Python CLI to call Cloudflare HTTP ingestor. To start running tomorrow morning! https://tenor.com/view/bbnft-yuge-theriouthly-seriously-mike-tyson-gif-5575001300972074272

This commit is contained in:
Diego Ripley
2026-03-10 14:31:09 -04:00
parent 6f9eddc5be
commit d3b5d69571
10 changed files with 235676 additions and 0 deletions
+96
View File
@@ -0,0 +1,96 @@
# 06 Call HTTP Ingestor
Python CLI that orchestrates concurrent calls to the [Cloudflare HTTP ingestor worker](../05_cloudflare_http_ingestor/) and persists results to SQLite + Parquet.
## Quick start
```bash
cd scripts/06_call_http_ingestor
# Install & run (uv handles the virtualenv automatically)
uv run d4c-http-ingestor \
--urls ../05_cloudflare_http_ingestor/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec_2026-03-10.txt \
--dataset-id ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec \
--worker-url https://cf-data-ingestor.labs.dataforcanada.org/ \
--auth-token "$D4C_INGESTOR_AUTH_TOKEN" \
--db ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec.sqlite \
--key-prefix dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec \
--out parquet/ \
--concurrency 12
```
The auth token can also be set via the `D4C_INGESTOR_AUTH_TOKEN` environment variable.
## CLI reference
```
usage: d4c-http-ingestor [-h] --urls URLS --dataset-id DATASET_ID
[--worker-url WORKER_URL] [--auth-token AUTH_TOKEN]
--db DB [--key-prefix KEY_PREFIX] [--out OUT]
[--concurrency CONCURRENCY] [--timeout TIMEOUT]
[--max-retries MAX_RETRIES]
[--resume | --no-resume] [--force-refresh]
```
| Flag | Default | Description |
|------|---------|-------------|
| `--urls` | *(required)* | Path to a newline-delimited file of URLs to ingest |
| `--dataset-id` | *(required)* | Logical dataset identifier (used in User-Agent and DB) |
| `--worker-url` | `https://cf-data-ingestor.labs.dataforcanada.org/` | Base URL of the Cloudflare ingestor worker |
| `--auth-token` | `$D4C_INGESTOR_AUTH_TOKEN` | Bearer token for the worker |
| `--db` | *(required)* | Path to the SQLite database file |
| `--key-prefix` | `""` | S3 key prefix passed to the worker |
| `--out` | `parquet/` | Output directory for the Parquet artifact |
| `--concurrency` | `12` | Maximum concurrent worker requests |
| `--timeout` | `600` | Per-request timeout in seconds |
| `--max-retries` | `3` | Maximum retry attempts per URL on failure |
| `--resume` | `true` | Skip URLs already recorded as `success` |
| `--force-refresh` | `false` | Ignore cached freshness; re-process all URLs |
## How it works
1. Reads URLs from the input file.
2. Opens (or creates) a SQLite database with the `downloads` table.
3. If `--resume` (default), filters out URLs already marked `success`.
4. Submits up to `--concurrency` concurrent POST requests to the worker.
5. Each request sends:
```json
{
"download_url": "<url from file>",
"user_agent": "Data for Canada - <dataset-id>",
"key_prefix": "<key-prefix>"
}
```
6. Persists each result (success/failed) to SQLite with idempotent upsert.
7. Failed URLs are retried with exponential backoff + jitter (up to `--max-retries`).
8. On completion, exports the full `downloads` table to `parquet/downloads.parquet`.
Re-runs append new datasets or update existing rows into the Parquet dataset.
## Data model
### SQLite schema (`downloads` table)
```sql
CREATE TABLE IF NOT EXISTS downloads (
url TEXT PRIMARY KEY,
dataset_id TEXT NOT NULL,
status TEXT NOT NULL, -- success | failed | skipped
http_status INTEGER,
error TEXT,
started_at TEXT NOT NULL,
finished_at TEXT
);
CREATE INDEX IF NOT EXISTS ix_downloads_dataset ON downloads(dataset_id);
CREATE INDEX IF NOT EXISTS ix_downloads_status ON downloads(status);
```
### Parquet columns
Mirrors the SQLite schema exactly.
## Dependencies
- [httpx](https://www.python-httpx.org/) async HTTP client
- [pyarrow](https://arrow.apache.org/docs/python/) Parquet I/O
- [rich](https://rich.readthedocs.io/) progress bars and terminal output