Python CLI to call Cloudflare HTTP ingestor. To start running tomorrow morning! https://tenor.com/view/bbnft-yuge-theriouthly-seriously-mike-tyson-gif-5575001300972074272

2026-06-14 14:40:54 +02:00 · 2026-03-10 14:31:09 -04:00
parent 6f9eddc5be
commit d3b5d69571
10 changed files with 235676 additions and 0 deletions
@@ -0,0 +1,96 @@
+# 06 – Call HTTP Ingestor
+
+Python CLI that orchestrates concurrent calls to the [Cloudflare HTTP ingestor worker](../05_cloudflare_http_ingestor/) and persists results to SQLite + Parquet.
+
+## Quick start
+
+```bash
+cd scripts/06_call_http_ingestor
+
+# Install & run (uv handles the virtualenv automatically)
+uv run d4c-http-ingestor \
+  --urls ../05_cloudflare_http_ingestor/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec_2026-03-10.txt \
+  --dataset-id ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec \
+  --worker-url https://cf-data-ingestor.labs.dataforcanada.org/ \
+  --auth-token "$D4C_INGESTOR_AUTH_TOKEN" \
+  --db ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec.sqlite \
+  --key-prefix dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec \
+  --out parquet/ \
+  --concurrency 12
+```
+
+The auth token can also be set via the `D4C_INGESTOR_AUTH_TOKEN` environment variable.
+
+## CLI reference
+
+```
+usage: d4c-http-ingestor [-h] --urls URLS --dataset-id DATASET_ID
+                         [--worker-url WORKER_URL] [--auth-token AUTH_TOKEN]
+                         --db DB [--key-prefix KEY_PREFIX] [--out OUT]
+                         [--concurrency CONCURRENCY] [--timeout TIMEOUT]
+                         [--max-retries MAX_RETRIES]
+                         [--resume | --no-resume] [--force-refresh]
+```
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--urls` | *(required)* | Path to a newline-delimited file of URLs to ingest |
+| `--dataset-id` | *(required)* | Logical dataset identifier (used in User-Agent and DB) |
+| `--worker-url` | `https://cf-data-ingestor.labs.dataforcanada.org/` | Base URL of the Cloudflare ingestor worker |
+| `--auth-token` | `$D4C_INGESTOR_AUTH_TOKEN` | Bearer token for the worker |
+| `--db` | *(required)* | Path to the SQLite database file |
+| `--key-prefix` | `""` | S3 key prefix passed to the worker |
+| `--out` | `parquet/` | Output directory for the Parquet artifact |
+| `--concurrency` | `12` | Maximum concurrent worker requests |
+| `--timeout` | `600` | Per-request timeout in seconds |
+| `--max-retries` | `3` | Maximum retry attempts per URL on failure |
+| `--resume` | `true` | Skip URLs already recorded as `success` |
+| `--force-refresh` | `false` | Ignore cached freshness; re-process all URLs |
+
+## How it works
+
+1. Reads URLs from the input file.
+2. Opens (or creates) a SQLite database with the `downloads` table.
+3. If `--resume` (default), filters out URLs already marked `success`.
+4. Submits up to `--concurrency` concurrent POST requests to the worker.
+5. Each request sends:
+   ```json
+   {
+     "download_url": "<url from file>",
+     "user_agent": "Data for Canada - <dataset-id>",
+     "key_prefix": "<key-prefix>"
+   }
+   ```
+6. Persists each result (success/failed) to SQLite with idempotent upsert.
+7. Failed URLs are retried with exponential backoff + jitter (up to `--max-retries`).
+8. On completion, exports the full `downloads` table to `parquet/downloads.parquet`.
+
+Re-runs append new datasets or update existing rows into the Parquet dataset.
+
+## Data model
+
+### SQLite schema (`downloads` table)
+
+```sql
+CREATE TABLE IF NOT EXISTS downloads (
+  url              TEXT PRIMARY KEY,
+  dataset_id       TEXT NOT NULL,
+  status           TEXT NOT NULL,  -- success | failed | skipped
+  http_status      INTEGER,
+  error            TEXT,
+  started_at       TEXT NOT NULL,
+  finished_at      TEXT
+);
+CREATE INDEX IF NOT EXISTS ix_downloads_dataset ON downloads(dataset_id);
+CREATE INDEX IF NOT EXISTS ix_downloads_status  ON downloads(status);
+```
+
+### Parquet columns
+
+Mirrors the SQLite schema exactly.
+
+## Dependencies
+
+- [httpx](https://www.python-httpx.org/) – async HTTP client
+- [pyarrow](https://arrow.apache.org/docs/python/) – Parquet I/O
+- [rich](https://rich.readthedocs.io/) – progress bars and terminal output