mirror of
https://github.com/dataforcanada/d4c-infra-distribution.git
synced 2026-06-14 14:40:54 +02:00
Python CLI to call Cloudflare HTTP ingestor. To start running tomorrow morning! https://tenor.com/view/bbnft-yuge-theriouthly-seriously-mike-tyson-gif-5575001300972074272
This commit is contained in:
@@ -0,0 +1,96 @@
|
||||
# 06 – Call HTTP Ingestor
|
||||
|
||||
Python CLI that orchestrates concurrent calls to the [Cloudflare HTTP ingestor worker](../05_cloudflare_http_ingestor/) and persists results to SQLite + Parquet.
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
cd scripts/06_call_http_ingestor
|
||||
|
||||
# Install & run (uv handles the virtualenv automatically)
|
||||
uv run d4c-http-ingestor \
|
||||
--urls ../05_cloudflare_http_ingestor/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec_2026-03-10.txt \
|
||||
--dataset-id ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec \
|
||||
--worker-url https://cf-data-ingestor.labs.dataforcanada.org/ \
|
||||
--auth-token "$D4C_INGESTOR_AUTH_TOKEN" \
|
||||
--db ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec.sqlite \
|
||||
--key-prefix dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec \
|
||||
--out parquet/ \
|
||||
--concurrency 12
|
||||
```
|
||||
|
||||
The auth token can also be set via the `D4C_INGESTOR_AUTH_TOKEN` environment variable.
|
||||
|
||||
## CLI reference
|
||||
|
||||
```
|
||||
usage: d4c-http-ingestor [-h] --urls URLS --dataset-id DATASET_ID
|
||||
[--worker-url WORKER_URL] [--auth-token AUTH_TOKEN]
|
||||
--db DB [--key-prefix KEY_PREFIX] [--out OUT]
|
||||
[--concurrency CONCURRENCY] [--timeout TIMEOUT]
|
||||
[--max-retries MAX_RETRIES]
|
||||
[--resume | --no-resume] [--force-refresh]
|
||||
```
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--urls` | *(required)* | Path to a newline-delimited file of URLs to ingest |
|
||||
| `--dataset-id` | *(required)* | Logical dataset identifier (used in User-Agent and DB) |
|
||||
| `--worker-url` | `https://cf-data-ingestor.labs.dataforcanada.org/` | Base URL of the Cloudflare ingestor worker |
|
||||
| `--auth-token` | `$D4C_INGESTOR_AUTH_TOKEN` | Bearer token for the worker |
|
||||
| `--db` | *(required)* | Path to the SQLite database file |
|
||||
| `--key-prefix` | `""` | S3 key prefix passed to the worker |
|
||||
| `--out` | `parquet/` | Output directory for the Parquet artifact |
|
||||
| `--concurrency` | `12` | Maximum concurrent worker requests |
|
||||
| `--timeout` | `600` | Per-request timeout in seconds |
|
||||
| `--max-retries` | `3` | Maximum retry attempts per URL on failure |
|
||||
| `--resume` | `true` | Skip URLs already recorded as `success` |
|
||||
| `--force-refresh` | `false` | Ignore cached freshness; re-process all URLs |
|
||||
|
||||
## How it works
|
||||
|
||||
1. Reads URLs from the input file.
|
||||
2. Opens (or creates) a SQLite database with the `downloads` table.
|
||||
3. If `--resume` (default), filters out URLs already marked `success`.
|
||||
4. Submits up to `--concurrency` concurrent POST requests to the worker.
|
||||
5. Each request sends:
|
||||
```json
|
||||
{
|
||||
"download_url": "<url from file>",
|
||||
"user_agent": "Data for Canada - <dataset-id>",
|
||||
"key_prefix": "<key-prefix>"
|
||||
}
|
||||
```
|
||||
6. Persists each result (success/failed) to SQLite with idempotent upsert.
|
||||
7. Failed URLs are retried with exponential backoff + jitter (up to `--max-retries`).
|
||||
8. On completion, exports the full `downloads` table to `parquet/downloads.parquet`.
|
||||
|
||||
Re-runs append new datasets or update existing rows into the Parquet dataset.
|
||||
|
||||
## Data model
|
||||
|
||||
### SQLite schema (`downloads` table)
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS downloads (
|
||||
url TEXT PRIMARY KEY,
|
||||
dataset_id TEXT NOT NULL,
|
||||
status TEXT NOT NULL, -- success | failed | skipped
|
||||
http_status INTEGER,
|
||||
error TEXT,
|
||||
started_at TEXT NOT NULL,
|
||||
finished_at TEXT
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS ix_downloads_dataset ON downloads(dataset_id);
|
||||
CREATE INDEX IF NOT EXISTS ix_downloads_status ON downloads(status);
|
||||
```
|
||||
|
||||
### Parquet columns
|
||||
|
||||
Mirrors the SQLite schema exactly.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- [httpx](https://www.python-httpx.org/) – async HTTP client
|
||||
- [pyarrow](https://arrow.apache.org/docs/python/) – Parquet I/O
|
||||
- [rich](https://rich.readthedocs.io/) – progress bars and terminal output
|
||||
Reference in New Issue
Block a user