mirror of
https://github.com/dataforcanada/d4c-infra-distribution.git
synced 2026-06-13 22:20:55 +02:00
Update Python CLI to call Cloudflare HTTP ingestor. TODO: For singlepart upload, return back part size and number of parts. More descriptive error logged
This commit is contained in:
@@ -15,7 +15,7 @@ uv run d4c-http-ingestor \
|
||||
--auth-token "$D4C_INGESTOR_AUTH_TOKEN" \
|
||||
--db ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec.sqlite \
|
||||
--key-prefix dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec \
|
||||
--out parquet/ \
|
||||
--out ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec \
|
||||
--concurrency 12
|
||||
```
|
||||
|
||||
@@ -26,7 +26,7 @@ The auth token can also be set via the `D4C_INGESTOR_AUTH_TOKEN` environment var
|
||||
```
|
||||
usage: d4c-http-ingestor [-h] --urls URLS --dataset-id DATASET_ID
|
||||
[--worker-url WORKER_URL] [--auth-token AUTH_TOKEN]
|
||||
--db DB [--key-prefix KEY_PREFIX] [--out OUT]
|
||||
--db DB [--key-prefix KEY_PREFIX] --out OUT
|
||||
[--concurrency CONCURRENCY] [--timeout TIMEOUT]
|
||||
[--max-retries MAX_RETRIES]
|
||||
[--resume | --no-resume] [--force-refresh]
|
||||
@@ -40,7 +40,7 @@ usage: d4c-http-ingestor [-h] --urls URLS --dataset-id DATASET_ID
|
||||
| `--auth-token` | `$D4C_INGESTOR_AUTH_TOKEN` | Bearer token for the worker |
|
||||
| `--db` | *(required)* | Path to the SQLite database file |
|
||||
| `--key-prefix` | `""` | S3 key prefix passed to the worker |
|
||||
| `--out` | `parquet/` | Output directory for the Parquet artifact |
|
||||
| `--out` | *(required)* | Parquet output filename stem (e.g. `my-dataset` → `my-dataset.parquet`) |
|
||||
| `--concurrency` | `12` | Maximum concurrent worker requests |
|
||||
| `--timeout` | `600` | Per-request timeout in seconds |
|
||||
| `--max-retries` | `3` | Maximum retry attempts per URL on failure |
|
||||
@@ -61,9 +61,10 @@ usage: d4c-http-ingestor [-h] --urls URLS --dataset-id DATASET_ID
|
||||
"key_prefix": "<key-prefix>"
|
||||
}
|
||||
```
|
||||
6. Persists each result (success/failed) to SQLite with idempotent upsert.
|
||||
6. Persists each result (success/failed) to SQLite with idempotent upsert, including the ETag, multipart info, and timestamps returned by the worker.
|
||||
7. Failed URLs are retried with exponential backoff + jitter (up to `--max-retries`).
|
||||
8. On completion, exports the full `downloads` table to `parquet/downloads.parquet`.
|
||||
8. **Every 100 successful downloads**, the full `downloads` table is exported to `{out}.parquet` and uploaded to S3 via the worker's PUT endpoint at the `--key-prefix` location.
|
||||
9. On completion, a final Parquet export + upload is performed.
|
||||
|
||||
Re-runs append new datasets or update existing rows into the Parquet dataset.
|
||||
|
||||
@@ -73,13 +74,16 @@ Re-runs append new datasets or update existing rows into the Parquet dataset.
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS downloads (
|
||||
url TEXT PRIMARY KEY,
|
||||
dataset_id TEXT NOT NULL,
|
||||
status TEXT NOT NULL, -- success | failed | skipped
|
||||
http_status INTEGER,
|
||||
error TEXT,
|
||||
started_at TEXT NOT NULL,
|
||||
finished_at TEXT
|
||||
url TEXT PRIMARY KEY,
|
||||
dataset_id TEXT NOT NULL,
|
||||
status TEXT NOT NULL, -- success | failed | skipped
|
||||
http_status INTEGER,
|
||||
etag TEXT,
|
||||
error TEXT,
|
||||
started_at TEXT NOT NULL,
|
||||
finished_at TEXT,
|
||||
multipart_part_size INTEGER,
|
||||
multipart_number_parts INTEGER
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS ix_downloads_dataset ON downloads(dataset_id);
|
||||
CREATE INDEX IF NOT EXISTS ix_downloads_status ON downloads(status);
|
||||
@@ -87,7 +91,20 @@ CREATE INDEX IF NOT EXISTS ix_downloads_status ON downloads(status);
|
||||
|
||||
### Parquet columns
|
||||
|
||||
Mirrors the SQLite schema exactly.
|
||||
| Column | Arrow Type | Description |
|
||||
|--------|-----------|-------------|
|
||||
| `url` | `string` | Source download URL |
|
||||
| `dataset_id` | `string` | Logical dataset identifier |
|
||||
| `status` | `string` | `success`, `failed`, or `skipped` |
|
||||
| `http_status` | `int32` | HTTP status code from the worker |
|
||||
| `etag` | `string` | S3 ETag of the uploaded object (quotes stripped) |
|
||||
| `error` | `string` | Error message (if failed) |
|
||||
| `started_at` | `timestamp[us, tz=UTC]` | When the worker started processing (from worker response) |
|
||||
| `finished_at` | `timestamp[us, tz=UTC]` | When the worker finished processing (from worker response) |
|
||||
| `multipart_part_size` | `int32` | S3 multipart part size in bytes (if multipart was used) |
|
||||
| `multipart_number_parts` | `int32` | Number of parts uploaded (if multipart was used) |
|
||||
|
||||
> Note: `started_at` and `finished_at` are stored as ISO-8601 text in SQLite but converted to proper Arrow timestamps in the Parquet output. These values come from the Cloudflare worker response, not the Python CLI.
|
||||
|
||||
## Dependencies
|
||||
|
||||
|
||||
Reference in New Issue
Block a user