Update Python CLI to call Cloudflare HTTP ingestor. TODO: For singlepart upload, return back part size and number of parts. More descriptive error logged

2026-06-13 22:20:55 +02:00 · 2026-03-13 07:18:36 -04:00
parent d4484b665f
commit 422cff5273
5 changed files with 318 additions and 63 deletions
@@ -15,7 +15,7 @@ uv run d4c-http-ingestor \
  --auth-token "$D4C_INGESTOR_AUTH_TOKEN" \
  --db ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec.sqlite \
  --key-prefix dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec \
-  --out parquet/ \
+  --out ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec \
  --concurrency 12
 ```

@@ -26,7 +26,7 @@ The auth token can also be set via the `D4C_INGESTOR_AUTH_TOKEN` environment var
 ```
 usage: d4c-http-ingestor [-h] --urls URLS --dataset-id DATASET_ID
                         [--worker-url WORKER_URL] [--auth-token AUTH_TOKEN]
-                         --db DB [--key-prefix KEY_PREFIX] [--out OUT]
+                         --db DB [--key-prefix KEY_PREFIX] --out OUT
                         [--concurrency CONCURRENCY] [--timeout TIMEOUT]
                         [--max-retries MAX_RETRIES]
                         [--resume | --no-resume] [--force-refresh]
@@ -40,7 +40,7 @@ usage: d4c-http-ingestor [-h] --urls URLS --dataset-id DATASET_ID
 | `--auth-token` | `$D4C_INGESTOR_AUTH_TOKEN` | Bearer token for the worker |
 | `--db` | *(required)* | Path to the SQLite database file |
 | `--key-prefix` | `""` | S3 key prefix passed to the worker |
-| `--out` | `parquet/` | Output directory for the Parquet artifact |
+| `--out` | *(required)* | Parquet output filename stem (e.g. `my-dataset` → `my-dataset.parquet`) |
 | `--concurrency` | `12` | Maximum concurrent worker requests |
 | `--timeout` | `600` | Per-request timeout in seconds |
 | `--max-retries` | `3` | Maximum retry attempts per URL on failure |
@@ -61,9 +61,10 @@ usage: d4c-http-ingestor [-h] --urls URLS --dataset-id DATASET_ID
     "key_prefix": "<key-prefix>"
   }
   ```
-6. Persists each result (success/failed) to SQLite with idempotent upsert.
+6. Persists each result (success/failed) to SQLite with idempotent upsert, including the ETag, multipart info, and timestamps returned by the worker.
 7. Failed URLs are retried with exponential backoff + jitter (up to `--max-retries`).
-8. On completion, exports the full `downloads` table to `parquet/downloads.parquet`.
+8. **Every 100 successful downloads**, the full `downloads` table is exported to `{out}.parquet` and uploaded to S3 via the worker's PUT endpoint at the `--key-prefix` location.
+9. On completion, a final Parquet export + upload is performed.

 Re-runs append new datasets or update existing rows into the Parquet dataset.

@@ -73,13 +74,16 @@ Re-runs append new datasets or update existing rows into the Parquet dataset.

 ```sql
 CREATE TABLE IF NOT EXISTS downloads (
-  url              TEXT PRIMARY KEY,
-  dataset_id       TEXT NOT NULL,
-  status           TEXT NOT NULL,  -- success | failed | skipped
-  http_status      INTEGER,
-  error            TEXT,
-  started_at       TEXT NOT NULL,
-  finished_at      TEXT
+  url                    TEXT PRIMARY KEY,
+  dataset_id             TEXT NOT NULL,
+  status                 TEXT NOT NULL,  -- success | failed | skipped
+  http_status            INTEGER,
+  etag                   TEXT,
+  error                  TEXT,
+  started_at             TEXT NOT NULL,
+  finished_at            TEXT,
+  multipart_part_size    INTEGER,
+  multipart_number_parts INTEGER
 );
 CREATE INDEX IF NOT EXISTS ix_downloads_dataset ON downloads(dataset_id);
 CREATE INDEX IF NOT EXISTS ix_downloads_status  ON downloads(status);
@@ -87,7 +91,20 @@ CREATE INDEX IF NOT EXISTS ix_downloads_status  ON downloads(status);

 ### Parquet columns

-Mirrors the SQLite schema exactly.
+| Column | Arrow Type | Description |
+|--------|-----------|-------------|
+| `url` | `string` | Source download URL |
+| `dataset_id` | `string` | Logical dataset identifier |
+| `status` | `string` | `success`, `failed`, or `skipped` |
+| `http_status` | `int32` | HTTP status code from the worker |
+| `etag` | `string` | S3 ETag of the uploaded object (quotes stripped) |
+| `error` | `string` | Error message (if failed) |
+| `started_at` | `timestamp[us, tz=UTC]` | When the worker started processing (from worker response) |
+| `finished_at` | `timestamp[us, tz=UTC]` | When the worker finished processing (from worker response) |
+| `multipart_part_size` | `int32` | S3 multipart part size in bytes (if multipart was used) |
+| `multipart_number_parts` | `int32` | Number of parts uploaded (if multipart was used) |
+
+> Note: `started_at` and `finished_at` are stored as ISO-8601 text in SQLite but converted to proper Arrow timestamps in the Parquet output. These values come from the Cloudflare worker response, not the Python CLI.

 ## Dependencies