Files
d4c-infra-distribution/scripts/05_cloudflare_http_ingestor

cf-data-ingestor

A Cloudflare Worker that acts as a secure proxy: it downloads a file from a URL provided in a JSON payload and streams it directly into an S3 bucket in us-west-2, keeping memory usage constant regardless of file size. All uploads use multipart upload with 5 MiB chunks to stay well within the Workers 128 MiB memory limit.

Architecture

Client POST ──▶ Worker ──stream──▶ S3 PutObject / Multipart
                  │
                  ├─ Auth check (Bearer token)
                  ├─ Fetch source URL (custom User-Agent)
                  └─ Sign with AWS Sig V4 (aws4fetch)

Client PUT  ──▶ Worker ──stream──▶ S3 PutObject / Multipart
                  │
                  ├─ Auth check (Bearer token)
                  └─ Direct binary upload (X-S3-Key header)

All uploads use S3 multipart upload with 5 MiB parts, keeping peak memory bounded to ~5 MiB regardless of file size. This avoids hitting the Cloudflare Workers 128 MiB memory limit that can occur when buffering large single PUT request bodies.

Setup

1. Install dependencies

pnpm install

2. Configure wrangler.toml

Edit the [vars] section:

[vars]
S3_BUCKET   = "us-west-2.opendata.source.coop"
S3_REGION   = "us-west-2"
S3_ENDPOINT = ""

S3_ENDPOINT should be left empty when targeting AWS S3 (path-style addressing is used automatically). Set it only for non-AWS S3-compatible services — https:// is prepended automatically if omitted.

3. Set secrets

Copy the example .env file and fill in your values:

cp .env.example .env
AUTH_TOKEN="your-auth-token"
AWS_ACCESS_KEY_ID="AKIAxxxxxxxxxxxxxxxxxxxx"
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Wrangler automatically loads the .env file during local development (pnpm run dev). For deployed Workers, push each secret with:

pnpm wrangler secret put AUTH_TOKEN
pnpm wrangler secret put AWS_ACCESS_KEY_ID
pnpm wrangler secret put AWS_SECRET_ACCESS_KEY

4. Deploy

pnpm run deploy

Usage

Download mode (POST)

Downloads a file from a URL and uploads it to S3.

Method: POST Content-Type: application/json Authorization: Bearer <AUTH_TOKEN>

Payload parameters:

Field Required Description
download_url Yes Direct link to the source file
user_agent Yes User-Agent string for the download request
key_prefix No Destination path within the S3 bucket

Example

curl -X POST https://cf-data-ingestor.labs.dataforcanada.org \
  -H "Authorization: Bearer <AUTH_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "download_url": "https://diffusion.mern.gouv.qc.ca/diffusion/RGQ/Imagerie/Orthomosaique/Generique/Mosa30rvb0015_30cm_Rvb/Mtm9/Jpeg2000/mos_14_31n02_se_30cm_f09.JP2",
    "user_agent": "Data for Canada - d4c-datapkg-orthoimagery",
    "key_prefix": "dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec"
  }'

Successful response

{
    "ok": true,
    "bucket": "us-west-2.opendata.source.coop",
    "key": "dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebecdataforcanada/.../mos_14_31n02_se_30cm_f09.JP2",
    "content_type": "application/x-msdownload",
    "size_bytes": 773722941,
    "etag": "abc123def456",
    "multipart_part_size": 5242880,
    "multipart_number_parts": 148,
    "started_at": "2026-03-12T21:00:00.000Z",
    "finished_at": "2026-03-12T21:01:30.000Z"
}

multipart_part_size and multipart_number_parts are always present since all uploads use multipart.

Direct upload mode (PUT)

Uploads a binary file body directly to S3. Useful for uploading local files (e.g. Parquet artifacts) without needing a public download URL.

Method: PUT Authorization: Bearer <AUTH_TOKEN>

Required headers:

Header Description
X-S3-Key Full S3 object key (e.g. dataforcanada/my-dataset/data.parquet)

Optional headers:

Header Description
Content-Type MIME type (default: application/octet-stream)
Content-Length File size in bytes

Body: Raw binary file content.

Example

curl -X PUT https://cf-data-ingestor.labs.dataforcanada.org \
  -H "Authorization: Bearer <AUTH_TOKEN>" \
  -H "X-S3-Key: dataforcanada/my-dataset/downloads.parquet" \
  -H "Content-Type: application/octet-stream" \
  -H "Content-Length: $(stat -c%s downloads.parquet)" \
  --data-binary @downloads.parquet

Successful response

{
    "ok": true,
    "bucket": "us-west-2.opendata.source.coop",
    "key": "dataforcanada/my-dataset/downloads.parquet",
    "content_type": "application/octet-stream",
    "size_bytes": 45231,
    "etag": "def456abc789",
    "started_at": "2026-03-12T21:00:00.000Z",
    "finished_at": "2026-03-12T21:00:01.000Z"
}

Response fields

Field Type Always present Description
ok boolean Yes true on success
bucket string Yes S3 bucket name
key string Yes S3 object key
content_type string Yes MIME type of the uploaded file
size_bytes number When Content-Length known File size in bytes
etag string When available S3 ETag (quotes stripped)
multipart_part_size number Yes Part size in bytes (5 MiB)
multipart_number_parts number Yes Number of parts uploaded
started_at string Yes ISO-8601 UTC timestamp when processing started
finished_at string Yes ISO-8601 UTC timestamp when processing finished

Error responses

Status Meaning
401 Missing or invalid Bearer token
405 Non-POST/PUT method
415 Content-Type is not application/json (POST only)
400 Malformed JSON, missing fields, or missing X-S3-Key header
502 Source download or S3 upload failed

S3 Object Key

POST mode

Only the filename is extracted from the download_url and placed under the key_prefix. The source URL's directory hierarchy is not preserved.

download_url: https://diffusion.mern.gouv.qc.ca/diffusion/RGQ/Imagerie/Orthomosaique/Generique/Mosa30rvb0015_30cm_Rvb/Mtm9/Jpeg2000/mos_14_31n02_se_30cm_f09.JP2
key_prefix:   "dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec"
→ key:        dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec/mos_14_31n02_se_30cm_f09.JP2

If key_prefix is omitted or empty, the file uploads to the bucket root.

PUT mode

The full S3 key is specified directly via the X-S3-Key header.

Local Development

pnpm run dev

Then POST or PUT to http://localhost:8787. Wrangler reads secrets from the .env file you created in step 3. You can also create environment-specific overrides (e.g. .env.staging) — see the Cloudflare docs for the full .env precedence rules.