7.5 KiB
cf-data-ingestor
A Cloudflare Worker that acts as a secure proxy: it downloads a file from a
URL provided in a JSON payload and streams it directly into an S3 bucket in
us-west-2, keeping memory usage constant regardless of file size.
Architecture
Client POST ──▶ Worker ──stream──▶ S3 PutObject / Multipart
│
├─ Auth check (Bearer token)
├─ Fetch source URL (custom User-Agent)
└─ Sign with AWS Sig V4 (aws4fetch)
Client PUT ──▶ Worker ──stream──▶ S3 PutObject / Multipart
│
├─ Auth check (Bearer token)
└─ Direct binary upload (X-S3-Key header)
Two upload paths are used automatically:
| Condition | Upload method | Memory overhead |
|---|---|---|
| Known size ≤ 100 MiB | Single streaming PUT |
~0 (pipe-through) |
| Unknown size or > 100 MiB | Multipart upload in 25 MiB chunks | ≤ 25 MiB |
Files larger than 100 MiB always use multipart upload because Cloudflare Workers enforce a body-size limit on single outbound
fetch()requests.
Setup
1. Install dependencies
pnpm install
2. Configure wrangler.toml
Edit the [vars] section:
[vars]
S3_BUCKET = "us-west-2.opendata.source.coop"
S3_REGION = "us-west-2"
S3_ENDPOINT = ""
S3_ENDPOINT should be left empty when targeting AWS S3 (path-style
addressing is used automatically). Set it only for non-AWS S3-compatible
services — https:// is prepended automatically if omitted.
3. Set secrets
Copy the example .env file and fill in your values:
cp .env.example .env
AUTH_TOKEN="your-auth-token"
AWS_ACCESS_KEY_ID="AKIAxxxxxxxxxxxxxxxxxxxx"
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Wrangler automatically loads the .env file during local development
(pnpm run dev). For deployed Workers, push each secret with:
pnpm wrangler secret put AUTH_TOKEN
pnpm wrangler secret put AWS_ACCESS_KEY_ID
pnpm wrangler secret put AWS_SECRET_ACCESS_KEY
4. Deploy
pnpm run deploy
Usage
Download mode (POST)
Downloads a file from a URL and uploads it to S3.
Method: POST
Content-Type: application/json
Authorization: Bearer <AUTH_TOKEN>
Payload parameters:
| Field | Required | Description |
|---|---|---|
download_url |
Yes | Direct link to the source file |
user_agent |
Yes | User-Agent string for the download request |
key_prefix |
No | Destination path within the S3 bucket |
Example
curl -X POST https://cf-data-ingestor.labs.dataforcanada.org \
-H "Authorization: Bearer <AUTH_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
"download_url": "https://diffusion.mern.gouv.qc.ca/diffusion/RGQ/Imagerie/Orthomosaique/Generique/Mosa30rvb0015_30cm_Rvb/Mtm9/Jpeg2000/mos_14_31n02_se_30cm_f09.JP2",
"user_agent": "Data for Canada - d4c-datapkg-orthoimagery",
"key_prefix": "dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec"
}'
Successful response
{
"ok": true,
"bucket": "us-west-2.opendata.source.coop",
"key": "dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebecdataforcanada/.../mos_14_31n02_se_30cm_f09.JP2",
"content_type": "application/x-msdownload",
"size_bytes": 773722941,
"etag": "abc123def456",
"multipart_part_size": 26214400,
"multipart_number_parts": 30,
"started_at": "2026-03-12T21:00:00.000Z",
"finished_at": "2026-03-12T21:01:30.000Z"
}
multipart_part_sizeandmultipart_number_partsare only present when multipart upload was used (file > 100 MiB or unknown size).
Direct upload mode (PUT)
Uploads a binary file body directly to S3. Useful for uploading local files (e.g. Parquet artifacts) without needing a public download URL.
Method: PUT
Authorization: Bearer <AUTH_TOKEN>
Required headers:
| Header | Description |
|---|---|
X-S3-Key |
Full S3 object key (e.g. dataforcanada/my-dataset/data.parquet) |
Optional headers:
| Header | Description |
|---|---|
Content-Type |
MIME type (default: application/octet-stream) |
Content-Length |
File size in bytes (enables single PUT for files ≤ 100 MiB) |
Body: Raw binary file content.
Example
curl -X PUT https://cf-data-ingestor.labs.dataforcanada.org \
-H "Authorization: Bearer <AUTH_TOKEN>" \
-H "X-S3-Key: dataforcanada/my-dataset/downloads.parquet" \
-H "Content-Type: application/octet-stream" \
-H "Content-Length: $(stat -c%s downloads.parquet)" \
--data-binary @downloads.parquet
Successful response
{
"ok": true,
"bucket": "us-west-2.opendata.source.coop",
"key": "dataforcanada/my-dataset/downloads.parquet",
"content_type": "application/octet-stream",
"size_bytes": 45231,
"etag": "def456abc789",
"started_at": "2026-03-12T21:00:00.000Z",
"finished_at": "2026-03-12T21:00:01.000Z"
}
Response fields
| Field | Type | Always present | Description |
|---|---|---|---|
ok |
boolean | Yes | true on success |
bucket |
string | Yes | S3 bucket name |
key |
string | Yes | S3 object key |
content_type |
string | Yes | MIME type of the uploaded file |
size_bytes |
number | When Content-Length known | File size in bytes |
etag |
string | When available | S3 ETag (quotes stripped) |
multipart_part_size |
number | Only for multipart | Part size in bytes (25 MiB) |
multipart_number_parts |
number | Only for multipart | Number of parts uploaded |
started_at |
string | Yes | ISO-8601 UTC timestamp when processing started |
finished_at |
string | Yes | ISO-8601 UTC timestamp when processing finished |
Error responses
| Status | Meaning |
|---|---|
| 401 | Missing or invalid Bearer token |
| 405 | Non-POST/PUT method |
| 415 | Content-Type is not application/json (POST only) |
| 400 | Malformed JSON, missing fields, or missing X-S3-Key header |
| 502 | Source download or S3 upload failed |
S3 Object Key
POST mode
Only the filename is extracted from the download_url and placed under the key_prefix. The source URL's directory hierarchy is not preserved.
download_url: https://diffusion.mern.gouv.qc.ca/diffusion/RGQ/Imagerie/Orthomosaique/Generique/Mosa30rvb0015_30cm_Rvb/Mtm9/Jpeg2000/mos_14_31n02_se_30cm_f09.JP2
key_prefix: "dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec"
→ key: dataforcanada/d4c-datapkg-orthoimagery/archive/ca-qc_government_and_municipalities_of_quebec-2026A000224_d4c-datapkg-orthoimagery_orthorectified_imagery_from_quebec/mos_14_31n02_se_30cm_f09.JP2
If key_prefix is omitted or empty, the file uploads to the bucket root.
PUT mode
The full S3 key is specified directly via the X-S3-Key header.
Local Development
pnpm run dev
Then POST or PUT to http://localhost:8787. Wrangler reads secrets from the .env file you created in step 3. You can also create environment-specific overrides (e.g. .env.staging) — see the Cloudflare docs for the full .env precedence rules.