diff --git a/content/blog/2026/2026-05-06_10-42-ia-slowdown.png b/content/blog/2026/2026-05-06_10-42-ia-slowdown.png new file mode 100644 index 0000000..aa5ad90 Binary files /dev/null and b/content/blog/2026/2026-05-06_10-42-ia-slowdown.png differ diff --git a/content/blog/2026/petabytes-at-the-tip-of-your-fingers-internet-archive.md b/content/blog/2026/petabytes-at-the-tip-of-your-fingers-internet-archive.md new file mode 100644 index 0000000..fe42a90 --- /dev/null +++ b/content/blog/2026/petabytes-at-the-tip-of-your-fingers-internet-archive.md @@ -0,0 +1,73 @@ +--- +title: Petabytes of Internet Archive Data at the Tip of Your Fingers +summary: If you know the bucket name that is πŸ˜„ +date: 2026-05-06T09:00:00-04:00 +authors: + - name: diegoripley + link: https://github.com/diegoripley + image: https://github.com/diegoripley.png +tags: + - internet-archive + - ia +excludeSearch: false +draft: false +--- + +We have added the internetarchive bucket to https://s3.labs.dataforcanada.org. + +```mermaid +flowchart TD + Client(["🌐 S3 Client / User"]) + Gateway["s3.dataforcanada.org\nS3-Compatible Gateway"] + + Client -->|"S3 API Request"| Gateway + + Gateway -->|"sourcecooperative bucket"| AWS + Gateway -->|"backblaze-ca-east-006 bucket"| BB + Gateway -->|"cloudflare-apac bucket"| CFAPAC + Gateway -->|"cloudflare-enam bucket"| CFENAM + Gateway -->|"internetarchive bucket"| INTERNETARCHIVE + Gateway -->|"tigris bucket"| TIGRIS + + subgraph AWS ["☁️ Amazon Web Services"] + AWSNode["πŸ“ Oregon, United States"] + end + + subgraph BB ["πŸ”΅ Backblaze B2"] + BBNode["πŸ“ Toronto, ON, Canada"] + end + + subgraph CFAPAC ["🟠 Cloudflare R2"] + CFAPACNode["πŸ“ Asia Pacific Region"] + end + + subgraph CFENAM ["🟠 Cloudflare R2"] + CFENAMNode["πŸ“ Eastern North America"] + end + + subgraph INTERNETARCHIVE ["πŸ“š Internet Archive Data"] + INTERNETARCHIVENode["πŸ“San Francisco, United States and Vancouver, Canada"] + end + + subgraph TIGRIS ["⚑ Tigris Data"] + TIGRISNode["11 Regions Worldwide 🌍\nAuto-routes to nearest location\nfor lowest latency"] + end + + style Gateway fill:#1a5f7a,color:#fff,stroke:#0d3d52 + style Client fill:#2d6a4f,color:#fff,stroke:#1b4332 + style AWSNode fill:#ff9900,color:#000,stroke:#cc7a00 + style BBNode fill:#e03c31,color:#fff,stroke:#b02d24 + style CFAPACNode fill:#f6821f,color:#fff,stroke:#c4681a + style CFENAMNode fill:#f6821f,color:#fff,stroke:#c4681a + style TIGRISNode fill:#6c3483,color:#fff,stroke:#512e6b +``` + +# Notes +- Internet Archive S3 Documentation https://archive.org/developers/ias3.html +- You need to know your identifier of your dataset (aka your bucket). For example, [earth-at-night-2016](https://s3.labs.dataforcanada.org/internetarchive/earth-at-night-2016) +- 30 GB per 5 minute bandwidth quota as the Internet Archive has limited budget +- Curious if anybody has made an index of all Internet Archive buckets +- I believe it would be ideal to have an Internet Archive serverless worker(s) that are situated closer to the Internet Archive, then connect to those, as they do not have a CDN + +- And it looks like we're getting throttled, even with keys +![landscape](2026-05-06_10-42-ia-slowdown.png)