diff --git a/content/blog/2026/200-journalists-applaud-internet-archive-archivebox-snapshot.webp b/content/blog/2026/200-journalists-applaud-internet-archive-archivebox-snapshot.webp new file mode 100644 index 0000000..6ca2d4c Binary files /dev/null and b/content/blog/2026/200-journalists-applaud-internet-archive-archivebox-snapshot.webp differ diff --git a/content/blog/2026/200-journalists-applaud-internet-archive-internet-archive-snapshot.webp b/content/blog/2026/200-journalists-applaud-internet-archive-internet-archive-snapshot.webp new file mode 100644 index 0000000..4ce28f0 Binary files /dev/null and b/content/blog/2026/200-journalists-applaud-internet-archive-internet-archive-snapshot.webp differ diff --git a/content/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-archivebox-snapshot.webp b/content/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-archivebox-snapshot.webp new file mode 100644 index 0000000..7b513e3 Binary files /dev/null and b/content/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-archivebox-snapshot.webp differ diff --git a/content/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-internet-archive-snapshot.webp b/content/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-internet-archive-snapshot.webp new file mode 100644 index 0000000..3649a34 Binary files /dev/null and b/content/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-internet-archive-snapshot.webp differ diff --git a/content/blog/2026/creating-web-snapshots-for-data-for-canada-universe.md b/content/blog/2026/creating-web-snapshots-for-data-for-canada-universe.md new file mode 100644 index 0000000..70c5547 --- /dev/null +++ b/content/blog/2026/creating-web-snapshots-for-data-for-canada-universe.md @@ -0,0 +1,28 @@ +--- +title: "A Permanent Record: Creating Web Snapshots for Data for Canada / Data for the Universe" +summary: +date: 2026-05-04T09:00:00-04:00 +authors: + - name: diegoripley + link: https://github.com/diegoripley + image: https://github.com/diegoripley.png +tags: + - corpus +excludeSearch: false +draft: true +--- + +- Using https://github.com/ArchiveBox/ArchiveBox to archive web pages can create high quality web snapshots to complement Internet Archive operations. + +{{< cards >}} + {{< card link="https://web.archive.org/web/20260503194906/https://phys.org/news/2026-04-usindian-space-mission-extreme-subsidence.html" title="Internet Archive Snapshot" image="/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-internet-archive-snapshot.webp" subtitle="Click on the image to preview page" >}} + {{< card link="https://s3.datafortheuniverse.org/tigris/d4u-datapkg-web-corpus/archive/1777838776.472139/singlefile.html" title="ArchiveBox Snapshot" image="/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-archivebox-snapshot.webp" subtitle="Click on the image to preview page" >}} +{{< /cards >}} + +In this case, the Internet Archive's snapshot is superior as they also save all of the URLs listed in the "Letter Text" section +{{< cards >}} + {{< card link="https://web.archive.org/web/20260503194906/https://phys.org/news/2026-04-usindian-space-mission-extreme-subsidence.html" title="Internet Archive Snapshot" image="/blog/2026/200-journalists-applaud-internet-archive-internet-archive-snapshot.webp" subtitle="Click on the image to preview page" >}} + {{< card link="https://s3.datafortheuniverse.org/tigris/d4u-datapkg-web-corpus/archive/1777842968.455868/singlefile.html" title="ArchiveBox Snapshot" image="/blog/2026/200-journalists-applaud-internet-archive-archivebox-snapshot.webp" subtitle="Click on the image to preview page" >}} +{{< /cards >}} + +- Talk about needing a customized [yt-dlp](https://github.com/yt-dlp/yt-dlp) as YouTube has recently changed and is harder to archive. \ No newline at end of file