Files
d4c-service-main-site/content/blog/2026/creating-web-snapshots-for-data-for-canada-universe.md
2026-05-04 10:10:26 -04:00

2.0 KiB

title, summary, date, authors, tags, excludeSearch, draft
title summary date authors tags excludeSearch draft
A Permanent Record: Creating Web Snapshots for Data for Canada / Data for the Universe 2026-05-04T09:00:00-04:00
name link image
diegoripley https://github.com/diegoripley https://github.com/diegoripley.png
corpus
false true

{{< cards >}} {{< card link="https://web.archive.org/web/20260503194906/https://phys.org/news/2026-04-usindian-space-mission-extreme-subsidence.html" title="Internet Archive Snapshot" image="/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-internet-archive-snapshot.webp" subtitle="Click on the image to preview page" >}} {{< card link="https://s3.datafortheuniverse.org/tigris/d4u-datapkg-web-corpus/archive/1777838776.472139/singlefile.html" title="ArchiveBox Snapshot" image="/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-archivebox-snapshot.webp" subtitle="Click on the image to preview page" >}} {{< /cards >}}

In this case, the Internet Archive's snapshot is superior as they also save all of the URLs listed in the "Letter Text" section {{< cards >}} {{< card link="https://web.archive.org/web/20260503194906/https://phys.org/news/2026-04-usindian-space-mission-extreme-subsidence.html" title="Internet Archive Snapshot" image="/blog/2026/200-journalists-applaud-internet-archive-internet-archive-snapshot.webp" subtitle="Click on the image to preview page" >}} {{< card link="https://s3.datafortheuniverse.org/tigris/d4u-datapkg-web-corpus/archive/1777842968.455868/singlefile.html" title="ArchiveBox Snapshot" image="/blog/2026/200-journalists-applaud-internet-archive-archivebox-snapshot.webp" subtitle="Click on the image to preview page" >}} {{< /cards >}}

  • Talk about needing a customized yt-dlp as YouTube has recently changed and is harder to archive.