Add web corpus blog post draft

This commit is contained in:
Diego Ripley
2026-05-04 10:10:26 -04:00
parent 75f502ab47
commit d7cae67922
5 changed files with 28 additions and 0 deletions
Binary file not shown.

After

Width:  |  Height:  |  Size: 191 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 196 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 368 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

@@ -0,0 +1,28 @@
---
title: "A Permanent Record: Creating Web Snapshots for Data for Canada / Data for the Universe"
summary:
date: 2026-05-04T09:00:00-04:00
authors:
- name: diegoripley
link: https://github.com/diegoripley
image: https://github.com/diegoripley.png
tags:
- corpus
excludeSearch: false
draft: true
---
- Using https://github.com/ArchiveBox/ArchiveBox to archive web pages can create high quality web snapshots to complement Internet Archive operations.
{{< cards >}}
{{< card link="https://web.archive.org/web/20260503194906/https://phys.org/news/2026-04-usindian-space-mission-extreme-subsidence.html" title="Internet Archive Snapshot" image="/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-internet-archive-snapshot.webp" subtitle="Click on the image to preview page" >}}
{{< card link="https://s3.datafortheuniverse.org/tigris/d4u-datapkg-web-corpus/archive/1777838776.472139/singlefile.html" title="ArchiveBox Snapshot" image="/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-archivebox-snapshot.webp" subtitle="Click on the image to preview page" >}}
{{< /cards >}}
In this case, the Internet Archive's snapshot is superior as they also save all of the URLs listed in the "Letter Text" section
{{< cards >}}
{{< card link="https://web.archive.org/web/20260503194906/https://phys.org/news/2026-04-usindian-space-mission-extreme-subsidence.html" title="Internet Archive Snapshot" image="/blog/2026/200-journalists-applaud-internet-archive-internet-archive-snapshot.webp" subtitle="Click on the image to preview page" >}}
{{< card link="https://s3.datafortheuniverse.org/tigris/d4u-datapkg-web-corpus/archive/1777842968.455868/singlefile.html" title="ArchiveBox Snapshot" image="/blog/2026/200-journalists-applaud-internet-archive-archivebox-snapshot.webp" subtitle="Click on the image to preview page" >}}
{{< /cards >}}
- Talk about needing a customized [yt-dlp](https://github.com/yt-dlp/yt-dlp) as YouTube has recently changed and is harder to archive.