mirror of
https://github.com/dataforcanada/d4c-service-main-site.git
synced 2026-06-13 14:00:51 +02:00
28 lines
2.0 KiB
Markdown
28 lines
2.0 KiB
Markdown
---
|
|
title: "A Permanent Record: Creating Web Snapshots for Data for Canada / Data for the Universe"
|
|
summary:
|
|
date: 2026-05-04T09:00:00-04:00
|
|
authors:
|
|
- name: diegoripley
|
|
link: https://github.com/diegoripley
|
|
image: https://github.com/diegoripley.png
|
|
tags:
|
|
- corpus
|
|
excludeSearch: false
|
|
draft: true
|
|
---
|
|
|
|
- Using https://github.com/ArchiveBox/ArchiveBox to archive web pages can create high quality web snapshots to complement Internet Archive operations.
|
|
|
|
{{< cards >}}
|
|
{{< card link="https://web.archive.org/web/20260503194906/https://phys.org/news/2026-04-usindian-space-mission-extreme-subsidence.html" title="Internet Archive Snapshot" image="/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-internet-archive-snapshot.webp" subtitle="Click on the image to preview page" >}}
|
|
{{< card link="https://s3.datafortheuniverse.org/tigris/d4u-datapkg-web-corpus/archive/1777838776.472139/singlefile.html" title="ArchiveBox Snapshot" image="/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-archivebox-snapshot.webp" subtitle="Click on the image to preview page" >}}
|
|
{{< /cards >}}
|
|
|
|
In this case, the Internet Archive's snapshot is superior as they also save all of the URLs listed in the "Letter Text" section
|
|
{{< cards >}}
|
|
{{< card link="https://web.archive.org/web/20260503194906/https://phys.org/news/2026-04-usindian-space-mission-extreme-subsidence.html" title="Internet Archive Snapshot" image="/blog/2026/200-journalists-applaud-internet-archive-internet-archive-snapshot.webp" subtitle="Click on the image to preview page" >}}
|
|
{{< card link="https://s3.datafortheuniverse.org/tigris/d4u-datapkg-web-corpus/archive/1777842968.455868/singlefile.html" title="ArchiveBox Snapshot" image="/blog/2026/200-journalists-applaud-internet-archive-archivebox-snapshot.webp" subtitle="Click on the image to preview page" >}}
|
|
{{< /cards >}}
|
|
|
|
- Talk about needing a customized [yt-dlp](https://github.com/yt-dlp/yt-dlp) as YouTube has recently changed and is harder to archive. |