mirror of
https://github.com/dataforcanada/d4c-service-main-site.git
synced 2026-06-13 14:00:51 +02:00
Add web corpus blog post draft
This commit is contained in:
Binary file not shown.
|
After Width: | Height: | Size: 191 KiB |
BIN
Binary file not shown.
|
After Width: | Height: | Size: 196 KiB |
BIN
Binary file not shown.
|
After Width: | Height: | Size: 368 KiB |
BIN
Binary file not shown.
|
After Width: | Height: | Size: 124 KiB |
@@ -0,0 +1,28 @@
|
|||||||
|
---
|
||||||
|
title: "A Permanent Record: Creating Web Snapshots for Data for Canada / Data for the Universe"
|
||||||
|
summary:
|
||||||
|
date: 2026-05-04T09:00:00-04:00
|
||||||
|
authors:
|
||||||
|
- name: diegoripley
|
||||||
|
link: https://github.com/diegoripley
|
||||||
|
image: https://github.com/diegoripley.png
|
||||||
|
tags:
|
||||||
|
- corpus
|
||||||
|
excludeSearch: false
|
||||||
|
draft: true
|
||||||
|
---
|
||||||
|
|
||||||
|
- Using https://github.com/ArchiveBox/ArchiveBox to archive web pages can create high quality web snapshots to complement Internet Archive operations.
|
||||||
|
|
||||||
|
{{< cards >}}
|
||||||
|
{{< card link="https://web.archive.org/web/20260503194906/https://phys.org/news/2026-04-usindian-space-mission-extreme-subsidence.html" title="Internet Archive Snapshot" image="/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-internet-archive-snapshot.webp" subtitle="Click on the image to preview page" >}}
|
||||||
|
{{< card link="https://s3.datafortheuniverse.org/tigris/d4u-datapkg-web-corpus/archive/1777838776.472139/singlefile.html" title="ArchiveBox Snapshot" image="/blog/2026/2026-04-usindian-space-mission-extreme-subsidence-archivebox-snapshot.webp" subtitle="Click on the image to preview page" >}}
|
||||||
|
{{< /cards >}}
|
||||||
|
|
||||||
|
In this case, the Internet Archive's snapshot is superior as they also save all of the URLs listed in the "Letter Text" section
|
||||||
|
{{< cards >}}
|
||||||
|
{{< card link="https://web.archive.org/web/20260503194906/https://phys.org/news/2026-04-usindian-space-mission-extreme-subsidence.html" title="Internet Archive Snapshot" image="/blog/2026/200-journalists-applaud-internet-archive-internet-archive-snapshot.webp" subtitle="Click on the image to preview page" >}}
|
||||||
|
{{< card link="https://s3.datafortheuniverse.org/tigris/d4u-datapkg-web-corpus/archive/1777842968.455868/singlefile.html" title="ArchiveBox Snapshot" image="/blog/2026/200-journalists-applaud-internet-archive-archivebox-snapshot.webp" subtitle="Click on the image to preview page" >}}
|
||||||
|
{{< /cards >}}
|
||||||
|
|
||||||
|
- Talk about needing a customized [yt-dlp](https://github.com/yt-dlp/yt-dlp) as YouTube has recently changed and is harder to archive.
|
||||||
Reference in New Issue
Block a user