diff --git a/README.md b/README.md index 0edd730..020fca9 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,151 @@ -## High-Level Overview +## Data Dissemination Strategy High-Level Overview + +```mermaid +flowchart TD + classDef linkNode stroke:#0000EE,color:#0000EE,stroke-width:2px; + subgraph mirrors [Mirrors & Preservation] + SourceCoop[Source Cooperative] + Zenodo[Zenodo] + InternetArchive[Internet Archive] + end + + Sources[Open Data Sources] + Processes[Transformation Pipelines] + Artifacts[Systems-Ready Data] + + subgraph CoreInfra [Data for Canada Infrastructure] + Portal[Object Storage] + Metadata[FAIR Data Catalog] + end + + Distribution[Decentralized Distribution] + Torrent[BitTorrent Swarm] + + subgraph Consumers [Consumption] + Users[Data People & Developers] + Systems[Automated Systems] + end + + %% Flow with Animations + Sources a1@--> Processes + a1@{animate: true, animation: slow} + + Processes a2@--> Artifacts + a2@{animate: true, animation: slow} + + Artifacts a3@--> Portal + a3@{animate: true, animation: slow} + + Portal a4@--> Metadata + a4@{animate: true, animation: fast} + + Metadata a5@--> Distribution + a5@{animate: true, animation: fast} + + %% Distribution Logic + Distribution a7@-.->|Primary| mirrors + a7@{animate: true, animation: slow} + + Distribution a8@-.->|Resiliency| Torrent + a8@{animate: true, animation: slow} + + %% Mirror Connections + mirrors a12@--> Users + a12@{animate: true, animation: slow} + + mirrors a13@--> Systems + a13@{animate: true, animation: slow} + + mirrors a9@-.->|Seeding| Torrent + a9@{animate: true, animation: fast} + + %% Torrent Connections + Torrent a10@--> Users + a10@{animate: true, animation: fast} + + Torrent a11@--> Systems + a11@{animate: true, animation: fast} + + %% Click Actions + click Artifacts "https://www.dataforcanada.org/docs/processes/" _blank + click Metadata "https://stac-utils.github.io/stac-geoparquet/latest/spec/stac-geoparquet-spec/" _blank + click Zenodo "https://zenodo.org/communities/dataforcanada/" _blank + click SourceCoop "https://source.coop/dataforcanada/" _blank + + %% APPLY STYLES TO LINKED NODES + class Artifacts,Metadata,Zenodo,SourceCoop linkNode +``` + +## 🏗️ Open Processing Architecture + +We believe that true open data requires open production. To ensure the longevity and resilience of Canada's data infrastructure, we treat our data pipelines as **open source software artifacts**. We provide the "blueprints" alongside the data, allowing any user to verify our work or rebuild the dataset from scratch on their own infrastructure. + +### The Blueprint Model + +Our processing strategy relies on three immutable components to guarantee transparency: + +1. **Build Manifests:** Every dataset version is accompanied by a strict manifest. This locks the exact "ingredients" used: the cryptographic hashes of the raw source files, the specific Git commit of the processing code, and the configuration parameters. +2. **Environment Definitions:** Rather than opaque binaries, we publish the exact **Infrastructure as Code (IaC)** definitions (e.g., Dockerfiles). This allows users to inspect the system context (GDAL versions, libraries, and dependencies) and build the environment themselves. +3. **Deterministic Builds:** By combining a *Build Manifest* with our *Environment Definitions*, any user can execute a **deterministic build**. This process guarantees a bit-for-bit identical copy of the official Data for Canada artifact, ensuring that the pipeline is independent of our specific servers. + +**Mirrored Source Artifacts:** +Crucially, we do not rely solely on external version control systems like GitHub, which may change or disappear. A complete snapshot of the processing code, environment definitions, and manifests is bundled with every data release. These source artifacts are replicated across **Source Cooperative, Zenodo, the Internet Archive, Data for Canada infrastructure, and the community**, ensuring that the *method* of creation is preserved with the same redundancy as the *result*. + +## Dissemination Process + +Once data products reach a production-ready state, they enter a dissemination flow designed for permanence and performance: + +* **Cloud-Native First:** Priority is given to performant, system-to-system file formats (e.g., Parquet) to enable high-throughput applications without the need for local parsing. +* **Persistent Identification:** Every dataset version is assigned a DOI for citation and immutability. +* **The FAIR Data Catalog:** Global metadata is aggregated into a single, queryable **[FAIR Data Catalog](https://stac-utils.github.io/stac-geoparquet/latest/spec/stac-geoparquet-spec/)**. This catalog acts as the "brain" of the system, tracking all versions and DOIs, and directing users to the optimal source within our multi-tier storage network: +* **[Source Cooperative](https://source.coop/dataforcanada)** serves as our **primary mirror** for all datasets, including large-scale products like orthoimagery. +* **[Zenodo](https://zenodo.org/communities/dataforcanada/)** serves as our repository for **long-term academic preservation** and provides a high-speed mirror for European users. +* **[The Internet Archive](https://archive.org)** is utilized **strategically** for specific datasets to ensure historical redundancy. + +### Decentralized Distribution (BitTorrent) + +We are piloting BitTorrent to maximize infrastructure resilience. By leveraging [HTTP Web Seeding (BEP 19)](https://www.bittorrent.org/beps/bep_0019.html), torrents will be seeded simultaneously by Source Cooperative, Zenodo, Data for Canada infrastructure, and community peers. This ensures high availability without a single point of failure. Current laboratory work is available in the [Decentralized Distribution Labs](https://github.com/dataforcanada/decentralized-distribution-labs). + +## Work in the Lab: Smart Nodes + +To further democratize access and ensure the persistence of Canada’s open data, we are experimenting with the features defined in previous work done by [Academic Torrents](https://academictorrents.com/docs/mirroring.html#smartnodes). + +A Smart Node functions as a "set-it-and-forget-it" volunteer server, an automated library branch for our data infrastructure. + +* **Automated Mirroring:** Unlike a standard download, a Smart Node automatically synchronizes with our central **FAIR Data Catalog**. It intelligently fetches new or "at-risk" datasets to ensure they remain available even if the central portal experiences downtime. +* **Volunteer-Powered Resilience:** This model allows partner institutions (ex. universities, research labs) and public volunteers to donate bandwidth and storage. By running a Smart Node, contributors actively protect vital Canadian datasets from being lost or gated. +* **Dynamic Storage Management:** The node software monitors network health to optimize resource usage. Leveraging BitTorrent's capability for **selective piece mapping**, the node does not need to store the entire catalog. Instead, it identifies specific file indices or "rare" pieces within the metadata and sends granular `REQUEST` messages for only those blocks. This allows a node with limited storage (ex. 500GB) to provide critical redundancy for a much larger archive (ex. 50TB) by surgically targeting only the data that is currently under shared. + +We are currently refining the concepts from [smart-node-transmission](https://github.com/academictorrents/smartnode-transmission) to work seamlessly with our catalog, enabling a fully decentralized data mesh for Canadian geospatial information. + +```mermaid +graph TD + %% Node Definitions with custom labels + Catalog[("FAIR Data Catalog")] + SmartNode["Volunteer Smart Node
(Limited Storage Optimization)"] + BTNetwork(["BitTorrent Peer Network
(Massive Data Pool)"]) + + %% The Process Flow + Catalog -->|"1. Syncs metadata & identifies 'at-risk' data"| SmartNode + + note["Note: The Node does NOT
download the whole file."] + SmartNode -.- note + + SmartNode -->|"2. Sends granular REQUESTs for specific pieces only
(e.g., 'Send Piece #804 of Dataset B')"| BTNetwork + + BTNetwork -.->|"3. Transfers ONLY the requested blocks"| SmartNode + + %% Optional Styling for visual clarity + classDef central fill:#e1f5fe,stroke:#0277bd,stroke-width:2px; + classDef node fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,stroke-dasharray: 5 5; + classDef network fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; + + class Catalog central; + class SmartNode node; + class BTNetwork network; +``` + +## Infrastructure High-Level Overview ```mermaid flowchart TD