Update dissemination strategy. Vision is much more clear for those that are technical

2026-06-13 14:00:51 +02:00 · 2026-02-08 14:08:18 -05:00
parent 2d19979bc9
commit 49fa00a5ab
1 changed files with 80 additions and 34 deletions
@@ -13,55 +13,73 @@ We prioritize interoperability, long-term preservation, and decentralized resili

 ```mermaid
 flowchart TD
-    subgraph mirrors [Mirrors]
+    subgraph mirrors [Mirrors & Preservation]
        SourceCoop[Source Cooperative]
        Zenodo[Zenodo]
        InternetArchive[Internet Archive]
    end

    Sources[Open Data Sources]
-    Processes[Transformation Processes]
+    Processes[Transformation Pipelines]
    Artifacts[Systems-Ready Data]
-    Portal[Object Storage]
-    Metadata[Metadata]
-    Distribution[Decentralized Distribution]
-    Torrent[BitTorrent]
-    Users[Data People & Developers]
-    Systems[Systems]
    
+    subgraph CoreInfra [Data for Canada Infrastructure]
+        Portal[Object Storage]
+        Metadata[FAIR Data Catalog]
+    end
+    
+    Distribution[Decentralized Distribution]
+    Torrent[BitTorrent Swarm]
+    
+    subgraph Consumers [Consumption]
+        Users[Data People & Developers]
+        Systems[Automated Systems]
+    end
+
+    %% Flow with Animations
    Sources a1@--> Processes
    a1@{animate: true, animation: slow}
+    
    Processes a2@--> Artifacts
    a2@{animate: true, animation: slow}
+    
    Artifacts a3@--> Portal
    a3@{animate: true, animation: slow}
+    
    Portal a4@--> Metadata
    a4@{animate: true, animation: fast}
+    
    Metadata a5@--> Distribution
    a5@{animate: true, animation: fast}

-    Distribution a7@--> Torrent
+    %% Distribution Logic
+    Distribution a7@-.->|Primary| mirrors
    a7@{animate: true, animation: slow}
-    Distribution a8@--> mirrors
+    
+    Distribution a8@-.->|Resiliency| Torrent
    a8@{animate: true, animation: slow}

-    mirrors a9@--> Torrent
-    a9@{animate: true, animation: fast}
-
+    %% Mirror Connections
    mirrors a12@--> Users
    a12@{animate: true, animation: slow}
+    
    mirrors a13@--> Systems
    a13@{animate: true, animation: slow}
    
+    mirrors a9@-.->|Seeding| Torrent
+    a9@{animate: true, animation: fast}
+
+    %% Torrent Connections
    Torrent a10@--> Users
    a10@{animate: true, animation: fast}
+    
    Torrent a11@--> Systems
    a11@{animate: true, animation: fast}

-    click Metadata "[https://stac-utils.github.io/stac-geoparquet/latest/spec/stac-geoparquet-spec/](https://stac-utils.github.io/stac-geoparquet/latest/spec/stac-geoparquet-spec/)" _blank
-    click Zenodo "[https://zenodo.org/communities/dataforcanada/](https://zenodo.org/communities/dataforcanada/)" _blank
-    click SourceCoop "[https://source.coop/dataforcanada/](https://source.coop/dataforcanada/)" _blank
-
+    %% Click Actions
+    click Metadata "https://stac-utils.github.io/stac-geoparquet/latest/spec/stac-geoparquet-spec/" _blank
+    click Zenodo "https://zenodo.org/communities/dataforcanada/" _blank
+    click SourceCoop "https://source.coop/dataforcanada/" _blank
 ```

 ## 🏗️ Open Processing Architecture
@@ -73,7 +91,7 @@ We believe that true open data requires open production. To ensure the longevity
 Our processing strategy relies on three immutable components to guarantee transparency:

 1. **Build Manifests:** Every dataset version is accompanied by a strict manifest. This locks the exact "ingredients" used: the cryptographic hashes of the raw source files, the specific Git commit of the processing code, and the configuration parameters.
-2. **Environment Definitions:** Rather than opaque binaries, we publish the exact **Infrastructure as Code (IaC)** definitions (e.g., Dockerfiles). This allows users to inspect the system context GDAL versions, libraries, and dependencies, and build the environment themselves.
+2. **Environment Definitions:** Rather than opaque binaries, we publish the exact **Infrastructure as Code (IaC)** definitions (e.g., Dockerfiles). This allows users to inspect the system context (GDAL versions, libraries, and dependencies) and build the environment themselves.
 3. **Deterministic Builds:** By combining a *Build Manifest* with our *Environment Definitions*, any user can execute a **deterministic build**. This process guarantees a bit-for-bit identical copy of the official Data for Canada artifact, ensuring that the pipeline is independent of our specific servers.

 **Mirrored Source Artifacts:**
@@ -81,26 +99,54 @@ Crucially, we do not rely solely on external version control systems like GitHub

 ## Dissemination Process

-Once data products reach a production-ready state, the workflow is as follows:
+Once data products reach a production-ready state, they enter a dissemination flow designed for permanence and performance:

-* **Cloud-Native First:** Priority is given to performant, system-to-system file formats (e.g., Parquet) to enable highly performant applications.
-* **Persistent Identification & Cataloging:** Every dataset version will be assigned a DOI for citation and immutability.
-* The endpoint `https://data-01.dataforcanada.org/processed/` will strictly serve the **latest** version of a dataset.
-* Global metadata will be aggregated into a single, queryable [STAC GeoParquet](https://stac-utils.github.io/stac-geoparquet/latest/spec/stac-geoparquet-spec/) file. This catalog will track all versions and DOIs, directing users to our multi-tier storage network:
-* **[Source Cooperative](https://source.coop/dataforcanada)** serves as our **primary mirror** for all datasets, including large-scale products like orthoimagery (see [Funding and Governance](https://docs.source.coop/#funding-and-governance)).
-* **[Zenodo](https://zenodo.org/communities/dataforcanada/)** serves as our repository for **long-term preservation** and provides a mirror for users in Europe (see [Funding](https://about.zenodo.org/infrastructure/)).
-* **[The Internet Archive](https://archive.org)** is utilized **strategically** for specific datasets to minimize load on their limited infrastructure (see [Funding](https://projects.propublica.org/nonprofits/organizations/943242767)).
+* **Cloud-Native First:** Priority is given to performant, system-to-system file formats (e.g., Parquet) to enable high-throughput applications without the need for local parsing.
+* **Persistent Identification:** Every dataset version is assigned a DOI for citation and immutability.
+* **The FAIR Data Catalog:** Global metadata is aggregated into a single, queryable **[FAIR Data Catalog](https://stac-utils.github.io/stac-geoparquet/latest/spec/stac-geoparquet-spec/)**. This catalog acts as the "brain" of the system, tracking all versions and DOIs, and directing users to the optimal source within our multi-tier storage network:
+* **[Source Cooperative](https://source.coop/dataforcanada)** serves as our **primary mirror** for all datasets, including large-scale products like orthoimagery.
+* **[Zenodo](https://zenodo.org/communities/dataforcanada/)** serves as our repository for **long-term academic preservation** and provides a high-speed mirror for European users.
+* **[The Internet Archive](https://archive.org)** is utilized **strategically** for specific datasets to ensure historical redundancy.

-* **Decentralized Distribution:** We will pilot BitTorrent to maximize infrastructure resilience. By leveraging [HTTP Web Seeding (BEP 19)](https://www.bittorrent.org/beps/bep_0019.html), torrents will be seeded simultaneously by Source Cooperative, Zenodo, the Data for Canada infrastructure, and community peers, ensuring high availability without a single point of failure. Current laboratory work is available on the [dataforcanada/decentralized-distribution-labs](https://github.com/dataforcanada/decentralized-distribution-labs) repo.
+### Decentralized Distribution (BitTorrent)
+
+We are piloting BitTorrent to maximize infrastructure resilience. By leveraging [HTTP Web Seeding (BEP 19)](https://www.bittorrent.org/beps/bep_0019.html), torrents will be seeded simultaneously by Source Cooperative, Zenodo, Data for Canada infrastructure, and community peers. This ensures high availability without a single point of failure. Current laboratory work is available in the [Decentralized Distribution Labs](https://github.com/dataforcanada/decentralized-distribution-labs).

 ## Work in the Lab: Smart Nodes

 To further democratize access and ensure the persistence of Canada’s open data, we are experimenting with the features defined in previous work done by [Academic Torrents](https://academictorrents.com/docs/mirroring.html#smartnodes).

-A smart node functions as a "set-it-and-forget-it" volunteer server, an automated library branch for our data infrastructure.
+A Smart Node functions as a "set-it-and-forget-it" volunteer server, an automated library branch for our data infrastructure.

-* **Automated Mirroring:** Unlike a standard download, a smart node automatically synchronizes with our central catalog. It intelligently fetches new or "at-risk" datasets to ensure they remain available even if the central portal experiences downtime.
-* **Volunteer-Powered Resilience:** This model allows partner institutions (universities, research labs) and public volunteers to donate bandwidth and storage. By running a smart node, contributors actively protect vital Canadian datasets from being lost or gated behind paywalls.
-* **Dynamic Storage Management:** The node software monitors network health, automatically prioritizing rare data to maintain high availability across the entire Data for Canada ecosystem.
+* **Automated Mirroring:** Unlike a standard download, a Smart Node automatically synchronizes with our central **FAIR Data Catalog**. It intelligently fetches new or "at-risk" datasets to ensure they remain available even if the central portal experiences downtime.
+* **Volunteer-Powered Resilience:** This model allows partner institutions (ex. universities, research labs) and public volunteers to donate bandwidth and storage. By running a Smart Node, contributors actively protect vital Canadian datasets from being lost or gated.
+* **Dynamic Storage Management:** The node software monitors network health to optimize resource usage. Leveraging BitTorrent's capability for **selective piece mapping**, the node does not need to store the entire catalog. Instead, it identifies specific file indices or "rare" pieces within the metadata and sends granular `REQUEST` messages for only those blocks. This allows a node with limited storage (ex. 500GB) to provide critical redundancy for a much larger archive (ex. 50TB) by surgically targeting only the data that is currently under shared.

-We are currently refining the concepts from  [smart-node-transmission](https://github.com/academictorrents/smartnode-transmission)  to work seamlessly with our STAC GeoParquet catalog, enabling a fully decentralized data mesh for Canadian geospatial information.
+We are currently refining the concepts from [smart-node-transmission](https://github.com/academictorrents/smartnode-transmission) to work seamlessly with our catalog, enabling a fully decentralized data mesh for Canadian geospatial information.
+
+```mermaid
+graph TD
+    %% Node Definitions with custom labels
+    Catalog[("FAIR Data Catalog")]
+    SmartNode["Volunteer Smart Node<br/>(Limited Storage Optimization)"]
+    BTNetwork(["BitTorrent Peer Network<br/>(Massive Data Pool)"])
+
+    %% The Process Flow
+    Catalog -->|"1. Syncs metadata & identifies 'at-risk' data"| SmartNode
+    
+    note["Note: The Node does NOT<br/>download the whole file."]
+    SmartNode -.- note
+
+    SmartNode -->|"2. Sends granular REQUESTs for specific pieces only<br/>(e.g., 'Send Piece #804 of Dataset B')"| BTNetwork
+    
+    BTNetwork -.->|"3. Transfers ONLY the requested blocks"| SmartNode
+
+    %% Optional Styling for visual clarity
+    classDef central fill:#e1f5fe,stroke:#0277bd,stroke-width:2px;
+    classDef node fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,stroke-dasharray: 5 5;
+    classDef network fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
+    
+    class Catalog central;
+    class SmartNode node;
+    class BTNetwork network;
+```