mirror of
https://github.com/dataforcanada/d4c-infra-distribution.git
synced 2026-06-13 14:10:53 +02:00
Update README.md
This commit is contained in:
@@ -1,5 +1,7 @@
|
|||||||
## Data Dissemination Strategy High-Level Overview
|
## Data Dissemination Strategy High-Level Overview
|
||||||
|
|
||||||
|
## High-Level Overview
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
flowchart TD
|
flowchart TD
|
||||||
classDef linkNode stroke:#0000EE,color:#0000EE,stroke-width:2px;
|
classDef linkNode stroke:#0000EE,color:#0000EE,stroke-width:2px;
|
||||||
@@ -7,59 +9,70 @@ flowchart TD
|
|||||||
SourceCoop[Source Cooperative]
|
SourceCoop[Source Cooperative]
|
||||||
Zenodo[Zenodo]
|
Zenodo[Zenodo]
|
||||||
InternetArchive[Internet Archive]
|
InternetArchive[Internet Archive]
|
||||||
|
Community[Community]
|
||||||
end
|
end
|
||||||
|
|
||||||
Sources[Open Data Sources]
|
Sources[Open Data Sources]
|
||||||
Processes[Transformation Pipelines]
|
Processes[Data for Canada Packages Collection]
|
||||||
Artifacts[Systems-Ready Data]
|
Artifacts[Systems-Ready Data]
|
||||||
|
|
||||||
subgraph CoreInfra [Data for Canada Infrastructure]
|
subgraph CoreInfra [Infrastructure]
|
||||||
Portal[Object Storage]
|
Portal[Object Storage]
|
||||||
Metadata[FAIR Data Catalog]
|
Metadata[FAIR Data Catalogue]
|
||||||
end
|
end
|
||||||
|
|
||||||
Distribution[Decentralized Distribution]
|
P2P["P2P Technology"]
|
||||||
Torrent["P2P Technology"]
|
|
||||||
|
|
||||||
subgraph Consumers [Consumption]
|
subgraph Consumers [Consumption]
|
||||||
Users[Data People & Developers]
|
Users[Data People & Developers]
|
||||||
Systems[Automated Systems]
|
Systems[Systems]
|
||||||
end
|
end
|
||||||
|
|
||||||
%% Flow with Animations
|
%% Flow with Animations
|
||||||
Sources a1@--> Processes
|
Sources a1@<--> Processes
|
||||||
a1@{animate: true, animation: slow}
|
a1@{animate: true, animation: slow}
|
||||||
|
|
||||||
Processes a2@--> Artifacts
|
Processes a2@<--> Artifacts
|
||||||
a2@{animate: true, animation: slow}
|
a2@{animate: true, animation: slow}
|
||||||
|
|
||||||
Artifacts a3@--> Portal
|
Artifacts a3@<--> CoreInfra
|
||||||
a3@{animate: true, animation: slow}
|
a3@{animate: true, animation: slow}
|
||||||
|
|
||||||
Portal a4@--> Metadata
|
Portal a4@<--> Metadata
|
||||||
a4@{animate: true, animation: fast}
|
a4@{animate: true, animation: fast}
|
||||||
|
|
||||||
Metadata a5@--> Distribution
|
Metadata a5@<--> mirrors
|
||||||
a5@{animate: true, animation: fast}
|
a5@{animate: true, animation: fast}
|
||||||
|
|
||||||
%% Distribution Logic
|
CoreInfra a8@<-.->P2P
|
||||||
Distribution a7@-.->|Primary| mirrors
|
|
||||||
a7@{animate: true, animation: slow}
|
|
||||||
|
|
||||||
Distribution a8@-.->|Resiliency| Torrent
|
|
||||||
a8@{animate: true, animation: slow}
|
a8@{animate: true, animation: slow}
|
||||||
|
|
||||||
%% Mirror Connections
|
%% Mirror Connections
|
||||||
mirrors a12@--> Consumers
|
mirrors a12@<--> Consumers
|
||||||
a12@{animate: true, animation: slow}
|
a12@{animate: true, animation: slow}
|
||||||
|
|
||||||
mirrors a9@-.->|Pooled| Torrent
|
mirrors a9@<-.->|Pooled| P2P
|
||||||
a9@{animate: true, animation: fast}
|
a9@{animate: true, animation: fast}
|
||||||
|
|
||||||
%% Torrent Connections
|
%% P2P Connections
|
||||||
Torrent a10@--> Consumers
|
P2P a10@<--> Consumers
|
||||||
a10@{animate: true, animation: fast}
|
a10@{animate: true, animation: fast}
|
||||||
|
|
||||||
|
style Sources fill:#FFB74D,stroke:#EF6C00,color:#000000
|
||||||
|
style Artifacts fill:#B71C1C,stroke:#7F0000,color:#FFFFFF
|
||||||
|
%% Opera concertmaster
|
||||||
|
style Metadata fill:#B71C1C,stroke:#7F0000,color:#FFFFFF
|
||||||
|
class Metadata Metadata
|
||||||
|
style Processes fill:#B71C1C,stroke:#7F0000,color:#FFFFFF
|
||||||
|
class Processes Processes
|
||||||
|
style SourceCoop fill:#B71C1C,stroke:#7F0000,color:#FFFFFF
|
||||||
|
style Zenodo fill:#FFB74D,stroke:#EF6C00,color:#000000
|
||||||
|
style Community fill:#D32F2F,stroke:#8E0000,color:#FFFFFF
|
||||||
|
style P2P fill:#B71C1C,stroke:#7F0000,color:#FFFFFF
|
||||||
|
style InternetArchive fill:#66BB6A,stroke:#2E7D32,color:#000000
|
||||||
|
style Users fill:#FFB74D,stroke:#EF6C00,color:#000000
|
||||||
|
style Systems fill:#B71C1C,stroke:#7F0000,color:#FFFFFF
|
||||||
|
|
||||||
%% Click Actions
|
%% Click Actions
|
||||||
click Sources "https://www.dataforcanada.org/#high-level-overview" _blank
|
click Sources "https://www.dataforcanada.org/#high-level-overview" _blank
|
||||||
click Processes "https://www.dataforcanada.org/docs/processes/" _blank
|
click Processes "https://www.dataforcanada.org/docs/processes/" _blank
|
||||||
@@ -70,9 +83,25 @@ flowchart TD
|
|||||||
click InternetArchive "https://archive.org/details/@diegoripley/uploads/" _blank
|
click InternetArchive "https://archive.org/details/@diegoripley/uploads/" _blank
|
||||||
|
|
||||||
%% APPLY STYLES TO LINKED NODES
|
%% APPLY STYLES TO LINKED NODES
|
||||||
class Sources,Processes,Artifacts,Metadata,SourceCoop,Zenodo,InternetArchive linkNode
|
class Sources linkNode
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Dissemination Process
|
||||||
|
|
||||||
|
Once data products reach a production-ready state, they enter a dissemination flow designed for permanence and performance:
|
||||||
|
|
||||||
|
* **Cloud-Native First:** Priority is given to performant, system-to-system file formats (e.g., Parquet) to enable high-throughput applications without the need for local parsing.
|
||||||
|
* **Persistent Identification:** Every dataset version is assigned a DOI for citation and immutability.
|
||||||
|
* **The FAIR Data Catalogue:** Global metadata is aggregated into a single, queryable **[FAIR Data Catalogue](https://stac-utils.github.io/stac-geoparquet/latest/spec/stac-geoparquet-spec/)**. This catalog acts as the "brain" of the system, tracking all versions and DOIs, and directing users to the optimal source within our multi-tier storage network:
|
||||||
|
* **[Source Cooperative](https://source.coop/dataforcanada)** serves as our **primary mirror** for all datasets, including large-scale products like orthoimagery (see [Funding and Governance](https://docs.source.coop/#funding-and-governance)).
|
||||||
|
* **[Zenodo](https://zenodo.org/communities/dataforcanada/)** serves as our repository for **long-term academic preservation** and provides a high-speed mirror for European users (see [Funding](https://about.zenodo.org/infrastructure/)).
|
||||||
|
* **[The Internet Archive](https://archive.org)** is utilized **strategically** for specific datasets to ensure historical redundancy (see [Funding](https://projects.propublica.org/nonprofits/organizations/943242767)).
|
||||||
|
* **[Data for Canada Infrastructure](https://www.dataforcanada.org/infrastructure/)** is utilized **strategically** for specific datasets of high-value.
|
||||||
|
|
||||||
|
### Decentralized Distribution
|
||||||
|
|
||||||
|
We are piloting a <abbr title="Peer-to-Peer">P2P</abbr> technology, to maximize infrastructure resilience. By leveraging the [P2P HTTP consumption feature](https://www.bittorrent.org/beps/bep_0019.html), users will be able to download simultaneously from Source Cooperative, Zenodo, Data for Canada infrastructure, and community peers. This ensures high availability without a single point of failure. Current laboratory work is available in the [Decentralized Distribution Labs](https://github.com/dataforcanada/decentralized-distribution-labs).
|
||||||
|
|
||||||
## 🏗️ Open Processing Architecture
|
## 🏗️ Open Processing Architecture
|
||||||
|
|
||||||
We believe that true open data requires open production. To ensure the longevity and resilience of Canada's data infrastructure, we treat our data pipelines as **open source software artifacts**. We provide the "blueprints" alongside the data, allowing any user to verify our work or rebuild the dataset from scratch on their own infrastructure.
|
We believe that true open data requires open production. To ensure the longevity and resilience of Canada's data infrastructure, we treat our data pipelines as **open source software artifacts**. We provide the "blueprints" alongside the data, allowing any user to verify our work or rebuild the dataset from scratch on their own infrastructure.
|
||||||
@@ -88,22 +117,6 @@ Our processing strategy relies on three immutable components to guarantee transp
|
|||||||
**Mirrored Source Artifacts:**
|
**Mirrored Source Artifacts:**
|
||||||
Crucially, we do not rely solely on external version control systems like GitHub, which may change or disappear. A complete snapshot of the processing code, environment definitions, and manifests is bundled with every data release. These source artifacts are replicated across **Source Cooperative, Zenodo, the Internet Archive, Data for Canada infrastructure, and the community**, ensuring that the *method* of creation is preserved with the same redundancy as the *result*.
|
Crucially, we do not rely solely on external version control systems like GitHub, which may change or disappear. A complete snapshot of the processing code, environment definitions, and manifests is bundled with every data release. These source artifacts are replicated across **Source Cooperative, Zenodo, the Internet Archive, Data for Canada infrastructure, and the community**, ensuring that the *method* of creation is preserved with the same redundancy as the *result*.
|
||||||
|
|
||||||
## Dissemination Process
|
|
||||||
|
|
||||||
Once data products reach a production-ready state, they enter a dissemination flow designed for permanence and performance:
|
|
||||||
|
|
||||||
* **Cloud-Native First:** Priority is given to performant, system-to-system file formats (e.g., Parquet) to enable high-throughput applications without the need for local parsing.
|
|
||||||
* **Persistent Identification:** Every dataset version is assigned a DOI for citation and immutability.
|
|
||||||
* **The FAIR Data Catalog:** Global metadata is aggregated into a single, queryable **[FAIR Data Catalog](https://stac-utils.github.io/stac-geoparquet/latest/spec/stac-geoparquet-spec/)**. This catalog acts as the "brain" of the system, tracking all versions and DOIs, and directing users to the optimal source within our multi-tier storage network:
|
|
||||||
- **[Source Cooperative](https://source.coop/dataforcanada)** serves as our **primary mirror** for all datasets, including large-scale products like orthoimagery.
|
|
||||||
- **[Zenodo](https://zenodo.org/communities/dataforcanada/)** serves as our repository for **long-term academic preservation** and provides a high-speed mirror for European users.
|
|
||||||
- **[The Internet Archive](https://archive.org)** is utilized **strategically** for specific datasets to ensure historical redundancy.
|
|
||||||
- **[Data for Canada Infrastructure](https://www.dataforcanada.org/infrastructure/)** is utilized **strategically** for specific datasets of high-value.
|
|
||||||
|
|
||||||
### Decentralized Distribution
|
|
||||||
|
|
||||||
We are piloting a <abbr title="Peer-to-Peer">P2P</abbr> technology, to maximize infrastructure resilience. By leveraging the [P2P HTTP consumption feature](https://www.bittorrent.org/beps/bep_0019.html), users will be able to download simultaneously from Source Cooperative, Zenodo, Data for Canada infrastructure, and community peers. This ensures high availability without a single point of failure. Current laboratory work is available in the [Decentralized Distribution Labs](https://github.com/dataforcanada/decentralized-distribution-labs).
|
|
||||||
|
|
||||||
## Work in the Lab: Smart Nodes
|
## Work in the Lab: Smart Nodes
|
||||||
|
|
||||||
To further democratize access and ensure the persistence of Canada’s open data, we are experimenting with the features defined in previous work done by other organizations.
|
To further democratize access and ensure the persistence of Canada’s open data, we are experimenting with the features defined in previous work done by other organizations.
|
||||||
@@ -119,154 +132,26 @@ We are currently refining the concepts from [smart-node-transmission](https://gi
|
|||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
graph TD
|
graph TD
|
||||||
%% Node Definitions with custom labels
|
Catalogue[("FAIR Data Catalogue")]
|
||||||
Catalog[("FAIR Data Catalog")]
|
SmartNode["Volunteer Smart Node<br/>(Limited Storage)"]
|
||||||
SmartNode["Volunteer Smart Node<br/>(Limited Storage Optimization)"]
|
P2PNetwork(["P2P Community Peers<br/>(Massive Data Pool)"])
|
||||||
BTNetwork(["P2P Network<br/>(Massive Data Pool)"])
|
|
||||||
|
|
||||||
%% The Process Flow
|
Catalogue -->|"1. Syncs metadata"| SmartNode
|
||||||
Catalog -->|"1. Syncs metadata"| SmartNode
|
|
||||||
|
|
||||||
note["Note: The Node does NOT<br/>download the whole file."]
|
note["Note: The Node does NOT<br/>download the whole file."]
|
||||||
SmartNode -.- note
|
SmartNode -.- note
|
||||||
|
|
||||||
SmartNode -->|"2. Sends granular REQUESTs for specific pieces only<br/>(e.g., 'Send Piece #804 of Dataset B')"| BTNetwork
|
SmartNode -->|"2. Sends granular REQUESTs for specific pieces only<br/>(e.g., 'Send Piece #804 of Dataset B')"| P2PNetwork
|
||||||
|
|
||||||
BTNetwork -.->|"3. Transfers ONLY the requested blocks"| SmartNode
|
P2PNetwork -.->|"3. Transfers ONLY the requested blocks"| SmartNode
|
||||||
|
|
||||||
%% Optional Styling for visual clarity
|
|
||||||
classDef central fill:#e1f5fe,stroke:#0277bd,stroke-width:2px;
|
|
||||||
classDef node fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,stroke-dasharray: 5 5;
|
classDef node fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,stroke-dasharray: 5 5;
|
||||||
classDef network fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
|
classDef network fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
|
||||||
|
|
||||||
class Catalog central;
|
style Catalogue fill:#B71C1C,stroke:#7F0000,color:#FFFFFF
|
||||||
class SmartNode node;
|
style P2PNetwork fill:#B71C1C,stroke:#7F0000,color:#FFFFFF
|
||||||
class BTNetwork network;
|
style SmartNode fill:#B71C1C,stroke:#7F0000,color:#FFFFFF
|
||||||
|
|
||||||
|
click Catalogue "https://stac-utils.github.io/stac-geoparquet/latest/spec/stac-geoparquet-spec/";
|
||||||
|
click SmartNode "https://www.dataforcanada.org/infrastructure/";
|
||||||
```
|
```
|
||||||
|
|
||||||
## Infrastructure High-Level Overview
|
|
||||||
|
|
||||||
```mermaid
|
|
||||||
flowchart TD
|
|
||||||
%% ---------------------------------------------------------
|
|
||||||
%% STYLING
|
|
||||||
%% ---------------------------------------------------------
|
|
||||||
style Canada_Region fill:#ffe6e6,stroke:#ff0000,stroke-width:2px
|
|
||||||
style USA_Region fill:#e6f2ff,stroke:#0066cc,stroke-width:2px
|
|
||||||
style Europe_Region fill:#e6ffe6,stroke:#009900,stroke-width:2px
|
|
||||||
|
|
||||||
%% Highlight Primary Storage
|
|
||||||
style R2 fill:#fffde7,stroke:#fbc02d,stroke-width:4px
|
|
||||||
|
|
||||||
%% ---------------------------------------------------------
|
|
||||||
%% REGION: CANADA
|
|
||||||
%% ---------------------------------------------------------
|
|
||||||
subgraph Canada_Region ["🇨🇦 Physical Location: Canada"]
|
|
||||||
direction TB
|
|
||||||
NodeTO["Smart Node 01
|
|
||||||
Location: Toronto, CA
|
|
||||||
Specs: 50Gbps / 50Gbps, 950GB Flash Storage
|
|
||||||
Jurisdiction: Canada"]
|
|
||||||
|
|
||||||
IA_Van["Internet Archive Mirror
|
|
||||||
Location: Vancouver, CA
|
|
||||||
Protocol: HTTP
|
|
||||||
Jurisdiction: USA"]
|
|
||||||
end
|
|
||||||
|
|
||||||
%% ---------------------------------------------------------
|
|
||||||
%% REGION: USA
|
|
||||||
%% ---------------------------------------------------------
|
|
||||||
subgraph USA_Region ["🇺🇸 Physical Location: USA"]
|
|
||||||
direction TB
|
|
||||||
SourceCoop["Source Cooperative
|
|
||||||
Location: Oregon, USA
|
|
||||||
Protocol: HTTP
|
|
||||||
Jurisdiction: USA"]
|
|
||||||
|
|
||||||
R2["☁️ Cloudflare R2
|
|
||||||
(Primary Object Storage)
|
|
||||||
Location: Eastern North America
|
|
||||||
Protocol: HTTP
|
|
||||||
Jurisdiction: USA"]
|
|
||||||
|
|
||||||
IA_SF["The Internet Archive
|
|
||||||
Location: San Francisco, USA
|
|
||||||
Protocol: HTTP
|
|
||||||
Jurisdiction: USA"]
|
|
||||||
|
|
||||||
Netcup["VPS 01
|
|
||||||
Location: Manassas, Virginia, USA
|
|
||||||
Specs: 2.5Gbps / 2.5Gbps, 512GB Flash Storage
|
|
||||||
Protocol: HTTP & P2P
|
|
||||||
Jurisdiction: Germany"]
|
|
||||||
end
|
|
||||||
|
|
||||||
%% ---------------------------------------------------------
|
|
||||||
%% REGION: EUROPE
|
|
||||||
%% ---------------------------------------------------------
|
|
||||||
subgraph Europe_Region ["🇪🇺 Physical Location: Europe"]
|
|
||||||
direction TB
|
|
||||||
subgraph Netherlands ["🇳🇱 Netherlands"]
|
|
||||||
NodeAMS["Smart Node 02
|
|
||||||
Location: Amsterdam, NL
|
|
||||||
Specs: 50Gbps / 50Gbps, 950GB Flash Storage
|
|
||||||
Jurisdiction: Netherlands"]
|
|
||||||
end
|
|
||||||
|
|
||||||
subgraph Switzerland ["🇨🇭 Switzerland"]
|
|
||||||
Zenodo["Zenodo
|
|
||||||
Location: Geneva, CH
|
|
||||||
(Replicated in Budapest, HU)
|
|
||||||
Protocol: HTTP
|
|
||||||
Jurisdiction: Switzerland"]
|
|
||||||
end
|
|
||||||
end
|
|
||||||
|
|
||||||
%% ---------------------------------------------------------
|
|
||||||
%% CONNECTIONS
|
|
||||||
%% ---------------------------------------------------------
|
|
||||||
|
|
||||||
NodeTO <==>|P2P| NodeAMS
|
|
||||||
IA_SF -.->|Internal Replication| IA_Van
|
|
||||||
|
|
||||||
NodeTO -.->|HTTP Pull| SourceCoop
|
|
||||||
NodeTO ==> R2
|
|
||||||
NodeTO -.->|HTTP Pull| Zenodo
|
|
||||||
NodeTO -.->|HTTP Pull| IA_SF
|
|
||||||
NodeTO -.->|HTTP or P2P| Netcup
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
## Infrastructure & Operating Costs
|
|
||||||
|
|
||||||
| Service | Description | CAD | USD | EUR |
|
|
||||||
| :--- | :--- | :--- | :--- | :--- |
|
|
||||||
| **CDN** | [CDN - Cloudflare Details](https://www.cloudflare.com/plans/) - WAF, CDN (Amortized Annual) | $30.90 | $22.60 | €19.13 |
|
|
||||||
| **CDN Services** | [Object Storage - Cloudflare Details](https://www.cloudflare.com/products/r2/) & [Serverless - Cloudflare Details](https://www.cloudflare.com/en-ca/plans/developer-platform/) (Variable) | $32.71 | $23.93 | €20.26 |
|
|
||||||
| **Smart Node 01** | [Decentralized Distribution - SlashN Services Details](https://ultra.cc/#plan-pricing) - Dedicated <abbr title="Peer-to-Peer">P2P</abbr> client | $28.98 | $21.21 | €17.95 |
|
|
||||||
| **Smart Node 02** | [Decentralized Distribution - SlashN Services Details](https://ultra.cc/#plan-pricing) - Dedicated <abbr title="Peer-to-Peer">P2P</abbr> client | $28.98 | $21.21 | €17.95 |
|
|
||||||
| **VPS 01** | [Geospatial Services - Netcup Details](https://www.netcup.com/en/server/root-server) - ARM64 | $14.64 | $10.72 | €9.07 |
|
|
||||||
| **TOTAL** | **Monthly Run Rate** | **$136.21** | **$99.67** | **€84.36** |
|
|
||||||
|
|
||||||
**Note:** Currency conversions are based on rates from February 16, 2026.
|
|
||||||
|
|
||||||
## Roadmap: Resilience & Transparency
|
|
||||||
|
|
||||||
To support our mission of providing high-performance, analysis-ready data, we are currently developing a suite of public tools to make this distributed ecosystem more **FAIR** (Findable, Accessible, Interoperable, Reusable), **resilient**, and **transparent**.
|
|
||||||
|
|
||||||
These planned features are designed to help researchers and automated systems coordinate data access across the various platforms and mirrors we utilize.
|
|
||||||
|
|
||||||
### 1. Real-Time Service Status
|
|
||||||
|
|
||||||
We are building a comprehensive status dashboard that monitors the availability of the diverse storage locations we rely on, from our own Smart Nodes to external providers like the Internet Archive, Source Cooperative, and Zenodo. Users will be able to verify if a specific mirror is operational before initiating workflows.
|
|
||||||
|
|
||||||
### 2. Traffic & Load Optimization Statistics
|
|
||||||
|
|
||||||
To foster better cooperation between our systems and downstream users, we plan to expose traffic and connectivity statistics where possible.
|
|
||||||
|
|
||||||
This transparency allows automated systems to be "smart" about data retrieval. For example, a system could query these statistics to schedule bandwidth-intensive HTTP downloads during non-peak hours, or adjust behavior based on current connectivity loads during high-traffic periods of the workday. This improves performance for individual users while respecting the bandwidth constraints of the various host providers.
|
|
||||||
|
|
||||||
### 3. Community Issue Reporting
|
|
||||||
|
|
||||||
We are introducing a streamlined method for users to report access issues across any of the services we aggregate. By allowing the community to flag connectivity drops or data integrity issues quickly, we can identify bottlenecks or outages at specific providers and route users to alternative sources more effectively.
|
|
||||||
|
|||||||
Reference in New Issue
Block a user