Files
d4c-service-statcan-geography/notebooks/generate_sql.ipynb
T
Diego Ripley c73a343599 Initial commit
2025-06-02 18:13:00 -04:00

839 lines
32 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "05ac8556",
"metadata": {},
"source": [
"# TODO\n",
"- Fix encoding issues with place names table (see below for troublesome records)\n",
"- Add remaining geographic hierarchy (Health Regions, CT, DA, DB, ADA, HCCSS)\n",
"- Read geographic hierachy from Parquet files and do the SQL work using DuckDB\n",
"- Add field so user can search by province (if possible). It won't be possible to add the field to the country and region tables\n",
"- Add field so user can search by census year\n",
"- Standardize search values. Look into porting CASK into JavaScript as the user input will need to be standardized as well"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "68f3cacd",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sqlite3\n",
"\n",
"from dotenv import load_dotenv\n",
"import duckdb\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "88719ee9",
"metadata": {},
"source": [
"# Create the geographies table"
]
},
{
"cell_type": "markdown",
"id": "400083f5",
"metadata": {},
"source": [
"## Create tables in SQLite\n",
"These are the instructions for exporting the database tables and importing into Cloudflare D1. At the moment they are manually done, but I should automate it.\n",
"\n",
"1. Export the geographies table using `sqlite3`\n",
"```\n",
"sqlite3 geography.db\n",
".output ./geographies.sql\n",
".dump geographies\n",
"```\n",
"2. Remove the `PRAGMA foreign_keys=off`, `BEGIN TRANSACTION` and `COMMIT` parts\n",
"3. Remove the `CREATE TABLE geographies` statement\n",
"4. Add the following to the top, before the insert statements\n",
"```\n",
"DROP TABLE IF EXISTS geographies;\n",
"CREATE TABLE IF NOT EXISTS geographies (\n",
" id INTEGER PRIMARY KEY,\n",
" dguid TEXT,\n",
" search_name TEXT,\n",
" geographic_level INTEGER\n",
");\n",
"\n",
"DROP TABLE IF EXISTS geographies_fts;\n",
"CREATE VIRTUAL TABLE IF NOT EXISTS geographies_fts USING fts5(\n",
" id UNINDEXED,\n",
" search_name,\n",
" content='geographies',\n",
" content_rowid='id',\n",
" tokenize = \"unicode61 tokenchars '-/.,''&():+'\"\n",
");\n",
"\n",
"```\n",
"5. Add `INSERT INTO geographies_fts(geographies_fts) VALUES ('rebuild');` at the end of the SQL file\n",
"6. Add `PRAGMA optimize;` at the end of the SQL file. This is recommended https://developers.cloudflare.com/d1/best-practices/use-indexes/\n",
"7. Log into Cloudflare by doing npx wrangler login\n",
"8. Import as follows\n",
"```\n",
"npx wrangler d1 execute geographies_search --remote --file=./geographies.sql\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "17f8ffd8",
"metadata": {},
"outputs": [],
"source": [
"con = sqlite3.connect(\"geography.db\")\n",
"cur = con.cursor()\n",
"\n",
"cur.executescript(\"\"\"\n",
"DROP TABLE IF EXISTS geographies;\n",
"CREATE TABLE IF NOT EXISTS geographies (\n",
" id INTEGER PRIMARY KEY,\n",
" dguid TEXT,\n",
" search_name TEXT,\n",
" geographic_level INTEGER\n",
");\n",
"\"\"\")\n",
"\n",
"# Allow searches to use -/.,'&():+\n",
"cur.executescript(\"\"\"\n",
"DROP TABLE IF EXISTS geographies_fts;\n",
"CREATE VIRTUAL TABLE IF NOT EXISTS geographies_fts USING fts5(\n",
" id UNINDEXED,\n",
" search_name,\n",
" content='geographies',\n",
" content_rowid='id',\n",
" tokenize = \"unicode61 tokenchars '-/.,''&():+'\"\n",
");\n",
"\"\"\")\n",
"\n",
"con.commit()"
]
},
{
"cell_type": "markdown",
"id": "f8010194",
"metadata": {},
"source": [
"## SQL to create search table\n",
"For tables where there is an English and French field, it creates two records. Can probably add a field to the search table that tells the user whether the field is English, French, or Both.\n",
"\n",
"Statistics Canada searches English field when the page is in English, and it searches the French field when the page is in French. Here are the two examples:\n",
"- **English:** https://www150.statcan.gc.ca/n1/en/geo?geotext=Quebec%20%5BProvince%5D&geocode=A000224\n",
"- **French:** https://www150.statcan.gc.ca/n1/fr/geo?geotext=Qu%C3%A9bec%20%5BProvince%5D&geocode=A000224"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c04b4979",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7a9c8c29df524416a8280e3f80b2a6cb",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"duck_con = duckdb.connect()\n",
"duck_con.install_extension(\"spatial\")\n",
"duck_con.load_extension(\"spatial\")\n",
"\n",
"duck_con.sql(\"\"\"\n",
"DROP TABLE IF EXISTS geography;\n",
"CREATE TABLE geography AS\n",
"WITH country AS (\n",
"\tSELECT country_dguid AS dguid, country_en_name AS search_name, 13 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/country_2021.parquet'\n",
"), regions AS (\n",
"\tSELECT DISTINCT grc_dguid AS dguid, grc_en_name AS search_name, 12 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/grc_2021.parquet'\n",
"\tUNION\n",
"\tSELECT DISTINCT grc_dguid AS dguid, grc_fr_name AS search_name, 12 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/grc_2021.parquet'\n",
"), pr AS (\n",
"\tSELECT DISTINCT pr_dguid AS dguid, pr_en_name AS search_name, 11 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'hhttps://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/pr_2021.parquet'\n",
"\tUNION\n",
"\tSELECT DISTINCT pr_dguid AS dguid, pr_fr_name AS search_name, 11 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'hhttps://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/pr_2021.parquet'\n",
"), er AS (\n",
"\tSELECT DISTINCT er_dguid AS dguid, er_name AS search_name, 10 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/er_2021.parquet'\n",
"), car AS (\n",
"\tSELECT DISTINCT car_dguid AS dguid, car_en_name AS search_name, 9 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/car_2021.parquet'\n",
"\tUNION\n",
"\tSELECT DISTINCT car_dguid AS dguid, car_fr_name AS search_name, 9 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/car_2021.parquet'\n",
"), cd AS (\n",
"\tSELECT cd_dguid AS dguid, cd_name AS search_name, 8 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/cd_2021.parquet'\n",
"), ccs AS (\n",
"\tSELECT ccs_dguid AS dguid, ccs_name AS search_name, 7 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/ccs_2021.parquet'\n",
"), cma AS (\n",
"\tSELECT \n",
"\tCASE \n",
"\t\tWHEN cma_p_dguid IS NOT NULL THEN cma_p_dguid\n",
"\t\tELSE cma_dguid \n",
"\tEND AS dguid, cma_name AS search_name, 6 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/cma_2021.parquet'\n",
"), csd AS (\n",
"\tSELECT csd_dguid AS dguid, csd_name AS search_name, 5 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/csd_2021.parquet'\n",
"), fed AS (\n",
"\tSELECT DISTINCT fed_dguid AS dguid, fed_en_name AS search_name, 4 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/fed_2021_2013.parquet'\n",
"\tUNION\n",
"\tSELECT DISTINCT fed_dguid AS dguid, fed_fr_name AS search_name, 4 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/fed_2021_2013.parquet'\n",
"), dpl AS (\n",
"\tSELECT dpl_dguid AS dguid, dpl_name AS search_name, 3 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/dpl_2021.parquet'\n",
"), pc AS (\n",
"\tSELECT \n",
"\tCASE \n",
"\t\tWHEN pop_ctr_p_dguid IS NOT NULL THEN pop_ctr_p_dguid\n",
"\t\tELSE pop_ctr_dguid\n",
"\tEND AS dguid, pop_ctr_name AS search_name, 2 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'hhttps://data.dataforcanada.org/processed/statistics_canada/boundaries/2021/digital_boundary_files/pop_ctr_2021.parquet'\n",
"), pn AS (\n",
"\tSELECT pn_dguid AS dguid, pn_name AS search_name, 1 AS geographic_level, ST_AsGeoJSON(geom) AS geom FROM 'https://data.dataforcanada.org/processed/statistics_canada/placenames/2021/pn_2021.parquet'\n",
"), concatenation AS (\n",
"\tSELECT * FROM country\n",
"\tUNION\n",
"\tSELECT * FROM regions\n",
"\tUNION\n",
"\tSELECT * FROM pr\n",
"\tUNION\n",
"\tSELECT * FROM er\n",
"\tUNION\n",
"\tSELECT * FROM car\n",
"\tUNION\n",
"\tSELECT * FROM cd\n",
"\tUNION\n",
"\tSELECT * FROM ccs\n",
"\tUNION\n",
"\tSELECT * FROM cma\n",
"\tUNION\n",
"\tSELECT * FROM csd\n",
"\tUNION\n",
"\tSELECT * FROM fed\n",
"\tUNION\n",
"\tSELECT * FROM dpl\n",
"\tUNION\n",
"\tSELECT * FROM pc\n",
" UNION\n",
"\tSELECT * FROM pn\n",
")\n",
"SELECT * FROM concatenation\n",
"ORDER BY search_name, geographic_level DESC;\n",
"\"\"\")\n",
"duck_con.commit()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "4c873914",
"metadata": {},
"outputs": [],
"source": [
"geography = duck_con.sql(\"SELECT * FROM geography;\").df()"
]
},
{
"cell_type": "markdown",
"id": "d1180ec8",
"metadata": {},
"source": [
"# TODO\n",
"## Fix encoding issues with place names"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "0d807307",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>dguid</th>\n",
" <th>search_name</th>\n",
" <th>geographic_level</th>\n",
" <th>geom</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5653</th>\n",
" <td>2021S0515005422</td>\n",
" <td>Cascapédia–Saint-Jules</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-65.9166667,48....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8243</th>\n",
" <td>2021S0515007864</td>\n",
" <td>Côte-des-Neiges–Notre-Dame-de-Grâce</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-73.6263889,45....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18297</th>\n",
" <td>2021S0515017557</td>\n",
" <td>L'Île-Bizard–Sainte-Geneviève</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-73.866667,45.4...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20327</th>\n",
" <td>2021S0515019487</td>\n",
" <td>Le Coteau-des-Sœurs</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-70.456886,47.0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20569</th>\n",
" <td>2021S0515019731</td>\n",
" <td>Le Sacré-Cœur</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-69.979863,46.9...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23733</th>\n",
" <td>2021S0515022795</td>\n",
" <td>Mercier–Hochelaga-Maisonneuve</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-73.5388889,45....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25319</th>\n",
" <td>2021S0515024311</td>\n",
" <td>Métabetchouan–Lac-à-la-Croix</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-71.8666667,48....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29619</th>\n",
" <td>2021S0515028429</td>\n",
" <td>Port-Daniel–Gascons</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-64.9666667,48....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31289</th>\n",
" <td>2021S0515030028</td>\n",
" <td>Rivière-des-Prairies–Pointe-aux-Trembles</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-73.516667,45.65]}</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31432</th>\n",
" <td>2021S0515030168</td>\n",
" <td>Rock Forest–Saint-Élie–Deauville</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-72.0416667,45....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31702</th>\n",
" <td>2021S0515030432</td>\n",
" <td>Rosemont–La Petite-Patrie</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-73.5902778,45....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32617</th>\n",
" <td>2021S0515031197</td>\n",
" <td>Saint-Côme–Linière</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-70.5166667,46....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32737</th>\n",
" <td>2021S0515031295</td>\n",
" <td>Saint-Faustin–Lac-Carré</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-74.4833333,46....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33177</th>\n",
" <td>2021S0515031660</td>\n",
" <td>Saint-Lin–Laurentides</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-73.755663,45.8...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34031</th>\n",
" <td>2021S0515032370</td>\n",
" <td>Sainte-Foy–Sillery–Cap-Rouge</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-71.308333,46.7...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40241</th>\n",
" <td>2021S0515038300</td>\n",
" <td>Vieux-Québec–Basse-Ville</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-71.2069444,46....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40329</th>\n",
" <td>2021S0515038389</td>\n",
" <td>Villeray–Saint-Michel–Parc-Extension</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-73.6222222,45....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42483</th>\n",
" <td>2021S0515040448</td>\n",
" <td>Yuneŝit'in</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-123.1363889,51...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42561</th>\n",
" <td>2021S0515040522</td>\n",
" <td>ʔEsdilagh</td>\n",
" <td>1</td>\n",
" <td>{\"type\":\"Point\",\"coordinates\":[-122.4972222,52...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" dguid search_name \\\n",
"5653 2021S0515005422 Cascapédia–Saint-Jules \n",
"8243 2021S0515007864 Côte-des-Neiges–Notre-Dame-de-Grâce \n",
"18297 2021S0515017557 L'Île-Bizard–Sainte-Geneviève \n",
"20327 2021S0515019487 Le Coteau-des-Sœurs \n",
"20569 2021S0515019731 Le Sacré-Cœur \n",
"23733 2021S0515022795 Mercier–Hochelaga-Maisonneuve \n",
"25319 2021S0515024311 Métabetchouan–Lac-à-la-Croix \n",
"29619 2021S0515028429 Port-Daniel–Gascons \n",
"31289 2021S0515030028 Rivière-des-Prairies–Pointe-aux-Trembles \n",
"31432 2021S0515030168 Rock Forest–Saint-Élie–Deauville \n",
"31702 2021S0515030432 Rosemont–La Petite-Patrie \n",
"32617 2021S0515031197 Saint-Côme–Linière \n",
"32737 2021S0515031295 Saint-Faustin–Lac-Carré \n",
"33177 2021S0515031660 Saint-Lin–Laurentides \n",
"34031 2021S0515032370 Sainte-Foy–Sillery–Cap-Rouge \n",
"40241 2021S0515038300 Vieux-Québec–Basse-Ville \n",
"40329 2021S0515038389 Villeray–Saint-Michel–Parc-Extension \n",
"42483 2021S0515040448 Yuneŝit'in \n",
"42561 2021S0515040522 ʔEsdilagh \n",
"\n",
" geographic_level geom \n",
"5653 1 {\"type\":\"Point\",\"coordinates\":[-65.9166667,48.... \n",
"8243 1 {\"type\":\"Point\",\"coordinates\":[-73.6263889,45.... \n",
"18297 1 {\"type\":\"Point\",\"coordinates\":[-73.866667,45.4... \n",
"20327 1 {\"type\":\"Point\",\"coordinates\":[-70.456886,47.0... \n",
"20569 1 {\"type\":\"Point\",\"coordinates\":[-69.979863,46.9... \n",
"23733 1 {\"type\":\"Point\",\"coordinates\":[-73.5388889,45.... \n",
"25319 1 {\"type\":\"Point\",\"coordinates\":[-71.8666667,48.... \n",
"29619 1 {\"type\":\"Point\",\"coordinates\":[-64.9666667,48.... \n",
"31289 1 {\"type\":\"Point\",\"coordinates\":[-73.516667,45.65]} \n",
"31432 1 {\"type\":\"Point\",\"coordinates\":[-72.0416667,45.... \n",
"31702 1 {\"type\":\"Point\",\"coordinates\":[-73.5902778,45.... \n",
"32617 1 {\"type\":\"Point\",\"coordinates\":[-70.5166667,46.... \n",
"32737 1 {\"type\":\"Point\",\"coordinates\":[-74.4833333,46.... \n",
"33177 1 {\"type\":\"Point\",\"coordinates\":[-73.755663,45.8... \n",
"34031 1 {\"type\":\"Point\",\"coordinates\":[-71.308333,46.7... \n",
"40241 1 {\"type\":\"Point\",\"coordinates\":[-71.2069444,46.... \n",
"40329 1 {\"type\":\"Point\",\"coordinates\":[-73.6222222,45.... \n",
"42483 1 {\"type\":\"Point\",\"coordinates\":[-123.1363889,51... \n",
"42561 1 {\"type\":\"Point\",\"coordinates\":[-122.4972222,52... "
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dguids_to_fix = ['2021S0515005422',\n",
" '2021S0515007864',\n",
" '2021S0515017557',\n",
" '2021S0515019487',\n",
" '2021S0515019731',\n",
" '2021S0515022795',\n",
" '2021S0515024311',\n",
" '2021S0515028429',\n",
" '2021S0515030028',\n",
" '2021S0515030168',\n",
" '2021S0515030432',\n",
" '2021S0515031197',\n",
" '2021S0515031295',\n",
" '2021S0515031660',\n",
" '2021S0515032370',\n",
" '2021S0515038300',\n",
" '2021S0515038389',\n",
" '2021S0515040448',\n",
" '2021S0515040522']\n",
"place_names_to_fix = geography[geography['dguid'].isin(dguids_to_fix)]\n",
"place_names_to_fix.head(19)"
]
},
{
"cell_type": "markdown",
"id": "5bec091c",
"metadata": {},
"source": [
"## Generate GeoJSON file for every dguid\n",
"Copy into Cloudflare R2 by running \n",
"```\n",
"cd geographies\n",
"rclone copy . --transfers 50 --progress cloudflare:/geographies-search\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "12ddba86",
"metadata": {},
"outputs": [],
"source": [
"if not os.path.exists(\"geojson\"):\n",
" print(\"Creating DGUID geojson folder\")\n",
" os.mkdir(\"geojson\")\n",
"\n",
"for record in geography.to_records():\n",
" dguid = record[1]\n",
" geom = record[-1]\n",
" path = f\"geojson/{dguid}.geojson\"\n",
" if os.path.exists(path):\n",
" continue\n",
" with open(path, 'w') as geography_fp:\n",
" geography_fp.write(geom)"
]
},
{
"cell_type": "markdown",
"id": "39c1ff9f",
"metadata": {},
"source": [
"## Insert data into SQLite database"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "6e39bbc4",
"metadata": {},
"outputs": [],
"source": [
"# Subset of fields to import into SQLite database, add id field as well\n",
"geography_subset = geography[['dguid', 'search_name', 'geographic_level']]\n",
"geography_subset.insert(0, 'id', geography_subset.index)\n",
"\n",
"cur.executemany(\"INSERT INTO geographies VALUES(?, ?, ?, ?)\", geography_subset.values.tolist())\n",
"cur.execute(\"INSERT INTO geographies_fts(geographies_fts) VALUES ('rebuild')\")\n",
"con.commit()"
]
},
{
"cell_type": "markdown",
"id": "0675ca6d",
"metadata": {},
"source": [
"### Test out a search query"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "c49c2f06",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>dguid</th>\n",
" <th>search_name</th>\n",
" <th>geographic_level</th>\n",
" <th>rank</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2021S05003510</td>\n",
" <td>Ottawa</td>\n",
" <td>10</td>\n",
" <td>-9.011603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2021A00033506</td>\n",
" <td>Ottawa</td>\n",
" <td>8</td>\n",
" <td>-9.011603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2021S05023506008</td>\n",
" <td>Ottawa</td>\n",
" <td>7</td>\n",
" <td>-9.011603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2021A00053506008</td>\n",
" <td>Ottawa</td>\n",
" <td>5</td>\n",
" <td>-9.011603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2013A000435078</td>\n",
" <td>Ottawa--Vanier</td>\n",
" <td>4</td>\n",
" <td>-9.011603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2013A000435075</td>\n",
" <td>Ottawa-Centre</td>\n",
" <td>4</td>\n",
" <td>-9.011603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2013A000435079</td>\n",
" <td>Ottawa-Ouest--Nepean</td>\n",
" <td>4</td>\n",
" <td>-9.011603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>2013A000435077</td>\n",
" <td>Ottawa-Sud</td>\n",
" <td>4</td>\n",
" <td>-9.011603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2021S0515026282</td>\n",
" <td>Ottawa</td>\n",
" <td>1</td>\n",
" <td>-9.011603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>2021S0515026283</td>\n",
" <td>Ottawa</td>\n",
" <td>1</td>\n",
" <td>-9.011603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>2013A000435075</td>\n",
" <td>Ottawa Centre</td>\n",
" <td>4</td>\n",
" <td>-6.940322</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>2013A000435077</td>\n",
" <td>Ottawa South</td>\n",
" <td>4</td>\n",
" <td>-6.940322</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>2013A000435079</td>\n",
" <td>Ottawa West--Nepean</td>\n",
" <td>4</td>\n",
" <td>-6.940322</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>2021S0515026271</td>\n",
" <td>Ottawa Brook</td>\n",
" <td>1</td>\n",
" <td>-6.940322</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>2021S0515026273</td>\n",
" <td>Ottawa East</td>\n",
" <td>1</td>\n",
" <td>-6.940322</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>2021S0515026275</td>\n",
" <td>Ottawa South</td>\n",
" <td>1</td>\n",
" <td>-6.940322</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>2021S0515026277</td>\n",
" <td>Ottawa West</td>\n",
" <td>1</td>\n",
" <td>-6.940322</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>2021S0511240616</td>\n",
" <td>Ottawa - Gatineau</td>\n",
" <td>2</td>\n",
" <td>-5.643245</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>2021S0511350616</td>\n",
" <td>Ottawa - Gatineau</td>\n",
" <td>2</td>\n",
" <td>-5.643245</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>2021S050535505</td>\n",
" <td>Ottawa - Gatineau (Ontario part / partie de l'...</td>\n",
" <td>6</td>\n",
" <td>-2.660226</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>2021S050524505</td>\n",
" <td>Ottawa - Gatineau (partie du Québec / Quebec p...</td>\n",
" <td>6</td>\n",
" <td>-2.660226</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" dguid search_name \\\n",
"0 2021S05003510 Ottawa \n",
"1 2021A00033506 Ottawa \n",
"2 2021S05023506008 Ottawa \n",
"3 2021A00053506008 Ottawa \n",
"4 2013A000435078 Ottawa--Vanier \n",
"5 2013A000435075 Ottawa-Centre \n",
"6 2013A000435079 Ottawa-Ouest--Nepean \n",
"7 2013A000435077 Ottawa-Sud \n",
"8 2021S0515026282 Ottawa \n",
"9 2021S0515026283 Ottawa \n",
"10 2013A000435075 Ottawa Centre \n",
"11 2013A000435077 Ottawa South \n",
"12 2013A000435079 Ottawa West--Nepean \n",
"13 2021S0515026271 Ottawa Brook \n",
"14 2021S0515026273 Ottawa East \n",
"15 2021S0515026275 Ottawa South \n",
"16 2021S0515026277 Ottawa West \n",
"17 2021S0511240616 Ottawa - Gatineau \n",
"18 2021S0511350616 Ottawa - Gatineau \n",
"19 2021S050535505 Ottawa - Gatineau (Ontario part / partie de l'... \n",
"20 2021S050524505 Ottawa - Gatineau (partie du Québec / Quebec p... \n",
"\n",
" geographic_level rank \n",
"0 10 -9.011603 \n",
"1 8 -9.011603 \n",
"2 7 -9.011603 \n",
"3 5 -9.011603 \n",
"4 4 -9.011603 \n",
"5 4 -9.011603 \n",
"6 4 -9.011603 \n",
"7 4 -9.011603 \n",
"8 1 -9.011603 \n",
"9 1 -9.011603 \n",
"10 4 -6.940322 \n",
"11 4 -6.940322 \n",
"12 4 -6.940322 \n",
"13 1 -6.940322 \n",
"14 1 -6.940322 \n",
"15 1 -6.940322 \n",
"16 1 -6.940322 \n",
"17 2 -5.643245 \n",
"18 2 -5.643245 \n",
"19 6 -2.660226 \n",
"20 6 -2.660226 "
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_sql_query(\"\"\"\n",
"SELECT geographies.dguid, fts.search_name, geographies.geographic_level, rank\n",
"FROM geographies_fts AS fts,\n",
" geographies\n",
"WHERE fts.search_name MATCH '\"Ottawa\"*'\n",
"AND fts.id = geographies.id\n",
"ORDER BY fts.rank, geographies.geographic_level DESC\n",
"\"\"\", con)\n",
"df"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}