From dfe7e70527a2a5f66597b1799cbc7d3351e01ece Mon Sep 17 00:00:00 2001 From: Till Hoffmann Date: Fri, 21 Feb 2025 12:34:22 -0500 Subject: [PATCH 1/4] Change README to markdown. --- README.md | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++ README.rst | 123 ----------------------------------------------------- 2 files changed, 117 insertions(+), 123 deletions(-) create mode 100644 README.md delete mode 100644 README.rst diff --git a/README.md b/README.md new file mode 100644 index 0000000..830016b --- /dev/null +++ b/README.md @@ -0,0 +1,117 @@ +# Wastewater Catchment Areas in Great Britain + +This repository provides code to consolidate wastewater catchment areas in Great Britain and evaluate their spatial overlap with statistical reporting units, such as Lower Layer Super Output Areas (LSOAs). Please see the [accompanying publication](https://doi.org/10.1002/essoar.10510612.2) for a detailed description of the analysis. If you have questions about the analysis, code, or accessing the data, please [create a new issue](https://github.com/tillahoffmann/wastewater-catchment-areas/issues/new). + +## 🏁 Just give me the dataset + +If you are interested in the consolidated dataset of wastewater catchment areas rather than reproducing the analysis, you can download it [here](https://gist.githubusercontent.com/tillahoffmann/fc12349c02950e43a9edefe5907eb62c/raw/catchments_consolidated.zip) ([Shapefile](https://en.wikipedia.org/wiki/Shapefile) format). More comprehensive results, including the CSV files described below, can be found [here](https://gist.githubusercontent.com/tillahoffmann/fc12349c02950e43a9edefe5907eb62c). + +## 💾 Data + +We obtained wastewater catchment area data from sewerage service providers under [Environmental Information Regulations 2004](https://en.wikipedia.org/wiki/Environmental_Information_Regulations_2004). We consolidated these geospatial data and matched catchments to wastewater treatment works data collected under the [Urban Wastewater Treatment Directive of the European Union](https://uwwtd.eu/United-Kingdom/). After analysis, the data comprise + +- `catchments_consolidated.*`: geospatial data as a shapefile in the [British National Grid projection](https://epsg.io/7405), including auxiliary files. Each feature has the following attributes: + + - `identifier`: a unique identifier for the catchment based on its geometry. These identifiers are stable across different versions of the data provided the geometry of the associated catchment remains unchanged. + - `company`: the water company that contributed the feature. + - `name`: the name of the catchment as provided by the water company. + - `comment` (optional): an annotation providing additional information about the catchment, e.g. overlaps with other catchments. +- `waterbase_consolidated.csv`: wastewater treatment plant metadata reported under the UWWTD between 2006 and 2018. See [here](https://www.eea.europa.eu/data-and-maps/data/waterbase-uwwtd-urban-waste-water-treatment-directive-7) for the original data. The columns comprise: + + - `uwwState`: whether the treatment work is `active` or `inactive`. + - `rptMStateKey`: key of the member state (should be `UK` or `GB` for all entries). + - `uwwCode`: unique treatment works identifier in the UWWTD database. + - `uwwName`: name of the treatment works. + - `uwwLatitude` and `uwwLongitude`: GPS coordinates of the treatment works in degrees. + - `uwwLoadEnteringUWWTP`: actual load entering the treatment works measured in BOD person equivalents, corresponding to an "organic biodegradable load having a five-day biochemical oxygen demand (BOD5) of 60 g of oxygen per day". + - `uwwCapacity`: potential treatment capacity measured in BOD person equivalents. + - `version`: the reporting version (incremented with each reporting cycling, corresponding to two years). + - `year`: the reporting year. + + Note that there are some data quality issues, e.g. treatment works `UKENNE_YW_TP000055` and `UKENNE_YW_TP000067` are both named `Doncaster (Bentley)` in 2006. + +- `waterbase_catchment_lookup.csv`: lookup table to walk between catchments and treatment works. The columns comprise: + + - `identifier` and `name`: catchment identifier and name as used in `catchments_consolidated.*`. + - `uwwCode` and `uwwName`: treatment works identifier and name as used in `waterbase_consolidated.csv`. + - `distance`: distance between the catchment and treatment works in British National Grid projection (approximately metres). + +- `lsoa_catchment_lookup.csv`: lookup table to walk between catchments and Lower Layer Super Output Areas (LSOAs). The columns comprise: + + - `identifier`: catchment identifier as used in `catchments_consolidated.*`. + - `LSOA11CD`: LSOA identifier as used in the 2011 census. + - `intersection_area`: area of the intersection between the catchment and LSOA in British National Grid projection (approximately square metres). + +### Environmental Information Requests + +Details of the submitted Environmental Information Requests can be found here: + +- 🟢 [Anglian Water](https://www.whatdotheyknow.com/r/615f2df6-b1b3-42db-a236-8b311789a468): data provided and publicly accessible. +- 🔴 [Northern Ireland Water](https://www.whatdotheyknow.com/r/2b144b5d-abe6-4ad9-a61b-4e39f1e96e9f): request refused. +- 🟢 [Northumbrian Water](https://www.whatdotheyknow.com/r/aad55c04-bbc4-47a9-bec8-ea7e2a97f6d3): data provided and publicly accessible. +- 🟢 [Scottish Water](https://www.whatdotheyknow.com/r/0998addc-63f7-4a78-ac75-17fcf9b54b7d): data provided and publicly accessible. +- 🟢 [Severn Trent Water](https://www.whatdotheyknow.com/request/wastewater_catchment_areas): data provided and publicly accessible. +- 🟢 [Southern Water](https://www.whatdotheyknow.com/r/4cde4e22-1df0-42c8-b1a2-02e2cbd45b1b): data provided and publicly accessible. +- 🟢 [South West Water](https://www.whatdotheyknow.com/request/catchment_geospatial_data_files): data provided and publicly accessible. +- 🟢 [Thames Water](https://www.whatdotheyknow.com/r/e5915cbb-dc3b-4797-bf75-fe7cd8eb75c0): data provided and publicly accessible. +- 🟢 [United Utilities](https://www.whatdotheyknow.com/r/578035f9-a422-4c1b-a803-c257bf4f3414): data provided and publicly accessible. +- 🟢 [Welsh Water](https://www.whatdotheyknow.com/r/f482d33f-e753-45b2-9518-45ddf92fa718): data provided and publicly accessible. +- 🟢 [Wessex Water](https://www.whatdotheyknow.com/r/bda33cfd-e23d-49e6-b651-4ff8997c83c3): data provided and publicly accessible. +- 🟢 [Yorkshire Water](https://www.whatdotheyknow.com/r/639740ed-b0a3-4609-b4b6-a30a052fe037): data provided and publicly accessible. + +You can use the following template to request the raw data directly from water companies. + +> Dear EIR Team, +> +> Could you please provide the geospatial extent of wastewater catchment areas served by wastewater treatment plants owned or operated by your company as an attachment in response to this request? Could you please provide these data at the highest spatial resolution available in a machine-readable vector format (see below for a non-exhaustive list of suitable formats)? Catchment areas served by different treatment plants should be distinguishable. +> +> For example, geospatial data could be provided as shapefile (https://en.wikipedia.org/wiki/Shapefile), GeoJSON (https://en.wikipedia.org/wiki/GeoJSON), or GeoPackage (https://en.wikipedia.org/wiki/GeoPackage) formats. Other commonly used geospatial file formats may also be suitable, but rasterised file formats are not suitable. +> +> This request was previously submitted directly to the EIR team, and I trust I will receive the same response via the whatdotheyknow.com platform. Thank you for your time and I look forward to hearing from you. +> +> All the best, +> [your name here] + +## 🔎 Reproducing the Analysis + +1. Install [GDAL](https://gdal.org), e.g., on a Mac with [`brew`](https://brew.sh>) installed, + +```bash +$ brew install gdal +``` + +2. Set up a clean python environment (this code has only been tested using python 3.9 on an Apple Silicon Macbook Pro), ideally using a virtual environment. Then install the required dependencies by running + +```bash +$ pip install -r requirements.txt +``` + +3. Download the data (including data on Lower Layer Super Output Areas (LSOAs) and population in LSOAs from the ONS, Urban Wastewater Treatment Directive Data from the European Environment Agency, and wastewater catchment area data from whatdotheyknow.com) by running the following command. + +```bash +$ make data +``` + +5. Validate all the data are in place and that you have the correct input data by running + +```bash +$ make data/validation +``` + +6. Run the analysis by executing + +```bash +$ make analysis +``` + +The last command will execute the following notebooks in sequence and generate both the data products listed above as well as the figures in the accompanying manuscript. The analysis will take between 15 and 30 minutes depending on your computer. + +1. `consolidate_waterbase.ipynb`: load the UWWTD data, extract all treatment work information, and write the `waterbase_consolidated.csv` file. +2. `conslidate_catchments.ipynb`: load all catchments, remove duplicates, annotate, and write the `catchments_consolidated.*` files. +3. `match_waterbase_and_catchments.ipynb`: match UWWTD treatment works to catchments based on distances, names, and manual review. Writes the `waterbase_catchment_lookup.csv` file. +4. `match_catchments_and_lsoas.ipynb`: match catchments to LSOAs to evaluate their spatial overlap. Writes the files `lsoa_catchment_lookup.csv` and `lsoa_coverage.csv`. +5. `estimate_population.ipynb`: estimate the population resident within catchments, and write the `geospatial_population_estimates.csv` file. + +## Acknowledgements + +This research is part of the Data and Connectivity National Core Study, led by Health Data Research UK in partnership with the Office for National Statistics and funded by UK Research and Innovation (grant ref MC_PC_20029). diff --git a/README.rst b/README.rst deleted file mode 100644 index 9ce6b31..0000000 --- a/README.rst +++ /dev/null @@ -1,123 +0,0 @@ -Wastewater Catchment Areas in Great Britain -=========================================== - -This repository provides code to consolidate wastewater catchment areas in Great Britain and evaluate their spatial overlap with statistical reporting units, such as Lower Layer Super Output Areas (LSOAs). Please see the `accompanying publication `__ for a detailed description of the analysis. If you have questions about the analysis, code, or accessing the data, please contact :code:`till dot hoffmann at oxon dot org`. - -🏁 Just give me the dataset --------------------------- - -If you are interested in the consolidated dataset of wastewater catchment areas rather than reproducing the analysis, you can download it `here `__ (`Shapefile `__ format). More comprehensive results, including the CSV files described below, can be found `here `__. - -💾 Data -------- - -We obtained wastewater catchment area data from sewerage service providers under `Environmental Information Regulations 2004 `__. We consolidated these geospatial data and matched catchments to wastewater treatment works data collected under the `Urban Wastewater Treatment Directive of the European Union `__. After analysis, the data comprise - -- :code:`catchments_consolidated.*`: geospatial data as a shapefile in the `British National Grid projection `__, including auxiliary files. Each feature has the following attributes: - - - :code:`identifier`: a unique identifier for the catchment based on its geometry. These identifiers are stable across different versions of the data provided the geometry of the associated catchment remains unchanged. - - :code:`company`: the water company that contributed the feature. - - :code:`name`: the name of the catchment as provided by the water company. - - :code:`comment` (optional): an annotation providing additional information about the catchment, e.g. overlaps with other catchments. -- :code:`waterbase_consolidated.csv`: wastewater treatment plant metadata reported under the UWWTD between 2006 and 2018. See `here `__ for the original data. The columns comprise: - - - :code:`uwwState`: whether the treatment work is :code:`active` or :code:`inactive`. - - :code:`rptMStateKey`: key of the member state (should be :code:`UK` or :code:`GB` for all entries). - - :code:`uwwCode`: unique treatment works identifier in the UWWTD database. - - :code:`uwwName`: name of the treatment works. - - :code:`uwwLatitude` and :code:`uwwLongitude`: GPS coordinates of the treatment works in degrees. - - :code:`uwwLoadEnteringUWWTP`: actual load entering the treatment works measured in BOD person equivalents, corresponding to an "organic biodegradable load having a five-day biochemical oxygen demand (BOD5) of 60 g of oxygen per day". - - :code:`uwwCapacity`: potential treatment capacity measured in BOD person equivalents. - - :code:`version`: the reporting version (incremented with each reporting cycling, corresponding to two years). - - :code:`year`: the reporting year. - - Note that there are some data quality issues, e.g. treatment works :code:`UKENNE_YW_TP000055` and :code:`UKENNE_YW_TP000067` are both named :code:`Doncaster (Bentley)` in 2006. - -- :code:`waterbase_catchment_lookup.csv`: lookup table to walk between catchments and treatment works. The columns comprise: - - - :code:`identifier` and :code:`name`: catchment identifier and name as used in `catchments_consolidated.*`. - - :code:`uwwCode` and :code:`uwwName`: treatment works identifier and name as used in :code:`waterbase_consolidated.csv`. - - :code:`distance`: distance between the catchment and treatment works in British National Grid projection (approximately metres). - -- :code:`lsoa_catchment_lookup.csv`: lookup table to walk between catchments and Lower Layer Super Output Areas (LSOAs). The columns comprise: - - - :code:`identifier`: catchment identifier as used in `catchments_consolidated.*`. - - :code:`LSOA11CD`: LSOA identifier as used in the 2011 census. - - :code:`intersection_area`: area of the intersection between the catchment and LSOA in British National Grid projection (approximately square metres). - -Environmental Information Requests -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Details of the submitted Environmental Information Requests can be found here: - -- 🟢 `Anglian Water `__: data provided and publicly accessible. -- 🔴 `Northern Ireland Water `__: request refused. -- 🟢 `Northumbrian Water `__: data provided and publicly accessible. -- 🟢 `Scottish Water `__: data provided and publicly accessible. -- 🟢 `Severn Trent Water `__: data provided and publicly accessible. -- 🟢 `Southern Water `__: data provided and publicly accessible. -- 🟢 `South West Water `__: data provided and publicly accessible. -- 🟢 `Thames Water `__: data provided and publicly accessible. -- 🟢 `United Utilities `__: data provided and publicly accessible. -- 🟢 `Welsh Water `__: data provided and publicly accessible. -- 🟢 `Wessex Water `__: data provided and publicly accessible. -- 🟢 `Yorkshire Water `__: data provided and publicly accessible. - -You can use the following template to request the raw data directly from water companies. - - Dear EIR Team, - - Could you please provide the geospatial extent of wastewater catchment areas served by wastewater treatment plants owned or operated by your company as an attachment in response to this request? Could you please provide these data at the highest spatial resolution available in a machine-readable vector format (see below for a non-exhaustive list of suitable formats)? Catchment areas served by different treatment plants should be distinguishable. - - For example, geospatial data could be provided as shapefile (https://en.wikipedia.org/wiki/Shapefile), GeoJSON (https://en.wikipedia.org/wiki/GeoJSON), or GeoPackage (https://en.wikipedia.org/wiki/GeoPackage) formats. Other commonly used geospatial file formats may also be suitable, but rasterised file formats are not suitable. - - This request was previously submitted directly to the EIR team, and I trust I will receive the same response via the whatdotheyknow.com platform. Thank you for your time and I look forward to hearing from you. - - All the best, - [your name here] - -🔎 Reproducing the Analysis ---------------------------- - -1. Install `GDAL `__, e.g., on a Mac with `brew `__ installed, - - .. code:: bash - - brew install gdal - -2. Set up a clean python environment (this code has only been tested using python 3.9 on an Apple Silicon Macbook Pro), ideally using a virtual environment. Then install the required dependencies by running - - .. code:: bash - - pip install -r requirements.txt - -3. Download the data (including data on Lower Layer Super Output Areas (LSOAs) and population in LSOAs from the ONS, Urban Wastewater Treatment Directive Data from the European Environment Agency, and wastewater catchment area data from whatdotheyknow.com) by running the following command. Catchment area data for Anglian Water and Severn Trent Water are available by submitting an Environmental Information Request, but they are not currently available for download from whatdotheyknow.com. Please use the Environmental Information Request template above or get in touch with the authors at :code:`till dot hoffmann at oxon dot org`. - - .. code:: bash - - make data - -5. Validate all the data are in place and that you have the correct input data by running - - .. code:: bash - - make data/validation - -6. Run the analysis by executing - - .. code:: bash - - make analysis - -The last command will execute the following notebooks in sequence and generate both the data products listed above as well as the figures in the accompanying manuscript. The analysis will take between 15 and 30 minutes depending on your computer. - -1. :code:`consolidate_waterbase.ipynb`: load the UWWTD data, extract all treatment work information, and write the :code:`waterbase_consolidated.csv` file. -2. :code:`conslidate_catchments.ipynb`: load all catchments, remove duplicates, annotate, and write the :code:`catchments_consolidated.*` files. -3. :code:`match_waterbase_and_catchments.ipynb`: match UWWTD treatment works to catchments based on distances, names, and manual review. Writes the :code:`waterbase_catchment_lookup.csv` file. -4. :code:`match_catchments_and_lsoas.ipynb`: match catchments to LSOAs to evaluate their spatial overlap. Writes the files :code:`lsoa_catchment_lookup.csv` and :code:`lsoa_coverage.csv`. -5. :code:`estimate_population.ipynb`: estimate the population resident within catchments, and write the :code:`geospatial_population_estimates.csv` file. - -Acknowledgements ----------------- - -This research is part of the Data and Connectivity National Core Study, led by Health Data Research UK in partnership with the Office for National Statistics and funded by UK Research and Innovation (grant ref MC_PC_20029). From 2ae5a1f26fdcfe18d9e98fc2d53b883d2f5ee941 Mon Sep 17 00:00:00 2001 From: Till Hoffmann Date: Fri, 21 Feb 2025 12:55:45 -0500 Subject: [PATCH 2/4] Update requirements and drop Sphinx. --- Makefile | 5 +- conf.py | 1 - requirements.in | 10 +- requirements.txt | 362 ++++++++++++++++++++++------------------------- 4 files changed, 177 insertions(+), 201 deletions(-) delete mode 100644 conf.py diff --git a/Makefile b/Makefile index 2559cdf..1f06416 100644 --- a/Makefile +++ b/Makefile @@ -1,4 +1,4 @@ -.PHONY : clear_output data data/geoportal.statistics.gov.uk data/ons.gov.uk data/eea.europa.eu docs \ +.PHONY : clear_output data data/geoportal.statistics.gov.uk data/ons.gov.uk data/eea.europa.eu \ data/raw_catchments data/validation data.shasum NBEXECUTE = jupyter nbconvert --execute --output-dir=workspace --to=html @@ -11,9 +11,6 @@ requirements.txt : requirements.in sync : requirements.txt pip-sync -docs : - sphinx-build . docs/_build - clear_output : jupyter nbconvert --clear-output *.ipynb diff --git a/conf.py b/conf.py deleted file mode 100644 index 7cb80a4..0000000 --- a/conf.py +++ /dev/null @@ -1 +0,0 @@ -master_doc = 'README' diff --git a/requirements.in b/requirements.in index a8063a1..f2e298a 100644 --- a/requirements.in +++ b/requirements.in @@ -1,16 +1,18 @@ chardet -fiona +# AttributeError: module 'fiona' has no attribute 'path' +fiona<1.10.0 flake8 -geopandas +# AttributeError: 'SpatialIndex' object has no attribute 'query_bulk' +geopandas<0.13.0 matplotlib networkx -numpy +# ValueError: Unable to avoid copy while creating an array as requested. +numpy<2 jupyter pandas pyproj rtree scipy shapely -sphinx retrying tqdm diff --git a/requirements.txt b/requirements.txt index e4bfcaf..29dad28 100644 --- a/requirements.txt +++ b/requirements.txt @@ -4,49 +4,47 @@ # # pip-compile # -alabaster==0.7.13 - # via sphinx -anyio==3.6.2 - # via jupyter-server -appnope==0.1.3 - # via - # ipykernel - # ipython -argon2-cffi==21.3.0 +anyio==4.8.0 # via + # httpx # jupyter-server - # nbclassic - # notebook +appnope==0.1.4 + # via ipykernel +argon2-cffi==23.1.0 + # via jupyter-server argon2-cffi-bindings==21.2.0 # via argon2-cffi -arrow==1.2.3 +arrow==1.3.0 # via isoduration -asttokens==2.2.1 +asttokens==3.0.0 # via stack-data -attrs==22.2.0 +async-lru==2.0.4 + # via jupyterlab +attrs==25.1.0 # via # fiona # jsonschema -babel==2.12.1 - # via sphinx -backcall==0.2.0 - # via ipython -beautifulsoup4==4.11.2 + # referencing +babel==2.17.0 + # via jupyterlab-server +beautifulsoup4==4.13.3 # via nbconvert -bleach==6.0.0 +bleach[css]==6.2.0 # via nbconvert -certifi==2022.12.7 +certifi==2025.1.31 # via # fiona + # httpcore + # httpx # pyproj # requests -cffi==1.15.1 +cffi==1.17.1 # via argon2-cffi-bindings -chardet==5.1.0 +chardet==5.2.0 # via -r requirements.in -charset-normalizer==3.1.0 +charset-normalizer==3.4.1 # via requests -click==8.1.3 +click==8.1.8 # via # click-plugins # cligj @@ -55,164 +53,164 @@ click-plugins==1.1.1 # via fiona cligj==0.7.2 # via fiona -comm==0.1.2 - # via ipykernel -contourpy==1.0.7 +comm==0.2.2 + # via + # ipykernel + # ipywidgets +contourpy==1.3.1 # via matplotlib -cycler==0.11.0 +cycler==0.12.1 # via matplotlib -debugpy==1.6.6 +debugpy==1.8.12 # via ipykernel decorator==5.1.1 # via ipython defusedxml==0.7.1 # via nbconvert -docutils==0.19 - # via sphinx -executing==1.2.0 +executing==2.2.0 # via stack-data -fastjsonschema==2.16.3 +fastjsonschema==2.21.1 # via nbformat -fiona==1.9.1 +fiona==1.9.6 # via # -r requirements.in # geopandas -flake8==6.0.0 +flake8==7.1.2 # via -r requirements.in -fonttools==4.39.0 +fonttools==4.56.0 # via matplotlib fqdn==1.5.1 # via jsonschema geopandas==0.12.2 # via -r requirements.in -idna==3.4 +h11==0.14.0 + # via httpcore +httpcore==1.0.7 + # via httpx +httpx==0.28.1 + # via jupyterlab +idna==3.10 # via # anyio + # httpx # jsonschema # requests -imagesize==1.4.1 - # via sphinx -ipykernel==6.21.3 +ipykernel==6.29.5 # via - # ipywidgets # jupyter # jupyter-console - # nbclassic - # notebook - # qtconsole -ipython==8.11.0 + # jupyterlab +ipython==8.32.0 # via # ipykernel # ipywidgets # jupyter-console -ipython-genutils==0.2.0 - # via - # nbclassic - # notebook - # qtconsole -ipywidgets==8.0.4 +ipywidgets==8.1.5 # via jupyter isoduration==20.11.0 # via jsonschema -jedi==0.18.2 +jedi==0.19.2 # via ipython -jinja2==3.1.2 +jinja2==3.1.5 # via # jupyter-server - # nbclassic + # jupyterlab + # jupyterlab-server # nbconvert - # notebook - # sphinx -jsonpointer==2.3 +json5==0.10.0 + # via jupyterlab-server +jsonpointer==3.0.0 # via jsonschema -jsonschema[format-nongpl]==4.17.3 +jsonschema[format-nongpl]==4.23.0 # via # jupyter-events + # jupyterlab-server # nbformat -jupyter==1.0.0 +jsonschema-specifications==2024.10.1 + # via jsonschema +jupyter==1.1.1 # via -r requirements.in -jupyter-client==8.0.3 +jupyter-client==8.6.3 # via # ipykernel # jupyter-console # jupyter-server - # nbclassic # nbclient - # notebook - # qtconsole jupyter-console==6.6.3 # via jupyter -jupyter-core==5.3.0 +jupyter-core==5.7.2 # via # ipykernel # jupyter-client # jupyter-console # jupyter-server - # nbclassic + # jupyterlab # nbclient # nbconvert # nbformat - # notebook - # qtconsole -jupyter-events==0.6.3 +jupyter-events==0.12.0 # via jupyter-server -jupyter-server==2.5.0 +jupyter-lsp==2.2.5 + # via jupyterlab +jupyter-server==2.15.0 # via - # nbclassic + # jupyter-lsp + # jupyterlab + # jupyterlab-server + # notebook # notebook-shim -jupyter-server-terminals==0.4.4 +jupyter-server-terminals==0.5.3 # via jupyter-server -jupyterlab-pygments==0.2.2 +jupyterlab==4.3.5 + # via + # jupyter + # notebook +jupyterlab-pygments==0.3.0 # via nbconvert -jupyterlab-widgets==3.0.5 +jupyterlab-server==2.27.3 + # via + # jupyterlab + # notebook +jupyterlab-widgets==3.0.13 # via ipywidgets -kiwisolver==1.4.4 +kiwisolver==1.4.8 # via matplotlib -markupsafe==2.1.2 +markupsafe==3.0.2 # via # jinja2 # nbconvert -matplotlib==3.7.1 +matplotlib==3.10.0 # via -r requirements.in -matplotlib-inline==0.1.6 +matplotlib-inline==0.1.7 # via # ipykernel # ipython mccabe==0.7.0 # via flake8 -mistune==2.0.5 +mistune==3.1.2 # via nbconvert -munch==2.5.0 - # via fiona -nbclassic==0.5.3 - # via notebook -nbclient==0.7.2 +nbclient==0.10.2 # via nbconvert -nbconvert==7.2.10 +nbconvert==7.16.6 # via # jupyter # jupyter-server - # nbclassic - # notebook -nbformat==5.7.3 +nbformat==5.10.4 # via # jupyter-server - # nbclassic # nbclient # nbconvert - # notebook -nest-asyncio==1.5.6 - # via - # ipykernel - # nbclassic - # notebook +nest-asyncio==1.6.0 + # via ipykernel networkx==3.4.2 # via -r requirements.in -notebook==6.5.3 +notebook==7.3.2 # via jupyter -notebook-shim==0.2.2 - # via nbclassic -numpy==1.24.2 +notebook-shim==0.2.4 + # via + # jupyterlab + # notebook +numpy==1.26.4 # via # -r requirements.in # contourpy @@ -220,97 +218,88 @@ numpy==1.24.2 # pandas # scipy # shapely -packaging==23.0 +overrides==7.7.0 + # via jupyter-server +packaging==24.2 # via # geopandas # ipykernel + # jupyter-events # jupyter-server + # jupyterlab + # jupyterlab-server # matplotlib # nbconvert - # qtconsole - # qtpy - # sphinx -pandas==1.5.3 +pandas==2.2.3 # via # -r requirements.in # geopandas -pandocfilters==1.5.0 +pandocfilters==1.5.1 # via nbconvert -parso==0.8.3 +parso==0.8.4 # via jedi -pexpect==4.8.0 - # via ipython -pickleshare==0.7.5 +pexpect==4.9.0 # via ipython -pillow==9.4.0 +pillow==11.1.0 # via matplotlib -platformdirs==3.1.1 +platformdirs==4.3.6 # via jupyter-core -prometheus-client==0.16.0 - # via - # jupyter-server - # nbclassic - # notebook -prompt-toolkit==3.0.38 +prometheus-client==0.21.1 + # via jupyter-server +prompt-toolkit==3.0.50 # via # ipython # jupyter-console -psutil==5.9.4 +psutil==7.0.0 # via ipykernel ptyprocess==0.7.0 # via # pexpect # terminado -pure-eval==0.2.2 +pure-eval==0.2.3 # via stack-data -pycodestyle==2.10.0 +pycodestyle==2.12.1 # via flake8 -pycparser==2.21 +pycparser==2.22 # via cffi -pyflakes==3.0.1 +pyflakes==3.2.0 # via flake8 -pygments==2.14.0 +pygments==2.19.1 # via # ipython # jupyter-console # nbconvert - # qtconsole - # sphinx -pyparsing==3.0.9 +pyparsing==3.2.1 # via matplotlib -pyproj==3.4.1 +pyproj==3.7.1 # via # -r requirements.in # geopandas -pyrsistent==0.19.3 - # via jsonschema -python-dateutil==2.8.2 +python-dateutil==2.9.0.post0 # via # arrow # jupyter-client # matplotlib # pandas -python-json-logger==2.0.7 +python-json-logger==3.2.1 # via jupyter-events -pytz==2022.7.1 +pytz==2025.1 # via pandas -pyyaml==6.0 +pyyaml==6.0.2 # via jupyter-events -pyzmq==25.0.1 +pyzmq==26.2.1 # via # ipykernel # jupyter-client # jupyter-console # jupyter-server - # nbclassic - # notebook - # qtconsole -qtconsole==5.4.1 - # via jupyter -qtpy==2.3.0 - # via qtconsole -requests==2.28.2 - # via sphinx +referencing==0.36.2 + # via + # jsonschema + # jsonschema-specifications + # jupyter-events +requests==2.32.3 + # via jupyterlab-server retrying==1.3.4 # via -r requirements.in rfc3339-validator==0.1.4 @@ -321,68 +310,49 @@ rfc3986-validator==0.1.1 # via # jsonschema # jupyter-events -rtree==1.0.1 +rpds-py==0.23.1 + # via + # jsonschema + # referencing +rtree==1.3.0 # via -r requirements.in -scipy==1.10.1 +scipy==1.15.2 # via -r requirements.in -send2trash==1.8.0 - # via - # jupyter-server - # nbclassic - # notebook -shapely==2.0.1 +send2trash==1.8.3 + # via jupyter-server +shapely==2.0.7 # via # -r requirements.in # geopandas -six==1.16.0 +six==1.17.0 # via - # asttokens - # bleach - # munch + # fiona # python-dateutil # retrying # rfc3339-validator -sniffio==1.3.0 +sniffio==1.3.1 # via anyio -snowballstemmer==2.2.0 - # via sphinx -soupsieve==2.4 +soupsieve==2.6 # via beautifulsoup4 -sphinx==6.1.3 - # via -r requirements.in -sphinxcontrib-applehelp==1.0.4 - # via sphinx -sphinxcontrib-devhelp==1.0.2 - # via sphinx -sphinxcontrib-htmlhelp==2.0.1 - # via sphinx -sphinxcontrib-jsmath==1.0.1 - # via sphinx -sphinxcontrib-qthelp==1.0.3 - # via sphinx -sphinxcontrib-serializinghtml==1.1.5 - # via sphinx -stack-data==0.6.2 +stack-data==0.6.3 # via ipython -terminado==0.17.1 +terminado==0.18.1 # via # jupyter-server # jupyter-server-terminals - # nbclassic - # notebook -tinycss2==1.2.1 - # via nbconvert -tornado==6.2 +tinycss2==1.4.0 + # via bleach +tornado==6.4.2 # via # ipykernel # jupyter-client # jupyter-server - # nbclassic + # jupyterlab # notebook # terminado -tqdm==4.65.0 +tqdm==4.67.1 # via -r requirements.in -traitlets==5.9.0 +traitlets==5.14.3 # via # comm # ipykernel @@ -393,28 +363,36 @@ traitlets==5.9.0 # jupyter-core # jupyter-events # jupyter-server + # jupyterlab # matplotlib-inline - # nbclassic # nbclient # nbconvert # nbformat - # notebook - # qtconsole -uri-template==1.2.0 +types-python-dateutil==2.9.0.20241206 + # via arrow +typing-extensions==4.12.2 + # via + # anyio + # beautifulsoup4 + # ipython + # referencing +tzdata==2025.1 + # via pandas +uri-template==1.3.0 # via jsonschema -urllib3==1.26.15 +urllib3==2.3.0 # via requests -wcwidth==0.2.6 +wcwidth==0.2.13 # via prompt-toolkit -webcolors==1.12 +webcolors==24.11.1 # via jsonschema webencodings==0.5.1 # via # bleach # tinycss2 -websocket-client==1.5.1 +websocket-client==1.8.0 # via jupyter-server -widgetsnbextension==4.0.5 +widgetsnbextension==4.0.13 # via ipywidgets # The following packages are considered to be unsafe in a requirements file: From 44c99ad82524df304d3601f4f8d10f0ca5414f94 Mon Sep 17 00:00:00 2001 From: Till Hoffmann Date: Fri, 21 Feb 2025 13:11:41 -0500 Subject: [PATCH 3/4] Use markdown notebooks. --- .gitignore | 1 + Makefile | 29 +- consolidate_catchments.ipynb | 830 --------------------------- consolidate_catchments.md | 758 ++++++++++++++++++++++++ consolidate_waterbase.ipynb | 300 ---------- consolidate_waterbase.md | 222 +++++++ estimate_population.ipynb | 322 ----------- estimate_population.md | 268 +++++++++ match_catchments_and_lsoas.ipynb | 233 -------- match_catchments_and_lsoas.md | 167 ++++++ match_waterbase_and_catchments.ipynb | 469 --------------- match_waterbase_and_catchments.md | 373 ++++++++++++ requirements.in | 1 + requirements.txt | 16 +- 14 files changed, 1819 insertions(+), 2170 deletions(-) delete mode 100644 consolidate_catchments.ipynb create mode 100644 consolidate_catchments.md delete mode 100644 consolidate_waterbase.ipynb create mode 100644 consolidate_waterbase.md delete mode 100644 estimate_population.ipynb create mode 100644 estimate_population.md delete mode 100644 match_catchments_and_lsoas.ipynb create mode 100644 match_catchments_and_lsoas.md delete mode 100644 match_waterbase_and_catchments.ipynb create mode 100644 match_waterbase_and_catchments.md diff --git a/.gitignore b/.gitignore index d790bf0..b65df51 100644 --- a/.gitignore +++ b/.gitignore @@ -147,3 +147,4 @@ data # Don't include figures. /*.pdf +*.ipynb diff --git a/Makefile b/Makefile index 1f06416..b0b0e2c 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,9 @@ -.PHONY : clear_output data data/geoportal.statistics.gov.uk data/ons.gov.uk data/eea.europa.eu \ +.PHONY : data data/geoportal.statistics.gov.uk data/ons.gov.uk data/eea.europa.eu \ data/raw_catchments data/validation data.shasum -NBEXECUTE = jupyter nbconvert --execute --output-dir=workspace --to=html +EXECUTE_NB = cd $(dir $<) \ + && jupytext --to ipynb --output - $(notdir $<) \ + | jupyter nbconvert --stdin --execute --to html --output $@ OUTPUT_ROOT = data/wastewater_catchment_areas_public CURL = curl -L --retry 3 --retry-all-errors @@ -11,9 +13,6 @@ requirements.txt : requirements.in sync : requirements.txt pip-sync -clear_output : - jupyter nbconvert --clear-output *.ipynb - # Getting the data ================================================================================= data : data/eea.europa.eu data/geoportal.statistics.gov.uk data/ons.gov.uk data/raw_catchments @@ -132,30 +131,30 @@ workspace : mkdir -p $@ workspace/consolidate_waterbase.html ${OUTPUT_ROOT}/waterbase_consolidated.csv \ - : consolidate_waterbase.ipynb workspace ${OUTPUT_ROOT} data/eea.europa.eu - ${NBEXECUTE} $< + : consolidate_waterbase.md workspace ${OUTPUT_ROOT} data/eea.europa.eu + ${EXECUTE_NB} $< workspace/consolidate_catchments.html ${OUTPUT_ROOT}/catchments_consolidated.shp overview.pdf \ - : consolidate_catchments.ipynb workspace ${OUTPUT_ROOT} data/raw_catchments - ${NBEXECUTE} $< + : consolidate_catchments.md workspace ${OUTPUT_ROOT} data/raw_catchments + ${EXECUTE_NB} $< workspace/match_waterbase_and_catchments.html ${OUTPUT_ROOT}/waterbase_catchment_lookup.csv \ - : match_waterbase_and_catchments.ipynb workspace ${OUTPUT_ROOT} \ + : match_waterbase_and_catchments.md workspace ${OUTPUT_ROOT} \ ${OUTPUT_ROOT}/catchments_consolidated.shp ${OUTPUT_ROOT}/waterbase_consolidated.csv - ${NBEXECUTE} $< + ${EXECUTE_NB} $< workspace/match_catchments_and_lsoas.html ${OUTPUT_ROOT}/lsoa_coverage.csv \ ${OUTPUT_ROOT}/lsoa_catchment_lookup.csv \ - : match_catchments_and_lsoas.ipynb workspace ${OUTPUT_ROOT} \ + : match_catchments_and_lsoas.md workspace ${OUTPUT_ROOT} \ ${OUTPUT_ROOT}/catchments_consolidated.shp data/geoportal.statistics.gov.uk - ${NBEXECUTE} $< + ${EXECUTE_NB} $< workspace/estimate_population.html ${OUTPUT_ROOT}/population_estimates.csv \ population_estimates.pdf estimation_method.pdf \ - : estimate_population.ipynb data/ons.gov.uk workspace ${OUTPUT_ROOT} \ + : estimate_population.md data/ons.gov.uk workspace ${OUTPUT_ROOT} \ ${OUTPUT_ROOT}/lsoa_catchment_lookup.csv ${OUTPUT_ROOT}/lsoa_coverage.csv \ ${OUTPUT_ROOT}/waterbase_catchment_lookup.csv - ${NBEXECUTE} $< + ${EXECUTE_NB} $< ${OUTPUT_ROOT}/catchments_consolidated.zip : ${OUTPUT_ROOT}/catchments_consolidated.shp zip $@ ${@:.zip=}* diff --git a/consolidate_catchments.ipynb b/consolidate_catchments.ipynb deleted file mode 100644 index e9a34bb..0000000 --- a/consolidate_catchments.ipynb +++ /dev/null @@ -1,830 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "instructional-chapter", - "metadata": {}, - "outputs": [], - "source": [ - "import fiona\n", - "import os\n", - "import logging\n", - "from tqdm.notebook import tqdm\n", - "import shapely as sh\n", - "import shapely.geometry\n", - "import geopandas as gpd\n", - "import pandas as pd\n", - "from matplotlib import pyplot as plt\n", - "import matplotlib as mpl\n", - "import numpy as np\n", - "import collections\n", - "import hashlib\n", - "\n", - "mpl.rcParams['figure.dpi'] = 144\n", - "mpl.style.use('scrartcl.mplstyle')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "utility-melbourne", - "metadata": {}, - "outputs": [], - "source": [ - "# Define helper functions for processing the data.\n", - "\n", - "def assert_in(*values, key=None):\n", - " \"\"\"\n", - " Assert that the value belongs to a set of values.\n", - " \"\"\"\n", - " def _wrapper(properties, value):\n", - " assert value in values, f'{value} is not one of {values}'\n", - " if key:\n", - " properties[key] = value\n", - " return _wrapper\n", - "\n", - "\n", - "def assert_unique(key=None, *, on_error='raise', exceptions=None):\n", - " \"\"\"\n", - " Assert that the value is unique.\n", - " \"\"\"\n", - " values = collections.Counter()\n", - " exceptions = exceptions or set()\n", - " def _wrapper(properties, value):\n", - " if value in values and value not in exceptions:\n", - " message = f'`{value}` is not a unique value and has occurred {values[value]} times'\n", - " if on_error == 'raise':\n", - " raise ValueError(message)\n", - " elif on_error == 'warn':\n", - " logging.warning(message)\n", - " else:\n", - " raise NotImplementedError\n", - "\n", - " values[value] += 1\n", - " if key:\n", - " properties[key] = value\n", - " return _wrapper" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "electrical-oregon", - "metadata": {}, - "outputs": [], - "source": [ - "root = 'data/raw_catchments'\n", - "crs = 'epsg:27700'\n", - "\n", - "# List of recognised catchment owners as supplied by Severn Trent (to make sure there aren't any\n", - "# unexpected values).\n", - "severn_trent_allowed_owners = [\n", - " 'STW', 'Hafren Dyfrdwy', 'Yorkshire Water', 'Anglian Water', 'Private', 'Welsh Water',\n", - " 'United Utilities', 'Thames Water', None,\n", - "]\n", - "\n", - "def united_utilities_aggregation(catchments):\n", - " \"\"\"\n", - " Function to aggregate United Utilities subcatchments into catchments at the treatment work\n", - " level.\n", - " \"\"\"\n", - " catchments['reference'] = catchments.reference.str.extract(r'(?PUU-\\d+-SC\\d+-\\w+)')\n", - " assert catchments.reference.nunique() < len(catchments)\n", - " return catchments.dissolve(by='reference').reset_index()\n", - "\n", - "\n", - "# Declare the schema we expect to see for all the different shapefiles. `None` means we acknowledge\n", - "# the field exists but ignore it in processing. We use the `postprocessor` function to apply\n", - "# transformations after loading the data, e.g. to aggregate subcatchments.\n", - "datasets = [\n", - " {\n", - " 'company': 'scottish_water',\n", - " 'filename': 'scottish_water.zip/DOAs and WWTWs',\n", - " 'properties': {\n", - " 'doa_name': assert_unique('name', exceptions=[\n", - " # This occurs twice, once for DOA000038 (which should probably be called Maxton)\n", - " # and also for DOA002612 which actually has a road called 'Wellrig'.\n", - " 'Wellrig DOA'\n", - " ]),\n", - " # See postprocessor for exception.\n", - " 'doa_refere': assert_unique('reference', exceptions=['DOA000917']),\n", - " 'SHAPE_STAr': None,\n", - " 'wams_stw_n': None,\n", - " },\n", - " # This deals with the one repeated reference in the dataset due to a new development near Dunbar.\n", - " 'postprocessor': lambda x: x.dissolve(by='reference').reset_index(),\n", - " },\n", - " {\n", - " 'company': 'thames_water',\n", - " 'filename': 'thames_water.zip',\n", - " 'properties': {\n", - " # We do not assert uniqueness for Thames Water Catchments because they have\n", - " # provided us with detailed subcatchments (which belong to different STWs).\n", - " 'STWCODE': 'reference',\n", - " 'STWNAME': 'name',\n", - " 'SDACFOUL_1': None,\n", - " 'RECIEVINGS': None,\n", - " 'OBJECTID': None,\n", - " 'CREATIONUS': None,\n", - " 'DATECREATE': None,\n", - " 'DATEMODIFI': None,\n", - " 'LASTUSER': None,\n", - " 'SHAPEAREA': None,\n", - " 'SDACFOULBN': None,\n", - " 'DATEPOSTED': None,\n", - " 'GLOBALID': None,\n", - " 'SHAPE_AREA': None,\n", - " 'SHAPE_LEN': None,\n", - " },\n", - " 'postprocessor': lambda x: x.dissolve(by='reference').reset_index(),\n", - " },\n", - " {\n", - " 'company': 'anglian_water',\n", - " 'filename': 'anglian_water.zip',\n", - " 'properties': {\n", - " 'AREANAME': assert_unique('name'),\n", - " 'AREASHORTC': assert_unique('reference'),\n", - " 'LIQUIDTYPE': None,\n", - " 'COLLECTION': None,\n", - " 'COLLECTIO1': None,\n", - " 'TREATMENTM': None,\n", - " 'TREATMENT1': None,\n", - " 'NOSINLINES': None,\n", - " 'NOSTERMINA': None,\n", - " 'NOSPROPS': None,\n", - " 'NOSDOMESTI': None,\n", - " 'NOSMIXEDPR': None,\n", - " 'NOSCOMMPRO': None,\n", - " 'HOMEPOPN': None,\n", - " 'HOUSEHOLDP': None,\n", - " 'DOMESTICPE': None,\n", - " 'HOLIDAYPE': None,\n", - " 'INSTITUTIO': None,\n", - " 'TRADEPE': None,\n", - " 'IMPORTPE': None,\n", - " 'EXPORTPE': None,\n", - " 'NOSSURFACE': None,\n", - " 'STWS_TOTNU': None,\n", - " 'SEWOUTFALL': None,\n", - " 'DATA_CONFI': None,\n", - " 'ID': None,\n", - " }\n", - " },\n", - " {\n", - " 'company': 'northumbrian_water',\n", - " 'filename': 'northumbrian_water.zip/Export',\n", - " 'properties': {\n", - " 'NAME': assert_unique('name'),\n", - " 'ID': assert_unique('reference'),\n", - " 'GID': None,\n", - " 'APIC_STYLE': None,\n", - " 'APIC_SPACE': None,\n", - " 'APIC_STATE': None,\n", - " 'APIC_CDATE': None,\n", - " 'APIC_MDATE': None,\n", - " }\n", - " },\n", - " {\n", - " 'company': 'severn_trent_water',\n", - " 'filename': 'severn_trent_water.zip',\n", - " 'properties': {\n", - " # We allow non-unique names for private treatment works because they're not easily\n", - " # identifiable.\n", - " 'CATCHMENT1': assert_unique('name', exceptions={\n", - " 'private SPS', 'private water works', 'private', 'private STW',\n", - " 'Surface water only', None,\n", - " }),\n", - " # Owner was useful in the original dataset because it allowed us to identify records\n", - " # that weren't actually from Severn Trent Water. We'll have to do that manually now.\n", - " # 'Owner': assert_in(*severn_trent_allowed_owners, key='owner'),\n", - " 'SU_REFEREN': assert_in('WwTw', 'WwTW', 'ST', None, 'Cess Pit'),\n", - " 'SAP_FLOC_I': 'reference',\n", - " 'OBJECTID': None,\n", - " 'CATCHMENT_': None,\n", - " 'CACI_2001': None,\n", - " 'CACI_20010': None,\n", - " 'CACI_2006': None,\n", - " 'CACI_2007': None,\n", - " 'CACI_2030': None,\n", - " 'PROPERTIES': None,\n", - " 'STATUS': None,\n", - " 'UPDATED': None,\n", - " 'SUGRAN': None,\n", - " 'COMMENT': None,\n", - " 'SAP_FLOC_D': None,\n", - " 'SHAPE_Leng': None,\n", - " 'STAR_ID': None,\n", - " 'WORKS_CODE': None,\n", - " 'WORKS_ID': None,\n", - " 'CREATED_DA': None,\n", - " 'MODIFIED_D': None,\n", - " 'MODIFIED_U': None,\n", - " 'CREATED_US': None,\n", - " 'JOB_ID': None,\n", - " 'QA_STATUS': None,\n", - " 'SAP_USER_S': None,\n", - " 'SAP_CLASS': None,\n", - " 'Shape_STAr': None,\n", - " 'Shape_STLe': None,\n", - " 'Shape_ST_1': None,\n", - " 'Shape_ST_2': None,\n", - " },\n", - " },\n", - " {\n", - " 'company': 'southern_water',\n", - " 'filename': 'southern_water.zip',\n", - " 'properties': {\n", - " 'Site_Unit_': assert_unique('name'),\n", - " }\n", - " },\n", - " {\n", - " 'company': 'united_utilities',\n", - " 'filename': 'united_utilities.zip',\n", - " 'properties': {\n", - " 'DA_CODE': assert_unique('reference'),\n", - " },\n", - " 'postprocessor': united_utilities_aggregation,\n", - " },\n", - " {\n", - " 'company': 'welsh_water',\n", - " 'filename': 'welsh_water.zip',\n", - " 'properties': {\n", - " 'CATCHMENT_': assert_unique('name'),\n", - " 'TERMINAL_A': None,\n", - " 'Area_M': None,\n", - " }\n", - " },\n", - " {\n", - " 'company': 'wessex_water',\n", - " 'filename': 'wessex_water.zip',\n", - " 'properties': {\n", - " # There is a shared code because the Corsley Heath catchment comprises\n", - " # two disjoint areas.\n", - " 'SITEID': assert_unique('reference', exceptions=[23390]),\n", - " 'NAME': 'name',\n", - " 'Comment': None,\n", - " },\n", - " # Wessex Water provide subcatchments, and we need to aggregate.\n", - " 'postprocessor': lambda x: x.dissolve(by='name').reset_index(),\n", - " },\n", - " {\n", - " 'company': 'yorkshire_water',\n", - " 'filename': 'yorkshire_water.zip/EIR - Wastewater Catchments',\n", - " 'properties':\n", - " {\n", - " 'Company': assert_in('Yorkshire'),\n", - " 'Name': assert_unique('name'),\n", - " }\n", - " },\n", - " {\n", - " 'company': 'south_west_water',\n", - " 'filename': 'south_west_water.shp',\n", - " 'properties': {\n", - " \"ID1\": assert_unique(\"reference\"),\n", - " \"EQUIPMENTN\": None,\n", - " \"EQUIPMENTD\": assert_unique(\"name\"),\n", - " \"CATCHMENT1\": None,\n", - " \"NOTES\": None,\n", - " \"NOTEDATE\": None,\n", - " \"CATCHMENT_\": None,\n", - " },\n", - " },\n", - "]\n", - "assert len(datasets) == 11\n", - "\n", - "# Iterate over all datasets, then over all features in each dataset, and validate the records.\n", - "parts = []\n", - "total = 0\n", - "for dataset in datasets:\n", - " try:\n", - " # Load the data. Probably should've used geopandas rather than the lower-level fiona. But\n", - " # this implementation works for the time being.\n", - " company_catchments = []\n", - " path = os.path.join(root, dataset['filename'])\n", - " if \".zip\" in path:\n", - " path = \"zip://\" + path\n", - " with fiona.open(path) as fp:\n", - " assert fp.crs['init'] == crs, 'dataset is not in the British National Grid projection'\n", - " for feature in tqdm(fp, desc=dataset['company']):\n", - " # Get the properties that we haven't declared and are thus unexpected. Then complain\n", - " # if necessary.\n", - " missing = {key: value for key, value in feature['properties'].items()\n", - " if key not in dataset['properties']}\n", - " unexpected = set(feature['properties']) - set(dataset['properties'])\n", - " if unexpected:\n", - " raise KeyError(f'could not remap unexpected fields {unexpected}')\n", - " missing = set(dataset['properties']) - set(feature['properties'])\n", - " if missing:\n", - " raise KeyError(f'missing expected fields {missing}')\n", - "\n", - " # Create a record for the consolidated dataset.\n", - " properties = {\n", - " 'company': dataset['company'],\n", - " 'geometry': sh.geometry.shape(feature['geometry']),\n", - " }\n", - "\n", - " # Process each feature and either store it if desired or apply a function to\n", - " # validate the data.\n", - " for key, value in feature['properties'].items():\n", - " key = dataset['properties'][key]\n", - " if callable(key):\n", - " key(properties, value)\n", - " elif key:\n", - " properties[key] = value\n", - "\n", - " # Check that we have *some* identifier.\n", - " if not any([properties.get('reference'), properties.get('name')]):\n", - " logging.warning(f'{dataset[\"company\"]} has feature without identifiers')\n", - "\n", - " # Store the record for future processing.\n", - " company_catchments.append(properties)\n", - "\n", - " # Package the features in a geopandas frame and apply postprocessing steps if necessary.\n", - " company_catchments = gpd.GeoDataFrame(company_catchments, crs=crs)\n", - " total += len(company_catchments)\n", - " postprocessor = dataset.get('postprocessor')\n", - " if postprocessor:\n", - " company_catchments = postprocessor(company_catchments)\n", - "\n", - " parts.append(company_catchments)\n", - " except Exception as ex:\n", - " raise RuntimeError(f'failed to process {dataset[\"filename\"]}') from ex\n", - "\n", - "catchments_raw = pd.concat(parts, ignore_index=True)\n", - "print(f'loaded {len(catchments_raw)} catchments with {total} shapes')\n", - "\n", - "# Filter out catchments for Severn Trent that are odd.\n", - "fltr = (catchments_raw.company == 'severn_trent_water') & (\n", - " (catchments_raw.reference == '0') # These are private treatment works.\n", - " | catchments_raw.reference.str.startswith('TW') # These are from Thames Water ...\n", - " | catchments_raw.reference.str.startswith('AW') # ... Anglian Water ...\n", - " | catchments_raw.reference.str.startswith('UU') # ... United Utilities ...\n", - " | catchments_raw.reference.str.startswith('YW') # ... Yorkshire Water ...\n", - " | catchments_raw.reference.str.startswith('WW') # ... Welsh Water ...\n", - ")\n", - "catchments_raw = catchments_raw[~fltr]\n", - "print(f'dropped {fltr.sum()} treatment works not owned by Severn Trent Water')\n", - "\n", - "# Backfill the name if only the ref is available and vice-versa.\n", - "catchments_raw.name.fillna(catchments_raw.reference, inplace=True)\n", - "catchments_raw.reference.fillna(catchments_raw.name, inplace=True)\n", - "\n", - "# Buffer to make the shapes valid.\n", - "print(f'{100 * catchments_raw.is_valid.mean():.3f}% of catchments are valid')\n", - "catchments_raw['geometry'] = catchments_raw.geometry.buffer(0)\n", - "\n", - "# Show the breakdown by company.\n", - "counts = catchments_raw.groupby('company').count()\n", - "print(counts.sum())\n", - "counts" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bottom-roads", - "metadata": {}, - "outputs": [], - "source": [ - "# Apply manual fixes to the data.\n", - "update = [\n", - " [('thames_water', 'EAST HYDE STW'),\n", - " {'comment': 'Overlap with the Chalton and Offley catchments from Anglian Water.'}],\n", - " [('thames_water', 'NAGS HEAD LANE STW'),\n", - " {'comment': 'Overlap with the (Shenfield and Hutton) and Upminster catchments from Anglian Water.'}],\n", - " [('thames_water', 'BARKWAY STW'),\n", - " {'comment': 'The catchment as substantial overlap with the Barley catchment from anglian_water.'}],\n", - " [('anglian_water', 'Chalton'),\n", - " {'comment': 'Overlap with the East Hyde catchment from Thames Water.'}],\n", - " [('anglian_water', 'Shenfield And Hutton'),\n", - " {'comment': 'Overlap with the Nags Head Lane catchment from Thames Water.'}],\n", - " [('anglian_water', 'Buckminster'),\n", - " {'comment': 'Overlap with the Waltham catchment from Severn Trent Water.'}],\n", - " [('anglian_water', 'Barley'),\n", - " {'comment': 'Overlap with the BARKWAY STW catchment from thames_water.'}],\n", - " [('severn_trent_water', 'CLAY CROSS (WRW)'),\n", - " {'comment': 'Overlap with the Danesmoor catchment from Yorkshire Water.'}],\n", - " [('severn_trent_water', 'WALTHAM (WRW)'),\n", - " {'comment': 'Overlap with the Buckminster catchment from Anglian Water.'}],\n", - " [('severn_trent_water', 'ABBEY LATHE - MALTBY (WRW)'),\n", - " {'comment': 'Overlap with the Aldwarke catchment from Yorkshire Water.'}],\n", - " [('yorkshire_water', 'Danesmoor'),\n", - " {'comment': 'Overlap with the Clay Cross catchment from Severn Trent Water.'}],\n", - " [('yorkshire_water', 'Aldwarke'),\n", - " {'comment': 'Overlap with the Abbey Lathe - Maltby catchment from Severn Trent Water.'}],\n", - " [('severn_trent_water', 'WIGMORE (WRW)'),\n", - " {'comment': 'Overlap with the The Orchards SPS catchment from welsh_water.'}],\n", - " [('welsh_water', 'The Orchards SPS'),\n", - " {'comment': 'Overlap with the WIGMORE (WRW) catchment from severn_trent_water.'}],\n", - " [('united_utilities', 'UU-08-SC42-BROMB'),\n", - " {'comment': 'Overlap with the NESTON catchment from welsh_water.'}],\n", - " [('welsh_water', 'NESTON'),\n", - " {'comment': 'Overlap with the UU-08-SC42-BROMB catchment from united_utilities.'}],\n", - " [('severn_trent_water', 'SCARCLIFFE (WRW)'),\n", - " {'comment': 'Overlap with the Bolsover catchment from yorkshire_water.'}],\n", - " [('yorkshire_water', 'Bolsover'),\n", - " {'comment': 'Overlap with the SCARCLIFFE (WRW) catchment from severn_trent_water.'}],\n", - " [('severn_trent_water', 'CLOWNE (WRW)'),\n", - " {'comment': 'Overlap with the Staveley catchment from yorkshire_water.'}],\n", - " [('yorkshire_water', 'Staveley'),\n", - " {'comment': 'Overlap with the CLOWNE (WRW) catchment from severn_trent_water.'}],\n", - " [('southern_water', 'LONGFIELD'),\n", - " {'comment': 'Overlap with the LONG REACH STW catchment from thames_water.'}],\n", - " [('united_utilities', 'UU-05-SC26-COLNE'),\n", - " {'comment': 'Overlap with the Foulridge catchment from yorkshire_water.'}],\n", - " [('yorkshire_water', 'Foulridge'),\n", - " {'comment': 'Overlap with the UU-05-SC26-COLNE catchment from united_utilities.'}],\n", - " [('anglian_water', 'Offley'),\n", - " {'comment': 'Overlap with the EAST HYDE STW catchment from thames_water.'}],\n", - " [('thames_water', 'RIVERSIDE STW'),\n", - " {'comment': 'Overlap with the Upminster catchment from anglian_water.'}],\n", - " [('anglian_water', 'Upminster'),\n", - " {'comment': 'Overlap with the RIVERSIDE STW and NAGS HEAD LANE STW catchments from thames_water.'}],\n", - " [('southern_water', 'REDLYNCH'),\n", - " {'comment': 'Overlap with the DOWNTON STW CATCHMENT catchment from wessex_water.'}],\n", - " [('wessex_water', 'DOWNTON STW CATCHMENT'),\n", - " {'comment': 'Overlap with the REDLYNCH catchment from southern_water.'}],\n", - " [('severn_trent_water', 'COALEY (WRW)'),\n", - " {'comment': 'Overlap with the NORTH NIBLEY STW CATCHMENT catchment from wessex_water.'}],\n", - " [('wessex_water', 'NORTH NIBLEY STW CATCHMENT'),\n", - " {'comment': 'Overlap with the COALEY (WRW) catchment from severn_trent_water.'}],\n", - " [('united_utilities', 'UU-11-SC58-ELLES'),\n", - " {'comment': 'Overlap with the CHESTER catchment from welsh_water.'}],\n", - " [('welsh_water', 'CHESTER'),\n", - " {'comment': 'Overlap with the UU-11-SC58-ELLES and UU-11-SC58-WAVER catchments from united_utilities.'}],\n", - " [('severn_trent_water', 'LYDNEY (WRW)'),\n", - " {'comment': 'Overlap with the NEWLAND catchment from welsh_water.'}],\n", - " [('welsh_water', 'NEWLAND'),\n", - " {'comment': 'Overlap with the LYDNEY (WRW) catchment from severn_trent_water.'}],\n", - " [('thames_water', 'GUILDFORD STW'),\n", - " {'comment': 'Overlap with the GUILDFORD catchment from southern_water.'}],\n", - " [('southern_water', 'GUILDFORD'),\n", - " {'comment': 'Overlap with the GUILDFORD STW catchment from thames_water.'}],\n", - " [('severn_trent_water', 'STRONGFORD (WRW)'),\n", - " {'comment': 'Overlap with the UU-11-SC59-MADEL and UU-11-SC60-KIDSG catchments from united_utilities.'}],\n", - " [('united_utilities', 'UU-11-SC59-MADEL'),\n", - " {'comment': 'Overlap with the STRONGFORD (WRW) catchment from severn_trent_water.'}],\n", - " [('southern_water', 'PENNINGTON'),\n", - " {'comment': 'Overlap with the CHRISTCHURCH STW CATCHMENT catchment from wessex_water.'}],\n", - " [('wessex_water', 'CHRISTCHURCH STW CATCHMENT'),\n", - " {'comment': 'Overlap with the PENNINGTON catchment from southern_water.'}],\n", - " [('thames_water', 'LONG REACH STW'),\n", - " {'comment': 'Overlap with the IDE HILL TO THAMES, LONG HILL TO THAMES, LONGFIELD, and NORTHFLEET catchments from southern_water.'}],\n", - " [('southern_water', 'IDE HILL TO THAMES'),\n", - " {'comment': 'Overlap with the LONG REACH STW catchment from thames_water.'}],\n", - " [('united_utilities', 'UU-08-SC41-BIRKE'),\n", - " {'comment': 'Overlap with the HESWALL catchment from welsh_water.'}],\n", - " [('welsh_water', 'HESWALL'),\n", - " {'comment': 'Overlap with the UU-08-SC41-BIRKE catchment from united_utilities.'}],\n", - " [('severn_trent_water', 'BRANTON (WRW)'),\n", - " {'comment': 'Overlap with the Sandall catchment from yorkshire_water.'}],\n", - " [('yorkshire_water', 'Sandall'),\n", - " {'comment': 'Overlap with the BRANTON (WRW) catchment from severn_trent_water.'}],\n", - " [('united_utilities', 'UU-11-SC60-KIDSG'),\n", - " {'comment': 'Overlap with the STRONGFORD (WRW) catchment from severn_trent_water.'}],\n", - " [('united_utilities', 'UU-11-SC58-WAVER'),\n", - " {'comment': 'Overlap with the CHESTER catchment from welsh_water.'}],\n", - " [('severn_trent_water', 'DINNINGTON (WRW)'),\n", - " {'comment': 'Overlap with the Woodhouse Mill catchment from yorkshire_water.'}],\n", - " [('yorkshire_water', 'Woodhouse Mill'),\n", - " {'comment': 'Overlap with the DINNINGTON (WRW) catchment from severn_trent_water.'}],\n", - " [('southern_water', 'LONGFIELD HILL TO THAMES'),\n", - " {'comment': 'Overlap with the LONG REACH STW catchment from thames_water.'}],\n", - " [('southern_water', 'NORTHFLEET'),\n", - " {'comment': 'Overlap with the LONG REACH STW catchment from thames_water.'}],\n", - " [('northumbrian_water', 'GUISBOROUGH STW Holding Tanks? NZ60160101'),\n", - " {'comment': 'Overlap with the MARSKE STW NZ62221701 catchment from yorkshire_water. These are only holding tanks.'}],\n", - "]\n", - "\n", - "# Treatment works to remove because they are duplicates. We can check which is the correct company\n", - "# using https://www.water.org.uk/advice-for-customers/find-your-supplier/.\n", - "drop = {\n", - " 'northumbrian_water': [\n", - " # Great Smeaton, DL6 2ET, supplied Yorkshire Water.\n", - " 'GREAT SMEATON STW NZ34043201',\n", - " # Brampton, CA8 2Qg, United Utilities.\n", - " 'TINDALE STW NY61599206',\n", - " ],\n", - " 'thames_water': [\n", - " # Wingrave, HP22 4PS, supplied by Anglian Water.\n", - " 'WINGRAVE STW',\n", - " # Stewkley, LU7 0EL, supplied by Anglian Water.\n", - " 'STEWKLEY STW',\n", - " # RH12 4BB, supplied by Southern Water.\n", - " 'COLGATE STW',\n", - " # Anglian.\n", - " 'DUNSTABLE (AW) STW',\n", - " ],\n", - " 'united_utilities': [\n", - " # Doveholes, SK17 8BL, supplied by Severn Trent.\n", - " 'UU-10-SC54-DOVEH',\n", - " # Various catchments supplied by Severn Trent.\n", - " 'UU-11-SC59-SVNTR',\n", - " ],\n", - " 'severn_trent_water': [\n", - " # Acton Green, SWR6 5AA, supplied by Welsh Water.\n", - " 'ACTON GREEN (WRW)',\n", - " # Severn Trent are missing Thealby and Normanby\n", - " # (but the Anglian Water catchment is rather coarse).\n", - " 'BURTON STATHER (WRW)',\n", - " ],\n", - " 'anglian_water': [\n", - " # SG4 7AA, Thames Water.\n", - " 'Weston',\n", - " # CM24 8BE, Thames Water.\n", - " 'Henham',\n", - " # DN9 3ED, Severn Trent.\n", - " 'Ex Raf Finningley',\n", - " # Charndon, Thames.\n", - " 'Chardon',\n", - " # This is technically Anglian, but Thames have better data.\n", - " # Specifically, Anglian are missing Marston St Lawrence.\n", - " 'Halse',\n", - " # Severn Trent have better data. Anglian are missing Bothamsall.\n", - " 'Elkesley',\n", - " # LE15 7PL, Severn Trent.\n", - " 'Market Overton',\n", - " # The Severn Trent data is more refined and actually covers Alkborough.\n", - " 'Alkborough',\n", - " # Covered by Thames Water.\n", - " 'Graveley',\n", - " ],\n", - " 'welsh_water': [\n", - " # SY12 9LD, Severn Trent Water.\n", - " 'DUDLESTON HEATH',\n", - " # sy13 2af, Severn Trent.\n", - " 'ASH MAGNA',\n", - " ],\n", - " 'southern_water': [\n", - " # Thames Water provide the treatment in Crawley.\n", - " 'COPTHORNE',\n", - " # Covered by Thames.\n", - " 'SMALLFIELD',\n", - " 'HASLEMERE',\n", - " 'GREENHYTHE',\n", - " ],\n", - " 'south_west_water': [\n", - " # Overlapping data from Wessex water seems more likely to be correct by manual\n", - " # inspection because the treatment works at 50.863244779522724, -2.753987039083524\n", - " # Likely serves both Mosterton and Chedington.\n", - " 'MOSTERTON_SPS_MOSTERTON',\n", - " ]\n", - "}\n", - "\n", - "catchments = catchments_raw.copy()\n", - "\n", - "# Drop each catchment individually, ensuring that we drop exactly one per name/company.\n", - "for company, names in drop.items():\n", - " for name in names:\n", - " f = (catchments.name == name) & (catchments.company == company)\n", - " assert f.sum() == 1, (company, name, f.sum())\n", - " catchments = catchments[~f]\n", - "print(f'dropped {sum(len(names) for names in drop.values())} catchments')\n", - "\n", - "# Update each record in turn, ensuring we only update exactly one per name/company(/reference).\n", - "for key, item in update:\n", - " if len(key) == 2:\n", - " (company, name) = key\n", - " f = (catchments.name == name) \\\n", - " & (catchments.company == company)\n", - " else:\n", - " (company, name, reference) = key\n", - " f = (catchments.name == name) \\\n", - " & (catchments.company == company) \\\n", - " & (catchments.reference == reference)\n", - " assert f.sum() == 1, (company, name, f.sum())\n", - "\n", - " for key, value in item.items():\n", - " catchments.loc[f, key] = value\n", - "\n", - "print(f'updated {len(update)} catchments')\n", - "\n", - "# Show the counts by company after updates.\n", - "counts = catchments.groupby('company').count()\n", - "print(counts.sum())\n", - "counts" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "outside-arabic", - "metadata": {}, - "outputs": [], - "source": [ - "# Show an overview of the data we are now left with.\n", - "ax = catchments.plot(\n", - " column='company',\n", - " categorical=True,\n", - " legend=True,\n", - " legend_kwds={\n", - " 'fontsize': 'x-small',\n", - " 'ncol': 2,\n", - " }\n", - ")\n", - "ax.ticklabel_format(scilimits=(0, 0), useMathText=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "broke-highway", - "metadata": {}, - "outputs": [], - "source": [ - "# Get the intersections of catchments.\n", - "sindex = catchments.geometry.sindex\n", - "indices = sindex.query_bulk(catchments.geometry).T\n", - "print(f'found {len(indices)} intersections')\n", - "\n", - "# Remove self-intersections and only report each intersection once.\n", - "indices = indices[indices[:, 0] < indices[:, 1]]\n", - "print(f'found {len(indices)} intersections without self intersections')\n", - "\n", - "# Calculate the actual intersection areas and remove zero-area intersections.\n", - "i, j = indices.T\n", - "x = catchments.iloc[i].geometry.reset_index(drop=True)\n", - "y = catchments.iloc[j].geometry.reset_index(drop=True)\n", - "intersection_areas = x.intersection(y).area.values\n", - "intersection_threshold = 10\n", - "f = intersection_areas > intersection_threshold\n", - "intersection_areas = intersection_areas[f]\n", - "indices = indices[f]\n", - "\n", - "print(f'found {len(indices)} intersections with intersection area > {intersection_threshold}')\n", - "\n", - "# Evaluate the unions.\n", - "i, j = indices.T\n", - "x = catchments.iloc[i].geometry.reset_index(drop=True)\n", - "y = catchments.iloc[j].geometry.reset_index(drop=True)\n", - "union_areas = x.union(y).area.values" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "boolean-front", - "metadata": {}, - "outputs": [], - "source": [ - "# Create a data frame that has information about both pairs of an intersection as well as\n", - "# intersection information, such as the area.\n", - "parts = [catchments.iloc[i].copy().reset_index(drop=True), catchments.iloc[j].copy().reset_index(drop=True)]\n", - "for suffix, part in zip(['_x', '_y'], parts):\n", - " part.columns = [col + suffix for col in part]\n", - "overlapping = gpd.GeoDataFrame(pd.concat(parts, axis=1))\n", - "overlapping['intersection_areas'] = intersection_areas\n", - "overlapping['union_areas'] = union_areas\n", - "overlapping['iou'] = overlapping.intersection_areas / overlapping.union_areas\n", - "overlapping = overlapping.sort_values('iou', ascending=False)\n", - "\n", - "# Restrict to settings where the water companies that provided the data differ (e.g. to avoid self\n", - "# overlap within subcatchments).\n", - "f = (overlapping.company_y != overlapping.company_x)\n", - "print(f'{f.sum()} overlapping catchments from different companies')\n", - "\n", - "# Filter out the pairs where one or more catchments have an annotation.\n", - "f &= (overlapping.comment_x.isnull() | overlapping.comment_y.isnull())\n", - "overlapping = overlapping.loc[f]\n", - "\n", - "# Manual exclusion of items that have been investigated and considered insignificant.\n", - "ignore = [\n", - " {\n", - " 'company_x': 'severn_trent_water',\n", - " 'name_x': 'private',\n", - " 'company_y': 'welsh_water',\n", - " 'name_y': 'LYDBROOK',\n", - " }\n", - "]\n", - "for item in ignore:\n", - " f = True\n", - " for key, value in item.items():\n", - " f = f & (overlapping[key] == value)\n", - " overlapping = overlapping.loc[~f]\n", - "\n", - "assert len(overlapping) == 0, f'{len(overlapping)} unaccounted overlaps found'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "approximate-suicide", - "metadata": {}, - "outputs": [], - "source": [ - "# Show the breakdown of areas covered by each company in square kilometres.\n", - "areas = catchments.groupby('company').geometry.apply(lambda x: x.area.sum()) / 1e6\n", - "print(f'total area covered: {areas.sum()}')\n", - "areas.round()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "shaped-baking", - "metadata": {}, - "outputs": [], - "source": [ - "# Show an overview figure.\n", - "fig = plt.figure(figsize=(5.8, 5.8))\n", - "gs = fig.add_gridspec(2, 2, width_ratios=[3, 2])\n", - "ax3 = fig.add_subplot(gs[:, 0])\n", - "tolerance = 1000\n", - "countries = gpd.read_file('data/geoportal.statistics.gov.uk/countries20_BGC.zip')\n", - "countries.simplify(tolerance=tolerance).plot(\n", - " ax=ax3, facecolor='none', edgecolor='k', lw=.5)\n", - "simplified = catchments.copy().sort_values(\"company\")\n", - "simplified['geometry'] = simplified.simplify(tolerance=tolerance)\n", - "\n", - "# Group the catchments by company and plot them. We could also use the `column`\n", - "# keyword argument, but that is non-trivial with the number of companies we have.\n", - "# This may not be the \"best\" approach, but it's simple.\n", - "\n", - "colors_by_company = {}\n", - "for i, (company, subset) in enumerate(simplified.groupby(\"company\")):\n", - " color = f\"C{i}\" if i < 10 else \"#7FB087\"\n", - " subset.boundary.plot(ax=ax3, zorder=9, lw=.5, color=color)\n", - " colors_by_company[company] = color\n", - "ax3.set_ylim(top=1.43e6)\n", - "\n", - "ax1 = fig.add_subplot(gs[0, 1])\n", - "ax2 = fig.add_subplot(gs[1, 1])\n", - "\n", - "alpha = 0.5\n", - "\n", - "# Easy to tell that they're duplicates.\n", - "colors = ['C0', 'C3']\n", - "subset = catchments_raw.loc[catchments_raw.name.str.lower().str.contains('market overton').fillna(False)]\n", - "subset.plot(ax=ax1, facecolor=colors, alpha=alpha, edgecolor='face')\n", - "handles = [mpl.patches.Rectangle((0, 0), 1, 1, color=color, alpha=alpha)\n", - " for color in colors]\n", - "\n", - "# Hard to tell what's going on.\n", - "colors = ['C5', 'C0']\n", - "subset = catchments_raw.loc[np.in1d(catchments_raw.name, ['EAST HYDE STW', 'Chalton'])]\n", - "subset.plot(ax=ax2, facecolor=colors, alpha=alpha, edgecolor='face')\n", - "\n", - "for ax, label in zip([ax3, ax1, ax2], ['(a)', '(b)', '(c)']):\n", - " ax.set_axis_off()\n", - " ax.text(0.05, 0.05, label, transform=ax.transAxes)\n", - "\n", - "handles_labels = [\n", - " (\n", - " mpl.lines.Line2D([], [], color=value, marker=\"o\", ls=\"none\"),\n", - " key.replace('_', ' ').title().removesuffix(\" Water\"),\n", - " )\n", - " for key, value in colors_by_company.items()\n", - "]\n", - "fig.legend(*zip(*handles_labels), ncol=2, loc='upper left', frameon=False)\n", - "\n", - "fig.tight_layout()\n", - "fig.savefig('overview.pdf')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "hydraulic-contract", - "metadata": {}, - "outputs": [], - "source": [ - "# Evaluate unique identifiers for the catchments based on the first ten characters of the sha1 hash\n", - "# of the geometry rounded to the nearest metre. Then save the file.\n", - "def hash_function(geometry):\n", - " coords = geometry.envelope.exterior.coords\n", - " return hashlib.sha1(np.round(coords)).hexdigest()[:10]\n", - "\n", - "identifiers = catchments.geometry.apply(hash_function)\n", - "assert identifiers.nunique() == len(identifiers)\n", - "\n", - "catchments['identifier'] = identifiers\n", - "catchments.to_file(\n", - " 'data/wastewater_catchment_areas_public/catchments_consolidated.shp',\n", - " encoding=\"utf-8\",\n", - ")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/consolidate_catchments.md b/consolidate_catchments.md new file mode 100644 index 0000000..3aa77e3 --- /dev/null +++ b/consolidate_catchments.md @@ -0,0 +1,758 @@ +--- +jupyter: + jupytext: + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + jupytext_version: 1.16.7 + kernelspec: + display_name: venv + language: python + name: python3 +--- + +```python +import fiona +import os +import logging +from tqdm.notebook import tqdm +import shapely as sh +import shapely.geometry +import geopandas as gpd +import pandas as pd +from matplotlib import pyplot as plt +import matplotlib as mpl +import numpy as np +import collections +import hashlib + +mpl.rcParams['figure.dpi'] = 144 +mpl.style.use('scrartcl.mplstyle') +``` + +```python +# Define helper functions for processing the data. + +def assert_in(*values, key=None): + """ + Assert that the value belongs to a set of values. + """ + def _wrapper(properties, value): + assert value in values, f'{value} is not one of {values}' + if key: + properties[key] = value + return _wrapper + + +def assert_unique(key=None, *, on_error='raise', exceptions=None): + """ + Assert that the value is unique. + """ + values = collections.Counter() + exceptions = exceptions or set() + def _wrapper(properties, value): + if value in values and value not in exceptions: + message = f'`{value}` is not a unique value and has occurred {values[value]} times' + if on_error == 'raise': + raise ValueError(message) + elif on_error == 'warn': + logging.warning(message) + else: + raise NotImplementedError + + values[value] += 1 + if key: + properties[key] = value + return _wrapper +``` + +```python +root = 'data/raw_catchments' +crs = 'epsg:27700' + +# List of recognised catchment owners as supplied by Severn Trent (to make sure there aren't any +# unexpected values). +severn_trent_allowed_owners = [ + 'STW', 'Hafren Dyfrdwy', 'Yorkshire Water', 'Anglian Water', 'Private', 'Welsh Water', + 'United Utilities', 'Thames Water', None, +] + +def united_utilities_aggregation(catchments): + """ + Function to aggregate United Utilities subcatchments into catchments at the treatment work + level. + """ + catchments['reference'] = catchments.reference.str.extract(r'(?PUU-\d+-SC\d+-\w+)') + assert catchments.reference.nunique() < len(catchments) + return catchments.dissolve(by='reference').reset_index() + + +# Declare the schema we expect to see for all the different shapefiles. `None` means we acknowledge +# the field exists but ignore it in processing. We use the `postprocessor` function to apply +# transformations after loading the data, e.g. to aggregate subcatchments. +datasets = [ + { + 'company': 'scottish_water', + 'filename': 'scottish_water.zip/DOAs and WWTWs', + 'properties': { + 'doa_name': assert_unique('name', exceptions=[ + # This occurs twice, once for DOA000038 (which should probably be called Maxton) + # and also for DOA002612 which actually has a road called 'Wellrig'. + 'Wellrig DOA' + ]), + # See postprocessor for exception. + 'doa_refere': assert_unique('reference', exceptions=['DOA000917']), + 'SHAPE_STAr': None, + 'wams_stw_n': None, + }, + # This deals with the one repeated reference in the dataset due to a new development near Dunbar. + 'postprocessor': lambda x: x.dissolve(by='reference').reset_index(), + }, + { + 'company': 'thames_water', + 'filename': 'thames_water.zip', + 'properties': { + # We do not assert uniqueness for Thames Water Catchments because they have + # provided us with detailed subcatchments (which belong to different STWs). + 'STWCODE': 'reference', + 'STWNAME': 'name', + 'SDACFOUL_1': None, + 'RECIEVINGS': None, + 'OBJECTID': None, + 'CREATIONUS': None, + 'DATECREATE': None, + 'DATEMODIFI': None, + 'LASTUSER': None, + 'SHAPEAREA': None, + 'SDACFOULBN': None, + 'DATEPOSTED': None, + 'GLOBALID': None, + 'SHAPE_AREA': None, + 'SHAPE_LEN': None, + }, + 'postprocessor': lambda x: x.dissolve(by='reference').reset_index(), + }, + { + 'company': 'anglian_water', + 'filename': 'anglian_water.zip', + 'properties': { + 'AREANAME': assert_unique('name'), + 'AREASHORTC': assert_unique('reference'), + 'LIQUIDTYPE': None, + 'COLLECTION': None, + 'COLLECTIO1': None, + 'TREATMENTM': None, + 'TREATMENT1': None, + 'NOSINLINES': None, + 'NOSTERMINA': None, + 'NOSPROPS': None, + 'NOSDOMESTI': None, + 'NOSMIXEDPR': None, + 'NOSCOMMPRO': None, + 'HOMEPOPN': None, + 'HOUSEHOLDP': None, + 'DOMESTICPE': None, + 'HOLIDAYPE': None, + 'INSTITUTIO': None, + 'TRADEPE': None, + 'IMPORTPE': None, + 'EXPORTPE': None, + 'NOSSURFACE': None, + 'STWS_TOTNU': None, + 'SEWOUTFALL': None, + 'DATA_CONFI': None, + 'ID': None, + } + }, + { + 'company': 'northumbrian_water', + 'filename': 'northumbrian_water.zip/Export', + 'properties': { + 'NAME': assert_unique('name'), + 'ID': assert_unique('reference'), + 'GID': None, + 'APIC_STYLE': None, + 'APIC_SPACE': None, + 'APIC_STATE': None, + 'APIC_CDATE': None, + 'APIC_MDATE': None, + } + }, + { + 'company': 'severn_trent_water', + 'filename': 'severn_trent_water.zip', + 'properties': { + # We allow non-unique names for private treatment works because they're not easily + # identifiable. + 'CATCHMENT1': assert_unique('name', exceptions={ + 'private SPS', 'private water works', 'private', 'private STW', + 'Surface water only', None, + }), + # Owner was useful in the original dataset because it allowed us to identify records + # that weren't actually from Severn Trent Water. We'll have to do that manually now. + # 'Owner': assert_in(*severn_trent_allowed_owners, key='owner'), + 'SU_REFEREN': assert_in('WwTw', 'WwTW', 'ST', None, 'Cess Pit'), + 'SAP_FLOC_I': 'reference', + 'OBJECTID': None, + 'CATCHMENT_': None, + 'CACI_2001': None, + 'CACI_20010': None, + 'CACI_2006': None, + 'CACI_2007': None, + 'CACI_2030': None, + 'PROPERTIES': None, + 'STATUS': None, + 'UPDATED': None, + 'SUGRAN': None, + 'COMMENT': None, + 'SAP_FLOC_D': None, + 'SHAPE_Leng': None, + 'STAR_ID': None, + 'WORKS_CODE': None, + 'WORKS_ID': None, + 'CREATED_DA': None, + 'MODIFIED_D': None, + 'MODIFIED_U': None, + 'CREATED_US': None, + 'JOB_ID': None, + 'QA_STATUS': None, + 'SAP_USER_S': None, + 'SAP_CLASS': None, + 'Shape_STAr': None, + 'Shape_STLe': None, + 'Shape_ST_1': None, + 'Shape_ST_2': None, + }, + }, + { + 'company': 'southern_water', + 'filename': 'southern_water.zip', + 'properties': { + 'Site_Unit_': assert_unique('name'), + } + }, + { + 'company': 'united_utilities', + 'filename': 'united_utilities.zip', + 'properties': { + 'DA_CODE': assert_unique('reference'), + }, + 'postprocessor': united_utilities_aggregation, + }, + { + 'company': 'welsh_water', + 'filename': 'welsh_water.zip', + 'properties': { + 'CATCHMENT_': assert_unique('name'), + 'TERMINAL_A': None, + 'Area_M': None, + } + }, + { + 'company': 'wessex_water', + 'filename': 'wessex_water.zip', + 'properties': { + # There is a shared code because the Corsley Heath catchment comprises + # two disjoint areas. + 'SITEID': assert_unique('reference', exceptions=[23390]), + 'NAME': 'name', + 'Comment': None, + }, + # Wessex Water provide subcatchments, and we need to aggregate. + 'postprocessor': lambda x: x.dissolve(by='name').reset_index(), + }, + { + 'company': 'yorkshire_water', + 'filename': 'yorkshire_water.zip/EIR - Wastewater Catchments', + 'properties': + { + 'Company': assert_in('Yorkshire'), + 'Name': assert_unique('name'), + } + }, + { + 'company': 'south_west_water', + 'filename': 'south_west_water.shp', + 'properties': { + "ID1": assert_unique("reference"), + "EQUIPMENTN": None, + "EQUIPMENTD": assert_unique("name"), + "CATCHMENT1": None, + "NOTES": None, + "NOTEDATE": None, + "CATCHMENT_": None, + }, + }, +] +assert len(datasets) == 11 + +# Iterate over all datasets, then over all features in each dataset, and validate the records. +parts = [] +total = 0 +for dataset in datasets: + try: + # Load the data. Probably should've used geopandas rather than the lower-level fiona. But + # this implementation works for the time being. + company_catchments = [] + path = os.path.join(root, dataset['filename']) + if ".zip" in path: + path = "zip://" + path + with fiona.open(path) as fp: + assert fp.crs['init'] == crs, 'dataset is not in the British National Grid projection' + for feature in tqdm(fp, desc=dataset['company']): + # Get the properties that we haven't declared and are thus unexpected. Then complain + # if necessary. + missing = {key: value for key, value in feature['properties'].items() + if key not in dataset['properties']} + unexpected = set(feature['properties']) - set(dataset['properties']) + if unexpected: + raise KeyError(f'could not remap unexpected fields {unexpected}') + missing = set(dataset['properties']) - set(feature['properties']) + if missing: + raise KeyError(f'missing expected fields {missing}') + + # Create a record for the consolidated dataset. + properties = { + 'company': dataset['company'], + 'geometry': sh.geometry.shape(feature['geometry']), + } + + # Process each feature and either store it if desired or apply a function to + # validate the data. + for key, value in feature['properties'].items(): + key = dataset['properties'][key] + if callable(key): + key(properties, value) + elif key: + properties[key] = value + + # Check that we have *some* identifier. + if not any([properties.get('reference'), properties.get('name')]): + logging.warning(f'{dataset["company"]} has feature without identifiers') + + # Store the record for future processing. + company_catchments.append(properties) + + # Package the features in a geopandas frame and apply postprocessing steps if necessary. + company_catchments = gpd.GeoDataFrame(company_catchments, crs=crs) + total += len(company_catchments) + postprocessor = dataset.get('postprocessor') + if postprocessor: + company_catchments = postprocessor(company_catchments) + + parts.append(company_catchments) + except Exception as ex: + raise RuntimeError(f'failed to process {dataset["filename"]}') from ex + +catchments_raw = pd.concat(parts, ignore_index=True) +print(f'loaded {len(catchments_raw)} catchments with {total} shapes') + +# Filter out catchments for Severn Trent that are odd. +fltr = (catchments_raw.company == 'severn_trent_water') & ( + (catchments_raw.reference == '0') # These are private treatment works. + | catchments_raw.reference.str.startswith('TW') # These are from Thames Water ... + | catchments_raw.reference.str.startswith('AW') # ... Anglian Water ... + | catchments_raw.reference.str.startswith('UU') # ... United Utilities ... + | catchments_raw.reference.str.startswith('YW') # ... Yorkshire Water ... + | catchments_raw.reference.str.startswith('WW') # ... Welsh Water ... +) +catchments_raw = catchments_raw[~fltr] +print(f'dropped {fltr.sum()} treatment works not owned by Severn Trent Water') + +# Backfill the name if only the ref is available and vice-versa. +catchments_raw.name.fillna(catchments_raw.reference, inplace=True) +catchments_raw.reference.fillna(catchments_raw.name, inplace=True) + +# Buffer to make the shapes valid. +print(f'{100 * catchments_raw.is_valid.mean():.3f}% of catchments are valid') +catchments_raw['geometry'] = catchments_raw.geometry.buffer(0) + +# Show the breakdown by company. +counts = catchments_raw.groupby('company').count() +print(counts.sum()) +counts +``` + +```python +# Apply manual fixes to the data. +update = [ + [('thames_water', 'EAST HYDE STW'), + {'comment': 'Overlap with the Chalton and Offley catchments from Anglian Water.'}], + [('thames_water', 'NAGS HEAD LANE STW'), + {'comment': 'Overlap with the (Shenfield and Hutton) and Upminster catchments from Anglian Water.'}], + [('thames_water', 'BARKWAY STW'), + {'comment': 'The catchment as substantial overlap with the Barley catchment from anglian_water.'}], + [('anglian_water', 'Chalton'), + {'comment': 'Overlap with the East Hyde catchment from Thames Water.'}], + [('anglian_water', 'Shenfield And Hutton'), + {'comment': 'Overlap with the Nags Head Lane catchment from Thames Water.'}], + [('anglian_water', 'Buckminster'), + {'comment': 'Overlap with the Waltham catchment from Severn Trent Water.'}], + [('anglian_water', 'Barley'), + {'comment': 'Overlap with the BARKWAY STW catchment from thames_water.'}], + [('severn_trent_water', 'CLAY CROSS (WRW)'), + {'comment': 'Overlap with the Danesmoor catchment from Yorkshire Water.'}], + [('severn_trent_water', 'WALTHAM (WRW)'), + {'comment': 'Overlap with the Buckminster catchment from Anglian Water.'}], + [('severn_trent_water', 'ABBEY LATHE - MALTBY (WRW)'), + {'comment': 'Overlap with the Aldwarke catchment from Yorkshire Water.'}], + [('yorkshire_water', 'Danesmoor'), + {'comment': 'Overlap with the Clay Cross catchment from Severn Trent Water.'}], + [('yorkshire_water', 'Aldwarke'), + {'comment': 'Overlap with the Abbey Lathe - Maltby catchment from Severn Trent Water.'}], + [('severn_trent_water', 'WIGMORE (WRW)'), + {'comment': 'Overlap with the The Orchards SPS catchment from welsh_water.'}], + [('welsh_water', 'The Orchards SPS'), + {'comment': 'Overlap with the WIGMORE (WRW) catchment from severn_trent_water.'}], + [('united_utilities', 'UU-08-SC42-BROMB'), + {'comment': 'Overlap with the NESTON catchment from welsh_water.'}], + [('welsh_water', 'NESTON'), + {'comment': 'Overlap with the UU-08-SC42-BROMB catchment from united_utilities.'}], + [('severn_trent_water', 'SCARCLIFFE (WRW)'), + {'comment': 'Overlap with the Bolsover catchment from yorkshire_water.'}], + [('yorkshire_water', 'Bolsover'), + {'comment': 'Overlap with the SCARCLIFFE (WRW) catchment from severn_trent_water.'}], + [('severn_trent_water', 'CLOWNE (WRW)'), + {'comment': 'Overlap with the Staveley catchment from yorkshire_water.'}], + [('yorkshire_water', 'Staveley'), + {'comment': 'Overlap with the CLOWNE (WRW) catchment from severn_trent_water.'}], + [('southern_water', 'LONGFIELD'), + {'comment': 'Overlap with the LONG REACH STW catchment from thames_water.'}], + [('united_utilities', 'UU-05-SC26-COLNE'), + {'comment': 'Overlap with the Foulridge catchment from yorkshire_water.'}], + [('yorkshire_water', 'Foulridge'), + {'comment': 'Overlap with the UU-05-SC26-COLNE catchment from united_utilities.'}], + [('anglian_water', 'Offley'), + {'comment': 'Overlap with the EAST HYDE STW catchment from thames_water.'}], + [('thames_water', 'RIVERSIDE STW'), + {'comment': 'Overlap with the Upminster catchment from anglian_water.'}], + [('anglian_water', 'Upminster'), + {'comment': 'Overlap with the RIVERSIDE STW and NAGS HEAD LANE STW catchments from thames_water.'}], + [('southern_water', 'REDLYNCH'), + {'comment': 'Overlap with the DOWNTON STW CATCHMENT catchment from wessex_water.'}], + [('wessex_water', 'DOWNTON STW CATCHMENT'), + {'comment': 'Overlap with the REDLYNCH catchment from southern_water.'}], + [('severn_trent_water', 'COALEY (WRW)'), + {'comment': 'Overlap with the NORTH NIBLEY STW CATCHMENT catchment from wessex_water.'}], + [('wessex_water', 'NORTH NIBLEY STW CATCHMENT'), + {'comment': 'Overlap with the COALEY (WRW) catchment from severn_trent_water.'}], + [('united_utilities', 'UU-11-SC58-ELLES'), + {'comment': 'Overlap with the CHESTER catchment from welsh_water.'}], + [('welsh_water', 'CHESTER'), + {'comment': 'Overlap with the UU-11-SC58-ELLES and UU-11-SC58-WAVER catchments from united_utilities.'}], + [('severn_trent_water', 'LYDNEY (WRW)'), + {'comment': 'Overlap with the NEWLAND catchment from welsh_water.'}], + [('welsh_water', 'NEWLAND'), + {'comment': 'Overlap with the LYDNEY (WRW) catchment from severn_trent_water.'}], + [('thames_water', 'GUILDFORD STW'), + {'comment': 'Overlap with the GUILDFORD catchment from southern_water.'}], + [('southern_water', 'GUILDFORD'), + {'comment': 'Overlap with the GUILDFORD STW catchment from thames_water.'}], + [('severn_trent_water', 'STRONGFORD (WRW)'), + {'comment': 'Overlap with the UU-11-SC59-MADEL and UU-11-SC60-KIDSG catchments from united_utilities.'}], + [('united_utilities', 'UU-11-SC59-MADEL'), + {'comment': 'Overlap with the STRONGFORD (WRW) catchment from severn_trent_water.'}], + [('southern_water', 'PENNINGTON'), + {'comment': 'Overlap with the CHRISTCHURCH STW CATCHMENT catchment from wessex_water.'}], + [('wessex_water', 'CHRISTCHURCH STW CATCHMENT'), + {'comment': 'Overlap with the PENNINGTON catchment from southern_water.'}], + [('thames_water', 'LONG REACH STW'), + {'comment': 'Overlap with the IDE HILL TO THAMES, LONG HILL TO THAMES, LONGFIELD, and NORTHFLEET catchments from southern_water.'}], + [('southern_water', 'IDE HILL TO THAMES'), + {'comment': 'Overlap with the LONG REACH STW catchment from thames_water.'}], + [('united_utilities', 'UU-08-SC41-BIRKE'), + {'comment': 'Overlap with the HESWALL catchment from welsh_water.'}], + [('welsh_water', 'HESWALL'), + {'comment': 'Overlap with the UU-08-SC41-BIRKE catchment from united_utilities.'}], + [('severn_trent_water', 'BRANTON (WRW)'), + {'comment': 'Overlap with the Sandall catchment from yorkshire_water.'}], + [('yorkshire_water', 'Sandall'), + {'comment': 'Overlap with the BRANTON (WRW) catchment from severn_trent_water.'}], + [('united_utilities', 'UU-11-SC60-KIDSG'), + {'comment': 'Overlap with the STRONGFORD (WRW) catchment from severn_trent_water.'}], + [('united_utilities', 'UU-11-SC58-WAVER'), + {'comment': 'Overlap with the CHESTER catchment from welsh_water.'}], + [('severn_trent_water', 'DINNINGTON (WRW)'), + {'comment': 'Overlap with the Woodhouse Mill catchment from yorkshire_water.'}], + [('yorkshire_water', 'Woodhouse Mill'), + {'comment': 'Overlap with the DINNINGTON (WRW) catchment from severn_trent_water.'}], + [('southern_water', 'LONGFIELD HILL TO THAMES'), + {'comment': 'Overlap with the LONG REACH STW catchment from thames_water.'}], + [('southern_water', 'NORTHFLEET'), + {'comment': 'Overlap with the LONG REACH STW catchment from thames_water.'}], + [('northumbrian_water', 'GUISBOROUGH STW Holding Tanks? NZ60160101'), + {'comment': 'Overlap with the MARSKE STW NZ62221701 catchment from yorkshire_water. These are only holding tanks.'}], +] + +# Treatment works to remove because they are duplicates. We can check which is the correct company +# using https://www.water.org.uk/advice-for-customers/find-your-supplier/. +drop = { + 'northumbrian_water': [ + # Great Smeaton, DL6 2ET, supplied Yorkshire Water. + 'GREAT SMEATON STW NZ34043201', + # Brampton, CA8 2Qg, United Utilities. + 'TINDALE STW NY61599206', + ], + 'thames_water': [ + # Wingrave, HP22 4PS, supplied by Anglian Water. + 'WINGRAVE STW', + # Stewkley, LU7 0EL, supplied by Anglian Water. + 'STEWKLEY STW', + # RH12 4BB, supplied by Southern Water. + 'COLGATE STW', + # Anglian. + 'DUNSTABLE (AW) STW', + ], + 'united_utilities': [ + # Doveholes, SK17 8BL, supplied by Severn Trent. + 'UU-10-SC54-DOVEH', + # Various catchments supplied by Severn Trent. + 'UU-11-SC59-SVNTR', + ], + 'severn_trent_water': [ + # Acton Green, SWR6 5AA, supplied by Welsh Water. + 'ACTON GREEN (WRW)', + # Severn Trent are missing Thealby and Normanby + # (but the Anglian Water catchment is rather coarse). + 'BURTON STATHER (WRW)', + ], + 'anglian_water': [ + # SG4 7AA, Thames Water. + 'Weston', + # CM24 8BE, Thames Water. + 'Henham', + # DN9 3ED, Severn Trent. + 'Ex Raf Finningley', + # Charndon, Thames. + 'Chardon', + # This is technically Anglian, but Thames have better data. + # Specifically, Anglian are missing Marston St Lawrence. + 'Halse', + # Severn Trent have better data. Anglian are missing Bothamsall. + 'Elkesley', + # LE15 7PL, Severn Trent. + 'Market Overton', + # The Severn Trent data is more refined and actually covers Alkborough. + 'Alkborough', + # Covered by Thames Water. + 'Graveley', + ], + 'welsh_water': [ + # SY12 9LD, Severn Trent Water. + 'DUDLESTON HEATH', + # sy13 2af, Severn Trent. + 'ASH MAGNA', + ], + 'southern_water': [ + # Thames Water provide the treatment in Crawley. + 'COPTHORNE', + # Covered by Thames. + 'SMALLFIELD', + 'HASLEMERE', + 'GREENHYTHE', + ], + 'south_west_water': [ + # Overlapping data from Wessex water seems more likely to be correct by manual + # inspection because the treatment works at 50.863244779522724, -2.753987039083524 + # Likely serves both Mosterton and Chedington. + 'MOSTERTON_SPS_MOSTERTON', + ] +} + +catchments = catchments_raw.copy() + +# Drop each catchment individually, ensuring that we drop exactly one per name/company. +for company, names in drop.items(): + for name in names: + f = (catchments.name == name) & (catchments.company == company) + assert f.sum() == 1, (company, name, f.sum()) + catchments = catchments[~f] +print(f'dropped {sum(len(names) for names in drop.values())} catchments') + +# Update each record in turn, ensuring we only update exactly one per name/company(/reference). +for key, item in update: + if len(key) == 2: + (company, name) = key + f = (catchments.name == name) \ + & (catchments.company == company) + else: + (company, name, reference) = key + f = (catchments.name == name) \ + & (catchments.company == company) \ + & (catchments.reference == reference) + assert f.sum() == 1, (company, name, f.sum()) + + for key, value in item.items(): + catchments.loc[f, key] = value + +print(f'updated {len(update)} catchments') + +# Show the counts by company after updates. +counts = catchments.groupby('company').count() +print(counts.sum()) +counts +``` + +```python +# Show an overview of the data we are now left with. +ax = catchments.plot( + column='company', + categorical=True, + legend=True, + legend_kwds={ + 'fontsize': 'x-small', + 'ncol': 2, + } +) +ax.ticklabel_format(scilimits=(0, 0), useMathText=True) +``` + +```python +# Get the intersections of catchments. +sindex = catchments.geometry.sindex +indices = sindex.query_bulk(catchments.geometry).T +print(f'found {len(indices)} intersections') + +# Remove self-intersections and only report each intersection once. +indices = indices[indices[:, 0] < indices[:, 1]] +print(f'found {len(indices)} intersections without self intersections') + +# Calculate the actual intersection areas and remove zero-area intersections. +i, j = indices.T +x = catchments.iloc[i].geometry.reset_index(drop=True) +y = catchments.iloc[j].geometry.reset_index(drop=True) +intersection_areas = x.intersection(y).area.values +intersection_threshold = 10 +f = intersection_areas > intersection_threshold +intersection_areas = intersection_areas[f] +indices = indices[f] + +print(f'found {len(indices)} intersections with intersection area > {intersection_threshold}') + +# Evaluate the unions. +i, j = indices.T +x = catchments.iloc[i].geometry.reset_index(drop=True) +y = catchments.iloc[j].geometry.reset_index(drop=True) +union_areas = x.union(y).area.values +``` + +```python +# Create a data frame that has information about both pairs of an intersection as well as +# intersection information, such as the area. +parts = [catchments.iloc[i].copy().reset_index(drop=True), catchments.iloc[j].copy().reset_index(drop=True)] +for suffix, part in zip(['_x', '_y'], parts): + part.columns = [col + suffix for col in part] +overlapping = gpd.GeoDataFrame(pd.concat(parts, axis=1)) +overlapping['intersection_areas'] = intersection_areas +overlapping['union_areas'] = union_areas +overlapping['iou'] = overlapping.intersection_areas / overlapping.union_areas +overlapping = overlapping.sort_values('iou', ascending=False) + +# Restrict to settings where the water companies that provided the data differ (e.g. to avoid self +# overlap within subcatchments). +f = (overlapping.company_y != overlapping.company_x) +print(f'{f.sum()} overlapping catchments from different companies') + +# Filter out the pairs where one or more catchments have an annotation. +f &= (overlapping.comment_x.isnull() | overlapping.comment_y.isnull()) +overlapping = overlapping.loc[f] + +# Manual exclusion of items that have been investigated and considered insignificant. +ignore = [ + { + 'company_x': 'severn_trent_water', + 'name_x': 'private', + 'company_y': 'welsh_water', + 'name_y': 'LYDBROOK', + } +] +for item in ignore: + f = True + for key, value in item.items(): + f = f & (overlapping[key] == value) + overlapping = overlapping.loc[~f] + +assert len(overlapping) == 0, f'{len(overlapping)} unaccounted overlaps found' +``` + +```python +# Show the breakdown of areas covered by each company in square kilometres. +areas = catchments.groupby('company').geometry.apply(lambda x: x.area.sum()) / 1e6 +print(f'total area covered: {areas.sum()}') +areas.round() +``` + +```python +# Show an overview figure. +fig = plt.figure(figsize=(5.8, 5.8)) +gs = fig.add_gridspec(2, 2, width_ratios=[3, 2]) +ax3 = fig.add_subplot(gs[:, 0]) +tolerance = 1000 +countries = gpd.read_file('data/geoportal.statistics.gov.uk/countries20_BGC.zip') +countries.simplify(tolerance=tolerance).plot( + ax=ax3, facecolor='none', edgecolor='k', lw=.5) +simplified = catchments.copy().sort_values("company") +simplified['geometry'] = simplified.simplify(tolerance=tolerance) + +# Group the catchments by company and plot them. We could also use the `column` +# keyword argument, but that is non-trivial with the number of companies we have. +# This may not be the "best" approach, but it's simple. + +colors_by_company = {} +for i, (company, subset) in enumerate(simplified.groupby("company")): + color = f"C{i}" if i < 10 else "#7FB087" + subset.boundary.plot(ax=ax3, zorder=9, lw=.5, color=color) + colors_by_company[company] = color +ax3.set_ylim(top=1.43e6) + +ax1 = fig.add_subplot(gs[0, 1]) +ax2 = fig.add_subplot(gs[1, 1]) + +alpha = 0.5 + +# Easy to tell that they're duplicates. +colors = ['C0', 'C3'] +subset = catchments_raw.loc[catchments_raw.name.str.lower().str.contains('market overton').fillna(False)] +subset.plot(ax=ax1, facecolor=colors, alpha=alpha, edgecolor='face') +handles = [mpl.patches.Rectangle((0, 0), 1, 1, color=color, alpha=alpha) + for color in colors] + +# Hard to tell what's going on. +colors = ['C5', 'C0'] +subset = catchments_raw.loc[np.in1d(catchments_raw.name, ['EAST HYDE STW', 'Chalton'])] +subset.plot(ax=ax2, facecolor=colors, alpha=alpha, edgecolor='face') + +for ax, label in zip([ax3, ax1, ax2], ['(a)', '(b)', '(c)']): + ax.set_axis_off() + ax.text(0.05, 0.05, label, transform=ax.transAxes) + +handles_labels = [ + ( + mpl.lines.Line2D([], [], color=value, marker="o", ls="none"), + key.replace('_', ' ').title().removesuffix(" Water"), + ) + for key, value in colors_by_company.items() +] +fig.legend(*zip(*handles_labels), ncol=2, loc='upper left', frameon=False) + +fig.tight_layout() +fig.savefig('overview.pdf') +``` + +```python +# Evaluate unique identifiers for the catchments based on the first ten characters of the sha1 hash +# of the geometry rounded to the nearest metre. Then save the file. +def hash_function(geometry): + coords = geometry.envelope.exterior.coords + return hashlib.sha1(np.round(coords)).hexdigest()[:10] + +identifiers = catchments.geometry.apply(hash_function) +assert identifiers.nunique() == len(identifiers) + +catchments['identifier'] = identifiers +catchments.to_file( + 'data/wastewater_catchment_areas_public/catchments_consolidated.shp', + encoding="utf-8", +) +``` diff --git a/consolidate_waterbase.ipynb b/consolidate_waterbase.ipynb deleted file mode 100644 index a6eaf2b..0000000 --- a/consolidate_waterbase.ipynb +++ /dev/null @@ -1,300 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "living-wagon", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "from matplotlib import pyplot as plt\n", - "import matplotlib as mpl\n", - "import pandas as pd\n", - "import pathlib\n", - "import chardet\n", - "mpl.rcParams['figure.dpi'] = 144" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "equal-watch", - "metadata": {}, - "outputs": [], - "source": [ - "# Filenames for each version because there's no consistent naming\n", - "# convention.\n", - "filenames = {\n", - " 1: 'T_UWWTPS.csv',\n", - " 2: 'T_UWWTPS.csv',\n", - " 3: 'T_UWWTPS.csv',\n", - " 4: 'T_UWWTPS.csv',\n", - " 5: 'T_UWWTPs.csv',\n", - " 6: 'dbo.VL_UWWTPS.csv',\n", - " 7: 'UWWTPS.csv',\n", - " 8: 'UWWTPS.csv',\n", - "}\n", - "reporting_period_filenames = {\n", - " 1: 'T_ReportPeriod.csv',\n", - " 2: 'T_ReportPeriod.csv',\n", - " 3: 'T_ReportPeriod.csv',\n", - " 4: 'T_ReportPeriod.csv',\n", - " 5: 'T_ReportPeriod.csv',\n", - " 6: 'dbo.VL_ReportPeriod.csv',\n", - " 7: 'ReportPeriod.csv',\n", - " 8: 'ReportPeriod.csv',\n", - "}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "balanced-programming", - "metadata": {}, - "outputs": [], - "source": [ - "# The column names we're interested in.\n", - "state_keys = ['UK', 'GB']\n", - "columns = ['uwwCode', 'uwwName', 'uwwCapacity', 'uwwLoadEnteringUWWTP', \n", - " 'rptMStateKey', 'uwwLatitude', 'uwwLongitude', 'uwwState']\n", - "\n", - "# Load the data and stuff it all into one data frame.\n", - "parts = []\n", - "for version, filename in filenames.items():\n", - " try:\n", - " folder = pathlib.Path(f'data/eea.europa.eu/waterbase_v{version}_csv')\n", - " \n", - " # Detect the encoding.\n", - " path = folder / filename\n", - " with open(path, 'rb') as fp:\n", - " text = fp.read()\n", - " encoding = chardet.detect(text)\n", - "\n", - " # Load the data and filter to the UK.\n", - " part = pd.read_csv(path, usecols=columns, encoding=encoding.get('encoding'))\n", - " part = part[np.in1d(part.rptMStateKey, state_keys)]\n", - " # Store the version.\n", - " part['version'] = version\n", - " # Strip leading and trailing whitespace from names.\n", - " part['uwwName'] = part.uwwName.str.strip()\n", - "\n", - " # Load information on the reporting period.\n", - " report_period = pd.read_csv(folder / reporting_period_filenames[version])\n", - " report_period = report_period.set_index('rptMStateKey').repReportedPeriod.to_dict()\n", - " year, = [report_period[key] for key in state_keys if key in report_period]\n", - "\n", - " print(version, year, len(part))\n", - " part['year'] = year\n", - " # Store the data\n", - " parts.append(part)\n", - " except Exception as ex:\n", - " raise RuntimeError(f'failed to process {path}') from ex\n", - " \n", - "data = pd.concat(parts)\n", - "# Remove random characters (non-breaking space in latin-1 and a missing character.\n", - "data['uwwName'] = data.uwwName.str.replace('\\xa0', ' ').str.replace('�', '')\n", - "# Recode the state (two different values).\n", - "# https://dd.eionet.europa.eu/dataelements/99468\n", - "# https://www.eea.europa.eu/data-and-maps/data/waterbase-uwwtd-urban-waste-water-treatment-directive-7\n", - "state_mapping = {\n", - " 0: 'inactive',\n", - " 1: 'active',\n", - " 2: 'temporary inactive',\n", - " 105: 'inactive',\n", - " 106: 'active',\n", - "}\n", - "data['uwwState'] = data.uwwState.apply(lambda x: state_mapping[x])\n", - "data.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "expanded-bosnia", - "metadata": {}, - "outputs": [], - "source": [ - "# Show the year number as a function of the version.\n", - "versions = data[['version', 'year']].drop_duplicates()\n", - "fig, ax = plt.subplots()\n", - "ax.plot(versions.version, versions.year, marker='o')\n", - "ax.set_xlabel('Version')\n", - "ax.set_ylabel('Year')\n", - "fig.tight_layout()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "worst-reading", - "metadata": {}, - "outputs": [], - "source": [ - "# Check that all data are the same in versions 3 and 4.\n", - "versions = [3, 4]\n", - "subsets = [data[data.version == version]\n", - " .drop('version', axis=1)\n", - " .sort_values('uwwCode')\n", - " .reset_index(drop=True)\n", - " for version in versions]\n", - "x, y = subsets\n", - "fltr = ((x == y) | (x.isnull() & y.isnull())).all(axis=1)\n", - "assert all(fltr)\n", - "\n", - "# Then drop version 3.\n", - "cleaned = data[data.version != 3].copy()\n", - "\n", - "# Fix a data error for Davyhulme that's got an order of magnitude error because there's an extra 1 prefixed.\n", - "cleaned.loc[cleaned.uwwCode == 'UKENNW_UU_TP000042', 'uwwCapacity'] = np.minimum(\n", - " 1206250, cleaned.loc[cleaned.uwwCode == 'UKENNW_UU_TP000042', 'uwwCapacity']\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "collected-cooper", - "metadata": {}, - "outputs": [], - "source": [ - "# Show number of data points over time.\n", - "counts = cleaned.groupby('year').count()\n", - "fig, ax = plt.subplots()\n", - "ax.plot(counts.index, counts.uwwCapacity, label='capacity')\n", - "ax.plot(counts.index, counts.uwwLoadEnteringUWWTP, label='load entering')\n", - "ax.legend()\n", - "ax.set_ylabel('Number of data points')\n", - "ax.set_xlabel(counts.index.name)\n", - "fig.tight_layout()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ambient-daisy", - "metadata": {}, - "outputs": [], - "source": [ - "# Show capacity and load for Mogden.\n", - "code = 'UKENTH_TWU_TP000113'\n", - "subset = cleaned[cleaned.uwwCode == code]\n", - "kwargs = {\n", - " 'marker': 'o',\n", - " 'alpha': 0.5,\n", - "}\n", - "\n", - "fig, ax = plt.subplots()\n", - "ax.plot(subset.year, subset.uwwCapacity, label='capacity', **kwargs)\n", - "ax.plot(subset.year, subset.uwwLoadEnteringUWWTP, label='load entering', **kwargs)\n", - "ax.set_xlabel('version')\n", - "ax.set_ylabel('people equivalent')\n", - "ax.legend()\n", - "fig.tight_layout()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "persistent-benjamin", - "metadata": {}, - "outputs": [], - "source": [ - "# Save the data.\n", - "cleaned.to_csv('data/wastewater_catchment_areas_public/waterbase_consolidated.csv', index=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "about-contemporary", - "metadata": {}, - "outputs": [], - "source": [ - "# Show availability of treatmentworks over time.\n", - "capacityAvailability = cleaned.set_index(['uwwCode', 'year']).uwwCapacity.unstack()\n", - "loadAvailability = cleaned.set_index(['uwwCode', 'year']).uwwLoadEnteringUWWTP.unstack()\n", - "fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True)\n", - "kwargs = {\n", - " 'aspect': 'auto',\n", - "}\n", - "idx = np.argsort(loadAvailability.isnull().sum(axis=1))\n", - "ax1.imshow(~capacityAvailability.isnull().values[idx], **kwargs)\n", - "ax2.imshow(~loadAvailability.isnull().values[idx], **kwargs)\n", - "ax1.set_title('Capacity availability')\n", - "ax2.set_title('Load availability')\n", - "loadAvailability.shape" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "southern-nutrition", - "metadata": {}, - "outputs": [], - "source": [ - "# Show the cumulative distribution function of treatment work capacities.\n", - "fig, ax = plt.subplots()\n", - "for key in ['uwwCapacity', 'uwwLoadEnteringUWWTP']:\n", - " values = cleaned.groupby('uwwCode')[key].max().sort_values()\n", - " ax.plot(values, (np.arange(len(values)) + 1) / len(values), label=key)\n", - "ax.set_xscale('log')\n", - "ax.legend()\n", - "fig.tight_layout()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "graduate-sociology", - "metadata": {}, - "outputs": [], - "source": [ - "# Show the scatter of capacity and load coloured by year.\n", - "fig, (ax1, ax2) = plt.subplots(1, 2, gridspec_kw={'width_ratios': [3, 2]})\n", - "ax = ax1\n", - "fltr = (cleaned.uwwCapacity > 0) & (cleaned.uwwLoadEnteringUWWTP > 0)\n", - "subset = cleaned[fltr]\n", - "mm = subset.uwwCapacity.min(), subset.uwwCapacity.max()\n", - "ax.scatter(subset.uwwCapacity, subset.uwwLoadEnteringUWWTP, c=subset.year, marker='.')\n", - "ax.plot(mm, mm, color='k', ls=':')\n", - "ax.set_yscale('log')\n", - "ax.set_xscale('log')\n", - "ax.set_aspect('equal')\n", - "ax.set_xlabel('Capacity')\n", - "ax.set_ylabel('Load entering')\n", - "\n", - "above_capacity = cleaned.groupby('year').apply(lambda x: (x.uwwLoadEnteringUWWTP > x.uwwCapacity).mean())\n", - "above_or_at_capacity = cleaned.groupby('year').apply(lambda x: (x.uwwLoadEnteringUWWTP >= x.uwwCapacity).mean())\n", - "ax = ax2\n", - "ax.plot(above_capacity.index, above_capacity, marker='.', label='above capacity')\n", - "ax.plot(above_or_at_capacity.index, above_or_at_capacity, marker='.', label='above or at capacity')\n", - "ax.set_ylabel('Fraction of treatment works')\n", - "ax.set_xlabel('Year')\n", - "ax.set_yscale('log')\n", - "fig.tight_layout()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/consolidate_waterbase.md b/consolidate_waterbase.md new file mode 100644 index 0000000..9f2f955 --- /dev/null +++ b/consolidate_waterbase.md @@ -0,0 +1,222 @@ +--- +jupyter: + jupytext: + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + jupytext_version: 1.16.7 + kernelspec: + display_name: Python 3 (ipykernel) + language: python + name: python3 +--- + +```python +import numpy as np +from matplotlib import pyplot as plt +import matplotlib as mpl +import pandas as pd +import pathlib +import chardet +mpl.rcParams['figure.dpi'] = 144 +``` + +```python +# Filenames for each version because there's no consistent naming +# convention. +filenames = { + 1: 'T_UWWTPS.csv', + 2: 'T_UWWTPS.csv', + 3: 'T_UWWTPS.csv', + 4: 'T_UWWTPS.csv', + 5: 'T_UWWTPs.csv', + 6: 'dbo.VL_UWWTPS.csv', + 7: 'UWWTPS.csv', + 8: 'UWWTPS.csv', +} +reporting_period_filenames = { + 1: 'T_ReportPeriod.csv', + 2: 'T_ReportPeriod.csv', + 3: 'T_ReportPeriod.csv', + 4: 'T_ReportPeriod.csv', + 5: 'T_ReportPeriod.csv', + 6: 'dbo.VL_ReportPeriod.csv', + 7: 'ReportPeriod.csv', + 8: 'ReportPeriod.csv', +} +``` + +```python +# The column names we're interested in. +state_keys = ['UK', 'GB'] +columns = ['uwwCode', 'uwwName', 'uwwCapacity', 'uwwLoadEnteringUWWTP', + 'rptMStateKey', 'uwwLatitude', 'uwwLongitude', 'uwwState'] + +# Load the data and stuff it all into one data frame. +parts = [] +for version, filename in filenames.items(): + try: + folder = pathlib.Path(f'data/eea.europa.eu/waterbase_v{version}_csv') + + # Detect the encoding. + path = folder / filename + with open(path, 'rb') as fp: + text = fp.read() + encoding = chardet.detect(text) + + # Load the data and filter to the UK. + part = pd.read_csv(path, usecols=columns, encoding=encoding.get('encoding')) + part = part[np.in1d(part.rptMStateKey, state_keys)] + # Store the version. + part['version'] = version + # Strip leading and trailing whitespace from names. + part['uwwName'] = part.uwwName.str.strip() + + # Load information on the reporting period. + report_period = pd.read_csv(folder / reporting_period_filenames[version]) + report_period = report_period.set_index('rptMStateKey').repReportedPeriod.to_dict() + year, = [report_period[key] for key in state_keys if key in report_period] + + print(version, year, len(part)) + part['year'] = year + # Store the data + parts.append(part) + except Exception as ex: + raise RuntimeError(f'failed to process {path}') from ex + +data = pd.concat(parts) +# Remove random characters (non-breaking space in latin-1 and a missing character. +data['uwwName'] = data.uwwName.str.replace('\xa0', ' ').str.replace('�', '') +# Recode the state (two different values). +# https://dd.eionet.europa.eu/dataelements/99468 +# https://www.eea.europa.eu/data-and-maps/data/waterbase-uwwtd-urban-waste-water-treatment-directive-7 +state_mapping = { + 0: 'inactive', + 1: 'active', + 2: 'temporary inactive', + 105: 'inactive', + 106: 'active', +} +data['uwwState'] = data.uwwState.apply(lambda x: state_mapping[x]) +data.head() +``` + +```python +# Show the year number as a function of the version. +versions = data[['version', 'year']].drop_duplicates() +fig, ax = plt.subplots() +ax.plot(versions.version, versions.year, marker='o') +ax.set_xlabel('Version') +ax.set_ylabel('Year') +fig.tight_layout() +``` + +```python +# Check that all data are the same in versions 3 and 4. +versions = [3, 4] +subsets = [data[data.version == version] + .drop('version', axis=1) + .sort_values('uwwCode') + .reset_index(drop=True) + for version in versions] +x, y = subsets +fltr = ((x == y) | (x.isnull() & y.isnull())).all(axis=1) +assert all(fltr) + +# Then drop version 3. +cleaned = data[data.version != 3].copy() + +# Fix a data error for Davyhulme that's got an order of magnitude error because there's an extra 1 prefixed. +cleaned.loc[cleaned.uwwCode == 'UKENNW_UU_TP000042', 'uwwCapacity'] = np.minimum( + 1206250, cleaned.loc[cleaned.uwwCode == 'UKENNW_UU_TP000042', 'uwwCapacity'] +) +``` + +```python +# Show number of data points over time. +counts = cleaned.groupby('year').count() +fig, ax = plt.subplots() +ax.plot(counts.index, counts.uwwCapacity, label='capacity') +ax.plot(counts.index, counts.uwwLoadEnteringUWWTP, label='load entering') +ax.legend() +ax.set_ylabel('Number of data points') +ax.set_xlabel(counts.index.name) +fig.tight_layout() +``` + +```python +# Show capacity and load for Mogden. +code = 'UKENTH_TWU_TP000113' +subset = cleaned[cleaned.uwwCode == code] +kwargs = { + 'marker': 'o', + 'alpha': 0.5, +} + +fig, ax = plt.subplots() +ax.plot(subset.year, subset.uwwCapacity, label='capacity', **kwargs) +ax.plot(subset.year, subset.uwwLoadEnteringUWWTP, label='load entering', **kwargs) +ax.set_xlabel('version') +ax.set_ylabel('people equivalent') +ax.legend() +fig.tight_layout() +``` + +```python +# Save the data. +cleaned.to_csv('data/wastewater_catchment_areas_public/waterbase_consolidated.csv', index=False) +``` + +```python +# Show availability of treatmentworks over time. +capacityAvailability = cleaned.set_index(['uwwCode', 'year']).uwwCapacity.unstack() +loadAvailability = cleaned.set_index(['uwwCode', 'year']).uwwLoadEnteringUWWTP.unstack() +fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True) +kwargs = { + 'aspect': 'auto', +} +idx = np.argsort(loadAvailability.isnull().sum(axis=1)) +ax1.imshow(~capacityAvailability.isnull().values[idx], **kwargs) +ax2.imshow(~loadAvailability.isnull().values[idx], **kwargs) +ax1.set_title('Capacity availability') +ax2.set_title('Load availability') +loadAvailability.shape +``` + +```python +# Show the cumulative distribution function of treatment work capacities. +fig, ax = plt.subplots() +for key in ['uwwCapacity', 'uwwLoadEnteringUWWTP']: + values = cleaned.groupby('uwwCode')[key].max().sort_values() + ax.plot(values, (np.arange(len(values)) + 1) / len(values), label=key) +ax.set_xscale('log') +ax.legend() +fig.tight_layout() +``` + +```python +# Show the scatter of capacity and load coloured by year. +fig, (ax1, ax2) = plt.subplots(1, 2, gridspec_kw={'width_ratios': [3, 2]}) +ax = ax1 +fltr = (cleaned.uwwCapacity > 0) & (cleaned.uwwLoadEnteringUWWTP > 0) +subset = cleaned[fltr] +mm = subset.uwwCapacity.min(), subset.uwwCapacity.max() +ax.scatter(subset.uwwCapacity, subset.uwwLoadEnteringUWWTP, c=subset.year, marker='.') +ax.plot(mm, mm, color='k', ls=':') +ax.set_yscale('log') +ax.set_xscale('log') +ax.set_aspect('equal') +ax.set_xlabel('Capacity') +ax.set_ylabel('Load entering') + +above_capacity = cleaned.groupby('year').apply(lambda x: (x.uwwLoadEnteringUWWTP > x.uwwCapacity).mean()) +above_or_at_capacity = cleaned.groupby('year').apply(lambda x: (x.uwwLoadEnteringUWWTP >= x.uwwCapacity).mean()) +ax = ax2 +ax.plot(above_capacity.index, above_capacity, marker='.', label='above capacity') +ax.plot(above_or_at_capacity.index, above_or_at_capacity, marker='.', label='above or at capacity') +ax.set_ylabel('Fraction of treatment works') +ax.set_xlabel('Year') +ax.set_yscale('log') +fig.tight_layout() +``` diff --git a/estimate_population.ipynb b/estimate_population.ipynb deleted file mode 100644 index bc090eb..0000000 --- a/estimate_population.ipynb +++ /dev/null @@ -1,322 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "mexican-luxury", - "metadata": {}, - "outputs": [], - "source": [ - "import fiona\n", - "import geopandas as gpd\n", - "from matplotlib import pyplot as plt\n", - "import matplotlib as mpl\n", - "import numpy as np\n", - "import pandas as pd\n", - "import pathlib\n", - "import shapely\n", - "from scipy import stats\n", - "from tqdm.notebook import tqdm\n", - "\n", - "mpl.rcParams['figure.dpi'] = 144\n", - "mpl.style.use('scrartcl.mplstyle')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "valid-telescope", - "metadata": {}, - "outputs": [], - "source": [ - "# Load all the data we need.\n", - "ROOT = pathlib.Path('data/wastewater_catchment_areas_public')\n", - "\n", - "lsoas = gpd.read_file('data/geoportal.statistics.gov.uk/LSOA11_BGC.zip').set_index('LSOA11CD')\n", - "\n", - "catchments = gpd.read_file(ROOT / 'catchments_consolidated.shp')\n", - "\n", - "lsoa_catchment_lookup = pd.read_csv(ROOT / 'lsoa_catchment_lookup.csv')\n", - "\n", - "lsoa_coverage = pd.read_csv(ROOT / 'lsoa_coverage.csv')\n", - "\n", - "lsoa_population = pd.read_csv('data/ons.gov.uk/lsoa_syoa_all_years_t.csv',\n", - " usecols=['LSOA11CD', 'year', 'Pop_Total'])\n", - "lsoa_population['year'] = lsoa_population.year.apply(lambda x: int(x[4:]))\n", - "\n", - "waterbase_catchment_lookup = pd.read_csv(ROOT / 'waterbase_catchment_lookup.csv')\n", - "\n", - "waterbase_consolidated = pd.read_csv(ROOT / 'waterbase_consolidated.csv',\n", - " index_col=['uwwCode', 'year'])\n", - "# Fix a data problem where someone dropped a zero (or another digit) for Kinmel Bay.\n", - "waterbase_consolidated.loc[('UKWAWA_WW_TP000093', 2016), 'uwwLoadEnteringUWWTP'] *= 10\n", - "\n", - "# Add up the treated load for the two works in Abingdon (which should really just be one).\n", - "x = waterbase_consolidated.loc['UKENTH_TWU_TP000001'].uwwLoadEnteringUWWTP\n", - "y = waterbase_consolidated.loc['UKENTH_TWU_TP000165'].uwwLoadEnteringUWWTP\n", - "z = y.reindex(x.index).fillna(0) + x\n", - "waterbase_consolidated.loc['UKENTH_TWU_TP000001', 'uwwLoadEnteringUWWTP'] = z.values\n", - "\n", - "# Get rid of the duplicate treatment work.\n", - "waterbase_consolidated = waterbase_consolidated.drop('UKENTH_TWU_TP000165', level=0)\n", - "waterbase_consolidated = waterbase_consolidated.reset_index()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "alternate-purse", - "metadata": {}, - "outputs": [], - "source": [ - "# Evaluate the total intersection area for each LSOA.\n", - "intersection_area_sum = lsoa_catchment_lookup.groupby('LSOA11CD')\\\n", - " .intersection_area.sum().reset_index(name='intersection_area_sum')\n", - "\n", - "# Construct a data frame that has a number of different areas that we can use for normalisation.\n", - "merged = pd.merge(lsoa_coverage, intersection_area_sum, on='LSOA11CD')\n", - "merged = pd.merge(merged, lsoa_catchment_lookup, on='LSOA11CD')\n", - "merged = pd.merge(merged, lsoa_population, on='LSOA11CD')\n", - "\n", - "def aggregate(subset):\n", - " # Construct different normalisations.\n", - " # - for total area, we divide by the area of the LSOA.\n", - " # - for area covered, we divide by the area of the LSOA that's covered by *any* \n", - " # catchment.\n", - " # - for intersection sum, we divide by the sum of all intersections. This may be\n", - " # larger than the area of the catchments if the data has overlapping catchments.\n", - " norms = {\n", - " 'norm_total_area': subset.total_area,\n", - " 'norm_area_covered': subset.area_covered,\n", - " 'norm_intersection_sum': subset.intersection_area_sum,\n", - " }\n", - " intersection_area_pop = subset.intersection_area * subset.Pop_Total\n", - " return pd.Series({key: (intersection_area_pop / value).sum() for key, value in norms.items()})\n", - "\n", - "grouped = merged.groupby(['identifier', 'year'])\n", - "geospatial_estimate = grouped.apply(aggregate)\n", - "geospatial_estimate.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5b28c289", - "metadata": {}, - "outputs": [], - "source": [ - "# Generate summary for table 1 of the publication.\n", - "year = 2016\n", - "method = \"norm_area_covered\"\n", - "\n", - "summary = geospatial_estimate.reset_index()\n", - "summary = summary[summary.year == year]\n", - "summary = pd.merge(catchments, summary, on=\"identifier\")\n", - "summary = summary.groupby(\"company\").apply(\n", - " lambda subset: {\n", - " \"population\": subset[method].sum() / 1e6,\n", - " \"retained_catchments\": len(subset),\n", - " \"area\": np.round(subset.geometry.area.sum() / 1e6),\n", - " \"matched_uwwtp\": subset.identifier.isin(waterbase_catchment_lookup.identifier).sum(),\n", - " }\n", - ").apply(pd.Series) \n", - "\n", - "frac = (\n", - " 1e6 * summary.population.sum() \n", - " / lsoa_population[lsoa_population.year == year].Pop_Total.sum()\n", - ")\n", - "print(f\"fraction covered: {frac}\")\n", - "summary" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "geological-person", - "metadata": {}, - "outputs": [], - "source": [ - "# Merge the waterbase data (BOD p.e.) with geospatial population estimates for comparison.\n", - "merged = pd.merge(waterbase_catchment_lookup, waterbase_consolidated, on=['uwwCode', 'uwwName'])\n", - "merged = pd.merge(merged, geospatial_estimate, on=['year', 'identifier'])\n", - "\n", - "# Sum by year and uwwCode (because the same treatment work may be linked to multiple catchments if\n", - "# the subcatchment aggregation didn't work out properly). Then assign back to the merged dataset and\n", - "# drop duplicates.\n", - "estimates = merged.groupby(['uwwCode', 'year']).agg({\n", - " 'norm_total_area': 'sum',\n", - " 'norm_area_covered': 'sum',\n", - "})\n", - "for key in estimates:\n", - " merged[key] = [estimates.loc[(x.uwwCode, x.year), key] for _, x in merged.iterrows()]\n", - "merged = merged.drop_duplicates(['uwwCode', 'year'])\n", - "\n", - "# Evaluate the pearson correlation on the log scale (omitting treatment works without load).\n", - "f = merged.uwwLoadEnteringUWWTP > 0\n", - "stats.pearsonr(np.log(merged.uwwLoadEnteringUWWTP[f]), np.log(merged.norm_area_covered[f]))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "governing-russia", - "metadata": {}, - "outputs": [], - "source": [ - "# Show a figure of different population estimates for a given year.\n", - "fig = plt.figure()\n", - "gs = fig.add_gridspec(2, 2)\n", - "ax = fig.add_subplot(gs[:, 0])\n", - "year = 2016\n", - "subset = merged[merged.year == year]\n", - "ax.scatter(subset.uwwLoadEnteringUWWTP, subset.norm_area_covered, marker='.', alpha=.5)\n", - "ax.set_yscale('log')\n", - "ax.set_xscale('log')\n", - "lims = subset.uwwLoadEnteringUWWTP.quantile([0, 1])\n", - "ax.plot(lims, lims, color='k', ls=':')\n", - "ax.set_aspect('equal')\n", - "ax.set_xlabel('BOD person equivalent')\n", - "ax.set_ylabel('Geospatial population estimate')\n", - "ax.text(0.05, 0.95, '(a)', transform=ax.transAxes, va='top')\n", - "\n", - "# Annotations.\n", - "annotations = [\n", - " {\n", - " 'code': 'UKENNE_NU_TP000026',\n", - " 'label': 'Haggerston',\n", - " 'xfactor': 3,\n", - " 'yfactor': 1,\n", - " },\n", - " {\n", - " 'code': 'UKWAWA_WW_TP000016',\n", - " 'label': 'Rotherwas',\n", - " 'xfactor': 2.5,\n", - " },\n", - " {\n", - " 'code': 'UKENAN_AW_TP000020',\n", - " 'label': 'Billericay',\n", - " 'xfactor': 2/3,\n", - " 'yfactor': 3,\n", - " 'kwargs': {'ha': 'center'},\n", - " },\n", - " # For better estimation of population coverage, we've dropped the Chalton\n", - " # treatment work as it overlaps with East Hyde (see catchment consolidation\n", - " # notebook for annotations and catchment-LSOA matching for dropping \n", - " # annotated catchments).\n", - " # {\n", - " # 'code': 'UKENAN_AW_TP000051',\n", - " # 'label': 'Chalton',\n", - " # 'xfactor': 1 / 3,\n", - " # 'kwargs': {'ha': 'right'},\n", - " # },\n", - "]\n", - "indexed = subset.set_index('uwwCode')\n", - "for annotation in annotations:\n", - " item = indexed.loc[annotation['code']]\n", - "\n", - " ax.annotate(\n", - " annotation['label'],\n", - " (item.uwwLoadEnteringUWWTP, item.norm_area_covered),\n", - " (item.uwwLoadEnteringUWWTP * annotation.get('xfactor', 1),\n", - " item.norm_area_covered * annotation.get('yfactor', 1)),\n", - " arrowprops={\n", - " 'arrowstyle': '-|>',\n", - " },\n", - " va='center',\n", - " **annotation.get('kwargs', {}),\n", - " )\n", - " print(annotation['label'], item.uwwName)\n", - "\n", - "ax3 = ax = fig.add_subplot(gs[:, 1])\n", - "target = lambda x: np.median(np.abs(np.log10(x.norm_area_covered / x.uwwLoadEnteringUWWTP)))\n", - "\n", - "x = []\n", - "y = []\n", - "ys = []\n", - "for year, subset in tqdm(merged.groupby('year')):\n", - " x.append(year)\n", - " # Evaluate the statistic.\n", - " y.append(target(subset))\n", - " # Run a bootstrap sample.\n", - " ys.append([target(subset.iloc[np.random.randint(len(subset), size=len(subset))])\n", - " for _ in range(1000)])\n", - "\n", - "ys = np.asarray(ys)\n", - "l, u = np.percentile(ys, [25, 75], axis=1)\n", - "ax.errorbar(x, y, (y - l, u - y), marker='.')\n", - "ax.ticklabel_format(scilimits=(0, 0), axis='y', useMathText=True)\n", - "ax.set_xlabel('Year')\n", - "ax.set_ylabel('Median absolute\\n$\\\\log_{10}$ error')\n", - "ax.xaxis.set_ticks([2006, 2008, 2010, 2012, 2014, 2016])\n", - "plt.setp(ax.xaxis.get_ticklabels(), rotation=30, ha='right')\n", - "ax.text(0.95, 0.95, '(b)', transform=ax.transAxes, ha='right', va='top')\n", - "\n", - "fig.tight_layout()\n", - "fig.savefig('population-estimates.pdf')\n", - "\n", - "# Show the log10 median absolute error over time.\n", - "y" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "synthetic-senator", - "metadata": {}, - "outputs": [], - "source": [ - "# Plot to illustrate why we're using area covered.\n", - "fig, ax = plt.subplots()\n", - "\n", - "xmin = 515000\n", - "xmax = 523000\n", - "ymin = 170000\n", - "ymax = 176000\n", - "box = shapely.geometry.box(xmin, ymin, xmax, ymax)\n", - "\n", - "# Plot the catchments.\n", - "idx_catchment = catchments.sindex.query(box)\n", - "subset = catchments.iloc[idx_catchment].sort_values('name')\n", - "subset = subset[subset.intersection(box).area > 10]\n", - "colors = ['C0', 'C1', 'C2']\n", - "subset.intersection(box).plot(ax=ax, color=colors)\n", - "\n", - "# Plot the LSOAs.\n", - "idx = lsoas.sindex.query(box)\n", - "lsoas.iloc[idx].plot(ax=ax, facecolor='none', edgecolor='k', alpha=.1)\n", - "lsoas.loc[['E01003817']].plot(ax=ax, facecolor=(.5, .5, .5, .25), edgecolor='k')\n", - "\n", - "ax.set_xlim(xmin, xmax)\n", - "ax.set_ylim(ymin, ymax)\n", - "ax.set_axis_off()\n", - "handles = [mpl.patches.Rectangle((0, 0), 1, 1, color=color) for color in colors]\n", - "labels = subset.name.str.replace(' STW', '').str.title()\n", - "ax.legend(handles, labels)\n", - "\n", - "fig.tight_layout()\n", - "fig.savefig('estimation_method.pdf')" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/estimate_population.md b/estimate_population.md new file mode 100644 index 0000000..c72ee3d --- /dev/null +++ b/estimate_population.md @@ -0,0 +1,268 @@ +--- +jupyter: + jupytext: + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + jupytext_version: 1.16.7 + kernelspec: + display_name: Python 3 (ipykernel) + language: python + name: python3 +--- + +```python +import fiona +import geopandas as gpd +from matplotlib import pyplot as plt +import matplotlib as mpl +import numpy as np +import pandas as pd +import pathlib +import shapely +from scipy import stats +from tqdm.notebook import tqdm + +mpl.rcParams['figure.dpi'] = 144 +mpl.style.use('scrartcl.mplstyle') +``` + +```python +# Load all the data we need. +ROOT = pathlib.Path('data/wastewater_catchment_areas_public') + +lsoas = gpd.read_file('data/geoportal.statistics.gov.uk/LSOA11_BGC.zip').set_index('LSOA11CD') + +catchments = gpd.read_file(ROOT / 'catchments_consolidated.shp') + +lsoa_catchment_lookup = pd.read_csv(ROOT / 'lsoa_catchment_lookup.csv') + +lsoa_coverage = pd.read_csv(ROOT / 'lsoa_coverage.csv') + +lsoa_population = pd.read_csv('data/ons.gov.uk/lsoa_syoa_all_years_t.csv', + usecols=['LSOA11CD', 'year', 'Pop_Total']) +lsoa_population['year'] = lsoa_population.year.apply(lambda x: int(x[4:])) + +waterbase_catchment_lookup = pd.read_csv(ROOT / 'waterbase_catchment_lookup.csv') + +waterbase_consolidated = pd.read_csv(ROOT / 'waterbase_consolidated.csv', + index_col=['uwwCode', 'year']) +# Fix a data problem where someone dropped a zero (or another digit) for Kinmel Bay. +waterbase_consolidated.loc[('UKWAWA_WW_TP000093', 2016), 'uwwLoadEnteringUWWTP'] *= 10 + +# Add up the treated load for the two works in Abingdon (which should really just be one). +x = waterbase_consolidated.loc['UKENTH_TWU_TP000001'].uwwLoadEnteringUWWTP +y = waterbase_consolidated.loc['UKENTH_TWU_TP000165'].uwwLoadEnteringUWWTP +z = y.reindex(x.index).fillna(0) + x +waterbase_consolidated.loc['UKENTH_TWU_TP000001', 'uwwLoadEnteringUWWTP'] = z.values + +# Get rid of the duplicate treatment work. +waterbase_consolidated = waterbase_consolidated.drop('UKENTH_TWU_TP000165', level=0) +waterbase_consolidated = waterbase_consolidated.reset_index() +``` + +```python +# Evaluate the total intersection area for each LSOA. +intersection_area_sum = lsoa_catchment_lookup.groupby('LSOA11CD')\ + .intersection_area.sum().reset_index(name='intersection_area_sum') + +# Construct a data frame that has a number of different areas that we can use for normalisation. +merged = pd.merge(lsoa_coverage, intersection_area_sum, on='LSOA11CD') +merged = pd.merge(merged, lsoa_catchment_lookup, on='LSOA11CD') +merged = pd.merge(merged, lsoa_population, on='LSOA11CD') + +def aggregate(subset): + # Construct different normalisations. + # - for total area, we divide by the area of the LSOA. + # - for area covered, we divide by the area of the LSOA that's covered by *any* + # catchment. + # - for intersection sum, we divide by the sum of all intersections. This may be + # larger than the area of the catchments if the data has overlapping catchments. + norms = { + 'norm_total_area': subset.total_area, + 'norm_area_covered': subset.area_covered, + 'norm_intersection_sum': subset.intersection_area_sum, + } + intersection_area_pop = subset.intersection_area * subset.Pop_Total + return pd.Series({key: (intersection_area_pop / value).sum() for key, value in norms.items()}) + +grouped = merged.groupby(['identifier', 'year']) +geospatial_estimate = grouped.apply(aggregate) +geospatial_estimate.head() +``` + +```python +# Generate summary for table 1 of the publication. +year = 2016 +method = "norm_area_covered" + +summary = geospatial_estimate.reset_index() +summary = summary[summary.year == year] +summary = pd.merge(catchments, summary, on="identifier") +summary = summary.groupby("company").apply( + lambda subset: { + "population": subset[method].sum() / 1e6, + "retained_catchments": len(subset), + "area": np.round(subset.geometry.area.sum() / 1e6), + "matched_uwwtp": subset.identifier.isin(waterbase_catchment_lookup.identifier).sum(), + } +).apply(pd.Series) + +frac = ( + 1e6 * summary.population.sum() + / lsoa_population[lsoa_population.year == year].Pop_Total.sum() +) +print(f"fraction covered: {frac}") +summary +``` + +```python +# Merge the waterbase data (BOD p.e.) with geospatial population estimates for comparison. +merged = pd.merge(waterbase_catchment_lookup, waterbase_consolidated, on=['uwwCode', 'uwwName']) +merged = pd.merge(merged, geospatial_estimate, on=['year', 'identifier']) + +# Sum by year and uwwCode (because the same treatment work may be linked to multiple catchments if +# the subcatchment aggregation didn't work out properly). Then assign back to the merged dataset and +# drop duplicates. +estimates = merged.groupby(['uwwCode', 'year']).agg({ + 'norm_total_area': 'sum', + 'norm_area_covered': 'sum', +}) +for key in estimates: + merged[key] = [estimates.loc[(x.uwwCode, x.year), key] for _, x in merged.iterrows()] +merged = merged.drop_duplicates(['uwwCode', 'year']) + +# Evaluate the pearson correlation on the log scale (omitting treatment works without load). +f = merged.uwwLoadEnteringUWWTP > 0 +stats.pearsonr(np.log(merged.uwwLoadEnteringUWWTP[f]), np.log(merged.norm_area_covered[f])) +``` + +```python +# Show a figure of different population estimates for a given year. +fig = plt.figure() +gs = fig.add_gridspec(2, 2) +ax = fig.add_subplot(gs[:, 0]) +year = 2016 +subset = merged[merged.year == year] +ax.scatter(subset.uwwLoadEnteringUWWTP, subset.norm_area_covered, marker='.', alpha=.5) +ax.set_yscale('log') +ax.set_xscale('log') +lims = subset.uwwLoadEnteringUWWTP.quantile([0, 1]) +ax.plot(lims, lims, color='k', ls=':') +ax.set_aspect('equal') +ax.set_xlabel('BOD person equivalent') +ax.set_ylabel('Geospatial population estimate') +ax.text(0.05, 0.95, '(a)', transform=ax.transAxes, va='top') + +# Annotations. +annotations = [ + { + 'code': 'UKENNE_NU_TP000026', + 'label': 'Haggerston', + 'xfactor': 3, + 'yfactor': 1, + }, + { + 'code': 'UKWAWA_WW_TP000016', + 'label': 'Rotherwas', + 'xfactor': 2.5, + }, + { + 'code': 'UKENAN_AW_TP000020', + 'label': 'Billericay', + 'xfactor': 2/3, + 'yfactor': 3, + 'kwargs': {'ha': 'center'}, + }, + # For better estimation of population coverage, we've dropped the Chalton + # treatment work as it overlaps with East Hyde (see catchment consolidation + # notebook for annotations and catchment-LSOA matching for dropping + # annotated catchments). + # { + # 'code': 'UKENAN_AW_TP000051', + # 'label': 'Chalton', + # 'xfactor': 1 / 3, + # 'kwargs': {'ha': 'right'}, + # }, +] +indexed = subset.set_index('uwwCode') +for annotation in annotations: + item = indexed.loc[annotation['code']] + + ax.annotate( + annotation['label'], + (item.uwwLoadEnteringUWWTP, item.norm_area_covered), + (item.uwwLoadEnteringUWWTP * annotation.get('xfactor', 1), + item.norm_area_covered * annotation.get('yfactor', 1)), + arrowprops={ + 'arrowstyle': '-|>', + }, + va='center', + **annotation.get('kwargs', {}), + ) + print(annotation['label'], item.uwwName) + +ax3 = ax = fig.add_subplot(gs[:, 1]) +target = lambda x: np.median(np.abs(np.log10(x.norm_area_covered / x.uwwLoadEnteringUWWTP))) + +x = [] +y = [] +ys = [] +for year, subset in tqdm(merged.groupby('year')): + x.append(year) + # Evaluate the statistic. + y.append(target(subset)) + # Run a bootstrap sample. + ys.append([target(subset.iloc[np.random.randint(len(subset), size=len(subset))]) + for _ in range(1000)]) + +ys = np.asarray(ys) +l, u = np.percentile(ys, [25, 75], axis=1) +ax.errorbar(x, y, (y - l, u - y), marker='.') +ax.ticklabel_format(scilimits=(0, 0), axis='y', useMathText=True) +ax.set_xlabel('Year') +ax.set_ylabel('Median absolute\n$\\log_{10}$ error') +ax.xaxis.set_ticks([2006, 2008, 2010, 2012, 2014, 2016]) +plt.setp(ax.xaxis.get_ticklabels(), rotation=30, ha='right') +ax.text(0.95, 0.95, '(b)', transform=ax.transAxes, ha='right', va='top') + +fig.tight_layout() +fig.savefig('population-estimates.pdf') + +# Show the log10 median absolute error over time. +y +``` + +```python +# Plot to illustrate why we're using area covered. +fig, ax = plt.subplots() + +xmin = 515000 +xmax = 523000 +ymin = 170000 +ymax = 176000 +box = shapely.geometry.box(xmin, ymin, xmax, ymax) + +# Plot the catchments. +idx_catchment = catchments.sindex.query(box) +subset = catchments.iloc[idx_catchment].sort_values('name') +subset = subset[subset.intersection(box).area > 10] +colors = ['C0', 'C1', 'C2'] +subset.intersection(box).plot(ax=ax, color=colors) + +# Plot the LSOAs. +idx = lsoas.sindex.query(box) +lsoas.iloc[idx].plot(ax=ax, facecolor='none', edgecolor='k', alpha=.1) +lsoas.loc[['E01003817']].plot(ax=ax, facecolor=(.5, .5, .5, .25), edgecolor='k') + +ax.set_xlim(xmin, xmax) +ax.set_ylim(ymin, ymax) +ax.set_axis_off() +handles = [mpl.patches.Rectangle((0, 0), 1, 1, color=color) for color in colors] +labels = subset.name.str.replace(' STW', '').str.title() +ax.legend(handles, labels) + +fig.tight_layout() +fig.savefig('estimation_method.pdf') +``` diff --git a/match_catchments_and_lsoas.ipynb b/match_catchments_and_lsoas.ipynb deleted file mode 100644 index dc17005..0000000 --- a/match_catchments_and_lsoas.ipynb +++ /dev/null @@ -1,233 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "trying-facing", - "metadata": {}, - "outputs": [], - "source": [ - "import fiona\n", - "import geopandas as gpd\n", - "from tqdm.notebook import tqdm\n", - "import pandas as pd\n", - "import pathlib\n", - "from matplotlib import pyplot as plt\n", - "import numpy as np\n", - "import networkx as nx" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "statutory-cartridge", - "metadata": {}, - "outputs": [], - "source": [ - "ROOT = pathlib.Path('data/wastewater_catchment_areas_public')\n", - "\n", - "catchments = gpd.read_file(ROOT / 'catchments_consolidated.shp').set_index('identifier')\n", - "print(f'loaded {len(catchments)} catchments')\n", - "catchments.head()\n", - "\n", - "# Drop catchments that are annotated with an overlap comment.\n", - "catchments = catchments[~catchments.comment.str.startswith(\"Overlap\").fillna(False)]\n", - "print(f'retained {len(catchments)} after dropping annotated catchments')\n", - "\n", - "# Drop all scottish water catchments because there should not be overlap with LSOAs.\n", - "catchments = catchments[catchments.company != \"scottish_water\"]\n", - "print(f'retained {len(catchments)} after dropping Scottish Water')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b1acb27f", - "metadata": {}, - "outputs": [], - "source": [ - "# Find non-trivial intersections between treatment works. That have not already been \n", - "# annotated.\n", - "sindex = catchments.geometry.sindex\n", - "indices = sindex.query_bulk(catchments.geometry).T\n", - "print(f'found {len(indices)} intersections')\n", - "\n", - "# Remove self-intersections and only report each intersection once.\n", - "indices = indices[indices[:, 0] < indices[:, 1]]\n", - "print(f'found {len(indices)} intersections without self intersections')\n", - "\n", - "# Calculate the actual intersection areas and remove zero-area intersections.\n", - "i, j = indices.T\n", - "x = catchments.iloc[i].geometry.reset_index(drop=True)\n", - "y = catchments.iloc[j].geometry.reset_index(drop=True)\n", - "intersection_areas = x.intersection(y).area.values\n", - "intersection_threshold = 100\n", - "f = intersection_areas > intersection_threshold\n", - "intersection_areas = intersection_areas[f]\n", - "indices = indices[f]\n", - "\n", - "# Sort to have largest intersection areas first.\n", - "intersection_areas = intersection_areas\n", - "indices = indices\n", - "identifiers = catchments.index.values[indices]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "29275036", - "metadata": {}, - "outputs": [], - "source": [ - "# Construct an intersection graph and get connected compnents.\n", - "graph = nx.Graph()\n", - "graph.add_edges_from([(*edge, {\"weight\": area}) for edge, area in zip(identifiers, intersection_areas)])\n", - "components = list(sorted(\n", - " nx.connected_components(graph), \n", - " key=lambda nodes: sum(data[\"weight\"] for *_, data in graph.edges(nodes, data=True)),\n", - " reverse=True,\n", - "))\n", - "print(f\"found {len(components)} connected components\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ed57bab2", - "metadata": {}, - "outputs": [], - "source": [ - "catchments[catchments.index.isin(components[3])].plot(column=\"name\", alpha=0.5)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "headed-humidity", - "metadata": {}, - "outputs": [], - "source": [ - "lsoas = gpd.read_file('data/geoportal.statistics.gov.uk/LSOA11_BGC.zip').set_index('LSOA11CD')\n", - "print(f'loaded {len(lsoas)} LSOAs')\n", - "lsoas.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "statewide-metallic", - "metadata": {}, - "outputs": [], - "source": [ - "# Evaluate the intersection area between LSOAs and catchments.\n", - "catchment_idx, lsoa_idx = lsoas.sindex.query_bulk(catchments.geometry)\n", - "print(f'found {len(catchment_idx)} intersections between catchments and LSOAs')\n", - "print(f'{len(set(lsoa_idx))} of {len(lsoas)} LSOAs intersect at least one catchment (at the '\n", - " 'envelope level)')\n", - "\n", - "# Evaluate the proper intersection areas (not just whether they intersect).\n", - "intersection_areas = [catchments.geometry.iloc[i].intersection(lsoas.geometry.iloc[j]).area\n", - " for i, j in tqdm(zip(catchment_idx, lsoa_idx), total=len(catchment_idx))]\n", - "\n", - "# Package the intersection areas in a dataframe and only retain intersections with non-zero area.\n", - "intersections = pd.DataFrame({\n", - " 'identifier': catchments.index[catchment_idx],\n", - " 'LSOA11CD': lsoas.index[lsoa_idx],\n", - " 'intersection_area': intersection_areas,\n", - "})\n", - "intersections = intersections[intersections.intersection_area > 0]\n", - "print(f'retained {len(intersections)} intersections after removing zero areas')\n", - "intersections.head()\n", - "intersections.to_csv(ROOT / 'lsoa_catchment_lookup.csv', index=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "abefb5d2", - "metadata": {}, - "outputs": [], - "source": [ - "# Evaluate the fraction of each LSOA covered.\n", - "grouped = intersections.groupby(\"LSOA11CD\")\n", - "frac_covered = grouped.intersection_area.sum() / lsoas.geometry.area\n", - "\n", - "coverage = pd.DataFrame({\n", - " \"n_catchments\": grouped.identifier.nunique(), \n", - " \"frac\": frac_covered,\n", - "})\n", - "\n", - "# Check that there is at most full coverage if there is only an\n", - "# intersection with one treatment work.\n", - "assert coverage[coverage.n_catchments == 1].frac.max() < 1 + 1e-9\n", - "coverage.loc[coverage.n_catchments == 1, \"frac\"] = np.minimum(\n", - " 1, coverage.frac[coverage.n_catchments == 1]\n", - ")\n", - "coverage = coverage[coverage.frac > 1 + 1e-6]\n", - "print(\n", - " f\"There are {len(coverage)} LSOAs with total \"\n", - " \"intersections exceeding their own area.\"\n", - ")\n", - "plt.hist(coverage.frac)\n", - "\n", - "# Find the catchments that give rise to the \"over-coverage\".\n", - "catchment_identifiers = intersections[\n", - " intersections.LSOA11CD.isin(coverage.index)\n", - "].groupby(\"identifier\").identifier.count()\n", - "\n", - "\n", - "catchment_identifiers" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "manual-fifteen", - "metadata": {}, - "outputs": [], - "source": [ - "coverage = {}\n", - "for lsoa_code, subset in tqdm(intersections.groupby('LSOA11CD')):\n", - " # Get the union of all possible intersections.\n", - " if len(subset) > 1:\n", - " all_intersecting = catchments.loc[subset.identifier].unary_union\n", - " else:\n", - " identifier = subset.identifier.iloc[0]\n", - " all_intersecting = catchments.geometry.loc[identifier]\n", - " # Evaluate the intersection of the LSOA with any catchment by intersecting with the spatial \n", - " # union of the catchments.\n", - " intersection = all_intersecting.intersection(lsoas.geometry.loc[lsoa_code])\n", - " coverage[lsoa_code] = intersection.area\n", - " \n", - "coverage = pd.Series(coverage)\n", - "\n", - "# Compute the coverage and fill with zeros where there are no intersections.\n", - "lsoas['area_covered'] = coverage\n", - "lsoas['area_covered'] = lsoas.area_covered.fillna(0)\n", - "lsoas['total_area'] = lsoas.area\n", - "lsoas[['total_area', 'area_covered']].to_csv(ROOT / 'lsoa_coverage.csv')" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/match_catchments_and_lsoas.md b/match_catchments_and_lsoas.md new file mode 100644 index 0000000..2330b92 --- /dev/null +++ b/match_catchments_and_lsoas.md @@ -0,0 +1,167 @@ +--- +jupyter: + jupytext: + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + jupytext_version: 1.16.7 + kernelspec: + display_name: Python 3 (ipykernel) + language: python + name: python3 +--- + +```python +import fiona +import geopandas as gpd +from tqdm.notebook import tqdm +import pandas as pd +import pathlib +from matplotlib import pyplot as plt +import numpy as np +import networkx as nx +``` + +```python +ROOT = pathlib.Path('data/wastewater_catchment_areas_public') + +catchments = gpd.read_file(ROOT / 'catchments_consolidated.shp').set_index('identifier') +print(f'loaded {len(catchments)} catchments') +catchments.head() + +# Drop catchments that are annotated with an overlap comment. +catchments = catchments[~catchments.comment.str.startswith("Overlap").fillna(False)] +print(f'retained {len(catchments)} after dropping annotated catchments') + +# Drop all scottish water catchments because there should not be overlap with LSOAs. +catchments = catchments[catchments.company != "scottish_water"] +print(f'retained {len(catchments)} after dropping Scottish Water') +``` + +```python +# Find non-trivial intersections between treatment works. That have not already been +# annotated. +sindex = catchments.geometry.sindex +indices = sindex.query_bulk(catchments.geometry).T +print(f'found {len(indices)} intersections') + +# Remove self-intersections and only report each intersection once. +indices = indices[indices[:, 0] < indices[:, 1]] +print(f'found {len(indices)} intersections without self intersections') + +# Calculate the actual intersection areas and remove zero-area intersections. +i, j = indices.T +x = catchments.iloc[i].geometry.reset_index(drop=True) +y = catchments.iloc[j].geometry.reset_index(drop=True) +intersection_areas = x.intersection(y).area.values +intersection_threshold = 100 +f = intersection_areas > intersection_threshold +intersection_areas = intersection_areas[f] +indices = indices[f] + +# Sort to have largest intersection areas first. +intersection_areas = intersection_areas +indices = indices +identifiers = catchments.index.values[indices] +``` + +```python +# Construct an intersection graph and get connected compnents. +graph = nx.Graph() +graph.add_edges_from([(*edge, {"weight": area}) for edge, area in zip(identifiers, intersection_areas)]) +components = list(sorted( + nx.connected_components(graph), + key=lambda nodes: sum(data["weight"] for *_, data in graph.edges(nodes, data=True)), + reverse=True, +)) +print(f"found {len(components)} connected components") +``` + +```python +catchments[catchments.index.isin(components[3])].plot(column="name", alpha=0.5) +``` + +```python +lsoas = gpd.read_file('data/geoportal.statistics.gov.uk/LSOA11_BGC.zip').set_index('LSOA11CD') +print(f'loaded {len(lsoas)} LSOAs') +lsoas.head() +``` + +```python +# Evaluate the intersection area between LSOAs and catchments. +catchment_idx, lsoa_idx = lsoas.sindex.query_bulk(catchments.geometry) +print(f'found {len(catchment_idx)} intersections between catchments and LSOAs') +print(f'{len(set(lsoa_idx))} of {len(lsoas)} LSOAs intersect at least one catchment (at the ' + 'envelope level)') + +# Evaluate the proper intersection areas (not just whether they intersect). +intersection_areas = [catchments.geometry.iloc[i].intersection(lsoas.geometry.iloc[j]).area + for i, j in tqdm(zip(catchment_idx, lsoa_idx), total=len(catchment_idx))] + +# Package the intersection areas in a dataframe and only retain intersections with non-zero area. +intersections = pd.DataFrame({ + 'identifier': catchments.index[catchment_idx], + 'LSOA11CD': lsoas.index[lsoa_idx], + 'intersection_area': intersection_areas, +}) +intersections = intersections[intersections.intersection_area > 0] +print(f'retained {len(intersections)} intersections after removing zero areas') +intersections.head() +intersections.to_csv(ROOT / 'lsoa_catchment_lookup.csv', index=False) +``` + +```python +# Evaluate the fraction of each LSOA covered. +grouped = intersections.groupby("LSOA11CD") +frac_covered = grouped.intersection_area.sum() / lsoas.geometry.area + +coverage = pd.DataFrame({ + "n_catchments": grouped.identifier.nunique(), + "frac": frac_covered, +}) + +# Check that there is at most full coverage if there is only an +# intersection with one treatment work. +assert coverage[coverage.n_catchments == 1].frac.max() < 1 + 1e-9 +coverage.loc[coverage.n_catchments == 1, "frac"] = np.minimum( + 1, coverage.frac[coverage.n_catchments == 1] +) +coverage = coverage[coverage.frac > 1 + 1e-6] +print( + f"There are {len(coverage)} LSOAs with total " + "intersections exceeding their own area." +) +plt.hist(coverage.frac) + +# Find the catchments that give rise to the "over-coverage". +catchment_identifiers = intersections[ + intersections.LSOA11CD.isin(coverage.index) +].groupby("identifier").identifier.count() + + +catchment_identifiers +``` + +```python +coverage = {} +for lsoa_code, subset in tqdm(intersections.groupby('LSOA11CD')): + # Get the union of all possible intersections. + if len(subset) > 1: + all_intersecting = catchments.loc[subset.identifier].unary_union + else: + identifier = subset.identifier.iloc[0] + all_intersecting = catchments.geometry.loc[identifier] + # Evaluate the intersection of the LSOA with any catchment by intersecting with the spatial + # union of the catchments. + intersection = all_intersecting.intersection(lsoas.geometry.loc[lsoa_code]) + coverage[lsoa_code] = intersection.area + +coverage = pd.Series(coverage) + +# Compute the coverage and fill with zeros where there are no intersections. +lsoas['area_covered'] = coverage +lsoas['area_covered'] = lsoas.area_covered.fillna(0) +lsoas['total_area'] = lsoas.area +lsoas[['total_area', 'area_covered']].to_csv(ROOT / 'lsoa_coverage.csv') +``` diff --git a/match_waterbase_and_catchments.ipynb b/match_waterbase_and_catchments.ipynb deleted file mode 100644 index d0ff4cf..0000000 --- a/match_waterbase_and_catchments.ipynb +++ /dev/null @@ -1,469 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "serious-hospital", - "metadata": {}, - "outputs": [], - "source": [ - "import fiona # Needs to be imported first for geopandas to work.\n", - "import geopandas as gpd\n", - "import matplotlib as mpl\n", - "from matplotlib import pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "import pathlib\n", - "import re\n", - "from tqdm.notebook import tqdm\n", - "\n", - "mpl.rcParams['figure.dpi'] = 144\n", - "mpl.style.use('scrartcl.mplstyle')\n", - "\n", - "ROOT = pathlib.Path('data/wastewater_catchment_areas_public')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "adverse-knife", - "metadata": {}, - "outputs": [], - "source": [ - "catchments = gpd.read_file(ROOT / 'catchments_consolidated.shp')\n", - "catchments.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fifth-inspector", - "metadata": {}, - "outputs": [], - "source": [ - "# Load the waterbase data.\n", - "uwwtps = gpd.GeoDataFrame(pd.read_csv(ROOT / 'waterbase_consolidated.csv'))\n", - "\n", - "# Identify the treatment plants that were inactive in the most recent report.\n", - "inactive = []\n", - "for uwwCode, subset in uwwtps.groupby('uwwCode'):\n", - " subset = subset.sort_values('year')\n", - " item = subset.iloc[-1]\n", - " if item.uwwState == 'inactive':\n", - " inactive.append(item)\n", - "inactive = pd.DataFrame(inactive)\n", - "\n", - "# Drop treatment works that are inactive.\n", - "print(f'starting with {uwwtps.uwwCode.nunique()} treatment plants')\n", - "uwwtps = uwwtps.loc[~np.in1d(uwwtps.uwwCode, inactive.uwwCode)]\n", - "print(f'retained {uwwtps.uwwCode.nunique()} treatment plants after removing inactive plants')\n", - "\n", - "# Drop all treatment works from Northern Ireland Water and and Gibraltar.\n", - "uwwtps = uwwtps[~(\n", - " uwwtps.uwwCode.str.startswith('UKNI') |\n", - " uwwtps.uwwCode.str.startswith('UKGIB')\n", - ")]\n", - "\n", - "print(f'retained {uwwtps.uwwCode.nunique()} treatment plants after removing treatment plants owned '\n", - " 'by water companies that did not supply data')\n", - "\n", - "# Ensure that all retained treatment plants have reported at least one record for capacity and load.\n", - "grouped = uwwtps.groupby('uwwCode')\n", - "np.testing.assert_array_less(1, grouped.uwwCapacity.max())\n", - "np.testing.assert_array_less(1, grouped.uwwLoadEnteringUWWTP.max())\n", - "\n", - "# Reproject into the british national grid.\n", - "uwwtps['geometry'] = gpd.points_from_xy(uwwtps.uwwLongitude, uwwtps.uwwLatitude)\n", - "uwwtps = gpd.GeoDataFrame(uwwtps).set_crs('epsg:4326').to_crs('epsg:27700')\n", - "\n", - "# Group by treatment code and take the most recent data.\n", - "uwwtps = uwwtps.sort_values('year').groupby('uwwCode').last().reset_index()\n", - "uwwtps.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "invisible-chapel", - "metadata": {}, - "outputs": [], - "source": [ - "# Evaluate the distance matrix between catchments and treatment works.\n", - "distances = []\n", - "for _, catchment in tqdm(catchments.iterrows(), total=len(catchments)):\n", - " distances.append(uwwtps.distance(catchment.geometry, align=False))\n", - "# Cast to a dataframe.\n", - "distances = np.asarray(distances)\n", - "distances = pd.DataFrame(distances, index=catchments.identifier, columns=uwwtps.uwwCode)\n", - "distances.shape" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "later-wonder", - "metadata": {}, - "outputs": [], - "source": [ - "# Sweep over different distance thresholds and evaluate how many entities are close to each other\n", - "# at that radius.\n", - "thresholds = np.logspace(1, 3, 50)\n", - "num_uwwtps_near_catchment = []\n", - "num_catchments_near_uwwtp = []\n", - "for threshold in thresholds:\n", - " near = distances < threshold\n", - " num_uwwtps_near_catchment.append(near.sum(axis=1))\n", - " num_catchments_near_uwwtp.append(near.sum(axis=0))\n", - "\n", - "num_uwwtps_near_catchment = np.asarray(num_uwwtps_near_catchment)\n", - "num_catchments_near_uwwtp = np.asarray(num_catchments_near_uwwtp)\n", - "assert num_uwwtps_near_catchment.shape == (len(thresholds), len(catchments))\n", - "assert num_catchments_near_uwwtp.shape == (len(thresholds), len(uwwtps))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "centered-accuracy", - "metadata": {}, - "outputs": [], - "source": [ - "# Show the fraction of UWWTPs that have exactly one catchment within a distance given by the\n", - "# threshold. As expected, there is an optimal distance: small thresholds are too restrictive whereas\n", - "# large ones \"confuse\" different catchments.\n", - "fig, ax = plt.subplots()\n", - "fraction_single_catchment_near_uwwtp = np.mean(num_catchments_near_uwwtp == 1, axis=1)\n", - "line, = ax.plot(thresholds, fraction_single_catchment_near_uwwtp)\n", - "i = np.argmax(fraction_single_catchment_near_uwwtp)\n", - "ax.scatter(thresholds[i], fraction_single_catchment_near_uwwtp[i], color=line.get_color())\n", - "ax.set_xscale('log')\n", - "ax.set_xlabel('Thresholds (metres)')\n", - "ax.set_ylabel('Fraction of UWWTPs with unique\\ncatchment @ threshold')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "painted-whale", - "metadata": {}, - "outputs": [], - "source": [ - "# Find all combinations of catchments and treatment plants where both the treatment plant\n", - "# and the catchment are associated with exactly one other treatment plant/catchment at a given\n", - "# distance.\n", - "thresholds = np.linspace(0, 500, 50)\n", - "num = []\n", - "for threshold in tqdm(thresholds):\n", - " binary = (distances <= threshold).values\n", - " f = (binary.sum(axis=1) == 1)[:, None] & ((binary).sum(axis=0) == 1) & binary\n", - " catchment_idx, stp_idx = np.nonzero(f)\n", - " num.append(len(catchment_idx))\n", - "\n", - "fig, ax = plt.subplots()\n", - "ax.plot(thresholds, num)\n", - "ax.set_xlabel('Thresholds (metres)')\n", - "ax.set_ylabel('# unique pairs @ threshold')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "formed-capital", - "metadata": {}, - "outputs": [], - "source": [ - "# Step 1: distance-based matching.\n", - "\n", - "# Find all combinations of catchments and treatment plants where both the treatment plant and the\n", - "# catchment are associated with exactly one other treatment plant/catchment below a given distance.\n", - "threshold = 100\n", - "binary = (distances <= threshold).values\n", - "f_distance = (binary.sum(axis=1) == 1)[:, None] & ((binary).sum(axis=0) == 1) & binary\n", - "print('distance matched', f_distance.sum())\n", - "\n", - "# Validate uniqueness to ensure the logic is consistent.\n", - "np.testing.assert_array_less(f.sum(axis=0), 2)\n", - "np.testing.assert_array_less(f.sum(axis=1), 2)\n", - "\n", - "def normalise_name(value):\n", - " \"\"\"\n", - " Function to normalise treatment names by removing special characters and common abbreviations.\n", - " \"\"\"\n", - " if pd.isnull(value):\n", - " return value\n", - " # Drop all special characters.\n", - " value, _ = re.subn(r'[\\(\\)\\[\\]\\{\\}\\.\\&-]', '', value.lower())\n", - " # Drop stw and wwtw (often suffixes).\n", - " value, _ = re.subn(r'\\b(stw|wwtw|doa|wrw|bucks)\\b', '', value)\n", - " # Drop all whitespace.\n", - " value, _ = re.subn(r'\\s', '', value)\n", - " return value\n", - "\n", - "# Step 2: name-based matching.\n", - "\n", - "# Add all treatment works where the normalised names are identical and the distance is less than\n", - "# another threshold.\n", - "f_name = (catchments['name'].apply(normalise_name).values[:, None] == uwwtps.uwwName.apply(normalise_name).values) \\\n", - " & (distances.values <= 2500)\n", - "\n", - "print('name matched', f_name.sum())\n", - "\n", - "# Create the auto-matched lookup table.\n", - "catchment_idx, stp_idx = np.nonzero(f_distance | f_name)\n", - "auto_matched = pd.concat([\n", - " uwwtps.iloc[stp_idx].reset_index(drop=True),\n", - " catchments.iloc[catchment_idx].reset_index(drop=True).drop('geometry', axis=1)\n", - "], axis=1)[['identifier', 'name', 'uwwCode', 'uwwName']]\n", - "print('total automatically matched', auto_matched.uwwCode.nunique())\n", - "auto_matched.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "provincial-stomach", - "metadata": {}, - "outputs": [], - "source": [ - "manual_matched = pd.DataFrame([\n", - " ('214873b5cf', 'UU-09-SC47-DAVYH', 'UKENNW_UU_TP000042', 'Manchester and Salford (Davyhulme)STW'),\n", - " ('d793a815d9', 'UU-10-SC50-ROYTO', 'UKENNW_UU_TP000105', 'ROYTON STW'),\n", - " ('b65d2eebc7', 'UU-10-SC51-FAILS', 'UKENNW_UU_TP000048', 'FAILSWORTH STW'),\n", - " ('bb6ca16304', 'UU-10-SC54-HAZEL', 'UKENNW_UU_TP000058', 'HAZEL GROVE STW'),\n", - " ('59bd836fc3', 'UU-10-SC54-STOCK', 'UKENNW_UU_TP000116', 'STOCKPORT STW'),\n", - " ('5900c96825', 'Shenfield and Hutton', 'UKENAN_AW_TP000230', 'SHENFIELD STW'),\n", - " ('4647e75f9e', 'CARDIFF BAY', 'UKWAWA_WW_TP000026', 'CARDIFF STW'),\n", - " # These two are the same treatment work.\n", - " ('cd33d70a80', 'Catterick Village', 'UKENNE_YW_TP000007', 'CATTERICK STW'),\n", - " ('cd33d70a80', 'Catterick Village', 'UKENNE_YW_TP000008', 'COLBURN STW'),\n", - " ('27ec2163de', 'UU-11-SC56-NTHWI', 'UKENNW_UU_TP000097', 'NORTHWICH STW'),\n", - " # Cuddington has been decommissioned and now feeds into Northwich, but it's all a bit tricky\n", - " # because there can be pumping going on between the different treatment plants\n", - " # (https://www.unitedutilities.com/about-us/cheshire/).\n", - " (None, None, 'UKENNW_UU_TP000040', 'CUDDINGTON STW'),\n", - " # Cotgrave feeds into Radcliffe.\n", - " # https://www.nottinghamshire.gov.uk/planningsearch/DisplayImage.aspx?doc=cmVjb3JkX251bWJlcj02OTU4JmZpbGVuYW1lPVxcbnMwMS0wMDI5XGZpbGVkYXRhMiRcREIwMy0wMDMwXFNoYXJlZEFwcHNcRExHU1xQbGFuc1xQTEFOTklOR1xTY3ItMzY0OFxSYWRjbGlmZmUgRUlBIFNjcmVlbmluZyBPcGluaW9uIFJlcXVlc3QucGRmJmltYWdlX251bWJlcj01JmltYWdlX3R5cGU9cGxhbm5pbmcmbGFzdF9tb2RpZmllZF9mcm9tX2Rpc2s9MTIvMDQvMjAxNyAxMDozMzo0MQ==\n", - " (None, None, 'UKENMI_ST_TP000064', 'COTGRAVE STW'),\n", - " ('fb13da240a', 'MANSFIELD-BATH LANE (WRW)', 'UKENMI_ST_TP000143', 'MANSFIELD STW'),\n", - " # https://goo.gl/maps/pFhJy3ZAxyABKwWc6\n", - " ('9fcbac5ecc', 'STOKE BARDOLPH (WRW)', 'UKENMI_ST_TP000163', 'NOTTINGHAM STW'),\n", - " ('7be11f772b', 'BEESTON -LILAC GROVE (WRW)', 'UKENMI_ST_TP000025', 'BEESTON STW'),\n", - " ('c861b125a1', 'STRATFORD-MILCOTE (WRW)', 'UKENMI_ST_TP000206', 'STRATFORD STW'),\n", - " # Long Marston has been decommissioned.\n", - " # https://waterprojectsonline.com/custom_case_study/pebworth-long-marston-stratford-upon-avon-transfer/\n", - " (None, None, 'UKENMI_ST_TP000135', 'LONG MARSTON STW'),\n", - " ('01bf76d616', 'MAPLE LODGE STW', 'UKENTH_TWU_TP000106', 'MAPLE LODGE, BUCKS STW\"'),\n", - " ('034de6cd3b', 'Eastwood', 'UKENNE_YW_TP000079', 'TODMORDEN STW'),\n", - " ('c64f685e8b', 'Neiley', 'UKENNE_YW_TP000109', 'HOLMFIRTH STW'),\n", - " ('7e5b0d4c85', 'UU-05-SC28-BLACK', 'UKENNW_UU_TP000018', 'BLACKBURN STW'),\n", - " ('ccad6a0777', 'UU-05-SC28-DARWE', 'UKENNW_UU_TP000041', 'DARWEN STW'),\n", - " ('31ec9a8fc2', 'UU-06-SC35-HUYTO', 'UKENNW_UU_TP000066', 'HUYTON STW'),\n", - " ('a83266015d', 'UU-06-SC35-LIVER', 'UKENNW_UU_TP000080', 'LIVERPOOL SOUTH [WOOLTON]) STW'),\n", - " ('0d2d457ce0', 'MARGATE AND BROADSTAIRS', 'UKENSO_SW_TP000019', 'BROADSTAIRS/MARGATE OUT CSM (Weatherlees B)'),\n", - " ('bbd09a27de', 'WEATHERLEES HILL', 'UKENSO_SW_TP000022', 'RAMSGATE, SANDWICH, DEAL STW\"'),\n", - " ('00e1c4c53b', 'UU-09-SC46-ECCLE', 'UKENNW_UU_TP000046', 'ECCLES STW'),\n", - " ('feab30c503', 'UU-07-SC40-WORSL', 'UKENNW_UU_TP000140', 'WORSLEY STW'),\n", - " ('a3aea16d4a', 'ABINGDON STW', 'UKENTH_TWU_TP000001', 'ABINGDON (OXON STW)'),\n", - " ('b428837dda', 'NAGS HEAD LANE STW', 'UKENTH_TWU_TP000115', 'NAGS HEAD LANE ( BRENTWOOD STW'),\n", - " ('3d27aa0af3', 'RIVERSIDE STW', 'UKENTH_TWU_TP000125', 'LONDON (Riverside STW)'),\n", - " ('cee1fb4e33', 'UU-07-SC40-GLAZE', 'UKENNW_UU_TP000053', 'GLAZEBURY STW'),\n", - " ('feec09aaaf', 'UU-07-SC40-LEIGH', 'UKENNW_UU_TP000078', 'LEIGH STW'),\n", - " ('3b82554080', 'UU-07-SC39-BOLTO', 'UKENNW_UU_TP000019', 'BOLTON STW'),\n", - " ('51bdbc0969', 'UU-07-SC37-BURYZ', 'UKENNW_UU_TP000026', 'BURY STW'),\n", - " ('f08942bc5a', 'UU-10-SC52-ASHUL', 'UKENNW_UU_TP000007', 'ASHTON-UNDER-LYNE STW'),\n", - " ('5d269b4c58', 'UU-10-SC52-DUKIN', 'UKENNW_UU_TP000044', 'DUKINFIELD STW'),\n", - " ('b24ca82746', 'RHIWSAESON (NEW)', 'UKWAWA_WW_TP000024', 'RHIWSAESON STW RHIWSAESON LLANTRI STW'),\n", - " ('a7ecedaef5', 'COSLECH', 'UKWAWA_WW_TP000020', 'SOUTH ELY VALLEY STW'),\n", - " ('2e2e1b5104', 'Dalscone DOA', 'UKSC_TP00053', 'DALSCONE S.T.W. (NEW)'),\n", - " ('2e2e1b5104', 'Dalscone DOA', 'UKSC_TP00054', 'DALSCONE S.T.W. (OLD)'),\n", - " ('5f66794c55', 'Whilton', 'UKENAN_AW_TP000286', 'DAVENTRY STW'),\n", - " ('3b3e3efde8', 'BAKEWELL, PICKORY CORNER (WRW)', 'UKENMI_ST_TP000013', 'BAKEWELL STW'),\n", - " ('2eb71ffcfc', 'BOTTESFORD-STW', 'UKENMI_ST_TP000033', 'BOTTESFORD STW'),\n", - " ('e4a961fba1', 'KEMPSEY WORKS (WRW)', 'UKENMI_ST_TP000119', 'KEMPSEY STW'),\n", - " ('53fe082858', 'CLAYMILLS (WRW)', 'UKENMI_ST_TP000056', 'BURTON ON TRENT STW'),\n", - " ('54e25cb6a9', 'MELTON (WRW)', 'UKENMI_ST_TP000152', 'MELTON MOWBRAY STW'),\n", - " ('b0753764b4', 'STANLEY DOWNTON (WRW)', 'UKENMI_ST_TP000208', 'STROUD STW'),\n", - " ('b11e60b850', 'ROUNDHILL (WRW)', 'UKENMI_ST_TP000180', 'STOURBRIDGE & HALESOWEN STW'),\n", - " ('ffe8483a02', 'ASH VALE STW', 'UKENTH_TWU_TP000009', 'ASH VALE, STRATFORD ROAD, NORTH STW\"'),\n", - " ('e526b8f3b6', 'BRACKNELL STW', 'UKENTH_TWU_TP000025', 'BRACKNELL, HAZELWOOD LANE, BINF STW\"'),\n", - " ('29d52395a3', 'UU-04-SC21-PREST', 'UKENNW_UU_TP000102', 'PRESTON (CLIFTON MARSH) STW'),\n", - " ('c1e0195f63', 'WINDSOR STW', 'UKENTH_TWU_TP000152', 'WINDSOR, HAM ISLAND, OLD WINDSO STW\"'),\n", - " ('743841f528', 'CAMBERLEY STW', 'UKENTH_TWU_TP000033', 'CAMBERLEY, CAMBERLEY, SURREY STW\"'),\n", - " ('80c85e78c6', 'DORCHESTER STW', 'UKENTH_TWU_TP000056', 'DORCHESTER STW (OXON)'),\n", - " ('0f5561c4b9', 'STANFORD RIVERS STW', 'UKENTH_TWU_TP000136', 'STANFORD RIVERS, ONGAR, ESSEX STW\"'),\n", - " ('45b2ebfd2f', 'UU-03-SC17-KRKLO', 'UKENNW_UU_TP000151', 'KIRKBY LONSDALE STW HUMUS TANK EFFLUENT'),\n", - " ('935c2734be', 'MARLBOROUGH STW', 'UKENTH_TWU_TP000108', 'MARLBOROUGH, MARLBOROUGH, WILTS STW\"'),\n", - " ('ba3e9907f2', 'EARLSWOOD STW', 'UKENTH_TWU_TP000123', 'REIGATE STW'),\n", - " ('98a14c9923', 'STANTON - DERBYSHIRE (WRW)', 'UKENMI_ST_TP000201', 'STANTON STW'),\n", - " ('1d9dcc765f', 'WORMINGHALL STW', 'UKENTH_TWU_TP000158', 'WORMINGHALL, WORMINGHALL, BUCKS STW\"'),\n", - " ('a087889829', 'SUTTON BONNINGTON (WRW)', 'UKENMI_ST_TP000278', 'SUTTON BONINGTON STW, FE\"'),\n", - " ('fd7e5039c1', 'STANSTED MOUNTFITCHET STW', 'UKENTH_TWU_TP000137', 'STANSTED MOUNTFITCHET, STANSTED STW\"'),\n", - " ('0d09ae579a', 'RAMSBURY STW', 'UKENTH_TWU_TP000121', 'RAMSBURY, RAMSBURY, MARLBOROUGH STW\"'),\n", - " ('0de7c65bea', 'UU-06-SC31-HSKBK', 'UKENNW_UU_TP000060', 'HESKETH BANK STW'),\n", - " ('a8d4ccb494', 'CHICKENHALL EASTLEIGH', 'UKENSO_SW_TP000013', 'EASTLEIGH STW'),\n", - " ('0340b3c3c4', 'WALSALL WOOD (WRW)', 'UKENMI_ST_TP000221', 'WALSALL NORTH STW'),\n", - " ('47aebf7f1e', 'UU-03-SC18-LANCA', 'UKENNW_UU_TP000076', 'LANCASTER (STODDAY) STW'),\n", - " ('95fa57e680', 'MILL GREEN STW', 'UKENTH_TWU_TP000111', 'MILL GREEN, HATFIELD, HERTS STW\"'),\n", - " ('f01161329e', 'SLOUGH STW', 'UKENTH_TWU_TP000133', 'SLOUGH, WOOD STW\"'),\n", - " ('d2e0ce618b', 'FINSTOCK STW', 'UKENTH_TWU_TP000067', 'FINSTOCK, FINSTOCK, OXON STW\"'),\n", - " ('15831f78a8', 'WOTTON UNDER EDGE STW CATCHMENT', 'UKENSW_WXW_TP000109', 'WOTTON UNDER EDGE STW'),\n", - " ('8237f30715', 'STANFORD IN THE VALE STW', 'UKENTH_TWU_TP000169', 'Standford in the Vale STW'),\n", - " ('a1e551a2fd', 'MORESTEAD ROAD WINCHESTER', 'UKENSO_SW_TP000003', 'WINCHESTER CENTRAL AND SOUTH (MORESTEAD) STW'),\n", - " ('3b917f6c5f', 'BASINGSTOKE STW', 'UKENTH_TWU_TP000013', 'BASINGSTOKE, WILDMOOR, BASINGST STW\"'),\n", - " ('1903e1316b', 'FULLERTON', 'UKENSO_SW_TP000006', 'ANDOVER STW'),\n", - " ('e9216a235f', 'Fochabers DOA', 'UKSC_TP00077', 'FOCHABERS WWTP'),\n", - " ('965a041c6a', 'THAME STW', 'UKENTH_TWU_TP000140', 'THAME, THAME, OXON STW\"'),\n", - " ('c836b06a6e', 'GODALMING STW', 'UKENTH_TWU_TP000071', 'GODALMING, UNSTEAD, GODALMING, STW\"'),\n", - " ('bb40f7f215', 'UU-03-SC16-GRNGS', 'UKENNW_UU_TP000055', 'GRANGE-OVER-SANDS STW'),\n", - " ('60cd9fef4c', 'DRENEWYDD - OSWESTRY (WRW)', 'UKENMI_ST_TP000166', 'OSWESTRY DRENEWYDD STW'),\n", - " ('9edf9e05e3', 'HIGHER HEATH-PREES (WRW)', 'UKENMI_ST_TP000268', 'PREES HIGHER HEATH STW'),\n", - " ('5b6ed3d068', 'RUSHMOOR (WRW)', 'UKENMI_ST_TP000184', 'TELFORD STW'),\n", - " ('d8b6f988ea', 'CHESHAM STW', 'UKENTH_TWU_TP000039', 'CHESHAM, BUCKS STW\"'),\n", - " ('04a2d621cd', 'UU-07-SC40-TYLDE', 'UKENNW_UU_TP000120', 'TYLDESLEY STW'),\n", - " ('5353ff23aa', 'APPLETON STW', 'UKENTH_TWU_TP000006', 'APPLETON, ABINGDON, OXON STW\"'),\n", - " ('8fccf79af1', 'WHITE WALTHAM STW', 'UKENTH_TWU_TP000150', 'WHITE WALTHAM, WHITE WALTHAM, B STW\"'),\n", - " ('a2a6cca4f6', 'POWICK (WRW)', 'UKENMI_ST_TP000173', 'POWICK NEW STW'),\n", - " ('c33bed15a5', 'Knostrop Merge High + Low', 'UKENNE_YW_TP000098', 'LEEDS (KNOSTROP) STW'),\n", - " ('11d6a4c3d9', 'HAYDEN (WRW)', 'UKENMI_ST_TP000256', 'CHELTENHAM STW'),\n", - " ('5b3b7769c8', 'Withernsea No. 2', 'UKENNE_YW_TP000139', 'WITHERNSEA OUTFALL STW'),\n", - " ('864e5eaa8d', 'TRIMDON VILLAGE STW NZ37346301', 'UKENNE_NU_TP000052', 'TRIMDON VILLAGE STW'),\n", - " ('d31b19b35b', 'NORTH TIDWORTH STW CATCHMENT', 'UKENSW_VE_TP000001', 'TIDWORTH GARRISON STW FE'),\n", - " ('f00ad841f9', 'PEACEHAVEN', 'UKENSO_SW_TP000126', 'PEACEHAVEN WASTEWATER TREATMENT WKS'),\n", - " # Morecambe is auto-matched to Middleton. But we also need the Morecambe catchment area itself.\n", - " # (cf. page 3 of https://www.unitedutilities.com/globalassets/documents/pdf/7013b-morcambe-a5-6pp-flyer-v10_acc17.pdf).\n", - " ('15d6d1fe75', 'UU-03-SC18-MOREC', 'UKENNW_UU_TP000092', 'MORECAMBE STW'),\n", - " # South West Water\n", - " ('86490b6f49', 'CULLOMPTON_STW_CULLOMPTON', 'UKENSW_SWS_TP000018', 'CULLOMPTON STW'),\n", - " ('e0cef5009e', 'BRADNINCH_STW_BRADNINCH', 'UKENSW_SWS_TP000085', 'BRADNINCH STW'),\n", - " ('9e49ba976b', 'WOODBURY_STW_WOODBURY', 'UKENSW_SWS_TP000083', 'WOODBURY STW'),\n", - " ('85685023c6', 'SCARLETTS WELL_STW_BODMIN', 'UKENSW_SWS_TP000005', 'BODMIN (SCARLETTS WELL) STW'),\n", - " ('8a6451d565', 'NANSTALLON_STW_BODMIN', 'UKENSW_SWS_TP000004', 'BODMIN (NANSTALLON) STW'),\n", - " ('f67cadab83', 'COUNTESS WEAR_STW_EXETER', 'UKENSW_SWS_TP000023', 'EXETER STW'),\n", - " # No good matches.\n", - " (None, None, 'UKSC_TP00011', 'ANNANDALE WATER MSA S.T.W.'),\n", - " (None, None, 'UKENTH_TWU_TP000162', 'ALDERSHOT MILITARY STW'),\n", - " (None, None, 'UKSC_TP00201', 'FASLANE STW'),\n", - " (None, None, 'UKSC_TP00085', 'GLENEAGLES STW'),\n", - " (None, None, 'UKWAWA_WW_TP000128', 'BLUESTONE LEISURE CANASTON'),\n", - " (None, None, 'UKENMI_ST_TP000227', 'WELSHPOOL STW'),\n", - " (None, None, 'UKSC_TP00171', 'SOUTHERNESS STW'),\n", - " (None, None, 'UKENMI_ST_TP000127', 'KNIGHTON STW'),\n", - " (None, None, 'UKENMI_ST_TP000133', 'LLANIDLOES STW'),\n", - " (None, None, 'UKENMI_ST_TP000161', 'NEWTOWN STW'),\n", - " (None, None, 'UKWAWA_WW_TP000011', 'MERTHYR TYDFIL STW'),\n", - "], columns=['identifier', 'name', 'uwwCode', 'uwwName'])\n", - "\n", - "matched = pd.concat([auto_matched, manual_matched])\n", - "unmatched = uwwtps[~uwwtps.uwwCode.isin(matched.uwwCode)]\n", - "unmatched" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a20c340f", - "metadata": {}, - "outputs": [], - "source": [ - "# Ensure that the matching accounts for exactly the right set of treatment works.\n", - "assert set(matched.uwwCode) == set(uwwtps.uwwCode)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "67b0e472", - "metadata": {}, - "outputs": [], - "source": [ - "# Make sure there are no invalid identifiers amongst the manual matches. That might happen\n", - "# if the underlying data change, e.g. as part of the request via whatdotheyknow.com.\n", - "invalid_identifiers = set(manual_matched.identifier) - set(catchments.identifier) - {None}\n", - "if invalid_identifiers:\n", - " raise ValueError(f'found {len(invalid_identifiers)} invalid identifiers: {invalid_identifiers}')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "optimum-finnish", - "metadata": {}, - "outputs": [], - "source": [ - "# Evaluate the distance between the catchment and the stp to give \"a feel\" for how good the matching\n", - "# is.\n", - "catchments_indexed = catchments.set_index('identifier')\n", - "uwwtps_indexed = uwwtps.set_index('uwwCode')\n", - "matched['distance'] = matched.apply(\n", - " lambda x: None if pd.isnull(x.identifier) else\n", - " catchments_indexed.geometry.loc[x.identifier].distance(uwwtps_indexed.geometry.loc[x.uwwCode]), axis=1)\n", - "matched.sort_values('uwwCode').to_csv(ROOT / 'waterbase_catchment_lookup.csv', index=False)\n", - "# Show the 99% distance between matched catchments and stps.\n", - "matched.distance.quantile([.5, .99])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "adapted-jersey", - "metadata": {}, - "outputs": [], - "source": [ - "# Show catchments that are matched to more than one treatment work. This should NEVER happen, unless\n", - "# the same treatment work has been assigned different identifiers by UWWTD.\n", - "\n", - "exceptions = [\n", - " '2e2e1b5104', # Dalscone New and Old.\n", - " 'a3aea16d4a', # Abingdon appears twice.\n", - " '8a5a88fe31', # Balby appears twice.\n", - " 'd8270e9ad8', # Catterick village and Colburn STWs are in the same location...\n", - " 'cd33d70a80', # ... and the catchment needs to be linked to both treatment works.\n", - "]\n", - "\n", - "counts = 0\n", - "for identifier, subset in matched.groupby('identifier'):\n", - " if identifier in exceptions:\n", - " continue\n", - " if len(subset) > 1:\n", - " print(subset)\n", - " counts += 1\n", - "\n", - "assert counts == 0, 'data doesn\\'t make sense'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "parental-yemen", - "metadata": {}, - "outputs": [], - "source": [ - "# Show treatment works broken down by company.\n", - "counts = pd.merge(matched, catchments, on='identifier', how='left').groupby('company').uwwCode.nunique()\n", - "print('total number of treatment works', counts.sum())\n", - "counts" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/match_waterbase_and_catchments.md b/match_waterbase_and_catchments.md new file mode 100644 index 0000000..dbdf5d1 --- /dev/null +++ b/match_waterbase_and_catchments.md @@ -0,0 +1,373 @@ +--- +jupyter: + jupytext: + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + jupytext_version: 1.16.7 + kernelspec: + display_name: Python 3 (ipykernel) + language: python + name: python3 +--- + +```python +import fiona # Needs to be imported first for geopandas to work. +import geopandas as gpd +import matplotlib as mpl +from matplotlib import pyplot as plt +import numpy as np +import pandas as pd +import pathlib +import re +from tqdm.notebook import tqdm + +mpl.rcParams['figure.dpi'] = 144 +mpl.style.use('scrartcl.mplstyle') + +ROOT = pathlib.Path('data/wastewater_catchment_areas_public') +``` + +```python +catchments = gpd.read_file(ROOT / 'catchments_consolidated.shp') +catchments.head() +``` + +```python +# Load the waterbase data. +uwwtps = gpd.GeoDataFrame(pd.read_csv(ROOT / 'waterbase_consolidated.csv')) + +# Identify the treatment plants that were inactive in the most recent report. +inactive = [] +for uwwCode, subset in uwwtps.groupby('uwwCode'): + subset = subset.sort_values('year') + item = subset.iloc[-1] + if item.uwwState == 'inactive': + inactive.append(item) +inactive = pd.DataFrame(inactive) + +# Drop treatment works that are inactive. +print(f'starting with {uwwtps.uwwCode.nunique()} treatment plants') +uwwtps = uwwtps.loc[~np.in1d(uwwtps.uwwCode, inactive.uwwCode)] +print(f'retained {uwwtps.uwwCode.nunique()} treatment plants after removing inactive plants') + +# Drop all treatment works from Northern Ireland Water and and Gibraltar. +uwwtps = uwwtps[~( + uwwtps.uwwCode.str.startswith('UKNI') | + uwwtps.uwwCode.str.startswith('UKGIB') +)] + +print(f'retained {uwwtps.uwwCode.nunique()} treatment plants after removing treatment plants owned ' + 'by water companies that did not supply data') + +# Ensure that all retained treatment plants have reported at least one record for capacity and load. +grouped = uwwtps.groupby('uwwCode') +np.testing.assert_array_less(1, grouped.uwwCapacity.max()) +np.testing.assert_array_less(1, grouped.uwwLoadEnteringUWWTP.max()) + +# Reproject into the british national grid. +uwwtps['geometry'] = gpd.points_from_xy(uwwtps.uwwLongitude, uwwtps.uwwLatitude) +uwwtps = gpd.GeoDataFrame(uwwtps).set_crs('epsg:4326').to_crs('epsg:27700') + +# Group by treatment code and take the most recent data. +uwwtps = uwwtps.sort_values('year').groupby('uwwCode').last().reset_index() +uwwtps.head() +``` + +```python +# Evaluate the distance matrix between catchments and treatment works. +distances = [] +for _, catchment in tqdm(catchments.iterrows(), total=len(catchments)): + distances.append(uwwtps.distance(catchment.geometry, align=False)) +# Cast to a dataframe. +distances = np.asarray(distances) +distances = pd.DataFrame(distances, index=catchments.identifier, columns=uwwtps.uwwCode) +distances.shape +``` + +```python +# Sweep over different distance thresholds and evaluate how many entities are close to each other +# at that radius. +thresholds = np.logspace(1, 3, 50) +num_uwwtps_near_catchment = [] +num_catchments_near_uwwtp = [] +for threshold in thresholds: + near = distances < threshold + num_uwwtps_near_catchment.append(near.sum(axis=1)) + num_catchments_near_uwwtp.append(near.sum(axis=0)) + +num_uwwtps_near_catchment = np.asarray(num_uwwtps_near_catchment) +num_catchments_near_uwwtp = np.asarray(num_catchments_near_uwwtp) +assert num_uwwtps_near_catchment.shape == (len(thresholds), len(catchments)) +assert num_catchments_near_uwwtp.shape == (len(thresholds), len(uwwtps)) +``` + +```python +# Show the fraction of UWWTPs that have exactly one catchment within a distance given by the +# threshold. As expected, there is an optimal distance: small thresholds are too restrictive whereas +# large ones "confuse" different catchments. +fig, ax = plt.subplots() +fraction_single_catchment_near_uwwtp = np.mean(num_catchments_near_uwwtp == 1, axis=1) +line, = ax.plot(thresholds, fraction_single_catchment_near_uwwtp) +i = np.argmax(fraction_single_catchment_near_uwwtp) +ax.scatter(thresholds[i], fraction_single_catchment_near_uwwtp[i], color=line.get_color()) +ax.set_xscale('log') +ax.set_xlabel('Thresholds (metres)') +ax.set_ylabel('Fraction of UWWTPs with unique\ncatchment @ threshold') +``` + +```python +# Find all combinations of catchments and treatment plants where both the treatment plant +# and the catchment are associated with exactly one other treatment plant/catchment at a given +# distance. +thresholds = np.linspace(0, 500, 50) +num = [] +for threshold in tqdm(thresholds): + binary = (distances <= threshold).values + f = (binary.sum(axis=1) == 1)[:, None] & ((binary).sum(axis=0) == 1) & binary + catchment_idx, stp_idx = np.nonzero(f) + num.append(len(catchment_idx)) + +fig, ax = plt.subplots() +ax.plot(thresholds, num) +ax.set_xlabel('Thresholds (metres)') +ax.set_ylabel('# unique pairs @ threshold') +``` + +```python +# Step 1: distance-based matching. + +# Find all combinations of catchments and treatment plants where both the treatment plant and the +# catchment are associated with exactly one other treatment plant/catchment below a given distance. +threshold = 100 +binary = (distances <= threshold).values +f_distance = (binary.sum(axis=1) == 1)[:, None] & ((binary).sum(axis=0) == 1) & binary +print('distance matched', f_distance.sum()) + +# Validate uniqueness to ensure the logic is consistent. +np.testing.assert_array_less(f.sum(axis=0), 2) +np.testing.assert_array_less(f.sum(axis=1), 2) + +def normalise_name(value): + """ + Function to normalise treatment names by removing special characters and common abbreviations. + """ + if pd.isnull(value): + return value + # Drop all special characters. + value, _ = re.subn(r'[\(\)\[\]\{\}\.\&-]', '', value.lower()) + # Drop stw and wwtw (often suffixes). + value, _ = re.subn(r'\b(stw|wwtw|doa|wrw|bucks)\b', '', value) + # Drop all whitespace. + value, _ = re.subn(r'\s', '', value) + return value + +# Step 2: name-based matching. + +# Add all treatment works where the normalised names are identical and the distance is less than +# another threshold. +f_name = (catchments['name'].apply(normalise_name).values[:, None] == uwwtps.uwwName.apply(normalise_name).values) \ + & (distances.values <= 2500) + +print('name matched', f_name.sum()) + +# Create the auto-matched lookup table. +catchment_idx, stp_idx = np.nonzero(f_distance | f_name) +auto_matched = pd.concat([ + uwwtps.iloc[stp_idx].reset_index(drop=True), + catchments.iloc[catchment_idx].reset_index(drop=True).drop('geometry', axis=1) +], axis=1)[['identifier', 'name', 'uwwCode', 'uwwName']] +print('total automatically matched', auto_matched.uwwCode.nunique()) +auto_matched.head() +``` + +```python +manual_matched = pd.DataFrame([ + ('214873b5cf', 'UU-09-SC47-DAVYH', 'UKENNW_UU_TP000042', 'Manchester and Salford (Davyhulme)STW'), + ('d793a815d9', 'UU-10-SC50-ROYTO', 'UKENNW_UU_TP000105', 'ROYTON STW'), + ('b65d2eebc7', 'UU-10-SC51-FAILS', 'UKENNW_UU_TP000048', 'FAILSWORTH STW'), + ('bb6ca16304', 'UU-10-SC54-HAZEL', 'UKENNW_UU_TP000058', 'HAZEL GROVE STW'), + ('59bd836fc3', 'UU-10-SC54-STOCK', 'UKENNW_UU_TP000116', 'STOCKPORT STW'), + ('5900c96825', 'Shenfield and Hutton', 'UKENAN_AW_TP000230', 'SHENFIELD STW'), + ('4647e75f9e', 'CARDIFF BAY', 'UKWAWA_WW_TP000026', 'CARDIFF STW'), + # These two are the same treatment work. + ('cd33d70a80', 'Catterick Village', 'UKENNE_YW_TP000007', 'CATTERICK STW'), + ('cd33d70a80', 'Catterick Village', 'UKENNE_YW_TP000008', 'COLBURN STW'), + ('27ec2163de', 'UU-11-SC56-NTHWI', 'UKENNW_UU_TP000097', 'NORTHWICH STW'), + # Cuddington has been decommissioned and now feeds into Northwich, but it's all a bit tricky + # because there can be pumping going on between the different treatment plants + # (https://www.unitedutilities.com/about-us/cheshire/). + (None, None, 'UKENNW_UU_TP000040', 'CUDDINGTON STW'), + # Cotgrave feeds into Radcliffe. + # https://www.nottinghamshire.gov.uk/planningsearch/DisplayImage.aspx?doc=cmVjb3JkX251bWJlcj02OTU4JmZpbGVuYW1lPVxcbnMwMS0wMDI5XGZpbGVkYXRhMiRcREIwMy0wMDMwXFNoYXJlZEFwcHNcRExHU1xQbGFuc1xQTEFOTklOR1xTY3ItMzY0OFxSYWRjbGlmZmUgRUlBIFNjcmVlbmluZyBPcGluaW9uIFJlcXVlc3QucGRmJmltYWdlX251bWJlcj01JmltYWdlX3R5cGU9cGxhbm5pbmcmbGFzdF9tb2RpZmllZF9mcm9tX2Rpc2s9MTIvMDQvMjAxNyAxMDozMzo0MQ== + (None, None, 'UKENMI_ST_TP000064', 'COTGRAVE STW'), + ('fb13da240a', 'MANSFIELD-BATH LANE (WRW)', 'UKENMI_ST_TP000143', 'MANSFIELD STW'), + # https://goo.gl/maps/pFhJy3ZAxyABKwWc6 + ('9fcbac5ecc', 'STOKE BARDOLPH (WRW)', 'UKENMI_ST_TP000163', 'NOTTINGHAM STW'), + ('7be11f772b', 'BEESTON -LILAC GROVE (WRW)', 'UKENMI_ST_TP000025', 'BEESTON STW'), + ('c861b125a1', 'STRATFORD-MILCOTE (WRW)', 'UKENMI_ST_TP000206', 'STRATFORD STW'), + # Long Marston has been decommissioned. + # https://waterprojectsonline.com/custom_case_study/pebworth-long-marston-stratford-upon-avon-transfer/ + (None, None, 'UKENMI_ST_TP000135', 'LONG MARSTON STW'), + ('01bf76d616', 'MAPLE LODGE STW', 'UKENTH_TWU_TP000106', 'MAPLE LODGE, BUCKS STW"'), + ('034de6cd3b', 'Eastwood', 'UKENNE_YW_TP000079', 'TODMORDEN STW'), + ('c64f685e8b', 'Neiley', 'UKENNE_YW_TP000109', 'HOLMFIRTH STW'), + ('7e5b0d4c85', 'UU-05-SC28-BLACK', 'UKENNW_UU_TP000018', 'BLACKBURN STW'), + ('ccad6a0777', 'UU-05-SC28-DARWE', 'UKENNW_UU_TP000041', 'DARWEN STW'), + ('31ec9a8fc2', 'UU-06-SC35-HUYTO', 'UKENNW_UU_TP000066', 'HUYTON STW'), + ('a83266015d', 'UU-06-SC35-LIVER', 'UKENNW_UU_TP000080', 'LIVERPOOL SOUTH [WOOLTON]) STW'), + ('0d2d457ce0', 'MARGATE AND BROADSTAIRS', 'UKENSO_SW_TP000019', 'BROADSTAIRS/MARGATE OUT CSM (Weatherlees B)'), + ('bbd09a27de', 'WEATHERLEES HILL', 'UKENSO_SW_TP000022', 'RAMSGATE, SANDWICH, DEAL STW"'), + ('00e1c4c53b', 'UU-09-SC46-ECCLE', 'UKENNW_UU_TP000046', 'ECCLES STW'), + ('feab30c503', 'UU-07-SC40-WORSL', 'UKENNW_UU_TP000140', 'WORSLEY STW'), + ('a3aea16d4a', 'ABINGDON STW', 'UKENTH_TWU_TP000001', 'ABINGDON (OXON STW)'), + ('b428837dda', 'NAGS HEAD LANE STW', 'UKENTH_TWU_TP000115', 'NAGS HEAD LANE ( BRENTWOOD STW'), + ('3d27aa0af3', 'RIVERSIDE STW', 'UKENTH_TWU_TP000125', 'LONDON (Riverside STW)'), + ('cee1fb4e33', 'UU-07-SC40-GLAZE', 'UKENNW_UU_TP000053', 'GLAZEBURY STW'), + ('feec09aaaf', 'UU-07-SC40-LEIGH', 'UKENNW_UU_TP000078', 'LEIGH STW'), + ('3b82554080', 'UU-07-SC39-BOLTO', 'UKENNW_UU_TP000019', 'BOLTON STW'), + ('51bdbc0969', 'UU-07-SC37-BURYZ', 'UKENNW_UU_TP000026', 'BURY STW'), + ('f08942bc5a', 'UU-10-SC52-ASHUL', 'UKENNW_UU_TP000007', 'ASHTON-UNDER-LYNE STW'), + ('5d269b4c58', 'UU-10-SC52-DUKIN', 'UKENNW_UU_TP000044', 'DUKINFIELD STW'), + ('b24ca82746', 'RHIWSAESON (NEW)', 'UKWAWA_WW_TP000024', 'RHIWSAESON STW RHIWSAESON LLANTRI STW'), + ('a7ecedaef5', 'COSLECH', 'UKWAWA_WW_TP000020', 'SOUTH ELY VALLEY STW'), + ('2e2e1b5104', 'Dalscone DOA', 'UKSC_TP00053', 'DALSCONE S.T.W. (NEW)'), + ('2e2e1b5104', 'Dalscone DOA', 'UKSC_TP00054', 'DALSCONE S.T.W. (OLD)'), + ('5f66794c55', 'Whilton', 'UKENAN_AW_TP000286', 'DAVENTRY STW'), + ('3b3e3efde8', 'BAKEWELL, PICKORY CORNER (WRW)', 'UKENMI_ST_TP000013', 'BAKEWELL STW'), + ('2eb71ffcfc', 'BOTTESFORD-STW', 'UKENMI_ST_TP000033', 'BOTTESFORD STW'), + ('e4a961fba1', 'KEMPSEY WORKS (WRW)', 'UKENMI_ST_TP000119', 'KEMPSEY STW'), + ('53fe082858', 'CLAYMILLS (WRW)', 'UKENMI_ST_TP000056', 'BURTON ON TRENT STW'), + ('54e25cb6a9', 'MELTON (WRW)', 'UKENMI_ST_TP000152', 'MELTON MOWBRAY STW'), + ('b0753764b4', 'STANLEY DOWNTON (WRW)', 'UKENMI_ST_TP000208', 'STROUD STW'), + ('b11e60b850', 'ROUNDHILL (WRW)', 'UKENMI_ST_TP000180', 'STOURBRIDGE & HALESOWEN STW'), + ('ffe8483a02', 'ASH VALE STW', 'UKENTH_TWU_TP000009', 'ASH VALE, STRATFORD ROAD, NORTH STW"'), + ('e526b8f3b6', 'BRACKNELL STW', 'UKENTH_TWU_TP000025', 'BRACKNELL, HAZELWOOD LANE, BINF STW"'), + ('29d52395a3', 'UU-04-SC21-PREST', 'UKENNW_UU_TP000102', 'PRESTON (CLIFTON MARSH) STW'), + ('c1e0195f63', 'WINDSOR STW', 'UKENTH_TWU_TP000152', 'WINDSOR, HAM ISLAND, OLD WINDSO STW"'), + ('743841f528', 'CAMBERLEY STW', 'UKENTH_TWU_TP000033', 'CAMBERLEY, CAMBERLEY, SURREY STW"'), + ('80c85e78c6', 'DORCHESTER STW', 'UKENTH_TWU_TP000056', 'DORCHESTER STW (OXON)'), + ('0f5561c4b9', 'STANFORD RIVERS STW', 'UKENTH_TWU_TP000136', 'STANFORD RIVERS, ONGAR, ESSEX STW"'), + ('45b2ebfd2f', 'UU-03-SC17-KRKLO', 'UKENNW_UU_TP000151', 'KIRKBY LONSDALE STW HUMUS TANK EFFLUENT'), + ('935c2734be', 'MARLBOROUGH STW', 'UKENTH_TWU_TP000108', 'MARLBOROUGH, MARLBOROUGH, WILTS STW"'), + ('ba3e9907f2', 'EARLSWOOD STW', 'UKENTH_TWU_TP000123', 'REIGATE STW'), + ('98a14c9923', 'STANTON - DERBYSHIRE (WRW)', 'UKENMI_ST_TP000201', 'STANTON STW'), + ('1d9dcc765f', 'WORMINGHALL STW', 'UKENTH_TWU_TP000158', 'WORMINGHALL, WORMINGHALL, BUCKS STW"'), + ('a087889829', 'SUTTON BONNINGTON (WRW)', 'UKENMI_ST_TP000278', 'SUTTON BONINGTON STW, FE"'), + ('fd7e5039c1', 'STANSTED MOUNTFITCHET STW', 'UKENTH_TWU_TP000137', 'STANSTED MOUNTFITCHET, STANSTED STW"'), + ('0d09ae579a', 'RAMSBURY STW', 'UKENTH_TWU_TP000121', 'RAMSBURY, RAMSBURY, MARLBOROUGH STW"'), + ('0de7c65bea', 'UU-06-SC31-HSKBK', 'UKENNW_UU_TP000060', 'HESKETH BANK STW'), + ('a8d4ccb494', 'CHICKENHALL EASTLEIGH', 'UKENSO_SW_TP000013', 'EASTLEIGH STW'), + ('0340b3c3c4', 'WALSALL WOOD (WRW)', 'UKENMI_ST_TP000221', 'WALSALL NORTH STW'), + ('47aebf7f1e', 'UU-03-SC18-LANCA', 'UKENNW_UU_TP000076', 'LANCASTER (STODDAY) STW'), + ('95fa57e680', 'MILL GREEN STW', 'UKENTH_TWU_TP000111', 'MILL GREEN, HATFIELD, HERTS STW"'), + ('f01161329e', 'SLOUGH STW', 'UKENTH_TWU_TP000133', 'SLOUGH, WOOD STW"'), + ('d2e0ce618b', 'FINSTOCK STW', 'UKENTH_TWU_TP000067', 'FINSTOCK, FINSTOCK, OXON STW"'), + ('15831f78a8', 'WOTTON UNDER EDGE STW CATCHMENT', 'UKENSW_WXW_TP000109', 'WOTTON UNDER EDGE STW'), + ('8237f30715', 'STANFORD IN THE VALE STW', 'UKENTH_TWU_TP000169', 'Standford in the Vale STW'), + ('a1e551a2fd', 'MORESTEAD ROAD WINCHESTER', 'UKENSO_SW_TP000003', 'WINCHESTER CENTRAL AND SOUTH (MORESTEAD) STW'), + ('3b917f6c5f', 'BASINGSTOKE STW', 'UKENTH_TWU_TP000013', 'BASINGSTOKE, WILDMOOR, BASINGST STW"'), + ('1903e1316b', 'FULLERTON', 'UKENSO_SW_TP000006', 'ANDOVER STW'), + ('e9216a235f', 'Fochabers DOA', 'UKSC_TP00077', 'FOCHABERS WWTP'), + ('965a041c6a', 'THAME STW', 'UKENTH_TWU_TP000140', 'THAME, THAME, OXON STW"'), + ('c836b06a6e', 'GODALMING STW', 'UKENTH_TWU_TP000071', 'GODALMING, UNSTEAD, GODALMING, STW"'), + ('bb40f7f215', 'UU-03-SC16-GRNGS', 'UKENNW_UU_TP000055', 'GRANGE-OVER-SANDS STW'), + ('60cd9fef4c', 'DRENEWYDD - OSWESTRY (WRW)', 'UKENMI_ST_TP000166', 'OSWESTRY DRENEWYDD STW'), + ('9edf9e05e3', 'HIGHER HEATH-PREES (WRW)', 'UKENMI_ST_TP000268', 'PREES HIGHER HEATH STW'), + ('5b6ed3d068', 'RUSHMOOR (WRW)', 'UKENMI_ST_TP000184', 'TELFORD STW'), + ('d8b6f988ea', 'CHESHAM STW', 'UKENTH_TWU_TP000039', 'CHESHAM, BUCKS STW"'), + ('04a2d621cd', 'UU-07-SC40-TYLDE', 'UKENNW_UU_TP000120', 'TYLDESLEY STW'), + ('5353ff23aa', 'APPLETON STW', 'UKENTH_TWU_TP000006', 'APPLETON, ABINGDON, OXON STW"'), + ('8fccf79af1', 'WHITE WALTHAM STW', 'UKENTH_TWU_TP000150', 'WHITE WALTHAM, WHITE WALTHAM, B STW"'), + ('a2a6cca4f6', 'POWICK (WRW)', 'UKENMI_ST_TP000173', 'POWICK NEW STW'), + ('c33bed15a5', 'Knostrop Merge High + Low', 'UKENNE_YW_TP000098', 'LEEDS (KNOSTROP) STW'), + ('11d6a4c3d9', 'HAYDEN (WRW)', 'UKENMI_ST_TP000256', 'CHELTENHAM STW'), + ('5b3b7769c8', 'Withernsea No. 2', 'UKENNE_YW_TP000139', 'WITHERNSEA OUTFALL STW'), + ('864e5eaa8d', 'TRIMDON VILLAGE STW NZ37346301', 'UKENNE_NU_TP000052', 'TRIMDON VILLAGE STW'), + ('d31b19b35b', 'NORTH TIDWORTH STW CATCHMENT', 'UKENSW_VE_TP000001', 'TIDWORTH GARRISON STW FE'), + ('f00ad841f9', 'PEACEHAVEN', 'UKENSO_SW_TP000126', 'PEACEHAVEN WASTEWATER TREATMENT WKS'), + # Morecambe is auto-matched to Middleton. But we also need the Morecambe catchment area itself. + # (cf. page 3 of https://www.unitedutilities.com/globalassets/documents/pdf/7013b-morcambe-a5-6pp-flyer-v10_acc17.pdf). + ('15d6d1fe75', 'UU-03-SC18-MOREC', 'UKENNW_UU_TP000092', 'MORECAMBE STW'), + # South West Water + ('86490b6f49', 'CULLOMPTON_STW_CULLOMPTON', 'UKENSW_SWS_TP000018', 'CULLOMPTON STW'), + ('e0cef5009e', 'BRADNINCH_STW_BRADNINCH', 'UKENSW_SWS_TP000085', 'BRADNINCH STW'), + ('9e49ba976b', 'WOODBURY_STW_WOODBURY', 'UKENSW_SWS_TP000083', 'WOODBURY STW'), + ('85685023c6', 'SCARLETTS WELL_STW_BODMIN', 'UKENSW_SWS_TP000005', 'BODMIN (SCARLETTS WELL) STW'), + ('8a6451d565', 'NANSTALLON_STW_BODMIN', 'UKENSW_SWS_TP000004', 'BODMIN (NANSTALLON) STW'), + ('f67cadab83', 'COUNTESS WEAR_STW_EXETER', 'UKENSW_SWS_TP000023', 'EXETER STW'), + # No good matches. + (None, None, 'UKSC_TP00011', 'ANNANDALE WATER MSA S.T.W.'), + (None, None, 'UKENTH_TWU_TP000162', 'ALDERSHOT MILITARY STW'), + (None, None, 'UKSC_TP00201', 'FASLANE STW'), + (None, None, 'UKSC_TP00085', 'GLENEAGLES STW'), + (None, None, 'UKWAWA_WW_TP000128', 'BLUESTONE LEISURE CANASTON'), + (None, None, 'UKENMI_ST_TP000227', 'WELSHPOOL STW'), + (None, None, 'UKSC_TP00171', 'SOUTHERNESS STW'), + (None, None, 'UKENMI_ST_TP000127', 'KNIGHTON STW'), + (None, None, 'UKENMI_ST_TP000133', 'LLANIDLOES STW'), + (None, None, 'UKENMI_ST_TP000161', 'NEWTOWN STW'), + (None, None, 'UKWAWA_WW_TP000011', 'MERTHYR TYDFIL STW'), +], columns=['identifier', 'name', 'uwwCode', 'uwwName']) + +matched = pd.concat([auto_matched, manual_matched]) +unmatched = uwwtps[~uwwtps.uwwCode.isin(matched.uwwCode)] +unmatched +``` + +```python +# Ensure that the matching accounts for exactly the right set of treatment works. +assert set(matched.uwwCode) == set(uwwtps.uwwCode) +``` + +```python +# Make sure there are no invalid identifiers amongst the manual matches. That might happen +# if the underlying data change, e.g. as part of the request via whatdotheyknow.com. +invalid_identifiers = set(manual_matched.identifier) - set(catchments.identifier) - {None} +if invalid_identifiers: + raise ValueError(f'found {len(invalid_identifiers)} invalid identifiers: {invalid_identifiers}') +``` + +```python +# Evaluate the distance between the catchment and the stp to give "a feel" for how good the matching +# is. +catchments_indexed = catchments.set_index('identifier') +uwwtps_indexed = uwwtps.set_index('uwwCode') +matched['distance'] = matched.apply( + lambda x: None if pd.isnull(x.identifier) else + catchments_indexed.geometry.loc[x.identifier].distance(uwwtps_indexed.geometry.loc[x.uwwCode]), axis=1) +matched.sort_values('uwwCode').to_csv(ROOT / 'waterbase_catchment_lookup.csv', index=False) +# Show the 99% distance between matched catchments and stps. +matched.distance.quantile([.5, .99]) +``` + +```python +# Show catchments that are matched to more than one treatment work. This should NEVER happen, unless +# the same treatment work has been assigned different identifiers by UWWTD. + +exceptions = [ + '2e2e1b5104', # Dalscone New and Old. + 'a3aea16d4a', # Abingdon appears twice. + '8a5a88fe31', # Balby appears twice. + 'd8270e9ad8', # Catterick village and Colburn STWs are in the same location... + 'cd33d70a80', # ... and the catchment needs to be linked to both treatment works. +] + +counts = 0 +for identifier, subset in matched.groupby('identifier'): + if identifier in exceptions: + continue + if len(subset) > 1: + print(subset) + counts += 1 + +assert counts == 0, 'data doesn\'t make sense' +``` + +```python +# Show treatment works broken down by company. +counts = pd.merge(matched, catchments, on='identifier', how='left').groupby('company').uwwCode.nunique() +print('total number of treatment works', counts.sum()) +counts +``` diff --git a/requirements.in b/requirements.in index f2e298a..adde995 100644 --- a/requirements.in +++ b/requirements.in @@ -9,6 +9,7 @@ networkx # ValueError: Unable to avoid copy while creating an array as requested. numpy<2 jupyter +jupytext pandas pyproj rtree diff --git a/requirements.txt b/requirements.txt index 29dad28..a40f7cf 100644 --- a/requirements.txt +++ b/requirements.txt @@ -173,8 +173,14 @@ jupyterlab-server==2.27.3 # notebook jupyterlab-widgets==3.0.13 # via ipywidgets +jupytext==1.16.7 + # via -r requirements.in kiwisolver==1.4.8 # via matplotlib +markdown-it-py==3.0.0 + # via + # jupytext + # mdit-py-plugins markupsafe==3.0.2 # via # jinja2 @@ -187,6 +193,10 @@ matplotlib-inline==0.1.7 # ipython mccabe==0.7.0 # via flake8 +mdit-py-plugins==0.4.2 + # via jupytext +mdurl==0.1.2 + # via markdown-it-py mistune==3.1.2 # via nbconvert nbclient==0.10.2 @@ -198,6 +208,7 @@ nbconvert==7.16.6 nbformat==5.10.4 # via # jupyter-server + # jupytext # nbclient # nbconvert nest-asyncio==1.6.0 @@ -228,6 +239,7 @@ packaging==24.2 # jupyter-server # jupyterlab # jupyterlab-server + # jupytext # matplotlib # nbconvert pandas==2.2.3 @@ -286,7 +298,9 @@ python-json-logger==3.2.1 pytz==2025.1 # via pandas pyyaml==6.0.2 - # via jupyter-events + # via + # jupyter-events + # jupytext pyzmq==26.2.1 # via # ipykernel From d5224df7a679ff30596e97c8fdca84534e348be8 Mon Sep 17 00:00:00 2001 From: Till Hoffmann Date: Fri, 21 Feb 2025 13:23:17 -0500 Subject: [PATCH 4/4] Simplify Makefile targets. --- Makefile | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/Makefile b/Makefile index b0b0e2c..924689d 100644 --- a/Makefile +++ b/Makefile @@ -1,8 +1,9 @@ .PHONY : data data/geoportal.statistics.gov.uk data/ons.gov.uk data/eea.europa.eu \ data/raw_catchments data/validation data.shasum -EXECUTE_NB = cd $(dir $<) \ - && jupytext --to ipynb --output - $(notdir $<) \ +# This only works for notebooks at the root level. +EXECUTE_NB = mkdir -p $(dir $@) \ + && jupytext --to ipynb --output - $< \ | jupyter nbconvert --stdin --execute --to html --output $@ OUTPUT_ROOT = data/wastewater_catchment_areas_public CURL = curl -L --retry 3 --retry-all-errors @@ -127,31 +128,27 @@ analysis : workspace/consolidate_waterbase.html \ ${OUTPUT_ROOT} : mkdir -p $@ -workspace : - mkdir -p $@ +${OUTPUT_ROOT}/waterbase_consolidated.csv : workspace/consolidate_waterbase.html +${OUTPUT_ROOT}/catchments_consolidated.shp overview.pdf : workspace/consolidate_catchments.html +${OUTPUT_ROOT}/waterbase_catchment_lookup.csv : workspace/match_waterbase_and_catchments.html +${OUTPUT_ROOT}/lsoa_coverage.csv ${OUTPUT_ROOT}/lsoa_catchment_lookup.csv : workspace/match_catchments_and_lsoas.html +${OUTPUT_ROOT}/population_estimates.csv population_estimates.pdf estimation_method.pdf : workspace/estimate_population.html -workspace/consolidate_waterbase.html ${OUTPUT_ROOT}/waterbase_consolidated.csv \ - : consolidate_waterbase.md workspace ${OUTPUT_ROOT} data/eea.europa.eu +workspace/consolidate_waterbase.html : consolidate_waterbase.md ${OUTPUT_ROOT} data/eea.europa.eu ${EXECUTE_NB} $< -workspace/consolidate_catchments.html ${OUTPUT_ROOT}/catchments_consolidated.shp overview.pdf \ - : consolidate_catchments.md workspace ${OUTPUT_ROOT} data/raw_catchments +workspace/consolidate_catchments.html : consolidate_catchments.md ${OUTPUT_ROOT} data/raw_catchments ${EXECUTE_NB} $< -workspace/match_waterbase_and_catchments.html ${OUTPUT_ROOT}/waterbase_catchment_lookup.csv \ - : match_waterbase_and_catchments.md workspace ${OUTPUT_ROOT} \ +workspace/match_waterbase_and_catchments.html : match_waterbase_and_catchments.md ${OUTPUT_ROOT} \ ${OUTPUT_ROOT}/catchments_consolidated.shp ${OUTPUT_ROOT}/waterbase_consolidated.csv ${EXECUTE_NB} $< -workspace/match_catchments_and_lsoas.html ${OUTPUT_ROOT}/lsoa_coverage.csv \ - ${OUTPUT_ROOT}/lsoa_catchment_lookup.csv \ - : match_catchments_and_lsoas.md workspace ${OUTPUT_ROOT} \ +workspace/match_catchments_and_lsoas.html : match_catchments_and_lsoas.md ${OUTPUT_ROOT} \ ${OUTPUT_ROOT}/catchments_consolidated.shp data/geoportal.statistics.gov.uk ${EXECUTE_NB} $< -workspace/estimate_population.html ${OUTPUT_ROOT}/population_estimates.csv \ - population_estimates.pdf estimation_method.pdf \ - : estimate_population.md data/ons.gov.uk workspace ${OUTPUT_ROOT} \ +workspace/estimate_population.html : estimate_population.md data/ons.gov.uk ${OUTPUT_ROOT} \ ${OUTPUT_ROOT}/lsoa_catchment_lookup.csv ${OUTPUT_ROOT}/lsoa_coverage.csv \ ${OUTPUT_ROOT}/waterbase_catchment_lookup.csv ${EXECUTE_NB} $<