55 clean up repo and add walk through #56

alxsmith · 2025-12-18T21:37:32Z

Added overview and context to scripts
Removed commented out/stale code
Minor refactor for readability

Fleshed out data and methods section of index.qmd if that is helpful for context during review.

Some Review guidance:

Files to review

1. notebooks/clean_peoples_construction_polygons.py (~300 lines)
Purpose: Unions overlapping construction polygons from Peoples Gas
Operations: Geometric unions, date filtering, status assignment
Output: peoples_polygons_unioned.geojson - cleaned construction areas

2. notebooks/match_parcels_buildings.py (~390 lines)
Purpose: Match Cook County parcels with Chicago building footprints to get unit counts
Operations: Spatial join, 1:1 matching by largest overlap, unit count assignment
Output: parcels_with_units_YYYYMMDD.geojson - parcels with residential unit data

3. notebooks/geo_data_cleaning.qmd (~1,200 lines)
Purpose: Main spatial ETL - creates block-level dataset by combining all sources
Inputs:

peoples_polygons_unioned.geojson - Created by clean_peoples_construction_polygons.py
parcels_with_units_YYYYMMDD.geojson - Created by match_parcels_buildings.py
pgp_blocks.geojson - Read from S3 (s3://data.sb/il_npa/gis/pgl/)
chicago_streets_YYYYMMDD.geojson - Read from S3 (s3://data.sb/il_npa/gis/pgl/)

Key Operations:

Clips streets/parcels/blocks to construction areas
Assigns parcels to blocks (1:1 by largest overlap)
Apportions street miles to blocks by perimeter
Aggregates to census block level

4. notebooks/analysis.qmd (~1,580 lines)
Purpose: Cost calculations and comparisons (PRP vs NPA)
Input: Block-level summary from geo_data_cleaning
Output: peoplesgas_with_buildings_streets_block_YYYYMMDD.geojson - final block-level dataset

Data Flow Logic

S3 Fallback Pattern

All data reads check local first, then S3:
Check if file exists locally in data/geo_data/ or data/outputs/
If not found → download from s3://data.sb/il_npa/gis/pgl/ using boto3
Once downloaded, saved locally for subsequent runs

File Naming & Latest Selection
Scripts write timestamped files (e.g., parcels_with_units_20260105.geojson) to avoid overwriting.
When reading, scripts use glob patterns to find the most recent file.

Workflow

just all includes prep-data which only runs preprocessing scripts if outputs are missing:
Checks if peoples_polygons_unioned.geojson exists → skip if yes
Checks if parcels_with_units_*.geojson exists → skip if yes
This means:
First run: Downloads from S3, runs preprocessing, creates timestamped outputs
Subsequent runs: Uses existing files, skips preprocessing, runs only the notebooks

I can probably make the S3 reads faster but I think arrow is a little particular about geojsons so that is a separate task. For now just let it write the files to your local drive.
closes #55

…th cli run

…e files do not exist

mshron

Leaving review on the first two files, haven't gotten to the second two yet, but I didn't want to make you wait.

mshron · 2026-01-08T19:52:47Z

reports/il_npa/notebooks/clean_peoples_construction_polygons.py

 import pandas as pd
+from shapely.ops import unary_union

 timestamp = datetime.now().strftime("%Y%m%d")


It looks like this isn't used anywhere now?

mshron · 2026-01-08T19:57:11Z

reports/il_npa/notebooks/clean_peoples_construction_polygons.py

+MIN_START = "2026-01-01"
 # --- Code Cell 5 ---

 # Find the most recent Peoples Gas data file


I don't like seeing these things all in the global namespace like this. If you ever tried to import this file, it would immediately run all of the commands. Any reason not to wrap this all in a main() and add a if __name__ == "__main__": main() at the end?

I guess i just never planned on importing this. The reason for these separate preprocessing scripts was to import fewer and smaller files into geo_data_cleaning.qmd after i kept running into memory issues. But that is an easy change to make

mshron · 2026-01-08T20:01:01Z

reports/il_npa/notebooks/clean_peoples_construction_polygons.py

-    from shapely.ops import unary_union

    # Threshold for unioning overlapping polygons: 1 square meter
    # Note: This is applied in UTM Zone 16N (EPSG:32616) which uses meters, so area is in square meters


Comment for line 118 (but I can't comment on it directly since it's not part of this PR):

Since I'm not sure how you're going to represent your graphs, strongly suggest at least putting in type hints for this function. Up until now, everything has been a dataframe, what is our graph? I can read ahead to figure it out, but it'd be easier to be able to read this in one go.

I'm guessing the end result is fine from visual inspection, but if you write this kind of thing again, I'd suggest breaking out this transitive / graph union operation into a function and writing a small test for it. This is exactly the kind of tricky code that I'd prefer to see tests for and not just a yolo. ~3 small tests (A ∪ B, A ∪ B ∪ C, A ∪ B but overlap with C is too small) would add confidence.

I've had good luck with this set of Claude skills for e.g. test-driven development, or having an LLM do more planning before writing https://github.com/obra/superpowers

mshron · 2026-01-08T21:19:40Z

reports/il_npa/notebooks/match_parcels_buildings.py

+    s3_bucket="data.sb",
    s3_key=s3_key,
 )
 print(f"  Loaded {len(parcels):,} parcels")


Comment for line 149. Why are we guaranteed to get the most recent one? Do we control these files ("chicago_buildings_*.geojson") and can enforce that they're always named with YYYY-MM-DD or similar where the glob is? We're using the LastModified date further down, maybe I'm just misunderstanding what we've got here.

I didnt include the utility scripts in your review but I have scripts for each of these datasets that downloads the geojsons either through the Cook County API or direct download. I add the date suffix when the files is created. While working on this I kept all files local and set up the S3 read for your review. I need to move fully away from local reading and writing and over to only S3 reads, at which point I think it does not make sense to keep the older copies of the raw data unless they are a dependency of a published report. This is a separate task though

mshron · 2026-01-08T22:16:10Z

reports/il_npa/notebooks/match_parcels_buildings.py

 4. Clean and validate geometries to avoid spatial errors.
-5. Perform a spatial join between parcels and buildings to determine which buildings are located within which parcels.
+5. Perform a spatial join between parcels and buildings to determine which buildings are located within which parcels. Parcel data provides us with the building type (residential, mixed-use, commercial, industrial) but does not have the number of units in the building for multi-family buildings. Building data provides us with the number of units in the building for multi-family buildings.
 6. For buildings that fall within a parcel, the script aims to establish a 1:1 correspondence between buildings and parcels:


Flagging as potentially a big issue: this seems wrong to me.

Suppose I have a large multifamily complex with several buildings on one parcel (common in Chicago, I imagine). Under this scenario, we'd end up way undercounting the number of multifamily buildings, no?

By the way, this is a good example of why it's better to refactor these things into functions instead of one big script. I would like to run a quick check to see how big an issue this is, but there's no way to do that without copying and pasting out sections of this script, or putting in a break and running it to a stopping point. For reproducibility I'd prefer to be able to write a quick script that imports this one, calls a few of the functions, and does that calculation.

Based on the output from running it, there are about 1.3 buildings for every parcel. Assuming most single family buildings are the only thing on their parcel, that means that multifamily buildings are likely to have >> 1.3 buildings per parcel. We should be summing the number of units, not just taking the largest one.

That is a good point. A ton of the single family homes have detached garages/ADUs that fall within the parcel and I chose not to count those as dwellings. So that is one reason for the average being > 1. I spent a lot of time looking at satelite imagery of these lots and did not come across multiple apartment complexes on a single parcel but I agree it would be good to check the data and get our arms around which buildings are being dropped.

mshron · 2026-01-08T22:19:28Z

reports/il_npa/notebooks/match_parcels_buildings.py

        (parcels_with_units["longitude"].astype(float).round(8) == round(-87.57623748, 8))
        & (parcels_with_units["latitude"].astype(float).round(8) == round(41.74593814, 8))
    ]
    if len(test_parcel) > 0:


I'd prefer if this was a test function or something instead of being in the middle of the code. Also, getting an LLM to generate some test data that isn't tied to the actual Chicago one would probably make it easier to test that this is working, instead of hand-rolling a single example.

mshron · 2026-01-08T22:23:03Z

reports/il_npa/notebooks/match_parcels_buildings.py

+            print("   Match differs from expected")

    # Create working column starting with raw data
    parcels_with_units["building_units"] = parcels_with_units["building_units_raw"].copy()


We should be putting out diagnostics on how often these fallback logics are being hit, instead of applying them silently. What if half of the buildings are being coerced to 2 units because of missing data? We should know that.

I think what you are describing is in lines 347-365. the final output retains the building_units_raw column (data from join) which we can compare with building_units (join + interpolated data). I will make a note to specifically report these counts in the data and methods

alxsmith added 4 commits December 18, 2025 12:09

add comments and workflow

359b696

add walkthrough to top of script

befeea2

fix typos

2500f2c

clean and add context

5a10486

alxsmith requested a review from mshron December 18, 2025 21:37

alxsmith linked an issue Dec 18, 2025 that may be closed by this pull request

Clean up repo and add walk through #55

Open

alxsmith added 10 commits December 19, 2025 15:02

first draft of methods

ed23bf4

ignore data file

ec244ca

Merge branch 'main' into 55-clean-up-repo-and-add-walk-through

b1e02fa

Merge branch 'main' into 55-clean-up-repo-and-add-walk-through

0f42040

Add data cleaning notebook to quarto yaml

39cc477

tweak language and formatting

5f0af5d

fix syntax error

b386b3e

update S3 path to use data.sb instead of data.sb.east

fae06da

switch relative paths to absolute - relative paths caused problems wi…

3412b5b

…th cli run

Conditionally run pre-processing scripts before render if intermediat…

7321cd6

…e files do not exist

mshron reviewed Jan 8, 2026

View reviewed changes

55 clean up repo and add walk through #56

Are you sure you want to change the base?

55 clean up repo and add walk through #56

Uh oh!

Conversation

alxsmith commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files to review

Data Flow Logic

Workflow

Uh oh!

mshron left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alxsmith Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alxsmith Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alxsmith commented Dec 18, 2025 •

edited

Loading

alxsmith Jan 9, 2026 •

edited

Loading

alxsmith Jan 9, 2026 •

edited

Loading