-
Notifications
You must be signed in to change notification settings - Fork 1
55 clean up repo and add walk through #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e files do not exist
mshron
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving review on the first two files, haven't gotten to the second two yet, but I didn't want to make you wait.
| import pandas as pd | ||
| from shapely.ops import unary_union | ||
|
|
||
| timestamp = datetime.now().strftime("%Y%m%d") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this isn't used anywhere now?
| MIN_START = "2026-01-01" | ||
| # --- Code Cell 5 --- | ||
|
|
||
| # Find the most recent Peoples Gas data file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like seeing these things all in the global namespace like this. If you ever tried to import this file, it would immediately run all of the commands. Any reason not to wrap this all in a main() and add a if __name__ == "__main__": main() at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess i just never planned on importing this. The reason for these separate preprocessing scripts was to import fewer and smaller files into geo_data_cleaning.qmd after i kept running into memory issues. But that is an easy change to make
| from shapely.ops import unary_union | ||
|
|
||
| # Threshold for unioning overlapping polygons: 1 square meter | ||
| # Note: This is applied in UTM Zone 16N (EPSG:32616) which uses meters, so area is in square meters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for line 118 (but I can't comment on it directly since it's not part of this PR):
Since I'm not sure how you're going to represent your graphs, strongly suggest at least putting in type hints for this function. Up until now, everything has been a dataframe, what is our graph? I can read ahead to figure it out, but it'd be easier to be able to read this in one go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the end result is fine from visual inspection, but if you write this kind of thing again, I'd suggest breaking out this transitive / graph union operation into a function and writing a small test for it. This is exactly the kind of tricky code that I'd prefer to see tests for and not just a yolo. ~3 small tests (A ∪ B, A ∪ B ∪ C, A ∪ B but overlap with C is too small) would add confidence.
I've had good luck with this set of Claude skills for e.g. test-driven development, or having an LLM do more planning before writing https://github.com/obra/superpowers
| s3_bucket="data.sb", | ||
| s3_key=s3_key, | ||
| ) | ||
| print(f" Loaded {len(parcels):,} parcels") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for line 149. Why are we guaranteed to get the most recent one? Do we control these files ("chicago_buildings_*.geojson") and can enforce that they're always named with YYYY-MM-DD or similar where the glob is? We're using the LastModified date further down, maybe I'm just misunderstanding what we've got here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didnt include the utility scripts in your review but I have scripts for each of these datasets that downloads the geojsons either through the Cook County API or direct download. I add the date suffix when the files is created. While working on this I kept all files local and set up the S3 read for your review. I need to move fully away from local reading and writing and over to only S3 reads, at which point I think it does not make sense to keep the older copies of the raw data unless they are a dependency of a published report. This is a separate task though
| 4. Clean and validate geometries to avoid spatial errors. | ||
| 5. Perform a spatial join between parcels and buildings to determine which buildings are located within which parcels. | ||
| 5. Perform a spatial join between parcels and buildings to determine which buildings are located within which parcels. Parcel data provides us with the building type (residential, mixed-use, commercial, industrial) but does not have the number of units in the building for multi-family buildings. Building data provides us with the number of units in the building for multi-family buildings. | ||
| 6. For buildings that fall within a parcel, the script aims to establish a 1:1 correspondence between buildings and parcels: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging as potentially a big issue: this seems wrong to me.
Suppose I have a large multifamily complex with several buildings on one parcel (common in Chicago, I imagine). Under this scenario, we'd end up way undercounting the number of multifamily buildings, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, this is a good example of why it's better to refactor these things into functions instead of one big script. I would like to run a quick check to see how big an issue this is, but there's no way to do that without copying and pasting out sections of this script, or putting in a break and running it to a stopping point. For reproducibility I'd prefer to be able to write a quick script that imports this one, calls a few of the functions, and does that calculation.
Based on the output from running it, there are about 1.3 buildings for every parcel. Assuming most single family buildings are the only thing on their parcel, that means that multifamily buildings are likely to have >> 1.3 buildings per parcel. We should be summing the number of units, not just taking the largest one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. A ton of the single family homes have detached garages/ADUs that fall within the parcel and I chose not to count those as dwellings. So that is one reason for the average being > 1. I spent a lot of time looking at satelite imagery of these lots and did not come across multiple apartment complexes on a single parcel but I agree it would be good to check the data and get our arms around which buildings are being dropped.
| (parcels_with_units["longitude"].astype(float).round(8) == round(-87.57623748, 8)) | ||
| & (parcels_with_units["latitude"].astype(float).round(8) == round(41.74593814, 8)) | ||
| ] | ||
| if len(test_parcel) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if this was a test function or something instead of being in the middle of the code. Also, getting an LLM to generate some test data that isn't tied to the actual Chicago one would probably make it easier to test that this is working, instead of hand-rolling a single example.
| print(" Match differs from expected") | ||
|
|
||
| # Create working column starting with raw data | ||
| parcels_with_units["building_units"] = parcels_with_units["building_units_raw"].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be putting out diagnostics on how often these fallback logics are being hit, instead of applying them silently. What if half of the buildings are being coerced to 2 units because of missing data? We should know that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you are describing is in lines 347-365. the final output retains the building_units_raw column (data from join) which we can compare with building_units (join + interpolated data). I will make a note to specifically report these counts in the data and methods
Added overview and context to scripts
Removed commented out/stale code
Minor refactor for readability
Fleshed out data and methods section of index.qmd if that is helpful for context during review.
Some Review guidance:
Files to review
1. notebooks/clean_peoples_construction_polygons.py (~300 lines)
Purpose: Unions overlapping construction polygons from Peoples Gas
Operations: Geometric unions, date filtering, status assignment
Output: peoples_polygons_unioned.geojson - cleaned construction areas
2. notebooks/match_parcels_buildings.py (~390 lines)
Purpose: Match Cook County parcels with Chicago building footprints to get unit counts
Operations: Spatial join, 1:1 matching by largest overlap, unit count assignment
Output: parcels_with_units_YYYYMMDD.geojson - parcels with residential unit data
3. notebooks/geo_data_cleaning.qmd (~1,200 lines)
Purpose: Main spatial ETL - creates block-level dataset by combining all sources
Inputs:
Key Operations:
4. notebooks/analysis.qmd (~1,580 lines)
Purpose: Cost calculations and comparisons (PRP vs NPA)
Input: Block-level summary from geo_data_cleaning
Output: peoplesgas_with_buildings_streets_block_YYYYMMDD.geojson - final block-level dataset
Data Flow Logic
S3 Fallback Pattern
File Naming & Latest Selection
Scripts write timestamped files (e.g., parcels_with_units_20260105.geojson) to avoid overwriting.
When reading, scripts use glob patterns to find the most recent file.
Workflow
just allincludesprep-datawhich only runs preprocessing scripts if outputs are missing:Checks if peoples_polygons_unioned.geojson exists → skip if yes
Checks if parcels_with_units_*.geojson exists → skip if yes
This means:
First run: Downloads from S3, runs preprocessing, creates timestamped outputs
Subsequent runs: Uses existing files, skips preprocessing, runs only the notebooks
I can probably make the S3 reads faster but I think arrow is a little particular about geojsons so that is a separate task. For now just let it write the files to your local drive.
closes #55