-
Notifications
You must be signed in to change notification settings - Fork 2
QAQC Process
From an EDM perspective, the QAQC of the Developments and Housing Database's will focus on whether the code behaves the way we expect it to and whether the each step of the process was successful in generating the full extent of the data. Of particular importance is reviewing the ingestion, build, and exporting steps executed in the Github Workflow, the final outputs exported to Digital Ocean edm-publishing space and the Data Engineering QAQC app.
Besides the QAQC reports generated in the Data Engineering QAQC app, there are some tests that can be performed on a build of DevDB and HousingDB to help checks for accurate production run of the data product. Obviously, the accuracy of each individual record is outside the scope of this QAQC process but the Housing team does do an extensive research phase before release.
Some of the more indicative fields that can shed light onto the accuracy of the data are certain date fields in the final outputs (and as proxy, intermediate tables used to create the final tables, these fields almost exclusively come from DOB data). These checks should be performed for every production build of DevDB and Housing DB. Those date fields should reflect the most recent source data and associated timeline the data is meant to cover e.g. 22Q4 version should include data up to the date of the DOB data. Three useful fields to check for are:
-
permit_qrtr- field comes from DOB Permits data -
date_filed- field comes from the DOB Job Application Filings -
date_completed- field comes from the DOB COFO data
The above graphs come from a jupyter notebook using the 21Q4 DevDB data. Note that the graph shows the accumulative counts by week.
A crucial step to check in the build process are whether or not the corrections files received from the Housing team are properly applied to the final DevDB and HousingDB tables which is provided by the housing research team. Currently, we do random spot checks of certain records comparing them with the manual_corrections.csv vs. the final outputs.
unique ID job number should be unique id so
assert df.shape[0] == len(df.job_number.unique())Job Type There should only be three types of jobs in the final data:
AlterationDemolitionNew Building
A good sanity check is also plotting the aggregate tables into a map to showcase where the most development is taking place, this should match SME projections.
In the above graph, you see that a large majority of the development is in Long Island City and Downtown Brooklyn which has a large majority of the overall development compared to the rest of the city. This geographic distribution of the data tracks with what we know about development in NYC.
-
About
-
Workflow
-
Previous Releases