Skip to content
This repository was archived by the owner on Jun 30, 2023. It is now read-only.

QAQC Process

Max B-H edited this page Apr 18, 2023 · 17 revisions

EDM

Overview

From an EDM perspective, the QAQC of the Developments and Housing Database's will focus on whether the code behaves the way we expect it to and whether the each step of the process was successful in generating the full extent of the data. Of particular importance is reviewing the ingestion, build, and exporting steps executed in the Github Workflow, the final outputs exported to Digital Ocean edm-publishing space and the Data Engineering QAQC app.

Besides the QAQC reports generated in the Data Engineering QAQC app, there are some tests that can be performed on a build of DevDB and HousingDB to help checks for accurate production run of the data product. Obviously, the accuracy of each individual record is outside the scope of this QAQC process but the Housing team does do an extensive research phase before release.

Dates

Some of the more indicative fields that can shed light onto the accuracy of the data are certain date fields in the final outputs (and as proxy, intermediate tables used to create the final tables, these fields almost exclusively come from DOB data). These checks should be performed for every production build of DevDB and Housing DB. Those date fields should reflect the most recent source data and associated timeline the data is meant to cover e.g. 22Q4 version should include data up to the date of the DOB data. Three useful fields to check for are:

  • permit_qrtr - field comes from DOB Permits data
  • date_filed - field comes from the DOB Job Application Filings
  • date_completed - field comes from the DOB COFO data
Screen Shot 2022-04-08 at 11 27 18 AM Screen Shot 2022-04-08 at 11 24 09 AM

The above graphs come from a jupyter notebook using the 21Q4 DevDB data. Note that the graph shows the accumulative counts by week.

Check Manual Correction's were applied

A crucial step to check in the build process are whether or not the corrections files received from the Housing team are properly applied to the final DevDB and HousingDB tables which is provided by the housing research team. Currently, we do random spot checks of certain records comparing them with the manual_corrections.csv vs. the final outputs.

Unique ID Check

unique ID job number should be unique id so

assert df.shape[0] == len(df.job_number.unique())

Job Type There should only be three types of jobs in the final data:

  • Alteration
  • Demolition
  • New Building

A good sanity check is also plotting the aggregate tables into a map to showcase where the most development is taking place, this should match SME projections.

Screen Shot 2022-04-08 at 11 40 11 AM

In the above graph, you see that a large majority of the development is in Long Island City and Downtown Brooklyn which has a large majority of the overall development compared to the rest of the city. This geographic distribution of the data tracks with what we know about development in NYC.

Housing Team

Clone this wiki locally