This tool ingests data from external sources or a local machine, optionally transforms/cleans the data, and stores the result in Parquet files. It supports both geospatial and non-geospatial datasets. Here is a link to the Open Data Week presentation.
For a one-liner to launch JupyterLab with an environment ready to code, run docker run --rm -p 8888:8888 nycplanning/open_data_ingest
python3 -m dcpy.cli lifecycle ingest nysparks_historicplaces --template-dir ./templatesflowchart LR
B[Raw data]
n1@{ shape: "tri", label: "Local File" } --> B
n2@{ shape: "tri", label: "API" } --> B
n3@{ shape: "tri", label: "Open Data" } --> |Extract to local machine| B
n4@{ shape: "tri", label: "ArcGIS Server" }--> B
n5@{ shape: "tri", label: "S3" } --> B
B -->|Convert to parquet| C[init.parquet]
%%C --> D{Pre-processing steps specified?}
%%D -- Yes --> E[Apply pre-processing steps]
%%E --> F[dataset_id.parquet]
%%D -- No --> F
E[dataset_id.parquet]
C -->|No change| E
C -->|Apply pre-processing steps| E
config.json: this is a metadata file about data and ingestion process.init.parquet: raw data in parquet format<dataset_id>.<format extension>: raw data. Example:dpr_forever_wild.zip.<dataset_id>.parquet: output data. Example:dpr_forever_wild.parquet. If pre-processing steps were not specified in dataset template, then this file will be the same asinit.parquet
If you have docker installed, you can simply run docker run --rm -p 8888:8888 nycplanning/open_data_ingest. This will run a JupyterLab instance, and prompt you with a url after it's built.
There's also a requirements.txt file here - it's pretty minimal. You will need both pip and git installed, since we haven't officially published dcpy yet, so the requirements file points to our repo.
However you choose to manage your python environment (venv, pyenv), simply install requirements.txt and you should be good to go. If you use VSCode like we do, you can run snippets in notebooks right in your editor, or you can try out the cli targets, such as
python3 -m dcpy.cli lifecycle ingest fdny_firehouses
Each dataset requires a YAML configuration file defining its ingestion settings. Below are the required fields.
# <unique_dataset_id>.yml
id: <unique_dataset_id> # Unique identifier for the dataset. It must match with its config filename like <unique_dataset_id>.yml
attributes:
name: <dataset name> # Human-readable name of the dataset
ingestion:
source: <source details> # Specifies the data source, see section below
file_format: <format details> # Defines the file format of the source data, seen section belowFrom local file
This option assumes that you already have dataset of interest on your local machine.
source:
type: local_file
path: <path to local file>Example:
source:
type: local_file
path: path/to/my/dataset.csvFrom OpenData portal
Pull data from OpenData. To find org and uid values for a given dataset, refer to OpenData portal dataset's url. Though source format is specified, the file_format section is still required.
source:
type: socrata
org: <organization> # Allowed values are: `nyc`, `nys`, and `nys_health`
uid: <dataset identifier> # Dataset identifier
format: <file format> # Data format of the source file. Allowed values are: `csv`, `geojson`, and `shapefile`Examples:
# DPR Parks roperties: https://data.cityofnewyork.us/Recreation/Parks-Properties/enfh-gkve
source:
type: socrata
org: nyc
uid: enfh-gkve
format: geojson# Solid Waste Management Facilities: https://data.ny.gov/Energy-Environment/Solid-Waste-Management-Facilities/2fni-raj8
source:
type: socrata
org: nys
uid: 2fni-raj8
format: csvFrom Esri Feature Service
source:
type: esri
server: <server name> # Allowed values are: `nys_clearinghouse`, `nys_parks`, `nps`, `dcp`, and `nyc_maphub`
dataset: <dataset name> # Name of the Esri dataset
layer_id: <layer_id> # ID of the layer (only specified if the dataset has multiple layers)Example:
# National Register of Historic Places: https://services.arcgis.com/1xFZPtKn1wKC6POA/ArcGIS/rest/services/National_Register_Building_Listings/FeatureServer
source:
type: esri
server: nys_parks
dataset: National_Register_Building_Listings
layer_id: 13From API
Pull data from an API. Currently available for datasets in `csv` and `json` file formats. Though source `format` is specified, the `file_format` section is still required.source:
type: api
endpoint: <api endpoint>
format: <file format> # Must be `csv` or `json` Example:
# NY Public Libraries: https://www.nypl.org/locations
source:
type: api
endpoint: https://refinery.nypl.org/api/nypl/locations/v1.0/locations
format: jsonIn csv
file_format:
type: csv
geometry: <geometry details> # Only required if dataset is geospatial (see section below). Otherwise can be ommitted Examples:
# Non-geospatial dataset
file_format:
type: csv# Non-geospatial dataset with some optional attributes
file_format:
type: csv
encoding: utf-8
delimiter: "|"
column_names: ["Column 1", "Column 2"] # When data doesn't have headers, add new ones # Geospatial dataset with geometry stored in "Longitude" and "Latitude" columns
file_format:
type: csv
geometry:
geom_column:
x: Longitude
y: Latitude
crs: EPSG:4326# Geospatial dataset with geometry in "GEOM" column
file_format:
type: csv
geometry:
geom_column: GEOM
crs: EPSG:2263
format: wkbIn excel
file_format:
type: xlsx # The value can also be `excel`
sheet_name: <excel sheet name or number>
geometry: <geometry details> # Only required if dataset is geospatial (see section below). Otherwise can be ommitted Examples:
# Non-geospatial dataset
file_format:
type: xlsx
sheet_name: Sheet_1# Geospatial dataset with geometry in "wkb_geometry" column
file_format:
type: xlsx
sheet_name: Sheet_1
geometry:
geom_column: wkb_geometry
crs: EPSG:2263In Json
file_format:
type: json
json_read_fn: <json_read_fn> # Allowed values: `normalize`, `read_json`. These are pandas functions to read in a json file -- refer to pandas docs for more details
geometry: <geometry details> # Only required if dataset is geospatial (see section below). Otherwise can be ommitted Examples:
# Non-geospatial dataset of Brooklyn Libraries: https://www.bklynlibrary.org/locations
file_format:
type: json
json_read_fn: normalize
json_read_kwargs: { "record_path": [ "locations" ] }# Geospatial dataset with geometry stored in "Longitude" and "Latitude" columns
file_format:
type: json
json_read_fn: normalize
json_read_kwargs:
{
"record_path": ["Locations", "Location"],
"meta": ["TrackerID", "FMSID", "Title", "TotalFunding"],
}
geometry:
crs: EPSG:4326
geom_column:
x: Longitude
y: LatitudeIn GeoJson
Note, crs is not an attribute for geojson format. Geojson has a specification of "EPSG:4326"file_format:
type: geojsonExample:
file_format:
type: geojsonIn shapefile
file_format:
type: shapefile
crs: <crs> # Coordinate Reference System. Ex: `EPSG:4326`Example:
file_format:
type: shapefile
crs: EPSG:2263In geodatabase
file_format:
type: geodatabase
crs: <crs> # Coordinate Reference System. Ex: `EPSG:4326`
layer: <layer name> # Only required if the file contains multiple layers. Otherwise can be ommitted Examples:
# Geodatabase file with one layer
file_format:
type: geodatabase
crs: EPSG:2263# Geodatabase file with multiple layers. Pick `lion` layer
file_format:
type: geodatabase
layer: lion
crs: EPSG:2263Once your YAML file is ready, run the following command:
python3 -m dcpy.cli lifecycle ingest <unique_dataset_id> --template-dir <directory path>Processed data is saved as a Parquet file in the designated output directory.
- Transformations: You can specify optional
processing_stepsfor column renaming, cleaning, and more. - Geospatial Data: Define geospatial info in
geometryproperty underfile_formatproperty.