Skip to content

NYCPlanning/open_data_ingest_demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This tool ingests data from external sources or a local machine, optionally transforms/cleans the data, and stores the result in Parquet files. It supports both geospatial and non-geospatial datasets. Here is a link to the Open Data Week presentation.

For a one-liner to launch JupyterLab with an environment ready to code, run docker run --rm -p 8888:8888 nycplanning/open_data_ingest

Extract data with ingest tool by running this:

python3 -m dcpy.cli lifecycle ingest nysparks_historicplaces --template-dir ./templates

Ingest tool in action

flowchart LR
  B[Raw data]
  n1@{ shape: "tri", label: "Local File" } -->  B
  n2@{ shape: "tri", label: "API" } -->  B
  n3@{ shape: "tri", label: "Open Data" } --> |Extract to local machine| B
  n4@{ shape: "tri", label: "ArcGIS Server" }--> B
  n5@{ shape: "tri", label: "S3" } -->  B
  
  B -->|Convert to parquet| C[init.parquet]

  %%C --> D{Pre-processing steps specified?}
  %%D -- Yes --> E[Apply pre-processing steps]
  %%E --> F[dataset_id.parquet]
  %%D -- No --> F

  E[dataset_id.parquet]
  C -->|No change| E
  C -->|Apply pre-processing steps| E
Loading

Output files

  • config.json: this is a metadata file about data and ingestion process.
  • init.parquet: raw data in parquet format
  • <dataset_id>.<format extension>: raw data. Example: dpr_forever_wild.zip.
  • <dataset_id>.parquet: output data. Example: dpr_forever_wild.parquet. If pre-processing steps were not specified in dataset template, then this file will be the same as init.parquet

Environment Setup

Docker

If you have docker installed, you can simply run docker run --rm -p 8888:8888 nycplanning/open_data_ingest. This will run a JupyterLab instance, and prompt you with a url after it's built.

More Manual

There's also a requirements.txt file here - it's pretty minimal. You will need both pip and git installed, since we haven't officially published dcpy yet, so the requirements file points to our repo.

However you choose to manage your python environment (venv, pyenv), simply install requirements.txt and you should be good to go. If you use VSCode like we do, you can run snippets in notebooks right in your editor, or you can try out the cli targets, such as

python3 -m dcpy.cli lifecycle ingest fdny_firehouses

dcpy Quick Start Guide

Steps to Ingest Data

1. Create a YAML Ingest Config File

Each dataset requires a YAML configuration file defining its ingestion settings. Below are the required fields.

Shared Required Fields

# <unique_dataset_id>.yml

id: <unique_dataset_id>  # Unique identifier for the dataset. It must match with its config filename like <unique_dataset_id>.yml

attributes:
  name: <dataset name>  # Human-readable name of the dataset

ingestion:
  source: <source details>  # Specifies the data source, see section below
  file_format: <format details>  # Defines the file format of the source data, seen section below

Required Fields by Source Type

From local file

This option assumes that you already have dataset of interest on your local machine.

source:
  type: local_file
  path: <path to local file>

Example:

source:
  type: local_file
  path: path/to/my/dataset.csv
From OpenData portal

Pull data from OpenData. To find org and uid values for a given dataset, refer to OpenData portal dataset's url. Though source format is specified, the file_format section is still required.

source:
  type: socrata
  org: <organization>  # Allowed values are: `nyc`, `nys`, and `nys_health`
  uid: <dataset identifier>  # Dataset identifier
  format: <file format>  # Data format of the source file. Allowed values are: `csv`, `geojson`, and `shapefile`

Examples:

# DPR Parks roperties: https://data.cityofnewyork.us/Recreation/Parks-Properties/enfh-gkve
source:
  type: socrata
  org: nyc
  uid: enfh-gkve
  format: geojson
# Solid Waste Management Facilities: https://data.ny.gov/Energy-Environment/Solid-Waste-Management-Facilities/2fni-raj8
source:
  type: socrata
  org: nys
  uid: 2fni-raj8
  format: csv
From Esri Feature Service
source:
  type: esri
  server: <server name>  # Allowed values are: `nys_clearinghouse`, `nys_parks`, `nps`, `dcp`, and `nyc_maphub`
  dataset: <dataset name>  # Name of the Esri dataset
  layer_id: <layer_id>  # ID of the layer (only specified if the dataset has multiple layers)

Example:

#  National Register of Historic Places: https://services.arcgis.com/1xFZPtKn1wKC6POA/ArcGIS/rest/services/National_Register_Building_Listings/FeatureServer
source:
  type: esri
  server: nys_parks
  dataset: National_Register_Building_Listings
  layer_id: 13
From API Pull data from an API. Currently available for datasets in `csv` and `json` file formats. Though source `format` is specified, the `file_format` section is still required.
source:
  type: api
  endpoint: <api endpoint> 
  format: <file format>  # Must be `csv` or `json` 

Example:

# NY Public Libraries: https://www.nypl.org/locations
source:
  type: api
  endpoint: https://refinery.nypl.org/api/nypl/locations/v1.0/locations
  format: json

Required Fields by Data Format Type

In csv
file_format:
  type: csv
  geometry: <geometry details>  # Only required if dataset is geospatial (see section below). Otherwise can be ommitted 

Examples:

# Non-geospatial dataset
file_format:
  type: csv
# Non-geospatial dataset with some optional attributes
file_format:
  type: csv
    encoding: utf-8
    delimiter: "|"
    column_names: ["Column 1", "Column 2"]	 # When data doesn't have headers, add new ones 
# Geospatial dataset with geometry stored in "Longitude" and "Latitude" columns
file_format:
  type: csv
  geometry:
    geom_column:
      x: Longitude
      y: Latitude
    crs: EPSG:4326
# Geospatial dataset with geometry in "GEOM" column
file_format:
  type: csv
  geometry:
    geom_column: GEOM
    crs: EPSG:2263
    format: wkb
In excel
file_format:
  type: xlsx  # The value can also be `excel`
  sheet_name: <excel sheet name or number>
  geometry: <geometry details>  # Only required if dataset is geospatial (see section below). Otherwise can be ommitted 

Examples:

# Non-geospatial dataset
file_format:
  type: xlsx
  sheet_name: Sheet_1
# Geospatial dataset with geometry in "wkb_geometry" column
file_format:
  type: xlsx
  sheet_name: Sheet_1
  geometry:
    geom_column: wkb_geometry
    crs: EPSG:2263
In Json
file_format:
  type: json
  json_read_fn: <json_read_fn>  # Allowed values: `normalize`, `read_json`. These are pandas functions to read in a json file -- refer to pandas docs for more details
  geometry: <geometry details>  # Only required if dataset is geospatial (see section below). Otherwise can be ommitted 

Examples:

# Non-geospatial dataset of Brooklyn Libraries: https://www.bklynlibrary.org/locations
file_format:
  type: json
  json_read_fn: normalize
  json_read_kwargs: { "record_path": [ "locations" ] }
# Geospatial dataset with geometry stored in "Longitude" and "Latitude" columns
file_format:
  type: json
  json_read_fn: normalize
  json_read_kwargs:
    {
      "record_path": ["Locations", "Location"],
      "meta": ["TrackerID", "FMSID", "Title", "TotalFunding"],
    }
  geometry:
    crs: EPSG:4326
    geom_column:
      x: Longitude
      y: Latitude
In GeoJson Note, crs is not an attribute for geojson format. Geojson has a specification of "EPSG:4326"
file_format:
  type: geojson

Example:

file_format:
  type: geojson
In shapefile
file_format:
  type: shapefile
  crs: <crs>  # Coordinate Reference System. Ex: `EPSG:4326`

Example:

file_format:
  type: shapefile
  crs: EPSG:2263
In geodatabase
file_format:
  type: geodatabase
  crs: <crs>  # Coordinate Reference System. Ex: `EPSG:4326`
  layer: <layer name>  # Only required if the file contains multiple layers. Otherwise can be ommitted 

Examples:

# Geodatabase file with one layer
file_format:
  type: geodatabase
  crs: EPSG:2263
# Geodatabase file with multiple layers. Pick `lion` layer
file_format:
  type: geodatabase
  layer: lion
  crs: EPSG:2263

2. Run the Ingest Command

Once your YAML file is ready, run the following command:

python3 -m dcpy.cli lifecycle ingest <unique_dataset_id> --template-dir <directory path>

3. Output

Processed data is saved as a Parquet file in the designated output directory.

Additional Options

  • Transformations: You can specify optional processing_steps for column renaming, cleaning, and more.
  • Geospatial Data: Define geospatial info in geometry property under file_format property.

About

Repo to demo how one can use ingest package. To be presented during 2025 Open Data Week

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •