Boettiger Lab Geospatial Datasets

A collection of cloud-native geospatial datasets optimized for analysis and visualization, available as H3-indexed GeoParquet, PMTiles, and Cloud-Optimized GeoTIFFs (COGs). All datasets are accessible via a STAC catalog hosted on the National Research Platform.

Available Datasets

Our collection currently includes 11 published datasets covering biodiversity, conservation, environmental justice, and infrastructure:

CPAD (California Protected Areas Database) - Protected lands in California
IUCN - Global species range maps and Red List assessments
WDPA (World Database on Protected Areas) - Global protected areas
Mapping Inequality - Historical redlining maps of US cities
HydroBasins - Global watershed boundaries at multiple hierarchical levels
Natural Capital Project - Ecosystem services and nature's contributions to people
GBIF - Global biodiversity occurrence records
US Census - Demographic and geographic data
Carbon - Carbon storage and emissions datasets
Social Vulnerability Index - CDC's social vulnerability indicators
Wetlands - National Wetlands Inventory data

All datasets are H3-indexed at resolution 0 (coarsest partitioning) for efficient spatial queries and parallel processing. Browse the STAC catalog for complete metadata, spatial/temporal extents, and direct HTTPS access to files.

CNG Datasets Toolkit

This Python toolkit was used to process and generate the datasets above, converting large geospatial datasets into cloud-native formats with H3 hexagonal indexing.

Features

Vector Processing: Convert polygon and point datasets to H3-indexed GeoParquet
Raster Processing: Create Cloud-Optimized GeoTIFFs (COGs) and H3-indexed parquet
Kubernetes Integration: Generate and submit K8s jobs for large-scale processing
Cloud Storage: Manage S3 buckets and sync across multiple providers with rclone
Scalable: Chunk-based processing for datasets that don't fit in memory

Usage

While package functions can all run locally with the python API, the intended use of this package is to auto-generate kubernetes jobs that can handle all the processing, e.g. on the NRP Nautilus cluster. Follow the NRP documentation on how to get an account, set up kubectl, and run basic jobs on the cluster first.

Vector Processing

For example, the following process will create PMTiles, GeoParquet, and partitioned, H3-indexed parquet running on the cluster. It will also create the bucket and configure public-read access and CORS headers appropriately. The helper utility generates the k8s jobs:

cng-datasets workflow \
  --dataset my-dataset \
  --source-url https://dsl.richmond.edu/panorama/redlining/static/mappinginequality.gpkg \
  --bucket public-test \
  --h3-resolution 10 \
  --parent-resolutions "9,8,0" \
  --hex-memory 8Gi \
  --max-completions 200

Then it will instruct you to run the job as follows:

# One-time RBAC setup
kubectl apply -f k8s/workflow-rbac.yaml

# Apply all workflow files (safe to re-run)
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/workflow.yaml

And the jobs will run on the cluster in order. (Because the workflow also runs on the cluster you don't have to keep the latop open).

The configmap is just a list of the underlying jobs that the workflow will run. You can modify any of the k8s *-job.yaml files for other kubernetes clusters or to tweak various settings and then just kubectl apply -f them individually as well. Most use a single pod, except for the hex job where the most computationally intensive steps happen.

Architecture

Kubernetes jobs run in remote pods that are cleaned up on completion. All data outputs are written directly to an S3 bucket, ready for use.

Output Structure

s3://bucket/
├── dataset-name.parquet         # GeoParquet with all attributes
├── dataset-name.pmtiles         # PMTiles vector tiles
└── dataset-name/
    └── hex/                     # H3-indexed parquet (partitioned by h0)
        └── h0=*/
            └── *.parquet

Processing Approach

Vector Datasets:

Convert to optimized GeoParquet (if needed)
Generate PMTiles for web visualization
Tile to H3 hexagons in chunks
Partition by h0 cells for efficient querying

Raster Datasets:

Create Cloud-Optimized GeoTIFF (COG)
Convert to H3-indexed parquet by h0 regions
Partition by h0 cells for efficient querying

H3 Resolutions

--h3-resolution 10        # Primary resolution (default: 10)
--parent-resolutions "9,8,0"  # Parent hexes for aggregation (default: "9,8,0")

Resolution Reference:

h12: ~3m (building-level)
h11: ~10m (lot-level)
h10: ~15m (street-level) - default
h9: ~50m (block-level)
h8: ~175m (neighborhood)
h7: ~600m (district)
h0: continent-scale (partitioning key)

Configuration

These routines rely on several software tools that can all read and write to S3 buckets: GDAL, duckdb, and rclone. GDAL and duckdb can both 'stream' data directly to a bucket without writing to a local file, and the package relies on environmental variables to configure them. rclone provides file-based operations when streaming is not an option or slower. The initial bucket creation and setting access permissions and CORS uses aws cli.

NRP uses a different internal endpoint URL which provides faster performance (notably, can handle many more open connections for parallel read/write).

Sync

Sync bucket to other S3 system, e.g. source.coop. (Be sure to create the bucket first, e.g. with source.coop create the repo in the web interface.) NOTE With large datasets, this can be important to run as k8s job rather than syncing through a local machine, which will be slow and prone to network timeouts.

cng-datasets sync-job \
    --job-name sync-to-source-coop \
    --source nrp:public-mappinginequality \
    --destination source:us-west-2.opendata.source.coop/cboettig/mappinginequality \
    --output sync-job.yaml

Examples

See the individual dataset directories for complete examples:

redlining/ - Vector polygon processing with chunking
wetlands/glwd/ - Raster to H3 conversion with global h0 processing
wdpa/ - Large-scale protected areas processing
hydrobasins/ - Multi-level watershed processing
gbif/ - Species occurrence data processing

License

MIT License - see LICENSE for details

Contributing

See CONTRIBUTING.md for development guidelines

Name		Name	Last commit message	Last commit date
Latest commit History 401 Commits
.github		.github
calenviroscreen		calenviroscreen
carbon		carbon
census		census
cng_datasets		cng_datasets
cpad		cpad
demos		demos
deprecated		deprecated
docs		docs
fire		fire
fishbase		fishbase
gbif		gbif
hydrobasins		hydrobasins
inat		inat
iucn		iucn
mappinginequality		mappinginequality
ncp		ncp
overturemaps		overturemaps
pad-us		pad-us
social-vulnerability		social-vulnerability
sync		sync
tests		tests
tutorials		tutorials
vector-example		vector-example
wdpa		wdpa
wetlands		wetlands
.gitattributes		.gitattributes
.gitignore		.gitignore
API.md		API.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_PACKAGE.md		README_PACKAGE.md
create-aws-k8s-secret.sh		create-aws-k8s-secret.sh
hex-subsets.ipynb		hex-subsets.ipynb
public-output-bucket-config.md		public-output-bucket-config.md
pyproject.toml		pyproject.toml
set-rclone-conf.sh		set-rclone-conf.sh
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Boettiger Lab Geospatial Datasets

Available Datasets

CNG Datasets Toolkit

Features

Usage

Vector Processing

Architecture

Output Structure

Processing Approach

H3 Resolutions

Configuration

Sync

Examples

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

boettiger-lab/datasets

Folders and files

Latest commit

History

Repository files navigation

Boettiger Lab Geospatial Datasets

Available Datasets

CNG Datasets Toolkit

Features

Usage

Vector Processing

Architecture

Output Structure

Processing Approach

H3 Resolutions

Configuration

Sync

Examples

License

Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages