Skip to content

Conversation

@mojodna
Copy link
Collaborator

@mojodna mojodna commented Oct 19, 2025

Usage: overture-schema [OPTIONS] COMMAND [ARGS]...

  Overture Schema command-line interface.

  Provides validation, schema generation, and type discovery for Overture Maps
  data.

  Examples:
    # Validate a file
    $ overture-schema validate data.json

    # Validate from stdin
    $ overture-schema validate - < data.json

    # List available types
    $ overture-schema list-types

    # Generate JSON schema
    $ overture-schema json-schema --theme buildings

    # Validate specific types
    $ overture-schema validate --theme buildings data.json

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  json-schema  Generate JSON schema for Overture Maps types.
  list-types   List all available types grouped by theme with descriptions.
  validate     Validate Overture Maps data against schemas.

overture-schema list-types:

overture-schema validate --type division --show-field id --show-field names:

─ [0] (Division) id=23e81262-d6ed-45a3-a1a0-4bc6a2... ──────────────────────────────────────────────────────────────── ... id "23e81262-d6ed-45a3-a1a0-4bc6a2a887d8" bbox      {xmin: ..., xmax: ..., ymin: ...} country                              <missing> ← Input should be a valid string version                                      1 names   primary: "Amundsen–Scott South Pole" ... ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

I anticipate introducing a [parquet] variant in the future that can parse Parquet directly, but in the meantime, this works:

duckdb -c <<-'SQL'
	INSTALL spatial;
	LOAD spatial;
	INSTALL httpfs;
	LOAD httpfs;
	SET s3_region='us-west-2';
	COPY (
	  SELECT ST_AsGeoJSON(geometry) as geometry, * EXCLUDE geometry
	  FROM read_parquet('s3://overturemaps-us-west-2/release/2025-09-24.0/theme=divisions/type=division/*.parquet', hive_partitioning=true)
	  LIMIT 100
	) TO '/dev/stdout' (FORMAT JSON, ARRAY false);
SQL
 | uv run overture-schema validate --type division --show-field type --show-field id -

@mojodna mojodna requested a review from vcschapp October 19, 2025 20:28
@mojodna mojodna added the change type - cosmetic 🌹 Cosmetic change label Oct 19, 2025
@mojodna mojodna changed the title overture-schema CLI for type listing, JSON Schema generation, and validation [Pydantic] overture-schema CLI for type listing, JSON Schema generation, and validation Oct 19, 2025
Copy link
Collaborator

@vcschapp vcschapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output and examples are very cool.

The main thing I'm looking for/trying to wrap my head around is how all the pieces fit together.

Can I ask a bunch of questions that will hopefully help me understand the big picture? Links to code might be helpful on some...


  1. I see the CLI only depends on the "core" package. If we migrate the model discovery stuff down into system, does that mean we can drop the direct dependency on core and just have the CLI depend on system?
  2. Functionally, it would be ideal if the CLI can be PIP-installed and just detect all the entry-points available from the CWD including globally-installed packages and workspace/env packages. How far are we from that?
    • I suspect entry-point discovery will discover from globally-installed packages.
    • But can we also discover from the current venv/workspace?
  3. Where will code generation live? In this CLI?
  4. Are the theme and type that the CLI depends on just values parsed from the entry-points?
    • I assume yes.
    • A yes mean the CLI doesn't depend on OvertureFeature or any of the models in core, which would be ideal.
  5. While the ability to take in "arbitrary blob of data", validate it against "all the models" and tell you which one it is is neat ....
    • This seems complex and maybe also unsustainable if we keep adding in schemas like "sources" that don't have theme/type discriminators to make the job easy.
    • Is there any business need for this? It's hard to imagine a case where the type of data wouldn't be known in advance.
    • Shouldn't the CLI just require you to specify the exact type you're trying to validate?
  6. Are the "overture" and "annex" namespaces reserved in some way?

@mojodna
Copy link
Collaborator Author

mojodna commented Oct 30, 2025

  1. I see the CLI only depends on the "core" package. If we migrate the model discovery stuff down into system, does that mean we can drop the direct dependency on core and just have the CLI depend on system?

Yes. The full list of dependencies from core are https://github.com/OvertureMaps/schema/pull/406/files#diff-bcbfe867ab7a1405a6384886b8ed2975dc659d798cb72af5fdc18a71e5617298R5-R13:

from overture.schema.core import parse_feature  # sets exclude_unset=True
from overture.schema.core.discovery import discover_models  # model discovery mechanism
from overture.schema.core.json_schema import json_schema  # this uses EnhancedJsonSchemaGenerator
from overture.schema.core.parser import (
    # list[BaseModel] variant
    parse_features,
    # validate-only variants
    validate_feature,
    validate_features,
)
from overture.schema.core.unions import create_union_from_models  # dynamically creates unions from discovered models for use by `*_feature[s]`.
  1. Functionally, it would be ideal if the CLI can be PIP-installed and just detect all the entry-points available from the CWD including globally-installed packages and workspace/env packages. How far are we from that?

    • I suspect entry-point discovery will discover from globally-installed packages.
    • But can we also discover from the current venv/workspace?

I don't know. I haven't tried this yet. Within a specific Python environment, it will detect all entry points. That may extend to anything in PYTHONPATH, but setuptools' hooks may only apply to packages that are "installed."

uv run can't access (and doesn't seem to support configuring/allowing access) modules outside the virtualenv that it manages. The opposite is probably less common, but I suspect that it works.

  1. Where will code generation live? In this CLI?

I was envisioning an overture-schema-codegen package. This would either contribute a sub-command (using the same entry point mechanism) to the CLI or provide its own CLI.

  1. Are the theme and type that the CLI depends on just values parsed from the entry-points?

    • I assume yes.
    • A yes mean the CLI doesn't depend on OvertureFeature or any of the models in core, which would be ideal.

Yes. The only Overture-specific implementation is the --overture-types flag (e.g., https://github.com/OvertureMaps/schema/pull/406/files#diff-5c898d587fe2fbc5e7d402913e372a553edfbe979182578cf3d8176381dfbf35R457-R461), which sets the namespace to use when discovering models: https://github.com/OvertureMaps/schema/pull/406/files#diff-5c898d587fe2fbc5e7d402913e372a553edfbe979182578cf3d8176381dfbf35R66-R67

  1. While the ability to take in "arbitrary blob of data", validate it against "all the models" and tell you which one it is is neat ....

    • This seems complex and maybe also unsustainable if we keep adding in schemas like "sources" that don't have theme/type discriminators to make the job easy.

I sorted it generally while supporting Sources; it uses Pydantic's built-in support for unions and has a heuristic (which also applies when choosing between Overture themes/types) to determine the most likely type based on the number of validation errors produced by each candidate.

  • Is there any business need for this? It's hard to imagine a case where the type of data wouldn't be known in advance.

I thought about that too, but decided that the convenience was valuable. It also supports heterogeneous lists.

  • Shouldn't the CLI just require you to specify the exact type you're trying to validate?

In the future paradigm (no unified schema), yes, but if people are working with 〰️ Overture Data 〰️ , not requiring it is convenient.

  1. Are the "overture" and "annex" namespaces reserved in some way?

"overture" kind of is (see above), "annex" is not. This PR changes the entry point key to <namespace>[:<theme>]:<type>. Anyone could theoretically register an Overture type, but we'd presumably check this when doing our "extension validation" dance.

@vcschapp
Copy link
Collaborator

  1. I see the CLI only depends on the "core" package. If we migrate the model discovery stuff down into system, does that mean we can drop the direct dependency on core and just have the CLI depend on system?

Yes. The full list of dependencies from core are https://github.com/OvertureMaps/schema/pull/406/files#diff-bcbfe867ab7a1405a6384886b8ed2975dc659d798cb72af5fdc18a71e5617298R5-R13:

from overture.schema.core import parse_feature  # sets exclude_unset=True
from overture.schema.core.discovery import discover_models  # model discovery mechanism
from overture.schema.core.json_schema import json_schema  # this uses EnhancedJsonSchemaGenerator
from overture.schema.core.parser import (
    # list[BaseModel] variant
    parse_features,
    # validate-only variants
    validate_feature,
    validate_features,
)
from overture.schema.core.unions import create_union_from_models  # dynamically creates unions from discovered models for use by `*_feature[s]`.

Awesome! 🤯

@vcschapp
Copy link
Collaborator

2. Functionally, it would be ideal if the CLI can be PIP-installed and just detect all the entry-points available from the CWD including globally-installed packages and workspace/env packages. How far are we from that?

* I suspect entry-point discovery will discover from globally-installed packages.
* But can we also discover from the current venv/workspace?

I don't know. I haven't tried this yet. Within a specific Python environment, it will detect all entry points. That may extend to anything in PYTHONPATH, but setuptools' hooks may only apply to packages that are "installed."

I'm pretty sure it should work and we should aim for that deployment mode. Ideally the CLI is generic functionality that can be applied to any schema and doesn't have to depend on a specific schema or schema version.

I would ideally love to have all the following 8 configurations work the same ... Where the CLI can "discover at the same level and downward"...

2025-10-31-overture-schema-cli-package-architecture drawio

I think we can achieve this roughly. I haven't been able to capture all the nuance yet, but it's definitely possible for a Python program to figure out what "tier" it is running at (system, user, venv) and also to detect if the CWD has a venv.

@vcschapp
Copy link
Collaborator

  1. Where will code generation live? In this CLI?

I was envisioning an overture-schema-codegen package.

That's how I see it also.

This would either contribute a sub-command (using the same entry point mechanism) to the CLI or provide its own CLI.

I guess we don't have to have the answer today.

(My preference is just to have one CLI, because it's fewer packages and tools for people to keep track of, and it aligns with the philosophy that every schema, even the core, is really an "extension".)

@vcschapp
Copy link
Collaborator

vcschapp commented Oct 31, 2025

4. Are the theme and type that the CLI depends on just values parsed from the entry-points?

  • I assume yes.
  • A yes mean the CLI doesn't depend on OvertureFeature or any of the models in core, which would be ideal.

Yes. The only Overture-specific implementation is the --overture-types flag (e.g., https://github.com/OvertureMaps/schema/pull/406/files#diff-5c898d587fe2fbc5e7d402913e372a553edfbe979182578cf3d8176381dfbf35R457-R461), which sets the namespace to use when discovering models: https://github.com/OvertureMaps/schema/pull/406/files#diff-5c898d587fe2fbc5e7d402913e372a553edfbe979182578cf3d8176381dfbf35R66-R67

Awesome. 👍

@vcschapp
Copy link
Collaborator

Are the theme and type that the CLI depends on just values parsed from the entry-points?

  • I assume yes.
  • A yes mean the CLI doesn't depend on OvertureFeature or any of the models in core, which would be ideal.

9. While the ability to take in "arbitrary blob of data", validate it against "all the models" and tell you which one it is is neat ....

* This seems complex and maybe also unsustainable if we keep adding in schemas like "sources" that don't have theme/type discriminators to make the job easy.

I sorted it generally while supporting Sources; it uses Pydantic's built-in support for unions and has a heuristic (which also applies when choosing between Overture themes/types) to determine the most likely type based on the number of validation errors produced by each candidate.

  • Is there any business need for this? It's hard to imagine a case where the type of data wouldn't be known in advance.

I thought about that too, but decided that the convenience was valuable. It also supports heterogeneous lists.

  • Shouldn't the CLI just require you to specify the exact type you're trying to validate?

In the future paradigm (no unified schema), yes, but if people are working with 〰️ Overture Data 〰️ , not requiring it is convenient.

Hmmm. Are you sure this idea will be portable beyond the JSON context? I tried writing out how it'd work with a Parquet table or Spark dataframe and confused myself so I just stopped. Just want to make sure you are confident it will work sensibly.

@vcschapp
Copy link
Collaborator

  1. Are the "overture" and "annex" namespaces reserved in some way?

"overture" kind of is (see above), "annex" is not. This PR changes the entry point key to <namespace>[:<theme>]:<type>. Anyone could theoretically register an Overture type, but we'd presumably check this when doing our "extension validation" dance.

Excellent. 👍

Copy link
Collaborator

@vcschapp vcschapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I merged PR #412 so there may be some minor rebase conflicts - sorry!

@vcschapp
Copy link
Collaborator

vcschapp commented Nov 4, 2025

Are the theme and type that the CLI depends on just values parsed from the entry-points?

  • I assume yes.
  • A yes mean the CLI doesn't depend on OvertureFeature or any of the models in core, which would be ideal.
  1. While the ability to take in "arbitrary blob of data", validate it against "all the models" and tell you which one it is is neat ....
* This seems complex and maybe also unsustainable if we keep adding in schemas like "sources" that don't have theme/type discriminators to make the job easy.

I sorted it generally while supporting Sources; it uses Pydantic's built-in support for unions and has a heuristic (which also applies when choosing between Overture themes/types) to determine the most likely type based on the number of validation errors produced by each candidate.

  • Is there any business need for this? It's hard to imagine a case where the type of data wouldn't be known in advance.

I thought about that too, but decided that the convenience was valuable. It also supports heterogeneous lists.

  • Shouldn't the CLI just require you to specify the exact type you're trying to validate?

In the future paradigm (no unified schema), yes, but if people are working with 〰️ Overture Data 〰️ , not requiring it is convenient.

Hmmm. Are you sure this idea will be portable beyond the JSON context? I tried writing out how it'd work with a Parquet table or Spark dataframe and confused myself so I just stopped. Just want to make sure you are confident it will work sensibly.

The more I think about this, the more I think it's a bad idea to try to validate against unions.

It works for the main Overture schema but only because we have the type discriminator. Otherwise it's an intractable problem.

Consider for example two arbitrary schemas for which the empty object {} is a valid instance. Which one of these schemas did {} successfully validate against? It turns out the answer is completely arbitrary and depends on the order in which the union was constructed:

>>> from pydantic import BaseModel, TypeAdapter
>>> 
>>> class Foo(BaseModel):
...   pass
... 
>>> class Bar(BaseModel):
...   pass
... 
>>> 
>>> Foo.model_validate({})
Foo()
>>> Bar.model_validate({})
Bar()
>>> type_adapter = TypeAdapter(Foo | Bar)
>>> type_adapter.validate_python({})
Foo()
>>> type_adapter2 = TypeAdapter(Bar | Foo)
>>> type_adapter2.validate_python({})
Bar()

This is a dangerous road to go down and I think we'll continually run into weird edge cases. The question is, for what gain? It's somewhat neat, but then again why not ask the person with the context what model he's trying to validate and move on to bigger things?

Enhance the model discovery system to support multiple namespaces
and capture the fully qualified class name for each registered model.

ModelKey dataclass now contains:
- namespace: distinguishes "overture" from extensions like "annex"
- theme: optional, as some models may not belong to a theme
- type: the feature type name
- class_name: the entry point value for introspection

Entry point format changes from "theme.type" to "namespace:theme:type"
(or "namespace:type" for non-themed models). This enables third-party
schema extensions to register models without conflicting with core
Overture types.

discover_models() now accepts an optional namespace filter parameter.
Introduce a command-line interface for working with Overture schema
models. The CLI provides tools for schema introspection, validation,
and JSON Schema generation.

Commands:

  list-types [--theme THEME] [--detailed]
    List registered Overture types, optionally filtered by theme.
    With --detailed, shows model descriptions from docstrings.

  json-schema [--theme THEME] [--type TYPE]
    Generate JSON Schema for specified themes/types or all models.

  validate [--theme THEME] [--type TYPE] [--show-field FIELD] FILE
    Validate GeoJSON features against Overture schemas. Supports:
    - Single features and FeatureCollections
    - Heterogeneous collections (mixed types)
    - JSONL input from stdin (use '-' as FILE)
    - Automatic type detection via discriminator fields
    - Rich error display with data context windows

Type Resolution:

When --type is not specified, the validator builds a discriminated
union from registered models and uses Pydantic's tagged union support
to identify the most likely type. For heterogeneous collections, each
feature is validated against its detected type independently.

Error Display:

Validation errors show surrounding data context to help locate issues.
The --show-field option pins specific fields (e.g., id) in the display
header for easier identification in large datasets.

Pipeline Support:

The validate command accepts JSONL on stdin for integration with tools
like jq and gpq:

    gpq convert file.geoparquet --to geojson | \
    jq -c '.features[]' | \
    overture-schema validate --type building -

Module Structure:

- commands.py: Click command definitions
- type_analysis.py: Union type construction and discriminator handling
- error_formatting.py: Validation error processing and display
- data_display.py: Context window and field extraction
- output.py: Rich console output helpers
@vcschapp vcschapp merged commit 2e8a7b7 into pydantic Nov 26, 2025
4 checks passed
@vcschapp vcschapp deleted the pydantic-cli branch November 26, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants