[Pydantic] overture-schema CLI for type listing, JSON Schema generation, and validation #406

mojodna · 2025-10-19T20:28:58Z

Usage: overture-schema [OPTIONS] COMMAND [ARGS]...

  Overture Schema command-line interface.

  Provides validation, schema generation, and type discovery for Overture Maps
  data.

  Examples:
    # Validate a file
    $ overture-schema validate data.json

    # Validate from stdin
    $ overture-schema validate - < data.json

    # List available types
    $ overture-schema list-types

    # Generate JSON schema
    $ overture-schema json-schema --theme buildings

    # Validate specific types
    $ overture-schema validate --theme buildings data.json

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  json-schema  Generate JSON schema for Overture Maps types.
  list-types   List all available types grouped by theme with descriptions.
  validate     Validate Overture Maps data against schemas.

overture-schema list-types:

overture-schema validate --type division --show-field id --show-field names:

I anticipate introducing a [parquet] variant in the future that can parse Parquet directly, but in the meantime, this works:

duckdb -c <<-'SQL'
	INSTALL spatial;
	LOAD spatial;
	INSTALL httpfs;
	LOAD httpfs;
	SET s3_region='us-west-2';
	COPY (
	  SELECT ST_AsGeoJSON(geometry) as geometry, * EXCLUDE geometry
	  FROM read_parquet('s3://overturemaps-us-west-2/release/2025-09-24.0/theme=divisions/type=division/*.parquet', hive_partitioning=true)
	  LIMIT 100
	) TO '/dev/stdout' (FORMAT JSON, ARRAY false);
SQL
 | uv run overture-schema validate --type division --show-field type --show-field id -

vcschapp

The output and examples are very cool.

The main thing I'm looking for/trying to wrap my head around is how all the pieces fit together.

Can I ask a bunch of questions that will hopefully help me understand the big picture? Links to code might be helpful on some...

I see the CLI only depends on the "core" package. If we migrate the model discovery stuff down into system, does that mean we can drop the direct dependency on core and just have the CLI depend on system?
Functionally, it would be ideal if the CLI can be PIP-installed and just detect all the entry-points available from the CWD including globally-installed packages and workspace/env packages. How far are we from that?
- I suspect entry-point discovery will discover from globally-installed packages.
- But can we also discover from the current venv/workspace?
Where will code generation live? In this CLI?
Are the theme and type that the CLI depends on just values parsed from the entry-points?
- I assume yes.
- A yes mean the CLI doesn't depend on OvertureFeature or any of the models in core, which would be ideal.
While the ability to take in "arbitrary blob of data", validate it against "all the models" and tell you which one it is is neat ....
- This seems complex and maybe also unsustainable if we keep adding in schemas like "sources" that don't have theme/type discriminators to make the job easy.
- Is there any business need for this? It's hard to imagine a case where the type of data wouldn't be known in advance.
- Shouldn't the CLI just require you to specify the exact type you're trying to validate?
Are the "overture" and "annex" namespaces reserved in some way?

mojodna · 2025-10-30T20:52:50Z

I see the CLI only depends on the "core" package. If we migrate the model discovery stuff down into system, does that mean we can drop the direct dependency on core and just have the CLI depend on system?

Yes. The full list of dependencies from core are https://github.com/OvertureMaps/schema/pull/406/files#diff-bcbfe867ab7a1405a6384886b8ed2975dc659d798cb72af5fdc18a71e5617298R5-R13:

from overture.schema.core import parse_feature  # sets exclude_unset=True
from overture.schema.core.discovery import discover_models  # model discovery mechanism
from overture.schema.core.json_schema import json_schema  # this uses EnhancedJsonSchemaGenerator
from overture.schema.core.parser import (
    # list[BaseModel] variant
    parse_features,
    # validate-only variants
    validate_feature,
    validate_features,
)
from overture.schema.core.unions import create_union_from_models  # dynamically creates unions from discovered models for use by `*_feature[s]`.

Functionally, it would be ideal if the CLI can be PIP-installed and just detect all the entry-points available from the CWD including globally-installed packages and workspace/env packages. How far are we from that?

I suspect entry-point discovery will discover from globally-installed packages.

But can we also discover from the current venv/workspace?

I don't know. I haven't tried this yet. Within a specific Python environment, it will detect all entry points. That may extend to anything in PYTHONPATH, but setuptools' hooks may only apply to packages that are "installed."

uv run can't access (and doesn't seem to support configuring/allowing access) modules outside the virtualenv that it manages. The opposite is probably less common, but I suspect that it works.

Where will code generation live? In this CLI?

I was envisioning an overture-schema-codegen package. This would either contribute a sub-command (using the same entry point mechanism) to the CLI or provide its own CLI.

Are the theme and type that the CLI depends on just values parsed from the entry-points?

I assume yes.

A yes mean the CLI doesn't depend on OvertureFeature or any of the models in core, which would be ideal.

Yes. The only Overture-specific implementation is the --overture-types flag (e.g., https://github.com/OvertureMaps/schema/pull/406/files#diff-5c898d587fe2fbc5e7d402913e372a553edfbe979182578cf3d8176381dfbf35R457-R461), which sets the namespace to use when discovering models: https://github.com/OvertureMaps/schema/pull/406/files#diff-5c898d587fe2fbc5e7d402913e372a553edfbe979182578cf3d8176381dfbf35R66-R67

While the ability to take in "arbitrary blob of data", validate it against "all the models" and tell you which one it is is neat ....

This seems complex and maybe also unsustainable if we keep adding in schemas like "sources" that don't have theme/type discriminators to make the job easy.

I sorted it generally while supporting Sources; it uses Pydantic's built-in support for unions and has a heuristic (which also applies when choosing between Overture themes/types) to determine the most likely type based on the number of validation errors produced by each candidate.

Is there any business need for this? It's hard to imagine a case where the type of data wouldn't be known in advance.

I thought about that too, but decided that the convenience was valuable. It also supports heterogeneous lists.

Shouldn't the CLI just require you to specify the exact type you're trying to validate?

In the future paradigm (no unified schema), yes, but if people are working with 〰️ Overture Data 〰️ , not requiring it is convenient.

Are the "overture" and "annex" namespaces reserved in some way?

"overture" kind of is (see above), "annex" is not. This PR changes the entry point key to <namespace>[:<theme>]:<type>. Anyone could theoretically register an Overture type, but we'd presumably check this when doing our "extension validation" dance.

vcschapp · 2025-10-31T22:28:26Z

I see the CLI only depends on the "core" package. If we migrate the model discovery stuff down into system, does that mean we can drop the direct dependency on core and just have the CLI depend on system?

Yes. The full list of dependencies from core are https://github.com/OvertureMaps/schema/pull/406/files#diff-bcbfe867ab7a1405a6384886b8ed2975dc659d798cb72af5fdc18a71e5617298R5-R13:
from overture.schema.core import parse_feature  # sets exclude_unset=True
from overture.schema.core.discovery import discover_models  # model discovery mechanism
from overture.schema.core.json_schema import json_schema  # this uses EnhancedJsonSchemaGenerator
from overture.schema.core.parser import (
    # list[BaseModel] variant
    parse_features,
    # validate-only variants
    validate_feature,
    validate_features,
)
from overture.schema.core.unions import create_union_from_models  # dynamically creates unions from discovered models for use by `*_feature[s]`.

Awesome! 🤯

vcschapp · 2025-10-31T23:02:46Z

2. Functionally, it would be ideal if the CLI can be PIP-installed and just detect all the entry-points available from the CWD including globally-installed packages and workspace/env packages. How far are we from that?
* I suspect entry-point discovery will discover from globally-installed packages.
* But can we also discover from the current venv/workspace?
I don't know. I haven't tried this yet. Within a specific Python environment, it will detect all entry points. That may extend to anything in PYTHONPATH, but setuptools' hooks may only apply to packages that are "installed."

I'm pretty sure it should work and we should aim for that deployment mode. Ideally the CLI is generic functionality that can be applied to any schema and doesn't have to depend on a specific schema or schema version.

I would ideally love to have all the following 8 configurations work the same ... Where the CLI can "discover at the same level and downward"...

I think we can achieve this roughly. I haven't been able to capture all the nuance yet, but it's definitely possible for a Python program to figure out what "tier" it is running at (system, user, venv) and also to detect if the CWD has a venv.

vcschapp · 2025-10-31T23:04:56Z

Where will code generation live? In this CLI?

I was envisioning an overture-schema-codegen package.

That's how I see it also.

This would either contribute a sub-command (using the same entry point mechanism) to the CLI or provide its own CLI.

I guess we don't have to have the answer today.

(My preference is just to have one CLI, because it's fewer packages and tools for people to keep track of, and it aligns with the philosophy that every schema, even the core, is really an "extension".)

vcschapp · 2025-10-31T23:05:57Z

4. Are the theme and type that the CLI depends on just values parsed from the entry-points?

I assume yes.

A yes mean the CLI doesn't depend on OvertureFeature or any of the models in core, which would be ideal.

Yes. The only Overture-specific implementation is the --overture-types flag (e.g., https://github.com/OvertureMaps/schema/pull/406/files#diff-5c898d587fe2fbc5e7d402913e372a553edfbe979182578cf3d8176381dfbf35R457-R461), which sets the namespace to use when discovering models: https://github.com/OvertureMaps/schema/pull/406/files#diff-5c898d587fe2fbc5e7d402913e372a553edfbe979182578cf3d8176381dfbf35R66-R67

Awesome. 👍

vcschapp · 2025-10-31T23:18:28Z

Are the theme and type that the CLI depends on just values parsed from the entry-points?

I assume yes.

A yes mean the CLI doesn't depend on OvertureFeature or any of the models in core, which would be ideal.

9. While the ability to take in "arbitrary blob of data", validate it against "all the models" and tell you which one it is is neat ....
* This seems complex and maybe also unsustainable if we keep adding in schemas like "sources" that don't have theme/type discriminators to make the job easy.
I sorted it generally while supporting Sources; it uses Pydantic's built-in support for unions and has a heuristic (which also applies when choosing between Overture themes/types) to determine the most likely type based on the number of validation errors produced by each candidate.

Is there any business need for this? It's hard to imagine a case where the type of data wouldn't be known in advance.

I thought about that too, but decided that the convenience was valuable. It also supports heterogeneous lists.

Shouldn't the CLI just require you to specify the exact type you're trying to validate?

In the future paradigm (no unified schema), yes, but if people are working with 〰️ Overture Data 〰️ , not requiring it is convenient.

Hmmm. Are you sure this idea will be portable beyond the JSON context? I tried writing out how it'd work with a Parquet table or Spark dataframe and confused myself so I just stopped. Just want to make sure you are confident it will work sensibly.

vcschapp · 2025-10-31T23:19:10Z

Are the "overture" and "annex" namespaces reserved in some way?

"overture" kind of is (see above), "annex" is not. This PR changes the entry point key to <namespace>[:<theme>]:<type>. Anyone could theoretically register an Overture type, but we'd presumably check this when doing our "extension validation" dance.

Excellent. 👍

vcschapp

Approved. I merged PR #412 so there may be some minor rebase conflicts - sorry!

vcschapp · 2025-11-04T07:17:06Z

Are the theme and type that the CLI depends on just values parsed from the entry-points?

I assume yes.

A yes mean the CLI doesn't depend on OvertureFeature or any of the models in core, which would be ideal.
While the ability to take in "arbitrary blob of data", validate it against "all the models" and tell you which one it is is neat ....
* This seems complex and maybe also unsustainable if we keep adding in schemas like "sources" that don't have theme/type discriminators to make the job easy.
I sorted it generally while supporting Sources; it uses Pydantic's built-in support for unions and has a heuristic (which also applies when choosing between Overture themes/types) to determine the most likely type based on the number of validation errors produced by each candidate.

Is there any business need for this? It's hard to imagine a case where the type of data wouldn't be known in advance.

I thought about that too, but decided that the convenience was valuable. It also supports heterogeneous lists.

Shouldn't the CLI just require you to specify the exact type you're trying to validate?

In the future paradigm (no unified schema), yes, but if people are working with 〰️ Overture Data 〰️ , not requiring it is convenient.
Hmmm. Are you sure this idea will be portable beyond the JSON context? I tried writing out how it'd work with a Parquet table or Spark dataframe and confused myself so I just stopped. Just want to make sure you are confident it will work sensibly.

The more I think about this, the more I think it's a bad idea to try to validate against unions.

It works for the main Overture schema but only because we have the type discriminator. Otherwise it's an intractable problem.

Consider for example two arbitrary schemas for which the empty object {} is a valid instance. Which one of these schemas did {} successfully validate against? It turns out the answer is completely arbitrary and depends on the order in which the union was constructed:

>>> from pydantic import BaseModel, TypeAdapter
>>> 
>>> class Foo(BaseModel):
...   pass
... 
>>> class Bar(BaseModel):
...   pass
... 
>>> 
>>> Foo.model_validate({})
Foo()
>>> Bar.model_validate({})
Bar()
>>> type_adapter = TypeAdapter(Foo | Bar)
>>> type_adapter.validate_python({})
Foo()
>>> type_adapter2 = TypeAdapter(Bar | Foo)
>>> type_adapter2.validate_python({})
Bar()

This is a dangerous road to go down and I think we'll continually run into weird edge cases. The question is, for what gain? It's somewhat neat, but then again why not ask the person with the context what model he's trying to validate and move on to bigger things?

packages/overture-schema-core/src/overture/schema/core/parser.py

Enhance the model discovery system to support multiple namespaces and capture the fully qualified class name for each registered model. ModelKey dataclass now contains: - namespace: distinguishes "overture" from extensions like "annex" - theme: optional, as some models may not belong to a theme - type: the feature type name - class_name: the entry point value for introspection Entry point format changes from "theme.type" to "namespace:theme:type" (or "namespace:type" for non-themed models). This enables third-party schema extensions to register models without conflicting with core Overture types. discover_models() now accepts an optional namespace filter parameter.

Introduce a command-line interface for working with Overture schema models. The CLI provides tools for schema introspection, validation, and JSON Schema generation. Commands: list-types [--theme THEME] [--detailed] List registered Overture types, optionally filtered by theme. With --detailed, shows model descriptions from docstrings. json-schema [--theme THEME] [--type TYPE] Generate JSON Schema for specified themes/types or all models. validate [--theme THEME] [--type TYPE] [--show-field FIELD] FILE Validate GeoJSON features against Overture schemas. Supports: - Single features and FeatureCollections - Heterogeneous collections (mixed types) - JSONL input from stdin (use '-' as FILE) - Automatic type detection via discriminator fields - Rich error display with data context windows Type Resolution: When --type is not specified, the validator builds a discriminated union from registered models and uses Pydantic's tagged union support to identify the most likely type. For heterogeneous collections, each feature is validated against its detected type independently. Error Display: Validation errors show surrounding data context to help locate issues. The --show-field option pins specific fields (e.g., id) in the display header for easier identification in large datasets. Pipeline Support: The validate command accepts JSONL on stdin for integration with tools like jq and gpq: gpq convert file.geoparquet --to geojson | \ jq -c '.features[]' | \ overture-schema validate --type building - Module Structure: - commands.py: Click command definitions - type_analysis.py: Union type construction and discriminator handling - error_formatting.py: Validation error processing and display - data_display.py: Context window and field extraction - output.py: Rich console output helpers

mojodna requested a review from vcschapp October 19, 2025 20:28

mojodna added the change type - cosmetic 🌹 Cosmetic change label Oct 19, 2025

mojodna temporarily deployed to staging October 19, 2025 20:29 — with GitHub Actions Inactive

mojodna temporarily deployed to staging October 19, 2025 20:34 — with GitHub Actions Inactive

mojodna changed the title ~~overture-schema CLI for type listing, JSON Schema generation, and validation~~ [Pydantic] overture-schema CLI for type listing, JSON Schema generation, and validation Oct 19, 2025

mojodna temporarily deployed to staging October 20, 2025 16:50 — with GitHub Actions Inactive

vcschapp reviewed Oct 28, 2025

View reviewed changes

vcschapp approved these changes Oct 31, 2025

View reviewed changes

vcschapp reviewed Nov 4, 2025

View reviewed changes

packages/overture-schema-core/src/overture/schema/core/parser.py Outdated Show resolved Hide resolved

vcschapp mentioned this pull request Nov 5, 2025

Pydantic package organization episode 5: "Only mostly dead" #415

Merged

mojodna added 2 commits November 24, 2025 21:45

mojodna force-pushed the pydantic-cli branch from 7a99a77 to d1123a1 Compare November 25, 2025 05:56

mojodna deployed to staging November 25, 2025 05:56 — with GitHub Actions View deployment

vcschapp merged commit 2e8a7b7 into pydantic Nov 26, 2025
4 checks passed

vcschapp deleted the pydantic-cli branch November 26, 2025 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Pydantic] overture-schema CLI for type listing, JSON Schema generation, and validation #406

[Pydantic] overture-schema CLI for type listing, JSON Schema generation, and validation #406

Uh oh!

mojodna commented Oct 19, 2025 •

edited

Loading

Uh oh!

vcschapp left a comment

Uh oh!

mojodna commented Oct 30, 2025

Uh oh!

vcschapp commented Oct 31, 2025

Uh oh!

vcschapp commented Oct 31, 2025

Uh oh!

vcschapp commented Oct 31, 2025

Uh oh!

vcschapp commented Oct 31, 2025 •

edited

Loading

Uh oh!

vcschapp commented Oct 31, 2025

Uh oh!

vcschapp commented Oct 31, 2025

Uh oh!

vcschapp left a comment

Uh oh!

vcschapp commented Nov 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Pydantic] overture-schema CLI for type listing, JSON Schema generation, and validation #406

[Pydantic] overture-schema CLI for type listing, JSON Schema generation, and validation #406

Uh oh!

Conversation

mojodna commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vcschapp left a comment

Choose a reason for hiding this comment

Uh oh!

mojodna commented Oct 30, 2025

Uh oh!

vcschapp commented Oct 31, 2025

Uh oh!

vcschapp commented Oct 31, 2025

Uh oh!

vcschapp commented Oct 31, 2025

Uh oh!

vcschapp commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vcschapp commented Oct 31, 2025

Uh oh!

vcschapp commented Oct 31, 2025

Uh oh!

vcschapp left a comment

Choose a reason for hiding this comment

Uh oh!

vcschapp commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mojodna commented Oct 19, 2025 •

edited

Loading

vcschapp commented Oct 31, 2025 •

edited

Loading

vcschapp commented Nov 4, 2025 •

edited

Loading