-
Notifications
You must be signed in to change notification settings - Fork 11
[Pydantic] overture-schema CLI for type listing, JSON Schema generation, and validation #406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vcschapp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output and examples are very cool.
The main thing I'm looking for/trying to wrap my head around is how all the pieces fit together.
Can I ask a bunch of questions that will hopefully help me understand the big picture? Links to code might be helpful on some...
- I see the CLI only depends on the "core" package. If we migrate the model discovery stuff down into system, does that mean we can drop the direct dependency on core and just have the CLI depend on system?
- Functionally, it would be ideal if the CLI can be PIP-installed and just detect all the entry-points available from the CWD including globally-installed packages and workspace/env packages. How far are we from that?
- I suspect entry-point discovery will discover from globally-installed packages.
- But can we also discover from the current venv/workspace?
- Where will code generation live? In this CLI?
- Are the theme and type that the CLI depends on just values parsed from the entry-points?
- I assume yes.
- A yes mean the CLI doesn't depend on
OvertureFeatureor any of the models in core, which would be ideal.
- While the ability to take in "arbitrary blob of data", validate it against "all the models" and tell you which one it is is neat ....
- This seems complex and maybe also unsustainable if we keep adding in schemas like "sources" that don't have theme/type discriminators to make the job easy.
- Is there any business need for this? It's hard to imagine a case where the type of data wouldn't be known in advance.
- Shouldn't the CLI just require you to specify the exact type you're trying to validate?
- Are the "overture" and "annex" namespaces reserved in some way?
Yes. The full list of dependencies from core are https://github.com/OvertureMaps/schema/pull/406/files#diff-bcbfe867ab7a1405a6384886b8ed2975dc659d798cb72af5fdc18a71e5617298R5-R13: from overture.schema.core import parse_feature # sets exclude_unset=True
from overture.schema.core.discovery import discover_models # model discovery mechanism
from overture.schema.core.json_schema import json_schema # this uses EnhancedJsonSchemaGenerator
from overture.schema.core.parser import (
# list[BaseModel] variant
parse_features,
# validate-only variants
validate_feature,
validate_features,
)
from overture.schema.core.unions import create_union_from_models # dynamically creates unions from discovered models for use by `*_feature[s]`.
I don't know. I haven't tried this yet. Within a specific Python environment, it will detect all entry points. That may extend to anything in
I was envisioning an
Yes. The only Overture-specific implementation is the
I sorted it generally while supporting
I thought about that too, but decided that the convenience was valuable. It also supports heterogeneous lists.
In the future paradigm (no unified schema), yes, but if people are working with 〰️ Overture Data 〰️ , not requiring it is convenient.
"overture" kind of is (see above), "annex" is not. This PR changes the entry point key to |
Awesome! 🤯 |
I'm pretty sure it should work and we should aim for that deployment mode. Ideally the CLI is generic functionality that can be applied to any schema and doesn't have to depend on a specific schema or schema version. I would ideally love to have all the following 8 configurations work the same ... Where the CLI can "discover at the same level and downward"... I think we can achieve this roughly. I haven't been able to capture all the nuance yet, but it's definitely possible for a Python program to figure out what "tier" it is running at (system, user, venv) and also to detect if the CWD has a venv. |
That's how I see it also.
I guess we don't have to have the answer today. (My preference is just to have one CLI, because it's fewer packages and tools for people to keep track of, and it aligns with the philosophy that every schema, even the core, is really an "extension".) |
Awesome. 👍 |
Hmmm. Are you sure this idea will be portable beyond the JSON context? I tried writing out how it'd work with a Parquet table or Spark dataframe and confused myself so I just stopped. Just want to make sure you are confident it will work sensibly. |
Excellent. 👍 |
vcschapp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. I merged PR #412 so there may be some minor rebase conflicts - sorry!
The more I think about this, the more I think it's a bad idea to try to validate against unions. It works for the main Overture schema but only because we have the type discriminator. Otherwise it's an intractable problem. Consider for example two arbitrary schemas for which the empty object >>> from pydantic import BaseModel, TypeAdapter
>>>
>>> class Foo(BaseModel):
... pass
...
>>> class Bar(BaseModel):
... pass
...
>>>
>>> Foo.model_validate({})
Foo()
>>> Bar.model_validate({})
Bar()
>>> type_adapter = TypeAdapter(Foo | Bar)
>>> type_adapter.validate_python({})
Foo()
>>> type_adapter2 = TypeAdapter(Bar | Foo)
>>> type_adapter2.validate_python({})
Bar()This is a dangerous road to go down and I think we'll continually run into weird edge cases. The question is, for what gain? It's somewhat neat, but then again why not ask the person with the context what model he's trying to validate and move on to bigger things? |
packages/overture-schema-core/src/overture/schema/core/parser.py
Outdated
Show resolved
Hide resolved
Enhance the model discovery system to support multiple namespaces and capture the fully qualified class name for each registered model. ModelKey dataclass now contains: - namespace: distinguishes "overture" from extensions like "annex" - theme: optional, as some models may not belong to a theme - type: the feature type name - class_name: the entry point value for introspection Entry point format changes from "theme.type" to "namespace:theme:type" (or "namespace:type" for non-themed models). This enables third-party schema extensions to register models without conflicting with core Overture types. discover_models() now accepts an optional namespace filter parameter.
Introduce a command-line interface for working with Overture schema
models. The CLI provides tools for schema introspection, validation,
and JSON Schema generation.
Commands:
list-types [--theme THEME] [--detailed]
List registered Overture types, optionally filtered by theme.
With --detailed, shows model descriptions from docstrings.
json-schema [--theme THEME] [--type TYPE]
Generate JSON Schema for specified themes/types or all models.
validate [--theme THEME] [--type TYPE] [--show-field FIELD] FILE
Validate GeoJSON features against Overture schemas. Supports:
- Single features and FeatureCollections
- Heterogeneous collections (mixed types)
- JSONL input from stdin (use '-' as FILE)
- Automatic type detection via discriminator fields
- Rich error display with data context windows
Type Resolution:
When --type is not specified, the validator builds a discriminated
union from registered models and uses Pydantic's tagged union support
to identify the most likely type. For heterogeneous collections, each
feature is validated against its detected type independently.
Error Display:
Validation errors show surrounding data context to help locate issues.
The --show-field option pins specific fields (e.g., id) in the display
header for easier identification in large datasets.
Pipeline Support:
The validate command accepts JSONL on stdin for integration with tools
like jq and gpq:
gpq convert file.geoparquet --to geojson | \
jq -c '.features[]' | \
overture-schema validate --type building -
Module Structure:
- commands.py: Click command definitions
- type_analysis.py: Union type construction and discriminator handling
- error_formatting.py: Validation error processing and display
- data_display.py: Context window and field extraction
- output.py: Rich console output helpers
7a99a77 to
d1123a1
Compare
overture-schema list-types:overture-schema validate --type division --show-field id --show-field names:I anticipate introducing a
[parquet]variant in the future that can parse Parquet directly, but in the meantime, this works: