Skip to content

Conversation

@vcschapp
Copy link
Collaborator

@vcschapp vcschapp commented Nov 5, 2025

tl;dr

This is part #5. Part #6 will happen ... sometime, and it is hoped it will be the last, and the best, of the package re-orgs. Our last best hope, if you will.

Done in this part:

  • The following themes are (mostly) fully re-organized: base, buildings (and places and addresses from part the fourth).
  • Core package is partly re-organized and json_schema module is promoted to system.
  • A number of bugs are fixed including several big GeoJSON-related Pydantic parsing issues and a subtle bug with the json_schema generation.
  • Parsing modules are aligned more with Pydantic conventions, which is made possible by the more robust GeoJSON parsing coming from the bug fixes, especially the GeoJSON-aware discriminated union fix for Feature.
  • Pydantic warnings are eliminated and asserted against in make check.
  • Doc format errors are reduced to less than a terminal page and only errors that are in scope for episode 6 remain to be fixed.
  • Various other clean-up.

Merge guidance: Rebase

Unlike parts 1-3, this iteration has an adequate commit history. It should ideally be rebase-merged onto the pydantic branch rather than squashed.

Journey

Destination

  1. Packages.
  2. Documentation.
    • ✅ All doctests will pass and automatically be enforced to pass by make check.
    • ▶️ All documentation will pass pydocstyle with the NumPy convention. (As of now, the tool reports too many complaints to add it into make check.)
    • ▶️ pydocstyle will be enforced by make check.
    • ✖️ docformatter will be run by make check. Incompatible with pydocstyle and NumPy-style documentation.
    • ▶️ Package-level README content will migrate into the package's __init__.py docstring.
    • ⬜️ There will be a doc target in the Makefile that does uv run pdoc to a useful place.
  3. Modules.
    • ▶️ Re-exports will be standardized and everything of value will be re-exported into the main module.
    • ▶️ In theme packages, each feature type will be within its own non-exported module within the package root.
    • ▶️ In theme packages, if a field contains a theme-specific "deep model" then it will be given its own module, e.g. perspectives.

TODO list for episode 6

  1. Promote model discovery from core into system.
  2. Finish documenting land and land_cover in base based on discussion in #working-group-schema.
  3. Re-organize divisions, transportation, and annex.
  4. Move perspectives into scoping.
  5. Finish re-organizing core, which is almost there.
  6. Eliminate all make docformat errors and add pydocstyle into make check.
  7. Add make api-docs-live and make api-docs-html targets to Makefile.
  8. Add doctest examples in system package __init__.py for main Feature use cases:
    • Make a feature with constrained geometry and roundtrip GeoJSON
    • Make a discriminated union of features
  9. Generify FieldConstraint so validate method can be type hint valid as much as possible, along the lines sketched below.
from typing import TypeVar, Generic

T = TypeVar('T')

class Parent(Generic[T]):
    def method(self, x: T) -> None:
        ...

class ChildStr(Parent[str]):
    def method(self, x: str) -> None:  # ✓ OK - matches Parent[str]
        ...

class ChildInt(Parent[int]):
    def method(self, x: int) -> None:  # ✓ OK - matches Parent[int]
        ...

@vcschapp vcschapp requested a review from mojodna November 5, 2025 10:03
@vcschapp vcschapp added the change type - cosmetic 🌹 Cosmetic change label Nov 5, 2025
Note: Still waiting on some information from Jennings and Jonah to
improve the documentation for `land`, `land_cover`, and `water`.
Copy link
Collaborator Author

@vcschapp vcschapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added callouts of some of the more interesting parts in comments, as is the tradition.

Global land derived from the inverse of OSM Coastlines. Translates `natural` tags from OpenStreetMap.
TODO: Update this description when the relationship to `land_cover` is better understood.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to finish the descriptions for Land and LandCover is on the TODO list for episode #6 that's listed in the PR description....

from overture.schema.system.model_constraint import FieldEqCondition


class PlaceType(str, Enum):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: To be renamed in the divisions package re-org. to remove the prefix "Place" to prevent any confusion with the places theme. The historical tie to WOF will still be documented obviously.

>>> try:
... MyModel(**{"osm_id": "foo"})
... except ValidationError as e:
... assert "Invalid OSM ID format: foo. Must be n123, w123, or r123." in str(e)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to standardize all exception messages to begin with lowercase.


I'm basing this convention on multiple (hopefully unbiased) LLM prompts to different models that basically land on this as the exception message convention. (The below summary is from ChatGPT, which claims it has examined the Python standard library in generalizing them.)

Element Convention Example
Capitalization lowercase unless proper noun/type "invalid configuration option 'mode'"
Ending punctuation no period "missing required field 'name'"
Formatting short, plain English phrase "expected int, got str"
Including data quote or repr() values "invalid value: {!r}".format(val)

Comment on lines +112 to +117
Minimum X-coordinate of the bounding box.
Returns
-------
float | int
Minimum X-coordinate
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct NumPy-style documentation that pydocstyle approves of.

Big favor ask! Please use make docformat to make sure your changes don't regress the pydocstyle warnings and you are hopefully killing all the warnings in files you're touching!

for f in feature_classes
),
),
Field(discriminator=Feature.field_discriminator("type", *feature_classes)),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👉 The other use of GeoJSON-aware dynamic discriminator.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping this would also eliminate the Overture-specific type field reference (by using the explicit Tags and pushing Feature.field_discriminator into the base class implementation, currying the field name), but I'm not sure how we would do that. The primary motivation for doing that would be to support union creation in system.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be powerful - we should noodle on how to achieve it. It would be ideal if all that logic could be lifted up into system...

@@ -1,76 +1,171 @@
__path__ = __import__("pkgutil").extend_path(__path__, __name__)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this file out because:

  1. It is a significant re-imagining of the "parsing features" concept.
  2. There's some overlap with the peripheries of [Pydantic] overture-schema CLI for type listing, JSON Schema generation, and validation #406.

The path to get here wasn't entirely linear, but if we pretend it was then I asked myself:

  1. What do parse and parse_feature do?
  2. What value do they add above the parsing features Pydantic already has?

My answers were basically that if our models integrate properly with Pydantic then:

  1. parse_feature does not add any real value because the same results can be obtained from calling BaseModel.model_validate and BaseModel.model_validate_json on the feature class.
  2. The value added by parse is that it can see the union of all discovered Overture models, so it can validate against the entire union.†

My actions then were to delete parse_feature and to re-imagine the interface of parse to:

  1. Be consistent as possible with Pydantic conventions. This is where the rename from parse to validate* comes from, as well as the split between validate and validate_json and the return value being clearly expressed as BaseModel.
  2. Leverage existing Pydantic features, such as the ability to validate a model from raw JSON, and the ability to validate a Python data structure, either a dict or a model class, against a model.

Some of the effects of this re-imagining are:

  • The entire "GeoJSON rotation" of a feature is now completely encapsulated within Feature in the system package, and, I hope, it will not leak out of it.
  • All GeoJSON validation is just JSON validation of a feature using BaseModel.model_validate_json.
  • There is no longer any concept of "a GeoJSON dict" independent of a Feature but if someone wants that they just have to dump the feature with BaseModel.model_dump_json and parse the JSON into a dict.
  • There may be some "technical" performance regressions in places because while Pydantic has all the facilities it needs, there edge cases where you need to go through serde with the json package to achieve your goal, as is always true with Pydantic. I think this is fine. I doubt JSON optimization is really needed because at-scale use cases shouldn't use JSON anyway, but if some optimized path is desired by someone we should evaluate the benefits against the cons of complexity and mission creep.

†: I'm still not sold on validating the union being a feature that is particularly needed by customers or particularly scalable, as some of my comments on #406 show, but that's a discussion for another day.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. I think it's safe to remove. The primary consumer is currently the validation tests, IIRC, and it can be moved there.

Since you fixed the "GeoJSON rotation," I also think it's valid to say that consumers (like the CLI package) should create their own TypeAdapters (or use a public version of _union_type_adapter), since there's no magic needed any longer.

Comment on lines +45 to +46
When using this class to generate a JSON Schema, you must dump your model JSON carefully to
ensure that it matches the JSON Schema.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for this warning is why I think Omitable is needed.

It's weird for us to say "Overture uses this omit convention but if you use our system you have to think carefully about matching your JSON Schema and JSON dumps of your model"...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but that still feels like a later problem if we find that lots of people are using this with JSON data + JSON Schema (vs. other approaches that are more in line with how our data pipeline is going to use it).

2. If the field's default value is `None` (JSON `null`), the default value is removed.
"""

@override
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a slight refactoring of how things were achieved in the previous EnhancedJsonSchemaGenerator but there's also a subtle bug fix. Previously, it would eliminate the null option from required nullable fields.

So if you had something along the lines of this:

class FooModel(BaseModel):
    foo: str | Null   # This is a required nullable field, you can't leave it unset.

Then Pydantic would give you a JSON Schema like this:

{
  "required": ["foo"],
  "properties": {
    "foo": {
      "anyOf": [
        {"type": "string"},
        {"type": "null"}
      ]
    }
  }
}

And EnhancedJsonSchemaGenerator would remove the "anyOf"/null option, which changed the meaning of the JSON Schema from "you have to put something, but it can be a string or the value null" to "you have to put a string".


Apart from the ⚠️ warning above, it's subtle issues like this that leave me preferring the one-shot Omitable option where the schema itself encodes the semantics so that our users don't have to think about these things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any uses of | Null, nor do I expect Overture to need it.

Strictly speaking, you're right, but I lean toward YAGNI until we learn that we do (and documenting the limitations is a great first start).

"""

source_tags: SourceTags | None = None
wikidata: WikidataId | None = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be split out? It may be present in the OSM tags already, so wikidata is either redundant or applies to more than data sourced from OSM.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost definitely yes. It just ended up below my subjective refactoring line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why _common (private) vs. common and not exported at the top level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're all exported from __init__.py. I guess the answer is just that I find the proliferation of multiple import options confusing and sometimes its easier to have files for code organization but just a single flat import namespace for usability...

such as bridges, airports, runways, aerialways, communication towers, and power lines.
"""

model_config = ConfigDict(title="infrastructure")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are still messy. We don't strictly need them, but I ported them over directly from JSON Schema, and they don't carry meaning right now. The only place they manifest is as the title of the schema in the generated JSON Schema.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I haven't checked how consistently they're applied across the different packages either, I'll try to remember that for the next one. They should be everywhere or nowhere.

physical thing covering the land, while land use is the human use to which the land is being
put.
TODO: Explain relationship to `Land` features.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also include representative sources, like being derived from remote sensing data as WorldCover or Global Dynamic Land Cover. Knowing that it's often derived from raster data is valuable context.

Comment on lines +96 to +97
In Overture data releases, water features are sourced from OpenStreetMap. There are two main
categories of water feature: ocean and inland water bodies.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future TODO: we shouldn't assume that it comes from OSM and that it might be transformed data from USGS, for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problem with the "base" theme, which is kind of really the "straight from OSM" theme.

Fundamentally we need to promote water to a real theme with a schema that's not just OSM dependent.

connectors: Annotated[
list[ConnectorReference] | None,
Field(
default=[],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good; this was weird (but matched the JSON Schema).

@@ -1,76 +1,171 @@
__path__ = __import__("pkgutil").extend_path(__path__, __name__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. I think it's safe to remove. The primary consumer is currently the validation tests, IIRC, and it can be moved there.

Since you fixed the "GeoJSON rotation," I also think it's valid to say that consumers (like the CLI package) should create their own TypeAdapters (or use a public version of _union_type_adapter), since there's no magic needed any longer.

for f in feature_classes
),
),
Field(discriminator=Feature.field_discriminator("type", *feature_classes)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping this would also eliminate the Overture-specific type field reference (by using the explicit Tags and pushing Feature.field_discriminator into the base class implementation, currying the field name), but I'm not sure how we would do that. The primary motivation for doing that would be to support union creation in system.

)
# If validation passed and we have a parsed feature, serialize to JSON and compare with the
# original JSON.
json_dump = model.model_dump(exclude_unset=True, by_alias=True, mode="json")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this here makes me wonder if we should implement Feature.model_dump, providing different defaults for exclude_unset and by_alias. That would eliminate some of the gotchas you described above.


import pytest
import yaml
from deepdiff import DeepDiff
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, how's the diff output for failures?

Should we remove it from the list of dependencies too?

vcschapp added a commit that referenced this pull request Nov 17, 2025
Per Seth's suggestion on PR #415.

Co-authored-by: Seth Fitzsimmons <[email protected]>
This type was unnecessary and misleading.

The misleading part comes from the fact that, historically, the Overture
feature in JSON Schema land had an `update_time` property. We eventually
removed that property, making a reusable type called `FeatureUpdateTime`
wholly redundant, but it seems the type was reused by the `update_time`
property in the sources construct.

This lead to misleading-ness not only at the type name/intent level, but
also at the documentation level since the documentation inherited by the
update time field in sources was:

> Timestamp when the feature was last updated

Which assuredly made no sense for sources. As a result of this commit,
that documentation is now updated to:

> Last update time of the source data record

Which makes more sense (and hopefully is also correct).
This was losing a description and adding a Pydantic warning when running
the tests.
The biggest subtle bug was that `level_rules` in the transportation
segment were ending up with a default value of `0`, so that you could
just omit `value` altogether. This was not intended as it's bizarre,
better to not create the rules at all. Meanwhile, `level` in all the
other features did NOT have a default of zero, which was again not
intended.

As a result of the `Stacked` model overriding the default value that
is set on `Level` in its `level` field, we were also getting a Pydantic
warning that was polluting the test run output. By fixing the underlying
issue, this commit makes that warning go away.
This should hopefully prevent recurrence of the Pydantic warning we were
seeing for a very long time, before I fixed them in the preceding
commit.
This is based on querying the `v2025_10_22_0` release and finding that
the range of levels in the rest of the dataset is -9,999 to 1,940.

And really, do you need billions of levels? That's not the intention
for Z-order stacking.
This commit also fixes one definite bug, and one weird thing that was
arguably a bug.

Definite bug: `Prominence` was modeled as `lt=100`, which would
indicate a peculiar-sized 99-element range from 1 to 99 inclusive, but
the actual data in the latest Overture release, `v2025_10_22_0`, ranges
from 1 to 100 inclusive.

Weird thing: Sort key had a default of zero, and zero is also supposed
to be the highest rendering priority. This seems weird. It would mean
that you can't use the sort key to give any feature higher rendering
priority than a feature that has no rendering priority set. This won't
work as expected, so I removed the default.

To fix another day: prominence and sort key are reversed. A high number
means high prominence, but a high number for sort key^[ means a low
rendering priority.
As of this commit, running `make docformat` detects there are 22
remaining errors, less than one terminal full. They are all of
the following two types:

1. D100: Missing docstring in public module
2. D101: Missing docstring in public class

Moreover all the remaining errors pertain to modules and/or classes that
are slated for further refactoring. Therefore, I'm going to leave
finishing them for another day.

As part of this commit, I edited the `make docformat` rule in the
`Makefile` to suppress four new rules. The rules suppressed are
listed below along with the description of the rule and the reason for
suppressing it.

1. D102: Missing docstring in public method. This is suppressed because
   `pydocstyle` is dumb and doesn't traverse inheritance hierarchies
   or even look at the `@override` decorator, so it needlessly flags
   overridden methods for which documentation is superfluous and a
   waste of time.
2. D200: One-line docstring should fit on one line with quotes. This is
   suppressed because `pydocstyle` is dumb and isn't aware of reasonable
   line length limits or wrapping.
3. D205: 1 blank line required between summary line and description.
   This is suppressed for the same reason as D200.
4. D400: First line should end with a period. This is suppressed for the
   same erason as D200 and D205.
The issues are that `docformatter` doesn't understand NumPy-style
documentation, which is how most of the API docs are written today, and
that it edits code in the opposite way to what `pydocstyle` expects.

Both `docformatter` and `pydocstyle` are somewhat underwhelming tools
with very few configuration options and a very heavy-handed approach
to what they do, but as `pydocstyle` understands more documentation
"dialects", including NumPy, it seems reasonable to keep it and drop
the other, at least for now.
This was a subtle bug. Only because we weren't previously exercising this
functionality at the `OvertureFeature` level via unit test.

Basically the Pydantic decorators like `@model_validator` break in an
inheritance context if the derived class overrides a validator method
that was used in a base class. This only makes sense and isn't really
Pydantic's fault: it'll be trying to call `instance.validator_method()`
and Python is going to resolve which code to invoke. If you overrode
the base class' `validator_method()` then yes, it stands to reason
that only the derived version will be called.

As you can see from the diff, the wrap validator in `Feature` that had
been fixing up the input data for GeoJSON conversion was called
`validate_model`, and I had created a method of the same name in
`OvertureFeature` to handle the soon-to-be-deprecated `ext_*` field
exception. So, of course, `Feature`'s version got overridden, and with
it the whole GeoJSON-to-flat-model transform that enabled the parsing
to continue.

I fixed the issue by giving the method names in both classes much more
specific names and tried to make them a bit obscure with dunder-naming,
which won't obfuscate them but will at least hopefully reduce the
chances of name collisions.

Also added a test, because the lack of it is the cause of this whole
fiasco in the first place!
Bug Fix
-------

At least one bug is fixed in this commit: the class
`EnhancedJsonSchemaGenerator` (now rechristened with the more accurate
if more unwieldy name `GenerateOmitNullableOptionalJsonSchema`) was
eliminating the `null` option on nullable fields that were required,
which subtly changed the meaning of the schema.

Where before you had:

{
  "required": ["foo"],
  "properties": {
    "foo": {
      "anyOf": [
        {"type": "string"},
        {"type": "null"}
      ]
    }
  }
}

The anyOf/null option was being removed but the required remained, which
meant changed the meaning from "you have to put something, but it can be
a string or the value `null`" to "you have to put a string". That's not
an equivalent schema.

Re-Homing of `json_schema`
--------------------------
The function `json_schema` and its entourage, the JSON Schema generator
class `EnhancedJsonSchemaGenerator`, are promoted from core to system.
This is because they are very general-purpose "George" functionality
that isn't really tied to the Overture schema per se.

Longer term, I think there's a case to be made that we should drop
`json_schema` and `GenerateOmitNullableOptionalJsonSchema` and just rely
on the existing Pydantic functionality, which is rich, along with the
`Omitable` type that's already in system. I'd like to reopen this topic.

As part of this work, I gave `GenerateOmitNullableOptionalJsonSchema`
some very thorough documentation including doctests.

Re-Imagining of Parsing
-----------------------
I killed the `parse_feature` function as I don't see what value it
brings above and beyond the native Pydantic parsing. Now that I have
fully fixed the `Feature` and `OvertureFeature` Pydantic integrations
to correctly parse the GeoJSON in all cases, its hard to see why we
need a function that "can parse an Overture feature from GeoJSON" when
the existing Pydantic functions can already do that.

I renamed the `parse` function in the overture-schema package to
`validate` and split it into two functions, `validate` and
`validate_json`, mirroring the fairly consistent way that Pydantic
likes to break it down. I used the new GeoJSON-compatible discriminated
union functionality from `Feature` (see earlier commit) and brought the
functions more into line with how Pydantic generally works. Basically
now they're "do the Pydantic thing, but with the union of all the
discovered models" functions. Also bugs are fixed, docs are added, and
the docs and code are hopefully closer to being in harmony...
@vcschapp vcschapp merged commit 4454603 into pydantic Nov 17, 2025
3 checks passed
@vcschapp vcschapp deleted the pp5 branch November 17, 2025 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants