-
Notifications
You must be signed in to change notification settings - Fork 11
Pydantic package organization episode 5: "Only mostly dead" #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note: Still waiting on some information from Jennings and Jonah to improve the documentation for `land`, `land_cover`, and `water`.
vcschapp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added callouts of some of the more interesting parts in comments, as is the tradition.
| Global land derived from the inverse of OSM Coastlines. Translates `natural` tags from OpenStreetMap. | ||
| TODO: Update this description when the relationship to `land_cover` is better understood. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to finish the descriptions for Land and LandCover is on the TODO list for episode #6 that's listed in the PR description....
packages/overture-schema-buildings-theme/src/overture/schema/buildings/building.py
Outdated
Show resolved
Hide resolved
| from overture.schema.system.model_constraint import FieldEqCondition | ||
|
|
||
|
|
||
| class PlaceType(str, Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: To be renamed in the divisions package re-org. to remove the prefix "Place" to prevent any confusion with the places theme. The historical tie to WOF will still be documented obviously.
| >>> try: | ||
| ... MyModel(**{"osm_id": "foo"}) | ||
| ... except ValidationError as e: | ||
| ... assert "Invalid OSM ID format: foo. Must be n123, w123, or r123." in str(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to standardize all exception messages to begin with lowercase.
I'm basing this convention on multiple (hopefully unbiased) LLM prompts to different models that basically land on this as the exception message convention. (The below summary is from ChatGPT, which claims it has examined the Python standard library in generalizing them.)
| Element | Convention | Example |
|---|---|---|
| Capitalization | lowercase unless proper noun/type | "invalid configuration option 'mode'" |
| Ending punctuation | no period | "missing required field 'name'" |
| Formatting | short, plain English phrase | "expected int, got str" |
| Including data | quote or repr() values |
"invalid value: {!r}".format(val) |
| Minimum X-coordinate of the bounding box. | ||
| Returns | ||
| ------- | ||
| float | int | ||
| Minimum X-coordinate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct NumPy-style documentation that pydocstyle approves of.
Big favor ask! Please use make docformat to make sure your changes don't regress the pydocstyle warnings and you are hopefully killing all the warnings in files you're touching!
| for f in feature_classes | ||
| ), | ||
| ), | ||
| Field(discriminator=Feature.field_discriminator("type", *feature_classes)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👉 The other use of GeoJSON-aware dynamic discriminator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping this would also eliminate the Overture-specific type field reference (by using the explicit Tags and pushing Feature.field_discriminator into the base class implementation, currying the field name), but I'm not sure how we would do that. The primary motivation for doing that would be to support union creation in system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be powerful - we should noodle on how to achieve it. It would be ideal if all that logic could be lifted up into system...
| @@ -1,76 +1,171 @@ | |||
| __path__ = __import__("pkgutil").extend_path(__path__, __name__) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this file out because:
- It is a significant re-imagining of the "parsing features" concept.
- There's some overlap with the peripheries of [Pydantic] overture-schema CLI for type listing, JSON Schema generation, and validation #406.
The path to get here wasn't entirely linear, but if we pretend it was then I asked myself:
- What do
parseandparse_featuredo? - What value do they add above the parsing features Pydantic already has?
My answers were basically that if our models integrate properly with Pydantic then:
parse_featuredoes not add any real value because the same results can be obtained from callingBaseModel.model_validateandBaseModel.model_validate_jsonon the feature class.- The value added by
parseis that it can see the union of all discovered Overture models, so it can validate against the entire union.†
My actions then were to delete parse_feature and to re-imagine the interface of parse to:
- Be consistent as possible with Pydantic conventions. This is where the rename from
parsetovalidate*comes from, as well as the split betweenvalidateandvalidate_jsonand the return value being clearly expressed asBaseModel. - Leverage existing Pydantic features, such as the ability to validate a model from raw JSON, and the ability to validate a Python data structure, either a
dictor a model class, against a model.
Some of the effects of this re-imagining are:
- The entire "GeoJSON rotation" of a feature is now completely encapsulated within
Featurein the system package, and, I hope, it will not leak out of it. - All GeoJSON validation is just JSON validation of a feature using
BaseModel.model_validate_json. - There is no longer any concept of "a GeoJSON
dict" independent of aFeaturebut if someone wants that they just have to dump the feature withBaseModel.model_dump_jsonand parse the JSON into adict. - There may be some "technical" performance regressions in places because while Pydantic has all the facilities it needs, there edge cases where you need to go through serde with the
jsonpackage to achieve your goal, as is always true with Pydantic. I think this is fine. I doubt JSON optimization is really needed because at-scale use cases shouldn't use JSON anyway, but if some optimized path is desired by someone we should evaluate the benefits against the cons of complexity and mission creep.
†: I'm still not sold on validating the union being a feature that is particularly needed by customers or particularly scalable, as some of my comments on #406 show, but that's a discussion for another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. I think it's safe to remove. The primary consumer is currently the validation tests, IIRC, and it can be moved there.
Since you fixed the "GeoJSON rotation," I also think it's valid to say that consumers (like the CLI package) should create their own TypeAdapters (or use a public version of _union_type_adapter), since there's no magic needed any longer.
| When using this class to generate a JSON Schema, you must dump your model JSON carefully to | ||
| ensure that it matches the JSON Schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for this warning is why I think Omitable is needed.
It's weird for us to say "Overture uses this omit convention but if you use our system you have to think carefully about matching your JSON Schema and JSON dumps of your model"...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but that still feels like a later problem if we find that lots of people are using this with JSON data + JSON Schema (vs. other approaches that are more in line with how our data pipeline is going to use it).
| 2. If the field's default value is `None` (JSON `null`), the default value is removed. | ||
| """ | ||
|
|
||
| @override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a slight refactoring of how things were achieved in the previous EnhancedJsonSchemaGenerator but there's also a subtle bug fix. Previously, it would eliminate the null option from required nullable fields.
So if you had something along the lines of this:
class FooModel(BaseModel):
foo: str | Null # This is a required nullable field, you can't leave it unset.Then Pydantic would give you a JSON Schema like this:
{
"required": ["foo"],
"properties": {
"foo": {
"anyOf": [
{"type": "string"},
{"type": "null"}
]
}
}
}And EnhancedJsonSchemaGenerator would remove the "anyOf"/null option, which changed the meaning of the JSON Schema from "you have to put something, but it can be a string or the value null" to "you have to put a string".
Apart from the Omitable option where the schema itself encodes the semantics so that our users don't have to think about these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any uses of | Null, nor do I expect Overture to need it.
Strictly speaking, you're right, but I lean toward YAGNI until we learn that we do (and documenting the limitations is a great first start).
| """ | ||
|
|
||
| source_tags: SourceTags | None = None | ||
| wikidata: WikidataId | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be split out? It may be present in the OSM tags already, so wikidata is either redundant or applies to more than data sourced from OSM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost definitely yes. It just ended up below my subjective refactoring line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why _common (private) vs. common and not exported at the top level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're all exported from __init__.py. I guess the answer is just that I find the proliferation of multiple import options confusing and sometimes its easier to have files for code organization but just a single flat import namespace for usability...
| such as bridges, airports, runways, aerialways, communication towers, and power lines. | ||
| """ | ||
|
|
||
| model_config = ConfigDict(title="infrastructure") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are still messy. We don't strictly need them, but I ported them over directly from JSON Schema, and they don't carry meaning right now. The only place they manifest is as the title of the schema in the generated JSON Schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I haven't checked how consistently they're applied across the different packages either, I'll try to remember that for the next one. They should be everywhere or nowhere.
| physical thing covering the land, while land use is the human use to which the land is being | ||
| put. | ||
| TODO: Explain relationship to `Land` features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also include representative sources, like being derived from remote sensing data as WorldCover or Global Dynamic Land Cover. Knowing that it's often derived from raster data is valuable context.
| In Overture data releases, water features are sourced from OpenStreetMap. There are two main | ||
| categories of water feature: ocean and inland water bodies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future TODO: we shouldn't assume that it comes from OSM and that it might be transformed data from USGS, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the problem with the "base" theme, which is kind of really the "straight from OSM" theme.
Fundamentally we need to promote water to a real theme with a schema that's not just OSM dependent.
| connectors: Annotated[ | ||
| list[ConnectorReference] | None, | ||
| Field( | ||
| default=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good; this was weird (but matched the JSON Schema).
| @@ -1,76 +1,171 @@ | |||
| __path__ = __import__("pkgutil").extend_path(__path__, __name__) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. I think it's safe to remove. The primary consumer is currently the validation tests, IIRC, and it can be moved there.
Since you fixed the "GeoJSON rotation," I also think it's valid to say that consumers (like the CLI package) should create their own TypeAdapters (or use a public version of _union_type_adapter), since there's no magic needed any longer.
| for f in feature_classes | ||
| ), | ||
| ), | ||
| Field(discriminator=Feature.field_discriminator("type", *feature_classes)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping this would also eliminate the Overture-specific type field reference (by using the explicit Tags and pushing Feature.field_discriminator into the base class implementation, currying the field name), but I'm not sure how we would do that. The primary motivation for doing that would be to support union creation in system.
| ) | ||
| # If validation passed and we have a parsed feature, serialize to JSON and compare with the | ||
| # original JSON. | ||
| json_dump = model.model_dump(exclude_unset=True, by_alias=True, mode="json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this here makes me wonder if we should implement Feature.model_dump, providing different defaults for exclude_unset and by_alias. That would eliminate some of the gotchas you described above.
|
|
||
| import pytest | ||
| import yaml | ||
| from deepdiff import DeepDiff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, how's the diff output for failures?
Should we remove it from the list of dependencies too?
Per Seth's suggestion on PR #415. Co-authored-by: Seth Fitzsimmons <[email protected]>
This type was unnecessary and misleading. The misleading part comes from the fact that, historically, the Overture feature in JSON Schema land had an `update_time` property. We eventually removed that property, making a reusable type called `FeatureUpdateTime` wholly redundant, but it seems the type was reused by the `update_time` property in the sources construct. This lead to misleading-ness not only at the type name/intent level, but also at the documentation level since the documentation inherited by the update time field in sources was: > Timestamp when the feature was last updated Which assuredly made no sense for sources. As a result of this commit, that documentation is now updated to: > Last update time of the source data record Which makes more sense (and hopefully is also correct).
This was losing a description and adding a Pydantic warning when running the tests.
The biggest subtle bug was that `level_rules` in the transportation segment were ending up with a default value of `0`, so that you could just omit `value` altogether. This was not intended as it's bizarre, better to not create the rules at all. Meanwhile, `level` in all the other features did NOT have a default of zero, which was again not intended. As a result of the `Stacked` model overriding the default value that is set on `Level` in its `level` field, we were also getting a Pydantic warning that was polluting the test run output. By fixing the underlying issue, this commit makes that warning go away.
This should hopefully prevent recurrence of the Pydantic warning we were seeing for a very long time, before I fixed them in the preceding commit.
This is based on querying the `v2025_10_22_0` release and finding that the range of levels in the rest of the dataset is -9,999 to 1,940. And really, do you need billions of levels? That's not the intention for Z-order stacking.
This commit also fixes one definite bug, and one weird thing that was arguably a bug. Definite bug: `Prominence` was modeled as `lt=100`, which would indicate a peculiar-sized 99-element range from 1 to 99 inclusive, but the actual data in the latest Overture release, `v2025_10_22_0`, ranges from 1 to 100 inclusive. Weird thing: Sort key had a default of zero, and zero is also supposed to be the highest rendering priority. This seems weird. It would mean that you can't use the sort key to give any feature higher rendering priority than a feature that has no rendering priority set. This won't work as expected, so I removed the default. To fix another day: prominence and sort key are reversed. A high number means high prominence, but a high number for sort key^[ means a low rendering priority.
As of this commit, running `make docformat` detects there are 22 remaining errors, less than one terminal full. They are all of the following two types: 1. D100: Missing docstring in public module 2. D101: Missing docstring in public class Moreover all the remaining errors pertain to modules and/or classes that are slated for further refactoring. Therefore, I'm going to leave finishing them for another day. As part of this commit, I edited the `make docformat` rule in the `Makefile` to suppress four new rules. The rules suppressed are listed below along with the description of the rule and the reason for suppressing it. 1. D102: Missing docstring in public method. This is suppressed because `pydocstyle` is dumb and doesn't traverse inheritance hierarchies or even look at the `@override` decorator, so it needlessly flags overridden methods for which documentation is superfluous and a waste of time. 2. D200: One-line docstring should fit on one line with quotes. This is suppressed because `pydocstyle` is dumb and isn't aware of reasonable line length limits or wrapping. 3. D205: 1 blank line required between summary line and description. This is suppressed for the same reason as D200. 4. D400: First line should end with a period. This is suppressed for the same erason as D200 and D205.
The issues are that `docformatter` doesn't understand NumPy-style documentation, which is how most of the API docs are written today, and that it edits code in the opposite way to what `pydocstyle` expects. Both `docformatter` and `pydocstyle` are somewhat underwhelming tools with very few configuration options and a very heavy-handed approach to what they do, but as `pydocstyle` understands more documentation "dialects", including NumPy, it seems reasonable to keep it and drop the other, at least for now.
This was a subtle bug. Only because we weren't previously exercising this functionality at the `OvertureFeature` level via unit test. Basically the Pydantic decorators like `@model_validator` break in an inheritance context if the derived class overrides a validator method that was used in a base class. This only makes sense and isn't really Pydantic's fault: it'll be trying to call `instance.validator_method()` and Python is going to resolve which code to invoke. If you overrode the base class' `validator_method()` then yes, it stands to reason that only the derived version will be called. As you can see from the diff, the wrap validator in `Feature` that had been fixing up the input data for GeoJSON conversion was called `validate_model`, and I had created a method of the same name in `OvertureFeature` to handle the soon-to-be-deprecated `ext_*` field exception. So, of course, `Feature`'s version got overridden, and with it the whole GeoJSON-to-flat-model transform that enabled the parsing to continue. I fixed the issue by giving the method names in both classes much more specific names and tried to make them a bit obscure with dunder-naming, which won't obfuscate them but will at least hopefully reduce the chances of name collisions. Also added a test, because the lack of it is the cause of this whole fiasco in the first place!
Bug Fix
-------
At least one bug is fixed in this commit: the class
`EnhancedJsonSchemaGenerator` (now rechristened with the more accurate
if more unwieldy name `GenerateOmitNullableOptionalJsonSchema`) was
eliminating the `null` option on nullable fields that were required,
which subtly changed the meaning of the schema.
Where before you had:
{
"required": ["foo"],
"properties": {
"foo": {
"anyOf": [
{"type": "string"},
{"type": "null"}
]
}
}
}
The anyOf/null option was being removed but the required remained, which
meant changed the meaning from "you have to put something, but it can be
a string or the value `null`" to "you have to put a string". That's not
an equivalent schema.
Re-Homing of `json_schema`
--------------------------
The function `json_schema` and its entourage, the JSON Schema generator
class `EnhancedJsonSchemaGenerator`, are promoted from core to system.
This is because they are very general-purpose "George" functionality
that isn't really tied to the Overture schema per se.
Longer term, I think there's a case to be made that we should drop
`json_schema` and `GenerateOmitNullableOptionalJsonSchema` and just rely
on the existing Pydantic functionality, which is rich, along with the
`Omitable` type that's already in system. I'd like to reopen this topic.
As part of this work, I gave `GenerateOmitNullableOptionalJsonSchema`
some very thorough documentation including doctests.
Re-Imagining of Parsing
-----------------------
I killed the `parse_feature` function as I don't see what value it
brings above and beyond the native Pydantic parsing. Now that I have
fully fixed the `Feature` and `OvertureFeature` Pydantic integrations
to correctly parse the GeoJSON in all cases, its hard to see why we
need a function that "can parse an Overture feature from GeoJSON" when
the existing Pydantic functions can already do that.
I renamed the `parse` function in the overture-schema package to
`validate` and split it into two functions, `validate` and
`validate_json`, mirroring the fairly consistent way that Pydantic
likes to break it down. I used the new GeoJSON-compatible discriminated
union functionality from `Feature` (see earlier commit) and brought the
functions more into line with how Pydantic generally works. Basically
now they're "do the Pydantic thing, but with the union of all the
discovered models" functions. Also bugs are fixed, docs are added, and
the docs and code are hopefully closer to being in harmony...
tl;dr
This is part
#5. Part#6will happen ... sometime, and it is hoped it will be the last, and the best, of the package re-orgs. Our last best hope, if you will.Done in this part:
json_schemamodule is promoted to system.json_schemageneration.Feature.make check.Merge guidance: Rebase
Unlike parts 1-3, this iteration has an adequate commit history. It should ideally be rebase-merged onto the
pydanticbranch rather than squashed.Journey
Destination
Package dependencies andSeth already did this in an earlier PR.pyproject.tomlwill be standardized.make check.pydocstylewith the NumPy convention. (As of now, the tool reports too many complaints to add it intomake check.)pydocstylewill be enforced bymake check.Incompatible withdocformatterwill be run bymake check.pydocstyleand NumPy-style documentation.__init__.pydocstring.doctarget in theMakefilethat doesuv run pdocto a useful place.perspectives.TODO list for episode 6
landandland_coverin base based on discussion in#working-group-schema.make docformaterrors and addpydocstyleintomake check.make api-docs-liveandmake api-docs-htmltargets toMakefile.__init__.pyfor mainFeatureuse cases:FieldConstraintso validate method can be type hint valid as much as possible, along the lines sketched below.