Skip to content

Conversation

@pierrepo
Copy link
Member

No description provided.

@pierrepo pierrepo marked this pull request as draft January 23, 2026 19:05
@pierrepo pierrepo marked this pull request as ready for review January 24, 2026 23:06
@pierrepo pierrepo requested a review from Copilot January 24, 2026 23:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR simplifies and renames several Pydantic models/enums, and updates the NOMAD scraper + tests/docs to match the new schemas.

Changes:

  • Consolidates dataset “repository/project” enums into DatasetSourceName and updates model fields accordingly.
  • Refactors simulation/file models (field renames, structured submodels, computed file_type, centralized datetime format).
  • Updates NOMAD scraper output keys, tests, and NOMAD documentation.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/mdverse_scrapers/models/simulation.py Renames/extends simulation metadata fields and validators; introduces ForceFieldModel and molecule formula support.
src/mdverse_scrapers/models/dataset.py Moves to DatasetSourceName, adds DATETIME_FORMAT, changes stats constraints, replaces date_last_fetched logic.
src/mdverse_scrapers/models/file.py Simplifies file model, updates byte-size normalization, changes file_type computation behavior.
src/mdverse_scrapers/models/enums.py Replaces repo/project enums with a unified DatasetSourceName.
src/mdverse_scrapers/models/date.py Introduces shared datetime format constant.
src/mdverse_scrapers/scrapers/nomad.py Updates scraper metadata keys/types and docstrings for the renamed models/fields.
tests/models/test_simulation.py Updates tests for renamed simulation fields and structured submodels.
tests/models/test_file.py Updates tests for file size parsing and file type extraction behavior.
tests/models/test_dataset.py Updates tests for enum changes, datetime formatting, and date_last_fetched.
docs/nomad.md Expands NOMAD API documentation and payload examples.
Comments suppressed due to low confidence (1)

src/mdverse_scrapers/scrapers/nomad.py:312

  • extract_software_and_version returns list[Software] | None, but the docstring return section describes a single Software instance. Update the docs to match the list/None behavior (or change the function signature/return value).
    Returns
    -------
    Software
        A Software instance with `name` and `version` fields, possibly None.
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +27 to +44
```json
{
"owner": "visible",
"query": {"results.method.workflow_name:any": ["MolecularDynamics"]},
"aggregations": {},
"pagination": {
"order_by": "upload_create_time",
"order": "desc",
"page_size": 10,
"page_after_value":None,
},
"required": {
"exclude": [
"quantities",
"sections",
]
},
}
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snippet is labeled as json but it isn’t valid JSON (None should be null, and there are trailing commas). Either fix it to valid JSON or change the fence language (e.g., to js/text) to avoid copy/paste errors.

Copilot uses AI. Check for mistakes.
Comment on lines 108 to 121

@computed_field
@property
def date_last_fetched(self) -> str:
"""
Populate the date when the dataset metadata was last fetched.

Returns
-------
str
The current date and time in ISO 8601 format.
"""
return datetime.now().strftime(DATETIME_FORMAT)

Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date_last_fetched is implemented as a @computed_field returning datetime.now(), so the value can change between accesses/serializations and can’t be overridden from input. This makes the metadata nondeterministic. Prefer storing the fetch time once (e.g., a normal field with default_factory=datetime.now(...).strftime(...), or an after model validator that sets it only when missing).

Suggested change
@computed_field
@property
def date_last_fetched(self) -> str:
"""
Populate the date when the dataset metadata was last fetched.
Returns
-------
str
The current date and time in ISO 8601 format.
"""
return datetime.now().strftime(DATETIME_FORMAT)
date_last_fetched: str = Field(
default_factory=lambda: datetime.now().strftime(DATETIME_FORMAT),
description="Date when the dataset metadata was last fetched.",
)

Copilot uses AI. Check for mistakes.
)
view_number: int | None = Field(
None,
gt=0,
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

view_number uses gt=0, which rejects a valid value of 0 and can cause validation failures for datasets with no views yet. Use ge=0 instead.

Suggested change
gt=0,
ge=0,

Copilot uses AI. Check for mistakes.
# ------------------------------------------------------------------
number_of_files: int | None = Field(
None,
gt=0,
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number_of_files uses gt=0, which rejects datasets with zero files (and will also fail when a scraper reports 0). Consider ge=0 to allow 0 while still rejecting negative values.

Suggested change
gt=0,
ge=0,

Copilot uses AI. Check for mistakes.
"Name of the source data repository. "
"Allowed values in DatasetRepositoryName enum. "
"Examples: ZENODO, FIGSHARE, OSF, NOMAD, MDPOSIT..."
"Allowed values in the DatasetRepoProjectName enum. "
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field description references DatasetRepoProjectName enum, but the type is DatasetSourceName. This is misleading for API consumers and autogenerated docs; update the wording to match the actual enum name.

Suggested change
"Allowed values in the DatasetRepoProjectName enum. "
"Allowed values in the DatasetSourceName enum. "

Copilot uses AI. Check for mistakes.
)
# Temperature
metadata["simulation_temperature"] = None # TODO?
metadata["simulation_temperatures"] = None # TODO?
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_datasets_metadata sets metadata["simulation_temperatures"], but the model field is simulation_temperatures_in_kelvin. With model_validate, this key will be ignored (or rejected if extra is forbidden), and it also doesn’t match the unit naming used elsewhere. Rename this key to the model field name (or rename the model field to match) for consistency.

Suggested change
metadata["simulation_temperatures"] = None # TODO?
metadata["simulation_temperatures_in_kelvin"] = None # TODO?

Copilot uses AI. Check for mistakes.
# ------------------------------------------------------------------
download_number: int | None = Field(
None,
gt=0,
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_number uses gt=0, which rejects a valid value of 0. For sources where downloads can legitimately be zero (or counters start at 0), this will cause validation failures. Use ge=0 instead.

Suggested change
gt=0,
ge=0,

Copilot uses AI. Check for mistakes.

class SimulationMetadata(BaseModel):
"""Base Pydantic model for simulation-related metadata."""
"""Base Pydantic model for MD simulationmetadata.
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring typo: “simulationmetadata” is missing a space (should be “simulation metadata”).

Suggested change
"""Base Pydantic model for MD simulationmetadata.
"""Base Pydantic model for MD simulation metadata.

Copilot uses AI. Check for mistakes.
Comment on lines 45 to 62
@field_validator("file_size_in_bytes", mode="before")
def normalize_byte_string(cls, v: str | None) -> str | None: # noqa: N805
@classmethod
def normalize_byte_string(cls, value: str | None) -> str | None:
"""
Normalize the unit "Bytes" with "B" acceptable for ByteSize.
Normalize the unit "Bytes" with "B" to make it acceptable for ByteSize.

- If it's a string containing "Bytes", replace with "B".
- Let ByteSize parse strings like '24.4 kB', '3MB', '689 B'.
- Integers are accepted as bytes directly.
Documentation: https://docs.pydantic.dev/latest/api/types/#pydantic.types.ByteSize

Returns
-------
str | None
The normalized "Bytes" file size as "B", or None if input is None.
"""
if v is None:
if value is None:
return None

if isinstance(v, str) and "Bytes" in v:
v = v.replace("Bytes", "b").strip()

return v
if isinstance(value, str) and "bytes" in value.lower():
value = value.lower().replace("bytes", "B").strip()
return value
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize_byte_string is annotated as taking/returning str | None, but the tests pass an int (and Pydantic ByteSize accepts ints). Widen the type hints (e.g., str | int | None and corresponding return type) to match actual supported inputs and avoid misleading static typing.

Copilot uses AI. Check for mistakes.
Comment on lines 397 to 398
float | None
The time step in fs, or None if not found.
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_time_step is annotated to return list[float] | None and actually returns [time_step], but the docstring says it returns float | None. Please align the docstring with the actual return type (or change the function to return a single float).

Suggested change
float | None
The time step in fs, or None if not found.
list[float] | None
A list containing the time step in fs, or None if not found.

Copilot uses AI. Check for mistakes.
@pierrepo pierrepo merged commit f8425f3 into main Jan 24, 2026
1 check passed
@pierrepo pierrepo deleted the simplify-pydantic-models branch January 24, 2026 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants