-
Notifications
You must be signed in to change notification settings - Fork 4
feat: Create Pydantic model to manage scraper metadata #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Introduces a new Pydantic model to centralize scraper runtime/config metadata (paths, counts, timing), and wires it into the NOMAD scraper to standardize output paths and end-of-run statistics. Adds tests to validate default path creation and basic validation behavior.
Changes:
- Add
ScraperContextPydantic model to manage scraper configuration/derived paths. - Update NOMAD scraper to use
ScraperContext, track exported counts, and print summary statistics. - Add tests covering directory/path derivation, optional query file validation, and non-negative counters.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
tests/models/test_scraper.py |
Adds unit tests for ScraperContext path derivation and validation rules. |
src/mdverse_scrapers/models/scraper.py |
Adds the new ScraperContext Pydantic model and validator that derives/creates output paths. |
src/mdverse_scrapers/scrapers/nomad.py |
Refactors NOMAD CLI flow to use ScraperContext, updated parquet export calls, and centralized stats printing. |
src/mdverse_scrapers/core/toolbox.py |
Updates parquet export helper to return row counts and adds print_statistics(scraper, ...). |
Comments suppressed due to low confidence (1)
src/mdverse_scrapers/core/toolbox.py:39
- PR description says this fixes #69 (convert
ContextManagerdataclass to a Pydantic model), butContextManageris still defined in this module and is still used across other scrapers (e.g.,src/mdverse_scrapers/scrapers/zenodo.py,figshare.py). Either adjust the PR description/scope, or complete the migration by updating the remaining call sites and removingContextManager.
from ..models.scraper import ScraperContext
warnings.filterwarnings(
"ignore",
message="The input looks more like a filename than markup",
category=UserWarning,
module="bs4",
)
@dataclass(kw_only=True)
class ContextManager:
"""ContextManager dataclass."""
logger: "loguru.Logger" = loguru.logger
output_path: pathlib.Path = pathlib.Path("")
query_file_name: pathlib.Path = pathlib.Path("")
token: str = ""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def create_output_dir_path(self) -> Self: | ||
| """ | ||
| Create output directory path if it does not exist. | ||
| Returns | ||
| ------- | ||
| Self | ||
| The ScraperContext instance with updated paths. | ||
| """ | ||
| # Update and create output directory path. | ||
| self.output_dir_path = ( | ||
| Path(self.output_dir_path) | ||
| / self.data_source_name.value | ||
| / self.start_time.strftime("%Y-%m-%d") | ||
| ) |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring says "Create output directory path if it does not exist", but output_dir_path is typed as DirectoryPath, which requires the provided directory to already exist. Either update the docstring/field description to clarify that only the subdirectory (<base>/<source>/<date>) is created, or change the field type/validation to allow creating the base directory as well.
| def main(output_dir_path: Path) -> None: | ||
| """Scrape molecular dynamics datasets and files from NOMAD.""" | ||
| # Create directories and logger. | ||
| output_dir_path = output_dir_path / DatasetSourceName.NOMAD.value | ||
| output_dir_path.mkdir(parents=True, exist_ok=True) | ||
| logfile_path = output_dir_path / f"{DatasetSourceName.NOMAD.value}_scraper.log" | ||
| logger = create_logger(logpath=logfile_path, level="INFO") | ||
| # Create scraper context. | ||
| scraper = ScraperContext( | ||
| data_source_name=DatasetSourceName.NOMAD, | ||
| output_dir_path=output_dir_path, | ||
| ) |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_dir_path is provided by Click with exists=False, so users can pass a non-existent directory. But ScraperContext.output_dir_path is a DirectoryPath, which requires the directory to already exist; this will raise a ValidationError at runtime for common CLI usage. Create the base output directory before instantiating ScraperContext (e.g., output_dir_path.mkdir(parents=True, exist_ok=True)) or relax the model field type/validation to allow non-existent paths and create them in the validator.
| output_dir_path=output_dir_path, | ||
| ) | ||
| logger = create_logger(logpath=scraper.log_file_path, level="INFO") | ||
| print(scraper.model_dump_json(indent=4)) |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print(scraper.model_dump_json(...)) writes the full context (including token if set) to stdout, which can leak secrets in logs/CI output and is hard to control for library consumers. Prefer logging at an appropriate level and explicitly exclude sensitive fields (e.g., exclude={"token"}) or remove this output entirely.
| print(scraper.model_dump_json(indent=4)) | |
| logger.debug(scraper.model_dump_json(indent=4, exclude={"token"})) |
Fix #69