feat: Create Pydantic model to manage scraper metadata #70

pierrepo · 2026-01-25T00:24:05Z

Copilot

Pull request overview

Introduces a new Pydantic model to centralize scraper runtime/config metadata (paths, counts, timing), and wires it into the NOMAD scraper to standardize output paths and end-of-run statistics. Adds tests to validate default path creation and basic validation behavior.

Changes:

Add ScraperContext Pydantic model to manage scraper configuration/derived paths.
Update NOMAD scraper to use ScraperContext, track exported counts, and print summary statistics.
Add tests covering directory/path derivation, optional query file validation, and non-negative counters.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`tests/models/test_scraper.py`	Adds unit tests for `ScraperContext` path derivation and validation rules.
`src/mdverse_scrapers/models/scraper.py`	Adds the new `ScraperContext` Pydantic model and validator that derives/creates output paths.
`src/mdverse_scrapers/scrapers/nomad.py`	Refactors NOMAD CLI flow to use `ScraperContext`, updated parquet export calls, and centralized stats printing.
`src/mdverse_scrapers/core/toolbox.py`	Updates parquet export helper to return row counts and adds `print_statistics(scraper, ...)`.

Comments suppressed due to low confidence (1)

src/mdverse_scrapers/core/toolbox.py:39

PR description says this fixes #69 (convert ContextManager dataclass to a Pydantic model), but ContextManager is still defined in this module and is still used across other scrapers (e.g., src/mdverse_scrapers/scrapers/zenodo.py, figshare.py). Either adjust the PR description/scope, or complete the migration by updating the remaining call sites and removing ContextManager.

from ..models.scraper import ScraperContext

warnings.filterwarnings(
    "ignore",
    message="The input looks more like a filename than markup",
    category=UserWarning,
    module="bs4",
)


@dataclass(kw_only=True)
class ContextManager:
    """ContextManager dataclass."""

    logger: "loguru.Logger" = loguru.logger
    output_path: pathlib.Path = pathlib.Path("")
    query_file_name: pathlib.Path = pathlib.Path("")
    token: str = ""

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-25T17:13:57Z

src/mdverse_scrapers/models/scraper.py

+    def create_output_dir_path(self) -> Self:
+        """
+        Create output directory path if it does not exist.
+
+        Returns
+        -------
+        Self
+            The ScraperContext instance with updated paths.
+        """
+        # Update and create output directory path.
+        self.output_dir_path = (
+            Path(self.output_dir_path)
+            / self.data_source_name.value
+            / self.start_time.strftime("%Y-%m-%d")
+        )


The docstring says "Create output directory path if it does not exist", but output_dir_path is typed as DirectoryPath, which requires the provided directory to already exist. Either update the docstring/field description to clarify that only the subdirectory (<base>/<source>/<date>) is created, or change the field type/validation to allow creating the base directory as well.

Copilot · 2026-01-25T17:13:57Z

src/mdverse_scrapers/scrapers/nomad.py

 def main(output_dir_path: Path) -> None:
    """Scrape molecular dynamics datasets and files from NOMAD."""
-    # Create directories and logger.
-    output_dir_path = output_dir_path / DatasetSourceName.NOMAD.value
-    output_dir_path.mkdir(parents=True, exist_ok=True)
-    logfile_path = output_dir_path / f"{DatasetSourceName.NOMAD.value}_scraper.log"
-    logger = create_logger(logpath=logfile_path, level="INFO")
+    # Create scraper context.
+    scraper = ScraperContext(
+        data_source_name=DatasetSourceName.NOMAD,
+        output_dir_path=output_dir_path,
+    )


output_dir_path is provided by Click with exists=False, so users can pass a non-existent directory. But ScraperContext.output_dir_path is a DirectoryPath, which requires the directory to already exist; this will raise a ValidationError at runtime for common CLI usage. Create the base output directory before instantiating ScraperContext (e.g., output_dir_path.mkdir(parents=True, exist_ok=True)) or relax the model field type/validation to allow non-existent paths and create them in the validator.

Copilot · 2026-01-25T17:13:57Z

src/mdverse_scrapers/scrapers/nomad.py

+        output_dir_path=output_dir_path,
+    )
+    logger = create_logger(logpath=scraper.log_file_path, level="INFO")
+    print(scraper.model_dump_json(indent=4))


print(scraper.model_dump_json(...)) writes the full context (including token if set) to stdout, which can leak secrets in logs/CI output and is hard to control for library consumers. Prefer logging at an appropriate level and explicitly exclude sensitive fields (e.g., exclude={"token"}) or remove this output entirely.

Suggested change

print(scraper.model_dump_json(indent=4))

logger.debug(scraper.model_dump_json(indent=4, exclude={"token"}))

feat: Create Pydantic model to manage scraper metadata

1c40f70

Fix #69

pierrepo marked this pull request as draft January 25, 2026 00:24

pierrepo added 7 commits January 25, 2026 01:29

feat: Add query_file_path and logger fields

19c914b

feat: Add number of datasets and files

42ee2e3

docs: Update doctrings

28264aa

feat: Remove logger from ScraperContext and define default paths

dad2566

feat: Adapt Nomad scraper to the new ScraperContext

68791ed

feat: Improve logging messages

2337e2a

test: Add tests for ScraperContext model

5282595

pierrepo requested a review from Copilot January 25, 2026 17:10

Copilot started reviewing on behalf of pierrepo January 25, 2026 17:10 View session

Copilot AI reviewed Jan 25, 2026

View reviewed changes

pierrepo added 4 commits January 25, 2026 18:23

refactor: Force output dir to exist

b3f9914

docs: Update doctrings to reflect what model_validator is really doing

a6fa01b

chore: Remove useless print

fce414c

chore: Print in log the ScraperContext model

efe607d

pierrepo marked this pull request as ready for review January 25, 2026 17:28

pierrepo merged commit f4271d6 into main Jan 25, 2026
1 check passed

pierrepo deleted the add-scraper-context branch January 25, 2026 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Create Pydantic model to manage scraper metadata #70

feat: Create Pydantic model to manage scraper metadata #70

Uh oh!

pierrepo commented Jan 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 25, 2026

Uh oh!

Copilot AI Jan 25, 2026

Uh oh!

Copilot AI Jan 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	print(scraper.model_dump_json(indent=4))
	logger.debug(scraper.model_dump_json(indent=4, exclude={"token"}))

feat: Create Pydantic model to manage scraper metadata #70

feat: Create Pydantic model to manage scraper metadata #70

Uh oh!

Conversation

pierrepo commented Jan 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants