Skip to content

Conversation

@pierrepo
Copy link
Member

Fix #69

@pierrepo pierrepo marked this pull request as draft January 25, 2026 00:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new Pydantic model to centralize scraper runtime/config metadata (paths, counts, timing), and wires it into the NOMAD scraper to standardize output paths and end-of-run statistics. Adds tests to validate default path creation and basic validation behavior.

Changes:

  • Add ScraperContext Pydantic model to manage scraper configuration/derived paths.
  • Update NOMAD scraper to use ScraperContext, track exported counts, and print summary statistics.
  • Add tests covering directory/path derivation, optional query file validation, and non-negative counters.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
tests/models/test_scraper.py Adds unit tests for ScraperContext path derivation and validation rules.
src/mdverse_scrapers/models/scraper.py Adds the new ScraperContext Pydantic model and validator that derives/creates output paths.
src/mdverse_scrapers/scrapers/nomad.py Refactors NOMAD CLI flow to use ScraperContext, updated parquet export calls, and centralized stats printing.
src/mdverse_scrapers/core/toolbox.py Updates parquet export helper to return row counts and adds print_statistics(scraper, ...).
Comments suppressed due to low confidence (1)

src/mdverse_scrapers/core/toolbox.py:39

  • PR description says this fixes #69 (convert ContextManager dataclass to a Pydantic model), but ContextManager is still defined in this module and is still used across other scrapers (e.g., src/mdverse_scrapers/scrapers/zenodo.py, figshare.py). Either adjust the PR description/scope, or complete the migration by updating the remaining call sites and removing ContextManager.
from ..models.scraper import ScraperContext

warnings.filterwarnings(
    "ignore",
    message="The input looks more like a filename than markup",
    category=UserWarning,
    module="bs4",
)


@dataclass(kw_only=True)
class ContextManager:
    """ContextManager dataclass."""

    logger: "loguru.Logger" = loguru.logger
    output_path: pathlib.Path = pathlib.Path("")
    query_file_name: pathlib.Path = pathlib.Path("")
    token: str = ""

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 62 to 76
def create_output_dir_path(self) -> Self:
"""
Create output directory path if it does not exist.
Returns
-------
Self
The ScraperContext instance with updated paths.
"""
# Update and create output directory path.
self.output_dir_path = (
Path(self.output_dir_path)
/ self.data_source_name.value
/ self.start_time.strftime("%Y-%m-%d")
)
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says "Create output directory path if it does not exist", but output_dir_path is typed as DirectoryPath, which requires the provided directory to already exist. Either update the docstring/field description to clarify that only the subdirectory (<base>/<source>/<date>) is created, or change the field type/validation to allow creating the base directory as well.

Copilot uses AI. Check for mistakes.
Comment on lines 568 to +573
def main(output_dir_path: Path) -> None:
"""Scrape molecular dynamics datasets and files from NOMAD."""
# Create directories and logger.
output_dir_path = output_dir_path / DatasetSourceName.NOMAD.value
output_dir_path.mkdir(parents=True, exist_ok=True)
logfile_path = output_dir_path / f"{DatasetSourceName.NOMAD.value}_scraper.log"
logger = create_logger(logpath=logfile_path, level="INFO")
# Create scraper context.
scraper = ScraperContext(
data_source_name=DatasetSourceName.NOMAD,
output_dir_path=output_dir_path,
)
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_dir_path is provided by Click with exists=False, so users can pass a non-existent directory. But ScraperContext.output_dir_path is a DirectoryPath, which requires the directory to already exist; this will raise a ValidationError at runtime for common CLI usage. Create the base output directory before instantiating ScraperContext (e.g., output_dir_path.mkdir(parents=True, exist_ok=True)) or relax the model field type/validation to allow non-existent paths and create them in the validator.

Copilot uses AI. Check for mistakes.
output_dir_path=output_dir_path,
)
logger = create_logger(logpath=scraper.log_file_path, level="INFO")
print(scraper.model_dump_json(indent=4))
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print(scraper.model_dump_json(...)) writes the full context (including token if set) to stdout, which can leak secrets in logs/CI output and is hard to control for library consumers. Prefer logging at an appropriate level and explicitly exclude sensitive fields (e.g., exclude={"token"}) or remove this output entirely.

Suggested change
print(scraper.model_dump_json(indent=4))
logger.debug(scraper.model_dump_json(indent=4, exclude={"token"}))

Copilot uses AI. Check for mistakes.
@pierrepo pierrepo marked this pull request as ready for review January 25, 2026 17:28
@pierrepo pierrepo merged commit f4271d6 into main Jan 25, 2026
1 check passed
@pierrepo pierrepo deleted the add-scraper-context branch January 25, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create Pydantic model to handle scraper basic configuration

2 participants