-
Notifications
You must be signed in to change notification settings - Fork 4
BREAKING CHANGES: Update ATLAS (and NOMAD) scrapers #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new ATLAS scraper for the MDverse scrapers project and refactors the NOMAD scraper to use a generalized connection testing function. The PR also updates the DOI validation pattern to support DOIs containing forward slashes (required for the ATLAS DOI).
Changes:
- Added new ATLAS scraper with HTML parsing to extract dataset and file metadata from the ATLAS database
- Refactored connection testing by moving
is_nomad_connection_workingto a genericis_connection_to_server_workingfunction - Updated DOI regex pattern to allow forward slashes in addition to existing allowed characters
- Enhanced logger configuration in both ATLAS and NOMAD scrapers to respect debug mode
- Removed old standalone ATLAS scraping script and updated documentation
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/mdverse_scrapers/scrapers/atlas.py | New scraper implementation for ATLAS database with functions to extract PDB chains, scrape metadata, and parse file information from HTML |
| src/mdverse_scrapers/scrapers/nomad.py | Refactored to use generic connection testing function and improved logger initialization based on debug mode |
| src/mdverse_scrapers/core/network.py | Added generic is_connection_to_server_working function to replace NOMAD-specific implementation |
| src/mdverse_scrapers/models/dataset.py | Updated DOI pattern to include forward slash character, allowing DOIs like 10.1093/nar/gkad1084 |
| scripts/scrap_atlas.py | Removed old standalone ATLAS scraper script in favor of integrated scraper module |
| pyproject.toml | Added project metadata (license, authors, classifiers) and registered new scrape-atlas command |
| docs/atlas.md | Updated documentation with clearer structure and comprehensive API endpoint descriptions |
| docs/zenodo.md | Reorganized documentation for better clarity by moving base URL to API section |
| .gitignore | Removed test-related directory entries |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.info("Starting ATLAS data scraping...") | ||
| # Create HTTPX client | ||
| client = create_httpx_client() | ||
| # Check connection to NOMAD API |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "Check connection to NOMAD API" but this is the ATLAS scraper, so it should say "Check connection to ATLAS API" instead.
| # Check connection to NOMAD API | |
| # Check connection to ATLAS API |
| { | ||
| "file_name": Path(href).name, | ||
| "file_url_in_repository": href, | ||
| # File size are sometimes expressed with comma as decimal separator. |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should use plural form: "File sizes are sometimes expressed" instead of "File size are sometimes expressed".
| # File size are sometimes expressed with comma as decimal separator. | |
| # File sizes are sometimes expressed with comma as decimal separator. |
| set[str] | ||
| List of PDB chain identifiers found. |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring describes the return value as "List of PDB chain identifiers found" but the function returns a set[str], not a list. The description should say "Set of PDB chain identifiers found" for consistency.
src/mdverse_scrapers/core/network.py
Outdated
|
|
||
| def is_connection_to_server_working( | ||
| client: httpx.Client, url: str, logger: "loguru.Logger" = loguru.logger | ||
| ) -> bool | None: |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type annotation indicates bool | None, but the function only returns True or False and never returns None. The return type should be bool instead.
| ) -> bool | None: | |
| ) -> bool: |
| set[str] | ||
| List of PDB chains (datasets) found. |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring describes the return value as "List of PDB chains (datasets) found" but the function returns a set[str], not a list. The description should say "Set of PDB chains (datasets) found" for consistency.
| try: | ||
| meta_json = response.json().get(f"{chain_id}") | ||
| except (json.decoder.JSONDecodeError, KeyError) as exc: | ||
| logger.warning("Failed to deconde JSON response fromthe ATLAS API.") |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two spelling errors in this log message: "deconde" should be "decode" and "fromthe" should be "from the".
| logger.warning("Failed to deconde JSON response fromthe ATLAS API.") | |
| logger.warning("Failed to decode JSON response from the ATLAS API.") |
No description provided.