Skip to content

Conversation

@pierrepo
Copy link
Member

No description provided.

@pierrepo pierrepo marked this pull request as draft January 26, 2026 17:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new ATLAS scraper for the MDverse scrapers project and refactors the NOMAD scraper to use a generalized connection testing function. The PR also updates the DOI validation pattern to support DOIs containing forward slashes (required for the ATLAS DOI).

Changes:

  • Added new ATLAS scraper with HTML parsing to extract dataset and file metadata from the ATLAS database
  • Refactored connection testing by moving is_nomad_connection_working to a generic is_connection_to_server_working function
  • Updated DOI regex pattern to allow forward slashes in addition to existing allowed characters
  • Enhanced logger configuration in both ATLAS and NOMAD scrapers to respect debug mode
  • Removed old standalone ATLAS scraping script and updated documentation

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/mdverse_scrapers/scrapers/atlas.py New scraper implementation for ATLAS database with functions to extract PDB chains, scrape metadata, and parse file information from HTML
src/mdverse_scrapers/scrapers/nomad.py Refactored to use generic connection testing function and improved logger initialization based on debug mode
src/mdverse_scrapers/core/network.py Added generic is_connection_to_server_working function to replace NOMAD-specific implementation
src/mdverse_scrapers/models/dataset.py Updated DOI pattern to include forward slash character, allowing DOIs like 10.1093/nar/gkad1084
scripts/scrap_atlas.py Removed old standalone ATLAS scraper script in favor of integrated scraper module
pyproject.toml Added project metadata (license, authors, classifiers) and registered new scrape-atlas command
docs/atlas.md Updated documentation with clearer structure and comprehensive API endpoint descriptions
docs/zenodo.md Reorganized documentation for better clarity by moving base URL to API section
.gitignore Removed test-related directory entries

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

logger.info("Starting ATLAS data scraping...")
# Create HTTPX client
client = create_httpx_client()
# Check connection to NOMAD API
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Check connection to NOMAD API" but this is the ATLAS scraper, so it should say "Check connection to ATLAS API" instead.

Suggested change
# Check connection to NOMAD API
# Check connection to ATLAS API

Copilot uses AI. Check for mistakes.
{
"file_name": Path(href).name,
"file_url_in_repository": href,
# File size are sometimes expressed with comma as decimal separator.
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should use plural form: "File sizes are sometimes expressed" instead of "File size are sometimes expressed".

Suggested change
# File size are sometimes expressed with comma as decimal separator.
# File sizes are sometimes expressed with comma as decimal separator.

Copilot uses AI. Check for mistakes.
Comment on lines 60 to 61
set[str]
List of PDB chain identifiers found.
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring describes the return value as "List of PDB chain identifiers found" but the function returns a set[str], not a list. The description should say "Set of PDB chain identifiers found" for consistency.

Copilot uses AI. Check for mistakes.

def is_connection_to_server_working(
client: httpx.Client, url: str, logger: "loguru.Logger" = loguru.logger
) -> bool | None:
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation indicates bool | None, but the function only returns True or False and never returns None. The return type should be bool instead.

Suggested change
) -> bool | None:
) -> bool:

Copilot uses AI. Check for mistakes.
Comment on lines 183 to 184
set[str]
List of PDB chains (datasets) found.
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring describes the return value as "List of PDB chains (datasets) found" but the function returns a set[str], not a list. The description should say "Set of PDB chains (datasets) found" for consistency.

Copilot uses AI. Check for mistakes.
try:
meta_json = response.json().get(f"{chain_id}")
except (json.decoder.JSONDecodeError, KeyError) as exc:
logger.warning("Failed to deconde JSON response fromthe ATLAS API.")
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two spelling errors in this log message: "deconde" should be "decode" and "fromthe" should be "from the".

Suggested change
logger.warning("Failed to deconde JSON response fromthe ATLAS API.")
logger.warning("Failed to decode JSON response from the ATLAS API.")

Copilot uses AI. Check for mistakes.
@pierrepo pierrepo marked this pull request as ready for review January 27, 2026 07:42
@pierrepo pierrepo merged commit e7075ef into main Jan 27, 2026
1 check passed
@pierrepo pierrepo deleted the update-atlas-scraper branch January 27, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants