Skip to content

Conversation

@Essmaw
Copy link
Collaborator

@Essmaw Essmaw commented Jan 16, 2026

No description provided.

@Essmaw Essmaw marked this pull request as draft January 16, 2026 18:10
@Essmaw Essmaw marked this pull request as ready for review January 19, 2026 17:03
@Essmaw Essmaw requested a review from pierrepo January 19, 2026 17:03
@pierrepo
Copy link
Member

Thanks @Essmaw
Could you please write a small documentation like this one to explain how the API works and how datasets and files are collected?

@Essmaw Essmaw requested a review from pierrepo January 20, 2026 17:56
Comment on lines +7 to +12
- API: https://www.gpcrmd.org/api/
- `version v1.3`

No account / token is needed to access GPCRmd API.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove version since it might be obsolete soon.

Could add the reference publication related to this repo?


API entrypoint to search for all datasets at once:

- Path: /search_all/info/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Endpoint" instead of "Path"?

- [documentatation](https://gpcrmd-docs.readthedocs.io/en/latest/api.html#main-gpcrmd-api)


#### Dataset metadata retrieved via API:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

via the API

| Field | Description |
| ------------------ | ----------------------------------- |
| `dyn_id` | *Unique dynamic (dataset) identifier* |
| `modelname` | *Name of the simulated model* |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Name of the simulated system"?

| ------------------ | ----------------------------------- |
| `dyn_id` | *Unique dynamic (dataset) identifier* |
| `modelname` | *Name of the simulated model* |
| `timestep` | *MD integration timestep* |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"MD integration time step in fs"


| Field | Description |
| -------------------- | ------------------------------------------ |
| `description` | *Full textual description of the simulation* |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only "Textual description of the simulation"


### Files

The GPCRmd API does not provide any endpoint to access file-level metadata. All file information must therefore be extracted from the dataset web page. Two file categories are available: **simulation output files** and **simulation protocol and starting files**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"All file information must therefore be extracted from the dataset web page"
->
"All file information is extracted from the dataset web page"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are "simulation output files" and "simulation protocol and starting files" categories still relevant here?

Comment on lines +56 to +59
- https://www.gpcrmd.org/dynadb/files/Dynamics/10166_trj_7.dcd
- https://www.gpcrmd.org/dynadb/files/Dynamics/10167_dyn_7.psf
- https://www.gpcrmd.org/dynadb/files/Dynamics/10168_dyn_7.pdb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide a second example with simulation output files and simulation protocol files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



## References

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reference also scraped into the dataset metadata?

Comment on lines +190 to +197
header = next(
(
h
for h in soup.find_all("h3")
if h.get_text(strip=True) == "References"
),
None,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks complicated. Could please avoid comprehensions?

Comment on lines +205 to +210
return [
a["href"].strip()
for a in content_div.find_all("a", href=True)
if isinstance(a, Tag)
and a["href"].strip().startswith(("http://", "https://"))
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid comprehension list.

Comment on lines +235 to +250
def count_links(container_id: str) -> int:
# Find the container <div> by ID
container = soup.find("div", id=container_id)
# Ensure the container is actually a Tag
if not isinstance(container, Tag):
return 0

# Collect all hrefs in <a> tags, stripping whitespace
links = [
str(a.get("href", "")).strip()
for a in container.find_all("a", href=True)
if isinstance(a, Tag) and str(a.get("href", "")).strip()
]

# Remove duplicates while preserving order
return len(dict.fromkeys(links))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, no function into function.

"""
Count files in the dataset webpage.

Especially in 'Simulation output files' and 'Simulation protocol \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'n not sure we need to make a distinction between "simulation output files" and "simulation protocol & starting files".

Why not searching for any "a" tags with "href" going to "/dynadb/files/Dynamics/" something into the entire html document?

For instance:

soup = BeautifulSoup(html, "html.parser")
links = soup.find_all("a")
for link in links:
    href = link.get("href", "")
    if "/dynadb/files/Dynamics/" in href:
        print("Interesting file:", href)

# Extract other metadata from dataset url page if available.
if html_content is None:
logger.warning(
"Error parsing additionnal metadatas from web page for dataset"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write the dataset_url in a separate line in logger

Comment on lines +316 to +317
except (ValueError, KeyError) as e:
logger.warning(f"Error parsing author names for entry {dataset_id}: {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could pass the logger object to the retrieve_metadata() function and handle exceptions into the function itself?

Comment on lines +155 to +158
bold_tag = next(
(b for b in soup.find_all("b") if b.get_text(strip=True) == field_name),
None,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comprehension please. This is difficult to read.

except (ValueError, KeyError) as e:
logger.warning(f"Error parsing simulation time for entry {dataset_id}: {e}")
metadata["simulation_time"] = stime_list
# Reference links.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the retrieve_metadata() function returns a single value only. So why having a list?

file_size = None
logger.warning(f"Could not retrieve file size for '{file_name}'")

files_metadata.append((file_name, file_type, file_size, file_url))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could store metadata into a list of dictionnaries instead of a list of tupples?

# Number of files.
nb_files = None
try:
nb_files: int | None = count_simulation_files(html_content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the info could be provided by the extract_files_metadata_from_html() function. We don't really need the count_simulation_files() function. You could fill the metadata["nb_files"] field after getting metadata for files.

# Example of file urls:
# From dataset ID: 2316 (https://www.gpcrmd.org/dynadb/dynamics/id/2316/)
# 1. https://www.gpcrmd.org/dynadb/files/Dynamics/dyn2667/tmp_dyn_0_2667.pdb
# 2. https://www.gpcrmd.org/dynadb/files/Dynamics/dyn2667/25400_trj_2316.dcd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment for the count_simulation_files() function. Could you provide a second example with output and prococol files?

"""Scrape molecular dynamics datasets and files from GPCRmd."""
# Create directories and logger.
output_dir_path = (output_dir_path / DatasetProjectName.GPCRMD.value
/ datetime.now().strftime("%Y-%m-%d"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants