Update GPCRmd scraper #51

Essmaw · 2026-01-16T18:10:20Z

No description provided.

pierrepo · 2026-01-19T18:01:04Z

Thanks @Essmaw
Could you please write a small documentation like this one to explain how the API works and how datasets and files are collected?

src/mdverse_scrapers/scrapers/gpcrmd.py

…raping and metadata extraction.

pierrepo · 2026-01-21T08:25:17Z

docs/gpcrmd.md

+- API: https://www.gpcrmd.org/api/
+  - `version v1.3`
+
+No account / token is needed to access GPCRmd API.
+
+


Please remove version since it might be obsolete soon.

Could add the reference publication related to this repo?

pierrepo · 2026-01-21T08:26:23Z

docs/gpcrmd.md

+
+API entrypoint to search for all datasets at once:
+
+- Path: /search_all/info/


"Endpoint" instead of "Path"?

pierrepo · 2026-01-21T08:26:48Z

docs/gpcrmd.md

+- [documentatation](https://gpcrmd-docs.readthedocs.io/en/latest/api.html#main-gpcrmd-api)
+
+
+#### Dataset metadata retrieved via API:


via the API

pierrepo · 2026-01-21T08:28:17Z

docs/gpcrmd.md

+| Field              | Description                         |
+| ------------------ | ----------------------------------- |
+| `dyn_id`           | *Unique dynamic (dataset) identifier* |
+| `modelname`        | *Name of the simulated model*         |


"Name of the simulated system"?

pierrepo · 2026-01-21T08:28:43Z

docs/gpcrmd.md

+| ------------------ | ----------------------------------- |
+| `dyn_id`           | *Unique dynamic (dataset) identifier* |
+| `modelname`        | *Name of the simulated model*         |
+| `timestep`         | *MD integration timestep*             |


"MD integration time step in fs"

pierrepo · 2026-01-21T08:29:36Z

docs/gpcrmd.md

+
+| Field                | Description                                |
+| -------------------- | ------------------------------------------ |
+| `description`        | *Full textual description of the simulation* |


Only "Textual description of the simulation"

pierrepo · 2026-01-21T08:30:47Z

docs/gpcrmd.md

+
+### Files
+
+The GPCRmd API does not provide any endpoint to access file-level metadata. All file information must therefore be extracted from the dataset web page. Two file categories are available: **simulation output files** and **simulation protocol and starting files**.


"All file information must therefore be extracted from the dataset web page"
->
"All file information is extracted from the dataset web page"

Are "simulation output files" and "simulation protocol and starting files" categories still relevant here?

pierrepo · 2026-01-21T08:32:40Z

docs/gpcrmd.md

+- https://www.gpcrmd.org/dynadb/files/Dynamics/10166_trj_7.dcd
+- https://www.gpcrmd.org/dynadb/files/Dynamics/10167_dyn_7.psf
+- https://www.gpcrmd.org/dynadb/files/Dynamics/10168_dyn_7.pdb
+


Could you please provide a second example with simulation output files and simulation protocol files?

For instance this one: https://www.gpcrmd.org/dynadb/dynamics/id/2316/

pierrepo · 2026-01-21T08:34:19Z

docs/gpcrmd.md

+
+
+## References 
+


Is the reference also scraped into the dataset metadata?

pierrepo · 2026-01-21T08:36:51Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+    header = next(
+        (
+            h
+            for h in soup.find_all("h3")
+            if h.get_text(strip=True) == "References"
+        ),
+        None,
+    )


Looks complicated. Could please avoid comprehensions?

pierrepo · 2026-01-21T08:37:13Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+    return [
+        a["href"].strip()
+        for a in content_div.find_all("a", href=True)
+        if isinstance(a, Tag)
+        and a["href"].strip().startswith(("http://", "https://"))
+    ]


Please avoid comprehension list.

pierrepo · 2026-01-21T08:39:33Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+        def count_links(container_id: str) -> int:
+            # Find the container <div> by ID
+            container = soup.find("div", id=container_id)
+            # Ensure the container is actually a Tag
+            if not isinstance(container, Tag):
+                return 0
+
+            # Collect all hrefs in <a> tags, stripping whitespace
+            links = [
+                str(a.get("href", "")).strip()
+                for a in container.find_all("a", href=True)
+                if isinstance(a, Tag) and str(a.get("href", "")).strip()
+            ]
+
+            # Remove duplicates while preserving order
+            return len(dict.fromkeys(links))


Please, no function into function.

pierrepo · 2026-01-21T08:44:39Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+    """
+    Count files in the dataset webpage.
+
+    Especially in 'Simulation output files' and 'Simulation protocol \


I'n not sure we need to make a distinction between "simulation output files" and "simulation protocol & starting files".

Why not searching for any "a" tags with "href" going to "/dynadb/files/Dynamics/" something into the entire html document?

For instance:

soup = BeautifulSoup(html, "html.parser") links = soup.find_all("a") for link in links: href = link.get("href", "") if "/dynadb/files/Dynamics/" in href: print("Interesting file:", href)

pierrepo · 2026-01-21T08:47:25Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+        # Extract other metadata from dataset url page if available.
+        if html_content is None:
+            logger.warning(
+                "Error parsing additionnal metadatas from web page for dataset"


Please write the dataset_url in a separate line in logger

pierrepo · 2026-01-21T08:50:38Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+        except (ValueError, KeyError) as e:
+            logger.warning(f"Error parsing author names for entry {dataset_id}: {e}")


Could pass the logger object to the retrieve_metadata() function and handle exceptions into the function itself?

pierrepo · 2026-01-21T08:54:32Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+    bold_tag = next(
+        (b for b in soup.find_all("b") if b.get_text(strip=True) == field_name),
+        None,
+    )


No comprehension please. This is difficult to read.

pierrepo · 2026-01-21T08:58:16Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+        except (ValueError, KeyError) as e:
+            logger.warning(f"Error parsing simulation time for entry {dataset_id}: {e}")
+        metadata["simulation_time"] = stime_list
+        # Reference links.


It looks like the retrieve_metadata() function returns a single value only. So why having a list?

pierrepo · 2026-01-21T09:04:01Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+            file_size = None
+            logger.warning(f"Could not retrieve file size for '{file_name}'")
+
+        files_metadata.append((file_name, file_type, file_size, file_url))


Could store metadata into a list of dictionnaries instead of a list of tupples?

pierrepo · 2026-01-21T09:06:31Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+        # Number of files.
+        nb_files = None
+        try:
+            nb_files: int | None = count_simulation_files(html_content)


All the info could be provided by the extract_files_metadata_from_html() function. We don't really need the count_simulation_files() function. You could fill the metadata["nb_files"] field after getting metadata for files.

pierrepo · 2026-01-21T09:07:57Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+        # Example of file urls:
+        # From dataset ID:  2316 (https://www.gpcrmd.org/dynadb/dynamics/id/2316/)
+        # 1. https://www.gpcrmd.org/dynadb/files/Dynamics/dyn2667/tmp_dyn_0_2667.pdb
+        # 2. https://www.gpcrmd.org/dynadb/files/Dynamics/dyn2667/25400_trj_2316.dcd


See comment for the count_simulation_files() function. Could you provide a second example with output and prococol files?

pierrepo · 2026-01-21T09:08:50Z

src/mdverse_scrapers/scrapers/gpcrmd.py

+    """Scrape molecular dynamics datasets and files from GPCRmd."""
+    # Create directories and logger.
+    output_dir_path = (output_dir_path / DatasetProjectName.GPCRMD.value
+                       / datetime.now().strftime("%Y-%m-%d"))


[WIP] - feat: add GPCRmd scraper to fetch molecular dynamics datasets.

420ff8c

Essmaw marked this pull request as draft January 16, 2026 18:10

essmaw added 2 commits January 19, 2026 18:01

refactor(gpcrmd): finalize scraper integration as a reusable module.

59dbfa4

docs(readme): update command for GPCRmd scraper module.

438823d

Essmaw marked this pull request as ready for review January 19, 2026 17:03

Essmaw requested a review from pierrepo January 19, 2026 17:03

pierrepo reviewed Jan 19, 2026

View reviewed changes

src/mdverse_scrapers/scrapers/gpcrmd.py Outdated Show resolved Hide resolved

pierrepo reviewed Jan 19, 2026

View reviewed changes

src/mdverse_scrapers/scrapers/gpcrmd.py Outdated Show resolved Hide resolved

pierrepo reviewed Jan 19, 2026

View reviewed changes

src/mdverse_scrapers/scrapers/gpcrmd.py Outdated Show resolved Hide resolved

pierrepo reviewed Jan 19, 2026

View reviewed changes

src/mdverse_scrapers/scrapers/gpcrmd.py Outdated Show resolved Hide resolved

pierrepo reviewed Jan 19, 2026

View reviewed changes

src/mdverse_scrapers/scrapers/gpcrmd.py Outdated Show resolved Hide resolved

pierrepo reviewed Jan 19, 2026

View reviewed changes

src/mdverse_scrapers/scrapers/gpcrmd.py Show resolved Hide resolved

pierrepo reviewed Jan 19, 2026

View reviewed changes

src/mdverse_scrapers/scrapers/gpcrmd.py Outdated Show resolved Hide resolved

pierrepo reviewed Jan 19, 2026

View reviewed changes

src/mdverse_scrapers/scrapers/gpcrmd.py Outdated Show resolved Hide resolved

Essmaw added 2 commits January 20, 2026 18:52

docs: Adding the documentation for scrapping GPCRmd.

ef7dd7e

refactor(gpcrmd): incorporate @pierrepo review, streamline dataset sc…

dc8eb03

…raping and metadata extraction.

Essmaw requested a review from pierrepo January 20, 2026 17:56

pierrepo reviewed Jan 21, 2026

View reviewed changes

docs/gpcrmd.md

API entrypoint to search for all datasets at once:

- Path: /search_all/info/

Copy link

Member

pierrepo Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Endpoint" instead of "Path"?

pierrepo reviewed Jan 21, 2026

View reviewed changes

docs/gpcrmd.md

## References

Copy link

Member

pierrepo Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reference also scraped into the dataset metadata?

pierrepo reviewed Jan 21, 2026

View reviewed changes


		API entrypoint to search for all datasets at once:

		- Path: /search_all/info/

		- [documentatation](https://gpcrmd-docs.readthedocs.io/en/latest/api.html#main-gpcrmd-api)


		#### Dataset metadata retrieved via API:


		### Files

		The GPCRmd API does not provide any endpoint to access file-level metadata. All file information must therefore be extracted from the dataset web page. Two file categories are available: simulation output files and simulation protocol and starting files.

		except (ValueError, KeyError) as e:
		logger.warning(f"Error parsing author names for entry {dataset_id}: {e}")

Update GPCRmd scraper #51

Are you sure you want to change the base?

Update GPCRmd scraper #51

Uh oh!

Conversation

Essmaw commented Jan 16, 2026

Uh oh!

pierrepo commented Jan 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants