Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ The scraping takes about 2 h.
Scrape GPCRmd to collect molecular dynamics (MD) datasets and files related to G-protein-coupled receptors (GPCRs), a major family of membrane proteins and common drug targets.

```bash
uv run -m scripts.scrape_gpcrmd
uv run scrape-gpcrmd --output-dir data
```

This command will:
Expand Down
113 changes: 113 additions & 0 deletions docs/gpcrmd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# GPCRmd

> GPCRmd is an online platform for visualizing, analyzing, and sharing molecular dynamics simulations of G-protein-coupled receptors (GPCRs), a key family of membrane proteins and common drug targets.

- web site: https://www.gpcrmd.org/
- documentation: https://gpcrmd-docs.readthedocs.io/en/latest/index.html
- API: https://www.gpcrmd.org/api/
- `version v1.3`

No account / token is needed to access GPCRmd API.


Comment on lines +7 to +12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove version since it might be obsolete soon.

Could add the reference publication related to this repo?

## Finding molecular dynamics datasets and files

Although GPCRmd provides a public API to discover molecular dynamics datasets, **some important metadata fields and all file-level information are not exposed via the API**. For this reason, web scraping of the dataset page is required to retrieve complete dataset descriptions and file metadata.

### Datasets

In GPCRmd, datasets (a simulation and its related files) are called "dynamic".

API entrypoint to search for all datasets at once:

- Path: /search_all/info/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Endpoint" instead of "Path"?

- [documentatation](https://gpcrmd-docs.readthedocs.io/en/latest/api.html#main-gpcrmd-api)


#### Dataset metadata retrieved via API:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

via the API


| Field | Description |
| ------------------ | ----------------------------------- |
| `dyn_id` | *Unique dynamic (dataset) identifier* |
| `modelname` | *Name of the simulated model* |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Name of the simulated system"?

| `timestep` | *MD integration timestep* |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"MD integration time step in fs"

| `atom_num` | *Number of atoms* |
| `mysoftware` | *MD engine used* |
| `software_version` | *Version of the MD engine* |
| `forcefield` | *Force field and model name* |
| `forcefield_version` | *Force field and model version* |
| `creation_timestamp` | *Dataset creation date* |
| `dataset_url` | *URL of the dataset web page* |

#### Dataset metadata retrieved via web scraping (URL provided by the API):

| Field | Description |
| -------------------- | ------------------------------------------ |
| `description` | *Full textual description of the simulation* |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only "Textual description of the simulation"

| `authors` | *Dataset authors* |
| `simulation_time` | *Total simulation length* |


### Files

The GPCRmd API does not provide any endpoint to access file-level metadata. All file information must therefore be extracted from the dataset web page. Two file categories are available: **simulation output files** and **simulation protocol and starting files**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"All file information must therefore be extracted from the dataset web page"
->
"All file information is extracted from the dataset web page"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are "simulation output files" and "simulation protocol and starting files" categories still relevant here?


For example, the files corresponding to the dataset` 7` (https://www.gpcrmd.org/dynadb/dynamics/id/7/) include these files:
- https://www.gpcrmd.org/dynadb/files/Dynamics/10166_trj_7.dcd
- https://www.gpcrmd.org/dynadb/files/Dynamics/10167_dyn_7.psf
- https://www.gpcrmd.org/dynadb/files/Dynamics/10168_dyn_7.pdb

Comment on lines +56 to +59
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide a second example with simulation output files and simulation protocol files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


#### File metadata retrieved via web scraping (URL provided by the API):

| Field | Description |
| ---------- | ---------------------- |
| `file_name` | *Name of the file* |
| `file_type` | *File extension* |
| `file_path`| *Public download URL* |
| `file_size`| *File size in bytes* |

> 💡 File size is obtained using an HTTP `HEAD` request on the file path, **avoiding file download**.


## Examples
### Dataset ID 2316

- [Dataset on GPCRmd GUI](https://www.gpcrmd.org/dynadb/dynamics/id/2316/)
- [Dataset on GPCRmd API](https://www.gpcrmd.org/api/search_dyn/info/2316)


#### Dataset metadata (API + scraping)

| Field | Description |
| ------------------ | ----------------------------------- |
| `dyn_id` | *2316* |
| `modelname` | *FFA2_TUG1375_Gi1-TUG1375* |
| `timestep` | *2* |
| `atom_num` | *4829* |
| `mysoftware` | *AMBER PMEMD.CUDA* |
| `software_version` | *2020* |
| `forcefield` | *ff19SB/lipid21/GAFF2* |
| `forcefield_version` | *ff19SB/lipid21* |
| `creation_timestamp` | *2025-05-13* |
| `dataset_url` | *https://www.gpcrmd.org/dynadb/dynamics/id/2316/* |
| `description` | *Simulation aims to observe structural features of FFA2 without an orthosteric agonist and G-protein, which will be compared to docking-based simulations of allosteric activators...* |
| `authors` | *Abdul-Akim Guseinov, University of Glasgow* |
| `simulation_time` | *3.0 µs* |


- [files on GPCRmd GUI](https://www.gpcrmd.org/api/search_dyn/info/2316) (accessible via the *Technical Information* section)

#### Example file from the dataset

| Field | Description |
| ---------- | ---------------------- |
| `file_name` | *tmp_dyn_0_2667.pdb* |
| `file_type` | *pdb* |
| `file_path`| *https://www.gpcrmd.org/dynadb/files/Dynamics/dyn2667/tmp_dyn_0_2667.pdb* |
| `file_size`| *1 024 bytes* |


## References

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reference also scraped into the dataset metadata?

Rodríguez-Espigares, I., Torrens-Fontanals, M., Tiemann, J.K.S. et al. GPCRmd uncovers the dynamics of the 3D-GPCRome. Nat Methods. 2020;17(8):777-787. doi:[10.1038/s41592-020-0884-y](https://www.nature.com/articles/s41592-020-0884-y)
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,4 @@ build-backend = "uv_build"
scrape-zenodo = "mdverse_scrapers.scrapers.zenodo:main"
scrape-figshare = "mdverse_scrapers.scrapers.figshare:main"
scrape-nomad = "mdverse_scrapers.scrapers.nomad:main"
scrape-gpcrmd = "mdverse_scrapers.scrapers.gpcrmd:main"
Loading