Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,3 @@ __pycache__/

# MAC tmp files
.DS_Store

Test/*
!Test/Github_version
67 changes: 33 additions & 34 deletions docs/atlas.md
Original file line number Diff line number Diff line change
@@ -1,63 +1,62 @@
# ATLAS.
# ATLAS

ATLAS (Atlas of proTein moLecular dynAmicS) is an open-access data repository that gathers standardized molecular dynamics simulations of protein structures, accompanied by their analysis in the form of interactive diagrams and trajectory visualisation. All raw trajectories as well as the results of analysis are available for download.

- web site: https://www.dsimb.inserm.fr/ATLAS/
- documentation: https://www.dsimb.inserm.fr/ATLAS/api/redoc
- API: https://www.dsimb.inserm.fr/ATLAS/api/
- web site: <https://www.dsimb.inserm.fr/ATLAS/>
- publication: [ATLAS: protein flexibility description from atomistic molecular dynamics simulations](https://academic.oup.com/nar/article/52/D1/D384/7438909), Nucleic Acids Research, 2024.

No account / token is needed to access ATLAS API.
## API

---
- Base URL: <https://www.dsimb.inserm.fr/ATLAS/api/>
- [documentation](https://www.dsimb.inserm.fr/ATLAS/api/redoc)

## Finding molecular dynamics datasets and files
No account / token is needed to access ATLAS API.

### Datasets

In ATLAS, each dataset corresponds to a molecular dynamics simulation of a **protein chain** and is uniquely identified by a **PDB ID and chain identifier** (`pdb_chain`).

The list of all available datasets can be obtained from the ATLAS HTML index:

https://www.dsimb.inserm.fr/ATLAS/

This page is used as the **discovery layer** to extract all available PDB chain identifiers.
The list of all available datasets can be obtained from the ATLAS index page: <https://www.dsimb.inserm.fr/ATLAS/>

---
All datasets (pdb chains) are extracted from this page with a regular expression.

### API entrypoint to search for entries
### Metadata for a given dataset

API endpoint to retrieve metadata for a given dataset:

- Path: `/ATLAS/metadata/{pdb_chain}`
- documentation: https://www.dsimb.inserm.fr/ATLAS/api/redoc
- Endpoint: `/ATLAS/metadata/{pdb_chain}`
- HTTP method: GET
- documentation: <https://www.dsimb.inserm.fr/ATLAS/api/redoc>

This endpoint returns structured JSON metadata describing the protein and its molecular dynamics simulation.
This endpoint returns structured JSON metadata describing the simulated protein.

---
Example with dataset id `1k5n_A`:

### Files
- [web page](https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/1k5n_A/1k5n_A.html)
- [API view](https://www.dsimb.inserm.fr/ATLAS/api/ATLAS/metadata/1k5n_A)

Files associated with a given dataset are hosted in a public directory.

- Base path: `/database/ATLAS/{pdb_chain}/`
Remarks:

These directories contain structure files (PDB, CIF), molecular dynamics trajectories, and precomputed analysis results.
- The title of the dataset is the protein name.
- No comment or description is provided. We used the organism as description.

---
### Metadata for files

## Examples

### 1k5n_A
Files associated with a given dataset are hosted in a public directory.

- entry id: `1k5n_A`
- entry on ATLAS GUI: https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/1k5n_A/1k5n_A.html
- entry on ATLAS API: https://www.dsimb.inserm.fr/ATLAS/api/ATLAS/metadata/1k5n_A
For each dataset, 3 zip files are provided. They are accessible through the web page of each individual dataset: <https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/{pdb_chain}/{pdb_chain}.html>

### Description (called "Comment") :
Zip files url follow these patterns:

HLA class I histocompatibility antigen, B alpha chain
- Analysis & MDs (1,000 frames, only protein): <https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/{pdb_chain}/{pdb_chain}_analysis.zip>
- MDs (10,000 frames, only protein): <https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/{pdb_chain}/{pdb_chain}_protein.zip>
- MDs (10,000 frames, total system): <https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/{pdb_chain}/{pdb_chain}_total.zip>

### Files
Example with dataset id `1k5n_A`:

- files on ATLAS GUI: https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/1k5n_A/1k5n_A.html
- [web page](https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/1k5n_A/1k5n_A.html)
- [1k5n_A_analysis.zip](https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/1k5n_A/1k5n_A_analysis.zip)
- [1k5n_A_protein.zip](https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/1k5n_A/1k5n_A_protein.zip)
- [1k5n_A_total.zip](https://www.dsimb.inserm.fr/ATLAS/database/ATLAS/1k5n_A/1k5n_A_total.zip)

We parse HTML content of dataset page and use regular expressions to extract URLs, file names and file sizes.
7 changes: 1 addition & 6 deletions docs/zenodo.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,7 @@ So we don't expect much files to have an individual size above 50 GB.

## API

### Documentation

- Base URL: <https://zenodo.org/>
- [REST API](https://developers.zenodo.org/)
- List of [HTTP status codes](https://developers.zenodo.org/#http-status-codes)

Expand All @@ -21,10 +20,6 @@ Zenodo requires a token to access its API with higher rate limits. See "[Authent

Example of direct API link for a given dataset: <https://zenodo.org/api/records/8183728>

### Base ULR

<https://zenodo.org/>

### Query

[Search guide](https://help.zenodo.org/guides/search/)
Expand Down
22 changes: 22 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,27 @@ name = "mdverse-scrapers"
version = "0.1.0"
description = "MDverse scrapers"
readme = "README.md"
license = "BSD-3-Clause"
authors = [
{ name = "Pierre Poulain", email = "pierre.poulain@cupnet.net" },
{ name = "Essmay Touami", email = "essmay.touami@etu.u-paris.fr" },
{ name = "Salahudin Sheikh", email = "sheikh@ibpc.fr"}
]
maintainers = [
{ name = "Pierre Poulain", email = "pierre.poulain@cupnet.net" }
]
classifiers = [
"Development Status :: 4 - Beta",
"License :: OSI Approved :: BSD License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: 3.14",
"Intended Audience :: Science/Research",
"Topic :: Database",
"Topic :: Scientific/Engineering :: Bio-Informatics",
"Topic :: Scientific/Engineering :: Chemistry",
]
requires-python = ">=3.12"
dependencies = [
"beautifulsoup4>=4.13.3",
Expand Down Expand Up @@ -50,3 +71,4 @@ build-backend = "uv_build"
scrape-zenodo = "mdverse_scrapers.scrapers.zenodo:main"
scrape-figshare = "mdverse_scrapers.scrapers.figshare:main"
scrape-nomad = "mdverse_scrapers.scrapers.nomad:main"
scrape-atlas = "mdverse_scrapers.scrapers.atlas:main"
Loading