-
Notifications
You must be signed in to change notification settings - Fork 4
Update GPCRmd scraper #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…raping and metadata extraction.
| - API: https://www.gpcrmd.org/api/ | ||
| - `version v1.3` | ||
|
|
||
| No account / token is needed to access GPCRmd API. | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove version since it might be obsolete soon.
Could add the reference publication related to this repo?
|
|
||
| API entrypoint to search for all datasets at once: | ||
|
|
||
| - Path: /search_all/info/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Endpoint" instead of "Path"?
| - [documentatation](https://gpcrmd-docs.readthedocs.io/en/latest/api.html#main-gpcrmd-api) | ||
|
|
||
|
|
||
| #### Dataset metadata retrieved via API: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
via the API
| | Field | Description | | ||
| | ------------------ | ----------------------------------- | | ||
| | `dyn_id` | *Unique dynamic (dataset) identifier* | | ||
| | `modelname` | *Name of the simulated model* | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Name of the simulated system"?
| | ------------------ | ----------------------------------- | | ||
| | `dyn_id` | *Unique dynamic (dataset) identifier* | | ||
| | `modelname` | *Name of the simulated model* | | ||
| | `timestep` | *MD integration timestep* | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"MD integration time step in fs"
|
|
||
| | Field | Description | | ||
| | -------------------- | ------------------------------------------ | | ||
| | `description` | *Full textual description of the simulation* | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only "Textual description of the simulation"
|
|
||
| ### Files | ||
|
|
||
| The GPCRmd API does not provide any endpoint to access file-level metadata. All file information must therefore be extracted from the dataset web page. Two file categories are available: **simulation output files** and **simulation protocol and starting files**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"All file information must therefore be extracted from the dataset web page"
->
"All file information is extracted from the dataset web page"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are "simulation output files" and "simulation protocol and starting files" categories still relevant here?
| - https://www.gpcrmd.org/dynadb/files/Dynamics/10166_trj_7.dcd | ||
| - https://www.gpcrmd.org/dynadb/files/Dynamics/10167_dyn_7.psf | ||
| - https://www.gpcrmd.org/dynadb/files/Dynamics/10168_dyn_7.pdb | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide a second example with simulation output files and simulation protocol files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance this one: https://www.gpcrmd.org/dynadb/dynamics/id/2316/
|
|
||
|
|
||
| ## References | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reference also scraped into the dataset metadata?
| header = next( | ||
| ( | ||
| h | ||
| for h in soup.find_all("h3") | ||
| if h.get_text(strip=True) == "References" | ||
| ), | ||
| None, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks complicated. Could please avoid comprehensions?
| return [ | ||
| a["href"].strip() | ||
| for a in content_div.find_all("a", href=True) | ||
| if isinstance(a, Tag) | ||
| and a["href"].strip().startswith(("http://", "https://")) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid comprehension list.
| def count_links(container_id: str) -> int: | ||
| # Find the container <div> by ID | ||
| container = soup.find("div", id=container_id) | ||
| # Ensure the container is actually a Tag | ||
| if not isinstance(container, Tag): | ||
| return 0 | ||
|
|
||
| # Collect all hrefs in <a> tags, stripping whitespace | ||
| links = [ | ||
| str(a.get("href", "")).strip() | ||
| for a in container.find_all("a", href=True) | ||
| if isinstance(a, Tag) and str(a.get("href", "")).strip() | ||
| ] | ||
|
|
||
| # Remove duplicates while preserving order | ||
| return len(dict.fromkeys(links)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, no function into function.
| """ | ||
| Count files in the dataset webpage. | ||
|
|
||
| Especially in 'Simulation output files' and 'Simulation protocol \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'n not sure we need to make a distinction between "simulation output files" and "simulation protocol & starting files".
Why not searching for any "a" tags with "href" going to "/dynadb/files/Dynamics/" something into the entire html document?
For instance:
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all("a")
for link in links:
href = link.get("href", "")
if "/dynadb/files/Dynamics/" in href:
print("Interesting file:", href)| # Extract other metadata from dataset url page if available. | ||
| if html_content is None: | ||
| logger.warning( | ||
| "Error parsing additionnal metadatas from web page for dataset" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write the dataset_url in a separate line in logger
| except (ValueError, KeyError) as e: | ||
| logger.warning(f"Error parsing author names for entry {dataset_id}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could pass the logger object to the retrieve_metadata() function and handle exceptions into the function itself?
| bold_tag = next( | ||
| (b for b in soup.find_all("b") if b.get_text(strip=True) == field_name), | ||
| None, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No comprehension please. This is difficult to read.
| except (ValueError, KeyError) as e: | ||
| logger.warning(f"Error parsing simulation time for entry {dataset_id}: {e}") | ||
| metadata["simulation_time"] = stime_list | ||
| # Reference links. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the retrieve_metadata() function returns a single value only. So why having a list?
| file_size = None | ||
| logger.warning(f"Could not retrieve file size for '{file_name}'") | ||
|
|
||
| files_metadata.append((file_name, file_type, file_size, file_url)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could store metadata into a list of dictionnaries instead of a list of tupples?
| # Number of files. | ||
| nb_files = None | ||
| try: | ||
| nb_files: int | None = count_simulation_files(html_content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the info could be provided by the extract_files_metadata_from_html() function. We don't really need the count_simulation_files() function. You could fill the metadata["nb_files"] field after getting metadata for files.
| # Example of file urls: | ||
| # From dataset ID: 2316 (https://www.gpcrmd.org/dynadb/dynamics/id/2316/) | ||
| # 1. https://www.gpcrmd.org/dynadb/files/Dynamics/dyn2667/tmp_dyn_0_2667.pdb | ||
| # 2. https://www.gpcrmd.org/dynadb/files/Dynamics/dyn2667/25400_trj_2316.dcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment for the count_simulation_files() function. Could you provide a second example with output and prococol files?
| """Scrape molecular dynamics datasets and files from GPCRmd.""" | ||
| # Create directories and logger. | ||
| output_dir_path = (output_dir_path / DatasetProjectName.GPCRMD.value | ||
| / datetime.now().strftime("%Y-%m-%d")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;-)
No description provided.