List view
- No due date
Currently we rely on a lot of API calls, which is very brittle and prone to issues. It would be better to build methods to query local databases, which do not fail for random reasons like internet connection, api query limits, etc. There are five different API queries: - blastp via BLAST+ command line wrapped in `subprocess` - Foldseek API to search structures against afdb50, afdb-proteome, and afdb-swissprot - UniprotID mapping queries to map refseq/ genbank accessions to UniprotKB-AC IDs - UniprotID mapping queries to pull down metadata for uniprot sequences - download of AlphaFold structures from GCP This will involve creating first a small test version of each of the following datasets: - a Foldseek database (to eventually use the afdb50 database, ~190GB) - a BLAST database (to eventually query the clusterednr database, ~70GB) - a Uniprot ID mapping database (to eventually query ~22GB) Eventually it would be best to also be able to download from - some version of Uniprot with all the metadata (figure out where this is located) - the structures of the full AlphaFold database (23 TiB)
No due date•3/3 issues closed