Skip to content

Replace BLAST search with mmseqs2 #84

@naailkhan28

Description

@naailkhan28

Description of feature

As discussed in #67 , the BLAST search step takes a very long time and is also prone to errors. I've experimented with replacing this with mmseqs2 which is much faster than BLAST and also more sensitive.

Code

See this commit for my proposed changes:

naailkhan28@46a1a44

I've added files mmseqs2_utils.py and run_mmseqs2.py which are analogous to the blast_utils.py and run_blast.py scripts from before. run_mmseqs2.py I adapted this script from the ColabFold repo which queries the mmseqs2 server.

I've also provided an updated Snakefile which implements these new rules. There's a step to query the API, a step to process the resulting .a3m file, and write an output .csv file with hits and a .txt file with UniRef100 hits. Most of these can be supplied directly to the next aggregate_lists rule but some of them are UniParc accessions starting with UPIXXX and so I've written a script map_uniparc_ids.py which will use bioservices to map these to UniProt KB accessions where possible.

Results

The biggest performance benefit of mmseqs2 is speed - requests to the API typically take under 20 seconds whereas BLAST searches often take 30-40 minutes or more.

But with this increased speed we also get many, many more sequence hits back. Here's a histogram of sequence identity of all the hits you get back against the Actin example, with mmseqs2 (with a max E-value of 1e-03), compared to the default setting of BLAST with max 3000 sequences:

image

Loads more sequences, and covering much more of sequence space too! Of course you can increase the number of BLAST hits, but this greatly increases the search time (and sometimes gives errors too). I tried running BLAST with 100,000 max sequences and it's still not given me a result after several hours!

mmseqs2 also recovers most of the BLAST hits anyways. Here's a venn diagram of the Actin sequence hits, and associated structure hits:

image

Only a few hundred hits are unique to BLAST, and as the previous graph shows mmseqs2 has far greater coverage of sequences with lower identity anyways.

Happy to chat more about this idea - my code is pretty basic for now and not polished at all, but I'd be open to working on this idea some more if you guys think it's worthwhile.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions