AlignBucket is a software for splitting a fasta file into smaller pieces suitable for alignment with BLAST. The constraint used to optimize the result is the required minimum alignment coverage.
ReferencesG. Profiti, P. Fariselli, and R. Casadio. "AlignBucket: a tool to speed up 'all-against-all' protein sequence alignments optimizing length constraints ". Bioinformatics, 31 (23): 3841-3843, 2015.
- sample dataset and SwissProt 2015_05 as desribed in the paper (about 100 MegaBytes)
The program was tested on a GNU/Linux Debian 7 system. The required libraries are:
To compile the program, you need to install the required libraries.
On Debian and Ubuntu you can use the following commands:
- sudo apt-get install g++
- sudo apt-get install libgmp3-dev
- sudo apt-get install libboost-dev
- sudo apt-get install libboost-program-options-dev
Then you must run:
or, if you don't have Automake in your system:
g++ -o alignbucket src/alignbucket.cpp -lgmpxx -lgmp -lboost_program_options
Then, you have to set execution permission on the program:
chmod u+x alignbucket
To split your fasta file into optimized buckets for 90% coverage, just run the following command:
./alignbucket.sh <fasta file>
Example with the sample dataset from Swissprot:
The output will be a set of fasta file named xxx-yyy.fasta, with xxx and yyy being the minimum and the maximum length of the sequences inside the file. You can now perform a Blast alignment on each identified subset.
For different coverage ratios, just run
./alignbucket.sh <fasta file> <percentage>
Example, for 75% coverage use:
./alignbucket.sh sample3.fasta 75
The program is released under a GNU Public License version 2.