Saturday, January 24, 2009

How To Automate BLAST Search?

One can use Blastcl3, the NCBI BLAST network client, to remotely connect to the NCBI BLAST servers to run automated BLAST searches. No web browser is required, you don't have to wait in front of the computer for the results and can perform "batch" search of many sequences against the databases at one time. The results of the search using this Network Blast will be identical to that of the Web version (http://blast.ncbi.nlm.nih.gov/Blast.cgi), as long as the identical parameters are used. I have tried and tested this.

Note on blast parametes for users that will rely on the Web version for the parameters to be used on the Network version:



If you don't uncheck (untick) the option for short sequence queries (") on the Web version, the default algorithm parameters that are displayed on the input page will not be used (for example, E-value of 10, Blosum 62 matrix, Word size 2, etc). This I think is a bug because it does not make sense showing the default parameters if they are not going to be used, especially when the short queries option is checked by default. Ideally, they should have not checked the short queries option and if it is checked by the user, the resulting default parameters should be shown.

So, if you don't plan to use the short queries option, just uncheck it and the parameters displayed will be used for the search; you can change them if you like. If you are happy with what you get on the results page, then take note of the parameters and then set them for use by the Network Blast.

Now, for those who plan to use the short queries option, do the following:

1. Don't bother modifying the default algorithm parameters and click on the submit button.
2. On the results page, click on the "Edit and Resubmit" button to find out what algorithm parameters were used for the search. The parameters that you see on the resulting page, were the exact values that were used for the search. If you are happy with the blast result, you can stick with these parameters, otherwise, you are free to modify them. If you plan to modify them, make sure you uncheck the option for short queries in order for your changes to be effective, otherwise, the default parameters for short queries that were displayed on that page will still be used. Once you are satisfied with your final set of parameters, take note of them and then set them for use by the Network Blast.





== Where to get Network Blast ==

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/
File: netblast-2.2.17-ia32-freebsd.tar.gz 2334 KB 8/26/2007 4:09:00 PM
File: netblast-2.2.17-ia32-linux.tar.gz 2758 KB 8/26/2007 4:13:00 PM
File: netblast-2.2.17-ia32-solaris.tar.gz 2416 KB 8/27/2007 8:10:00 PM
File: netblast-2.2.17-ia32-solaris9.tar.gz 2455 KB 8/26/2007 4:15:00 PM
File: netblast-2.2.17-ia32-win32.exe 1479 KB 8/22/2007 9:40:00 PM
File: netblast-2.2.17-ia64-linux.tar.gz 5319 KB 11/1/2007 12:23:00 PM
File: netblast-2.2.17-mips64-irix.tar.gz 4340 KB 8/26/2007 4:17:00 PM
File: netblast-2.2.17-sparc64-solaris.tar.gz 4448 KB 8/26/2007 4:41:00 PM
File: netblast-2.2.17-universal-macosx.tar.gz 4398 KB 8/26/2007 4:25:00 PM
File: netblast-2.2.17-x64-linux.tar.gz 2834 KB 8/26/2007 4:13:00 PM
File: netblast-2.2.17-x64-solaris.tar.gz 3287 KB 8/27/2007 8:27:00 PM
File: netblast-2.2.17-x64-win64.exe 1686 KB 8/19/2007 4:29:00 AM

For other methods, see http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.pdf
and http://www.ncbi.nlm.nih.gov/blast/blast_overview.shtml
and the query.fcgi pages on how to construct an api to talk to various NCBI queries, e.g. automating the retrieval of GI numbers in FASTA or whatever formats.

== How to run Network Blast in Linux OS (e.g BioSlax)==

tar -zxvf netblast-2.2.17-ia32-linux.tar.gz


* The following will be untarred (unzipped): VERSION bin/ data/ doc/
* Blastcl3 is found in the bin folder

== Example of how to run Network Blast ==

1. Split the input fasta file (containing multiple sequences) into multiple files, with one sequence per file, using the command csplit. For example:

csplit -n 1 multiple.fasta /\>/ \{\*\}


The above command adds digits to names of the output fasta files without the padding of zeros infront.

* Example of content of the input fasta file:
>1
ESWILRNPGYALVA
>2
GCGLFGKGSIDTCA

..etc..

* Examples of output files:

file name: xx1
content of xx1:
>1
ESWILRNPGYALVA

file name: xx2
content of xx2:
>2
GCGLFGKGSIDTCA

.
.
.

file name: xx10
content of xx10:
>10
GCGLFGKGSIDTCA

and so on to the last sequence ...


2. Create an executable file containing the following code:

for i in xx*
do
blastcl3 -p blastp -d nr -i $i -F F -e 200000 -I T -v 0 -b 20000
-f 11 -W 2 -T F -u "Root[ORGN] NOT txid11082[Organism:exp] NOT
txid81077[ORGN]" -C F -M PAM30 -o $i.out
sleep 10
done


'''NOTE:'''
* xx refers to the name of the input files, for example xx1, xx2, etc
* Sleep 10 is the time gap (10 seconds) between the end of one submission and the start of the next. This can be changed to longer time periods (recommended) in order not to overload the NCBI server. Otherwise, they may black list your IP.
* -p is blast p program, -d nr is the NR database, -i is for input filename, -F F for false is to switch off low complexity and other filters, -I T set to true is to show the GI numbers in the definition lines, and -v 0 is to set number of database sequences to show one-line descriptions, -b is to list out 20000 hits, -f 11 threshold for extending hits, -W 2 set word size for short sequence queries, default used if set to zero (blastn 11, megablast 28, all others 3, and 2 for short queries), -T F to produce text output, instead of HTML, -u "Root[ORGN] NOT txid11082[Organism:exp] NOT txid81077[ORGN]" is to restrict search of database to results of Entrez2 lookup [String], -C F to not use composition-based statistics for blastp or tblastn (0 or F or f: no composition-based statistics), -M PAM30 select matrix for short sequence queries (BLOSUM62 for normal protein searches), and -o to specify output filename.


3. Change the mode of the executable file in order to be able to execute. For example:

chmod 755 x

* x refers to the name of the executable file.
* For more information, see [http://en.wikipedia.org/wiki/Chmod Chmod]and [http://www.xaviermedia.com/documents/chmod755.php Chmod755].


4. Run the executable file. For example:

./x

* x refers to the name of the executable file

A task can usually be started and run as a background task by putting a '&' at the end of the command line. For example:

./x &


If a task was started and is running in the foreground, it is still possible to move it to the background without cancelling it. To move a task from the foreground to the background perform the following steps:

1. CTRL-Z (That is, while holding the CTRL key down, tap the 'z' key) This will suspend the current foreground job (task).
2. Enter the job control command 'bg'
3. Tap the 'Enter' key

The job is now running in the background.

Useful commands to see which jobs are still running is the 'jobs' or the 'ps ua' commands. If the 'jobs' command is used, a background jobs can be brought to the foreground with the command fg n where n is the job (not the PID) number.

==Manual pages of Network Blast==

blastcl3 2.2.17 arguments:

-p Program Name [String]
-d Database [String]
default = nr
-i Query File [File In]
default = stdin
-e Expectation value (E) [Real]
default = 10.0
-m alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = tabular,
9 tabular with comment lines
10 ASN, text
11 ASN, binary [Integer]
default = 0
range from 0 to 11
-o BLAST report Output File [File Out] Optional
default = stdout
-F Filter query sequence (DUST with blastn, SEG with others) [String]
default = T
-G Cost to open a gap (-1 invokes default behavior) [Integer]
default = -1
-E Cost to extend a gap (-1 invokes default behavior) [Integer]
default = -1
-X X dropoff value for gapped alignment (in bits) (zero invokes default behavior)
blastn 30, megablast 20, tblastx 0, all others 15 [Integer]
default = 0
-I Show GI's in deflines [T/F]
default = F
-q Penalty for a nucleotide mismatch (blastn only) [Integer]
default = -3
-r Reward for a nucleotide match (blastn only) [Integer]
default = 1
-v Number of database sequences to show one-line descriptions for (V) [Integer]
default = 500
-b Number of database sequence to show alignments for (B) [Integer]
default = 250
-f Threshold for extending hits, default if zero
blastp 11, blastn 0, blastx 12, tblastn 13
tblastx 13, megablast 0 [Real]
default = 0
-g Perform gapped alignment (not available with tblastx) [T/F]
default = T
-Q Query Genetic code to use [Integer]
default = 1
-D DB Genetic code (for tblast[nx] only) [Integer]
default = 1
-a Number of processors to use [Integer]
default = 1
-O SeqAlign file [File Out] Optional
-J Believe the query defline [T/F]
default = F
-M Matrix [String]
default = BLOSUM62
-W Word size, default if zero (blastn 11, megablast 28, all others 3) [Integer]
default = 0
-z Effective length of the database (use zero for the real size) [Real]
default = 0
-K Number of best hits from a region to keep (off by default, if used a value of 100 is recommended) [Integer]
default = 0
-P 0 for multiple hit, 1 for single hit (does not apply to blastn) [Integer]
default = 0
-Y Effective length of the search space (use zero for the real size) [Real]
default = 0
-S Query strands to search against database (for blast[nx], and tblastx)
3 is both, 1 is top, 2 is bottom [Integer]
default = 3
-T Produce HTML output [T/F]
default = F
-u Restrict search of database to results of Entrez2 lookup [String] Optional
-U Use lower case filtering of FASTA sequence [T/F] Optional
-y X dropoff value for ungapped extensions in bits (0.0 invokes default behavior)
blastn 20, megablast 10, all others 7 [Real]
default = 0.0
-Z X dropoff value for final gapped alignment in bits (0.0 invokes default behavior)
blastn/megablast 50, tblastx 0, all others 25 [Integer]
default = 0
-R RPS Blast search [T/F]
default = F
-n MegaBlast search [T/F]
default = F
-L Location on query sequence [String] Optional
-A Multiple Hits window size, default if zero (blastn/megablast 0, all others 40 [Integer]
default = 0
-w Frame shift penalty (OOF algorithm for blastx) [Integer]
default = 0
-t Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments. (0 invokes default behavior; a negative value disables linking.) [Integer]
default = 0
-C Use composition-based statistics for blastp or tblastn:
As first character:
D or d: default (equivalent to T)
0 or F or f: no composition-based statistics
1 or T or t: Composition-based statistics as in NAR 29:2994-3005, 2001
2: Composition-based score adjustment as in Bioinformatics 21:902-911,
2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911,
2005, unconditionally
For programs other than tblastn, must either be absent or be D, F or 0.
As second character, if first character is equivalent to 1, 2, or 3:
U or u: unified p-value combining alignment p-value and compositional p-value in round 1 only
[String]
default = D
-s Compute locally optimal Smith-Waterman alignments (This option is only
available for gapped tblastn.) [T/F]
default = F

No comments: