next up previous contents
Next: Job submission - Requesting Up: Job Submission - Doing Previous: Job Submission - Doing   Contents

Getting started - BLAST

BLAST (Basic Local Alignment Search Tool) is a tool used in bioinformatics to find regions of local similarity between sequences. BLAST is a software package that contains several different tools that search existing databases given an input of nucleotides or a protein. The exact details of BLAST and how to run and use the software are beyond the scope of this documentation. The examples that follow are to illustrate how to use the program with a cluster scheduling system and a variety of scripting techniques to streamline program operation.

Note: The example inputs in this section are taken from the tcoffee package, which has many different kinds of inputs in a variety of formats.

Note: The scripts in this section are primarily for example purposes, they are not a ``best method'' to run a particular program in all cases, instead they are meant to showcase different scripting and job setup techniques. In general, the best submission scripts are those where the process is standardized and organized to a degree that you seldom (if ever) have to change the script. You are encouraged to write scripts that suit your particular style and preferences.

Here we have already set up a blast_test directory with an input file ready to go.


[jdpoisso@umms-amino ~]$ cd blast_test/ 
[jdpoisso@umms-amino blast_test]$ ls 
[jdpoisso@umms-amino blast_test]$

Lets say you wanted to run this without using the scheduling system. Your command might be something like this:

/opt/bio/ncbi/bin/blastall -p blastp -i sv.fasta -d /library/yzhang/nr/nr

Note: Depending on how paths are set up on your cluster system or by your personal settings you may not need to use the full path ``/opt/bio/ncbi/bin/blastall '' Also, depending on your system you may need to be aware of having multiple versions of a software, and only one may be the default at a given time. Be aware of these factors when writing your script, and be sure to run the correct version if your input is version sensitive.

submit that exact command to the scheduling system means writing it into a script. Here is a script that will run that exact command through the scheduling system: :

	/opt/bio/ncbi/bin/blastall -p blastp -i sv.fasta -d /library/yzhang/nr/nr

As you can see, just to run a command there is only a small amount of setup required. The first line identifies the file as a script and specifies what shell (for our purposes - the language of the script) to use. The second line changes the directory using the ${PBS_O_WORKDIR} environment variable. As previously mentioned, the scheduling system may set environment variables for a job. The ${PBS_O_WORKDIR} variable is set to the directory from which the job is submitted, our submission directory . So if we submit the job in our blast_test directory, then the variable is set to be ``blast_test `` (actually its the absolute path - /home/jdpoisso/blast_test ). This is necessary because (by default) when the scheduling system starts your job, the directory is set to be your home directory, which may have any input (or worse, different input) causing your job to try to run, failing to find your input and ending. TT>

Example :

[jdpoisso@umms-amino blast_test]$ qsub 
[jdpoisso@umms-amino blast_test]$ qstat 1231700 
Job id                    Name             User            Time Use S Queue 
------------------------- ---------------- --------------- -------- - ----- 
1231700.umms-amino         jdpoisso        00:03:20 R default
<----job finishes---->
[jdpoisso@umms-amino blast_test]$ ls  sv.fasta 
[jdpoisso@umms-amino blast_test]$

Having run that script, the job is allowed to run, and to finish, and the output is placed in submission directory. For those familiar with BLAST , you may know that when the specified command is run without the scheduling system, all the results are printed to the screen, and not saved to a file. , all the results normally printed to the screen and instead in , as mentioned in a previous section, the scheduling system captures the standard output and saves it into a file, which is then posted back to your submission directory as a result.

What if your job, instead of writing its results to the screen, writes it out to a file? Or multiple files? In this case, there may be nothing in that file. Instead your data may be in the files you specified, or specified by the program. :

[jdpoisso@umms-amino blast_test]$ qsub 
[jdpoisso@umms-amino blast_test]$ qstat 1231986 
Job id                    Name             User            Time Use S Queue 
------------------------- ---------------- --------------- -------- - ----- 
1231986.umms-amino         jdpoisso        00:00:36 R default        
[jdpoisso@umms-amino blast_test]$ ls 
blast.out  blast.seq  sv.fasta 
[jdpoisso@umms-amino blast_test]$

The command in the script has been changed to produce two output files, instead of anything that would be printed to the screen. Both these files have been written to our submission directory,

as our script still explicitly changes to our submission directory. this example, the files are small and manageable. However, what if your results are large files? Or your program does megabytes or gigabytes of temporary storage while it runs? are many ways your cluster system could be configured to handle these situation. Some systems may have a high speed shared space that provides the necessary performance to handle many concurrent jobs of this type. In most cases though, you will want to copy your job to a local scratch space.

Accomplishing this is quite simple, assuming as in previous example all the necessary data is in the submission directory, we can modify our script to copy out and stage our data into a local scratch space (/tmp). :



# setup and copy workdir 
mkdir -p ${LOCAL_WORKDIR} 
cp -r * ${LOCAL_WORKDIR} 

/opt/bio/ncbi/bin/blastall -p blastp -i sv.fasta -d /library/yzhang/nr/nr -o blast.out -O blast.seq 

# copy back data 
cp -r * ${PBS_O_WORKDIR} 

# cleanup 

Example :

[jdpoisso@umms-amino blast_test]$ qsub 
[jdpoisso@umms-amino blast_test]$ qstat 1382907 
Job id                    Name             User            Time Use S Queue 
------------------------- ---------------- --------------- -------- - ----- 
1382907.umms-amino         jdpoisso        00:04:10 R default      
[jdpoisso@umms-amino blast_test]$ ls  sv.fasta 
<----job finishes---->
[jdpoisso@umms-amino blast_test]$ ls
blast.out  blast.seq  sv.fasta 
[jdpoisso@umms-amino blast_test]$

Note: The example script here does not take into account any potential caveats that could occur due to copying files back and forth, such as running out of disk space, or failing to copy files back to their origin. A modified version of the script above that includes checks for these factors is included in the scripting appendix. (Draft Note - Write a scripting appendix)

As you can see, the job may be submitted and run, and no change appears to the submission directory. This is because the relevant data (in this case the sv.fasta file) had been copied to a local scratch space on the whatever node the program was assigned. the program completed, the data was copied back, and all the results appear in your submission directory.

Using these examples you have a simple framework and knowledge base you can use to submit jobs to the cluster. You should be able to submit jobs and have them run. However, you may notice certain problems when running your jobs using modified versions of these examples. Your jobs may end prematurely. They may not thread or distribute properly. They may run extremely slowly. They may crash for lack of memory or disk space. is because so far there has been no discussion of how to request resources.

next up previous contents
Next: Job submission - Requesting Up: Job Submission - Doing Previous: Job Submission - Doing   Contents