[Optional] Using Singularity to run BLAST+


Teaching: 30 min
Exercises: 30 min
  • How can I use Singularity to run bioinformatics workflows with BLAST+?

  • Show example of using Singularity with a common bioinformatics tool.

We have now learned enough to be able to use Sigularity to deploy software without us needed to install the software itself on the host system.

In this section we will demonstrate the use of a Singularity container image that provides the BLAST+ software.

Source material

This example is based on the example from the official NCBI BLAST+ Docker container documentation Note: the efetch parts of the step-by-step guide do not currently work using Singularity version of the image so we provide a dataset with the data already downloaded.

(This is because the NCBI BLAST+ Docker container image has the efetch tool installed in the /root directory and this special location gets overwritten during the conversion to a Singularity container image.)

Download the required data

Download the blast_example.tar.gz.

Unpack the archive which contains the downloaded data required for the BLAST+ example:

tar -xvf blast_example.tar.gz
x blast/
x blast/blastdb/
x blast/queries/
x blast/fasta/
x blast/results/
x blast/blastdb_custom/
x blast/fasta/nurse-shark-proteins.fsa
x blast/queries/P01349.fsa

Finally, move into the newly created directory:

cd blast
blastdb        blastdb_custom fasta          queries        results

Create the Singularity container image

NCBI provide official Docker containers with the BLAST+ software hosted on Docker Hub. We can create a Singularity container image from the Docker container image with:

singularity pull ncbi-blast.sif docker://ncbi/blast
INFO:    Creating SIF file...

Now we have a container with the software in, we can use it.

Build and verify the BLAST database

Our example dataset has already downloaded the query and database sequences. We first use these downloaded data to create a custom BLAST database by using a container to run the command makeblastdb with the correct options.

singularity exec ncbi-blast.sif \
    makeblastdb -in fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
    -taxid 7801 -blastdb_version 5

Building a new DB, current time: 06/16/2023 14:35:07
New DB name:   /home/auser/test/blast/blast/nurse-shark-proteins
New DB title:  Nurse shark proteins
Sequence type: Protein
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 7 sequences in 0.0199499 seconds.

To verify the newly created BLAST database above, you can run the blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T" command to display the accessions, sequence length, and common name of the sequences in the database.

singularity exec ncbi-blast.sif \
    blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T"
Q90523.1 106 7801
P80049.1 132 7801
P83981.1 53 7801
P83977.1 95 7801
P83984.1 190 7801
P83985.1 195 7801
P27950.1 151 7801

Now we have our database we can run queries against it.

Run a query against the BLAST database

Lets execute a query on our database using the blastp command:

singularity exec ncbi-blast.sif \
    blastp -query queries/P01349.fsa -db nurse-shark-proteins \
    -out results/blastp.out

At this point, you should see the results of the query in the output file results/blastp.out. To view the content of this output file, use the command less results/blastp.out.

less results/blastp.out
...output trimmed...

Query= sp|P01349.2|RELX_CARTA RecName: Full=Relaxin; Contains: RecName:
Full=Relaxin B chain; Contains: RecName: Full=Relaxin A chain

                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName...  14.2    0.96

>P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName: Full=Liver-type
fatty acid-binding protein; Short=L-FABP

...output trimmed...

With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96.

Accessing online BLAST databases

As well as building your own local database to query, you can also access databases that are available online. For example, to see which databases are available online in the Google Compute Platform (GCP):

singularity exec ncbi-blast.sif update_blastdb.pl --showall pretty --source gcp
Connected to GCP
BLASTDB                                                      DESCRIPTION                                                                                                              SIZE (GB)      LAST_UPDATED
nr                                                           All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects        369.4824      2023-06-10
swissprot                                                    Non-redundant UniProtKB/SwissProt sequences                                                                                 0.3576      2023-06-10
refseq_protein                                               NCBI Protein Reference Sequences                                                                                          146.5088      2023-06-12
landmark                                                     Landmark database for SmartBLAST                                                                                            0.3817      2023-04-25
pdbaa                                                        PDB protein database                                                                                                        0.1967      2023-06-10
nt                                                           Nucleotide collection (nt)                                                                                                319.5044      2023-06-11
pdbnt                                                        PDB nucleotide database                                                                                                     0.0145      2023-06-09
patnt                                                        Nucleotide sequences derived from the Patent division of GenBank                                                           15.7342      2023-06-09
refseq_rna                                                   NCBI Transcript Reference Sequences                                                                                        47.8721      2023-06-12

...output trimmed...

Similarly, for databases hosted at NCBI:

singularity exec ncbi-blast.sif update_blastdb.pl --showall pretty --source ncbi
Connected to NCBI
BLASTDB                                                      DESCRIPTION                                                                                                              SIZE (GB)      LAST_UPDATED
env_nr                                                       Proteins from WGS metagenomic projects (env_nr).                                                                            3.9459      2023-06-11
SSU_eukaryote_rRNA                                           Small subunit ribosomal nucleic acid for Eukaryotes                                                                         0.0063      2022-12-05
LSU_prokaryote_rRNA                                          Large subunit ribosomal nucleic acid for Prokaryotes                                                                        0.0041      2022-12-05
16S_ribosomal_RNA                                            16S ribosomal RNA (Bacteria and Archaea type strains)                                                                       0.0178      2023-06-16
env_nt                                                       environmental samples                                                                                                      48.8599      2023-06-08
LSU_eukaryote_rRNA                                           Large subunit ribosomal nucleic acid for Eukaryotes                                                                         0.0053      2022-12-05
ITS_RefSeq_Fungi                                             Internal transcribed spacer region (ITS) from Fungi type and reference material                                             0.0067      2022-10-28
Betacoronavirus                                              Betacoronavirus                                                                                                            55.3705      2023-06-16

...output trimmed...


You have now completed a simple example of using a complex piece of bioinformatics software through Singularity containers. You may have noticed that some things just worked without you needing to set them up even though you were running using containers:

  1. We did not need to explicitly bind any files/directories in to the container. This worked because Singularity automatically binds the current directory into the running container, so any data in the current directory (or its subdirectories) will generally be available in running Singularity containers. (If you have used Docker containers, you will notice that this is different from the defalt behaviour there.)
  2. Access to the internet is automatically available within the running container in the same way as it is on the host system without us needed to specify any additional options.
  3. Files and data we create within the container have the right ownership and permissions for us to access outside the container.

In addtion, we were able to use the tools in the container image provided by NCBI without having to do any work to install the software irrespecetive of the computing platform that we are using. (In fact, the example this is based on runs the pipeline using Docker on a cloud computing platform rather than on your local systeam.)

Key Points

  • We can use containers to run software without having to install it

  • The commands we use are very similar to those we would use natively

  • Singularity handles a lot of complexity around data and internet access for us