Using Singularity to run BLAST+
Overview
Teaching: 30 min
Exercises: 30 minQuestions
How can I use Singularity to run bioinformatics workflows with BLAST+?
Objectives
Show example of using Singularity with a common bioinformatics tool.
We have now learned enough to be able to use Sigularity to deploy software without us needed to install the software itself on the host system.
In this section we will demonstrate the use of a Singularity container image that provides the BLAST+ software.
Source material
This example is based on the example from the official NCBI BLAST+ Docker container documentation Note: the
efetch
parts of the step-by-step guide do not currently work using Singularity version of the image so we provide a dataset with the data already downloaded.(This is because the NCBI BLAST+ Docker container image has the
efetch
tool installed in the/root
directory and this special location gets overwritten during the conversion to a Singularity container image.)
Download the required data
Download the blast_example.tar.gz.
Unpack the archive which contains the downloaded data required for the BLAST+ example:
remote$ wget https://epcced.github.io/2024-04-16_containers_bham/files/blast_example.tar.gz
remote$ tar -xvf blast_example.tar.gz
x blast/
x blast/blastdb/
x blast/queries/
x blast/fasta/
x blast/results/
x blast/blastdb_custom/
x blast/fasta/nurse-shark-proteins.fsa
x blast/queries/P01349.fsa
Finally, move into the newly created directory:
remote$ cd blast
remote$ ls
blastdb blastdb_custom fasta queries results
Create the Singularity container image
NCBI provide official Docker containers with the BLAST+ software hosted on Docker Hub. We can create a Singularity container image from the Docker container image with:
remote$ singularity pull ncbi-blast.sif docker://ncbi/blast
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob f3b81f6693c5 done
Copying blob 9e3ea8720c6d done
Copying blob f1910abb61ed done
Copying blob 5ac33d4de47b done
Copying blob 8402427c8382 done
Copying blob 06add1a477bc done
Copying blob d9781f222125 done
Copying blob 4aae31cc8a8b done
Copying blob 6a61413c1ffa done
Copying blob c657bf8fc6ca done
Copying blob 1776e565f5f8 done
Copying blob d90474a0d8c8 done
Copying blob 0bc89cb1b9d7 done
Copying blob b8a272fccf13 done
Copying blob 891eb09f891f done
Copying blob 4c64befa8a35 done
Copying blob 7ab0b7afbc21 done
Copying blob b007c620c60b done
Copying blob f877ffc04713 done
Copying blob 6ee97c348001 done
Copying blob 03f0ee97190b done
Copying config 28914b3519 done
Writing manifest to image destination
Storing signatures
2023/06/16 08:26:53 info unpack layer: sha256:9e3ea8720c6de96cc9ad544dddc695a3ab73f5581c5d954e0504cc4f80fb5e5c
2023/06/16 08:26:53 info unpack layer: sha256:06add1a477bcffec8bac0529923aa8ae25d51f0660f0c8ef658e66aa89ac82c2
2023/06/16 08:26:53 info unpack layer: sha256:f3b81f6693c592ab94c8ebff2109dc60464d7220578331c39972407ef7b9e5ec
2023/06/16 08:26:53 info unpack layer: sha256:5ac33d4de47beb37ae35e9cad976d27afa514ab8cbc66e0e60c828a98e7531f4
2023/06/16 08:27:03 info unpack layer: sha256:8402427c8382ab723ac504155561fb6d3e5ea1e7b4f3deac8449cec9e44ae65a
2023/06/16 08:27:03 info unpack layer: sha256:f1910abb61edef8947e9b5556ec756fd989fa13f329ac503417728bf3b0bae5e
2023/06/16 08:27:03 info unpack layer: sha256:d9781f222125b5ad192d0df0b59570f75b797b2ab1dc0d82064c1b6cead04840
2023/06/16 08:27:03 info unpack layer: sha256:4aae31cc8a8b726dce085e4e2dc4671a9be28162b8d4e1b1c00b8754f14e6fe6
2023/06/16 08:27:03 info unpack layer: sha256:6a61413c1ffa309d92931265a5b0ecc9448568f13ccf3920e16aaacc8fdfc671
2023/06/16 08:27:03 info unpack layer: sha256:c657bf8fc6cae341e3835cb101dc4c6839ba4aad69578ff8538b3c1eba7abb21
2023/06/16 08:27:04 info unpack layer: sha256:1776e565f5f85562b8601edfd29c35f3fba76eb53177c8e89105f709387e3627
2023/06/16 08:27:04 info unpack layer: sha256:d90474a0d8c8e6165d909cc0ebbf97dbe70fd759a93eff11a5a3f91fa09a470e
2023/06/16 08:27:04 warn rootless{root/edirect/aux/lib/perl5/Mozilla/CA/cacert.pem} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2023/06/16 08:27:04 warn rootless{root/edirect/aux/lib/perl5/Mozilla/CA.pm} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2023/06/16 08:27:04 warn rootless{root/edirect/aux/lib/perl5/Mozilla/mk-ca-bundle.pl} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2023/06/16 08:27:04 info unpack layer: sha256:0bc89cb1b9d7ca198a7a1b95258006560feffaff858509be8eb7388b315b9cf5
2023/06/16 08:27:04 info unpack layer: sha256:b8a272fccf13b721fa68826f17f0c2bb395de377e0d22c98d38748eb5957a4c6
2023/06/16 08:27:04 info unpack layer: sha256:891eb09f891ff2c26f24a5466112e134f6fb30bd3d0e78c14c0d676b0e68d60a
2023/06/16 08:27:04 info unpack layer: sha256:4c64befa8a35c9f8518324524dfc27966753462a4c07b2234811865387058bf4
2023/06/16 08:27:04 info unpack layer: sha256:7ab0b7afbc21b75697a7b8ed907ee9b81e5b17a04895dc6ff7d25ea2ba1eeba4
2023/06/16 08:27:04 info unpack layer: sha256:b007c620c60b91ce6a9e76584ecc4bc062c822822c204d8c2b1c8668193d44d1
2023/06/16 08:27:04 info unpack layer: sha256:f877ffc04713a03dffd995f540ee13b65f426b350cdc8c5f1e20c290de129571
2023/06/16 08:27:04 info unpack layer: sha256:6ee97c348001fca7c98e56f02b787ce5e91d8cc7af7c7f96810a9ecf4a833504
2023/06/16 08:27:04 info unpack layer: sha256:03f0ee97190baebded2f82136bad72239254175c567b19def105b755247b0193
INFO: Creating SIF file...
Now we have a container with the software in, we can use it.
Build and verify the BLAST database
Our example dataset has already downloaded the query and database sequences. We first
use these downloaded data to create a custom BLAST database by using a container to run
the command makeblastdb
with the correct options.
remote$ singularity exec ncbi-blast.sif \
makeblastdb -in fasta/nurse-shark-proteins.fsa -dbtype prot \
-parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
-taxid 7801 -blastdb_version 5
Building a new DB, current time: 06/16/2023 14:35:07
New DB name: /home/auser/test/blast/blast/nurse-shark-proteins
New DB title: Nurse shark proteins
Sequence type: Protein
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 7 sequences in 0.0199499 seconds.
To verify the newly created BLAST database above, you can run the
blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T"
command to display
the accessions, sequence length, and common name of the sequences in the database.
remote$ singularity exec ncbi-blast.sif \
blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T"
Q90523.1 106 7801
P80049.1 132 7801
P83981.1 53 7801
P83977.1 95 7801
P83984.1 190 7801
P83985.1 195 7801
P27950.1 151 7801
Now we have our database we can run queries against it.
Run a query against the BLAST database
Lets execute a query on our database using the blastp
command:
remote$ singularity exec ncbi-blast.sif \
blastp -query queries/P01349.fsa -db nurse-shark-proteins \
-out results/blastp.out
At this point, you should see the results of the query in the output file results/blastp.out
.
To view the content of this output file, use the command less results/blastp.out
.
remote$ less results/blastp.out
...output trimmed...
Query= sp|P01349.2|RELX_CARTA RecName: Full=Relaxin; Contains: RecName:
Full=Relaxin B chain; Contains: RecName: Full=Relaxin A chain
Length=44
Score E
Sequences producing significant alignments: (Bits) Value
P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName... 14.2 0.96
>P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName: Full=Liver-type
fatty acid-binding protein; Short=L-FABP
Length=132
...output trimmed...
With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96.
Accessing online BLAST databases
As well as building your own local database to query, you can also access databases that are available online. For example, to see which databases are available online in the Google Compute Platform (GCP):
remote$ singularity exec ncbi-blast.sif update_blastdb.pl --showall pretty --source gcp
Connected to GCP
BLASTDB DESCRIPTION SIZE (GB) LAST_UPDATED
nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects 369.4824 2023-06-10
swissprot Non-redundant UniProtKB/SwissProt sequences 0.3576 2023-06-10
refseq_protein NCBI Protein Reference Sequences 146.5088 2023-06-12
landmark Landmark database for SmartBLAST 0.3817 2023-04-25
pdbaa PDB protein database 0.1967 2023-06-10
nt Nucleotide collection (nt) 319.5044 2023-06-11
pdbnt PDB nucleotide database 0.0145 2023-06-09
patnt Nucleotide sequences derived from the Patent division of GenBank 15.7342 2023-06-09
refseq_rna NCBI Transcript Reference Sequences 47.8721 2023-06-12
...output trimmed...
Similarly, for databases hosted at NCBI:
remote$ singularity exec ncbi-blast.sif update_blastdb.pl --showall pretty --source ncbi
Connected to NCBI
BLASTDB DESCRIPTION SIZE (GB) LAST_UPDATED
env_nr Proteins from WGS metagenomic projects (env_nr). 3.9459 2023-06-11
SSU_eukaryote_rRNA Small subunit ribosomal nucleic acid for Eukaryotes 0.0063 2022-12-05
LSU_prokaryote_rRNA Large subunit ribosomal nucleic acid for Prokaryotes 0.0041 2022-12-05
16S_ribosomal_RNA 16S ribosomal RNA (Bacteria and Archaea type strains) 0.0178 2023-06-16
env_nt environmental samples 48.8599 2023-06-08
LSU_eukaryote_rRNA Large subunit ribosomal nucleic acid for Eukaryotes 0.0053 2022-12-05
ITS_RefSeq_Fungi Internal transcribed spacer region (ITS) from Fungi type and reference material 0.0067 2022-10-28
Betacoronavirus Betacoronavirus 55.3705 2023-06-16
...output trimmed...
Notes
You have now completed a simple example of using a complex piece of bioinformatics software through Singularity containers. You may have noticed that some things just worked without you needing to set them up even though you were running using containers:
- We did not need to explicitly bind any files/directories in to the container. This worked because Singularity automatically binds the current directory into the running container, so any data in the current directory (or its subdirectories) will generally be available in running Singularity containers. (If you have used Docker containers, you will notice that this is different from the default behaviour there.)
- Access to the internet is automatically available within the running container in the same way as it is on the host system without us needed to specify any additional options.
- Files and data we create within the container have the right ownership and permissions for us to access outside the container.
In addition, we were able to use the tools in the container image provided by NCBI without having to do any work to install the software irrespective of the computing platform that we are using. (In fact, the example this is based on runs the pipeline using Docker on a cloud computing platform rather than on your local system.)
Key Points
We can use containers to run software without having to install it
The commands we use are very similar to those we would use natively
Singularity handles a lot of complexity around data and internet access for us