Using Docker containers on Biochemistry Linux Cluster

Summary

A short example on running software within a Docker container on a Linux cluster. Since the software is installed within the container no installation is necessary. The computation can be scaled to accommodate large sample files.

Docker on Biochemistry Linux Cluster

Docker on Biochemistry Compute Cluster

The HTCondor scheduler software on the Biochemistry Compute Cluster (BCC) can run Docker containers. The same methods can also be applied within the larger compute cluster at the Center for High Throughput Computing (CHTC.)

Docker containers are isolated, transient computing spaces within another computer. The containers are created “on the fly” from template Docker images that can be created by the users who saves the image on a “repository.” The main repository for Docker images are the “Docker Hub” where images from others are available. A Docker image typically contains all the necessary software and libraries to run independently of the host computer.

The example below is small and simple and could be run on a laptop but much larger jobs can be accomplished with the same method on a Linux cluster.

Example

Below is a quick example on how to use a Docker image for thebowtie2 software used to map (align) DNA reads onto a genome. The example is part of the bowtie2 software installation. The commands are inspired by the “live demo” available in the interactive console at the sandbox.bio site in the Sequence alignment with bowtie2 episode.

The example assumes having access to a Linux cluster  (See BCC or CHTC) and a basic understanding of running the HTCondor software.

Ingredients

1. user account
2. Docker image, found at my repository: jysgro/bowtie2
3. shell script to conduct computation (see appendix below)
4. shell script to submit job via HTCondor (see appendix below)

The Docker image used will contain the bowtie2 software, and in addition the samtools and bcftools software to conduct all computations proposed in the “live” Sequence alignment with bowtie2 episode demonstration on sandbox.bio .

1. Users  on BCC should work within the /scratch directory

2. The Docker image will be called in step 4 by the submit script

3. The script contains all shell commands to complete all computations, without any manual intervention. The example uses the genome of phage lambda, and three “fastQ” files containing just a few DNA reads and their quality score. The bowtie2 software will map (align) the DNA sequence to their matching location saved in a “Map” file either as plain text (SAM, Sequence And Map) which is then converted its binary (BAM) version with the samtools software. This file is in turn used by thge bcftools software to export SNPs (Single Nucleotide Polymorphism) into a .vcf “Variant Call File”.

4. The submit script calls the Docker image and specifies that we are working with the “Docker Universe”, not the default “Vanilla Universe.” We provide files to be transferred (the shell script, as all other files are already installed within the Docker image.)

Results

The files created a copied to the user’s account at the end of the computation, when the ephemeral Docker container is deleted. The list of files, and their sizes will be:

eg2.vcf 21K
eg2.bcf 5.8K
eg2.sorted.bam  2.5M
eg2.bam 3.0M
eg3.sam 4.8M
eg2.sam 6.8M
eg1.sam 3.3M

Appendix

Shell script: name it  bt2.sh

#!/bin/bash

# Copy example locally to make commands clearer
cp -r /opt/bowtie2-2.5.1-linux-x86_64/example/ ${PWD}
mv example/* ${PWD}

# Define HOME, index and location of reference genome
export HOME=${PWD}
export REF=./index/lambda_virus
export REF_FASTA=./reference/lambda_virus.fa

# Align reads: single-end, paired-end, long reads
bowtie2 -x $REF -U reads/reads_1.fq -S eg1.sam
bowtie2 -x $REF -1 reads/reads_1.fq -2 reads/reads_2.fq -S eg2.sam
bowtie2 --local -x $REF -U reads/longreads.fq -S eg3.sam

# Samtools: convert SAM to BAM, then sort
samtools view eg2.sam -o eg2.bam
samtools sort eg2.sam -o eg2.sorted.bam

# Generate variant calls, then convert bcf file to uncompressed vcf
bcftools mpileup -f $REF_FASTA eg2.sorted.bam | bcftools call -m -v -Ob -o eg2.bcf -
bcftools convert  eg2.bcf -Ov -o eg2.vcf
echo DONE

Submit script: name it bt2.sub

universe =  Docker
docker_image = jysgro/bowtie2:sam115_bcf118

executable =  bt2.sh
transfer_input_files = bt2.sh

should_transfer_files   = YES
when_to_transfer_output = ON_EXIT

request_memory = 2GB
request_disk = 2GB
request_cpus = 1

output = hello.out
error  = hello.err
log    = hello.log

queue 1

Submit the computation via HTCondor and check the job queue:

condor_submit bt2.sub 

condor_q

 

Image credits: oil in water by Pixabay artist A_Different_Perspective ; Docker logo/container by artist kamanea.