Reproducible research
Docker is a great tool to provide methods for reproducible research that should allow other researchers to reproduce any advanced calculations that you, as a researcher have done, most likely in association with publishing a paper.
However, there are many caveats along the way, the first one being explanations on how to launch and use the docker image that can be downloaded from the Docker hub.
As I am writing Docker tutorials for biologists I am looking for examples and I am fining a lot of either unusable or unclear images, unfortunately.
In many cases a simple README file would suffice to make it useable by many more scientists.
RNASeq pipeline
As I am trying a specific docker image that looked promising (An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study. (Wang Z, Ma’ayan A. PubMed: 27583132)) I find some unexpected problems and some other, unclearly documented necessary set-ups.
It is not necessary to use Docker to see or use their pipeline, which is visible on a “Jupyter notebook” available online: RNA Seq pipeline.
I wrote a blog entry elsewhere about the problem to download the data from the The Sequence Read Archive (SRA) simply because some of the data has disappeared from the FTP site.
On the Docker Hub entry maayanlab/zika they do provide enough information to start the Docker image into a useable container, even though it is not a straight-forward affair for a novice. The important information from the hub web page provided is:
- Run the Docker image. The Docker container requires host to mount two directories as data volumes: the reference genome directory (
/notebook/genomes
) and the data directory (/notebook/data
). This can be done by specifying the -v tag when running Docker:-v /host/path/to/genomes:/notebook/genomes -v /host/path/to/data:/notebook/data
$ docker run -d -p 80:8888 -e "PASSWORD=YourPassword" -e "USE_HTTP=1" -v /host/path/to/genomes:/notebook/genomes -v /host/path/to/data:/notebook/data maayanlab/zika
The information provided next on that list is obsolete as Docker has evolved and there is no longer a “Docker machine” available. However, by going to the web site localhost:80 and getting a “404 Not Found” page one can click on the Jupyter logo and end-up at http://localhost/zika/tree where the notebook “Zika.ipynb” is found… It is the same as the one on-line (but can be edited and run.) With this knowledge, one could go directly to http://localhost/zika/notebooks/Zika.ipynb and start working, but this is not clear from the original explanations.
What’s missing?
Data
The fact that there had been some changes at SRA is not the author’s fault. To find and download the actual data one can still use the “Project” number SRP070895 in this convenient web site: “SRA-explorer” and find alternate sources e.g. from the UK rather than the US.
In this case some of the files may only be downloadable as *.fastq.gz
rather than .sra
but at least they’d be available.
EDIT_1: *.fastq.gz
are stored in a different directory as the SRA files. The missing ones are “paired -end reads” and should be placed in /notebook/data/Zika/paired_fastqs
EDIT_2: The *.fastq.gz
were downloaded with help of “SRA-explorer” had longer, more meaningful names. However, upon running the pipeline there are errors due to file names: those downloaded as *.fastq.gz
can only have an SRR name, for example: SRR3191545_1.fastq.gz
otherwise the file is not found. However, those that were saved as *.sra
keep their original name.
Those that are issued from the SRA conversion that are single reads are placed in /notebook/data/Zika/fastqs
We know from the docker hub that these files should be placed into: /notebook/data
However, it has to be deciphered from the notebook that the data should be placed in:
data/Zika//SRR*/*.sra
according to the notebook, but perhaps it should be better written as /data/Zika/SRR*/*.sra. The double “//” is probably a typo and perhaps the missing leading “/” as well?
Folder structure
The pipeline requires data that is not included in the Docker image or the pipeline, but there are no clear explanations as to what is needed exactly and how it should be organized. It is up to the user to figure it out by “reverse engineering” the information provided on the notebook, as we already saw above with the exact location where data should be located and organized.
- Genome – it is mentioned that the human genome hg19 is used. The implications are that this genome should be from University of California, Santa Cruz (UCSC) and can be downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/
- The file to download is hg19.2bit which will require decompression from the 2bit format to the fast format with a tool (twoBitToFa) – or it could also be downloaded as hg19.fa.out.gz
- There is also the need for an annotation file in GTF format, which can be created from the UCSC at http://genome.ucsc.edu/cgi-bin/hgTables?command=start specifying the hg19 “assembly” and GTF as output format.
From the information on the Docker hub we know that these files should be in directory /notebook/genomes
but there are other important specific information yet missing…. this can be gleaned from one of the pipeline scripts.
In both scripts analyze_sra.sh and analyze_fastq.sh it is written:
GENOME="$HOME/genomes/Homo_sapiens/UCSC/hg19"
and we can also discover the following declarations that altogether tell us where the files should go:
GENOME_GTF="$GENOME/Annotation/Genes/genes.gtf"
GENOME_FA="$GENOME/Sequence/WholeGenomeFasta/genome.fa"
However, if one only feels comfortable using the notebook, it would be difficult to figure out where files have to be placed.
I will try to run it once the data has downloaded (slow process…)