Exercise 2: Running a bioinfomatics application

Objective

Run a real-world bioinformatics application in a docker container

Running velvet from the bioboxes repository

This example is heavily abridged from the bioboxes.org tutorial, at http://bioboxes.org/docs/using-a-biobox/. Please refer to that tutorial for full details.

I’ve included the biobox.yaml file and the input reads directly in this repository, for convenience. You only need cd into the bioboxes directory, check you have everything, and invoke docker run appropriately.

# Clone this documentation if you haven't already done so
> # git clone https://gitlab.ebi.ac.uk/TSI/tsi-ccdoc
> # cd tsi-ccdoc

# You need to be in the right working directory
> cd tsi-cc/ResOps/scripts/docker/bioboxes
> ls -l
total 0
drwxr-xr-x  4 wildish  wildish  136 Nov 30 11:37 input_data
drwxr-xr-x  5 wildish  wildish  170 Nov 30 11:45 output_data
> ls -l input_data/
total 64
-rw-r--r--  1 wildish  wildish    125 Nov 30 11:31 biobox.yaml
-rw-r--r--  1 wildish  wildish  27944 Nov 30 11:30 reads.fq.gz

Note the two --volume options, below. The first one maps the input_data directory from your machine to the /bbx/input directory on the container, where the application will look for it. The input directory is marked read-only (:ro). The second does the same for the output data, marking it read-write. This is a useful generic pattern for getting data into and out of containers. Be aware the behaviour is different if you use absolute or relative paths, we use absolute paths here.

The default argument at the end of the command line is fed to the application (velvet, in this case), docker doesn’t pay attention to it.

> docker run --volume=`pwd`/input_data:/bbx/input:ro --volume=`pwd`/output_data:/bbx/output:rw bioboxes/velvet default
Unable to find image 'bioboxes/velvet:latest' locally
latest: Pulling from bioboxes/velvet
e190868d63f8: Pull complete 
909cd34c6fd7: Pull complete 
0b9bfabab7c1: Pull complete 
a3ed95caeb02: Pull complete 
c16026f9e2f2: Pull complete 
d64cce756b2d: Pull complete 
e1705543da3f: Pull complete 
e003a99fece9: Pull complete 
1ca78c008b50: Pull complete 
1b41cafd4a53: Pull complete 
e846c07b1d98: Pull complete 
dc7515d258fb: Pull complete 
6354b2058d01: Pull complete 
497a947c4908: Pull complete 
Digest: sha256:6611675a6d3755515592aa71932bd4ea4c26bccad34fae7a3ec1198ddcccddad
Status: Downloaded newer image for bioboxes/velvet:latest
[0.000002] Reading FastQ file /bbx/input/reads.fq.gz;
[0.001655] 228 sequences found
[0.001659] Done
[0.001697] Reading read set file /tmp/tmp.n2NrkikPB0/Sequences;
[0.001757] 228 sequences found
[0.001938] Done
[0.001941] 228 sequences in total.
[0.001970] Writing into roadmap file /tmp/tmp.n2NrkikPB0/Roadmaps...
[0.002214] Inputting sequences...
[0.002217] Inputting sequence 0 / 228
[0.019130]  === Sequences loaded in 0.016915 s
[0.019161] Done inputting sequences
[0.019164] Destroying splay table
[0.020453] Splay table destroyed
[0.000003] Reading roadmap file /tmp/tmp.n2NrkikPB0/Roadmaps
[0.000301] 228 roadmaps read
[0.000310] Creating insertion markers
[0.000338] Ordering insertion markers
[0.000430] Counting preNodes
[0.000459] 453 preNodes counted, creating them now
[0.001127] Adjusting marker info...
[0.001141] Connecting preNodes
[0.001255] Cleaning up memory
[0.001257] Done creating preGraph
[0.001260] Concatenation...
[0.001349] Renumbering preNodes
[0.001351] Initial preNode count 453
[0.001358] Destroyed 398 preNodes
[0.001360] Concatenation over!
[0.001362] Clipping short tips off preGraph
[0.001368] Concatenation...
[0.001381] Renumbering preNodes
[0.001383] Initial preNode count 55
[0.001387] Destroyed 18 preNodes
[0.001390] Concatenation over!
[0.001392] 9 tips cut off
[0.001396] 37 nodes left
[0.001435] Writing into pregraph file /tmp/tmp.n2NrkikPB0/PreGraph...
[0.001609] Reading read set file /tmp/tmp.n2NrkikPB0/Sequences;
[0.001657] 228 sequences found
[0.001834] Done
[0.002079] Reading pre-graph file /tmp/tmp.n2NrkikPB0/PreGraph
[0.002088] Graph has 37 nodes and 228 sequences
[0.002182] Scanning pre-graph file /tmp/tmp.n2NrkikPB0/PreGraph for k-mers
[0.002221] 3170 kmers found
[0.002365] Sorting kmer occurence table ... 
[0.002760] Sorting done.
[0.002763] Computing acceleration table... 
[0.022207] Computing offsets... 
[0.022236] Ghost Threading through reads 0 / 228
[0.022243]  === Ghost-Threaded in 0.000008 s
[0.022248] Threading through reads 0 / 228
[0.024005]  === Threaded in 0.001756 s
[0.027663] Correcting graph with cutoff 0.200000
[0.027683] Determining eligible starting points
[0.027725] Done listing starting nodes
[0.027729] Initializing todo lists
[0.027736] Done with initilization
[0.027738] Activating arc lookup table
[0.027741] Done activating arc lookup table
[0.027923] Concatenation...
[0.027927] Renumbering nodes
[0.027929] Initial node count 37
[0.027932] Removed 21 null nodes
[0.027939] Concatenation over!
[0.027965] Clipping short tips off graph, drastic
[0.027968] Concatenation...
[0.027971] Renumbering nodes
[0.027974] Initial node count 16
[0.027977] Removed 0 null nodes
[0.027980] Concatenation over!
[0.027983] 16 nodes left
[0.028051] Writing into graph file /tmp/tmp.n2NrkikPB0/Graph...
[0.028457] Measuring median coverage depth...
[0.028464] Median coverage depth = 7.238477
[0.028482] Removing contigs with coverage < 3.619238...
[0.028487] Concatenation...
[0.028504] Renumbering nodes
[0.028507] Initial node count 16
[0.028509] Removed 15 null nodes
[0.028511] Concatenation over!
[0.028514] Concatenation...
[0.028516] Renumbering nodes
[0.028518] Initial node count 1
[0.028521] Removed 0 null nodes
[0.028523] Concatenation over!
[0.028526] Clipping short tips off graph, drastic
[0.028528] Concatenation...
[0.028530] Renumbering nodes
[0.028533] Initial node count 1
[0.028535] Removed 0 null nodes
[0.028537] Concatenation over!
[0.028539] 1 nodes left
[0.028542] WARNING: NO EXPECTED COVERAGE PROVIDED
[0.028544] Velvet will be unable to resolve any repeats
[0.028546] See manual for instructions on how to set the expected coverage parameter
[0.028549] Concatenation...
[0.028551] Renumbering nodes
[0.028553] Initial node count 1
[0.028555] Removed 0 null nodes
[0.028558] Concatenation over!
[0.028560] Removing reference contigs with coverage < 3.619238...
[0.028562] Concatenation...
[0.028684] Renumbering nodes
[0.028687] Initial node count 1
[0.028689] Removed 0 null nodes
[0.028692] Concatenation over!
[0.028730] Writing contigs into /tmp/tmp.n2NrkikPB0/contigs.fa...
[0.028981] Writing into stats file /tmp/tmp.n2NrkikPB0/stats.txt...
[0.029047] Writing into graph file /tmp/tmp.n2NrkikPB0/LastGraph...
[0.029271] Estimated Coverage cutoff = 3.619238
Final graph has 1 nodes and n50 of 2703, max 2703, total 2703, using 0/228 reads

And that’s all there is to it! You can find the output data in the output_data directory (stunning, I know…)

> ls -l output_data/
total 16
-rw-r--r--  1 wildish  wildish   108 Nov 30 11:43 biobox.yaml
-rw-r--r--  1 wildish  wildish  2812 Nov 30 11:43 contigs.fa
cd ..

bioboxes.org is a really well-organized site, with excellent documentation, and is well worth exploring.

Conclusion

There are a lot of bioinformatics applications already wrapped up in container images. Bioboxes isn’t the only site that provides them, but it’s an extremely good one.

Best practices

  • don’t re-invent the wheel, it’s worth looking to see who’s done what you want already