## Exercise 2: Running a bioinfomatics application ## ### Objective ### Run a real-world bioinformatics application in a docker container ### Running velvet from the bioboxes repository ### This example is heavily abridged from the [bioboxes.org](http://bioboxes.org/) tutorial, at [http://bioboxes.org/docs/using-a-biobox/](http://bioboxes.org/docs/using-a-biobox/). Please refer to that tutorial for full details. I've included the **biobox.yaml** file and the input reads directly in this repository, for convenience. You only need **cd** into the **bioboxes** directory, check you have everything, and invoke **docker run** appropriately. ``` # Clone this documentation if you haven't already done so > # git clone https://gitlab.ebi.ac.uk/TSI/tsi-ccdoc.git > # cd tsi-ccdoc # You need to be in the right working directory > cd tsi-cc/ResOps/scripts/docker/bioboxes > ls -l total 0 drwxr-xr-x 4 wildish wildish 136 Nov 30 11:37 input_data drwxr-xr-x 5 wildish wildish 170 Nov 30 11:45 output_data > ls -l input_data/ total 64 -rw-r--r-- 1 wildish wildish 125 Nov 30 11:31 biobox.yaml -rw-r--r-- 1 wildish wildish 27944 Nov 30 11:30 reads.fq.gz ``` Note the two ```--volume``` options, below. The first one maps the **input_data** directory from your machine to the **/bbx/input** directory on the container, where the application will look for it. The input directory is marked read-only (**:ro**). The second does the same for the output data, marking it read-write. This is a useful generic pattern for getting data into and out of containers. Be aware the behaviour is different if you use absolute or relative paths, we use absolute paths here. The ```default``` argument at the end of the command line is fed to the application (velvet, in this case), docker doesn't pay attention to it. ``` > docker run --volume=`pwd`/input_data:/bbx/input:ro --volume=`pwd`/output_data:/bbx/output:rw bioboxes/velvet default Unable to find image 'bioboxes/velvet:latest' locally latest: Pulling from bioboxes/velvet e190868d63f8: Pull complete 909cd34c6fd7: Pull complete 0b9bfabab7c1: Pull complete a3ed95caeb02: Pull complete c16026f9e2f2: Pull complete d64cce756b2d: Pull complete e1705543da3f: Pull complete e003a99fece9: Pull complete 1ca78c008b50: Pull complete 1b41cafd4a53: Pull complete e846c07b1d98: Pull complete dc7515d258fb: Pull complete 6354b2058d01: Pull complete 497a947c4908: Pull complete Digest: sha256:6611675a6d3755515592aa71932bd4ea4c26bccad34fae7a3ec1198ddcccddad Status: Downloaded newer image for bioboxes/velvet:latest [0.000002] Reading FastQ file /bbx/input/reads.fq.gz; [0.001655] 228 sequences found [0.001659] Done [0.001697] Reading read set file /tmp/tmp.n2NrkikPB0/Sequences; [0.001757] 228 sequences found [0.001938] Done [0.001941] 228 sequences in total. [0.001970] Writing into roadmap file /tmp/tmp.n2NrkikPB0/Roadmaps... [0.002214] Inputting sequences... [0.002217] Inputting sequence 0 / 228 [0.019130] === Sequences loaded in 0.016915 s [0.019161] Done inputting sequences [0.019164] Destroying splay table [0.020453] Splay table destroyed [0.000003] Reading roadmap file /tmp/tmp.n2NrkikPB0/Roadmaps [0.000301] 228 roadmaps read [0.000310] Creating insertion markers [0.000338] Ordering insertion markers [0.000430] Counting preNodes [0.000459] 453 preNodes counted, creating them now [0.001127] Adjusting marker info... [0.001141] Connecting preNodes [0.001255] Cleaning up memory [0.001257] Done creating preGraph [0.001260] Concatenation... [0.001349] Renumbering preNodes [0.001351] Initial preNode count 453 [0.001358] Destroyed 398 preNodes [0.001360] Concatenation over! [0.001362] Clipping short tips off preGraph [0.001368] Concatenation... [0.001381] Renumbering preNodes [0.001383] Initial preNode count 55 [0.001387] Destroyed 18 preNodes [0.001390] Concatenation over! [0.001392] 9 tips cut off [0.001396] 37 nodes left [0.001435] Writing into pregraph file /tmp/tmp.n2NrkikPB0/PreGraph... [0.001609] Reading read set file /tmp/tmp.n2NrkikPB0/Sequences; [0.001657] 228 sequences found [0.001834] Done [0.002079] Reading pre-graph file /tmp/tmp.n2NrkikPB0/PreGraph [0.002088] Graph has 37 nodes and 228 sequences [0.002182] Scanning pre-graph file /tmp/tmp.n2NrkikPB0/PreGraph for k-mers [0.002221] 3170 kmers found [0.002365] Sorting kmer occurence table ... [0.002760] Sorting done. [0.002763] Computing acceleration table... [0.022207] Computing offsets... [0.022236] Ghost Threading through reads 0 / 228 [0.022243] === Ghost-Threaded in 0.000008 s [0.022248] Threading through reads 0 / 228 [0.024005] === Threaded in 0.001756 s [0.027663] Correcting graph with cutoff 0.200000 [0.027683] Determining eligible starting points [0.027725] Done listing starting nodes [0.027729] Initializing todo lists [0.027736] Done with initilization [0.027738] Activating arc lookup table [0.027741] Done activating arc lookup table [0.027923] Concatenation... [0.027927] Renumbering nodes [0.027929] Initial node count 37 [0.027932] Removed 21 null nodes [0.027939] Concatenation over! [0.027965] Clipping short tips off graph, drastic [0.027968] Concatenation... [0.027971] Renumbering nodes [0.027974] Initial node count 16 [0.027977] Removed 0 null nodes [0.027980] Concatenation over! [0.027983] 16 nodes left [0.028051] Writing into graph file /tmp/tmp.n2NrkikPB0/Graph... [0.028457] Measuring median coverage depth... [0.028464] Median coverage depth = 7.238477 [0.028482] Removing contigs with coverage < 3.619238... [0.028487] Concatenation... [0.028504] Renumbering nodes [0.028507] Initial node count 16 [0.028509] Removed 15 null nodes [0.028511] Concatenation over! [0.028514] Concatenation... [0.028516] Renumbering nodes [0.028518] Initial node count 1 [0.028521] Removed 0 null nodes [0.028523] Concatenation over! [0.028526] Clipping short tips off graph, drastic [0.028528] Concatenation... [0.028530] Renumbering nodes [0.028533] Initial node count 1 [0.028535] Removed 0 null nodes [0.028537] Concatenation over! [0.028539] 1 nodes left [0.028542] WARNING: NO EXPECTED COVERAGE PROVIDED [0.028544] Velvet will be unable to resolve any repeats [0.028546] See manual for instructions on how to set the expected coverage parameter [0.028549] Concatenation... [0.028551] Renumbering nodes [0.028553] Initial node count 1 [0.028555] Removed 0 null nodes [0.028558] Concatenation over! [0.028560] Removing reference contigs with coverage < 3.619238... [0.028562] Concatenation... [0.028684] Renumbering nodes [0.028687] Initial node count 1 [0.028689] Removed 0 null nodes [0.028692] Concatenation over! [0.028730] Writing contigs into /tmp/tmp.n2NrkikPB0/contigs.fa... [0.028981] Writing into stats file /tmp/tmp.n2NrkikPB0/stats.txt... [0.029047] Writing into graph file /tmp/tmp.n2NrkikPB0/LastGraph... [0.029271] Estimated Coverage cutoff = 3.619238 Final graph has 1 nodes and n50 of 2703, max 2703, total 2703, using 0/228 reads ``` And that's all there is to it! You can find the output data in the **output_data** directory (stunning, I know...) ``` > ls -l output_data/ total 16 -rw-r--r-- 1 wildish wildish 108 Nov 30 11:43 biobox.yaml -rw-r--r-- 1 wildish wildish 2812 Nov 30 11:43 contigs.fa cd .. ``` [bioboxes.org](http://bioboxes.org/) is a really well-organized site, with excellent documentation, and is well worth exploring. ### Conclusion ### There are a lot of bioinformatics applications already wrapped up in container images. Bioboxes isn't the only site that provides them, but it's an extremely good one. ### Best practices ### - don't re-invent the wheel, it's worth looking to see who's done what you want already