From pjotr.public66 at thebird.nl  Sat Jun  4 09:50:44 2016
From: pjotr.public66 at thebird.nl (Pjotr Prins)
Date: Sat, 4 Jun 2016 11:50:44 +0200
Subject: [Bio-packaging] GNU Guix update
In-Reply-To: <idjege4i0ly.fsf@bimsb-sys02.mdc-berlin.net>
References: <20150705074338.GA18376@thebird.nl> <87fv53jb4z.fsf@mdc-berlin.de>
 <20150705082734.GA18540@thebird.nl>
 <20150706193118.GA26592@thebird.nl>
 <87lhes4wx2.fsf@izanagi.i-did-not-set--mail-host-address--so-tickle-me>
 <20150712093800.GA24041@thebird.nl>
 <20150918065854.GA18316@thebird.nl>
 <20151230115526.GA12168@thebird.nl>
 <idjege4i0ly.fsf@bimsb-sys02.mdc-berlin.net>
Message-ID: <20160604095044.GA29370@thebird.nl>

I am writing a section on Guix etc. for the biohackathon 2015 report.
It currently reads:


* Software for reproducible analysis

Computational genomics faces the challenges of scalability,
reproducibility and provenance. Larger datasets, such as produced by
The Cancer Genome Atlas [pmid: 24071849], are now petabyte-sized,
while procedures for read-mapping, variant calling, genome assembly
and downstream imputation have grown impressively sophisticated,
involving numerous steps and individual programs. In addition to the
need for reproducible, reusable and trustworthy data (see above) there
is also the question of capturing reproducible data analysis, i.e. the
steps that happen after data retrieval. Genomics analyses involving
DNA or RNA sequencing are being used not just for primary research,
but now being used within the clinic, adding a legal component that
makes it essential that analyses can be precisely reproduced.  We
formed a working group on the challenges of creating reproducible
pipelines for data analysis in the context of semantic technologies.

With the advent of large sequencing efforts, pipelines are getting
wider attention in bioinformatics now biologists regularly have to
deal with TBs of data from sequencing (cite). This data can no longer
be easily analyzed on single workstations, so analysis is executed on
computer clusters and analysis steps are run serially and in parallel
on multiple machines using multiple software programs. To describe
such a complex setup, pipeline runners or engines are being developed.
For example, we worked on the common workflow language (CWL), which
abstracts away the underlying platform and describes the workflow in a
language that can be used on different computing platforms. To the
describe deployed software and make reproducible software installation
a reality we worked on Docker, Bioconda and GNU Guix.

One key insight is that versioned software is a form of data and can be
represented with a unique Hash value, e.g., a SHA can be calculated
over the source code or the binary executables. Also, the steps in a
pipeline can be captured in scripts or data and can be represented by
a Hash value, such as calculated by git. This means that the full data
analysis can be captured in a single Hash value that uniquely
identifies a result with the used software and executed analysis
steps, together with the raw data.

** CWL

CWL (http://www.commonwl.org/) is a modern initiative to describe
command line tools and connect them together to create
workflows. Because CWL is a specification and not a specific piece of
software, tools and workflows described using CWL are portable across
a variety of platforms that support the CWL standard.

CWL has roots in "make" and many similar tools that determine order of
execution based on dependencies between tasks. However unlike "make",
CWL tasks are isolated and you must be explicit about your inputs and
outputs. The benefit of explicitness and isolation are flexibility,
portability, and scalability: tools and workflows described with CWL
can transparently leverage technologies such as Docker, be used with
CWL implementations from different vendors, and is well suited for
describing large-scale workflows in cluster, cloud and high
performance computing environments where tasks are scheduled in
parallel across many nodes.

At the biohackathon CWL support was added for the Toil workflow engine
(https://github.com/BD2KGenomics/toil) and work was done on Schema
Salad, which is the module used to process YAML CWL files into JSON-LD
linked data documents. A tutorial was given on the Common Workflow
Language to interested participants.

** Docker

One challenge is the creation of standard mechanisms for running tools
reproducibly and efficiently. Over the last couple of years
containerization (ref?), e.g., Docker, has gained popularity as a
solution to this problem. Container technologies have less overhead
than full virtual machines (VMs) and are smaller in size.

At the Biohackathon we formed a working group to initiate a registry
of bioinformatics Docker containers that can be used from the CWL, for
example. From this meeting evolved the GA4GH Tool Registry API
(https://github.com/ga4gh/tool-registry-schemas) that provides
ontology based meta-data describing inputs and outputs.  Work was also
done on an Ensemble API in Docker on
[github](https://github.com/helios/ensembl-docker).

** GNU Guix

One problem of Docker-based deployment is that it requires special
permissions from the Linux kernel which are not given in many HPC
environments. More importantly, Docker binary images are 'opaque',
i.e., it is not clear what is inside them and it matters at what time
the image was created what software is installed (i.e., an apt update
may generate a different image). Finally, distributing binary images
can be considered a security risk - you have to trust the party who
created the image. An alternative to using Docker is using the GNU
Guix packaging system which takes a far more rigorous approach towards
reproducible software deployment. Guix packages, including
dependencies, are built from source and generate byte-identical
outputs. The Hash value of a Guix package includes the Hash calculated
over the source code, the build configuration (inputs) and the
dependencies. This means you get a fully tractable deployment graph
which can be regenerated at any time. Guix does also supports binary
installs and does not require special kernel privileges.  

As of July 2016, Guix has fast growing support for Perl (415
packages), python (613), ruby (148), and R (151). GNU Guix includes
129 specific bioinformatics packages, i.e., aragorn bamtools bedops
bedtools bio-blastxmlparser bio-locus bioawk bioperl-minimal bioruby
blast+ bless bowtie bwa bwa-pssm cd-hit clipper clustal-omega
codingquarry couger crossmap cufflinks cutadapt deeptools diamond
edirect express express-beta-diversity fasttree fastx-toolkit
filevercmp flexbar fraggenescan fxtract grit hisat hmmer htseq htslib
idr java-htsjdk java-ngs jellyfish libbigwig macs mafft metabat miso
mosaik muscle ncbi-vdb ngs-sdk orfm pardre pbtranscript-tofu pepr
piranha plink preseq prodigal pyicoteo python-biopython python-plastid
python-pybigwig python-pysam python-twobitreader python2-biopython
python2-bx-python python2-pbcore python2-plastid python2-pybedtools
python2-pybigwig python2-pysam python2-twobitreader python2-warpedlmm
r-acsnminer r-annotationdbi r-biobase r-biocgenerics r-biocparallel
r-biomart r-biostrings r-bsgenome r-bsgenome-celegans-ucsc-ce6
r-bsgenome-dmelanogaster-ucsc-dm3 r-bsgenome-hsapiens-ucsc-hg19
r-bsgenome-mmusculus-ucsc-mm9 r-dnacopy r-genomation r-genomationdata
r-genomeinfodb r-genomicalignments r-genomicfeatures r-genomicranges
r-go-db r-graph r-impute r-iranges r-motifrg r-org-ce-eg-db
r-org-dm-eg-db r-org-hs-eg-db r-org-mm-eg-db r-qtl r-rsamtools
r-rtracklayer r-s4vectors r-seqlogo r-seqpattern
r-summarizedexperiment r-topgo r-variantannotation r-xvector
r-zlibbioc rsem rseqc samtools samtools seqan seqmagick smithlab-cpp
snap-aligner sortmerna sra-tools star stringtie subread tophat
vcftools vsearch.

At the biohackathon we added more bioinformatics packages and
documentation [documentation](https://github.com/pjotrp/guix-notes/)
to GNU Guix and created a deployment of Guix inside a Docker
container. We packaged CWL in Guix. We also added support for Ruby
gems to Guix which means that existing Ruby packages can easily be
deployed in Guix, similar to support for Python packages and R
packages.

Another project that deserves mention is Bioconda
(https://github.com/bioconda) which has a rapidly growing set of
bioinformatics software packages. Similar to support for Python
packages in Guix, we expect to add bioconda support to Guix in
2016. This would combine the ease of Bioconda with the rigorous control of
dependencies that GNU Guix offers.

** Software Discovery

We also started preparing for software discovery. Guix comes with a
continuous integration system on a build farm. We want to harvest that
information to see when packages are building or failing. See, for
example, the [Ruby
builds](http://hydra.gnu.org/job/gnu/master/ruby-2.2.3.x86_64-linux)
which contain the SHA values of the package as well as the checkout of
the Guix git repository reflecting the exact dependency graph.  We are
collaborating with Nix and Guix communities to get this information as
JSON output so it can be used in a web-service.

The next idea is to use RDF for software discovery. One problem
software discovery needs to address is that bioinformaticians often
spend a long time assessing the value of particular tools. For
example, there are hundreds of structural DNA variant
callers. Assessment of tools includes reading papers and online
documentation followed by installing, trying and tweaking individual
tools (if they get installed at all!). Next there may need to be an
assessment of code quality (reading the source code) and checking
activity of development, recent updated, bug fixes and mailing list
activity. An RDF interface could contain information that would make
this type of information better accessible and contain both automated
and harvested data, as well as contributions by individuals.