From pjotr.public66 at thebird.nl Sat Jun 4 09:50:44 2016 From: pjotr.public66 at thebird.nl (Pjotr Prins) Date: Sat, 4 Jun 2016 11:50:44 +0200 Subject: [Bio-packaging] GNU Guix update In-Reply-To: References: <20150705074338.GA18376@thebird.nl> <87fv53jb4z.fsf@mdc-berlin.de> <20150705082734.GA18540@thebird.nl> <20150706193118.GA26592@thebird.nl> <87lhes4wx2.fsf@izanagi.i-did-not-set--mail-host-address--so-tickle-me> <20150712093800.GA24041@thebird.nl> <20150918065854.GA18316@thebird.nl> <20151230115526.GA12168@thebird.nl> Message-ID: <20160604095044.GA29370@thebird.nl> I am writing a section on Guix etc. for the biohackathon 2015 report. It currently reads: * Software for reproducible analysis Computational genomics faces the challenges of scalability, reproducibility and provenance. Larger datasets, such as produced by The Cancer Genome Atlas [pmid: 24071849], are now petabyte-sized, while procedures for read-mapping, variant calling, genome assembly and downstream imputation have grown impressively sophisticated, involving numerous steps and individual programs. In addition to the need for reproducible, reusable and trustworthy data (see above) there is also the question of capturing reproducible data analysis, i.e. the steps that happen after data retrieval. Genomics analyses involving DNA or RNA sequencing are being used not just for primary research, but now being used within the clinic, adding a legal component that makes it essential that analyses can be precisely reproduced. We formed a working group on the challenges of creating reproducible pipelines for data analysis in the context of semantic technologies. With the advent of large sequencing efforts, pipelines are getting wider attention in bioinformatics now biologists regularly have to deal with TBs of data from sequencing (cite). This data can no longer be easily analyzed on single workstations, so analysis is executed on computer clusters and analysis steps are run serially and in parallel on multiple machines using multiple software programs. To describe such a complex setup, pipeline runners or engines are being developed. For example, we worked on the common workflow language (CWL), which abstracts away the underlying platform and describes the workflow in a language that can be used on different computing platforms. To the describe deployed software and make reproducible software installation a reality we worked on Docker, Bioconda and GNU Guix. One key insight is that versioned software is a form of data and can be represented with a unique Hash value, e.g., a SHA can be calculated over the source code or the binary executables. Also, the steps in a pipeline can be captured in scripts or data and can be represented by a Hash value, such as calculated by git. This means that the full data analysis can be captured in a single Hash value that uniquely identifies a result with the used software and executed analysis steps, together with the raw data. ** CWL CWL (http://www.commonwl.org/) is a modern initiative to describe command line tools and connect them together to create workflows. Because CWL is a specification and not a specific piece of software, tools and workflows described using CWL are portable across a variety of platforms that support the CWL standard. CWL has roots in "make" and many similar tools that determine order of execution based on dependencies between tasks. However unlike "make", CWL tasks are isolated and you must be explicit about your inputs and outputs. The benefit of explicitness and isolation are flexibility, portability, and scalability: tools and workflows described with CWL can transparently leverage technologies such as Docker, be used with CWL implementations from different vendors, and is well suited for describing large-scale workflows in cluster, cloud and high performance computing environments where tasks are scheduled in parallel across many nodes. At the biohackathon CWL support was added for the Toil workflow engine (https://github.com/BD2KGenomics/toil) and work was done on Schema Salad, which is the module used to process YAML CWL files into JSON-LD linked data documents. A tutorial was given on the Common Workflow Language to interested participants. ** Docker One challenge is the creation of standard mechanisms for running tools reproducibly and efficiently. Over the last couple of years containerization (ref?), e.g., Docker, has gained popularity as a solution to this problem. Container technologies have less overhead than full virtual machines (VMs) and are smaller in size. At the Biohackathon we formed a working group to initiate a registry of bioinformatics Docker containers that can be used from the CWL, for example. From this meeting evolved the GA4GH Tool Registry API (https://github.com/ga4gh/tool-registry-schemas) that provides ontology based meta-data describing inputs and outputs. Work was also done on an Ensemble API in Docker on [github](https://github.com/helios/ensembl-docker). ** GNU Guix One problem of Docker-based deployment is that it requires special permissions from the Linux kernel which are not given in many HPC environments. More importantly, Docker binary images are 'opaque', i.e., it is not clear what is inside them and it matters at what time the image was created what software is installed (i.e., an apt update may generate a different image). Finally, distributing binary images can be considered a security risk - you have to trust the party who created the image. An alternative to using Docker is using the GNU Guix packaging system which takes a far more rigorous approach towards reproducible software deployment. Guix packages, including dependencies, are built from source and generate byte-identical outputs. The Hash value of a Guix package includes the Hash calculated over the source code, the build configuration (inputs) and the dependencies. This means you get a fully tractable deployment graph which can be regenerated at any time. Guix does also supports binary installs and does not require special kernel privileges. As of July 2016, Guix has fast growing support for Perl (415 packages), python (613), ruby (148), and R (151). GNU Guix includes 129 specific bioinformatics packages, i.e., aragorn bamtools bedops bedtools bio-blastxmlparser bio-locus bioawk bioperl-minimal bioruby blast+ bless bowtie bwa bwa-pssm cd-hit clipper clustal-omega codingquarry couger crossmap cufflinks cutadapt deeptools diamond edirect express express-beta-diversity fasttree fastx-toolkit filevercmp flexbar fraggenescan fxtract grit hisat hmmer htseq htslib idr java-htsjdk java-ngs jellyfish libbigwig macs mafft metabat miso mosaik muscle ncbi-vdb ngs-sdk orfm pardre pbtranscript-tofu pepr piranha plink preseq prodigal pyicoteo python-biopython python-plastid python-pybigwig python-pysam python-twobitreader python2-biopython python2-bx-python python2-pbcore python2-plastid python2-pybedtools python2-pybigwig python2-pysam python2-twobitreader python2-warpedlmm r-acsnminer r-annotationdbi r-biobase r-biocgenerics r-biocparallel r-biomart r-biostrings r-bsgenome r-bsgenome-celegans-ucsc-ce6 r-bsgenome-dmelanogaster-ucsc-dm3 r-bsgenome-hsapiens-ucsc-hg19 r-bsgenome-mmusculus-ucsc-mm9 r-dnacopy r-genomation r-genomationdata r-genomeinfodb r-genomicalignments r-genomicfeatures r-genomicranges r-go-db r-graph r-impute r-iranges r-motifrg r-org-ce-eg-db r-org-dm-eg-db r-org-hs-eg-db r-org-mm-eg-db r-qtl r-rsamtools r-rtracklayer r-s4vectors r-seqlogo r-seqpattern r-summarizedexperiment r-topgo r-variantannotation r-xvector r-zlibbioc rsem rseqc samtools samtools seqan seqmagick smithlab-cpp snap-aligner sortmerna sra-tools star stringtie subread tophat vcftools vsearch. At the biohackathon we added more bioinformatics packages and documentation [documentation](https://github.com/pjotrp/guix-notes/) to GNU Guix and created a deployment of Guix inside a Docker container. We packaged CWL in Guix. We also added support for Ruby gems to Guix which means that existing Ruby packages can easily be deployed in Guix, similar to support for Python packages and R packages. Another project that deserves mention is Bioconda (https://github.com/bioconda) which has a rapidly growing set of bioinformatics software packages. Similar to support for Python packages in Guix, we expect to add bioconda support to Guix in 2016. This would combine the ease of Bioconda with the rigorous control of dependencies that GNU Guix offers. ** Software Discovery We also started preparing for software discovery. Guix comes with a continuous integration system on a build farm. We want to harvest that information to see when packages are building or failing. See, for example, the [Ruby builds](http://hydra.gnu.org/job/gnu/master/ruby-2.2.3.x86_64-linux) which contain the SHA values of the package as well as the checkout of the Guix git repository reflecting the exact dependency graph. We are collaborating with Nix and Guix communities to get this information as JSON output so it can be used in a web-service. The next idea is to use RDF for software discovery. One problem software discovery needs to address is that bioinformaticians often spend a long time assessing the value of particular tools. For example, there are hundreds of structural DNA variant callers. Assessment of tools includes reading papers and online documentation followed by installing, trying and tweaking individual tools (if they get installed at all!). Next there may need to be an assessment of code quality (reading the source code) and checking activity of development, recent updated, bug fixes and mailing list activity. An RDF interface could contain information that would make this type of information better accessible and contain both automated and harvested data, as well as contributions by individuals.