[GSoC] BioRuby project proposal & VCF search API inquiry

Mon Mar 17 05:06:17 EDT 2014

Hi Loris,
thanks for your interest in this proposal. The overall ideas you raised are
all good and you got the point on the VCF query proposal. For the other
proposals, I will let people comment as well, but I believe we do have
already quite a large number of GFF / VCF parsers and formats, although a
GPU approach sounds really cool.

About the technical aspects, you are free to develop your project as you
like and your approaches sounds fine to me. A db with support for
map-reduce queries with a REST API on top to wrap common use cases is a
good idea. One point you could also explore, as others are doing for the
same proposal, is the possibility to offer support for the SPARQL language
as well. This will be a plus in terms of interoperability for such an
application.

All the best.
Francesco

On Sun, Mar 16, 2014 at 6:44 PM, Loris Cro <l.cro at campus.unimib.it> wrote:

> Hello, I'm Loris Cro, 3rd year CS student at UNIMIB (http://unimib.it).
>
> I write to both get some more information about the official project idea
> (and provide a possible solution while doing so), and discuss a different
> proposal.
>
> In the interest of time I offer only a short introduction of myself.
>
> - I have an adequate level of the required soft and technical skills (as
>   described in the student guide), hopefully this writing will attest to
> that.
> - I have minor experience with the main programming paradigms /
>   abstraction levels (C, Java, Ruby, LISP, Prolog) with Python being
>   my programming language of choice.
> - I have a good amount of practical experience with nosql data modelling,
>   mainly from the necessity of building an online RSS aggregator on Google
>   App Engine (that offers a nosql database built on BigTable).
> - Regarding my bioinformatic preparation, I'm only ~4.2% "bio", so I'll
> need
>   some help to understand the actual workflows/use-cases and the
>   semantic of some, if not most, data formats will not be instantly clear
> to me.
>
> Now, regarding the proposals, I'll start with the new one:
>
> Would you be interested in a GFF3 or VCF parser that runs on the GPU or,
> alternatively, a binary format for GFF3?
>
> About the "official" idea:
> What is the exact level of speed/scalability/flexibility you need?
>
> I'll assume, from what I understood by reading the rationale, that:
> - you are talking about an arbitrarily large dataset (so no REDIS),
> - users should be able to do complex searches but they are not necessarily
> expected to
>   build queries by hand (meaning that we aim to offer an easy (although
> extensible)
>   interface that covers the most common use-case / allows the user to
> filter out
>   enough data to do the rest of the job in-memory).
>   For example a mutations_diff(sample_list_A, sample_list_B) method
> (accessible via
>   REST ofc).
>
> Given those assumptions I think the solution could be dissected as follows:
>
> [1] A component that reads VCF files and loads them into the database.
>
> [2] The database itself.
>
> [3] A component that exposes the operations that the database doesnt offer.
>
>
> The [3] component is the one you propose to build in jruby or scala.
> Why would you want to build [3] entirely by hand?
> If the amount of data is in the 1000 * few-gigabytes ballpark, you will
> inevitably end up with a mapreduce-y solution (please do correct me if I'm
> wrong).
> The items get partitioned in N groups, local filtering/aggregation is
> applied,
> more partitioning and a last join of the partial results, or something like
> that.
> How about using a different approach?
> We find a "mapreduce engine" and implement our operations as
> "recipes" for such engine.
>
> Some DBs that offer such functionality: MongoDB, CouchDB, ArangoDB.
> If you want a bigger scale approach Apache Accumulo might be a possible
> candidate.
>
> Please check out the ArangoDB site, it offers support for both JOINs and
> custom mapreducey scripts (written in either JS or Ruby!) + other
> amenities.
>
> In this case we could focus on finding a good data representation, building
> a fast
> importer and the scripts for the most common operations. ArangoDB even
> offers some
> nice scaffolding tools for the RESTful API, so extensions would be dead
> easy to build
> (if the data representation is decent).
>
> That said, while I have more experience working with the latter kind of
> projects, I
> think I would be a decent fit for the former too. In fact I plan to make my
> proposal
> count as the required internship to graduate so I would also get some help
> from within
> my university.
>
> Let me know, the deadline for proposals is fast approaching :)
> Loris Cro
> _______________________________________________
> GSoC mailing list
> GSoC at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/gsoc
>

-- 

Francesco Strozzi