[GSoC] BioRuby project proposal & VCF search API inquiry
Loris Cro
l.cro at campus.unimib.it
Sun Mar 23 13:47:22 EDT 2014
The proposal is up on melange.
If any organisation member wants to discuss it on IRC just let me know.
2014-03-17 10:06 GMT+01:00 Francesco Strozzi <francesco.strozzi at gmail.com>:
> Hi Loris,
> thanks for your interest in this proposal. The overall ideas you raised
> are all good and you got the point on the VCF query proposal. For the other
> proposals, I will let people comment as well, but I believe we do have
> already quite a large number of GFF / VCF parsers and formats, although a
> GPU approach sounds really cool.
>
> About the technical aspects, you are free to develop your project as you
> like and your approaches sounds fine to me. A db with support for
> map-reduce queries with a REST API on top to wrap common use cases is a
> good idea. One point you could also explore, as others are doing for the
> same proposal, is the possibility to offer support for the SPARQL language
> as well. This will be a plus in terms of interoperability for such an
> application.
>
> All the best.
> Francesco
>
>
>
>
> On Sun, Mar 16, 2014 at 6:44 PM, Loris Cro <l.cro at campus.unimib.it> wrote:
>
>> Hello, I'm Loris Cro, 3rd year CS student at UNIMIB (http://unimib.it).
>>
>> I write to both get some more information about the official project idea
>> (and provide a possible solution while doing so), and discuss a different
>> proposal.
>>
>> In the interest of time I offer only a short introduction of myself.
>>
>> - I have an adequate level of the required soft and technical skills (as
>> described in the student guide), hopefully this writing will attest to
>> that.
>> - I have minor experience with the main programming paradigms /
>> abstraction levels (C, Java, Ruby, LISP, Prolog) with Python being
>> my programming language of choice.
>> - I have a good amount of practical experience with nosql data modelling,
>> mainly from the necessity of building an online RSS aggregator on Google
>> App Engine (that offers a nosql database built on BigTable).
>> - Regarding my bioinformatic preparation, I'm only ~4.2% "bio", so I'll
>> need
>> some help to understand the actual workflows/use-cases and the
>> semantic of some, if not most, data formats will not be instantly clear
>> to me.
>>
>> Now, regarding the proposals, I'll start with the new one:
>>
>> Would you be interested in a GFF3 or VCF parser that runs on the GPU or,
>> alternatively, a binary format for GFF3?
>>
>> About the "official" idea:
>> What is the exact level of speed/scalability/flexibility you need?
>>
>> I'll assume, from what I understood by reading the rationale, that:
>> - you are talking about an arbitrarily large dataset (so no REDIS),
>> - users should be able to do complex searches but they are not necessarily
>> expected to
>> build queries by hand (meaning that we aim to offer an easy (although
>> extensible)
>> interface that covers the most common use-case / allows the user to
>> filter out
>> enough data to do the rest of the job in-memory).
>> For example a mutations_diff(sample_list_A, sample_list_B) method
>> (accessible via
>> REST ofc).
>>
>> Given those assumptions I think the solution could be dissected as
>> follows:
>>
>> [1] A component that reads VCF files and loads them into the database.
>>
>> [2] The database itself.
>>
>> [3] A component that exposes the operations that the database doesnt
>> offer.
>>
>>
>> The [3] component is the one you propose to build in jruby or scala.
>> Why would you want to build [3] entirely by hand?
>> If the amount of data is in the 1000 * few-gigabytes ballpark, you will
>> inevitably end up with a mapreduce-y solution (please do correct me if I'm
>> wrong).
>> The items get partitioned in N groups, local filtering/aggregation is
>> applied,
>> more partitioning and a last join of the partial results, or something
>> like
>> that.
>> How about using a different approach?
>> We find a "mapreduce engine" and implement our operations as
>> "recipes" for such engine.
>>
>> Some DBs that offer such functionality: MongoDB, CouchDB, ArangoDB.
>> If you want a bigger scale approach Apache Accumulo might be a possible
>> candidate.
>>
>> Please check out the ArangoDB site, it offers support for both JOINs and
>> custom mapreducey scripts (written in either JS or Ruby!) + other
>> amenities.
>>
>> In this case we could focus on finding a good data representation,
>> building
>> a fast
>> importer and the scripts for the most common operations. ArangoDB even
>> offers some
>> nice scaffolding tools for the RESTful API, so extensions would be dead
>> easy to build
>> (if the data representation is decent).
>>
>> That said, while I have more experience working with the latter kind of
>> projects, I
>> think I would be a decent fit for the former too. In fact I plan to make
>> my
>> proposal
>> count as the required internship to graduate so I would also get some help
>> from within
>> my university.
>>
>> Let me know, the deadline for proposals is fast approaching :)
>> Loris Cro
>> _______________________________________________
>> GSoC mailing list
>> GSoC at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/gsoc
>>
>
>
>
> --
>
> Francesco Strozzi
>
More information about the GSoC
mailing list