[Open-bio-l] Fwd: GSOC (VCF DBMS)

Loris Cro l.cro at campus.unimib.it
Tue Apr 29 18:06:26 UTC 2014


Hi, we're discussing the implementation of my proposal on
the bioruby ml.

There are some news regarding the problem of computing
privates that might be of special interest.

Feel free to join the conversation.

---------- Forwarded message ----------
From: Loris Cro <l.cro at campus.unimib.it>
Date: 2014-04-29 19:46 GMT+02:00
Subject: Re: GSOC
To: Pjotr Prins <pjotr.public14 at thebird.nl>
Cc: bioruby at lists.open-bio.org


Hi all! Let me start the discussion with some info about what I've
done, what I'm planning to do next and what questions I need help with.


As far as I understand, the project idea published on the OBF
wiki was primarily to answer the problem of computing privates.
There are other features that you are interested in, but this
specific problem was the biggest pain point. I say "was" because
in fact computing privates is not that hard in the end (no JOINs or
heavy denormalization required anymore) as you can see by
reading:

      https://gist.github.com/kappaloris/11356517

In fact now it seems to me that this tool would be best implemented
as a library (with support for all the features mentioned in the gist) to
be used in conjunction with tabix. (If anyone wants to help me write it
in python, I reserved the name PrivatePy on pypi :3. I don't want to
commit this early to extra work so, if you like the idea, please offer
some help, I still have a DBMS to think about :) If you want to write
it in Ruby I can still help, ofc, no cool name tho).

I prefere python because it seems to me that python is the language
with the most educational value since the "private" concept is not
private to biology alone and also it's the language I know best (and
accordingly I can help most effectively with).

Nevertheless, as I stated in my proposal, this script can also
be implemented as a processing step during the import to the DBMS.
Unless you're working with really huge amounts of data, you shouldn't
expect the DBMS to be faster than a command-line utility, tho.

Now, what I want to understand is how exactly VCF files constitute
a bottleneck:

1. Regarding performance: are there other computationally heavy
    operations (like privates once were :D )? Mixing filtering and other
    "by row" rules doesn't really count as 'heavy', I'm talking about ugly
    cross-referencing business.

2. Regarding current cases: what operations are really easy but made
    tedious by lack of proper interfaces / inconsistent formats / ... that
    this system should be expected to offer? An example would be the
    possibility of doing a "walk-together" import of multiple VCF files.
    This would also be extremely beneficial for making private-indexing
    faster.

3. Regarding new cases: what new features should be considered a
    must-have? For example, 1-click scalability? If you don't have
    already a specific idea, don't worry, as the other details fall into
    place I will offer some ideas depending on what the most plausible
    solutions might offer.

4. Regarding ???: are there any other aspects that I'm missing?


Please note that [1] is what i understand best. [2] Especially is not
easy for me: I don't work in a sequencing center so if you want to
point something out please add a little context and don't be afraid
to paste some example code of what you are doing now and how
you think it should be done if you had this system already available.

As of now, other than do some more exploring on my solution for
computing privates, this is the first hurdle to jump. Talking with
Francesco, it seemed that privates where the biggest computational
problem, meaning that, unless someone points to something that
I'm missing, the focus should be more on ease of use and less on
"raw" performance (since every DBMS has its own kinks).

Next week I will start writing the blog and publish some other
information about what I'm doing to make easier following the
development.



Thank you all for your time and please don't bother wasting time
on courtesy/etiquette: if you have any objections share them as
soon as possible, I don't mind criticism and "form" is just a prefix
of formalism.




2014-04-27 8:02 GMT+02:00 Pjotr Prins <pjotr.public14 at thebird.nl>:

Hi Lori,
>
> Congrats from BioRuby for being accepted as a GSoC student this year!
>
> To all others: Lori is going to work on a scalable NoSQL VCF container
> for BioRuby with Francesco as a primary mentor. I am pretty excited
> about this project - VCF parsing is quite a bottleneck in many
> sequencing centers (including ours) and a NoSQL solution may just be
> the right idea.
>
> From here on we will discuss this project on this ML.
>
> Lori, I am on IRC this morning, if you like to chat.
>
> Pj.
>



More information about the Open-Bio-l mailing list