[BioRuby] GSOC
Francesco Strozzi
francesco.strozzi at gmail.com
Mon May 12 10:01:09 UTC 2014
Hi Loris,
*if* the VCF file is generated following general rules and guidelines, what
may change is the presence / absence of keys in the INFO and GENOTYPE
fields. Normally variation callers provide information on the INFO field
composition in the VCF header. We will provide you with VCF example files
generated with the latest versions of the most used calling software, i.e.
Samtools (v2.0-rc7), FreeBayes (v0.9.14) and GATK (v3.1) so you can have a
look at differences.
Francesco
On Mon, May 12, 2014 at 11:55 AM, Loris Cro <l.cro at campus.unimib.it> wrote:
> Pjotr pointed out in another discussion that VCF files have
> some differences depending on what program generated them.
>
> How can I find out more about there differences? Or is it only a
> matter of custom keys in the INFO and/or FORMAT field structure?
> Basically I'm wondering if the VCF file specification is enough to
> understand these differences.
>
> Also, since the objective is to work with multiple files, some fields
> seem to lose meaning in that contest. Is there any convention
> regarding that matter?
>
>
> 2014-04-29 19:46 GMT+02:00 Loris Cro <l.cro at campus.unimib.it>:
>
> > Hi all! Let me start the discussion with some info about what I've
> > done, what I'm planning to do next and what questions I need help with.
> >
> >
> > As far as I understand, the project idea published on the OBF
> > wiki was primarily to answer the problem of computing privates.
> > There are other features that you are interested in, but this
> > specific problem was the biggest pain point. I say "was" because
> > in fact computing privates is not that hard in the end (no JOINs or
> > heavy denormalization required anymore) as you can see by
> > reading:
> >
> > https://gist.github.com/kappaloris/11356517
> >
> > In fact now it seems to me that this tool would be best implemented
> > as a library (with support for all the features mentioned in the gist) to
> > be used in conjunction with tabix. (If anyone wants to help me write it
> > in python, I reserved the name PrivatePy on pypi :3. I don't want to
> > commit this early to extra work so, if you like the idea, please offer
> > some help, I still have a DBMS to think about :) If you want to write
> > it in Ruby I can still help, ofc, no cool name tho).
> >
> > I prefere python because it seems to me that python is the language
> > with the most educational value since the "private" concept is not
> > private to biology alone and also it's the language I know best (and
> > accordingly I can help most effectively with).
> >
> > Nevertheless, as I stated in my proposal, this script can also
> > be implemented as a processing step during the import to the DBMS.
> > Unless you're working with really huge amounts of data, you shouldn't
> > expect the DBMS to be faster than a command-line utility, tho.
> >
> > Now, what I want to understand is how exactly VCF files constitute
> > a bottleneck:
> >
> > 1. Regarding performance: are there other computationally heavy
> > operations (like privates once were :D )? Mixing filtering and other
> > "by row" rules doesn't really count as 'heavy', I'm talking about
> ugly
> > cross-referencing business.
> >
> > 2. Regarding current cases: what operations are really easy but made
> > tedious by lack of proper interfaces / inconsistent formats / ...
> that
> > this system should be expected to offer? An example would be the
> > possibility of doing a "walk-together" import of multiple VCF files.
> > This would also be extremely beneficial for making private-indexing
> > faster.
> >
> > 3. Regarding new cases: what new features should be considered a
> > must-have? For example, 1-click scalability? If you don't have
> > already a specific idea, don't worry, as the other details fall into
> > place I will offer some ideas depending on what the most plausible
> > solutions might offer.
> >
> > 4. Regarding ???: are there any other aspects that I'm missing?
> >
> >
> > Please note that [1] is what i understand best. [2] Especially is not
> > easy for me: I don't work in a sequencing center so if you want to
> > point something out please add a little context and don't be afraid
> > to paste some example code of what you are doing now and how
> > you think it should be done if you had this system already available.
> >
> > As of now, other than do some more exploring on my solution for
> > computing privates, this is the first hurdle to jump. Talking with
> > Francesco, it seemed that privates where the biggest computational
> > problem, meaning that, unless someone points to something that
> > I'm missing, the focus should be more on ease of use and less on
> > "raw" performance (since every DBMS has its own kinks).
> >
> > Next week I will start writing the blog and publish some other
> > information about what I'm doing to make easier following the
> > development.
> >
> >
> >
> > Thank you all for your time and please don't bother wasting time
> > on courtesy/etiquette: if you have any objections share them as
> > soon as possible, I don't mind criticism and "form" is just a prefix
> > of formalism.
> >
> >
> >
> >
> > 2014-04-27 8:02 GMT+02:00 Pjotr Prins <pjotr.public14 at thebird.nl>:
> >
> > Hi Lori,
> >>
> >> Congrats from BioRuby for being accepted as a GSoC student this year!
> >>
> >> To all others: Lori is going to work on a scalable NoSQL VCF container
> >> for BioRuby with Francesco as a primary mentor. I am pretty excited
> >> about this project - VCF parsing is quite a bottleneck in many
> >> sequencing centers (including ours) and a NoSQL solution may just be
> >> the right idea.
> >>
> >> From here on we will discuss this project on this ML.
> >>
> >> Lori, I am on IRC this morning, if you like to chat.
> >>
> >> Pj.
> >>
> >
> >
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
--
Francesco Strozzi
More information about the BioRuby
mailing list