[BioRuby] GSOC

Mon May 12 16:12:29 UTC 2014

I'm trying to write a list of all the problems that must be addressed:

https://github.com/kappaloris/GSoC-2014-OBF/blob/master/problems-features.md

For now I believe I should try to fill the first section as much as
possible and
I wouldn't mind some input in that regard.

I stubbed a possible data model that would preserve all the informations
present in the VCF files, considering also the possibility of having
multiple
reference genomes inside a single collection.

https://gist.github.com/kappaloris/462082314dc2e940ba4e

How to merge the results of queries is still TBD, tho.

2014-05-12 14:23 GMT+02:00 Francesco Strozzi <francesco.strozzi at gmail.com>:

> I think this is a slightly different point, which is more related to
> downstream software accepting only particular variations, like working with
> SNPs but not InDels or multi-allelic sites, as in the example you
> mentioned. I agree it is a good approach in these situations to have an
> "ignore errors" options, which only discards unwanted or bad records
> instead of dropping the whole dataset :-), but this is not something
> strictly related to the VCF format per se. My point was that VCF
> specification gives already a certain degree of flexibility and callers,
> when they take liberties, they also provide enough information in the
> header to see where and what kind of custom data is added to each records
> (I am talking mainly about the INFO field here).
>
> Ciao!
> Francesco
>
>
> On Mon, May 12, 2014 at 12:00 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:
>
>> Not completely accurate. Variant callers do take liberties. I have
>> just had a varscan2 result which was rejected by cartagenia. Obviously
>> the latter is not flexible in what it accepts ;). Turns out it does
>> not allow the variant field to contain multiple nucleotides.
>>
>> My main point is that, if you want your software to be generally
>> useful, you can not predict what liberties programmers take. That is
>> in fact the secret of the success of 'flexible' formats in
>> bioinformatics - think VCF, FASTA, GFF3, SAM etc. The trick is to have
>> minimal guidelines on what you expect - but don't become rigorous or,
>> if you can't resist being rigorous, make it so that you can switch it
>> off. With bio-vcf I have added a --ignore-errors option for that very
>> reason.
>>
>> Pj.
>>
>> On Mon, May 12, 2014 at 12:01:09PM +0200, Francesco Strozzi wrote:
>> > Hi Loris,
>> > *if* the VCF file is generated following general rules and guidelines,
>> what may
>> > change is the presence / absence of keys in the INFO and GENOTYPE
>> fields.
>> > Normally variation callers provide information on the INFO field
>> composition in
>> > the VCF header. We will provide you with VCF example files generated
>> with the
>> > latest versions of the most used calling software, i.e. Samtools
>> (v2.0-rc7),
>> > FreeBayes (v0.9.14) and GATK (v3.1) so you can have a look at
>> differences.
>> >
>> > Francesco
>> >
>> >
>> > On Mon, May 12, 2014 at 11:55 AM, Loris Cro <l.cro at campus.unimib.it>
>> wrote:
>> >
>> >     Pjotr pointed out in another discussion that VCF files have
>> >     some differences depending on what program generated them.
>> >
>> >     How can I find out more about there differences? Or is it only a
>> >     matter of custom keys in the INFO and/or FORMAT field structure?
>> >     Basically I'm wondering if the VCF file specification is enough to
>> >     understand these differences.
>> >
>> >     Also, since the objective is to work with multiple files, some
>> fields
>> >     seem to lose meaning in that contest. Is there any convention
>> >     regarding that matter?
>> >
>> >
>> >     2014-04-29 19:46 GMT+02:00 Loris Cro <l.cro at campus.unimib.it>:
>> >
>> >     > Hi all! Let me start the discussion with some info about what I've
>> >     > done, what I'm planning to do next and what questions I need help
>> with.
>> >     >
>> >     >
>> >     > As far as I understand, the project idea published on the OBF
>> >     > wiki was primarily to answer the problem of computing privates.
>> >     > There are other features that you are interested in, but this
>> >     > specific problem was the biggest pain point. I say "was" because
>> >     > in fact computing privates is not that hard in the end (no JOINs
>> or
>> >     > heavy denormalization required anymore) as you can see by
>> >     > reading:
>> >     >
>> >     >       https://gist.github.com/kappaloris/11356517
>> >     >
>> >     > In fact now it seems to me that this tool would be best
>> implemented
>> >     > as a library (with support for all the features mentioned in the
>> gist) to
>> >     > be used in conjunction with tabix. (If anyone wants to help me
>> write it
>> >     > in python, I reserved the name PrivatePy on pypi :3. I don't want
>> to
>> >     > commit this early to extra work so, if you like the idea, please
>> offer
>> >     > some help, I still have a DBMS to think about :) If you want to
>> write
>> >     > it in Ruby I can still help, ofc, no cool name tho).
>> >     >
>> >     > I prefere python because it seems to me that python is the
>> language
>> >     > with the most educational value since the "private" concept is not
>> >     > private to biology alone and also it's the language I know best
>> (and
>> >     > accordingly I can help most effectively with).
>> >     >
>> >     > Nevertheless, as I stated in my proposal, this script can also
>> >     > be implemented as a processing step during the import to the DBMS.
>> >     > Unless you're working with really huge amounts of data, you
>> shouldn't
>> >     > expect the DBMS to be faster than a command-line utility, tho.
>> >     >
>> >     > Now, what I want to understand is how exactly VCF files constitute
>> >     > a bottleneck:
>> >     >
>> >     > 1. Regarding performance: are there other computationally heavy
>> >     >     operations (like privates once were :D )? Mixing filtering and
>> >     other
>> >     >     "by row" rules doesn't really count as 'heavy', I'm talking
>> about
>> >     ugly
>> >     >     cross-referencing business.
>> >     >
>> >     > 2. Regarding current cases: what operations are really easy but
>> made
>> >     >     tedious by lack of proper interfaces / inconsistent formats /
>> ...
>> >     that
>> >     >     this system should be expected to offer? An example would be
>> the
>> >     >     possibility of doing a "walk-together" import of multiple VCF
>> >     files.
>> >     >     This would also be extremely beneficial for making
>> private-indexing
>> >     >     faster.
>> >     >
>> >     > 3. Regarding new cases: what new features should be considered a
>> >     >     must-have? For example, 1-click scalability? If you don't have
>> >     >     already a specific idea, don't worry, as the other details
>> fall
>> >     into
>> >     >     place I will offer some ideas depending on what the most
>> plausible
>> >     >     solutions might offer.
>> >     >
>> >     > 4. Regarding ???: are there any other aspects that I'm missing?
>> >     >
>> >     >
>> >     > Please note that [1] is what i understand best. [2] Especially is
>> not
>> >     > easy for me: I don't work in a sequencing center so if you want to
>> >     > point something out please add a little context and don't be
>> afraid
>> >     > to paste some example code of what you are doing now and how
>> >     > you think it should be done if you had this system already
>> available.
>> >     >
>> >     > As of now, other than do some more exploring on my solution for
>> >     > computing privates, this is the first hurdle to jump. Talking with
>> >     > Francesco, it seemed that privates where the biggest computational
>> >     > problem, meaning that, unless someone points to something that
>> >     > I'm missing, the focus should be more on ease of use and less on
>> >     > "raw" performance (since every DBMS has its own kinks).
>> >     >
>> >     > Next week I will start writing the blog and publish some other
>> >     > information about what I'm doing to make easier following the
>> >     > development.
>> >     >
>> >     >
>> >     >
>> >     > Thank you all for your time and please don't bother wasting time
>> >     > on courtesy/etiquette: if you have any objections share them as
>> >     > soon as possible, I don't mind criticism and "form" is just a
>> prefix
>> >     > of formalism.
>> >     >
>> >     >
>> >     >
>> >     >
>> >     > 2014-04-27 8:02 GMT+02:00 Pjotr Prins <pjotr.public14 at thebird.nl
>> >:
>> >     >
>> >     > Hi Lori,
>> >     >>
>> >     >> Congrats from BioRuby for being accepted as a GSoC student this
>> year!
>> >     >>
>> >     >> To all others: Lori is going to work on a scalable NoSQL VCF
>> container
>> >     >> for BioRuby with Francesco as a primary mentor. I am pretty
>> excited
>> >     >> about this project - VCF parsing is quite a bottleneck in many
>> >     >> sequencing centers (including ours) and a NoSQL solution may
>> just be
>> >     >> the right idea.
>> >     >>
>> >     >> From here on we will discuss this project on this ML.
>> >     >>
>> >     >> Lori, I am on IRC this morning, if you like to chat.
>> >     >>
>> >     >> Pj.
>> >     >>
>> >     >
>> >     >
>> >     _______________________________________________
>> >     BioRuby Project - http://www.bioruby.org/
>> >     BioRuby mailing list
>> >     BioRuby at lists.open-bio.org
>> >     http://lists.open-bio.org/mailman/listinfo/bioruby
>> >
>> >
>> >
>> >
>> > --
>> >
>> > Francesco Strozzi
>>
>
>
>
> --
>
> Francesco Strozzi
>