[BioRuby] GSOC
Loris Cro
l.cro at campus.unimib.it
Mon May 12 16:12:29 UTC 2014
I'm trying to write a list of all the problems that must be addressed:
https://github.com/kappaloris/GSoC-2014-OBF/blob/master/problems-features.md
For now I believe I should try to fill the first section as much as
possible and
I wouldn't mind some input in that regard.
I stubbed a possible data model that would preserve all the informations
present in the VCF files, considering also the possibility of having
multiple
reference genomes inside a single collection.
https://gist.github.com/kappaloris/462082314dc2e940ba4e
How to merge the results of queries is still TBD, tho.
2014-05-12 14:23 GMT+02:00 Francesco Strozzi <francesco.strozzi at gmail.com>:
> I think this is a slightly different point, which is more related to
> downstream software accepting only particular variations, like working with
> SNPs but not InDels or multi-allelic sites, as in the example you
> mentioned. I agree it is a good approach in these situations to have an
> "ignore errors" options, which only discards unwanted or bad records
> instead of dropping the whole dataset :-), but this is not something
> strictly related to the VCF format per se. My point was that VCF
> specification gives already a certain degree of flexibility and callers,
> when they take liberties, they also provide enough information in the
> header to see where and what kind of custom data is added to each records
> (I am talking mainly about the INFO field here).
>
> Ciao!
> Francesco
>
>
> On Mon, May 12, 2014 at 12:00 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:
>
>> Not completely accurate. Variant callers do take liberties. I have
>> just had a varscan2 result which was rejected by cartagenia. Obviously
>> the latter is not flexible in what it accepts ;). Turns out it does
>> not allow the variant field to contain multiple nucleotides.
>>
>> My main point is that, if you want your software to be generally
>> useful, you can not predict what liberties programmers take. That is
>> in fact the secret of the success of 'flexible' formats in
>> bioinformatics - think VCF, FASTA, GFF3, SAM etc. The trick is to have
>> minimal guidelines on what you expect - but don't become rigorous or,
>> if you can't resist being rigorous, make it so that you can switch it
>> off. With bio-vcf I have added a --ignore-errors option for that very
>> reason.
>>
>> Pj.
>>
>> On Mon, May 12, 2014 at 12:01:09PM +0200, Francesco Strozzi wrote:
>> > Hi Loris,
>> > *if* the VCF file is generated following general rules and guidelines,
>> what may
>> > change is the presence / absence of keys in the INFO and GENOTYPE
>> fields.
>> > Normally variation callers provide information on the INFO field
>> composition in
>> > the VCF header. We will provide you with VCF example files generated
>> with the
>> > latest versions of the most used calling software, i.e. Samtools
>> (v2.0-rc7),
>> > FreeBayes (v0.9.14) and GATK (v3.1) so you can have a look at
>> differences.
>> >
>> > Francesco
>> >
>> >
>> > On Mon, May 12, 2014 at 11:55 AM, Loris Cro <l.cro at campus.unimib.it>
>> wrote:
>> >
>> > Pjotr pointed out in another discussion that VCF files have
>> > some differences depending on what program generated them.
>> >
>> > How can I find out more about there differences? Or is it only a
>> > matter of custom keys in the INFO and/or FORMAT field structure?
>> > Basically I'm wondering if the VCF file specification is enough to
>> > understand these differences.
>> >
>> > Also, since the objective is to work with multiple files, some
>> fields
>> > seem to lose meaning in that contest. Is there any convention
>> > regarding that matter?
>> >
>> >
>> > 2014-04-29 19:46 GMT+02:00 Loris Cro <l.cro at campus.unimib.it>:
>> >
>> > > Hi all! Let me start the discussion with some info about what I've
>> > > done, what I'm planning to do next and what questions I need help
>> with.
>> > >
>> > >
>> > > As far as I understand, the project idea published on the OBF
>> > > wiki was primarily to answer the problem of computing privates.
>> > > There are other features that you are interested in, but this
>> > > specific problem was the biggest pain point. I say "was" because
>> > > in fact computing privates is not that hard in the end (no JOINs
>> or
>> > > heavy denormalization required anymore) as you can see by
>> > > reading:
>> > >
>> > > https://gist.github.com/kappaloris/11356517
>> > >
>> > > In fact now it seems to me that this tool would be best
>> implemented
>> > > as a library (with support for all the features mentioned in the
>> gist) to
>> > > be used in conjunction with tabix. (If anyone wants to help me
>> write it
>> > > in python, I reserved the name PrivatePy on pypi :3. I don't want
>> to
>> > > commit this early to extra work so, if you like the idea, please
>> offer
>> > > some help, I still have a DBMS to think about :) If you want to
>> write
>> > > it in Ruby I can still help, ofc, no cool name tho).
>> > >
>> > > I prefere python because it seems to me that python is the
>> language
>> > > with the most educational value since the "private" concept is not
>> > > private to biology alone and also it's the language I know best
>> (and
>> > > accordingly I can help most effectively with).
>> > >
>> > > Nevertheless, as I stated in my proposal, this script can also
>> > > be implemented as a processing step during the import to the DBMS.
>> > > Unless you're working with really huge amounts of data, you
>> shouldn't
>> > > expect the DBMS to be faster than a command-line utility, tho.
>> > >
>> > > Now, what I want to understand is how exactly VCF files constitute
>> > > a bottleneck:
>> > >
>> > > 1. Regarding performance: are there other computationally heavy
>> > > operations (like privates once were :D )? Mixing filtering and
>> > other
>> > > "by row" rules doesn't really count as 'heavy', I'm talking
>> about
>> > ugly
>> > > cross-referencing business.
>> > >
>> > > 2. Regarding current cases: what operations are really easy but
>> made
>> > > tedious by lack of proper interfaces / inconsistent formats /
>> ...
>> > that
>> > > this system should be expected to offer? An example would be
>> the
>> > > possibility of doing a "walk-together" import of multiple VCF
>> > files.
>> > > This would also be extremely beneficial for making
>> private-indexing
>> > > faster.
>> > >
>> > > 3. Regarding new cases: what new features should be considered a
>> > > must-have? For example, 1-click scalability? If you don't have
>> > > already a specific idea, don't worry, as the other details
>> fall
>> > into
>> > > place I will offer some ideas depending on what the most
>> plausible
>> > > solutions might offer.
>> > >
>> > > 4. Regarding ???: are there any other aspects that I'm missing?
>> > >
>> > >
>> > > Please note that [1] is what i understand best. [2] Especially is
>> not
>> > > easy for me: I don't work in a sequencing center so if you want to
>> > > point something out please add a little context and don't be
>> afraid
>> > > to paste some example code of what you are doing now and how
>> > > you think it should be done if you had this system already
>> available.
>> > >
>> > > As of now, other than do some more exploring on my solution for
>> > > computing privates, this is the first hurdle to jump. Talking with
>> > > Francesco, it seemed that privates where the biggest computational
>> > > problem, meaning that, unless someone points to something that
>> > > I'm missing, the focus should be more on ease of use and less on
>> > > "raw" performance (since every DBMS has its own kinks).
>> > >
>> > > Next week I will start writing the blog and publish some other
>> > > information about what I'm doing to make easier following the
>> > > development.
>> > >
>> > >
>> > >
>> > > Thank you all for your time and please don't bother wasting time
>> > > on courtesy/etiquette: if you have any objections share them as
>> > > soon as possible, I don't mind criticism and "form" is just a
>> prefix
>> > > of formalism.
>> > >
>> > >
>> > >
>> > >
>> > > 2014-04-27 8:02 GMT+02:00 Pjotr Prins <pjotr.public14 at thebird.nl
>> >:
>> > >
>> > > Hi Lori,
>> > >>
>> > >> Congrats from BioRuby for being accepted as a GSoC student this
>> year!
>> > >>
>> > >> To all others: Lori is going to work on a scalable NoSQL VCF
>> container
>> > >> for BioRuby with Francesco as a primary mentor. I am pretty
>> excited
>> > >> about this project - VCF parsing is quite a bottleneck in many
>> > >> sequencing centers (including ours) and a NoSQL solution may
>> just be
>> > >> the right idea.
>> > >>
>> > >> From here on we will discuss this project on this ML.
>> > >>
>> > >> Lori, I am on IRC this morning, if you like to chat.
>> > >>
>> > >> Pj.
>> > >>
>> > >
>> > >
>> > _______________________________________________
>> > BioRuby Project - http://www.bioruby.org/
>> > BioRuby mailing list
>> > BioRuby at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/bioruby
>> >
>> >
>> >
>> >
>> > --
>> >
>> > Francesco Strozzi
>>
>
>
>
> --
>
> Francesco Strozzi
>
More information about the BioRuby
mailing list