[Biopython-dev] Merging the GFF3 and VCF branches

Eric Talevich eric.talevich at gmail.com
Tue Jun 9 19:17:33 UTC 2015


On Thu, Jun 4, 2015 at 3:44 AM, Peter Cock <p.j.a.cock at googlemail.com>
wrote:

> This would be great to have merged - pathological test cases
> and interconversion too :)
>
> Did we settle on a plan for parent/child relationships in
> SeqFeature objects (beyond deprecating sub_features
> which has been replaced with CompoundLocations)?
>
> Peter
>

The last thread I see on this topic is from the end of summer 2012:
http://mailman.open-bio.org/pipermail/biopython-dev/2012-July/018979.html
(thread)
http://mailman.open-bio.org/pipermail/biopython-dev/2012-September/019101.html
(terminal)

I'm a bit confused because the CompoundLocation class exists in
Bio/SeqFeature.py, and git blame says it was written in late 2011 --
Peter's Time Machine in action? Does the f_loc5 branch modify the existing
CompoundLocation class, then?

The threads above also mention a deprecation process. I suppose in order to
begin that process we need to determine what we're deprecating in favor of,
then apply the new functionality and trigger a DeprecationWarning from the
old-and-tired sub_features attribute along with some shim to keep things
working approximately the way they used to?

Even if a perfectly smooth transition isn't possible, I think it's
worthwhile to make a gentle break to allow Biopython to correctly handle
modern file formats for genomic features/annotations.


> On Thu, Jun 4, 2015 at 10:54 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
> >
> > Eric;
> > Thanks for looking at this. +1 on getting Lenna's work in and I'll let
> > her comment on that compared to the current state of VCF support in
> > pysam and PyVCF. For GFF, I'd actually rather see
> > integration/collaboration with Ryan's gffutils:
> >
> > https://github.com/daler/gffutils
> >
> > It uses sqlite to organize the data and is much better engineered than
> > my GFF work. He took all my pathological test cases and made them work,
> > and it also has initial biopython integration:
> >
> >
> https://github.com/daler/gffutils/blob/master/gffutils/biopython_integration.py
> >
>

Ryan is a superstar. I see gffutils is MIT-licensed, too, so maybe we can
just copy a relevant chunk of the code?



> > The main work would be to take some of the scripts in bcbio-gff that
> > folks find useful, like the GFF/GenBank conversion through SeqIO, and
> > port these over. This has been something I wanted to do for a while but
> > never got done. What does everyone think?
> > Brad
>

These?:
https://github.com/chapmanb/bcbb/tree/master/gff/Scripts/gff

I like that plan. The main goal in my mind is to provide a sensible
substrate in Biopython for integrating the "tabix" family of formats, using
SeqFeature as a core object and making it a little more useful, rather than
try to provide a full-featured environment or high-performance I/O. I think
Lenna's work was headed in this direction, so I'd also like to focus on
merging that functionality and seeing what else falls out of it.

-Eric

>
> >> Biopythoneers,
> >>
> >> I am interested in improving Biopython's support for genomic data,
> namely
> >> through merging the existing GFF3 and VCF branches.
> >>
> >> Where we last left off, Brad's GFF branch was available on a fork:
> >> http://biopython.org/wiki/GFF_Parsing
> >> https://github.com/chapmanb/bcbb/tree/master/gff
> >>
> >> When this branch was submitted to Biopython, in 2009 or so, there was a
> >> subtle conflict with the way nested annotations were represented as
> >> SeqFeatures in Biopython. Peter tested several possible resolutions to
> this
> >> issue on branches, the last of which appears to be f_loc5:
> >> https://github.com/peterjc/biopython/tree/f_loc5
> >>
> >> For GSoC 2012, Lenna developed a VCF parser and genomic coordinate
> mapper
> >> compatible with Peter's SeqFeature updates (actually the f_loc4 branch,
> I
> >> guess?) and Brad's GFF parser:
> >>
> http://biopython.org/wiki/Google_Summer_of_Code#Representation_and_manipulation_of_genomic_variants
> >> http://arklenna.tumblr.com/post/29808300789/and-the-summer-ends
> >> https://github.com/lennax/biopython/
> >>
> >> What would it take to merge all of this once-recent work into Biopython?
> >> Are the SeqFeature CompoundLocation changes satisfactory and ready to
> merge
> >> into the mainline? Are we willing to make this compatibility break? If
> not,
> >> should we instead add another class/module to support the new behavior
> >> (BetterSeqFeature)?
> >>
> >> Happy to help,
> >> Eric
> >> _______________________________________________
> >> Biopython-dev mailing list
> >> Biopython-dev at mailman.open-bio.org
> >> http://mailman.open-bio.org/mailman/listinfo/biopython-dev
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at mailman.open-bio.org
> > http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20150609/db7a4207/attachment.html>


More information about the Biopython-dev mailing list