[Biopython-dev] Merging the GFF3 and VCF branches
Ryan Dale
dalerr at niddk.nih.gov
Wed Jun 10 19:15:11 UTC 2015
On 06/10/2015 01:44 PM, Brad Chapman wrote:
> Eric and Peter;
> Thanks again for moving this forward. cc'ing in Ryan as well, in case he
> hasn't seen this discussion.
>
>> So I suppose the remaining tasks are, in no particular order:
>>
>> - Add/port Brad's GFF-GenBank converters and tests to Biopython. Ensure all
>> the tests pass.
> I'd suggest moving those scripts to use gffutils, rather than rely on
> bcbb/gff. Ryan's implementation is better and I'd prefer to deprecate
> mine and move forward with his work.
>
>> - Enable GFF3 support by merging or porting from Brad's branch, bcbb/gff,
>> or gffutils?
> My vote is for gffutils.
>
>> What to add for parent/child relationships between features is
>>> yet to be decided.
>> I wonder if we can follow the lead of one of the GFF implementations
>> mentioned above.
>>
>> Has this been discussed in a more recent thread that I didn't link
>> here?
> I lost this as well so am not sure the best starting place. I don't have
> a strong opinion and open to doing whatever y'all think is best.
>
> Thanks again,
> Brad
Hi all -
Brad, thanks for the CC. I'd be happy to help out getting any/all of
gffutils into BioPython. Let me give a high-level overview so you can
decide what makes sense to bring into BioPython . . .
There are two main tricky parts to working with GFF/GTF: parsing the
attributes and inferring the hierarchy of parent/child relationships.
The parsing is mostly self-contained in gffutils.parser. It borrows the
idea of a "dialect" from the built-in Python csv module, and the kinds
of trickiness we see in Brad's pathological cases are encoded in the
fields of the dialect (see comments in the gfftutils.constants.dialect
dictionary).
The relationships are by far the hardest. I could write a lot about the
difficulties of GFF vs GTF, but let's just say a sqlite3 db is the most
portable and performant way I've found to use both GFF and GTF and
interconvert between them. The bulk of gffutils' code and complexity is
for working on this task.
Converting GFF to BioPython objects while reliably keeping track of
parent/child relations requires parsing the entire file, creating a
database, and then querying the db for the relations. gffutils does
this, and currently creates SeqFeatures objects. Any additional
CompoundLocation stuff can easily be added, as long as there's a
gffutils database to get relationship info from. Likewise, assuming
presence of a db, Brad's scripts can easily be ported. I can certainly
work on this.
So I guess the big question is if you want to introduce all the sqlite3
machinery to BioPython in order to access relationship info, or just use
the parser.
thanks,
-ryan
P.S. I've been on the dev list but had its messages going to an unread
mailbox folder. I'll be checking it regularly now; sorry about that.
More information about the Biopython-dev
mailing list