[Biopython-dev] Merging the GFF3 and VCF branches

Ryan Dale dalerr at niddk.nih.gov
Wed Jun 10 19:15:11 UTC 2015


On 06/10/2015 01:44 PM, Brad Chapman wrote:
> Eric and Peter;
> Thanks again for moving this forward. cc'ing in Ryan as well, in case he
> hasn't seen this discussion.
>
>> So I suppose the remaining tasks are, in no particular order:
>>
>> - Add/port Brad's GFF-GenBank converters and tests to Biopython. Ensure all
>> the tests pass.
> I'd suggest moving those scripts to use gffutils, rather than rely on
> bcbb/gff. Ryan's implementation is better and I'd prefer to deprecate
> mine and move forward with his work.
>
>> - Enable GFF3 support by merging or porting from Brad's branch, bcbb/gff,
>> or gffutils?
> My vote is for gffutils.
>
>> What to add for parent/child relationships between features is
>>> yet to be decided.
>> I wonder if we can follow the lead of one of the GFF implementations
>> mentioned above.
>>
>> Has this been discussed in a more recent thread that I didn't link
>> here?
> I lost this as well so am not sure the best starting place. I don't have
> a strong opinion and open to doing whatever y'all think is best.
>
> Thanks again,
> Brad

Hi all -

Brad, thanks for the CC.  I'd be happy to help out getting any/all of 
gffutils into BioPython. Let me give a high-level overview so you can 
decide what makes sense to bring into BioPython . . .

There are two main tricky parts to working with GFF/GTF: parsing the 
attributes and inferring the hierarchy of parent/child relationships.

The parsing is mostly self-contained in gffutils.parser. It borrows the 
idea of a "dialect" from the built-in Python csv module, and the kinds 
of trickiness we see in Brad's pathological cases are encoded in the 
fields of the dialect (see comments in the gfftutils.constants.dialect 
dictionary).

The relationships are by far the hardest. I could write a lot about the 
difficulties of GFF vs GTF, but let's just say a sqlite3 db is the most 
portable and performant way I've found to use both GFF and GTF and 
interconvert between them. The bulk of gffutils' code and complexity is 
for working on this task.

Converting GFF to BioPython objects while reliably keeping track of 
parent/child relations requires parsing the entire file, creating a 
database, and then querying the db for the relations. gffutils does 
this, and currently creates SeqFeatures objects. Any additional 
CompoundLocation stuff can easily be added, as long as there's a 
gffutils database to get relationship info from. Likewise, assuming 
presence of a db, Brad's scripts can easily be ported. I can certainly 
work on this.

So I guess the big question is if you want to introduce all the sqlite3 
machinery to BioPython in order to access relationship info, or just use 
the parser.

thanks,
-ryan

P.S. I've been on the dev list but had its messages going to an unread 
mailbox folder. I'll be checking it regularly now; sorry about that.


More information about the Biopython-dev mailing list