[Biopython-dev] Merging the GFF3 and VCF branches

Thu Jun 25 14:16:26 UTC 2015

On 06/24/2015 05:54 PM, Eric Talevich wrote:
> On Wed, Jun 10, 2015 at 12:15 PM, Ryan Dale <dalerr at niddk.nih.gov 
> <mailto:dalerr at niddk.nih.gov>> wrote:
>
>
>     On 06/10/2015 01:44 PM, Brad Chapman wrote:
>
>         Eric and Peter;
>         Thanks again for moving this forward. cc'ing in Ryan as well,
>         in case he
>         hasn't seen this discussion.
>
>             So I suppose the remaining tasks are, in no particular order:
>
>             - Add/port Brad's GFF-GenBank converters and tests to
>             Biopython. Ensure all
>             the tests pass.
>
>         I'd suggest moving those scripts to use gffutils, rather than
>         rely on
>         bcbb/gff. Ryan's implementation is better and I'd prefer to
>         deprecate
>         mine and move forward with his work.
>
>             - Enable GFF3 support by merging or porting from Brad's
>             branch, bcbb/gff,
>             or gffutils?
>
>         My vote is for gffutils.
>
>             What to add for parent/child relationships between features is
>
>                 yet to be decided.
>
>             I wonder if we can follow the lead of one of the GFF
>             implementations
>             mentioned above.
>
>             Has this been discussed in a more recent thread that I
>             didn't link
>             here?
>
>         I lost this as well so am not sure the best starting place. I
>         don't have
>         a strong opinion and open to doing whatever y'all think is best.
>
>         Thanks again,
>         Brad
>
>
>     Hi all -
>
>     Brad, thanks for the CC.  I'd be happy to help out getting any/all
>     of gffutils into BioPython. Let me give a high-level overview so
>     you can decide what makes sense to bring into BioPython . . .
>
>
> Awesome. (Sorry for the lag.) I've looked through the gffutils code to 
> see how this might work.
>
> Starting with the most mundane, I see that gffutils has these 
> dependencies (Biopython aims for a functioning dependency-free 
> installation):
>
> six-- Could port using Bio._py3k, straightforward but monotonous work.
>
> argh, argcomplete -- Only for argument parsing in the "gffutils-cli" 
> script, maybe not needed in Biopython.
>
> simplejson-- I think this is roughly the same code as the "json" 
> module in the standard library of Python 2.6+. Since Biopython doesn't 
> support Python 2.5 anymore, we can probably just import "json" instead 
> of "simplejson" in feature.py and helpers.py.
>
> pyfaidx-- This takes some consideration. Since GSoC 2014, Biopython 
> can index a genome-scale FASTA file with sqlite3 using its own index 
> format, not the samtools faidx format. I don't see a ton of pygr-style 
> indexing in gffutils beyond just extracting the specified subsequences 
> from a FASTA file, so Biopython's internal solution may suffice. This 
> is not yet merged; the pull request is here:
> https://github.com/biopython/biopython/pull/356
>
> If reading the .fai file is mandatory but writing it is not, then I 
> can contribute a minimal ~100-line implementation of that (which could 
> alternatively go into Biopython if we prefer):
> https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py
>
>     There are two main tricky parts to working with GFF/GTF: parsing
>     the attributes and inferring the hierarchy of parent/child
>     relationships.
>
>     The parsing is mostly self-contained in gffutils.parser. It
>     borrows the idea of a "dialect" from the built-in Python csv
>     module, and the kinds of trickiness we see in Brad's pathological
>     cases are encoded in the fields of the dialect (see comments in
>     the gfftutils.constants.dialect dictionary).
>
>
> This looks valuable to have in Biopython even without inferring 
> parent-child relationships. Would it be possible to start by 
> extracting and merging the GFF3 parser, and work on the parent-child 
> relationships separately?
>
>     The relationships are by far the hardest. I could write a lot
>     about the difficulties of GFF vs GTF, but let's just say a sqlite3
>     db is the most portable and performant way I've found to use both
>     GFF and GTF and interconvert between them. The bulk of gffutils'
>     code and complexity is for working on this task.
>
>     Converting GFF to BioPython objects while reliably keeping track
>     of parent/child relations requires parsing the entire file,
>     creating a database, and then querying the db for the relations.
>     gffutils does this, and currently creates SeqFeatures objects. Any
>     additional CompoundLocation stuff can easily be added, as long as
>     there's a gffutils database to get relationship info from.
>     Likewise, assuming presence of a db, Brad's scripts can easily be
>     ported. I can certainly work on this.
>
>     So I guess the big question is if you want to introduce all the
>     sqlite3 machinery to BioPython in order to access relationship
>     info, or just use the parser.
>
>
> I think we're happy to use sqlite3 wherever it's a sensible 
> engineering choice, since it's part of the standard library. Biopython 
> users may want the option to skip the database if parent-child 
> relationships are not needed, or keep it in RAM to avoid hitting the disk.
>
>
> -Eric

Hi Eric -

Regarding the dependencies, I'm pretty sure we can drop them all (I'll 
address them individually below) for integration with Biopython. I'm 
imagining gffutils will remain as an independent project, with a subset 
of useful parts shared with Biopython. In this case, it would be 
important to minimize the "merge barrier" to facilitate 
bugfixes/improvements between code bases in both directions.

With that in mind:

> six-- Could port using Bio._py3k, straightforward but monotonous work.

I think the way to address this is to make a copy of Bio._py3k in 
gffutils and use that to replace six in the main gffutils repo. Then 
picking-and-choosing pieces of gffutils to put in Biopython will be 
straightforward: just edit the import from "gffutils._py3k" to "Bio._py3k".

> argh, argcomplete -- Only for argument parsing in the "gffutils-cli" 
> script, maybe not needed in Biopython.

Agreed, nothing targeted for Biopython integration needs these.

> simplejson-- I think this is roughly the same code as the "json" 
> module in the standard library of Python 2.6+. Since Biopython doesn't 
> support Python 2.5 anymore, we can probably just import "json" instead 
> of "simplejson" in feature.py and helpers.py.

simplejson won in some performance benchmarks I had done, but not by a 
huge amount. It's a drop-in replacement for json, so this should be a 
straightforward fix.

> pyfaidx-- This takes some consideration. Since GSoC 2014, Biopython 
> can index a genome-scale FASTA file with sqlite3 using its own index 
> format, not the samtools faidx format. I don't see a ton of pygr-style 
> indexing in gffutils beyond just extracting the specified subsequences 
> from a FASTA file, so Biopython's internal solution may suffice. This 
> is not yet merged; the pull request is here:
> https://github.com/biopython/biopython/pull/356
>
> If reading the .fai file is mandatory but writing it is not, then I 
> can contribute a minimal ~100-line implementation of that (which could 
> alternatively go into Biopython if we prefer):
> https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py

Right, it's only used for extracting subsequences from a FASTA. Could 
either drop this functionality altogether, or use the options you 
suggest. I would prefer the Biopython internal solution, since a 
read-only approach limits the user to pre-constructed indexes -- or 
requires them to install additional dependencies to create their own.

> Would it be possible to start by extracting and merging the GFF3 
> parser, and work on the parent-child relationships separately?

Certainly. I think this is a good strategy. I think it would be good to 
hold off on the sequence extraction for now as well.

> I think we're happy to use sqlite3 wherever it's a sensible 
> engineering choice, since it's part of the standard library. Biopython 
> users may want the option to skip the database if parent-child 
> relationships are not needed, or keep it in RAM to avoid hitting the disk.

If parent-child relationships are not needed, then maybe all you need to 
do is parse:

from gffutils.iterators import DataIterator
for feature in DataIterator('annotations.gff'):
     # do something with feature

The attributes-field-parsing machinery ported from Brad's code handles 
the pathological attributes fields, but other than that it's pretty 
trivial and like any line-by-line parser uses very little RAM.

Creating parent-child relationships is a different beast. This takes a 
lot longer and is a lot more complex:

from gffutils import create_db
db = create_db('annotations.gff', 'annotations.db')

I should point out that using ":memory:" for the database name puts it 
in RAM. The downside is no persistence, so the time cost of 
parsing/constructing the db (~10 mins for 1.2M-feature human GENCODE 
GTF) has to be spent every time.

Anyway, I think the next step is to get a draft PR going to iron out the 
details of parser integration. Where do you want this this live? 
Bio.GFF? Bio.GTF?

-ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20150625/f587b3ca/attachment.html>