[Biopython-dev] Merging the GFF3 and VCF branches
Ryan Dale
dalerr at niddk.nih.gov
Thu Jun 25 14:16:26 UTC 2015
On 06/24/2015 05:54 PM, Eric Talevich wrote:
> On Wed, Jun 10, 2015 at 12:15 PM, Ryan Dale <dalerr at niddk.nih.gov
> <mailto:dalerr at niddk.nih.gov>> wrote:
>
>
> On 06/10/2015 01:44 PM, Brad Chapman wrote:
>
> Eric and Peter;
> Thanks again for moving this forward. cc'ing in Ryan as well,
> in case he
> hasn't seen this discussion.
>
> So I suppose the remaining tasks are, in no particular order:
>
> - Add/port Brad's GFF-GenBank converters and tests to
> Biopython. Ensure all
> the tests pass.
>
> I'd suggest moving those scripts to use gffutils, rather than
> rely on
> bcbb/gff. Ryan's implementation is better and I'd prefer to
> deprecate
> mine and move forward with his work.
>
> - Enable GFF3 support by merging or porting from Brad's
> branch, bcbb/gff,
> or gffutils?
>
> My vote is for gffutils.
>
> What to add for parent/child relationships between features is
>
> yet to be decided.
>
> I wonder if we can follow the lead of one of the GFF
> implementations
> mentioned above.
>
> Has this been discussed in a more recent thread that I
> didn't link
> here?
>
> I lost this as well so am not sure the best starting place. I
> don't have
> a strong opinion and open to doing whatever y'all think is best.
>
> Thanks again,
> Brad
>
>
> Hi all -
>
> Brad, thanks for the CC. I'd be happy to help out getting any/all
> of gffutils into BioPython. Let me give a high-level overview so
> you can decide what makes sense to bring into BioPython . . .
>
>
> Awesome. (Sorry for the lag.) I've looked through the gffutils code to
> see how this might work.
>
> Starting with the most mundane, I see that gffutils has these
> dependencies (Biopython aims for a functioning dependency-free
> installation):
>
> six-- Could port using Bio._py3k, straightforward but monotonous work.
>
> argh, argcomplete -- Only for argument parsing in the "gffutils-cli"
> script, maybe not needed in Biopython.
>
> simplejson-- I think this is roughly the same code as the "json"
> module in the standard library of Python 2.6+. Since Biopython doesn't
> support Python 2.5 anymore, we can probably just import "json" instead
> of "simplejson" in feature.py and helpers.py.
>
> pyfaidx-- This takes some consideration. Since GSoC 2014, Biopython
> can index a genome-scale FASTA file with sqlite3 using its own index
> format, not the samtools faidx format. I don't see a ton of pygr-style
> indexing in gffutils beyond just extracting the specified subsequences
> from a FASTA file, so Biopython's internal solution may suffice. This
> is not yet merged; the pull request is here:
> https://github.com/biopython/biopython/pull/356
>
> If reading the .fai file is mandatory but writing it is not, then I
> can contribute a minimal ~100-line implementation of that (which could
> alternatively go into Biopython if we prefer):
> https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py
>
> There are two main tricky parts to working with GFF/GTF: parsing
> the attributes and inferring the hierarchy of parent/child
> relationships.
>
> The parsing is mostly self-contained in gffutils.parser. It
> borrows the idea of a "dialect" from the built-in Python csv
> module, and the kinds of trickiness we see in Brad's pathological
> cases are encoded in the fields of the dialect (see comments in
> the gfftutils.constants.dialect dictionary).
>
>
> This looks valuable to have in Biopython even without inferring
> parent-child relationships. Would it be possible to start by
> extracting and merging the GFF3 parser, and work on the parent-child
> relationships separately?
>
> The relationships are by far the hardest. I could write a lot
> about the difficulties of GFF vs GTF, but let's just say a sqlite3
> db is the most portable and performant way I've found to use both
> GFF and GTF and interconvert between them. The bulk of gffutils'
> code and complexity is for working on this task.
>
> Converting GFF to BioPython objects while reliably keeping track
> of parent/child relations requires parsing the entire file,
> creating a database, and then querying the db for the relations.
> gffutils does this, and currently creates SeqFeatures objects. Any
> additional CompoundLocation stuff can easily be added, as long as
> there's a gffutils database to get relationship info from.
> Likewise, assuming presence of a db, Brad's scripts can easily be
> ported. I can certainly work on this.
>
> So I guess the big question is if you want to introduce all the
> sqlite3 machinery to BioPython in order to access relationship
> info, or just use the parser.
>
>
> I think we're happy to use sqlite3 wherever it's a sensible
> engineering choice, since it's part of the standard library. Biopython
> users may want the option to skip the database if parent-child
> relationships are not needed, or keep it in RAM to avoid hitting the disk.
>
>
> -Eric
Hi Eric -
Regarding the dependencies, I'm pretty sure we can drop them all (I'll
address them individually below) for integration with Biopython. I'm
imagining gffutils will remain as an independent project, with a subset
of useful parts shared with Biopython. In this case, it would be
important to minimize the "merge barrier" to facilitate
bugfixes/improvements between code bases in both directions.
With that in mind:
> six-- Could port using Bio._py3k, straightforward but monotonous work.
I think the way to address this is to make a copy of Bio._py3k in
gffutils and use that to replace six in the main gffutils repo. Then
picking-and-choosing pieces of gffutils to put in Biopython will be
straightforward: just edit the import from "gffutils._py3k" to "Bio._py3k".
> argh, argcomplete -- Only for argument parsing in the "gffutils-cli"
> script, maybe not needed in Biopython.
Agreed, nothing targeted for Biopython integration needs these.
> simplejson-- I think this is roughly the same code as the "json"
> module in the standard library of Python 2.6+. Since Biopython doesn't
> support Python 2.5 anymore, we can probably just import "json" instead
> of "simplejson" in feature.py and helpers.py.
simplejson won in some performance benchmarks I had done, but not by a
huge amount. It's a drop-in replacement for json, so this should be a
straightforward fix.
> pyfaidx-- This takes some consideration. Since GSoC 2014, Biopython
> can index a genome-scale FASTA file with sqlite3 using its own index
> format, not the samtools faidx format. I don't see a ton of pygr-style
> indexing in gffutils beyond just extracting the specified subsequences
> from a FASTA file, so Biopython's internal solution may suffice. This
> is not yet merged; the pull request is here:
> https://github.com/biopython/biopython/pull/356
>
> If reading the .fai file is mandatory but writing it is not, then I
> can contribute a minimal ~100-line implementation of that (which could
> alternatively go into Biopython if we prefer):
> https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py
Right, it's only used for extracting subsequences from a FASTA. Could
either drop this functionality altogether, or use the options you
suggest. I would prefer the Biopython internal solution, since a
read-only approach limits the user to pre-constructed indexes -- or
requires them to install additional dependencies to create their own.
> Would it be possible to start by extracting and merging the GFF3
> parser, and work on the parent-child relationships separately?
Certainly. I think this is a good strategy. I think it would be good to
hold off on the sequence extraction for now as well.
> I think we're happy to use sqlite3 wherever it's a sensible
> engineering choice, since it's part of the standard library. Biopython
> users may want the option to skip the database if parent-child
> relationships are not needed, or keep it in RAM to avoid hitting the disk.
If parent-child relationships are not needed, then maybe all you need to
do is parse:
from gffutils.iterators import DataIterator
for feature in DataIterator('annotations.gff'):
# do something with feature
The attributes-field-parsing machinery ported from Brad's code handles
the pathological attributes fields, but other than that it's pretty
trivial and like any line-by-line parser uses very little RAM.
Creating parent-child relationships is a different beast. This takes a
lot longer and is a lot more complex:
from gffutils import create_db
db = create_db('annotations.gff', 'annotations.db')
I should point out that using ":memory:" for the database name puts it
in RAM. The downside is no persistence, so the time cost of
parsing/constructing the db (~10 mins for 1.2M-feature human GENCODE
GTF) has to be spent every time.
Anyway, I think the next step is to get a draft PR going to iron out the
details of parser integration. Where do you want this this live?
Bio.GFF? Bio.GTF?
-ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20150625/f587b3ca/attachment.html>
More information about the Biopython-dev
mailing list