[Biopython] gff3 file

Peter Cock p.j.a.cock at googlemail.com
Tue Jun 2 14:35:48 UTC 2015


On Tue, Jun 2, 2015 at 3:24 PM, Fields, Christopher J
<cjfields at illinois.edu> wrote:
>
> Atteyet-Alla,
>
> My guess: this is using one of the various genbank-to-GFF scripts in
> BioPerl?  Most of those are designed to work w/RefSeq data, where the
> features are halfway consistent.  Also, I believe features spanning the
> origin are supported but this depends on which version of BioPerl you are
> using for the conversion (was added in the last few releases I believe).
>
> Depending on the script you use and settings, they do a fairly decent job
> but in many cases need some tweaks to get it where you want it..  Frankly
> they have been subsumed by using the NCBI GFF3 data directly.
>
> Speaking of, is there any reason you aren’t simply using the NCBI GFF3 and
> bypassing GenBank altogether?  They have been working pretty hard to make
> their output GFF3-compliant, and last I checked they work with most genome
> browsers and parsers:
>
> GenBank:
> ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000724605.1_ASM72460v1/GCA_000724605.1_ASM72460v1_genomic.gff.gz
> RefSeq:
> ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000724605.1_ASM72460v1/GCF_000724605.1_ASM72460v1_genomic.gff.gz
>

Hi Chris,

Good point, I'd also advocate using any NCBI provided GFF file
nowadays rather than attempting GenBank to GFF conversion.

In this case it is a shame that the NCBI webinterface doesn't
offer GFF output (yet) from pages like this:

http://www.ncbi.nlm.nih.gov/nuccore/CP008802

> (not sure about Biopython support here, but I would be really surprised if
> there are problems)
>
> I personally find GenBank to be a legacy format, useful for human
> readability but little more, and a huge pain to deal with from the parsing
> end due to lack of a true specification (no, the ‘Sample GenBank file’ at
> NCBI doesn’t count in my book when they change it at will).
>
> Apologies for the snarkiness, need coffee :)

As things stand, Biopython's GenBank parsing is pretty robust, and
the SeqRecord/SeqFeature object model fits it very well (likewise
with BioSQL - this is what they were originally designed for).

However, Biopython's GFF support is not yet merged into the core.
Brad did a lot of work on this, but full support needs changes to our
objects to handle the parent/child relationship between features
in GFF (which is implicit in GenBank format).

However, these details are not important for Atteyet-Alla's question.

Peter



More information about the Biopython mailing list