[Biopython] gff3 file

Fields, Christopher J cjfields at illinois.edu
Tue Jun 2 14:24:39 UTC 2015


On Jun 2, 2015, at 5:49 AM, Peter Cock <p.j.a.cock at googlemail.com<mailto:p.j.a.cock at googlemail.com>> wrote:

On Tue, Jun 2, 2015 at 11:32 AM, Peter Cock <p.j.a.cock at googlemail.com<mailto:p.j.a.cock at googlemail.com>> wrote:
On Tue, Jun 2, 2015 at 11:11 AM,  <Atteyet-Alla.Yassin at ukb.uni-bonn.de<mailto:Atteyet-Alla.Yassin at ukb.uni-bonn.de>> wrote:
I would like to convert a gff file (which I recieved on converting a
sequence in Genbank format using bioperl) in table e.g. like the following
one:

Seqname Source feature Start End Score Strand Frame Attributes
chr1 hg19_gold exon 67088326 67183780 0,000000 + . gene_id "AL139147.7";
transcript_id "AL139147.7"

In my gff file you will observe the following :

Lines are doubled i.e repeated e.g.


CP008802    Genbank    gene    417    638    .    +    .    ID=FB03_00010
CP008802    Genbank    CDS    417    638    .    +    .
Parent=FB03_00010.t00;db_xref=EnsemblGenomes-Gn%3AFB03_00010,EnsemblGenomes-Tr%3AAIE81925,UniProtKB%2FTrEMBL%3AA0A068NGQ6;codon_start=1;inference=COORDINATES%3Aab%20initio%20prediction%3AGeneMarkS%2B;product=hypothetical%20protein;translation=MAKRKKKDRGGVLTWVGIFAIVLASIADFVLFFFDNGSRYILYTLPLWFLGIGCFAWLGRAEERRNNTKRTGN;transl_table=11;note=Derived%20by%20automated%20computational%20analysis%20using%20gene%20prediction%20method%3A%20GeneMarkS%2B.;protein_id=AIE81925.1



I assume this is a continuation of your past email, i.e.
http://lists.open-bio.org/pipermail/biopython/2015-May/015641.html

You posted the full GFF file then:
http://mailman.open-bio.org/pipermail/biopython/attachments/20150530/dd32ee7e/attachment-0001.obj

Note that these "repeated" GFF files are normal - you have a line
describing a "gene" at 417..638, and a matching "CDS" at 417..638.
In the original GenBank file there would also have been two entries
for the "gene" and "CDS".

So, given this example gene/CDS, what would you like to have
in the output file? Maybe something like this?

Seqname Source feature Start End Score Strand Frame Attributes
CP008802 Genbank gene 417 638 0,000000 + . gene_id "FB03_00010";
transcript_id "FB03_00010"

Peter

You've not explained this file format, so I am guessing here
(e.g. should start/end be counting from one, should the frame
be just plus or minus, should feature be of type "gene"?).

I would work from the original GenBank file rather than a
conversion to GFF which may introduce additional problems.
There's an example at the end of this email - but note this
does not handle complex locations like FB03_00005 which
appears to span the origin.

Peter

Atteyet-Alla,

My guess: this is using one of the various genbank-to-GFF scripts in BioPerl?  Most of those are designed to work w/RefSeq data, where the features are halfway consistent.  Also, I believe features spanning the origin are supported but this depends on which version of BioPerl you are using for the conversion (was added in the last few releases I believe).

Depending on the script you use and settings, they do a fairly decent job but in many cases need some tweaks to get it where you want it..  Frankly they have been subsumed by using the NCBI GFF3 data directly.

Speaking of, is there any reason you aren’t simply using the NCBI GFF3 and bypassing GenBank altogether?  They have been working pretty hard to make their output GFF3-compliant, and last I checked they work with most genome browsers and parsers:

GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000724605.1_ASM72460v1/GCA_000724605.1_ASM72460v1_genomic.gff.gz
RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000724605.1_ASM72460v1/GCF_000724605.1_ASM72460v1_genomic.gff.gz

(not sure about Biopython support here, but I would be really surprised if there are problems)

I personally find GenBank to be a legacy format, useful for human readability but little more, and a huge pain to deal with from the parsing end due to lack of a true specification (no, the ‘Sample GenBank file’ at NCBI doesn’t count in my book when they change it at will).

Apologies for the snarkiness, need coffee :)

chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20150602/9f85fc86/attachment-0001.html>


More information about the Biopython mailing list