[Biopython] gff3 file

Tue Jun 2 10:32:41 UTC 2015

On Tue, Jun 2, 2015 at 11:11 AM,  <Atteyet-Alla.Yassin at ukb.uni-bonn.de> wrote:
> I would like to convert a gff file (which I recieved on converting a
> sequence in Genbank format using bioperl) in table e.g. like the following
> one:
>
> Seqname Source feature Start End Score Strand Frame Attributes
> chr1 hg19_gold exon 67088326 67183780 0,000000 + . gene_id "AL139147.7";
> transcript_id "AL139147.7"
>
> In my gff file you will observe the following :
>
> Lines are doubled i.e repeated e.g.
>
>
> CP008802    Genbank    gene    417    638    .    +    .    ID=FB03_00010
> CP008802    Genbank    CDS    417    638    .    +    .
> Parent=FB03_00010.t00;db_xref=EnsemblGenomes-Gn%3AFB03_00010,EnsemblGenomes-Tr%3AAIE81925,UniProtKB%2FTrEMBL%3AA0A068NGQ6;codon_start=1;inference=COORDINATES%3Aab%20initio%20prediction%3AGeneMarkS%2B;product=hypothetical%20protein;translation=MAKRKKKDRGGVLTWVGIFAIVLASIADFVLFFFDNGSRYILYTLPLWFLGIGCFAWLGRAEERRNNTKRTGN;transl_table=11;note=Derived%20by%20automated%20computational%20analysis%20using%20gene%20prediction%20method%3A%20GeneMarkS%2B.;protein_id=AIE81925.1
>
>

I assume this is a continuation of your past email, i.e.
http://lists.open-bio.org/pipermail/biopython/2015-May/015641.html

You posted the full GFF file then:
http://mailman.open-bio.org/pipermail/biopython/attachments/20150530/dd32ee7e/attachment-0001.obj

Note that these "repeated" GFF files are normal - you have a line
describing a "gene" at 417..638, and a matching "CDS" at 417..638.
In the original GenBank file there would also have been two entries
for the "gene" and "CDS".

So, given this example gene/CDS, what would you like to have
in the output file? Maybe something like this?

Seqname Source feature Start End Score Strand Frame Attributes
CP008802 Genbank gene 417 638 0,000000 + . gene_id "FB03_00010";
transcript_id "FB03_00010"

Peter