[Biopython] gff3 file
Peter Cock
p.j.a.cock at googlemail.com
Tue Jun 2 10:32:41 UTC 2015
On Tue, Jun 2, 2015 at 11:11 AM, <Atteyet-Alla.Yassin at ukb.uni-bonn.de> wrote:
> I would like to convert a gff file (which I recieved on converting a
> sequence in Genbank format using bioperl) in table e.g. like the following
> one:
>
> Seqname Source feature Start End Score Strand Frame Attributes
> chr1 hg19_gold exon 67088326 67183780 0,000000 + . gene_id "AL139147.7";
> transcript_id "AL139147.7"
>
> In my gff file you will observe the following :
>
> Lines are doubled i.e repeated e.g.
>
>
> CP008802 Genbank gene 417 638 . + . ID=FB03_00010
> CP008802 Genbank CDS 417 638 . + .
> Parent=FB03_00010.t00;db_xref=EnsemblGenomes-Gn%3AFB03_00010,EnsemblGenomes-Tr%3AAIE81925,UniProtKB%2FTrEMBL%3AA0A068NGQ6;codon_start=1;inference=COORDINATES%3Aab%20initio%20prediction%3AGeneMarkS%2B;product=hypothetical%20protein;translation=MAKRKKKDRGGVLTWVGIFAIVLASIADFVLFFFDNGSRYILYTLPLWFLGIGCFAWLGRAEERRNNTKRTGN;transl_table=11;note=Derived%20by%20automated%20computational%20analysis%20using%20gene%20prediction%20method%3A%20GeneMarkS%2B.;protein_id=AIE81925.1
>
>
I assume this is a continuation of your past email, i.e.
http://lists.open-bio.org/pipermail/biopython/2015-May/015641.html
You posted the full GFF file then:
http://mailman.open-bio.org/pipermail/biopython/attachments/20150530/dd32ee7e/attachment-0001.obj
Note that these "repeated" GFF files are normal - you have a line
describing a "gene" at 417..638, and a matching "CDS" at 417..638.
In the original GenBank file there would also have been two entries
for the "gene" and "CDS".
So, given this example gene/CDS, what would you like to have
in the output file? Maybe something like this?
Seqname Source feature Start End Score Strand Frame Attributes
CP008802 Genbank gene 417 638 0,000000 + . gene_id "FB03_00010";
transcript_id "FB03_00010"
Peter
More information about the Biopython
mailing list