[Biopython] gff3 problem
Michal
mictadlo at gmail.com
Tue Apr 5 12:26:21 UTC 2011
Hello,
I have found http://www.biopython.org/wiki/GFF_Parsing for BioPython in
order to read GFF3 files. The following code
import sys
from BCBio import GFF
from pprint import pprint
in_file = sys.argv[1]
in_handle = open(in_file)
for rec in GFF.parse(in_handle):
pprint(rec.id)
pprint(rec.description)
pprint(rec.name)
pprint(rec.features)
in_handle.close()
use this gff3 file:
##gff-version 3
BC test chromosome 1 15923202 . . . ID=BC;Name=BC
BC x gene 2235 3344 . - .
ID=BC-x.1;Name=BC-x.1;Note=Elongation factor P (EF-P) family protein n:2
Tax:Arabidopsis RepID:D7L774_ARALY
BC x exon 2235 2279 5.336 - .
ID=BC-x.1-Exon-1;Parent=BC-x.1;Name=BC-x.1-Exon-1
BC x exon 2423 2535 -3.679 - .
ID=BC-x.1-Exon-2;Parent=BC-x.1;Name=BC-x.1-Exon-2
BC x exon 2610 2691 13.041 - .
ID=BC-x.1-Exon-3;Parent=BC-x.1;Name=BC-x.1-Exon-3
BC x exon 2763 2864 26.072 - .
ID=BC-x.1-Exon-4;Parent=BC-x.1;Name=BC-x.1-Exon-4
BC x exon 2972 3049 17.020 - .
ID=BC-x.1-Exon-5;Parent=BC-x.1;Name=BC-x.1-Exon-5
BC x exon 3126 3251 8.398 - .
ID=BC-x.1-Exon-6;Parent=BC-x.1;Name=BC-x.1-Exon-6
BC x exon 3321 3344 1.792 - .
ID=BC-x.1-Exon-7;Parent=BC-x.1;Name=BC-x.1-Exon-7
BC blastp protein 2423 3332 . - .
ID=BC.protein_x.1_1;Name=UniRef90_Q2RBC6;Note=Elongation factor P,
putative, expressed n:4 Tax:Oryza sativa RepID:Q2RBC6_ORYSJ
BC blastp cds 2423 2535 . - .
ID=BC.protein_x.1_1_1;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_1
BC blastp cds 2610 2691 . - .
ID=BC.protein_x.1_1_2;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_2
BC blastp cds 2763 2864 . - .
ID=BC.protein_x.1_1_3;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_3
BC blastp cds 2972 3049 . - .
ID=BC.protein_x.1_1_4;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_4
BC blastp cds 3126 3251 . - .
ID=BC.protein_x.1_1_5;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_5
BC blastp cds 3321 3332 . - .
ID=BC.protein_x.1_1_6;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_6
BC blastp protein 2423 3338 . - .
ID=BC.protein_x.1_2;Name=UniRef90_B4B801;Note=Elongation factor P n:1
Tax:Cyanothece sp. PCC 7822 RepID:B4B801_9CHRO
BC blastp cds 2423 2535 . - .
ID=BC.protein_x.1_2_1;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_1
BC blastp cds 2610 2691 . - .
ID=BC.protein_x.1_2_2;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_2
BC blastp cds 2763 2864 . - .
ID=BC.protein_x.1_2_3;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_3
BC blastp cds 2972 3049 . - .
ID=BC.protein_x.1_2_4;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_4
BC blastp cds 3126 3251 . - .
ID=BC.protein_x.1_2_5;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_5
BC blastp cds 3321 3338 . - .
ID=BC.protein_x.1_2_6;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_6
BC x gene 3859 4071 . + .
ID=BC-x.2;Name=BC-x.2;Note=No hits found
BC x exon 3859 4071 -0.231 + .
ID=BC-x.2-Exon-1;Parent=BC-x.2;Name=BC-x.2-Exon-1
BC x gene 5536 7351 . + .
ID=BC-x.3;Name=BC-x.3;Note=Probable protein phosphatase 2C 65 n:3
Tax:Arabidopsis RepID:P2C65_ARATH
BC x exon 5536 5739 -2.746 + .
ID=BC-x.3-Exon-1;Parent=BC-x.3;Name=BC-x.3-Exon-1
BC x exon 5827 5907 17.396 + .
ID=BC-x.3-Exon-2;Parent=BC-x.3;Name=BC-x.3-Exon-2
BC x exon 5971 6111 9.268 + .
ID=BC-x.3-Exon-3;Parent=BC-x.3;Name=BC-x.3-Exon-3
BC x exon 6202 6319 9.154 + .
ID=BC-x.3-Exon-4;Parent=BC-x.3;Name=BC-x.3-Exon-4
BC x exon 6476 6699 15.287 + .
ID=BC-x.3-Exon-5;Parent=BC-x.3;Name=BC-x.3-Exon-5
BC x exon 6795 7023 9.286 + .
ID=BC-x.3-Exon-6;Parent=BC-x.3;Name=BC-x.3-Exon-6
BC x exon 7323 7351 6.774 + .
ID=BC-x.3-Exon-7;Parent=BC-x.3;Name=BC-x.3-Exon-7
BC blastp protein 5536 6968 . + .
ID=BC.protein_x.3_1;Name=UniRef90_A5BF43;Note=Putative uncharacterized
protein n:1 Tax:Vitis vinifera RepID:A5BF43_VITVI
BC blastp cds 5536 5739 . + .
ID=BC.protein_x.3_1_1;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_1
BC blastp cds 5827 5907 . + .
ID=BC.protein_x.3_1_2;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_2
BC blastp cds 5971 6111 . + .
ID=BC.protein_x.3_1_3;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_3
BC blastp cds 6202 6319 . + .
ID=BC.protein_x.3_1_4;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_4
BC blastp cds 6476 6699 . + .
ID=BC.protein_x.3_1_5;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_5
BC blastp cds 6795 6968 . + .
ID=BC.protein_x.3_1_6;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_6
and get the following results:
$ python test2.py test.gff3
'BC'
'<unknown description>'
'<unknown name>'
[SeqFeature(FeatureLocation(ExactPosition(0),ExactPosition(15923202)),
type='chromosome', id='BC'),
SeqFeature(FeatureLocation(ExactPosition(2234),ExactPosition(3344)),
type='gene', location_operator='join', strand=-1, id='BC-x.1'),
SeqFeature(FeatureLocation(ExactPosition(2422),ExactPosition(3332)),
type='protein', location_operator='join', strand=-1, id='BC.protein_x.1_1'),
SeqFeature(FeatureLocation(ExactPosition(2422),ExactPosition(3338)),
type='protein', location_operator='join', strand=-1, id='BC.protein_x.1_2'),
SeqFeature(FeatureLocation(ExactPosition(3858),ExactPosition(4071)),
type='gene', location_operator='join', strand=1, id='BC-x.2'),
SeqFeature(FeatureLocation(ExactPosition(5535),ExactPosition(7351)),
type='gene', location_operator='join', strand=1, id='BC-x.3'),
SeqFeature(FeatureLocation(ExactPosition(5535),ExactPosition(6968)),
type='protein', location_operator='join', strand=1, id='BC.protein_x.3_1')]
How can I access exon and cds information from gff3 file?
Why does start position is always one less than in the gff3 file, but
the end position is the same?
Why do not I get Note=Elongation factor P (EF-P)...?
Thank you in advance.
More information about the Biopython
mailing list