[Biopython] gff3 problem

Michal mictadlo at gmail.com
Tue Apr 5 12:26:21 UTC 2011


Hello,
I have found http://www.biopython.org/wiki/GFF_Parsing  for BioPython in 
order to read GFF3 files. The following code
import sys
from BCBio import GFF
from pprint import pprint

in_file = sys.argv[1]

in_handle = open(in_file)
for rec in GFF.parse(in_handle):
         pprint(rec.id)
         pprint(rec.description)
         pprint(rec.name)
         pprint(rec.features)
in_handle.close()

use this gff3 file:
##gff-version    3

BC  test    chromosome  1    15923202    .    .    .    ID=BC;Name=BC

BC    x    gene    2235    3344    .    -    .    
ID=BC-x.1;Name=BC-x.1;Note=Elongation factor P (EF-P) family protein n:2 
Tax:Arabidopsis RepID:D7L774_ARALY
BC    x    exon    2235    2279    5.336    -    .    
ID=BC-x.1-Exon-1;Parent=BC-x.1;Name=BC-x.1-Exon-1
BC    x    exon    2423    2535    -3.679    -    .    
ID=BC-x.1-Exon-2;Parent=BC-x.1;Name=BC-x.1-Exon-2
BC    x    exon    2610    2691    13.041    -    .    
ID=BC-x.1-Exon-3;Parent=BC-x.1;Name=BC-x.1-Exon-3
BC    x    exon    2763    2864    26.072    -    .    
ID=BC-x.1-Exon-4;Parent=BC-x.1;Name=BC-x.1-Exon-4
BC    x    exon    2972    3049    17.020    -    .    
ID=BC-x.1-Exon-5;Parent=BC-x.1;Name=BC-x.1-Exon-5
BC    x    exon    3126    3251    8.398    -    .    
ID=BC-x.1-Exon-6;Parent=BC-x.1;Name=BC-x.1-Exon-6
BC    x    exon    3321    3344    1.792    -    .    
ID=BC-x.1-Exon-7;Parent=BC-x.1;Name=BC-x.1-Exon-7

BC    blastp    protein    2423    3332    .    -    .    
ID=BC.protein_x.1_1;Name=UniRef90_Q2RBC6;Note=Elongation factor P, 
putative, expressed n:4 Tax:Oryza sativa RepID:Q2RBC6_ORYSJ
BC    blastp    cds    2423    2535    .    -    .    
ID=BC.protein_x.1_1_1;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_1
BC    blastp    cds    2610    2691    .    -    .    
ID=BC.protein_x.1_1_2;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_2
BC    blastp    cds    2763    2864    .    -    .    
ID=BC.protein_x.1_1_3;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_3
BC    blastp    cds    2972    3049    .    -    .    
ID=BC.protein_x.1_1_4;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_4
BC    blastp    cds    3126    3251    .    -    .    
ID=BC.protein_x.1_1_5;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_5
BC    blastp    cds    3321    3332    .    -    .    
ID=BC.protein_x.1_1_6;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_6

BC    blastp    protein    2423    3338    .    -    .    
ID=BC.protein_x.1_2;Name=UniRef90_B4B801;Note=Elongation factor P n:1 
Tax:Cyanothece sp. PCC 7822 RepID:B4B801_9CHRO
BC    blastp    cds    2423    2535    .    -    .    
ID=BC.protein_x.1_2_1;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_1
BC    blastp    cds    2610    2691    .    -    .    
ID=BC.protein_x.1_2_2;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_2
BC    blastp    cds    2763    2864    .    -    .    
ID=BC.protein_x.1_2_3;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_3
BC    blastp    cds    2972    3049    .    -    .    
ID=BC.protein_x.1_2_4;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_4
BC    blastp    cds    3126    3251    .    -    .    
ID=BC.protein_x.1_2_5;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_5
BC    blastp    cds    3321    3338    .    -    .    
ID=BC.protein_x.1_2_6;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_6

BC    x    gene    3859    4071    .    +    .    
ID=BC-x.2;Name=BC-x.2;Note=No hits found
BC    x    exon    3859    4071    -0.231    +    .    
ID=BC-x.2-Exon-1;Parent=BC-x.2;Name=BC-x.2-Exon-1

BC    x    gene    5536    7351    .    +    .    
ID=BC-x.3;Name=BC-x.3;Note=Probable protein phosphatase 2C 65 n:3 
Tax:Arabidopsis RepID:P2C65_ARATH
BC    x    exon    5536    5739    -2.746    +    .    
ID=BC-x.3-Exon-1;Parent=BC-x.3;Name=BC-x.3-Exon-1
BC    x    exon    5827    5907    17.396    +    .    
ID=BC-x.3-Exon-2;Parent=BC-x.3;Name=BC-x.3-Exon-2
BC    x    exon    5971    6111    9.268    +    .    
ID=BC-x.3-Exon-3;Parent=BC-x.3;Name=BC-x.3-Exon-3
BC    x    exon    6202    6319    9.154    +    .    
ID=BC-x.3-Exon-4;Parent=BC-x.3;Name=BC-x.3-Exon-4
BC    x    exon    6476    6699    15.287    +    .    
ID=BC-x.3-Exon-5;Parent=BC-x.3;Name=BC-x.3-Exon-5
BC    x    exon    6795    7023    9.286    +    .    
ID=BC-x.3-Exon-6;Parent=BC-x.3;Name=BC-x.3-Exon-6
BC    x    exon    7323    7351    6.774    +    .    
ID=BC-x.3-Exon-7;Parent=BC-x.3;Name=BC-x.3-Exon-7

BC    blastp    protein    5536    6968    .    +    .    
ID=BC.protein_x.3_1;Name=UniRef90_A5BF43;Note=Putative uncharacterized 
protein n:1 Tax:Vitis vinifera RepID:A5BF43_VITVI
BC    blastp    cds    5536    5739    .    +    .    
ID=BC.protein_x.3_1_1;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_1
BC    blastp    cds    5827    5907    .    +    .    
ID=BC.protein_x.3_1_2;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_2
BC    blastp    cds    5971    6111    .    +    .    
ID=BC.protein_x.3_1_3;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_3
BC    blastp    cds    6202    6319    .    +    .    
ID=BC.protein_x.3_1_4;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_4
BC    blastp    cds    6476    6699    .    +    .    
ID=BC.protein_x.3_1_5;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_5
BC    blastp    cds    6795    6968    .    +    .    
ID=BC.protein_x.3_1_6;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_6

and get the following results:
$ python test2.py test.gff3
'BC'
'<unknown description>'
'<unknown name>'
[SeqFeature(FeatureLocation(ExactPosition(0),ExactPosition(15923202)), 
type='chromosome', id='BC'),
  SeqFeature(FeatureLocation(ExactPosition(2234),ExactPosition(3344)), 
type='gene', location_operator='join', strand=-1, id='BC-x.1'),
  SeqFeature(FeatureLocation(ExactPosition(2422),ExactPosition(3332)), 
type='protein', location_operator='join', strand=-1, id='BC.protein_x.1_1'),
  SeqFeature(FeatureLocation(ExactPosition(2422),ExactPosition(3338)), 
type='protein', location_operator='join', strand=-1, id='BC.protein_x.1_2'),
  SeqFeature(FeatureLocation(ExactPosition(3858),ExactPosition(4071)), 
type='gene', location_operator='join', strand=1, id='BC-x.2'),
  SeqFeature(FeatureLocation(ExactPosition(5535),ExactPosition(7351)), 
type='gene', location_operator='join', strand=1, id='BC-x.3'),
  SeqFeature(FeatureLocation(ExactPosition(5535),ExactPosition(6968)), 
type='protein', location_operator='join', strand=1, id='BC.protein_x.3_1')]

How can I access exon and cds information from gff3 file?
Why does start position is always one less than in the gff3 file, but 
the end position is the same?
Why do not I get Note=Elongation factor P (EF-P)...?

Thank you in advance.



More information about the Biopython mailing list