[Biopython] GFF parsing with biopython
Mic
mictadlo at gmail.com
Tue Apr 30 04:12:34 UTC 2013
Hi,
I have the following GFF file from a SNAP
X1 SNAP Einit 2579 2712 -3.221 + . X1-snap.1
X1 SNAP Exon 2813 2945 4.836 + . X1-snap.1
X1 SNAP Eterm 3013 3033 10.467 + . X1-snap.1
X1 SNAP Esngl 3457 3702 -17.856 + . X1-snap.2
X1 SNAP Einit 4901 4974 -4.954 + . X1-snap.3
X1 SNAP Eterm 5021 5150 14.231 + . X1-snap.3
X1 SNAP Einit 6245 7325 -1.525 - . X1-snap.4
X1 SNAP Eterm 5974 6008 5.398 - . X1-snap.4
With the code below I have tried to parse the above GFF file
from BCBio import GFF
from pprint import pprint
from BCBio.GFF import GFFExaminer
def retrieve_pred_genes_data():
with open("test/X1_small.snap.gff") as sf:
#examiner = GFFExaminer()
#pprint(examiner.available_limits(sf))
for rec in GFF.parse(sf):
pprint(rec.id)
pprint(rec.description)
pprint(rec.name)
pprint(rec.features)
#pprint(rec.type) #'SeqRecord' object has no
attribute
#pprint(rec.ref) #'SeqRecord' object has no
attribute
#pprint(rec.ref_db) #'SeqRecord' object has no
attribute
#pprint(rec.location) #'SeqRecord' object has no
attribute
#pprint(rec.location_operator) #'SeqRecord' object has no
attribute
#pprint(rec.strand) #'SeqRecord' object has no
attribute
#pprint(rec.sub_features) #'SeqRecord' object has no
attribute
retrieve_pred_genes_data()
and got the following output:
'X1'
'<unknown description>'
'<unknown name>'
[SeqFeature(FeatureLocation(ExactPosition(2578), ExactPosition(2712),
strand=1), type='Einit'),
SeqFeature(FeatureLocation(ExactPosition(2812), ExactPosition(2945),
strand=1), type='Exon'),
SeqFeature(FeatureLocation(ExactPosition(3012), ExactPosition(3033),
strand=1), type='Eterm'),
SeqFeature(FeatureLocation(ExactPosition(3456), ExactPosition(3702),
strand=1), type='Esngl'),
SeqFeature(FeatureLocation(ExactPosition(4900), ExactPosition(4974),
strand=1), type='Einit'),
SeqFeature(FeatureLocation(ExactPosition(5020), ExactPosition(5150),
strand=1), type='Eterm'),
SeqFeature(FeatureLocation(ExactPosition(6160), ExactPosition(7325),
strand=-1), type='Einit'),
SeqFeature(FeatureLocation(ExactPosition(5973), ExactPosition(6008),
strand=-1), type='Eterm')]
and with GFFExaminer I got these:
{'gff_id': {('X1',): 8},
'gff_source': {('SNAP',): 8},
'gff_source_type': {('SNAP', 'Einit'): 3,
('SNAP', 'Esngl'): 1,
('SNAP', 'Eterm'): 3,
('SNAP', 'Exon'): 1},
'gff_type': {('Einit',): 3, ('Esngl',): 1, ('Eterm',): 3, ('Exon',): 1}}
I found these examples (
https://github.com/patena/jonikaslab-mutant-pools/blob/master/notes_on_GFF_parsing.txt),
but I got these kind of errors:
#pprint(rec.type) #'SeqRecord' object has no
attribute
#pprint(rec.ref) #'SeqRecord' object has no
attribute
#pprint(rec.ref_db) #'SeqRecord' object has no
attribute
#pprint(rec.location) #'SeqRecord' object has no
attribute
#pprint(rec.location_operator) #'SeqRecord' object has no
attribute
#pprint(rec.strand) #'SeqRecord' object has no
attribute
#pprint(rec.sub_features) #'SeqRecord' object has no
attribute
What did I do wrong and how is it possible to access all fields in the
above GFF file?
Thank you in advance.
Mic
More information about the Biopython
mailing list