[Biopython] Bug in GenBank/EMBL parser?
Uri Laserson
laserson at mit.edu
Thu Apr 22 01:07:19 UTC 2010
Hi,
I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which
supposedly conforms to the EMBL standard).
The short story is that whenever there is a feature, the parser checks
whether there are qualifiers in the feature with an assert statement, and
does not allow features with no qualifiers. However, the IMGT flatfile is
full of entries that have features with no qualifiers (only coordinates).
Who is wrong here? Does the EMBL specification require that a feature have
qualifiers? Or is this a bug to be fixed in the parser.
To be more concrete, the parser broke on the following record:
ID A03907 IMGT/LIGM annotation : keyword level; unassigned DNA; HUM; 412
BP.
XX
AC A03907;
XX
DT 11-MAR-1998 (Rel. 8, arrived in LIGM-DB )
DT 10-JUN-2008 (Rel. 200824-2, Last updated, Version 3)
XX
DE H.sapiens antibody D1.3 variable region protein ;
DE unassigned DNA; rearranged configuration; Ig-Heavy; regular; group
IGHV.
XX
KW antigen receptor; Immunoglobulin superfamily (IgSF);
KW Immunoglobulin (IG); IG-Heavy; variable; diversity; joining;
KW rearranged.
XX
OS Homo sapiens (human)
OC cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa;
Eumetazoa;
OC Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata;
OC Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda;
OC Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates;
OC Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae;
OC Homo/Pan/Gorilla group; Homo.
XX
RN [1]
RP 1-412
RA ;
RT "Recombinant antibodies and methods for their production.";
RL Patent number EP0239400-A/10, 30-SEP-1987.
RL MEDICAL RESEARCH COUNCIL.
XX
DR EMBL; A03907.
XX
FH Key Location/Qualifiers (from EMBL)
FH
FT source 1..412
FT /organism="Homo sapiens"
FT /mol_type="unassigned DNA"
FT /db_xref="taxon:9606"
FT V_region 8..>412
FT /note="antibody D1.3 V region"
FT sig_peptide 8..64
FT CDS 8..>412
FT /product="antibody D1.3 V region (VDJ)"
FT /protein_id="CAA00308.1"
FT
/translation="MAVLALLFCLVTFPSCILSQVQLKESGPGLVAPSQSLSITCTVSG
FT
FSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSL
FT HTDDTARYYCARERDYRLDYWGQGTTLTVSS"
FT D_segment 356..371
FT J_segment 372..>412
FT /note="J(H)2 region"
XX
SQ Sequence 412 BP; 105 A; 109 C; 104 G; 94 T; 0 other;
tcagagcatg gctgtcctgg cattactctt ctgcctggta acattcccaa gctgtatcct
60
ttcccaggtg cagctgaagg agtcaggacc tggcctggtg gcgccctcac agagcctgtc
120
catcacatgc accgtctcag ggttctcatt aaccggctat ggtgtaaact gggttcgcca
180
gcctccagga aagggtctgg agtggctggg aatgatttgg ggtgatggaa acacagacta
240
taattcagct ctcaaatcca gactgagcat cagcaaggac aactccaaga gccaagtttt
300
cttaaaaatg aacagtctgc acactgatga cacagccagg tactactgtg ccagagagag
360
agattatagg cttgactact ggggccaagg caccactctc acagtctcct ca
412
//
And the traceback was:
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (311, 0))
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
/Volumes/External/home/laserson/research/church/vdj-ome/ref-data/IMGT/<ipython
console> in <module>()
/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_records(self, handle, do_features)
418 #This is a generator function
419 while True :
--> 420 record = self.parse(handle, do_features)
421 if record is None : break
422 assert record.id is not None
/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse(self, handle, do_features)
401 feature_cleaner = FeatureValueCleaner())
402
--> 403 if self.feed(handle, consumer, do_features) :
404 return consumer.data
405 else :
/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in feed(self, handle, consumer, do_features)
373 #Features (common to both EMBL and GenBank):
374 if do_features :
--> 375 self._feed_feature_table(consumer,
self.parse_features(skip=False))
376 else :
377 self.parse_features(skip=True) # ignore the data
/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_features(self, skip)
170
feature_lines.append(line[self.FEATURE_QUALIFIER_INDENT:].rstrip())
171 line = self.handle.readline()
--> 172 features.append(self.parse_feature(feature_key,
feature_lines))
173 self.line = line
174 return features
/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_feature(self, feature_key, lines)
267 else :
268 #Unquoted continuation
--> 269 assert len(qualifiers) > 0
270 assert key==qualifiers[-1][0]
271 #if debug : print "Unquoted Cont %s:%s" % (key,
line)
AssertionError:
Which is tracked to an assert statement in Scanner.py at line 269. It
appears that the assumption in the code is that there is an unquoted
continuation of a feature qualifier.
Finally, I am using biopython 1.51 that I built from source using python 2.5
(from an EPD install 4.3.0). I am on a Mac running OS X 10.5.8 (Leopard)
Thanks!
Uri
More information about the Biopython
mailing list