[Biopython-dev] [Bug 3062] New: GenBank/EMBL parser breaks when features have no qualifiers
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Thu Apr 22 13:56:48 UTC 2010
http://bugzilla.open-bio.org/show_bug.cgi?id=3062
Summary: GenBank/EMBL parser breaks when features have no
qualifiers
Product: Biopython
Version: 1.54b
Platform: All
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: laserson at mit.edu
CC: laserson at mit.edu
I am trying to use the EMBL parser to parse the IMGT/LIGM flatfile. Whenever
there is a feature, the parser checks whether there are qualifiers in the
feature with an assert statement, and does not allow features with no
qualifiers. However, the EMBL specification does not require features to have
qualifiers, and the IMGT flatfile is full of entries that have features with no
qualifiers (only coordinates).
The assertion error is tracked to an assert statement in Scanner.py at line
269. It appears that the assumption in the code is that there is an unquoted
continuation of a feature qualifier, rather than a feature with no qualifiers.
I am using biopython 1.51 that I built from source using python 2.5 (from an
EPD install 4.3.0). I am on a Mac running OS X 10.5.8 (Leopard). Peter
mentioned that the problem in the code is still present in the 1.54b release,
and also in the repository.
To reproduce the problem, the parser broke on the following record (the
traceback is below as well):
ID A03907 IMGT/LIGM annotation : keyword level; unassigned DNA; HUM; 412 BP.
XX
AC A03907;
XX
DT 11-MAR-1998 (Rel. 8, arrived in LIGM-DB )
DT 10-JUN-2008 (Rel. 200824-2, Last updated, Version 3)
XX
DE H.sapiens antibody D1.3 variable region protein ;
DE unassigned DNA; rearranged configuration; Ig-Heavy; regular; group IGHV.
XX
KW antigen receptor; Immunoglobulin superfamily (IgSF);
KW Immunoglobulin (IG); IG-Heavy; variable; diversity; joining;
KW rearranged.
XX
OS Homo sapiens (human)
OC cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa;
OC Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata;
OC Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda;
OC Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates;
OC Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae;
OC Homo/Pan/Gorilla group; Homo.
XX
RN [1]
RP 1-412
RA ;
RT "Recombinant antibodies and methods for their production.";
RL Patent number EP0239400-A/10, 30-SEP-1987.
RL MEDICAL RESEARCH COUNCIL.
XX
DR EMBL; A03907.
XX
FH Key Location/Qualifiers (from EMBL)
FH
FT source 1..412
FT /organism="Homo sapiens"
FT /mol_type="unassigned DNA"
FT /db_xref="taxon:9606"
FT V_region 8..>412
FT /note="antibody D1.3 V region"
FT sig_peptide 8..64
FT CDS 8..>412
FT /product="antibody D1.3 V region (VDJ)"
FT /protein_id="CAA00308.1"
FT
/translation="MAVLALLFCLVTFPSCILSQVQLKESGPGLVAPSQSLSITCTVSG
FT
FSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSL
FT HTDDTARYYCARERDYRLDYWGQGTTLTVSS"
FT D_segment 356..371
FT J_segment 372..>412
FT /note="J(H)2 region"
XX
SQ Sequence 412 BP; 105 A; 109 C; 104 G; 94 T; 0 other;
tcagagcatg gctgtcctgg cattactctt ctgcctggta acattcccaa gctgtatcct
60
ttcccaggtg cagctgaagg agtcaggacc tggcctggtg gcgccctcac agagcctgtc
120
catcacatgc accgtctcag ggttctcatt aaccggctat ggtgtaaact gggttcgcca
180
gcctccagga aagggtctgg agtggctggg aatgatttgg ggtgatggaa acacagacta
240
taattcagct ctcaaatcca gactgagcat cagcaaggac aactccaaga gccaagtttt
300
cttaaaaatg aacagtctgc acactgatga cacagccagg tactactgtg ccagagagag
360
agattatagg cttgactact ggggccaagg caccactctc acagtctcct ca
412
//
And the traceback was:
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (311, 0))
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
/Volumes/External/home/laserson/research/church/vdj-ome/ref-data/IMGT/<ipython
console> in <module>()
/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_records(self, handle, do_features)
418 #This is a generator function
419 while True :
--> 420 record = self.parse(handle, do_features)
421 if record is None : break
422 assert record.id is not None
/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse(self, handle, do_features)
401 feature_cleaner = FeatureValueCleaner())
402
--> 403 if self.feed(handle, consumer, do_features) :
404 return consumer.data
405 else :
/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in feed(self, handle, consumer, do_features)
373 #Features (common to both EMBL and GenBank):
374 if do_features :
--> 375 self._feed_feature_table(consumer,
self.parse_features(skip=False))
376 else :
377 self.parse_features(skip=True) # ignore the data
/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_features(self, skip)
170
feature_lines.append(line[self.FEATURE_QUALIFIER_INDENT:].rstrip())
171 line = self.handle.readline()
--> 172 features.append(self.parse_feature(feature_key,
feature_lines))
173 self.line = line
174 return features
/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_feature(self, feature_key, lines)
267 else :
268 #Unquoted continuation
--> 269 assert len(qualifiers) > 0
270 assert key==qualifiers[-1][0]
271 #if debug : print "Unquoted Cont %s:%s" % (key,
line)
AssertionError:
Thanks!
Uri
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list