[Biopython] Bug in GenBank/EMBL parser?

Uri Laserson laserson at mit.edu
Thu Apr 22 01:07:19 UTC 2010


Hi,

I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which
supposedly conforms to the EMBL standard).

The short story is that whenever there is a feature, the parser checks
whether there are qualifiers in the feature with an assert statement, and
does not allow features with no qualifiers.  However, the IMGT flatfile is
full of entries that have features with no qualifiers (only coordinates).

Who is wrong here?  Does the EMBL specification require that a feature have
qualifiers?  Or is this a bug to be fixed in the parser.

To be more concrete, the parser broke on the following record:

ID   A03907 IMGT/LIGM annotation : keyword level; unassigned DNA; HUM; 412
BP.
XX
AC   A03907;
XX
DT   11-MAR-1998 (Rel. 8, arrived in LIGM-DB )
DT   10-JUN-2008 (Rel. 200824-2, Last updated, Version 3)
XX
DE   H.sapiens antibody D1.3 variable region protein  ;
DE   unassigned DNA; rearranged configuration; Ig-Heavy; regular; group
IGHV.
XX
KW   antigen receptor; Immunoglobulin superfamily (IgSF);
KW   Immunoglobulin (IG); IG-Heavy; variable; diversity; joining;
KW   rearranged.
XX
OS   Homo sapiens (human)
OC   cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa;
Eumetazoa;
OC   Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata;
OC   Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda;
OC   Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates;
OC   Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae;
OC   Homo/Pan/Gorilla group; Homo.
XX
RN   [1]
RP   1-412
RA   ;
RT   "Recombinant antibodies and methods for their production.";
RL   Patent number EP0239400-A/10, 30-SEP-1987.
RL   MEDICAL RESEARCH COUNCIL.
XX
DR   EMBL; A03907.
XX
FH   Key             Location/Qualifiers (from EMBL)
FH
FT   source          1..412
FT                   /organism="Homo sapiens"
FT                   /mol_type="unassigned DNA"
FT                   /db_xref="taxon:9606"
FT   V_region        8..>412
FT                   /note="antibody D1.3 V region"
FT   sig_peptide     8..64
FT   CDS             8..>412
FT                   /product="antibody D1.3 V region (VDJ)"
FT                   /protein_id="CAA00308.1"
FT
/translation="MAVLALLFCLVTFPSCILSQVQLKESGPGLVAPSQSLSITCTVSG
FT
FSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSL
FT                   HTDDTARYYCARERDYRLDYWGQGTTLTVSS"
FT   D_segment       356..371
FT   J_segment       372..>412
FT                   /note="J(H)2 region"
XX
SQ   Sequence 412 BP; 105 A; 109 C; 104 G; 94 T; 0 other;
     tcagagcatg gctgtcctgg cattactctt ctgcctggta acattcccaa gctgtatcct
 60
     ttcccaggtg cagctgaagg agtcaggacc tggcctggtg gcgccctcac agagcctgtc
120
     catcacatgc accgtctcag ggttctcatt aaccggctat ggtgtaaact gggttcgcca
180
     gcctccagga aagggtctgg agtggctggg aatgatttgg ggtgatggaa acacagacta
240
     taattcagct ctcaaatcca gactgagcat cagcaaggac aactccaaga gccaagtttt
300
     cttaaaaatg aacagtctgc acactgatga cacagccagg tactactgtg ccagagagag
360
     agattatagg cttgactact ggggccaagg caccactctc acagtctcct ca
412
//

And the traceback was:

ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (311, 0))

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)

/Volumes/External/home/laserson/research/church/vdj-ome/ref-data/IMGT/<ipython
console> in <module>()

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_records(self, handle, do_features)
    418         #This is a generator function
    419         while True :
--> 420             record = self.parse(handle, do_features)
    421             if record is None : break
    422             assert record.id is not None

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse(self, handle, do_features)
    401                     feature_cleaner = FeatureValueCleaner())
    402
--> 403         if self.feed(handle, consumer, do_features) :
    404             return consumer.data
    405         else :

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in feed(self, handle, consumer, do_features)
    373         #Features (common to both EMBL and GenBank):
    374         if do_features :
--> 375             self._feed_feature_table(consumer,
self.parse_features(skip=False))
    376         else :
    377             self.parse_features(skip=True) # ignore the data

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_features(self, skip)
    170
feature_lines.append(line[self.FEATURE_QUALIFIER_INDENT:].rstrip())
    171                     line = self.handle.readline()
--> 172                 features.append(self.parse_feature(feature_key,
feature_lines))
    173         self.line = line
    174         return features

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_feature(self, feature_key, lines)
    267                 else :
    268                     #Unquoted continuation
--> 269                     assert len(qualifiers) > 0
    270                     assert key==qualifiers[-1][0]
    271                     #if debug : print "Unquoted Cont %s:%s" % (key,
line)

AssertionError:

Which is tracked to an assert statement in Scanner.py at line 269.  It
appears that the assumption in the code is that there is an unquoted
continuation of a feature qualifier.

Finally, I am using biopython 1.51 that I built from source using python 2.5
(from an EPD install 4.3.0).  I am on a Mac running OS X 10.5.8 (Leopard)

Thanks!
Uri



More information about the Biopython mailing list