[Biopython-dev] [Bug 3062] New: GenBank/EMBL parser breaks when features have no qualifiers

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Apr 22 09:56:48 EDT 2010


http://bugzilla.open-bio.org/show_bug.cgi?id=3062

           Summary: GenBank/EMBL parser breaks when features have no
                    qualifiers
           Product: Biopython
           Version: 1.54b
          Platform: All
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: laserson at mit.edu
                CC: laserson at mit.edu


I am trying to use the EMBL parser to parse the IMGT/LIGM flatfile.  Whenever
there is a feature, the parser checks whether there are qualifiers in the
feature with an assert statement, and does not allow features with no
qualifiers.  However, the EMBL specification does not require features to have
qualifiers, and the IMGT flatfile is full of entries that have features with no
qualifiers (only coordinates).

The assertion error is tracked to an assert statement in Scanner.py at line
269.  It appears that the assumption in the code is that there is an unquoted
continuation of a feature qualifier, rather than a feature with no qualifiers.

I am using biopython 1.51 that I built from source using python 2.5 (from an
EPD install 4.3.0).  I am on a Mac running OS X 10.5.8 (Leopard).  Peter
mentioned that the problem in the code is still present in the 1.54b release,
and also in the repository.

To reproduce the problem,  the parser broke on the following record (the
traceback is below as well):

ID   A03907 IMGT/LIGM annotation : keyword level; unassigned DNA; HUM; 412 BP.
XX
AC   A03907;
XX
DT   11-MAR-1998 (Rel. 8, arrived in LIGM-DB )
DT   10-JUN-2008 (Rel. 200824-2, Last updated, Version 3)
XX
DE   H.sapiens antibody D1.3 variable region protein  ;
DE   unassigned DNA; rearranged configuration; Ig-Heavy; regular; group IGHV. 
XX
KW   antigen receptor; Immunoglobulin superfamily (IgSF); 
KW   Immunoglobulin (IG); IG-Heavy; variable; diversity; joining; 
KW   rearranged. 
XX
OS   Homo sapiens (human)
OC   cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; 
OC   Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata; 
OC   Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda; 
OC   Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates; 
OC   Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; 
OC   Homo/Pan/Gorilla group; Homo. 
XX
RN   [1]
RP   1-412
RA   ;
RT   "Recombinant antibodies and methods for their production.";
RL   Patent number EP0239400-A/10, 30-SEP-1987.
RL   MEDICAL RESEARCH COUNCIL.
XX
DR   EMBL; A03907.
XX
FH   Key             Location/Qualifiers (from EMBL) 
FH   
FT   source          1..412
FT                   /organism="Homo sapiens"
FT                   /mol_type="unassigned DNA"
FT                   /db_xref="taxon:9606"
FT   V_region        8..>412
FT                   /note="antibody D1.3 V region"
FT   sig_peptide     8..64
FT   CDS             8..>412
FT                   /product="antibody D1.3 V region (VDJ)"
FT                   /protein_id="CAA00308.1"
FT                  
/translation="MAVLALLFCLVTFPSCILSQVQLKESGPGLVAPSQSLSITCTVSG
FT                  
FSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSL
FT                   HTDDTARYYCARERDYRLDYWGQGTTLTVSS"
FT   D_segment       356..371
FT   J_segment       372..>412
FT                   /note="J(H)2 region"
XX
SQ   Sequence 412 BP; 105 A; 109 C; 104 G; 94 T; 0 other;
     tcagagcatg gctgtcctgg cattactctt ctgcctggta acattcccaa gctgtatcct       
60
     ttcccaggtg cagctgaagg agtcaggacc tggcctggtg gcgccctcac agagcctgtc      
120
     catcacatgc accgtctcag ggttctcatt aaccggctat ggtgtaaact gggttcgcca      
180
     gcctccagga aagggtctgg agtggctggg aatgatttgg ggtgatggaa acacagacta      
240
     taattcagct ctcaaatcca gactgagcat cagcaaggac aactccaaga gccaagtttt      
300
     cttaaaaatg aacagtctgc acactgatga cacagccagg tactactgtg ccagagagag      
360
     agattatagg cttgactact ggggccaagg caccactctc acagtctcct ca              
412
//

And the traceback was:

ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (311, 0))

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)

/Volumes/External/home/laserson/research/church/vdj-ome/ref-data/IMGT/<ipython
console> in <module>()

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_records(self, handle, do_features)
    418         #This is a generator function
    419         while True :
--> 420             record = self.parse(handle, do_features)
    421             if record is None : break
    422             assert record.id is not None

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse(self, handle, do_features)
    401                     feature_cleaner = FeatureValueCleaner())
    402 
--> 403         if self.feed(handle, consumer, do_features) :
    404             return consumer.data
    405         else :

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in feed(self, handle, consumer, do_features)
    373         #Features (common to both EMBL and GenBank):
    374         if do_features :
--> 375             self._feed_feature_table(consumer,
self.parse_features(skip=False))
    376         else :
    377             self.parse_features(skip=True) # ignore the data

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_features(self, skip)
    170                    
feature_lines.append(line[self.FEATURE_QUALIFIER_INDENT:].rstrip())
    171                     line = self.handle.readline()
--> 172                 features.append(self.parse_feature(feature_key,
feature_lines))
    173         self.line = line
    174         return features

/Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc
in parse_feature(self, feature_key, lines)
    267                 else :
    268                     #Unquoted continuation
--> 269                     assert len(qualifiers) > 0
    270                     assert key==qualifiers[-1][0]
    271                     #if debug : print "Unquoted Cont %s:%s" % (key,
line)

AssertionError: 



Thanks!
Uri


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list