[BioPython] Cannot parse ApE plasmid editor GenBank file

Tue Jun 5 19:55:29 UTC 2007

On Jun 5, 2007, at 1:57 PM, Martin MOKREJŠ wrote:

> Hi Peter, Chris and others,
>   here I am passing the answer from Wayne back, sorry for the  
> difficult
> cross-communication. Chris, I hope you will update the bioperl bug  
> I have
> opened on this once it is clearer. I do not know whether Wayne will  
> have
> enough time to answer all your comments, on email lists and in  
> bugzilla.
> Few days ago he said they do some organize a meeting, so ... Anyway,
> official answer:
>
> Wayne Davis wrote:
>> locus line I'm using is the old standard (some older parsers  
>> wanted it
>> that way).
>> I've updated to write the new standard, if your program isn't  
>> flexible
>> enough to read the old style locus lines. We'll see if anyone is  
>> using
>> the older parsers still.
>> from the document laying out the new standard:
>>
>>  We encourage software developers to switch to a token-based LOCUS  
>> parsing
>> approach, rather than a column-specific approach. If this is done,  
>> then future
>> changes to the LOCUS line that affect only the spacing of its data  
>> values will
>>
>> not require any modifications to software.
>>
>>
>>
>>
>> I've made the default behavior to put "." in the empty fields. I left
>> those fields there because there are other parsers that require them.
>> In my new version you can change the default genbank record values by
>> adding a line to your preferences file like this:
>> empty_genbank_header<TAB>{LOCUS       } {} {DEFINITION  } {.}
>> {ACCESSION   } {.} {VERSION     } {.} {SOURCE      } {.}  
>> {  ORGANISM  } {.}
>>
>> or
>> empty_genbank_header<TAB>{LOCUS       } {}
>>
>>
>> My access to our web server is temporarily unavailable, but I'll post
>> the update as soon as I can.
>
> Martin

The bioperl parser doesn't rely on the exact spacing and uses a  
tokenized approach.  It does rely on the presence of the LOCUS line  
and a locus name in that line (which Martin's sequence record  
lacks).  Acc. to the release notes the locus name is then followed by  
the sequence length, 'bp' or 'aa', and the rest.  As might be  
guessed, the lack of a locus name is probably the major source of  
headaches here.

Note that the presence of the locus name appears to be required  
according to the GenBank release notes.  There is no optional  
designation for the LOCUS line (it is mandatory as stated in sec.  
3.4.2), and the locus name appears in the line for all records (sec.  
3.5.4).  I could argue that errors encountered parsing a record  
lacking a locus name are actually features (albeit horribly  
documented ones).  I have added a warning which catches less than six  
tokens on the line, but I don't see the point of going beyond that w/ 
o descending into tokenizing oblivion (is it an accession, if not is  
it the length, if not ....) when the initial source of the problem is  
a badly formatted line in a sequence record.

chris