[BioPython] Cannot parse ApE plasmid editor GenBank file
Chris Fields
cjfields at uiuc.edu
Tue Jun 5 19:55:29 UTC 2007
On Jun 5, 2007, at 1:57 PM, Martin MOKREJŠ wrote:
> Hi Peter, Chris and others,
> here I am passing the answer from Wayne back, sorry for the
> difficult
> cross-communication. Chris, I hope you will update the bioperl bug
> I have
> opened on this once it is clearer. I do not know whether Wayne will
> have
> enough time to answer all your comments, on email lists and in
> bugzilla.
> Few days ago he said they do some organize a meeting, so ... Anyway,
> official answer:
>
> Wayne Davis wrote:
>> locus line I'm using is the old standard (some older parsers
>> wanted it
>> that way).
>> I've updated to write the new standard, if your program isn't
>> flexible
>> enough to read the old style locus lines. We'll see if anyone is
>> using
>> the older parsers still.
>> from the document laying out the new standard:
>>
>> We encourage software developers to switch to a token-based LOCUS
>> parsing
>> approach, rather than a column-specific approach. If this is done,
>> then future
>> changes to the LOCUS line that affect only the spacing of its data
>> values will
>>
>> not require any modifications to software.
>>
>>
>>
>>
>> I've made the default behavior to put "." in the empty fields. I left
>> those fields there because there are other parsers that require them.
>> In my new version you can change the default genbank record values by
>> adding a line to your preferences file like this:
>> empty_genbank_header<TAB>{LOCUS } {} {DEFINITION } {.}
>> {ACCESSION } {.} {VERSION } {.} {SOURCE } {.}
>> { ORGANISM } {.}
>>
>> or
>> empty_genbank_header<TAB>{LOCUS } {}
>>
>>
>> My access to our web server is temporarily unavailable, but I'll post
>> the update as soon as I can.
>
> Martin
The bioperl parser doesn't rely on the exact spacing and uses a
tokenized approach. It does rely on the presence of the LOCUS line
and a locus name in that line (which Martin's sequence record
lacks). Acc. to the release notes the locus name is then followed by
the sequence length, 'bp' or 'aa', and the rest. As might be
guessed, the lack of a locus name is probably the major source of
headaches here.
Note that the presence of the locus name appears to be required
according to the GenBank release notes. There is no optional
designation for the LOCUS line (it is mandatory as stated in sec.
3.4.2), and the locus name appears in the line for all records (sec.
3.5.4). I could argue that errors encountered parsing a record
lacking a locus name are actually features (albeit horribly
documented ones). I have added a warning which catches less than six
tokens on the line, but I don't see the point of going beyond that w/
o descending into tokenizing oblivion (is it an accession, if not is
it the length, if not ....) when the initial source of the problem is
a badly formatted line in a sequence record.
chris
More information about the Biopython
mailing list