[BioPython] Cannot parse ApE plasmid editor GenBank file

Chris Fields cjfields at uiuc.edu
Tue Jun 5 20:28:08 UTC 2007


On Jun 5, 2007, at 2:58 PM, Peter wrote:

...
> Easier said than done, as some fields can also contain white space.
> However, Howard Salis has some interesting code to tackle this  
> attached
> to Biopython bug 2294.

The bioperl parser simply splits the data upon white space.  The  
first three tokens (not counting the LOCUS name) are always the locus  
name, the seq length, and 'bp' or 'aa' (which we use to determine the  
alphabet); that order seems to    es back to GenBank release 100 (1997):

ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb100.release.notes

The next few fluctuate dep. on the release or sequence type, but the  
division and date are always last.  I don't think we require a  
division code to be present, but I'm not sure.

> Peter wrote:
>>> The next six lines of that example file (elh/pNEX3.gb) have no
>>> values - as Chris Fields pointed out on the Biopython mailing list,
>>> the NCBI likes to use a dot/period as a place holder.
>>>
>>> The spec does explicitly say that the KEYWORDS can be omitted, but
>>> seems to assume the other lines are expected. Biopython should be
>>> happy if these lines are just omitted.
>
> Just to correct myself, many of those fields are described as  
> mandatory
> single entries further up in the documentation - so using a dot/period
> (as Wayne has done for the ApE plasmid editor) does seem the best  
> solution.
>
> Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
>> 3.4.2  Entry Organization
>> ...
>>   The following is a brief description of each entry field. Detailed
>> information about each field may be found in Sections 3.4.4 to  
>> 3.4.15.
>>
>> LOCUS	... Mandatory keyword/exactly one record.
>> DEFINITION ... Mandatory keyword/one or more records.
>> ACCESSION ... Mandatory keyword/one or more records.
>> VERSION...  Mandatory keyword/exactly one record.
>> ...
>
> KEYWORDS, SOURCE and ORGANISM are described as mandatory in all  
> annotated
> entries (so not mandatory in general). COMMENT is optional.
>
> Peter

Probably something we should look into and correct as well.  We don't  
require those fields for parsing, but they should be present in  
output sequence records, strictly speaking.

chris




More information about the Biopython mailing list