[BioPython] Cannot parse ApE plasmid editor GenBank file

Tue Jun 5 18:29:52 UTC 2007

Hi Wayne & all the Biopython mailing list,

Martin has been trying to parse some GenBank files produced by ApE 
plasmid editor, and Biopython (and BioPerl?) don't like them.

Hopefully between us we can sort this out :)

By the way - Is the current ApE plasmid editor webpage here, because it 
times out for me?:

http://www.biology.utah.edu/jorgensen/wayned/ape/

Martin MOKREJŠ wrote:
> I would appreciate if you could tell me then what was exactly wrong with 
> the generated files by ApE editor (author Cc:ed).

OK then, looking at file elh/pNEX3.gb which starts:

LOCUS               2981 bp ds-DNA     linear       12-OCT-2006
DEFINITION
ACCESSION
VERSION
SOURCE
   ORGANISM
COMMENT
COMMENT     ApEinfo:methylated:1
FEATURES             Location/Qualifiers
      misc_feature    225..257
                      /ApEinfo_label=pNEX3-compatibile
...

I think the location of the size (2981 bp), sequence type (ds-DNA, 
linear) and date (12-OCT-2006) are not in the correct positions (i.e. 
column numbers).  Also the locus ID is missing, which is not ideal. 
Trying to do examples in an email is tricky as the line wrapping spoils 
the effect.

Interestingly all these files seem to have their LOCUS line fields in 
the same place - perhaps the ApE plasmid editor is following an out of 
date version of the GenBank file format which I haven't seen before? If 
so, we (Biopython) should be able to deal with this too.

For the current version of the LOCUS line spec, see:
ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

In particular:
> The detailed format for the LOCUS line format is as follows:
> 
> Positions  Contents
> ---------  --------
> 01-05      'LOCUS'
> 06-12      spaces
> 13-28      Locus name
> 29-29      space
> 30-40      Length of sequence, right-justified
> 41-41      space
> 42-43      bp
> 44-44      space
> 45-47      spaces, ss- (single-stranded), ds- (double-stranded), or
>            ms- (mixed-stranded)
> 48-53      NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), 
>            mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA,
>            snoRNA. Left justified.
> 54-55      space
> 56-63      'linear' followed by two spaces, or 'circular'
> 64-64      space
> 65-67      The division code (see Section 3.3)
> 68-68      space
> 69-79      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)

Note that the proteins variant "GenPept" is slightly different.

The next six lines of that example file (elh/pNEX3.gb) have no values - 
as Chris Fields pointed out on the Biopython mailing list, the NCBI 
likes to use a dot/period as a place holder.

The spec does explicitly say that the KEYWORDS can be omitted, but seems 
to assume the other lines are expected. Biopython should be happy if 
these lines are just omitted.

See also:
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

> Hope this helps,

You might have upset some people by emailing an attachment to the entire 
Biopython mailing list, but it wasn't too big at least ;)

Regards,

Peter