[BioPython] Cannot parse ApE plasmid editor GenBank file

Chris Fields cjfields at uiuc.edu
Thu Jun 7 16:42:13 UTC 2007


On Jun 7, 2007, at 9:44 AM, Martin MOKREJŠ wrote:

> Hi Peter,
>> ...
>> That's good news.  Martin - will this solve your problem, or do you
>> think we should also update Biopython to cope with these  "old style"
>> LOCUS lines (which also lack identifiers)?
>
> I think that if it was ever a valid format it should cope with it.

I think it's better to explicitly state that the parser is compliant  
with a particular GenBank release and can likely parse other  
similarly formatted GenBank records from third-party software.  If  
the parser chokes on a bad record then you can point out the  
deficiency in the record and (if possible) try to make it more  
flexible w/o borking the parser later on.  The release notes are  
there for a good reason!

The LOCUS line format, however, has been relatively stable over  
time.  Here are the release notes for a GenBank release from late 1992:

ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb74.release.notes

and the LOCUS line is:

Positions   	Contents

1-12	LOCUS
13-22	Locus name
23-29	Length of sequence, right-justified
31-32	bp
34-36	Blank, ss- (single-stranded), ds- (double-stranded), or
	 ms- (mixed-stranded)
37-40	Blank, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA),
	mRNA (messenger RNA), or uRNA (small nuclear RNA)
43-52	Blank (implies linear) or circular
53-55	The division code (see Section 3.3)
63-73	Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)

The spacing is more explicitly laid out in later versions.  The best  
part is the Entrez CD order form (clipped out by scissors to be snail- 
mailed) at the end of the file!

chris



More information about the Biopython mailing list