[BioPython] Cannot parse ApE plasmid editor GenBank file
Chris Fields
cjfields at uiuc.edu
Thu Jun 7 16:42:13 UTC 2007
On Jun 7, 2007, at 9:44 AM, Martin MOKREJŠ wrote:
> Hi Peter,
>> ...
>> That's good news. Martin - will this solve your problem, or do you
>> think we should also update Biopython to cope with these "old style"
>> LOCUS lines (which also lack identifiers)?
>
> I think that if it was ever a valid format it should cope with it.
I think it's better to explicitly state that the parser is compliant
with a particular GenBank release and can likely parse other
similarly formatted GenBank records from third-party software. If
the parser chokes on a bad record then you can point out the
deficiency in the record and (if possible) try to make it more
flexible w/o borking the parser later on. The release notes are
there for a good reason!
The LOCUS line format, however, has been relatively stable over
time. Here are the release notes for a GenBank release from late 1992:
ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb74.release.notes
and the LOCUS line is:
Positions Contents
1-12 LOCUS
13-22 Locus name
23-29 Length of sequence, right-justified
31-32 bp
34-36 Blank, ss- (single-stranded), ds- (double-stranded), or
ms- (mixed-stranded)
37-40 Blank, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA),
mRNA (messenger RNA), or uRNA (small nuclear RNA)
43-52 Blank (implies linear) or circular
53-55 The division code (see Section 3.3)
63-73 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
The spacing is more explicitly laid out in later versions. The best
part is the Entrez CD order form (clipped out by scissors to be snail-
mailed) at the end of the file!
chris
More information about the Biopython
mailing list