[BioPython] Patch: Minor GenBank Parsing Problems

Jan T. Kim jtk at cmp.uea.ac.uk
Thu Mar 10 07:20:00 EST 2005


Dear Biopython Maintainers,

I've patched Bio/expressions/genbank.py to solve the parsing
problems I recently reported. The patch is marginal, so I just
attach it to this message; I hope that's ok.

The test doesn't introduce new failures into the regression tests (I
currently get a failure on Restriction, which is somewhat strange, but
at any rate, unrelated to the subject here). I haven't added any tests
myself.

On Mon, Mar 07, 2005 at 05:02:45PM +0000, Jan T. Kim wrote:

> I have noticed the following problems with the Bio.GenBank.FeatureParser:
> 
>     * The parser appears to depend on additional information in the
>       LOCUS line, it works with
> 
>         LOCUS       U00096               4639675 bp    DNA     circular BCT 24-JUN-2004
>       while the undecorated line
> 
>         LOCUS       U00096
> 
>       results in a Martel.Parser.ParserPositionException.

I solved this  by making everything following the locus value optional.
It appears that Biopython doesn't need the additional information, at
least for me, this fixes the problem without introducing any new ones.

Note: You probably won't get records with the additional info missing
from NCBI, but if you use EMBOSS to write GenBank formatted files, the
additional info is lost, and having to make that up manually just to
satisfy BioPython is a bit annoying.

>     * The parser also doesn't like some accession types, the line
> 
>         ACCESSION   U00096 AE000111-AE000510
> 
>       while trimming that to
> 
>         ACCESSION   U00096
> 
>       results in a file that now parses ok.

This was solved by allowing a "-" in accession numbers. While this fixes
the problem for now, this may not be entirely ideal as the accession
number "AE000111-AE000510" may be invalid, it seems to denote a range
of accession numbers AE000111, AE000112, ..., AE000510, but I'm not
certain about this.

Best regards, Jan
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |    *NEW*    email: jtk at cmp.uea.ac.uk                               |
 |    *NEW*    WWW:   http://www.cmp.uea.ac.uk/people/jtk             |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*
-------------- next part --------------
diff -Naur biopython-1.40b/Bio/expressions/genbank.py biopython-1.40b-hacked/Bio/expressions/genbank.py
--- biopython-1.40b/Bio/expressions/genbank.py	2004-03-18 00:53:24.000000000 +0000
+++ biopython-1.40b-hacked/Bio/expressions/genbank.py	2005-03-10 11:34:37.134325592 +0000
@@ -116,20 +116,22 @@
 data_file_division = Martel.Group("data_file_division",
                                   Martel.Alt(*divisions))
 
+# JTK: made everything followgin "LOCUS XXXXXX" optional -- seqret of EMBOSS
+# doesn't supply all this base count, division, linear/circular etc. stuff
 locus_line = Martel.Group("locus_line",
                           Martel.Str("LOCUS") +
                           blank_space +
                           locus +
-                          blank_space +
-                          size +
-                          blank_space +
-                          Martel.Re("bp|aa") +
-                          blank_space +
-                          Martel.Opt(residue_type +
-                                     blank_space) +
-                          data_file_division +
-                          blank_space +
-                          date +
+                          Martel.Opt(blank_space +
+                                     size +
+                                     blank_space +
+                                     Martel.Re("bp|aa") +
+                                     blank_space +
+                                     Martel.Opt(residue_type +
+                                                blank_space) +
+                                     data_file_division +
+                                     blank_space +
+                                     date) +
                           Martel.AnyEol())
 
 # definition line
@@ -141,8 +143,11 @@
 
 # accession line
 # ACCESSION   AC007323
+# JTK: allowed also a "-" in accession, to allow for
+# ACCESSION   U00096 AE000111-AE000510
+# as found in E. coli K12 GenBank record
 accession = Martel.Group("accession",
-                         Martel.Re("[\w]+"))
+                         Martel.Re("[\w\-]+"))
 
 accession_block = Martel.Group("accession_block",
                                Martel.Str("ACCESSION") +


More information about the BioPython mailing list