[BioPython] Problem using GenBank.Dictionary

Brad Chapman chapmanb@arches.uga.edu
Fri, 24 May 2002 12:23:25 -0400


Hi Michael;
Sorry for the delay in getting back with you on this -- I was out of
town on family matters and am swamped with work.

> I am having some trouble making a GenBank database (using an index file
> and GenBank.Dictionary). I can use an GenBank.Iterator to parse through
> genbank records in a file, however, if I make a GenBank database of the
> file and then request a particular record, I get a parsing error
> ?ParserPositionException: error parsing at or beyond character 0?. The
> dictionary seems to be created properly (ie it has keys corresponding to
> all the records in the file) and the error only occurs when I call for a
> particular record.

The problem here is one in dealing with Martel (which I used for the
underlying parser). I made some assumptions in creating the dictionary
which work fine for small files, but fail when the file you are indexing
get beyond a certain point. I spent some really frustrating hours
tracking this down a while back and it turns out it was a limitation in
the Martel system (or rather, an unexpected need).

We have been working on solutions to this problem:

1. We have been developing a new indexing system in collaboration with
the other Bio* projects. As far as I know this is still "under
development," Andrew would be the person who would know the most about
the current status of this.

2. You can load GenBank records into an SQL database using BioSQL and
then retrieve it. This requires MySQL and the MySQLdb python interfaces
to MySQL, but after that it's simple to use. You can just create a
database and load the BioSQL schema into it.

Then the following python code will load a GenBank file into this
database:

from BioSQL import BioSeqDatabase
from Bio import GenBank

server = BioSeqDatabase.open_database(user = your_mysql_username,
                                      passwd = your_mysql_password,
                                      db = your_db)
db = server.new_database(internal_db_name)
parser = GenBank.FeatureParser()
iterator = GenBank.Iterator(open("your_file.gb", parser)
db.load(iterator)

You can then retrieve things from this database with code like:

server = BioSeqDatabase.open_database(user = your_mysql_username,
                                      passwd = your_mysql_password, 
                                      db = your_db)
db = server[internal_db_name]
item = db.lookup(accession = "whatever")

Then you'll get something to work with.

This does work well (I use it in my work), but still needs
documentation.

I hope this rambling helps some. If you want to explore an option, let
us know and I'll be happy to work with you more on it.

Brad
-- 
PGP public key available from http://pgp.mit.edu/