[Biopython-dev] Second go at GenBank parser

Wed Dec 20 10:11:49 EST 2000

Hello all;

I've got together a second tarball of the GenBank parser that we've
been working on. You can grab it from:

http://www.bioinformatics.org/bradstuff/bp/gb_parser-20001222.tar.gz

I think this is a huge improvement from the first, mostly due to the
many many helpful comments from everyone here. I really appreciated
everyone's comments and interest, and I think that we've fixed/worked
on all of the points that people raised. I'll try to respond to some
specific mails later today. Sorry to not be able to respond to
everything in a timely manner. I guess if I only have time to write or 
code, it is better to be coding :-).

Anyways, the new version has the following new and
oh-so-incredibly-exciting features:

o Much better Martel syntax for parsing things. This is almost
entirely due to Andrew -- who sent me lots of nice comments and good
tips, and even wrote up his own syntax which I could borrow from. Tons 
of the new syntax is taken from Andrew's stuff, so he deserves a huge
pat on the back for this :-).

o Tested on a bunch of different downloads from the ncbi genbank
directory, so the syntax is much more "battle tested" then the last
and handles lots more cases, including the dreaded "fake /" cases
(found some more hideous ones like that in a bacterial
dataset). GenBank, wow, what a headache!

o I integrated Andrew's SPARK based location parser, and now use it to
parse the locations. spark.py is included in the tarball, but we need
to still figure out how we want to do it in Biopython (once the
GenBank parser is up to snuff). Another big thanks to Andrew for
providing the location parser! I integrated this first before doing
all the testing, so it has been through a workout over here. I found
one case it didn't handle (when you have a "between" location by
itself without parentheses, like '6.27') and made the small fix for
this. Otherwise it performed great!

o Coded up a Record class for GenBank record and added a parser and
consumer that parse GenBank data into it.

o Miscellaneous bug fixes that popped up (hopefully I squashed more
than I introduced :-).

o Better testing -- again thank to Andrew. Have I mentioned yet that
he is my personal hero?

If people have time to download and test this and give me their
feedback I would really appreciate it. I only want to get it into
Biopython if people feel it is up to par (don't want to bring down the 
good name of Biopython :-). I'm especially interested in feedback on
the following points:

o I would really like to hear about anything that causes errors in any 
of the parsers (or my code!).

o Naming of modules -- right now my naming sucks (the "supplimentary"
feature classes, like Location.py and Reference.py are in a module
called 'FeatureInfo', for instance. yeck.), so if people have good
ideas for how to name things I'll definately take 'em. I'm also not
sure where a good place for spark.py to live in Biopython is (BTW, I
think we should include it :-). Finally, I noticed Jeff put his snazzy 
code in GenBank/__init__.py -- Should my GenBank.py go into
__init__.py? Should it be named something else?

o Data transfer -- if everything being transferred okay? Am I messing
anything up/losing data? People hand checking different records for me 
would be very very helpful.

o HTML -- Cayte expressed concerns about parsing GenBank files with a
bunch o' HTML stuck in them. In my opinion it isn't really worth
worrying about this because it is so easy to get the text flat files
-- do lots of people think I should work on html support, or do they
agree with me?

Thanks again for everyone's feedback on the first version!

Brad