[Biopython-dev] Second go at GenBank parser

Wed Dec 20 19:23:21 EST 2000

Hi Brad,

This is great!  You've filled two gaping holes in biopython functionality.  
Please check these in, as I'm sure people will want to start using the
code.

> o Tested on a bunch of different downloads from the ncbi genbank
> directory, so the syntax is much more "battle tested" then the last
> and handles lots more cases, including the dreaded "fake /" cases
> (found some more hideous ones like that in a bacterial
> dataset). GenBank, wow, what a headache!

Good.  GenBank is notoriously hard to deal with, and I suspect work on the
format will be ongoing.

> o I integrated Andrew's SPARK based location parser, and now use it to
> parse the locations. spark.py is included in the tarball, but we need
> to still figure out how we want to do it in Biopython

Yep, definitely a good thing.  Using SPARK is the right way to go.

> o Coded up a Record class for GenBank record and added a parser and
> consumer that parse GenBank data into it.

Thanks!

> I only want to get it into Biopython if people feel it is up to par
> (don't want to bring down the good name of Biopython :-).

Heh.  From what I gather, it's runnable.  Let's get this out the door so
people can start using it, and hopefully give good comments and (even
better) patches.

> o Naming of modules -- right now my naming sucks (the "supplimentary"
> feature classes, like Location.py and Reference.py are in a module
> called 'FeatureInfo', for instance. yeck.), so if people have good
> ideas for how to name things I'll definately take 'em.

Are these meant to be used with SeqFeatures?  If so, how about just
SeqFeature.Location and SeqFeature.Reference?

> I'm also not sure where a good place for spark.py to live in Biopython
> is (BTW, I think we should include it :-).

Where you have it now seems as good a place as any (without the
PGML).  Including it is fine with me.

> Finally, I noticed Jeff put his snazzy code in GenBank/__init__.py --
> Should my GenBank.py go into __init__.py?

Yes.  GenBank is a good name for it, and as per Andrew's earlier email, we
should avoid having code in both GenBank/__init__.py and
GenBank/GenBank.py.

> o HTML -- Cayte expressed concerns about parsing GenBank files with a
> bunch o' HTML stuck in them. In my opinion it isn't really worth
> worrying about this because it is so easy to get the text flat files
> -- do lots of people think I should work on html support, or do they
> agree with me?

Are the HTML-formatted files different?  Does it work if you just strip
the HTML tags?  I guess for HTML-formatted data from GenBank, it would be
nice to handle, but very low priority.  HTML-formatted data from other
sources, no.  If someone needs that functionality, they can submit the
patches!  :)

Jeff