[Biopython-dev] Second go at GenBank parser
Jeffrey Chang
jchang at SMI.Stanford.EDU
Wed Dec 20 19:23:21 EST 2000
Hi Brad,
This is great! You've filled two gaping holes in biopython functionality.
Please check these in, as I'm sure people will want to start using the
code.
> o Tested on a bunch of different downloads from the ncbi genbank
> directory, so the syntax is much more "battle tested" then the last
> and handles lots more cases, including the dreaded "fake /" cases
> (found some more hideous ones like that in a bacterial
> dataset). GenBank, wow, what a headache!
Good. GenBank is notoriously hard to deal with, and I suspect work on the
format will be ongoing.
> o I integrated Andrew's SPARK based location parser, and now use it to
> parse the locations. spark.py is included in the tarball, but we need
> to still figure out how we want to do it in Biopython
Yep, definitely a good thing. Using SPARK is the right way to go.
> o Coded up a Record class for GenBank record and added a parser and
> consumer that parse GenBank data into it.
Thanks!
> I only want to get it into Biopython if people feel it is up to par
> (don't want to bring down the good name of Biopython :-).
Heh. From what I gather, it's runnable. Let's get this out the door so
people can start using it, and hopefully give good comments and (even
better) patches.
> o Naming of modules -- right now my naming sucks (the "supplimentary"
> feature classes, like Location.py and Reference.py are in a module
> called 'FeatureInfo', for instance. yeck.), so if people have good
> ideas for how to name things I'll definately take 'em.
Are these meant to be used with SeqFeatures? If so, how about just
SeqFeature.Location and SeqFeature.Reference?
> I'm also not sure where a good place for spark.py to live in Biopython
> is (BTW, I think we should include it :-).
Where you have it now seems as good a place as any (without the
PGML). Including it is fine with me.
> Finally, I noticed Jeff put his snazzy code in GenBank/__init__.py --
> Should my GenBank.py go into __init__.py?
Yes. GenBank is a good name for it, and as per Andrew's earlier email, we
should avoid having code in both GenBank/__init__.py and
GenBank/GenBank.py.
> o HTML -- Cayte expressed concerns about parsing GenBank files with a
> bunch o' HTML stuck in them. In my opinion it isn't really worth
> worrying about this because it is so easy to get the text flat files
> -- do lots of people think I should work on html support, or do they
> agree with me?
Are the HTML-formatted files different? Does it work if you just strip
the HTML tags? I guess for HTML-formatted data from GenBank, it would be
nice to handle, but very low priority. HTML-formatted data from other
sources, no. If someone needs that functionality, they can submit the
patches! :)
Jeff
More information about the Biopython-dev
mailing list