[Biopython-dev] GenBank parser -- first go

Mon Dec 4 22:32:59 EST 2000

Hello all;
As promised, I spent this weekend getting together a GenBank parser,
which I hope is something that we could include in Biopython in the
future. What I've got so far is available from:

http://www.bioinformatics.org/bradstuff/bp/gb_parser-20001204.tar.gz

It has a nice distutils setup script and everything will install into
Bio.PGML directory (PGML = Plant Genome Mapping Lab -> that's my
little subdirectory to keep things I work on separate from Biopython).
The parser uses Martel-0.4, so you'll need to have that installed to
use this. Making this would definately not have been possible without
all  of the cool things in Martel, so we all definately have to give 
Andrew another big pat on the back for his awesome tool :-).

It is, I hope, a full featured GenBank parser that parses things into
SeqFeature classes. I'm hoping that these SeqFeature classes (or
something derived from them) will be something we can include in
Biopython as well. It would be really nice to have some "standard"
objects for features, to help us be more compatible with the Biocorba
and  BioXML projects. Anyways, the parser and seq features have 
the following exciting features:

* fully parses out Feature tables. This includes support for sub
Features (ie. the exons of a CDS object).

* deals with 'the dreaded fuzziness' in locations. There should be
support for all of the types of fuzziness, but I've tried to not make
it much more difficult to access locations if you don't care about 
fuzziness at all.

* parses into SeqRecord objects with Seq objects that are hopefully
AlphabetStrict in the proper manner.

I didn't write any docs on using these yet (I've got to get to work on 
things for lab and school now :-), but the parsers work like other
Biopython parsers like Blast (ie. with Iterators and Parsers). There
are also a couple of example scripts to get things going.

I'm really looking for feedback in the following areas:

1. Does this code look decent? Anyone besides me want to see this in
Biopython? 

2. Does this parser parse your favorite GenBank files? I've tested it
on a few things, but they are mostly plant sequences, since that's
what I've got around here. There is a script included in the tarball
"find_parser_problems.py", which will, if you run it on a GenBank
file, tell you what accession numbers, if any, cause parser
problems. If you could send me lists of accession numbers that break
it, it would really help to make sure it works in more cases.

3. Does the output you get have the same info as the initial GenBank
file (ie -- are there any ugly bugs)? I have another script included,
"check_output.py," which will spit out the parsed information to make
it possible to compare it with the initial GenBank file and see if I
screwed anything up. I've hand checked a couple of files, but it would 
really help to have other people debugging this as well.

4. What do people think about the SeqFeature classes? Like 'em? Hate
'em? Suggestions for improvement?

5. Can the code be speeded up/improved in any ways? Suggestions to
help me code better are always very welcome!

Thanks for listening and enjoy!

Brad