[Biopython-dev] GenBank parser -- first go

Tue Dec 5 02:36:50 EST 2000

Hi Brad,

On Mon, 4 Dec 2000, Brad Chapman wrote:

> Hello all;
> As promised, I spent this weekend getting together a GenBank parser,
> which I hope is something that we could include in Biopython in the
> future. What I've got so far is available from:

Great!  We need a Genbank parser.

> It is, I hope, a full featured GenBank parser that parses things into
> SeqFeature classes. I'm hoping that these SeqFeature classes (or
> something derived from them) will be something we can include in
> Biopython as well. It would be really nice to have some "standard"
> objects for features, to help us be more compatible with the Biocorba
> and  BioXML projects.

Yes, I definitely agree with needing a general class.  However, I've been
purposefully shying away from proposing a general framework for
annotations for two main reasons.  First, it's a hard, unsolved problem
that we don't know how to do yet.  If you look at the models for biojava,
bioperl, and game, you'll see that there are 3 different partially
compatible solutions.  I suspect how you handle annotations is going to
depend on the purpose of the applications.  (Though I suppose "to store
genbank annotations" is a reasonable purpose).  The second reason is that
I like the idea of specific data structures for each database.  That way,
people that really care about, say, swissprot, will know how to retrieve
the data from their favorite field without having to muck around to see
how it's getting coerced into a one-size-fits-all framework.  If you can
only parse into a general data structure, then, since I don't believe a
single data structure can hold all the types of information from every
data base, you're bound to lose data.  I don't believe there's any general
data structure in existance that can handle the genbank location
field.  It's describe by a BNF grammar and requires a tree!

> Anyways, the parser and seq features have the following exciting
> features:
> 
> * fully parses out Feature tables. This includes support for sub
> Features (ie. the exons of a CDS object).
> 
> * deals with 'the dreaded fuzziness' in locations. There should be
> support for all of the types of fuzziness, but I've tried to not make
> it much more difficult to access locations if you don't care about 
> fuzziness at all.

Do we need to deal with genbank function like complement or order?

> * parses into SeqRecord objects with Seq objects that are hopefully
> AlphabetStrict in the proper manner.

I'm not sure that's a good thing for GenBank.  Does GenBank store the
alphabet for the sequence?  What if the sequence doesn't strictly follow
the alphabet?

> I didn't write any docs on using these yet (I've got to get to work on 
> things for lab and school now :-), but the parsers work like other
> Biopython parsers like Blast (ie. with Iterators and Parsers). There
> are also a couple of example scripts to get things going.
> 
> I'm really looking for feedback in the following areas:
> 
> 1. Does this code look decent? Anyone besides me want to see this in
> Biopython? 

- There's a TaggingConsumer in Bio.ParserSupport.  It looks like this does
something similar to _PrintConsumer.  It's supposed to be used for
debugging purposes so that you know what's getting passed when.  If it's
not appropriate, please let me know how to extend it so that it's more
generally useful.

> 2. Does this parser parse your favorite GenBank files? I've tested it
> on a few things, but they are mostly plant sequences, since that's
> what I've got around here. There is a script included in the tarball
> "find_parser_problems.py", which will, if you run it on a GenBank
> file, tell you what accession numbers, if any, cause parser
> problems. If you could send me lists of accession numbers that break
> it, it would really help to make sure it works in more cases.
> 
> 3. Does the output you get have the same info as the initial GenBank
> file (ie -- are there any ugly bugs)? I have another script included,
> "check_output.py," which will spit out the parsed information to make
> it possible to compare it with the initial GenBank file and see if I
> screwed anything up. I've hand checked a couple of files, but it would 
> really help to have other people debugging this as well.
> 
> 4. What do people think about the SeqFeature classes? Like 'em? Hate
> 'em? Suggestions for improvement?

Could you put Bio/SeqFeature/SeqFeature.py code into Bio/SeqFeature.py?  
It would prevent stuff like:
from Bio.SeqFeature import SeqFeature
or even worse,
from Bio.SeqFeature.SeqFeature import SeqFeature

> 5. Can the code be speeded up/improved in any ways? Suggestions to
> help me code better are always very welcome!

Thanks for doing this!

Jeff