[Biopython-dev] GenBank parser -- first go

Wed Dec 6 02:21:38 EST 2000

Hey Jeff,
Thanks a lot for taking a look at the code!

[SeqFeature classes]
> Yes, I definitely agree with needing a general class.  However, I've been
> purposefully shying away from proposing a general framework for
> annotations for two main reasons.  First, it's a hard, unsolved problem
> that we don't know how to do yet.  If you look at the models for biojava,
> bioperl, and game, you'll see that there are 3 different partially
> compatible solutions.  I suspect how you handle annotations is going to
> depend on the purpose of the applications.  (Though I suppose "to store
> genbank annotations" is a reasonable purpose).  

Agreed. I think our chances of getting it perfect are pretty slim
:-). However, I think it would really help writing "applications that
use Biopython" to have some kind of general class to work off of (even 
if it is imperfect). It is just too much work to have to support
Genbank record classes and EMBL record classes and whatever record
classes. I imagine this would be especially painful for writing user
interfaces. I'm not sure if my SeqFeature classes are the best thing
ever, but it is just meant as a bit of a start. I am definately
willing to throw out/ammend lots of what I wrote if people have got
ideas for changing them to be better.

> The second reason is that
> I like the idea of specific data structures for each database.  That way,
> people that really care about, say, swissprot, will know how to retrieve
> the data from their favorite field without having to muck around to see
> how it's getting coerced into a one-size-fits-all framework.  

Also agreed. I'll work on a GenBank specific Record class as well, as
I think these make it much easier for people who just want to parse
out GenBank.

> If you can
> only parse into a general data structure, then, since I don't believe a
> single data structure can hold all the types of information from every
> data base, you're bound to lose data.  I don't believe there's any general
> data structure in existance that can handle the genbank location
> field.  It's describe by a BNF grammar and requires a tree!

Very true :-). When putting the "GenBank specific stuff" into the
SeqFeature classes I ended up dumping a lot of it into dictionaries
(like Andrew's annotations dictionary in the SeqRecord class). A
GenBank specific record would definately hold the information in a lot 
more readily accessible format.

> Do we need to deal with genbank function like complement or order?

I'm trying to deal with them (although I forgot to do order! Thanks
for mentioning it). I'm dealing with them in the following way:

complement - I mark the feature as being on the opposite strand (I'm
using a 1, 0, -1 scale like BioCorba -- so -1 is the opposite strand).

join - The top level feature has a location from the start to the end
of the join. The feature then has sub SeqFeatures (also borrowed from
Biocorba) which are the individual exons (or whatever) in the
join. Right now if the top level feature is a CDS, then the sub
Features are labelled as type CDS_span. I think I'll change this to
CDS_join to make it clear they are part of a join.

order - I'll treat this like join, except call the sub features
CDS_order. 

It should also be able to deal with nested locations, like
complement(join(location,location)).

Does this all sound reasonable?

[Alphabets for GenBank]
> I'm not sure that's a good thing for GenBank.  Does GenBank store the
> alphabet for the sequence?  What if the sequence doesn't strictly follow
> the alphabet?

Well, GenBank doesn't really store the alphabet (it does give a base
count for common bases (AGTC) but then specifies anything else as
"other" which isn't very useful for our purposes). What I do is
remember the type from the GenBank file (DNA, RNA, PROTEIN) and then
give the sequence this alphabet. I use the ambiguous DNA and RNA
alphabets so this should cover any letters in the sequence
(hopefully). I'm not sure if this is ideal, but at least it associates 
the type with the sequence. Suggestions about how to be more strict
are welcome on this.

> - There's a TaggingConsumer in Bio.ParserSupport.  It looks like this does
> something similar to _PrintConsumer.  It's supposed to be used for
> debugging purposes so that you know what's getting passed when.  If it's
> not appropriate, please let me know how to extend it so that it's more
> generally useful.

Oh sorry, I meant to document that. TaggingConsumer is great. I just
used the PrintConsumer as I was coding this so that I would make sure
I added all of the necessary callbacks and didn't forget any
information. This was more for my use in coding then for anything
else. Once I was done building the parser, I just copy/pasted it and
used it to build the Feature consumer. I don't think it is worthwhile
to actually include in a final version, but I saved it because I'll
probably need to copy and paste it again to write a
RecordConsumer. But anyways, it was just a coding tool -- for later
debugging TaggingConsumer is great for me. PrintConsumer was just my
way to reduce the number of bugs in my code :-).
> Could you put Bio/SeqFeature/SeqFeature.py code into Bio/SeqFeature.py?  
> It would prevent stuff like:
> from Bio.SeqFeature import SeqFeature
> or even worse,
> from Bio.SeqFeature.SeqFeature import SeqFeature

Well, I was going to recommend to use it like this:

from Bio import SeqFeature
my_feature = SeqFeature.SeqFeature.SeqFeature()

:-) Seriously, this is indeed very ugly. Another possible solution
would be to put all of the Features in a directory called Feature or
Features (instead of SeqFeature) so then the imports would look like:

from Bio.Feature import SeqFeature

Either way is fine, though (or I'm very open to additional
suggestions), so whatever you think.

Thanks again for taking a look at this. I'll try to produce another
version based on this (and any future comments) for next week.

Brad