[Biopython-dev] Parsing features fuzzie in Genbank annotation att Brad.

Mon Sep 24 16:43:08 EDT 2001

Since this is being reviewed ....

Brad, please make a note of this everyone should understand this is there.

In genbank records the following join format will pop up:

    "join(10000,10200..10450)"

The numbers used here represent a one base join with a second exon. Can this
happen in biology, I am still working that out, or what does this annotation
represent of the biology, if this is not a real one base join, I am working
that out too.

However, please note that is a possible annotation in any case. programs
that use feature information should know how to handle this.

This cropped up whiles I was parsing the Refseq S_cerevisiae data. Go to
NCBI and download Chromosome 9, and you will see what I am talking about.

I will post what I find out, but if anyone else wants to look into some
insight on this, please post.

Peter

P.S. pretty umbeleavable is it not?

In response to the following comment --------------------------

3.
Andrew:
> Related to that, what's the type used when there are subfeatures?

Previously, if we had a sequence feature like:
CDS             join(104..160,320..390,504..579)

I would code this as a top level SeqFeature with type "CDS" and
location (104..579), and have sub_features of this top level feature
with type "CDS_join." This is stolen from bioperl, but is not that
great in retrospect, since I'm hacking the type and all of that.

I'd like to propose adding a location_operator attribute to
SeqFeature (already done in CVS) and have the top level SeqFeature
be type "CDS" with location_operator "join", and all sub_features
also be of the same type and location_operator. This will only
affect people who relied on the previous (fairly ugly)
type/location_operator concatenation mechanism.