[Biopython-dev] Parsing features fuzzie in Genbank annotation att Brad.
Peter Wilkinson
pewilkinson at informaxinc.com
Mon Sep 24 16:43:08 EDT 2001
Since this is being reviewed ....
Brad, please make a note of this everyone should understand this is there.
In genbank records the following join format will pop up:
"join(10000,10200..10450)"
The numbers used here represent a one base join with a second exon. Can this
happen in biology, I am still working that out, or what does this annotation
represent of the biology, if this is not a real one base join, I am working
that out too.
However, please note that is a possible annotation in any case. programs
that use feature information should know how to handle this.
This cropped up whiles I was parsing the Refseq S_cerevisiae data. Go to
NCBI and download Chromosome 9, and you will see what I am talking about.
I will post what I find out, but if anyone else wants to look into some
insight on this, please post.
Peter
P.S. pretty umbeleavable is it not?
In response to the following comment --------------------------
3.
Andrew:
> Related to that, what's the type used when there are subfeatures?
Previously, if we had a sequence feature like:
CDS join(104..160,320..390,504..579)
I would code this as a top level SeqFeature with type "CDS" and
location (104..579), and have sub_features of this top level feature
with type "CDS_join." This is stolen from bioperl, but is not that
great in retrospect, since I'm hacking the type and all of that.
I'd like to propose adding a location_operator attribute to
SeqFeature (already done in CVS) and have the top level SeqFeature
be type "CDS" with location_operator "join", and all sub_features
also be of the same type and location_operator. This will only
affect people who relied on the previous (fairly ugly)
type/location_operator concatenation mechanism.
More information about the Biopython-dev
mailing list