repost - Re: [Bioperl-l] Huh? Bioperl Seq objects and strands

Ewan Birney birney@ebi.ac.uk
Wed, 20 Sep 2000 15:18:24 +0100 (GMT)


On Wed, 20 Sep 2000, Arlin Stoltzfus wrote:

> Ewan Birney wrote:
> > 
> > > > Also, why are introns and exons top-level features of a sequence, when
> > > > they are also (obviously) sub-features of a gene?
> > > >
> > 
> > This is an issue with GenBank/EMBL being mapped into a more interpretable
> > format.
> > 
> > GenBank/EMBL sometimes puts introns/exons separate from the CDS lines.
> > Quite often they *disagree* with the CDS lines. What are we meant to do in
> > these cases.
> 
> It may help to know that the information on the CDS line of a GenBank  
> text file is not a description of the splicing process, but a "SeqLoc" or 
> sequence location for the CDS feature.  This is why it starts and ends 
> with start and stop codons, and not with the beginning and ending of 
> the first and last exons.  Some published surveys of exon lengths are 
> actually based on interpreting the first and last intervals in the CDS 
> SeqLoc statements as exons, but they are not.  
> 
> Every feature in the feature table has a SeqLoc mapping it to the 
> sequence, and a SeqLoc of "1..4" is the same as "join(1..3,4)" or 
> "order(1..2,3..4)" or "join(1,2,3,4)" etc, because they all specify the 
> same sequence location.  A CDS that results from -1 translational 
> frameshifting after the 45th nucleotide might be specified by "join(1..45,
> 45..599)", so that the 45th nucleotide is included twice.  Some entries 
> in GenBank actually use this entirely legitimate method.  
> 
> Also, its not GenBank's decision about whether the introns and exons 
> appear explicitly in the feature table-- this is because the people who 
> submit sequences typically only annotate the CDS, and do not annotate 
> mRNA or intron features (usually they have no experimental evidence 
> to do this anyway).  Programmers can interpret the CDS SeqLoc to 
> get implicit information on splicing (I do it all the time), but this 
> has its risks.  

Indeed. Well said.


To summarise:

	EMBL/GenBank is a mess to get data out of except for DNA Sequence
because the rest of the data is loosely standardised over the last 20
years in a variety of ways by millions of people. 


	Representing in objects which are more than "An EMBL/Genbank file
as an object" is challenging, but in some sense, what we want to do.


Or in other words:

	there is no silver bullet here.

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------