[Bioperl-l] Gene Structure / GenScan

hilmar.lapp@pharma.Novartis.com hilmar.lapp@pharma.Novartis.com
Tue, 1 Aug 2000 14:33:51 +0100


The Ensembl Genscan parser Ewan sent yesterday seems to be a good starting
point. However, I'd prefer to have a gene structure represented optionally
independent of the/an underlying sequence (object), that is, as a feature
which may or may not have a sequence attached. In addition, a parser should
not need to rely on being provided with the source sequence, and the
resulting gene structure representation can be attached to the pertaining
source sequence by the client.

I'd propose the following:
Bio::SeqFeature::GeneStructure is-a Bio::SeqFeature::Generic (or just a
Bio::SeqFeatureI ?)
and offers specific support for gene structure related things, like
   $gene->promotor();
   $gene->initial_exon();
   $gene->exon($which);
   $gene->intron($which);
   $gene->all_exons();
   $gene->terminal_exon();
   $gene->poly_adenylation();
All of the above would return a SeqFeature::Generic object. The following
can only work if a sequence (and the correct one!) is attached:
   $gene->cds();               # returns a string (exons concatenated,
   phase to be taken into account)
   $gene->translation();  # dto.
The problem with the latter is that there appears to be this phase
ambiguity for the exons if the prediction is not complete (i.e., initial
exons is missing). As a first guess, I'd suspect that at least GenScan
would not predict a CDS (= concatenated exons) that contains stops within
the correct phase. So, ideally I'd check which of the three frames does not
yield an intervening stop, and take that. I guess the Ensembl people will
have checked for this, and probably it wasn't as easy.  Any experiences out
there?

What do you think? And please let me know what I'm duplicating here from
what other people have already written.

     Hilmar