Bioperl: translating genes with >1 exon

Ewan Birney birney@ebi.ac.uk
Sun, 9 Apr 2000 21:56:29 +0100 (BST)


On 9 Apr 2000, Keith James wrote:

> 
> Hi,
> 
> I have to write scripts to process bacterial sequence, so I didn't
> notice until recently that the way one has to translate a multi-exon
> gene seems quite laborious.

Keith - you are stumbling right into an area which I know I did not think
out cleanly enough when I originally wrote this.

Generally all of this is my fault ;)

> 
> e.g. taking a feature which is a 5 exon gene it appears that I have
> to:
> 
> Ask for its subsequences
> 
> Get the PrimarySeq object of each subsequence
> 
> Get the sequence string of each PrimarySeq object
> 
> Join the sequence strings together
> 
> Make a new PrimarySeq object from the string
> 
> Translate that, making yet another PrimarySeq object
> 
> 
> I think that this is done often enough that there should be a method
> for getting the combined exons from a feature. In fact, having the
> translate method available on a PrimarySeq object only makes sense in
> the special case of having 1 exon in your gene.
> 
> If there are no objections I would like to add a method which returned
> a PrimarySeq object consisting of the concatenated subsquences,
> analagous to the seq() method which currently gives the PrimarySeq
> object representing both introns and exons.

I think this is a fine idea. I like your basic proposal as well.


> 
> I'm not quite sure how to deal with the potential multiple levels of
> subsequences. I suppose there could be a method which returns a list
> of integers, one for each level of subsequences, the integer being the
> number of features on that 'level'.

Allowing there to be an arbitery number of sequence features nested is
probably a big mistake in bioperl. I thought it was clever about 6 months
ago. Now - I am not so sure. 

I know in ensembl we can definite classify sequence features into three
classes:

	plain sequence features (we use Bioperl SeqFeature::Generic)

	homol sequence features (BLAST hits - one sequence feature on one
sequence and one sequence feature on another) (we use Bioperl
SeqFeature::Homol)

	fsets - feature sets - one sequence feature with one level of sub
sequene features (Genscan predictions) (we use Bioperl SeqFeature::Generic 
with only one level of sub SeqFeatures).


Now - what to do here. (BTW - this is all for 0.7 - this can't go on the
branch as the branch is *only for bugs*. This is a feature, pardon my
pun).

Options:

	a) reorganise the Sequence feature code to have official Fset type
features with one level of sub sequence features. Fset could have three
methods returning sequence objects

	$fset->seq (or $fset->seqobj - see naming proposal below).
	$fset->spliced_seq
	$fset->entire_seq

	- for backwards compatibility we would keep sub_SeqFeature method
but it would always return empty for SeqFeature::Generic and
SeqFeature::Homol

	b) Keep with the arbitary nesting of sequence features generically
and come up with standards such as the one suggested below for descending
the tree (what if the tree is not balanced) and how does one cope with sub
sequence features of the wrong "type" being added?


Personally - I am happier with (a).


I'd like to hear more views on this - I am sure there are many people with
views on this and then, I hope, Keith - you can put together a final,
sensible proposal for the whole thing.

Let's wait for a week or so to let people think about it. Feel free to
hack your own code to make things easier for the moment, but don't commit
it.

> 
> e.g.
> 
> parent sequence  -------------
> feature           -----------
> sub-features       -- ---  --   level 0
> sub-sub-features   -  - -  -    level 1
> 
> returns (3, 4)
> 
> You could then ask for the sequence of level 0 (exons) as a single,
> new PrimarySeq object. I wonder if anyone would want second (or
> subsequent layers) at all, much less want the combined sequence of
> them (maybe the individual sequences, though). However, as multiple
> layers are allowed there should be easy ways of dealing with them.
> 
> Incidentally, I found having two methods with the same name, but doing
> different things a bit confusing:
> 
> my $nt = $feature->seq()->seq();
> 

Hmmm. I have seen this as well. It makes me wince as well.

What about going $feature->seqobj() in the future? 


> where the first seq() returns a PrimarySeq object and the second one
> a sequence string.
> 
> Comments?
> 
> cheers,
> 
> Keith
> 
> -- 
> 
> Keith James  --  kdj@sanger.ac.uk  --  http://www.sanger.ac.uk/Users/kdj
> The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://bio.perl.org/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================