[Biopython-dev] [Bug 2381] translate and transcibe methods for the Seq object (in Bio.Seq)

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Fri Nov 7 09:37:23 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2381





------- Comment #51 from lpritc at scri.sari.ac.uk  2008-11-07 04:37 EST -------
Just to perpetuate, what I suggest is (in pseudocode, and with argument names
up for, well, argument):

class Seq:
   [...]
   def startswith_startcodon():
      """ Returns True if the first three bases of the sequence 
           are a valid start codon in the sequence's codon table,
           returns False otherwise
      """

    def endswith_stopcodon():
        """ Returns True if the length of the sequence is a multiple
             of three, and the last three bases are a valid stop codon 
             in the sequence's codon table, returns False otherwise
        """

    def is_cds():
        """ Returns true if the sequence meets the criteria for a CDS, 
             False otherwise.  The criteria are:
             i) The very first three bases of the sequence are a valid start
codon
             ii)  The sequence length is a multiple of three
             iii) The final three bases of the sequence are a valid stop codon
             iv) There are no in-frame stop codons, other than the final stop
codon
        """
        if not self.startswith_startcodon(): return False
        if not endswith_stopcodon(): return False
        # Test for in-frame stop codon, return True if none is found, return
False otherwise

   def translate([...], assert_cds=False, assert_cds_firstcodon=False):
        """ Returns a new Seq object with the protein translation.  
             If assert_cds is True, but the sequence is not a CDS as determined
by self.is_cds(), 
             then an error is thrown.  Otherwise, the sequence is translated
with the 
             first codon read as a methionine, rather than the amino acid which
it 
             would encode at any other position.
             If assert_cdsfirstcodon is true, but the sequence doesn't start
with a valid 
             start codon, then an error is thrown.  Otherwise, the sequence is
translated 
             with the first codon read as a methionine, as above. 
        """
        # Translate away as normal, here
        [...]
        if assert_cds:
            if not self.is_cds(): 
                raise ValueError, "WTF? This is no CDS, my good fellow human!"
            else:
                # Make the first amino acid of the translated sequence a Met
        if assert_cdsfirstcodon:
            if not self.startswith_startcodon():
                raise ValueError, "Hey!  Stop playing around, this sequence
doesn't start with a start codon"
            else:
                # Make the first amino acid of the translated sequence a Met
        # Then continue as normal

This approach provides the following behaviour (assuming things about argument
names that can be thrashed out later)

# I want to translate some nt sequence, and don't care about stops, starts, or
any other stuff
aaseq = ntseq.translate()
# I want to translate my nt sequence to the first in-frame stop codon, and no
further
aaseq = ntseq.translate(to_stop=True)
# I want to know if my nt sequence is a (putative) CDS
ntseq.is_cds()
# I want to know if my nt sequence starts with a start codon
ntseq.startswith_startcodon()
# I want to know if my nt sequence ends with an in-frame stop codon
# Note that this is a different question to asking whether there is *any*
in-frame stop codon
ntseq.endswith_stopcodon()
# I want to translate my nt sequence, which I know is a CDS, 
# but not convert the first codon to a methionine
aaseq = ntseq.translate()
# I want to translate my nt sequence, which I know is a CDS, 
# and convert the first codon to a methionine
aaseq = ntseq.translate(assert_cds=True)
# OK, my sequence isn't a *real* CDS, but it still starts with a valid start
codon
# (I checked already with ntseq.startswith_startcodon()), and I'd like to
convert the first
# codon as if it was really a CDS.  You don't need to know why, I just do.  I'm
wacky that way.
aaseq = ntseq.translate(assert_cdsfirstcodon=True)
# I'd like a list of all my sequences that are valid CDS
seqlist = [s for s in myntseqs if s.is_cds()]
# I'd like translations of all my sequences that are valid CDS
tlist1 = [s.translate() for s in seqlist]
tlist2 = [s.translate() for s in myntseqs if s.is_cds()]


In terms of nomenclature:

The default behaviour of translate() as Peter proposed: read through in-frame
and translate with the appropriate codon table - is fine in nearly all
circumstances.  Most other circumstances are covered by stopping at the first
in-frame stop codon, which Peter has implemented, and is an option we all seem
to agree on.

Biologically-speaking, this behaviour is not always correct for CDS in
prokaryotes, where alternative start codons may occur a significant minority of
the time.  These will be mistranslated if no provision is made for them.  I
think a useful biological sequence object should at least try to mimic actual
biology, so we should provide an option to handle this.

We should not assume that a sequence is a CDS unless it is specified by the
user.  It seems reasonable to me that the term 'cds' should occur in any such
argument from the user.

We have at least two options for how to proceed with a CDS: i) we can provide a
strict CDS-type translation, which requires confirmation that the sequence is,
in fact, a CDS; ii) we can provide a weak CDS-type translation, which only
modifies the way the start codon is translated.  In both cases, behaviour is
specific to CDS, and so having 'cds' in the argument name *somewhere* seems
obvious, and entirely reasonable.

I think that 'assert_cds' makes clear that we are asserting that the sequence
is a valid CDS - no internal stops and everything else that comes with that
status.

I think that 'assert_cdsfirstcodon' avoids any ambiguity over the word 'start',
and also conveys that we are asserting that the first (rather than start) codon
has some relationship to a CDS; in this case the relationship is that the first
codon of the sequence meets the criteria for a CDS.  But that's kind of a long
argument name ;)


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list