[Biopython-dev] [Bug 2381] translate and transcibe methods for the Seq object (in Bio.Seq)

Mon Oct 20 08:24:25 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2381

------- Comment #21 from lpritc at scri.sari.ac.uk  2008-10-20 04:24 EST -------
(In reply to comment #19)
> (In reply to comment #18)

> The "complete" is a cryptic naming, I wouldn't be fond of it. I think everybody
> would rather him/herself rather check is a.startswith('M') and a.endswith('*')
> instead. But, what would be useful is a.find_orf(offset=0).

Ditto the 'complete' naming - it's not clear at all.

> > Related to this, it would be useful to have a boolean option to stop
> > translation at the first in frame stop codon (possible argument names for this
> > include "stop" if not used as above, "to_stop", "auto_stop", "terminate" etc).
> 
> Yes, find_orf(offset) with default offset=0.

I would like to raise the issue that 'ORF' has taken on (at least) two meanings
over the years, and it's not yet clear which is being discussed here.  

The correct definition of 'Open Reading Frame' is an uninterrupted sequence of
nucleotides that do not contain an in-frame stop codon.  However, more
restrictive definitions have found a way in erroneously over the years,
asserting that the sequence must have an in-frame start codon, or additionally
that the ORF begins at that start codon.  This latter case in particular would
be a putative coding sequence (CDS), rather than an ORF.  See a Google define:
orf search for details... (http://www.google.com/search?q=define:+orf).  

As an implementation examply, Sanger's Artemis
(http://www.sanger.ac.uk/Software/Artemis/) correctly identifies ORFs.  See
also Doolittle's 'Of URFS and ORFS', available on Google Books:
http://books.google.com/books?id=jIlMMx6Ji-sC - it's 22 years old now, and a
good candidate for the first manual on bioinformatics.  The Wikipedia page for
ORF is typically egregious, and also incorrect.

Also, by 'offset' in the proposed syntax above, is 'reading_frame' intended? 
If so I think it would be clearer to indicate that the reading frame is what is
desired, as specifying a reading frame of -1 implies something different to an
offset of -1.  I propose that the default behaviour is to find all ORFs in all
reading frames, leaving it to the user to decide whether that behaviour is
appropriate for their sequence and optionally specify a reading frame.

For discussion purposes, I'm attaching code for an ORF search I implemented
locally in a subclass of the Seq object.  As ever, I don't claim that it's
perfect, but it did what I needed at the time.  In particular the returned
index for ORFs is 1-based, as that is what I wanted then.  

    def find_ORFs(self, codon_table=1, min_length=100):
        """ find_ORFs(self, codon_table=1, min_length=100)

            codon_table        Integer, must be one of the integers in
                               Bio.Data.CodonTable.generic_by_id; these are
                               the standard codon table numbers used by
                               sequence databases.

            min_length         Integer, the shortest length of consecutive
                               nucleotides to consider as an ORF

            Finds ORFs within the SeqRecord sequence, and returns them as
            a list of tuples in the format:
            (frame, start, end, sequence)
            where start and end are the start and end points on the sequence
            (i.e. the first and last base positions, NOT the values you
            should use when indexing sequences in Python), and sequence is a
            Seq object.
        """
        assert self.alphabet.__class__ in dna_alphabets, \
               "Alphabet is not a known DNA alphabet"
        # Get the codon table; raises a KeyError if an invalid table number
        codon_table = CodonTable.generic_by_id[codon_table]
        # Loop over the record's sequence in all six forward and reverse
        # frames, returning a list of (frame, start, end, sequence) tuples
        # List of tuples
        orflist = []
        # Forward frames first
        forward_orfs = self.__find_orfs_in_sequence(self.data, codon_table)
        for frame, start, end, sequence in forward_orfs:
            if len(sequence) >= min_length:
                orflist.append(('+%d' % frame, start, end, Seq(sequence,
                                                               self.alphabet)))
        # Then reverse frames
        seq = reverse_complement(self.data)
        reverse_orfs = self.__find_orfs_in_sequence(seq, codon_table)
        for frame, start, end, sequence in reverse_orfs:
            if len(sequence) >= min_length:
                start = len(self.data) - start + 1
                end = len(self.data) - end + 1
                start, end = end, start
                orflist.append(('-%d' % frame, start, end, Seq(sequence,
                                                               self.alphabet)))
        return orflist

    def __find_orfs_in_sequence(self, sequence, codon_table):
        """ Returns a list of ORFs for a passed sequence, in three forward
            frames, as tuples (frame, start, end, sequence)
        """
        orflist = []
        for frame, offset in [(1, 0), (2, 1), (3, 2)]:
            tmporf = []
            orfstart = offset
            i = offset
            while i < len(sequence):
                codon = sequence[i:i+3]
                if len(codon) == 3 and codon not in codon_table.stop_codons:
                    tmporf.append(codon)
                else:
                    if codon in codon_table.stop_codons:
                        tmporf.append(codon)
                    tmporf = ''.join(tmporf)
                    orflist.append((frame, orfstart+1,
                                    orfstart+len(tmporf), tmporf))
                    orfstart += len(tmporf)
                    tmporf = []
                i += 3
            # Catch ORFs that run up to the end of the sequence, by checking
            # for an empty tmporf list
            if tmporf != []:
                tmporf = ''.join(tmporf)
                orflist.append((frame, orfstart+1,
                                orfstart+len(tmporf), tmporf))
        return orflist

In order to obtain a potential coding sequence that begins with a methionine, I
would translate, and then use this method in a subclass of Seq for the
translated sequence:

    def trim_to_first_met(self):
        """ Assuming that the sequence is a protein sequence, trims to
            the first methionine in the sequence and returns a Seq object

            If the sequence has no methionine, then the full sequence is
            returned
        """
        # Crop the sequence to the first Methionine. If there is no methionine
        # the full sequence is returned
        # We assert that we have a protein sequence
        assert self.alphabet.__class__ in protein_alphabets, \
               "Sequence alphabet is not a known ProteinAlphabet"
        if self.data.count('M'):
            seq = self.data[self.data.index('M'):]
        else:
            seq = self.data
        return Seq(seq, self.alphabet)

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.