[Biopython-dev] [Bug 2381] translate and transcibe methods for the Seq object (in Bio.Seq)

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Nov 6 15:28:20 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2381





------- Comment #46 from lpritc at scri.sari.ac.uk  2008-11-06 10:28 EST -------
(In reply to comment #43)

> > (2) the "complete_cds" argument (perhaps under another name, maybe "cds"?)
> > illustrated in this patch.  This would check the start codon is valid AND
> > translate it as a methionine AND check there are a whole number of codons AND
> > check it ends with a stop codon AND check there are no extra in-frame stop
> > codons.

> I support (1) but strongly disagree with (2) because 'cds' refers to a complete
> DNA sequence not just if the sequence starts with M.
> http://www.yeastgenome.org/help/glossary.html
> "CDS:    CoDing Sequence, region of nucleotides that corresponds to the
> sequence of amino acids in the predicted protein. The CDS includes start and
> stop codons, therefore coding sequences begin with an "ATG" and end with a stop
> codon. In SGD, unexpressed sequences, including the 5'-UTR, the 3'-UTR,
> introns, or bases not expressed due to frameshifting, are not included within a
> CDS. Note that the CDS does not correspond to the actual mRNA sequence."

That definition seems to correspond exactly to (2), above; not that web-based
definitions have any particular authority ;)

"Begin with an ATG" is a eukaryote-specific statement; "Begin with a (valid)
start codon" covers this.

"End with a stop codon", implying the *first in-frame* stop codon is the same
in both cases.

Where do you see that they differ?

> I do not support the name 'cds_start' because of the DNA interpretation and
> that many Genbank records include the upstream and downstream non-coding
> regions. In such cases, I would have to find the actual start codon, then I
> might as well do the translation after that start codon than rely on a check
> that might be wrong.

I don't think that the argument is proposed for that particular use-case, which
is why I don't think it's valid, there.  If, say, you knew that the 5`UTR ran
to base 17, then you could check with seq[17:].translate(complete_cds=True) or
some such arrangement - but that's not the problem that's being solved with
that method argument, I think.

> Perhaps some variant of:
> a) Similar cases in Python:
> has_met or has_met1
> get_met or get_met1
> b) More direct meaning:
> starts_with_methionine, starts_with_met, starts_with_m

I quite like this way of checking sequence properties, and would prefer an
is_cds() (or, to be pedantic, is_conceptual_cds()) method that returns a
Boolean, but otherwise implements the sort of behaviour described above.

If you only wanted the conceptual translations of sequences that fit the
criteria for a CDS, then a one-liner to replace

[seq.translate(cds=True) for seq in seqlist]

might be

[seq.translate() for seq in seqlist if seq.is_cds()]

I prefer the second option, for readability, but YMMV.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list