[BioPython] Determine alphabet (DNA or Protein) of a sequence

Peter biopython at maubp.freeserve.co.uk
Mon Jan 12 09:34:13 EST 2009


On Mon, Jan 12, 2009 at 1:47 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi Björn:
> I am agreed with Peter; guessing should be the last resort. The
> guessing is not that smart, and will fall apart for very
> pathological cases like short amino acids with lots of Gly, Ala, Cys
> or Thrs. That being said, here is some code that does this. Hope
> this helps,
>
> Brad
>
> from Bio import Seq
>
> def guess_if_dna(seq, thresh = 0.90, dna_letters = ['G', 'A', 'T', 'C']):
>    """Guess if the given sequence is DNA.
>
>    It's considered DNA if more than 90% of the sequence is GATCs. The threshold
>    is configurable via the thresh parameter. dna_letters can be used to configure
>    which letters are considered DNA; for instance, adding N might be useful if
>    you are expecting data with ambiguous bases.
>    """
>    if isinstance(seq, Seq.Seq):
>        seq = seq.data
>    elif isinstance(seq, type("")) or isinstance(seq, type(u"")):
>        seq = str(seq)
>    else:
>        raise ValueError("Do not know provided type: %s" % seq)
>    seq = seq.upper()

This code is trying to get the sequence as an upper case string, given
that the Seq object does not support the upper method (yet - I've just
filed enhancement Bug 2731 on this, something I'd been thinking about
for a while).

Anyway, this would be shorter and would cope with strings or Seq
objects, or even MutableSeq objects.

seq = str(seq.upper())

Also using the Seq object's data property is discouraged (see Bug 2509).

>    dna_alpha_count = 0
>    for letter in dna_letters:
>        dna_alpha_count += seq.count(letter)
>    if (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh):
>        return True
>    else:
>        return False

You could just do:

return (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh)

Peter



More information about the BioPython mailing list