[BioPython] Determine alphabet (DNA or Protein) of a sequence
Peter
biopython at maubp.freeserve.co.uk
Mon Jan 12 09:34:13 EST 2009
On Mon, Jan 12, 2009 at 1:47 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi Björn:
> I am agreed with Peter; guessing should be the last resort. The
> guessing is not that smart, and will fall apart for very
> pathological cases like short amino acids with lots of Gly, Ala, Cys
> or Thrs. That being said, here is some code that does this. Hope
> this helps,
>
> Brad
>
> from Bio import Seq
>
> def guess_if_dna(seq, thresh = 0.90, dna_letters = ['G', 'A', 'T', 'C']):
> """Guess if the given sequence is DNA.
>
> It's considered DNA if more than 90% of the sequence is GATCs. The threshold
> is configurable via the thresh parameter. dna_letters can be used to configure
> which letters are considered DNA; for instance, adding N might be useful if
> you are expecting data with ambiguous bases.
> """
> if isinstance(seq, Seq.Seq):
> seq = seq.data
> elif isinstance(seq, type("")) or isinstance(seq, type(u"")):
> seq = str(seq)
> else:
> raise ValueError("Do not know provided type: %s" % seq)
> seq = seq.upper()
This code is trying to get the sequence as an upper case string, given
that the Seq object does not support the upper method (yet - I've just
filed enhancement Bug 2731 on this, something I'd been thinking about
for a while).
Anyway, this would be shorter and would cope with strings or Seq
objects, or even MutableSeq objects.
seq = str(seq.upper())
Also using the Seq object's data property is discouraged (see Bug 2509).
> dna_alpha_count = 0
> for letter in dna_letters:
> dna_alpha_count += seq.count(letter)
> if (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh):
> return True
> else:
> return False
You could just do:
return (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh)
Peter
More information about the BioPython
mailing list