[BioPython] Determine alphabet (DNA or Protein) of a sequence

Mon Jan 12 13:47:37 UTC 2009

Hi Björn:
I am agreed with Peter; guessing should be the last resort. The
guessing is not that smart, and will fall apart for very
pathological cases like short amino acids with lots of Gly, Ala, Cys
or Thrs. That being said, here is some code that does this. Hope
this helps,

Brad

from Bio import Seq

def guess_if_dna(seq, thresh = 0.90, dna_letters = ['G', 'A', 'T', 'C']):
    """Guess if the given sequence is DNA.

    It's considered DNA if more than 90% of the sequence is GATCs. The threshold
    is configurable via the thresh parameter. dna_letters can be used to configure
    which letters are considered DNA; for instance, adding N might be useful if
    you are expecting data with ambiguous bases.
    """
    if isinstance(seq, Seq.Seq):
        seq = seq.data
    elif isinstance(seq, type("")) or isinstance(seq, type(u"")):
        seq = str(seq)
    else:
        raise ValueError("Do not know provided type: %s" % seq)
    seq = seq.upper()
    dna_alpha_count = 0
    for letter in dna_letters:
        dna_alpha_count += seq.count(letter)
    if (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh):
        return True
    else:
        return False

On Mon, Jan 12, 2009 at 11:30:26AM +0000, Peter wrote:
> On Mon, Jan 12, 2009 at 10:58 AM, Björn Johansson
> <bjorn_johansson at bio.uminho.pt> wrote:
> > Hi, I am fairly new to biopython, so I don't now if this question has
> > been answered in the archives (tried to loo but found nothing).
> >
> > Is there a (bio)python module or code snippet that I can use to
> > determine if a sequence is liiely to be nucleic acid or protein?
> >
> > I believe the program ReadSeq does this for example, when formatting a
> > fasta sequence to genbank.
> >
> > grateful for answers!
> >
> > /bjorn
> 
> It seems like lots of different tools (e.g. FASTA) have come up with
> their own way to try and guess this, usually by looking at the letter
> content.  This is impossible to get right 100% of the time (especially
> if the nucleotide includes ambiguous characters - which can make it
> look more protein like).  I don't think we have a standard bit of code
> in Biopython to do this (but I've never searched).
> 
> In python there as a general preference for making things explicit
> rather than trying to guess and do the right thing.  If you don't know
> which you have (e.g. user input?) then you are in an awkward position.
>  What are you going to do with the sequence?  If you are going to pass
> it to a command line tool, maybe you can let it guess?
> 
> Peter
> 
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython