[BioPython] Determine alphabet (DNA or Protein) of a sequence

Mon Jan 12 06:30:26 EST 2009

On Mon, Jan 12, 2009 at 10:58 AM, Björn Johansson
<bjorn_johansson at bio.uminho.pt> wrote:
> Hi, I am fairly new to biopython, so I don't now if this question has
> been answered in the archives (tried to loo but found nothing).
>
> Is there a (bio)python module or code snippet that I can use to
> determine if a sequence is liiely to be nucleic acid or protein?
>
> I believe the program ReadSeq does this for example, when formatting a
> fasta sequence to genbank.
>
> grateful for answers!
>
> /bjorn

It seems like lots of different tools (e.g. FASTA) have come up with
their own way to try and guess this, usually by looking at the letter
content.  This is impossible to get right 100% of the time (especially
if the nucleotide includes ambiguous characters - which can make it
look more protein like).  I don't think we have a standard bit of code
in Biopython to do this (but I've never searched).

In python there as a general preference for making things explicit
rather than trying to guess and do the right thing.  If you don't know
which you have (e.g. user input?) then you are in an awkward position.
 What are you going to do with the sequence?  If you are going to pass
it to a command line tool, maybe you can let it guess?

Peter