[BioPython] Determine alphabet (DNA or Protein) of a sequence

Mon Jan 12 21:57:39 UTC 2009

On Mon, Jan 12, 2009 at 6:34 PM, Björn Johansson wrote:
>
> Hi,
> and thanks for the quick replies and the submitted code! Its very nice
> to have the help of such a devoted community!
>
> I am writing a plug-in to deal with reformatting pasted code (DNA or
> protein) snippets into the editor (incidently WikidPad which is
> written in python and uses scintilla, open-source
> http://wikidpad.sourceforge.net/) and I would like to be able to
> format (DNA or protein) code in the selection from raw format to fasta
> and genbank.
>
> The identity of the code (DNA or protein) is only needed to feed into
> the SeqIO.write method, it demands to know if the sequence is DNA or
> protein to write genbank format.

Yes - this is because the GenBank format distinguishes between
nucleotides and proteins, so if you try and output a SeqRecord using a
generic alphabet, we have a problem.  We could guess, but from a
python style point of view I think most would agree it is preferable
to make you (the programmer) make the choice explicity.

As an aside, you might prefer to use the SeqRecord's format method to
get the record as a FASTA or GenBank string - but this calls
Bio.SeqIO.write() internally anyway, so the alphabet problem remains.

> I know I could add a dialog, but I want a function to quickly reformat
> sequences, although I agree that guessing is bad from a theoretical
> viewpoint.

You could have a selection box offering:
(*) Guess (default)
(*) Nucleotide
(*) Amino acids

That way for any border line cases, the web site user can easily
change this if they need to.

Once you know you have nucleotides, deciding if it is DNA or RNA is
pretty easy :)

> Ill try the code that you submitted as soon as I can and Ill get back to you!
> thanks,
> /bjorn

Peter