[BioPython] Determine alphabet (DNA or Protein) of a sequence

Björn Johansson bjorn_johansson at bio.uminho.pt
Mon Jan 12 18:34:36 UTC 2009


Hi,
and thanks for the quick replies and the submitted code! Its very nice
to have the help of such a devoted community!

I am writing a plug-in to deal with reformatting pasted code (DNA or
protein) snippets into the editor (incidently WikidPad which is
written in python and uses scintilla, open-source
http://wikidpad.sourceforge.net/) and I would like to be able to
format (DNA or protein) code in the selection from raw format to fasta
and genbank.

The identity of the code (DNA or protein) is only needed to feed into
the SeqIO.write method, it demands to know if the sequence is DNA or
protein to write genbank format.

I know I could add a dialog, but I want a function to quickly reformat
sequences, although I agree that guessing is bad from a theoretical
viewpoint.

Ill try the code that you submitted as soon as I can and Ill get back to you!
thanks,
/bjorn



On Mon, Jan 12, 2009 at 14:34, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Jan 12, 2009 at 1:47 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> Hi Björn:
>> I am agreed with Peter; guessing should be the last resort. The
>> guessing is not that smart, and will fall apart for very
>> pathological cases like short amino acids with lots of Gly, Ala, Cys
>> or Thrs. That being said, here is some code that does this. Hope
>> this helps,
>>
>> Brad
>>
>> from Bio import Seq
>>
>> def guess_if_dna(seq, thresh = 0.90, dna_letters = ['G', 'A', 'T', 'C']):
>>    """Guess if the given sequence is DNA.
>>
>>    It's considered DNA if more than 90% of the sequence is GATCs. The threshold
>>    is configurable via the thresh parameter. dna_letters can be used to configure
>>    which letters are considered DNA; for instance, adding N might be useful if
>>    you are expecting data with ambiguous bases.
>>    """
>>    if isinstance(seq, Seq.Seq):
>>        seq = seq.data
>>    elif isinstance(seq, type("")) or isinstance(seq, type(u"")):
>>        seq = str(seq)
>>    else:
>>        raise ValueError("Do not know provided type: %s" % seq)
>>    seq = seq.upper()
>
> This code is trying to get the sequence as an upper case string, given
> that the Seq object does not support the upper method (yet - I've just
> filed enhancement Bug 2731 on this, something I'd been thinking about
> for a while).
>
> Anyway, this would be shorter and would cope with strings or Seq
> objects, or even MutableSeq objects.
>
> seq = str(seq.upper())
>
> Also using the Seq object's data property is discouraged (see Bug 2509).
>
>>    dna_alpha_count = 0
>>    for letter in dna_letters:
>>        dna_alpha_count += seq.count(letter)
>>    if (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh):
>>        return True
>>    else:
>>        return False
>
> You could just do:
>
> return (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh)
>
> Peter
>



-- 
Björn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
http://www.bio.uminho.pt
http://sites.google.com/site/bjornhome
Work (direct) +351-253 601517
Private mob. +351-967 147 704
Dept of Biology (secretariate) +351-253 60 4310
Dept of Biology (fax) +351-253 678980




More information about the Biopython mailing list