[Biopython-dev] [Biopython - Bug #2597] Enforce alphabet letters in Seq objects

Thu Feb 23 17:17:59 UTC 2012

Issue #2597 has been updated by Eric Talevich.

It would also be useful to be able to validate alphabets when constructing Seqs or SeqRecords from scratch.

Here's a proposal that I believe fits with most of what's been agreed to so far.

In Bio/Alphabet/__init__.py, replace _verify_alphabet with an efficiently implemented method on the Alphabet class and perhaps make it public:

<pre>
def validate(self, sequence):
    """Raise a ValueError if sequence contains letters not allowed by alphabet.

    If alphabet does not define letters, it's all OK.
    ...
    """
    ok_letters = set(self.letters)
    if ok_letters:
        bad_letters = set(str(sequence)) - ok_letters
        if bad_letters:
            raise ValueError("Alphabet does not accept these letters: "
                             + ''.join(bad_letters))
</pre>

In the Seq class, optionally add a method 'check_alphabet' which wraps Alphabet.validate:

<pre>
def check_alphabet(self):
    self.alphabet.validate(self.data)
</pre>

In SeqIO.parse and SeqIO.read, add an option check_alphabet=False, which calls either Alphabet.validate(seq) or seq.check_alphabet(). If validation fails, the exception is propagated up.

I don't know how much this would affect performance, but it seems that users are willing to accept a small performance hit if they explicitly opt into validation. The extra 'if' statement may or may not be noticeable in the default case.
----------------------------------------
Bug #2597: Enforce alphabet letters in Seq objects
https://redmine.open-bio.org/issues/2597

Author: Peter Cock
Status: In Progress
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: Not Applicable
URL: 

If a Seq object is created with an alphabet with a pre-defined set of letters (e.g. the IUPAC alphabets) then I think Biopython should validate that the sequence does indeed only use those letters.

This will catch mis-use of ambiguous sequences with non-ambiguous alphabets, letters in an unexpected case, and most importantly any unexpected symbols (e.g. from a parsing problem).

This will impose a performance overhead - which can be avoided if the user instead chooses to use a generic dna/rna/protein alphabet which does not list the letters expected.

Note that we will have to resolve Bug 2532 before doing this, as currently some parts of Biopython are mis-using the upper case only IUPAC alphabet objects with mixed case sequences.

-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org