[Biopython-dev] [Biopython - Bug #2597] Enforce alphabet letters in Seq objects
redmine at redmine.open-bio.org
redmine at redmine.open-bio.org
Thu Feb 23 17:17:59 UTC 2012
Issue #2597 has been updated by Eric Talevich.
It would also be useful to be able to validate alphabets when constructing Seqs or SeqRecords from scratch.
Here's a proposal that I believe fits with most of what's been agreed to so far.
In Bio/Alphabet/__init__.py, replace _verify_alphabet with an efficiently implemented method on the Alphabet class and perhaps make it public:
<pre>
def validate(self, sequence):
"""Raise a ValueError if sequence contains letters not allowed by alphabet.
If alphabet does not define letters, it's all OK.
...
"""
ok_letters = set(self.letters)
if ok_letters:
bad_letters = set(str(sequence)) - ok_letters
if bad_letters:
raise ValueError("Alphabet does not accept these letters: "
+ ''.join(bad_letters))
</pre>
In the Seq class, optionally add a method 'check_alphabet' which wraps Alphabet.validate:
<pre>
def check_alphabet(self):
self.alphabet.validate(self.data)
</pre>
In SeqIO.parse and SeqIO.read, add an option check_alphabet=False, which calls either Alphabet.validate(seq) or seq.check_alphabet(). If validation fails, the exception is propagated up.
I don't know how much this would affect performance, but it seems that users are willing to accept a small performance hit if they explicitly opt into validation. The extra 'if' statement may or may not be noticeable in the default case.
----------------------------------------
Bug #2597: Enforce alphabet letters in Seq objects
https://redmine.open-bio.org/issues/2597
Author: Peter Cock
Status: In Progress
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: Not Applicable
URL:
If a Seq object is created with an alphabet with a pre-defined set of letters (e.g. the IUPAC alphabets) then I think Biopython should validate that the sequence does indeed only use those letters.
This will catch mis-use of ambiguous sequences with non-ambiguous alphabets, letters in an unexpected case, and most importantly any unexpected symbols (e.g. from a parsing problem).
This will impose a performance overhead - which can be avoided if the user instead chooses to use a generic dna/rna/protein alphabet which does not list the letters expected.
Note that we will have to resolve Bug 2532 before doing this, as currently some parts of Biopython are mis-using the upper case only IUPAC alphabet objects with mixed case sequences.
--
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org
More information about the Biopython-dev
mailing list