[Biopython-dev] Alphabet case and standards
Bruce Southey
bsouthey at gmail.com
Mon Jan 12 22:24:00 UTC 2009
Hi,
I am moving a potential discussion away from the bugzilla because it
affects at least the following Bugs (please add others):
2351 (Make Seq more like a string, even subclass string?
http://bugzilla.open-bio.org/show_bug.cgi?id=2351 ),
2532 (Using IUPAC alphabets in mixed case Seq objects
http://bugzilla.open-bio.org/show_bug.cgi?id=2532 ),
2597 (Enforce alphabet letters in Seq objects
http://bugzilla.open-bio.org/show_bug.cgi?id=2597 )
2731 (Adding .upper() and .lower() methods to the Seq object
http://bugzilla.open-bio.org/show_bug.cgi?id=2731 ).
I am hoping it gets wider feedback than using bugzilla, avoid
unnecessary duplication and closure of these bugs.
From Bug 2351, "Bio.Alphabets.IUPAC defines a number of alphabets with
defined lists of valid letters which are in upper case ONLY". But
various applications ignore the alphabet case and hence the standards.
So this creates the problem of how Biopython should handle alphabet case.
If we follow the standard for all modules then there should be not need
to do anything except to ensure we follow it. There are numerous
examples where the standard is not followed including users ignorance,
simplicity or design (such as using mixed case to denote 'important'
things), and various databases and applications do not follow it. But I
think that the actual case is irrelevant in most situations and not
following the standard would make Biopython inefficient.
One suggestion given in two of the bugs is to change the Alphabet object
but I believe that this is wrong because you do not know which alphabet
to use. If you already know the case then my preferred option is change
the case of your query. Otherwise you would have to obtain and use one
alphabet for every case used, for example, a user may need two alphabets
to handle upper and lower case or just one combined one. Also, if mixed
case alphabets are used, then an excessive number of alphabets may be
required.
I think that current approach is to force to user to using uppercase
when interacting with the Alphabet object or derived from it (such as an
actual alphabet). While this maintains storage of the input case, it
does not enforce the standard. This is also inefficient because it
requires constant checks for the correct case.
Similar to the first suggestion in Bug 2731, I think that we should
automatically changes the case when creating any sequence-related object
and provide a warning that the input has changed. This enforces standard
and probably requires small changes to the code but loses the format of
the input. Outside of Biopython, an example of this is the web version
of NCBI blast silently converts input case of the query.
Less desirable options:
a) Enforces the standard such as with Bug 2597 so that an error is
return for any sequence-related object if the case is incorrect. This is
probably a little too harsh for a difference in case.
b) Use regular expressions to ignore case but this will create a large
penalty especially if it is not required.
Regards
Bruce
More information about the Biopython-dev
mailing list