[Biopython-dev] Alphabet case and standards

Mon Jan 12 22:24:00 UTC 2009

Hi,
I am moving a potential discussion away from the bugzilla because it 
affects at least the following Bugs (please add others):
2351 (Make Seq more like a string, even subclass string? 
http://bugzilla.open-bio.org/show_bug.cgi?id=2351 ),
2532 (Using IUPAC alphabets in mixed case Seq objects 
http://bugzilla.open-bio.org/show_bug.cgi?id=2532 ),
2597 (Enforce alphabet letters in Seq objects 
http://bugzilla.open-bio.org/show_bug.cgi?id=2597 )
2731 (Adding .upper() and .lower() methods to the Seq object 
http://bugzilla.open-bio.org/show_bug.cgi?id=2731 ).

I am hoping it gets wider feedback than using bugzilla, avoid 
unnecessary duplication and closure of these bugs.

 From Bug 2351, "Bio.Alphabets.IUPAC defines a number of alphabets with 
defined lists of valid letters which are in upper case ONLY". But 
various applications ignore the alphabet case and hence the standards. 
So this creates the problem of how Biopython should handle alphabet case.

If we follow the standard for all modules then there should be not need 
to do anything except to ensure we follow it. There are numerous 
examples where the standard is not followed including users ignorance, 
simplicity or design (such as using mixed case to denote 'important' 
things), and various databases and applications do not follow it. But I 
think that the actual case is irrelevant in most situations and not 
following the standard would make Biopython inefficient.

One suggestion given in two of the bugs is to change the Alphabet object 
but I believe that this is wrong because you do not know which alphabet 
to use. If you already know the case then my preferred option is change 
the case of your query. Otherwise  you would have to obtain and use one 
alphabet for every case used, for example, a user may need two alphabets 
to handle upper and lower case or just one combined one. Also, if mixed 
case alphabets are used, then an excessive number of alphabets may be 
required.

I think that current approach is to force to user to using uppercase 
when interacting with the Alphabet object or derived from it (such as an 
actual alphabet). While this maintains storage of the input case, it 
does not enforce the standard. This is also inefficient because it 
requires constant checks for the correct case.

Similar to the first suggestion in Bug 2731, I think that we should 
automatically changes the case when creating any sequence-related object 
and provide a warning that the input has changed. This enforces standard 
and probably requires small changes to the code but loses the format of 
the input. Outside of Biopython, an example of this is the web version 
of NCBI blast silently converts input case of the query.

Less desirable options:
a) Enforces the standard such as with Bug 2597 so that an error is 
return for any sequence-related object if the case is incorrect. This is 
probably a little too harsh for a difference in case.
b) Use regular expressions to ignore case but this will create a large 
penalty especially if it is not required.

Regards
Bruce