[Biopython] Sequence object "find" is still case specific?

Martin Mokrejs mmokrejs at fold.natur.cuni.cz
Mon Mar 4 13:43:09 UTC 2013


Michiel de Hoon wrote:
> --- On Sun, 3/3/13, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> We're going off topic here, but for the record I think
>> that 'find' should continue to be case sensitive (like
>> Python strings).
> 
> I would prefer find to be case-insensitive. Biochemically there is no difference between upper case and lower case nucleotides; lower case is just used for annotation purposes. I find it quite counter-intuitive that 
>>>> s = Seq("ACGTttt")
>>>> s.find("ACGTT")
> returns -1.
> 
> While it is possible to change the sequences to upper case before executing .find, it has the disadvantage that then we won't be able to tell what the original case was (and therefore whether we are hitting a repeat region or not).

I agree that it would be bad if biopython converted my sequences into all-uppercase.
And not only me, a typical use case is nowadays import of raw sequencing reads
including low-qual/masked region in lower-case.

I do use mixed-casing quite often and I think it is acceptable to ask user to do the
.find like:

s.to_string().upper().find('ACGTT')

and leave the user slice out the mixed-cased match eventually from the
original sequence object.

I don't think I want anything to be changed in biopython except maybe more
runtime control over the checks of the alphabet during data import. Supporting
searches through possibly mixed-case sequence object would require use of REGEXP
engine and possibly be slower.

Hope I got your discussion right. ;-)
Martin



More information about the Biopython mailing list