[Biopython] Sequence object "find" is still case specific?

Wed Mar 6 10:48:18 UTC 2013

Peter Cock wrote:
> On Wed, Mar 6, 2013 at 7:45 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> --- On Mon, 3/4/13, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz> wrote:
>>> I do use mixed-casing quite often and I think it is
>>> acceptable to ask user to do the
>>> .find like:
>>>
>>> s.tostring().upper().find('ACGTT')
>>>
>>> and leave the user slice out the mixed-cased match
>>> eventually from the original sequence object.
>>
>> The problem though is that the call to .upper() will be slow if s is a
>> long sequence. Trying this for human chromosome 1 showed that
>> the search will take 20,000 times longer, and is unacceptably slow
>> if you want to execute this search often.

I convert to .upper() raw 454 read sequences about up to 1200nt in length
but haven't studied the perfomance. I just wanted to avoid re.compile()
for every unpredictable query. The SeqIO.parse() objects are still in mixed-casing
which is what I am happy with.

> 
> With the current code, the simple route is to standardise all your
> query and search strings into one case (e.g. upper case).
> 
> Might optional case insensitive search might be useful if we
> can make it fast with some optional C code (and a pure Python
> fallback for PyPy, Jython, etc)?

Yes, if you provide a case-insensitive search interface to SeqIO objects
I will gladly use it instead of making a temporary copy of s.to_string().upper().
But I do it only once for some, maybe not even all reads, would have to dig
more into my code. ;-) Just in case you would be considering the penalty
during initial *all data* import.

Martin