[BioRuby] matching against a zillion patterns
George Githinji
georgkam at gmail.com
Sun Oct 18 07:01:41 UTC 2009
Thank you for the approach.
The initial list of sequences was 2000.(i.e the ones that need to be
classified) and the numbers of patterns was 570. All of then are of the same
size(14 amino acids long). Using Robert's approach i was able to create a
single very loong regular expression for matching. It seems to work but i
will do some benchmarks and diagnostics tests.
I can twitch this approach so as to do a single pass for all the sequences
while doing the search.
Thank you so much!!
On Fri, Oct 16, 2009 at 4:51 PM, Jan Aerts <jan.aerts at gmail.com> wrote:
> Hey George,
> So if I understand correctly you've got a huge number of aminoacid
> sequences (how many?) and about 400 regular expressions. And for each of the
> aminoacid sequences: if they match just one of the regular expressions they
> are put in box A and if they match none of the regexps, they go into box B.
> Correct?
>
> It just happens that something very similar was the subject of Jim
> Tisdall's (from Beginning Perl for Bioinformatics fame) talk at the
> bioinformatics course we're teaching at the moment :-)
>
> First thing: avoid loops. You don't want to take loop over all regexps for
> each AA sequences, or the other way around.
>
> Are all regexps of the same length? Would be nice if they are, but not
> critical. My approach would be to go over the data just once. So suppose the
> regexps all are of the same length.
>
> A. Prepare your data:
> a. "Decode" the regexps into literal strings: e.g. /A[BC]D/ become "ABD"
> and "ACD".
> b. Create a hash that contains all those things as keys.
> c. Concatenate all AA sequences together, joined with a non-AA, let's say
> a semicolon ";". E.g. CAARGNDLYSKNIG;GGARGNDLYSKNIG;KKARGNDLYSKNIG
>
> B. Do the actual search
> a. If the length of the strings to match (what used to be the regexps,
> and are now the keys in the hash) is 5: take the first 5 characters of your
> concatenated AA string and check if that substring exists as a key in the
> hash. If so: you know that the AA sequence between the surrounding ";"
> characters should go in box A.
> b. Advance 1 position: take AAs 2 to 6.
> c. Go back to a.
>
> You might have to tweak this approach to exactly fit your requirements, but
> if your code used to take a very long time, this might speed things up
> immensely.
>
> (George: can you forward this to the ruby mailing list it was discussed on
> initially? Cheers)
>
> Good luck,
> jan.
>
>
> 2009/10/16 George Githinji <georgkam at gmail.com>
>
>> Recently had this discussion on the Ruby mailing list. Any ideas or
>>
>> solutions
>>
>> http://www.ruby-forum.com/topic/197365#new
>>
>> --
>> ---------------
>> Sincerely
>> George
>>
>> Skype: george_g2
>> Blog: http://biorelated.wordpress.com/
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>
>
--
---------------
Sincerely
George
Skype: george_g2
Blog: http://biorelated.wordpress.com/
More information about the BioRuby
mailing list