[Bioperl-l] Findpatterns

Sun, 24 Nov 2002 12:04:34 -0800

You might want to check out the EMBOSS suite of tools, there may be some that 
suit your needs in there.

I don't know how the bioperl gurus would handle building a list of locations 
based on pattern matching.  I had a similar problem and this is what I did:
1)used bioperl's SeqIO module to read sequences from a flat file
-next 2 steps are for increased search speed (19,000 seqs in our db)
2)built a hash of all specified-length sequence wordmers in the database, with 
each wordmer acting as a pointer to a list of sequence IDs where they were 
found
3)looked at the first wordmer in the query sequence and searched only 
sequences listed under that wordmer in the profile
4)worked through each bared sequence in the list with canonical use of the pos 
and index functions in a while loop, straight from the Camel 3rd
5)dumped each find to a file

The main speedbump is creating the index of wordmers, once that is done the 
search itself is very reasonable. To be honest I feel sort of rustic putting 
this on the list; perhaps someone can educate me as to a smoother, less 
roll-your-own method using bioperl. :)

Nathanael Kuipers
--
Center for Biomedical Research,
Dept. of Biology,
University of Victoria

>===== Original Message From "Fernandez-Capetillo, Oscar (NIH/NCI)" 
<fernando@mail.nih.gov> =====
>Hi there,
>Not sure I need the Bio part of Perl to do this. Anyway, maybe somebody has
>tried this before and can help me out.
>I am trying to run the search of a short nucleotide pattern against the
>human and mouse genome databases. Exactly, I will want to find where a
>nucleotide pattern is present in the genome. My nucleotide pattern is of
>medium complexity (I can only represent it as a regular expression otherways
>the combinatios will be huge). Lets say something like:
>ACTCTATCANNNNNNNNNNNNNNACTATCTTGGCATCGACNNNNNNNNCATGCTAGCATCGGG
>I know that years ago the freely usable GCG package had a tool named
>findpatterns which you could use to do so. Unfortunately, people is not only
>driven by the shake of Science and now GCG is commercial (I wonder what will
>happen if Newton would have pattented the differential calculus). So I am
>all alone. I could try a Blast for short sequences, but it does not accept
>vaguities as NNNNN...
>I'd want to run it against both mouse and human genome databases, which I
>don't think I can access through the bioperl interface.
>I'd appreciate any help.Thanks,
>Oskar
>Oskar Fernandez-Capetillo, Ph. D.
> NCI Build., 10 Room 4A01
>National Institute of Health
>10 Center Drive
>Bethesda, MD
>20892, 1360
>
>Phone: 301-496-4673
>Fax:      301-496-0887
>e-mail: fernando@mail.nih.gov
>www: http://usuarios.lycos.es/h2ax/
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l@bioperl.org
>http://bioperl.org/mailman/listinfo/bioperl-l