[Biopython] matching sequences from fasta files

Vincent Davis vincent at vincentdavis.net
Thu Mar 11 13:42:40 UTC 2010


Thanks again for all the responses I'll let you know what I end up with.

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Thu, Mar 11, 2010 at 4:06 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Mar 11, 2010 at 12:47 AM, Vincent Davis
> <vincent at vincentdavis.net> wrote:
> > So I had an idea and wanted to get some feedback.
> > I could make all possible single position mismatches for the sequences. I
> > have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000).
> Then
> > use BLAST to look for perfect matches. I would probably do this
> > incrementally maybe even just blast for each sequence. The advantage I
> see
> > in this is that BLAST can run multi core and I am running it on an 8core
> > with 48gb of memory So it seems that this would be the fastest way to do
> > this and very straight forward as there is very little parsing. There is
> > either a match or not. I am purely guessing that generating the list if
> > faster than parsing the results.
>
> The strengths of BLAST are in fast fuzzy matching. My instinct is is
> would be silly to take your 230,000 queries, generate an extra queries
> 17,250,000 queries, and then run BLAST against your (organism
> specific?) database. Just run the BLAST on your queries with some
> reasonably strict match parameters, then post filter for your single
> base change.
>
> Now, if you really want to go for the brute force approach of looking
> for the perfect matches, what you could do is for each query of length
> 25, generate 25 simple regular expressions (e.g. using the "any letter"
> wild card in each position). You can do the regular expression matching
> within Python, or even with a command line tool like EMBOSS dreg.
> http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/dreg.html
>
> Speaking of the EMBOSS tools, their fuzzy nucleotide search tool
> fuzznuc might be useful (you can specify the patterns using the
> IUPAC codes rather than regular expressions):
> http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/fuzznuc.html
>
> As far as I know, EMBOSS don't have a tool/option for fuzzy matching
> where you can specify a allowed number of miss-matches - unless one
> of the primer/vector tools can be used in this way? I'd suggest using
> primersearch but I think that only takes pairs of primers (not single
> probes).
>
> There is going to more than one way to solve your problem. This
> will be a useful learning process for you.
>
> Regards,
>
> Peter
>



More information about the Biopython mailing list