[Biopython] matching sequences from fasta files

Thu Mar 11 11:06:24 UTC 2010

On Thu, Mar 11, 2010 at 12:47 AM, Vincent Davis
<vincent at vincentdavis.net> wrote:
> So I had an idea and wanted to get some feedback.
> I could make all possible single position mismatches for the sequences. I
> have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000). Then
> use BLAST to look for perfect matches. I would probably do this
> incrementally maybe even just blast for each sequence. The advantage I see
> in this is that BLAST can run multi core and I am running it on an 8core
> with 48gb of memory So it seems that this would be the fastest way to do
> this and very straight forward as there is very little parsing. There is
> either a match or not. I am purely guessing that generating the list if
> faster than parsing the results.

The strengths of BLAST are in fast fuzzy matching. My instinct is is
would be silly to take your 230,000 queries, generate an extra queries
17,250,000 queries, and then run BLAST against your (organism
specific?) database. Just run the BLAST on your queries with some
reasonably strict match parameters, then post filter for your single
base change.

Now, if you really want to go for the brute force approach of looking
for the perfect matches, what you could do is for each query of length
25, generate 25 simple regular expressions (e.g. using the "any letter"
wild card in each position). You can do the regular expression matching
within Python, or even with a command line tool like EMBOSS dreg.
http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/dreg.html

Speaking of the EMBOSS tools, their fuzzy nucleotide search tool
fuzznuc might be useful (you can specify the patterns using the
IUPAC codes rather than regular expressions):
http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/fuzznuc.html

As far as I know, EMBOSS don't have a tool/option for fuzzy matching
where you can specify a allowed number of miss-matches - unless one
of the primer/vector tools can be used in this way? I'd suggest using
primersearch but I think that only takes pairs of primers (not single
probes).

There is going to more than one way to solve your problem. This
will be a useful learning process for you.

Regards,

Peter