[Biopython] matching sequences from fasta files

Wed Mar 10 16:29:17 UTC 2010

On Wed, Mar 10, 2010 at 3:19 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> I am considering just using just python and regular expression. Blast is
> great but I don't seem to be able to easily filter it to get only close
> matched that differ at 1 snp.
> I have a custom microarray and a list of the sequences it will bind. I need
> to test if they are in the genome of toxoplasma gondii (just yes or no) and
> if there are close matches (differ at 1 snp) and where the diff is in the
> sequence.
>
> So from reading the responses I should consider python.re. or look more into
> FASTA or needle. to see if i can get my version of a close match from them.
> Is this right? Like I said I am very new to this, just got called in to get
> this project done.

Using BLAST / FASTA / needle / any pairwise alignment is going to
boil down running the tool and parsing to filter out what you want.
I don't think any of these general purpose tools allow for a "single
base pair difference" threshold. This approach should work though.

If you want to allow a single mis-match anywhere in the sequence,
I'm not sure regular expressions are ideal either. If you wanted to
look for matches with a single mis-match at a particular point
(i.e. a know SNP) then a regular expression would work fine.

However, you might have more success with software designed for
second generation sequencing - there are certainly similarities to
mapping short reads (e.g. Solexa/Illumina data) to a reference
genome. You might also be able to use software designed to
look for primer matches (again, these are short sequences).

Just some ideas...

Peter