[Biopython] matching sequences from fasta files

Peter biopython at maubp.freeserve.co.uk
Wed Mar 10 16:29:17 UTC 2010


On Wed, Mar 10, 2010 at 3:19 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> I am considering just using just python and regular expression. Blast is
> great but I don't seem to be able to easily filter it to get only close
> matched that differ at 1 snp.
> I have a custom microarray and a list of the sequences it will bind. I need
> to test if they are in the genome of toxoplasma gondii (just yes or no) and
> if there are close matches (differ at 1 snp) and where the diff is in the
> sequence.
>
> So from reading the responses I should consider python.re. or look more into
> FASTA or needle. to see if i can get my version of a close match from them.
> Is this right? Like I said I am very new to this, just got called in to get
> this project done.

Using BLAST / FASTA / needle / any pairwise alignment is going to
boil down running the tool and parsing to filter out what you want.
I don't think any of these general purpose tools allow for a "single
base pair difference" threshold. This approach should work though.

If you want to allow a single mis-match anywhere in the sequence,
I'm not sure regular expressions are ideal either. If you wanted to
look for matches with a single mis-match at a particular point
(i.e. a know SNP) then a regular expression would work fine.

However, you might have more success with software designed for
second generation sequencing - there are certainly similarities to
mapping short reads (e.g. Solexa/Illumina data) to a reference
genome. You might also be able to use software designed to
look for primer matches (again, these are short sequences).

Just some ideas...

Peter



More information about the Biopython mailing list