[Biopython] matching sequences from fasta files

Leighton Pritchard lpritc at scri.ac.uk
Thu Mar 11 09:00:37 UTC 2010


On 11/03/2010 Thursday, March 11, 00:47, "Vincent Davis"
<vincent at vincentdavis.net> wrote:

> So I had an idea and wanted to get some feedback.
> I could make all possible single position mismatches for the sequences. I
> have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000). Then
> use BLAST to look for perfect matches.

That doesn't sound very elegant (or like a good solution) to me, but if you
wanted to do that you wouldn't necessarily need Python, except perhaps to
generate all possible mismatches.  You can restrict BLAST output to the best
match, and match identities to 100% with the option (in BLAST+)

-word_size 25

Which restricts BLAST to finding seed words of the same length (25) as your
oligos.  This would also speed up BLAST.  You might also consider exploring
other output formats, so you could process tabular output from the command
line, for instance.

However, given the size of your data set, and the sizes of your sequences
(neither of which were stated in the OP), I'd be inclined to bypass this
altogether, and instead use one of the short-read sequence alignment
packages such as SOAP or PASS, to see if it can be applied to your problem.
Michiel's suggestion of NEXALIGN might be a good one - I've never used it,
so can't say much about it.

> I would probably do this
> incrementally maybe even just blast for each sequence. The advantage I see
> in this is that BLAST can run multi core and I am running it on an 8core
> with 48gb of memory So it seems that this would be the fastest way to do
> this and very straight forward as there is very little parsing.

If you BLASTed each of 17m sequences individually, you would have to parse
17m output files.  That sounds like a *lot* of parsing and file IO to me. ;)

> There is
> either a match or not. I am purely guessing that generating the list if
> faster than parsing the results.

You could try timing it with 10, 100 and 1000 sequences and see if you
notice a trend.  With your sequence set, I wouldn't bother - I'd jump
straight to the next-gen sequence aligners.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________



More information about the Biopython mailing list