[EMBOSS] run parameters for inexact matching with EMBOSS needle
David Mathog
mathog at caltech.edu
Mon Feb 4 17:57:06 UTC 2019
On 03-Feb-2019 04:18, Anandkumar Surendrarao wrote:
> - I want to search with FASTA formatted DNA sequence queries that
> are <=
> 100nt in length,
> - against FASTA formatted genomic DNA sequence,
> - for global end-to-end matches (using Needleman Wunsch type global
> search, not Smith Waterman type local matching)
> - but allowing >= 80% identity and >= 80% of query length coverage
Not sure about needle, except that I expect it would be very slow.
However, I made a modified version of Lastal that added command line
parameters for identity/query/target cutoff for almost exactly the same
application. It is available here in the "modified programs" directory.
https://sourceforge.net/projects/lemonade-assemble/
There is a Centos 6 64 bit binary there, for any other platform you will
need to recompile. Lastal does not have NW exactly, but it does have a
-T parameter which does something which in many cases is similar. -T 0
is a local alignment, and -T 1 "extends to the end" (starting from the
local alignment). My use case involved Illumina reads, and what would
happen is if there was a mismatch at 3 or 4 bp from the end of that read
in the alignment against the target (in this case mRNA transcripts) -T 0
would truncate the alignment at the position before the mismatch, so the
read would not be seen as a full length match. With -T 1 it would pick
up the extra matching bases. Anyway, let's say you have
reads.fasta # bunch of 100bp reads
genome.fasta # self explanatory
CPUS=40
lastdb -P $CPUS -w 10 genome_10 genome.fasta
lastal -P $CPUS -T1 -I80 -Y80 -m250 -O50 -8 1 -9 1 \
-f BlastTab genome_10 reads.fasta > results.txt 2>errors.log
The lastdb command makes a database but uses only every 10th position.
This will make the database smaller and the search much faster. These
positions are only used as seeds, so if the expected match is very good
-w10 gives the same results as -w1 (default). If the match is expected
to be poor use -w 1 and expect much longer run times. Set CPUS to match
your system.
The lastal command looks for 80% identity (-I80), 80% coverage on query
(-Y80), allows up to 250 overlapping matches (-m250, see below), allows
offset overlap alignment as
long as the overlap is at least 50bp (-O50), requires overlap alignments
to go right
to the ends (-8 1 -9 1), and -f BlastTab gives a blast tabular format
output file (except when it is on the other strand the query positions
are flipped rather than the target). Offset overlap is like this:
-------------------------
||||||||||||||| <- -O50 == this must be >= 50bp
------------------------------
This only concerns alignments which overlap the ends. The -m parameter
is a standard lastal command line option:
-m: maximum initial matches per query position (10)
What it means is that if there are a lot of alignments in exactly the
same place by default one only retrieves a few of them. Increasing the
value to 250 means many more will be retrieved. Conceivably if there is
super high coverage you might want to use an even larger number. Best
to first try a small subset to determine the appropriate -m value for
your case.
I don't know if it would be faster if the query/database were the other
way around.
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the EMBOSS
mailing list