[Biojava-l] comparison of the pairwise aligner to emboss' needle
Andreas Prlic
andreas at sdsc.edu
Wed Apr 20 17:33:03 UTC 2011
Hi Wim,
are you sure you are using the correct sequences in your test? When I
run the code at the bottom of this emails I am getting 95 and 97%
sequence ID, which is similar to what you are expecting.
Andreas
Here my code: (using latest code from SVN)
package demo;
import org.biojava3.alignment.Alignments;
import org.biojava3.alignment.Alignments.PairwiseSequenceAlignerType;
import org.biojava3.alignment.SimpleGapPenalty;
import org.biojava3.alignment.SubstitutionMatrixHelper;
import org.biojava3.alignment.template.GapPenalty;
import org.biojava3.alignment.template.PairwiseSequenceAligner;
import org.biojava3.alignment.template.SequencePair;
import org.biojava3.core.sequence.DNASequence;
import org.biojava3.core.sequence.compound.AmbiguityDNACompoundSet;
import org.biojava3.core.sequence.compound.NucleotideCompound;
public class TestDNANeedlemanWunsch {
public static void main(String[] args){
String query =
"AGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGTGAAGCCCAGCTTGCTGGGTGGATCA"
+
"GTGGCGAACGGGTGAGTAACACGTGAGCAACCTGCCCCTGACTCTGGGATAAGCGCTGGAAACGGTGTCT" +
"AATACTGGATATGAGCTACCACCGCATGGTGAGTGGTTGGAAAGATTTTTCGGTTGGGGATGGGCTCGCG" +
"GCCTATCAGCTTGTTGGTGAGGTAATGGCTCACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGACC" +
"GGCCACACTGGGACTGAGACACGGCCCAGACTCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGC" +
"GGAAGCCTGATGCAGCAACGCCGCGTGAGGGACGACGGCTTCGGGTTGTAAACCTCTTTTAGCAGGGAAG" +
"AAGCGAGAGTGACGGTACCTGCAGAAAAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAG" +
"GGCGCAAGCGTTATCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAA" +
"CCCGAGGCTCAACCTNNGGGCTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCC" +
"TGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAAC" +
"TGACGCTGAGGAGCGAAAGGGTGGGGAGCAAACAGGCTTAGATACCCTGGTAGTCCACCCCGTAAACGTT" +
"GGGAACTAGTTGTGGGGTCCTTTCCACGGATTCCGTGACGCACGTAACGCATTAAGTTCCCCGCCTGGGG" +
"AGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGCGGAGCATGCGGATTA" +
"AATCGATGCAACGCGAAGAACCTTACCAAGGCTTGACATACACGAGAACGCTGCAGAAATGTAGAACTCT" +
"TTGGACACTCGTGAACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCG" +
"CAACGAGCGCAACCCTCGTTCTATGTTGCCAGCACGTAATGGTGGGAACTCATGGGATACTGCCGGGGTC" +
"AACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTTCACGCATGCTACA" +
"ATGGCCGGTACAAAGGGCTGCAATACCGTGAGGTGGAGCGAATCCCAAAAAGCCGGTCCCAGTTCGGATT" +
"GAGGTCTGCAACTCGACCTCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATA" +
"CGTTCCCGGGTCTTGTACACACCGCCCGTCAAGTCATGAAAGTCGGTAACACCTGAAGCCGGTGGCCTAA" +
"CCCTTGTGGAGGGAGCCGGTAATTAAA";
String target =
"CTGGCCGCCTGCTTAACACATCCAAGTCGAACGGTGAAGCCCCANCTTACTGGGTGGATCAGTGCCGAAC"
+
"GGGTGAGTAACACGTGAGCAACCTCCCCCTGACTCTGGGATAAGCGCTGGAANCGGTGTCTAATACTGGA" +
"TATGAGCTACCACCGCATGGTGAGTGGTTGGAAAGATTTTTCGGTTGGGGATGGGCTCGCGCCCTATGAG" +
"CTTGTTGGTGAGGTAATGGCTCACCAAGCCGTCGACGGGTAGCCGGCCTGAGAGGGTGACCGNCCACACT" +
"GGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGGAAGCCT" +
"GATTCANCAACCCCGCGTGAGGGACGACGGCCTTCGGGTTGTAAACCTCTTTTAGCAGGGAAGAAGCGAG" +
"AGTGACGGTACCTGCAGAAAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAA" +
"GCGTTATCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAACCCGAGG" +
"CTCAACCTCGGGCCTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCCTGGTGTA" +
"GCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAACTGACGCT" +
"GAGGAGCGAAAGGGTGGGGAGCAAACAGGCTTAGATACCCTGGTAGTCCACCCCGTAAACGTTGGGAACT" +
"AGTTGTGGGGTCCTTTCCACGGATTCCGTGACGCAGCTAACGCATTAAGTTCCCCGCCTGGGGAGTACGG" +
"CCGCAAGGCTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGCGGAGCATGCGGATTAATTCGAT" +
"GCAACGCGAAGAACCTTACCAAGGCTTGACATACACGAGAACGCTGCAGAAATGTAGAACTCTTTGGACA" +
"CTCGTGAACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAG" +
"CGCAACCCTCGTTCTATGTTGCCAGCACGTAATGGTGGGAACTCATGGGATACTGCCGGGGTCAACTCGG" +
"AGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTTCACGCATGCTACAATGGCCG" +
"GTACAAAGGGCTGCAATACCGTGAGGTGGAGCGAATCCCAAAAAGCCGGTCCCAGTTCGGATTGAGGTCT" +
"GCAACTCGACCTCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCC" +
"GGGTCTTGTACACACCGCCCGTCAAGTCATGAAAGTCGGTAACACCTGAAGCCGGTGGCCCAACCCTTGT" +
"GGAGGGAGCCGTCGAAGGTGGGATCGGTAATTAGGACTAAGTCGTAACAAGGTAGCCGTACC";
GapPenalty penalty = new SimpleGapPenalty((short)-14, (short)-4);
PairwiseSequenceAligner<DNASequence, NucleotideCompound> aligner =
Alignments.getPairwiseAligner(
new DNASequence(query, AmbiguityDNACompoundSet.getDNACompoundSet()),
new DNASequence(target, AmbiguityDNACompoundSet.getDNACompoundSet()),
PairwiseSequenceAlignerType.GLOBAL,
penalty, SubstitutionMatrixHelper.getNuc4_4());
SequencePair<DNASequence, NucleotideCompound>
alignment = aligner.getPair();
System.out.println(alignment);
int identical = alignment.getNumIdenticals();
System.out.println("Number of identical residues: " + identical);
System.out.println("% identical query: " + identical / (float)
query.length() );
System.out.println("% identical query: " + identical / (float)
target.length() );
}
}
On Mon, Apr 18, 2011 at 8:22 AM, Wim De Smet <Wim.DeSmet at ugent.be> wrote:
> Hi all,
>
> I've been trying to generate some global alignments with biojava and
> comparing them with what needle returns. Doing this, I can't seem to
> reproduce needle's alignment with biojava. The score returned from biojava
> seems to be worse than that from needle, so I'm not sure what's happening
> here.
>
> The sequences are AB004720 and Y17238 (I didn't attach a fasta file to avoid
> spamming people, let me know if you want one). I align them with:
> GapPenalty penalty = new SimpleGapPenalty((short)-14, (short)-4);
> PairwiseSequenceAligner<DNASequence, NucleotideCompound> aligner =
> Alignments.getPairwiseAligner(
> new DNASequence(query, AmbiguityDNACompoundSet.getDNACompoundSet()),
> new DNASequence(target, AmbiguityDNACompoundSet.getDNACompoundSet()),
> PairwiseSequenceAlignerType.GLOBAL,
> penalty, SubstitutionMatrixHelper.getNuc4_4());
> SequencePair<DNASequence, NucleotideCompound>
> alignment = aligner.getPair();
>
> This gives me an alignment with only 23% similarity and a gap at the end.
> Varying the gap penalties can give me a gap in front too, but that's about
> it. When aligning in needle, I get a sequence with a higher score (6784 vs
> (-)5862) and 94% similarity (which seems closer to home). Needle I just run
> with defaults (so it uses EDNAFULL) and a go/ge of 14/4.
>
> Could this be a bug or am I misunderstanding some of the options?
>
> BTW, if I use a really large gapextend, say -4000, I also get a nullpointer
> exception.
>
> TIA,
> Wim De Smet
>
> --
> Wim De Smet
> http://www.straininfo.net/
> _______________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
--
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------
More information about the Biojava-l
mailing list