[Biojava-l] comparison of the pairwise aligner to emboss' needle

Wim De Smet Wim.DeSmet at UGent.be
Thu Apr 21 07:54:56 UTC 2011


Hi Andreas

Thanks for having a look. I found the issue: if your sequence is lower 
case, the alignment is different. Try aligning with query.toLowerCase() 
and target.toLowerCase(), then the output is:

Number of identical residues: 334
% identical query: 0.23405746
% identical query: 0.22845417

vs upper case

Number of identical residues: 1394
% identical query: 0.9768746
% identical query: 0.95348835

So I just inserted a toUpperCase() and it works now.

Regards
Wim

On 20-04-11 19:33, Andreas Prlic wrote:
> Hi Wim,
>
> are you sure you are using the correct sequences in your test? When I
> run the code at the bottom of this emails I am getting 95 and 97%
> sequence ID, which is similar to what you are expecting.
>
> Andreas
>
> Here my code: (using latest code from SVN)
>
> package demo;
>
> import org.biojava3.alignment.Alignments;
> import org.biojava3.alignment.Alignments.PairwiseSequenceAlignerType;
> import org.biojava3.alignment.SimpleGapPenalty;
> import org.biojava3.alignment.SubstitutionMatrixHelper;
> import org.biojava3.alignment.template.GapPenalty;
> import org.biojava3.alignment.template.PairwiseSequenceAligner;
> import org.biojava3.alignment.template.SequencePair;
> import org.biojava3.core.sequence.DNASequence;
> import org.biojava3.core.sequence.compound.AmbiguityDNACompoundSet;
> import org.biojava3.core.sequence.compound.NucleotideCompound;
>
> public class TestDNANeedlemanWunsch {
> 	public static void main(String[] args){
>
> 		String query =
> "AGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGTGAAGCCCAGCTTGCTGGGTGGATCA"
> +
> 		"GTGGCGAACGGGTGAGTAACACGTGAGCAACCTGCCCCTGACTCTGGGATAAGCGCTGGAAACGGTGTCT" +
> 		"AATACTGGATATGAGCTACCACCGCATGGTGAGTGGTTGGAAAGATTTTTCGGTTGGGGATGGGCTCGCG" +
> 		"GCCTATCAGCTTGTTGGTGAGGTAATGGCTCACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGACC" +
> 		"GGCCACACTGGGACTGAGACACGGCCCAGACTCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGC" +
> 		"GGAAGCCTGATGCAGCAACGCCGCGTGAGGGACGACGGCTTCGGGTTGTAAACCTCTTTTAGCAGGGAAG" +
> 		"AAGCGAGAGTGACGGTACCTGCAGAAAAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAG" +
> 		"GGCGCAAGCGTTATCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAA" +
> 		"CCCGAGGCTCAACCTNNGGGCTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCC" +
> 		"TGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAAC" +
> 		"TGACGCTGAGGAGCGAAAGGGTGGGGAGCAAACAGGCTTAGATACCCTGGTAGTCCACCCCGTAAACGTT" +
> 		"GGGAACTAGTTGTGGGGTCCTTTCCACGGATTCCGTGACGCACGTAACGCATTAAGTTCCCCGCCTGGGG" +
> 		"AGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGCGGAGCATGCGGATTA" +
> 		"AATCGATGCAACGCGAAGAACCTTACCAAGGCTTGACATACACGAGAACGCTGCAGAAATGTAGAACTCT" +
> 		"TTGGACACTCGTGAACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCG" +
> 		"CAACGAGCGCAACCCTCGTTCTATGTTGCCAGCACGTAATGGTGGGAACTCATGGGATACTGCCGGGGTC" +
> 		"AACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTTCACGCATGCTACA" +
> 		"ATGGCCGGTACAAAGGGCTGCAATACCGTGAGGTGGAGCGAATCCCAAAAAGCCGGTCCCAGTTCGGATT" +
> 		"GAGGTCTGCAACTCGACCTCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATA" +
> 		"CGTTCCCGGGTCTTGTACACACCGCCCGTCAAGTCATGAAAGTCGGTAACACCTGAAGCCGGTGGCCTAA" +
> 		"CCCTTGTGGAGGGAGCCGGTAATTAAA";
>
> 		String target =
> "CTGGCCGCCTGCTTAACACATCCAAGTCGAACGGTGAAGCCCCANCTTACTGGGTGGATCAGTGCCGAAC"
> +
> 		"GGGTGAGTAACACGTGAGCAACCTCCCCCTGACTCTGGGATAAGCGCTGGAANCGGTGTCTAATACTGGA" +
> 		"TATGAGCTACCACCGCATGGTGAGTGGTTGGAAAGATTTTTCGGTTGGGGATGGGCTCGCGCCCTATGAG" +
> 		"CTTGTTGGTGAGGTAATGGCTCACCAAGCCGTCGACGGGTAGCCGGCCTGAGAGGGTGACCGNCCACACT" +
> 		"GGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGGAAGCCT" +
> 		"GATTCANCAACCCCGCGTGAGGGACGACGGCCTTCGGGTTGTAAACCTCTTTTAGCAGGGAAGAAGCGAG" +
> 		"AGTGACGGTACCTGCAGAAAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAA" +
> 		"GCGTTATCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAACCCGAGG" +
> 		"CTCAACCTCGGGCCTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCCTGGTGTA" +
> 		"GCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAACTGACGCT" +
> 		"GAGGAGCGAAAGGGTGGGGAGCAAACAGGCTTAGATACCCTGGTAGTCCACCCCGTAAACGTTGGGAACT" +
> 		"AGTTGTGGGGTCCTTTCCACGGATTCCGTGACGCAGCTAACGCATTAAGTTCCCCGCCTGGGGAGTACGG" +
> 		"CCGCAAGGCTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGCGGAGCATGCGGATTAATTCGAT" +
> 		"GCAACGCGAAGAACCTTACCAAGGCTTGACATACACGAGAACGCTGCAGAAATGTAGAACTCTTTGGACA" +
> 		"CTCGTGAACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAG" +
> 		"CGCAACCCTCGTTCTATGTTGCCAGCACGTAATGGTGGGAACTCATGGGATACTGCCGGGGTCAACTCGG" +
> 		"AGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTTCACGCATGCTACAATGGCCG" +
> 		"GTACAAAGGGCTGCAATACCGTGAGGTGGAGCGAATCCCAAAAAGCCGGTCCCAGTTCGGATTGAGGTCT" +
> 		"GCAACTCGACCTCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCC" +
> 		"GGGTCTTGTACACACCGCCCGTCAAGTCATGAAAGTCGGTAACACCTGAAGCCGGTGGCCCAACCCTTGT" +
> 		"GGAGGGAGCCGTCGAAGGTGGGATCGGTAATTAGGACTAAGTCGTAACAAGGTAGCCGTACC";
>
> 		GapPenalty penalty = new SimpleGapPenalty((short)-14, (short)-4);
> 		PairwiseSequenceAligner<DNASequence, NucleotideCompound>  aligner =
> Alignments.getPairwiseAligner(
> 				new DNASequence(query, AmbiguityDNACompoundSet.getDNACompoundSet()),
> 				new DNASequence(target, AmbiguityDNACompoundSet.getDNACompoundSet()),
> 				PairwiseSequenceAlignerType.GLOBAL,
> 				penalty, SubstitutionMatrixHelper.getNuc4_4());
> 		SequencePair<DNASequence, NucleotideCompound>
> 		alignment = aligner.getPair();
> 		
> 		System.out.println(alignment);
> 		
> 		int identical = alignment.getNumIdenticals();
> 		System.out.println("Number of identical residues: " + identical);
> 		System.out.println("% identical query: " + identical / (float)
> query.length() );
> 		System.out.println("% identical query: " + identical / (float)
> target.length() );
> 	}
> }
>
>
>
>
>
>
> On Mon, Apr 18, 2011 at 8:22 AM, Wim De Smet<Wim.DeSmet at ugent.be>  wrote:
>> Hi all,
>>
>> I've been trying to generate some global alignments with biojava and
>> comparing them with what needle returns. Doing this, I can't seem to
>> reproduce needle's alignment with biojava. The score returned from biojava
>> seems to be worse than that from needle, so I'm not sure what's happening
>> here.
>>
>> The sequences are AB004720 and Y17238 (I didn't attach a fasta file to avoid
>> spamming people, let me know if you want one). I align them with:
>> GapPenalty penalty = new SimpleGapPenalty((short)-14, (short)-4);
>> PairwiseSequenceAligner<DNASequence, NucleotideCompound>  aligner =
>> Alignments.getPairwiseAligner(
>> new DNASequence(query, AmbiguityDNACompoundSet.getDNACompoundSet()),
>> new DNASequence(target, AmbiguityDNACompoundSet.getDNACompoundSet()),
>> PairwiseSequenceAlignerType.GLOBAL,
>> penalty, SubstitutionMatrixHelper.getNuc4_4());
>> SequencePair<DNASequence, NucleotideCompound>
>> alignment = aligner.getPair();
>>
>> This gives me an alignment with only 23% similarity and a gap at the end.
>> Varying the gap penalties can give me a gap in front too, but that's about
>> it. When aligning in needle, I get a sequence with a higher score (6784 vs
>> (-)5862) and 94% similarity (which seems closer to home). Needle I just run
>> with defaults (so it uses EDNAFULL) and a go/ge of 14/4.
>>
>> Could this be a bug or am I misunderstanding some of the options?
>>
>> BTW, if I use a really large gapextend, say -4000, I also get a nullpointer
>> exception.
>>
>> TIA,
>> Wim De Smet
>>
>> --
>> Wim De Smet
>> http://www.straininfo.net/
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
>


-- 
Wim De Smet
http://www.straininfo.net/



More information about the Biojava-l mailing list