[Biojava-dev] Case sensitivity in Alignment

Thu Nov 21 19:18:14 UTC 2013

If that is the intended behavior, then it must be explicit, and dependent on the class of sequence. So there would need to be a MaskedDNASequence, or perhaps a MaskedNucleotideCompound, which had a different equals() method.

A DNASequence<NucleotideCompound> should behave exactly the same way regardless of how it was created, and particularly; regardless of the file format it was read from. How does the current code behave now with a genbank file?, an embl file?, a gcg file? There should be no question in a users mind how it will behave. Now it a user is explicitly using a mixed case file, aware of its significance, he should have different options. So a DNASequence<MaskedNucliotideCompoud> could be available. This path would also allow for programmatically masking a sequence and using the alignment tools in the same way.

On Nov 20, 2013, at 8:57 AM, Michael Heuer <heuermh at gmail.com> wrote:

> Sorry, I may not be keeping up with you both here, but the code in
> question is in the alignment package, and if the substitution matrices
> are all upper case they won't match lower case soft masked sequence;
> wouldn't that be the intent?  (A feature not a bug)
> 
>  michael
> 
> On Wed, Nov 20, 2013 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>> The problem is that the substitution matrices are all upper case. We can
>> probably fix this by making the NucleotideCompound.equals method case
>> insensitive...
>> 
>> Does anybody see an issue with that?
>> 
>> A
>> 
>> 
>> On Wed, Nov 20, 2013 at 8:22 AM, Michael Heuer <heuermh at gmail.com> wrote:
>>> 
>>> Hello Andreas, David
>>> 
>>> Lower case is the convention for soft-masking sequences from alignment
>>> 
>>> http://www.ncbi.nlm.nih.gov/books/NBK1763/
>>> 
>>> http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.Create_a_masked_BLAST
>>> 
>>> If we are using this convention, perhaps it should be more clearly
>>> documented.  What happens if you use mixed case?
>>> 
>>>  michael
>>> 
>>> 
>>> On Wed, Nov 20, 2013 at 5:29 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> Hi David,
>>>> 
>>>> not sure if we should consider this a bug or a feature: It should be
>>>> easy
>>>> to work around this by calling toUppercase on your strings. We could of
>>>> course internally convert all nucleotides to upper case, but that would
>>>> remove the possibility for people to use mixed upper case and lower case
>>>> sequences to represent e.g. alignment conservation.
>>>> 
>>>> Any opinions by other people on this? Is anybody using mixed case
>>>> sequences?
>>>> 
>>>> Andreas
>>>> 
>>>> 
>>>> On Mon, Nov 18, 2013 at 11:43 AM, Waring, David A <dwaring at fhcrc.org>
>>>> wrote:
>>>> 
>>>>> 
>>>>> There seems to be a bug in the alignment package. If DNA sequences are
>>>>> created using lower case letters, the alignment methods don't work.
>>>>> Looks
>>>>> like the the default substitution matrix is coded in upper case, and
>>>>> the
>>>>> underlying case of the DNA sequence is being used in the alignment.
>>>>> Seems
>>>>> like a bug to me.
>>>>> 
>>>>> This problem occurs when the DNA Sequence is create either using the
>>>>> DNASequence constructor, or reading from a fasta which is in lower
>>>>> case.
>>>>> 
>>>>> 
>>>>> The code below shows the problem.
>>>>> 
>>>>> 
>>>>>   static SimpleGapPenalty gapP;
>>>>>   static SubstitutionMatrix<NucleotideCompound> matrix;
>>>>> 
>>>>>   public static void main(String[] args)throws Exception{
>>>>>      matrix = SubstitutionMatrixHelper.getNuc4_4();
>>>>>       gapP = new SimpleGapPenalty();
>>>>>       gapP.setOpenPenalty((short)5);
>>>>>       gapP.setExtensionPenalty((short)2);
>>>>>       testHardcoded();
>>>>>   }
>>>>> 
>>>>>   public static void testHardcoded()throws Exception{
>>>>>      Sequence<NucleotideCompound> seq1 = new
>>>>> DNASequence("AGGGCTTTACCCCGGTTAA");
>>>>>       Sequence<NucleotideCompound> seq2 = new
>>>>> DNASequence("ACCCCGGTTTAATATTTTT");
>>>>>       Sequence<NucleotideCompound> seq3 = new
>>>>> DNASequence("agggctttaccccggttaa");
>>>>>       Sequence<NucleotideCompound> seq4 = new
>>>>> DNASequence("accccggtttaatattttt");
>>>>>       alignPair(seq1,seq2);
>>>>>       alignPair(seq1,seq4);
>>>>>       alignPair(seq3,seq4);
>>>>> 
>>>>>   }
>>>>> 
>>>>> 
>>>>>   public static void alignPair(Sequence<NucleotideCompound> seq1,
>>>>> Sequence<NucleotideCompound> seq2){
>>>>>               SequencePair<Sequence<NucleotideCompound>,
>>>>> NucleotideCompound> pair =
>>>>>                       Alignments.getPairwiseAlignment(seq1,seq2,
>>>>>                       Alignments.PairwiseSequenceAlignerType.GLOBAL,
>>>>> gapP, matrix);
>>>>> 
>>>>>       System.out.printf("%s", pair);
>>>>>       System.out.println();
>>>>>   }
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> biojava-dev mailing list
>>>>> biojava-dev at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>> 
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>> 
>> 
>>