[Biojava-dev] Case sensitivity in Alignment

Waring, David A dwaring at fhcrc.org
Thu Nov 21 19:19:01 UTC 2013


The fact that there is a class CaseFreeAminoAcidCompoundSet is a sign of the very problem with the design. 

There is no such thing as an uppercase amino acid, or a lower case nucleotide. The representation of nucleotides with an ascii character is a convention. And in most cases a guanine is represented by a 'g' or a 'G'.  Regardless of how it is represented in a file, the Object must represent a guanine, not a G or a g. 

BioJava 1 was quite explicit in its understanding of this basic point. As best as I can tell BioJava 3 seems to miss this. I have just begun to try out BioJava 3 and this makes me wonder what other issues I will run into. 


On Nov 20, 2013, at 3:02 PM, Spencer Bliven <sbliven at ucsd.edu> wrote:

> I neglected the CaseFreeAminoAcidCompoundSet from the aa-prop module's xml
> package. I have no idea why it's there.
> 
> -Spencer
> 
> 
> On Wed, Nov 20, 2013 at 2:54 PM, Spencer Bliven <sbliven at ucsd.edu> wrote:
> 
>> The issue of case has come up before, and to my knowledge it hasn't been
>> handled particularly consistently. There's a CaseInsensitiveCompound in
>> core which is not used my any other class, and which is pretty much useless
>> since it doesn't derive from NucleotideCompound but merely wraps it.
>> There's also a CasePreservingProteinSequenceCreator, which was my solution
>> to maintain the case information while still working with a standard
>> AminoAcid CompoundSet. It's an ugly solution-I just turn everything to
>> uppercase while storing the case as a boolean array in the sequence's
>> UserCollection. That could easily be adapted to nucleic acid, but I'd
>> welcome a cleaner solution if anyone has one.
>> 
>> 
>> On Wed, Nov 20, 2013 at 8:57 AM, Michael Heuer <heuermh at gmail.com> wrote:
>> 
>>> Sorry, I may not be keeping up with you both here, but the code in
>>> question is in the alignment package, and if the substitution matrices
>>> are all upper case they won't match lower case soft masked sequence;
>>> wouldn't that be the intent?  (A feature not a bug)
>>> 
>>>   michael
>>> 
>>> On Wed, Nov 20, 2013 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> The problem is that the substitution matrices are all upper case. We can
>>>> probably fix this by making the NucleotideCompound.equals method case
>>>> insensitive...
>>>> 
>>>> Does anybody see an issue with that?
>>>> 
>>>> A
>>>> 
>>>> 
>>>> On Wed, Nov 20, 2013 at 8:22 AM, Michael Heuer <heuermh at gmail.com>
>>> wrote:
>>>>> 
>>>>> Hello Andreas, David
>>>>> 
>>>>> Lower case is the convention for soft-masking sequences from alignment
>>>>> 
>>>>> http://www.ncbi.nlm.nih.gov/books/NBK1763/
>>>>> 
>>>>> 
>>> http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.Create_a_masked_BLAST
>>>>> 
>>>>> If we are using this convention, perhaps it should be more clearly
>>>>> documented.  What happens if you use mixed case?
>>>>> 
>>>>>   michael
>>>>> 
>>>>> 
>>>>> On Wed, Nov 20, 2013 at 5:29 AM, Andreas Prlic <andreas at sdsc.edu>
>>> wrote:
>>>>>> Hi David,
>>>>>> 
>>>>>> not sure if we should consider this a bug or a feature: It should be
>>>>>> easy
>>>>>> to work around this by calling toUppercase on your strings. We could
>>> of
>>>>>> course internally convert all nucleotides to upper case, but that
>>> would
>>>>>> remove the possibility for people to use mixed upper case and lower
>>> case
>>>>>> sequences to represent e.g. alignment conservation.
>>>>>> 
>>>>>> Any opinions by other people on this? Is anybody using mixed case
>>>>>> sequences?
>>>>>> 
>>>>>> Andreas
>>>>>> 
>>>>>> 
>>>>>> On Mon, Nov 18, 2013 at 11:43 AM, Waring, David A <dwaring at fhcrc.org
>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> There seems to be a bug in the alignment package. If DNA sequences
>>> are
>>>>>>> created using lower case letters, the alignment methods don't work.
>>>>>>> Looks
>>>>>>> like the the default substitution matrix is coded in upper case, and
>>>>>>> the
>>>>>>> underlying case of the DNA sequence is being used in the alignment.
>>>>>>> Seems
>>>>>>> like a bug to me.
>>>>>>> 
>>>>>>> This problem occurs when the DNA Sequence is create either using
>>> the
>>>>>>> DNASequence constructor, or reading from a fasta which is in lower
>>>>>>> case.
>>>>>>> 
>>>>>>> 
>>>>>>> The code below shows the problem.
>>>>>>> 
>>>>>>> 
>>>>>>>    static SimpleGapPenalty gapP;
>>>>>>>    static SubstitutionMatrix<NucleotideCompound> matrix;
>>>>>>> 
>>>>>>>    public static void main(String[] args)throws Exception{
>>>>>>>       matrix = SubstitutionMatrixHelper.getNuc4_4();
>>>>>>>        gapP = new SimpleGapPenalty();
>>>>>>>        gapP.setOpenPenalty((short)5);
>>>>>>>        gapP.setExtensionPenalty((short)2);
>>>>>>>        testHardcoded();
>>>>>>>    }
>>>>>>> 
>>>>>>>    public static void testHardcoded()throws Exception{
>>>>>>>       Sequence<NucleotideCompound> seq1 = new
>>>>>>> DNASequence("AGGGCTTTACCCCGGTTAA");
>>>>>>>        Sequence<NucleotideCompound> seq2 = new
>>>>>>> DNASequence("ACCCCGGTTTAATATTTTT");
>>>>>>>        Sequence<NucleotideCompound> seq3 = new
>>>>>>> DNASequence("agggctttaccccggttaa");
>>>>>>>        Sequence<NucleotideCompound> seq4 = new
>>>>>>> DNASequence("accccggtttaatattttt");
>>>>>>>        alignPair(seq1,seq2);
>>>>>>>        alignPair(seq1,seq4);
>>>>>>>        alignPair(seq3,seq4);
>>>>>>> 
>>>>>>>    }
>>>>>>> 
>>>>>>> 
>>>>>>>    public static void alignPair(Sequence<NucleotideCompound> seq1,
>>>>>>> Sequence<NucleotideCompound> seq2){
>>>>>>>                SequencePair<Sequence<NucleotideCompound>,
>>>>>>> NucleotideCompound> pair =
>>>>>>>                        Alignments.getPairwiseAlignment(seq1,seq2,
>>>>>>> 
>>> Alignments.PairwiseSequenceAlignerType.GLOBAL,
>>>>>>> gapP, matrix);
>>>>>>> 
>>>>>>>        System.out.printf("%s", pair);
>>>>>>>        System.out.println();
>>>>>>>    }
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> biojava-dev mailing list
>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> biojava-dev mailing list
>>>>>> biojava-dev at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>> 
>>>> 
>>>> 
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>> 
>> 
>> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev





More information about the biojava-dev mailing list