[Biojava-dev] Case sensitivity in Alignment

Spencer Bliven sbliven at ucsd.edu
Wed Nov 20 23:02:01 UTC 2013


I neglected the CaseFreeAminoAcidCompoundSet from the aa-prop module's xml
package. I have no idea why it's there.

-Spencer


On Wed, Nov 20, 2013 at 2:54 PM, Spencer Bliven <sbliven at ucsd.edu> wrote:

> The issue of case has come up before, and to my knowledge it hasn't been
> handled particularly consistently. There's a CaseInsensitiveCompound in
> core which is not used my any other class, and which is pretty much useless
> since it doesn't derive from NucleotideCompound but merely wraps it.
> There's also a CasePreservingProteinSequenceCreator, which was my solution
> to maintain the case information while still working with a standard
> AminoAcid CompoundSet. It's an ugly solution-I just turn everything to
> uppercase while storing the case as a boolean array in the sequence's
> UserCollection. That could easily be adapted to nucleic acid, but I'd
> welcome a cleaner solution if anyone has one.
>
>
> On Wed, Nov 20, 2013 at 8:57 AM, Michael Heuer <heuermh at gmail.com> wrote:
>
>> Sorry, I may not be keeping up with you both here, but the code in
>> question is in the alignment package, and if the substitution matrices
>> are all upper case they won't match lower case soft masked sequence;
>> wouldn't that be the intent?  (A feature not a bug)
>>
>>    michael
>>
>> On Wed, Nov 20, 2013 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>> > The problem is that the substitution matrices are all upper case. We can
>> > probably fix this by making the NucleotideCompound.equals method case
>> > insensitive...
>> >
>> > Does anybody see an issue with that?
>> >
>> > A
>> >
>> >
>> > On Wed, Nov 20, 2013 at 8:22 AM, Michael Heuer <heuermh at gmail.com>
>> wrote:
>> >>
>> >> Hello Andreas, David
>> >>
>> >> Lower case is the convention for soft-masking sequences from alignment
>> >>
>> >> http://www.ncbi.nlm.nih.gov/books/NBK1763/
>> >>
>> >>
>> http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.Create_a_masked_BLAST
>> >>
>> >> If we are using this convention, perhaps it should be more clearly
>> >> documented.  What happens if you use mixed case?
>> >>
>> >>    michael
>> >>
>> >>
>> >> On Wed, Nov 20, 2013 at 5:29 AM, Andreas Prlic <andreas at sdsc.edu>
>> wrote:
>> >> > Hi David,
>> >> >
>> >> > not sure if we should consider this a bug or a feature: It should be
>> >> > easy
>> >> > to work around this by calling toUppercase on your strings. We could
>> of
>> >> > course internally convert all nucleotides to upper case, but that
>> would
>> >> > remove the possibility for people to use mixed upper case and lower
>> case
>> >> > sequences to represent e.g. alignment conservation.
>> >> >
>> >> > Any opinions by other people on this? Is anybody using mixed case
>> >> > sequences?
>> >> >
>> >> > Andreas
>> >> >
>> >> >
>> >> > On Mon, Nov 18, 2013 at 11:43 AM, Waring, David A <dwaring at fhcrc.org
>> >
>> >> > wrote:
>> >> >
>> >> >>
>> >> >> There seems to be a bug in the alignment package. If DNA sequences
>> are
>> >> >> created using lower case letters, the alignment methods don't work.
>> >> >> Looks
>> >> >> like the the default substitution matrix is coded in upper case, and
>> >> >> the
>> >> >> underlying case of the DNA sequence is being used in the alignment.
>> >> >> Seems
>> >> >> like a bug to me.
>> >> >>
>> >> >>  This problem occurs when the DNA Sequence is create either using
>> the
>> >> >> DNASequence constructor, or reading from a fasta which is in lower
>> >> >> case.
>> >> >>
>> >> >>
>> >> >> The code below shows the problem.
>> >> >>
>> >> >>
>> >> >>     static SimpleGapPenalty gapP;
>> >> >>     static SubstitutionMatrix<NucleotideCompound> matrix;
>> >> >>
>> >> >>     public static void main(String[] args)throws Exception{
>> >> >>        matrix = SubstitutionMatrixHelper.getNuc4_4();
>> >> >>         gapP = new SimpleGapPenalty();
>> >> >>         gapP.setOpenPenalty((short)5);
>> >> >>         gapP.setExtensionPenalty((short)2);
>> >> >>         testHardcoded();
>> >> >>     }
>> >> >>
>> >> >>     public static void testHardcoded()throws Exception{
>> >> >>        Sequence<NucleotideCompound> seq1 = new
>> >> >> DNASequence("AGGGCTTTACCCCGGTTAA");
>> >> >>         Sequence<NucleotideCompound> seq2 = new
>> >> >> DNASequence("ACCCCGGTTTAATATTTTT");
>> >> >>         Sequence<NucleotideCompound> seq3 = new
>> >> >> DNASequence("agggctttaccccggttaa");
>> >> >>         Sequence<NucleotideCompound> seq4 = new
>> >> >> DNASequence("accccggtttaatattttt");
>> >> >>         alignPair(seq1,seq2);
>> >> >>         alignPair(seq1,seq4);
>> >> >>         alignPair(seq3,seq4);
>> >> >>
>> >> >>     }
>> >> >>
>> >> >>
>> >> >>     public static void alignPair(Sequence<NucleotideCompound> seq1,
>> >> >> Sequence<NucleotideCompound> seq2){
>> >> >>                 SequencePair<Sequence<NucleotideCompound>,
>> >> >> NucleotideCompound> pair =
>> >> >>                         Alignments.getPairwiseAlignment(seq1,seq2,
>> >> >>
>> Alignments.PairwiseSequenceAlignerType.GLOBAL,
>> >> >> gapP, matrix);
>> >> >>
>> >> >>         System.out.printf("%s", pair);
>> >> >>         System.out.println();
>> >> >>     }
>> >> >>
>> >> >>
>> >> >>
>> >> >> _______________________________________________
>> >> >> biojava-dev mailing list
>> >> >> biojava-dev at lists.open-bio.org
>> >> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>> >> >>
>> >> > _______________________________________________
>> >> > biojava-dev mailing list
>> >> > biojava-dev at lists.open-bio.org
>> >> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>> >
>> >
>> >
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>
>



More information about the biojava-dev mailing list