[Biojava-dev] Case sensitivity in Alignment

Spencer Bliven sbliven at ucsd.edu
Wed Nov 20 22:54:08 UTC 2013


The issue of case has come up before, and to my knowledge it hasn't been
handled particularly consistently. There's a CaseInsensitiveCompound in
core which is not used my any other class, and which is pretty much useless
since it doesn't derive from NucleotideCompound but merely wraps it.
There's also a CasePreservingProteinSequenceCreator, which was my solution
to maintain the case information while still working with a standard
AminoAcid CompoundSet. It's an ugly solution-I just turn everything to
uppercase while storing the case as a boolean array in the sequence's
UserCollection. That could easily be adapted to nucleic acid, but I'd
welcome a cleaner solution if anyone has one.


On Wed, Nov 20, 2013 at 8:57 AM, Michael Heuer <heuermh at gmail.com> wrote:

> Sorry, I may not be keeping up with you both here, but the code in
> question is in the alignment package, and if the substitution matrices
> are all upper case they won't match lower case soft masked sequence;
> wouldn't that be the intent?  (A feature not a bug)
>
>    michael
>
> On Wed, Nov 20, 2013 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
> > The problem is that the substitution matrices are all upper case. We can
> > probably fix this by making the NucleotideCompound.equals method case
> > insensitive...
> >
> > Does anybody see an issue with that?
> >
> > A
> >
> >
> > On Wed, Nov 20, 2013 at 8:22 AM, Michael Heuer <heuermh at gmail.com>
> wrote:
> >>
> >> Hello Andreas, David
> >>
> >> Lower case is the convention for soft-masking sequences from alignment
> >>
> >> http://www.ncbi.nlm.nih.gov/books/NBK1763/
> >>
> >>
> http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.Create_a_masked_BLAST
> >>
> >> If we are using this convention, perhaps it should be more clearly
> >> documented.  What happens if you use mixed case?
> >>
> >>    michael
> >>
> >>
> >> On Wed, Nov 20, 2013 at 5:29 AM, Andreas Prlic <andreas at sdsc.edu>
> wrote:
> >> > Hi David,
> >> >
> >> > not sure if we should consider this a bug or a feature: It should be
> >> > easy
> >> > to work around this by calling toUppercase on your strings. We could
> of
> >> > course internally convert all nucleotides to upper case, but that
> would
> >> > remove the possibility for people to use mixed upper case and lower
> case
> >> > sequences to represent e.g. alignment conservation.
> >> >
> >> > Any opinions by other people on this? Is anybody using mixed case
> >> > sequences?
> >> >
> >> > Andreas
> >> >
> >> >
> >> > On Mon, Nov 18, 2013 at 11:43 AM, Waring, David A <dwaring at fhcrc.org>
> >> > wrote:
> >> >
> >> >>
> >> >> There seems to be a bug in the alignment package. If DNA sequences
> are
> >> >> created using lower case letters, the alignment methods don't work.
> >> >> Looks
> >> >> like the the default substitution matrix is coded in upper case, and
> >> >> the
> >> >> underlying case of the DNA sequence is being used in the alignment.
> >> >> Seems
> >> >> like a bug to me.
> >> >>
> >> >>  This problem occurs when the DNA Sequence is create either using the
> >> >> DNASequence constructor, or reading from a fasta which is in lower
> >> >> case.
> >> >>
> >> >>
> >> >> The code below shows the problem.
> >> >>
> >> >>
> >> >>     static SimpleGapPenalty gapP;
> >> >>     static SubstitutionMatrix<NucleotideCompound> matrix;
> >> >>
> >> >>     public static void main(String[] args)throws Exception{
> >> >>        matrix = SubstitutionMatrixHelper.getNuc4_4();
> >> >>         gapP = new SimpleGapPenalty();
> >> >>         gapP.setOpenPenalty((short)5);
> >> >>         gapP.setExtensionPenalty((short)2);
> >> >>         testHardcoded();
> >> >>     }
> >> >>
> >> >>     public static void testHardcoded()throws Exception{
> >> >>        Sequence<NucleotideCompound> seq1 = new
> >> >> DNASequence("AGGGCTTTACCCCGGTTAA");
> >> >>         Sequence<NucleotideCompound> seq2 = new
> >> >> DNASequence("ACCCCGGTTTAATATTTTT");
> >> >>         Sequence<NucleotideCompound> seq3 = new
> >> >> DNASequence("agggctttaccccggttaa");
> >> >>         Sequence<NucleotideCompound> seq4 = new
> >> >> DNASequence("accccggtttaatattttt");
> >> >>         alignPair(seq1,seq2);
> >> >>         alignPair(seq1,seq4);
> >> >>         alignPair(seq3,seq4);
> >> >>
> >> >>     }
> >> >>
> >> >>
> >> >>     public static void alignPair(Sequence<NucleotideCompound> seq1,
> >> >> Sequence<NucleotideCompound> seq2){
> >> >>                 SequencePair<Sequence<NucleotideCompound>,
> >> >> NucleotideCompound> pair =
> >> >>                         Alignments.getPairwiseAlignment(seq1,seq2,
> >> >>
> Alignments.PairwiseSequenceAlignerType.GLOBAL,
> >> >> gapP, matrix);
> >> >>
> >> >>         System.out.printf("%s", pair);
> >> >>         System.out.println();
> >> >>     }
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> biojava-dev mailing list
> >> >> biojava-dev at lists.open-bio.org
> >> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >> >>
> >> > _______________________________________________
> >> > biojava-dev mailing list
> >> > biojava-dev at lists.open-bio.org
> >> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >
> >
> >
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>



More information about the biojava-dev mailing list