[Biojava-dev] Case sensitivity in Alignment

Andreas Prlic andreas at sdsc.edu
Thu Nov 21 08:44:14 UTC 2013


Scooter: There is the nuc-4_4 matrix, I believe it does exactly what you
want...

Andreas


On Wed, Nov 20, 2013 at 6:04 PM, Willis,Scooter <
Scooter.Willis at sanfordhealth.org> wrote:

> The problem extends to other cases for the alignment where you have R = A
> or G Y = C or T etc. In this particular problem A = a or A C = c or C. If
> the alignment code in doing a compare understood this equality as an
> alignment then would add some interesting functionality.
>
> Scooter Willis
>
> Director of Computational Bioinformatics
>
> Edith Sanford Breast Cancer Research
>
>
> -----------------------------------------------------------------------
>
> Confidentiality Notice: This e-mail message, including any attachments, is
> for the sole use of the intended recipient(s) and may contain privileged
> and confidential information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended recipient, please
> contact the sender by reply e-mail and destroy all copies of the original
> message.
>
>
>
> On 11/20/13, 6:02 PM, "Spencer Bliven" <sbliven at ucsd.edu> wrote:
>
> >I neglected the CaseFreeAminoAcidCompoundSet from the aa-prop module's xml
> >package. I have no idea why it's there.
> >
> >-Spencer
> >
> >
> >On Wed, Nov 20, 2013 at 2:54 PM, Spencer Bliven <sbliven at ucsd.edu> wrote:
> >
> >> The issue of case has come up before, and to my knowledge it hasn't been
> >> handled particularly consistently. There's a CaseInsensitiveCompound in
> >> core which is not used my any other class, and which is pretty much
> >>useless
> >> since it doesn't derive from NucleotideCompound but merely wraps it.
> >> There's also a CasePreservingProteinSequenceCreator, which was my
> >>solution
> >> to maintain the case information while still working with a standard
> >> AminoAcid CompoundSet. It's an ugly solution-I just turn everything to
> >> uppercase while storing the case as a boolean array in the sequence's
> >> UserCollection. That could easily be adapted to nucleic acid, but I'd
> >> welcome a cleaner solution if anyone has one.
> >>
> >>
> >> On Wed, Nov 20, 2013 at 8:57 AM, Michael Heuer <heuermh at gmail.com>
> >>wrote:
> >>
> >>> Sorry, I may not be keeping up with you both here, but the code in
> >>> question is in the alignment package, and if the substitution matrices
> >>> are all upper case they won't match lower case soft masked sequence;
> >>> wouldn't that be the intent?  (A feature not a bug)
> >>>
> >>>    michael
> >>>
> >>> On Wed, Nov 20, 2013 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu>
> >>>wrote:
> >>> > The problem is that the substitution matrices are all upper case. We
> >>>can
> >>> > probably fix this by making the NucleotideCompound.equals method case
> >>> > insensitive...
> >>> >
> >>> > Does anybody see an issue with that?
> >>> >
> >>> > A
> >>> >
> >>> >
> >>> > On Wed, Nov 20, 2013 at 8:22 AM, Michael Heuer <heuermh at gmail.com>
> >>> wrote:
> >>> >>
> >>> >> Hello Andreas, David
> >>> >>
> >>> >> Lower case is the convention for soft-masking sequences from
> >>>alignment
> >>> >>
> >>> >> http://www.ncbi.nlm.nih.gov/books/NBK1763/
> >>> >>
> >>> >>
> >>>
> >>>
> http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.Create_a_ma
> >>>sked_BLAST
> >>> >>
> >>> >> If we are using this convention, perhaps it should be more clearly
> >>> >> documented.  What happens if you use mixed case?
> >>> >>
> >>> >>    michael
> >>> >>
> >>> >>
> >>> >> On Wed, Nov 20, 2013 at 5:29 AM, Andreas Prlic <andreas at sdsc.edu>
> >>> wrote:
> >>> >> > Hi David,
> >>> >> >
> >>> >> > not sure if we should consider this a bug or a feature: It should
> >>>be
> >>> >> > easy
> >>> >> > to work around this by calling toUppercase on your strings. We
> >>>could
> >>> of
> >>> >> > course internally convert all nucleotides to upper case, but that
> >>> would
> >>> >> > remove the possibility for people to use mixed upper case and
> >>>lower
> >>> case
> >>> >> > sequences to represent e.g. alignment conservation.
> >>> >> >
> >>> >> > Any opinions by other people on this? Is anybody using mixed case
> >>> >> > sequences?
> >>> >> >
> >>> >> > Andreas
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 18, 2013 at 11:43 AM, Waring, David A
> >>><dwaring at fhcrc.org
> >>> >
> >>> >> > wrote:
> >>> >> >
> >>> >> >>
> >>> >> >> There seems to be a bug in the alignment package. If DNA
> >>>sequences
> >>> are
> >>> >> >> created using lower case letters, the alignment methods don't
> >>>work.
> >>> >> >> Looks
> >>> >> >> like the the default substitution matrix is coded in upper case,
> >>>and
> >>> >> >> the
> >>> >> >> underlying case of the DNA sequence is being used in the
> >>>alignment.
> >>> >> >> Seems
> >>> >> >> like a bug to me.
> >>> >> >>
> >>> >> >>  This problem occurs when the DNA Sequence is create either using
> >>> the
> >>> >> >> DNASequence constructor, or reading from a fasta which is in
> >>>lower
> >>> >> >> case.
> >>> >> >>
> >>> >> >>
> >>> >> >> The code below shows the problem.
> >>> >> >>
> >>> >> >>
> >>> >> >>     static SimpleGapPenalty gapP;
> >>> >> >>     static SubstitutionMatrix<NucleotideCompound> matrix;
> >>> >> >>
> >>> >> >>     public static void main(String[] args)throws Exception{
> >>> >> >>        matrix = SubstitutionMatrixHelper.getNuc4_4();
> >>> >> >>         gapP = new SimpleGapPenalty();
> >>> >> >>         gapP.setOpenPenalty((short)5);
> >>> >> >>         gapP.setExtensionPenalty((short)2);
> >>> >> >>         testHardcoded();
> >>> >> >>     }
> >>> >> >>
> >>> >> >>     public static void testHardcoded()throws Exception{
> >>> >> >>        Sequence<NucleotideCompound> seq1 = new
> >>> >> >> DNASequence("AGGGCTTTACCCCGGTTAA");
> >>> >> >>         Sequence<NucleotideCompound> seq2 = new
> >>> >> >> DNASequence("ACCCCGGTTTAATATTTTT");
> >>> >> >>         Sequence<NucleotideCompound> seq3 = new
> >>> >> >> DNASequence("agggctttaccccggttaa");
> >>> >> >>         Sequence<NucleotideCompound> seq4 = new
> >>> >> >> DNASequence("accccggtttaatattttt");
> >>> >> >>         alignPair(seq1,seq2);
> >>> >> >>         alignPair(seq1,seq4);
> >>> >> >>         alignPair(seq3,seq4);
> >>> >> >>
> >>> >> >>     }
> >>> >> >>
> >>> >> >>
> >>> >> >>     public static void alignPair(Sequence<NucleotideCompound>
> >>>seq1,
> >>> >> >> Sequence<NucleotideCompound> seq2){
> >>> >> >>                 SequencePair<Sequence<NucleotideCompound>,
> >>> >> >> NucleotideCompound> pair =
> >>> >> >>
> >>>Alignments.getPairwiseAlignment(seq1,seq2,
> >>> >> >>
> >>> Alignments.PairwiseSequenceAlignerType.GLOBAL,
> >>> >> >> gapP, matrix);
> >>> >> >>
> >>> >> >>         System.out.printf("%s", pair);
> >>> >> >>         System.out.println();
> >>> >> >>     }
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> _______________________________________________
> >>> >> >> biojava-dev mailing list
> >>> >> >> biojava-dev at lists.open-bio.org
> >>> >> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>> >> >>
> >>> >> > _______________________________________________
> >>> >> > biojava-dev mailing list
> >>> >> > biojava-dev at lists.open-bio.org
> >>> >> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>> >
> >>> >
> >>> >
> >>> _______________________________________________
> >>> biojava-dev mailing list
> >>> biojava-dev at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>>
> >>
> >>
> >_______________________________________________
> >biojava-dev mailing list
> >biojava-dev at lists.open-bio.org
> >http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> -----------------------------------------------------------------------
> Confidentiality Notice: This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> privileged and confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy
> all copies of the original message.
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>



More information about the biojava-dev mailing list