[Biojava-dev] Case sensitivity in Alignment
Willis,Scooter
Scooter.Willis at SanfordHealth.org
Thu Nov 21 02:04:49 UTC 2013
The problem extends to other cases for the alignment where you have R = A
or G Y = C or T etc. In this particular problem A = a or A C = c or C. If
the alignment code in doing a compare understood this equality as an
alignment then would add some interesting functionality.
Scooter Willis
Director of Computational Bioinformatics
Edith Sanford Breast Cancer Research
-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments, is
for the sole use of the intended recipient(s) and may contain privileged
and confidential information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the intended recipient, please
contact the sender by reply e-mail and destroy all copies of the original
message.
On 11/20/13, 6:02 PM, "Spencer Bliven" <sbliven at ucsd.edu> wrote:
>I neglected the CaseFreeAminoAcidCompoundSet from the aa-prop module's xml
>package. I have no idea why it's there.
>
>-Spencer
>
>
>On Wed, Nov 20, 2013 at 2:54 PM, Spencer Bliven <sbliven at ucsd.edu> wrote:
>
>> The issue of case has come up before, and to my knowledge it hasn't been
>> handled particularly consistently. There's a CaseInsensitiveCompound in
>> core which is not used my any other class, and which is pretty much
>>useless
>> since it doesn't derive from NucleotideCompound but merely wraps it.
>> There's also a CasePreservingProteinSequenceCreator, which was my
>>solution
>> to maintain the case information while still working with a standard
>> AminoAcid CompoundSet. It's an ugly solution-I just turn everything to
>> uppercase while storing the case as a boolean array in the sequence's
>> UserCollection. That could easily be adapted to nucleic acid, but I'd
>> welcome a cleaner solution if anyone has one.
>>
>>
>> On Wed, Nov 20, 2013 at 8:57 AM, Michael Heuer <heuermh at gmail.com>
>>wrote:
>>
>>> Sorry, I may not be keeping up with you both here, but the code in
>>> question is in the alignment package, and if the substitution matrices
>>> are all upper case they won't match lower case soft masked sequence;
>>> wouldn't that be the intent? (A feature not a bug)
>>>
>>> michael
>>>
>>> On Wed, Nov 20, 2013 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu>
>>>wrote:
>>> > The problem is that the substitution matrices are all upper case. We
>>>can
>>> > probably fix this by making the NucleotideCompound.equals method case
>>> > insensitive...
>>> >
>>> > Does anybody see an issue with that?
>>> >
>>> > A
>>> >
>>> >
>>> > On Wed, Nov 20, 2013 at 8:22 AM, Michael Heuer <heuermh at gmail.com>
>>> wrote:
>>> >>
>>> >> Hello Andreas, David
>>> >>
>>> >> Lower case is the convention for soft-masking sequences from
>>>alignment
>>> >>
>>> >> http://www.ncbi.nlm.nih.gov/books/NBK1763/
>>> >>
>>> >>
>>>
>>>http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.Create_a_ma
>>>sked_BLAST
>>> >>
>>> >> If we are using this convention, perhaps it should be more clearly
>>> >> documented. What happens if you use mixed case?
>>> >>
>>> >> michael
>>> >>
>>> >>
>>> >> On Wed, Nov 20, 2013 at 5:29 AM, Andreas Prlic <andreas at sdsc.edu>
>>> wrote:
>>> >> > Hi David,
>>> >> >
>>> >> > not sure if we should consider this a bug or a feature: It should
>>>be
>>> >> > easy
>>> >> > to work around this by calling toUppercase on your strings. We
>>>could
>>> of
>>> >> > course internally convert all nucleotides to upper case, but that
>>> would
>>> >> > remove the possibility for people to use mixed upper case and
>>>lower
>>> case
>>> >> > sequences to represent e.g. alignment conservation.
>>> >> >
>>> >> > Any opinions by other people on this? Is anybody using mixed case
>>> >> > sequences?
>>> >> >
>>> >> > Andreas
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 18, 2013 at 11:43 AM, Waring, David A
>>><dwaring at fhcrc.org
>>> >
>>> >> > wrote:
>>> >> >
>>> >> >>
>>> >> >> There seems to be a bug in the alignment package. If DNA
>>>sequences
>>> are
>>> >> >> created using lower case letters, the alignment methods don't
>>>work.
>>> >> >> Looks
>>> >> >> like the the default substitution matrix is coded in upper case,
>>>and
>>> >> >> the
>>> >> >> underlying case of the DNA sequence is being used in the
>>>alignment.
>>> >> >> Seems
>>> >> >> like a bug to me.
>>> >> >>
>>> >> >> This problem occurs when the DNA Sequence is create either using
>>> the
>>> >> >> DNASequence constructor, or reading from a fasta which is in
>>>lower
>>> >> >> case.
>>> >> >>
>>> >> >>
>>> >> >> The code below shows the problem.
>>> >> >>
>>> >> >>
>>> >> >> static SimpleGapPenalty gapP;
>>> >> >> static SubstitutionMatrix<NucleotideCompound> matrix;
>>> >> >>
>>> >> >> public static void main(String[] args)throws Exception{
>>> >> >> matrix = SubstitutionMatrixHelper.getNuc4_4();
>>> >> >> gapP = new SimpleGapPenalty();
>>> >> >> gapP.setOpenPenalty((short)5);
>>> >> >> gapP.setExtensionPenalty((short)2);
>>> >> >> testHardcoded();
>>> >> >> }
>>> >> >>
>>> >> >> public static void testHardcoded()throws Exception{
>>> >> >> Sequence<NucleotideCompound> seq1 = new
>>> >> >> DNASequence("AGGGCTTTACCCCGGTTAA");
>>> >> >> Sequence<NucleotideCompound> seq2 = new
>>> >> >> DNASequence("ACCCCGGTTTAATATTTTT");
>>> >> >> Sequence<NucleotideCompound> seq3 = new
>>> >> >> DNASequence("agggctttaccccggttaa");
>>> >> >> Sequence<NucleotideCompound> seq4 = new
>>> >> >> DNASequence("accccggtttaatattttt");
>>> >> >> alignPair(seq1,seq2);
>>> >> >> alignPair(seq1,seq4);
>>> >> >> alignPair(seq3,seq4);
>>> >> >>
>>> >> >> }
>>> >> >>
>>> >> >>
>>> >> >> public static void alignPair(Sequence<NucleotideCompound>
>>>seq1,
>>> >> >> Sequence<NucleotideCompound> seq2){
>>> >> >> SequencePair<Sequence<NucleotideCompound>,
>>> >> >> NucleotideCompound> pair =
>>> >> >>
>>>Alignments.getPairwiseAlignment(seq1,seq2,
>>> >> >>
>>> Alignments.PairwiseSequenceAlignerType.GLOBAL,
>>> >> >> gapP, matrix);
>>> >> >>
>>> >> >> System.out.printf("%s", pair);
>>> >> >> System.out.println();
>>> >> >> }
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> _______________________________________________
>>> >> >> biojava-dev mailing list
>>> >> >> biojava-dev at lists.open-bio.org
>>> >> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>> >> >>
>>> >> > _______________________________________________
>>> >> > biojava-dev mailing list
>>> >> > biojava-dev at lists.open-bio.org
>>> >> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>> >
>>> >
>>> >
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>>
>>
>_______________________________________________
>biojava-dev mailing list
>biojava-dev at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/biojava-dev
-----------------------------------------------------------------------
Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
privileged and confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply e-mail and destroy
all copies of the original message.
More information about the biojava-dev
mailing list