[Biojava-dev] Case sensitivity in Alignment

Waring, David A dwaring at fhcrc.org
Thu Nov 21 19:18:40 UTC 2013


Exactly my point, lower case is a convention for masking. It is not a general convention for sequence formats of all types including fasta. 

If someone is working within a system using masking, then he is well aware of that and will need to be working with Sequence objects that make this explicit.

But if someone is not using masking, and is simply working with a set of files he received from a researcher that simply represent DNA sequence, the sequences should never behave differently depending on the underlying file format.



On Nov 20, 2013, at 8:22 AM, Michael Heuer <heuermh at gmail.com> wrote:

> Hello Andreas, David
> 
> Lower case is the convention for soft-masking sequences from alignment
> 
> http://www.ncbi.nlm.nih.gov/books/NBK1763/
> http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.Create_a_masked_BLAST
> 
> If we are using this convention, perhaps it should be more clearly
> documented.  What happens if you use mixed case?
> 
>  michael
> 
> 
> On Wed, Nov 20, 2013 at 5:29 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi David,
>> 
>> not sure if we should consider this a bug or a feature: It should be easy
>> to work around this by calling toUppercase on your strings. We could of
>> course internally convert all nucleotides to upper case, but that would
>> remove the possibility for people to use mixed upper case and lower case
>> sequences to represent e.g. alignment conservation.
>> 
>> Any opinions by other people on this? Is anybody using mixed case sequences?
>> 
>> Andreas
>> 
>> 
>> On Mon, Nov 18, 2013 at 11:43 AM, Waring, David A <dwaring at fhcrc.org> wrote:
>> 
>>> 
>>> There seems to be a bug in the alignment package. If DNA sequences are
>>> created using lower case letters, the alignment methods don't work. Looks
>>> like the the default substitution matrix is coded in upper case, and the
>>> underlying case of the DNA sequence is being used in the alignment. Seems
>>> like a bug to me.
>>> 
>>> This problem occurs when the DNA Sequence is create either using the
>>> DNASequence constructor, or reading from a fasta which is in lower case.
>>> 
>>> 
>>> The code below shows the problem.
>>> 
>>> 
>>>   static SimpleGapPenalty gapP;
>>>   static SubstitutionMatrix<NucleotideCompound> matrix;
>>> 
>>>   public static void main(String[] args)throws Exception{
>>>      matrix = SubstitutionMatrixHelper.getNuc4_4();
>>>       gapP = new SimpleGapPenalty();
>>>       gapP.setOpenPenalty((short)5);
>>>       gapP.setExtensionPenalty((short)2);
>>>       testHardcoded();
>>>   }
>>> 
>>>   public static void testHardcoded()throws Exception{
>>>      Sequence<NucleotideCompound> seq1 = new
>>> DNASequence("AGGGCTTTACCCCGGTTAA");
>>>       Sequence<NucleotideCompound> seq2 = new
>>> DNASequence("ACCCCGGTTTAATATTTTT");
>>>       Sequence<NucleotideCompound> seq3 = new
>>> DNASequence("agggctttaccccggttaa");
>>>       Sequence<NucleotideCompound> seq4 = new
>>> DNASequence("accccggtttaatattttt");
>>>       alignPair(seq1,seq2);
>>>       alignPair(seq1,seq4);
>>>       alignPair(seq3,seq4);
>>> 
>>>   }
>>> 
>>> 
>>>   public static void alignPair(Sequence<NucleotideCompound> seq1,
>>> Sequence<NucleotideCompound> seq2){
>>>               SequencePair<Sequence<NucleotideCompound>,
>>> NucleotideCompound> pair =
>>>                       Alignments.getPairwiseAlignment(seq1,seq2,
>>>                       Alignments.PairwiseSequenceAlignerType.GLOBAL,
>>> gapP, matrix);
>>> 
>>>       System.out.printf("%s", pair);
>>>       System.out.println();
>>>   }
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>> 
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev





More information about the biojava-dev mailing list