[Biojava-dev] SimpleGappedSymbolList problem, wierd "String seqString()" results.

Kalle Näslund kalle.naslund at genpat.uu.se
Wed Feb 19 15:23:16 EST 2003


Hi!

I noticed that if you insert leading or trailing gaps, and then call the 
seqString() you get "n" instead of "-". To illustrate it a bit better. 
the following set of gap operations on a SimpleGappedSymbolList :



Alphabet           dna         =   DNATools.getDNA();
SymbolTokenization dnaParser   =   dna.getTokenization( "token" );

SymbolList       symList1    =   new SimpleSymbolList( dnaParser, new 
String( "TTCCTTCCGGGTCGTC" ) );
GappedSymbolList gl1         =   new SimpleGappedSymbolList( symList1 );

System.out.println( gl1.seqString() );
gl1.addGapsInSource( 1, 4 );
System.out.println( gl1.seqString() );
gl1.addGapsInSource( 10, 2 );
System.out.println( gl1.seqString() );
gl1.addGapsInSource( 17, 4 );
System.out.println( gl1.seqString() );

gives this result :

ttccttccgggtcgtc
nnnnttccttccgggtcgtc
nnnnttccttccg--ggtcgtc
nnnnttccttccg--ggtcgtcnnnn


I havent manage to fully understand why this happens, but the start of 
the story goes like this :

1) SimpleGappedSymbolList's symbolAt method returns different gap 
symbols depending on if the gap symbol is an "internal" gap or a 
leading/trailing gap. the relevant piece of code in the symbolAt method is :
	if( (indx < firstNonGap()) || (indx > lastNonGap()) ) {
         	return Alphabet.EMPTY_ALPHABET.getGapSymbol();
       	}
	else {
         	return getAlphabet().getGapSymbol();
       	}

2) When one call seqString on a SimpleGappedSymbolList it simple uses 
the method it inherited from AbstractSymbolList,that looks like this.


	public String seqString() {
       	try {
		SymbolTokenization toke = 		 
        		getAlphabet().getTokenization("token");
		return toke.tokenizeSymbolList(this);
       	}
	catch (BioException ex) {
           throw new BioRuntimeException(ex, "Couldn't 
tokenize 			 					sequence");
       	}
   	}

so, what happens is that all symbols, get fed to the SymbolTokenization 
object, that you get from whatever the default alphabet a DNA 
SimpleGappedSequence uses. if you feed the gapsymbol you get from 
Alphabet.EMPTY_ALPHABET.getGapSymbol() to this SymbolTokenizer it 
returns a "n" and not a "-".

At this point my limited knowledge of the black arts of Alphabets in 
biojava stoped me from writing the end of the story, and was hoping that 
  someone else might end it for me =),

regards Kalle







More information about the biojava-dev mailing list