[Biojava-dev] FastaFormat performance enhancement

ml-it-biojava-dev at epigenomics.com ml-it-biojava-dev at epigenomics.com
Wed Oct 19 04:41:38 EDT 2005


Hi, 

I had a lot of trouble using SeqIOTools.writeFasta on large sequences. The subStr method of SymbolList seems to introduce a memory leak (I did not track that in detail!). Anyway I would suggest to change FastaFormat:
  
    public void writeSequence(Sequence seq, PrintStream os)
    throws IOException {
        os.print(">");
        os.println(describeSequence(seq));
        
        int length = seq.length();
        
        for (int pos = 1; pos <= length; pos += lineWidth) {
            int end = Math.min(pos + lineWidth - 1, length);
            os.println(seq.subStr(pos, end));
        }
    }

to 

    public void writeSequence(Sequence seq, PrintStream os)
    throws IOException {
        os.print(">");
        os.println(describeSequence(seq));
        
        int length = seq.length();
        String seqString = seq.seqString();
        for (int pos = 0; pos < length; pos += lineWidth) {
            int end = Math.min(pos + lineWidth, length);
            String sub = seqString.substring(pos, end);
            os.println(sub);
        }
    }

since it is String manipulation that takes place in the loop, I think there is no point in using SymbolList subStr anyway.

ciao dirk
  
-- 
Dirk Habighorst                  Software Engineer/ Bioinformatician
Epigenomics AG    Kleine Praesidentenstr. 1    10178 Berlin, Germany
phone:+49-30-24345-372                          fax:+49-30-24345-555
http://www.epigenomics.com           dirk.habighorst at epigenomics.com


More information about the biojava-dev mailing list