[Biojava-dev] FastaFormat performance enhancement

Thu Oct 20 03:16:39 EDT 2005

Hello -

Think I may have solved this. I found that the ChangeSupport had 
WeakReferences to the the SymbolLists that are created in the subList() 
method. Obviously the things that were referenced were becomming weakly 
referenced and getting garbage collected but the ChangeSupport was not 
clearing out the WeakReference objects that no longer pointed to anything. 
There was a provision for this if someone did something that fired a 
change event  but not if they did not.

I've tweaked ChangeSupport a bit so that when it tries to grow it's array 
or WeakReferences it first checks if it can purge some. This seems to 
stabalize the number of WeakReferences at about 1500 on my machine, each 
typically lasts about 4 GC cycles on average. I will check this into CVS.

I'm still a little concerned by the gradual increase of 
java.lang.ref.Finalize objects however these are package private and only 
used by the JVM so I don't think they are anything to do with what biojava 
is doing (directly) so hopefully they will sort themselves out given 
enough time.

- Mark

ml-it-biojava-dev at epigenomics.com
Sent by: biojava-dev-bounces at portal.open-bio.org
10/20/2005 12:28 AM

        To:     biojava-dev at biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        Re: [Biojava-dev] FastaFormat performance enhancement

Dirk Habighorst wrote:
> Thomas Down wrote:
> 
>>
>> On 19 Oct 2005, at 09:41, ml-it-biojava-dev at epigenomics.com wrote:
>>
>>> Hi,
>>> I had a lot of trouble using SeqIOTools.writeFasta on large 
>>> sequences. The subStr method of SymbolList seems to introduce a 
>>> memory leak (I did not track that in detail!). Anyway I would 
>>> suggest to change FastaFormat:
>>>     public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>               for (int pos = 1; pos <= length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth - 1, length);
>>>            os.println(seq.subStr(pos, end));
>>>        }
>>>    }
>>>
>>> to
>>>    public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>        String seqString = seq.seqString();
>>>        for (int pos = 0; pos < length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth, length);
>>>            String sub = seqString.substring(pos, end);
>>>            os.println(sub);
>>>        }
>>>    }
>>>
>>> since it is String manipulation that takes place in the loop, I 
>>> think there is no point in using SymbolList subStr anyway.
>>
>>
>>
>> Hi,
>>
>> I'd argue against this patch since it could potentially generate some 
>> really huge strings.  Suppose I've got a Sequence object representing 
>> human chromosome 1 (somewhere around 220Mb).  If this is a database- 
>> backed object with chunks of sequence lazy-loaded on demand (biojava- 
>> ensembl does this, for example) then there'll be no problem working 
>> with it even on a fairly modest PC.  But converting the whole thing 
>> to a String is going to use at least 440Mb of RAM, and could easily 
>> cause an OutOfMemoryError.
>>
>> I'd be fine with stringifying sequences in larger chunks rather than 
>> one line at a time -- but I think we should be cautious about 
>> stringifying complete large sequences.
>>
>> Do you have any idea where the memory leak might be?  I'd be 
>> interested to track it down.  What sort of sequences were you using?
>>
>>              Thomas
>>
> Hi thomas,
> 
> I experienced performance problems (even OutOfMemoryError) when working 
> with large Sequences (not lazy loaded). You might want to check this 
> little example:
> 
> package test;
> 
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.OutputStream;
> import java.util.Properties;
> 
> import org.biojava.bio.seq.DNATools;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.symbol.IllegalSymbolException;
> import org.ensembl.datamodel.CoordinateSystem;
> import org.ensembl.datamodel.Location;
> import org.ensembl.datamodel.Sequence;
> import org.ensembl.datamodel.SequenceRegion;
> import org.ensembl.driver.AdaptorException;
> import org.ensembl.driver.ConfigurationException;
> import org.ensembl.driver.CoreDriver;
> import org.ensembl.driver.DriverManager;
> import org.ensembl.driver.SequenceAdaptor;
> import org.ensembl.driver.SequenceRegionAdaptor;
> 
> 
> public class ExportFasta
> {
> 
>  /**
>   * @param args
>   */
>  public static void main (String[] args) {
>    // TODO Auto-generated method stub
>    Properties props = createDriverProperties (args);
>    try {
>      OutputStream os;
>      os = new FileOutputStream (args[3]);
> 
>      CoreDriver coreDriver = DriverManager.loadDriver (props);
>      SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
>      SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
>      CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
>      SequenceRegion[] srs = 
> sra.fetchAllByCoordinateSystem(coordinateSystem);
>           int size = Integer.parseInt(args[5]);
>      for (SequenceRegion seqRegion : srs) {
>        Location loc = null;
>        int length = (int) seqRegion.getLength();
>        int start = 1;
>        int end;
>        while (start < length) {
>          end = start + size - 1 < length ? start + size - 1: length;
>          loc = new Location (coordinateSystem, seqRegion.getName(), 
> start, end, 1);
>          System.out.println(loc);
>          start = end + 1;
>          Sequence seq = sa.fetch(loc);
>          org.biojava.bio.seq.Sequence bioseq = 
> DNATools.createDNASequence(seq.getString(), loc.toString());
>          SeqIOTools.writeFasta(os, bioseq);
>        }
>      }
>    }
>    catch (ConfigurationException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (AdaptorException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (FileNotFoundException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IllegalSymbolException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IOException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>  }
> 
>  private static Properties createDriverProperties (String[] args) {
>    Properties props = new Properties ();
>    props.setProperty("host", args[0]);
>    props.setProperty("user", args[1]);
>    props.setProperty("database", args[2]);
>       return props;
>  }
> 
> }
> 
> java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE 

> RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE
> 
> since the chunksize is stable the memory required should be stable. With 

> large chunks (1000000) allocated memory keeps growing!
> hope that helps, dirk

Hi thomas,

I did a little debugging myself and found an intresting place to look at! 
The SimpleSymbolList backing Sequences created with the DNATools 
implements subList like this:

     public SymbolList subList(int start, int end){
        if (start < 1 || end > length()) {
            throw new IndexOutOfBoundsException(
                      "Sublist index out of bounds " + length() + ":" + 
start + "," + end
                      );
        }

        if (end < start) {
            throw new IllegalArgumentException(
                "end must not be lower than start: start=" + start + ", 
end=" + end
                );
        }

        SimpleSymbolList sl = new 
SimpleSymbolList(this,viewOffset+start,viewOffset+end);
        if (isView){
            referenceSymbolList.addChangeListener(sl);
        }else{
            this.addChangeListener(sl);
        }
        return sl;
    }

so it keeps adding references to SymbolLists via the addChangeListener 
method to the original Sequence. It appears that the garbage collection 
can't keep up with that if the Sequence is to long. I have not checked 
this in detail though.

ciao, dirk
-- 
Dirk Habighorst                  Software Engineer/ Bioinformatician
Epigenomics AG    Kleine Praesidentenstr. 1    10178 Berlin, Germany
phone:+49-30-24345-372                          fax:+49-30-24345-555
http://www.epigenomics.com           dirk.habighorst at epigenomics.com
_______________________________________________
biojava-dev mailing list
biojava-dev at biojava.org
http://biojava.org/mailman/listinfo/biojava-dev