[Biojava-dev] FastaFormat performance enhancement
mark.schreiber at novartis.com
mark.schreiber at novartis.com
Thu Oct 20 03:16:39 EDT 2005
Hello -
Think I may have solved this. I found that the ChangeSupport had
WeakReferences to the the SymbolLists that are created in the subList()
method. Obviously the things that were referenced were becomming weakly
referenced and getting garbage collected but the ChangeSupport was not
clearing out the WeakReference objects that no longer pointed to anything.
There was a provision for this if someone did something that fired a
change event but not if they did not.
I've tweaked ChangeSupport a bit so that when it tries to grow it's array
or WeakReferences it first checks if it can purge some. This seems to
stabalize the number of WeakReferences at about 1500 on my machine, each
typically lasts about 4 GC cycles on average. I will check this into CVS.
I'm still a little concerned by the gradual increase of
java.lang.ref.Finalize objects however these are package private and only
used by the JVM so I don't think they are anything to do with what biojava
is doing (directly) so hopefully they will sort themselves out given
enough time.
- Mark
ml-it-biojava-dev at epigenomics.com
Sent by: biojava-dev-bounces at portal.open-bio.org
10/20/2005 12:28 AM
To: biojava-dev at biojava.org
cc: (bcc: Mark Schreiber/GP/Novartis)
Subject: Re: [Biojava-dev] FastaFormat performance enhancement
Dirk Habighorst wrote:
> Thomas Down wrote:
>
>>
>> On 19 Oct 2005, at 09:41, ml-it-biojava-dev at epigenomics.com wrote:
>>
>>> Hi,
>>> I had a lot of trouble using SeqIOTools.writeFasta on large
>>> sequences. The subStr method of SymbolList seems to introduce a
>>> memory leak (I did not track that in detail!). Anyway I would
>>> suggest to change FastaFormat:
>>> public void writeSequence(Sequence seq, PrintStream os)
>>> throws IOException {
>>> os.print(">");
>>> os.println(describeSequence(seq));
>>> int length = seq.length();
>>> for (int pos = 1; pos <= length; pos += lineWidth) {
>>> int end = Math.min(pos + lineWidth - 1, length);
>>> os.println(seq.subStr(pos, end));
>>> }
>>> }
>>>
>>> to
>>> public void writeSequence(Sequence seq, PrintStream os)
>>> throws IOException {
>>> os.print(">");
>>> os.println(describeSequence(seq));
>>> int length = seq.length();
>>> String seqString = seq.seqString();
>>> for (int pos = 0; pos < length; pos += lineWidth) {
>>> int end = Math.min(pos + lineWidth, length);
>>> String sub = seqString.substring(pos, end);
>>> os.println(sub);
>>> }
>>> }
>>>
>>> since it is String manipulation that takes place in the loop, I
>>> think there is no point in using SymbolList subStr anyway.
>>
>>
>>
>> Hi,
>>
>> I'd argue against this patch since it could potentially generate some
>> really huge strings. Suppose I've got a Sequence object representing
>> human chromosome 1 (somewhere around 220Mb). If this is a database-
>> backed object with chunks of sequence lazy-loaded on demand (biojava-
>> ensembl does this, for example) then there'll be no problem working
>> with it even on a fairly modest PC. But converting the whole thing
>> to a String is going to use at least 440Mb of RAM, and could easily
>> cause an OutOfMemoryError.
>>
>> I'd be fine with stringifying sequences in larger chunks rather than
>> one line at a time -- but I think we should be cautious about
>> stringifying complete large sequences.
>>
>> Do you have any idea where the memory leak might be? I'd be
>> interested to track it down. What sort of sequences were you using?
>>
>> Thomas
>>
> Hi thomas,
>
> I experienced performance problems (even OutOfMemoryError) when working
> with large Sequences (not lazy loaded). You might want to check this
> little example:
>
> package test;
>
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.OutputStream;
> import java.util.Properties;
>
> import org.biojava.bio.seq.DNATools;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.symbol.IllegalSymbolException;
> import org.ensembl.datamodel.CoordinateSystem;
> import org.ensembl.datamodel.Location;
> import org.ensembl.datamodel.Sequence;
> import org.ensembl.datamodel.SequenceRegion;
> import org.ensembl.driver.AdaptorException;
> import org.ensembl.driver.ConfigurationException;
> import org.ensembl.driver.CoreDriver;
> import org.ensembl.driver.DriverManager;
> import org.ensembl.driver.SequenceAdaptor;
> import org.ensembl.driver.SequenceRegionAdaptor;
>
>
> public class ExportFasta
> {
>
> /**
> * @param args
> */
> public static void main (String[] args) {
> // TODO Auto-generated method stub
> Properties props = createDriverProperties (args);
> try {
> OutputStream os;
> os = new FileOutputStream (args[3]);
>
> CoreDriver coreDriver = DriverManager.loadDriver (props);
> SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
> SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
> CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
> SequenceRegion[] srs =
> sra.fetchAllByCoordinateSystem(coordinateSystem);
> int size = Integer.parseInt(args[5]);
> for (SequenceRegion seqRegion : srs) {
> Location loc = null;
> int length = (int) seqRegion.getLength();
> int start = 1;
> int end;
> while (start < length) {
> end = start + size - 1 < length ? start + size - 1: length;
> loc = new Location (coordinateSystem, seqRegion.getName(),
> start, end, 1);
> System.out.println(loc);
> start = end + 1;
> Sequence seq = sa.fetch(loc);
> org.biojava.bio.seq.Sequence bioseq =
> DNATools.createDNASequence(seq.getString(), loc.toString());
> SeqIOTools.writeFasta(os, bioseq);
> }
> }
> }
> catch (ConfigurationException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (AdaptorException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (FileNotFoundException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (IllegalSymbolException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (IOException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> }
>
> private static Properties createDriverProperties (String[] args) {
> Properties props = new Properties ();
> props.setProperty("host", args[0]);
> props.setProperty("user", args[1]);
> props.setProperty("database", args[2]);
> return props;
> }
>
> }
>
> java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE
> RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE
>
> since the chunksize is stable the memory required should be stable. With
> large chunks (1000000) allocated memory keeps growing!
> hope that helps, dirk
Hi thomas,
I did a little debugging myself and found an intresting place to look at!
The SimpleSymbolList backing Sequences created with the DNATools
implements subList like this:
public SymbolList subList(int start, int end){
if (start < 1 || end > length()) {
throw new IndexOutOfBoundsException(
"Sublist index out of bounds " + length() + ":" +
start + "," + end
);
}
if (end < start) {
throw new IllegalArgumentException(
"end must not be lower than start: start=" + start + ",
end=" + end
);
}
SimpleSymbolList sl = new
SimpleSymbolList(this,viewOffset+start,viewOffset+end);
if (isView){
referenceSymbolList.addChangeListener(sl);
}else{
this.addChangeListener(sl);
}
return sl;
}
so it keeps adding references to SymbolLists via the addChangeListener
method to the original Sequence. It appears that the garbage collection
can't keep up with that if the Sequence is to long. I have not checked
this in detail though.
ciao, dirk
--
Dirk Habighorst Software Engineer/ Bioinformatician
Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany
phone:+49-30-24345-372 fax:+49-30-24345-555
http://www.epigenomics.com dirk.habighorst at epigenomics.com
_______________________________________________
biojava-dev mailing list
biojava-dev at biojava.org
http://biojava.org/mailman/listinfo/biojava-dev
More information about the biojava-dev
mailing list