[Biojava-dev] FastaFormat performance enhancement
ml-it-biojava-dev at epigenomics.com
ml-it-biojava-dev at epigenomics.com
Wed Oct 19 11:09:27 EDT 2005
Thomas Down wrote:
>
> On 19 Oct 2005, at 09:41, ml-it-biojava-dev at epigenomics.com wrote:
>
>> Hi,
>> I had a lot of trouble using SeqIOTools.writeFasta on large
>> sequences. The subStr method of SymbolList seems to introduce a
>> memory leak (I did not track that in detail!). Anyway I would suggest
>> to change FastaFormat:
>> public void writeSequence(Sequence seq, PrintStream os)
>> throws IOException {
>> os.print(">");
>> os.println(describeSequence(seq));
>> int length = seq.length();
>> for (int pos = 1; pos <= length; pos += lineWidth) {
>> int end = Math.min(pos + lineWidth - 1, length);
>> os.println(seq.subStr(pos, end));
>> }
>> }
>>
>> to
>> public void writeSequence(Sequence seq, PrintStream os)
>> throws IOException {
>> os.print(">");
>> os.println(describeSequence(seq));
>> int length = seq.length();
>> String seqString = seq.seqString();
>> for (int pos = 0; pos < length; pos += lineWidth) {
>> int end = Math.min(pos + lineWidth, length);
>> String sub = seqString.substring(pos, end);
>> os.println(sub);
>> }
>> }
>>
>> since it is String manipulation that takes place in the loop, I think
>> there is no point in using SymbolList subStr anyway.
>
>
> Hi,
>
> I'd argue against this patch since it could potentially generate some
> really huge strings. Suppose I've got a Sequence object representing
> human chromosome 1 (somewhere around 220Mb). If this is a database-
> backed object with chunks of sequence lazy-loaded on demand (biojava-
> ensembl does this, for example) then there'll be no problem working
> with it even on a fairly modest PC. But converting the whole thing to
> a String is going to use at least 440Mb of RAM, and could easily cause
> an OutOfMemoryError.
>
> I'd be fine with stringifying sequences in larger chunks rather than
> one line at a time -- but I think we should be cautious about
> stringifying complete large sequences.
>
> Do you have any idea where the memory leak might be? I'd be interested
> to track it down. What sort of sequences were you using?
>
> Thomas
>
Hi thomas,
I experienced performance problems (even OutOfMemoryError) when working with large Sequences (not lazy loaded). You might want to check this little example:
package test;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.Properties;
import org.biojava.bio.seq.DNATools;
import org.biojava.bio.seq.io.SeqIOTools;
import org.biojava.bio.symbol.IllegalSymbolException;
import org.ensembl.datamodel.CoordinateSystem;
import org.ensembl.datamodel.Location;
import org.ensembl.datamodel.Sequence;
import org.ensembl.datamodel.SequenceRegion;
import org.ensembl.driver.AdaptorException;
import org.ensembl.driver.ConfigurationException;
import org.ensembl.driver.CoreDriver;
import org.ensembl.driver.DriverManager;
import org.ensembl.driver.SequenceAdaptor;
import org.ensembl.driver.SequenceRegionAdaptor;
public class ExportFasta
{
/**
* @param args
*/
public static void main (String[] args) {
// TODO Auto-generated method stub
Properties props = createDriverProperties (args);
try {
OutputStream os;
os = new FileOutputStream (args[3]);
CoreDriver coreDriver = DriverManager.loadDriver (props);
SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
SequenceRegion[] srs = sra.fetchAllByCoordinateSystem(coordinateSystem);
int size = Integer.parseInt(args[5]);
for (SequenceRegion seqRegion : srs) {
Location loc = null;
int length = (int) seqRegion.getLength();
int start = 1;
int end;
while (start < length) {
end = start + size - 1 < length ? start + size - 1: length;
loc = new Location (coordinateSystem, seqRegion.getName(), start, end, 1);
System.out.println(loc);
start = end + 1;
Sequence seq = sa.fetch(loc);
org.biojava.bio.seq.Sequence bioseq = DNATools.createDNASequence(seq.getString(), loc.toString());
SeqIOTools.writeFasta(os, bioseq);
}
}
}
catch (ConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (AdaptorException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (IllegalSymbolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static Properties createDriverProperties (String[] args) {
Properties props = new Properties ();
props.setProperty("host", args[0]);
props.setProperty("user", args[1]);
props.setProperty("database", args[2]);
return props;
}
}
java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE
since the chunksize is stable the memory required should be stable. With large chunks (1000000) allocated memory keeps growing!
hope that helps, dirk
--
Dirk Habighorst Software Engineer/ Bioinformatician
Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany
phone:+49-30-24345-372 fax:+49-30-24345-555
http://www.epigenomics.com dirk.habighorst at epigenomics.com
More information about the biojava-dev
mailing list