[Biojava-dev] FastaFormat performance enhancement
Thomas Down
td2 at sanger.ac.uk
Wed Oct 19 09:53:39 EDT 2005
On 19 Oct 2005, at 09:41, ml-it-biojava-dev at epigenomics.com wrote:
> Hi,
> I had a lot of trouble using SeqIOTools.writeFasta on large
> sequences. The subStr method of SymbolList seems to introduce a
> memory leak (I did not track that in detail!). Anyway I would
> suggest to change FastaFormat:
> public void writeSequence(Sequence seq, PrintStream os)
> throws IOException {
> os.print(">");
> os.println(describeSequence(seq));
> int length = seq.length();
> for (int pos = 1; pos <= length; pos += lineWidth) {
> int end = Math.min(pos + lineWidth - 1, length);
> os.println(seq.subStr(pos, end));
> }
> }
>
> to
> public void writeSequence(Sequence seq, PrintStream os)
> throws IOException {
> os.print(">");
> os.println(describeSequence(seq));
> int length = seq.length();
> String seqString = seq.seqString();
> for (int pos = 0; pos < length; pos += lineWidth) {
> int end = Math.min(pos + lineWidth, length);
> String sub = seqString.substring(pos, end);
> os.println(sub);
> }
> }
>
> since it is String manipulation that takes place in the loop, I
> think there is no point in using SymbolList subStr anyway.
Hi,
I'd argue against this patch since it could potentially generate some
really huge strings. Suppose I've got a Sequence object representing
human chromosome 1 (somewhere around 220Mb). If this is a database-
backed object with chunks of sequence lazy-loaded on demand (biojava-
ensembl does this, for example) then there'll be no problem working
with it even on a fairly modest PC. But converting the whole thing
to a String is going to use at least 440Mb of RAM, and could easily
cause an OutOfMemoryError.
I'd be fine with stringifying sequences in larger chunks rather than
one line at a time -- but I think we should be cautious about
stringifying complete large sequences.
Do you have any idea where the memory leak might be? I'd be
interested to track it down. What sort of sequences were you using?
Thomas
More information about the biojava-dev
mailing list