[Biojava-dev] [Biojava-l] How to parse large Genbank files?

Mon Jul 27 22:52:44 EDT 2009

Dear Paolo -

Calling the garbage collector is generally not required and often not
recommended. Modern JVMs do a better job of this than programmers do.
Also a garbage collector cannot release memory that is allocated to
objects that still contain references. I suspect the problem here is
that objects are being copied and references are being retained to the
old copies. These old copies are not really required and therefore the
references can be set to null which will allow the GC to clean them
up.

Also, manually calling the GC is very aggressive and forces the JVM to
dump all classes it is not currently using, when the class is called
again the classloader will need to reload it which can result in a
performance hit.

- Mark

On Tue, Jul 28, 2009 at 12:47 AM, Paolo Pavan<paolo.pavan at gmail.com> wrote:
> Calling a garbage collection among the steps doesn't bring to
> anything, isn't it?
>
> 2009/7/27 Richard Holland <holland at eaglegenomics.com>:
>>
>>> My question to this list again:
>>> Is there a way to achieve my goal of parsing a 200MB Genbank file with the
>>> current biojava version without code changes?
>>
>> Probably not. The internal requirement to convert everything into
>> SymbolLists and back again really does get in the way. This is one of
>> the main drivers behind BioJava3 - to refactor out unnecessary
>> complexity, of which this is a prime example.
>>
>> The ideal solution would be to parse the file and keep the sequence as a
>> string, only to be converted into Symbols when _absolutely necessary_ -
>> otherwise to remain as a string (or even just as a pointer to a string
>> stored on a disk-based temporary file repository somewhere, to save
>> memory). Hibernate et al could then work directly with the string.
>>
>> cheers,
>> Richard
>>
>>>
>>> - Florian
>>>
>>>
>>>
>>> > On 25 Jul 2009, 1:33 AM, "Florian Mittag" <florian.mittag at uni-tuebingen.de>
>>> > wrote:
>>> >
>>> > Hi!
>>> >
>>> > I think this is a problem worth of its own thread, so I'll start one:
>>> >
>>> > I want to store all human chromosomes in a BioSQL database after I loaded
>>> > the
>>> > information from .gbk files. The files I get from NCBI with the following
>>> > URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804:
>>> >
>>> > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_0
>>> >00023&rettype=gbwithparts&retmode=text
>>> >
>>> > I then try to parse the files as described in
>>> > http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_fi
>>> >les but it wont work. While there are no problems parsing 1804 and 24,
>>> > chromosome
>>> > 23 leads to a OutOfMemory exception although I gave it 2GB of heap space.
>>> >
>>> > Here is a stack trace (the line numbers might differ, because I already
>>> > tried
>>> > to improve GenbankFormat.java in memory efficiency):
>>> >
>>> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>> >        at
>>> > org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolLis
>>> >tFactory.java:222) at
>>> > org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequ
>>> >enceBuilder.java:256) at
>>> > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5
>>> >35) at
>>> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.
>>> >java:110) at
>>> > org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.
>>> >java:537) at
>>> > org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:46
>>> >8) at
>>> > org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164)
>>> >
>>> > The line in GenbankFormat.java is:
>>> >
>>> > rlistener.addSymbols(
>>> >        symParser.getAlphabet(),
>>> >        (Symbol[])(sl.toList().toArray(new Symbol[0])),
>>> >        0, sl.length());
>>> >
>>> > Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails
>>> > later
>>> > inside the addSymbols method, but it always fails.
>>> >
>>> > How can this be? I mean, the file is only 190MB in size, so 2GB of memory
>>> > should be more than enough. Browsing through the source code, I discovered
>>> > what I think of as very inefficient handling of sequences:
>>> >
>>> > 1) the sequence string is read from file into a StringBuffer
>>> > 2) it is converted to a string (with whitespaces removed)
>>> > 3) a SimpleSymbolList is created out of the string
>>> > 4) the SymbolList is converted to a List of Symbols
>>> > 5) the List is converted to an array of Symbols
>>> > 6) the array is passed to addSymbols
>>> > 7) there it is added to a ChunkedSymbolListFactory
>>> > 8) if at some point the sequence is requested, a SymbolList is created and
>>> > then converted to a string.
>>> >
>>> > You see, there is a lot of copying and converting, but in the end I have
>>> > the same string I started with. Well, I had the string, if it ever reached
>>> > the end, because it will crash before completing this process.
>>> >
>>> >
>>> > Am I doing something wrong or is there a great potential of improving
>>> > parsing
>>> > of Genbank files?
>>> >
>>> >
>>> > Regards,
>>> >   Florian
>>> > _______________________________________________
>>> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>