[Biojava-dev] [Biojava-l] How to parse large Genbank files?

Mon Jul 27 16:47:46 UTC 2009

Calling a garbage collection among the steps doesn't bring to
anything, isn't it?

2009/7/27 Richard Holland <holland at eaglegenomics.com>:
>
>> My question to this list again:
>> Is there a way to achieve my goal of parsing a 200MB Genbank file with the
>> current biojava version without code changes?
>
> Probably not. The internal requirement to convert everything into
> SymbolLists and back again really does get in the way. This is one of
> the main drivers behind BioJava3 - to refactor out unnecessary
> complexity, of which this is a prime example.
>
> The ideal solution would be to parse the file and keep the sequence as a
> string, only to be converted into Symbols when _absolutely necessary_ -
> otherwise to remain as a string (or even just as a pointer to a string
> stored on a disk-based temporary file repository somewhere, to save
> memory). Hibernate et al could then work directly with the string.
>
> cheers,
> Richard
>
>>
>> - Florian
>>
>>
>>
>> > On 25 Jul 2009, 1:33 AM, "Florian Mittag" <florian.mittag at uni-tuebingen.de>
>> > wrote:
>> >
>> > Hi!
>> >
>> > I think this is a problem worth of its own thread, so I'll start one:
>> >
>> > I want to store all human chromosomes in a BioSQL database after I loaded
>> > the
>> > information from .gbk files. The files I get from NCBI with the following
>> > URIs, where the id ranges from nc_000001 to nc_000024 plus nc_001804:
>> >
>> > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=nc_0
>> >00023&rettype=gbwithparts&retmode=text
>> >
>> > I then try to parse the files as described in
>> > http://biojava.org/wiki/BioJava:BioJavaXDocs#Tools_for_reading.2Fwriting_fi
>> >les but it wont work. While there are no problems parsing 1804 and 24,
>> > chromosome
>> > 23 leads to a OutOfMemory exception although I gave it 2GB of heap space.
>> >
>> > Here is a stack trace (the line numbers might differ, because I already
>> > tried
>> > to improve GenbankFormat.java in memory efficiency):
>> >
>> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> >        at
>> > org.biojava.bio.seq.io.ChunkedSymbolListFactory.addSymbols(ChunkedSymbolLis
>> >tFactory.java:222) at
>> > org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addSymbols(SimpleRichSequ
>> >enceBuilder.java:256) at
>> > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:5
>> >35) at
>> > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.
>> >java:110) at
>> > org.prodge.sequence_viewer.db.UpdateDB_Main.updateChromosome(UpdateDB_Main.
>> >java:537) at
>> > org.prodge.sequence_viewer.db.UpdateDB_Main.newGenome(UpdateDB_Main.java:46
>> >8) at
>> > org.prodge.sequence_viewer.db.UpdateDB_Main.main(UpdateDB_Main.java:164)
>> >
>> > The line in GenbankFormat.java is:
>> >
>> > rlistener.addSymbols(
>> >        symParser.getAlphabet(),
>> >        (Symbol[])(sl.toList().toArray(new Symbol[0])),
>> >        0, sl.length());
>> >
>> > Sometimes it fails at the sl.toList().toArray()-part, sometimes it fails
>> > later
>> > inside the addSymbols method, but it always fails.
>> >
>> > How can this be? I mean, the file is only 190MB in size, so 2GB of memory
>> > should be more than enough. Browsing through the source code, I discovered
>> > what I think of as very inefficient handling of sequences:
>> >
>> > 1) the sequence string is read from file into a StringBuffer
>> > 2) it is converted to a string (with whitespaces removed)
>> > 3) a SimpleSymbolList is created out of the string
>> > 4) the SymbolList is converted to a List of Symbols
>> > 5) the List is converted to an array of Symbols
>> > 6) the array is passed to addSymbols
>> > 7) there it is added to a ChunkedSymbolListFactory
>> > 8) if at some point the sequence is requested, a SymbolList is created and
>> > then converted to a string.
>> >
>> > You see, there is a lot of copying and converting, but in the end I have
>> > the same string I started with. Well, I had the string, if it ever reached
>> > the end, because it will crash before completing this process.
>> >
>> >
>> > Am I doing something wrong or is there a great potential of improving
>> > parsing
>> > of Genbank files?
>> >
>> >
>> > Regards,
>> >   Florian
>> > _______________________________________________
>> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>