[Biojava-dev] Bug when reading FASTA file with many DNA Sequences
Scooter Willis
HWillis at scripps.edu
Wed Feb 9 15:50:28 UTC 2011
Dan
I have attached a copy of my biojava3-core that has that method in it as
well as other memory/speed optimizations I worked on. Sounds like that
method(recently added) hasn't made its way into the current biojava3 jars.
You should see a dramatic reduction in memory if you only need to select a
sub-set of sequences. Trying to load a 245MB fasta file does take lots of
memory. If you plan on reading each sequence then you will eventually run
into a memory problem as I am currently not freeing up the sequence data
that is loaded lazily. My plan is to add some optimization hints/logic
that the developer can control that every time you load a new sequence and
use more memory I will internally free up sequence data that has been
allocated. If you go back to a sequence that has had storage deallocated
then I will simply reload it. This way you can work with very large
sequence files at a genome scale without running out of memory or being
forced to put in a database.
Let me know if this works and if you need to analyze every sequence and
will see if I can find some time to add in the lazyload memory management
features.
Thanks
Scooter
On 2/9/11 10:07 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
wrote:
>I'm running Windows 7 Enterprise 64-bit and Java 1.6 64-bit. I double
>checked I was using the 64-bit version of java at runtime using JConsole.
> I'm also using biojava3, in case that makes a difference.
>
>I tried the FastaReaderHelper.readFastaDNASequence(File f) version of the
>method, but I still haven't found the lazySequenceLoad version. Same
>problem.
>
>Best regards,
>Dan
>
>>-----Original Message-----
>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>Sent: Wednesday, February 09, 2011 2:48 PM
>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>Sequences
>>
>>Dan
>>
>>I usually do 40MB DNA files with no problem. I will concat together and
>>test a 245MB version. What operating system and version of Java? 32bit
>>or
>>64bit? The GC should be able to keep up in lazySequenceLoad mode.
>>
>>You need to use File because an inputstream doesn't provide the ability
>>to
>>random seek based on an offset as it could be an HTTP stream etc.
>>
>>Thanks
>>
>>Scooter
>>
>>On 2/9/11 9:38 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>>wrote:
>>
>>>Hi Scooter,
>>>
>>>Thanks for the quick feedback.
>>>
>>>Unfortunately, the memory isn't the issue, I set my JVM to use 2500MB
>>max
>>>heap (the most I can get away with on my machine), and still
>>encountered
>>>the same problem. On a colleagues machine he has the max heap
>>>essentially unbounded and he still gets the same error. It seems to be
>>>something to do with the garbage collector removing temporary items
>>from
>>>memory rather than max available memory.
>>>http://stackoverflow.com/questions/1393486/what-means-the-error-
>>message-ja
>>>va-lang-outofmemoryerror-gc-overhead-limit-excee
>>>
>>>Also the FastaReaderHelper.readFastaDNASequence method ran into the
>>same
>>>problem. I passed in an input stream rather than a file but I don't
>>>think that should cause the problem, should it? Also I couldn't find
>>an
>>>overloaded variant with the lazySequenceLoad signature. Maybe I'm
>>using
>>>an older version but I couldn't find it in the biojava3 API docs
>>either.
>>>
>>>Any other ideas?
>>>
>>>Best regards,
>>>Dan
>>>
>>>
>>>>-----Original Message-----
>>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>>Sent: Wednesday, February 09, 2011 1:31 PM
>>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>>>Sequences
>>>>
>>>>Daniel
>>>>
>>>>You have two options. The first is to run java -Xmx2048m (the rest of
>>>>your
>>>>parameters) and the out of memory error will go away. I have a Helper
>>>>method that will read the fasta file and lazy load when you request a
>>>>sequence. If you call this method
>>>>
>>>>FastaReaderHelper.readFastaDNASequnece(File f, boolean
>>lazySequenceLoad)
>>>>you will be able to load the entire fasta file with minimal memory
>>>>requirements.
>>>>
>>>>Even though your fasta file is X when we load it into memory each
>>>>sequence
>>>>position gets represented by a Java object so the memory footprint
>>will
>>>>be
>>>>larger.
>>>>
>>>>Let me know if you don't have that particular method in the jars you
>>are
>>>>using. Not sure of the latest release on jars. If you look in the
>>>>biojava3-genome module you will find examples of working with the DNA
>>>>sequences to translate proteins etc assuming you have CDS features to
>>>>map
>>>>onto your sequences.
>>>>
>>>>Thanks
>>>>
>>>>Scooter
>>>>
>>>>On 2/9/11 7:13 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>>>>wrote:
>>>>
>>>>>Hello,
>>>>>
>>>>>I'm trying to read a FASTA file that contains just over 4000 DNA
>>>>>sequences and is around 270MB big. Each sequence starts like this:
>>>>>">SequenceName" followed by a linefeed. The actual DNA sequence data
>>>>>does contain a linefeed every 40 characters or so.
>>>>>
>>>>>I want to read in the data into a LinkedHashMap object, similar to
>>the
>>>>>example you specify in your cookbook:
>>>>>
>>>>>FileInputStream inStream = new FileInputStream(genomeFile);
>>>>>FastaReader<DNASequence, NucleotideCompound> fastaReader = new
>>>>>FastaReader<DNASequence, NucleotideCompound>(
>>>>> inStream,
>>>>> new GenericFastaHeaderParser<DNASequence,
>>>>>NucleotideCompound>(),
>>>>> new
>>>>>DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet()));
>>>>>
>>>>> try {
>>>>> genomeData = fastaReader.process();
>>>>> } catch (Exception ex) { }
>>>>>
>>>>>This works on some files but not the one containing the 4000
>>sequences.
>>>>>I get an exception generated by the JVM:
>>>>>
>>>>>Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: GC
>>>>>overhead limit exceeded
>>>>> at
>>>>>java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
>>>>> at java.lang.StringBuilder.<init>(StringBuilder.java:80)
>>>>> at
>>>>>org.biojava3.core.sequence.io.FastaReader.process(FastaReader.java:11
>>1)
>>>>> at
>>>>>workbench.Process_Hits_Patman.openFastaFile(Process_Hits_Patman.java:
>>41
>>>>4)
>>>>> at
>>>>>workbench.Process_Hits_Patman.readFiles(Process_Hits_Patman.java:451)
>>>>> at workbench.MirCat.openFile(MirCat.java:283)
>>>>> at
>>>>>workbench.MainMDIWindow.openMenuItemActionPerformed(MainMDIWindow.jav
>>a:
>>>>252
>>>>>)
>>>>> at workbench.MainMDIWindow.access$000(MainMDIWindow.java:25)
>>>>> at
>>>>>workbench.MainMDIWindow$1.actionPerformed(MainMDIWindow.java:140)
>>>>> at
>>>>>javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:19
>>95
>>>>)
>>>>> at
>>>>>javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.jav
>>a:
>>>>231
>>>>>8)
>>>>> at
>>>>>javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel
>>.j
>>>>ava
>>>>>:387)
>>>>> at
>>>>>javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242
>>)
>>>>> at
>>javax.swing.AbstractButton.doClick(AbstractButton.java:357)
>>>>> at
>>>>>javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1
>>22
>>>>5)
>>>>> at
>>>>>javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMen
>>uI
>>>>tem
>>>>>UI.java:1266)
>>>>> at java.awt.Component.processMouseEvent(Component.java:6263)
>>>>> at
>>>>javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>>>>> at java.awt.Component.processEvent(Component.java:6028)
>>>>> at java.awt.Container.processEvent(Container.java:2041)
>>>>> at java.awt.Component.dispatchEventImpl(Component.java:4630)
>>>>> at java.awt.Container.dispatchEventImpl(Container.java:2099)
>>>>> at java.awt.Component.dispatchEvent(Component.java:4460)
>>>>> at
>>>>>java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4574
>>)
>>>>> at
>>>>>java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238)
>>>>> at
>>>>>java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
>>>>> at java.awt.Container.dispatchEventImpl(Container.java:2085)
>>>>> at java.awt.Window.dispatchEventImpl(Window.java:2475)
>>>>> at java.awt.Component.dispatchEvent(Component.java:4460)
>>>>> at java.awt.EventQueue.dispatchEvent(EventQueue.java:599)
>>>>> at
>>>>>java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThre
>>ad
>>>>.ja
>>>>>va:269)
>>>>> at
>>>>>java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.
>>ja
>>>>va:
>>>>>184)
>>>>>
>>>>>The amount of memory on the system isn't an issue. I tried this on a
>>>>>machine with 12GB of RAM. It seems to be an issue with the garbage
>>>>>collector getting tired of deleting temporary objects! Also I
>>noticed
>>>>>that although the file is less than 300MB large, the actual amount of
>>>>>heap space used increases from 100MB to over 900MB when in
>>>>>FastaReader.Process before the exception occurs.
>>>>>
>>>>>Unfortunately I can't share the FASTA file that is causing the
>>problem.
>>>>>
>>>>>Would it be possible for you guys to look into this and either
>>produce
>>>>a
>>>>>fix or suggest a workaround? Also do you think there is someway to
>>>>>optimise the performance and memory usage of this process?
>>>>>
>>>>>Finally, I have a question about selectively loading sequences from a
>>>>>FASTA file. The idea being to reduce memory usage. Is it
>>possibility
>>>>to
>>>>>do this using biojava? i.e. given a DNA sequence name, only load
>>that
>>>>>sequence into memory? Or do we have to load the entire FASTA file
>>into
>>>>a
>>>>>LinkedHashMap each time?
>>>>>
>>>>>Thanks in advance for your help on this one,
>>>>>
>>>>>Best regards,
>>>>>Dr Daniel Mapleson (UEA)
>>>>>
>>>>>_______________________________________________
>>>>>biojava-dev mailing list
>>>>>biojava-dev at lists.open-bio.org
>>>>>http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: biojava3-core-3.0.1-SNAPSHOT.jar
Type: application/java-archive
Size: 255200 bytes
Desc: biojava3-core-3.0.1-SNAPSHOT.jar
URL: <http://lists.open-bio.org/pipermail/biojava-dev/attachments/20110209/6eb4db1e/attachment-0001.jar>
More information about the biojava-dev
mailing list