[Biojava-dev] Bug when reading FASTA file with many DNA Sequences
Mapleson Daniel Dr (CMP)
D.Mapleson at uea.ac.uk
Wed Feb 9 15:07:33 UTC 2011
I'm running Windows 7 Enterprise 64-bit and Java 1.6 64-bit. I double checked I was using the 64-bit version of java at runtime using JConsole. I'm also using biojava3, in case that makes a difference.
I tried the FastaReaderHelper.readFastaDNASequence(File f) version of the method, but I still haven't found the lazySequenceLoad version. Same problem.
Best regards,
Dan
>-----Original Message-----
>From: Scooter Willis [mailto:HWillis at scripps.edu]
>Sent: Wednesday, February 09, 2011 2:48 PM
>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>Sequences
>
>Dan
>
>I usually do 40MB DNA files with no problem. I will concat together and
>test a 245MB version. What operating system and version of Java? 32bit
>or
>64bit? The GC should be able to keep up in lazySequenceLoad mode.
>
>You need to use File because an inputstream doesn't provide the ability
>to
>random seek based on an offset as it could be an HTTP stream etc.
>
>Thanks
>
>Scooter
>
>On 2/9/11 9:38 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>wrote:
>
>>Hi Scooter,
>>
>>Thanks for the quick feedback.
>>
>>Unfortunately, the memory isn't the issue, I set my JVM to use 2500MB
>max
>>heap (the most I can get away with on my machine), and still
>encountered
>>the same problem. On a colleagues machine he has the max heap
>>essentially unbounded and he still gets the same error. It seems to be
>>something to do with the garbage collector removing temporary items
>from
>>memory rather than max available memory.
>>http://stackoverflow.com/questions/1393486/what-means-the-error-
>message-ja
>>va-lang-outofmemoryerror-gc-overhead-limit-excee
>>
>>Also the FastaReaderHelper.readFastaDNASequence method ran into the
>same
>>problem. I passed in an input stream rather than a file but I don't
>>think that should cause the problem, should it? Also I couldn't find
>an
>>overloaded variant with the lazySequenceLoad signature. Maybe I'm
>using
>>an older version but I couldn't find it in the biojava3 API docs
>either.
>>
>>Any other ideas?
>>
>>Best regards,
>>Dan
>>
>>
>>>-----Original Message-----
>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>Sent: Wednesday, February 09, 2011 1:31 PM
>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>>Sequences
>>>
>>>Daniel
>>>
>>>You have two options. The first is to run java -Xmx2048m (the rest of
>>>your
>>>parameters) and the out of memory error will go away. I have a Helper
>>>method that will read the fasta file and lazy load when you request a
>>>sequence. If you call this method
>>>
>>>FastaReaderHelper.readFastaDNASequnece(File f, boolean
>lazySequenceLoad)
>>>you will be able to load the entire fasta file with minimal memory
>>>requirements.
>>>
>>>Even though your fasta file is X when we load it into memory each
>>>sequence
>>>position gets represented by a Java object so the memory footprint
>will
>>>be
>>>larger.
>>>
>>>Let me know if you don't have that particular method in the jars you
>are
>>>using. Not sure of the latest release on jars. If you look in the
>>>biojava3-genome module you will find examples of working with the DNA
>>>sequences to translate proteins etc assuming you have CDS features to
>>>map
>>>onto your sequences.
>>>
>>>Thanks
>>>
>>>Scooter
>>>
>>>On 2/9/11 7:13 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>>>wrote:
>>>
>>>>Hello,
>>>>
>>>>I'm trying to read a FASTA file that contains just over 4000 DNA
>>>>sequences and is around 270MB big. Each sequence starts like this:
>>>>">SequenceName" followed by a linefeed. The actual DNA sequence data
>>>>does contain a linefeed every 40 characters or so.
>>>>
>>>>I want to read in the data into a LinkedHashMap object, similar to
>the
>>>>example you specify in your cookbook:
>>>>
>>>>FileInputStream inStream = new FileInputStream(genomeFile);
>>>>FastaReader<DNASequence, NucleotideCompound> fastaReader = new
>>>>FastaReader<DNASequence, NucleotideCompound>(
>>>> inStream,
>>>> new GenericFastaHeaderParser<DNASequence,
>>>>NucleotideCompound>(),
>>>> new
>>>>DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet()));
>>>>
>>>> try {
>>>> genomeData = fastaReader.process();
>>>> } catch (Exception ex) { }
>>>>
>>>>This works on some files but not the one containing the 4000
>sequences.
>>>>I get an exception generated by the JVM:
>>>>
>>>>Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: GC
>>>>overhead limit exceeded
>>>> at
>>>>java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
>>>> at java.lang.StringBuilder.<init>(StringBuilder.java:80)
>>>> at
>>>>org.biojava3.core.sequence.io.FastaReader.process(FastaReader.java:11
>1)
>>>> at
>>>>workbench.Process_Hits_Patman.openFastaFile(Process_Hits_Patman.java:
>41
>>>4)
>>>> at
>>>>workbench.Process_Hits_Patman.readFiles(Process_Hits_Patman.java:451)
>>>> at workbench.MirCat.openFile(MirCat.java:283)
>>>> at
>>>>workbench.MainMDIWindow.openMenuItemActionPerformed(MainMDIWindow.jav
>a:
>>>252
>>>>)
>>>> at workbench.MainMDIWindow.access$000(MainMDIWindow.java:25)
>>>> at
>>>>workbench.MainMDIWindow$1.actionPerformed(MainMDIWindow.java:140)
>>>> at
>>>>javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:19
>95
>>>)
>>>> at
>>>>javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.jav
>a:
>>>231
>>>>8)
>>>> at
>>>>javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel
>.j
>>>ava
>>>>:387)
>>>> at
>>>>javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242
>)
>>>> at
>javax.swing.AbstractButton.doClick(AbstractButton.java:357)
>>>> at
>>>>javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1
>22
>>>5)
>>>> at
>>>>javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMen
>uI
>>>tem
>>>>UI.java:1266)
>>>> at java.awt.Component.processMouseEvent(Component.java:6263)
>>>> at
>>>javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>>>> at java.awt.Component.processEvent(Component.java:6028)
>>>> at java.awt.Container.processEvent(Container.java:2041)
>>>> at java.awt.Component.dispatchEventImpl(Component.java:4630)
>>>> at java.awt.Container.dispatchEventImpl(Container.java:2099)
>>>> at java.awt.Component.dispatchEvent(Component.java:4460)
>>>> at
>>>>java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4574
>)
>>>> at
>>>>java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238)
>>>> at
>>>>java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
>>>> at java.awt.Container.dispatchEventImpl(Container.java:2085)
>>>> at java.awt.Window.dispatchEventImpl(Window.java:2475)
>>>> at java.awt.Component.dispatchEvent(Component.java:4460)
>>>> at java.awt.EventQueue.dispatchEvent(EventQueue.java:599)
>>>> at
>>>>java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThre
>ad
>>>.ja
>>>>va:269)
>>>> at
>>>>java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.
>ja
>>>va:
>>>>184)
>>>>
>>>>The amount of memory on the system isn't an issue. I tried this on a
>>>>machine with 12GB of RAM. It seems to be an issue with the garbage
>>>>collector getting tired of deleting temporary objects! Also I
>noticed
>>>>that although the file is less than 300MB large, the actual amount of
>>>>heap space used increases from 100MB to over 900MB when in
>>>>FastaReader.Process before the exception occurs.
>>>>
>>>>Unfortunately I can't share the FASTA file that is causing the
>problem.
>>>>
>>>>Would it be possible for you guys to look into this and either
>produce
>>>a
>>>>fix or suggest a workaround? Also do you think there is someway to
>>>>optimise the performance and memory usage of this process?
>>>>
>>>>Finally, I have a question about selectively loading sequences from a
>>>>FASTA file. The idea being to reduce memory usage. Is it
>possibility
>>>to
>>>>do this using biojava? i.e. given a DNA sequence name, only load
>that
>>>>sequence into memory? Or do we have to load the entire FASTA file
>into
>>>a
>>>>LinkedHashMap each time?
>>>>
>>>>Thanks in advance for your help on this one,
>>>>
>>>>Best regards,
>>>>Dr Daniel Mapleson (UEA)
>>>>
>>>>_______________________________________________
>>>>biojava-dev mailing list
>>>>biojava-dev at lists.open-bio.org
>>>>http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
More information about the biojava-dev
mailing list