[Biojava-dev] Bug when reading FASTA file with many DNA Sequences

Thu Feb 10 15:15:36 UTC 2011

Hi Scooter,

Thanks for all you help yesterday, that was much appreciated.  Sorry to trouble you again but while the modified jar you provided worked great for the file I was working with yesterday, I have to process some other files that contain ambiguous dna nucleotides (particularly "Y").  I noticed that the readFastaDNASequence hardcodes the use of DNACompoundSet rather than the AmbiguityDNACompoundSet.  Is there any chance of getting an overloaded version of the readFastaDNASequence that allows you to set ambiguous or unambiguous compound sets?

Best regards,
Dan

>-----Original Message-----
>From: Scooter Willis [mailto:HWillis at scripps.edu]
>Sent: Wednesday, February 09, 2011 4:47 PM
>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>Sequences
>
>Dan
>
>You can check out the code via subversion to be using the latest and
>greatest. Our goal is to have minimal changes in biojava3-core but
>modules
>that depend on core will change more frequently. We use Maven for
>building. If you are not using Maven but use Netbeans then should be
>easy
>to setup. Easy for eclipse as well but not sure how much configuration
>is
>required. This way if you have requirements easier for me to make
>changes
>and check in the code that you can then test.
>
>Scooter
>
>On 2/9/11 11:40 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>wrote:
>
>>Scooter,
>>
>>Yup, I'll keep that in mind going forward, although I suspect our use
>>case will involve going back to the sequence more than once in some
>>instances.
>>
>>However, readFastaDNASequence is pretty swift now so we could manage
>this
>>to some extent by removing the hashmap and calling the
>>readFastaDNASequence with lazysequenceload again if required.  Not
>ideal,
>>but it's a workaround in a case where we had to remove a particular
>>sequence from the hashmap to save memory, and then later realised we
>need
>>it again.
>>
>>Is there some place I can view the latest changes to the biojava jars
>on
>>your wiki?  I'd like to keep an eye on new functionality that gets
>added.
>>
>>Best regards,
>>Dan
>>
>>
>>
>>>-----Original Message-----
>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>Sent: Wednesday, February 09, 2011 4:27 PM
>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>>Sequences
>>>
>>>Dan
>>>
>>>Glad that worked. If you only need to use a sequence once then you
>could
>>>remove it from the hashmap(allowing GC) and that should keep your
>memory
>>>low.
>>>
>>>Scooter
>>>
>>>On 2/9/11 11:22 AM, "Mapleson Daniel Dr (CMP)" <D.Mapleson at uea.ac.uk>
>>>wrote:
>>>
>>>>Thanks Scooter.  That's great.  This jar fixes the problem.  The file
>>>is
>>>>loaded really quickly and memory usage has decreased massively too.
>>>>
>>>>We will need to process each (most) sequence(s), in the file at some
>>>>stage, so the ability to free up memory containing sequences that
>>>aren't
>>>>currently being processed/used will be useful going forward with our
>>>>project.  It's not urgent though.  The main thing from my perspective
>>>is
>>>>that we can actually run the program, which your fix allows us to do.
>>>>
>>>>Thanks again for the quick turnaround!  Much appreciated! :)
>>>>
>>>>Best regards,
>>>>Dan
>>>>
>>>>>-----Original Message-----
>>>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>>>Sent: Wednesday, February 09, 2011 3:50 PM
>>>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many DNA
>>>>>Sequences
>>>>>
>>>>>Dan
>>>>>
>>>>>I have attached a copy of my biojava3-core that has that method in
>it
>>>as
>>>>>well as other memory/speed optimizations I worked on. Sounds like
>that
>>>>>method(recently added) hasn't made its way into the current biojava3
>>>>>jars.
>>>>>You should see a dramatic reduction in memory if you only need to
>>>select
>>>>>a sub-set of sequences. Trying to load a 245MB fasta file does take
>>>lots
>>>>>of memory. If you plan on reading each sequence then you will
>>>eventually
>>>>>run into a memory problem as I am currently not freeing up the
>>>sequence
>>>>>data that is loaded lazily. My plan is to add some optimization
>>>>>hints/logic that the developer can control that every time you load
>a
>>>>>new sequence and use more memory I will internally free up sequence
>>>data
>>>>>that has been allocated. If you go back to a sequence that has had
>>>>>storage deallocated then I will simply reload it. This way you can
>>>work
>>>>>with very large sequence files at a genome scale without running out
>>>of
>>>>>memory or being forced to put in a database.
>>>>>
>>>>>Let me know if this works and if you need to analyze every sequence
>>>and
>>>>>will see if I can find some time to add in the lazyload memory
>>>>>management features.
>>>>>
>>>>>Thanks
>>>>>
>>>>>Scooter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>On 2/9/11 10:07 AM, "Mapleson Daniel Dr (CMP)"
><D.Mapleson at uea.ac.uk>
>>>>>wrote:
>>>>>
>>>>>>I'm running Windows 7 Enterprise 64-bit and Java 1.6 64-bit.  I
>>>double
>>>>>>checked I was using the 64-bit version of java at runtime using
>>>>>JConsole.
>>>>>> I'm also using biojava3, in case that makes a difference.
>>>>>>
>>>>>>I tried the FastaReaderHelper.readFastaDNASequence(File f) version
>of
>>>>>>the method, but I still haven't found the lazySequenceLoad version.
>>>>>>Same problem.
>>>>>>
>>>>>>Best regards,
>>>>>>Dan
>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>>>>>Sent: Wednesday, February 09, 2011 2:48 PM
>>>>>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>>>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many
>DNA
>>>>>>>Sequences
>>>>>>>
>>>>>>>Dan
>>>>>>>
>>>>>>>I usually do 40MB DNA files with no problem. I will concat
>together
>>>>>>>and test a 245MB version. What operating system and version of
>Java?
>>>>>>>32bit or 64bit? The GC should be able to keep up in
>lazySequenceLoad
>>>>>>>mode.
>>>>>>>
>>>>>>>You need to use File because an inputstream doesn't provide the
>>>>>>>ability to random seek based on an offset as it could be an HTTP
>>>>>>>stream etc.
>>>>>>>
>>>>>>>Thanks
>>>>>>>
>>>>>>>Scooter
>>>>>>>
>>>>>>>On 2/9/11 9:38 AM, "Mapleson Daniel Dr (CMP)"
><D.Mapleson at uea.ac.uk>
>>>>>>>wrote:
>>>>>>>
>>>>>>>>Hi Scooter,
>>>>>>>>
>>>>>>>>Thanks for the quick feedback.
>>>>>>>>
>>>>>>>>Unfortunately, the memory isn't the issue, I set my JVM to use
>>>2500MB
>>>>>>>max
>>>>>>>>heap (the most I can get away with on my machine), and still
>>>>>>>encountered
>>>>>>>>the same problem.  On a colleagues machine he has the max heap
>>>>>>>>essentially unbounded and he still gets the same error.  It seems
>>>to
>>>>>>>>be something to do with the garbage collector removing temporary
>>>>>>>>items
>>>>>>>from
>>>>>>>>memory rather than max available memory.
>>>>>>>>http://stackoverflow.com/questions/1393486/what-means-the-error-
>>>>>>>message-ja
>>>>>>>>va-lang-outofmemoryerror-gc-overhead-limit-excee
>>>>>>>>
>>>>>>>>Also the FastaReaderHelper.readFastaDNASequence method ran into
>the
>>>>>>>same
>>>>>>>>problem.  I passed in an input stream rather than a file but I
>>>don't
>>>>>>>>think that should cause the problem, should it?  Also I couldn't
>>>find
>>>>>>>an
>>>>>>>>overloaded variant with the lazySequenceLoad signature.  Maybe
>I'm
>>>>>>>using
>>>>>>>>an older version but I couldn't find it in the biojava3 API docs
>>>>>>>either.
>>>>>>>>
>>>>>>>>Any other ideas?
>>>>>>>>
>>>>>>>>Best regards,
>>>>>>>>Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>>-----Original Message-----
>>>>>>>>>From: Scooter Willis [mailto:HWillis at scripps.edu]
>>>>>>>>>Sent: Wednesday, February 09, 2011 1:31 PM
>>>>>>>>>To: Mapleson Daniel Dr (CMP); biojava-dev at lists.open-bio.org
>>>>>>>>>Subject: Re: [Biojava-dev] Bug when reading FASTA file with many
>>>DNA
>>>>>>>>>Sequences
>>>>>>>>>
>>>>>>>>>Daniel
>>>>>>>>>
>>>>>>>>>You have two options. The first is to run java -Xmx2048m (the
>rest
>>>>>>>>>of your
>>>>>>>>>parameters) and the out of memory error will go away. I have a
>>>>>>>>>Helper method that will read the fasta file and lazy load when
>you
>>>>>>>>>request a sequence. If you call this method
>>>>>>>>>
>>>>>>>>>FastaReaderHelper.readFastaDNASequnece(File f, boolean
>>>>>>>lazySequenceLoad)
>>>>>>>>>you will be able to load the entire fasta file with minimal
>memory
>>>>>>>>>requirements.
>>>>>>>>>
>>>>>>>>>Even though your fasta file is X when we load it into memory
>each
>>>>>>>>>sequence position gets represented by a Java object so the
>memory
>>>>>>>>>footprint
>>>>>>>will
>>>>>>>>>be
>>>>>>>>>larger.
>>>>>>>>>
>>>>>>>>>Let me know if you don't have that particular method in the jars
>>>you
>>>>>>>are
>>>>>>>>>using. Not sure of the latest release on jars. If you look in
>the
>>>>>>>>>biojava3-genome module you will find examples of working with
>the
>>>>>>>>>DNA sequences to translate proteins etc assuming you have CDS
>>>>>>>>>features to map onto your sequences.
>>>>>>>>>
>>>>>>>>>Thanks
>>>>>>>>>
>>>>>>>>>Scooter
>>>>>>>>>
>>>>>>>>>On 2/9/11 7:13 AM, "Mapleson Daniel Dr (CMP)"
>>><D.Mapleson at uea.ac.uk>
>>>>>>>>>wrote:
>>>>>>>>>
>>>>>>>>>>Hello,
>>>>>>>>>>
>>>>>>>>>>I'm trying to read a FASTA file that contains just over 4000
>DNA
>>>>>>>>>>sequences and is around 270MB big.  Each sequence starts like
>>>this:
>>>>>>>>>>">SequenceName" followed by a linefeed.  The actual DNA
>sequence
>>>>>>>>>>data does contain a linefeed every 40 characters or so.
>>>>>>>>>>
>>>>>>>>>>I want to read in the data into a LinkedHashMap object, similar
>>>to
>>>>>>>the
>>>>>>>>>>example you specify in your cookbook:
>>>>>>>>>>
>>>>>>>>>>FileInputStream inStream = new FileInputStream(genomeFile);
>>>>>>>>>>FastaReader<DNASequence, NucleotideCompound> fastaReader = new
>>>>>>>>>>FastaReader<DNASequence, NucleotideCompound>(
>>>>>>>>>>              inStream,
>>>>>>>>>>              new GenericFastaHeaderParser<DNASequence,
>>>>>>>>>>NucleotideCompound>(),
>>>>>>>>>>              new
>>>>>>>>>>DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())
>);
>>>>>>>>>>
>>>>>>>>>>        try {
>>>>>>>>>>            genomeData = fastaReader.process();
>>>>>>>>>>        } catch (Exception ex) { }
>>>>>>>>>>
>>>>>>>>>>This works on some files but not the one containing the 4000
>>>>>>>sequences.
>>>>>>>>>>I get an exception generated by the JVM:
>>>>>>>>>>
>>>>>>>>>>Exception in thread "AWT-EventQueue-0"
>>>java.lang.OutOfMemoryError:
>>>>>>>>>>GC overhead limit exceeded
>>>>>>>>>>        at
>>>>>>>>>>java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.ja
>va
>>>:4
>>>>>5)
>>>>>>>>>>        at
>java.lang.StringBuilder.<init>(StringBuilder.java:80)
>>>>>>>>>>        at
>>>>>>>>>>org.biojava3.core.sequence.io.FastaReader.process(FastaReader.j
>av
>>>a:
>>>>>>>>>>11
>>>>>>>1)
>>>>>>>>>>        at
>>>>>>>>>>workbench.Process_Hits_Patman.openFastaFile(Process_Hits_Patman
>.j
>>>av
>>>>>a:
>>>>>>>41
>>>>>>>>>4)
>>>>>>>>>>        at
>>>>>>>>>>workbench.Process_Hits_Patman.readFiles(Process_Hits_Patman.jav
>a:
>>>45
>>>>>1)
>>>>>>>>>>        at workbench.MirCat.openFile(MirCat.java:283)
>>>>>>>>>>        at
>>>>>>>>>>workbench.MainMDIWindow.openMenuItemActionPerformed(MainMDIWind
>ow
>>>.j
>>>>>>>>>>av
>>>>>>>a:
>>>>>>>>>252
>>>>>>>>>>)
>>>>>>>>>>        at
>>>>>workbench.MainMDIWindow.access$000(MainMDIWindow.java:25)
>>>>>>>>>>        at
>>>>>>>>>>workbench.MainMDIWindow$1.actionPerformed(MainMDIWindow.java:14
>0)
>>>>>>>>>>        at
>>>>>>>>>>javax.swing.AbstractButton.fireActionPerformed(AbstractButton.j
>av
>>>a:
>>>>>>>>>>19
>>>>>>>95
>>>>>>>>>)
>>>>>>>>>>        at
>>>>>>>>>>javax.swing.AbstractButton$Handler.actionPerformed(AbstractButt
>on
>>>.j
>>>>>>>>>>av
>>>>>>>a:
>>>>>>>>>231
>>>>>>>>>>8)
>>>>>>>>>>        at
>>>>>>>>>>javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButto
>nM
>>>od
>>>>>>>>>>el
>>>>>>>.j
>>>>>>>>>ava
>>>>>>>>>>:387)
>>>>>>>>>>        at
>>>>>>>>>>javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.ja
>va
>>>:2
>>>>>>>>>>42
>>>>>>>)
>>>>>>>>>>        at
>>>>>>>javax.swing.AbstractButton.doClick(AbstractButton.java:357)
>>>>>>>>>>        at
>>>>>>>>>>javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.
>ja
>>>va
>>>>>>>>>>:1
>>>>>>>22
>>>>>>>>>5)
>>>>>>>>>>        at
>>>>>>>>>>javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Ba
>si
>>>cM
>>>>>>>>>>en
>>>>>>>uI
>>>>>>>>>tem
>>>>>>>>>>UI.java:1266)
>>>>>>>>>>        at
>>>>>java.awt.Component.processMouseEvent(Component.java:6263)
>>>>>>>>>>        at
>>>>>>>>>javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
>>>>>>>>>>        at java.awt.Component.processEvent(Component.java:6028)
>>>>>>>>>>        at java.awt.Container.processEvent(Container.java:2041)
>>>>>>>>>>        at
>>>>>java.awt.Component.dispatchEventImpl(Component.java:4630)
>>>>>>>>>>        at
>>>>>java.awt.Container.dispatchEventImpl(Container.java:2099)
>>>>>>>>>>        at
>java.awt.Component.dispatchEvent(Component.java:4460)
>>>>>>>>>>        at
>>>>>>>>>>java.awt.LightweightDispatcher.retargetMouseEvent(Container.jav
>a:
>>>45
>>>>>>>>>>74
>>>>>>>)
>>>>>>>>>>        at
>>>>>>>>>>java.awt.LightweightDispatcher.processMouseEvent(Container.java
>:4
>>>23
>>>>>8)
>>>>>>>>>>        at
>>>>>>>>>>java.awt.LightweightDispatcher.dispatchEvent(Container.java:416
>8)
>>>>>>>>>>        at
>>>>>java.awt.Container.dispatchEventImpl(Container.java:2085)
>>>>>>>>>>        at java.awt.Window.dispatchEventImpl(Window.java:2475)
>>>>>>>>>>        at
>java.awt.Component.dispatchEvent(Component.java:4460)
>>>>>>>>>>        at
>java.awt.EventQueue.dispatchEvent(EventQueue.java:599)
>>>>>>>>>>        at
>>>>>>>>>>java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispat
>ch
>>>Th
>>>>>>>>>>re
>>>>>>>ad
>>>>>>>>>.ja
>>>>>>>>>>va:269)
>>>>>>>>>>        at
>>>>>>>>>>java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchT
>hr
>>>ea
>>>>>d.
>>>>>>>ja
>>>>>>>>>va:
>>>>>>>>>>184)
>>>>>>>>>>
>>>>>>>>>>The amount of memory on the system isn't an issue.  I tried
>this
>>>on
>>>>>>>>>>a machine with 12GB of RAM.  It seems to be an issue with the
>>>>>>>>>>garbage collector getting tired of deleting temporary objects!
>>>>>>>>>>Also I
>>>>>>>noticed
>>>>>>>>>>that although the file is less than 300MB large, the actual
>>>amount
>>>>>>>>>>of heap space used increases from 100MB to over 900MB when in
>>>>>>>>>>FastaReader.Process before the exception occurs.
>>>>>>>>>>
>>>>>>>>>>Unfortunately I can't share the FASTA file that is causing the
>>>>>>>problem.
>>>>>>>>>>
>>>>>>>>>>Would it be possible for you guys to look into this and either
>>>>>>>produce
>>>>>>>>>a
>>>>>>>>>>fix or suggest a workaround?  Also do you think there is
>someway
>>>to
>>>>>>>>>>optimise the performance and memory usage of this process?
>>>>>>>>>>
>>>>>>>>>>Finally, I have a question about selectively loading sequences
>>>from
>>>>>>>>>>a FASTA file.  The idea being to reduce memory usage.  Is it
>>>>>>>possibility
>>>>>>>>>to
>>>>>>>>>>do this using biojava?  i.e. given a DNA sequence name, only
>load
>>>>>>>that
>>>>>>>>>>sequence into memory?  Or do we have to load the entire FASTA
>>>file
>>>>>>>into
>>>>>>>>>a
>>>>>>>>>>LinkedHashMap each time?
>>>>>>>>>>
>>>>>>>>>>Thanks in advance for your help on this one,
>>>>>>>>>>
>>>>>>>>>>Best regards,
>>>>>>>>>>Dr Daniel Mapleson (UEA)
>>>>>>>>>>
>>>>>>>>>>_______________________________________________
>>>>>>>>>>biojava-dev mailing list
>>>>>>>>>>biojava-dev at lists.open-bio.org
>>>>>>>>>>http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>>
>>>>>>
>>>>
>>