[Biojava-dev] Fasta parsing bug
Schreiber, Mark
mark.schreiber at agresearch.co.nz
Thu Apr 24 15:31:41 EDT 2003
Hi -
I have been slowly tracking down a bug in the reading of large (10K + sequences) fasta files. The bug is caused by a mark being set in a BufferedReader by the FastaFormat object that is later unable to be reset causing an IOException.
A typical stack trace is:
java.io.IOException: Can't reset: Mark invalid parseStart=417 bytesRead=512
at org.biojava.bio.seq.io.FastaFormat.readSequenceData(FastaFormat.java:170)
at org.biojava.bio.seq.io.FastaFormat.readSequence(FastaFormat.java:120)
at org.biojava.bio.seq.io.StreamReader.nextSequence(StreamReader.java:100)
rethrown as org.biojava.bio.BioException: Could not read sequence
at org.biojava.bio.seq.io.StreamReader.nextSequence(StreamReader.java:103)
at utils.SeqLength.main(SeqLength.java:42)
although the specifics of parseStart anf bytesRead are dependent on the size of the BufferedReader.
Looking into the Java docs I found some hints about the size of the buffer. If you decrease the size of the buffer from the default 8192 then errors occur in smaller files, or earlier in the file. I then started doubling the size of the buffer once I got to 65536 I could read the largest FASTA lib I had on my machine. This is a bit of a kludge and it may point to an error in the bowels of the JVM itself rather than in biojava.
This was on WindowsXP with biojava-live and Java build 1.4.1_02-b06 but I think others have been periodically bugged by this as well, not sure of the OS etc been used.
Is there a way to avoid using the mark/ reset paradigm in FastaFormat?
- Mark
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================
More information about the biojava-dev
mailing list