[Biojava-l] FASTA parsing bug ?
Richard Holland
holland at eaglegenomics.com
Wed Apr 29 09:49:58 UTC 2009
I'd love to see a proper solution to this that doesn't involve upping
the read-ahead limit. I was aware that it might be the issue, but had no
idea why it was not failing for other similar long sequences. I look
forward to seeing your suggested fix!
thanks,
Richard
Josh Goodman wrote:
> Hi Richard and JP,
>
> I think I can be of some help as I'm the FlyBase developer responsible for
> generating these troublesome FASTA files :-). The cause of this problem
> appears to be the description line length for the record FBpp0145470.
>
> The trouble lies in org.biojavax.bio.seq.io.FastaFormat in the while loop
> at line 196. Biojava correctly reads in FBpp0145468 but throws an error
> when trying to parse FBpp0145469. There is nothing wrong in FBpp0145469
> but when biojava reaches the end of the sequence it reads in the header
> for the next record (FBpp0145470). It then tries to reset the
> BufferedReader to the start of FBpp0145470 but that is where the exception
> is thrown because line 197 sets the read ahead limit to 500 characters and
> the reader.readLine() command exceeds that limit.
>
> What isn't obvious to me is why other large definition lines that precede
> that line don't throw the same error (e.g. FBpp0157909). I guess the
> javadoc on BufferedReader.mark() does say "may fail" but I assumed it
> would be more predictable than that.
>
> The file in question can be downloaded from
> ftp://ftp.flybase.net/genomes/Drosophila_grimshawi/dgri_r1.3_FB2008_07/fasta/dgri-all-translation-r1.3.fasta.gz.
>
> If there is interest in a solution that doesn't involve simply upping the
> read ahead limit I can put a patch file together in the next day or so.
>
> Cheers,
> Josh
>
> On Tue, 28 Apr 2009, Richard Holland wrote:
>
>> You're right, doesn't look like newlines.
>>
>> The "Mark invalid" happens when the parser looks too far ahead in the
>> file attempting to seek out the next valid sequence to parse. I'm not
>> sure why this is happening.
>>
>> I don't have the time to test right now but if you could post the link
>> to where someone could download the same FASTA as you're using, then it
>> would make it possible for someone else to investigate in more detail.
>>
>> thanks,
>> Richard
>>
>> JP wrote:
>>> Thanks Richard for your prompt reply.
>>>
>>> I will not attach the fasta file I am parsing (12MB) its
>>> dgri-all-translation-r1.3.fasta from the flybase project.
>>>
>>> If the file had any extra new lines I would see them when I loaded it in
>>> a text editor - no ?
>>>
>>> I implemented the whole thing without using Biojava (for this part)
>>>
>>> fr = new FileReader(fastaProteinFileName);
>>> br = new BufferedReader(fr);
>>> String fastaLine;
>>> String startAccession = '>' + accessionId.trim();
>>> String fastaEntry = "";
>>> boolean record = false;
>>> while ((fastaLine = br.readLine()) != null) {
>>> fastaLine = fastaLine.trim() + '\n';
>>> if (fastaLine.startsWith(startAccession)) {
>>> record = true;
>>> } else if (record && fastaLine.startsWith(">")) {
>>> record = false;
>>> break;
>>> }
>>> if (record) {
>>> fastaEntry += fastaLine;
>>> }
>>> }
>>>
>>>
>>> Notice - I do not use regex - since I'd need to read the whole file and
>>> then regex upon it (if the record is the first one - I just read that one).
>>>
>>> Cheers
>>> JP
>>>
>>>
>>> On Tue, Apr 28, 2009 at 3:27 PM, Richard Holland
>>> <holland at eaglegenomics.com <mailto:holland at eaglegenomics.com>> wrote:
>>>
>>> The "Mark invalid" exception is indicating that the parser has gone too
>>> far ahead in the file looking for a valid header. I'm not sure why but
>>> looking at your original query, there may be extra newlines embedded
>>> into your FASTA header line? That would definitely confuse it.
>>>
>>> The parser is not able to currently pull out just one sequence - in
>>> effect this is a search facility, which it doesn't have. :(
>>>
>>> thanks,
>>> Richard
>>>
>>> JP wrote:
>>> > Hi all at BioJava,
>>> >
>>> > I am trying to parse several FASTA files using the following code:
>>> >
>>> > fr = new FileReader(fastaProteinFileName);
>>> >> br = new BufferedReader(fr);
>>> >>
>>> >> RichSequenceIterator protIter = IOTools.readFastaProtein(br, null);
>>> >> while (protIter.hasNext()) {
>>> >> BioEntry bioEntry = protIter.nextBioEntry();
>>> >> System.out.println (fastaProteinFileName + " == " +
>>> accessionId + " =
>>> >> " + bioEntry.getAccession());
>>> >> }
>>> >
>>> >
>>> > At particular points in my fasta file - I get the following exception:
>>> >
>>> > 14:53:42,546 ERROR FastaFileProcessing - File parsing exception (from
>>> >> biojava library)
>>> >> org.biojava.bio.BioException: Could not read sequence
>>> >> at
>>> >>
>>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>>> >> at
>>> >>
>>> org.biojavax.bio.seq.io.RichStreamReader.nextBioEntry(RichStreamReader.java:99)
>>> >> at
>>> >>
>>> edu.imperial.msc.orthologue.fasta.FastaFileProcessing.getProteinSequenceFromFASTAFile(FastaFileProcessing.java:60)
>>> >> at
>>> >>
>>> edu.imperial.msc.orthologue.core.OrthologueFinder.getFASTAEntries(OrthologueFinder.java:64)
>>> >> at
>>> >>
>>> edu.imperial.msc.orthologue.core.OrthologueFinder.<init>(OrthologueFinder.java:51)
>>> >> at
>>> >>
>>> edu.imperial.msc.orthologue.launcher.OrthologueFinderLauncher.main(OrthologueFinderLauncher.java:60)
>>> >> Caused by: java.io.IOException: Mark invalid
>>> >> at java.io.BufferedReader.reset(Unknown Source)
>>> >> at
>>> >>
>>> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
>>> >> at
>>> >>
>>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
>>> >> ... 5 more
>>> >
>>> >
>>> > Interestingly if I delete the header portion of the header line (from
>>> > type=protein... till the end of the line ...Dgri;)
>>> >
>>> >> FBpp0145468 type=protein;
>>> >>
>>> loc=scaffold_15252:join(13219687..13219727,13219972..13220279,13220507..13220798,13220861..13221180,13221286..13221467,13222258..13222629,13226331..13226463,13226531..13226658);
>>> >> ID=FBpp0145468; name=Dgri\GH11562-PA; parent=FBgn0119042,FBtr0146976;
>>> >> dbxref=FlyBase:FBpp0145468,FlyBase_Annotation_IDs:GH11562-PA;
>>> >> MD5=c8dc38c7197a0d3c93c78b08059e2604; length=591; release=r1.3;
>>> >> species=Dgri;
>>> >>
>>> >
>>> > It works - but I have a number of these exceptions (and I do not
>>> want to
>>> > edit the original data). Mind you I have longer headers in my
>>> file which
>>> > are parsed OK (strange!).
>>> >
>>> > Any ideas anyone ? Alternatively - is there a better way how to
>>> get ONE
>>> > SINGLE sequence from the whole fasta file give that I have the
>>> accession id
>>> > (FBpp0145468) ?
>>> >
>>> > Many Thanks
>>> > JP
>>> > _______________________________________________
>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>> <mailto:Biojava-l at lists.open-bio.org>
>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>> >
>>>
>>> --
>>> Richard Holland, BSc MBCS
>>> Finance Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>>> <mailto:holland at eaglegenomics.com>
>>> http://www.eaglegenomics.com/
>>>
>>>
>> --
>> Richard Holland, BSc MBCS
>> Finance Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>> _______________________________________________
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
--
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/
More information about the Biojava-l
mailing list