[Biojava-l] Error parsing ipi.HUMAN.fasta file

Mon Jan 11 14:34:21 UTC 2010

[ posting back to biojava-l as omitted the address previously ]

Ah, right. It wasn't clear on the wiki whether those were included or 
not with the 'all' package.

It compiles fine now (with one warning) and through trial and error a 
buffer value of 2000 works with ipi.HUMAN.fasta as well as mouse and 
chicken.

Thanks very much for your help.

Chris

On 11/01/10 12:31, Richard Holland wrote:
> Hello. You need to make sure the support libraries are also on your classpath:
>
> http://www.biojava.org/wiki/BioJava:Download#Support_libraries
>
> cheers,
> Richard
>
> On 11 Jan 2010, at 12:16, Chris Cole wrote:
>
>> Thanks for the reply, Richard.
>>
>> Just getting back to this problem. I've upped the buffer to 1000 bytes, but I can't get it to compile with ant. I get a whole slew of compile errors, there seems to be something missing, but I don't know how to solve it. Output from ant build follows:
>>
>> caterpillar: ~/Downloads/biojava-1.7/src>  ant -f ../build.xml
>> Buildfile: ../build.xml
>>
>> init:
>>      [echo] Building biojava-1.7
>>      [echo] Java Home:                       /usr/java/jdk1.6.0_17/jre
>>      [echo] JUnit present:                   ${junit.present}
>>      [echo] JUnit supported by Ant:          ${junit.support}
>>      [echo] HSQLDB driver present:           ${sqlDriver.hsqldb}
>>      [echo] XSLT support:                    true
>>
>> prepare:
>>
>> prepare-biojava:
>>
>> compile-biojava:
>>     [javac] Compiling 1462 source files to /opt/Downloads/biojava-1.7/ant-build/classes/biojava
>>     [javac] /opt/Downloads/biojava-1.7/src/org/biojava/bio/dp/twohead/DPCompiler.java:55: package org.biojava.utils.bytecode does not exist
>>     [javac] import org.biojava.utils.bytecode.ByteCode;
>>     [javac]                                  ^
>>     [javac] /opt/Downloads/biojava-1.7/src/org/biojava/bio/dp/twohead/DPCompiler.java:56: package org.biojava.utils.bytecode does not exist
>>     [javac] import org.biojava.utils.bytecode.CodeClass;
>>     [javac]                                  ^
>>     [javac] /opt/Downloads/biojava-1.7/src/org/biojava/bio/dp/twohead/DPCompiler.java:57: package org.biojava.utils.bytecode does not exist
>>     [javac] import org.biojava.utils.bytecode.CodeException;
>>     [javac]                                  ^
>> ...etc.
>>
>> I downloaded the biojava-1.7-all.jar, originally and I can't what else I need?
>>
>> I'm also trying to do this from within Eclipse, so any Eclipse-specific pointers would be much appreciated.
>> Cheers,
>>
>> Chris
>>
>> On 18/12/09 16:58, Richard Holland wrote:
>>> The FASTA parser has a buffer which it uses to read ahead to the next
>>> complete line then back up before it actually parses it on the second
>>> pass (in order to allow it to do things like hasNext()). The
>>> exception shows that the size of that buffer is being exceeded,
>>> causing it to fail to back up again afterwards.
>>>
>>> There's two cures - one is to rewrite the FASTA parser to buffer
>>> things in a different way. The other is to open up
>>> org/biojavax/bio/seq/io/FastaFormat.java in a text editor, search for
>>> the line where it sets the buffer (somewhere around line 202
>>> according to the exception, in the readRichSequence() method - the
>>> command to look for is 'mark'), and increase the buffer size to
>>> something suitably large enough (it's currently set at 500 bytes).
>>> Then recompile BioJava and it should work.
>>>
>>> cheers, Richard
>>>
>>> On 18 Dec 2009, at 15:53, Chris Cole wrote:
>>>
>>>> I'm wanting to parse a fasta file obtained from IPI using the code
>>>> at the bottom of this message, but I get the following error:
>>>>
>>>> org.biojava.bio.BioException: Could not read sequence at
>>>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>>>>
>>>>
>> at test.readFasta(test.java:39)
>>>> at test.main(test.java:18) Caused by: java.io.IOException: Mark
>>>> invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at
>>>> org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202)
>>>>
>>>>
>> at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
>>>> ... 2 more
>>>>
>>>> Looking at the Fasta file itself and doing some tests, it seems to
>>>> fail consistently at one or two entries /preceding/ an entry with a
>>>> very long description line e.g.:
>>>>> IPI:IPI00021421.4|SWISS-PROT:Q9UMR5-1|TREMBL:B0S868|ENSEMBL:ENSP00000382748;ENSP00000382749;ENSP00000382750;ENSP00000387679;ENSP00000388341;ENSP00000388618;ENSP00000389930;ENSP00000392885;ENSP00000393009;ENSP00000395242;ENSP00000395562;ENSP00000397025;ENSP00000399879;ENSP00000403820;ENSP00000406496;ENSP00000406566;ENSP00000408703;ENSP00000411007;ENSP00000411625;ENSP00000412827|REFSEQ:NP_005146|VEGA:OTTHUMP00000014775;OTTHUMP00000014776;OTTHUMP00000014778;OTTHUMP00000175028;OTTHUMP00000175029;OTTHUMP00000175030;OTTHUMP00000193135;OTTHUMP00000193136;OTTHUMP00000193138;OTTHUMP00000193964;OTTHUMP00000193965;OTTHUMP00000193967;OTTHUMP00000194391;OTTHUMP00000194392;OTTHUMP00000194394
>>>>> Tax_Id=9606 Gene_Symbol=PPT2 Isoform 1 of Lysosomal thioesterase
>>>>> PPT2
>>>> MLGLWGQRLPAAWVLLLLPFLPLLLLAAPAPHRASYKPVIVVHGLFDSSYSFRHLLEYIN
>>>> ETHPGTVVTVLDLFDGRESLRPLWEQVQGFREAVVPIMAKAPQGVHLICYSQGGLVCRAL
>>>> LSVMDDHNVDSFISLSSPQMGQYGDTDYLKWLFPTSMRSNLYRICYSPWGQEFSICNYWH
>>>> DPHHDDLYLNASSFLALINGERDHPNATVWRKNFLRVGHLVLIGGPDDGVITPWQSSFFG
>>>> FYDANETVLEMEEQLVYLRDSFGLKTLLARGAIVRCPMAGISHTAWHSNRTLYETCIEPW LS
>>>>
>>>> Deleting the large entries allows the code to continue until it
>>>> reaches another long description line.
>>>>
>>>> It also seems to be a feature of large Fasta files as reading the
>>>> above sequence alone or as part of a small file is fine.
>>>>
>>>> Is this a known problem or am I doing something wrong? BTW I'm
>>>> using biojava 1.7 and Java 1.6.0_17. Any help would be most
>>>> appreciated. Cheers.
>>>>
>>>> code: import java.io.*;
>>>>
>>>> import org.biojava.bio.*; import org.biojavax.*; import
>>>> org.biojavax.bio.seq.*;
>>>>
>>>> public class test { private static PrintStream o = System.out;
>>>>
>>>> public static void main(String[] args) { // TODO Auto-generated
>>>> method stub readFasta(args[0]); }  public static void
>>>> readFasta(String filename) { try { o.println("Reading file: " +
>>>> filename); //prepare a BufferedReader for file io BufferedReader br
>>>> = new BufferedReader(new FileReader(filename));
>>>>
>>>> // read Fasta file as BioJava RichSequence object Namespace ns =
>>>> RichObjectFactory.getDefaultNamespace(); RichSequenceIterator iter
>>>> = RichSequence.IOTools.readFastaProtein(br,ns);
>>>>
>>>> int numProteins = 0; while(iter.hasNext()) { ++numProteins;
>>>>
>>>> // Retrieve sequence and description data RichSequence seq =
>>>> iter.nextRichSequence(); String ipi =
>>>> seq.getName().substring(4,15); o.println(ipi);  } o.println("Found
>>>> " + numProteins + " in Fasta file"); } catch (FileNotFoundException
>>>> ex) { //can't find file specified by args[0] ex.printStackTrace();
>>>> } catch (BioException ex) { //error parsing requested format
>>>> ex.printStackTrace(); } }
>>>>
>>>> }
>
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>

-- 
Dr Chris Cole
Senior Bioinformatics Research Officer
School of Life Sciences Research
University of Dundee
Dow Street
Dundee
DD1 5EH
Scotland, UK

url: http://network.nature.com/profile/drchriscole
e-mail: chris at compbio.dundee.ac.uk
Tel: +44 (0)1382 388 721

The University of Dundee is a registered Scottish charity, No: SC015096