[Biojava-l] reading paptides from a fasta file

Wed Nov 8 12:39:20 UTC 2006

Hi Sarah,

just a few comments / guesses:
maybe there are other approaches, but you could follow the FASTA-format 
BioJavaX understands, e.g.:

|>gi|<identifier>|<namespace>|<accession>|<name> <description>|

http://biojava.org/wiki/BioJava:BioJavaXDocs#Reading   (chapter 8.2.1)

So, if your file would look like this:

 >gi|0|namespace|null|1 0.9992
ASITENGGAEEESVAK
 >gi|1|namespace|null|1 0.9953
ASITENGGAEEESVAK
.
.
.

you could use:

                System.out.println("id: "+rich_seq.getIdentifier());
                System.out.println("rank: "+rich_seq.getName());
                System.out.println("probability: 
"+rich_seq.getDescription());
                System.out.println("sequence: "+rich_seq.seqString());

to get the data.

However, 'rank' or 'probability' actually would be annotations of the 
sequence, so when processing the data (e.g. storing in a database), one 
would store these data as annotations.

- As for Java-style / naming conventions for variables, the 'Camel-Case' 
is recommended, e.g.richSeq instead of rich_seq.

- To get the alphabet:
System.out.println("alphabet: "+rich_seq.getAlphabet().getName());

- Also maybe you should use the default namespace instead of null:
RichSequenceIterator rich_stream = 
RichSequence.IOTools.readFastaProtein(br,RichObjectFactory.getDefaultNamespace());

Cheers,
Felix

Gerster Sarah wrote:
> Hi!
>
> I'm trying to read peptides from a fasta file:
>   
>> id|0|0.9992|1
>>     
> ASITENGGAEEESVAK
>   
>> id|1|0.9953|1
>>     
> ASITENGGAEEESVAK
>   
>> id|2|0.9998|1
>>     
> ASNASSAGDEVDNVATSSK
>   
>> id|3|0.9998|1
>>     
> EAAAAEEPQPSDEGDVVAK
>   
>> id|4|0.9998|1
>>     
> EAAAAEEPQPSDEGDVVAK
> ....
> I would like to have all peptides somewhere in the memory. I need, their id, the sequence and the 2 numbers at the end (e.g. id = 0, probability = 0.9992, rank = 1 for the first entry in the file).
>
> I tried to use readFastaProtein... but I guess I don't use it right. Anyway, I get the sequences, but I don't get any of the other infomations I want... 
>
> Here is my code:
> try
> {
>   BufferedReader br = new BufferedReader(new FileReader(file_name));
>   RichSequenceIterator rich_stream = RichSequence.IOTools.readFastaProtein(br,null);
>   while(rich_stream.hasNext())
>   {
>     RichSequence rich_seq = rich_stream.nextRichSequence();
>     System.out.println(rich_seq.toString());
>     System.out.println(rich_seq.getAccession());
>     System.out.println(rich_seq.getAlphabet());
>     System.out.println(rich_seq.getAnnotation());
>     System.out.println(rich_seq.getName());
>     System.out.println(rich_seq.getDescription());
>     System.out.println(rich_seq.getIdentifier());
>     System.out.println(rich_seq.seqString());
>   }     
> }
> catch(Exception e) 
> {
>   System.err.println("Bug while reading the sequences from the FASTA file"); 
> } 
>  
>
> Here's the output (for the first entry in the fasta file):
> id|0:1/0.9992
> 0
> org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper at 1df073d
>
> 1
> null
> null
> ASITENGGAEEESVAK
>
>
> Can anyone tell me what's going wrong? 
> Is there already a function to put all the sequences directly in the memory (like a HashSet) while reading them?
>
> Cheers
>
> Sarah
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>   

-- 
Felix Dreher

Max Planck Institute for Molecular Genetics
Department of Vertebrate Genomics
Bioinformatics Group
Ihnestr. 73
D-14195 Berlin

Phone: +49 30 - 8413 1682
Mobile: +49 163 - 754 24 26
E-mail: dreher at molgen.mpg.de
www.molgen.mpg.de/~lh_bioinf