[Biojava-l] reading paptides from a fasta file
Felix Dreher
dreher at molgen.mpg.de
Wed Nov 8 12:39:20 UTC 2006
Hi Sarah,
just a few comments / guesses:
maybe there are other approaches, but you could follow the FASTA-format
BioJavaX understands, e.g.:
|>gi|<identifier>|<namespace>|<accession>|<name> <description>|
http://biojava.org/wiki/BioJava:BioJavaXDocs#Reading (chapter 8.2.1)
So, if your file would look like this:
>gi|0|namespace|null|1 0.9992
ASITENGGAEEESVAK
>gi|1|namespace|null|1 0.9953
ASITENGGAEEESVAK
.
.
.
you could use:
System.out.println("id: "+rich_seq.getIdentifier());
System.out.println("rank: "+rich_seq.getName());
System.out.println("probability:
"+rich_seq.getDescription());
System.out.println("sequence: "+rich_seq.seqString());
to get the data.
However, 'rank' or 'probability' actually would be annotations of the
sequence, so when processing the data (e.g. storing in a database), one
would store these data as annotations.
- As for Java-style / naming conventions for variables, the 'Camel-Case'
is recommended, e.g.richSeq instead of rich_seq.
- To get the alphabet:
System.out.println("alphabet: "+rich_seq.getAlphabet().getName());
- Also maybe you should use the default namespace instead of null:
RichSequenceIterator rich_stream =
RichSequence.IOTools.readFastaProtein(br,RichObjectFactory.getDefaultNamespace());
Cheers,
Felix
Gerster Sarah wrote:
> Hi!
>
> I'm trying to read peptides from a fasta file:
>
>> id|0|0.9992|1
>>
> ASITENGGAEEESVAK
>
>> id|1|0.9953|1
>>
> ASITENGGAEEESVAK
>
>> id|2|0.9998|1
>>
> ASNASSAGDEVDNVATSSK
>
>> id|3|0.9998|1
>>
> EAAAAEEPQPSDEGDVVAK
>
>> id|4|0.9998|1
>>
> EAAAAEEPQPSDEGDVVAK
> ....
> I would like to have all peptides somewhere in the memory. I need, their id, the sequence and the 2 numbers at the end (e.g. id = 0, probability = 0.9992, rank = 1 for the first entry in the file).
>
> I tried to use readFastaProtein... but I guess I don't use it right. Anyway, I get the sequences, but I don't get any of the other infomations I want...
>
> Here is my code:
> try
> {
> BufferedReader br = new BufferedReader(new FileReader(file_name));
> RichSequenceIterator rich_stream = RichSequence.IOTools.readFastaProtein(br,null);
> while(rich_stream.hasNext())
> {
> RichSequence rich_seq = rich_stream.nextRichSequence();
> System.out.println(rich_seq.toString());
> System.out.println(rich_seq.getAccession());
> System.out.println(rich_seq.getAlphabet());
> System.out.println(rich_seq.getAnnotation());
> System.out.println(rich_seq.getName());
> System.out.println(rich_seq.getDescription());
> System.out.println(rich_seq.getIdentifier());
> System.out.println(rich_seq.seqString());
> }
> }
> catch(Exception e)
> {
> System.err.println("Bug while reading the sequences from the FASTA file");
> }
>
>
> Here's the output (for the first entry in the fasta file):
> id|0:1/0.9992
> 0
> org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper at 1df073d
>
> 1
> null
> null
> ASITENGGAEEESVAK
>
>
> Can anyone tell me what's going wrong?
> Is there already a function to put all the sequences directly in the memory (like a HashSet) while reading them?
>
> Cheers
>
> Sarah
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
--
Felix Dreher
Max Planck Institute for Molecular Genetics
Department of Vertebrate Genomics
Bioinformatics Group
Ihnestr. 73
D-14195 Berlin
Phone: +49 30 - 8413 1682
Mobile: +49 163 - 754 24 26
E-mail: dreher at molgen.mpg.de
www.molgen.mpg.de/~lh_bioinf
More information about the Biojava-l
mailing list