[Biojava-l] WriteFasta
Saif Ur-Rehman
su24 at st-andrews.ac.uk
Fri Oct 5 13:44:29 UTC 2007
Setting the System properties solved the problem.
Thanks a lot,
Saif
Quoting Richard Holland <holland at ebi.ac.uk>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Great, thanks.
>
> The initial analysis shows that the text file generated contains four
> extra characters at the beginning of the file, and is using '\n' as the
> line separator.
>
> This is a hex dump of the file:
>
> 00000000 ac ed 00 05 3e 67 69 7c 31 38 33 39 38 33 39 30
> |....>gi|18398390|
> 00000010 7c 6c 63 6c 7c 4e 50 5f 35 36 35 34 31 33 2e 31
> ||lcl|NP_565413.1|
> 00000020 7c 4e 50 5f 35 36 35 34 31 33 20 75 6e 6b 6e 6f ||NP_565413
> unkno|
> 00000030 77 6e 20 70 72 6f 74 65 69 6e 20 5b 41 72 61 62 |wn protein
> [Arab|
> 00000040 69 64 6f 70 73 69 73 20 74 68 61 6c 69 61 6e 61 |idopsis
> thaliana|
> 00000050 5d 0a 4d 53 4c 52 49 4b 4c 56 56 44 4b 46 56 45
> |].MSLRIKLVVDKFVE|
> 00000060 45 4c 4b 51 41 4c 44 41 44 49 51 44 52 49 4d 4b
> |ELKQALDADIQDRIMK|
> 00000070 45 52 45 4d 51 53 59 49 58 58 58 58 58 58 58 58
> |EREMQSYIXXXXXXXX|
> 00000080 58 58 58 58 58 57 4b 41 45 4c 53 52 52 45 54 45
> |XXXXXWKAELSRRETE|
> 00000090 49 41 52 51 45 41 52 4c 4b 4d 45 52 45 4e 4c 45
> |IARQEARLKMERENLE|
> 000000a0 4b 45 0a 4b 53 56 4c 4d 47 54 41 53 4e 51 44 4e
> |KE.KSVLMGTASNQDN|
> 000000b0 51 44 47 41 4c 45 49 54 56 53 47 45 4b 59 52 43
> |QDGALEITVSGEKYRC|
> 000000c0 4c 52 46 53 4b 41 4b 4b 0a |LRFSKAKK.|
>
>
> The four extra characters are hex #ac #ed #00 #05 and these are showing
> as question marks in your text editor because that's how text editors
> handle unprintable characters.
>
> Does anyone recognise these characters? There is no code in BioJava
> which writes anything like this, in fact there is no output code at all
> before the initial write of the first > symbol in the file. Something
> tells me that these symbols are being inserted by the VM or the OS
> somewhere under the hood, possibly due to internationalisation?
>
> I strongly suspect this is an internationalisation problem. It seems
> probable that Java has been set up on your system to use a language or
> character encoding that causes Java by default to write these extra
> characters at the start of files to indicate the encoding. Check the
> output of:
>
> System.getProperty("file.encode");
>
> to see if it is using something other than UTF-8. If it is, then chances
> are that this is the problem.
>
> We've had internationalisation problems before with BioJava. Hopefully
> these will be addressed in future development, but there is no current
> activity in that area due to lack of resources. In the meantime the best
> workaround is to set every setting you can find to a Western European
> character set/character mapping and UTF-8 file encoding, in the hope
> that it will all match up nicely and work.
>
> cheers,
> Richard
>
> Saif Ur-Rehman wrote:
> > Dear Richard,
> >
> > The input file is just the entire set of RefSeq proteins for Arabdopsis
> thaliana
> > and is too large for me to send as an attachment. But it can be downloaded
> from
> > NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]".
> >
> > Cheers,
> >
> > Saif
> >
> >
> >
> > Quoting Richard Holland <holland at ebi.ac.uk>:
> >
> > Interesting. Could you send your input file as well?
> >
> > cheers,
> > Richard
> >
> > Saif Ur-Rehman wrote:
> >>>> Dear Richard,
> >>>>
> >>>> The sequences are being read by SeqIO.readFasta. The code from read to
> > write is
> >>>> as follows. Essentially the program wants to read in a fasta file
> > containing
> >>>> all the protein sequences in a given organism and split them up into one
> > file
> >>>> per protein.
> >>>>
> >>>>
> >>>> BufferedReader br=null;
> >>>> try
> >>>> {
> >>>> br = new BufferedReader(new FileReader(filename));
> >>>> }
> >>>> catch (FileNotFoundException e1)
> >>>> {
> >>>>
> >>>> e1.printStackTrace();
> >>>> }
> >>>>
> >>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br);
> >>>> while (stream.hasNext())
> >>>> {
> >>>> try
> >>>> {
> >>>> Sequence seq = stream.nextSequence();
> >>>> File scriptFile1= new
> > File("///Users/Saif/Organisms/RunTemp/"+name
> >>>> +"/"+seq.getName());
> >>>>
> >>>> try
> >>>> {
> >>>> scriptFile1.createNewFile();
> >>>> }
> >>>> catch (IOException e1)
> >>>> {
> >>>>
> >>>> e1.printStackTrace();
> >>>> }
> >>>>
> >>>> try
> >>>> {
> >>>> FileWriter fstream = new
> > FileWriter(scriptFile1.getAbsolutePath());
> >>>> BufferedWriter out = new BufferedWriter(fstream);
> >>>>
> >>>> FileOutputStream f =new FileOutputStream (scriptFile1);
> >>>>
> >>>> RichSequence rs=RichSequence.Tools.enrich(seq);
> >>>>
> >>>>
> >>>> try{
> >>>>
> >>>>
> >>>> RichSequence.IOTools.writeFasta(
> >>>> f,
> >>>> rs,
> >>>> RichObjectFactory.getDefaultNamespace()
> >>>> );
> >>>>
> >>>>
> >>>> }
> >>>>
> >>>> catch (IOException ioe){}
> >>>>
> >>>> An example of an outputted fasta file from this code is attached.
> >>>>
> >>>>
> >>>>
> >>>> Thanks a lot for your time.
> >>>>
> >>>> Saif
> >>>>
> >>>>
> >>>> Quoting Richard Holland <holland at ebi.ac.uk>:
> >>>>
> >>>> Where are the input sequences coming from? i.e. what method are you
> >>>> using to construct them or read them from a file.
> >>>>
> >>>> Also, what do you mean by the 'front' of each write? Could you send me
> >>>> an example of an entire FASTA file containing the problem? (It'd be best
> >>>> to attach the file to an email to me personally as this list will not
> >>>> accept attachments, and copying-and-pasting from a text editor to an
> >>>> email client may obscure the underlying problem).
> >>>>
> >>>> It'd be good also to see your entire code from the point the sequences
> >>>> are read or created to the point where they are written out. Or, a
> >>>> sample program which exhibits the same behaviour would suffice.
> >>>>
> >>>> I suspect that the sequences themselves contain the incorrect data,
> >>>> although technically this should be impossible as the sequence alphabet
> >>>> should prevent it.
> >>>>
> >>>> We recently had an issue reported here regarding BioJava not being able
> >>>> to do certain sequence tasks on platforms using non-Western-European
> >>>> character mappings. If your machine is running such a mapping, try it
> >>>> again on a machine with an English or other Western European language
> >>>> set up by default. If it works there but not on your machine, then
> >>>> this'll be the same problem. (There is no solution yet, but at least
> >>>> you'll know what's wrong).
> >>>>
> >>>> cheers,
> >>>> Richard
> >>>>
> >>>> Saif Ur-Rehman wrote:
> >>>>>>> Dear Richard,
> >>>>>>>
> >>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this
> method
> > is
> >>>> still
> >>>>>>> appending the characters "??" to the front of each write. I am using
> a
> >>>>>>> FileOutputStream and a Sequence object as inputs to the method. like
> so.
> >>>>>>>
> >>>>>>>
> >>>>>>> Sequence seq; // read in from File
> >>>>>>> FileOutputStream f =new FileOutputStream (fileName);
> >>>>>>>
> >>>>>>>
> >>>>>>> try{
> >>>>>>>
> >>>>>>> RichSequence.IOTools.writeFasta(f,
> >>>>>>> seq,
> >>>>>>> RichObjectFactory.getDefaultNamespace()
> >>>>>>> );
> >>>>>>>
> >>>>>>>
> >>>>>>> }
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks a lot for your time
> >>>>>>>
> >>>>>>> Sincerely,
> >>>>>>>
> >>>>>>> Saif
> >>>>>>>
> >>>>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
> >>>>>>>
> >>>>>>> SeqIOTools is deprecated.
> >>>>>>>
> >>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
> >>>>>>>
> >>>>>>> e.g.:
> >>>>>>>
> >>>>>>> RichSequence.IOTools.writeFasta(
> >>>>>>> System.out,
> >>>>>>> seq,
> >>>>>>> RichObjectFactory.getDefaultNamespace()
> >>>>>>> );
> >>>>>>>
> >>>>>>> where seq is either a Sequence or a SequenceIterator.
> >>>>>>>
> >>>>>>> cheers,
> >>>>>>> Richard
> >>>>>>>
> >>>>>>> Saif Ur-Rehman wrote:
> >>>>>>>>>> Dear All,
> >>>>>>>>>>
> >>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I
> am
> >>>>>>> currently
> >>>>>>>>>> trying to break up Fasta Files of whole organisms into one file
> per
> > gene
> >>>>>>> for
> >>>>>>>>>> further analysis. However the writeFasta method appears to append
> the
> >>>>>>>>>> characters
> >>>>>>>>>> "¨Ì
> >>>>>>>>>>
> >>>>>>>>>> ------------------------------------------------------------------
> >>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>>>>>
> >>
>
-------------------------------------------------------------------------------
> >>>>>>> Saif Ur-Rehman
> >>>>>>> Research Student
> >>>>>>> The Centre for Evolution, Genes & Genomics (CEGG)
> >>>>>>> Dyers Brae
> >>>>>>> School of Biology
> >>>>>>> The University of St Andrews
> >>>>>>> St Andrews,
> >>>>>>> Fife
> >>>>>>> Scotland,UK
> >>>>>>> ------------------------------------------------------------------
> >>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>
>
-------------------------------------------------------------------------------
> >>>> Saif Ur-Rehman
> >>>> Research Student
> >>>> The Centre for Evolution, Genes & Genomics (CEGG)
> >>>> Dyers Brae
> >>>> School of Biology
> >>>> The University of St Andrews
> >>>> St Andrews,
> >>>> Fife
> >>>> Scotland,UK
> >>>> ------------------------------------------------------------------
> >>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>
>
> >
>
-------------------------------------------------------------------------------
> > Saif Ur-Rehman
> > Research Student
> > The Centre for Evolution, Genes & Genomics (CEGG)
> > Dyers Brae
> > School of Biology
> > The University of St Andrews
> > St Andrews,
> > Fife
> > Scotland,UK
>
> > ------------------------------------------------------------------
> > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb
> pxAPAybISoRQgbvQ1wyzqVg=
> =MS7P
> -----END PGP SIGNATURE-----
>
-------------------------------------------------------------------------------
Saif Ur-Rehman
Research Student
The Centre for Evolution, Genes & Genomics (CEGG)
Dyers Brae
School of Biology
The University of St Andrews
St Andrews,
Fife
Scotland,UK
------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
More information about the Biojava-l
mailing list