[Biojava-l] WriteFasta
Richard Holland
holland at ebi.ac.uk
Fri Oct 5 12:10:58 UTC 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Great, thanks.
The initial analysis shows that the text file generated contains four
extra characters at the beginning of the file, and is using '\n' as the
line separator.
This is a hex dump of the file:
00000000 ac ed 00 05 3e 67 69 7c 31 38 33 39 38 33 39 30
|....>gi|18398390|
00000010 7c 6c 63 6c 7c 4e 50 5f 35 36 35 34 31 33 2e 31
||lcl|NP_565413.1|
00000020 7c 4e 50 5f 35 36 35 34 31 33 20 75 6e 6b 6e 6f ||NP_565413
unkno|
00000030 77 6e 20 70 72 6f 74 65 69 6e 20 5b 41 72 61 62 |wn protein
[Arab|
00000040 69 64 6f 70 73 69 73 20 74 68 61 6c 69 61 6e 61 |idopsis
thaliana|
00000050 5d 0a 4d 53 4c 52 49 4b 4c 56 56 44 4b 46 56 45
|].MSLRIKLVVDKFVE|
00000060 45 4c 4b 51 41 4c 44 41 44 49 51 44 52 49 4d 4b
|ELKQALDADIQDRIMK|
00000070 45 52 45 4d 51 53 59 49 58 58 58 58 58 58 58 58
|EREMQSYIXXXXXXXX|
00000080 58 58 58 58 58 57 4b 41 45 4c 53 52 52 45 54 45
|XXXXXWKAELSRRETE|
00000090 49 41 52 51 45 41 52 4c 4b 4d 45 52 45 4e 4c 45
|IARQEARLKMERENLE|
000000a0 4b 45 0a 4b 53 56 4c 4d 47 54 41 53 4e 51 44 4e
|KE.KSVLMGTASNQDN|
000000b0 51 44 47 41 4c 45 49 54 56 53 47 45 4b 59 52 43
|QDGALEITVSGEKYRC|
000000c0 4c 52 46 53 4b 41 4b 4b 0a |LRFSKAKK.|
The four extra characters are hex #ac #ed #00 #05 and these are showing
as question marks in your text editor because that's how text editors
handle unprintable characters.
Does anyone recognise these characters? There is no code in BioJava
which writes anything like this, in fact there is no output code at all
before the initial write of the first > symbol in the file. Something
tells me that these symbols are being inserted by the VM or the OS
somewhere under the hood, possibly due to internationalisation?
I strongly suspect this is an internationalisation problem. It seems
probable that Java has been set up on your system to use a language or
character encoding that causes Java by default to write these extra
characters at the start of files to indicate the encoding. Check the
output of:
System.getProperty("file.encode");
to see if it is using something other than UTF-8. If it is, then chances
are that this is the problem.
We've had internationalisation problems before with BioJava. Hopefully
these will be addressed in future development, but there is no current
activity in that area due to lack of resources. In the meantime the best
workaround is to set every setting you can find to a Western European
character set/character mapping and UTF-8 file encoding, in the hope
that it will all match up nicely and work.
cheers,
Richard
Saif Ur-Rehman wrote:
> Dear Richard,
>
> The input file is just the entire set of RefSeq proteins for Arabdopsis thaliana
> and is too large for me to send as an attachment. But it can be downloaded from
> NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]".
>
> Cheers,
>
> Saif
>
>
>
> Quoting Richard Holland <holland at ebi.ac.uk>:
>
> Interesting. Could you send your input file as well?
>
> cheers,
> Richard
>
> Saif Ur-Rehman wrote:
>>>> Dear Richard,
>>>>
>>>> The sequences are being read by SeqIO.readFasta. The code from read to
> write is
>>>> as follows. Essentially the program wants to read in a fasta file
> containing
>>>> all the protein sequences in a given organism and split them up into one
> file
>>>> per protein.
>>>>
>>>>
>>>> BufferedReader br=null;
>>>> try
>>>> {
>>>> br = new BufferedReader(new FileReader(filename));
>>>> }
>>>> catch (FileNotFoundException e1)
>>>> {
>>>>
>>>> e1.printStackTrace();
>>>> }
>>>>
>>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br);
>>>> while (stream.hasNext())
>>>> {
>>>> try
>>>> {
>>>> Sequence seq = stream.nextSequence();
>>>> File scriptFile1= new
> File("///Users/Saif/Organisms/RunTemp/"+name
>>>> +"/"+seq.getName());
>>>>
>>>> try
>>>> {
>>>> scriptFile1.createNewFile();
>>>> }
>>>> catch (IOException e1)
>>>> {
>>>>
>>>> e1.printStackTrace();
>>>> }
>>>>
>>>> try
>>>> {
>>>> FileWriter fstream = new
> FileWriter(scriptFile1.getAbsolutePath());
>>>> BufferedWriter out = new BufferedWriter(fstream);
>>>>
>>>> FileOutputStream f =new FileOutputStream (scriptFile1);
>>>>
>>>> RichSequence rs=RichSequence.Tools.enrich(seq);
>>>>
>>>>
>>>> try{
>>>>
>>>>
>>>> RichSequence.IOTools.writeFasta(
>>>> f,
>>>> rs,
>>>> RichObjectFactory.getDefaultNamespace()
>>>> );
>>>>
>>>>
>>>> }
>>>>
>>>> catch (IOException ioe){}
>>>>
>>>> An example of an outputted fasta file from this code is attached.
>>>>
>>>>
>>>>
>>>> Thanks a lot for your time.
>>>>
>>>> Saif
>>>>
>>>>
>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
>>>>
>>>> Where are the input sequences coming from? i.e. what method are you
>>>> using to construct them or read them from a file.
>>>>
>>>> Also, what do you mean by the 'front' of each write? Could you send me
>>>> an example of an entire FASTA file containing the problem? (It'd be best
>>>> to attach the file to an email to me personally as this list will not
>>>> accept attachments, and copying-and-pasting from a text editor to an
>>>> email client may obscure the underlying problem).
>>>>
>>>> It'd be good also to see your entire code from the point the sequences
>>>> are read or created to the point where they are written out. Or, a
>>>> sample program which exhibits the same behaviour would suffice.
>>>>
>>>> I suspect that the sequences themselves contain the incorrect data,
>>>> although technically this should be impossible as the sequence alphabet
>>>> should prevent it.
>>>>
>>>> We recently had an issue reported here regarding BioJava not being able
>>>> to do certain sequence tasks on platforms using non-Western-European
>>>> character mappings. If your machine is running such a mapping, try it
>>>> again on a machine with an English or other Western European language
>>>> set up by default. If it works there but not on your machine, then
>>>> this'll be the same problem. (There is no solution yet, but at least
>>>> you'll know what's wrong).
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> Saif Ur-Rehman wrote:
>>>>>>> Dear Richard,
>>>>>>>
>>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this method
> is
>>>> still
>>>>>>> appending the characters "??" to the front of each write. I am using a
>>>>>>> FileOutputStream and a Sequence object as inputs to the method. like so.
>>>>>>>
>>>>>>>
>>>>>>> Sequence seq; // read in from File
>>>>>>> FileOutputStream f =new FileOutputStream (fileName);
>>>>>>>
>>>>>>>
>>>>>>> try{
>>>>>>>
>>>>>>> RichSequence.IOTools.writeFasta(f,
>>>>>>> seq,
>>>>>>> RichObjectFactory.getDefaultNamespace()
>>>>>>> );
>>>>>>>
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot for your time
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> Saif
>>>>>>>
>>>>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
>>>>>>>
>>>>>>> SeqIOTools is deprecated.
>>>>>>>
>>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
>>>>>>>
>>>>>>> e.g.:
>>>>>>>
>>>>>>> RichSequence.IOTools.writeFasta(
>>>>>>> System.out,
>>>>>>> seq,
>>>>>>> RichObjectFactory.getDefaultNamespace()
>>>>>>> );
>>>>>>>
>>>>>>> where seq is either a Sequence or a SequenceIterator.
>>>>>>>
>>>>>>> cheers,
>>>>>>> Richard
>>>>>>>
>>>>>>> Saif Ur-Rehman wrote:
>>>>>>>>>> Dear All,
>>>>>>>>>>
>>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am
>>>>>>> currently
>>>>>>>>>> trying to break up Fasta Files of whole organisms into one file per
> gene
>>>>>>> for
>>>>>>>>>> further analysis. However the writeFasta method appears to append the
>>>>>>>>>> characters
>>>>>>>>>> "¨Ì
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------
>>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>> -------------------------------------------------------------------------------
>>>>>>> Saif Ur-Rehman
>>>>>>> Research Student
>>>>>>> The Centre for Evolution, Genes & Genomics (CEGG)
>>>>>>> Dyers Brae
>>>>>>> School of Biology
>>>>>>> The University of St Andrews
>>>>>>> St Andrews,
>>>>>>> Fife
>>>>>>> Scotland,UK
>>>>>>> ------------------------------------------------------------------
>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>> -------------------------------------------------------------------------------
>>>> Saif Ur-Rehman
>>>> Research Student
>>>> The Centre for Evolution, Genes & Genomics (CEGG)
>>>> Dyers Brae
>>>> School of Biology
>>>> The University of St Andrews
>>>> St Andrews,
>>>> Fife
>>>> Scotland,UK
>>>> ------------------------------------------------------------------
>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>
> -------------------------------------------------------------------------------
> Saif Ur-Rehman
> Research Student
> The Centre for Evolution, Genes & Genomics (CEGG)
> Dyers Brae
> School of Biology
> The University of St Andrews
> St Andrews,
> Fife
> Scotland,UK
> ------------------------------------------------------------------
> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb
pxAPAybISoRQgbvQ1wyzqVg=
=MS7P
-----END PGP SIGNATURE-----
More information about the Biojava-l
mailing list