[Biojava-l] WriteFasta

Fri Oct 5 12:10:58 UTC 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Great, thanks.

The initial analysis shows that the text file generated contains four
extra characters at the beginning of the file, and is using '\n' as the
line separator.

This is a hex dump of the file:

00000000  ac ed 00 05 3e 67 69 7c  31 38 33 39 38 33 39 30
|....>gi|18398390|
00000010  7c 6c 63 6c 7c 4e 50 5f  35 36 35 34 31 33 2e 31
||lcl|NP_565413.1|
00000020  7c 4e 50 5f 35 36 35 34  31 33 20 75 6e 6b 6e 6f  ||NP_565413
unkno|
00000030  77 6e 20 70 72 6f 74 65  69 6e 20 5b 41 72 61 62  |wn protein
[Arab|
00000040  69 64 6f 70 73 69 73 20  74 68 61 6c 69 61 6e 61  |idopsis
thaliana|
00000050  5d 0a 4d 53 4c 52 49 4b  4c 56 56 44 4b 46 56 45
|].MSLRIKLVVDKFVE|
00000060  45 4c 4b 51 41 4c 44 41  44 49 51 44 52 49 4d 4b
|ELKQALDADIQDRIMK|
00000070  45 52 45 4d 51 53 59 49  58 58 58 58 58 58 58 58
|EREMQSYIXXXXXXXX|
00000080  58 58 58 58 58 57 4b 41  45 4c 53 52 52 45 54 45
|XXXXXWKAELSRRETE|
00000090  49 41 52 51 45 41 52 4c  4b 4d 45 52 45 4e 4c 45
|IARQEARLKMERENLE|
000000a0  4b 45 0a 4b 53 56 4c 4d  47 54 41 53 4e 51 44 4e
|KE.KSVLMGTASNQDN|
000000b0  51 44 47 41 4c 45 49 54  56 53 47 45 4b 59 52 43
|QDGALEITVSGEKYRC|
000000c0  4c 52 46 53 4b 41 4b 4b  0a                       |LRFSKAKK.|

The four extra characters are hex #ac #ed #00 #05 and these are showing
as question marks in your text editor because that's how text editors
handle unprintable characters.

Does anyone recognise these characters? There is no code in BioJava
which writes anything like this, in fact there is no output code at all
before the initial write of the first > symbol in the file. Something
tells me that these symbols are being inserted by the VM or the OS
somewhere under the hood, possibly due to internationalisation?

I strongly suspect this is an internationalisation problem. It seems
probable that Java has been set up on your system to use a language or
character encoding that causes Java by default to write these extra
characters at the start of files to indicate the encoding. Check the
output of:

System.getProperty("file.encode");

to see if it is using something other than UTF-8. If it is, then chances
are that this is the problem.

We've had internationalisation problems before with BioJava. Hopefully
these will be addressed in future development, but there is no current
activity in that area due to lack of resources. In the meantime the best
workaround is to set every setting you can find to a Western European
character set/character mapping and UTF-8 file encoding, in the hope
that it will all match up nicely and work.

cheers,
Richard

Saif Ur-Rehman wrote:
> Dear Richard,
> 
> The input file is just the entire set of RefSeq proteins for Arabdopsis thaliana
> and is too large for me to send as an attachment. But it can be downloaded from
> NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]".
> 
> Cheers,
> 
> Saif
> 
> 
> 
> Quoting Richard Holland <holland at ebi.ac.uk>:
> 
> Interesting. Could you send your input file as well?
> 
> cheers,
> Richard
> 
> Saif Ur-Rehman wrote:
>>>> Dear Richard,
>>>>
>>>> The sequences are being read by SeqIO.readFasta. The code from read to
> write is
>>>> as follows. Essentially the program wants to read in a fasta file
> containing
>>>> all the protein sequences in a given organism and split them up into one
> file
>>>> per protein.
>>>>
>>>>
>>>> BufferedReader br=null;
>>>> try
>>>> {
>>>> br = new BufferedReader(new FileReader(filename));
>>>> }
>>>> catch (FileNotFoundException e1)
>>>> {
>>>>
>>>> e1.printStackTrace();
>>>> }
>>>>
>>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br);
>>>> 	while (stream.hasNext())
>>>>     {
>>>> 	    try
>>>>         {
>>>> 			Sequence seq = stream.nextSequence();
>>>>            File scriptFile1= new
> File("///Users/Saif/Organisms/RunTemp/"+name
>>>> +"/"+seq.getName());
>>>>
>>>> 			try
>>>>            {
>>>> 				scriptFile1.createNewFile();
>>>> 			 }
>>>>          catch (IOException e1)
>>>>          {
>>>>
>>>> 				e1.printStackTrace();
>>>> 			}
>>>>
>>>> 			try
>>>>           {
>>>>            FileWriter fstream = new
> FileWriter(scriptFile1.getAbsolutePath());
>>>> 			    BufferedWriter out = new BufferedWriter(fstream);
>>>>
>>>> 			    FileOutputStream f =new FileOutputStream (scriptFile1);
>>>>
>>>> 			    RichSequence rs=RichSequence.Tools.enrich(seq);
>>>>
>>>>
>>>> 			    try{
>>>>
>>>>
>>>> 			    	RichSequence.IOTools.writeFasta(
>>>> 			    	        f,
>>>> 			    	        rs,
>>>> 			    	        RichObjectFactory.getDefaultNamespace()
>>>> 			    	        );
>>>>
>>>>
>>>> 			    }
>>>>
>>>> 			    catch (IOException ioe){}
>>>>
>>>> An example of an outputted fasta file from this code is attached.
>>>>
>>>>
>>>>
>>>> Thanks a lot for your time.
>>>>
>>>> Saif
>>>>
>>>>
>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
>>>>
>>>> Where are the input sequences coming from? i.e. what method are you
>>>> using to construct them or read them from a file.
>>>>
>>>> Also, what do you mean by the 'front' of each write? Could you send me
>>>> an example of an entire FASTA file containing the problem? (It'd be best
>>>> to attach the file to an email to me personally as this list will not
>>>> accept attachments, and copying-and-pasting from a text editor to an
>>>> email client may obscure the underlying problem).
>>>>
>>>> It'd be good also to see your entire code from the point the sequences
>>>> are read or created to the point where they are written out. Or, a
>>>> sample program which exhibits the same behaviour would suffice.
>>>>
>>>> I suspect that the sequences themselves contain the incorrect data,
>>>> although technically this should be impossible as the sequence alphabet
>>>> should prevent it.
>>>>
>>>> We recently had an issue reported here regarding BioJava not being able
>>>> to do certain sequence tasks on platforms using non-Western-European
>>>> character mappings. If your machine is running such a mapping, try it
>>>> again on a machine with an English or other Western European language
>>>> set up by default. If it works there but not on your machine, then
>>>> this'll be the same problem. (There is no solution yet, but at least
>>>> you'll know what's wrong).
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> Saif Ur-Rehman wrote:
>>>>>>> Dear Richard,
>>>>>>>
>>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this method
> is
>>>> still
>>>>>>> appending the characters "??" to the front of each write. I am using a
>>>>>>> FileOutputStream and a Sequence object as inputs to the method. like so.
>>>>>>>
>>>>>>>
>>>>>>>  Sequence seq; // read in from File
>>>>>>>  FileOutputStream f =new FileOutputStream (fileName);
>>>>>>>
>>>>>>>
>>>>>>> 			   try{
>>>>>>>
>>>>>>> 			    	RichSequence.IOTools.writeFasta(f,
>>>>>>> 			    	        seq,
>>>>>>> 			    	        RichObjectFactory.getDefaultNamespace()
>>>>>>> 			    	        );
>>>>>>>
>>>>>>>
>>>>>>> 			    }
>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot for your time
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> Saif
>>>>>>>
>>>>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
>>>>>>>
>>>>>>> SeqIOTools is deprecated.
>>>>>>>
>>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
>>>>>>>
>>>>>>> e.g.:
>>>>>>>
>>>>>>> RichSequence.IOTools.writeFasta(
>>>>>>> 	System.out,
>>>>>>> 	seq,
>>>>>>> 	RichObjectFactory.getDefaultNamespace()
>>>>>>> 	);
>>>>>>>
>>>>>>> where seq is either a Sequence or a SequenceIterator.
>>>>>>>
>>>>>>> cheers,
>>>>>>> Richard
>>>>>>>
>>>>>>> Saif Ur-Rehman wrote:
>>>>>>>>>> Dear All,
>>>>>>>>>>
>>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am
>>>>>>> currently
>>>>>>>>>> trying to break up Fasta Files of whole organisms into one file per
> gene
>>>>>>> for
>>>>>>>>>> further analysis. However the writeFasta method appears to append the
>>>>>>>>>> characters
>>>>>>>>>> "¨Ì
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------
>>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>> -------------------------------------------------------------------------------
>>>>>>> Saif Ur-Rehman
>>>>>>> Research Student
>>>>>>> The Centre for Evolution, Genes & Genomics (CEGG)
>>>>>>> Dyers Brae
>>>>>>> School of Biology
>>>>>>> The University of St Andrews
>>>>>>> St Andrews,
>>>>>>> Fife
>>>>>>> Scotland,UK
>>>>>>> ------------------------------------------------------------------
>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>> -------------------------------------------------------------------------------
>>>> Saif Ur-Rehman
>>>> Research Student
>>>> The Centre for Evolution, Genes & Genomics (CEGG)
>>>> Dyers Brae
>>>> School of Biology
>>>> The University of St Andrews
>>>> St Andrews,
>>>> Fife
>>>> Scotland,UK
>>>> ------------------------------------------------------------------
>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>

> -------------------------------------------------------------------------------
> Saif Ur-Rehman
> Research Student
> The Centre for Evolution, Genes & Genomics (CEGG)
> Dyers Brae
> School of Biology
> The University of St Andrews
> St Andrews,
> Fife
> Scotland,UK

> ------------------------------------------------------------------
> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb
pxAPAybISoRQgbvQ1wyzqVg=
=MS7P
-----END PGP SIGNATURE-----