[Biojava-l] WriteFasta

Fri Oct 5 12:28:43 UTC 2007

I've done a quick search & it seems as if U+ACED is a Chinese character 
& the other is just a blank. Something is getting confused quite badly here

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Great, thanks.
> 
> The initial analysis shows that the text file generated contains four
> extra characters at the beginning of the file, and is using '\n' as the
> line separator.
> 
> This is a hex dump of the file:
> 
> 00000000  ac ed 00 05 3e 67 69 7c  31 38 33 39 38 33 39 30
> |....>gi|18398390|
> 00000010  7c 6c 63 6c 7c 4e 50 5f  35 36 35 34 31 33 2e 31
> ||lcl|NP_565413.1|
> 00000020  7c 4e 50 5f 35 36 35 34  31 33 20 75 6e 6b 6e 6f  ||NP_565413
> unkno|
> 00000030  77 6e 20 70 72 6f 74 65  69 6e 20 5b 41 72 61 62  |wn protein
> [Arab|
> 00000040  69 64 6f 70 73 69 73 20  74 68 61 6c 69 61 6e 61  |idopsis
> thaliana|
> 00000050  5d 0a 4d 53 4c 52 49 4b  4c 56 56 44 4b 46 56 45
> |].MSLRIKLVVDKFVE|
> 00000060  45 4c 4b 51 41 4c 44 41  44 49 51 44 52 49 4d 4b
> |ELKQALDADIQDRIMK|
> 00000070  45 52 45 4d 51 53 59 49  58 58 58 58 58 58 58 58
> |EREMQSYIXXXXXXXX|
> 00000080  58 58 58 58 58 57 4b 41  45 4c 53 52 52 45 54 45
> |XXXXXWKAELSRRETE|
> 00000090  49 41 52 51 45 41 52 4c  4b 4d 45 52 45 4e 4c 45
> |IARQEARLKMERENLE|
> 000000a0  4b 45 0a 4b 53 56 4c 4d  47 54 41 53 4e 51 44 4e
> |KE.KSVLMGTASNQDN|
> 000000b0  51 44 47 41 4c 45 49 54  56 53 47 45 4b 59 52 43
> |QDGALEITVSGEKYRC|
> 000000c0  4c 52 46 53 4b 41 4b 4b  0a                       |LRFSKAKK.|
> 
> 
> The four extra characters are hex #ac #ed #00 #05 and these are showing
> as question marks in your text editor because that's how text editors
> handle unprintable characters.
> 
> Does anyone recognise these characters? There is no code in BioJava
> which writes anything like this, in fact there is no output code at all
> before the initial write of the first > symbol in the file. Something
> tells me that these symbols are being inserted by the VM or the OS
> somewhere under the hood, possibly due to internationalisation?
> 
> I strongly suspect this is an internationalisation problem. It seems
> probable that Java has been set up on your system to use a language or
> character encoding that causes Java by default to write these extra
> characters at the start of files to indicate the encoding. Check the
> output of:
> 
> System.getProperty("file.encode");
> 
> to see if it is using something other than UTF-8. If it is, then chances
> are that this is the problem.
> 
> We've had internationalisation problems before with BioJava. Hopefully
> these will be addressed in future development, but there is no current
> activity in that area due to lack of resources. In the meantime the best
> workaround is to set every setting you can find to a Western European
> character set/character mapping and UTF-8 file encoding, in the hope
> that it will all match up nicely and work.
> 
> cheers,
> Richard
> 
>