[Biojava-l] WriteFasta
Andy Yates
ayates at ebi.ac.uk
Fri Oct 5 12:28:43 UTC 2007
I've done a quick search & it seems as if U+ACED is a Chinese character
& the other is just a blank. Something is getting confused quite badly here
Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Great, thanks.
>
> The initial analysis shows that the text file generated contains four
> extra characters at the beginning of the file, and is using '\n' as the
> line separator.
>
> This is a hex dump of the file:
>
> 00000000 ac ed 00 05 3e 67 69 7c 31 38 33 39 38 33 39 30
> |....>gi|18398390|
> 00000010 7c 6c 63 6c 7c 4e 50 5f 35 36 35 34 31 33 2e 31
> ||lcl|NP_565413.1|
> 00000020 7c 4e 50 5f 35 36 35 34 31 33 20 75 6e 6b 6e 6f ||NP_565413
> unkno|
> 00000030 77 6e 20 70 72 6f 74 65 69 6e 20 5b 41 72 61 62 |wn protein
> [Arab|
> 00000040 69 64 6f 70 73 69 73 20 74 68 61 6c 69 61 6e 61 |idopsis
> thaliana|
> 00000050 5d 0a 4d 53 4c 52 49 4b 4c 56 56 44 4b 46 56 45
> |].MSLRIKLVVDKFVE|
> 00000060 45 4c 4b 51 41 4c 44 41 44 49 51 44 52 49 4d 4b
> |ELKQALDADIQDRIMK|
> 00000070 45 52 45 4d 51 53 59 49 58 58 58 58 58 58 58 58
> |EREMQSYIXXXXXXXX|
> 00000080 58 58 58 58 58 57 4b 41 45 4c 53 52 52 45 54 45
> |XXXXXWKAELSRRETE|
> 00000090 49 41 52 51 45 41 52 4c 4b 4d 45 52 45 4e 4c 45
> |IARQEARLKMERENLE|
> 000000a0 4b 45 0a 4b 53 56 4c 4d 47 54 41 53 4e 51 44 4e
> |KE.KSVLMGTASNQDN|
> 000000b0 51 44 47 41 4c 45 49 54 56 53 47 45 4b 59 52 43
> |QDGALEITVSGEKYRC|
> 000000c0 4c 52 46 53 4b 41 4b 4b 0a |LRFSKAKK.|
>
>
> The four extra characters are hex #ac #ed #00 #05 and these are showing
> as question marks in your text editor because that's how text editors
> handle unprintable characters.
>
> Does anyone recognise these characters? There is no code in BioJava
> which writes anything like this, in fact there is no output code at all
> before the initial write of the first > symbol in the file. Something
> tells me that these symbols are being inserted by the VM or the OS
> somewhere under the hood, possibly due to internationalisation?
>
> I strongly suspect this is an internationalisation problem. It seems
> probable that Java has been set up on your system to use a language or
> character encoding that causes Java by default to write these extra
> characters at the start of files to indicate the encoding. Check the
> output of:
>
> System.getProperty("file.encode");
>
> to see if it is using something other than UTF-8. If it is, then chances
> are that this is the problem.
>
> We've had internationalisation problems before with BioJava. Hopefully
> these will be addressed in future development, but there is no current
> activity in that area due to lack of resources. In the meantime the best
> workaround is to set every setting you can find to a Western European
> character set/character mapping and UTF-8 file encoding, in the hope
> that it will all match up nicely and work.
>
> cheers,
> Richard
>
>
More information about the Biojava-l
mailing list