[Fwd: Re: [BioPython] BLAST XML problem?]

Manuel Prinz manuel at pinguinkiste.de
Thu Jan 12 14:50:07 EST 2006


> |XMLDecl| ::= |'<?xml' VersionInfo EncodingDecl ? SDDecl ? S ? '?>'
> 
> (source http://www.w3.org/TR/2004/REC-xml-20040204/#NT-XMLDecl)
> 
> As far as I understand that definition, the encoding attribute is 
> optional, so the NCBI File should be ok from the XML point of view.

This is not totally right. The encoding is optional, if the encoding is
proper UTF-8 (or UTF-16) or if the encoding can be obtained from a
higher instance such as mimetypes, which does not affect a file. The
standard reads this (in "4.3.3 Character Encoding in Entities"):

"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8. Note that since ASCII
is a subset of UTF-8, ordinary ASCII entities do not strictly need an
encoding declaration."

It's also mentioned in that section that processors HAVE to know UTF-8
and UTF-16 and MAY know others. The standard further states the
following:

"It is a fatal error when an XML processor encounters an entity with an
encoding that it is unable to process. It is a fatal error if an XML
entity is determined (via default, encoding declaration, or higher-level
protocol) to be in a certain encoding but contains byte sequences that
are not legal in that encoding. Specifically, it is a fatal error if an
entity encoded in UTF-8 contains any irregular code unit sequences, as
defined in Unicode 3.1 [Unicode3]. Unless an encoding is determined by a
higher-level protocol, it is also a fatal error if an XML entity
contains no encoding declaration and its content is not legal UTF-8 or
UTF-16."

So the BioPython parser has to reject the XML file (since it is/was not
proper UTF-8 or UTF-16) to meet the standard. Auto-detecting encodings
is a nice feature but from the processors point of view only useful to
check if the declared encoding matches the real one in terms of the
standard.

> Anyway, how can I tell SAX which encoding table to use, beside editing 
> the XML file itself?

Since SAX is standard compliant AFAIK, there probably isn't any. Either
convert your files to UTF-8 or you have to declare the character
encoding. (iconv is a great tool to convert between different
encodings.)

With kind regards,
Manuel





More information about the BioPython mailing list