[Biopython-dev] [Bug 2938] Bio.Entrez.read() returns empty string for HTML (not an error)

Wed Oct 28 11:12:05 UTC 2009

http://bugzilla.open-bio.org/show_bug.cgi?id=2938

------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp  2009-10-28 07:12 EST -------
(In reply to comment #2)
> In the meantime, instead of the white list, how about a blacklist?
> i.e. If the data starts "<html" (ignoring case) raise an error?
> We could also spot things like FASTA and GenBank files etc, and
> as all we want to do is spot non-XML, this should be reliable.
> 
One important point is that the initial <?xml ... > tag is not handled as a
regular XML tag by the parser. There is a separate handler method specific for
parsing the <?xml ... > tag. This makes it much easier to check if an XML
document is really XML: If this special handler is never called, it's not XML.

Checking for a FASTA and GenBank file is also relatively easy; the parser
raises an xml.parsers.expat.ExpatError syntax error, which we can catch and
transform in a more informative message.

Checking for HTML is trickier. The parser will not raise an error, because
except for the missing <!xml ... > initial tag, the HTML could in principle be
regarded as XML. To check if the input starts with <html>, we'd have to read
some data ahead, check for the <html>, and pass the data to the parser if it
seems to be OK.

So I suggest we add a check for xml.parsers.expat.ExpatError syntax errors now,
and add a check for the initial <?xml ... > once NCBI has fixed the XML output
to always contain this tag, but don't check for <html>.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.