[Biopython-dev] [Bug 2938] Bio.Entrez.read() returns empty string for HTML (not an error)

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Oct 28 11:26:33 UTC 2009


http://bugzilla.open-bio.org/show_bug.cgi?id=2938





------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-10-28 07:26 EST -------
(In reply to comment #3)
> (In reply to comment #2)
> > In the meantime, instead of the white list, how about a blacklist?
> > i.e. If the data starts "<html" (ignoring case) raise an error?
> > We could also spot things like FASTA and GenBank files etc, and
> > as all we want to do is spot non-XML, this should be reliable.
> > 
> One important point is that the initial <?xml ... > tag is not handled as a
> regular XML tag by the parser. There is a separate handler method specific for
> parsing the <?xml ... > tag. This makes it much easier to check if an XML
> document is really XML: If this special handler is never called, it's not XML.
> 
> Checking for a FASTA and GenBank file is also relatively easy; the parser
> raises an xml.parsers.expat.ExpatError syntax error, which we can catch and
> transform in a more informative message.

Sounds good.

> Checking for HTML is trickier. The parser will not raise an error, because
> except for the missing <!xml ... > initial tag, the HTML could in principle be
> regarded as XML. To check if the input starts with <html>, we'd have to read
> some data ahead, check for the <html>, and pass the data to the parser if it
> seems to be OK.

Understood.

> So I suggest we add a check for xml.parsers.expat.ExpatError syntax errors
> now, and add a check for the initial <?xml ... > once NCBI has fixed the XML
> output to always contain this tag, but don't check for <html>.

+1 on adding the syntax error check now, that will be a worthwhile improvement
in itself.

Regarding flagging <html>, is it currently a safe assumption that anything
starting <html> is NOT an NCBI XML file? If the NCBI will fix all their XML
output to always start <?xml ... > then great. I suspect it will take a while
though. If you want to wait, fine. I'm happy to leave this decision to you -
it's your module after all ;)

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list