[Biopython-dev] [Bug 2771] Bio.Entrez.read can't parse XML files from dbSNP (snp database)

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Sat Mar 21 04:24:43 UTC 2009


http://bugzilla.open-bio.org/show_bug.cgi?id=2771





------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp  2009-03-21 00:24 EST -------
(In reply to comment #0)
> >>> handle = Entrez.efetch(db='snp', id='9996597', retmode='xml')
> >>> cont = handle.read()
> >>> print cont
> '<?xml version="1.0"?>
> <ExchangeSet...>
> ...
> </ExchangeSet>
> 
With Bio.Entrez currently in CVS, Entrez.read does not raise an exception, but
simply returns an empty record. The problem is that EFetch from the SNP
database uses an XML Schema instead of a DTD to describe the contents of the
XML file, as shown in the first few lines of the XML file:

<?xml version="1.0"?>
<ExchangeSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.ncbi.nlm.nih.gov/SNP/docsum"
xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/docsum
http://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">

The last url shows the XML Schema.
All other Entrez Utilities I've seen so far use a DTD instead of an XML Schema.
Hence, Entrez.read only has a DTD parser to find out how to interpret the XML
file. In principle, Bio.Entrez can be modified to add an XML Schema parser.
While this is not trivial, it is probably not super difficult. Marco, would you
be willing to write such a parser? If you have a parser for the XML Schema, I
can show you how to integrate it with Bio.Entrez.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list