[Biopython] Parsing xml from Bioproject without DTD - how to use schema?

Joshua Klein mobiusklein at gmail.com
Sun May 17 21:59:33 UTC 2015


For your last issue, if you don't mind needing to disentangle the data
after you've pulled it from the XML document, you can use this pattern to
convert the document exactly into an identical nested collection of
dictionaries:

def recursive_dict(element):
    data_dict = dict(element.attrib)
    children = map(recursive_dict, element)
    children_nodes = defaultdict(list)
    clean_nodes = {}
    for node, data in children:
        children_nodes[node].append(data)
    for node, data_list in children_nodes.items():
        clean_nodes[node] = data_list[0] if len(data_list) == 1 else
data_list

    if clean_nodes:
        data_dict.update(clean_nodes)

    if element.text is not None and not element.text.isspace():
        data_dict['text'] = element.text
    if len(data_dict) == 1 and 'text' in data_dict:
        data_dict = data_dict['text']
    tag = element.tag
    return tag, data_dict

Feed it the root of the ElementTree you want to parse, and it will return
the complete tree in dictionary form.

>From that dictionary you can infer an ad-hoc schema, which will most likely
be dependent on the class of organism you're looking at.


On Sun, May 17, 2015 at 4:24 PM, Anna Simpson <acsimpson at gmail.com> wrote:

> Hi all,
> I've been trying to parse xml files from an efetch query to the bioproject
> database, and kept getting an error message about no dtd (and
> validation=False gets me no data at all) when using Entrez.read or
> Entrez.parse. I found a post on this mailing list from 2013, where a
> gentleman had the same problem - he emailed NCBI and was told the
> following:
>
> "Yes this is the "normal" but it is an oversight as a dtd was never
> created for this database. I will have to open a ticket to the developers
> to create this and have it included in the XML and on the DTD web page."
>
> I've emailed NCBI about this again but I'm guessing there still isn't one
> (and I can't find it in the DTD index page). But my various googlings have
> led me to find that there is a schema for bioproject, and that perhaps,
> somehow, it could be used to parse these xml files. How  might I go about
> doing that?
>
> I've been trying to use xml parsers like element tree and Beautiful Soup
> but keep running into walls (how to stick an entrez handle into a parser,
> how to get it to give me deeply nested information when the nesting is
> different for each xml document I get and I'm running this through a loop)
> so it would be great if I could ...stop doing that.
>
> Thanks,
> Anna
> University of Washington, Seattle
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20150517/8709a750/attachment.html>


More information about the Biopython mailing list