[Biopython] Pubmeddata XML parsing with Entrez .fetch and .read

Thu Jul 15 09:44:06 UTC 2010

On Thu, Jul 15, 2010 at 3:52 AM, Guy Eakin <guyeakin at gmail.com> wrote:
> Sure, I am new, so there are probably errors, but how about something like a
> demonstration appended to the end of the tutorial section at
> http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc105
>
> At core, the simple demonstration that type(record) calls a class object
> rather than a list, and that foo.attributes, and foo.tag exist would be
> helpful.  I am not using any of the sequence utilities, so admit that my
> reading of those sections was brief.  Reiteration in the entrez parsing
> sections is probably helpful for people like me.
>
> A more verbose demonstration follows.
>
> Again, thanks for the help.
> Guy

Thank you for the detailed suggested text.

> 8.11.1  Parsing Medline records [intervening text omitted]
>
> At this point let’s address what these elements contain.  Consider
> information found in the following statement.
>
>>>> records[0]['PubmedData']
>
> {u'ArticleIdList': ['btp163', '10.1093/bioinformatics/btp163', '19304878',
> 'PMC2682512'], u'PublicationStatus': 'ppublish', u'History': [{u'Month':
> '3', u'Day': '20', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', u'Day':
> '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '3',
> u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month':
> '7', u'Day': '10', u'Hour': '9', u'Year': '2009'}]}
>
> It is important to recall that each item is a biopython class, rather than a
> simply a dictionary or list item.  This can be verified by
>
>>>>type(records[0]['PubmedData']['ArticleIdList']
>
> Which returns <class 'Bio.Entrez.Parser.ListElement'> rather than <type
> 'list'>

This is why I was suggesting to Michiel that we override the __repr__
for our subclassed objects, so that rather than seeing things like this:

['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512']

we get something like:

ListElement(['btp163', '10.1093/bioinformatics/btp163', '19304878',
'PMC2682512'], attributes={...})

On deeper reflection, the trouble with this is that all the children within
the list would get longer, so the full representation of a ListElement (or
any container) would become very very long - swamping the console
output. Even if we literally show the attributes with a dot dot dot :(

Maybe we'll have to settle for just documentation improvements.
Michiel - this is your code - what do you think?

Peter