[Biopython] Pubmeddata XML parsing with Entrez .fetch and .read

Guy Eakin guyeakin at gmail.com
Thu Jul 15 13:32:43 UTC 2010


>From the naive first-time user perspective, the current implementation is
fine for the computer, but could benefit from a viewer output that creates a
more human readable representation.  I found myself cross referencing the
console output to the original XML in most cases.

That says to me that there might be benefit to a function that recapitulates
the original XML's nested structure, listing attribute values.  I would
think something along the following would be quite useful, and if limited to
particular range of records would not necessarily be unwieldy.

(apologies for the admittedly unwieldy markup)

>>> Bio.Entrez.viewer(recordlist, range=(0:(len(recordlist)),
ShowMedlineCitation = True, ShowPubmedData = True)

<Parent1> - (Attributes = Parent1.atribute) -  \n                       #(80
characters/line)
..............value
..............indented text allows word wrap of entries > 80 char.
........<Children>1 - Attribute - \n
.......................value


that's an off the cuff representation before I dash to a meeting, but I
think you can see what I am suggesting.

Guy


Guy
On Thu, Jul 15, 2010 at 5:44 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Jul 15, 2010 at 3:52 AM, Guy Eakin <guyeakin at gmail.com> wrote:
> > Sure, I am new, so there are probably errors, but how about something
> like a
> > demonstration appended to the end of the tutorial section at
> > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc105
> >
> > At core, the simple demonstration that type(record) calls a class object
> > rather than a list, and that foo.attributes, and foo.tag exist would be
> > helpful.  I am not using any of the sequence utilities, so admit that my
> > reading of those sections was brief.  Reiteration in the entrez parsing
> > sections is probably helpful for people like me.
> >
> > A more verbose demonstration follows.
> >
> > Again, thanks for the help.
> > Guy
>
> Thank you for the detailed suggested text.
>
> > 8.11.1  Parsing Medline records [intervening text omitted]
> >
> > At this point let’s address what these elements contain.  Consider
> > information found in the following statement.
> >
> >>>> records[0]['PubmedData']
> >
> > {u'ArticleIdList': ['btp163', '10.1093/bioinformatics/btp163',
> '19304878',
> > 'PMC2682512'], u'PublicationStatus': 'ppublish', u'History': [{u'Month':
> > '3', u'Day': '20', u'Year': '2009'}, {u'Minute': '0', u'Month': '3',
> u'Day':
> > '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '3',
> > u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month':
> > '7', u'Day': '10', u'Hour': '9', u'Year': '2009'}]}
> >
> > It is important to recall that each item is a biopython class, rather
> than a
> > simply a dictionary or list item.  This can be verified by
> >
> >>>>type(records[0]['PubmedData']['ArticleIdList']
> >
> > Which returns <class 'Bio.Entrez.Parser.ListElement'> rather than <type
> > 'list'>
>
> This is why I was suggesting to Michiel that we override the __repr__
> for our subclassed objects, so that rather than seeing things like this:
>
> ['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512']
>
> we get something like:
>
> ListElement(['btp163', '10.1093/bioinformatics/btp163', '19304878',
> 'PMC2682512'], attributes={...})
>
> On deeper reflection, the trouble with this is that all the children within
> the list would get longer, so the full representation of a ListElement (or
> any container) would become very very long - swamping the console
> output. Even if we literally show the attributes with a dot dot dot :(
>
> Maybe we'll have to settle for just documentation improvements.
> Michiel - this is your code - what do you think?
>
> Peter
>




More information about the Biopython mailing list