[Biopython] Parsing Pubmed-Entrez searches into a normalized relational resource

Sun Sep 19 10:51:19 UTC 2010

Christopher;

> What I would like to do is parse the returns of an entrez pubmed
> search into their smallest, unique useful bits and create a
> relational database (sqlite, dee?).  Ideally this would not only be
> of returned fields, but also drilling further down into say
> affiliation, addresses, etc...
[...]
> Where I am falling down is understanding how to extract the
> structure of these outputs and create a persistent relational
> resource that's been normalized such that these fields can be mapped
> to used to "correct" values in an uncurated dataset with highly
> analogous fields.

This is the standard problem of represent object style data in a
flat relational database. It's tough to answer succinctly on a
mailing list, as there are entire textbooks and courses devoted to
the problem. The wikipedia entry on normalization and first normal
form is a good place to get started:

http://en.wikipedia.org/wiki/Database_normalization

As far as accessing relational databases, Python is great for this.
An object relational mapper like SQLAlchemy:

http://www.sqlalchemy.org/

is a great place to get started. This allows you to deal more
directly with objects, and also generalizes database access so you
can quickly switch from SQLite to MySQL to whatever.

Another suggestion is to use a document oriented database like
MongoDB for storing your data:

http://www.mongodb.org/

This allows you to store objects without flattening them, which may
be more intuitive for the XML/dictionary results you get back from
Entrez searches.

Hope this helps,
Brad