[BioPython] Bio.Medline parser

Sat Aug 2 06:35:54 EDT 2008

Hi everybody,

For bug #2454:

http://bugzilla.open-bio.org/show_bug.cgi?id=2454

I was looking at the parser in Bio.Medline, which can parse flat files in the Medline format. For an example, see Tests/Medline/pubmed_result2.txt:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Tests/Medline/pubmed_result2.txt?rev=1.1&cvsroot=biopython&content-type=text/vnd.viewcvs-markup

I would like to suggest some changes to this parser.

Currently, it works as follows:

>>> from Bio import Medline
>>> parser = Medline.RecordParser()
>>> handle = open("mymedlinefile.txt")
>>> record = parser.parse(handle)

or, to iterate over a bunch of Medline records:

>>> from Bio import Medline
>>> parser = Medline.RecordParser()
>>> handle = open("mymedlinefile.txt")
>>> records = Medline.Iterator(handle, parser)
>>> for record in records:
...     # do something with the record.

I'd like to change these to

>>> from Bio import Medline
>>> handle = open("mymedlinefile.txt")
>>> record = Medline.read(handle)

and

>>> from Bio import Medline
>>> handle = open("mymedlinefile.txt")
>>> records = Medline.parse(handle)
>>> for record in records:
...     # do something with the record.

respectively.

In addition, currently the fields in the Medline file are stored as attributes of the record. For example, if the file is

PMID- 12230038
OWN - NLM
STAT- MEDLINE
DA  - 20020916
...

then the corresponding record is

record.pubmed_id = "12230038"
record.owner = "NLM"
record.status = "MEDLINE"
record.entry_date = "20020916"

I'd like to change two things here:

1) Use the key shown in the Medline file instead of the name to store each field.
2) Let the record class derive from a dictionary, and store each field as a key, value pair in this dictionary.

record["PMID"] = "12230038"
record["OWN"]  = "NLM"
record["STAT"] = "MEDLINE"
record["DA"]   = "20020916"
...

This avoids the names that were rather arbitrarily chosen by ourselves, and greatly simplifies the parser. The parser will also be more robust if new fields are added to the Medline file format.

Currently there is very little information on the Medline parser in the documentation, so I doubt it has many users. Nevertheless, I wanted to check if anybody has any objections or comments before I implement these changes.

--Michiel