[BioPython] Bio.Medline parser

Sat Aug 2 08:18:11 EDT 2008

On Sat, Aug 2, 2008 at 11:35 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Hi everybody,
>
> For bug #2454:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2454
>
> I was looking at the parser in Bio.Medline, which can parse flat files
> in the Medline format. For an example, see Tests/Medline/pubmed_result2.txt:
>
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Tests/Medline/pubmed_result2.txt?rev=1.1&cvsroot=biopython&content-type=text/vnd.viewcvs-markup
>
> I would like to suggest some changes to this parser.
>
> Currently, it works as follows:
>
>>>> from Bio import Medline
>>>> parser = Medline.RecordParser()
>>>> handle = open("mymedlinefile.txt")
>>>> record = parser.parse(handle)
>
> or, to iterate over a bunch of Medline records:
>
>>>> from Bio import Medline
>>>> parser = Medline.RecordParser()
>>>> handle = open("mymedlinefile.txt")
>>>> records = Medline.Iterator(handle, parser)
>>>> for record in records:
> ...     # do something with the record.
>
> I'd like to change these to
>
>>>> from Bio import Medline
>>>> handle = open("mymedlinefile.txt")
>>>> record = Medline.read(handle)
>
> and
>
>>>> from Bio import Medline
>>>> handle = open("mymedlinefile.txt")
>>>> records = Medline.parse(handle)
>>>> for record in records:
> ...     # do something with the record.
>
> respectively.

+1 (I agree)
That would fit with our recent parser changes, and consistency is good :)

> In addition, currently the fields in the Medline file are stored as attributes of the record. For example, if the file is
>
> PMID- 12230038
> OWN - NLM
> STAT- MEDLINE
> DA  - 20020916
> ...
>
> then the corresponding record is
>
> record.pubmed_id = "12230038"
> record.owner = "NLM"
> record.status = "MEDLINE"
> record.entry_date = "20020916"
>
> I'd like to change two things here:
>
> 1) Use the key shown in the Medline file instead of the name to store each field.
> 2) Let the record class derive from a dictionary, and store each field as a key, value pair in this dictionary.
>
> record["PMID"] = "12230038"
> record["OWN"]  = "NLM"
> record["STAT"] = "MEDLINE"
> record["DA"]   = "20020916"
> ...
>
> This avoids the names that were rather arbitrarily chosen by ourselves,
> and greatly simplifies the parser. The parser will also be more robust if
> new fields are added to the Medline file format.

One downside of this is that the user then has to go and consult the
file format documentation to discover "DA" is the entry date, etc.  In
some cases the abbrevations are probably a little unclear.  I would
find code using the current named properties easier to read than the
suggested dictionary based approach which exposes the raw field names.

Also, could you make the changes whiling leaving the older parser with
the old record behaviour in place (with deprecation warnings) for a
few releases?  This would allow existing user's scripts to continue as
is with (but with a warning).

> Currently there is very little information on the Medline parser in the
> documentation, so I doubt it has many users. Nevertheless, I wanted
> to check if anybody has any objections or comments before I implement
> these changes.

I think the first addition (read and parse functions) is very
sensisble, but I am not sure about the suggested change to the record
behaviour.

Peter