[BioPython] Bio.Medline parser

Peter biopython at maubp.freeserve.co.uk
Sat Aug 2 14:09:58 UTC 2008


On Sat, Aug 2, 2008 at 2:32 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> One downside of this is that the user then has to go and
>> consult the file format documentation to discover "DA" is the
>> entry date, etc.  In some cases the abbreviations are probably
>> a little unclear.  I would find code using the current named
>> properties easier to read than the suggested dictionary based
>> approach which exposes the raw field names.
>
> What I noticed when I was playing with this parser is that it is often
> unclear which (Biopython-chosen) name goes with which (NCBI-chosen)
> key. For example, PMID is the pubmed ID number in the flat file. Should
> I look under "pmid", "PMID", "PubmedID"? (the correct answer is "pubmed_id").

If you did dir(record) how many possible candidates would you see?

> As you mention, the NCBI-chosen keys are often not very informative
> (who can guess that TT stands for "transliterated title"?). I was thinking
> to have a list of NCBI keys and their description in the docstring of
> Bio.Medline's Record class, so users can always find them without
> having to go into NCBI's documentation.

That would help users - and also future developers trying to
understand what the parser is doing!

> Another possibility is to overload the dictionary class such that all keys
> are automatically mapped to their more descriptive names. So the
> parser only knows about the NCBI-defined keys, but if a user types
> record["Author"], then the Record class knows it should return
> record["AU"]. With a corresponding modification of record.keys().

The alias idea is nice but does mean there is more than one way to
access the data (not encouraged in python).  A related suggestion is
to support the properties record.entry_date, record.author etc (what
ever the current parser does) as alternatives to record["DA"],
record["AU"], ... ?  This would then be backwards compatible.  This
could probably be done with a private dictionary mapping keys ("DA")
to property names ("entry_date").  When ever we add a new entry to the
dictionary, also see if it has a named property to define too.

>> Also, could you make the changes whiling leaving the older
>> parser with the old record behaviour in place (with deprecation
>> warnings) for a few releases?
>
> Yes that is possible. Existing scripts will use ...

Good, we shouldn't break existing scripts during the deprecation
transition period.

Peter



More information about the Biopython mailing list