[BioPython] Medline records

Tue Dec 12 22:09:20 UTC 2006

Hi Thomas,
  I use the Medline/Pubmed modules from biopython 1.42 at least and they
do work, although some fields contains weird data. Well, see my comments
in the code below. It seems PubMed people should cleanup their database
contents a bit.

Thomas Elliott wrote:
> Hi,
> 
> I started playing with Biopython again.  Glad to see it is still  
> active.   I love Python.
> 
> Some issues came up in parsing Medline records according to the  
> tutorial.
> 
> It wasn't obvious which variables exist to be queried on a record.   
> The tutorial example gives title, authors, and source only.
> 
> I tried looking at the code.  There are terms in NLMMedlineXML.py  
> that look like they should work, but which raise AttributeError (e.g.  
> journal and date_created).
> 
> I finally looked in the keys to __dict__ for a  record.  I found  
> 'year' there, but record.year seems to be always empty, or rather it  
> is a blank string.
> 
> • record.title_abbreviation is actually the abbreviated journal name.
> • record.volume_issue gives only the volume, not the issue.
> • record.journal_title_code is always a blank string.
> 
> Not sure what the right way is to do this.  I guess it would be  
> helpful to know which file of the source I should be looking at for  
> variable names.

from Bio import PubMed, Medline, GenBank

def get_citation_by_pmid(pmid):
    """Fetches citation data from NCBI Pubmed using pmid as a key.
    It returns dicstionary with the following structure:
    {'title': 'Molecular characterization of the murine Hif-1 alpha locus.', 'journal': 'Gene. Expr.',
     'author': 'Luo G., Gu Y. Z., Jain S., Chan W. K., Carr K. M., Hogenesch J. B., Bradfield C. A.',
     'volume': '6', 'year': '1997', 'issue': '5', 'pages': '287-299'}

    Test with PMID: 15703059, 10851087, 1111, 123456, 15703059, Y00664, 12509242, 10713153
    """

    rec_parser = Medline.RecordParser()
    medline_dict = PubMed.Dictionary(parser = rec_parser)
    cur_record = medline_dict[pmid]
    _authors = cur_record.authors # ['Luo G', 'Gu YZ', 'Jain S', 'Chan WK', 'Carr KM', 'Hogenesch JB', 'Bradfield CA']
    _new_authors = []
    for _author in _authors:
        _author = ' '.join(_author.split(' ')[:-1]+['. '.join(tuple(_author.split(' ')[-1:][0]))+'.'])
        # 'van Carr-Schmidt K.M.'
        _new_authors.append(_author)

    _title = cur_record.title # '[The laboratory in programs for enteric infection control]'
                              # 'Cap-independent translation of maize Hsp101.'
                              # "The chicken c-Jun 5' untranslated region directs translation by internal\ninitiation."
    if _title.startswith('[') and _title.endswith(']'):
        _title = cur_record.title[1:-1]

    if '\r\n' in _title:
        _title = _title.replace('\r\n', ' ')

    if '\n' in _title:
        _title = _title.replace('\n', ' ')

    # _volume_issue = cur_record.volume_issue # '6' but also '1-2' and also 'Pt 6'

    # pages
    _pages = cur_record.pagination # '287-99'
    _start_page, _last_page = _pages.split('-')
    _start_page, _last_page = int(_start_page), int(_last_page)
    if _last_page < _start_page:
        _fixed_last_page = str(_start_page)[:-len(str(_last_page))] + str(_last_page)
        _pages = str(_start_page) + "-" + str(_fixed_last_page)

    # year
    _year = cur_record.publication_date
    if not _year:
        _year = cur_record.year

    try:
        _year = int(_year)
    except TypeError: # without raise
        # 1998 Oct
        _space_position = _year.find(' ')
        _year = _year[:_space_position]
        _year = int(_year)
    except ValueError:
        # 1975 May-Jun
        _space_position = _year.find(' ')
        _year = _year[:_space_position]
        _year = int(_year)

    # journal
    _source = cur_record.source # 'Gene Expr 1997;6(5):287-99.'
                                # 'J Comp Physiol [A] 2000 Jun;186(6):567-74'
                                # 'Biokhimiia 1975 May-Jun;40(3):645-51.'

    # BUG: we should not blindly append dots to the end of the string,
    # for example this would be wrong in case of journals:
    # RNA, Oncogene, Nature ... where the correct citation is `Nature 33: 22-33', etc.
    _index = _source.find(cur_record.publication_date)
    _journal = _source[:_index - 1].strip() # strip the trailing space
    _journal = _journal.replace(' ','. ')
    if _journal[-1] != '.' and _journal[-1] != ')':
        _journal = ''.join(_journal,'.')

    # volume and issue
    if ';' not in _source:
        raise ValueError, "cannot find semicolon as the delimiter of the title from volume and issue"
    else:
        # print "_source is " + _source
        _index = _source.index(';')
        if '(' not in _source or ')' not in _source:
            raise ValueError, "cannot find round brackets around issue number"
        else:
            # use rindex so we do not match the first bracket, as with 'DNA Repair (Amst) 2002 May 30;1(5):379-90.'
            _i1 = _source.rindex('(')
            _i2 = _source.rindex(')')
            _volume = _source[_index + 1:_i1]
            # print "volume is " + str(_volume)

            # get the issue from here, although we might already have it right
            _issue = _source[_i1 + 1:_i2]
            # print "issue is " + str(_issue)

    if str(pmid) != str(cur_record.pubmed_id):
        raise RuntimeError, "we asked PubMed for pmid=" + str(pmid) + " but received record with pmid=" + str(cur_record.pubmed_id)

    _dict = {}
    _dict['year'] = _year
    _dict['author'] = ', '.join(_new_authors)
    _dict['title'] = _title
    _dict['journal'] = _journal
    _dict['volume'] = _volume
    _dict['issue'] = _issue
    _dict['pages'] = _pages

    return _dict

Hope this helps.
-- 
Dr. Martin Mokrejs
Dept. of Genetics and Microbiology
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs