[BioPython] Medline records
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Tue Dec 12 22:09:20 UTC 2006
Hi Thomas,
I use the Medline/Pubmed modules from biopython 1.42 at least and they
do work, although some fields contains weird data. Well, see my comments
in the code below. It seems PubMed people should cleanup their database
contents a bit.
Thomas Elliott wrote:
> Hi,
>
> I started playing with Biopython again. Glad to see it is still
> active. I love Python.
>
> Some issues came up in parsing Medline records according to the
> tutorial.
>
> It wasn't obvious which variables exist to be queried on a record.
> The tutorial example gives title, authors, and source only.
>
> I tried looking at the code. There are terms in NLMMedlineXML.py
> that look like they should work, but which raise AttributeError (e.g.
> journal and date_created).
>
> I finally looked in the keys to __dict__ for a record. I found
> 'year' there, but record.year seems to be always empty, or rather it
> is a blank string.
>
> • record.title_abbreviation is actually the abbreviated journal name.
> • record.volume_issue gives only the volume, not the issue.
> • record.journal_title_code is always a blank string.
>
> Not sure what the right way is to do this. I guess it would be
> helpful to know which file of the source I should be looking at for
> variable names.
from Bio import PubMed, Medline, GenBank
def get_citation_by_pmid(pmid):
"""Fetches citation data from NCBI Pubmed using pmid as a key.
It returns dicstionary with the following structure:
{'title': 'Molecular characterization of the murine Hif-1 alpha locus.', 'journal': 'Gene. Expr.',
'author': 'Luo G., Gu Y. Z., Jain S., Chan W. K., Carr K. M., Hogenesch J. B., Bradfield C. A.',
'volume': '6', 'year': '1997', 'issue': '5', 'pages': '287-299'}
Test with PMID: 15703059, 10851087, 1111, 123456, 15703059, Y00664, 12509242, 10713153
"""
rec_parser = Medline.RecordParser()
medline_dict = PubMed.Dictionary(parser = rec_parser)
cur_record = medline_dict[pmid]
_authors = cur_record.authors # ['Luo G', 'Gu YZ', 'Jain S', 'Chan WK', 'Carr KM', 'Hogenesch JB', 'Bradfield CA']
_new_authors = []
for _author in _authors:
_author = ' '.join(_author.split(' ')[:-1]+['. '.join(tuple(_author.split(' ')[-1:][0]))+'.'])
# 'van Carr-Schmidt K.M.'
_new_authors.append(_author)
_title = cur_record.title # '[The laboratory in programs for enteric infection control]'
# 'Cap-independent translation of maize Hsp101.'
# "The chicken c-Jun 5' untranslated region directs translation by internal\ninitiation."
if _title.startswith('[') and _title.endswith(']'):
_title = cur_record.title[1:-1]
if '\r\n' in _title:
_title = _title.replace('\r\n', ' ')
if '\n' in _title:
_title = _title.replace('\n', ' ')
# _volume_issue = cur_record.volume_issue # '6' but also '1-2' and also 'Pt 6'
# pages
_pages = cur_record.pagination # '287-99'
_start_page, _last_page = _pages.split('-')
_start_page, _last_page = int(_start_page), int(_last_page)
if _last_page < _start_page:
_fixed_last_page = str(_start_page)[:-len(str(_last_page))] + str(_last_page)
_pages = str(_start_page) + "-" + str(_fixed_last_page)
# year
_year = cur_record.publication_date
if not _year:
_year = cur_record.year
try:
_year = int(_year)
except TypeError: # without raise
# 1998 Oct
_space_position = _year.find(' ')
_year = _year[:_space_position]
_year = int(_year)
except ValueError:
# 1975 May-Jun
_space_position = _year.find(' ')
_year = _year[:_space_position]
_year = int(_year)
# journal
_source = cur_record.source # 'Gene Expr 1997;6(5):287-99.'
# 'J Comp Physiol [A] 2000 Jun;186(6):567-74'
# 'Biokhimiia 1975 May-Jun;40(3):645-51.'
# BUG: we should not blindly append dots to the end of the string,
# for example this would be wrong in case of journals:
# RNA, Oncogene, Nature ... where the correct citation is `Nature 33: 22-33', etc.
_index = _source.find(cur_record.publication_date)
_journal = _source[:_index - 1].strip() # strip the trailing space
_journal = _journal.replace(' ','. ')
if _journal[-1] != '.' and _journal[-1] != ')':
_journal = ''.join(_journal,'.')
# volume and issue
if ';' not in _source:
raise ValueError, "cannot find semicolon as the delimiter of the title from volume and issue"
else:
# print "_source is " + _source
_index = _source.index(';')
if '(' not in _source or ')' not in _source:
raise ValueError, "cannot find round brackets around issue number"
else:
# use rindex so we do not match the first bracket, as with 'DNA Repair (Amst) 2002 May 30;1(5):379-90.'
_i1 = _source.rindex('(')
_i2 = _source.rindex(')')
_volume = _source[_index + 1:_i1]
# print "volume is " + str(_volume)
# get the issue from here, although we might already have it right
_issue = _source[_i1 + 1:_i2]
# print "issue is " + str(_issue)
if str(pmid) != str(cur_record.pubmed_id):
raise RuntimeError, "we asked PubMed for pmid=" + str(pmid) + " but received record with pmid=" + str(cur_record.pubmed_id)
_dict = {}
_dict['year'] = _year
_dict['author'] = ', '.join(_new_authors)
_dict['title'] = _title
_dict['journal'] = _journal
_dict['volume'] = _volume
_dict['issue'] = _issue
_dict['pages'] = _pages
return _dict
Hope this helps.
--
Dr. Martin Mokrejs
Dept. of Genetics and Microbiology
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs
More information about the Biopython
mailing list