[Biopython] Finding protein ID using Entrez.efetch

Fri Aug 28 10:56:24 UTC 2009

On Fri, Aug 28, 2009 at 11:37 AM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>> To: <biopython at biopython.org>
>> Date: Fri, 28 Aug 2009 09:42:03 +0900 (KST)
>> Subject: Finding protein ID using Entrez.efetch
>>
>> Hi all,
>>
>> I'm looking for the way to extract the data of protein ID numbers in
>> the Genbank. ...
>>
>> What I need is all the protein ID (For example: EEU21068.1) or GI
>> number (for example: 256615878) in this Genbank page for the blast
>> search.
>
> ...
>
> However, if that is all you need, then it is a waste to download the
> full GenBank file. Try using NCBI Entrez ELink instead?
> http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html

Try something based on this:

>>> from Bio import Entrez
>>> data = Entrez.read(Entrez.elink(db="protein", dbfrom="nuccore",id="256615878", retmode="xml"))
>>> for db in data :
...     print "Links for", db["IdList"], "from database", db["DbFrom"]
...     for link in db["LinkSetDb"][0]["Link"] : print link["Id"]
...
Links from ['256615878'] from database nuccore
256616663
256616662
...
256615879

As we try to explain in the tutorial, the Entrez.read() XML parser turns the XML
data into Python lists, dictionaries and strings. This reflects the
deeply nested
nature of the NCBI XML files - you have to dig into the hierarchy to get to the
actual list of protein IDs.

Peter