[Biopython] PDBid to Uniprot ID?

Wed Jul 29 00:38:44 EDT 2009

Peter wrote:
> On Wed, Jun 24, 2009 at 11:04 PM, Nick Matzke <matzke at berkeley.edu> wrote:
>> Hi all,
>>
>> I have succeeded in using the BioPython PDB parser to download a PDB file,
>> parse the structure, etc.  But I am wondering if there is an easy way to retrieve
>> the UniProt ID that corresponds to the structure?
>>
>> I.e., if the structure is 1QFC...
>> http://www.pdb.org/pdb/explore/explore.do?structureId=1QFC
>>
>> ...the Uniprot ID is (click "Sequence" above): P29288
>> http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1QFC
>>
>> I don't see a way to get this out of the current parser, so I guess I will schlep
>> through the downloaded structure file for "UNP    P29288" unless someone
>> has a better idea.
> 
> Well, I would at least look for a line starting "DBREF" and then search that
> for the reference.
> 
> Right now the PDB header parsing is minimal, and even that was something
> of an after thought - Eric has been looking at this stuff recently, but I image
> he will be busy with his GSoC work at the moment. This could be handled
> as another tiny incremental addition to parse_pdb_header.py - right now I
> don't think it looks at the "DBREF" lines.
> 
> Peter

I forgot to post to the list, I wrote a function for parsing the DBREF 
line a couple of weeks ago, it should be pretty comprehensive as it uses 
the official specifications for DBREF lines.

Here's the code to save other people re-inventing the wheel.  Free to 
use/modify/include in a biopython upgrade whatever...

===================
def parse_DBREF_line(line):
	"""
	Following format here:
	http://www.wwpdb.org/documentation/format23/sect3.html

	Record Format

	COLUMNS       DATA TYPE          FIELD          DEFINITION
	----------------------------------------------------------------
	 1 - 6        Record name        "DBREF "
	 8 - 11       IDcode             idCode         ID code of this entry.
	13            Character          chainID        Chain identifier.
	15 - 18       Integer            seqBegin       Initial sequence number
													of the PDB sequence segment.
	19            AChar              insertBegin    Initial insertion code
													of the PDB sequence segment.
	21 - 24       Integer            seqEnd         Ending sequence number
													of the PDB sequence segment.
	25            AChar              insertEnd      Ending insertion code
													of the PDB sequence segment.
	27 - 32       LString            database       Sequence database name.
	34 - 41       LString            dbAccession    Sequence database 
accession code.
	43 - 54      LString            dbIdCode        Sequence database
													identification code.
	56 - 60      Integer            dbseqBegin      Initial sequence number 
of the
													database seqment.
	61           AChar              idbnsBeg        Insertion code of 
initial residue
													of the segment, if PDB is the
													reference.
	63 - 67      Integer            dbseqEnd        Ending sequence number 
of the
													database segment.
	68           AChar              dbinsEnd        Insertion code of the 
ending
													residue of the segment, if PDB is
													the reference.

     Database name                         database
                                      (code in columns 27 - 32)
     ----------------------------------------------------------
     GenBank                               GB
     Protein Data Bank                     PDB
     Protein Identification Resource       PIR
     SWISS-PROT                            SWS
     TREMBL                                TREMBL
     UNIPROT                               UNP

	Test line:
	line="  1QFC A    1   306  UNP    P29288   PPA5_RAT        22    327 
           "
	"""

	data_type_list = ['Record name',
	'IDcode',
	'Character',
	'Integer',
	'AChar',
	'Integer',
	'AChar',
	'LString',
	'LString',
	'LString',
	'Integer',
	'AChar',
	'Integer',
	'AChar']

	field_list = ['"DBREF "',
	'idCode',
	'chainID',
	'seqBegin',
	'insertBegin',
	'seqEnd',
	'insertEnd',
	'database',
	'dbAccession',
	'dbIdCode',
	'dbseqBegin',
	'idbnsBeg',
	'dbseqEnd',
	'dbinsEnd']

	def_list = ['',
	'ID code of this entry.',
	'Chain identifier.',
	'Initial sequence number of the PDB sequence segment.',
	'Initial insertion code of the PDB sequence segment.',
	'Ending sequence number of the PDB sequence segment.',
	'Ending insertion code of the PDB sequence segment.',
	'Sequence database name.',
	'Sequence database accession code.',
	'Sequence database identification code.',
	'Initial sequence number of the database seqment.',
	'Insertion code of initial residue of the segment, if PDB is the 
reference.',
	'Ending sequence number of the database segment.',
	'Insertion code of the ending residue of the segment, if PDB is the 
reference.']

	charpos_list = [(1,6),
	(8,11),
	(13,13),
	(15,18),
	(19,19),
	(21,24),
	(25,25),
	(27,32),
	(34,41),
	(43,54),
	(56,60),
	(61,61),
	(63,67),
	(68,68)]

	data_list = ['',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'']

	# Make empty dictionary
	dbref_dict = {}
	for index in range(0,len(field_list)):
		dbref_dict[ field_list[index] ] = [ data_type_list[index], 
charpos_list[index], data_list[index], def_list[index] ]

	for field in field_list:
		#print field
		#print dbref_dict[field][1]
		startpos = int(dbref_dict[field][1][0])
		endpos = int(dbref_dict[field][1][1])

		dbref_dict[field][2] = get_char_range(line, startpos, endpos)

	return dbref_dict
===================

> 

-- 
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================