[Biopython] PDBid to Uniprot ID?
Nick Matzke
matzke at berkeley.edu
Wed Jul 29 04:38:44 UTC 2009
Peter wrote:
> On Wed, Jun 24, 2009 at 11:04 PM, Nick Matzke <matzke at berkeley.edu> wrote:
>> Hi all,
>>
>> I have succeeded in using the BioPython PDB parser to download a PDB file,
>> parse the structure, etc. But I am wondering if there is an easy way to retrieve
>> the UniProt ID that corresponds to the structure?
>>
>> I.e., if the structure is 1QFC...
>> http://www.pdb.org/pdb/explore/explore.do?structureId=1QFC
>>
>> ...the Uniprot ID is (click "Sequence" above): P29288
>> http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1QFC
>>
>> I don't see a way to get this out of the current parser, so I guess I will schlep
>> through the downloaded structure file for "UNP P29288" unless someone
>> has a better idea.
>
> Well, I would at least look for a line starting "DBREF" and then search that
> for the reference.
>
> Right now the PDB header parsing is minimal, and even that was something
> of an after thought - Eric has been looking at this stuff recently, but I image
> he will be busy with his GSoC work at the moment. This could be handled
> as another tiny incremental addition to parse_pdb_header.py - right now I
> don't think it looks at the "DBREF" lines.
>
> Peter
I forgot to post to the list, I wrote a function for parsing the DBREF
line a couple of weeks ago, it should be pretty comprehensive as it uses
the official specifications for DBREF lines.
Here's the code to save other people re-inventing the wheel. Free to
use/modify/include in a biopython upgrade whatever...
===================
def parse_DBREF_line(line):
"""
Following format here:
http://www.wwpdb.org/documentation/format23/sect3.html
Record Format
COLUMNS DATA TYPE FIELD DEFINITION
----------------------------------------------------------------
1 - 6 Record name "DBREF "
8 - 11 IDcode idCode ID code of this entry.
13 Character chainID Chain identifier.
15 - 18 Integer seqBegin Initial sequence number
of the PDB sequence segment.
19 AChar insertBegin Initial insertion code
of the PDB sequence segment.
21 - 24 Integer seqEnd Ending sequence number
of the PDB sequence segment.
25 AChar insertEnd Ending insertion code
of the PDB sequence segment.
27 - 32 LString database Sequence database name.
34 - 41 LString dbAccession Sequence database
accession code.
43 - 54 LString dbIdCode Sequence database
identification code.
56 - 60 Integer dbseqBegin Initial sequence number
of the
database seqment.
61 AChar idbnsBeg Insertion code of
initial residue
of the segment, if PDB is the
reference.
63 - 67 Integer dbseqEnd Ending sequence number
of the
database segment.
68 AChar dbinsEnd Insertion code of the
ending
residue of the segment, if PDB is
the reference.
Database name database
(code in columns 27 - 32)
----------------------------------------------------------
GenBank GB
Protein Data Bank PDB
Protein Identification Resource PIR
SWISS-PROT SWS
TREMBL TREMBL
UNIPROT UNP
Test line:
line=" 1QFC A 1 306 UNP P29288 PPA5_RAT 22 327
"
"""
data_type_list = ['Record name',
'IDcode',
'Character',
'Integer',
'AChar',
'Integer',
'AChar',
'LString',
'LString',
'LString',
'Integer',
'AChar',
'Integer',
'AChar']
field_list = ['"DBREF "',
'idCode',
'chainID',
'seqBegin',
'insertBegin',
'seqEnd',
'insertEnd',
'database',
'dbAccession',
'dbIdCode',
'dbseqBegin',
'idbnsBeg',
'dbseqEnd',
'dbinsEnd']
def_list = ['',
'ID code of this entry.',
'Chain identifier.',
'Initial sequence number of the PDB sequence segment.',
'Initial insertion code of the PDB sequence segment.',
'Ending sequence number of the PDB sequence segment.',
'Ending insertion code of the PDB sequence segment.',
'Sequence database name.',
'Sequence database accession code.',
'Sequence database identification code.',
'Initial sequence number of the database seqment.',
'Insertion code of initial residue of the segment, if PDB is the
reference.',
'Ending sequence number of the database segment.',
'Insertion code of the ending residue of the segment, if PDB is the
reference.']
charpos_list = [(1,6),
(8,11),
(13,13),
(15,18),
(19,19),
(21,24),
(25,25),
(27,32),
(34,41),
(43,54),
(56,60),
(61,61),
(63,67),
(68,68)]
data_list = ['',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'']
# Make empty dictionary
dbref_dict = {}
for index in range(0,len(field_list)):
dbref_dict[ field_list[index] ] = [ data_type_list[index],
charpos_list[index], data_list[index], def_list[index] ]
for field in field_list:
#print field
#print dbref_dict[field][1]
startpos = int(dbref_dict[field][1][0])
endpos = int(dbref_dict[field][1][1])
dbref_dict[field][2] = get_char_range(line, startpos, endpos)
return dbref_dict
===================
>
--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
More information about the Biopython
mailing list