[Biopython] problems searching swiss prot
Peter Cock
p.j.a.cock at googlemail.com
Tue Sep 14 05:13:04 EDT 2010
On Mon, Sep 13, 2010 at 9:40 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Forwarding a query from Jessica Grant since she appears
> to have had trouble posting to the mailing list.
>
> Jessica wrote:
>
>> Hello,
>>
>> I am running a few scripts to try to extract sequence information
>> out of uniprot. One program called AutoFACT gives me ID numbers
>> associated with that database. Most of these look like this:
>>
>> D2V5S4_NAEGR
>> Q48KU2_PSE14
>> Q22B72_TETTH
>>
>>
>> and my downstream scripts, which are written in biopython, are
>> fine with this. Then, every once in a while, a sequence will come
>> back with a name that looks like this:
>>
>> UPI00006CC162
>>
>> and everything goes bad. My script can't handle these names,
>> apparently, although if I go to uniprot.org and search for it, the
>> sequence comes up.
>>
>> My script uses the following, where RepID is the number
>> extracted from AutoFACT:
>>
>> handle = ExPASy.get_sprot_raw(RepID, cgi=None)
>> seq_record = SeqIO.read(handle, "swiss")
>>
>> Any thoughts?
>>
>> Thank you,
>>
>> Jessica
>
> Hi Jessica,
>
> I think the problem is that these unusual identifiers are
> not UniProt/SwissProt accession identifiers. The URL
> this Biopython function uses was originally from
> www.expasy.ch but is now on www.uniprot.org as
> described here:
>
> http://www.expasy.ch/expasy_urls.html
>
> I think the ID UPI00006CC162 is a UniProt ID of some
> kind, so it may be possible to access the information
> you want somehow. See for example:
>
> http://www.uniprot.org/uniparc/UPI00006CC162
>
> However, it is not clear to me right away if you can get
> this record back as a plain text "swiss" format entry...
>
> Peter
Jessica replied (off list), to say:
>> Oh, and I got a great help from someone at Uniprot for my
>> previous question...turns out you can get the sequences
>> downloaded as fasta files:
>>
>> http://www.uniprot.org/uniparc/UPI00006CC162.fasta
>>
>> and I could then read them into SeqIO as a fasta and
>> manipulate them that way.
I guess the UPI at the start stands for Uni Parc Identifier.
Note that the page I linked to earlier has links to several
file formats including FASTA, but not plain text "SwissProt"
format: http://www.uniprot.org/uniparc/UPI00006CC162
Peter
More information about the Biopython
mailing list