[Biopython] IPI fetching

Tue Sep 8 13:41:40 UTC 2009

On Tue, Sep 8, 2009 at 1:39 PM, Yvan Strahm<yvan.strahm at bccs.uib.no> wrote:
>>
>> Can you give us a specific example of an IPI number and the FASTA
>> record you want back?
>
> IPI00109764
>
>> ipi|IPI00109764|IPI00109764.2 DNA TOPOISOMERASE 1.
> MSGDHLHNDSQIEADFRLNDSHKHKDKHKD...YEF
>
> This particular entry has this Uniprot accession number:Q04750

So if you can work out the uniprot accession number, then you can use
the Bio.ExPASy.get_sprot_raw() function to download the file in the
SwissProt/UniProt plain text format, e.g.

>>> from Bio import ExPASy
>>> from Bio import SeqIO
>>> record = SeqIO.read(ExPASy.get_sprot_raw("Q04750"), "swiss")
>>> print record.format("fasta")
>Q04750 RecName: Full=DNA topoisomerase 1; EC=5.99.1.2; AltName: Full=DNA topoisomerase I;
MSGDHLHNDSQIEADFRLNDSHKHKDKHKD...YEF

It looks like you should be able to get the sequence directly from
the EBI via the International Protein Index (IPI) identifier, IPI00109764
http://www.ebi.ac.uk/IPI/IPIhelp.html

As per that old thread you referenced, Biopython should be able
to parse the "swiss" output from IPI. How about a quick and dirty
URL hack to access the EBI's SRS?

>>> import urllib
>>> from Bio import SeqIO
>>> ipi = "IPI00109764"
>>> url = "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[IPI-acc:%s]+-ascii" % ipi
>>> record = SeqIO.read(urllib.urlopen(url), "swiss")
>>> print record.format("fasta")
>IPI00109764 DNA TOPOISOMERASE 1.
MSGDHLHNDSQIEADFRLNDSHKHKDKHKDRE...YEF

Done? With a little tweaking to the URL you can download this directly as
FASTA if you like (saves some bandwidth).

Peter