[Biopython] downloading gnome Protein table

Peter Cock p.j.a.cock at googlemail.com
Thu Oct 27 13:14:10 UTC 2011


On Thu, Oct 27, 2011 at 11:47 AM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> The problem is I have only the Refseq ID like NC_008390 and I don't have
> Protein table ID (in this case CP000441.ptt) so I can't download the .ptt
> file (as in ftp url
> ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid13490/CP000441.ptt
>   )

Given your identifiers, use ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ rather
than ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/ - in this case,

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Burkholderia_ambifaria_AMMD_uid58303/NC_008390.ptt

>
> Also not all  Refseq IDs I have belongs to 'Bacteria'.
>

Then the NCBI won't have them on the Bacterial FTP sites, and I
don't think they will provide *.ptt files for them.

> So for ID
> NC_004314 (just an example) I have to change the ftp url as
> ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Plasmodium_falciparum/NC_004314.ptt
>
> Downloading the *.gbk file may be an option (but later I need to convert
> them into protein table)

Just download *all* the bacterial protein tables as the tar ball, its only
120MB compressed:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ptt.tar.gz

Then you can just search locally for a file by name etc.

> so I tried this
> from Bio import Entrez
> Entrez.email = "from.d.putto at gmail.com"
> handle = Entrez.efetch(db="genome", id="NC_008390", rettype="gbk")
> print handle.read()
> The output shows me 'Nothing has been found'
> I am not sure in which database I should look for id like NC_008390.

Try it on the NCBI website for all databases,
http://www.ncbi.nlm.nih.gov/sites/gquery?term=NC_008390

You'll see it does match the genome database, but also the
nucleotide database. In this case you want the sequence as
a GenBank file so use the nucleotide database.

> Moreover later-on I need to convert 'gbk' file to .ptt (or extract protein
> information)

The Biopython GenBank parser can do that - life is easier with
bacterial genomes as there are (almost) no nasty join(...)
locations to deal with.

Peter




More information about the Biopython mailing list