[Biopython] About access NCBI taxonomy database

Peter Cock p.j.a.cock at googlemail.com
Wed Jun 24 08:04:48 EDT 2009


Hi again Jian,

I have again CC'd the mailing list.

On Wed, Jun 24, 2009 at 12:25 PM, Tian, Jing <JT0831 at ecu.edu> wrote:
>
> Hi,Peter,
>
> Thank you very much for your detailed reply.That's a huge help.
> Your explanation is exactly what I want.

Thanks :)

> I still have some questions based on your reply:
>
> To implement this stage:(b) Map the GI numbers to NCBI taxonomy numbers.
>
> My original thought is going through each GI and find the corresponding
> tax_id in gi_taxid_prot.dmp,and then using tax_id to get its lineage from
> node.dmp and name.dmp,but I don't know if it will cause memory overload
> problem?

Excellent idea! I hadn't noticed the gi_taxid_prot.dmp existed, as
the taxdump_readme.txt didn't mention it. Looking closer, yes,
downloading that would give you a nice simple way to map from
the protein GI numbers to their NCBI taxonomy ID.

ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz
ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.zip

This is a simple tab separated file, so it is very easy to parse.
It is 422MB, so I would try loading it as a simple python dict
(mapping GI to taxon ID), which should be fine on a recent
computer. Using integers rather than strings saves quite a bit
of memory - but you will need to turn GI strings into integers
when looking them up.

>>> gi_to_taxon = dict()
>>> for line in open("gi_taxid_prot.dmp", "rU") :
...     gi, taxon = line.rstrip("\n").split("\t")
...     gi_to_taxon[int(gi)] = int(taxon)
...
>>> len(gi_to_taxon)
27416138
>>> gi_to_taxon[229305135]
525271

If you are still limited by memory you could do something
more clever, like mapping ranges of GI numbers to taxon IDs.

> You mentioned there is a different way to approach this
> (You said:
> You can BLAST against species specific (or genera
> specific) databases, and then you know in advance
> where the matches come from.)
>
> Could you give a little more detail?

Given the existence of the gi_taxid_prot.dmp file you probably won't
need this. However, using standalone BLAST, you can prepare you
own species specific databases from FASTA files using formatdb.
For online BLAST, the NCBI provide several pre-built databases,
and also lets you filter large databases like NR by species. See:
http://lists.open-bio.org/pipermail/biopython/2009-June/005264.html

> Another question is after we use the BioSQL script load_ncbi_taxonomy.pl
> to download the NCBI taxonomy and store it in the BioSQL taxon and
> taxon_name tables, do these tables include the mapping information
> (from GI to NCBI taxid)? or we also need to write code myself to do (b)
> stage separately,is that right?

No, using the BioSQL script load_ncbi_taxonomy.pl will not download and
store the GI to NCBI taxon id. You would have to do this yourself.

It sounds like working directly with the NCBI taxonomy files will be
simplest for your task.

> If we want change stage (b) as:Map the [species name] to NCBI tax_id,
> how could I approach that?

You could use Entrez to look up the species name online. However, one
of the taxonomy dump files should include this information (including any
previous names and sometimes also misspellings which can be helpful).

> I'm sorry I have so much questions.
>
> Thanks,
> Jing

Thanks Jing - I learnt something new too :)

Peter


More information about the Biopython mailing list