[Biopython] About access NCBI taxonomy database

Tue Jun 23 05:37:34 EDT 2009

Hi Jing & Tina,

I hope you don't mind me CC'ing this reply to the Biopython mailing
list, as I think this sort of advice could be of general interest.

On Tue, Jun 23, 2009 at 4:52 AM, Tian, Jing <JT0831 at ecu.edu> wrote:
>
> Hi,Peter,
>
> My classmate Tina asked you about how to do local taxonomy
> search.Thank you for your reply,it's very helpful.
>
> I also have a question need your suggestions:
>
> From taxonomy database,We need to get linage information
> of a set of BLAST hits based on their GI numbers,this set
> might be very huge,because we got almost 1,000~10,000
> sequence ID for Blast input.

I wonder if you are trying to reproduce something like
the "Taxonomy Report" available with online BLAST?
http://www.ncbi.nlm.nih.gov/blast/taxblasthelp.shtml
As far as I know, the NCBI standalone BLAST doesn't
offer this feature - and you probably have too many
sequences to use the online BLAST search.

> Based on the knowledge you told us,here we have three
> options to do that:
>
> 1.Use the NCBI Entrez tool to access NCBI Taxonomy online.
> 2.Download NCBI taxonomy from the FTP site and parse it ourself.
> 3.Download NCBI taxonomy from the FTP site and using BioSQL.
>
> I'm new to Biopython and python,but I'm familiar with SQL.
> Which option do you suggest?

Yes, to go from an NCBI taxonomy number to the NCBI lineage
any of those would work.

e.g. Going from NCBI taxonomy number 9606 (humans) to
the lineage: root; cellular organisms; Eukaryota;
Fungi/Metazoa group; Metazoa; ...; Homininae; Homo;
Homo sapiens

If you have only a small number of species to work with (say
under 50 lineages) I would recommend using the Entrez tool
online. There is an example of how to do this in the Entrez
chapter of the Biopython Tutorial:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

If you have say 2000 species, you could just use Entrez
online in stages, and store the results locally - make sure
you follow the NCBI Entrez usage guidelines!

In general, if you have lots of species to find the lineage
for, I would use the taxonomy file downloaded from the
NCBI. If you think BioSQL will be useful in other aspects
of your work, then try that. You'll need to access the taxon
information with your own SQL queries. Otherwise it might
be easier to parse the file directly - but you will have to
write this code yourself!

------------------------------------------------------------------

However, the above advice only covers the final step.
Your plan seems to have three stages,
(a) Run BLAST, getting back GI numbers.
(b) Map the GI numbers to NCBI taxonomy numbers.
(c) Map the NCBI taxonomy numbers to a lineage.

You haven't said anything about the organisms you are
working with, or the BLAST database you are using.
However, while you will have a vast number of BLAST
hits, I would guess these may only cover 2000 species.
This means step (c), mapping from the species to the
lineage will actually be relatively simple.

For step (a), running BLAST: You've said you have between
1,000~10,000 sequences to BLAST. With that many query
sequences, you should be running BLAST locally (either a
standalone installation, or on a local server at your institute).

I think step (b) will be the bottleneck: How to go from the
BLAST result GI numbers to a list of NCBI taxonomy
numbers, as this seems to be a big job. Depending on
what database you search, and your thresholds, you
might have 20 hits per sequence on average. That means
you could have 20,000 to 200,000 GI numbers to deal
with! You will need to be able to map all these BLAST
GI number results back to an NCBI taxonomy ID, and
you'll have to do this locally (not online - there are too
many).

Perhaps you need to approach this in a different way?
You can BLAST against species specific (or genera
specific) databases, and then you know in advance
where the matches come from.

------------------------------------------------------------------

> If we chose 3,I know how to download and import the NCBI
> taxonomy to BioSQL,but I still don't have idea how to get
> lineage information for each hit?I read some tutorial about
> BioSQL, but did not find the answer.Do you have some
> examples or suggestions for doing that?

http://biopython.org/wiki/BioSQL#NCBI_Taxonomy

If you use the BioSQL script load_ncbi_taxonomy.pl will
download the NCBI taxonomy and store it in the BioSQL
taxon and taxon_name tables.

Each node will be recorded with a link to its parent ID. This
means that to get a lineage you can just recurse (or loop)
up the tree. Watch out for the root node pointing to itself
(BioSQL bug 2664).

In addition to these parent links (useful for going up the
tree towards the root), there are also left/right fields which
are useful for going down the tree (e.g. getting all the
taxa within a group). The idea here is described here:
http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html
(linked to from Biopython's BioSQL wiki page).

> Another question is if BioSQL can be used under Windows?

Yes, I personally have tested BioSQL with MySQL on my
old Windows laptop. It wasn't very fast, but this was an
old machine.

> I appreciate your help very much!
>
> Best,
> Jing

Sure,

Peter