[Biopython] Blast.NCBIWWW qblast taxonID search space limitation

Thu Sep 24 16:04:52 UTC 2009

On Thu, Sep 24, 2009 at 4:47 PM, Carlos Javier Borroto
<carlos.borroto at gmail.com> wrote:
> On Thu, Sep 24, 2009 at 11:34 AM, Michael S. Koeris
> <michael.koeris at gmail.com> wrote:
>> I saw the option to put in an [ORGANISM] but I was hoping I could use the
>> TaxonID because say I want to BLAST all bacteria or all archea
>
> I'm just doing exactly that, by putting on my entrez_query something like this:
> "txid6945[Organism:noexp]"
>
> I got that string by searching on the taxonomic database and then
> clicking to see all of the sequences of that taxon. I haven't tried to
> use only "txid6945" don't know what is the meaning of
> "[Organism:noexp]", but I can tell you this works.

Where did [Organism:noexp] come from? I guess it tells Entrez
not to expand the organism name or the heirachy?

I would just use "txid6945[Organism]" or "txid6945[orgn]" which is
shorter and I think clearer.

See also this blog post and the EInfo entry in the tutorial:
http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/

>>> from Bio import Entrez
>>> record = Entrez.read(Entrez.einfo(db="nuccore"))
>>> for field in record["DbInfo"]["FieldList"] :
...     print "%(Name)s, %(FullName)s, %(Description)s" % field
...
ALL, All Fields, All terms from all searchable fields
UID, UID, Unique number assigned to each sequence
FILT, Filter, Limits the records
WORD, Text Word, Free text associated with record
TITL, Title, Words in definition line
KYWD, Keyword, Nonstandardized terms provided by submitter
AUTH, Author, Author(s) of publication
JOUR, Journal, Journal abbreviation of publication
VOL, Volume, Volume number of publication
ISS, Issue, Issue number of publication
PAGE, Page Number, Page number(s) of publication
ORGN, Organism, Scientific and common names of organism, and all
higher levels of taxonomy
ACCN, Accession, Accession number of sequence
PACC, Primary Accession, Does not include retired secondary accessions
GENE, Gene Name, Name of gene associated with sequence
PROT, Protein Name, Name of protein associated with sequence
ECNO, EC/RN Number, EC number for enzyme or CAS registry number
PDAT, Publication Date, Date sequence added to GenBank
MDAT, Modification Date, Date of last update
SUBS, Substance Name, CAS chemical name or MEDLINE Substance Name
PROP, Properties, Classification by source qualifiers and molecule type
SQID, SeqID String, String identifier for sequence
GPRJ, Genome Project, Genome Project
SLEN, Sequence Length, Length of sequence
FKEY, Feature key, Feature annotated on sequence
PORG, Primary Organism, Scientific and common names of primary
organism, and all higher levels of taxonomy

> As a side note on blasting, I think there is a bug on the XML
> generator from NCBI, I getting stuff like this:
>>>> print blast_record.descriptions[0].title
> gi|241564310|ref|XP_002401874.1| E1-E2 ATPase, putative [Ixodes
> scapularis] >gi|215501920|gb|EEC11414.1| E1-E2 ATPase, putative
> [Ixodes scapularis]

The NCBI BLAST tools have a strange method of merging
redundant entries into a single entry, which results in these
odd identifiers.

Peter