[Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage

Tue Apr 8 14:17:54 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2475

------- Comment #20 from mdehoon at ims.u-tokyo.ac.jp  2008-04-08 10:17 EST -------
> Bio.Entrez.read(taxon_handle) will return a list of dictionaries (one for each
> taxon ID supplied).  We've established a convention of sorts about "read()"
> versus "parse()", the first returns a single record and the second a record
> iterator.

> If a taxon single entry (currently held as a dictionary) is regarded as a
> record, then should Bio.Entrez.read() be called Bio.Entrez.parse() instead?

I thought about that also, but I think that having Bio.Entrez.read() only is
better. The reason is that some XML files returned by NCBI can be regarded as a
list of records (possibly a list of only one record), but others can never be
regarded as a list of records. That means we could have a Bio.Entrez.parse() in
addition to Bio.Entrez.read(), but not instead of Bio.Entrez.read().

Now, in practical situations that could get ugly, not to say counterintuitive.
For example, take Bio.Entrez.einfo. Without an argument, Bio.Entrez.einfo()
returns a list of NCBI databases. Bio.Entrez.einfo(db="pubmed") then returns a
dictionary with information about the pubmed database. (This double usage is
not my choice; this is how NCBI has it set up). If we apply the parse/read rule
strictly, we'd get the following:

>>> from Bio import Entrez
>>> handle = Entrez.einfo()
>>> records = Entrez.parse(handle)
>>> for record in records:
...     print record
pubmed
protein
nucleotide
nuccore
....
taxonomy
toolkit
unigene
unists
>>> 

To me, this seems to be a bit too much, since this is actually just a list.
Now if we want information about pubmed, we'd use
>>> handle = Entrez.einfo(db="pubmed")
>>> record = Entrez.read(handle)
# Now we have to use read() instead of parse()

And here is the really tricky part: Is the following possible?
>>> handle = Entrez.einfo(db=["pubmed","taxonomy"])
For example, Entrez.efetch allows a list of Ids; a user may guess that
Entrez.einfo can handle a list of dbs. If it can, should he then call parse()
instead of read() (in the example above, with db="pubmed")?

Unlike for example Bio.Blast.NCBIXML, where we always get a list of records,
for Bio.Entrez some XML files are more like a single record, whereas others are
more like a list of records, and it may not be obvious to the user which is
which. If you make a mistake, you have to repeat your query to NCBI, because
the handle is already partially read.

If we define the read/parse rule as "read returns an object, parse returns an
iterator", then the existing Bio.Entrez.read() is still fine.

> I am also wondering if we should create simple record classes for
> the different XML data types (instead of using dictionaries).

This can be useful if the record is an empty object deriving from a dict. It
allows us to add a docstring to each record, while still preserving the
functionality of each record as a dictionary. I don't see a good usage of
additional functionality right now.

Essentially, the XML file represents a dictionary (or a list of dictionaries);
the Python object we returns should correspond to this. One alternative is to
have a record class with fields corresponding to the keys in the dictionary. So
>>> record.abc
>>> record.ddd
>>> record.klmnop
instead of
>>> record["abc"]
>>> record["ddd"]
>>> record["klmnop"]
But I like the second form better, because it allows us to call keys() on the
record and get the names of all fields.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.