[BioPython] EUtils-2.0a1

Tue Sep 21 05:15:07 EDT 2004

Hi all,

It's been a few months ... okay, more like 4 and I finally
have the first alpha version of the new iteration of EUtils
done.  I won't be able to work on it for the next few days.
I hope some of you all here can get a chance to try it out
and give me feedback.

It's (temporarily) at
   http://www.dalkescientific.com/EUtils-2.0a1.tar.gz

There is essentially no documentation for it.  Here
are a few examples to get you started.

First, the simple interface

 >>> import EUtils
 >>> query = EUtils.search("biopython")
 >>> len(query)
3
 >>> for x in query.summary():
...   print x.dataitems["Authors"]
...
de Hoon MJ, Imoto S, Nolan J, Miyano S
Hamelryck T, Manderick B
Mangalam H
 >>> dbids = EUtils.DBIds("pubmed", ["9390282"])
 >>> print EUtils.efetch(dbids, "abstract").read()

1: Pac Symp Biocomput.  1997;:85-96.

Using Tcl for molecular visualization and analysis.

Dalke A, Schulten K.

Beckman Institute, Urbana, IL 61801, USA.

Reading and manipulating molecular structure data is a standard task in  
every
molecular visualization and analysis program, but is rarely available  
in a form
readily accessible to the user. Instead, the development of new methods  
for
analysis, display, and interaction is often achieved by writing a new  
program,
rather than building on pre-existing software. We present the Tcl-based  
script
language used in our molecular modeling program, VMD, and show how it  
can access
information about the molecular structure, perform analysis, and  
graphically
display and animate the results. The commands are available to the user  
and make
VMD a useful environment for studying biomolecules.

PMID: 9390282 [PubMed - indexed for MEDLINE]

 >>>

One thing to note is the "format" field.  This is new.  You
can ask for formats like "docsum", "brief", "fasta", etc. and
it maps the format name to the underlying EUtils request.
Eg, "fasta" -> "rettype=fasta&retmode=text"

You can force the parameters to be what you want with a new
notation, so "fasta/xml" format returns sequence data in the
TSeqSet XML format

 >>> dataset = EUtils.post(EUtils.DBIds("protein", ["12345"]))
 >>> print dataset.efetch("fasta/xml").read()
<?xml version="1.0"?>
<!DOCTYPE TSeqSet PUBLIC "-//NCBI//NCBI TSeq/EN"  
"http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeqSet>
<TSeq>
   <TSeq_seqtype value="protein"/>
   <TSeq_gi>12345</TSeq_gi>
   <TSeq_accver>CAA44029.1</TSeq_accver>
   <TSeq_taxid>4565</TSeq_taxid>
   <TSeq_orgname>Triticum aestivum</TSeq_orgname>
   <TSeq_defline>psaI [Triticum aestivum]</TSeq_defline>
   <TSeq_length>36</TSeq_length>
   <TSeq_sequence>MTDLNLPSIFVPLVGLVFPAIAMTSLFLYVQKNKIV</TSeq_sequence>
</TSeq>
</TSeqSet>

Here's a new feature.  Experimental support for unicode
searches.  Entrez converts some non-ASCII characters into
ASCII by removing accents and other marks.  This library
tries to support that method.

   Enable some debugging code

 >>> from EUtils import ThinClient
 >>> ThinClient.DUMP_URL = True

   I'm going to search for a city in Sweden.  I originally
   tried "España" but it turns out a lot of records had that
   without the tilde.  By comparison, the English spelling
   of Göteborg is Gothenburg.

 >>> GOT = u"G\N{LATIN SMALL LETTER O WITH DIAERESIS}teborg"
 >>> results = EUtils.search(GOT)
Opening with GET:  
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? 
term=Goteborg&retmax=20&WebEnv=0Y9-O043 
-9cLfWBD8rh7CXNDbbccF8wDOOW0fwVKPRkyM890EFVPAe&retstart=0&tool=Biopython 
_EUtils_2_0&db=pubmed&email=biopython-dev%40biopython.org&usehistory=y

   See how the term=Goteborg doesn't have any special characters?

 >>> ThinClient.DUMP_URL = False
 >>> len(results)
14165
 >>> s = results[0].efetch("xml").read()

   To prove that Goteborg isn't in the string

 >>> s.find("Goteborg")
-1

    But that Göteborg is present

 >>> unicode(s, "utf8").find(GOT)
2594
 >>> print unicode(s, "utf8")[2500:2651].encode("utf8")
  whole communities.</AbstractText>
                 </Abstract>
                 <Affiliation>Department of Marine Ecology, Göteborg  
University, Box 461, SE 405 30, Göteborg, Sweden
 >>>

The code I showed above uses a middle layer which
is more object oriented.

One is DBIdsClient which sends the list of database
identifiers to/from the server.   The other is the
HistoryClient which uses a sort of cookie mechanism
at NCBI that lets users store lists of identifiers.

I use the HistoryClient for the Entrez searches because
that gives access to the list of all matches.  By
comparison, the DBIdsClient.search takes a count of
the maximum number of records to return.

Under that is the ThinClient.  This does the conversion
to the URL requests expected by NCBI.

Hopefully it makes some sense even without documentation!

WARNING: some of the docstrings are from the version 1
library and haven't been updated to this version 2.

					Andrew
					dalke at dalkescientific.com