[Bioperl-l] question regarding entrez

Jonathan Epstein Jonathan_Epstein@nih.gov
Mon, 04 Feb 2002 17:00:43 -0500


I'm not aware of anything in BioPerl which uses Entrez per se.

Within the NCBI toolkit, you'll find a program called entrcmd.c, which compiles into a network client called Nentrcmd.  I wrote this program when I was at NCBI, and it was originally used as the engine for the first WWW Entrez server.

Unfortunately, this network client/server interface has grown less reliable over the years, especially for large queries.  My understanding is that there is a newer Entrez API, but it's not clear to me whether it's been deployed yet.

E.g., the command
   Nentrcmd -d n -e 'human[ORGN]' -p su >/tmp/human
should dump all the human GIs into the output file, but in practice this command fails to produce any output, probably because of server problems and/or the dataset size.

For support with this program, you should contact toolbox@ncbi.nlm.nih.gov.  If you strike out with them I can try to help you, but note that I no longer have any control over the network services.

It strikes me that it would be cool if Catherine L. and her group were to create a GUI interface for this program using their automatic GUI-creator (although note that there is a builtin GUI of sorts already for certain platforms).

Here's the short online help, followed by the full help:

mgchd1 2% Nentrcmd -

Entrez command-line $Revision: 6.3 $   arguments:

   -d  Initial database [String]  Optional
     default = m
   -e  Boolean expression [String]  Optional
   -u  Comma-delimited list of UIDs [String]  Optional
   -p  Program of commands [String]
   -s  Display status report [T/F]  Optional
     default = F
   -w  Produce WWW/HTML formatted output (recommended value is /htbin) [String]  Optional
   -h  Detailed help [T/F]  Optional
     default = F
   -f  For WWW output, use Forms [T/F]  Optional
     default = F
   -c  'Check' WWW output Forms [T/F]  Optional
     default = F
   -x  Name of export file for named UID list [String]  Optional
   -i  Comma-delimited list of files to import for named UID list [String]  Optional
   -t  Produce a list of terms (term) [String]  Optional
   -l  Taxonomy lookup [String]  Optional
   -n  On-the-fly neighboring [File In]  Optional
   -o  Output file [File Out]
     default = stdout
   -g  Use WWW-style encoding for special input characters [T/F]  Optional
     default = T
   -r  Get sequences from ID Repository [T/F]  Optional
     default = F
   -y  Complexity (1=bioseq only, 2=bioseq set, 3=nuc-prot set) [Integer]  Optional
     default = 3


---------------------

Entrcmd is a non-interactive command-line interface which allows a user to
perform a series of neighboring and output operations, based upon an initial
set of UIDs or a boolean expression which describes a set of UIDs.
Alternatively, it can be used to display an alphabetically sorted list of
terms near an initial term.

Type 'entrcmd' with no arguments for a brief summary of command-line options.

     EXPRESSION SYNTAX (-e option)

The following grammar is based upon Backus-Naur form.  Braces ({}) are used to
specify optional fields, and ellipses (...) represents an arbitrary number
of repititions.  In most Backus-Naur forms, the vertical bar (|) and brackets
([]) are used as meta-symbols.  However, in the following grammar, the
vertical bar and brackets are terminal symbols, and three stacked vertical
bars are used to represent alternation.

expression ::= diff { - diff ... }
diff ::= term { | term ... }
term ::= factor { & factor ... }
                      |
factor ::= qualtoken | ( expression )
                      |
qualtoken ::= token { [ fld { ,S } ] }


token is a string of characters which either contains no special characters,
or which is delimited by double-quotes (").  Double-quote marks and
backslashes (\) which appear with a quoted token must be quoted by an
additional backslash.

fld is an appropriate string describing a field.  The possible values are
described in the following table.  For all databases, an asterisk(*) is a
possible value for fld, signifying the union of all possible fields for that
database.  '*' is also the default field, if no field qualifier is specified.

   | fld| Databases and descriptions
   +----+--------------------------------------------------------------------
   |WORD| For MEDLINE, "Abstract or Title"; for Sequences, "Text Terms"
   |MESH| MEDLINE only, "MeSH term"
   |AUTH| For all databases, "Author Name"
   |JOUR| For all databases, "Journal Title"
   |GENE| For all databases, "Gene Name"
   |KYWD| For MEDLINE, "Substance", for Sequences "Keyword"
   |ECNO| For MEDLINE and protein, "E.C. number"
   |ORGN| For all databases, "Organism"
   |ACCN| For Sequence databases, "Accession"
   |PROT| For protein, "Protein Name"

The presence of ",S"  after a field specifier implies the same semantics
as "special" in Entrez.  Entrez "total" semantics are the default.


     PROGRAM OF COMMANDS (-p option)

For the "-e" and "-u" options, the program of commands consists of a sequence of
neighboring operations alternated with optional output commands.  All output
commands, except the first, must be preceded by a period (.), and all
neighboring commands must be preceded by a comma (,).

The output commands are:
    no    None (default)             sg    Sequence GenBank/GenPept flat file format
    ma    MEDLINE ASN.1 format       sa    Sequence ASN.1 format
    md    MEDLINE docsums            sd    Sequence docsums
    ml    MEDLARS format             sf    Sequence FASTA format
    mr    MEDLINE report format      sr    Sequence report format
    mu    MEDLINE UIDs               su    Sequence UIDs
                                     si    Sequence IDs
Each output command may be followed by an optional count indicating how
many articles to display.  The default is to display all the articles.

If the "-x" command line option appears ("export to a saved UID list"), then
the first "mu" or "su" command results in those UIDs being written to that
"saved UID list" file, rather than being written to the standard output.

Neighboring commands indicate the database to neighbor "to", and
consists of the first letter of each of the possible databases:
(medline, protein, nucleotide) followed by an optional count of
how many of the current set of articles should be included in the
neighboring operation.

Example:
   Find the articles written by "Kay LE", but not by "Forman-Kay JD".  Find
   their MEDLINE neighbors.  Print document summaries for all of these
   neighbors.  Of these neighbors, neighbor the first 5 entries to the protein
   database.  Print up to 10 of these sequences in Sequence Report format.

     entrcmd -e '"Kay LE" [AUTH] - "Forman-Kay JD" [AUTH]' -p ,m.md,p5.sr10


If the "-t" option is used, then the program of commands is different from
what is described above.  Rather, it consists of a seven character string,
optionally followed by the number of terms which should be displayed.
The default number of terms is 40.

The string is of the form '123FLDD', where 1, 2, and 3 are as follows,
and FLDD is one of the field specifications described above (AUTH, etc.).

1 - one of 't', 's', or 'o', where 't' means that the total term counts
     should be displayed after the term, 's' means that the special and
     total term counts should be displayed after the term, and 'o' means
     that only the term itself should be displayed
2 - one of 'b', 'c', 'e', or an integer from 3 to 9, where:
     'b' - display terms beginning with the specified term
     'c' - "center" terms; i.e., display half the terms before the specified
           term, and half the terms after the specified term
     'e' - display terms ending with the specified term
     k   - an integer from 3 to 9, indicating that (2/k)ths of the terms
           should be alphabetically before the specified term.  Note that
           '4' is the same as 'c'.  The value '9' is recommended for
           scrolled displays.
3 - One of 'i' or 'n', indicating for the 'b' and 'e' options above whether
     the specified term is to be included in the output, where 'i' means
     inclusive, and 'n' means non-inclusive.  This value is ignored for
     other values of the previous character, but must be present as a
     place-holder.

[ WARNING: SOME OF THESE TERM SPECIFICATIONS OPTIONS (COMBINATIONS OF 1,
2, AND 3 ABOVE) ARE CURRENTLY UNIMPLEMENTED ]


     WORLD WIDE WEB STYLE OUTPUT (-w option)

The entrcmd program can also generate output which is appropriate for
display in an HTML document, to be "served" by a WWW server.  In particular,
some output text contains HTML hypertext links to other data, as well as
HTML formatting information.  The parameter to the -w option is the
directory prefix for the linked hypertext items; "/htbin" is recommended.

If the "-w" option is selected, then the "-f" option may also be selected.
This indicates that the HTML output should be of a form which is
appropriate for a HTML "FORM".  This output can only be processed by
advanced WWW clients, but potentially provides a nicer interface, where
each document summary has an associated checkbox, resulting in a display
which is similar to the Entrez CD-ROM application.  The "-c" option, if used
in conjunction with "-f", indicates that these checkboxes should be
"pre-checked", i.e., selected.  This potentially provides the equivalent
of the Entrez "select all" operation for neighboring.



Hope this helps,

-Jonathan

At 04:09 PM 2/3/2002 , Xiaowu Gai wrote:
>I browsed through the documentation of BioPerl and could not seem to find anything about using Entrez with BioPerl, in other words, it appears impossible to do an Entrez search in your program written in BioPerl? Why does BioPerl not support Entrez? Is there a way I can work around it? Any known program for Net Entrez so I call it in my program? (I downloaded the Network Entrez from NCBI and played with it a little bit, but it has this GUI and no command line version).



Jonathan Epstein                                Jonathan_Epstein@nih.gov
Head, Unit on Biologic Computation              (301)402-4563
Office of the Scientific Director               Bldg 31, Room 2A47
Nat. Inst. of Child Health & Human Development  31 Center Drive
National Institutes of Health                   Bethesda, MD 20892