From mjldehoon at yahoo.com  Sat Jun  7 04:35:05 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 7 Jun 2008 01:35:05 -0700 (PDT)
Subject: [BioPython] Bio.Gobase, anybody?
Message-ID: <844450.31822.qm@web62415.mail.re1.yahoo.com>


Hi everbody,

As part of bug report 2454:
http://bugzilla.open-bio.org/show_bug.cgi?id=2454,
I started looking at the Bio.Gobase module.
This module provides access to the gobase database:
http://megasun.bch.umontreal.ca/gobase/

This module is about seven years old and (AFAICT)
is not actively maintained. We don't have documentation
for this module, but the unit tests suggests that it
parses HTML files from gobase. I am not sure exactly
where the HTML files came from, but I doubt that
after seven years this still works.

So I was wondering:
Does anybody use Bio.Gobase?

If not, I suggest we deprecate it for the next release,
and remove it in some future release.
If there are users, we need to make some (small) changes
to this module (that is what the original bug report
was about).

--Michiel.


From mmokrejs at ribosome.natur.cuni.cz  Sat Jun  7 05:27:26 2008
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Sat, 07 Jun 2008 11:27:26 +0200
Subject: [BioPython] Bio.Gobase, anybody?
In-Reply-To: <844450.31822.qm@web62415.mail.re1.yahoo.com>
References: <844450.31822.qm@web62415.mail.re1.yahoo.com>
Message-ID: <484A547E.1030909@ribosome.natur.cuni.cz>

Hi,
  I don't use it but seems an interesting resource. ;-)
See http://gobase.bcm.umontreal.ca/samples.html .
Martin

> This module is about seven years old and (AFAICT)
> is not actively maintained. We don't have documentation
> for this module, but the unit tests suggests that it
> parses HTML files from gobase. I am not sure exactly
> where the HTML files came from, but I doubt that
> after seven years this still works.

From cg5x6 at yahoo.com  Mon Jun  9 01:21:50 2008
From: cg5x6 at yahoo.com (C. G.)
Date: Sun, 8 Jun 2008 22:21:50 -0700 (PDT)
Subject: [BioPython] splice variants in GenBank/Entrez
Message-ID: <664146.43151.qm@web65604.mail.ac4.yahoo.com>

Hi all,

I've been using BioPython for a few projects the last
two months to process BLAST results but now I need to
take those results and determine which of them have
known splice variants. By "known" I mean those that
have annotations contained in a database that indicate
they have (or are) splice variants.

My thought was that Entrez would have this information
(which I would then retrieve and parse with BioPython)
but I can't find a consistent means of determining if
an entry has splice variants. I was hoping that maybe
someone on this list had some experience trying to
find this information. Perhaps there is a sequence
feature or a common user-defined field I could access?

I'm also sending an email to NCBI requesting
information but I thought I would cover my bases.
Thanks in advance for any information or help you can
provide.

-steve


From krewink at inb.uni-luebeck.de  Mon Jun  9 02:58:52 2008
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Mon, 9 Jun 2008 08:58:52 +0200
Subject: [BioPython] splice variants in GenBank/Entrez
In-Reply-To: <664146.43151.qm@web65604.mail.ac4.yahoo.com>
References: <664146.43151.qm@web65604.mail.ac4.yahoo.com>
Message-ID: <20080609065852.GB13032@inb.uni-luebeck.de>

Hi Steve,

On Sun, Jun 08, 2008 at 10:21:50PM -0700, C. G. wrote:
> 
> I've been using BioPython for a few projects the last
> two months to process BLAST results but now I need to
> take those results and determine which of them have
> known splice variants. By "known" I mean those that
> have annotations contained in a database that indicate
> they have (or are) splice variants.

Depending on which organism you are looking at, you might want to use
the Ensembl genome database.  There is no biopython interface, but you
can use the jython interface from their website (at least they once
had one, I didn't check if that's still the case).  Otherwise you
might have to use perl or java packages for that.

Another good resource for this is the Alternative Splicing Database:
http://www.ebi.ac.uk/asd/

Hope that helps,

Albert


-- 
Albert Krewinkel <krewink (at) inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics
http://www.inb.uni-luebeck.de/~krewink/

From bsouthey at gmail.com  Mon Jun  9 09:25:44 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Mon, 09 Jun 2008 08:25:44 -0500
Subject: [BioPython] splice variants in GenBank/Entrez
In-Reply-To: <20080609065852.GB13032@inb.uni-luebeck.de>
References: <664146.43151.qm@web65604.mail.ac4.yahoo.com>
	<20080609065852.GB13032@inb.uni-luebeck.de>
Message-ID: <484D2F58.6020502@gmail.com>

Albert Krewinkel wrote:
> Hi Steve,
>
> On Sun, Jun 08, 2008 at 10:21:50PM -0700, C. G. wrote:
>   
>> I've been using BioPython for a few projects the last
>> two months to process BLAST results but now I need to
>> take those results and determine which of them have
>> known splice variants. By "known" I mean those that
>> have annotations contained in a database that indicate
>> they have (or are) splice variants.
>>     
>
> Depending on which organism you are looking at, you might want to use
> the Ensembl genome database.  There is no biopython interface, but you
> can use the jython interface from their website (at least they once
> had one, I didn't check if that's still the case).  Otherwise you
> might have to use perl or java packages for that.
>
> Another good resource for this is the Alternative Splicing Database:
> http://www.ebi.ac.uk/asd/
>
> Hope that helps,
>
> Albert
>
>
>   
The 'ALTERNATIVE PRODUCTS' section of CC lines in a UniProt (SwissProt) 
record can contain alternative splicing information. See for example, 
the manual section:
**3.12.5. Syntax of the topic 'ALTERNATIVE PRODUCTS'**
http://ca.expasy.org/sprot/userman.html#CCAP
(Given below for completeness).

Bruce

Example of the CC lines and the corresponding FT lines for an entry with 
alternative splicing:

    CC   -!- ALTERNATIVE PRODUCTS:
    CC       Event=Alternative splicing, Alternative initiation; Named isoforms=8;
    CC         Comment=Additional isoforms seem to exist;
    CC       Name=1; Synonyms=Non-muscle isozyme;
    CC         IsoId=Q15746-1; Sequence=Displayed;
    CC       Name=2;
    CC         IsoId=Q15746-2; Sequence=VSP_004791;
    CC       Name=3A;
    CC         IsoId=Q15746-3; Sequence=VSP_004792, VSP_004794;
    CC       Name=3B;
    CC         IsoId=Q15746-4; Sequence=VSP_004791, VSP_004792, VSP_004794;
    CC       Name=4;
    CC         IsoId=Q15746-5; Sequence=VSP_004792, VSP_004793;
    CC       Name=Del-1790;
    CC         IsoId=Q15746-6; Sequence=VSP_004795;
    CC       Name=5; Synonyms=Smooth-muscle isozyme;
    CC         IsoId=Q15746-7; Sequence=VSP_018845;
    CC         Note=Produced by alternative initiation at Met-923 of isoform 1;
    CC       Name=6; Synonyms=Telokin;
    CC         IsoId=Q15746-8; Sequence=VSP_018846;
    CC         Note=Produced by alternative initiation at Met-1761 of isoform
    CC         1. Has no catalytic activity;
    ...
    FT   VAR_SEQ       1   1760       Missing (in isoform 6).
    FT                                /FTId=VSP_018846.
    FT   VAR_SEQ       1    922       Missing (in isoform 5).
    FT                                /FTId=VSP_018845.
    FT   VAR_SEQ     437    506       VSGIPKPEVAWFLEGTPVRRQEGSIEVYEDAGSHYLCLLKA
    FT                                RTRDSGTYSCTASNAQGQVSCSWTLQVER -> G (in
    FT                                isoform 2 and isoform 3B).
    FT                                /FTId=VSP_004791.
    FT   VAR_SEQ    1433   1439       DEVEVSD -> MKWRCQT (in isoform 3A,
    FT                                isoform 3B and isoform 4).
    FT                                /FTId=VSP_004792.
    FT   VAR_SEQ    1473   1545       Missing (in isoform 4).
    FT                                /FTId=VSP_004793.
    FT   VAR_SEQ    1655   1705       Missing (in isoform 3A and isoform 3B).
    FT                                /FTId=VSP_004794.
    FT   VAR_SEQ    1790   1790       Missing (in isoform Del-1790).
    FT                                /FTId=VSP_004795.
      

    CC   -!- ALTERNATIVE PRODUCTS:
    CC       Event=Alternative splicing, Alternative initiation; Named isoforms=3;
    CC         Comment=Isoform 1 and isoform 2 arise due to the use of two
    CC         alternative first exons joined to a common exon 2 at the same
    CC         acceptor site but in different reading frames, resulting in two
    CC         completely different isoforms;
    CC       Name=1; Synonyms=p16INK4a;
    CC         IsoId=O77617-1; Sequence=Displayed;
    CC       Name=3;
    CC         IsoId=O77617-2; Sequence=VSP_018701;
    CC         Note=Produced by alternative initiation at Met-35 of isoform 1.
    CC         No experimental confirmation available;
    CC       Name=2; Synonyms=p19ARF;
    CC         IsoId=O77618-1; Sequence=External;
    ..
    FT   VAR_SEQ       1     34       Missing (in isoform 3).
    FT                                /FTId=VSP_004099.
      

From lueck at ipk-gatersleben.de  Tue Jun 10 04:38:14 2008
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Tue, 10 Jun 2008 10:38:14 +0200
Subject: [BioPython] formatdb over python code
Message-ID: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de>

Hi!

Does someone know's whether it's possible to make a database with formatdb (NCBI) via python code (among Windows) and not over the console?


Regards
Stefanie

From biopython at maubp.freeserve.co.uk  Tue Jun 10 05:41:27 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 10 Jun 2008 10:41:27 +0100
Subject: [BioPython] formatdb over python code
In-Reply-To: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de>
References: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00806100241i68b24632s121324ce1c942dd9@mail.gmail.com>

On Tue, Jun 10, 2008 at 9:38 AM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:
> Hi!
>
> Does someone know's whether it's possible to make a database with formatdb
> (NCBI) via python code (among Windows) and not over the console?

Hello Stefanie,

I don't think Biopython has a wrapper for the NCBI formatdb tool, but
you could construct the command line string yourself and call it with
one of the standard python os functions, e.g. os.popen().

Peter


From winter at biotec.tu-dresden.de  Tue Jun 10 06:13:06 2008
From: winter at biotec.tu-dresden.de (Christof Winter)
Date: Tue, 10 Jun 2008 12:13:06 +0200
Subject: [BioPython] formatdb over python code
In-Reply-To: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de>
References: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de>
Message-ID: <484E53B2.5060102@biotec.tu-dresden.de>

Stefanie L?ck wrote, On 06/10/08 10:38:
> Hi!
> 
> Does someone know's whether it's possible to make a database with formatdb
> (NCBI) via python code (among Windows) and not over the console?

Here is the Python code I use for that:
cmd = "formatdb -i %s -p T -o F" % database
os.system(cmd)

-p T specifies protein sequences, -o T creates indexes, but fails if the fasta 
file does not follow the defline format (see 
http://en.wikipedia.org/wiki/Fasta_format#Sequence_identifiers). If it fails, 
use -o F.

Christof


From mjldehoon at yahoo.com  Fri Jun 13 22:34:05 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 13 Jun 2008 19:34:05 -0700 (PDT)
Subject: [BioPython] Bio.Rebase
Message-ID: <237761.5963.qm@web62409.mail.re1.yahoo.com>

Hi everybody,

As part of bug #2454 on Bugzilla, I am looking at the Bio.Rebase module.
This module parses files (in HTML format) from the Rebase database:
http://rebase.neb.com/rebase/rebase.html

Unfortunately, since this module was written (in 2000) the HTML format used by the Rebase database has changed completely. This module is therefore not able to parse current Rebase HTML files.

Is anybody willing to update Bio.Rebase (either by updating the HTML parser, or preferably by writing a parser for plain-text output from Bio.Rebase)? If not, I think this module should be deprecated.

--Michiel.


From biopython at maubp.freeserve.co.uk  Mon Jun 16 10:01:31 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 16 Jun 2008 15:01:31 +0100
Subject: [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO
Message-ID: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com>

I've recently had to deal with some contig files in the Ace format
(output by CAP3, but many assembly files will produce this output).

We have a module for parsing Ace files in Biopython,
Bio.Sequencing.Ace but I was wondering about integrating this into the
Bio.SeqIO or Bio.AlignIO framework.
http://www.biopython.org/wiki/SeqIO
http://www.biopython.org/wiki/AlignIO

I'd like to hear from anyone currently using Ace files, on how they
tend to treat the data - and if they think a SeqRecord or Alignment
based representation would be useful.

Each contig in an Ace file could be treated as a SeqRecord using the
consensus sequence.  The identifiers of each sub-sequence used to
build the consensus could be stored as database cross-references, or
perhaps we could store these as SeqFeatures describing which part of
the consensus they support.  This would then fit into Bio.SeqIO quite
well.

Alternatively, each contig could be treated as an alignment (with a
consensus) and integrated into Bio.AlignIO.  One drawback for this is
doing this with the current generic alignment class would require
padding the start and/or end of each sequence with gaps in order to
make every sequence the same length.  However, if we did this (or
created a more specialised alignment class), the Ace file format would
then fit into Bio.AlignIO too.

So, Ace users - would either (or both) of the above approaches make
sense for how you use the Ace contig files?

Thanks

Peter

From laserson at mit.edu  Tue Jun 17 14:44:08 2008
From: laserson at mit.edu (Uri Laserson)
Date: Tue, 17 Jun 2008 14:44:08 -0400
Subject: [BioPython] Dependency help: libssl.so.0.9.7
Message-ID: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com>

Hi,

I am trying to use some biopython packages, and it turns out there is an
error when I try to import _hashlib:

>>> import _hashlib
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ImportError: libssl.so.0.9.7: cannot open shared object file: No such file
or directory

I am working on unix system that is administered by a university, but I have
installed my own local version of python along with biopython and all
necessary packages for that.

There exists a libssl.so.0.9.8 and libssl.so (a symbolic link to the former)
in /usr/lib

ldd _hashlib.so in my own /python/lib/python2.5/lib-dynload gives me:
        linux-gate.so.1 =>  (0xffffe000)
        libssl.so.0.9.7 => not found
        libcrypto.so.0.9.7 => not found
        libpthread.so.0 => /lib32/libpthread.so.0 (0xf7f67000)
        libc.so.6 => /lib32/libc.so.6 (0xf7e3c000)
        /lib/ld-linux.so.2 (0x56555000)

What is the easiest way to solve this?  How do I get my local (home
directory) installation of python to find the libssl.so library in /usr/lib?

Thanks!

Uri

-- 
Uri Laserson
PhD Candidate, Biomedical Engineering
Harvard Medical School (Genetics)
Massachusetts Institute of Technology (Mathematics)
phone +1 917 742 8019
laserson at mit.edu

From biopython at maubp.freeserve.co.uk  Wed Jun 18 05:11:42 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 18 Jun 2008 10:11:42 +0100
Subject: [BioPython] Dependency help: libssl.so.0.9.7
In-Reply-To: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com>
References: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com>
Message-ID: <320fb6e00806180211o5d505ct4099cdd4fc9e11dc@mail.gmail.com>

On Tue, Jun 17, 2008 at 7:44 PM, Uri Laserson <laserson at mit.edu> wrote:
> Hi,
>
> I am trying to use some biopython packages, and it turns out there is an
> error when I try to import _hashlib:
>
>>>> import _hashlib
> Traceback (most recent call last):
> ...

Hi Uri,

I'm guessing you are trying to use Bio.SeqUtils.Checksum, but did you
mean "import hashlib"?  See http://code.krypto.org/python/hashlib/

Peter

From biopython at maubp.freeserve.co.uk  Wed Jun 18 07:32:10 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 18 Jun 2008 12:32:10 +0100
Subject: [BioPython] blastx works fine?
In-Reply-To: <1131745582.4368.22.camel@osiris.biology.duke.edu>
References: <1131745582.4368.22.camel@osiris.biology.duke.edu>
Message-ID: <320fb6e00806180432x60ceea96o3e45f05590003e8e@mail.gmail.com>

In Nov 2005, Frank Kauff wrote:
> Hi all,
>
> qblast currently says it works only for blastp and blastn. Actually it
> seems to work fine with blastx as well - xml output parses well with
> NCBIXML. Or am I missing something?
>
> Frank

Yes, using BLASTX with the Biopython XML parser does seem to work.

In fact the NCBI (now) documentation explicitly lists blastn, blastp,
blastx, tblastn and tblastx so I updated Biopython's qblast function
to allow them too.  http://www.ncbi.nlm.nih.gov/BLAST/Doc/node43.html

Fixed in Bio/Blast/NCBIWWW.py revision 1.50 - better late than never?

Peter

From mjldehoon at yahoo.com  Thu Jun 19 09:04:31 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 19 Jun 2008 06:04:31 -0700 (PDT)
Subject: [BioPython] Bio.CDD, anyone?
Message-ID: <14893.84074.qm@web62409.mail.re1.yahoo.com>

Hi everybody,

Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) records. The parser parses HTML pages from CDD's web site. Since the parser was written about six years ago, the CDD web site has changed considerably. Bio.CDD therefore cannot parse current HTML pages from CDD.

So I am wondering:
1) Is anybody using Bio.CDD?
2) Is anybody willing to update Bio.CDD to handle current HTML?
3) If not, can we deprecate it? There is not much purpose of having a parser for HTML pages from years ago.

--Michiel.


From biopython at maubp.freeserve.co.uk  Thu Jun 19 09:38:29 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 19 Jun 2008 14:38:29 +0100
Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone?
In-Reply-To: <14893.84074.qm@web62409.mail.re1.yahoo.com>
References: <14893.84074.qm@web62409.mail.re1.yahoo.com>
Message-ID: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com>

> Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database)
> records. The parser parses HTML pages from CDD's web site. Since the parser
> was written about six years ago, the CDD web site has changed considerably.
> Bio.CDD therefore cannot parse current HTML pages from CDD.

A couple of years ago, I wanted to get the CDD domain name and
description and ended up writing my own very simple and crude parser
to extract just this information.  Doing a proper job would mean
extracting lots and lots of fields, e.g.
http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475

I wonder if the NCBI make any of this available as XML via Entrez?  I
had a quick look and couldn't find anything.

Peter

From mjldehoon at yahoo.com  Thu Jun 19 09:58:25 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 19 Jun 2008 06:58:25 -0700 (PDT)
Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone?
In-Reply-To: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com>
Message-ID: <352888.20937.qm@web62409.mail.re1.yahoo.com>

> I wonder if the NCBI make any of this available as XML via Entrez?  I
> had a quick look and couldn't find anything.

Actually I already asked this question to NCBI. Their answer was that a subset of the information shown on the web page is available as XML via Entrez's ESummary and EFetch (and thus available from Biopython). The full CDD records are stored as one large file, which is obtainable from NCBI's ftp site, but currently it is not possible to get individual CDD records except in HTML form through the NCBI website.

--Michiel.


Peter <biopython at maubp.freeserve.co.uk> wrote: > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database)
> records. The parser parses HTML pages from CDD's web site. Since the parser
> was written about six years ago, the CDD web site has changed considerably.
> Bio.CDD therefore cannot parse current HTML pages from CDD.

A couple of years ago, I wanted to get the CDD domain name and
description and ended up writing my own very simple and crude parser
to extract just this information.  Doing a proper job would mean
extracting lots and lots of fields, e.g.
http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475

I wonder if the NCBI make any of this available as XML via Entrez?  I
had a quick look and couldn't find anything.

Peter


From bsouthey at gmail.com  Thu Jun 19 10:44:00 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Thu, 19 Jun 2008 09:44:00 -0500
Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone?
In-Reply-To: <352888.20937.qm@web62409.mail.re1.yahoo.com>
References: <352888.20937.qm@web62409.mail.re1.yahoo.com>
Message-ID: <485A70B0.1010202@gmail.com>

Michiel de Hoon wrote:
>> I wonder if the NCBI make any of this available as XML via Entrez?  I
>> had a quick look and couldn't find anything.
>>     
>
> Actually I already asked this question to NCBI. Their answer was that a subset of the information shown on the web page is available as XML via Entrez's ESummary and EFetch (and thus available from Biopython). The full CDD records are stored as one large file, which is obtainable from NCBI's ftp site, but currently it is not possible to get individual CDD records except in HTML form through the NCBI website.
>
> --Michiel.
>
>
> Peter <biopython at maubp.freeserve.co.uk> wrote: > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database)
>   
>> records. The parser parses HTML pages from CDD's web site. Since the parser
>> was written about six years ago, the CDD web site has changed considerably.
>> Bio.CDD therefore cannot parse current HTML pages from CDD.
>>     
>
> A couple of years ago, I wanted to get the CDD domain name and
> description and ended up writing my own very simple and crude parser
> to extract just this information.  Doing a proper job would mean
> extracting lots and lots of fields, e.g.
> http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475
>
> I wonder if the NCBI make any of this available as XML via Entrez?  I
> had a quick look and couldn't find anything.
>
> Peter
>
>
>        
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   
Hi,
Do you know how the test files were created? If there is not an easy 
answer then it makes the decision easier.

Anyhow, I  vote to remove this module as, in addition to the things 
previously mentioned, it would far better to support interproscan 
(http://www.ebi.ac.uk/Tools/InterProScan/ ) than just a single tool.

Bruce

From cjfields at uiuc.edu  Thu Jun 19 10:45:05 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 19 Jun 2008 09:45:05 -0500
Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone?
In-Reply-To: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com>
References: <14893.84074.qm@web62409.mail.re1.yahoo.com>
	<320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com>
Message-ID: <F1CD69BA-253F-4BBD-9942-572B19C6E722@uiuc.edu>

They don't, though you can get esummary XML information (which  
includes description), and I believe you can use elink to grab other  
information (including proteins with the specified domain).

chris

On Jun 19, 2008, at 8:38 AM, Peter wrote:

>> Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain  
>> Database)
>> records. The parser parses HTML pages from CDD's web site. Since  
>> the parser
>> was written about six years ago, the CDD web site has changed  
>> considerably.
>> Bio.CDD therefore cannot parse current HTML pages from CDD.
>
> A couple of years ago, I wanted to get the CDD domain name and
> description and ended up writing my own very simple and crude parser
> to extract just this information.  Doing a proper job would mean
> extracting lots and lots of fields, e.g.
> http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475
>
> I wonder if the NCBI make any of this available as XML via Entrez?  I
> had a quick look and couldn't find anything.
>
> Peter
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign


From biopython at maubp.freeserve.co.uk  Thu Jun 19 12:13:16 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 19 Jun 2008 17:13:16 +0100
Subject: [BioPython] Adding NCBI XML sequence formats to Bio.SeqIO
Message-ID: <320fb6e00806190913h2f3f81bgd9d16fb0f2a740f9@mail.gmail.com>

Dear all,

I've realised that as a bonus from Michiel's work on Bio.Entrez,
Biopython should be able to parse several of the XML sequence file
formats used by the NCBI - and ideally we should be able to do this
via Bio.SeqIO and get SeqRecord objects.  I am thinking about adding a
new module to Bio.SeqIO which will map the python list/dictionary
structures from Bio.Entrez into SeqRecord object(s).

What I wanted to ask the list about, is which XML sequence files are
of interest - and are there any strong views on format names should I
use?

I've looked at BioPerl list since I try and re-use the same format
names, but could only spot one NCBI XML file listed here:
http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats

NCBI TinySeq XML format
http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd
BioPerl call this "tinyseq", which seems like a good choice of name.
http://www.bioperl.org/wiki/Tinyseq_sequence_format

Also potentially of interest are:

NCBI INSDSeq XML format
http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd

NCBI Seq-entry XML format
http://www.ncbi.nlm.nih.gov/dtd/NCBI_Seqset.dtd

NCBI Entrezgene XML format (BioPerl uses "entrezgene" to refer to the
ASN.1 variant of this file format).
http://www.ncbi.nlm.nih.gov/dtd/NCBI_Entrezgene.dtd

(I haven't actually sat down and looked at the details of the
implementation yet, so no promises on the timing!)

Peter

From sbassi at gmail.com  Sun Jun 22 18:49:48 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sun, 22 Jun 2008 19:49:48 -0300
Subject: [BioPython] Secondary structure alphabet?
Message-ID: <b43bf2080806221549r19be02e6q8a59e15550faa5b7@mail.gmail.com>

Here is the secondary structure alphabet:

class SecondaryStructure(SingleLetterAlphabet)
 |  Method resolution order:
 |      SecondaryStructure
 |      SingleLetterAlphabet
 |      Alphabet
 |
 |  Data and other attributes defined here:
 |
 |  letters = 'HSTC'

I can't find what that HSTC stands for. The closer match I found was
the DSSP code:

The DSSP code

The output of DSSP is explained extensively under 'explanation'. The
very short summary of the output is:

    * H = alpha helix
    * B = residue in isolated beta-bridge
    * E = extended strand, participates in beta ladder
    * G = 3-helix (3/10 helix)
    * I = 5 helix (pi helix)
    * T = hydrogen bonded turn
    * S = bend

(http://swift.cmbi.ru.nl/gv/dssp/)

Does anybody knows the meaning of HSTC? I am CC this mail to Andrew
Dalke it seems he was the one who submit it the Biopython.


-- 
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Tutorial libre de Python: http://tinyurl.com/2az5d5

From idoerg at gmail.com  Sun Jun 22 19:03:52 2008
From: idoerg at gmail.com (Iddo Friedberg)
Date: Sun, 22 Jun 2008 16:03:52 -0700
Subject: [BioPython] Secondary structure alphabet?
In-Reply-To: <b43bf2080806221549r19be02e6q8a59e15550faa5b7@mail.gmail.com>
References: <b43bf2080806221549r19be02e6q8a59e15550faa5b7@mail.gmail.com>
Message-ID: <b5bbbc970806221603i1f426106n34c5c8fb2223f8e@mail.gmail.com>

Probably Helix Turn Strand Coil


On Sun, Jun 22, 2008 at 3:49 PM, Sebastian Bassi <sbassi at gmail.com> wrote:

> Here is the secondary structure alphabet:
>
> class SecondaryStructure(SingleLetterAlphabet)
>  |  Method resolution order:
>  |      SecondaryStructure
>  |      SingleLetterAlphabet
>  |      Alphabet
>  |
>  |  Data and other attributes defined here:
>  |
>  |  letters = 'HSTC'
>
> I can't find what that HSTC stands for. The closer match I found was
> the DSSP code:
>
> The DSSP code
>
> The output of DSSP is explained extensively under 'explanation'. The
> very short summary of the output is:
>
>    * H = alpha helix
>    * B = residue in isolated beta-bridge
>    * E = extended strand, participates in beta ladder
>    * G = 3-helix (3/10 helix)
>    * I = 5 helix (pi helix)
>    * T = hydrogen bonded turn
>    * S = bend
>
> (http://swift.cmbi.ru.nl/gv/dssp/)
>
> Does anybody knows the meaning of HSTC? I am CC this mail to Andrew
> Dalke it seems he was the one who submit it the Biopython.
>
>
> --
> Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
> Bioinformatics news: http://www.bioinformatica.info
> Tutorial libre de Python: http://tinyurl.com/2az5d5
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 

Iddo Friedberg, Ph.D.
CALIT2, mail code 0440
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0440, USA
T: +1 (858) 534-0570
T: +1 (858) 646-3100 x3516
http://iddo-friedberg.org

From sbassi at gmail.com  Sun Jun 22 19:05:13 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sun, 22 Jun 2008 20:05:13 -0300
Subject: [BioPython] Secondary structure alphabet?
In-Reply-To: <b5bbbc970806221603i1f426106n34c5c8fb2223f8e@mail.gmail.com>
References: <b43bf2080806221549r19be02e6q8a59e15550faa5b7@mail.gmail.com>
	<b5bbbc970806221603i1f426106n34c5c8fb2223f8e@mail.gmail.com>
Message-ID: <b43bf2080806221605j69d8fdefk7bb59329fc8f5022@mail.gmail.com>

On Sun, Jun 22, 2008 at 8:03 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> Probably Helix Turn Strand Coil

Sounds plausible. Thank you.
Best,
SB.

-- 
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Tutorial libre de Python: http://tinyurl.com/2az5d5

From jdieten at gmail.com  Tue Jun 24 06:58:23 2008
From: jdieten at gmail.com (Joost van Dieten)
Date: Tue, 24 Jun 2008 12:58:23 +0200
Subject: [BioPython] Blastp XML mailfunction
Message-ID: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com>

 MY CODE:
       result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence,
entrez_query='man[ORGN]')
       blast_results = result_handle.read()
       print result_handle-
       result_handler = cStringIO.StringIO(blast_results)
       print result_handler
       blast_records = NCBIXML.parse(result_handler)
       blast_record = blast_records.next()

This code doesn't seem to work anymore. I got an error that my blast_record
is empty, but it worked fine 3 weeks ago. Something changed to the NCBIXML
code??? Any ideas??

Greetz,

Joost Dieten

From biopython at maubp.freeserve.co.uk  Tue Jun 24 07:11:12 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Jun 2008 12:11:12 +0100
Subject: [BioPython] Blastp XML mailfunction
In-Reply-To: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com>
References: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com>
Message-ID: <320fb6e00806240411j1c01903cm1f40d53eb9c5ad77@mail.gmail.com>

On Tue, Jun 24, 2008 at 11:58 AM, Joost van Dieten <jdieten at gmail.com> wrote:
>  MY CODE:
>       result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence,
> entrez_query='man[ORGN]')
>       blast_results = result_handle.read()
>       print result_handle-
>       result_handler = cStringIO.StringIO(blast_results)
>       print result_handler
>       blast_records = NCBIXML.parse(result_handler)
>       blast_record = blast_records.next()

You probably know this, but for anyone trying to cut-and-paste the
code, its much simpler to do this:

result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence,
entrez_query='man[ORGN]')
blast_records = NCBIXML.parse(result_handle)
blast_record = blast_records.next()

Joost's code is a handy way to print out the raw data before parsing
it, to try and identify any problems by eye.

> This code doesn't seem to work anymore. I got an error that my blast_record
> is empty, but it worked fine 3 weeks ago. Something changed to the NCBIXML
> code??? Any ideas??

Yes, its probably a recent NCBI change, which we've fixed with Bug 2499:
http://bugzilla.open-bio.org/show_bug.cgi?id=2499

If you want to just update the Blast parser, I think you need to
update both NCBIXML.py and Record.py, but a complete install from CVS
might be simpler.

Peter

From mjldehoon at yahoo.com  Wed Jun 25 10:04:09 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 25 Jun 2008 07:04:09 -0700 (PDT)
Subject: [BioPython] Bio.SCOP.FileIndex
Message-ID: <141582.2274.qm@web62413.mail.re1.yahoo.com>

Hi everybody,

When I was modifying Bio.SCOP, I noticed that Bio.SCOP.FileIndex is flawed if file reading is done via a buffer (which is often the case in Python).

Before we try to fix this, is anybody actually using Bio.SCOP.FileIndex?
If not, I think we should deprecate it instead of trying to fix it.

--Michiel.


From dag at sonsorol.org  Wed Jun 25 11:08:33 2008
From: dag at sonsorol.org (Chris Dagdigian)
Date: Wed, 25 Jun 2008 11:08:33 -0400
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>
Message-ID: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>


Can someone from the biopython dev team respond officially to Scott  
please?

Regards,
Chris


Begin forwarded message:

> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" <mcginnis at ncbi.nlm.nih.gov>
> Date: June 25, 2008 10:54:28 AM EDT
> To: <biopython-owner at lists.open-bio.org>
> Subject: NCBI Abuse Activity with BioPython
>
> Dear Colleague:
>
>
>
> My name is Scott McGinnis and I am responsible for monitoring the web
> page at NCBI and blocking users with excessive access.
>
>
>
> I am seeing more and more activity with BioPython and it is us  
> concern.
> Mainly the BioPython suite does not appear to be written to the
> recommendations made on the main NCBI E-utilities web page
> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr
> inciply the following are not being done by BioPython tools.
>
>
>
> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web address.
>
> *  Make no more than one request every 3 seconds.
>
>
>
> In fact I recently cc'd you on an event when a user was coming in at
> over 18 requests per second. We really wish that you would alter you
> scripts to run with a some sort of sleep in it in order to not send
> requests more than once per 3 seconds and to not send these to the  
> main
> www web servers but use the  http://eutils.ncbi.nlm.nih.gov
> <http://eutils.ncbi.nlm.nih.gov/> .
>
>
>
> Also, there is the problem of huge searches in order to build local
> databases. With you package it seems that if one were so inclined you
> would send a search for all human sequences (over 10,000,000  
> sequences)
> and you program would then retrieve these one ID at a time. Regardless
> of the fact that this is an extreme example, we would much prefer if
> your program could webenv from the Esearch  and  use the search  
> history
> and webenv to retrieve sets of sequences at 200 - 200 at a time.
>
>
>
> History: Requests utility to maintain results in user's environment.
> Used in conjunction with WebEnv.
>
> usehistory=y
>
> Web Environment: Value previously returned in XML results from ESearch
> or EPost. This value may change with each utility call. If WebEnv is
> used, History search numbers can be included in an ESummary URL, e.g.,
> term=cancer+AND+%23X (where %23 replaces # and X is the History search
> number).
>
> Note: WebEnv is similar to the cookie that is set on a user's  
> computers
> when accessing PubMed on the web.  If the parameter usehistory=y is
> included in an ESearch URL both a WebEnv (cookie string) and query_key
> (history number) values will be returned in the results. Rather than
> using the retrieved PMIDs in an ESummary or EFetch URL you may simply
> use the WebEnv and query_key values to retrieve the records. WebEnv  
> will
> change for each ESearch query, but a sample URL would be as follows:
>
> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed
> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh
> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D
> &query_key=6&retmode=html&rettype=medline&retmax=15
>
> WebEnv=WgHmIcDG]B etc.
>
> Display Numbers:
>
> retstart=x  (x= sequential number of the first record retrieved -
> default=0 which will retrieve the first record)
> retmax=y  (y= number of items retrieved)
>
>
>
> Otherwise we will end up blocking more of your users which we are
> unfortunately already doing in some cases.
>
>
>
> Sincerely,
> Scott D. McGinnis, M.S.
> DHHS/NIH/NLM/NCBI
> www.ncbi.nlm.nih.gov
>
>
>


From cjfields at uiuc.edu  Wed Jun 25 11:34:34 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 25 Jun 2008 10:34:34 -0500
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>
	<55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
Message-ID: <FCC44A01-8CA7-42F3-A08C-5AC03B1C00F4@uiuc.edu>

Just as a note from the BioPerl side, BioPerl modules which access  
eutils use the 3 min sleep rule, and we specify in the documentation  
the NCBI rules.  The modules also identify the tool/agent used as  
'bioperl', I believe.

chris

On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote:

>
> Can someone from the biopython dev team respond officially to Scott  
> please?
>
> Regards,
> Chris
>
>
> Begin forwarded message:
>
>> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]"  
>> <mcginnis at ncbi.nlm.nih.gov>
>> Date: June 25, 2008 10:54:28 AM EDT
>> To: <biopython-owner at lists.open-bio.org>
>> Subject: NCBI Abuse Activity with BioPython
>>
>> Dear Colleague:
>>
>>
>>
>> My name is Scott McGinnis and I am responsible for monitoring the web
>> page at NCBI and blocking users with excessive access.
>>
>>
>>
>> I am seeing more and more activity with BioPython and it is us  
>> concern.
>> Mainly the BioPython suite does not appear to be written to the
>> recommendations made on the main NCBI E-utilities web page
>> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr
>> inciply the following are not being done by BioPython tools.
>>
>>
>>
>> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
>> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web  
>> address.
>>
>> *  Make no more than one request every 3 seconds.
>>
>>
>>
>> In fact I recently cc'd you on an event when a user was coming in at
>> over 18 requests per second. We really wish that you would alter you
>> scripts to run with a some sort of sleep in it in order to not send
>> requests more than once per 3 seconds and to not send these to the  
>> main
>> www web servers but use the  http://eutils.ncbi.nlm.nih.gov
>> <http://eutils.ncbi.nlm.nih.gov/> .
>>
>>
>>
>> Also, there is the problem of huge searches in order to build local
>> databases. With you package it seems that if one were so inclined you
>> would send a search for all human sequences (over 10,000,000  
>> sequences)
>> and you program would then retrieve these one ID at a time.  
>> Regardless
>> of the fact that this is an extreme example, we would much prefer if
>> your program could webenv from the Esearch  and  use the search  
>> history
>> and webenv to retrieve sets of sequences at 200 - 200 at a time.
>>
>>
>>
>> History: Requests utility to maintain results in user's environment.
>> Used in conjunction with WebEnv.
>>
>> usehistory=y
>>
>> Web Environment: Value previously returned in XML results from  
>> ESearch
>> or EPost. This value may change with each utility call. If WebEnv is
>> used, History search numbers can be included in an ESummary URL,  
>> e.g.,
>> term=cancer+AND+%23X (where %23 replaces # and X is the History  
>> search
>> number).
>>
>> Note: WebEnv is similar to the cookie that is set on a user's  
>> computers
>> when accessing PubMed on the web.  If the parameter usehistory=y is
>> included in an ESearch URL both a WebEnv (cookie string) and  
>> query_key
>> (history number) values will be returned in the results. Rather than
>> using the retrieved PMIDs in an ESummary or EFetch URL you may simply
>> use the WebEnv and query_key values to retrieve the records. WebEnv  
>> will
>> change for each ESearch query, but a sample URL would be as follows:
>>
>> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed
>> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh
>> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D
>> &query_key=6&retmode=html&rettype=medline&retmax=15
>>
>> WebEnv=WgHmIcDG]B etc.
>>
>> Display Numbers:
>>
>> retstart=x  (x= sequential number of the first record retrieved -
>> default=0 which will retrieve the first record)
>> retmax=y  (y= number of items retrieved)
>>
>>
>>
>> Otherwise we will end up blocking more of your users which we are
>> unfortunately already doing in some cases.
>>
>>
>>
>> Sincerely,
>> Scott D. McGinnis, M.S.
>> DHHS/NIH/NLM/NCBI
>> www.ncbi.nlm.nih.gov
>>
>>
>>
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign


From rjalves at igc.gulbenkian.pt  Wed Jun 25 12:16:49 2008
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Wed, 25 Jun 2008 17:16:49 +0100
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <FCC44A01-8CA7-42F3-A08C-5AC03B1C00F4@uiuc.edu>
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>	<55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
	<FCC44A01-8CA7-42F3-A08C-5AC03B1C00F4@uiuc.edu>
Message-ID: <48626F71.4020804@igc.gulbenkian.pt>

you mean 3 seconds no?

Quoting Chris Fields on 06/25/2008 04:34 PM:
> Just as a note from the BioPerl side, BioPerl modules which access 
> eutils use the 3 min sleep rule, and we specify in the documentation 
> the NCBI rules.  The modules also identify the tool/agent used as 
> 'bioperl', I believe.
>
> chris
>
> On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote:
>
>>
>> Can someone from the biopython dev team respond officially to Scott 
>> please?
>>
>> Regards,
>> Chris
>>
>>
>> Begin forwarded message:
>>
>>> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" <mcginnis at ncbi.nlm.nih.gov>
>>> Date: June 25, 2008 10:54:28 AM EDT
>>> To: <biopython-owner at lists.open-bio.org>
>>> Subject: NCBI Abuse Activity with BioPython
>>>
>>> Dear Colleague:
>>>
>>>
>>>
>>> My name is Scott McGinnis and I am responsible for monitoring the web
>>> page at NCBI and blocking users with excessive access.
>>>
>>>
>>>
>>> I am seeing more and more activity with BioPython and it is us concern.
>>> Mainly the BioPython suite does not appear to be written to the
>>> recommendations made on the main NCBI E-utilities web page
>>> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr 
>>>
>>> inciply the following are not being done by BioPython tools.
>>>
>>>
>>>
>>> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
>>> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web address.
>>>
>>> *  Make no more than one request every 3 seconds.
>>>
>>>
>>>
>>> In fact I recently cc'd you on an event when a user was coming in at
>>> over 18 requests per second. We really wish that you would alter you
>>> scripts to run with a some sort of sleep in it in order to not send
>>> requests more than once per 3 seconds and to not send these to the main
>>> www web servers but use the  http://eutils.ncbi.nlm.nih.gov
>>> <http://eutils.ncbi.nlm.nih.gov/> .
>>>
>>>
>>>
>>> Also, there is the problem of huge searches in order to build local
>>> databases. With you package it seems that if one were so inclined you
>>> would send a search for all human sequences (over 10,000,000 sequences)
>>> and you program would then retrieve these one ID at a time. Regardless
>>> of the fact that this is an extreme example, we would much prefer if
>>> your program could webenv from the Esearch  and  use the search history
>>> and webenv to retrieve sets of sequences at 200 - 200 at a time.
>>>
>>>
>>>
>>> History: Requests utility to maintain results in user's environment.
>>> Used in conjunction with WebEnv.
>>>
>>> usehistory=y
>>>
>>> Web Environment: Value previously returned in XML results from ESearch
>>> or EPost. This value may change with each utility call. If WebEnv is
>>> used, History search numbers can be included in an ESummary URL, e.g.,
>>> term=cancer+AND+%23X (where %23 replaces # and X is the History search
>>> number).
>>>
>>> Note: WebEnv is similar to the cookie that is set on a user's computers
>>> when accessing PubMed on the web.  If the parameter usehistory=y is
>>> included in an ESearch URL both a WebEnv (cookie string) and query_key
>>> (history number) values will be returned in the results. Rather than
>>> using the retrieved PMIDs in an ESummary or EFetch URL you may simply
>>> use the WebEnv and query_key values to retrieve the records. WebEnv 
>>> will
>>> change for each ESearch query, but a sample URL would be as follows:
>>>
>>> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed
>>> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh
>>> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D
>>> &query_key=6&retmode=html&rettype=medline&retmax=15
>>>
>>> WebEnv=WgHmIcDG]B etc.
>>>
>>> Display Numbers:
>>>
>>> retstart=x  (x= sequential number of the first record retrieved -
>>> default=0 which will retrieve the first record)
>>> retmax=y  (y= number of items retrieved)
>>>
>>>
>>>
>>> Otherwise we will end up blocking more of your users which we are
>>> unfortunately already doing in some cases.
>>>
>>>
>>>
>>> Sincerely,
>>> Scott D. McGinnis, M.S.
>>> DHHS/NIH/NLM/NCBI
>>> www.ncbi.nlm.nih.gov
>>>
>>>
>>>
>>
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Marie-Claude Hofmann
> College of Veterinary Medicine
> University of Illinois Urbana-Champaign
>
>
>
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From cjfields at uiuc.edu  Wed Jun 25 15:00:34 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 25 Jun 2008 14:00:34 -0500
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <48626F71.4020804@igc.gulbenkian.pt>
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>	<55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
	<FCC44A01-8CA7-42F3-A08C-5AC03B1C00F4@uiuc.edu>
	<48626F71.4020804@igc.gulbenkian.pt>
Message-ID: <16811EA1-130D-4F47-B0B5-654E840705B9@uiuc.edu>

Yes, my bad (was in a hurry).

I have heard of instances where specific users/IPs were blocked  
temporarily by NCBI based on spamming, so it's best  to be proactive.

chris

On Jun 25, 2008, at 11:16 AM, Renato Alves wrote:

> you mean 3 seconds no?
>
> Quoting Chris Fields on 06/25/2008 04:34 PM:
>> Just as a note from the BioPerl side, BioPerl modules which access  
>> eutils use the 3 min sleep rule, and we specify in the  
>> documentation the NCBI rules.  The modules also identify the tool/ 
>> agent used as 'bioperl', I believe.
>>
>> chris
>>
>> On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote:
>>
>>>
>>> Can someone from the biopython dev team respond officially to  
>>> Scott please?
>>>
>>> Regards,
>>> Chris
>>>
>>>
>>> Begin forwarded message:
>>>
>>>> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" <mcginnis at ncbi.nlm.nih.gov 
>>>> >
>>>> Date: June 25, 2008 10:54:28 AM EDT
>>>> To: <biopython-owner at lists.open-bio.org>
>>>> Subject: NCBI Abuse Activity with BioPython
>>>>
>>>> Dear Colleague:
>>>>
>>>>
>>>>
>>>> My name is Scott McGinnis and I am responsible for monitoring the  
>>>> web
>>>> page at NCBI and blocking users with excessive access.
>>>>
>>>>
>>>>
>>>> I am seeing more and more activity with BioPython and it is us  
>>>> concern.
>>>> Mainly the BioPython suite does not appear to be written to the
>>>> recommendations made on the main NCBI E-utilities web page
>>>> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr
>>>> inciply the following are not being done by BioPython tools.
>>>>
>>>>
>>>>
>>>> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
>>>> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web  
>>>> address.
>>>>
>>>> *  Make no more than one request every 3 seconds.
>>>>
>>>>
>>>>
>>>> In fact I recently cc'd you on an event when a user was coming in  
>>>> at
>>>> over 18 requests per second. We really wish that you would alter  
>>>> you
>>>> scripts to run with a some sort of sleep in it in order to not send
>>>> requests more than once per 3 seconds and to not send these to  
>>>> the main
>>>> www web servers but use the  http://eutils.ncbi.nlm.nih.gov
>>>> <http://eutils.ncbi.nlm.nih.gov/> .
>>>>
>>>>
>>>>
>>>> Also, there is the problem of huge searches in order to build local
>>>> databases. With you package it seems that if one were so inclined  
>>>> you
>>>> would send a search for all human sequences (over 10,000,000  
>>>> sequences)
>>>> and you program would then retrieve these one ID at a time.  
>>>> Regardless
>>>> of the fact that this is an extreme example, we would much prefer  
>>>> if
>>>> your program could webenv from the Esearch  and  use the search  
>>>> history
>>>> and webenv to retrieve sets of sequences at 200 - 200 at a time.
>>>>
>>>>
>>>>
>>>> History: Requests utility to maintain results in user's  
>>>> environment.
>>>> Used in conjunction with WebEnv.
>>>>
>>>> usehistory=y
>>>>
>>>> Web Environment: Value previously returned in XML results from  
>>>> ESearch
>>>> or EPost. This value may change with each utility call. If WebEnv  
>>>> is
>>>> used, History search numbers can be included in an ESummary URL,  
>>>> e.g.,
>>>> term=cancer+AND+%23X (where %23 replaces # and X is the History  
>>>> search
>>>> number).
>>>>
>>>> Note: WebEnv is similar to the cookie that is set on a user's  
>>>> computers
>>>> when accessing PubMed on the web.  If the parameter usehistory=y is
>>>> included in an ESearch URL both a WebEnv (cookie string) and  
>>>> query_key
>>>> (history number) values will be returned in the results. Rather  
>>>> than
>>>> using the retrieved PMIDs in an ESummary or EFetch URL you may  
>>>> simply
>>>> use the WebEnv and query_key values to retrieve the records.  
>>>> WebEnv will
>>>> change for each ESearch query, but a sample URL would be as  
>>>> follows:
>>>>
>>>> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed
>>>> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh
>>>> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D
>>>> &query_key=6&retmode=html&rettype=medline&retmax=15
>>>>
>>>> WebEnv=WgHmIcDG]B etc.
>>>>
>>>> Display Numbers:
>>>>
>>>> retstart=x  (x= sequential number of the first record retrieved -
>>>> default=0 which will retrieve the first record)
>>>> retmax=y  (y= number of items retrieved)
>>>>
>>>>
>>>>
>>>> Otherwise we will end up blocking more of your users which we are
>>>> unfortunately already doing in some cases.
>>>>
>>>>
>>>>
>>>> Sincerely,
>>>> Scott D. McGinnis, M.S.
>>>> DHHS/NIH/NLM/NCBI
>>>> www.ncbi.nlm.nih.gov
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Marie-Claude Hofmann
>> College of Veterinary Medicine
>> University of Illinois Urbana-Champaign
>>
>>
>>
>>
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign


From dalke at dalkescientific.com  Wed Jun 25 21:15:50 2008
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 26 Jun 2008 03:15:50 +0200
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>
	<55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
Message-ID: <DA88B268-8CE1-4C71-A685-F7C13978C8BA@dalkescientific.com>

Hi Chris,

   I'm no longer part of the Biopython dev team, but I read at least  
the subject line on the mailing list.

   I wrote the Biopython EUtils package around December 2002 and  
according to the CVS logs it was added to Biopython in June 2003, so  
more then 5 years ago.  Looking at the commit logs there haven't been  
any change to the relevant code since 2004, and that was a minor patch.

   I thought I put a rate limiter into the code, but looking at it  
now I see I didn't.  The documentation clearly states that users must  
follow NCBI's recommendations, but who actually reads documentation?

>> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
>> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web  
>> address.

That change was announced on May 21, 2003, and most likely no one on  
the Biopython dev group tracks the EUtils mailing list.  It was also  
after I wrote the code, but to be fair I was subscribed to the  
utilities list at the time and should have caught the change.

I think the correct fix is to this code in ThinClient.py:

     def __init__(self,
                  opener = None,
                  tool = TOOL,
                  email = EMAIL,
                  baseurl = "http://www.ncbi.nlm.nih.gov/entrez/ 
eutils/"):

Change the baseurl to "http://eutils.ncbi.nlm.nih.gov/entrez/ 
eutils/".  I have not tested this.


>> *  Make no more than one request every 3 seconds.

There's a couple of points here.  The quickest and most direct way to  
force/fix the code is to change the "def _get()" in ThinClient.py .   
The current code is

     def _get(self, program, query):
         """Internal function: send the query string to the program  
as GET"""
         # NOTE: epost uses a different interface

         q = self._fixup_query(query)
         url = self.baseurl + program + "?" + q
         if DUMP_URL:
             print "Opening with GET:", url
         if DUMP_RESULT:
             print " ================== Results ============= "
             s = self.opener.open(url).read()
             print s
             print " ================== Finished ============ "
             return cStringIO.StringIO(s)
         return self.opener.open(url)

Here's one possible fix: add the following two lines to module scope:

import time
_prev_time = 0


and insert four lines in the _get function.

     def _get(self, program, query):
         """Internal function: send the query string to the program  
as GET"""
         # NOTE: epost uses a different interface
         global _prev_time
         q = self._fixup_query(query)
         url = self.baseurl + program + "?" + q
         if DUMP_URL:
             print "Opening with GET:", url

	# Follow NCBI's 3 second restriction
         if time.time() - _prev_time < 3:
             time.sleep(time.time()-_prev_time)
         _prev_time = time.time()

         if DUMP_RESULT:
             print " ================== Results ============= "
             s = self.opener.open(url).read()
             print s
             print " ================== Finished ============ "
             return cStringIO.StringIO(s)
         return self.opener.open(url)


(I recall that I had something like that, and it made my unit tests -  
which I did during the off hours - interminable.)


When I wrote this module I think I assumed that whoever would use the  
library would use the code correctly.  Using it correctly means a few  
things:
   - obey the restrictions set by NCBI
   - change the 'tool' and 'email' settings, so NCBI complains the  
right person.
      (The default is to say 'EUtils_Python_client' and 'biopython- 
dev at biopython.org')

This isn't happening.  The patch above force-fixes the first.  Should  
Biopython do a better job of the second?  It's not easy to figure out  
the correct email.  I couldn't then and can't now think of a better  
solution.  Perhaps use the result of getpass.getuser()?  But that  
doesn't get the rest of the domain for a proper email.  Though NCBI  
should be able to guess the site from the IP address.

The reason I made this assumption is that I meant EUtils to be used  
by contentious developers.  I've since learned that that's seldom the  
case, and because it was imported into Biopython it's been exposed to  
a wider audience.


>> Also, there is the problem of huge searches in order to build local
>> databases. With you package it seems that if one were so inclined you
>> would send a search for all human sequences (over 10,000,000  
>> sequences)
>> and you program would then retrieve these one ID at a time.  
>> Regardless
>> of the fact that this is an extreme example, we would much prefer if
>> your program could webenv from the Esearch  and  use the search  
>> history
>> and webenv to retrieve sets of sequences at 200 - 200 at a time.

It does exactly that.  There's an entire interface for handling  
search history - and it took some non-trivial work and questions to  
NCBI to get things working right.  Rather, there are two layers.  One  
is for the low-level protocol ("ThinClient") that EUtils offers, and  
another wraps around the history mechanism ("HistoryClient").

 >>> from Bio import EUtils
 >>> from Bio.EUtils import HistoryClient
 >>> client = HistoryClient.HistoryClient()
 >>> result = client.search("polio AND picornavirus")
 >>> len(result)
3437
 >>> f = result.efetch()
 >>> print f.read(1000)
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st  
January 2008//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ 
pubmed_080101.dtd">
<PubmedArticleSet>
<PubmedArticle>
     <MedlineCitation Owner="NLM" Status="In-Process">
         <PMID>18540199</PMID>
         <DateCreated>
             <Year>2008</Year>
             <Month>06</Month>
             <Day>10</Day>
         </DateCreated>
         <Article PubModel="Print">
             <Journal>
                 <ISSN IssnType="Print">0041-3771</ISSN>
                 <JournalIssue CitedMedium="Print">
                     <Volume>50</Volume>
                     <Issue>2</Issue>
                     <PubDate>
                         <Year>2008</Year>
                     </PubDate>
                 </JournalIssue>
                 <Title>Tsitologiia</Title>
                 <ISOAbbreviation>Tsitologiia</ISOAbbreviation>
             </Journal>
             <ArticleTitle>[The enter of viruses family  
Picornaviridae in

and there's a way to populate the history with a list of records,  
then fetch those records in a block:

 >>> result = client.from_dbids(EUtils.DBIds("pubmed",  
["100","200","300","400","500"]))
 >>> f = result.efetch("text", "brief")
 >>> print f.read()

1: Jolly RD et al. Bovine mannosidosis--a model ...[PMID: 100]

2: El Halawani ME et al. The relative importance of mo...[PMID: 200]

3: Amdur MA. Alcohol-related problems in a...[PMID: 300]

4: Regitz G et al. Trypsin-sensitive photosynthe...[PMID: 400]

5: Nourse ES. The regional workshops on pri...[PMID: 500]


If I had to guess, likely more people find the ThinClient code easier  
to understand, because the NCBI interface has a simple way to get the  
result for a single record, without using the history interface.  The  
NCBI interface doesn't guide people to the right way to use it  
effectively.

I started working on an update to EUtils which improved the API to  
include a few helper functions, like "EUtils.search()" instead of  
having to create a HistoryClient.  That might help guide people to  
using it better.  I wrote up something about it a few years ago:
   http://www.dalkescientific.com/writings/diary/archive/2005/09/30/ 
using_eutils.html

But a problem in completing that is that I never got any sort of  
funding or user feedback on how people were using the software, and  
as I moved over to chemistry it became lower and lower on my list.   
That's still the problem with me working on this again.

I don't know about this next point, but there might also be a lack of  
documentation on how to use the Biopython interface effectively?  The  
NCBI documentation isn't mean for non-programmers (it's more of a  
bytes-on-the-wire document) so perhaps people are pattern matching on  
what looks right and going with what works, vs. what works well.   
Then because there was no 3 second limit, they had no incentive to  
find a better/faster solution.

				Andrew
				dalke at dalkescientific.com


From biopython at maubp.freeserve.co.uk  Thu Jun 26 07:21:57 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Jun 2008 12:21:57 +0100
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <DA88B268-8CE1-4C71-A685-F7C13978C8BA@dalkescientific.com>
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>
	<55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
	<DA88B268-8CE1-4C71-A685-F7C13978C8BA@dalkescientific.com>
Message-ID: <320fb6e00806260421g48e5807ei92297b372c330e5b@mail.gmail.com>

On Thu, Jun 26, 2008 at 2:15 AM, Andrew Dalke <dalke at dalkescientific.com> wrote:
> Hi Chris,
>
>  I'm no longer part of the Biopython dev team, but I read at least the
> subject line on the mailing list.
>
>  I wrote the Biopython EUtils package around December 2002 and according to
> the CVS logs it was added to Biopython in June 2003, so more then 5 years
> ago.  Looking at the commit logs there haven't been any change to the
> relevant code since 2004, and that was a minor patch.
>
>  I thought I put a rate limiter into the code, but looking at it now I see I
> didn't.  The documentation clearly states that users must follow NCBI's
> recommendations, but who actually reads documentation?
>
> There's a couple of points here.  The quickest and most direct way to
> force/fix the code is to change the "def _get()" in ThinClient.py .  ...

I've updated Bio/EUtils/ThinClient.py in CVS based on your suggested
change, and checked the unit tests test_EUtils.py and
test_SeqIO_online.py (which calls Bio.EUtils via Bio.GenBank).

Looking over the code, should this wait also be done for the
ThinClient's epost() method as well?

> When I wrote this module I think I assumed that whoever would use the
> library would use the code correctly.  Using it correctly means a few
> things:
>  - obey the restrictions set by NCBI
>  - change the 'tool' and 'email' settings, so NCBI complains the right
> person.
>     (The default is to say 'EUtils_Python_client' and
> 'biopython-dev at biopython.org')
>
> This isn't happening.  The patch above force-fixes the first.  Should
> Biopython do a better job of the second?  It's not easy to figure out the
> correct email.  I couldn't then and can't now think of a better solution.
>  Perhaps use the result of getpass.getuser()?  But that doesn't get the rest
> of the domain for a proper email.  Though NCBI should be able to guess the
> site from the IP address.

Figuring out the user's email address is tricky, especially cross
platform.  Perhaps we should update the Bio.EUtils and Bio.Entrez
documentation to recommend the user set their email address here, and
if they are wrapping Biopython in part of a larger tool (e.g. a
webservice) to set the tool name too.

> If I had to guess, likely more people find the ThinClient code easier to
> understand, because the NCBI interface has a simple way to get the result
> for a single record, without using the history interface.  The NCBI
> interface doesn't guide people to the right way to use it effectively.

I would agree with you.  I would go further, and say for a new user
even the ThinClient is a bit scary, and that the wrapper functions in
Bio.GenBank are nicer to use.

> I started working on an update to EUtils which improved the API to include a
> few helper functions, like "EUtils.search()" instead of having to create a
> HistoryClient.  That might help guide people to using it better.  I wrote up
> something about it a few years ago:
>  http://www.dalkescientific.com/writings/diary/archive/2005/09/30/using_eutils.html
>
> But a problem in completing that is that I never got any sort of funding or
> user feedback on how people were using the software, and as I moved over to
> chemistry it became lower and lower on my list.  That's still the problem
> with me working on this again.

This complexity is also daunting for anyone else considering taking
over the Bio.EUtils code base.

> I don't know about this next point, but there might also be a lack of
> documentation on how to use the Biopython interface effectively?  The NCBI
> documentation isn't mean for non-programmers (it's more of a
> bytes-on-the-wire document) so perhaps people are pattern matching on what
> looks right and going with what works, vs. what works well.  Then because
> there was no 3 second limit, they had no incentive to find a better/faster
> solution.

That would explain how the unnamed user ended up making over 18
requests per second!  I confess I had assumed that things like the
Bio.GenBank wrappers would be respecting the 3 second rule (at least
they should do now).

Peter

From mjldehoon at yahoo.com  Thu Jun 26 07:48:09 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 26 Jun 2008 04:48:09 -0700 (PDT)
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
Message-ID: <53670.7764.qm@web62412.mail.re1.yahoo.com>

Dear Chris,

Sorry for the trouble. We are now discussing on the Biopython mailing list how to fix this issue. I will write a reply to Scott shortly.

Best,

--Michiel.

--- On Wed, 6/25/08, Chris Dagdigian <dag at sonsorol.org> wrote:
From: Chris Dagdigian <dag at sonsorol.org>
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
To: biopython at lists.open-bio.org
Date: Wednesday, June 25, 2008, 11:08 AM

Can someone from the biopython dev team respond officially to Scott  
please?

Regards,
Chris


Begin forwarded message:

> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]"
<mcginnis at ncbi.nlm.nih.gov>
> Date: June 25, 2008 10:54:28 AM EDT
> To: <biopython-owner at lists.open-bio.org>
> Subject: NCBI Abuse Activity with BioPython
>
> Dear Colleague:
>
>
>
> My name is Scott McGinnis and I am responsible for monitoring the web
> page at NCBI and blocking users with excessive access.
>
>
>
> I am seeing more and more activity with BioPython and it is us  
> concern.
> Mainly the BioPython suite does not appear to be written to the
> recommendations made on the main NCBI E-utilities web page
> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr
> inciply the following are not being done by BioPython tools.
>
>
>
> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web
address.
>
> *  Make no more than one request every 3 seconds.
>
>
>
> In fact I recently cc'd you on an event when a user was coming in at
> over 18 requests per second. We really wish that you would alter you
> scripts to run with a some sort of sleep in it in order to not send
> requests more than once per 3 seconds and to not send these to the  
> main
> www web servers but use the  http://eutils.ncbi.nlm.nih.gov
> <http://eutils.ncbi.nlm.nih.gov/> .
>
>
>
> Also, there is the problem of huge searches in order to build local
> databases. With you package it seems that if one were so inclined you
> would send a search for all human sequences (over 10,000,000  
> sequences)
> and you program would then retrieve these one ID at a time. Regardless
> of the fact that this is an extreme example, we would much prefer if
> your program could webenv from the Esearch  and  use the search  
> history
> and webenv to retrieve sets of sequences at 200 - 200 at a time.
>
>
>
> History: Requests utility to maintain results in user's environment.
> Used in conjunction with WebEnv.
>
> usehistory=y
>
> Web Environment: Value previously returned in XML results from ESearch
> or EPost. This value may change with each utility call. If WebEnv is
> used, History search numbers can be included in an ESummary URL, e.g.,
> term=cancer+AND+%23X (where %23 replaces # and X is the History search
> number).
>
> Note: WebEnv is similar to the cookie that is set on a user's  
> computers
> when accessing PubMed on the web.  If the parameter usehistory=y is
> included in an ESearch URL both a WebEnv (cookie string) and query_key
> (history number) values will be returned in the results. Rather than
> using the retrieved PMIDs in an ESummary or EFetch URL you may simply
> use the WebEnv and query_key values to retrieve the records. WebEnv  
> will
> change for each ESearch query, but a sample URL would be as follows:
>
> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed
> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh
> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D
> &query_key=6&retmode=html&rettype=medline&retmax=15
>
> WebEnv=WgHmIcDG]B etc.
>
> Display Numbers:
>
> retstart=x  (x= sequential number of the first record retrieved -
> default=0 which will retrieve the first record)
> retmax=y  (y= number of items retrieved)
>
>
>
> Otherwise we will end up blocking more of your users which we are
> unfortunately already doing in some cases.
>
>
>
> Sincerely,
> Scott D. McGinnis, M.S.
> DHHS/NIH/NLM/NCBI
> www.ncbi.nlm.nih.gov
>
>
>

_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From mjldehoon at yahoo.com  Thu Jun 26 10:01:31 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 26 Jun 2008 07:01:31 -0700 (PDT)
Subject: [BioPython] Bio.ECell, anybody?
Message-ID: <712489.88060.qm@web62410.mail.re1.yahoo.com>

This is one of the Martel-based parser whose relevance in 2008 is unclear to me.

>From the docstring:

Ecell converts the ECell input from spreadsheet format to an intermediate format, described in http://www.e-cell.org/manual/chapter2E.html#3.2.? It provides an alternative to the perl script supplied with the Ecell2 distribution at http://bioinformatics.org/project/?group_id=49.

Currently, ECell is at version 3.1.106 (and uses Python as the scripting interface! Yay!). The link to the chapter in the ECell manual is dead.

Is anybody using the Bio.ECell module?

--Michiel


From binbin.liu at umb.no  Thu Jun 26 11:35:46 2008
From: binbin.liu at umb.no (binbin)
Date: Thu, 26 Jun 2008 17:35:46 +0200
Subject: [BioPython] Entrez
Message-ID: <1214494546.6215.3.camel@ubuntu>

?Hei,
        Am using biopython 1.45
        
        my problem is as follow
        
        
        >>> from Bio import GenBank
        >>> from Bio import Entrez
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        ImportError: cannot import name Entrez
        
        I could not import Entrez. was it deleted from Bio?


From biopython at maubp.freeserve.co.uk  Thu Jun 26 11:57:47 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Jun 2008 16:57:47 +0100
Subject: [BioPython] Entrez
In-Reply-To: <1214494546.6215.3.camel@ubuntu>
References: <1214494546.6215.3.camel@ubuntu>
Message-ID: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com>

On Thu, Jun 26, 2008 at 4:35 PM, binbin <binbin.liu at umb.no> wrote:
> Hei,
>        Am using biopython 1.45
>
>        my problem is as follow
>
>
>        >>> from Bio import GenBank
>        >>> from Bio import Entrez
>        Traceback (most recent call last):
>          File "<stdin>", line 1, in <module>
>        ImportError: cannot import name Entrez
>
>        I could not import Entrez. was it deleted from Bio?

Hello binbin,

A long long time ago there was a Bio.Entrez module which was deleted in 2000.

We are going to re-introduce a Bio.Entrez module in Biopython 1.46
(hopefully out next month?), which will replace Bio.WWW.NCBI.  If you
want to try this out now, please install the latest CVS version of
Biopython from source.

Can I ask why are you trying to do "from Bio import Entrez"?

Peter

From winter at biotec.tu-dresden.de  Thu Jun 26 11:53:23 2008
From: winter at biotec.tu-dresden.de (Christof Winter)
Date: Thu, 26 Jun 2008 17:53:23 +0200
Subject: [BioPython] Entrez
In-Reply-To: <1214494546.6215.3.camel@ubuntu>
References: <1214494546.6215.3.camel@ubuntu>
Message-ID: <4863BB73.2020509@biotec.tu-dresden.de>

binbin wrote, On 06/26/08 17:35:
> Hei,
>         Am using biopython 1.45
>         
>         my problem is as follow
>         
>         
>         >>> from Bio import GenBank
>         >>> from Bio import Entrez
>         Traceback (most recent call last):
>           File "<stdin>", line 1, in <module>
>         ImportError: cannot import name Entrez
>         
>         I could not import Entrez. was it deleted from Bio?

Import works fine for me, so I don't think it has been deleted. With my Linux 
installation, I can do

locate Entrez

which finds
/var/lib/python-support/python2.5/Bio/Entrez

HTH,
Christof

From biopython at maubp.freeserve.co.uk  Thu Jun 26 12:12:53 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Jun 2008 17:12:53 +0100
Subject: [BioPython] Entrez
In-Reply-To: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com>
References: <1214494546.6215.3.camel@ubuntu>
	<320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com>
Message-ID: <320fb6e00806260912j3395d2c0s3d7bbb7227f84421@mail.gmail.com>

> Hello binbin,
>
> A long long time ago there was a Bio.Entrez module which was deleted in 2000.
>
> We are going to re-introduce a Bio.Entrez module in Biopython 1.46
> (hopefully out next month?), which will replace Bio.WWW.NCBI.  If you
> want to try this out now, please install the latest CVS version of
> Biopython from source.

Sorry - I've confused myself as the Bio.Entrez module has been under
revision recently.

>From the user's point of view Biopython 1.46 will add an XML parser,
but otherwise Bio.Entrez should be there in Biopython 1.45.

Peter

From biopython at maubp.freeserve.co.uk  Thu Jun 26 16:19:31 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Jun 2008 21:19:31 +0100
Subject: [BioPython] Removing the unit test GUI?
Message-ID: <320fb6e00806261319w5be098d1y48404f3f93934fa3@mail.gmail.com>

Hello all,

I wanted to do a quick survey of opinion about the Biopython test
suite and its interface.

Those of you who have ever installed Biopython from source may have
tried running the unit tests too.  You do this by changing to the
Tests subdirectory, and then running the run_tests.py script.
Currently by default this will show a GUI.  However, from the
developer's point of view the unit tests are almost always run at the
command line with:

python run_tests.py --no-gui

It would let us simplify the test harness if we got rid of the GUI,
and it would make life very slightly easier for people running the
tests at the command line.  But would anyone be upset at the loss of
the test GUI?

So - have any of you ever run the unit tests?  Did you use the GUI or
the command line?  Would you prefer the GUI to remain?

Thanks

Peter

P.S. See also bug 2525
http://bugzilla.open-bio.org/show_bug.cgi?id=2525

From mjldehoon at yahoo.com  Thu Jun 26 18:24:41 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 26 Jun 2008 15:24:41 -0700 (PDT)
Subject: [BioPython] Entrez
In-Reply-To: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com>
Message-ID: <987374.9439.qm@web62409.mail.re1.yahoo.com>

Bio.Entrez was reintroduced in release 1.45 already (though without the parser), so binbin should be able to find it.

--Michiel.

--- On Thu, 6/26/08, Peter <biopython at maubp.freeserve.co.uk> wrote:
From: Peter <biopython at maubp.freeserve.co.uk>
Subject: Re: [BioPython] Entrez
To: "binbin" <binbin.liu at umb.no>
Cc: biopython at biopython.org
Date: Thursday, June 26, 2008, 11:57 AM

On Thu, Jun 26, 2008 at 4:35 PM, binbin <binbin.liu at umb.no> wrote:
> Hei,
>        Am using biopython 1.45
>
>        my problem is as follow
>
>
>        >>> from Bio import GenBank
>        >>> from Bio import Entrez
>        Traceback (most recent call last):
>          File "<stdin>", line 1, in <module>
>        ImportError: cannot import name Entrez
>
>        I could not import Entrez. was it deleted from Bio?

Hello binbin,

A long long time ago there was a Bio.Entrez module which was deleted in 2000.

We are going to re-introduce a Bio.Entrez module in Biopython 1.46
(hopefully out next month?), which will replace Bio.WWW.NCBI.  If you
want to try this out now, please install the latest CVS version of
Biopython from source.

Can I ask why are you trying to do "from Bio import Entrez"?

Peter
_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Fri Jun 27 07:16:12 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 27 Jun 2008 12:16:12 +0100
Subject: [BioPython] Entrez
In-Reply-To: <1214562160.6026.2.camel@ubuntu>
References: <1214494546.6215.3.camel@ubuntu>
	<320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com>
	<1214562160.6026.2.camel@ubuntu>
Message-ID: <320fb6e00806270416x76d8b388mdd79577927001f32@mail.gmail.com>

On Fri, Jun 27, 2008 at 11:22 AM, binbin <binbin.liu at umb.no> wrote:
> thank you for answering, i am a beginner of biopython,in the "Biopython
> Tutorial and Cookbook":
> 2.5  Connecting with biological databases:
> this is found
> "from Bio import Entrez"
>
> i tried this but it did work for me, that is why i asked.

That should have worked if your installation of Biopython 1.45 was successful.

We may be able to work out what is wrong.  What operating system are
you using, which version of python, and how did you install Biopython?

Regards,

Peter

From fredgca at hotmail.com  Fri Jun 27 09:19:04 2008
From: fredgca at hotmail.com (Frederico Arnoldi)
Date: Fri, 27 Jun 2008 13:19:04 +0000
Subject: [BioPython]  Fwd: NCBI Abuse Activity with BioPython
Message-ID: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>


Guys (sorry the informality),

I have followed the discussion about "NCBI Abuse Activity with BioPython". I have to confess that followed it superficially, since I am not able to understand everything you said. So, I am going to make some questions about it:

1)I believe that using BLAST with NCBIWWW.qblast is included in "Abuse Activity". Right? I am asking because sometimes I use it. The recommendation of NCBI is "Make no more than one request every 3 seconds.". Biopython code does not assure it with the following  code in NCBIWWW.py, line 779:
[code]
limiter = RequestLimiter(3)
while 1:
    limiter.wait()
[/code]

2)Do you have any recommendation for using it that it is not included in the tutorial? Maybe listing some recommendations here would help. 

Sorry if I have asked an obviousness.

Thanks,
Fred


_________________________________________________________________
Conhe?a o Windows Live Spaces, a rede de relacionamentos do Messenger!
http://www.amigosdomessenger.com.br/

From biopython at maubp.freeserve.co.uk  Fri Jun 27 09:57:49 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 27 Jun 2008 14:57:49 +0100
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>
References: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>
Message-ID: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com>

On Fri, Jun 27, 2008 at 2:19 PM, Frederico Arnoldi <fredgca at hotmail.com> wrote:
>
> Guys (sorry the informality),
>
> I have followed the discussion about "NCBI Abuse Activity with BioPython". I
> have to confess that followed it superficially, since I am not able to understand
> everything you said. So, I am going to make some questions about it:
>
> 1)I believe that using BLAST with NCBIWWW.qblast is included in "Abuse Activity". Right?

I'm not aware that abuse of BLAST was singled out, only Entrez / E-utils.

> I am asking because sometimes I use it. The recommendation of NCBI is
> "Make no more than one request every 3 seconds.".

True, http://www.ncbi.nlm.nih.gov/blast/Doc/node60.html

> Biopython code does not assure it with the following  code in NCBIWWW.py,
> line 779:
> [code]
> limiter = RequestLimiter(3)
> while 1:
>    limiter.wait()
> [/code]

I believe that bit of code is polling the server for results every
three seconds.  Perhaps we should insert an additional enforced three
second delay between submission of queries as well.

> 2)Do you have any recommendation for using it that it is not included in the
> tutorial? Maybe listing some recommendations here would help.

I would recommend running your own local BLAST server for any large
jobs - either the standalone blast tools, or if you have a machine on
the network that many people could share, run the WWW version locally.

Peter

From cjfields at uiuc.edu  Fri Jun 27 11:51:12 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Fri, 27 Jun 2008 10:51:12 -0500
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com>
References: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>
	<320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com>
Message-ID: <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu>


On Jun 27, 2008, at 8:57 AM, Peter wrote:

> On Fri, Jun 27, 2008 at 2:19 PM, Frederico Arnoldi <fredgca at hotmail.com 
> > wrote:
>>
>> Guys (sorry the informality),
>>
>> I have followed the discussion about "NCBI Abuse Activity with  
>> BioPython". I
>> have to confess that followed it superficially, since I am not able  
>> to understand
>> everything you said. So, I am going to make some questions about it:
>>
>> 1)I believe that using BLAST with NCBIWWW.qblast is included in  
>> "Abuse Activity". Right?
>
> I'm not aware that abuse of BLAST was singled out, only Entrez / E- 
> utils.

Similar policy though, for the same reasons they insist on a delay for  
E-utils.

>> I am asking because sometimes I use it. The recommendation of NCBI is
>> "Make no more than one request every 3 seconds.".
>
> True, http://www.ncbi.nlm.nih.gov/blast/Doc/node60.html
>
>> Biopython code does not assure it with the following  code in  
>> NCBIWWW.py,
>> line 779:
>> [code]
>> limiter = RequestLimiter(3)
>> while 1:
>>   limiter.wait()
>> [/code]
>
> I believe that bit of code is polling the server for results every
> three seconds.  Perhaps we should insert an additional enforced three
> second delay between submission of queries as well.
>
>> 2)Do you have any recommendation for using it that it is not  
>> included in the
>> tutorial? Maybe listing some recommendations here would help.
>
> I would recommend running your own local BLAST server for any large
> jobs - either the standalone blast tools, or if you have a machine on
> the network that many people could share, run the WWW version locally.
>
> Peter

The above appears to submit a single job at a time and wait 3 sec.  
between polling the server until the current job is finished.  I don't  
think that is the problem indicated in the link above.  The 3 sec. is  
for submitting new BLAST jobs, for instance if you want to submit one  
BLAST request after another (gathering the RIDs), then grab all the  
reports at once, or if you are threading 50 submission requests all at  
once.

chris

From fredgca at hotmail.com  Fri Jun 27 12:18:47 2008
From: fredgca at hotmail.com (Frederico Arnoldi)
Date: Fri, 27 Jun 2008 16:18:47 +0000
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu>
References: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>
	<320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com>
	<6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu>
Message-ID: <BLU105-W37D809FE53D9BE140CFFDEBFA20@phx.gbl>


Right, thanks for the answers.
If I understood, the problem is threading the requests. If I am not threading my requests I am not abusing NCBI server, so don't thread them.
Thanks again,
Fred
> >> 2)Do you have any recommendation for using it that it is not  
> >> included in the
> >> tutorial? Maybe listing some recommendations here would help.
> >
> > I would recommend running your own local BLAST server for any large
> > jobs - either the standalone blast tools, or if you have a machine on
> > the network that many people could share, run the WWW version locally.
> >
> > Peter
> 
> The above appears to submit a single job at a time and wait 3 sec.  
> between polling the server until the current job is finished.  I don't  
> think that is the problem indicated in the link above.  The 3 sec. is  
> for submitting new BLAST jobs, for instance if you want to submit one  
> BLAST request after another (gathering the RIDs), then grab all the  
> reports at once, or if you are threading 50 submission requests all at  
> once.
> 
> chris

_________________________________________________________________
Instale a Barra de Ferramentas com Desktop Search e ganhe EMOTICONS para o Messenger! ? GR?TIS!
http://www.msn.com.br/emoticonpack

From cjfields at uiuc.edu  Fri Jun 27 13:32:31 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Fri, 27 Jun 2008 12:32:31 -0500
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <BLU105-W37D809FE53D9BE140CFFDEBFA20@phx.gbl>
References: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>
	<320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com>
	<6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu>
	<BLU105-W37D809FE53D9BE140CFFDEBFA20@phx.gbl>
Message-ID: <53E19130-4EAC-4DC7-A58C-883581F8B468@uiuc.edu>

No, not just threading.  The requests could be made by a simple script/ 
program of any kind with no timeout implemented; the IPs of those  
abusing the timeout will likely be blocked.

The idea is not to spam their server (let alone any server which  
provides a free service) with tons of requests of any kind, be it  
eutils or BLAST submission requests, BLAST report retrieval requests  
using RID, etc.  Any tools using these services should implement the  
minimum recommended delay between them.  Alternatively, set up a local  
BLAST service as Peter recommends.

chris

On Jun 27, 2008, at 11:18 AM, Frederico Arnoldi wrote:

>
> Right, thanks for the answers.
> If I understood, the problem is threading the requests. If I am not  
> threading my requests I am not abusing NCBI server, so don't thread  
> them.
> Thanks again,
> Fred
>>>> 2)Do you have any recommendation for using it that it is not
>>>> included in the
>>>> tutorial? Maybe listing some recommendations here would help.
>>>
>>> I would recommend running your own local BLAST server for any large
>>> jobs - either the standalone blast tools, or if you have a machine  
>>> on
>>> the network that many people could share, run the WWW version  
>>> locally.
>>>
>>> Peter
>>
>> The above appears to submit a single job at a time and wait 3 sec.
>> between polling the server until the current job is finished.  I  
>> don't
>> think that is the problem indicated in the link above.  The 3 sec. is
>> for submitting new BLAST jobs, for instance if you want to submit one
>> BLAST request after another (gathering the RIDs), then grab all the
>> reports at once, or if you are threading 50 submission requests all  
>> at
>> once.
>>
>> chris
>
> _________________________________________________________________
> Instale a Barra de Ferramentas com Desktop Search e ganhe EMOTICONS  
> para o Messenger! ? GR?TIS!
> http://www.msn.com.br/emoticonpack
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From sbassi at gmail.com  Sat Jun 28 10:46:45 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sat, 28 Jun 2008 11:46:45 -0300
Subject: [BioPython] one function, two behaivors
Message-ID: <b43bf2080806280746s7a616504s4bc1e93dd8177211@mail.gmail.com>

If I invoke "transcribe" with a RNA sequence like this:

>>> from Bio.Seq import transcribe
>>> from Bio.Seq import Seq
>>> import Bio.Alphabet
>>> rna_seq = Seq('CCGGGUU',Bio.Alphabet.IUPAC.unambiguous_rna)
>>> transcribe(rna_seq)  # Look!, I am "transcribing a RNA"
Seq('CCGGGUU', RNAAlphabet())

But I can't "transcribe" a RNA sequence if I invoke it this way:

>>> from Bio import Transcribe
>>> transcriber = Transcribe.unambiguous_transcriber
>>> transcriber.transcribe(rna_seq)

Traceback (most recent call last):
  File "<pyshell#13>", line 1, in <module>
    transcriber.transcribe(rna_seq)
  File "/usr/local/lib/python2.5/site-packages/Bio/Transcribe.py",
line 13, in transcribe
    "transcribe has the wrong DNA alphabet"
AssertionError: transcribe has the wrong DNA alphabet

The same result I get when using "translate". What is the rationale behind this?

-- 
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Tutorial libre de Python: http://tinyurl.com/2az5d5

From biopython at maubp.freeserve.co.uk  Sat Jun 28 11:16:13 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 28 Jun 2008 16:16:13 +0100
Subject: [BioPython] one function, two behaivors
In-Reply-To: <b43bf2080806280746s7a616504s4bc1e93dd8177211@mail.gmail.com>
References: <b43bf2080806280746s7a616504s4bc1e93dd8177211@mail.gmail.com>
Message-ID: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com>

Hi Senastian,

As to why there are two ways, well, frankly the Bio.Transcribe and
Bio.Translate code isn't very nice to use!   The Bio.Seq functions are
much simpler.

We've talked about deprecating the Bio.Transcribe and Bio.Translate
modules in favour of just Bio.Seq -- we could deprecate Bio.Transcribe
now, but there is functionality in Bio.Translate that has not been
duplicated.  See also bug 2381.
http://bugzilla.open-bio.org/show_bug.cgi?id=2381

On Sat, Jun 28, 2008 at 3:46 PM, Sebastian Bassi <sbassi at gmail.com> wrote:
> If I invoke "transcribe" with a RNA sequence like this:
>
>>>> from Bio.Seq import transcribe
>>>> from Bio.Seq import Seq
>>>> import Bio.Alphabet
>>>> rna_seq = Seq('CCGGGUU',Bio.Alphabet.IUPAC.unambiguous_rna)
>>>> transcribe(rna_seq)  # Look!, I am "transcribing a RNA"
> Seq('CCGGGUU', RNAAlphabet())

When Michiel added this code for Biopython 1.41, originally there was
no error checking on the alphabet.  For Biopython 1.44, I added a
check to prevent protein transcibing (which is clearly meaningless),
and made a note to consider also banning transcribing RNA.

Here there is at least one reason to want to do this - suppose you
have a mixed set of nucleotide sequences and want to ensure they are
all RNA.

Do you think the Bio.Seq.transcibe() method should reject RNA sequences?

Peter

From biopython at maubp.freeserve.co.uk  Sat Jun 28 11:23:40 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 28 Jun 2008 16:23:40 +0100
Subject: [BioPython] one function, two behaivors
In-Reply-To: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com>
References: <b43bf2080806280746s7a616504s4bc1e93dd8177211@mail.gmail.com>
	<320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com>
Message-ID: <320fb6e00806280823h36f3f01ema2886dca98635588@mail.gmail.com>

I wrote,
> As to why there are two ways, well, frankly the Bio.Transcribe and
> Bio.Translate code isn't very nice to use!   The Bio.Seq functions are
> much simpler.

Hmm - the tutorial is still using Bio.Transcribe and Bio.Translate at
the moment.   I could update the tutorial to use the Bio.Seq functions
for (back)transcription.  However, as I said in the last email,
Bio.Translate still has its uses - there is no way to do a "translate
to stop" with Bio.Seq for example.

Maybe Bug 2381 should be a priority for the next release AFTER the
imminent Biopython 1.46.  We can then use object methods in the
tutorial, which I personally would find much nicer to use.

http://bugzilla.open-bio.org/show_bug.cgi?id=2381

If you could have a look at the suggested changes on Bug 2381, I'd
welcome some feedback.

Peter

From sbassi at gmail.com  Sat Jun 28 12:47:05 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sat, 28 Jun 2008 13:47:05 -0300
Subject: [BioPython] one function, two behaivors
In-Reply-To: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com>
References: <b43bf2080806280746s7a616504s4bc1e93dd8177211@mail.gmail.com>
	<320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com>
Message-ID: <b43bf2080806280947j791d9f69oe06ca9759b00860e@mail.gmail.com>

On Sat, Jun 28, 2008 at 12:16 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
....
> Here there is at least one reason to want to do this - suppose you
> have a mixed set of nucleotide sequences and want to ensure they are
> all RNA.
> Do you think the Bio.Seq.transcibe() method should reject RNA sequences?

IMHO, it should reject RNA sequences. The case you point out (ensure a
set of sequences are all RNA) could be done by checking the type
before applying "transcribe".


-- 
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Tutorial libre de Python: http://tinyurl.com/2az5d5

From lueck at ipk-gatersleben.de  Sun Jun 29 10:42:47 2008
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Sun, 29 Jun 2008 16:42:47 +0200
Subject: [BioPython] Sequence from Fasta
Message-ID: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>

Hi!

Is there a way to extract only the sequence (full length) from a fasta file?

If I try the code from page 10 in the tutorial, I get of course this:
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA ...', SingleLetterAlphabet())

But I'm looking for something like this:

Name Sequence without linebreak

Example:

MySequence atgcgcgctcggcgcgctcgfcgcgccccccatggctcgcgcactacagcg
MySequence2 atgcgctctgcgcgctcgatgtagaatatgagatctctatgagatcagcatca

etc.

Regards 
Stefanie

From biopython at maubp.freeserve.co.uk  Sun Jun 29 11:19:13 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 29 Jun 2008 16:19:13 +0100
Subject: [BioPython] Sequence from Fasta
In-Reply-To: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com>

On Sun, Jun 29, 2008 at 3:42 PM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:
> Hi!
>
> Is there a way to extract only the sequence (full length) from a fasta file?

Yes.  Based on your requirement to have name-space-sequence, how about:

handle = open(filename)
from Bio import SeqIO
for record in SeqIO.parse(handle, "fasta") :
    print "%s %s" % (record.id, record.seq)
handle.close()

> If I try the code from page 10 in the tutorial, I get of course this:
> Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA ...', SingleLetterAlphabet())

Which bit of the tutorial exactly?  That looks like printing the
repr() of a Seq object, and Seq objects don't have names.  If
something could be clarified that's useful feedback.

Peter


From lueck at ipk-gatersleben.de  Mon Jun 30 05:09:53 2008
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Mon, 30 Jun 2008 11:09:53 +0200
Subject: [BioPython] Sequence from Fasta
References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
	<320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com>
Message-ID: <001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de>

Hi Peter!

I mean the biopython tutorial (16.3.2007), page 10:

>>>
from Bio import SeqIO
handle = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(handle, "fasta") :
print seq_record.id
print seq_record.seq
print len(seq_record.seq)
handle.close()
<<<

I tried your code but I still have the same problem. It's don't show the 
full sequence.

Output:
1 Seq('atgctcgatgcgcgctcgcgtccgtcgCAGGAgGAGATGGGGAGGCGCCGCCGGTTCACG ...', 
SingleLetterAlphabet())
2 Seq('AGAAAAATCCGGAATCAGAGGAGGAGGAGGAGTCTCGCGAGGAGGATAGCACGGAGGCGG ...', 
SingleLetterAlphabet())

Fasta File looks like this:
>1
atgctcgatgcgcgctcgcgtccgtcgCAGGAgGAGATGGGGAGGCGCCGCCGGTTCACGCATCAGCCCACCAGCGACGACGACGACGAGGAAGACAGAGCCGcCC
>2
AGAAAAATCCGGAATCAGAGGAGGAGGAGGAGTCTCGCGAGGAGGATAGCACGGAGGCGGTACCCGTCGGTGAACCTTT

I can try with regular expressions but I first wanted to know whether there 
is a way in biopyhton.

Regards
Stefanie


From biopython at maubp.freeserve.co.uk  Mon Jun 30 05:19:16 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 30 Jun 2008 10:19:16 +0100
Subject: [BioPython] Sequence from Fasta
In-Reply-To: <001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de>
References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
	<320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com>
	<001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00806300219j54f7f43dpe0051f54be27d402@mail.gmail.com>

Which version of Biopython do you have?  I'm guessing Biopython 1.44.
On older versions you would have to do explicitly turn the Seq into a
string.

Does this work:

from Bio import SeqIO
handle = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(handle, "fasta") :
    print seq_record.id
    print seq_record.seq.tostring()
    print len(seq_record.seq)
handle.close()

Since Biopython 1.45, doing str(...) on a Seq object gives you the
sequence in full as a plain string.  When you do a print this happens
implicitly.

Peter

P.S. For the implementation, str(object) calls the object.__str__() method.

From dalloliogm at gmail.com  Mon Jun 30 05:40:23 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Mon, 30 Jun 2008 11:40:23 +0200
Subject: [BioPython] Sequence from Fasta
In-Reply-To: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
Message-ID: <5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com>

On Sun, Jun 29, 2008 at 4:42 PM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:

> If I try the code from page 10 in the tutorial, I get of course this:

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA
...', SingleLetterAlphabet())

Try with seq_record.seq.data.


> But I'm looking for something like this:
>
> Name Sequence without linebreak
>
> Example:
>
> MySequence atgcgcgctcggcgcgctcgfcgcgccccccatggctcgcgcactacagcg
> MySequence2 atgcgctctgcgcgctcgatgtagaatatgagatctctatgagatcagcatca

Bioperl's SeqIO has support for a 'tab sequence format' which is
similar to this[1].
Maybe it could be useful in the future to add support for such a
format in biopython.

[1] http://www.bioperl.org/wiki/Tab_sequence_format


>
> Regards
> Stefanie
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Mon Jun 30 06:25:01 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 30 Jun 2008 11:25:01 +0100
Subject: [BioPython] Sequence from Fasta
In-Reply-To: <5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com>
References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
	<5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com>
Message-ID: <320fb6e00806300325r10c96b57qffee9ab3df81cb9e@mail.gmail.com>

On Mon, Jun 30, 2008 at 10:40 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> On Sun, Jun 29, 2008 at 4:42 PM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:
>
>> If I try the code from page 10 in the tutorial, I get of course this:
>
> Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA
> ...', SingleLetterAlphabet())
>
> Try with seq_record.seq.data.

I would like to discourage using the Seq object's .data property if
possible, in favour of my_seq.tostring() which will work even on very
old versions of Biopython, or str(my_seq) if you are up to date.

I've mooted deprecating the Seq object's .data property as part of
making the Seq object more string like (Bug 2509 and Bug 2351).

http://bugzilla.open-bio.org/show_bug.cgi?id=2509
http://bugzilla.open-bio.org/show_bug.cgi?id=2351

User feedback would be good, but to explain my current thinking: I'm
hoping to reduce the Seq's .data to a read only property in a future
release, and then in a later release start issuing a deprecation
warning, before its eventual removal (Bug 2509).  At some point in
this process the Seq object would hopefully subclass the python string
(Bug 2351).

Peter


From mjldehoon at yahoo.com  Sat Jun  7 08:35:05 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 7 Jun 2008 01:35:05 -0700 (PDT)
Subject: [BioPython] Bio.Gobase, anybody?
Message-ID: <844450.31822.qm@web62415.mail.re1.yahoo.com>


Hi everbody,

As part of bug report 2454:
http://bugzilla.open-bio.org/show_bug.cgi?id=2454,
I started looking at the Bio.Gobase module.
This module provides access to the gobase database:
http://megasun.bch.umontreal.ca/gobase/

This module is about seven years old and (AFAICT)
is not actively maintained. We don't have documentation
for this module, but the unit tests suggests that it
parses HTML files from gobase. I am not sure exactly
where the HTML files came from, but I doubt that
after seven years this still works.

So I was wondering:
Does anybody use Bio.Gobase?

If not, I suggest we deprecate it for the next release,
and remove it in some future release.
If there are users, we need to make some (small) changes
to this module (that is what the original bug report
was about).

--Michiel.


From mmokrejs at ribosome.natur.cuni.cz  Sat Jun  7 09:27:26 2008
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Sat, 07 Jun 2008 11:27:26 +0200
Subject: [BioPython] Bio.Gobase, anybody?
In-Reply-To: <844450.31822.qm@web62415.mail.re1.yahoo.com>
References: <844450.31822.qm@web62415.mail.re1.yahoo.com>
Message-ID: <484A547E.1030909@ribosome.natur.cuni.cz>

Hi,
  I don't use it but seems an interesting resource. ;-)
See http://gobase.bcm.umontreal.ca/samples.html .
Martin

> This module is about seven years old and (AFAICT)
> is not actively maintained. We don't have documentation
> for this module, but the unit tests suggests that it
> parses HTML files from gobase. I am not sure exactly
> where the HTML files came from, but I doubt that
> after seven years this still works.


From cg5x6 at yahoo.com  Mon Jun  9 05:21:50 2008
From: cg5x6 at yahoo.com (C. G.)
Date: Sun, 8 Jun 2008 22:21:50 -0700 (PDT)
Subject: [BioPython] splice variants in GenBank/Entrez
Message-ID: <664146.43151.qm@web65604.mail.ac4.yahoo.com>

Hi all,

I've been using BioPython for a few projects the last
two months to process BLAST results but now I need to
take those results and determine which of them have
known splice variants. By "known" I mean those that
have annotations contained in a database that indicate
they have (or are) splice variants.

My thought was that Entrez would have this information
(which I would then retrieve and parse with BioPython)
but I can't find a consistent means of determining if
an entry has splice variants. I was hoping that maybe
someone on this list had some experience trying to
find this information. Perhaps there is a sequence
feature or a common user-defined field I could access?

I'm also sending an email to NCBI requesting
information but I thought I would cover my bases.
Thanks in advance for any information or help you can
provide.

-steve


From krewink at inb.uni-luebeck.de  Mon Jun  9 06:58:52 2008
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Mon, 9 Jun 2008 08:58:52 +0200
Subject: [BioPython] splice variants in GenBank/Entrez
In-Reply-To: <664146.43151.qm@web65604.mail.ac4.yahoo.com>
References: <664146.43151.qm@web65604.mail.ac4.yahoo.com>
Message-ID: <20080609065852.GB13032@inb.uni-luebeck.de>

Hi Steve,

On Sun, Jun 08, 2008 at 10:21:50PM -0700, C. G. wrote:
> 
> I've been using BioPython for a few projects the last
> two months to process BLAST results but now I need to
> take those results and determine which of them have
> known splice variants. By "known" I mean those that
> have annotations contained in a database that indicate
> they have (or are) splice variants.

Depending on which organism you are looking at, you might want to use
the Ensembl genome database.  There is no biopython interface, but you
can use the jython interface from their website (at least they once
had one, I didn't check if that's still the case).  Otherwise you
might have to use perl or java packages for that.

Another good resource for this is the Alternative Splicing Database:
http://www.ebi.ac.uk/asd/

Hope that helps,

Albert


-- 
Albert Krewinkel <krewink (at) inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics
http://www.inb.uni-luebeck.de/~krewink/


From bsouthey at gmail.com  Mon Jun  9 13:25:44 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Mon, 09 Jun 2008 08:25:44 -0500
Subject: [BioPython] splice variants in GenBank/Entrez
In-Reply-To: <20080609065852.GB13032@inb.uni-luebeck.de>
References: <664146.43151.qm@web65604.mail.ac4.yahoo.com>
	<20080609065852.GB13032@inb.uni-luebeck.de>
Message-ID: <484D2F58.6020502@gmail.com>

Albert Krewinkel wrote:
> Hi Steve,
>
> On Sun, Jun 08, 2008 at 10:21:50PM -0700, C. G. wrote:
>   
>> I've been using BioPython for a few projects the last
>> two months to process BLAST results but now I need to
>> take those results and determine which of them have
>> known splice variants. By "known" I mean those that
>> have annotations contained in a database that indicate
>> they have (or are) splice variants.
>>     
>
> Depending on which organism you are looking at, you might want to use
> the Ensembl genome database.  There is no biopython interface, but you
> can use the jython interface from their website (at least they once
> had one, I didn't check if that's still the case).  Otherwise you
> might have to use perl or java packages for that.
>
> Another good resource for this is the Alternative Splicing Database:
> http://www.ebi.ac.uk/asd/
>
> Hope that helps,
>
> Albert
>
>
>   
The 'ALTERNATIVE PRODUCTS' section of CC lines in a UniProt (SwissProt) 
record can contain alternative splicing information. See for example, 
the manual section:
**3.12.5. Syntax of the topic 'ALTERNATIVE PRODUCTS'**
http://ca.expasy.org/sprot/userman.html#CCAP
(Given below for completeness).

Bruce

Example of the CC lines and the corresponding FT lines for an entry with 
alternative splicing:

    CC   -!- ALTERNATIVE PRODUCTS:
    CC       Event=Alternative splicing, Alternative initiation; Named isoforms=8;
    CC         Comment=Additional isoforms seem to exist;
    CC       Name=1; Synonyms=Non-muscle isozyme;
    CC         IsoId=Q15746-1; Sequence=Displayed;
    CC       Name=2;
    CC         IsoId=Q15746-2; Sequence=VSP_004791;
    CC       Name=3A;
    CC         IsoId=Q15746-3; Sequence=VSP_004792, VSP_004794;
    CC       Name=3B;
    CC         IsoId=Q15746-4; Sequence=VSP_004791, VSP_004792, VSP_004794;
    CC       Name=4;
    CC         IsoId=Q15746-5; Sequence=VSP_004792, VSP_004793;
    CC       Name=Del-1790;
    CC         IsoId=Q15746-6; Sequence=VSP_004795;
    CC       Name=5; Synonyms=Smooth-muscle isozyme;
    CC         IsoId=Q15746-7; Sequence=VSP_018845;
    CC         Note=Produced by alternative initiation at Met-923 of isoform 1;
    CC       Name=6; Synonyms=Telokin;
    CC         IsoId=Q15746-8; Sequence=VSP_018846;
    CC         Note=Produced by alternative initiation at Met-1761 of isoform
    CC         1. Has no catalytic activity;
    ...
    FT   VAR_SEQ       1   1760       Missing (in isoform 6).
    FT                                /FTId=VSP_018846.
    FT   VAR_SEQ       1    922       Missing (in isoform 5).
    FT                                /FTId=VSP_018845.
    FT   VAR_SEQ     437    506       VSGIPKPEVAWFLEGTPVRRQEGSIEVYEDAGSHYLCLLKA
    FT                                RTRDSGTYSCTASNAQGQVSCSWTLQVER -> G (in
    FT                                isoform 2 and isoform 3B).
    FT                                /FTId=VSP_004791.
    FT   VAR_SEQ    1433   1439       DEVEVSD -> MKWRCQT (in isoform 3A,
    FT                                isoform 3B and isoform 4).
    FT                                /FTId=VSP_004792.
    FT   VAR_SEQ    1473   1545       Missing (in isoform 4).
    FT                                /FTId=VSP_004793.
    FT   VAR_SEQ    1655   1705       Missing (in isoform 3A and isoform 3B).
    FT                                /FTId=VSP_004794.
    FT   VAR_SEQ    1790   1790       Missing (in isoform Del-1790).
    FT                                /FTId=VSP_004795.
      

    CC   -!- ALTERNATIVE PRODUCTS:
    CC       Event=Alternative splicing, Alternative initiation; Named isoforms=3;
    CC         Comment=Isoform 1 and isoform 2 arise due to the use of two
    CC         alternative first exons joined to a common exon 2 at the same
    CC         acceptor site but in different reading frames, resulting in two
    CC         completely different isoforms;
    CC       Name=1; Synonyms=p16INK4a;
    CC         IsoId=O77617-1; Sequence=Displayed;
    CC       Name=3;
    CC         IsoId=O77617-2; Sequence=VSP_018701;
    CC         Note=Produced by alternative initiation at Met-35 of isoform 1.
    CC         No experimental confirmation available;
    CC       Name=2; Synonyms=p19ARF;
    CC         IsoId=O77618-1; Sequence=External;
    ..
    FT   VAR_SEQ       1     34       Missing (in isoform 3).
    FT                                /FTId=VSP_004099.
      

From lueck at ipk-gatersleben.de  Tue Jun 10 08:38:14 2008
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Tue, 10 Jun 2008 10:38:14 +0200
Subject: [BioPython] formatdb over python code
Message-ID: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de>

Hi!

Does someone know's whether it's possible to make a database with formatdb (NCBI) via python code (among Windows) and not over the console?


Regards
Stefanie


From biopython at maubp.freeserve.co.uk  Tue Jun 10 09:41:27 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 10 Jun 2008 10:41:27 +0100
Subject: [BioPython] formatdb over python code
In-Reply-To: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de>
References: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00806100241i68b24632s121324ce1c942dd9@mail.gmail.com>

On Tue, Jun 10, 2008 at 9:38 AM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:
> Hi!
>
> Does someone know's whether it's possible to make a database with formatdb
> (NCBI) via python code (among Windows) and not over the console?

Hello Stefanie,

I don't think Biopython has a wrapper for the NCBI formatdb tool, but
you could construct the command line string yourself and call it with
one of the standard python os functions, e.g. os.popen().

Peter


From winter at biotec.tu-dresden.de  Tue Jun 10 10:13:06 2008
From: winter at biotec.tu-dresden.de (Christof Winter)
Date: Tue, 10 Jun 2008 12:13:06 +0200
Subject: [BioPython] formatdb over python code
In-Reply-To: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de>
References: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de>
Message-ID: <484E53B2.5060102@biotec.tu-dresden.de>

Stefanie L?ck wrote, On 06/10/08 10:38:
> Hi!
> 
> Does someone know's whether it's possible to make a database with formatdb
> (NCBI) via python code (among Windows) and not over the console?

Here is the Python code I use for that:
cmd = "formatdb -i %s -p T -o F" % database
os.system(cmd)

-p T specifies protein sequences, -o T creates indexes, but fails if the fasta 
file does not follow the defline format (see 
http://en.wikipedia.org/wiki/Fasta_format#Sequence_identifiers). If it fails, 
use -o F.

Christof


From mjldehoon at yahoo.com  Sat Jun 14 02:34:05 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 13 Jun 2008 19:34:05 -0700 (PDT)
Subject: [BioPython] Bio.Rebase
Message-ID: <237761.5963.qm@web62409.mail.re1.yahoo.com>

Hi everybody,

As part of bug #2454 on Bugzilla, I am looking at the Bio.Rebase module.
This module parses files (in HTML format) from the Rebase database:
http://rebase.neb.com/rebase/rebase.html

Unfortunately, since this module was written (in 2000) the HTML format used by the Rebase database has changed completely. This module is therefore not able to parse current Rebase HTML files.

Is anybody willing to update Bio.Rebase (either by updating the HTML parser, or preferably by writing a parser for plain-text output from Bio.Rebase)? If not, I think this module should be deprecated.

--Michiel.


From biopython at maubp.freeserve.co.uk  Mon Jun 16 14:01:31 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 16 Jun 2008 15:01:31 +0100
Subject: [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO
Message-ID: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com>

I've recently had to deal with some contig files in the Ace format
(output by CAP3, but many assembly files will produce this output).

We have a module for parsing Ace files in Biopython,
Bio.Sequencing.Ace but I was wondering about integrating this into the
Bio.SeqIO or Bio.AlignIO framework.
http://www.biopython.org/wiki/SeqIO
http://www.biopython.org/wiki/AlignIO

I'd like to hear from anyone currently using Ace files, on how they
tend to treat the data - and if they think a SeqRecord or Alignment
based representation would be useful.

Each contig in an Ace file could be treated as a SeqRecord using the
consensus sequence.  The identifiers of each sub-sequence used to
build the consensus could be stored as database cross-references, or
perhaps we could store these as SeqFeatures describing which part of
the consensus they support.  This would then fit into Bio.SeqIO quite
well.

Alternatively, each contig could be treated as an alignment (with a
consensus) and integrated into Bio.AlignIO.  One drawback for this is
doing this with the current generic alignment class would require
padding the start and/or end of each sequence with gaps in order to
make every sequence the same length.  However, if we did this (or
created a more specialised alignment class), the Ace file format would
then fit into Bio.AlignIO too.

So, Ace users - would either (or both) of the above approaches make
sense for how you use the Ace contig files?

Thanks

Peter


From laserson at mit.edu  Tue Jun 17 18:44:08 2008
From: laserson at mit.edu (Uri Laserson)
Date: Tue, 17 Jun 2008 14:44:08 -0400
Subject: [BioPython] Dependency help: libssl.so.0.9.7
Message-ID: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com>

Hi,

I am trying to use some biopython packages, and it turns out there is an
error when I try to import _hashlib:

>>> import _hashlib
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ImportError: libssl.so.0.9.7: cannot open shared object file: No such file
or directory

I am working on unix system that is administered by a university, but I have
installed my own local version of python along with biopython and all
necessary packages for that.

There exists a libssl.so.0.9.8 and libssl.so (a symbolic link to the former)
in /usr/lib

ldd _hashlib.so in my own /python/lib/python2.5/lib-dynload gives me:
        linux-gate.so.1 =>  (0xffffe000)
        libssl.so.0.9.7 => not found
        libcrypto.so.0.9.7 => not found
        libpthread.so.0 => /lib32/libpthread.so.0 (0xf7f67000)
        libc.so.6 => /lib32/libc.so.6 (0xf7e3c000)
        /lib/ld-linux.so.2 (0x56555000)

What is the easiest way to solve this?  How do I get my local (home
directory) installation of python to find the libssl.so library in /usr/lib?

Thanks!

Uri

-- 
Uri Laserson
PhD Candidate, Biomedical Engineering
Harvard Medical School (Genetics)
Massachusetts Institute of Technology (Mathematics)
phone +1 917 742 8019
laserson at mit.edu


From biopython at maubp.freeserve.co.uk  Wed Jun 18 09:11:42 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 18 Jun 2008 10:11:42 +0100
Subject: [BioPython] Dependency help: libssl.so.0.9.7
In-Reply-To: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com>
References: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com>
Message-ID: <320fb6e00806180211o5d505ct4099cdd4fc9e11dc@mail.gmail.com>

On Tue, Jun 17, 2008 at 7:44 PM, Uri Laserson <laserson at mit.edu> wrote:
> Hi,
>
> I am trying to use some biopython packages, and it turns out there is an
> error when I try to import _hashlib:
>
>>>> import _hashlib
> Traceback (most recent call last):
> ...

Hi Uri,

I'm guessing you are trying to use Bio.SeqUtils.Checksum, but did you
mean "import hashlib"?  See http://code.krypto.org/python/hashlib/

Peter


From biopython at maubp.freeserve.co.uk  Wed Jun 18 11:32:10 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 18 Jun 2008 12:32:10 +0100
Subject: [BioPython] blastx works fine?
In-Reply-To: <1131745582.4368.22.camel@osiris.biology.duke.edu>
References: <1131745582.4368.22.camel@osiris.biology.duke.edu>
Message-ID: <320fb6e00806180432x60ceea96o3e45f05590003e8e@mail.gmail.com>

In Nov 2005, Frank Kauff wrote:
> Hi all,
>
> qblast currently says it works only for blastp and blastn. Actually it
> seems to work fine with blastx as well - xml output parses well with
> NCBIXML. Or am I missing something?
>
> Frank

Yes, using BLASTX with the Biopython XML parser does seem to work.

In fact the NCBI (now) documentation explicitly lists blastn, blastp,
blastx, tblastn and tblastx so I updated Biopython's qblast function
to allow them too.  http://www.ncbi.nlm.nih.gov/BLAST/Doc/node43.html

Fixed in Bio/Blast/NCBIWWW.py revision 1.50 - better late than never?

Peter


From mjldehoon at yahoo.com  Thu Jun 19 13:04:31 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 19 Jun 2008 06:04:31 -0700 (PDT)
Subject: [BioPython] Bio.CDD, anyone?
Message-ID: <14893.84074.qm@web62409.mail.re1.yahoo.com>

Hi everybody,

Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) records. The parser parses HTML pages from CDD's web site. Since the parser was written about six years ago, the CDD web site has changed considerably. Bio.CDD therefore cannot parse current HTML pages from CDD.

So I am wondering:
1) Is anybody using Bio.CDD?
2) Is anybody willing to update Bio.CDD to handle current HTML?
3) If not, can we deprecate it? There is not much purpose of having a parser for HTML pages from years ago.

--Michiel.


From biopython at maubp.freeserve.co.uk  Thu Jun 19 13:38:29 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 19 Jun 2008 14:38:29 +0100
Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone?
In-Reply-To: <14893.84074.qm@web62409.mail.re1.yahoo.com>
References: <14893.84074.qm@web62409.mail.re1.yahoo.com>
Message-ID: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com>

> Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database)
> records. The parser parses HTML pages from CDD's web site. Since the parser
> was written about six years ago, the CDD web site has changed considerably.
> Bio.CDD therefore cannot parse current HTML pages from CDD.

A couple of years ago, I wanted to get the CDD domain name and
description and ended up writing my own very simple and crude parser
to extract just this information.  Doing a proper job would mean
extracting lots and lots of fields, e.g.
http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475

I wonder if the NCBI make any of this available as XML via Entrez?  I
had a quick look and couldn't find anything.

Peter


From mjldehoon at yahoo.com  Thu Jun 19 13:58:25 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 19 Jun 2008 06:58:25 -0700 (PDT)
Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone?
In-Reply-To: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com>
Message-ID: <352888.20937.qm@web62409.mail.re1.yahoo.com>

> I wonder if the NCBI make any of this available as XML via Entrez?  I
> had a quick look and couldn't find anything.

Actually I already asked this question to NCBI. Their answer was that a subset of the information shown on the web page is available as XML via Entrez's ESummary and EFetch (and thus available from Biopython). The full CDD records are stored as one large file, which is obtainable from NCBI's ftp site, but currently it is not possible to get individual CDD records except in HTML form through the NCBI website.

--Michiel.


Peter <biopython at maubp.freeserve.co.uk> wrote: > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database)
> records. The parser parses HTML pages from CDD's web site. Since the parser
> was written about six years ago, the CDD web site has changed considerably.
> Bio.CDD therefore cannot parse current HTML pages from CDD.

A couple of years ago, I wanted to get the CDD domain name and
description and ended up writing my own very simple and crude parser
to extract just this information.  Doing a proper job would mean
extracting lots and lots of fields, e.g.
http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475

I wonder if the NCBI make any of this available as XML via Entrez?  I
had a quick look and couldn't find anything.

Peter


From bsouthey at gmail.com  Thu Jun 19 14:44:00 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Thu, 19 Jun 2008 09:44:00 -0500
Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone?
In-Reply-To: <352888.20937.qm@web62409.mail.re1.yahoo.com>
References: <352888.20937.qm@web62409.mail.re1.yahoo.com>
Message-ID: <485A70B0.1010202@gmail.com>

Michiel de Hoon wrote:
>> I wonder if the NCBI make any of this available as XML via Entrez?  I
>> had a quick look and couldn't find anything.
>>     
>
> Actually I already asked this question to NCBI. Their answer was that a subset of the information shown on the web page is available as XML via Entrez's ESummary and EFetch (and thus available from Biopython). The full CDD records are stored as one large file, which is obtainable from NCBI's ftp site, but currently it is not possible to get individual CDD records except in HTML form through the NCBI website.
>
> --Michiel.
>
>
> Peter <biopython at maubp.freeserve.co.uk> wrote: > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database)
>   
>> records. The parser parses HTML pages from CDD's web site. Since the parser
>> was written about six years ago, the CDD web site has changed considerably.
>> Bio.CDD therefore cannot parse current HTML pages from CDD.
>>     
>
> A couple of years ago, I wanted to get the CDD domain name and
> description and ended up writing my own very simple and crude parser
> to extract just this information.  Doing a proper job would mean
> extracting lots and lots of fields, e.g.
> http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475
>
> I wonder if the NCBI make any of this available as XML via Entrez?  I
> had a quick look and couldn't find anything.
>
> Peter
>
>
>        
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   
Hi,
Do you know how the test files were created? If there is not an easy 
answer then it makes the decision easier.

Anyhow, I  vote to remove this module as, in addition to the things 
previously mentioned, it would far better to support interproscan 
(http://www.ebi.ac.uk/Tools/InterProScan/ ) than just a single tool.

Bruce


From cjfields at uiuc.edu  Thu Jun 19 14:45:05 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 19 Jun 2008 09:45:05 -0500
Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone?
In-Reply-To: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com>
References: <14893.84074.qm@web62409.mail.re1.yahoo.com>
	<320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com>
Message-ID: <F1CD69BA-253F-4BBD-9942-572B19C6E722@uiuc.edu>

They don't, though you can get esummary XML information (which  
includes description), and I believe you can use elink to grab other  
information (including proteins with the specified domain).

chris

On Jun 19, 2008, at 8:38 AM, Peter wrote:

>> Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain  
>> Database)
>> records. The parser parses HTML pages from CDD's web site. Since  
>> the parser
>> was written about six years ago, the CDD web site has changed  
>> considerably.
>> Bio.CDD therefore cannot parse current HTML pages from CDD.
>
> A couple of years ago, I wanted to get the CDD domain name and
> description and ended up writing my own very simple and crude parser
> to extract just this information.  Doing a proper job would mean
> extracting lots and lots of fields, e.g.
> http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475
>
> I wonder if the NCBI make any of this available as XML via Entrez?  I
> had a quick look and couldn't find anything.
>
> Peter
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign


From biopython at maubp.freeserve.co.uk  Thu Jun 19 16:13:16 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 19 Jun 2008 17:13:16 +0100
Subject: [BioPython] Adding NCBI XML sequence formats to Bio.SeqIO
Message-ID: <320fb6e00806190913h2f3f81bgd9d16fb0f2a740f9@mail.gmail.com>

Dear all,

I've realised that as a bonus from Michiel's work on Bio.Entrez,
Biopython should be able to parse several of the XML sequence file
formats used by the NCBI - and ideally we should be able to do this
via Bio.SeqIO and get SeqRecord objects.  I am thinking about adding a
new module to Bio.SeqIO which will map the python list/dictionary
structures from Bio.Entrez into SeqRecord object(s).

What I wanted to ask the list about, is which XML sequence files are
of interest - and are there any strong views on format names should I
use?

I've looked at BioPerl list since I try and re-use the same format
names, but could only spot one NCBI XML file listed here:
http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats

NCBI TinySeq XML format
http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd
BioPerl call this "tinyseq", which seems like a good choice of name.
http://www.bioperl.org/wiki/Tinyseq_sequence_format

Also potentially of interest are:

NCBI INSDSeq XML format
http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd

NCBI Seq-entry XML format
http://www.ncbi.nlm.nih.gov/dtd/NCBI_Seqset.dtd

NCBI Entrezgene XML format (BioPerl uses "entrezgene" to refer to the
ASN.1 variant of this file format).
http://www.ncbi.nlm.nih.gov/dtd/NCBI_Entrezgene.dtd

(I haven't actually sat down and looked at the details of the
implementation yet, so no promises on the timing!)

Peter


From sbassi at gmail.com  Sun Jun 22 22:49:48 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sun, 22 Jun 2008 19:49:48 -0300
Subject: [BioPython] Secondary structure alphabet?
Message-ID: <b43bf2080806221549r19be02e6q8a59e15550faa5b7@mail.gmail.com>

Here is the secondary structure alphabet:

class SecondaryStructure(SingleLetterAlphabet)
 |  Method resolution order:
 |      SecondaryStructure
 |      SingleLetterAlphabet
 |      Alphabet
 |
 |  Data and other attributes defined here:
 |
 |  letters = 'HSTC'

I can't find what that HSTC stands for. The closer match I found was
the DSSP code:

The DSSP code

The output of DSSP is explained extensively under 'explanation'. The
very short summary of the output is:

    * H = alpha helix
    * B = residue in isolated beta-bridge
    * E = extended strand, participates in beta ladder
    * G = 3-helix (3/10 helix)
    * I = 5 helix (pi helix)
    * T = hydrogen bonded turn
    * S = bend

(http://swift.cmbi.ru.nl/gv/dssp/)

Does anybody knows the meaning of HSTC? I am CC this mail to Andrew
Dalke it seems he was the one who submit it the Biopython.


-- 
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Tutorial libre de Python: http://tinyurl.com/2az5d5


From idoerg at gmail.com  Sun Jun 22 23:03:52 2008
From: idoerg at gmail.com (Iddo Friedberg)
Date: Sun, 22 Jun 2008 16:03:52 -0700
Subject: [BioPython] Secondary structure alphabet?
In-Reply-To: <b43bf2080806221549r19be02e6q8a59e15550faa5b7@mail.gmail.com>
References: <b43bf2080806221549r19be02e6q8a59e15550faa5b7@mail.gmail.com>
Message-ID: <b5bbbc970806221603i1f426106n34c5c8fb2223f8e@mail.gmail.com>

Probably Helix Turn Strand Coil


On Sun, Jun 22, 2008 at 3:49 PM, Sebastian Bassi <sbassi at gmail.com> wrote:

> Here is the secondary structure alphabet:
>
> class SecondaryStructure(SingleLetterAlphabet)
>  |  Method resolution order:
>  |      SecondaryStructure
>  |      SingleLetterAlphabet
>  |      Alphabet
>  |
>  |  Data and other attributes defined here:
>  |
>  |  letters = 'HSTC'
>
> I can't find what that HSTC stands for. The closer match I found was
> the DSSP code:
>
> The DSSP code
>
> The output of DSSP is explained extensively under 'explanation'. The
> very short summary of the output is:
>
>    * H = alpha helix
>    * B = residue in isolated beta-bridge
>    * E = extended strand, participates in beta ladder
>    * G = 3-helix (3/10 helix)
>    * I = 5 helix (pi helix)
>    * T = hydrogen bonded turn
>    * S = bend
>
> (http://swift.cmbi.ru.nl/gv/dssp/)
>
> Does anybody knows the meaning of HSTC? I am CC this mail to Andrew
> Dalke it seems he was the one who submit it the Biopython.
>
>
> --
> Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
> Bioinformatics news: http://www.bioinformatica.info
> Tutorial libre de Python: http://tinyurl.com/2az5d5
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 

Iddo Friedberg, Ph.D.
CALIT2, mail code 0440
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0440, USA
T: +1 (858) 534-0570
T: +1 (858) 646-3100 x3516
http://iddo-friedberg.org


From sbassi at gmail.com  Sun Jun 22 23:05:13 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sun, 22 Jun 2008 20:05:13 -0300
Subject: [BioPython] Secondary structure alphabet?
In-Reply-To: <b5bbbc970806221603i1f426106n34c5c8fb2223f8e@mail.gmail.com>
References: <b43bf2080806221549r19be02e6q8a59e15550faa5b7@mail.gmail.com>
	<b5bbbc970806221603i1f426106n34c5c8fb2223f8e@mail.gmail.com>
Message-ID: <b43bf2080806221605j69d8fdefk7bb59329fc8f5022@mail.gmail.com>

On Sun, Jun 22, 2008 at 8:03 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> Probably Helix Turn Strand Coil

Sounds plausible. Thank you.
Best,
SB.

-- 
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Tutorial libre de Python: http://tinyurl.com/2az5d5


From jdieten at gmail.com  Tue Jun 24 10:58:23 2008
From: jdieten at gmail.com (Joost van Dieten)
Date: Tue, 24 Jun 2008 12:58:23 +0200
Subject: [BioPython] Blastp XML mailfunction
Message-ID: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com>

 MY CODE:
       result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence,
entrez_query='man[ORGN]')
       blast_results = result_handle.read()
       print result_handle-
       result_handler = cStringIO.StringIO(blast_results)
       print result_handler
       blast_records = NCBIXML.parse(result_handler)
       blast_record = blast_records.next()

This code doesn't seem to work anymore. I got an error that my blast_record
is empty, but it worked fine 3 weeks ago. Something changed to the NCBIXML
code??? Any ideas??

Greetz,

Joost Dieten


From biopython at maubp.freeserve.co.uk  Tue Jun 24 11:11:12 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Jun 2008 12:11:12 +0100
Subject: [BioPython] Blastp XML mailfunction
In-Reply-To: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com>
References: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com>
Message-ID: <320fb6e00806240411j1c01903cm1f40d53eb9c5ad77@mail.gmail.com>

On Tue, Jun 24, 2008 at 11:58 AM, Joost van Dieten <jdieten at gmail.com> wrote:
>  MY CODE:
>       result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence,
> entrez_query='man[ORGN]')
>       blast_results = result_handle.read()
>       print result_handle-
>       result_handler = cStringIO.StringIO(blast_results)
>       print result_handler
>       blast_records = NCBIXML.parse(result_handler)
>       blast_record = blast_records.next()

You probably know this, but for anyone trying to cut-and-paste the
code, its much simpler to do this:

result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence,
entrez_query='man[ORGN]')
blast_records = NCBIXML.parse(result_handle)
blast_record = blast_records.next()

Joost's code is a handy way to print out the raw data before parsing
it, to try and identify any problems by eye.

> This code doesn't seem to work anymore. I got an error that my blast_record
> is empty, but it worked fine 3 weeks ago. Something changed to the NCBIXML
> code??? Any ideas??

Yes, its probably a recent NCBI change, which we've fixed with Bug 2499:
http://bugzilla.open-bio.org/show_bug.cgi?id=2499

If you want to just update the Blast parser, I think you need to
update both NCBIXML.py and Record.py, but a complete install from CVS
might be simpler.

Peter


From mjldehoon at yahoo.com  Wed Jun 25 14:04:09 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 25 Jun 2008 07:04:09 -0700 (PDT)
Subject: [BioPython] Bio.SCOP.FileIndex
Message-ID: <141582.2274.qm@web62413.mail.re1.yahoo.com>

Hi everybody,

When I was modifying Bio.SCOP, I noticed that Bio.SCOP.FileIndex is flawed if file reading is done via a buffer (which is often the case in Python).

Before we try to fix this, is anybody actually using Bio.SCOP.FileIndex?
If not, I think we should deprecate it instead of trying to fix it.

--Michiel.


From dag at sonsorol.org  Wed Jun 25 15:08:33 2008
From: dag at sonsorol.org (Chris Dagdigian)
Date: Wed, 25 Jun 2008 11:08:33 -0400
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>
Message-ID: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>


Can someone from the biopython dev team respond officially to Scott  
please?

Regards,
Chris


Begin forwarded message:

> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" <mcginnis at ncbi.nlm.nih.gov>
> Date: June 25, 2008 10:54:28 AM EDT
> To: <biopython-owner at lists.open-bio.org>
> Subject: NCBI Abuse Activity with BioPython
>
> Dear Colleague:
>
>
>
> My name is Scott McGinnis and I am responsible for monitoring the web
> page at NCBI and blocking users with excessive access.
>
>
>
> I am seeing more and more activity with BioPython and it is us  
> concern.
> Mainly the BioPython suite does not appear to be written to the
> recommendations made on the main NCBI E-utilities web page
> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr
> inciply the following are not being done by BioPython tools.
>
>
>
> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web address.
>
> *  Make no more than one request every 3 seconds.
>
>
>
> In fact I recently cc'd you on an event when a user was coming in at
> over 18 requests per second. We really wish that you would alter you
> scripts to run with a some sort of sleep in it in order to not send
> requests more than once per 3 seconds and to not send these to the  
> main
> www web servers but use the  http://eutils.ncbi.nlm.nih.gov
> <http://eutils.ncbi.nlm.nih.gov/> .
>
>
>
> Also, there is the problem of huge searches in order to build local
> databases. With you package it seems that if one were so inclined you
> would send a search for all human sequences (over 10,000,000  
> sequences)
> and you program would then retrieve these one ID at a time. Regardless
> of the fact that this is an extreme example, we would much prefer if
> your program could webenv from the Esearch  and  use the search  
> history
> and webenv to retrieve sets of sequences at 200 - 200 at a time.
>
>
>
> History: Requests utility to maintain results in user's environment.
> Used in conjunction with WebEnv.
>
> usehistory=y
>
> Web Environment: Value previously returned in XML results from ESearch
> or EPost. This value may change with each utility call. If WebEnv is
> used, History search numbers can be included in an ESummary URL, e.g.,
> term=cancer+AND+%23X (where %23 replaces # and X is the History search
> number).
>
> Note: WebEnv is similar to the cookie that is set on a user's  
> computers
> when accessing PubMed on the web.  If the parameter usehistory=y is
> included in an ESearch URL both a WebEnv (cookie string) and query_key
> (history number) values will be returned in the results. Rather than
> using the retrieved PMIDs in an ESummary or EFetch URL you may simply
> use the WebEnv and query_key values to retrieve the records. WebEnv  
> will
> change for each ESearch query, but a sample URL would be as follows:
>
> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed
> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh
> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D
> &query_key=6&retmode=html&rettype=medline&retmax=15
>
> WebEnv=WgHmIcDG]B etc.
>
> Display Numbers:
>
> retstart=x  (x= sequential number of the first record retrieved -
> default=0 which will retrieve the first record)
> retmax=y  (y= number of items retrieved)
>
>
>
> Otherwise we will end up blocking more of your users which we are
> unfortunately already doing in some cases.
>
>
>
> Sincerely,
> Scott D. McGinnis, M.S.
> DHHS/NIH/NLM/NCBI
> www.ncbi.nlm.nih.gov
>
>
>


From cjfields at uiuc.edu  Wed Jun 25 15:34:34 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 25 Jun 2008 10:34:34 -0500
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>
	<55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
Message-ID: <FCC44A01-8CA7-42F3-A08C-5AC03B1C00F4@uiuc.edu>

Just as a note from the BioPerl side, BioPerl modules which access  
eutils use the 3 min sleep rule, and we specify in the documentation  
the NCBI rules.  The modules also identify the tool/agent used as  
'bioperl', I believe.

chris

On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote:

>
> Can someone from the biopython dev team respond officially to Scott  
> please?
>
> Regards,
> Chris
>
>
> Begin forwarded message:
>
>> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]"  
>> <mcginnis at ncbi.nlm.nih.gov>
>> Date: June 25, 2008 10:54:28 AM EDT
>> To: <biopython-owner at lists.open-bio.org>
>> Subject: NCBI Abuse Activity with BioPython
>>
>> Dear Colleague:
>>
>>
>>
>> My name is Scott McGinnis and I am responsible for monitoring the web
>> page at NCBI and blocking users with excessive access.
>>
>>
>>
>> I am seeing more and more activity with BioPython and it is us  
>> concern.
>> Mainly the BioPython suite does not appear to be written to the
>> recommendations made on the main NCBI E-utilities web page
>> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr
>> inciply the following are not being done by BioPython tools.
>>
>>
>>
>> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
>> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web  
>> address.
>>
>> *  Make no more than one request every 3 seconds.
>>
>>
>>
>> In fact I recently cc'd you on an event when a user was coming in at
>> over 18 requests per second. We really wish that you would alter you
>> scripts to run with a some sort of sleep in it in order to not send
>> requests more than once per 3 seconds and to not send these to the  
>> main
>> www web servers but use the  http://eutils.ncbi.nlm.nih.gov
>> <http://eutils.ncbi.nlm.nih.gov/> .
>>
>>
>>
>> Also, there is the problem of huge searches in order to build local
>> databases. With you package it seems that if one were so inclined you
>> would send a search for all human sequences (over 10,000,000  
>> sequences)
>> and you program would then retrieve these one ID at a time.  
>> Regardless
>> of the fact that this is an extreme example, we would much prefer if
>> your program could webenv from the Esearch  and  use the search  
>> history
>> and webenv to retrieve sets of sequences at 200 - 200 at a time.
>>
>>
>>
>> History: Requests utility to maintain results in user's environment.
>> Used in conjunction with WebEnv.
>>
>> usehistory=y
>>
>> Web Environment: Value previously returned in XML results from  
>> ESearch
>> or EPost. This value may change with each utility call. If WebEnv is
>> used, History search numbers can be included in an ESummary URL,  
>> e.g.,
>> term=cancer+AND+%23X (where %23 replaces # and X is the History  
>> search
>> number).
>>
>> Note: WebEnv is similar to the cookie that is set on a user's  
>> computers
>> when accessing PubMed on the web.  If the parameter usehistory=y is
>> included in an ESearch URL both a WebEnv (cookie string) and  
>> query_key
>> (history number) values will be returned in the results. Rather than
>> using the retrieved PMIDs in an ESummary or EFetch URL you may simply
>> use the WebEnv and query_key values to retrieve the records. WebEnv  
>> will
>> change for each ESearch query, but a sample URL would be as follows:
>>
>> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed
>> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh
>> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D
>> &query_key=6&retmode=html&rettype=medline&retmax=15
>>
>> WebEnv=WgHmIcDG]B etc.
>>
>> Display Numbers:
>>
>> retstart=x  (x= sequential number of the first record retrieved -
>> default=0 which will retrieve the first record)
>> retmax=y  (y= number of items retrieved)
>>
>>
>>
>> Otherwise we will end up blocking more of your users which we are
>> unfortunately already doing in some cases.
>>
>>
>>
>> Sincerely,
>> Scott D. McGinnis, M.S.
>> DHHS/NIH/NLM/NCBI
>> www.ncbi.nlm.nih.gov
>>
>>
>>
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign


From rjalves at igc.gulbenkian.pt  Wed Jun 25 16:16:49 2008
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Wed, 25 Jun 2008 17:16:49 +0100
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <FCC44A01-8CA7-42F3-A08C-5AC03B1C00F4@uiuc.edu>
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>	<55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
	<FCC44A01-8CA7-42F3-A08C-5AC03B1C00F4@uiuc.edu>
Message-ID: <48626F71.4020804@igc.gulbenkian.pt>

you mean 3 seconds no?

Quoting Chris Fields on 06/25/2008 04:34 PM:
> Just as a note from the BioPerl side, BioPerl modules which access 
> eutils use the 3 min sleep rule, and we specify in the documentation 
> the NCBI rules.  The modules also identify the tool/agent used as 
> 'bioperl', I believe.
>
> chris
>
> On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote:
>
>>
>> Can someone from the biopython dev team respond officially to Scott 
>> please?
>>
>> Regards,
>> Chris
>>
>>
>> Begin forwarded message:
>>
>>> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" <mcginnis at ncbi.nlm.nih.gov>
>>> Date: June 25, 2008 10:54:28 AM EDT
>>> To: <biopython-owner at lists.open-bio.org>
>>> Subject: NCBI Abuse Activity with BioPython
>>>
>>> Dear Colleague:
>>>
>>>
>>>
>>> My name is Scott McGinnis and I am responsible for monitoring the web
>>> page at NCBI and blocking users with excessive access.
>>>
>>>
>>>
>>> I am seeing more and more activity with BioPython and it is us concern.
>>> Mainly the BioPython suite does not appear to be written to the
>>> recommendations made on the main NCBI E-utilities web page
>>> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr 
>>>
>>> inciply the following are not being done by BioPython tools.
>>>
>>>
>>>
>>> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
>>> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web address.
>>>
>>> *  Make no more than one request every 3 seconds.
>>>
>>>
>>>
>>> In fact I recently cc'd you on an event when a user was coming in at
>>> over 18 requests per second. We really wish that you would alter you
>>> scripts to run with a some sort of sleep in it in order to not send
>>> requests more than once per 3 seconds and to not send these to the main
>>> www web servers but use the  http://eutils.ncbi.nlm.nih.gov
>>> <http://eutils.ncbi.nlm.nih.gov/> .
>>>
>>>
>>>
>>> Also, there is the problem of huge searches in order to build local
>>> databases. With you package it seems that if one were so inclined you
>>> would send a search for all human sequences (over 10,000,000 sequences)
>>> and you program would then retrieve these one ID at a time. Regardless
>>> of the fact that this is an extreme example, we would much prefer if
>>> your program could webenv from the Esearch  and  use the search history
>>> and webenv to retrieve sets of sequences at 200 - 200 at a time.
>>>
>>>
>>>
>>> History: Requests utility to maintain results in user's environment.
>>> Used in conjunction with WebEnv.
>>>
>>> usehistory=y
>>>
>>> Web Environment: Value previously returned in XML results from ESearch
>>> or EPost. This value may change with each utility call. If WebEnv is
>>> used, History search numbers can be included in an ESummary URL, e.g.,
>>> term=cancer+AND+%23X (where %23 replaces # and X is the History search
>>> number).
>>>
>>> Note: WebEnv is similar to the cookie that is set on a user's computers
>>> when accessing PubMed on the web.  If the parameter usehistory=y is
>>> included in an ESearch URL both a WebEnv (cookie string) and query_key
>>> (history number) values will be returned in the results. Rather than
>>> using the retrieved PMIDs in an ESummary or EFetch URL you may simply
>>> use the WebEnv and query_key values to retrieve the records. WebEnv 
>>> will
>>> change for each ESearch query, but a sample URL would be as follows:
>>>
>>> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed
>>> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh
>>> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D
>>> &query_key=6&retmode=html&rettype=medline&retmax=15
>>>
>>> WebEnv=WgHmIcDG]B etc.
>>>
>>> Display Numbers:
>>>
>>> retstart=x  (x= sequential number of the first record retrieved -
>>> default=0 which will retrieve the first record)
>>> retmax=y  (y= number of items retrieved)
>>>
>>>
>>>
>>> Otherwise we will end up blocking more of your users which we are
>>> unfortunately already doing in some cases.
>>>
>>>
>>>
>>> Sincerely,
>>> Scott D. McGinnis, M.S.
>>> DHHS/NIH/NLM/NCBI
>>> www.ncbi.nlm.nih.gov
>>>
>>>
>>>
>>
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Marie-Claude Hofmann
> College of Veterinary Medicine
> University of Illinois Urbana-Champaign
>
>
>
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From cjfields at uiuc.edu  Wed Jun 25 19:00:34 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 25 Jun 2008 14:00:34 -0500
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <48626F71.4020804@igc.gulbenkian.pt>
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>	<55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
	<FCC44A01-8CA7-42F3-A08C-5AC03B1C00F4@uiuc.edu>
	<48626F71.4020804@igc.gulbenkian.pt>
Message-ID: <16811EA1-130D-4F47-B0B5-654E840705B9@uiuc.edu>

Yes, my bad (was in a hurry).

I have heard of instances where specific users/IPs were blocked  
temporarily by NCBI based on spamming, so it's best  to be proactive.

chris

On Jun 25, 2008, at 11:16 AM, Renato Alves wrote:

> you mean 3 seconds no?
>
> Quoting Chris Fields on 06/25/2008 04:34 PM:
>> Just as a note from the BioPerl side, BioPerl modules which access  
>> eutils use the 3 min sleep rule, and we specify in the  
>> documentation the NCBI rules.  The modules also identify the tool/ 
>> agent used as 'bioperl', I believe.
>>
>> chris
>>
>> On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote:
>>
>>>
>>> Can someone from the biopython dev team respond officially to  
>>> Scott please?
>>>
>>> Regards,
>>> Chris
>>>
>>>
>>> Begin forwarded message:
>>>
>>>> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" <mcginnis at ncbi.nlm.nih.gov 
>>>> >
>>>> Date: June 25, 2008 10:54:28 AM EDT
>>>> To: <biopython-owner at lists.open-bio.org>
>>>> Subject: NCBI Abuse Activity with BioPython
>>>>
>>>> Dear Colleague:
>>>>
>>>>
>>>>
>>>> My name is Scott McGinnis and I am responsible for monitoring the  
>>>> web
>>>> page at NCBI and blocking users with excessive access.
>>>>
>>>>
>>>>
>>>> I am seeing more and more activity with BioPython and it is us  
>>>> concern.
>>>> Mainly the BioPython suite does not appear to be written to the
>>>> recommendations made on the main NCBI E-utilities web page
>>>> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr
>>>> inciply the following are not being done by BioPython tools.
>>>>
>>>>
>>>>
>>>> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
>>>> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web  
>>>> address.
>>>>
>>>> *  Make no more than one request every 3 seconds.
>>>>
>>>>
>>>>
>>>> In fact I recently cc'd you on an event when a user was coming in  
>>>> at
>>>> over 18 requests per second. We really wish that you would alter  
>>>> you
>>>> scripts to run with a some sort of sleep in it in order to not send
>>>> requests more than once per 3 seconds and to not send these to  
>>>> the main
>>>> www web servers but use the  http://eutils.ncbi.nlm.nih.gov
>>>> <http://eutils.ncbi.nlm.nih.gov/> .
>>>>
>>>>
>>>>
>>>> Also, there is the problem of huge searches in order to build local
>>>> databases. With you package it seems that if one were so inclined  
>>>> you
>>>> would send a search for all human sequences (over 10,000,000  
>>>> sequences)
>>>> and you program would then retrieve these one ID at a time.  
>>>> Regardless
>>>> of the fact that this is an extreme example, we would much prefer  
>>>> if
>>>> your program could webenv from the Esearch  and  use the search  
>>>> history
>>>> and webenv to retrieve sets of sequences at 200 - 200 at a time.
>>>>
>>>>
>>>>
>>>> History: Requests utility to maintain results in user's  
>>>> environment.
>>>> Used in conjunction with WebEnv.
>>>>
>>>> usehistory=y
>>>>
>>>> Web Environment: Value previously returned in XML results from  
>>>> ESearch
>>>> or EPost. This value may change with each utility call. If WebEnv  
>>>> is
>>>> used, History search numbers can be included in an ESummary URL,  
>>>> e.g.,
>>>> term=cancer+AND+%23X (where %23 replaces # and X is the History  
>>>> search
>>>> number).
>>>>
>>>> Note: WebEnv is similar to the cookie that is set on a user's  
>>>> computers
>>>> when accessing PubMed on the web.  If the parameter usehistory=y is
>>>> included in an ESearch URL both a WebEnv (cookie string) and  
>>>> query_key
>>>> (history number) values will be returned in the results. Rather  
>>>> than
>>>> using the retrieved PMIDs in an ESummary or EFetch URL you may  
>>>> simply
>>>> use the WebEnv and query_key values to retrieve the records.  
>>>> WebEnv will
>>>> change for each ESearch query, but a sample URL would be as  
>>>> follows:
>>>>
>>>> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed
>>>> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh
>>>> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D
>>>> &query_key=6&retmode=html&rettype=medline&retmax=15
>>>>
>>>> WebEnv=WgHmIcDG]B etc.
>>>>
>>>> Display Numbers:
>>>>
>>>> retstart=x  (x= sequential number of the first record retrieved -
>>>> default=0 which will retrieve the first record)
>>>> retmax=y  (y= number of items retrieved)
>>>>
>>>>
>>>>
>>>> Otherwise we will end up blocking more of your users which we are
>>>> unfortunately already doing in some cases.
>>>>
>>>>
>>>>
>>>> Sincerely,
>>>> Scott D. McGinnis, M.S.
>>>> DHHS/NIH/NLM/NCBI
>>>> www.ncbi.nlm.nih.gov
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Marie-Claude Hofmann
>> College of Veterinary Medicine
>> University of Illinois Urbana-Champaign
>>
>>
>>
>>
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign


From dalke at dalkescientific.com  Thu Jun 26 01:15:50 2008
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 26 Jun 2008 03:15:50 +0200
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>
	<55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
Message-ID: <DA88B268-8CE1-4C71-A685-F7C13978C8BA@dalkescientific.com>

Hi Chris,

   I'm no longer part of the Biopython dev team, but I read at least  
the subject line on the mailing list.

   I wrote the Biopython EUtils package around December 2002 and  
according to the CVS logs it was added to Biopython in June 2003, so  
more then 5 years ago.  Looking at the commit logs there haven't been  
any change to the relevant code since 2004, and that was a minor patch.

   I thought I put a rate limiter into the code, but looking at it  
now I see I didn't.  The documentation clearly states that users must  
follow NCBI's recommendations, but who actually reads documentation?

>> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
>> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web  
>> address.

That change was announced on May 21, 2003, and most likely no one on  
the Biopython dev group tracks the EUtils mailing list.  It was also  
after I wrote the code, but to be fair I was subscribed to the  
utilities list at the time and should have caught the change.

I think the correct fix is to this code in ThinClient.py:

     def __init__(self,
                  opener = None,
                  tool = TOOL,
                  email = EMAIL,
                  baseurl = "http://www.ncbi.nlm.nih.gov/entrez/ 
eutils/"):

Change the baseurl to "http://eutils.ncbi.nlm.nih.gov/entrez/ 
eutils/".  I have not tested this.


>> *  Make no more than one request every 3 seconds.

There's a couple of points here.  The quickest and most direct way to  
force/fix the code is to change the "def _get()" in ThinClient.py .   
The current code is

     def _get(self, program, query):
         """Internal function: send the query string to the program  
as GET"""
         # NOTE: epost uses a different interface

         q = self._fixup_query(query)
         url = self.baseurl + program + "?" + q
         if DUMP_URL:
             print "Opening with GET:", url
         if DUMP_RESULT:
             print " ================== Results ============= "
             s = self.opener.open(url).read()
             print s
             print " ================== Finished ============ "
             return cStringIO.StringIO(s)
         return self.opener.open(url)

Here's one possible fix: add the following two lines to module scope:

import time
_prev_time = 0


and insert four lines in the _get function.

     def _get(self, program, query):
         """Internal function: send the query string to the program  
as GET"""
         # NOTE: epost uses a different interface
         global _prev_time
         q = self._fixup_query(query)
         url = self.baseurl + program + "?" + q
         if DUMP_URL:
             print "Opening with GET:", url

	# Follow NCBI's 3 second restriction
         if time.time() - _prev_time < 3:
             time.sleep(time.time()-_prev_time)
         _prev_time = time.time()

         if DUMP_RESULT:
             print " ================== Results ============= "
             s = self.opener.open(url).read()
             print s
             print " ================== Finished ============ "
             return cStringIO.StringIO(s)
         return self.opener.open(url)


(I recall that I had something like that, and it made my unit tests -  
which I did during the off hours - interminable.)


When I wrote this module I think I assumed that whoever would use the  
library would use the code correctly.  Using it correctly means a few  
things:
   - obey the restrictions set by NCBI
   - change the 'tool' and 'email' settings, so NCBI complains the  
right person.
      (The default is to say 'EUtils_Python_client' and 'biopython- 
dev at biopython.org')

This isn't happening.  The patch above force-fixes the first.  Should  
Biopython do a better job of the second?  It's not easy to figure out  
the correct email.  I couldn't then and can't now think of a better  
solution.  Perhaps use the result of getpass.getuser()?  But that  
doesn't get the rest of the domain for a proper email.  Though NCBI  
should be able to guess the site from the IP address.

The reason I made this assumption is that I meant EUtils to be used  
by contentious developers.  I've since learned that that's seldom the  
case, and because it was imported into Biopython it's been exposed to  
a wider audience.


>> Also, there is the problem of huge searches in order to build local
>> databases. With you package it seems that if one were so inclined you
>> would send a search for all human sequences (over 10,000,000  
>> sequences)
>> and you program would then retrieve these one ID at a time.  
>> Regardless
>> of the fact that this is an extreme example, we would much prefer if
>> your program could webenv from the Esearch  and  use the search  
>> history
>> and webenv to retrieve sets of sequences at 200 - 200 at a time.

It does exactly that.  There's an entire interface for handling  
search history - and it took some non-trivial work and questions to  
NCBI to get things working right.  Rather, there are two layers.  One  
is for the low-level protocol ("ThinClient") that EUtils offers, and  
another wraps around the history mechanism ("HistoryClient").

 >>> from Bio import EUtils
 >>> from Bio.EUtils import HistoryClient
 >>> client = HistoryClient.HistoryClient()
 >>> result = client.search("polio AND picornavirus")
 >>> len(result)
3437
 >>> f = result.efetch()
 >>> print f.read(1000)
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st  
January 2008//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ 
pubmed_080101.dtd">
<PubmedArticleSet>
<PubmedArticle>
     <MedlineCitation Owner="NLM" Status="In-Process">
         <PMID>18540199</PMID>
         <DateCreated>
             <Year>2008</Year>
             <Month>06</Month>
             <Day>10</Day>
         </DateCreated>
         <Article PubModel="Print">
             <Journal>
                 <ISSN IssnType="Print">0041-3771</ISSN>
                 <JournalIssue CitedMedium="Print">
                     <Volume>50</Volume>
                     <Issue>2</Issue>
                     <PubDate>
                         <Year>2008</Year>
                     </PubDate>
                 </JournalIssue>
                 <Title>Tsitologiia</Title>
                 <ISOAbbreviation>Tsitologiia</ISOAbbreviation>
             </Journal>
             <ArticleTitle>[The enter of viruses family  
Picornaviridae in

and there's a way to populate the history with a list of records,  
then fetch those records in a block:

 >>> result = client.from_dbids(EUtils.DBIds("pubmed",  
["100","200","300","400","500"]))
 >>> f = result.efetch("text", "brief")
 >>> print f.read()

1: Jolly RD et al. Bovine mannosidosis--a model ...[PMID: 100]

2: El Halawani ME et al. The relative importance of mo...[PMID: 200]

3: Amdur MA. Alcohol-related problems in a...[PMID: 300]

4: Regitz G et al. Trypsin-sensitive photosynthe...[PMID: 400]

5: Nourse ES. The regional workshops on pri...[PMID: 500]


If I had to guess, likely more people find the ThinClient code easier  
to understand, because the NCBI interface has a simple way to get the  
result for a single record, without using the history interface.  The  
NCBI interface doesn't guide people to the right way to use it  
effectively.

I started working on an update to EUtils which improved the API to  
include a few helper functions, like "EUtils.search()" instead of  
having to create a HistoryClient.  That might help guide people to  
using it better.  I wrote up something about it a few years ago:
   http://www.dalkescientific.com/writings/diary/archive/2005/09/30/ 
using_eutils.html

But a problem in completing that is that I never got any sort of  
funding or user feedback on how people were using the software, and  
as I moved over to chemistry it became lower and lower on my list.   
That's still the problem with me working on this again.

I don't know about this next point, but there might also be a lack of  
documentation on how to use the Biopython interface effectively?  The  
NCBI documentation isn't mean for non-programmers (it's more of a  
bytes-on-the-wire document) so perhaps people are pattern matching on  
what looks right and going with what works, vs. what works well.   
Then because there was no 3 second limit, they had no incentive to  
find a better/faster solution.

				Andrew
				dalke at dalkescientific.com


From biopython at maubp.freeserve.co.uk  Thu Jun 26 11:21:57 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Jun 2008 12:21:57 +0100
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <DA88B268-8CE1-4C71-A685-F7C13978C8BA@dalkescientific.com>
References: <EEEED756EF6626469B10653F74501438018FAABB@NIHCESMLBX15.nih.gov>
	<55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
	<DA88B268-8CE1-4C71-A685-F7C13978C8BA@dalkescientific.com>
Message-ID: <320fb6e00806260421g48e5807ei92297b372c330e5b@mail.gmail.com>

On Thu, Jun 26, 2008 at 2:15 AM, Andrew Dalke <dalke at dalkescientific.com> wrote:
> Hi Chris,
>
>  I'm no longer part of the Biopython dev team, but I read at least the
> subject line on the mailing list.
>
>  I wrote the Biopython EUtils package around December 2002 and according to
> the CVS logs it was added to Biopython in June 2003, so more then 5 years
> ago.  Looking at the commit logs there haven't been any change to the
> relevant code since 2004, and that was a minor patch.
>
>  I thought I put a rate limiter into the code, but looking at it now I see I
> didn't.  The documentation clearly states that users must follow NCBI's
> recommendations, but who actually reads documentation?
>
> There's a couple of points here.  The quickest and most direct way to
> force/fix the code is to change the "def _get()" in ThinClient.py .  ...

I've updated Bio/EUtils/ThinClient.py in CVS based on your suggested
change, and checked the unit tests test_EUtils.py and
test_SeqIO_online.py (which calls Bio.EUtils via Bio.GenBank).

Looking over the code, should this wait also be done for the
ThinClient's epost() method as well?

> When I wrote this module I think I assumed that whoever would use the
> library would use the code correctly.  Using it correctly means a few
> things:
>  - obey the restrictions set by NCBI
>  - change the 'tool' and 'email' settings, so NCBI complains the right
> person.
>     (The default is to say 'EUtils_Python_client' and
> 'biopython-dev at biopython.org')
>
> This isn't happening.  The patch above force-fixes the first.  Should
> Biopython do a better job of the second?  It's not easy to figure out the
> correct email.  I couldn't then and can't now think of a better solution.
>  Perhaps use the result of getpass.getuser()?  But that doesn't get the rest
> of the domain for a proper email.  Though NCBI should be able to guess the
> site from the IP address.

Figuring out the user's email address is tricky, especially cross
platform.  Perhaps we should update the Bio.EUtils and Bio.Entrez
documentation to recommend the user set their email address here, and
if they are wrapping Biopython in part of a larger tool (e.g. a
webservice) to set the tool name too.

> If I had to guess, likely more people find the ThinClient code easier to
> understand, because the NCBI interface has a simple way to get the result
> for a single record, without using the history interface.  The NCBI
> interface doesn't guide people to the right way to use it effectively.

I would agree with you.  I would go further, and say for a new user
even the ThinClient is a bit scary, and that the wrapper functions in
Bio.GenBank are nicer to use.

> I started working on an update to EUtils which improved the API to include a
> few helper functions, like "EUtils.search()" instead of having to create a
> HistoryClient.  That might help guide people to using it better.  I wrote up
> something about it a few years ago:
>  http://www.dalkescientific.com/writings/diary/archive/2005/09/30/using_eutils.html
>
> But a problem in completing that is that I never got any sort of funding or
> user feedback on how people were using the software, and as I moved over to
> chemistry it became lower and lower on my list.  That's still the problem
> with me working on this again.

This complexity is also daunting for anyone else considering taking
over the Bio.EUtils code base.

> I don't know about this next point, but there might also be a lack of
> documentation on how to use the Biopython interface effectively?  The NCBI
> documentation isn't mean for non-programmers (it's more of a
> bytes-on-the-wire document) so perhaps people are pattern matching on what
> looks right and going with what works, vs. what works well.  Then because
> there was no 3 second limit, they had no incentive to find a better/faster
> solution.

That would explain how the unnamed user ended up making over 18
requests per second!  I confess I had assumed that things like the
Bio.GenBank wrappers would be respecting the 3 second rule (at least
they should do now).

Peter


From mjldehoon at yahoo.com  Thu Jun 26 11:48:09 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 26 Jun 2008 04:48:09 -0700 (PDT)
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org>
Message-ID: <53670.7764.qm@web62412.mail.re1.yahoo.com>

Dear Chris,

Sorry for the trouble. We are now discussing on the Biopython mailing list how to fix this issue. I will write a reply to Scott shortly.

Best,

--Michiel.

--- On Wed, 6/25/08, Chris Dagdigian <dag at sonsorol.org> wrote:
From: Chris Dagdigian <dag at sonsorol.org>
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
To: biopython at lists.open-bio.org
Date: Wednesday, June 25, 2008, 11:08 AM

Can someone from the biopython dev team respond officially to Scott  
please?

Regards,
Chris


Begin forwarded message:

> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]"
<mcginnis at ncbi.nlm.nih.gov>
> Date: June 25, 2008 10:54:28 AM EDT
> To: <biopython-owner at lists.open-bio.org>
> Subject: NCBI Abuse Activity with BioPython
>
> Dear Colleague:
>
>
>
> My name is Scott McGinnis and I am responsible for monitoring the web
> page at NCBI and blocking users with excessive access.
>
>
>
> I am seeing more and more activity with BioPython and it is us  
> concern.
> Mainly the BioPython suite does not appear to be written to the
> recommendations made on the main NCBI E-utilities web page
> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr
> inciply the following are not being done by BioPython tools.
>
>
>
> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web
address.
>
> *  Make no more than one request every 3 seconds.
>
>
>
> In fact I recently cc'd you on an event when a user was coming in at
> over 18 requests per second. We really wish that you would alter you
> scripts to run with a some sort of sleep in it in order to not send
> requests more than once per 3 seconds and to not send these to the  
> main
> www web servers but use the  http://eutils.ncbi.nlm.nih.gov
> <http://eutils.ncbi.nlm.nih.gov/> .
>
>
>
> Also, there is the problem of huge searches in order to build local
> databases. With you package it seems that if one were so inclined you
> would send a search for all human sequences (over 10,000,000  
> sequences)
> and you program would then retrieve these one ID at a time. Regardless
> of the fact that this is an extreme example, we would much prefer if
> your program could webenv from the Esearch  and  use the search  
> history
> and webenv to retrieve sets of sequences at 200 - 200 at a time.
>
>
>
> History: Requests utility to maintain results in user's environment.
> Used in conjunction with WebEnv.
>
> usehistory=y
>
> Web Environment: Value previously returned in XML results from ESearch
> or EPost. This value may change with each utility call. If WebEnv is
> used, History search numbers can be included in an ESummary URL, e.g.,
> term=cancer+AND+%23X (where %23 replaces # and X is the History search
> number).
>
> Note: WebEnv is similar to the cookie that is set on a user's  
> computers
> when accessing PubMed on the web.  If the parameter usehistory=y is
> included in an ESearch URL both a WebEnv (cookie string) and query_key
> (history number) values will be returned in the results. Rather than
> using the retrieved PMIDs in an ESummary or EFetch URL you may simply
> use the WebEnv and query_key values to retrieve the records. WebEnv  
> will
> change for each ESearch query, but a sample URL would be as follows:
>
> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed
> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh
> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D
> &query_key=6&retmode=html&rettype=medline&retmax=15
>
> WebEnv=WgHmIcDG]B etc.
>
> Display Numbers:
>
> retstart=x  (x= sequential number of the first record retrieved -
> default=0 which will retrieve the first record)
> retmax=y  (y= number of items retrieved)
>
>
>
> Otherwise we will end up blocking more of your users which we are
> unfortunately already doing in some cases.
>
>
>
> Sincerely,
> Scott D. McGinnis, M.S.
> DHHS/NIH/NLM/NCBI
> www.ncbi.nlm.nih.gov
>
>
>

_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From mjldehoon at yahoo.com  Thu Jun 26 14:01:31 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 26 Jun 2008 07:01:31 -0700 (PDT)
Subject: [BioPython] Bio.ECell, anybody?
Message-ID: <712489.88060.qm@web62410.mail.re1.yahoo.com>

This is one of the Martel-based parser whose relevance in 2008 is unclear to me.

>From the docstring:

Ecell converts the ECell input from spreadsheet format to an intermediate format, described in http://www.e-cell.org/manual/chapter2E.html#3.2.? It provides an alternative to the perl script supplied with the Ecell2 distribution at http://bioinformatics.org/project/?group_id=49.

Currently, ECell is at version 3.1.106 (and uses Python as the scripting interface! Yay!). The link to the chapter in the ECell manual is dead.

Is anybody using the Bio.ECell module?

--Michiel


From binbin.liu at umb.no  Thu Jun 26 15:35:46 2008
From: binbin.liu at umb.no (binbin)
Date: Thu, 26 Jun 2008 17:35:46 +0200
Subject: [BioPython] Entrez
Message-ID: <1214494546.6215.3.camel@ubuntu>

?Hei,
        Am using biopython 1.45
        
        my problem is as follow
        
        
        >>> from Bio import GenBank
        >>> from Bio import Entrez
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        ImportError: cannot import name Entrez
        
        I could not import Entrez. was it deleted from Bio?


From biopython at maubp.freeserve.co.uk  Thu Jun 26 15:57:47 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Jun 2008 16:57:47 +0100
Subject: [BioPython] Entrez
In-Reply-To: <1214494546.6215.3.camel@ubuntu>
References: <1214494546.6215.3.camel@ubuntu>
Message-ID: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com>

On Thu, Jun 26, 2008 at 4:35 PM, binbin <binbin.liu at umb.no> wrote:
> Hei,
>        Am using biopython 1.45
>
>        my problem is as follow
>
>
>        >>> from Bio import GenBank
>        >>> from Bio import Entrez
>        Traceback (most recent call last):
>          File "<stdin>", line 1, in <module>
>        ImportError: cannot import name Entrez
>
>        I could not import Entrez. was it deleted from Bio?

Hello binbin,

A long long time ago there was a Bio.Entrez module which was deleted in 2000.

We are going to re-introduce a Bio.Entrez module in Biopython 1.46
(hopefully out next month?), which will replace Bio.WWW.NCBI.  If you
want to try this out now, please install the latest CVS version of
Biopython from source.

Can I ask why are you trying to do "from Bio import Entrez"?

Peter


From winter at biotec.tu-dresden.de  Thu Jun 26 15:53:23 2008
From: winter at biotec.tu-dresden.de (Christof Winter)
Date: Thu, 26 Jun 2008 17:53:23 +0200
Subject: [BioPython] Entrez
In-Reply-To: <1214494546.6215.3.camel@ubuntu>
References: <1214494546.6215.3.camel@ubuntu>
Message-ID: <4863BB73.2020509@biotec.tu-dresden.de>

binbin wrote, On 06/26/08 17:35:
> Hei,
>         Am using biopython 1.45
>         
>         my problem is as follow
>         
>         
>         >>> from Bio import GenBank
>         >>> from Bio import Entrez
>         Traceback (most recent call last):
>           File "<stdin>", line 1, in <module>
>         ImportError: cannot import name Entrez
>         
>         I could not import Entrez. was it deleted from Bio?

Import works fine for me, so I don't think it has been deleted. With my Linux 
installation, I can do

locate Entrez

which finds
/var/lib/python-support/python2.5/Bio/Entrez

HTH,
Christof


From biopython at maubp.freeserve.co.uk  Thu Jun 26 16:12:53 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Jun 2008 17:12:53 +0100
Subject: [BioPython] Entrez
In-Reply-To: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com>
References: <1214494546.6215.3.camel@ubuntu>
	<320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com>
Message-ID: <320fb6e00806260912j3395d2c0s3d7bbb7227f84421@mail.gmail.com>

> Hello binbin,
>
> A long long time ago there was a Bio.Entrez module which was deleted in 2000.
>
> We are going to re-introduce a Bio.Entrez module in Biopython 1.46
> (hopefully out next month?), which will replace Bio.WWW.NCBI.  If you
> want to try this out now, please install the latest CVS version of
> Biopython from source.

Sorry - I've confused myself as the Bio.Entrez module has been under
revision recently.

>From the user's point of view Biopython 1.46 will add an XML parser,
but otherwise Bio.Entrez should be there in Biopython 1.45.

Peter


From biopython at maubp.freeserve.co.uk  Thu Jun 26 20:19:31 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Jun 2008 21:19:31 +0100
Subject: [BioPython] Removing the unit test GUI?
Message-ID: <320fb6e00806261319w5be098d1y48404f3f93934fa3@mail.gmail.com>

Hello all,

I wanted to do a quick survey of opinion about the Biopython test
suite and its interface.

Those of you who have ever installed Biopython from source may have
tried running the unit tests too.  You do this by changing to the
Tests subdirectory, and then running the run_tests.py script.
Currently by default this will show a GUI.  However, from the
developer's point of view the unit tests are almost always run at the
command line with:

python run_tests.py --no-gui

It would let us simplify the test harness if we got rid of the GUI,
and it would make life very slightly easier for people running the
tests at the command line.  But would anyone be upset at the loss of
the test GUI?

So - have any of you ever run the unit tests?  Did you use the GUI or
the command line?  Would you prefer the GUI to remain?

Thanks

Peter

P.S. See also bug 2525
http://bugzilla.open-bio.org/show_bug.cgi?id=2525


From mjldehoon at yahoo.com  Thu Jun 26 22:24:41 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 26 Jun 2008 15:24:41 -0700 (PDT)
Subject: [BioPython] Entrez
In-Reply-To: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com>
Message-ID: <987374.9439.qm@web62409.mail.re1.yahoo.com>

Bio.Entrez was reintroduced in release 1.45 already (though without the parser), so binbin should be able to find it.

--Michiel.

--- On Thu, 6/26/08, Peter <biopython at maubp.freeserve.co.uk> wrote:
From: Peter <biopython at maubp.freeserve.co.uk>
Subject: Re: [BioPython] Entrez
To: "binbin" <binbin.liu at umb.no>
Cc: biopython at biopython.org
Date: Thursday, June 26, 2008, 11:57 AM

On Thu, Jun 26, 2008 at 4:35 PM, binbin <binbin.liu at umb.no> wrote:
> Hei,
>        Am using biopython 1.45
>
>        my problem is as follow
>
>
>        >>> from Bio import GenBank
>        >>> from Bio import Entrez
>        Traceback (most recent call last):
>          File "<stdin>", line 1, in <module>
>        ImportError: cannot import name Entrez
>
>        I could not import Entrez. was it deleted from Bio?

Hello binbin,

A long long time ago there was a Bio.Entrez module which was deleted in 2000.

We are going to re-introduce a Bio.Entrez module in Biopython 1.46
(hopefully out next month?), which will replace Bio.WWW.NCBI.  If you
want to try this out now, please install the latest CVS version of
Biopython from source.

Can I ask why are you trying to do "from Bio import Entrez"?

Peter
_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Fri Jun 27 11:16:12 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 27 Jun 2008 12:16:12 +0100
Subject: [BioPython] Entrez
In-Reply-To: <1214562160.6026.2.camel@ubuntu>
References: <1214494546.6215.3.camel@ubuntu>
	<320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com>
	<1214562160.6026.2.camel@ubuntu>
Message-ID: <320fb6e00806270416x76d8b388mdd79577927001f32@mail.gmail.com>

On Fri, Jun 27, 2008 at 11:22 AM, binbin <binbin.liu at umb.no> wrote:
> thank you for answering, i am a beginner of biopython,in the "Biopython
> Tutorial and Cookbook":
> 2.5  Connecting with biological databases:
> this is found
> "from Bio import Entrez"
>
> i tried this but it did work for me, that is why i asked.

That should have worked if your installation of Biopython 1.45 was successful.

We may be able to work out what is wrong.  What operating system are
you using, which version of python, and how did you install Biopython?

Regards,

Peter


From fredgca at hotmail.com  Fri Jun 27 13:19:04 2008
From: fredgca at hotmail.com (Frederico Arnoldi)
Date: Fri, 27 Jun 2008 13:19:04 +0000
Subject: [BioPython]  Fwd: NCBI Abuse Activity with BioPython
Message-ID: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>


Guys (sorry the informality),

I have followed the discussion about "NCBI Abuse Activity with BioPython". I have to confess that followed it superficially, since I am not able to understand everything you said. So, I am going to make some questions about it:

1)I believe that using BLAST with NCBIWWW.qblast is included in "Abuse Activity". Right? I am asking because sometimes I use it. The recommendation of NCBI is "Make no more than one request every 3 seconds.". Biopython code does not assure it with the following  code in NCBIWWW.py, line 779:
[code]
limiter = RequestLimiter(3)
while 1:
    limiter.wait()
[/code]

2)Do you have any recommendation for using it that it is not included in the tutorial? Maybe listing some recommendations here would help. 

Sorry if I have asked an obviousness.

Thanks,
Fred


_________________________________________________________________
Conhe?a o Windows Live Spaces, a rede de relacionamentos do Messenger!
http://www.amigosdomessenger.com.br/


From biopython at maubp.freeserve.co.uk  Fri Jun 27 13:57:49 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 27 Jun 2008 14:57:49 +0100
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>
References: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>
Message-ID: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com>

On Fri, Jun 27, 2008 at 2:19 PM, Frederico Arnoldi <fredgca at hotmail.com> wrote:
>
> Guys (sorry the informality),
>
> I have followed the discussion about "NCBI Abuse Activity with BioPython". I
> have to confess that followed it superficially, since I am not able to understand
> everything you said. So, I am going to make some questions about it:
>
> 1)I believe that using BLAST with NCBIWWW.qblast is included in "Abuse Activity". Right?

I'm not aware that abuse of BLAST was singled out, only Entrez / E-utils.

> I am asking because sometimes I use it. The recommendation of NCBI is
> "Make no more than one request every 3 seconds.".

True, http://www.ncbi.nlm.nih.gov/blast/Doc/node60.html

> Biopython code does not assure it with the following  code in NCBIWWW.py,
> line 779:
> [code]
> limiter = RequestLimiter(3)
> while 1:
>    limiter.wait()
> [/code]

I believe that bit of code is polling the server for results every
three seconds.  Perhaps we should insert an additional enforced three
second delay between submission of queries as well.

> 2)Do you have any recommendation for using it that it is not included in the
> tutorial? Maybe listing some recommendations here would help.

I would recommend running your own local BLAST server for any large
jobs - either the standalone blast tools, or if you have a machine on
the network that many people could share, run the WWW version locally.

Peter


From cjfields at uiuc.edu  Fri Jun 27 15:51:12 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Fri, 27 Jun 2008 10:51:12 -0500
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com>
References: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>
	<320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com>
Message-ID: <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu>


On Jun 27, 2008, at 8:57 AM, Peter wrote:

> On Fri, Jun 27, 2008 at 2:19 PM, Frederico Arnoldi <fredgca at hotmail.com 
> > wrote:
>>
>> Guys (sorry the informality),
>>
>> I have followed the discussion about "NCBI Abuse Activity with  
>> BioPython". I
>> have to confess that followed it superficially, since I am not able  
>> to understand
>> everything you said. So, I am going to make some questions about it:
>>
>> 1)I believe that using BLAST with NCBIWWW.qblast is included in  
>> "Abuse Activity". Right?
>
> I'm not aware that abuse of BLAST was singled out, only Entrez / E- 
> utils.

Similar policy though, for the same reasons they insist on a delay for  
E-utils.

>> I am asking because sometimes I use it. The recommendation of NCBI is
>> "Make no more than one request every 3 seconds.".
>
> True, http://www.ncbi.nlm.nih.gov/blast/Doc/node60.html
>
>> Biopython code does not assure it with the following  code in  
>> NCBIWWW.py,
>> line 779:
>> [code]
>> limiter = RequestLimiter(3)
>> while 1:
>>   limiter.wait()
>> [/code]
>
> I believe that bit of code is polling the server for results every
> three seconds.  Perhaps we should insert an additional enforced three
> second delay between submission of queries as well.
>
>> 2)Do you have any recommendation for using it that it is not  
>> included in the
>> tutorial? Maybe listing some recommendations here would help.
>
> I would recommend running your own local BLAST server for any large
> jobs - either the standalone blast tools, or if you have a machine on
> the network that many people could share, run the WWW version locally.
>
> Peter

The above appears to submit a single job at a time and wait 3 sec.  
between polling the server until the current job is finished.  I don't  
think that is the problem indicated in the link above.  The 3 sec. is  
for submitting new BLAST jobs, for instance if you want to submit one  
BLAST request after another (gathering the RIDs), then grab all the  
reports at once, or if you are threading 50 submission requests all at  
once.

chris


From fredgca at hotmail.com  Fri Jun 27 16:18:47 2008
From: fredgca at hotmail.com (Frederico Arnoldi)
Date: Fri, 27 Jun 2008 16:18:47 +0000
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu>
References: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>
	<320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com>
	<6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu>
Message-ID: <BLU105-W37D809FE53D9BE140CFFDEBFA20@phx.gbl>


Right, thanks for the answers.
If I understood, the problem is threading the requests. If I am not threading my requests I am not abusing NCBI server, so don't thread them.
Thanks again,
Fred
> >> 2)Do you have any recommendation for using it that it is not  
> >> included in the
> >> tutorial? Maybe listing some recommendations here would help.
> >
> > I would recommend running your own local BLAST server for any large
> > jobs - either the standalone blast tools, or if you have a machine on
> > the network that many people could share, run the WWW version locally.
> >
> > Peter
> 
> The above appears to submit a single job at a time and wait 3 sec.  
> between polling the server until the current job is finished.  I don't  
> think that is the problem indicated in the link above.  The 3 sec. is  
> for submitting new BLAST jobs, for instance if you want to submit one  
> BLAST request after another (gathering the RIDs), then grab all the  
> reports at once, or if you are threading 50 submission requests all at  
> once.
> 
> chris

_________________________________________________________________
Instale a Barra de Ferramentas com Desktop Search e ganhe EMOTICONS para o Messenger! ? GR?TIS!
http://www.msn.com.br/emoticonpack


From cjfields at uiuc.edu  Fri Jun 27 17:32:31 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Fri, 27 Jun 2008 12:32:31 -0500
Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython
In-Reply-To: <BLU105-W37D809FE53D9BE140CFFDEBFA20@phx.gbl>
References: <BLU105-W287CF9563F68509EAC034CBFA20@phx.gbl>
	<320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com>
	<6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu>
	<BLU105-W37D809FE53D9BE140CFFDEBFA20@phx.gbl>
Message-ID: <53E19130-4EAC-4DC7-A58C-883581F8B468@uiuc.edu>

No, not just threading.  The requests could be made by a simple script/ 
program of any kind with no timeout implemented; the IPs of those  
abusing the timeout will likely be blocked.

The idea is not to spam their server (let alone any server which  
provides a free service) with tons of requests of any kind, be it  
eutils or BLAST submission requests, BLAST report retrieval requests  
using RID, etc.  Any tools using these services should implement the  
minimum recommended delay between them.  Alternatively, set up a local  
BLAST service as Peter recommends.

chris

On Jun 27, 2008, at 11:18 AM, Frederico Arnoldi wrote:

>
> Right, thanks for the answers.
> If I understood, the problem is threading the requests. If I am not  
> threading my requests I am not abusing NCBI server, so don't thread  
> them.
> Thanks again,
> Fred
>>>> 2)Do you have any recommendation for using it that it is not
>>>> included in the
>>>> tutorial? Maybe listing some recommendations here would help.
>>>
>>> I would recommend running your own local BLAST server for any large
>>> jobs - either the standalone blast tools, or if you have a machine  
>>> on
>>> the network that many people could share, run the WWW version  
>>> locally.
>>>
>>> Peter
>>
>> The above appears to submit a single job at a time and wait 3 sec.
>> between polling the server until the current job is finished.  I  
>> don't
>> think that is the problem indicated in the link above.  The 3 sec. is
>> for submitting new BLAST jobs, for instance if you want to submit one
>> BLAST request after another (gathering the RIDs), then grab all the
>> reports at once, or if you are threading 50 submission requests all  
>> at
>> once.
>>
>> chris
>
> _________________________________________________________________
> Instale a Barra de Ferramentas com Desktop Search e ganhe EMOTICONS  
> para o Messenger! ? GR?TIS!
> http://www.msn.com.br/emoticonpack
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From sbassi at gmail.com  Sat Jun 28 14:46:45 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sat, 28 Jun 2008 11:46:45 -0300
Subject: [BioPython] one function, two behaivors
Message-ID: <b43bf2080806280746s7a616504s4bc1e93dd8177211@mail.gmail.com>

If I invoke "transcribe" with a RNA sequence like this:

>>> from Bio.Seq import transcribe
>>> from Bio.Seq import Seq
>>> import Bio.Alphabet
>>> rna_seq = Seq('CCGGGUU',Bio.Alphabet.IUPAC.unambiguous_rna)
>>> transcribe(rna_seq)  # Look!, I am "transcribing a RNA"
Seq('CCGGGUU', RNAAlphabet())

But I can't "transcribe" a RNA sequence if I invoke it this way:

>>> from Bio import Transcribe
>>> transcriber = Transcribe.unambiguous_transcriber
>>> transcriber.transcribe(rna_seq)

Traceback (most recent call last):
  File "<pyshell#13>", line 1, in <module>
    transcriber.transcribe(rna_seq)
  File "/usr/local/lib/python2.5/site-packages/Bio/Transcribe.py",
line 13, in transcribe
    "transcribe has the wrong DNA alphabet"
AssertionError: transcribe has the wrong DNA alphabet

The same result I get when using "translate". What is the rationale behind this?

-- 
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Tutorial libre de Python: http://tinyurl.com/2az5d5


From biopython at maubp.freeserve.co.uk  Sat Jun 28 15:16:13 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 28 Jun 2008 16:16:13 +0100
Subject: [BioPython] one function, two behaivors
In-Reply-To: <b43bf2080806280746s7a616504s4bc1e93dd8177211@mail.gmail.com>
References: <b43bf2080806280746s7a616504s4bc1e93dd8177211@mail.gmail.com>
Message-ID: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com>

Hi Senastian,

As to why there are two ways, well, frankly the Bio.Transcribe and
Bio.Translate code isn't very nice to use!   The Bio.Seq functions are
much simpler.

We've talked about deprecating the Bio.Transcribe and Bio.Translate
modules in favour of just Bio.Seq -- we could deprecate Bio.Transcribe
now, but there is functionality in Bio.Translate that has not been
duplicated.  See also bug 2381.
http://bugzilla.open-bio.org/show_bug.cgi?id=2381

On Sat, Jun 28, 2008 at 3:46 PM, Sebastian Bassi <sbassi at gmail.com> wrote:
> If I invoke "transcribe" with a RNA sequence like this:
>
>>>> from Bio.Seq import transcribe
>>>> from Bio.Seq import Seq
>>>> import Bio.Alphabet
>>>> rna_seq = Seq('CCGGGUU',Bio.Alphabet.IUPAC.unambiguous_rna)
>>>> transcribe(rna_seq)  # Look!, I am "transcribing a RNA"
> Seq('CCGGGUU', RNAAlphabet())

When Michiel added this code for Biopython 1.41, originally there was
no error checking on the alphabet.  For Biopython 1.44, I added a
check to prevent protein transcibing (which is clearly meaningless),
and made a note to consider also banning transcribing RNA.

Here there is at least one reason to want to do this - suppose you
have a mixed set of nucleotide sequences and want to ensure they are
all RNA.

Do you think the Bio.Seq.transcibe() method should reject RNA sequences?

Peter


From biopython at maubp.freeserve.co.uk  Sat Jun 28 15:23:40 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 28 Jun 2008 16:23:40 +0100
Subject: [BioPython] one function, two behaivors
In-Reply-To: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com>
References: <b43bf2080806280746s7a616504s4bc1e93dd8177211@mail.gmail.com>
	<320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com>
Message-ID: <320fb6e00806280823h36f3f01ema2886dca98635588@mail.gmail.com>

I wrote,
> As to why there are two ways, well, frankly the Bio.Transcribe and
> Bio.Translate code isn't very nice to use!   The Bio.Seq functions are
> much simpler.

Hmm - the tutorial is still using Bio.Transcribe and Bio.Translate at
the moment.   I could update the tutorial to use the Bio.Seq functions
for (back)transcription.  However, as I said in the last email,
Bio.Translate still has its uses - there is no way to do a "translate
to stop" with Bio.Seq for example.

Maybe Bug 2381 should be a priority for the next release AFTER the
imminent Biopython 1.46.  We can then use object methods in the
tutorial, which I personally would find much nicer to use.

http://bugzilla.open-bio.org/show_bug.cgi?id=2381

If you could have a look at the suggested changes on Bug 2381, I'd
welcome some feedback.

Peter


From sbassi at gmail.com  Sat Jun 28 16:47:05 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sat, 28 Jun 2008 13:47:05 -0300
Subject: [BioPython] one function, two behaivors
In-Reply-To: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com>
References: <b43bf2080806280746s7a616504s4bc1e93dd8177211@mail.gmail.com>
	<320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com>
Message-ID: <b43bf2080806280947j791d9f69oe06ca9759b00860e@mail.gmail.com>

On Sat, Jun 28, 2008 at 12:16 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
....
> Here there is at least one reason to want to do this - suppose you
> have a mixed set of nucleotide sequences and want to ensure they are
> all RNA.
> Do you think the Bio.Seq.transcibe() method should reject RNA sequences?

IMHO, it should reject RNA sequences. The case you point out (ensure a
set of sequences are all RNA) could be done by checking the type
before applying "transcribe".


-- 
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Tutorial libre de Python: http://tinyurl.com/2az5d5


From lueck at ipk-gatersleben.de  Sun Jun 29 14:42:47 2008
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Sun, 29 Jun 2008 16:42:47 +0200
Subject: [BioPython] Sequence from Fasta
Message-ID: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>

Hi!

Is there a way to extract only the sequence (full length) from a fasta file?

If I try the code from page 10 in the tutorial, I get of course this:
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA ...', SingleLetterAlphabet())

But I'm looking for something like this:

Name Sequence without linebreak

Example:

MySequence atgcgcgctcggcgcgctcgfcgcgccccccatggctcgcgcactacagcg
MySequence2 atgcgctctgcgcgctcgatgtagaatatgagatctctatgagatcagcatca

etc.

Regards 
Stefanie


From biopython at maubp.freeserve.co.uk  Sun Jun 29 15:19:13 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 29 Jun 2008 16:19:13 +0100
Subject: [BioPython] Sequence from Fasta
In-Reply-To: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com>

On Sun, Jun 29, 2008 at 3:42 PM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:
> Hi!
>
> Is there a way to extract only the sequence (full length) from a fasta file?

Yes.  Based on your requirement to have name-space-sequence, how about:

handle = open(filename)
from Bio import SeqIO
for record in SeqIO.parse(handle, "fasta") :
    print "%s %s" % (record.id, record.seq)
handle.close()

> If I try the code from page 10 in the tutorial, I get of course this:
> Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA ...', SingleLetterAlphabet())

Which bit of the tutorial exactly?  That looks like printing the
repr() of a Seq object, and Seq objects don't have names.  If
something could be clarified that's useful feedback.

Peter


From lueck at ipk-gatersleben.de  Mon Jun 30 09:09:53 2008
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Mon, 30 Jun 2008 11:09:53 +0200
Subject: [BioPython] Sequence from Fasta
References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
	<320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com>
Message-ID: <001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de>

Hi Peter!

I mean the biopython tutorial (16.3.2007), page 10:

>>>
from Bio import SeqIO
handle = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(handle, "fasta") :
print seq_record.id
print seq_record.seq
print len(seq_record.seq)
handle.close()
<<<

I tried your code but I still have the same problem. It's don't show the 
full sequence.

Output:
1 Seq('atgctcgatgcgcgctcgcgtccgtcgCAGGAgGAGATGGGGAGGCGCCGCCGGTTCACG ...', 
SingleLetterAlphabet())
2 Seq('AGAAAAATCCGGAATCAGAGGAGGAGGAGGAGTCTCGCGAGGAGGATAGCACGGAGGCGG ...', 
SingleLetterAlphabet())

Fasta File looks like this:
>1
atgctcgatgcgcgctcgcgtccgtcgCAGGAgGAGATGGGGAGGCGCCGCCGGTTCACGCATCAGCCCACCAGCGACGACGACGACGAGGAAGACAGAGCCGcCC
>2
AGAAAAATCCGGAATCAGAGGAGGAGGAGGAGTCTCGCGAGGAGGATAGCACGGAGGCGGTACCCGTCGGTGAACCTTT

I can try with regular expressions but I first wanted to know whether there 
is a way in biopyhton.

Regards
Stefanie


From biopython at maubp.freeserve.co.uk  Mon Jun 30 09:19:16 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 30 Jun 2008 10:19:16 +0100
Subject: [BioPython] Sequence from Fasta
In-Reply-To: <001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de>
References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
	<320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com>
	<001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00806300219j54f7f43dpe0051f54be27d402@mail.gmail.com>

Which version of Biopython do you have?  I'm guessing Biopython 1.44.
On older versions you would have to do explicitly turn the Seq into a
string.

Does this work:

from Bio import SeqIO
handle = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(handle, "fasta") :
    print seq_record.id
    print seq_record.seq.tostring()
    print len(seq_record.seq)
handle.close()

Since Biopython 1.45, doing str(...) on a Seq object gives you the
sequence in full as a plain string.  When you do a print this happens
implicitly.

Peter

P.S. For the implementation, str(object) calls the object.__str__() method.


From dalloliogm at gmail.com  Mon Jun 30 09:40:23 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Mon, 30 Jun 2008 11:40:23 +0200
Subject: [BioPython] Sequence from Fasta
In-Reply-To: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
Message-ID: <5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com>

On Sun, Jun 29, 2008 at 4:42 PM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:

> If I try the code from page 10 in the tutorial, I get of course this:

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA
...', SingleLetterAlphabet())

Try with seq_record.seq.data.


> But I'm looking for something like this:
>
> Name Sequence without linebreak
>
> Example:
>
> MySequence atgcgcgctcggcgcgctcgfcgcgccccccatggctcgcgcactacagcg
> MySequence2 atgcgctctgcgcgctcgatgtagaatatgagatctctatgagatcagcatca

Bioperl's SeqIO has support for a 'tab sequence format' which is
similar to this[1].
Maybe it could be useful in the future to add support for such a
format in biopython.

[1] http://www.bioperl.org/wiki/Tab_sequence_format


>
> Regards
> Stefanie
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Mon Jun 30 10:25:01 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 30 Jun 2008 11:25:01 +0100
Subject: [BioPython] Sequence from Fasta
In-Reply-To: <5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com>
References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de>
	<5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com>
Message-ID: <320fb6e00806300325r10c96b57qffee9ab3df81cb9e@mail.gmail.com>

On Mon, Jun 30, 2008 at 10:40 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> On Sun, Jun 29, 2008 at 4:42 PM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:
>
>> If I try the code from page 10 in the tutorial, I get of course this:
>
> Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA
> ...', SingleLetterAlphabet())
>
> Try with seq_record.seq.data.

I would like to discourage using the Seq object's .data property if
possible, in favour of my_seq.tostring() which will work even on very
old versions of Biopython, or str(my_seq) if you are up to date.

I've mooted deprecating the Seq object's .data property as part of
making the Seq object more string like (Bug 2509 and Bug 2351).

http://bugzilla.open-bio.org/show_bug.cgi?id=2509
http://bugzilla.open-bio.org/show_bug.cgi?id=2351

User feedback would be good, but to explain my current thinking: I'm
hoping to reduce the Seq's .data to a read only property in a future
release, and then in a later release start issuing a deprecation
warning, before its eventual removal (Bug 2509).  At some point in
this process the Seq object would hopefully subclass the python string
(Bug 2351).

Peter