From mjldehoon at yahoo.com Sat Jun 7 04:35:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 7 Jun 2008 01:35:05 -0700 (PDT) Subject: [BioPython] Bio.Gobase, anybody? Message-ID: <844450.31822.qm@web62415.mail.re1.yahoo.com> Hi everbody, As part of bug report 2454: http://bugzilla.open-bio.org/show_bug.cgi?id=2454, I started looking at the Bio.Gobase module. This module provides access to the gobase database: http://megasun.bch.umontreal.ca/gobase/ This module is about seven years old and (AFAICT) is not actively maintained. We don't have documentation for this module, but the unit tests suggests that it parses HTML files from gobase. I am not sure exactly where the HTML files came from, but I doubt that after seven years this still works. So I was wondering: Does anybody use Bio.Gobase? If not, I suggest we deprecate it for the next release, and remove it in some future release. If there are users, we need to make some (small) changes to this module (that is what the original bug report was about). --Michiel. From mmokrejs at ribosome.natur.cuni.cz Sat Jun 7 05:27:26 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sat, 07 Jun 2008 11:27:26 +0200 Subject: [BioPython] Bio.Gobase, anybody? In-Reply-To: <844450.31822.qm@web62415.mail.re1.yahoo.com> References: <844450.31822.qm@web62415.mail.re1.yahoo.com> Message-ID: <484A547E.1030909@ribosome.natur.cuni.cz> Hi, I don't use it but seems an interesting resource. ;-) See http://gobase.bcm.umontreal.ca/samples.html . Martin > This module is about seven years old and (AFAICT) > is not actively maintained. We don't have documentation > for this module, but the unit tests suggests that it > parses HTML files from gobase. I am not sure exactly > where the HTML files came from, but I doubt that > after seven years this still works. From cg5x6 at yahoo.com Mon Jun 9 01:21:50 2008 From: cg5x6 at yahoo.com (C. G.) Date: Sun, 8 Jun 2008 22:21:50 -0700 (PDT) Subject: [BioPython] splice variants in GenBank/Entrez Message-ID: <664146.43151.qm@web65604.mail.ac4.yahoo.com> Hi all, I've been using BioPython for a few projects the last two months to process BLAST results but now I need to take those results and determine which of them have known splice variants. By "known" I mean those that have annotations contained in a database that indicate they have (or are) splice variants. My thought was that Entrez would have this information (which I would then retrieve and parse with BioPython) but I can't find a consistent means of determining if an entry has splice variants. I was hoping that maybe someone on this list had some experience trying to find this information. Perhaps there is a sequence feature or a common user-defined field I could access? I'm also sending an email to NCBI requesting information but I thought I would cover my bases. Thanks in advance for any information or help you can provide. -steve From krewink at inb.uni-luebeck.de Mon Jun 9 02:58:52 2008 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Mon, 9 Jun 2008 08:58:52 +0200 Subject: [BioPython] splice variants in GenBank/Entrez In-Reply-To: <664146.43151.qm@web65604.mail.ac4.yahoo.com> References: <664146.43151.qm@web65604.mail.ac4.yahoo.com> Message-ID: <20080609065852.GB13032@inb.uni-luebeck.de> Hi Steve, On Sun, Jun 08, 2008 at 10:21:50PM -0700, C. G. wrote: > > I've been using BioPython for a few projects the last > two months to process BLAST results but now I need to > take those results and determine which of them have > known splice variants. By "known" I mean those that > have annotations contained in a database that indicate > they have (or are) splice variants. Depending on which organism you are looking at, you might want to use the Ensembl genome database. There is no biopython interface, but you can use the jython interface from their website (at least they once had one, I didn't check if that's still the case). Otherwise you might have to use perl or java packages for that. Another good resource for this is the Alternative Splicing Database: http://www.ebi.ac.uk/asd/ Hope that helps, Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics http://www.inb.uni-luebeck.de/~krewink/ From bsouthey at gmail.com Mon Jun 9 09:25:44 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 09 Jun 2008 08:25:44 -0500 Subject: [BioPython] splice variants in GenBank/Entrez In-Reply-To: <20080609065852.GB13032@inb.uni-luebeck.de> References: <664146.43151.qm@web65604.mail.ac4.yahoo.com> <20080609065852.GB13032@inb.uni-luebeck.de> Message-ID: <484D2F58.6020502@gmail.com> Albert Krewinkel wrote: > Hi Steve, > > On Sun, Jun 08, 2008 at 10:21:50PM -0700, C. G. wrote: > >> I've been using BioPython for a few projects the last >> two months to process BLAST results but now I need to >> take those results and determine which of them have >> known splice variants. By "known" I mean those that >> have annotations contained in a database that indicate >> they have (or are) splice variants. >> > > Depending on which organism you are looking at, you might want to use > the Ensembl genome database. There is no biopython interface, but you > can use the jython interface from their website (at least they once > had one, I didn't check if that's still the case). Otherwise you > might have to use perl or java packages for that. > > Another good resource for this is the Alternative Splicing Database: > http://www.ebi.ac.uk/asd/ > > Hope that helps, > > Albert > > > The 'ALTERNATIVE PRODUCTS' section of CC lines in a UniProt (SwissProt) record can contain alternative splicing information. See for example, the manual section: **3.12.5. Syntax of the topic 'ALTERNATIVE PRODUCTS'** http://ca.expasy.org/sprot/userman.html#CCAP (Given below for completeness). Bruce Example of the CC lines and the corresponding FT lines for an entry with alternative splicing: CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative splicing, Alternative initiation; Named isoforms=8; CC Comment=Additional isoforms seem to exist; CC Name=1; Synonyms=Non-muscle isozyme; CC IsoId=Q15746-1; Sequence=Displayed; CC Name=2; CC IsoId=Q15746-2; Sequence=VSP_004791; CC Name=3A; CC IsoId=Q15746-3; Sequence=VSP_004792, VSP_004794; CC Name=3B; CC IsoId=Q15746-4; Sequence=VSP_004791, VSP_004792, VSP_004794; CC Name=4; CC IsoId=Q15746-5; Sequence=VSP_004792, VSP_004793; CC Name=Del-1790; CC IsoId=Q15746-6; Sequence=VSP_004795; CC Name=5; Synonyms=Smooth-muscle isozyme; CC IsoId=Q15746-7; Sequence=VSP_018845; CC Note=Produced by alternative initiation at Met-923 of isoform 1; CC Name=6; Synonyms=Telokin; CC IsoId=Q15746-8; Sequence=VSP_018846; CC Note=Produced by alternative initiation at Met-1761 of isoform CC 1. Has no catalytic activity; ... FT VAR_SEQ 1 1760 Missing (in isoform 6). FT /FTId=VSP_018846. FT VAR_SEQ 1 922 Missing (in isoform 5). FT /FTId=VSP_018845. FT VAR_SEQ 437 506 VSGIPKPEVAWFLEGTPVRRQEGSIEVYEDAGSHYLCLLKA FT RTRDSGTYSCTASNAQGQVSCSWTLQVER -> G (in FT isoform 2 and isoform 3B). FT /FTId=VSP_004791. FT VAR_SEQ 1433 1439 DEVEVSD -> MKWRCQT (in isoform 3A, FT isoform 3B and isoform 4). FT /FTId=VSP_004792. FT VAR_SEQ 1473 1545 Missing (in isoform 4). FT /FTId=VSP_004793. FT VAR_SEQ 1655 1705 Missing (in isoform 3A and isoform 3B). FT /FTId=VSP_004794. FT VAR_SEQ 1790 1790 Missing (in isoform Del-1790). FT /FTId=VSP_004795. CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative splicing, Alternative initiation; Named isoforms=3; CC Comment=Isoform 1 and isoform 2 arise due to the use of two CC alternative first exons joined to a common exon 2 at the same CC acceptor site but in different reading frames, resulting in two CC completely different isoforms; CC Name=1; Synonyms=p16INK4a; CC IsoId=O77617-1; Sequence=Displayed; CC Name=3; CC IsoId=O77617-2; Sequence=VSP_018701; CC Note=Produced by alternative initiation at Met-35 of isoform 1. CC No experimental confirmation available; CC Name=2; Synonyms=p19ARF; CC IsoId=O77618-1; Sequence=External; .. FT VAR_SEQ 1 34 Missing (in isoform 3). FT /FTId=VSP_004099. From lueck at ipk-gatersleben.de Tue Jun 10 04:38:14 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 10 Jun 2008 10:38:14 +0200 Subject: [BioPython] formatdb over python code Message-ID: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de> Hi! Does someone know's whether it's possible to make a database with formatdb (NCBI) via python code (among Windows) and not over the console? Regards Stefanie From biopython at maubp.freeserve.co.uk Tue Jun 10 05:41:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 10 Jun 2008 10:41:27 +0100 Subject: [BioPython] formatdb over python code In-Reply-To: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de> References: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00806100241i68b24632s121324ce1c942dd9@mail.gmail.com> On Tue, Jun 10, 2008 at 9:38 AM, Stefanie L?ck wrote: > Hi! > > Does someone know's whether it's possible to make a database with formatdb > (NCBI) via python code (among Windows) and not over the console? Hello Stefanie, I don't think Biopython has a wrapper for the NCBI formatdb tool, but you could construct the command line string yourself and call it with one of the standard python os functions, e.g. os.popen(). Peter From winter at biotec.tu-dresden.de Tue Jun 10 06:13:06 2008 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Tue, 10 Jun 2008 12:13:06 +0200 Subject: [BioPython] formatdb over python code In-Reply-To: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de> References: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de> Message-ID: <484E53B2.5060102@biotec.tu-dresden.de> Stefanie L?ck wrote, On 06/10/08 10:38: > Hi! > > Does someone know's whether it's possible to make a database with formatdb > (NCBI) via python code (among Windows) and not over the console? Here is the Python code I use for that: cmd = "formatdb -i %s -p T -o F" % database os.system(cmd) -p T specifies protein sequences, -o T creates indexes, but fails if the fasta file does not follow the defline format (see http://en.wikipedia.org/wiki/Fasta_format#Sequence_identifiers). If it fails, use -o F. Christof From mjldehoon at yahoo.com Fri Jun 13 22:34:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 13 Jun 2008 19:34:05 -0700 (PDT) Subject: [BioPython] Bio.Rebase Message-ID: <237761.5963.qm@web62409.mail.re1.yahoo.com> Hi everybody, As part of bug #2454 on Bugzilla, I am looking at the Bio.Rebase module. This module parses files (in HTML format) from the Rebase database: http://rebase.neb.com/rebase/rebase.html Unfortunately, since this module was written (in 2000) the HTML format used by the Rebase database has changed completely. This module is therefore not able to parse current Rebase HTML files. Is anybody willing to update Bio.Rebase (either by updating the HTML parser, or preferably by writing a parser for plain-text output from Bio.Rebase)? If not, I think this module should be deprecated. --Michiel. From biopython at maubp.freeserve.co.uk Mon Jun 16 10:01:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 16 Jun 2008 15:01:31 +0100 Subject: [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO Message-ID: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> I've recently had to deal with some contig files in the Ace format (output by CAP3, but many assembly files will produce this output). We have a module for parsing Ace files in Biopython, Bio.Sequencing.Ace but I was wondering about integrating this into the Bio.SeqIO or Bio.AlignIO framework. http://www.biopython.org/wiki/SeqIO http://www.biopython.org/wiki/AlignIO I'd like to hear from anyone currently using Ace files, on how they tend to treat the data - and if they think a SeqRecord or Alignment based representation would be useful. Each contig in an Ace file could be treated as a SeqRecord using the consensus sequence. The identifiers of each sub-sequence used to build the consensus could be stored as database cross-references, or perhaps we could store these as SeqFeatures describing which part of the consensus they support. This would then fit into Bio.SeqIO quite well. Alternatively, each contig could be treated as an alignment (with a consensus) and integrated into Bio.AlignIO. One drawback for this is doing this with the current generic alignment class would require padding the start and/or end of each sequence with gaps in order to make every sequence the same length. However, if we did this (or created a more specialised alignment class), the Ace file format would then fit into Bio.AlignIO too. So, Ace users - would either (or both) of the above approaches make sense for how you use the Ace contig files? Thanks Peter From laserson at mit.edu Tue Jun 17 14:44:08 2008 From: laserson at mit.edu (Uri Laserson) Date: Tue, 17 Jun 2008 14:44:08 -0400 Subject: [BioPython] Dependency help: libssl.so.0.9.7 Message-ID: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com> Hi, I am trying to use some biopython packages, and it turns out there is an error when I try to import _hashlib: >>> import _hashlib Traceback (most recent call last): File "", line 1, in ImportError: libssl.so.0.9.7: cannot open shared object file: No such file or directory I am working on unix system that is administered by a university, but I have installed my own local version of python along with biopython and all necessary packages for that. There exists a libssl.so.0.9.8 and libssl.so (a symbolic link to the former) in /usr/lib ldd _hashlib.so in my own /python/lib/python2.5/lib-dynload gives me: linux-gate.so.1 => (0xffffe000) libssl.so.0.9.7 => not found libcrypto.so.0.9.7 => not found libpthread.so.0 => /lib32/libpthread.so.0 (0xf7f67000) libc.so.6 => /lib32/libc.so.6 (0xf7e3c000) /lib/ld-linux.so.2 (0x56555000) What is the easiest way to solve this? How do I get my local (home directory) installation of python to find the libssl.so library in /usr/lib? Thanks! Uri -- Uri Laserson PhD Candidate, Biomedical Engineering Harvard Medical School (Genetics) Massachusetts Institute of Technology (Mathematics) phone +1 917 742 8019 laserson at mit.edu From biopython at maubp.freeserve.co.uk Wed Jun 18 05:11:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Jun 2008 10:11:42 +0100 Subject: [BioPython] Dependency help: libssl.so.0.9.7 In-Reply-To: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com> References: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com> Message-ID: <320fb6e00806180211o5d505ct4099cdd4fc9e11dc@mail.gmail.com> On Tue, Jun 17, 2008 at 7:44 PM, Uri Laserson wrote: > Hi, > > I am trying to use some biopython packages, and it turns out there is an > error when I try to import _hashlib: > >>>> import _hashlib > Traceback (most recent call last): > ... Hi Uri, I'm guessing you are trying to use Bio.SeqUtils.Checksum, but did you mean "import hashlib"? See http://code.krypto.org/python/hashlib/ Peter From biopython at maubp.freeserve.co.uk Wed Jun 18 07:32:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Jun 2008 12:32:10 +0100 Subject: [BioPython] blastx works fine? In-Reply-To: <1131745582.4368.22.camel@osiris.biology.duke.edu> References: <1131745582.4368.22.camel@osiris.biology.duke.edu> Message-ID: <320fb6e00806180432x60ceea96o3e45f05590003e8e@mail.gmail.com> In Nov 2005, Frank Kauff wrote: > Hi all, > > qblast currently says it works only for blastp and blastn. Actually it > seems to work fine with blastx as well - xml output parses well with > NCBIXML. Or am I missing something? > > Frank Yes, using BLASTX with the Biopython XML parser does seem to work. In fact the NCBI (now) documentation explicitly lists blastn, blastp, blastx, tblastn and tblastx so I updated Biopython's qblast function to allow them too. http://www.ncbi.nlm.nih.gov/BLAST/Doc/node43.html Fixed in Bio/Blast/NCBIWWW.py revision 1.50 - better late than never? Peter From mjldehoon at yahoo.com Thu Jun 19 09:04:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 19 Jun 2008 06:04:31 -0700 (PDT) Subject: [BioPython] Bio.CDD, anyone? Message-ID: <14893.84074.qm@web62409.mail.re1.yahoo.com> Hi everybody, Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) records. The parser parses HTML pages from CDD's web site. Since the parser was written about six years ago, the CDD web site has changed considerably. Bio.CDD therefore cannot parse current HTML pages from CDD. So I am wondering: 1) Is anybody using Bio.CDD? 2) Is anybody willing to update Bio.CDD to handle current HTML? 3) If not, can we deprecate it? There is not much purpose of having a parser for HTML pages from years ago. --Michiel. From biopython at maubp.freeserve.co.uk Thu Jun 19 09:38:29 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Jun 2008 14:38:29 +0100 Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <14893.84074.qm@web62409.mail.re1.yahoo.com> References: <14893.84074.qm@web62409.mail.re1.yahoo.com> Message-ID: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > records. The parser parses HTML pages from CDD's web site. Since the parser > was written about six years ago, the CDD web site has changed considerably. > Bio.CDD therefore cannot parse current HTML pages from CDD. A couple of years ago, I wanted to get the CDD domain name and description and ended up writing my own very simple and crude parser to extract just this information. Doing a proper job would mean extracting lots and lots of fields, e.g. http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 I wonder if the NCBI make any of this available as XML via Entrez? I had a quick look and couldn't find anything. Peter From mjldehoon at yahoo.com Thu Jun 19 09:58:25 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 19 Jun 2008 06:58:25 -0700 (PDT) Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> Message-ID: <352888.20937.qm@web62409.mail.re1.yahoo.com> > I wonder if the NCBI make any of this available as XML via Entrez? I > had a quick look and couldn't find anything. Actually I already asked this question to NCBI. Their answer was that a subset of the information shown on the web page is available as XML via Entrez's ESummary and EFetch (and thus available from Biopython). The full CDD records are stored as one large file, which is obtainable from NCBI's ftp site, but currently it is not possible to get individual CDD records except in HTML form through the NCBI website. --Michiel. Peter wrote: > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > records. The parser parses HTML pages from CDD's web site. Since the parser > was written about six years ago, the CDD web site has changed considerably. > Bio.CDD therefore cannot parse current HTML pages from CDD. A couple of years ago, I wanted to get the CDD domain name and description and ended up writing my own very simple and crude parser to extract just this information. Doing a proper job would mean extracting lots and lots of fields, e.g. http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 I wonder if the NCBI make any of this available as XML via Entrez? I had a quick look and couldn't find anything. Peter From bsouthey at gmail.com Thu Jun 19 10:44:00 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 19 Jun 2008 09:44:00 -0500 Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <352888.20937.qm@web62409.mail.re1.yahoo.com> References: <352888.20937.qm@web62409.mail.re1.yahoo.com> Message-ID: <485A70B0.1010202@gmail.com> Michiel de Hoon wrote: >> I wonder if the NCBI make any of this available as XML via Entrez? I >> had a quick look and couldn't find anything. >> > > Actually I already asked this question to NCBI. Their answer was that a subset of the information shown on the web page is available as XML via Entrez's ESummary and EFetch (and thus available from Biopython). The full CDD records are stored as one large file, which is obtainable from NCBI's ftp site, but currently it is not possible to get individual CDD records except in HTML form through the NCBI website. > > --Michiel. > > > Peter wrote: > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > >> records. The parser parses HTML pages from CDD's web site. Since the parser >> was written about six years ago, the CDD web site has changed considerably. >> Bio.CDD therefore cannot parse current HTML pages from CDD. >> > > A couple of years ago, I wanted to get the CDD domain name and > description and ended up writing my own very simple and crude parser > to extract just this information. Doing a proper job would mean > extracting lots and lots of fields, e.g. > http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 > > I wonder if the NCBI make any of this available as XML via Entrez? I > had a quick look and couldn't find anything. > > Peter > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Hi, Do you know how the test files were created? If there is not an easy answer then it makes the decision easier. Anyhow, I vote to remove this module as, in addition to the things previously mentioned, it would far better to support interproscan (http://www.ebi.ac.uk/Tools/InterProScan/ ) than just a single tool. Bruce From cjfields at uiuc.edu Thu Jun 19 10:45:05 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 19 Jun 2008 09:45:05 -0500 Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> References: <14893.84074.qm@web62409.mail.re1.yahoo.com> <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> Message-ID: They don't, though you can get esummary XML information (which includes description), and I believe you can use elink to grab other information (including proteins with the specified domain). chris On Jun 19, 2008, at 8:38 AM, Peter wrote: >> Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain >> Database) >> records. The parser parses HTML pages from CDD's web site. Since >> the parser >> was written about six years ago, the CDD web site has changed >> considerably. >> Bio.CDD therefore cannot parse current HTML pages from CDD. > > A couple of years ago, I wanted to get the CDD domain name and > description and ended up writing my own very simple and crude parser > to extract just this information. Doing a proper job would mean > extracting lots and lots of fields, e.g. > http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 > > I wonder if the NCBI make any of this available as XML via Entrez? I > had a quick look and couldn't find anything. > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Thu Jun 19 12:13:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Jun 2008 17:13:16 +0100 Subject: [BioPython] Adding NCBI XML sequence formats to Bio.SeqIO Message-ID: <320fb6e00806190913h2f3f81bgd9d16fb0f2a740f9@mail.gmail.com> Dear all, I've realised that as a bonus from Michiel's work on Bio.Entrez, Biopython should be able to parse several of the XML sequence file formats used by the NCBI - and ideally we should be able to do this via Bio.SeqIO and get SeqRecord objects. I am thinking about adding a new module to Bio.SeqIO which will map the python list/dictionary structures from Bio.Entrez into SeqRecord object(s). What I wanted to ask the list about, is which XML sequence files are of interest - and are there any strong views on format names should I use? I've looked at BioPerl list since I try and re-use the same format names, but could only spot one NCBI XML file listed here: http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats NCBI TinySeq XML format http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd BioPerl call this "tinyseq", which seems like a good choice of name. http://www.bioperl.org/wiki/Tinyseq_sequence_format Also potentially of interest are: NCBI INSDSeq XML format http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd NCBI Seq-entry XML format http://www.ncbi.nlm.nih.gov/dtd/NCBI_Seqset.dtd NCBI Entrezgene XML format (BioPerl uses "entrezgene" to refer to the ASN.1 variant of this file format). http://www.ncbi.nlm.nih.gov/dtd/NCBI_Entrezgene.dtd (I haven't actually sat down and looked at the details of the implementation yet, so no promises on the timing!) Peter From sbassi at gmail.com Sun Jun 22 18:49:48 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 22 Jun 2008 19:49:48 -0300 Subject: [BioPython] Secondary structure alphabet? Message-ID: Here is the secondary structure alphabet: class SecondaryStructure(SingleLetterAlphabet) | Method resolution order: | SecondaryStructure | SingleLetterAlphabet | Alphabet | | Data and other attributes defined here: | | letters = 'HSTC' I can't find what that HSTC stands for. The closer match I found was the DSSP code: The DSSP code The output of DSSP is explained extensively under 'explanation'. The very short summary of the output is: * H = alpha helix * B = residue in isolated beta-bridge * E = extended strand, participates in beta ladder * G = 3-helix (3/10 helix) * I = 5 helix (pi helix) * T = hydrogen bonded turn * S = bend (http://swift.cmbi.ru.nl/gv/dssp/) Does anybody knows the meaning of HSTC? I am CC this mail to Andrew Dalke it seems he was the one who submit it the Biopython. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From idoerg at gmail.com Sun Jun 22 19:03:52 2008 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 22 Jun 2008 16:03:52 -0700 Subject: [BioPython] Secondary structure alphabet? In-Reply-To: References: Message-ID: Probably Helix Turn Strand Coil On Sun, Jun 22, 2008 at 3:49 PM, Sebastian Bassi wrote: > Here is the secondary structure alphabet: > > class SecondaryStructure(SingleLetterAlphabet) > | Method resolution order: > | SecondaryStructure > | SingleLetterAlphabet > | Alphabet > | > | Data and other attributes defined here: > | > | letters = 'HSTC' > > I can't find what that HSTC stands for. The closer match I found was > the DSSP code: > > The DSSP code > > The output of DSSP is explained extensively under 'explanation'. The > very short summary of the output is: > > * H = alpha helix > * B = residue in isolated beta-bridge > * E = extended strand, participates in beta ladder > * G = 3-helix (3/10 helix) > * I = 5 helix (pi helix) > * T = hydrogen bonded turn > * S = bend > > (http://swift.cmbi.ru.nl/gv/dssp/) > > Does anybody knows the meaning of HSTC? I am CC this mail to Andrew > Dalke it seems he was the one who submit it the Biopython. > > > -- > Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 > Bioinformatics news: http://www.bioinformatica.info > Tutorial libre de Python: http://tinyurl.com/2az5d5 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg, Ph.D. CALIT2, mail code 0440 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0440, USA T: +1 (858) 534-0570 T: +1 (858) 646-3100 x3516 http://iddo-friedberg.org From sbassi at gmail.com Sun Jun 22 19:05:13 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 22 Jun 2008 20:05:13 -0300 Subject: [BioPython] Secondary structure alphabet? In-Reply-To: References: Message-ID: On Sun, Jun 22, 2008 at 8:03 PM, Iddo Friedberg wrote: > Probably Helix Turn Strand Coil Sounds plausible. Thank you. Best, SB. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From jdieten at gmail.com Tue Jun 24 06:58:23 2008 From: jdieten at gmail.com (Joost van Dieten) Date: Tue, 24 Jun 2008 12:58:23 +0200 Subject: [BioPython] Blastp XML mailfunction Message-ID: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com> MY CODE: result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence, entrez_query='man[ORGN]') blast_results = result_handle.read() print result_handle- result_handler = cStringIO.StringIO(blast_results) print result_handler blast_records = NCBIXML.parse(result_handler) blast_record = blast_records.next() This code doesn't seem to work anymore. I got an error that my blast_record is empty, but it worked fine 3 weeks ago. Something changed to the NCBIXML code??? Any ideas?? Greetz, Joost Dieten From biopython at maubp.freeserve.co.uk Tue Jun 24 07:11:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Jun 2008 12:11:12 +0100 Subject: [BioPython] Blastp XML mailfunction In-Reply-To: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com> References: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com> Message-ID: <320fb6e00806240411j1c01903cm1f40d53eb9c5ad77@mail.gmail.com> On Tue, Jun 24, 2008 at 11:58 AM, Joost van Dieten wrote: > MY CODE: > result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence, > entrez_query='man[ORGN]') > blast_results = result_handle.read() > print result_handle- > result_handler = cStringIO.StringIO(blast_results) > print result_handler > blast_records = NCBIXML.parse(result_handler) > blast_record = blast_records.next() You probably know this, but for anyone trying to cut-and-paste the code, its much simpler to do this: result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence, entrez_query='man[ORGN]') blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() Joost's code is a handy way to print out the raw data before parsing it, to try and identify any problems by eye. > This code doesn't seem to work anymore. I got an error that my blast_record > is empty, but it worked fine 3 weeks ago. Something changed to the NCBIXML > code??? Any ideas?? Yes, its probably a recent NCBI change, which we've fixed with Bug 2499: http://bugzilla.open-bio.org/show_bug.cgi?id=2499 If you want to just update the Blast parser, I think you need to update both NCBIXML.py and Record.py, but a complete install from CVS might be simpler. Peter From mjldehoon at yahoo.com Wed Jun 25 10:04:09 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 25 Jun 2008 07:04:09 -0700 (PDT) Subject: [BioPython] Bio.SCOP.FileIndex Message-ID: <141582.2274.qm@web62413.mail.re1.yahoo.com> Hi everybody, When I was modifying Bio.SCOP, I noticed that Bio.SCOP.FileIndex is flawed if file reading is done via a buffer (which is often the case in Python). Before we try to fix this, is anybody actually using Bio.SCOP.FileIndex? If not, I think we should deprecate it instead of trying to fix it. --Michiel. From dag at sonsorol.org Wed Jun 25 11:08:33 2008 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed, 25 Jun 2008 11:08:33 -0400 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython References: Message-ID: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Can someone from the biopython dev team respond officially to Scott please? Regards, Chris Begin forwarded message: > From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" > Date: June 25, 2008 10:54:28 AM EDT > To: > Subject: NCBI Abuse Activity with BioPython > > Dear Colleague: > > > > My name is Scott McGinnis and I am responsible for monitoring the web > page at NCBI and blocking users with excessive access. > > > > I am seeing more and more activity with BioPython and it is us > concern. > Mainly the BioPython suite does not appear to be written to the > recommendations made on the main NCBI E-utilities web page > (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr > inciply the following are not being done by BioPython tools. > > > > * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov > , not the standard NCBI Web address. > > * Make no more than one request every 3 seconds. > > > > In fact I recently cc'd you on an event when a user was coming in at > over 18 requests per second. We really wish that you would alter you > scripts to run with a some sort of sleep in it in order to not send > requests more than once per 3 seconds and to not send these to the > main > www web servers but use the http://eutils.ncbi.nlm.nih.gov > . > > > > Also, there is the problem of huge searches in order to build local > databases. With you package it seems that if one were so inclined you > would send a search for all human sequences (over 10,000,000 > sequences) > and you program would then retrieve these one ID at a time. Regardless > of the fact that this is an extreme example, we would much prefer if > your program could webenv from the Esearch and use the search > history > and webenv to retrieve sets of sequences at 200 - 200 at a time. > > > > History: Requests utility to maintain results in user's environment. > Used in conjunction with WebEnv. > > usehistory=y > > Web Environment: Value previously returned in XML results from ESearch > or EPost. This value may change with each utility call. If WebEnv is > used, History search numbers can be included in an ESummary URL, e.g., > term=cancer+AND+%23X (where %23 replaces # and X is the History search > number). > > Note: WebEnv is similar to the cookie that is set on a user's > computers > when accessing PubMed on the web. If the parameter usehistory=y is > included in an ESearch URL both a WebEnv (cookie string) and query_key > (history number) values will be returned in the results. Rather than > using the retrieved PMIDs in an ESummary or EFetch URL you may simply > use the WebEnv and query_key values to retrieve the records. WebEnv > will > change for each ESearch query, but a sample URL would be as follows: > > http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed > &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh > GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D > &query_key=6&retmode=html&rettype=medline&retmax=15 > > WebEnv=WgHmIcDG]B etc. > > Display Numbers: > > retstart=x (x= sequential number of the first record retrieved - > default=0 which will retrieve the first record) > retmax=y (y= number of items retrieved) > > > > Otherwise we will end up blocking more of your users which we are > unfortunately already doing in some cases. > > > > Sincerely, > Scott D. McGinnis, M.S. > DHHS/NIH/NLM/NCBI > www.ncbi.nlm.nih.gov > > > From cjfields at uiuc.edu Wed Jun 25 11:34:34 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 25 Jun 2008 10:34:34 -0500 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: Just as a note from the BioPerl side, BioPerl modules which access eutils use the 3 min sleep rule, and we specify in the documentation the NCBI rules. The modules also identify the tool/agent used as 'bioperl', I believe. chris On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote: > > Can someone from the biopython dev team respond officially to Scott > please? > > Regards, > Chris > > > Begin forwarded message: > >> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" >> >> Date: June 25, 2008 10:54:28 AM EDT >> To: >> Subject: NCBI Abuse Activity with BioPython >> >> Dear Colleague: >> >> >> >> My name is Scott McGinnis and I am responsible for monitoring the web >> page at NCBI and blocking users with excessive access. >> >> >> >> I am seeing more and more activity with BioPython and it is us >> concern. >> Mainly the BioPython suite does not appear to be written to the >> recommendations made on the main NCBI E-utilities web page >> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr >> inciply the following are not being done by BioPython tools. >> >> >> >> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov >> , not the standard NCBI Web >> address. >> >> * Make no more than one request every 3 seconds. >> >> >> >> In fact I recently cc'd you on an event when a user was coming in at >> over 18 requests per second. We really wish that you would alter you >> scripts to run with a some sort of sleep in it in order to not send >> requests more than once per 3 seconds and to not send these to the >> main >> www web servers but use the http://eutils.ncbi.nlm.nih.gov >> . >> >> >> >> Also, there is the problem of huge searches in order to build local >> databases. With you package it seems that if one were so inclined you >> would send a search for all human sequences (over 10,000,000 >> sequences) >> and you program would then retrieve these one ID at a time. >> Regardless >> of the fact that this is an extreme example, we would much prefer if >> your program could webenv from the Esearch and use the search >> history >> and webenv to retrieve sets of sequences at 200 - 200 at a time. >> >> >> >> History: Requests utility to maintain results in user's environment. >> Used in conjunction with WebEnv. >> >> usehistory=y >> >> Web Environment: Value previously returned in XML results from >> ESearch >> or EPost. This value may change with each utility call. If WebEnv is >> used, History search numbers can be included in an ESummary URL, >> e.g., >> term=cancer+AND+%23X (where %23 replaces # and X is the History >> search >> number). >> >> Note: WebEnv is similar to the cookie that is set on a user's >> computers >> when accessing PubMed on the web. If the parameter usehistory=y is >> included in an ESearch URL both a WebEnv (cookie string) and >> query_key >> (history number) values will be returned in the results. Rather than >> using the retrieved PMIDs in an ESummary or EFetch URL you may simply >> use the WebEnv and query_key values to retrieve the records. WebEnv >> will >> change for each ESearch query, but a sample URL would be as follows: >> >> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed >> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh >> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D >> &query_key=6&retmode=html&rettype=medline&retmax=15 >> >> WebEnv=WgHmIcDG]B etc. >> >> Display Numbers: >> >> retstart=x (x= sequential number of the first record retrieved - >> default=0 which will retrieve the first record) >> retmax=y (y= number of items retrieved) >> >> >> >> Otherwise we will end up blocking more of your users which we are >> unfortunately already doing in some cases. >> >> >> >> Sincerely, >> Scott D. McGinnis, M.S. >> DHHS/NIH/NLM/NCBI >> www.ncbi.nlm.nih.gov >> >> >> > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From rjalves at igc.gulbenkian.pt Wed Jun 25 12:16:49 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Wed, 25 Jun 2008 17:16:49 +0100 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: <48626F71.4020804@igc.gulbenkian.pt> you mean 3 seconds no? Quoting Chris Fields on 06/25/2008 04:34 PM: > Just as a note from the BioPerl side, BioPerl modules which access > eutils use the 3 min sleep rule, and we specify in the documentation > the NCBI rules. The modules also identify the tool/agent used as > 'bioperl', I believe. > > chris > > On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote: > >> >> Can someone from the biopython dev team respond officially to Scott >> please? >> >> Regards, >> Chris >> >> >> Begin forwarded message: >> >>> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" >>> Date: June 25, 2008 10:54:28 AM EDT >>> To: >>> Subject: NCBI Abuse Activity with BioPython >>> >>> Dear Colleague: >>> >>> >>> >>> My name is Scott McGinnis and I am responsible for monitoring the web >>> page at NCBI and blocking users with excessive access. >>> >>> >>> >>> I am seeing more and more activity with BioPython and it is us concern. >>> Mainly the BioPython suite does not appear to be written to the >>> recommendations made on the main NCBI E-utilities web page >>> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr >>> >>> inciply the following are not being done by BioPython tools. >>> >>> >>> >>> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov >>> , not the standard NCBI Web address. >>> >>> * Make no more than one request every 3 seconds. >>> >>> >>> >>> In fact I recently cc'd you on an event when a user was coming in at >>> over 18 requests per second. We really wish that you would alter you >>> scripts to run with a some sort of sleep in it in order to not send >>> requests more than once per 3 seconds and to not send these to the main >>> www web servers but use the http://eutils.ncbi.nlm.nih.gov >>> . >>> >>> >>> >>> Also, there is the problem of huge searches in order to build local >>> databases. With you package it seems that if one were so inclined you >>> would send a search for all human sequences (over 10,000,000 sequences) >>> and you program would then retrieve these one ID at a time. Regardless >>> of the fact that this is an extreme example, we would much prefer if >>> your program could webenv from the Esearch and use the search history >>> and webenv to retrieve sets of sequences at 200 - 200 at a time. >>> >>> >>> >>> History: Requests utility to maintain results in user's environment. >>> Used in conjunction with WebEnv. >>> >>> usehistory=y >>> >>> Web Environment: Value previously returned in XML results from ESearch >>> or EPost. This value may change with each utility call. If WebEnv is >>> used, History search numbers can be included in an ESummary URL, e.g., >>> term=cancer+AND+%23X (where %23 replaces # and X is the History search >>> number). >>> >>> Note: WebEnv is similar to the cookie that is set on a user's computers >>> when accessing PubMed on the web. If the parameter usehistory=y is >>> included in an ESearch URL both a WebEnv (cookie string) and query_key >>> (history number) values will be returned in the results. Rather than >>> using the retrieved PMIDs in an ESummary or EFetch URL you may simply >>> use the WebEnv and query_key values to retrieve the records. WebEnv >>> will >>> change for each ESearch query, but a sample URL would be as follows: >>> >>> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed >>> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh >>> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D >>> &query_key=6&retmode=html&rettype=medline&retmax=15 >>> >>> WebEnv=WgHmIcDG]B etc. >>> >>> Display Numbers: >>> >>> retstart=x (x= sequential number of the first record retrieved - >>> default=0 which will retrieve the first record) >>> retmax=y (y= number of items retrieved) >>> >>> >>> >>> Otherwise we will end up blocking more of your users which we are >>> unfortunately already doing in some cases. >>> >>> >>> >>> Sincerely, >>> Scott D. McGinnis, M.S. >>> DHHS/NIH/NLM/NCBI >>> www.ncbi.nlm.nih.gov >>> >>> >>> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Marie-Claude Hofmann > College of Veterinary Medicine > University of Illinois Urbana-Champaign > > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From cjfields at uiuc.edu Wed Jun 25 15:00:34 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 25 Jun 2008 14:00:34 -0500 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <48626F71.4020804@igc.gulbenkian.pt> References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> <48626F71.4020804@igc.gulbenkian.pt> Message-ID: <16811EA1-130D-4F47-B0B5-654E840705B9@uiuc.edu> Yes, my bad (was in a hurry). I have heard of instances where specific users/IPs were blocked temporarily by NCBI based on spamming, so it's best to be proactive. chris On Jun 25, 2008, at 11:16 AM, Renato Alves wrote: > you mean 3 seconds no? > > Quoting Chris Fields on 06/25/2008 04:34 PM: >> Just as a note from the BioPerl side, BioPerl modules which access >> eutils use the 3 min sleep rule, and we specify in the >> documentation the NCBI rules. The modules also identify the tool/ >> agent used as 'bioperl', I believe. >> >> chris >> >> On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote: >> >>> >>> Can someone from the biopython dev team respond officially to >>> Scott please? >>> >>> Regards, >>> Chris >>> >>> >>> Begin forwarded message: >>> >>>> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" >>> > >>>> Date: June 25, 2008 10:54:28 AM EDT >>>> To: >>>> Subject: NCBI Abuse Activity with BioPython >>>> >>>> Dear Colleague: >>>> >>>> >>>> >>>> My name is Scott McGinnis and I am responsible for monitoring the >>>> web >>>> page at NCBI and blocking users with excessive access. >>>> >>>> >>>> >>>> I am seeing more and more activity with BioPython and it is us >>>> concern. >>>> Mainly the BioPython suite does not appear to be written to the >>>> recommendations made on the main NCBI E-utilities web page >>>> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr >>>> inciply the following are not being done by BioPython tools. >>>> >>>> >>>> >>>> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov >>>> , not the standard NCBI Web >>>> address. >>>> >>>> * Make no more than one request every 3 seconds. >>>> >>>> >>>> >>>> In fact I recently cc'd you on an event when a user was coming in >>>> at >>>> over 18 requests per second. We really wish that you would alter >>>> you >>>> scripts to run with a some sort of sleep in it in order to not send >>>> requests more than once per 3 seconds and to not send these to >>>> the main >>>> www web servers but use the http://eutils.ncbi.nlm.nih.gov >>>> . >>>> >>>> >>>> >>>> Also, there is the problem of huge searches in order to build local >>>> databases. With you package it seems that if one were so inclined >>>> you >>>> would send a search for all human sequences (over 10,000,000 >>>> sequences) >>>> and you program would then retrieve these one ID at a time. >>>> Regardless >>>> of the fact that this is an extreme example, we would much prefer >>>> if >>>> your program could webenv from the Esearch and use the search >>>> history >>>> and webenv to retrieve sets of sequences at 200 - 200 at a time. >>>> >>>> >>>> >>>> History: Requests utility to maintain results in user's >>>> environment. >>>> Used in conjunction with WebEnv. >>>> >>>> usehistory=y >>>> >>>> Web Environment: Value previously returned in XML results from >>>> ESearch >>>> or EPost. This value may change with each utility call. If WebEnv >>>> is >>>> used, History search numbers can be included in an ESummary URL, >>>> e.g., >>>> term=cancer+AND+%23X (where %23 replaces # and X is the History >>>> search >>>> number). >>>> >>>> Note: WebEnv is similar to the cookie that is set on a user's >>>> computers >>>> when accessing PubMed on the web. If the parameter usehistory=y is >>>> included in an ESearch URL both a WebEnv (cookie string) and >>>> query_key >>>> (history number) values will be returned in the results. Rather >>>> than >>>> using the retrieved PMIDs in an ESummary or EFetch URL you may >>>> simply >>>> use the WebEnv and query_key values to retrieve the records. >>>> WebEnv will >>>> change for each ESearch query, but a sample URL would be as >>>> follows: >>>> >>>> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed >>>> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh >>>> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D >>>> &query_key=6&retmode=html&rettype=medline&retmax=15 >>>> >>>> WebEnv=WgHmIcDG]B etc. >>>> >>>> Display Numbers: >>>> >>>> retstart=x (x= sequential number of the first record retrieved - >>>> default=0 which will retrieve the first record) >>>> retmax=y (y= number of items retrieved) >>>> >>>> >>>> >>>> Otherwise we will end up blocking more of your users which we are >>>> unfortunately already doing in some cases. >>>> >>>> >>>> >>>> Sincerely, >>>> Scott D. McGinnis, M.S. >>>> DHHS/NIH/NLM/NCBI >>>> www.ncbi.nlm.nih.gov >>>> >>>> >>>> >>> >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Marie-Claude Hofmann >> College of Veterinary Medicine >> University of Illinois Urbana-Champaign >> >> >> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From dalke at dalkescientific.com Wed Jun 25 21:15:50 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 26 Jun 2008 03:15:50 +0200 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: Hi Chris, I'm no longer part of the Biopython dev team, but I read at least the subject line on the mailing list. I wrote the Biopython EUtils package around December 2002 and according to the CVS logs it was added to Biopython in June 2003, so more then 5 years ago. Looking at the commit logs there haven't been any change to the relevant code since 2004, and that was a minor patch. I thought I put a rate limiter into the code, but looking at it now I see I didn't. The documentation clearly states that users must follow NCBI's recommendations, but who actually reads documentation? >> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov >> , not the standard NCBI Web >> address. That change was announced on May 21, 2003, and most likely no one on the Biopython dev group tracks the EUtils mailing list. It was also after I wrote the code, but to be fair I was subscribed to the utilities list at the time and should have caught the change. I think the correct fix is to this code in ThinClient.py: def __init__(self, opener = None, tool = TOOL, email = EMAIL, baseurl = "http://www.ncbi.nlm.nih.gov/entrez/ eutils/"): Change the baseurl to "http://eutils.ncbi.nlm.nih.gov/entrez/ eutils/". I have not tested this. >> * Make no more than one request every 3 seconds. There's a couple of points here. The quickest and most direct way to force/fix the code is to change the "def _get()" in ThinClient.py . The current code is def _get(self, program, query): """Internal function: send the query string to the program as GET""" # NOTE: epost uses a different interface q = self._fixup_query(query) url = self.baseurl + program + "?" + q if DUMP_URL: print "Opening with GET:", url if DUMP_RESULT: print " ================== Results ============= " s = self.opener.open(url).read() print s print " ================== Finished ============ " return cStringIO.StringIO(s) return self.opener.open(url) Here's one possible fix: add the following two lines to module scope: import time _prev_time = 0 and insert four lines in the _get function. def _get(self, program, query): """Internal function: send the query string to the program as GET""" # NOTE: epost uses a different interface global _prev_time q = self._fixup_query(query) url = self.baseurl + program + "?" + q if DUMP_URL: print "Opening with GET:", url # Follow NCBI's 3 second restriction if time.time() - _prev_time < 3: time.sleep(time.time()-_prev_time) _prev_time = time.time() if DUMP_RESULT: print " ================== Results ============= " s = self.opener.open(url).read() print s print " ================== Finished ============ " return cStringIO.StringIO(s) return self.opener.open(url) (I recall that I had something like that, and it made my unit tests - which I did during the off hours - interminable.) When I wrote this module I think I assumed that whoever would use the library would use the code correctly. Using it correctly means a few things: - obey the restrictions set by NCBI - change the 'tool' and 'email' settings, so NCBI complains the right person. (The default is to say 'EUtils_Python_client' and 'biopython- dev at biopython.org') This isn't happening. The patch above force-fixes the first. Should Biopython do a better job of the second? It's not easy to figure out the correct email. I couldn't then and can't now think of a better solution. Perhaps use the result of getpass.getuser()? But that doesn't get the rest of the domain for a proper email. Though NCBI should be able to guess the site from the IP address. The reason I made this assumption is that I meant EUtils to be used by contentious developers. I've since learned that that's seldom the case, and because it was imported into Biopython it's been exposed to a wider audience. >> Also, there is the problem of huge searches in order to build local >> databases. With you package it seems that if one were so inclined you >> would send a search for all human sequences (over 10,000,000 >> sequences) >> and you program would then retrieve these one ID at a time. >> Regardless >> of the fact that this is an extreme example, we would much prefer if >> your program could webenv from the Esearch and use the search >> history >> and webenv to retrieve sets of sequences at 200 - 200 at a time. It does exactly that. There's an entire interface for handling search history - and it took some non-trivial work and questions to NCBI to get things working right. Rather, there are two layers. One is for the low-level protocol ("ThinClient") that EUtils offers, and another wraps around the history mechanism ("HistoryClient"). >>> from Bio import EUtils >>> from Bio.EUtils import HistoryClient >>> client = HistoryClient.HistoryClient() >>> result = client.search("polio AND picornavirus") >>> len(result) 3437 >>> f = result.efetch() >>> print f.read(1000) 18540199 2008 06 10
0041-3771 50 2 2008 Tsitologiia Tsitologiia [The enter of viruses family Picornaviridae in and there's a way to populate the history with a list of records, then fetch those records in a block: >>> result = client.from_dbids(EUtils.DBIds("pubmed", ["100","200","300","400","500"])) >>> f = result.efetch("text", "brief") >>> print f.read() 1: Jolly RD et al. Bovine mannosidosis--a model ...[PMID: 100] 2: El Halawani ME et al. The relative importance of mo...[PMID: 200] 3: Amdur MA. Alcohol-related problems in a...[PMID: 300] 4: Regitz G et al. Trypsin-sensitive photosynthe...[PMID: 400] 5: Nourse ES. The regional workshops on pri...[PMID: 500] If I had to guess, likely more people find the ThinClient code easier to understand, because the NCBI interface has a simple way to get the result for a single record, without using the history interface. The NCBI interface doesn't guide people to the right way to use it effectively. I started working on an update to EUtils which improved the API to include a few helper functions, like "EUtils.search()" instead of having to create a HistoryClient. That might help guide people to using it better. I wrote up something about it a few years ago: http://www.dalkescientific.com/writings/diary/archive/2005/09/30/ using_eutils.html But a problem in completing that is that I never got any sort of funding or user feedback on how people were using the software, and as I moved over to chemistry it became lower and lower on my list. That's still the problem with me working on this again. I don't know about this next point, but there might also be a lack of documentation on how to use the Biopython interface effectively? The NCBI documentation isn't mean for non-programmers (it's more of a bytes-on-the-wire document) so perhaps people are pattern matching on what looks right and going with what works, vs. what works well. Then because there was no 3 second limit, they had no incentive to find a better/faster solution. Andrew dalke at dalkescientific.com From biopython at maubp.freeserve.co.uk Thu Jun 26 07:21:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 12:21:57 +0100 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: <320fb6e00806260421g48e5807ei92297b372c330e5b@mail.gmail.com> On Thu, Jun 26, 2008 at 2:15 AM, Andrew Dalke wrote: > Hi Chris, > > I'm no longer part of the Biopython dev team, but I read at least the > subject line on the mailing list. > > I wrote the Biopython EUtils package around December 2002 and according to > the CVS logs it was added to Biopython in June 2003, so more then 5 years > ago. Looking at the commit logs there haven't been any change to the > relevant code since 2004, and that was a minor patch. > > I thought I put a rate limiter into the code, but looking at it now I see I > didn't. The documentation clearly states that users must follow NCBI's > recommendations, but who actually reads documentation? > > There's a couple of points here. The quickest and most direct way to > force/fix the code is to change the "def _get()" in ThinClient.py . ... I've updated Bio/EUtils/ThinClient.py in CVS based on your suggested change, and checked the unit tests test_EUtils.py and test_SeqIO_online.py (which calls Bio.EUtils via Bio.GenBank). Looking over the code, should this wait also be done for the ThinClient's epost() method as well? > When I wrote this module I think I assumed that whoever would use the > library would use the code correctly. Using it correctly means a few > things: > - obey the restrictions set by NCBI > - change the 'tool' and 'email' settings, so NCBI complains the right > person. > (The default is to say 'EUtils_Python_client' and > 'biopython-dev at biopython.org') > > This isn't happening. The patch above force-fixes the first. Should > Biopython do a better job of the second? It's not easy to figure out the > correct email. I couldn't then and can't now think of a better solution. > Perhaps use the result of getpass.getuser()? But that doesn't get the rest > of the domain for a proper email. Though NCBI should be able to guess the > site from the IP address. Figuring out the user's email address is tricky, especially cross platform. Perhaps we should update the Bio.EUtils and Bio.Entrez documentation to recommend the user set their email address here, and if they are wrapping Biopython in part of a larger tool (e.g. a webservice) to set the tool name too. > If I had to guess, likely more people find the ThinClient code easier to > understand, because the NCBI interface has a simple way to get the result > for a single record, without using the history interface. The NCBI > interface doesn't guide people to the right way to use it effectively. I would agree with you. I would go further, and say for a new user even the ThinClient is a bit scary, and that the wrapper functions in Bio.GenBank are nicer to use. > I started working on an update to EUtils which improved the API to include a > few helper functions, like "EUtils.search()" instead of having to create a > HistoryClient. That might help guide people to using it better. I wrote up > something about it a few years ago: > http://www.dalkescientific.com/writings/diary/archive/2005/09/30/using_eutils.html > > But a problem in completing that is that I never got any sort of funding or > user feedback on how people were using the software, and as I moved over to > chemistry it became lower and lower on my list. That's still the problem > with me working on this again. This complexity is also daunting for anyone else considering taking over the Bio.EUtils code base. > I don't know about this next point, but there might also be a lack of > documentation on how to use the Biopython interface effectively? The NCBI > documentation isn't mean for non-programmers (it's more of a > bytes-on-the-wire document) so perhaps people are pattern matching on what > looks right and going with what works, vs. what works well. Then because > there was no 3 second limit, they had no incentive to find a better/faster > solution. That would explain how the unnamed user ended up making over 18 requests per second! I confess I had assumed that things like the Bio.GenBank wrappers would be respecting the 3 second rule (at least they should do now). Peter From mjldehoon at yahoo.com Thu Jun 26 07:48:09 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 04:48:09 -0700 (PDT) Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: <53670.7764.qm@web62412.mail.re1.yahoo.com> Dear Chris, Sorry for the trouble. We are now discussing on the Biopython mailing list how to fix this issue. I will write a reply to Scott shortly. Best, --Michiel. --- On Wed, 6/25/08, Chris Dagdigian wrote: From: Chris Dagdigian Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython To: biopython at lists.open-bio.org Date: Wednesday, June 25, 2008, 11:08 AM Can someone from the biopython dev team respond officially to Scott please? Regards, Chris Begin forwarded message: > From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" > Date: June 25, 2008 10:54:28 AM EDT > To: > Subject: NCBI Abuse Activity with BioPython > > Dear Colleague: > > > > My name is Scott McGinnis and I am responsible for monitoring the web > page at NCBI and blocking users with excessive access. > > > > I am seeing more and more activity with BioPython and it is us > concern. > Mainly the BioPython suite does not appear to be written to the > recommendations made on the main NCBI E-utilities web page > (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr > inciply the following are not being done by BioPython tools. > > > > * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov > , not the standard NCBI Web address. > > * Make no more than one request every 3 seconds. > > > > In fact I recently cc'd you on an event when a user was coming in at > over 18 requests per second. We really wish that you would alter you > scripts to run with a some sort of sleep in it in order to not send > requests more than once per 3 seconds and to not send these to the > main > www web servers but use the http://eutils.ncbi.nlm.nih.gov > . > > > > Also, there is the problem of huge searches in order to build local > databases. With you package it seems that if one were so inclined you > would send a search for all human sequences (over 10,000,000 > sequences) > and you program would then retrieve these one ID at a time. Regardless > of the fact that this is an extreme example, we would much prefer if > your program could webenv from the Esearch and use the search > history > and webenv to retrieve sets of sequences at 200 - 200 at a time. > > > > History: Requests utility to maintain results in user's environment. > Used in conjunction with WebEnv. > > usehistory=y > > Web Environment: Value previously returned in XML results from ESearch > or EPost. This value may change with each utility call. If WebEnv is > used, History search numbers can be included in an ESummary URL, e.g., > term=cancer+AND+%23X (where %23 replaces # and X is the History search > number). > > Note: WebEnv is similar to the cookie that is set on a user's > computers > when accessing PubMed on the web. If the parameter usehistory=y is > included in an ESearch URL both a WebEnv (cookie string) and query_key > (history number) values will be returned in the results. Rather than > using the retrieved PMIDs in an ESummary or EFetch URL you may simply > use the WebEnv and query_key values to retrieve the records. WebEnv > will > change for each ESearch query, but a sample URL would be as follows: > > http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed > &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh > GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D > &query_key=6&retmode=html&rettype=medline&retmax=15 > > WebEnv=WgHmIcDG]B etc. > > Display Numbers: > > retstart=x (x= sequential number of the first record retrieved - > default=0 which will retrieve the first record) > retmax=y (y= number of items retrieved) > > > > Otherwise we will end up blocking more of your users which we are > unfortunately already doing in some cases. > > > > Sincerely, > Scott D. McGinnis, M.S. > DHHS/NIH/NLM/NCBI > www.ncbi.nlm.nih.gov > > > _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From mjldehoon at yahoo.com Thu Jun 26 10:01:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 07:01:31 -0700 (PDT) Subject: [BioPython] Bio.ECell, anybody? Message-ID: <712489.88060.qm@web62410.mail.re1.yahoo.com> This is one of the Martel-based parser whose relevance in 2008 is unclear to me. >From the docstring: Ecell converts the ECell input from spreadsheet format to an intermediate format, described in http://www.e-cell.org/manual/chapter2E.html#3.2.? It provides an alternative to the perl script supplied with the Ecell2 distribution at http://bioinformatics.org/project/?group_id=49. Currently, ECell is at version 3.1.106 (and uses Python as the scripting interface! Yay!). The link to the chapter in the ECell manual is dead. Is anybody using the Bio.ECell module? --Michiel From binbin.liu at umb.no Thu Jun 26 11:35:46 2008 From: binbin.liu at umb.no (binbin) Date: Thu, 26 Jun 2008 17:35:46 +0200 Subject: [BioPython] Entrez Message-ID: <1214494546.6215.3.camel@ubuntu> ?Hei, Am using biopython 1.45 my problem is as follow >>> from Bio import GenBank >>> from Bio import Entrez Traceback (most recent call last): File "", line 1, in ImportError: cannot import name Entrez I could not import Entrez. was it deleted from Bio? From biopython at maubp.freeserve.co.uk Thu Jun 26 11:57:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 16:57:47 +0100 Subject: [BioPython] Entrez In-Reply-To: <1214494546.6215.3.camel@ubuntu> References: <1214494546.6215.3.camel@ubuntu> Message-ID: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> On Thu, Jun 26, 2008 at 4:35 PM, binbin wrote: > Hei, > Am using biopython 1.45 > > my problem is as follow > > > >>> from Bio import GenBank > >>> from Bio import Entrez > Traceback (most recent call last): > File "", line 1, in > ImportError: cannot import name Entrez > > I could not import Entrez. was it deleted from Bio? Hello binbin, A long long time ago there was a Bio.Entrez module which was deleted in 2000. We are going to re-introduce a Bio.Entrez module in Biopython 1.46 (hopefully out next month?), which will replace Bio.WWW.NCBI. If you want to try this out now, please install the latest CVS version of Biopython from source. Can I ask why are you trying to do "from Bio import Entrez"? Peter From winter at biotec.tu-dresden.de Thu Jun 26 11:53:23 2008 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Thu, 26 Jun 2008 17:53:23 +0200 Subject: [BioPython] Entrez In-Reply-To: <1214494546.6215.3.camel@ubuntu> References: <1214494546.6215.3.camel@ubuntu> Message-ID: <4863BB73.2020509@biotec.tu-dresden.de> binbin wrote, On 06/26/08 17:35: > Hei, > Am using biopython 1.45 > > my problem is as follow > > > >>> from Bio import GenBank > >>> from Bio import Entrez > Traceback (most recent call last): > File "", line 1, in > ImportError: cannot import name Entrez > > I could not import Entrez. was it deleted from Bio? Import works fine for me, so I don't think it has been deleted. With my Linux installation, I can do locate Entrez which finds /var/lib/python-support/python2.5/Bio/Entrez HTH, Christof From biopython at maubp.freeserve.co.uk Thu Jun 26 12:12:53 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 17:12:53 +0100 Subject: [BioPython] Entrez In-Reply-To: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> References: <1214494546.6215.3.camel@ubuntu> <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> Message-ID: <320fb6e00806260912j3395d2c0s3d7bbb7227f84421@mail.gmail.com> > Hello binbin, > > A long long time ago there was a Bio.Entrez module which was deleted in 2000. > > We are going to re-introduce a Bio.Entrez module in Biopython 1.46 > (hopefully out next month?), which will replace Bio.WWW.NCBI. If you > want to try this out now, please install the latest CVS version of > Biopython from source. Sorry - I've confused myself as the Bio.Entrez module has been under revision recently. >From the user's point of view Biopython 1.46 will add an XML parser, but otherwise Bio.Entrez should be there in Biopython 1.45. Peter From biopython at maubp.freeserve.co.uk Thu Jun 26 16:19:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 21:19:31 +0100 Subject: [BioPython] Removing the unit test GUI? Message-ID: <320fb6e00806261319w5be098d1y48404f3f93934fa3@mail.gmail.com> Hello all, I wanted to do a quick survey of opinion about the Biopython test suite and its interface. Those of you who have ever installed Biopython from source may have tried running the unit tests too. You do this by changing to the Tests subdirectory, and then running the run_tests.py script. Currently by default this will show a GUI. However, from the developer's point of view the unit tests are almost always run at the command line with: python run_tests.py --no-gui It would let us simplify the test harness if we got rid of the GUI, and it would make life very slightly easier for people running the tests at the command line. But would anyone be upset at the loss of the test GUI? So - have any of you ever run the unit tests? Did you use the GUI or the command line? Would you prefer the GUI to remain? Thanks Peter P.S. See also bug 2525 http://bugzilla.open-bio.org/show_bug.cgi?id=2525 From mjldehoon at yahoo.com Thu Jun 26 18:24:41 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 15:24:41 -0700 (PDT) Subject: [BioPython] Entrez In-Reply-To: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> Message-ID: <987374.9439.qm@web62409.mail.re1.yahoo.com> Bio.Entrez was reintroduced in release 1.45 already (though without the parser), so binbin should be able to find it. --Michiel. --- On Thu, 6/26/08, Peter wrote: From: Peter Subject: Re: [BioPython] Entrez To: "binbin" Cc: biopython at biopython.org Date: Thursday, June 26, 2008, 11:57 AM On Thu, Jun 26, 2008 at 4:35 PM, binbin wrote: > Hei, > Am using biopython 1.45 > > my problem is as follow > > > >>> from Bio import GenBank > >>> from Bio import Entrez > Traceback (most recent call last): > File "", line 1, in > ImportError: cannot import name Entrez > > I could not import Entrez. was it deleted from Bio? Hello binbin, A long long time ago there was a Bio.Entrez module which was deleted in 2000. We are going to re-introduce a Bio.Entrez module in Biopython 1.46 (hopefully out next month?), which will replace Bio.WWW.NCBI. If you want to try this out now, please install the latest CVS version of Biopython from source. Can I ask why are you trying to do "from Bio import Entrez"? Peter _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Fri Jun 27 07:16:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 12:16:12 +0100 Subject: [BioPython] Entrez In-Reply-To: <1214562160.6026.2.camel@ubuntu> References: <1214494546.6215.3.camel@ubuntu> <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> <1214562160.6026.2.camel@ubuntu> Message-ID: <320fb6e00806270416x76d8b388mdd79577927001f32@mail.gmail.com> On Fri, Jun 27, 2008 at 11:22 AM, binbin wrote: > thank you for answering, i am a beginner of biopython,in the "Biopython > Tutorial and Cookbook": > 2.5 Connecting with biological databases: > this is found > "from Bio import Entrez" > > i tried this but it did work for me, that is why i asked. That should have worked if your installation of Biopython 1.45 was successful. We may be able to work out what is wrong. What operating system are you using, which version of python, and how did you install Biopython? Regards, Peter From fredgca at hotmail.com Fri Jun 27 09:19:04 2008 From: fredgca at hotmail.com (Frederico Arnoldi) Date: Fri, 27 Jun 2008 13:19:04 +0000 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython Message-ID: Guys (sorry the informality), I have followed the discussion about "NCBI Abuse Activity with BioPython". I have to confess that followed it superficially, since I am not able to understand everything you said. So, I am going to make some questions about it: 1)I believe that using BLAST with NCBIWWW.qblast is included in "Abuse Activity". Right? I am asking because sometimes I use it. The recommendation of NCBI is "Make no more than one request every 3 seconds.". Biopython code does not assure it with the following code in NCBIWWW.py, line 779: [code] limiter = RequestLimiter(3) while 1: limiter.wait() [/code] 2)Do you have any recommendation for using it that it is not included in the tutorial? Maybe listing some recommendations here would help. Sorry if I have asked an obviousness. Thanks, Fred _________________________________________________________________ Conhe?a o Windows Live Spaces, a rede de relacionamentos do Messenger! http://www.amigosdomessenger.com.br/ From biopython at maubp.freeserve.co.uk Fri Jun 27 09:57:49 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 14:57:49 +0100 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: References: Message-ID: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com> On Fri, Jun 27, 2008 at 2:19 PM, Frederico Arnoldi wrote: > > Guys (sorry the informality), > > I have followed the discussion about "NCBI Abuse Activity with BioPython". I > have to confess that followed it superficially, since I am not able to understand > everything you said. So, I am going to make some questions about it: > > 1)I believe that using BLAST with NCBIWWW.qblast is included in "Abuse Activity". Right? I'm not aware that abuse of BLAST was singled out, only Entrez / E-utils. > I am asking because sometimes I use it. The recommendation of NCBI is > "Make no more than one request every 3 seconds.". True, http://www.ncbi.nlm.nih.gov/blast/Doc/node60.html > Biopython code does not assure it with the following code in NCBIWWW.py, > line 779: > [code] > limiter = RequestLimiter(3) > while 1: > limiter.wait() > [/code] I believe that bit of code is polling the server for results every three seconds. Perhaps we should insert an additional enforced three second delay between submission of queries as well. > 2)Do you have any recommendation for using it that it is not included in the > tutorial? Maybe listing some recommendations here would help. I would recommend running your own local BLAST server for any large jobs - either the standalone blast tools, or if you have a machine on the network that many people could share, run the WWW version locally. Peter From cjfields at uiuc.edu Fri Jun 27 11:51:12 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 27 Jun 2008 10:51:12 -0500 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com> References: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com> Message-ID: <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu> On Jun 27, 2008, at 8:57 AM, Peter wrote: > On Fri, Jun 27, 2008 at 2:19 PM, Frederico Arnoldi > wrote: >> >> Guys (sorry the informality), >> >> I have followed the discussion about "NCBI Abuse Activity with >> BioPython". I >> have to confess that followed it superficially, since I am not able >> to understand >> everything you said. So, I am going to make some questions about it: >> >> 1)I believe that using BLAST with NCBIWWW.qblast is included in >> "Abuse Activity". Right? > > I'm not aware that abuse of BLAST was singled out, only Entrez / E- > utils. Similar policy though, for the same reasons they insist on a delay for E-utils. >> I am asking because sometimes I use it. The recommendation of NCBI is >> "Make no more than one request every 3 seconds.". > > True, http://www.ncbi.nlm.nih.gov/blast/Doc/node60.html > >> Biopython code does not assure it with the following code in >> NCBIWWW.py, >> line 779: >> [code] >> limiter = RequestLimiter(3) >> while 1: >> limiter.wait() >> [/code] > > I believe that bit of code is polling the server for results every > three seconds. Perhaps we should insert an additional enforced three > second delay between submission of queries as well. > >> 2)Do you have any recommendation for using it that it is not >> included in the >> tutorial? Maybe listing some recommendations here would help. > > I would recommend running your own local BLAST server for any large > jobs - either the standalone blast tools, or if you have a machine on > the network that many people could share, run the WWW version locally. > > Peter The above appears to submit a single job at a time and wait 3 sec. between polling the server until the current job is finished. I don't think that is the problem indicated in the link above. The 3 sec. is for submitting new BLAST jobs, for instance if you want to submit one BLAST request after another (gathering the RIDs), then grab all the reports at once, or if you are threading 50 submission requests all at once. chris From fredgca at hotmail.com Fri Jun 27 12:18:47 2008 From: fredgca at hotmail.com (Frederico Arnoldi) Date: Fri, 27 Jun 2008 16:18:47 +0000 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu> References: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com> <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu> Message-ID: Right, thanks for the answers. If I understood, the problem is threading the requests. If I am not threading my requests I am not abusing NCBI server, so don't thread them. Thanks again, Fred > >> 2)Do you have any recommendation for using it that it is not > >> included in the > >> tutorial? Maybe listing some recommendations here would help. > > > > I would recommend running your own local BLAST server for any large > > jobs - either the standalone blast tools, or if you have a machine on > > the network that many people could share, run the WWW version locally. > > > > Peter > > The above appears to submit a single job at a time and wait 3 sec. > between polling the server until the current job is finished. I don't > think that is the problem indicated in the link above. The 3 sec. is > for submitting new BLAST jobs, for instance if you want to submit one > BLAST request after another (gathering the RIDs), then grab all the > reports at once, or if you are threading 50 submission requests all at > once. > > chris _________________________________________________________________ Instale a Barra de Ferramentas com Desktop Search e ganhe EMOTICONS para o Messenger! ? GR?TIS! http://www.msn.com.br/emoticonpack From cjfields at uiuc.edu Fri Jun 27 13:32:31 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 27 Jun 2008 12:32:31 -0500 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: References: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com> <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu> Message-ID: <53E19130-4EAC-4DC7-A58C-883581F8B468@uiuc.edu> No, not just threading. The requests could be made by a simple script/ program of any kind with no timeout implemented; the IPs of those abusing the timeout will likely be blocked. The idea is not to spam their server (let alone any server which provides a free service) with tons of requests of any kind, be it eutils or BLAST submission requests, BLAST report retrieval requests using RID, etc. Any tools using these services should implement the minimum recommended delay between them. Alternatively, set up a local BLAST service as Peter recommends. chris On Jun 27, 2008, at 11:18 AM, Frederico Arnoldi wrote: > > Right, thanks for the answers. > If I understood, the problem is threading the requests. If I am not > threading my requests I am not abusing NCBI server, so don't thread > them. > Thanks again, > Fred >>>> 2)Do you have any recommendation for using it that it is not >>>> included in the >>>> tutorial? Maybe listing some recommendations here would help. >>> >>> I would recommend running your own local BLAST server for any large >>> jobs - either the standalone blast tools, or if you have a machine >>> on >>> the network that many people could share, run the WWW version >>> locally. >>> >>> Peter >> >> The above appears to submit a single job at a time and wait 3 sec. >> between polling the server until the current job is finished. I >> don't >> think that is the problem indicated in the link above. The 3 sec. is >> for submitting new BLAST jobs, for instance if you want to submit one >> BLAST request after another (gathering the RIDs), then grab all the >> reports at once, or if you are threading 50 submission requests all >> at >> once. >> >> chris > > _________________________________________________________________ > Instale a Barra de Ferramentas com Desktop Search e ganhe EMOTICONS > para o Messenger! ? GR?TIS! > http://www.msn.com.br/emoticonpack > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From sbassi at gmail.com Sat Jun 28 10:46:45 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 28 Jun 2008 11:46:45 -0300 Subject: [BioPython] one function, two behaivors Message-ID: If I invoke "transcribe" with a RNA sequence like this: >>> from Bio.Seq import transcribe >>> from Bio.Seq import Seq >>> import Bio.Alphabet >>> rna_seq = Seq('CCGGGUU',Bio.Alphabet.IUPAC.unambiguous_rna) >>> transcribe(rna_seq) # Look!, I am "transcribing a RNA" Seq('CCGGGUU', RNAAlphabet()) But I can't "transcribe" a RNA sequence if I invoke it this way: >>> from Bio import Transcribe >>> transcriber = Transcribe.unambiguous_transcriber >>> transcriber.transcribe(rna_seq) Traceback (most recent call last): File "", line 1, in transcriber.transcribe(rna_seq) File "/usr/local/lib/python2.5/site-packages/Bio/Transcribe.py", line 13, in transcribe "transcribe has the wrong DNA alphabet" AssertionError: transcribe has the wrong DNA alphabet The same result I get when using "translate". What is the rationale behind this? -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From biopython at maubp.freeserve.co.uk Sat Jun 28 11:16:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 28 Jun 2008 16:16:13 +0100 Subject: [BioPython] one function, two behaivors In-Reply-To: References: Message-ID: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com> Hi Senastian, As to why there are two ways, well, frankly the Bio.Transcribe and Bio.Translate code isn't very nice to use! The Bio.Seq functions are much simpler. We've talked about deprecating the Bio.Transcribe and Bio.Translate modules in favour of just Bio.Seq -- we could deprecate Bio.Transcribe now, but there is functionality in Bio.Translate that has not been duplicated. See also bug 2381. http://bugzilla.open-bio.org/show_bug.cgi?id=2381 On Sat, Jun 28, 2008 at 3:46 PM, Sebastian Bassi wrote: > If I invoke "transcribe" with a RNA sequence like this: > >>>> from Bio.Seq import transcribe >>>> from Bio.Seq import Seq >>>> import Bio.Alphabet >>>> rna_seq = Seq('CCGGGUU',Bio.Alphabet.IUPAC.unambiguous_rna) >>>> transcribe(rna_seq) # Look!, I am "transcribing a RNA" > Seq('CCGGGUU', RNAAlphabet()) When Michiel added this code for Biopython 1.41, originally there was no error checking on the alphabet. For Biopython 1.44, I added a check to prevent protein transcibing (which is clearly meaningless), and made a note to consider also banning transcribing RNA. Here there is at least one reason to want to do this - suppose you have a mixed set of nucleotide sequences and want to ensure they are all RNA. Do you think the Bio.Seq.transcibe() method should reject RNA sequences? Peter From biopython at maubp.freeserve.co.uk Sat Jun 28 11:23:40 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 28 Jun 2008 16:23:40 +0100 Subject: [BioPython] one function, two behaivors In-Reply-To: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com> References: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com> Message-ID: <320fb6e00806280823h36f3f01ema2886dca98635588@mail.gmail.com> I wrote, > As to why there are two ways, well, frankly the Bio.Transcribe and > Bio.Translate code isn't very nice to use! The Bio.Seq functions are > much simpler. Hmm - the tutorial is still using Bio.Transcribe and Bio.Translate at the moment. I could update the tutorial to use the Bio.Seq functions for (back)transcription. However, as I said in the last email, Bio.Translate still has its uses - there is no way to do a "translate to stop" with Bio.Seq for example. Maybe Bug 2381 should be a priority for the next release AFTER the imminent Biopython 1.46. We can then use object methods in the tutorial, which I personally would find much nicer to use. http://bugzilla.open-bio.org/show_bug.cgi?id=2381 If you could have a look at the suggested changes on Bug 2381, I'd welcome some feedback. Peter From sbassi at gmail.com Sat Jun 28 12:47:05 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 28 Jun 2008 13:47:05 -0300 Subject: [BioPython] one function, two behaivors In-Reply-To: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com> References: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com> Message-ID: On Sat, Jun 28, 2008 at 12:16 PM, Peter wrote: .... > Here there is at least one reason to want to do this - suppose you > have a mixed set of nucleotide sequences and want to ensure they are > all RNA. > Do you think the Bio.Seq.transcibe() method should reject RNA sequences? IMHO, it should reject RNA sequences. The case you point out (ensure a set of sequences are all RNA) could be done by checking the type before applying "transcribe". -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From lueck at ipk-gatersleben.de Sun Jun 29 10:42:47 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Sun, 29 Jun 2008 16:42:47 +0200 Subject: [BioPython] Sequence from Fasta Message-ID: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> Hi! Is there a way to extract only the sequence (full length) from a fasta file? If I try the code from page 10 in the tutorial, I get of course this: Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA ...', SingleLetterAlphabet()) But I'm looking for something like this: Name Sequence without linebreak Example: MySequence atgcgcgctcggcgcgctcgfcgcgccccccatggctcgcgcactacagcg MySequence2 atgcgctctgcgcgctcgatgtagaatatgagatctctatgagatcagcatca etc. Regards Stefanie From biopython at maubp.freeserve.co.uk Sun Jun 29 11:19:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 29 Jun 2008 16:19:13 +0100 Subject: [BioPython] Sequence from Fasta In-Reply-To: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com> On Sun, Jun 29, 2008 at 3:42 PM, Stefanie L?ck wrote: > Hi! > > Is there a way to extract only the sequence (full length) from a fasta file? Yes. Based on your requirement to have name-space-sequence, how about: handle = open(filename) from Bio import SeqIO for record in SeqIO.parse(handle, "fasta") : print "%s %s" % (record.id, record.seq) handle.close() > If I try the code from page 10 in the tutorial, I get of course this: > Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA ...', SingleLetterAlphabet()) Which bit of the tutorial exactly? That looks like printing the repr() of a Seq object, and Seq objects don't have names. If something could be clarified that's useful feedback. Peter From lueck at ipk-gatersleben.de Mon Jun 30 05:09:53 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 30 Jun 2008 11:09:53 +0200 Subject: [BioPython] Sequence from Fasta References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> <320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com> Message-ID: <001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de> Hi Peter! I mean the biopython tutorial (16.3.2007), page 10: >>> from Bio import SeqIO handle = open("ls_orchid.fasta") for seq_record in SeqIO.parse(handle, "fasta") : print seq_record.id print seq_record.seq print len(seq_record.seq) handle.close() <<< I tried your code but I still have the same problem. It's don't show the full sequence. Output: 1 Seq('atgctcgatgcgcgctcgcgtccgtcgCAGGAgGAGATGGGGAGGCGCCGCCGGTTCACG ...', SingleLetterAlphabet()) 2 Seq('AGAAAAATCCGGAATCAGAGGAGGAGGAGGAGTCTCGCGAGGAGGATAGCACGGAGGCGG ...', SingleLetterAlphabet()) Fasta File looks like this: >1 atgctcgatgcgcgctcgcgtccgtcgCAGGAgGAGATGGGGAGGCGCCGCCGGTTCACGCATCAGCCCACCAGCGACGACGACGACGAGGAAGACAGAGCCGcCC >2 AGAAAAATCCGGAATCAGAGGAGGAGGAGGAGTCTCGCGAGGAGGATAGCACGGAGGCGGTACCCGTCGGTGAACCTTT I can try with regular expressions but I first wanted to know whether there is a way in biopyhton. Regards Stefanie From biopython at maubp.freeserve.co.uk Mon Jun 30 05:19:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 30 Jun 2008 10:19:16 +0100 Subject: [BioPython] Sequence from Fasta In-Reply-To: <001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de> References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> <320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com> <001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00806300219j54f7f43dpe0051f54be27d402@mail.gmail.com> Which version of Biopython do you have? I'm guessing Biopython 1.44. On older versions you would have to do explicitly turn the Seq into a string. Does this work: from Bio import SeqIO handle = open("ls_orchid.fasta") for seq_record in SeqIO.parse(handle, "fasta") : print seq_record.id print seq_record.seq.tostring() print len(seq_record.seq) handle.close() Since Biopython 1.45, doing str(...) on a Seq object gives you the sequence in full as a plain string. When you do a print this happens implicitly. Peter P.S. For the implementation, str(object) calls the object.__str__() method. From dalloliogm at gmail.com Mon Jun 30 05:40:23 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 30 Jun 2008 11:40:23 +0200 Subject: [BioPython] Sequence from Fasta In-Reply-To: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> Message-ID: <5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com> On Sun, Jun 29, 2008 at 4:42 PM, Stefanie L?ck wrote: > If I try the code from page 10 in the tutorial, I get of course this: Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA ...', SingleLetterAlphabet()) Try with seq_record.seq.data. > But I'm looking for something like this: > > Name Sequence without linebreak > > Example: > > MySequence atgcgcgctcggcgcgctcgfcgcgccccccatggctcgcgcactacagcg > MySequence2 atgcgctctgcgcgctcgatgtagaatatgagatctctatgagatcagcatca Bioperl's SeqIO has support for a 'tab sequence format' which is similar to this[1]. Maybe it could be useful in the future to add support for such a format in biopython. [1] http://www.bioperl.org/wiki/Tab_sequence_format > > Regards > Stefanie > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Jun 30 06:25:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 30 Jun 2008 11:25:01 +0100 Subject: [BioPython] Sequence from Fasta In-Reply-To: <5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com> References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> <5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com> Message-ID: <320fb6e00806300325r10c96b57qffee9ab3df81cb9e@mail.gmail.com> On Mon, Jun 30, 2008 at 10:40 AM, Giovanni Marco Dall'Olio wrote: > On Sun, Jun 29, 2008 at 4:42 PM, Stefanie L?ck wrote: > >> If I try the code from page 10 in the tutorial, I get of course this: > > Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA > ...', SingleLetterAlphabet()) > > Try with seq_record.seq.data. I would like to discourage using the Seq object's .data property if possible, in favour of my_seq.tostring() which will work even on very old versions of Biopython, or str(my_seq) if you are up to date. I've mooted deprecating the Seq object's .data property as part of making the Seq object more string like (Bug 2509 and Bug 2351). http://bugzilla.open-bio.org/show_bug.cgi?id=2509 http://bugzilla.open-bio.org/show_bug.cgi?id=2351 User feedback would be good, but to explain my current thinking: I'm hoping to reduce the Seq's .data to a read only property in a future release, and then in a later release start issuing a deprecation warning, before its eventual removal (Bug 2509). At some point in this process the Seq object would hopefully subclass the python string (Bug 2351). Peter From mjldehoon at yahoo.com Sat Jun 7 08:35:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 7 Jun 2008 01:35:05 -0700 (PDT) Subject: [BioPython] Bio.Gobase, anybody? Message-ID: <844450.31822.qm@web62415.mail.re1.yahoo.com> Hi everbody, As part of bug report 2454: http://bugzilla.open-bio.org/show_bug.cgi?id=2454, I started looking at the Bio.Gobase module. This module provides access to the gobase database: http://megasun.bch.umontreal.ca/gobase/ This module is about seven years old and (AFAICT) is not actively maintained. We don't have documentation for this module, but the unit tests suggests that it parses HTML files from gobase. I am not sure exactly where the HTML files came from, but I doubt that after seven years this still works. So I was wondering: Does anybody use Bio.Gobase? If not, I suggest we deprecate it for the next release, and remove it in some future release. If there are users, we need to make some (small) changes to this module (that is what the original bug report was about). --Michiel. From mmokrejs at ribosome.natur.cuni.cz Sat Jun 7 09:27:26 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sat, 07 Jun 2008 11:27:26 +0200 Subject: [BioPython] Bio.Gobase, anybody? In-Reply-To: <844450.31822.qm@web62415.mail.re1.yahoo.com> References: <844450.31822.qm@web62415.mail.re1.yahoo.com> Message-ID: <484A547E.1030909@ribosome.natur.cuni.cz> Hi, I don't use it but seems an interesting resource. ;-) See http://gobase.bcm.umontreal.ca/samples.html . Martin > This module is about seven years old and (AFAICT) > is not actively maintained. We don't have documentation > for this module, but the unit tests suggests that it > parses HTML files from gobase. I am not sure exactly > where the HTML files came from, but I doubt that > after seven years this still works. From cg5x6 at yahoo.com Mon Jun 9 05:21:50 2008 From: cg5x6 at yahoo.com (C. G.) Date: Sun, 8 Jun 2008 22:21:50 -0700 (PDT) Subject: [BioPython] splice variants in GenBank/Entrez Message-ID: <664146.43151.qm@web65604.mail.ac4.yahoo.com> Hi all, I've been using BioPython for a few projects the last two months to process BLAST results but now I need to take those results and determine which of them have known splice variants. By "known" I mean those that have annotations contained in a database that indicate they have (or are) splice variants. My thought was that Entrez would have this information (which I would then retrieve and parse with BioPython) but I can't find a consistent means of determining if an entry has splice variants. I was hoping that maybe someone on this list had some experience trying to find this information. Perhaps there is a sequence feature or a common user-defined field I could access? I'm also sending an email to NCBI requesting information but I thought I would cover my bases. Thanks in advance for any information or help you can provide. -steve From krewink at inb.uni-luebeck.de Mon Jun 9 06:58:52 2008 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Mon, 9 Jun 2008 08:58:52 +0200 Subject: [BioPython] splice variants in GenBank/Entrez In-Reply-To: <664146.43151.qm@web65604.mail.ac4.yahoo.com> References: <664146.43151.qm@web65604.mail.ac4.yahoo.com> Message-ID: <20080609065852.GB13032@inb.uni-luebeck.de> Hi Steve, On Sun, Jun 08, 2008 at 10:21:50PM -0700, C. G. wrote: > > I've been using BioPython for a few projects the last > two months to process BLAST results but now I need to > take those results and determine which of them have > known splice variants. By "known" I mean those that > have annotations contained in a database that indicate > they have (or are) splice variants. Depending on which organism you are looking at, you might want to use the Ensembl genome database. There is no biopython interface, but you can use the jython interface from their website (at least they once had one, I didn't check if that's still the case). Otherwise you might have to use perl or java packages for that. Another good resource for this is the Alternative Splicing Database: http://www.ebi.ac.uk/asd/ Hope that helps, Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics http://www.inb.uni-luebeck.de/~krewink/ From bsouthey at gmail.com Mon Jun 9 13:25:44 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 09 Jun 2008 08:25:44 -0500 Subject: [BioPython] splice variants in GenBank/Entrez In-Reply-To: <20080609065852.GB13032@inb.uni-luebeck.de> References: <664146.43151.qm@web65604.mail.ac4.yahoo.com> <20080609065852.GB13032@inb.uni-luebeck.de> Message-ID: <484D2F58.6020502@gmail.com> Albert Krewinkel wrote: > Hi Steve, > > On Sun, Jun 08, 2008 at 10:21:50PM -0700, C. G. wrote: > >> I've been using BioPython for a few projects the last >> two months to process BLAST results but now I need to >> take those results and determine which of them have >> known splice variants. By "known" I mean those that >> have annotations contained in a database that indicate >> they have (or are) splice variants. >> > > Depending on which organism you are looking at, you might want to use > the Ensembl genome database. There is no biopython interface, but you > can use the jython interface from their website (at least they once > had one, I didn't check if that's still the case). Otherwise you > might have to use perl or java packages for that. > > Another good resource for this is the Alternative Splicing Database: > http://www.ebi.ac.uk/asd/ > > Hope that helps, > > Albert > > > The 'ALTERNATIVE PRODUCTS' section of CC lines in a UniProt (SwissProt) record can contain alternative splicing information. See for example, the manual section: **3.12.5. Syntax of the topic 'ALTERNATIVE PRODUCTS'** http://ca.expasy.org/sprot/userman.html#CCAP (Given below for completeness). Bruce Example of the CC lines and the corresponding FT lines for an entry with alternative splicing: CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative splicing, Alternative initiation; Named isoforms=8; CC Comment=Additional isoforms seem to exist; CC Name=1; Synonyms=Non-muscle isozyme; CC IsoId=Q15746-1; Sequence=Displayed; CC Name=2; CC IsoId=Q15746-2; Sequence=VSP_004791; CC Name=3A; CC IsoId=Q15746-3; Sequence=VSP_004792, VSP_004794; CC Name=3B; CC IsoId=Q15746-4; Sequence=VSP_004791, VSP_004792, VSP_004794; CC Name=4; CC IsoId=Q15746-5; Sequence=VSP_004792, VSP_004793; CC Name=Del-1790; CC IsoId=Q15746-6; Sequence=VSP_004795; CC Name=5; Synonyms=Smooth-muscle isozyme; CC IsoId=Q15746-7; Sequence=VSP_018845; CC Note=Produced by alternative initiation at Met-923 of isoform 1; CC Name=6; Synonyms=Telokin; CC IsoId=Q15746-8; Sequence=VSP_018846; CC Note=Produced by alternative initiation at Met-1761 of isoform CC 1. Has no catalytic activity; ... FT VAR_SEQ 1 1760 Missing (in isoform 6). FT /FTId=VSP_018846. FT VAR_SEQ 1 922 Missing (in isoform 5). FT /FTId=VSP_018845. FT VAR_SEQ 437 506 VSGIPKPEVAWFLEGTPVRRQEGSIEVYEDAGSHYLCLLKA FT RTRDSGTYSCTASNAQGQVSCSWTLQVER -> G (in FT isoform 2 and isoform 3B). FT /FTId=VSP_004791. FT VAR_SEQ 1433 1439 DEVEVSD -> MKWRCQT (in isoform 3A, FT isoform 3B and isoform 4). FT /FTId=VSP_004792. FT VAR_SEQ 1473 1545 Missing (in isoform 4). FT /FTId=VSP_004793. FT VAR_SEQ 1655 1705 Missing (in isoform 3A and isoform 3B). FT /FTId=VSP_004794. FT VAR_SEQ 1790 1790 Missing (in isoform Del-1790). FT /FTId=VSP_004795. CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative splicing, Alternative initiation; Named isoforms=3; CC Comment=Isoform 1 and isoform 2 arise due to the use of two CC alternative first exons joined to a common exon 2 at the same CC acceptor site but in different reading frames, resulting in two CC completely different isoforms; CC Name=1; Synonyms=p16INK4a; CC IsoId=O77617-1; Sequence=Displayed; CC Name=3; CC IsoId=O77617-2; Sequence=VSP_018701; CC Note=Produced by alternative initiation at Met-35 of isoform 1. CC No experimental confirmation available; CC Name=2; Synonyms=p19ARF; CC IsoId=O77618-1; Sequence=External; .. FT VAR_SEQ 1 34 Missing (in isoform 3). FT /FTId=VSP_004099. From lueck at ipk-gatersleben.de Tue Jun 10 08:38:14 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 10 Jun 2008 10:38:14 +0200 Subject: [BioPython] formatdb over python code Message-ID: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de> Hi! Does someone know's whether it's possible to make a database with formatdb (NCBI) via python code (among Windows) and not over the console? Regards Stefanie From biopython at maubp.freeserve.co.uk Tue Jun 10 09:41:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 10 Jun 2008 10:41:27 +0100 Subject: [BioPython] formatdb over python code In-Reply-To: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de> References: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00806100241i68b24632s121324ce1c942dd9@mail.gmail.com> On Tue, Jun 10, 2008 at 9:38 AM, Stefanie L?ck wrote: > Hi! > > Does someone know's whether it's possible to make a database with formatdb > (NCBI) via python code (among Windows) and not over the console? Hello Stefanie, I don't think Biopython has a wrapper for the NCBI formatdb tool, but you could construct the command line string yourself and call it with one of the standard python os functions, e.g. os.popen(). Peter From winter at biotec.tu-dresden.de Tue Jun 10 10:13:06 2008 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Tue, 10 Jun 2008 12:13:06 +0200 Subject: [BioPython] formatdb over python code In-Reply-To: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de> References: <001501c8cad5$52c312e0$1022a8c0@ipkgatersleben.de> Message-ID: <484E53B2.5060102@biotec.tu-dresden.de> Stefanie L?ck wrote, On 06/10/08 10:38: > Hi! > > Does someone know's whether it's possible to make a database with formatdb > (NCBI) via python code (among Windows) and not over the console? Here is the Python code I use for that: cmd = "formatdb -i %s -p T -o F" % database os.system(cmd) -p T specifies protein sequences, -o T creates indexes, but fails if the fasta file does not follow the defline format (see http://en.wikipedia.org/wiki/Fasta_format#Sequence_identifiers). If it fails, use -o F. Christof From mjldehoon at yahoo.com Sat Jun 14 02:34:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 13 Jun 2008 19:34:05 -0700 (PDT) Subject: [BioPython] Bio.Rebase Message-ID: <237761.5963.qm@web62409.mail.re1.yahoo.com> Hi everybody, As part of bug #2454 on Bugzilla, I am looking at the Bio.Rebase module. This module parses files (in HTML format) from the Rebase database: http://rebase.neb.com/rebase/rebase.html Unfortunately, since this module was written (in 2000) the HTML format used by the Rebase database has changed completely. This module is therefore not able to parse current Rebase HTML files. Is anybody willing to update Bio.Rebase (either by updating the HTML parser, or preferably by writing a parser for plain-text output from Bio.Rebase)? If not, I think this module should be deprecated. --Michiel. From biopython at maubp.freeserve.co.uk Mon Jun 16 14:01:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 16 Jun 2008 15:01:31 +0100 Subject: [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO Message-ID: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> I've recently had to deal with some contig files in the Ace format (output by CAP3, but many assembly files will produce this output). We have a module for parsing Ace files in Biopython, Bio.Sequencing.Ace but I was wondering about integrating this into the Bio.SeqIO or Bio.AlignIO framework. http://www.biopython.org/wiki/SeqIO http://www.biopython.org/wiki/AlignIO I'd like to hear from anyone currently using Ace files, on how they tend to treat the data - and if they think a SeqRecord or Alignment based representation would be useful. Each contig in an Ace file could be treated as a SeqRecord using the consensus sequence. The identifiers of each sub-sequence used to build the consensus could be stored as database cross-references, or perhaps we could store these as SeqFeatures describing which part of the consensus they support. This would then fit into Bio.SeqIO quite well. Alternatively, each contig could be treated as an alignment (with a consensus) and integrated into Bio.AlignIO. One drawback for this is doing this with the current generic alignment class would require padding the start and/or end of each sequence with gaps in order to make every sequence the same length. However, if we did this (or created a more specialised alignment class), the Ace file format would then fit into Bio.AlignIO too. So, Ace users - would either (or both) of the above approaches make sense for how you use the Ace contig files? Thanks Peter From laserson at mit.edu Tue Jun 17 18:44:08 2008 From: laserson at mit.edu (Uri Laserson) Date: Tue, 17 Jun 2008 14:44:08 -0400 Subject: [BioPython] Dependency help: libssl.so.0.9.7 Message-ID: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com> Hi, I am trying to use some biopython packages, and it turns out there is an error when I try to import _hashlib: >>> import _hashlib Traceback (most recent call last): File "", line 1, in ImportError: libssl.so.0.9.7: cannot open shared object file: No such file or directory I am working on unix system that is administered by a university, but I have installed my own local version of python along with biopython and all necessary packages for that. There exists a libssl.so.0.9.8 and libssl.so (a symbolic link to the former) in /usr/lib ldd _hashlib.so in my own /python/lib/python2.5/lib-dynload gives me: linux-gate.so.1 => (0xffffe000) libssl.so.0.9.7 => not found libcrypto.so.0.9.7 => not found libpthread.so.0 => /lib32/libpthread.so.0 (0xf7f67000) libc.so.6 => /lib32/libc.so.6 (0xf7e3c000) /lib/ld-linux.so.2 (0x56555000) What is the easiest way to solve this? How do I get my local (home directory) installation of python to find the libssl.so library in /usr/lib? Thanks! Uri -- Uri Laserson PhD Candidate, Biomedical Engineering Harvard Medical School (Genetics) Massachusetts Institute of Technology (Mathematics) phone +1 917 742 8019 laserson at mit.edu From biopython at maubp.freeserve.co.uk Wed Jun 18 09:11:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Jun 2008 10:11:42 +0100 Subject: [BioPython] Dependency help: libssl.so.0.9.7 In-Reply-To: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com> References: <165c1bda0806171144g20f62ab7s401007fd69c661cc@mail.gmail.com> Message-ID: <320fb6e00806180211o5d505ct4099cdd4fc9e11dc@mail.gmail.com> On Tue, Jun 17, 2008 at 7:44 PM, Uri Laserson wrote: > Hi, > > I am trying to use some biopython packages, and it turns out there is an > error when I try to import _hashlib: > >>>> import _hashlib > Traceback (most recent call last): > ... Hi Uri, I'm guessing you are trying to use Bio.SeqUtils.Checksum, but did you mean "import hashlib"? See http://code.krypto.org/python/hashlib/ Peter From biopython at maubp.freeserve.co.uk Wed Jun 18 11:32:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Jun 2008 12:32:10 +0100 Subject: [BioPython] blastx works fine? In-Reply-To: <1131745582.4368.22.camel@osiris.biology.duke.edu> References: <1131745582.4368.22.camel@osiris.biology.duke.edu> Message-ID: <320fb6e00806180432x60ceea96o3e45f05590003e8e@mail.gmail.com> In Nov 2005, Frank Kauff wrote: > Hi all, > > qblast currently says it works only for blastp and blastn. Actually it > seems to work fine with blastx as well - xml output parses well with > NCBIXML. Or am I missing something? > > Frank Yes, using BLASTX with the Biopython XML parser does seem to work. In fact the NCBI (now) documentation explicitly lists blastn, blastp, blastx, tblastn and tblastx so I updated Biopython's qblast function to allow them too. http://www.ncbi.nlm.nih.gov/BLAST/Doc/node43.html Fixed in Bio/Blast/NCBIWWW.py revision 1.50 - better late than never? Peter From mjldehoon at yahoo.com Thu Jun 19 13:04:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 19 Jun 2008 06:04:31 -0700 (PDT) Subject: [BioPython] Bio.CDD, anyone? Message-ID: <14893.84074.qm@web62409.mail.re1.yahoo.com> Hi everybody, Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) records. The parser parses HTML pages from CDD's web site. Since the parser was written about six years ago, the CDD web site has changed considerably. Bio.CDD therefore cannot parse current HTML pages from CDD. So I am wondering: 1) Is anybody using Bio.CDD? 2) Is anybody willing to update Bio.CDD to handle current HTML? 3) If not, can we deprecate it? There is not much purpose of having a parser for HTML pages from years ago. --Michiel. From biopython at maubp.freeserve.co.uk Thu Jun 19 13:38:29 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Jun 2008 14:38:29 +0100 Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <14893.84074.qm@web62409.mail.re1.yahoo.com> References: <14893.84074.qm@web62409.mail.re1.yahoo.com> Message-ID: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > records. The parser parses HTML pages from CDD's web site. Since the parser > was written about six years ago, the CDD web site has changed considerably. > Bio.CDD therefore cannot parse current HTML pages from CDD. A couple of years ago, I wanted to get the CDD domain name and description and ended up writing my own very simple and crude parser to extract just this information. Doing a proper job would mean extracting lots and lots of fields, e.g. http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 I wonder if the NCBI make any of this available as XML via Entrez? I had a quick look and couldn't find anything. Peter From mjldehoon at yahoo.com Thu Jun 19 13:58:25 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 19 Jun 2008 06:58:25 -0700 (PDT) Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> Message-ID: <352888.20937.qm@web62409.mail.re1.yahoo.com> > I wonder if the NCBI make any of this available as XML via Entrez? I > had a quick look and couldn't find anything. Actually I already asked this question to NCBI. Their answer was that a subset of the information shown on the web page is available as XML via Entrez's ESummary and EFetch (and thus available from Biopython). The full CDD records are stored as one large file, which is obtainable from NCBI's ftp site, but currently it is not possible to get individual CDD records except in HTML form through the NCBI website. --Michiel. Peter wrote: > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > records. The parser parses HTML pages from CDD's web site. Since the parser > was written about six years ago, the CDD web site has changed considerably. > Bio.CDD therefore cannot parse current HTML pages from CDD. A couple of years ago, I wanted to get the CDD domain name and description and ended up writing my own very simple and crude parser to extract just this information. Doing a proper job would mean extracting lots and lots of fields, e.g. http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 I wonder if the NCBI make any of this available as XML via Entrez? I had a quick look and couldn't find anything. Peter From bsouthey at gmail.com Thu Jun 19 14:44:00 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 19 Jun 2008 09:44:00 -0500 Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <352888.20937.qm@web62409.mail.re1.yahoo.com> References: <352888.20937.qm@web62409.mail.re1.yahoo.com> Message-ID: <485A70B0.1010202@gmail.com> Michiel de Hoon wrote: >> I wonder if the NCBI make any of this available as XML via Entrez? I >> had a quick look and couldn't find anything. >> > > Actually I already asked this question to NCBI. Their answer was that a subset of the information shown on the web page is available as XML via Entrez's ESummary and EFetch (and thus available from Biopython). The full CDD records are stored as one large file, which is obtainable from NCBI's ftp site, but currently it is not possible to get individual CDD records except in HTML form through the NCBI website. > > --Michiel. > > > Peter wrote: > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > >> records. The parser parses HTML pages from CDD's web site. Since the parser >> was written about six years ago, the CDD web site has changed considerably. >> Bio.CDD therefore cannot parse current HTML pages from CDD. >> > > A couple of years ago, I wanted to get the CDD domain name and > description and ended up writing my own very simple and crude parser > to extract just this information. Doing a proper job would mean > extracting lots and lots of fields, e.g. > http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 > > I wonder if the NCBI make any of this available as XML via Entrez? I > had a quick look and couldn't find anything. > > Peter > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Hi, Do you know how the test files were created? If there is not an easy answer then it makes the decision easier. Anyhow, I vote to remove this module as, in addition to the things previously mentioned, it would far better to support interproscan (http://www.ebi.ac.uk/Tools/InterProScan/ ) than just a single tool. Bruce From cjfields at uiuc.edu Thu Jun 19 14:45:05 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 19 Jun 2008 09:45:05 -0500 Subject: [BioPython] [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> References: <14893.84074.qm@web62409.mail.re1.yahoo.com> <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> Message-ID: They don't, though you can get esummary XML information (which includes description), and I believe you can use elink to grab other information (including proteins with the specified domain). chris On Jun 19, 2008, at 8:38 AM, Peter wrote: >> Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain >> Database) >> records. The parser parses HTML pages from CDD's web site. Since >> the parser >> was written about six years ago, the CDD web site has changed >> considerably. >> Bio.CDD therefore cannot parse current HTML pages from CDD. > > A couple of years ago, I wanted to get the CDD domain name and > description and ended up writing my own very simple and crude parser > to extract just this information. Doing a proper job would mean > extracting lots and lots of fields, e.g. > http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 > > I wonder if the NCBI make any of this available as XML via Entrez? I > had a quick look and couldn't find anything. > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Thu Jun 19 16:13:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Jun 2008 17:13:16 +0100 Subject: [BioPython] Adding NCBI XML sequence formats to Bio.SeqIO Message-ID: <320fb6e00806190913h2f3f81bgd9d16fb0f2a740f9@mail.gmail.com> Dear all, I've realised that as a bonus from Michiel's work on Bio.Entrez, Biopython should be able to parse several of the XML sequence file formats used by the NCBI - and ideally we should be able to do this via Bio.SeqIO and get SeqRecord objects. I am thinking about adding a new module to Bio.SeqIO which will map the python list/dictionary structures from Bio.Entrez into SeqRecord object(s). What I wanted to ask the list about, is which XML sequence files are of interest - and are there any strong views on format names should I use? I've looked at BioPerl list since I try and re-use the same format names, but could only spot one NCBI XML file listed here: http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats NCBI TinySeq XML format http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd BioPerl call this "tinyseq", which seems like a good choice of name. http://www.bioperl.org/wiki/Tinyseq_sequence_format Also potentially of interest are: NCBI INSDSeq XML format http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd NCBI Seq-entry XML format http://www.ncbi.nlm.nih.gov/dtd/NCBI_Seqset.dtd NCBI Entrezgene XML format (BioPerl uses "entrezgene" to refer to the ASN.1 variant of this file format). http://www.ncbi.nlm.nih.gov/dtd/NCBI_Entrezgene.dtd (I haven't actually sat down and looked at the details of the implementation yet, so no promises on the timing!) Peter From sbassi at gmail.com Sun Jun 22 22:49:48 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 22 Jun 2008 19:49:48 -0300 Subject: [BioPython] Secondary structure alphabet? Message-ID: Here is the secondary structure alphabet: class SecondaryStructure(SingleLetterAlphabet) | Method resolution order: | SecondaryStructure | SingleLetterAlphabet | Alphabet | | Data and other attributes defined here: | | letters = 'HSTC' I can't find what that HSTC stands for. The closer match I found was the DSSP code: The DSSP code The output of DSSP is explained extensively under 'explanation'. The very short summary of the output is: * H = alpha helix * B = residue in isolated beta-bridge * E = extended strand, participates in beta ladder * G = 3-helix (3/10 helix) * I = 5 helix (pi helix) * T = hydrogen bonded turn * S = bend (http://swift.cmbi.ru.nl/gv/dssp/) Does anybody knows the meaning of HSTC? I am CC this mail to Andrew Dalke it seems he was the one who submit it the Biopython. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From idoerg at gmail.com Sun Jun 22 23:03:52 2008 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 22 Jun 2008 16:03:52 -0700 Subject: [BioPython] Secondary structure alphabet? In-Reply-To: References: Message-ID: Probably Helix Turn Strand Coil On Sun, Jun 22, 2008 at 3:49 PM, Sebastian Bassi wrote: > Here is the secondary structure alphabet: > > class SecondaryStructure(SingleLetterAlphabet) > | Method resolution order: > | SecondaryStructure > | SingleLetterAlphabet > | Alphabet > | > | Data and other attributes defined here: > | > | letters = 'HSTC' > > I can't find what that HSTC stands for. The closer match I found was > the DSSP code: > > The DSSP code > > The output of DSSP is explained extensively under 'explanation'. The > very short summary of the output is: > > * H = alpha helix > * B = residue in isolated beta-bridge > * E = extended strand, participates in beta ladder > * G = 3-helix (3/10 helix) > * I = 5 helix (pi helix) > * T = hydrogen bonded turn > * S = bend > > (http://swift.cmbi.ru.nl/gv/dssp/) > > Does anybody knows the meaning of HSTC? I am CC this mail to Andrew > Dalke it seems he was the one who submit it the Biopython. > > > -- > Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 > Bioinformatics news: http://www.bioinformatica.info > Tutorial libre de Python: http://tinyurl.com/2az5d5 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg, Ph.D. CALIT2, mail code 0440 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0440, USA T: +1 (858) 534-0570 T: +1 (858) 646-3100 x3516 http://iddo-friedberg.org From sbassi at gmail.com Sun Jun 22 23:05:13 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 22 Jun 2008 20:05:13 -0300 Subject: [BioPython] Secondary structure alphabet? In-Reply-To: References: Message-ID: On Sun, Jun 22, 2008 at 8:03 PM, Iddo Friedberg wrote: > Probably Helix Turn Strand Coil Sounds plausible. Thank you. Best, SB. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From jdieten at gmail.com Tue Jun 24 10:58:23 2008 From: jdieten at gmail.com (Joost van Dieten) Date: Tue, 24 Jun 2008 12:58:23 +0200 Subject: [BioPython] Blastp XML mailfunction Message-ID: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com> MY CODE: result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence, entrez_query='man[ORGN]') blast_results = result_handle.read() print result_handle- result_handler = cStringIO.StringIO(blast_results) print result_handler blast_records = NCBIXML.parse(result_handler) blast_record = blast_records.next() This code doesn't seem to work anymore. I got an error that my blast_record is empty, but it worked fine 3 weeks ago. Something changed to the NCBIXML code??? Any ideas?? Greetz, Joost Dieten From biopython at maubp.freeserve.co.uk Tue Jun 24 11:11:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Jun 2008 12:11:12 +0100 Subject: [BioPython] Blastp XML mailfunction In-Reply-To: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com> References: <4ac065b80806240358r4d687514k84a8b77aaff9b142@mail.gmail.com> Message-ID: <320fb6e00806240411j1c01903cm1f40d53eb9c5ad77@mail.gmail.com> On Tue, Jun 24, 2008 at 11:58 AM, Joost van Dieten wrote: > MY CODE: > result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence, > entrez_query='man[ORGN]') > blast_results = result_handle.read() > print result_handle- > result_handler = cStringIO.StringIO(blast_results) > print result_handler > blast_records = NCBIXML.parse(result_handler) > blast_record = blast_records.next() You probably know this, but for anyone trying to cut-and-paste the code, its much simpler to do this: result_handle = NCBIWWW.qblast('blastp', 'swissprot', sequence, entrez_query='man[ORGN]') blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() Joost's code is a handy way to print out the raw data before parsing it, to try and identify any problems by eye. > This code doesn't seem to work anymore. I got an error that my blast_record > is empty, but it worked fine 3 weeks ago. Something changed to the NCBIXML > code??? Any ideas?? Yes, its probably a recent NCBI change, which we've fixed with Bug 2499: http://bugzilla.open-bio.org/show_bug.cgi?id=2499 If you want to just update the Blast parser, I think you need to update both NCBIXML.py and Record.py, but a complete install from CVS might be simpler. Peter From mjldehoon at yahoo.com Wed Jun 25 14:04:09 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 25 Jun 2008 07:04:09 -0700 (PDT) Subject: [BioPython] Bio.SCOP.FileIndex Message-ID: <141582.2274.qm@web62413.mail.re1.yahoo.com> Hi everybody, When I was modifying Bio.SCOP, I noticed that Bio.SCOP.FileIndex is flawed if file reading is done via a buffer (which is often the case in Python). Before we try to fix this, is anybody actually using Bio.SCOP.FileIndex? If not, I think we should deprecate it instead of trying to fix it. --Michiel. From dag at sonsorol.org Wed Jun 25 15:08:33 2008 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed, 25 Jun 2008 11:08:33 -0400 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython References: Message-ID: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Can someone from the biopython dev team respond officially to Scott please? Regards, Chris Begin forwarded message: > From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" > Date: June 25, 2008 10:54:28 AM EDT > To: > Subject: NCBI Abuse Activity with BioPython > > Dear Colleague: > > > > My name is Scott McGinnis and I am responsible for monitoring the web > page at NCBI and blocking users with excessive access. > > > > I am seeing more and more activity with BioPython and it is us > concern. > Mainly the BioPython suite does not appear to be written to the > recommendations made on the main NCBI E-utilities web page > (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr > inciply the following are not being done by BioPython tools. > > > > * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov > , not the standard NCBI Web address. > > * Make no more than one request every 3 seconds. > > > > In fact I recently cc'd you on an event when a user was coming in at > over 18 requests per second. We really wish that you would alter you > scripts to run with a some sort of sleep in it in order to not send > requests more than once per 3 seconds and to not send these to the > main > www web servers but use the http://eutils.ncbi.nlm.nih.gov > . > > > > Also, there is the problem of huge searches in order to build local > databases. With you package it seems that if one were so inclined you > would send a search for all human sequences (over 10,000,000 > sequences) > and you program would then retrieve these one ID at a time. Regardless > of the fact that this is an extreme example, we would much prefer if > your program could webenv from the Esearch and use the search > history > and webenv to retrieve sets of sequences at 200 - 200 at a time. > > > > History: Requests utility to maintain results in user's environment. > Used in conjunction with WebEnv. > > usehistory=y > > Web Environment: Value previously returned in XML results from ESearch > or EPost. This value may change with each utility call. If WebEnv is > used, History search numbers can be included in an ESummary URL, e.g., > term=cancer+AND+%23X (where %23 replaces # and X is the History search > number). > > Note: WebEnv is similar to the cookie that is set on a user's > computers > when accessing PubMed on the web. If the parameter usehistory=y is > included in an ESearch URL both a WebEnv (cookie string) and query_key > (history number) values will be returned in the results. Rather than > using the retrieved PMIDs in an ESummary or EFetch URL you may simply > use the WebEnv and query_key values to retrieve the records. WebEnv > will > change for each ESearch query, but a sample URL would be as follows: > > http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed > &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh > GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D > &query_key=6&retmode=html&rettype=medline&retmax=15 > > WebEnv=WgHmIcDG]B etc. > > Display Numbers: > > retstart=x (x= sequential number of the first record retrieved - > default=0 which will retrieve the first record) > retmax=y (y= number of items retrieved) > > > > Otherwise we will end up blocking more of your users which we are > unfortunately already doing in some cases. > > > > Sincerely, > Scott D. McGinnis, M.S. > DHHS/NIH/NLM/NCBI > www.ncbi.nlm.nih.gov > > > From cjfields at uiuc.edu Wed Jun 25 15:34:34 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 25 Jun 2008 10:34:34 -0500 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: Just as a note from the BioPerl side, BioPerl modules which access eutils use the 3 min sleep rule, and we specify in the documentation the NCBI rules. The modules also identify the tool/agent used as 'bioperl', I believe. chris On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote: > > Can someone from the biopython dev team respond officially to Scott > please? > > Regards, > Chris > > > Begin forwarded message: > >> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" >> >> Date: June 25, 2008 10:54:28 AM EDT >> To: >> Subject: NCBI Abuse Activity with BioPython >> >> Dear Colleague: >> >> >> >> My name is Scott McGinnis and I am responsible for monitoring the web >> page at NCBI and blocking users with excessive access. >> >> >> >> I am seeing more and more activity with BioPython and it is us >> concern. >> Mainly the BioPython suite does not appear to be written to the >> recommendations made on the main NCBI E-utilities web page >> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr >> inciply the following are not being done by BioPython tools. >> >> >> >> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov >> , not the standard NCBI Web >> address. >> >> * Make no more than one request every 3 seconds. >> >> >> >> In fact I recently cc'd you on an event when a user was coming in at >> over 18 requests per second. We really wish that you would alter you >> scripts to run with a some sort of sleep in it in order to not send >> requests more than once per 3 seconds and to not send these to the >> main >> www web servers but use the http://eutils.ncbi.nlm.nih.gov >> . >> >> >> >> Also, there is the problem of huge searches in order to build local >> databases. With you package it seems that if one were so inclined you >> would send a search for all human sequences (over 10,000,000 >> sequences) >> and you program would then retrieve these one ID at a time. >> Regardless >> of the fact that this is an extreme example, we would much prefer if >> your program could webenv from the Esearch and use the search >> history >> and webenv to retrieve sets of sequences at 200 - 200 at a time. >> >> >> >> History: Requests utility to maintain results in user's environment. >> Used in conjunction with WebEnv. >> >> usehistory=y >> >> Web Environment: Value previously returned in XML results from >> ESearch >> or EPost. This value may change with each utility call. If WebEnv is >> used, History search numbers can be included in an ESummary URL, >> e.g., >> term=cancer+AND+%23X (where %23 replaces # and X is the History >> search >> number). >> >> Note: WebEnv is similar to the cookie that is set on a user's >> computers >> when accessing PubMed on the web. If the parameter usehistory=y is >> included in an ESearch URL both a WebEnv (cookie string) and >> query_key >> (history number) values will be returned in the results. Rather than >> using the retrieved PMIDs in an ESummary or EFetch URL you may simply >> use the WebEnv and query_key values to retrieve the records. WebEnv >> will >> change for each ESearch query, but a sample URL would be as follows: >> >> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed >> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh >> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D >> &query_key=6&retmode=html&rettype=medline&retmax=15 >> >> WebEnv=WgHmIcDG]B etc. >> >> Display Numbers: >> >> retstart=x (x= sequential number of the first record retrieved - >> default=0 which will retrieve the first record) >> retmax=y (y= number of items retrieved) >> >> >> >> Otherwise we will end up blocking more of your users which we are >> unfortunately already doing in some cases. >> >> >> >> Sincerely, >> Scott D. McGinnis, M.S. >> DHHS/NIH/NLM/NCBI >> www.ncbi.nlm.nih.gov >> >> >> > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From rjalves at igc.gulbenkian.pt Wed Jun 25 16:16:49 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Wed, 25 Jun 2008 17:16:49 +0100 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: <48626F71.4020804@igc.gulbenkian.pt> you mean 3 seconds no? Quoting Chris Fields on 06/25/2008 04:34 PM: > Just as a note from the BioPerl side, BioPerl modules which access > eutils use the 3 min sleep rule, and we specify in the documentation > the NCBI rules. The modules also identify the tool/agent used as > 'bioperl', I believe. > > chris > > On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote: > >> >> Can someone from the biopython dev team respond officially to Scott >> please? >> >> Regards, >> Chris >> >> >> Begin forwarded message: >> >>> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" >>> Date: June 25, 2008 10:54:28 AM EDT >>> To: >>> Subject: NCBI Abuse Activity with BioPython >>> >>> Dear Colleague: >>> >>> >>> >>> My name is Scott McGinnis and I am responsible for monitoring the web >>> page at NCBI and blocking users with excessive access. >>> >>> >>> >>> I am seeing more and more activity with BioPython and it is us concern. >>> Mainly the BioPython suite does not appear to be written to the >>> recommendations made on the main NCBI E-utilities web page >>> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr >>> >>> inciply the following are not being done by BioPython tools. >>> >>> >>> >>> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov >>> , not the standard NCBI Web address. >>> >>> * Make no more than one request every 3 seconds. >>> >>> >>> >>> In fact I recently cc'd you on an event when a user was coming in at >>> over 18 requests per second. We really wish that you would alter you >>> scripts to run with a some sort of sleep in it in order to not send >>> requests more than once per 3 seconds and to not send these to the main >>> www web servers but use the http://eutils.ncbi.nlm.nih.gov >>> . >>> >>> >>> >>> Also, there is the problem of huge searches in order to build local >>> databases. With you package it seems that if one were so inclined you >>> would send a search for all human sequences (over 10,000,000 sequences) >>> and you program would then retrieve these one ID at a time. Regardless >>> of the fact that this is an extreme example, we would much prefer if >>> your program could webenv from the Esearch and use the search history >>> and webenv to retrieve sets of sequences at 200 - 200 at a time. >>> >>> >>> >>> History: Requests utility to maintain results in user's environment. >>> Used in conjunction with WebEnv. >>> >>> usehistory=y >>> >>> Web Environment: Value previously returned in XML results from ESearch >>> or EPost. This value may change with each utility call. If WebEnv is >>> used, History search numbers can be included in an ESummary URL, e.g., >>> term=cancer+AND+%23X (where %23 replaces # and X is the History search >>> number). >>> >>> Note: WebEnv is similar to the cookie that is set on a user's computers >>> when accessing PubMed on the web. If the parameter usehistory=y is >>> included in an ESearch URL both a WebEnv (cookie string) and query_key >>> (history number) values will be returned in the results. Rather than >>> using the retrieved PMIDs in an ESummary or EFetch URL you may simply >>> use the WebEnv and query_key values to retrieve the records. WebEnv >>> will >>> change for each ESearch query, but a sample URL would be as follows: >>> >>> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed >>> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh >>> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D >>> &query_key=6&retmode=html&rettype=medline&retmax=15 >>> >>> WebEnv=WgHmIcDG]B etc. >>> >>> Display Numbers: >>> >>> retstart=x (x= sequential number of the first record retrieved - >>> default=0 which will retrieve the first record) >>> retmax=y (y= number of items retrieved) >>> >>> >>> >>> Otherwise we will end up blocking more of your users which we are >>> unfortunately already doing in some cases. >>> >>> >>> >>> Sincerely, >>> Scott D. McGinnis, M.S. >>> DHHS/NIH/NLM/NCBI >>> www.ncbi.nlm.nih.gov >>> >>> >>> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Marie-Claude Hofmann > College of Veterinary Medicine > University of Illinois Urbana-Champaign > > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From cjfields at uiuc.edu Wed Jun 25 19:00:34 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 25 Jun 2008 14:00:34 -0500 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <48626F71.4020804@igc.gulbenkian.pt> References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> <48626F71.4020804@igc.gulbenkian.pt> Message-ID: <16811EA1-130D-4F47-B0B5-654E840705B9@uiuc.edu> Yes, my bad (was in a hurry). I have heard of instances where specific users/IPs were blocked temporarily by NCBI based on spamming, so it's best to be proactive. chris On Jun 25, 2008, at 11:16 AM, Renato Alves wrote: > you mean 3 seconds no? > > Quoting Chris Fields on 06/25/2008 04:34 PM: >> Just as a note from the BioPerl side, BioPerl modules which access >> eutils use the 3 min sleep rule, and we specify in the >> documentation the NCBI rules. The modules also identify the tool/ >> agent used as 'bioperl', I believe. >> >> chris >> >> On Jun 25, 2008, at 10:08 AM, Chris Dagdigian wrote: >> >>> >>> Can someone from the biopython dev team respond officially to >>> Scott please? >>> >>> Regards, >>> Chris >>> >>> >>> Begin forwarded message: >>> >>>> From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" >>> > >>>> Date: June 25, 2008 10:54:28 AM EDT >>>> To: >>>> Subject: NCBI Abuse Activity with BioPython >>>> >>>> Dear Colleague: >>>> >>>> >>>> >>>> My name is Scott McGinnis and I am responsible for monitoring the >>>> web >>>> page at NCBI and blocking users with excessive access. >>>> >>>> >>>> >>>> I am seeing more and more activity with BioPython and it is us >>>> concern. >>>> Mainly the BioPython suite does not appear to be written to the >>>> recommendations made on the main NCBI E-utilities web page >>>> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr >>>> inciply the following are not being done by BioPython tools. >>>> >>>> >>>> >>>> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov >>>> , not the standard NCBI Web >>>> address. >>>> >>>> * Make no more than one request every 3 seconds. >>>> >>>> >>>> >>>> In fact I recently cc'd you on an event when a user was coming in >>>> at >>>> over 18 requests per second. We really wish that you would alter >>>> you >>>> scripts to run with a some sort of sleep in it in order to not send >>>> requests more than once per 3 seconds and to not send these to >>>> the main >>>> www web servers but use the http://eutils.ncbi.nlm.nih.gov >>>> . >>>> >>>> >>>> >>>> Also, there is the problem of huge searches in order to build local >>>> databases. With you package it seems that if one were so inclined >>>> you >>>> would send a search for all human sequences (over 10,000,000 >>>> sequences) >>>> and you program would then retrieve these one ID at a time. >>>> Regardless >>>> of the fact that this is an extreme example, we would much prefer >>>> if >>>> your program could webenv from the Esearch and use the search >>>> history >>>> and webenv to retrieve sets of sequences at 200 - 200 at a time. >>>> >>>> >>>> >>>> History: Requests utility to maintain results in user's >>>> environment. >>>> Used in conjunction with WebEnv. >>>> >>>> usehistory=y >>>> >>>> Web Environment: Value previously returned in XML results from >>>> ESearch >>>> or EPost. This value may change with each utility call. If WebEnv >>>> is >>>> used, History search numbers can be included in an ESummary URL, >>>> e.g., >>>> term=cancer+AND+%23X (where %23 replaces # and X is the History >>>> search >>>> number). >>>> >>>> Note: WebEnv is similar to the cookie that is set on a user's >>>> computers >>>> when accessing PubMed on the web. If the parameter usehistory=y is >>>> included in an ESearch URL both a WebEnv (cookie string) and >>>> query_key >>>> (history number) values will be returned in the results. Rather >>>> than >>>> using the retrieved PMIDs in an ESummary or EFetch URL you may >>>> simply >>>> use the WebEnv and query_key values to retrieve the records. >>>> WebEnv will >>>> change for each ESearch query, but a sample URL would be as >>>> follows: >>>> >>>> http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed >>>> &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh >>>> GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D >>>> &query_key=6&retmode=html&rettype=medline&retmax=15 >>>> >>>> WebEnv=WgHmIcDG]B etc. >>>> >>>> Display Numbers: >>>> >>>> retstart=x (x= sequential number of the first record retrieved - >>>> default=0 which will retrieve the first record) >>>> retmax=y (y= number of items retrieved) >>>> >>>> >>>> >>>> Otherwise we will end up blocking more of your users which we are >>>> unfortunately already doing in some cases. >>>> >>>> >>>> >>>> Sincerely, >>>> Scott D. McGinnis, M.S. >>>> DHHS/NIH/NLM/NCBI >>>> www.ncbi.nlm.nih.gov >>>> >>>> >>>> >>> >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Marie-Claude Hofmann >> College of Veterinary Medicine >> University of Illinois Urbana-Champaign >> >> >> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From dalke at dalkescientific.com Thu Jun 26 01:15:50 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 26 Jun 2008 03:15:50 +0200 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: Hi Chris, I'm no longer part of the Biopython dev team, but I read at least the subject line on the mailing list. I wrote the Biopython EUtils package around December 2002 and according to the CVS logs it was added to Biopython in June 2003, so more then 5 years ago. Looking at the commit logs there haven't been any change to the relevant code since 2004, and that was a minor patch. I thought I put a rate limiter into the code, but looking at it now I see I didn't. The documentation clearly states that users must follow NCBI's recommendations, but who actually reads documentation? >> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov >> , not the standard NCBI Web >> address. That change was announced on May 21, 2003, and most likely no one on the Biopython dev group tracks the EUtils mailing list. It was also after I wrote the code, but to be fair I was subscribed to the utilities list at the time and should have caught the change. I think the correct fix is to this code in ThinClient.py: def __init__(self, opener = None, tool = TOOL, email = EMAIL, baseurl = "http://www.ncbi.nlm.nih.gov/entrez/ eutils/"): Change the baseurl to "http://eutils.ncbi.nlm.nih.gov/entrez/ eutils/". I have not tested this. >> * Make no more than one request every 3 seconds. There's a couple of points here. The quickest and most direct way to force/fix the code is to change the "def _get()" in ThinClient.py . The current code is def _get(self, program, query): """Internal function: send the query string to the program as GET""" # NOTE: epost uses a different interface q = self._fixup_query(query) url = self.baseurl + program + "?" + q if DUMP_URL: print "Opening with GET:", url if DUMP_RESULT: print " ================== Results ============= " s = self.opener.open(url).read() print s print " ================== Finished ============ " return cStringIO.StringIO(s) return self.opener.open(url) Here's one possible fix: add the following two lines to module scope: import time _prev_time = 0 and insert four lines in the _get function. def _get(self, program, query): """Internal function: send the query string to the program as GET""" # NOTE: epost uses a different interface global _prev_time q = self._fixup_query(query) url = self.baseurl + program + "?" + q if DUMP_URL: print "Opening with GET:", url # Follow NCBI's 3 second restriction if time.time() - _prev_time < 3: time.sleep(time.time()-_prev_time) _prev_time = time.time() if DUMP_RESULT: print " ================== Results ============= " s = self.opener.open(url).read() print s print " ================== Finished ============ " return cStringIO.StringIO(s) return self.opener.open(url) (I recall that I had something like that, and it made my unit tests - which I did during the off hours - interminable.) When I wrote this module I think I assumed that whoever would use the library would use the code correctly. Using it correctly means a few things: - obey the restrictions set by NCBI - change the 'tool' and 'email' settings, so NCBI complains the right person. (The default is to say 'EUtils_Python_client' and 'biopython- dev at biopython.org') This isn't happening. The patch above force-fixes the first. Should Biopython do a better job of the second? It's not easy to figure out the correct email. I couldn't then and can't now think of a better solution. Perhaps use the result of getpass.getuser()? But that doesn't get the rest of the domain for a proper email. Though NCBI should be able to guess the site from the IP address. The reason I made this assumption is that I meant EUtils to be used by contentious developers. I've since learned that that's seldom the case, and because it was imported into Biopython it's been exposed to a wider audience. >> Also, there is the problem of huge searches in order to build local >> databases. With you package it seems that if one were so inclined you >> would send a search for all human sequences (over 10,000,000 >> sequences) >> and you program would then retrieve these one ID at a time. >> Regardless >> of the fact that this is an extreme example, we would much prefer if >> your program could webenv from the Esearch and use the search >> history >> and webenv to retrieve sets of sequences at 200 - 200 at a time. It does exactly that. There's an entire interface for handling search history - and it took some non-trivial work and questions to NCBI to get things working right. Rather, there are two layers. One is for the low-level protocol ("ThinClient") that EUtils offers, and another wraps around the history mechanism ("HistoryClient"). >>> from Bio import EUtils >>> from Bio.EUtils import HistoryClient >>> client = HistoryClient.HistoryClient() >>> result = client.search("polio AND picornavirus") >>> len(result) 3437 >>> f = result.efetch() >>> print f.read(1000) 18540199 2008 06 10
0041-3771 50 2 2008 Tsitologiia Tsitologiia [The enter of viruses family Picornaviridae in and there's a way to populate the history with a list of records, then fetch those records in a block: >>> result = client.from_dbids(EUtils.DBIds("pubmed", ["100","200","300","400","500"])) >>> f = result.efetch("text", "brief") >>> print f.read() 1: Jolly RD et al. Bovine mannosidosis--a model ...[PMID: 100] 2: El Halawani ME et al. The relative importance of mo...[PMID: 200] 3: Amdur MA. Alcohol-related problems in a...[PMID: 300] 4: Regitz G et al. Trypsin-sensitive photosynthe...[PMID: 400] 5: Nourse ES. The regional workshops on pri...[PMID: 500] If I had to guess, likely more people find the ThinClient code easier to understand, because the NCBI interface has a simple way to get the result for a single record, without using the history interface. The NCBI interface doesn't guide people to the right way to use it effectively. I started working on an update to EUtils which improved the API to include a few helper functions, like "EUtils.search()" instead of having to create a HistoryClient. That might help guide people to using it better. I wrote up something about it a few years ago: http://www.dalkescientific.com/writings/diary/archive/2005/09/30/ using_eutils.html But a problem in completing that is that I never got any sort of funding or user feedback on how people were using the software, and as I moved over to chemistry it became lower and lower on my list. That's still the problem with me working on this again. I don't know about this next point, but there might also be a lack of documentation on how to use the Biopython interface effectively? The NCBI documentation isn't mean for non-programmers (it's more of a bytes-on-the-wire document) so perhaps people are pattern matching on what looks right and going with what works, vs. what works well. Then because there was no 3 second limit, they had no incentive to find a better/faster solution. Andrew dalke at dalkescientific.com From biopython at maubp.freeserve.co.uk Thu Jun 26 11:21:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 12:21:57 +0100 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: <320fb6e00806260421g48e5807ei92297b372c330e5b@mail.gmail.com> On Thu, Jun 26, 2008 at 2:15 AM, Andrew Dalke wrote: > Hi Chris, > > I'm no longer part of the Biopython dev team, but I read at least the > subject line on the mailing list. > > I wrote the Biopython EUtils package around December 2002 and according to > the CVS logs it was added to Biopython in June 2003, so more then 5 years > ago. Looking at the commit logs there haven't been any change to the > relevant code since 2004, and that was a minor patch. > > I thought I put a rate limiter into the code, but looking at it now I see I > didn't. The documentation clearly states that users must follow NCBI's > recommendations, but who actually reads documentation? > > There's a couple of points here. The quickest and most direct way to > force/fix the code is to change the "def _get()" in ThinClient.py . ... I've updated Bio/EUtils/ThinClient.py in CVS based on your suggested change, and checked the unit tests test_EUtils.py and test_SeqIO_online.py (which calls Bio.EUtils via Bio.GenBank). Looking over the code, should this wait also be done for the ThinClient's epost() method as well? > When I wrote this module I think I assumed that whoever would use the > library would use the code correctly. Using it correctly means a few > things: > - obey the restrictions set by NCBI > - change the 'tool' and 'email' settings, so NCBI complains the right > person. > (The default is to say 'EUtils_Python_client' and > 'biopython-dev at biopython.org') > > This isn't happening. The patch above force-fixes the first. Should > Biopython do a better job of the second? It's not easy to figure out the > correct email. I couldn't then and can't now think of a better solution. > Perhaps use the result of getpass.getuser()? But that doesn't get the rest > of the domain for a proper email. Though NCBI should be able to guess the > site from the IP address. Figuring out the user's email address is tricky, especially cross platform. Perhaps we should update the Bio.EUtils and Bio.Entrez documentation to recommend the user set their email address here, and if they are wrapping Biopython in part of a larger tool (e.g. a webservice) to set the tool name too. > If I had to guess, likely more people find the ThinClient code easier to > understand, because the NCBI interface has a simple way to get the result > for a single record, without using the history interface. The NCBI > interface doesn't guide people to the right way to use it effectively. I would agree with you. I would go further, and say for a new user even the ThinClient is a bit scary, and that the wrapper functions in Bio.GenBank are nicer to use. > I started working on an update to EUtils which improved the API to include a > few helper functions, like "EUtils.search()" instead of having to create a > HistoryClient. That might help guide people to using it better. I wrote up > something about it a few years ago: > http://www.dalkescientific.com/writings/diary/archive/2005/09/30/using_eutils.html > > But a problem in completing that is that I never got any sort of funding or > user feedback on how people were using the software, and as I moved over to > chemistry it became lower and lower on my list. That's still the problem > with me working on this again. This complexity is also daunting for anyone else considering taking over the Bio.EUtils code base. > I don't know about this next point, but there might also be a lack of > documentation on how to use the Biopython interface effectively? The NCBI > documentation isn't mean for non-programmers (it's more of a > bytes-on-the-wire document) so perhaps people are pattern matching on what > looks right and going with what works, vs. what works well. Then because > there was no 3 second limit, they had no incentive to find a better/faster > solution. That would explain how the unnamed user ended up making over 18 requests per second! I confess I had assumed that things like the Bio.GenBank wrappers would be respecting the 3 second rule (at least they should do now). Peter From mjldehoon at yahoo.com Thu Jun 26 11:48:09 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 04:48:09 -0700 (PDT) Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: <53670.7764.qm@web62412.mail.re1.yahoo.com> Dear Chris, Sorry for the trouble. We are now discussing on the Biopython mailing list how to fix this issue. I will write a reply to Scott shortly. Best, --Michiel. --- On Wed, 6/25/08, Chris Dagdigian wrote: From: Chris Dagdigian Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython To: biopython at lists.open-bio.org Date: Wednesday, June 25, 2008, 11:08 AM Can someone from the biopython dev team respond officially to Scott please? Regards, Chris Begin forwarded message: > From: "Mcginnis, Scott (NIH/NLM/NCBI) [E]" > Date: June 25, 2008 10:54:28 AM EDT > To: > Subject: NCBI Abuse Activity with BioPython > > Dear Colleague: > > > > My name is Scott McGinnis and I am responsible for monitoring the web > page at NCBI and blocking users with excessive access. > > > > I am seeing more and more activity with BioPython and it is us > concern. > Mainly the BioPython suite does not appear to be written to the > recommendations made on the main NCBI E-utilities web page > (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).Pr > inciply the following are not being done by BioPython tools. > > > > * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov > , not the standard NCBI Web address. > > * Make no more than one request every 3 seconds. > > > > In fact I recently cc'd you on an event when a user was coming in at > over 18 requests per second. We really wish that you would alter you > scripts to run with a some sort of sleep in it in order to not send > requests more than once per 3 seconds and to not send these to the > main > www web servers but use the http://eutils.ncbi.nlm.nih.gov > . > > > > Also, there is the problem of huge searches in order to build local > databases. With you package it seems that if one were so inclined you > would send a search for all human sequences (over 10,000,000 > sequences) > and you program would then retrieve these one ID at a time. Regardless > of the fact that this is an extreme example, we would much prefer if > your program could webenv from the Esearch and use the search > history > and webenv to retrieve sets of sequences at 200 - 200 at a time. > > > > History: Requests utility to maintain results in user's environment. > Used in conjunction with WebEnv. > > usehistory=y > > Web Environment: Value previously returned in XML results from ESearch > or EPost. This value may change with each utility call. If WebEnv is > used, History search numbers can be included in an ESummary URL, e.g., > term=cancer+AND+%23X (where %23 replaces # and X is the History search > number). > > Note: WebEnv is similar to the cookie that is set on a user's > computers > when accessing PubMed on the web. If the parameter usehistory=y is > included in an ESearch URL both a WebEnv (cookie string) and query_key > (history number) values will be returned in the results. Rather than > using the retrieved PMIDs in an ESummary or EFetch URL you may simply > use the WebEnv and query_key values to retrieve the records. WebEnv > will > change for each ESearch query, but a sample URL would be as follows: > > http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed > &WebEnv=%3D%5DzU%5D%3FIJIj%3CC%5E%5DA%3CT%5DEACgdn%3DF%5E%3Eh > GFA%5D%3CIFKGCbQkA%5E_hDFiFd%5C%3D > &query_key=6&retmode=html&rettype=medline&retmax=15 > > WebEnv=WgHmIcDG]B etc. > > Display Numbers: > > retstart=x (x= sequential number of the first record retrieved - > default=0 which will retrieve the first record) > retmax=y (y= number of items retrieved) > > > > Otherwise we will end up blocking more of your users which we are > unfortunately already doing in some cases. > > > > Sincerely, > Scott D. McGinnis, M.S. > DHHS/NIH/NLM/NCBI > www.ncbi.nlm.nih.gov > > > _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From mjldehoon at yahoo.com Thu Jun 26 14:01:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 07:01:31 -0700 (PDT) Subject: [BioPython] Bio.ECell, anybody? Message-ID: <712489.88060.qm@web62410.mail.re1.yahoo.com> This is one of the Martel-based parser whose relevance in 2008 is unclear to me. >From the docstring: Ecell converts the ECell input from spreadsheet format to an intermediate format, described in http://www.e-cell.org/manual/chapter2E.html#3.2.? It provides an alternative to the perl script supplied with the Ecell2 distribution at http://bioinformatics.org/project/?group_id=49. Currently, ECell is at version 3.1.106 (and uses Python as the scripting interface! Yay!). The link to the chapter in the ECell manual is dead. Is anybody using the Bio.ECell module? --Michiel From binbin.liu at umb.no Thu Jun 26 15:35:46 2008 From: binbin.liu at umb.no (binbin) Date: Thu, 26 Jun 2008 17:35:46 +0200 Subject: [BioPython] Entrez Message-ID: <1214494546.6215.3.camel@ubuntu> ?Hei, Am using biopython 1.45 my problem is as follow >>> from Bio import GenBank >>> from Bio import Entrez Traceback (most recent call last): File "", line 1, in ImportError: cannot import name Entrez I could not import Entrez. was it deleted from Bio? From biopython at maubp.freeserve.co.uk Thu Jun 26 15:57:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 16:57:47 +0100 Subject: [BioPython] Entrez In-Reply-To: <1214494546.6215.3.camel@ubuntu> References: <1214494546.6215.3.camel@ubuntu> Message-ID: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> On Thu, Jun 26, 2008 at 4:35 PM, binbin wrote: > Hei, > Am using biopython 1.45 > > my problem is as follow > > > >>> from Bio import GenBank > >>> from Bio import Entrez > Traceback (most recent call last): > File "", line 1, in > ImportError: cannot import name Entrez > > I could not import Entrez. was it deleted from Bio? Hello binbin, A long long time ago there was a Bio.Entrez module which was deleted in 2000. We are going to re-introduce a Bio.Entrez module in Biopython 1.46 (hopefully out next month?), which will replace Bio.WWW.NCBI. If you want to try this out now, please install the latest CVS version of Biopython from source. Can I ask why are you trying to do "from Bio import Entrez"? Peter From winter at biotec.tu-dresden.de Thu Jun 26 15:53:23 2008 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Thu, 26 Jun 2008 17:53:23 +0200 Subject: [BioPython] Entrez In-Reply-To: <1214494546.6215.3.camel@ubuntu> References: <1214494546.6215.3.camel@ubuntu> Message-ID: <4863BB73.2020509@biotec.tu-dresden.de> binbin wrote, On 06/26/08 17:35: > Hei, > Am using biopython 1.45 > > my problem is as follow > > > >>> from Bio import GenBank > >>> from Bio import Entrez > Traceback (most recent call last): > File "", line 1, in > ImportError: cannot import name Entrez > > I could not import Entrez. was it deleted from Bio? Import works fine for me, so I don't think it has been deleted. With my Linux installation, I can do locate Entrez which finds /var/lib/python-support/python2.5/Bio/Entrez HTH, Christof From biopython at maubp.freeserve.co.uk Thu Jun 26 16:12:53 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 17:12:53 +0100 Subject: [BioPython] Entrez In-Reply-To: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> References: <1214494546.6215.3.camel@ubuntu> <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> Message-ID: <320fb6e00806260912j3395d2c0s3d7bbb7227f84421@mail.gmail.com> > Hello binbin, > > A long long time ago there was a Bio.Entrez module which was deleted in 2000. > > We are going to re-introduce a Bio.Entrez module in Biopython 1.46 > (hopefully out next month?), which will replace Bio.WWW.NCBI. If you > want to try this out now, please install the latest CVS version of > Biopython from source. Sorry - I've confused myself as the Bio.Entrez module has been under revision recently. >From the user's point of view Biopython 1.46 will add an XML parser, but otherwise Bio.Entrez should be there in Biopython 1.45. Peter From biopython at maubp.freeserve.co.uk Thu Jun 26 20:19:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 21:19:31 +0100 Subject: [BioPython] Removing the unit test GUI? Message-ID: <320fb6e00806261319w5be098d1y48404f3f93934fa3@mail.gmail.com> Hello all, I wanted to do a quick survey of opinion about the Biopython test suite and its interface. Those of you who have ever installed Biopython from source may have tried running the unit tests too. You do this by changing to the Tests subdirectory, and then running the run_tests.py script. Currently by default this will show a GUI. However, from the developer's point of view the unit tests are almost always run at the command line with: python run_tests.py --no-gui It would let us simplify the test harness if we got rid of the GUI, and it would make life very slightly easier for people running the tests at the command line. But would anyone be upset at the loss of the test GUI? So - have any of you ever run the unit tests? Did you use the GUI or the command line? Would you prefer the GUI to remain? Thanks Peter P.S. See also bug 2525 http://bugzilla.open-bio.org/show_bug.cgi?id=2525 From mjldehoon at yahoo.com Thu Jun 26 22:24:41 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 15:24:41 -0700 (PDT) Subject: [BioPython] Entrez In-Reply-To: <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> Message-ID: <987374.9439.qm@web62409.mail.re1.yahoo.com> Bio.Entrez was reintroduced in release 1.45 already (though without the parser), so binbin should be able to find it. --Michiel. --- On Thu, 6/26/08, Peter wrote: From: Peter Subject: Re: [BioPython] Entrez To: "binbin" Cc: biopython at biopython.org Date: Thursday, June 26, 2008, 11:57 AM On Thu, Jun 26, 2008 at 4:35 PM, binbin wrote: > Hei, > Am using biopython 1.45 > > my problem is as follow > > > >>> from Bio import GenBank > >>> from Bio import Entrez > Traceback (most recent call last): > File "", line 1, in > ImportError: cannot import name Entrez > > I could not import Entrez. was it deleted from Bio? Hello binbin, A long long time ago there was a Bio.Entrez module which was deleted in 2000. We are going to re-introduce a Bio.Entrez module in Biopython 1.46 (hopefully out next month?), which will replace Bio.WWW.NCBI. If you want to try this out now, please install the latest CVS version of Biopython from source. Can I ask why are you trying to do "from Bio import Entrez"? Peter _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Fri Jun 27 11:16:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 12:16:12 +0100 Subject: [BioPython] Entrez In-Reply-To: <1214562160.6026.2.camel@ubuntu> References: <1214494546.6215.3.camel@ubuntu> <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> <1214562160.6026.2.camel@ubuntu> Message-ID: <320fb6e00806270416x76d8b388mdd79577927001f32@mail.gmail.com> On Fri, Jun 27, 2008 at 11:22 AM, binbin wrote: > thank you for answering, i am a beginner of biopython,in the "Biopython > Tutorial and Cookbook": > 2.5 Connecting with biological databases: > this is found > "from Bio import Entrez" > > i tried this but it did work for me, that is why i asked. That should have worked if your installation of Biopython 1.45 was successful. We may be able to work out what is wrong. What operating system are you using, which version of python, and how did you install Biopython? Regards, Peter From fredgca at hotmail.com Fri Jun 27 13:19:04 2008 From: fredgca at hotmail.com (Frederico Arnoldi) Date: Fri, 27 Jun 2008 13:19:04 +0000 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython Message-ID: Guys (sorry the informality), I have followed the discussion about "NCBI Abuse Activity with BioPython". I have to confess that followed it superficially, since I am not able to understand everything you said. So, I am going to make some questions about it: 1)I believe that using BLAST with NCBIWWW.qblast is included in "Abuse Activity". Right? I am asking because sometimes I use it. The recommendation of NCBI is "Make no more than one request every 3 seconds.". Biopython code does not assure it with the following code in NCBIWWW.py, line 779: [code] limiter = RequestLimiter(3) while 1: limiter.wait() [/code] 2)Do you have any recommendation for using it that it is not included in the tutorial? Maybe listing some recommendations here would help. Sorry if I have asked an obviousness. Thanks, Fred _________________________________________________________________ Conhe?a o Windows Live Spaces, a rede de relacionamentos do Messenger! http://www.amigosdomessenger.com.br/ From biopython at maubp.freeserve.co.uk Fri Jun 27 13:57:49 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 14:57:49 +0100 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: References: Message-ID: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com> On Fri, Jun 27, 2008 at 2:19 PM, Frederico Arnoldi wrote: > > Guys (sorry the informality), > > I have followed the discussion about "NCBI Abuse Activity with BioPython". I > have to confess that followed it superficially, since I am not able to understand > everything you said. So, I am going to make some questions about it: > > 1)I believe that using BLAST with NCBIWWW.qblast is included in "Abuse Activity". Right? I'm not aware that abuse of BLAST was singled out, only Entrez / E-utils. > I am asking because sometimes I use it. The recommendation of NCBI is > "Make no more than one request every 3 seconds.". True, http://www.ncbi.nlm.nih.gov/blast/Doc/node60.html > Biopython code does not assure it with the following code in NCBIWWW.py, > line 779: > [code] > limiter = RequestLimiter(3) > while 1: > limiter.wait() > [/code] I believe that bit of code is polling the server for results every three seconds. Perhaps we should insert an additional enforced three second delay between submission of queries as well. > 2)Do you have any recommendation for using it that it is not included in the > tutorial? Maybe listing some recommendations here would help. I would recommend running your own local BLAST server for any large jobs - either the standalone blast tools, or if you have a machine on the network that many people could share, run the WWW version locally. Peter From cjfields at uiuc.edu Fri Jun 27 15:51:12 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 27 Jun 2008 10:51:12 -0500 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com> References: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com> Message-ID: <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu> On Jun 27, 2008, at 8:57 AM, Peter wrote: > On Fri, Jun 27, 2008 at 2:19 PM, Frederico Arnoldi > wrote: >> >> Guys (sorry the informality), >> >> I have followed the discussion about "NCBI Abuse Activity with >> BioPython". I >> have to confess that followed it superficially, since I am not able >> to understand >> everything you said. So, I am going to make some questions about it: >> >> 1)I believe that using BLAST with NCBIWWW.qblast is included in >> "Abuse Activity". Right? > > I'm not aware that abuse of BLAST was singled out, only Entrez / E- > utils. Similar policy though, for the same reasons they insist on a delay for E-utils. >> I am asking because sometimes I use it. The recommendation of NCBI is >> "Make no more than one request every 3 seconds.". > > True, http://www.ncbi.nlm.nih.gov/blast/Doc/node60.html > >> Biopython code does not assure it with the following code in >> NCBIWWW.py, >> line 779: >> [code] >> limiter = RequestLimiter(3) >> while 1: >> limiter.wait() >> [/code] > > I believe that bit of code is polling the server for results every > three seconds. Perhaps we should insert an additional enforced three > second delay between submission of queries as well. > >> 2)Do you have any recommendation for using it that it is not >> included in the >> tutorial? Maybe listing some recommendations here would help. > > I would recommend running your own local BLAST server for any large > jobs - either the standalone blast tools, or if you have a machine on > the network that many people could share, run the WWW version locally. > > Peter The above appears to submit a single job at a time and wait 3 sec. between polling the server until the current job is finished. I don't think that is the problem indicated in the link above. The 3 sec. is for submitting new BLAST jobs, for instance if you want to submit one BLAST request after another (gathering the RIDs), then grab all the reports at once, or if you are threading 50 submission requests all at once. chris From fredgca at hotmail.com Fri Jun 27 16:18:47 2008 From: fredgca at hotmail.com (Frederico Arnoldi) Date: Fri, 27 Jun 2008 16:18:47 +0000 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu> References: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com> <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu> Message-ID: Right, thanks for the answers. If I understood, the problem is threading the requests. If I am not threading my requests I am not abusing NCBI server, so don't thread them. Thanks again, Fred > >> 2)Do you have any recommendation for using it that it is not > >> included in the > >> tutorial? Maybe listing some recommendations here would help. > > > > I would recommend running your own local BLAST server for any large > > jobs - either the standalone blast tools, or if you have a machine on > > the network that many people could share, run the WWW version locally. > > > > Peter > > The above appears to submit a single job at a time and wait 3 sec. > between polling the server until the current job is finished. I don't > think that is the problem indicated in the link above. The 3 sec. is > for submitting new BLAST jobs, for instance if you want to submit one > BLAST request after another (gathering the RIDs), then grab all the > reports at once, or if you are threading 50 submission requests all at > once. > > chris _________________________________________________________________ Instale a Barra de Ferramentas com Desktop Search e ganhe EMOTICONS para o Messenger! ? GR?TIS! http://www.msn.com.br/emoticonpack From cjfields at uiuc.edu Fri Jun 27 17:32:31 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 27 Jun 2008 12:32:31 -0500 Subject: [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: References: <320fb6e00806270657p1150698ue23be1b7c547b2ae@mail.gmail.com> <6172FCA3-63CB-4CD8-8F12-BB0385DCBD0C@uiuc.edu> Message-ID: <53E19130-4EAC-4DC7-A58C-883581F8B468@uiuc.edu> No, not just threading. The requests could be made by a simple script/ program of any kind with no timeout implemented; the IPs of those abusing the timeout will likely be blocked. The idea is not to spam their server (let alone any server which provides a free service) with tons of requests of any kind, be it eutils or BLAST submission requests, BLAST report retrieval requests using RID, etc. Any tools using these services should implement the minimum recommended delay between them. Alternatively, set up a local BLAST service as Peter recommends. chris On Jun 27, 2008, at 11:18 AM, Frederico Arnoldi wrote: > > Right, thanks for the answers. > If I understood, the problem is threading the requests. If I am not > threading my requests I am not abusing NCBI server, so don't thread > them. > Thanks again, > Fred >>>> 2)Do you have any recommendation for using it that it is not >>>> included in the >>>> tutorial? Maybe listing some recommendations here would help. >>> >>> I would recommend running your own local BLAST server for any large >>> jobs - either the standalone blast tools, or if you have a machine >>> on >>> the network that many people could share, run the WWW version >>> locally. >>> >>> Peter >> >> The above appears to submit a single job at a time and wait 3 sec. >> between polling the server until the current job is finished. I >> don't >> think that is the problem indicated in the link above. The 3 sec. is >> for submitting new BLAST jobs, for instance if you want to submit one >> BLAST request after another (gathering the RIDs), then grab all the >> reports at once, or if you are threading 50 submission requests all >> at >> once. >> >> chris > > _________________________________________________________________ > Instale a Barra de Ferramentas com Desktop Search e ganhe EMOTICONS > para o Messenger! ? GR?TIS! > http://www.msn.com.br/emoticonpack > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From sbassi at gmail.com Sat Jun 28 14:46:45 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 28 Jun 2008 11:46:45 -0300 Subject: [BioPython] one function, two behaivors Message-ID: If I invoke "transcribe" with a RNA sequence like this: >>> from Bio.Seq import transcribe >>> from Bio.Seq import Seq >>> import Bio.Alphabet >>> rna_seq = Seq('CCGGGUU',Bio.Alphabet.IUPAC.unambiguous_rna) >>> transcribe(rna_seq) # Look!, I am "transcribing a RNA" Seq('CCGGGUU', RNAAlphabet()) But I can't "transcribe" a RNA sequence if I invoke it this way: >>> from Bio import Transcribe >>> transcriber = Transcribe.unambiguous_transcriber >>> transcriber.transcribe(rna_seq) Traceback (most recent call last): File "", line 1, in transcriber.transcribe(rna_seq) File "/usr/local/lib/python2.5/site-packages/Bio/Transcribe.py", line 13, in transcribe "transcribe has the wrong DNA alphabet" AssertionError: transcribe has the wrong DNA alphabet The same result I get when using "translate". What is the rationale behind this? -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From biopython at maubp.freeserve.co.uk Sat Jun 28 15:16:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 28 Jun 2008 16:16:13 +0100 Subject: [BioPython] one function, two behaivors In-Reply-To: References: Message-ID: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com> Hi Senastian, As to why there are two ways, well, frankly the Bio.Transcribe and Bio.Translate code isn't very nice to use! The Bio.Seq functions are much simpler. We've talked about deprecating the Bio.Transcribe and Bio.Translate modules in favour of just Bio.Seq -- we could deprecate Bio.Transcribe now, but there is functionality in Bio.Translate that has not been duplicated. See also bug 2381. http://bugzilla.open-bio.org/show_bug.cgi?id=2381 On Sat, Jun 28, 2008 at 3:46 PM, Sebastian Bassi wrote: > If I invoke "transcribe" with a RNA sequence like this: > >>>> from Bio.Seq import transcribe >>>> from Bio.Seq import Seq >>>> import Bio.Alphabet >>>> rna_seq = Seq('CCGGGUU',Bio.Alphabet.IUPAC.unambiguous_rna) >>>> transcribe(rna_seq) # Look!, I am "transcribing a RNA" > Seq('CCGGGUU', RNAAlphabet()) When Michiel added this code for Biopython 1.41, originally there was no error checking on the alphabet. For Biopython 1.44, I added a check to prevent protein transcibing (which is clearly meaningless), and made a note to consider also banning transcribing RNA. Here there is at least one reason to want to do this - suppose you have a mixed set of nucleotide sequences and want to ensure they are all RNA. Do you think the Bio.Seq.transcibe() method should reject RNA sequences? Peter From biopython at maubp.freeserve.co.uk Sat Jun 28 15:23:40 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 28 Jun 2008 16:23:40 +0100 Subject: [BioPython] one function, two behaivors In-Reply-To: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com> References: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com> Message-ID: <320fb6e00806280823h36f3f01ema2886dca98635588@mail.gmail.com> I wrote, > As to why there are two ways, well, frankly the Bio.Transcribe and > Bio.Translate code isn't very nice to use! The Bio.Seq functions are > much simpler. Hmm - the tutorial is still using Bio.Transcribe and Bio.Translate at the moment. I could update the tutorial to use the Bio.Seq functions for (back)transcription. However, as I said in the last email, Bio.Translate still has its uses - there is no way to do a "translate to stop" with Bio.Seq for example. Maybe Bug 2381 should be a priority for the next release AFTER the imminent Biopython 1.46. We can then use object methods in the tutorial, which I personally would find much nicer to use. http://bugzilla.open-bio.org/show_bug.cgi?id=2381 If you could have a look at the suggested changes on Bug 2381, I'd welcome some feedback. Peter From sbassi at gmail.com Sat Jun 28 16:47:05 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 28 Jun 2008 13:47:05 -0300 Subject: [BioPython] one function, two behaivors In-Reply-To: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com> References: <320fb6e00806280816i666076aetcf1dcc12128924@mail.gmail.com> Message-ID: On Sat, Jun 28, 2008 at 12:16 PM, Peter wrote: .... > Here there is at least one reason to want to do this - suppose you > have a mixed set of nucleotide sequences and want to ensure they are > all RNA. > Do you think the Bio.Seq.transcibe() method should reject RNA sequences? IMHO, it should reject RNA sequences. The case you point out (ensure a set of sequences are all RNA) could be done by checking the type before applying "transcribe". -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From lueck at ipk-gatersleben.de Sun Jun 29 14:42:47 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Sun, 29 Jun 2008 16:42:47 +0200 Subject: [BioPython] Sequence from Fasta Message-ID: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> Hi! Is there a way to extract only the sequence (full length) from a fasta file? If I try the code from page 10 in the tutorial, I get of course this: Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA ...', SingleLetterAlphabet()) But I'm looking for something like this: Name Sequence without linebreak Example: MySequence atgcgcgctcggcgcgctcgfcgcgccccccatggctcgcgcactacagcg MySequence2 atgcgctctgcgcgctcgatgtagaatatgagatctctatgagatcagcatca etc. Regards Stefanie From biopython at maubp.freeserve.co.uk Sun Jun 29 15:19:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 29 Jun 2008 16:19:13 +0100 Subject: [BioPython] Sequence from Fasta In-Reply-To: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com> On Sun, Jun 29, 2008 at 3:42 PM, Stefanie L?ck wrote: > Hi! > > Is there a way to extract only the sequence (full length) from a fasta file? Yes. Based on your requirement to have name-space-sequence, how about: handle = open(filename) from Bio import SeqIO for record in SeqIO.parse(handle, "fasta") : print "%s %s" % (record.id, record.seq) handle.close() > If I try the code from page 10 in the tutorial, I get of course this: > Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA ...', SingleLetterAlphabet()) Which bit of the tutorial exactly? That looks like printing the repr() of a Seq object, and Seq objects don't have names. If something could be clarified that's useful feedback. Peter From lueck at ipk-gatersleben.de Mon Jun 30 09:09:53 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 30 Jun 2008 11:09:53 +0200 Subject: [BioPython] Sequence from Fasta References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> <320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com> Message-ID: <001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de> Hi Peter! I mean the biopython tutorial (16.3.2007), page 10: >>> from Bio import SeqIO handle = open("ls_orchid.fasta") for seq_record in SeqIO.parse(handle, "fasta") : print seq_record.id print seq_record.seq print len(seq_record.seq) handle.close() <<< I tried your code but I still have the same problem. It's don't show the full sequence. Output: 1 Seq('atgctcgatgcgcgctcgcgtccgtcgCAGGAgGAGATGGGGAGGCGCCGCCGGTTCACG ...', SingleLetterAlphabet()) 2 Seq('AGAAAAATCCGGAATCAGAGGAGGAGGAGGAGTCTCGCGAGGAGGATAGCACGGAGGCGG ...', SingleLetterAlphabet()) Fasta File looks like this: >1 atgctcgatgcgcgctcgcgtccgtcgCAGGAgGAGATGGGGAGGCGCCGCCGGTTCACGCATCAGCCCACCAGCGACGACGACGACGAGGAAGACAGAGCCGcCC >2 AGAAAAATCCGGAATCAGAGGAGGAGGAGGAGTCTCGCGAGGAGGATAGCACGGAGGCGGTACCCGTCGGTGAACCTTT I can try with regular expressions but I first wanted to know whether there is a way in biopyhton. Regards Stefanie From biopython at maubp.freeserve.co.uk Mon Jun 30 09:19:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 30 Jun 2008 10:19:16 +0100 Subject: [BioPython] Sequence from Fasta In-Reply-To: <001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de> References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> <320fb6e00806290819s73f95d32x563879e9bb64b924@mail.gmail.com> <001901c8da91$0eedfcd0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00806300219j54f7f43dpe0051f54be27d402@mail.gmail.com> Which version of Biopython do you have? I'm guessing Biopython 1.44. On older versions you would have to do explicitly turn the Seq into a string. Does this work: from Bio import SeqIO handle = open("ls_orchid.fasta") for seq_record in SeqIO.parse(handle, "fasta") : print seq_record.id print seq_record.seq.tostring() print len(seq_record.seq) handle.close() Since Biopython 1.45, doing str(...) on a Seq object gives you the sequence in full as a plain string. When you do a print this happens implicitly. Peter P.S. For the implementation, str(object) calls the object.__str__() method. From dalloliogm at gmail.com Mon Jun 30 09:40:23 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 30 Jun 2008 11:40:23 +0200 Subject: [BioPython] Sequence from Fasta In-Reply-To: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> Message-ID: <5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com> On Sun, Jun 29, 2008 at 4:42 PM, Stefanie L?ck wrote: > If I try the code from page 10 in the tutorial, I get of course this: Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA ...', SingleLetterAlphabet()) Try with seq_record.seq.data. > But I'm looking for something like this: > > Name Sequence without linebreak > > Example: > > MySequence atgcgcgctcggcgcgctcgfcgcgccccccatggctcgcgcactacagcg > MySequence2 atgcgctctgcgcgctcgatgtagaatatgagatctctatgagatcagcatca Bioperl's SeqIO has support for a 'tab sequence format' which is similar to this[1]. Maybe it could be useful in the future to add support for such a format in biopython. [1] http://www.bioperl.org/wiki/Tab_sequence_format > > Regards > Stefanie > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Jun 30 10:25:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 30 Jun 2008 11:25:01 +0100 Subject: [BioPython] Sequence from Fasta In-Reply-To: <5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com> References: <001e01c8d9f6$66134420$1022a8c0@ipkgatersleben.de> <5aa3b3570806300240g25a5c311y1d5e1872a9fa97d5@mail.gmail.com> Message-ID: <320fb6e00806300325r10c96b57qffee9ab3df81cb9e@mail.gmail.com> On Mon, Jun 30, 2008 at 10:40 AM, Giovanni Marco Dall'Olio wrote: > On Sun, Jun 29, 2008 at 4:42 PM, Stefanie L?ck wrote: > >> If I try the code from page 10 in the tutorial, I get of course this: > > Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA > ...', SingleLetterAlphabet()) > > Try with seq_record.seq.data. I would like to discourage using the Seq object's .data property if possible, in favour of my_seq.tostring() which will work even on very old versions of Biopython, or str(my_seq) if you are up to date. I've mooted deprecating the Seq object's .data property as part of making the Seq object more string like (Bug 2509 and Bug 2351). http://bugzilla.open-bio.org/show_bug.cgi?id=2509 http://bugzilla.open-bio.org/show_bug.cgi?id=2351 User feedback would be good, but to explain my current thinking: I'm hoping to reduce the Seq's .data to a read only property in a future release, and then in a later release start issuing a deprecation warning, before its eventual removal (Bug 2509). At some point in this process the Seq object would hopefully subclass the python string (Bug 2351). Peter