From betainverse at gmail.com Wed May 7 17:02:42 2008 From: betainverse at gmail.com (Katie Edmonds) Date: Wed, 7 May 2008 17:02:42 -0400 Subject: [BioPython] PSI-BLAST using NCBIWWW Message-ID: <8e76d5310805071402o77292571oabfe7e5912a826a0@mail.gmail.com> Hi, I'm trying to use biopython to run psi-blast on the ncbi server. It looks like qblast cannot be used for psi-blast, and as far as I can tell blast() doesn't work at all anymore, though it has a 'run_psiblast' parameter. Has anyone had success with running psi-blast with biopython who could offer some advice? Thanks, Katie From gbastian at pasteur.fr Wed May 7 17:29:26 2008 From: gbastian at pasteur.fr (gbastian at pasteur.fr) Date: Wed, 7 May 2008 23:29:26 +0200 (CEST) Subject: [BioPython] NCBIXML error Message-ID: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> Dear all, I have been using a script to blast sequences for days without a problem, then, after 2/3 hours it started giving me this error and never worked again...did they change xml blast format? this is the error: File "ppinvestigator.py", line 918, in ? pdbs.find_homologous_seqs(int_list) File "ppinvestigator.py", line 122, in find_homologous_seqs data = search_seq(self.sequences[chain][0], interactor_list) File "/home/giacomotion/Desktop/VU-PROJECT/PPI_PDBS/PPINVESTIGATOR/tools.py", line 32, in search_seq blast_record = blast_records.next() File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 576, in parse expat_parser.Parse(text, False) File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 98, in endElement eval("self.%s()" % method) File "", line 0, in ? File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 216, in _end_BlastOutput_version self._header.date = self._value.split()[2][1:-1] IndexError: list index out of range this is my script: #Launch the blastp search result_handle = NCBIWWW.qblast('blastp', 'nr', sequence, hitlist_size=10, perc_ident=10, alignments=10, descriptions=10, entrez_query='"Saccharomyces ce revisiae" [Organism]') #Handle the result file blast_results = result_handle.read() output_filename = 'tmp_blast.xml' save_file = open(output_filename,'w') save_file.write(blast_results) save_file.close() result_handle2 = open(output_filename, 'r') blast_records = NCBIXML.parse(result_handle2) #Initialize the record dictionary record_storage = [] #Iterate on the blast_handle file (only one iteration if one blastp search) blast_record = blast_records.next() this is the xml that I get: blastp BLASTP 2.2.18+ Altschul, Stephen F., Thomas L. Madden, Alejandro A. Sch?ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. nr

4678

unnamed protein product

280

BLOSUM62 10

4678

unnamed protein product

280

1 gi|151567870|pdb|2PM9|B Chain B, Crystal Structure Of Yeast Sec1331 VERTEX ELEMENT OF THE Copii Vesicular Coat 2PM9-B 297 1

577.785

1488 Thanks for any suggestion, Giacomo From biopython at maubp.freeserve.co.uk Thu May 8 04:57:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 May 2008 09:57:50 +0100 Subject: [BioPython] NCBIXML error In-Reply-To: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> References: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> Message-ID: <320fb6e00805080157n188865detf9e8a2791ac76530@mail.gmail.com> On Wed, May 7, 2008 at 10:29 PM, wrote: > Dear all, > > I have been using a script to blast sequences for days without a > problem, then, after 2/3 hours it started giving me this error > and never worked again...did they change xml blast format? Looking at the XML snippet, the version is "BLASTP 2.2.18+" (with a plus but no date) so it looks like the may well have updated something. Its possible that they'll make further tweaks in the next couple of days, so it would be worth retesting. The Biopython code expects something like "BLASTP 2.2.12 [Aug-07-2005]", and its the missing date that is causing this error for you. On a different note, if you have really been running BLASTP for days over the internet, it would probably be faster and more efficient to install standalone blast and the nr database on your local machine. You can still ask for XML output, so the parsing side of your script shouldn't change. Peter From winter at biotec.tu-dresden.de Thu May 8 04:47:38 2008 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Thu, 08 May 2008 10:47:38 +0200 Subject: [BioPython] NCBIXML error In-Reply-To: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> References: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> Message-ID: <4822BE2A.8080601@biotec.tu-dresden.de> gbastian at pasteur.fr wrote: > Dear all, > > I have been using a script to blast sequences for days without a > problem, then, after 2/3 hours it started giving me this error > and never worked again...did they change xml blast format? > > this is the error: > > File "ppinvestigator.py", line 918, in ? > pdbs.find_homologous_seqs(int_list) > File "ppinvestigator.py", line 122, in find_homologous_seqs > data = search_seq(self.sequences[chain][0], interactor_list) > File > "/home/giacomotion/Desktop/VU-PROJECT/PPI_PDBS/PPINVESTIGATOR/tools.py", > line 32, in search_seq > blast_record = blast_records.next() > File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 576, > in parse > expat_parser.Parse(text, False) > File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 98, > in endElement > eval("self.%s()" % method) > File "", line 0, in ? > File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 216, > in _end_BlastOutput_version > self._header.date = self._value.split()[2][1:-1] > IndexError: list index out of range [...] > this is the xml that I get: > > > "NCBI_BlastOutput.dtd"> > > blastp > BLASTP 2.2.18+ > Altschul, Stephen F., Thomas L. Madden, Alejandro [...] It seems they did change the format. When I run blast locally, it says blastp 2.2.18 [Mar-02-2008] self._header.date = self._value.split()[2][1:-1] works in that case, whereas it chokes on your BLASTP 2.2.18+ as "BLASTP 2.2.18+".split() lacks a third element. Should be easy to fix, shouldn't it? Christof From biopython at maubp.freeserve.co.uk Thu May 8 05:18:53 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 May 2008 10:18:53 +0100 Subject: [BioPython] NCBIXML error In-Reply-To: <4822BE2A.8080601@biotec.tu-dresden.de> References: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> <4822BE2A.8080601@biotec.tu-dresden.de> Message-ID: <320fb6e00805080218k2f6748a1j7c98886622dd90ac@mail.gmail.com> > It seems they did change the format. When I run blast locally ... > whereas it chokes ... as "BLASTP 2.2.18+".split() lacks a third element. > Should be easy to fix, shouldn't it? > > Christof I came to the same conclusion Christof, and its now fixed in CVS (with a new test case too). Giacomo, if you want to try this you'll need to update your system. Checking out the latest code from CVS and installing Biopython from source would be one way. However, you only need to update one file, replacing /usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py with CVS revision 1.4. If you want, you can just grab this from the web interface here (once the website is automatically updated in a few hours): http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIXML.py?cvsroot=biopython Please let us know on the mailing list if that works for you (or if there are still problems). Peter From gbastian at pasteur.fr Thu May 8 05:46:00 2008 From: gbastian at pasteur.fr (gbastian at pasteur.fr) Date: Thu, 8 May 2008 11:46:00 +0200 (CEST) Subject: [BioPython] NCBIXML error Message-ID: <63681.157.99.64.103.1210239960.squirrel@php.pasteur.fr> Hello Peter and Christof, I just modified the code of NCBIXML.py where it trys to get the version date information. Now it works. line 216 self._header.date = 'Dec-21-2012' thanks, Giacomo From biopython at maubp.freeserve.co.uk Thu May 8 07:37:34 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 May 2008 12:37:34 +0100 Subject: [BioPython] PSI-BLAST using NCBIWWW In-Reply-To: <8e76d5310805071402o77292571oabfe7e5912a826a0@mail.gmail.com> References: <8e76d5310805071402o77292571oabfe7e5912a826a0@mail.gmail.com> Message-ID: <320fb6e00805080437h2d13eb5co63a8e5f604223f8f@mail.gmail.com> On Wed, May 7, 2008 at 10:02 PM, Katie Edmonds wrote: > Hi, > > I'm trying to use biopython to run psi-blast on the ncbi server. It looks > like qblast cannot be used for psi-blast, and as far as I can tell blast() > doesn't work at all anymore, though it has a 'run_psiblast' parameter. Has > anyone had success with running psi-blast with biopython who could offer > some advice? I've done some investigation of using Bio.Blast.NCBIWWW.qblast() with psi-blast, and it does seem to be missing the run_psiblast option. However, I don't think that's the only issue here... Have you been able to try running standalone psiblast, and parse its XML output? Peter From biopython at maubp.freeserve.co.uk Fri May 9 04:59:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 May 2008 09:59:01 +0100 Subject: [BioPython] PSI-BLAST using NCBIWWW In-Reply-To: <8e76d5310805081228n1558e475ga774f910db020060@mail.gmail.com> References: <8e76d5310805071402o77292571oabfe7e5912a826a0@mail.gmail.com> <320fb6e00805080437h2d13eb5co63a8e5f604223f8f@mail.gmail.com> <8e76d5310805081228n1558e475ga774f910db020060@mail.gmail.com> Message-ID: <320fb6e00805090159n30c96b6p8d11ce6343eac7d6@mail.gmail.com> On Thu, May 8, 2008 at 8:28 PM, Katie Edmonds wrote: > From what I can get with the web interface, it seems like parsing the XML > should be ok, though for however many iterations I try, it only seems to be > giving me XML for iteration #1. Could you try running standalone PSI-Blast and see how the XML output compares? > I'm still trying to figure out how to run subsequent iterations of PSI-BLAST > with your patch. For anyone else interested, see http://bugzilla.open-bio.org/show_bug.cgi?id=2496 > In the web form it seems to keep track of all past iterations with NEXT_I: > > > > > > > > I don't have any idea how similar the qblast interface is to the web > interface, though. > > Thanks, > > Katie You are talking about multiple iterations of results from a single query? Looking at the example output I got yesterday, there is indeed only one iteration present. I know that multiple queries in classic blast or RPS-Blast get returned in the XML file as different iterations, which may be complicating things. Would you like to email the NCBI directly and ask if PSI-Blast is intended to be used by a program via the QBLAST API CGI page? They may have some tips. Peter From mmokrejs at ribosome.natur.cuni.cz Sun May 11 18:48:41 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 12 May 2008 00:48:41 +0200 Subject: [BioPython] blastall does not flush buffers due to biopython buffering? Message-ID: <482777C9.60500@ribosome.natur.cuni.cz> Hi, when I try to use Bio/Blast/NCBIStandalone blast sometimes the process hangs and sometimes it works (tested from Unix shell and via Apache mod_python). I see blastall process in the list of system processes, attaching strace(1) to it shows that it did print some line from the result output, but somewhat does not continue to write out the buffers (you know that at the end of blast output is the summary stats ...;). I believe that is because the consuming process did not read yet the output already written. Effectively, blastall gets blocked due to biopython. I see in the stacktrace of a killed process: print ''.join(_error_info.readlines()) File "/usr/lib/python2.5/site-packages/Bio/File.py", line 37, in readlines lines = self._saved + self._handle.readlines(*args,**keywds) KeyboardInterrupt $ Currently, there is in CVS: def blastall(blastcmd, program, database, infile, align_view='7', **keywds): """blastall(blastcmd, program, database, infile, align_view='7', **keywds) -> read, error Undohandles ... w, r, e = os.popen3(" ".join([blastcmd] + params)) w.close() return File.UndoHandle(r), File.UndoHandle(e) I did not study yet Bio/File.py but let me say that running just the following works fine for me: >>> import os >>> w, r, e = os.popen3('/usr/bin/blastall -p blastn -d /home/mmokrejs/a.fa -i /tmp/bl_FCOri7fa -m 0 -S 1 -e 1000 -W 4 -E 1 -G 1') >>> print ''.join(r.readlines()) BLASTN 2.2.18 [Mar-02-2008] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), [...] --> WORKS >>> print ''.join(e.readlines()) >>> I have found that first I have to read from STDOUT of blastall and only afterwards I may try to read from its STDERR. Otherwise, readline() or readlines() get blocked in the "same way" although the os.popen3() approach works otherwise. Is there a way to ensure no output is buffered in python? Something like 'man perlopentut' would be helpful. ;-) Why is the File.UndoHandle() used here at all? Thanks for clarification, Martin From mjldehoon at yahoo.com Sun May 11 22:11:39 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 11 May 2008 19:11:39 -0700 (PDT) Subject: [BioPython] blastall does not flush buffers due to biopython buffering? In-Reply-To: <482777C9.60500@ribosome.natur.cuni.cz> Message-ID: <558592.90782.qm@web62404.mail.re1.yahoo.com> Can you show an example script that causes the UndoHandle to block? Just to understand better what is going on. On a related note, the UndoHandle works by saving all lines that were read. Particularly for large Blast files, that is not what one would like to do. So if there is no strong reason for returning a UndoHandle, I'd be in favor of simply returning the handle directly. --Michiel. Martin MOKREJ? wrote: Hi, when I try to use Bio/Blast/NCBIStandalone blast sometimes the process hangs and sometimes it works (tested from Unix shell and via Apache mod_python). I see blastall process in the list of system processes, attaching strace(1) to it shows that it did print some line from the result output, but somewhat does not continue to write out the buffers (you know that at the end of blast output is the summary stats ...;). I believe that is because the consuming process did not read yet the output already written. Effectively, blastall gets blocked due to biopython. I see in the stacktrace of a killed process: print ''.join(_error_info.readlines()) File "/usr/lib/python2.5/site-packages/Bio/File.py", line 37, in readlines lines = self._saved + self._handle.readlines(*args,**keywds) KeyboardInterrupt $ Currently, there is in CVS: def blastall(blastcmd, program, database, infile, align_view='7', **keywds): """blastall(blastcmd, program, database, infile, align_view='7', **keywds) -> read, error Undohandles ... w, r, e = os.popen3(" ".join([blastcmd] + params)) w.close() return File.UndoHandle(r), File.UndoHandle(e) I did not study yet Bio/File.py but let me say that running just the following works fine for me: >>> import os >>> w, r, e = os.popen3('/usr/bin/blastall -p blastn -d /home/mmokrejs/a.fa -i /tmp/bl_FCOri7fa -m 0 -S 1 -e 1000 -W 4 -E 1 -G 1') >>> print ''.join(r.readlines()) BLASTN 2.2.18 [Mar-02-2008] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), [...] --> WORKS >>> print ''.join(e.readlines()) >>> I have found that first I have to read from STDOUT of blastall and only afterwards I may try to read from its STDERR. Otherwise, readline() or readlines() get blocked in the "same way" although the os.popen3() approach works otherwise. Is there a way to ensure no output is buffered in python? Something like 'man perlopentut' would be helpful. ;-) Why is the File.UndoHandle() used here at all? Thanks for clarification, Martin _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython --------------------------------- Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. From mmokrejs at ribosome.natur.cuni.cz Mon May 12 06:10:07 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 12 May 2008 12:10:07 +0200 Subject: [BioPython] blastall does not flush buffers due to biopython buffering? In-Reply-To: <558592.90782.qm@web62404.mail.re1.yahoo.com> References: <558592.90782.qm@web62404.mail.re1.yahoo.com> Message-ID: <4828177F.9010009@ribosome.natur.cuni.cz> Michiel de Hoon wrote: > Can you show an example script that causes the UndoHandle to block? Just to understand better what is going on. Maybe it is related to it, but sometimes blast process is probably close to exit: mmokrejs 3343 3329 0 10:37 pts/2 00:00:00 [blastall] But sometimes I see it is still running with all its arguments and I can attach to it by strace(1). So there are two issues. I have hacked NCBIStandalone.blastall() like this: + print "Executing %s" % " ".join([blastcmd] + params) w, r, e = os.popen3(" ".join([blastcmd] + params)) w.close() return File.UndoHandle(r), File.UndoHandle(e) $ python test38a.py Testing Bio.Blast.NCBIStandalone.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_8zjEnKfa -m 0 -M NUC4.2 -S 1 -e 1 -W 4 -E 1 -G 2 -b 10 -v 10 BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_TR0YHefa -m 0 -M NUC.4.2 -S 1 -e 1000 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results Traceback (most recent call last): File "test38b.py", line 15, in print ''.join(_error_info.readlines()) File "/usr/lib/python2.5/site-packages/Bio/File.py", line 37, in readlines lines = self._saved + self._handle.readlines(*args,**keywds) KeyboardInterrupt $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_meaeD7fa -m 0 -M NUC.4.2 -S 1 -e 1 -W 4 -E 1 -G 2 -b 10 -v 10 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_MRrSAWfa -m 0 -M NUC.4.2 -S 1 -e 1 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_VtMzeDfa -m 0 -M NUC.4.2 -S 1 -e 1000 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... So, it is not much reproducible. However, it is clear it has something to do with the length of the output. As I have already said, I saw by strace(1) many time that blastall did not write all its output. I propose to ensure that the open3() or equivalent does not use buffered outputs ("man perlopentut") and the UndoHandle avoided altogether. $ cat test38a.py #! /usr/bin/env python from sys import path, argv path.append('..') from Bio.Blast import NCBIStandalone print "Testing Bio.Blast.NCBIStandalone.blastall() functionality" _we_got_some_results = 0 _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', '/tmp/sequence_nucleic_acids_all.fa', '/tmp/blast_8zjEnKfa', matrix='NUC4.2', wordsize=4, gap_open=2, gap_extend=1, strands=1, alignments=9999999, descriptions=999, expectation=1, align_view=0) while 1: _line = _blast_out.readline() if _line: print _line, _we_got_some_results = 1 else: break if _we_got_some_results: while 1: _line = _error_info.readline() if _line: print _line, else: break $ cat test38b.py #! /usr/bin/env python import os from sys import path, argv path.append('..') print "Testing blasttest.blastall() functionality" import blasttest _blast_out, _error_info, _blast_file = blasttest.blastall('/tmp/sequence_nucleic_acids_all.fa', 'CCGCCGTCGCGGGCAGTGTCTAGCCAGGCCTTGACAAGCTA', '/tmp', 'sequence') # we have to read first from stdout of blast otherwise reading from stderr blocks print "Fetching blast results" print ''.join(_blast_out.readlines()) print "Fetching blast error messages" print ''.join(_error_info.readlines()) os.remove(_blast_file) $ cat blasttest.py #! /usr/bin/env python import os import file_io import tempfile from Bio.Blast import NCBIStandalone def blastall(blast_db, blast_query_string, tmpdir, mode, align_view=0, matrix='NUC.4.2'): _fd, _blast_file = tempfile.mkstemp(suffix='fa', prefix='blast_', dir=tmpdir, text=False) if blast_query_string[0] != '>': os.write(_fd, '>your_query\n' + blast_query_string.replace('\n','') + '\n') else: os.write(_fd, blast_query_string + '\n') os.close(_fd) del(_fd) if not file_io.file_size(_blast_file): raise ValueError, "No input sequence provided by user" return None, None if mode == 'sequence': _wordsize = 4 _gap_open = 1 _gap_extend = 1 _strands = 1 _alignments = 9999999 # -b _descriptions = 999 # -v _expectation = 1000 # -e 1e-20 _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix=matrix, wordsize=_wordsize, gap_open=_gap_open, gap_extend=_gap_extend, strands=_strands, alignments=_alignments, descriptions=_descriptions, expectation=_expectation, align_view=align_view) return _blast_out, _error_info, _blast_file Martin From mmokrejs at ribosome.natur.cuni.cz Mon May 12 09:45:30 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 12 May 2008 15:45:30 +0200 Subject: [BioPython] blastall does not flush buffers due to biopython buffering? In-Reply-To: <247270.16639.qm@web62406.mail.re1.yahoo.com> References: <247270.16639.qm@web62406.mail.re1.yahoo.com> Message-ID: <482849FA.4010304@ribosome.natur.cuni.cz> Hi, I do not have a patch nor a solution. I justed hacked Bio.Blast.NCBIStandalone to print what it actually does execute to help debugging, nothing else. I will try to think of it more before posting bugzilla. Thanks a have some sleep, ;-) M Michiel de Hoon wrote: > Sorry it is a bit hard to follow which changes exactly you are > proposing to make to Bio.Blast.NCBIStandalone. Maybe it's just me or > maybe because it's late at night. Could you open a bug report on > BugZilla and upload a patch there? > > --Michiel. From mjldehoon at yahoo.com Mon May 12 09:39:03 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 12 May 2008 06:39:03 -0700 (PDT) Subject: [BioPython] blastall does not flush buffers due to biopython buffering? In-Reply-To: <4828177F.9010009@ribosome.natur.cuni.cz> Message-ID: <247270.16639.qm@web62406.mail.re1.yahoo.com> Sorry it is a bit hard to follow which changes exactly you are proposing to make to Bio.Blast.NCBIStandalone. Maybe it's just me or maybe because it's late at night. Could you open a bug report on BugZilla and upload a patch there? --Michiel. Martin MOKREJ? wrote: Michiel de Hoon wrote: > Can you show an example script that causes the UndoHandle to block? Just to understand better what is going on. Maybe it is related to it, but sometimes blast process is probably close to exit: mmokrejs 3343 3329 0 10:37 pts/2 00:00:00 [blastall] But sometimes I see it is still running with all its arguments and I can attach to it by strace(1). So there are two issues. I have hacked NCBIStandalone.blastall() like this: + print "Executing %s" % " ".join([blastcmd] + params) w, r, e = os.popen3(" ".join([blastcmd] + params)) w.close() return File.UndoHandle(r), File.UndoHandle(e) $ python test38a.py Testing Bio.Blast.NCBIStandalone.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_8zjEnKfa -m 0 -M NUC4.2 -S 1 -e 1 -W 4 -E 1 -G 2 -b 10 -v 10 BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_TR0YHefa -m 0 -M NUC.4.2 -S 1 -e 1000 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results Traceback (most recent call last): File "test38b.py", line 15, in print ''.join(_error_info.readlines()) File "/usr/lib/python2.5/site-packages/Bio/File.py", line 37, in readlines lines = self._saved + self._handle.readlines(*args,**keywds) KeyboardInterrupt $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_meaeD7fa -m 0 -M NUC.4.2 -S 1 -e 1 -W 4 -E 1 -G 2 -b 10 -v 10 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_MRrSAWfa -m 0 -M NUC.4.2 -S 1 -e 1 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_VtMzeDfa -m 0 -M NUC.4.2 -S 1 -e 1000 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... So, it is not much reproducible. However, it is clear it has something to do with the length of the output. As I have already said, I saw by strace(1) many time that blastall did not write all its output. I propose to ensure that the open3() or equivalent does not use buffered outputs ("man perlopentut") and the UndoHandle avoided altogether. $ cat test38a.py #! /usr/bin/env python from sys import path, argv path.append('..') from Bio.Blast import NCBIStandalone print "Testing Bio.Blast.NCBIStandalone.blastall() functionality" _we_got_some_results = 0 _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', '/tmp/sequence_nucleic_acids_all.fa', '/tmp/blast_8zjEnKfa', matrix='NUC4.2', wordsize=4, gap_open=2, gap_extend=1, strands=1, alignments=9999999, descriptions=999, expectation=1, align_view=0) while 1: _line = _blast_out.readline() if _line: print _line, _we_got_some_results = 1 else: break if _we_got_some_results: while 1: _line = _error_info.readline() if _line: print _line, else: break $ cat test38b.py #! /usr/bin/env python import os from sys import path, argv path.append('..') print "Testing blasttest.blastall() functionality" import blasttest _blast_out, _error_info, _blast_file = blasttest.blastall('/tmp/sequence_nucleic_acids_all.fa', 'CCGCCGTCGCGGGCAGTGTCTAGCCAGGCCTTGACAAGCTA', '/tmp', 'sequence') # we have to read first from stdout of blast otherwise reading from stderr blocks print "Fetching blast results" print ''.join(_blast_out.readlines()) print "Fetching blast error messages" print ''.join(_error_info.readlines()) os.remove(_blast_file) $ cat blasttest.py #! /usr/bin/env python import os import file_io import tempfile from Bio.Blast import NCBIStandalone def blastall(blast_db, blast_query_string, tmpdir, mode, align_view=0, matrix='NUC.4.2'): _fd, _blast_file = tempfile.mkstemp(suffix='fa', prefix='blast_', dir=tmpdir, text=False) if blast_query_string[0] != '>': os.write(_fd, '>your_query\n' + blast_query_string.replace('\n','') + '\n') else: os.write(_fd, blast_query_string + '\n') os.close(_fd) del(_fd) if not file_io.file_size(_blast_file): raise ValueError, "No input sequence provided by user" return None, None if mode == 'sequence': _wordsize = 4 _gap_open = 1 _gap_extend = 1 _strands = 1 _alignments = 9999999 # -b _descriptions = 999 # -v _expectation = 1000 # -e 1e-20 _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix=matrix, wordsize=_wordsize, gap_open=_gap_open, gap_extend=_gap_extend, strands=_strands, alignments=_alignments, descriptions=_descriptions, expectation=_expectation, align_view=align_view) return _blast_out, _error_info, _blast_file Martin --------------------------------- Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. From david.moreira at u-psud.fr Wed May 14 05:26:13 2008 From: david.moreira at u-psud.fr (David Moreira) Date: Wed, 14 May 2008 10:26:13 +0100 Subject: [BioPython] GenBank.NCBIDictionary 'Operation timed out' Message-ID: <482AB035.80508@u-psud.fr> Dear all, I have a problem when using the GenBank.NCBIDictionary to retrieve sequences from GenBank. Quite often my script finishes with an error due to a too long connexion time: urllib2.URLError: > The code I am using is very simple, I have tried to avoid the problem by waiting 10 seconds before retrying, but it does not work: def retrieve_taxonomy_seq(acc_number): # retrieve the sequence from GenBank try: ncbi_dict = GenBank.NCBIDictionary(database="nucleotide", format="genbank") # exception in case of network problems except IOError: print "Network down, waiting 10 seconds before continuing" time.sleep(10) retrieve_taxonomy(acc_number) I have to use this function about ~100 times. I know that the best would be to have a local data base to avoid Internet connexion problems, but by some practical reasons I prefer to go directly to GenBank for my search. Do you have any suggestion to avoid this problem? Best, David -- David MOREIRA Unit? d'Ecologie, Syst?matique et Evolution - UMR CNRS 8079 Universit? Paris-Sud. B?timent 360. 91405 Orsay CEDEX. FRANCE Tel: 33 1 69 15 76 08 Fax: 33 1 69 15 46 97 http://www.ese.u-psud.fr/microbiologie/ From biopython at maubp.freeserve.co.uk Wed May 14 05:07:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 May 2008 10:07:12 +0100 Subject: [BioPython] GenBank.NCBIDictionary 'Operation timed out' In-Reply-To: <482AB035.80508@u-psud.fr> References: <482AB035.80508@u-psud.fr> Message-ID: <320fb6e00805140207u1e257aafmdecc9741513a48c5@mail.gmail.com> On Wed, May 14, 2008 at 10:26 AM, David Moreira wrote: > Dear all, > > I have a problem when using the GenBank.NCBIDictionary to retrieve > sequences from GenBank. Quite often my script finishes with an error due to > a too long connexion time: > > urllib2.URLError: > That does sound like a network problem. Perhaps you can do some investigation like running ping on the NCBI host when you have issues, or consult your IT helpdesk to see if there are any local issues. Which version of Biopython are you using? In Biopython 1.45 we switched the NCBIDictionary class from using EUtils to using Entrez internally. > I have to use this function about ~100 times. I know that the best would be > to have a local data base to avoid Internet connexion problems, but by some > practical reasons I prefer to go directly to GenBank for my search. > > Do you have any suggestion to avoid this problem? Have you added a local cache to record the sequences on your hard disk once they are downloaded, to avoid re-downloading the same things many times? Peter From biopython at maubp.freeserve.co.uk Fri May 16 11:10:03 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 May 2008 16:10:03 +0100 Subject: [BioPython] Parsing the pairwise alignments from the FASTA tool Message-ID: <320fb6e00805160810s75f27329yc4fa8d2a1676a1dd@mail.gmail.com> Hello everyone, I'm currently interested in parsing the output from Bill Pearson's FASTA tool in python, and would considering adding support for this to Biopython. See http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml Does anyone else on the list care about this file format? If so, do you have any python code you would like to share? Otherwise, would support for the "-m 10" output suffice? This output is specifically intended for parsing rather than being human readable. There are several different options for the output format, some of which looks a bit like the BLAST plain text output. Note that I do not mean the "typical" fasta format where each sequence starts with a ">" line, which was originally introduced as an input file format to the FASTA tools. Biopython already deals with this quite happily. Peter P.S. For anyone interested, BioPerl have had support for the human readable FASTA output for a while, and judging from this thread, they added support for the FASTA m10 variant last year: http://bioperl.org/pipermail/bioperl-l/2007-April/025465.html From cjfields at uiuc.edu Fri May 16 11:32:44 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 16 May 2008 10:32:44 -0500 Subject: [BioPython] Parsing the pairwise alignments from the FASTA tool In-Reply-To: <320fb6e00805160810s75f27329yc4fa8d2a1676a1dd@mail.gmail.com> References: <320fb6e00805160810s75f27329yc4fa8d2a1676a1dd@mail.gmail.com> Message-ID: <3FE79510-1A20-4863-B2F1-C9AFA13572B7@uiuc.edu> Peter, An enhancement request is in place in bugzilla for this, but BioPerl hasn't implemented parsing -m10 yet. As you indicated this shouldn't be too hard to implement; just needs someone with the time to code it up. chris On May 16, 2008, at 10:10 AM, Peter wrote: > ... > P.S. For anyone interested, BioPerl have had support for the human > readable FASTA output for a while, and judging from this thread, they > added support for the FASTA m10 variant last year: > http://bioperl.org/pipermail/bioperl-l/2007-April/025465.html > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Mon May 19 12:21:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 May 2008 17:21:20 +0100 Subject: [BioPython] Parsing the pairwise alignments from the FASTA tool In-Reply-To: <3FE79510-1A20-4863-B2F1-C9AFA13572B7@uiuc.edu> References: <320fb6e00805160810s75f27329yc4fa8d2a1676a1dd@mail.gmail.com> <3FE79510-1A20-4863-B2F1-C9AFA13572B7@uiuc.edu> Message-ID: <320fb6e00805190921o4dd3ebb3q39ea967108f77839@mail.gmail.com> On Fri, May 16, 2008 at 4:32 PM, Chris Fields wrote: > Peter, > > An enhancement request is in place in bugzilla for this, but BioPerl hasn't > implemented parsing -m10 yet. As you indicated this shouldn't be too hard > to implement; just needs someone with the time to code it up. Thanks for pointing that out Chris, I had assumed this had already happened without checking bugzilla. For anyone else interested, this is BioPerl Bug 2278 - FASTA m10 output support, http://bugzilla.open-bio.org/show_bug.cgi?id=2278 I'd still like to hear from anyone else using or interested in using the FASTA output in python... Peter From colochera at gmail.com Tue May 20 12:08:22 2008 From: colochera at gmail.com (Raul Guerra) Date: Tue, 20 May 2008 12:08:22 -0400 Subject: [BioPython] Remote PSI-Blast Message-ID: Hello Everybody, I just began using Biopython.I need to do a remote PSI-Blast, and I cannot find how to do it online. Does anybody know how to do it? Is it implemented in BioPython or do I have to implement it? Also, I am trying to do a blastp for a fasta file named fastaStr, but I want to restrict the sarch to the organim Chlamydomonas. Everytime I run the following code result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Chlamydomonas" [ORGN]') I get urllib2.URLError: from the Biopresult_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Chlamydomonas" [ORGN]') from the Biopython code. Any ideas? Thank you in advance, David Guerra Colgate University '09 From peter at maubp.freeserve.co.uk Tue May 20 12:13:40 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 20 May 2008 17:13:40 +0100 Subject: [BioPython] Remote PSI-Blast In-Reply-To: References: Message-ID: <320fb6e00805200913p458b0610p72d9b7c22e139894@mail.gmail.com> On Tue, May 20, 2008 at 5:08 PM, Raul Guerra wrote: > Hello Everybody, > > I just began using Biopython.I need to do a remote PSI-Blast, and I cannot > find how to do it online. > > Does anybody know how to do it? Is it implemented in BioPython or do I have > to implement it? I'm not sure if the NCBI have an official way to access the PSI-Blast search online. If you have a look at Bug 2496, we've made some headway in trying to do it anyway, but I would really like some official documentation from the NCBI on if/how they expect this to work. http://bugzilla.open-bio.org/show_bug.cgi?id=2496 Peter From biopython at maubp.freeserve.co.uk Tue May 20 12:21:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 May 2008 17:21:12 +0100 Subject: [BioPython] NCBIWWW.qblast time outs Message-ID: <320fb6e00805200921h557602eet99cd49134d6d068a@mail.gmail.com> On Tue, May 20, 2008 at 5:08 PM, Raul Guerra wrote: > Also, > I am trying to do a blastp for a fasta file named fastaStr, but I want to > restrict the sarch to the organim Chlamydomonas. Everytime I run the > following code > > result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='"Chlamydomonas" [ORGN]') > > I get > > urllib2.URLError: > > from the Biopresult_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='"Chlamydomonas" [ORGN]') > > from the Biopython code. Any ideas? You could try running the same query via your web browser, so get an idea of if its just a local networking problem or high server load at the NCBI. It does seem to be working for me right now, but I am using the latest CVS version of Biopython where I updated the CGI URL from http://www.ncbi.nlm.nih.gov/blast/Blast.cgi to http://blast.ncbi.nlm.nih.gov/Blast.cgi - see revision 1.49 for the change: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIWWW.py?cvsroot=biopython This might be worth trying... of course, for big tasks you would be better off installing a local copy of standalone blast, and creating your own blast database from the genome (i.e. the protein fasta file(s) for Chlamydomonas in your case). Peter From betainverse at gmail.com Tue May 20 12:24:08 2008 From: betainverse at gmail.com (Katie Edmonds) Date: Tue, 20 May 2008 12:24:08 -0400 Subject: [BioPython] Remote PSI-Blast In-Reply-To: <320fb6e00805200913p458b0610p72d9b7c22e139894@mail.gmail.com> References: <320fb6e00805200913p458b0610p72d9b7c22e139894@mail.gmail.com> Message-ID: <8e76d5310805200924q5a37e0d0j573d2bad3372b798@mail.gmail.com> I asked NCBI about this, and they (eventually) replied that it's "not officially supported." I have been unable to figure out how to get it to return iterations after the first one. Katie On Tue, May 20, 2008 at 12:13 PM, Peter wrote: > On Tue, May 20, 2008 at 5:08 PM, Raul Guerra wrote: > > Hello Everybody, > > > > I just began using Biopython.I need to do a remote PSI-Blast, and I > cannot > > find how to do it online. > > > > Does anybody know how to do it? Is it implemented in BioPython or do I > have > > to implement it? > > I'm not sure if the NCBI have an official way to access the PSI-Blast > search online. If you have a look at Bug 2496, we've made some headway > in trying to do it anyway, but I would really like some official > documentation from the NCBI on if/how they expect this to work. > http://bugzilla.open-bio.org/show_bug.cgi?id=2496 > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From colochera at gmail.com Tue May 20 19:28:27 2008 From: colochera at gmail.com (Raul Guerra) Date: Tue, 20 May 2008 19:28:27 -0400 Subject: [BioPython] Parstin a remote Blast output Message-ID: Thank you to everyone who replied my last post. I am sorry to bother you again with a question. Thank you in advance for your time. I am trying to parse the output from: result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Arabidopsis thaliana" [ORGN]') where fastaStr is a string in the fasta format. When I run the following code: b_parser = NCBIWWW.BlastParser() blast_records = b_parser.parse(result_handle) I get the following error: ValueError: Unexpected end of stream. I think that result_handle is a cStringIO.StringI data structure. I thought the code for some reason was trying to invoke the readline() method on the cStringIO.StringI data structure and maybe that was what is causing the error. However, I already saved result_handle.read() in a file and opened it with the open() function, so that I would get a file object with a readline() function. But the code still did not work. I tried to follow the logic of the program and I found that NCBIWWW.qblast() is outputing a XML file, and for some reason NCBIWWW.BlastParser() is expecting a HTML file. That is my guess of what is going wrong. So what I did was to use the parser in NCBIXML. So I ran the following result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Arabidopsis thaliana" [ORGN]') blast_records = NCBIXML.parse(result_handle) and it works fine (at least I do not get errors), but I have no idea on what type of object blast_records is. I tried the following next = blast_records.next() and got the following error: Traceback (most recent call last): File "/home/rguerra/workspace/Summer2008/src/testingBioPython.py", line 28, in next = blast_records.next() File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 574, in parse expat_parser.Parse(text, False) File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 98, in endElement eval("self.%s()" % method) File "", line 1, in File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 214, in _end_BlastOutput_version self._header.date = self._value.split()[2][1:-1] IndexError: list index out of range I have not been able to understand what is going on here. I just want to parse the results I get from: result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Arabidopsis thaliana" [ORGN]') Any ideas? Raul Guerra From ibdeno at gmail.com Wed May 21 03:43:42 2008 From: ibdeno at gmail.com (=?ISO-8859-1?Q?Miguel_Ortiz-Lombard=EDa?=) Date: Wed, 21 May 2008 09:43:42 +0200 Subject: [BioPython] PSIBlastParser and blastpgp 2.2.18 Message-ID: Hi, The PSIBlastParser (biopython 1.45) seems not to work with the latest (2.2.18) version of NCBI blastpgp. Using the same script/inputs I can successfully run it with blastpgp 2.2.15 but when using 2.2.18 I get this error: Traceback (most recent call last): File "./lpbl.py", line 23, in b_record = b_parser.parse(blast_out) File "/home/mortiz/Progs//lib/python/Bio/Blast/NCBIStandalone.py", line 760, in parse self._scanner.feed(handle, self._consumer) File "/home/mortiz/Progs//lib/python/Bio/Blast/NCBIStandalone.py", line 98, in feed self._scan_header(uhandle, consumer) File "/home/mortiz/Progs//lib/python/Bio/Blast/NCBIStandalone.py", line 208, in _scan_header raise ValueError("Invalid header?") ValueError: Invalid header? In both cases I had setup align_view='0' so the psi-blast output is plain text and not XML (This is because I don't think there is a PSI-Blast parser working with XML, if there is one, please let me know how to invoke it) Best regards, Miguel -- http://www.pangea.org/mol/spip.php?rubrique2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Je suis de la mauvaise herbe, Braves gens, braves gens, Je pousse en libert? Dans les jardins mal fr?quent?s! Georges Brassens From biopython at maubp.freeserve.co.uk Wed May 21 04:26:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 May 2008 09:26:37 +0100 Subject: [BioPython] PSIBlastParser and blastpgp 2.2.18 In-Reply-To: References: Message-ID: <320fb6e00805210126p469c0fd5id26a0e3be51697d0@mail.gmail.com> On Wed, May 21, 2008 at 8:43 AM, Miguel Ortiz-Lombard?a wrote: > Hi, > > The PSIBlastParser (biopython 1.45) seems not to work with the latest > (2.2.18) version of NCBI blastpgp. Using the same script/inputs I can > successfully run it with blastpgp 2.2.15 but when using 2.2.18 I get this > error: > > Traceback (most recent call last): > ... > ValueError: Invalid header? > > In both cases I had setup align_view='0' so the psi-blast output is plain > text and not XML (This is because I don't think there is a PSI-Blast parser > working with XML, if there is one, please let me know how to invoke it) Plain text Blast parsing in general is a pain as the NCBI often make minor changes to the file format. If you could file a bug on Bugzilla, with a couple of example input files, I'll try and have a look and see if its an easy fix this time. Matched pairs using blastpgp 2.2.15 and 2.2.18 with the same command line arguments would be especially useful. In the meantime, you could try the Bio.NCBIXML parser on the PSI-Blast XML output. I know that works for blastall and rpsblast, so it may be OK with blastpgp too. Please let us know if this works for you (and we can update our documentation). Thank you, Peter From biopython at maubp.freeserve.co.uk Wed May 21 04:46:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 May 2008 09:46:13 +0100 Subject: [BioPython] Parstin a remote Blast output In-Reply-To: References: Message-ID: <320fb6e00805210146l34fd2cd3q892dd52bbc439eae@mail.gmail.com> On Wed, May 21, 2008 at 12:28 AM, Raul Guerra wrote: > Thank you to everyone who replied my last post. I am sorry to bother you > again with a question. Thank you in advance for your time. > > I am trying to parse the output from: > > result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='"Arabidopsis thaliana" [ORGN]') > > where fastaStr is a string in the fasta format. As you have discovered, this will return XML by default (since Biopython 1.41). You will get back a handle object (of some sort). > I tried to follow the logic of the program and I found that NCBIWWW.qblast() > is outputing a XML file, and for some reason NCBIWWW.BlastParser() is > expecting a HTML file. That is my guess of what is going wrong. So what I > did was to use the parser in NCBIXML. Well done on working this out. Can I ask you why you tried the version using the plain text parser? I thought we'd updated all our documentation on this but perhaps we missed something. See BLAST Chapter of the tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > So I ran the following > > result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='"Arabidopsis thaliana" [ORGN]') > > blast_records = NCBIXML.parse(result_handle) > > and it works fine (at least I do not get errors), but I have no idea on what > type of object blast_records is. I tried the following Using NCBIXML.parse(result_handle) will return an iterator, but it doesn't actually start parsing the file until you call the next() method, which is usally done in a for loop. > next = blast_records.next() > > and got the following error: > > Traceback (most recent call last): > ... > File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 214, in > _end_BlastOutput_version > self._header.date = self._value.split()[2][1:-1] > IndexError: list index out of range > > I have not been able to understand what is going on here. Sadly the NCBI changed their output format slightly, and Biopython couldn't cope. We've fixed this now (Bug 2499), but you'll have to update your installation. See here for details: http://bugzilla.open-bio.org/show_bug.cgi?id=2499 > I just want to parse the results I get from: > > result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='"Arabidopsis thaliana" [ORGN]') > > Any ideas? You're very close. I suggest updating the NCBIXML file to cope with the current version of BLAST that the NCBI is using online (2.2.18+), and then using the XML parser: from Bio import NCBIWWW, NCBIXML result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Arabidopsis thaliana" [ORGN]') for record in NCBIXML.parse(result_handle) : #Do something with the blast result Peter From biopython at maubp.freeserve.co.uk Wed May 21 07:36:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 May 2008 12:36:27 +0100 Subject: [BioPython] PSIBlastParser and blastpgp 2.2.18 In-Reply-To: References: <320fb6e00805210126p469c0fd5id26a0e3be51697d0@mail.gmail.com> Message-ID: <320fb6e00805210436u5f7a0e0bi3ebe7a127d0b477d@mail.gmail.com> On Wed, May 21, 2008 at 10:26 AM, Miguel Ortiz-Lombard?a wrote: > Hi Peter, > > I will try NCBIXML, thank you. Good luck :) > In the mean time, how can I attach a file to bug I'm filing so you have my > script and an input file? I thought this possibility existed, but I can't > find it right now. You have to file the bug, and then go back and add an attachment as a second step. It is a little annoying/confusing. Does anyone knows if bugzilla can be configured to allow attachments when filing a bug? Peter From sdavis2 at mail.nih.gov Wed May 21 08:08:57 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 21 May 2008 08:08:57 -0400 Subject: [BioPython] Adding new database types to EUtils In-Reply-To: <200712041449.39100.luca.beltrame@unimi.it> References: <200712041119.46997.luca.beltrame@unimi.it> <200712041221.16990.luca.beltrame@unimi.it> <264855a00712040535l2911e009t48e4e6f483461ac5@mail.gmail.com> <200712041449.39100.luca.beltrame@unimi.it> Message-ID: <264855a00805210508v6c9945e2j2c7d638a27a0e61b@mail.gmail.com> On Tue, Dec 4, 2007 at 9:49 AM, Luca Beltrame wrote: > Il Tuesday 04 December 2007 14:35:13 hai scritto: > >> release. I have thought about making a python-based version, but I find R >> a much more compelling framework for statistical computing and array-based > > I think it is mostly a matter of personal preference. I turned to Python (but > I have been using GEOquery in the past) because I like the language more than > R. > >> Metadata (and not values), then URLs can be constructed against their web > > I guess I did not make the statement clear enough in my original mail. Yes, I > meant to fetch only the metadata because I wanted to gather the experiment > descriptions from all the accessions I had (a rather large number) in order > to look through them without having to query for each one. > I will try looking at the queries via web and see if I can write something > useful (although I still think that, as basic as it is, it would be nice to > have EUtils GEO support in Bio.EUtils, at least for the metadata). > >> I'm not sure that exactly the same functionality is available via Eutils, >> but I think not. > > I have played a bit with EUtils, but I haven't yet been able to use esearch to > work with a GEO accession. Since I have just looked at them briefly, I can't > guarantee it was just a mistake on my part, though. This is a pretty old post, I know, but I thought I would toss in some new information here. We have parsed all of the GEO metadata into a MySQL database which can be queried directly here: http://meltzerlab.nci.nih.gov/apps/geo In addition, for those who want programmatic access, on the same page is a SQLite database file with the same information. It can be used from python (or any other language with SQLite bindings) for SQL-based queries of GEO metadata. Sean From cjfields at uiuc.edu Wed May 21 08:21:48 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 21 May 2008 07:21:48 -0500 Subject: [BioPython] PSIBlastParser and blastpgp 2.2.18 In-Reply-To: <320fb6e00805210436u5f7a0e0bi3ebe7a127d0b477d@mail.gmail.com> References: <320fb6e00805210126p469c0fd5id26a0e3be51697d0@mail.gmail.com> <320fb6e00805210436u5f7a0e0bi3ebe7a127d0b477d@mail.gmail.com> Message-ID: <95A355CD-8C57-415A-BF28-0E7C7ABDB5CF@uiuc.edu> You have to file the bug first descriptively; after the ticket is generated the 'Create a New Attachment' link should show up. chris On May 21, 2008, at 6:36 AM, Peter wrote: > On Wed, May 21, 2008 at 10:26 AM, Miguel Ortiz-Lombard?a > wrote: >> Hi Peter, >> >> I will try NCBIXML, thank you. > > Good luck :) > >> In the mean time, how can I attach a file to bug I'm filing so you >> have my >> script and an input file? I thought this possibility existed, but I >> can't >> find it right now. > > You have to file the bug, and then go back and add an attachment as a > second step. It is a little annoying/confusing. > > Does anyone knows if bugzilla can be configured to allow attachments > when filing a bug? > > Peter > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Wed May 21 10:25:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 May 2008 15:25:31 +0100 Subject: [BioPython] [Bug 2502] PSIBlastParser fails with blastpgp 2.2.18 though works with blastpgp 2.2.15 In-Reply-To: References: <200805211305.m4LD5Eo3020573@portal.open-bio.org> Message-ID: <320fb6e00805210725n6a73f2lbe381acc1801d5b0@mail.gmail.com> On Wed, May 21, 2008 at 2:53 PM, Miguel Ortiz-Lombard?a wrote: > Peter, I'm sending this to you because bugzilla rejected my e-mail: I don't think you can submit comments to the bugs on bugzilla via email. You have to go to our bugzilla webpage (using the link included in the email) and fill in the comments box. i.e. http://bugzilla.open-bio.org/show_bug.cgi?id=2502 > Hi, > > Just to clarify: the output is not XML but plain text ( because > align_view='0' ) Is the plain text that you want? I had forgotten to check what the default output was - sorry. To cover all cases, I would like to see both the plain text and the XML for both the old and new blast versions. i.e. Four different files using the same input query and database. I would generate these on the command line. It may we that its only a simple change to the plain text (but I suspect not), and that updating the plain text parser would be simple. I fear that the changes to the plain text are bigger (based on what happened with the other blast tools recently) and we would be better off getting Biopython to parse the XML. Thanks, Peter From biopython at maubp.freeserve.co.uk Wed May 21 18:42:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 May 2008 23:42:10 +0100 Subject: [BioPython] Parstin a remote Blast output In-Reply-To: References: <320fb6e00805210146l34fd2cd3q892dd52bbc439eae@mail.gmail.com> Message-ID: <320fb6e00805211542t42554d06t88d5ac88f8ba4f64@mail.gmail.com> On Wed, May 21, 2008 at 6:04 PM, Raul Guerra wrote: > Hi Peter, > > Thank you for all your help, it is greatly appreciated. I am sorry that I > keep asking about how to parse the blast. About why I tried to use the plain > text parser, I googled around and found a BioPython CookBook (I know now > that it is an older version of yours) and tried the code in it. That would explain why you started out doing things "the old way". We could try writing to the website hosting the old tutorial/cookbook and see if they could update it... > About NCBIXML, I overwrote the file > "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py" with the one in CVS > following the link > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIXML.py?cvsroot=biopython OK, that should be able to parse the latest XML files from the NCBI except for the slight catch I had forgotten about (see below) > However, a new problem comes up. When I run the code, > > result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='%s [ORGN]'%organism) > for record in NCBIXML.parse(result_handle) : > print record > > where organism is "Chlamydomonas" > and fastaStr is >>NP_013762_Nup116 > MFGVSRGAFPSAT.... Here is a short version using the GI number of this sequence, from Bio.Blast import NCBIWWW, NCBIXML handle = NCBIWWW.qblast("blastp", "nr", "6323691", "Chlamydomonas[ORGN]") for record in NCBIXML.parse(handle) : print record This seems to work for me. With hindsight, it would have been safer just to have got the whole of Biopython from CVS, but I think you also need to update this file /usr/lib/python2.5/site-packages/Bio/Blast/Record.py with the latest from CVS: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/Record.py?cvsroot=biopython I'd forgotten about this related change - sorry. Peter From biopython at maubp.freeserve.co.uk Thu May 22 04:24:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 May 2008 09:24:39 +0100 Subject: [BioPython] THANK YOU In-Reply-To: References: Message-ID: <320fb6e00805220124o4b00c578je98946b0951d9624@mail.gmail.com> On Thu, May 22, 2008 at 12:49 AM, Raul Guerra wrote: > > Peter, > > Thank you so much. it WORKED!! I did not know about BioPython until a few > weeks back. Before that I had been programming my own parsers and scripts to > access NCBI. I spent 15 weeks trying to figure out a parser for GenBank, and > I still could not get it to work for all cases. I thought that I was going > to spend a lot of time with a parser for Blast, but BioPython works > beautifully. > > Thanks for all the help, > > Raul Thanks for letting me know it work Raul - we got there in the end with the Blast XML parsing ;) I am optimistic that Biopython will have another release by summer 2008,which will be good news for everyone else wanting to use the NCBI's online blast (without having to mess about with the CVS code). If you find any issues with our GenBank parser, please do get in touch again on the mailing list (or via Bugzilla if you find a bug). Peter From florian.koelling at tu-bs.de Thu May 22 10:38:59 2008 From: florian.koelling at tu-bs.de (Florian Koelling) Date: Thu, 22 May 2008 16:38:59 +0200 Subject: [BioPython] problems to fetch pdb files from server using pdb_list Message-ID: <48358583.3050401@tu-bs.de> Hi Folks! I tried to download pdb files using pdb_list: (the second variant didn't work as well:-@) from Bio.PDB import* pdbl=PDBList() pdbl.retrieve_pdb_file('1FAT') #pdbl.retrieve_pdb_file('1FAT', obsolete= compression='.Z', uncompress='gunzip', pdir='/home/flo/Desktop') I receive: flo at AKB-12:~/Desktop/astex$ python test.py Traceback (most recent call last): File "test.py", line 10, in pdbl.retrieve_pdb_file('1FAT') File "/var/lib/python-support/python2.5/Bio/PDB/PDBList.py", line 190, in retrieve_pdb_file os.mkdir(path) OSError: [Errno 2] No such file or directory: '/pdb/fa' What am I doing wrong? Thanx for your Help! From peter at maubp.freeserve.co.uk Thu May 22 11:14:52 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Thu, 22 May 2008 16:14:52 +0100 Subject: [BioPython] problems to fetch pdb files from server using pdb_list In-Reply-To: <48358583.3050401@tu-bs.de> References: <48358583.3050401@tu-bs.de> Message-ID: <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> On Thu, May 22, 2008 at 3:38 PM, Florian Koelling wrote: > > Hi Folks! > > I tried to download pdb files using pdb_list: > (the second variant didn't work as well:-@) > > from Bio.PDB import* > > pdbl=PDBList() > pdbl.retrieve_pdb_file('1FAT') This worked for me, it create a subdirectory "fa" inside the current working directory, and saved the PDB there as "pdb1fat.ent". Are you sure the error you quoted in your email matches this code? > #pdbl.retrieve_pdb_file('1FAT', obsolete= compression='.Z', > uncompress='gunzip', pdir='/home/flo/Desktop') The above isn't valid python syntax (there is something wrong between the words obsolete and compression). You probably want something like this: pdbl.retrieve_pdb_file('1FAT', pdir='/home/flo/Desktop') Peter From florian.koelling at tu-bs.de Fri May 23 04:28:40 2008 From: florian.koelling at tu-bs.de (Florian Koelling) Date: Fri, 23 May 2008 10:28:40 +0200 Subject: [BioPython] problems to fetch pdb files from server using pdb_list In-Reply-To: <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> References: <48358583.3050401@tu-bs.de> <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> Message-ID: <48368038.7010504@tu-bs.de> Humm - I recognised that I was still using Biopython 1.42 (installed via synaptics) - it works fine on 1.45! :-)))) Peter wrote: > On Thu, May 22, 2008 at 3:38 PM, Florian Koelling > wrote: > >> Hi Folks! >> >> I tried to download pdb files using pdb_list: >> (the second variant didn't work as well:-@) >> >> from Bio.PDB import* >> >> pdbl=PDBList() >> pdbl.retrieve_pdb_file('1FAT') >> > > This worked for me, it create a subdirectory "fa" inside the current > working directory, and saved the PDB there as "pdb1fat.ent". > > Are you sure the error you quoted in your email matches this code? > > >> #pdbl.retrieve_pdb_file('1FAT', obsolete= compression='.Z', >> uncompress='gunzip', pdir='/home/flo/Desktop') >> > > The above isn't valid python syntax (there is something wrong between > the words obsolete and compression). You probably want something like > this: > > pdbl.retrieve_pdb_file('1FAT', pdir='/home/flo/Desktop') > > Peter > From peter at maubp.freeserve.co.uk Fri May 23 06:08:53 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Fri, 23 May 2008 11:08:53 +0100 Subject: [BioPython] problems to fetch pdb files from server using pdb_list In-Reply-To: <48368038.7010504@tu-bs.de> References: <48358583.3050401@tu-bs.de> <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> <48368038.7010504@tu-bs.de> Message-ID: <320fb6e00805230308yecdf4c3r6e604b8edae647c3@mail.gmail.com> On Fri, May 23, 2008 at 9:28 AM, Florian Koelling wrote: > Humm - I recognised that I was still using Biopython 1.42 (installed via > synaptics) - it works fine on 1.45! :-)))) One mystery solved - I did wonder if I should have asked which version of Biopython you had ;) Good luck with your PDB analysis. Peter From florian.koelling at tu-bs.de Fri May 23 06:22:11 2008 From: florian.koelling at tu-bs.de (Florian Koelling) Date: Fri, 23 May 2008 12:22:11 +0200 Subject: [BioPython] problems to fetch pdb files from server using pdb_list In-Reply-To: <320fb6e00805230308yecdf4c3r6e604b8edae647c3@mail.gmail.com> References: <48358583.3050401@tu-bs.de> <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> <48368038.7010504@tu-bs.de> <320fb6e00805230308yecdf4c3r6e604b8edae647c3@mail.gmail.com> Message-ID: <48369AD3.1010005@tu-bs.de> Thanx alot! Is there any possibility to avoid the construction of the obselete folder? I could not find it in the documentation - and the flags 0, 1 and None didn' t work It perturbs my loops while slicing over several Folders. Greetz, Florian Peter wrote: > On Fri, May 23, 2008 at 9:28 AM, Florian Koelling > wrote: > >> Humm - I recognised that I was still using Biopython 1.42 (installed via >> synaptics) - it works fine on 1.45! :-)))) >> > > One mystery solved - I did wonder if I should have asked which version > of Biopython you had ;) > > Good luck with your PDB analysis. > > Peter > From biopython at maubp.freeserve.co.uk Fri May 23 06:35:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 May 2008 11:35:59 +0100 Subject: [BioPython] problems to fetch pdb files from server using pdb_list In-Reply-To: <48369AD3.1010005@tu-bs.de> References: <48358583.3050401@tu-bs.de> <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> <48368038.7010504@tu-bs.de> <320fb6e00805230308yecdf4c3r6e604b8edae647c3@mail.gmail.com> <48369AD3.1010005@tu-bs.de> Message-ID: <320fb6e00805230335m7444d608l4efb3c743f89f461@mail.gmail.com> Florian Koelling wrote: > Is there any possibility to avoid the construction of the obselete > folder? I could not find it in the documentation - and the flags 0, 1 > and None didn' t work You just use the pdir argument to specify the folder you want to use (otherwise it build the subdirectory structre). The function will understand the shorthand of "." for the current directory. Based on your previous example try: from Bio.PDB import * pdbl=PDBList() pdbl.retrieve_pdb_file('1FAT', pdir=".") Peter From gbastian at pasteur.fr Wed May 28 06:18:19 2008 From: gbastian at pasteur.fr (gbastian at pasteur.fr) Date: Wed, 28 May 2008 12:18:19 +0200 (CEST) Subject: [BioPython] biopython sequence retreive from locus_tag Message-ID: <34099.157.99.64.103.1211969899.squirrel@php.pasteur.fr> Hello all, I am trying to find a way to retreive the sequences from NCBI starting from the locus_tag information. I have no accession number. thanks for your suggestions. Giacomo From colochera at gmail.com Wed May 28 11:41:26 2008 From: colochera at gmail.com (Raul Guerra) Date: Wed, 28 May 2008 11:41:26 -0400 Subject: [BioPython] pblast in NCBI's website differs from biopython's pblast Message-ID: Hi everyone, I was wondering if someone has had the same problem. I am running the following code in BioPython. result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Chlamydomonas" [ORGN]',ncbi_gi= True,matrix_name='BLOSUM62', hitlist_size=50) where fastaStr is the fasta string for NP_012855. (When I mention Biopython's pBlast I refer to the code above) The results that I got back are different from the results I get from the pblast option at http://www.ncbi.nlm.nih.gov/blast/Blast.cgi (I refer to it when I mention pblast in NCBI's website) The results that I got from NCBI's website are 2 sequences, which were what I was looking for. On the other hand, Biopython gives back as many hits as I specify in the limit. Also in Biopython's pBlast, I only get one of the hits that I get in NCBI's pBlast. I know that the qBlast option in NCNBIWWW has many parameters. def qblast(program, database, sequence, auto_format=None,composition_based_statistics=None, db_genetic_code=None,endpoints=None,entrez_query='(none)', expect=10.0,filter=None,gapcosts=None,genetic_code=None, hitlist_size=50,i_thresh=None,layout=None,lcase_mask=None, matrix_name=None,nucl_penalty=None,nucl_reward=None, other_advanced=None,perc_ident=None,phi_pattern=None, query_file=None,query_believe_defline=None,query_from=None, query_to=None,searchsp_eff=None,service=None,threshold=None, ungapped_alignment=None,word_size=None, alignments=500,alignment_view=None,descriptions=500, entrez_links_new_window=None,expect_low=None,expect_high=None, format_entrez_query=None,format_object=None,format_type='XML', ncbi_gi=None,results_file=None,show_overview=None ): I also know that the pBlast in NCBI's website utilizes a Gap Cost of "Existence: 11 Extension:1". I am not sure how to translate that into the qblast function in Biopython. I am not sure if this is the problem, but it could be that Biopython's pblast and NCBI's pblast have different parameters. Thank you for your time, David From fahy at chapman.edu Wed May 28 22:16:53 2008 From: fahy at chapman.edu (Michael Fahy) Date: Wed, 28 May 2008 19:16:53 -0700 Subject: [BioPython] EST Alignment Message-ID: I expect to be getting two sets of ESTs from two stages of an organism whose genome has been sequenced. I would like to align the ESTs to the genome sequence to identify the genes that are being expressed and, more importantly, to find the differences in genes between the two sets. There is an EST2GENOME class in BioPerl that looks like it would help. Is there anything similar in BioPython? Or public tools, accessible thorugh BioPython, for identifying genes by aligning ESTs to genome sequences? From aloraine at gmail.com Thu May 29 06:18:25 2008 From: aloraine at gmail.com (Ann Loraine) Date: Thu, 29 May 2008 06:18:25 -0400 Subject: [BioPython] EST Alignment In-Reply-To: References: Message-ID: <83722dde0805290318m1c1e153er8e9e3e1e5f32db9d@mail.gmail.com> Have you looked at blat? On Wed, May 28, 2008 at 10:16 PM, Michael Fahy wrote: > I expect to be getting two sets of ESTs from two stages of an organism whose genome has been sequenced. I would like to align the ESTs to the genome sequence to identify the genes that are being expressed and, more importantly, to find the differences in genes between the two sets. There is an EST2GENOME class in BioPerl that looks like it would help. Is there anything similar in BioPython? Or public tools, accessible thorugh BioPython, for identifying genes by aligning ESTs to genome sequences? > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From sdavis2 at mail.nih.gov Thu May 29 07:33:42 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 29 May 2008 07:33:42 -0400 Subject: [BioPython] EST Alignment In-Reply-To: <83722dde0805290318m1c1e153er8e9e3e1e5f32db9d@mail.gmail.com> References: <83722dde0805290318m1c1e153er8e9e3e1e5f32db9d@mail.gmail.com> Message-ID: <264855a00805290433x68ee84fn22adbfafd27aef96@mail.gmail.com> On Thu, May 29, 2008 at 6:18 AM, Ann Loraine wrote: > Have you looked at blat? Or GMAP (or many others)? > On Wed, May 28, 2008 at 10:16 PM, Michael Fahy wrote: >> I expect to be getting two sets of ESTs from two stages of an organism whose genome has been sequenced. I would like to align the ESTs to the genome sequence to identify the genes that are being expressed and, more importantly, to find the differences in genes between the two sets. There is an EST2GENOME class in BioPerl that looks like it would help. Is there anything similar in BioPython? Or public tools, accessible thorugh BioPython, for identifying genes by aligning ESTs to genome sequences? >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From colochera at gmail.com Thu May 29 08:47:55 2008 From: colochera at gmail.com (Raul Guerra) Date: Thu, 29 May 2008 08:47:55 -0400 Subject: [BioPython] pblast in NCBI's website differs from biopython's pblast In-Reply-To: References: Message-ID: Hi everyone, I was wondering if someone has had the same problem. I am running the following code in BioPython. result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Chlamydomonas" [ORGN]',ncbi_gi= True,matrix_name='BLOSUM62', hitlist_size=50) where fastaStr is the fasta string for NP_012855. (When I mention Biopython's pBlast I refer to the code above) The results that I got back are different from the results I get from the pblast option at http://www.ncbi.nlm.nih.gov/blast/Blast.cgi , if you follow the link click on pblast and do a blast just specifiying the organism and the sequence accession number. The results that I got from NCBI's website are 2 sequences, which were what I was looking for. On the other hand, Biopython gives back as many hits as I specify in the limit. Also in Biopython's pBlast, I only get one of the hits that I get in NCBI's pBlast. I know that the qBlast option in NCNBIWWW has many parameters. def qblast(program, database, sequence, auto_format=None,composition_based_statistics=None, db_genetic_code=None,endpoints=None,entrez_query='(none)', expect=10.0,filter=None,gapcosts=None,genetic_code=None, hitlist_size=50,i_thresh=None,layout=None,lcase_mask=None, matrix_name=None,nucl_penalty=None,nucl_reward=None, other_advanced=None,perc_ident=None,phi_pattern=None, query_file=None,query_believe_defline=None,query_from=None, query_to=None,searchsp_eff=None,service=None,threshold=None, ungapped_alignment=None,word_size=None, alignments=500,alignment_view=None,descriptions=500, entrez_links_new_window=None,expect_low=None,expect_high=None, format_entrez_query=None,format_object=None,format_type='XML', ncbi_gi=None,results_file=None,show_overview=None ): I also know that the pBlast in NCBI's website utilizes a Gap Cost of "Existence: 11 Extension:1". I am not sure how to translate that into the qblast function in Biopython. I am not sure if this is the problem, but it could be that Biopython's pblast and NCBI's pblast have different parameters. Thank you for your time, David From biopython at maubp.freeserve.co.uk Thu May 29 17:53:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 May 2008 22:53:58 +0100 Subject: [BioPython] biopython sequence retreive from locus_tag In-Reply-To: <34099.157.99.64.103.1211969899.squirrel@php.pasteur.fr> References: <34099.157.99.64.103.1211969899.squirrel@php.pasteur.fr> Message-ID: <320fb6e00805291453x7e2597f8w8554d3a30b716b9@mail.gmail.com> > Hello all, > > I am trying to find a way to retreive the sequences from NCBI > starting from the locus_tag information. > I have no accession number. Hello Giacomo, There is probably more than one solution. My first idea is to try and use the Entrez utilities. Have you tried searching Entrez for your locus tag via this web page? http://www.ncbi.nlm.nih.gov/Entrez/ If this work, then I would try using the Bio.Entrez module in Biopython, and request the desired records in FASTA or GenBank format according to your needs. Peter From biopython at maubp.freeserve.co.uk Thu May 29 18:09:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 May 2008 23:09:59 +0100 Subject: [BioPython] pblast in NCBI's website differs from biopython's pblast In-Reply-To: References: Message-ID: <320fb6e00805291509u14da40b2lb14afa0458210dc8@mail.gmail.com> On Wed, May 28, 2008 at 4:41 PM, Raul Guerra wrote: > Hi everyone, > > I was wondering if someone has had the same problem. I am running the > following code in BioPython. ... > > The results that I got back are different from the results I get from the > pblast option at http://www.ncbi.nlm.nih.gov/blast/Blast.cgi (I refer to it > when I mention pblast in NCBI's website) ... > > I also know that the pBlast in NCBI's website utilizes a Gap Cost of > "Existence: 11 Extension:1". I am not sure how to translate that into the > qblast function in Biopython. I am not sure if this is the problem, but it > could be that Biopython's pblast and NCBI's pblast have different > parameters. The NCBI have often changed their default options in the BLAST tools, both the online versions and the web interface. As you have seen from the Biopython source code, we set many of the options to specific default values which may no longer match what the NCBI webpages do by default. Its a little frustrating :( For a similar examples from last year, see http://portal.open-bio.org/pipermail/biopython/2007-August/003679.html and http://portal.open-bio.org/pipermail/biopython/2007-August/003693.html You should compare the parameters for the two sets of results, work out where they are different, and decide which settings are best for your problem. For the gap costs, see http://www.ncbi.nlm.nih.gov/BLAST/Doc/node28.html - you can specify this parameter in the Biopython qblast function with the optional gapcosts argument. Peter From betainverse at gmail.com Wed May 7 21:02:42 2008 From: betainverse at gmail.com (Katie Edmonds) Date: Wed, 7 May 2008 17:02:42 -0400 Subject: [BioPython] PSI-BLAST using NCBIWWW Message-ID: <8e76d5310805071402o77292571oabfe7e5912a826a0@mail.gmail.com> Hi, I'm trying to use biopython to run psi-blast on the ncbi server. It looks like qblast cannot be used for psi-blast, and as far as I can tell blast() doesn't work at all anymore, though it has a 'run_psiblast' parameter. Has anyone had success with running psi-blast with biopython who could offer some advice? Thanks, Katie From gbastian at pasteur.fr Wed May 7 21:29:26 2008 From: gbastian at pasteur.fr (gbastian at pasteur.fr) Date: Wed, 7 May 2008 23:29:26 +0200 (CEST) Subject: [BioPython] NCBIXML error Message-ID: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> Dear all, I have been using a script to blast sequences for days without a problem, then, after 2/3 hours it started giving me this error and never worked again...did they change xml blast format? this is the error: File "ppinvestigator.py", line 918, in ? pdbs.find_homologous_seqs(int_list) File "ppinvestigator.py", line 122, in find_homologous_seqs data = search_seq(self.sequences[chain][0], interactor_list) File "/home/giacomotion/Desktop/VU-PROJECT/PPI_PDBS/PPINVESTIGATOR/tools.py", line 32, in search_seq blast_record = blast_records.next() File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 576, in parse expat_parser.Parse(text, False) File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 98, in endElement eval("self.%s()" % method) File "", line 0, in ? File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 216, in _end_BlastOutput_version self._header.date = self._value.split()[2][1:-1] IndexError: list index out of range this is my script: #Launch the blastp search result_handle = NCBIWWW.qblast('blastp', 'nr', sequence, hitlist_size=10, perc_ident=10, alignments=10, descriptions=10, entrez_query='"Saccharomyces ce revisiae" [Organism]') #Handle the result file blast_results = result_handle.read() output_filename = 'tmp_blast.xml' save_file = open(output_filename,'w') save_file.write(blast_results) save_file.close() result_handle2 = open(output_filename, 'r') blast_records = NCBIXML.parse(result_handle2) #Initialize the record dictionary record_storage = [] #Iterate on the blast_handle file (only one iteration if one blastp search) blast_record = blast_records.next() this is the xml that I get: blastp BLASTP 2.2.18+ Altschul, Stephen F., Thomas L. Madden, Alejandro A. Sch?ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. nr

4678

unnamed protein product

280

BLOSUM62 10

4678

unnamed protein product

280

1 gi|151567870|pdb|2PM9|B Chain B, Crystal Structure Of Yeast Sec1331 VERTEX ELEMENT OF THE Copii Vesicular Coat 2PM9-B 297 1

577.785

1488 Thanks for any suggestion, Giacomo From biopython at maubp.freeserve.co.uk Thu May 8 08:57:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 May 2008 09:57:50 +0100 Subject: [BioPython] NCBIXML error In-Reply-To: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> References: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> Message-ID: <320fb6e00805080157n188865detf9e8a2791ac76530@mail.gmail.com> On Wed, May 7, 2008 at 10:29 PM, wrote: > Dear all, > > I have been using a script to blast sequences for days without a > problem, then, after 2/3 hours it started giving me this error > and never worked again...did they change xml blast format? Looking at the XML snippet, the version is "BLASTP 2.2.18+" (with a plus but no date) so it looks like the may well have updated something. Its possible that they'll make further tweaks in the next couple of days, so it would be worth retesting. The Biopython code expects something like "BLASTP 2.2.12 [Aug-07-2005]", and its the missing date that is causing this error for you. On a different note, if you have really been running BLASTP for days over the internet, it would probably be faster and more efficient to install standalone blast and the nr database on your local machine. You can still ask for XML output, so the parsing side of your script shouldn't change. Peter From winter at biotec.tu-dresden.de Thu May 8 08:47:38 2008 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Thu, 08 May 2008 10:47:38 +0200 Subject: [BioPython] NCBIXML error In-Reply-To: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> References: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> Message-ID: <4822BE2A.8080601@biotec.tu-dresden.de> gbastian at pasteur.fr wrote: > Dear all, > > I have been using a script to blast sequences for days without a > problem, then, after 2/3 hours it started giving me this error > and never worked again...did they change xml blast format? > > this is the error: > > File "ppinvestigator.py", line 918, in ? > pdbs.find_homologous_seqs(int_list) > File "ppinvestigator.py", line 122, in find_homologous_seqs > data = search_seq(self.sequences[chain][0], interactor_list) > File > "/home/giacomotion/Desktop/VU-PROJECT/PPI_PDBS/PPINVESTIGATOR/tools.py", > line 32, in search_seq > blast_record = blast_records.next() > File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 576, > in parse > expat_parser.Parse(text, False) > File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 98, > in endElement > eval("self.%s()" % method) > File "", line 0, in ? > File "/usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py", line 216, > in _end_BlastOutput_version > self._header.date = self._value.split()[2][1:-1] > IndexError: list index out of range [...] > this is the xml that I get: > > > "NCBI_BlastOutput.dtd"> > > blastp > BLASTP 2.2.18+ > Altschul, Stephen F., Thomas L. Madden, Alejandro [...] It seems they did change the format. When I run blast locally, it says blastp 2.2.18 [Mar-02-2008] self._header.date = self._value.split()[2][1:-1] works in that case, whereas it chokes on your BLASTP 2.2.18+ as "BLASTP 2.2.18+".split() lacks a third element. Should be easy to fix, shouldn't it? Christof From biopython at maubp.freeserve.co.uk Thu May 8 09:18:53 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 May 2008 10:18:53 +0100 Subject: [BioPython] NCBIXML error In-Reply-To: <4822BE2A.8080601@biotec.tu-dresden.de> References: <54230.157.99.64.103.1210195766.squirrel@php.pasteur.fr> <4822BE2A.8080601@biotec.tu-dresden.de> Message-ID: <320fb6e00805080218k2f6748a1j7c98886622dd90ac@mail.gmail.com> > It seems they did change the format. When I run blast locally ... > whereas it chokes ... as "BLASTP 2.2.18+".split() lacks a third element. > Should be easy to fix, shouldn't it? > > Christof I came to the same conclusion Christof, and its now fixed in CVS (with a new test case too). Giacomo, if you want to try this you'll need to update your system. Checking out the latest code from CVS and installing Biopython from source would be one way. However, you only need to update one file, replacing /usr/lib/python2.4/site-packages/Bio/Blast/NCBIXML.py with CVS revision 1.4. If you want, you can just grab this from the web interface here (once the website is automatically updated in a few hours): http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIXML.py?cvsroot=biopython Please let us know on the mailing list if that works for you (or if there are still problems). Peter From gbastian at pasteur.fr Thu May 8 09:46:00 2008 From: gbastian at pasteur.fr (gbastian at pasteur.fr) Date: Thu, 8 May 2008 11:46:00 +0200 (CEST) Subject: [BioPython] NCBIXML error Message-ID: <63681.157.99.64.103.1210239960.squirrel@php.pasteur.fr> Hello Peter and Christof, I just modified the code of NCBIXML.py where it trys to get the version date information. Now it works. line 216 self._header.date = 'Dec-21-2012' thanks, Giacomo From biopython at maubp.freeserve.co.uk Thu May 8 11:37:34 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 May 2008 12:37:34 +0100 Subject: [BioPython] PSI-BLAST using NCBIWWW In-Reply-To: <8e76d5310805071402o77292571oabfe7e5912a826a0@mail.gmail.com> References: <8e76d5310805071402o77292571oabfe7e5912a826a0@mail.gmail.com> Message-ID: <320fb6e00805080437h2d13eb5co63a8e5f604223f8f@mail.gmail.com> On Wed, May 7, 2008 at 10:02 PM, Katie Edmonds wrote: > Hi, > > I'm trying to use biopython to run psi-blast on the ncbi server. It looks > like qblast cannot be used for psi-blast, and as far as I can tell blast() > doesn't work at all anymore, though it has a 'run_psiblast' parameter. Has > anyone had success with running psi-blast with biopython who could offer > some advice? I've done some investigation of using Bio.Blast.NCBIWWW.qblast() with psi-blast, and it does seem to be missing the run_psiblast option. However, I don't think that's the only issue here... Have you been able to try running standalone psiblast, and parse its XML output? Peter From biopython at maubp.freeserve.co.uk Fri May 9 08:59:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 May 2008 09:59:01 +0100 Subject: [BioPython] PSI-BLAST using NCBIWWW In-Reply-To: <8e76d5310805081228n1558e475ga774f910db020060@mail.gmail.com> References: <8e76d5310805071402o77292571oabfe7e5912a826a0@mail.gmail.com> <320fb6e00805080437h2d13eb5co63a8e5f604223f8f@mail.gmail.com> <8e76d5310805081228n1558e475ga774f910db020060@mail.gmail.com> Message-ID: <320fb6e00805090159n30c96b6p8d11ce6343eac7d6@mail.gmail.com> On Thu, May 8, 2008 at 8:28 PM, Katie Edmonds wrote: > From what I can get with the web interface, it seems like parsing the XML > should be ok, though for however many iterations I try, it only seems to be > giving me XML for iteration #1. Could you try running standalone PSI-Blast and see how the XML output compares? > I'm still trying to figure out how to run subsequent iterations of PSI-BLAST > with your patch. For anyone else interested, see http://bugzilla.open-bio.org/show_bug.cgi?id=2496 > In the web form it seems to keep track of all past iterations with NEXT_I: > > > > > > > > I don't have any idea how similar the qblast interface is to the web > interface, though. > > Thanks, > > Katie You are talking about multiple iterations of results from a single query? Looking at the example output I got yesterday, there is indeed only one iteration present. I know that multiple queries in classic blast or RPS-Blast get returned in the XML file as different iterations, which may be complicating things. Would you like to email the NCBI directly and ask if PSI-Blast is intended to be used by a program via the QBLAST API CGI page? They may have some tips. Peter From mmokrejs at ribosome.natur.cuni.cz Sun May 11 22:48:41 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 12 May 2008 00:48:41 +0200 Subject: [BioPython] blastall does not flush buffers due to biopython buffering? Message-ID: <482777C9.60500@ribosome.natur.cuni.cz> Hi, when I try to use Bio/Blast/NCBIStandalone blast sometimes the process hangs and sometimes it works (tested from Unix shell and via Apache mod_python). I see blastall process in the list of system processes, attaching strace(1) to it shows that it did print some line from the result output, but somewhat does not continue to write out the buffers (you know that at the end of blast output is the summary stats ...;). I believe that is because the consuming process did not read yet the output already written. Effectively, blastall gets blocked due to biopython. I see in the stacktrace of a killed process: print ''.join(_error_info.readlines()) File "/usr/lib/python2.5/site-packages/Bio/File.py", line 37, in readlines lines = self._saved + self._handle.readlines(*args,**keywds) KeyboardInterrupt $ Currently, there is in CVS: def blastall(blastcmd, program, database, infile, align_view='7', **keywds): """blastall(blastcmd, program, database, infile, align_view='7', **keywds) -> read, error Undohandles ... w, r, e = os.popen3(" ".join([blastcmd] + params)) w.close() return File.UndoHandle(r), File.UndoHandle(e) I did not study yet Bio/File.py but let me say that running just the following works fine for me: >>> import os >>> w, r, e = os.popen3('/usr/bin/blastall -p blastn -d /home/mmokrejs/a.fa -i /tmp/bl_FCOri7fa -m 0 -S 1 -e 1000 -W 4 -E 1 -G 1') >>> print ''.join(r.readlines()) BLASTN 2.2.18 [Mar-02-2008] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), [...] --> WORKS >>> print ''.join(e.readlines()) >>> I have found that first I have to read from STDOUT of blastall and only afterwards I may try to read from its STDERR. Otherwise, readline() or readlines() get blocked in the "same way" although the os.popen3() approach works otherwise. Is there a way to ensure no output is buffered in python? Something like 'man perlopentut' would be helpful. ;-) Why is the File.UndoHandle() used here at all? Thanks for clarification, Martin From mjldehoon at yahoo.com Mon May 12 02:11:39 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 11 May 2008 19:11:39 -0700 (PDT) Subject: [BioPython] blastall does not flush buffers due to biopython buffering? In-Reply-To: <482777C9.60500@ribosome.natur.cuni.cz> Message-ID: <558592.90782.qm@web62404.mail.re1.yahoo.com> Can you show an example script that causes the UndoHandle to block? Just to understand better what is going on. On a related note, the UndoHandle works by saving all lines that were read. Particularly for large Blast files, that is not what one would like to do. So if there is no strong reason for returning a UndoHandle, I'd be in favor of simply returning the handle directly. --Michiel. Martin MOKREJ? wrote: Hi, when I try to use Bio/Blast/NCBIStandalone blast sometimes the process hangs and sometimes it works (tested from Unix shell and via Apache mod_python). I see blastall process in the list of system processes, attaching strace(1) to it shows that it did print some line from the result output, but somewhat does not continue to write out the buffers (you know that at the end of blast output is the summary stats ...;). I believe that is because the consuming process did not read yet the output already written. Effectively, blastall gets blocked due to biopython. I see in the stacktrace of a killed process: print ''.join(_error_info.readlines()) File "/usr/lib/python2.5/site-packages/Bio/File.py", line 37, in readlines lines = self._saved + self._handle.readlines(*args,**keywds) KeyboardInterrupt $ Currently, there is in CVS: def blastall(blastcmd, program, database, infile, align_view='7', **keywds): """blastall(blastcmd, program, database, infile, align_view='7', **keywds) -> read, error Undohandles ... w, r, e = os.popen3(" ".join([blastcmd] + params)) w.close() return File.UndoHandle(r), File.UndoHandle(e) I did not study yet Bio/File.py but let me say that running just the following works fine for me: >>> import os >>> w, r, e = os.popen3('/usr/bin/blastall -p blastn -d /home/mmokrejs/a.fa -i /tmp/bl_FCOri7fa -m 0 -S 1 -e 1000 -W 4 -E 1 -G 1') >>> print ''.join(r.readlines()) BLASTN 2.2.18 [Mar-02-2008] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), [...] --> WORKS >>> print ''.join(e.readlines()) >>> I have found that first I have to read from STDOUT of blastall and only afterwards I may try to read from its STDERR. Otherwise, readline() or readlines() get blocked in the "same way" although the os.popen3() approach works otherwise. Is there a way to ensure no output is buffered in python? Something like 'man perlopentut' would be helpful. ;-) Why is the File.UndoHandle() used here at all? Thanks for clarification, Martin _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython --------------------------------- Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. From mmokrejs at ribosome.natur.cuni.cz Mon May 12 10:10:07 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 12 May 2008 12:10:07 +0200 Subject: [BioPython] blastall does not flush buffers due to biopython buffering? In-Reply-To: <558592.90782.qm@web62404.mail.re1.yahoo.com> References: <558592.90782.qm@web62404.mail.re1.yahoo.com> Message-ID: <4828177F.9010009@ribosome.natur.cuni.cz> Michiel de Hoon wrote: > Can you show an example script that causes the UndoHandle to block? Just to understand better what is going on. Maybe it is related to it, but sometimes blast process is probably close to exit: mmokrejs 3343 3329 0 10:37 pts/2 00:00:00 [blastall] But sometimes I see it is still running with all its arguments and I can attach to it by strace(1). So there are two issues. I have hacked NCBIStandalone.blastall() like this: + print "Executing %s" % " ".join([blastcmd] + params) w, r, e = os.popen3(" ".join([blastcmd] + params)) w.close() return File.UndoHandle(r), File.UndoHandle(e) $ python test38a.py Testing Bio.Blast.NCBIStandalone.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_8zjEnKfa -m 0 -M NUC4.2 -S 1 -e 1 -W 4 -E 1 -G 2 -b 10 -v 10 BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_TR0YHefa -m 0 -M NUC.4.2 -S 1 -e 1000 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results Traceback (most recent call last): File "test38b.py", line 15, in print ''.join(_error_info.readlines()) File "/usr/lib/python2.5/site-packages/Bio/File.py", line 37, in readlines lines = self._saved + self._handle.readlines(*args,**keywds) KeyboardInterrupt $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_meaeD7fa -m 0 -M NUC.4.2 -S 1 -e 1 -W 4 -E 1 -G 2 -b 10 -v 10 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_MRrSAWfa -m 0 -M NUC.4.2 -S 1 -e 1 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_VtMzeDfa -m 0 -M NUC.4.2 -S 1 -e 1000 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... So, it is not much reproducible. However, it is clear it has something to do with the length of the output. As I have already said, I saw by strace(1) many time that blastall did not write all its output. I propose to ensure that the open3() or equivalent does not use buffered outputs ("man perlopentut") and the UndoHandle avoided altogether. $ cat test38a.py #! /usr/bin/env python from sys import path, argv path.append('..') from Bio.Blast import NCBIStandalone print "Testing Bio.Blast.NCBIStandalone.blastall() functionality" _we_got_some_results = 0 _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', '/tmp/sequence_nucleic_acids_all.fa', '/tmp/blast_8zjEnKfa', matrix='NUC4.2', wordsize=4, gap_open=2, gap_extend=1, strands=1, alignments=9999999, descriptions=999, expectation=1, align_view=0) while 1: _line = _blast_out.readline() if _line: print _line, _we_got_some_results = 1 else: break if _we_got_some_results: while 1: _line = _error_info.readline() if _line: print _line, else: break $ cat test38b.py #! /usr/bin/env python import os from sys import path, argv path.append('..') print "Testing blasttest.blastall() functionality" import blasttest _blast_out, _error_info, _blast_file = blasttest.blastall('/tmp/sequence_nucleic_acids_all.fa', 'CCGCCGTCGCGGGCAGTGTCTAGCCAGGCCTTGACAAGCTA', '/tmp', 'sequence') # we have to read first from stdout of blast otherwise reading from stderr blocks print "Fetching blast results" print ''.join(_blast_out.readlines()) print "Fetching blast error messages" print ''.join(_error_info.readlines()) os.remove(_blast_file) $ cat blasttest.py #! /usr/bin/env python import os import file_io import tempfile from Bio.Blast import NCBIStandalone def blastall(blast_db, blast_query_string, tmpdir, mode, align_view=0, matrix='NUC.4.2'): _fd, _blast_file = tempfile.mkstemp(suffix='fa', prefix='blast_', dir=tmpdir, text=False) if blast_query_string[0] != '>': os.write(_fd, '>your_query\n' + blast_query_string.replace('\n','') + '\n') else: os.write(_fd, blast_query_string + '\n') os.close(_fd) del(_fd) if not file_io.file_size(_blast_file): raise ValueError, "No input sequence provided by user" return None, None if mode == 'sequence': _wordsize = 4 _gap_open = 1 _gap_extend = 1 _strands = 1 _alignments = 9999999 # -b _descriptions = 999 # -v _expectation = 1000 # -e 1e-20 _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix=matrix, wordsize=_wordsize, gap_open=_gap_open, gap_extend=_gap_extend, strands=_strands, alignments=_alignments, descriptions=_descriptions, expectation=_expectation, align_view=align_view) return _blast_out, _error_info, _blast_file Martin From mmokrejs at ribosome.natur.cuni.cz Mon May 12 13:45:30 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 12 May 2008 15:45:30 +0200 Subject: [BioPython] blastall does not flush buffers due to biopython buffering? In-Reply-To: <247270.16639.qm@web62406.mail.re1.yahoo.com> References: <247270.16639.qm@web62406.mail.re1.yahoo.com> Message-ID: <482849FA.4010304@ribosome.natur.cuni.cz> Hi, I do not have a patch nor a solution. I justed hacked Bio.Blast.NCBIStandalone to print what it actually does execute to help debugging, nothing else. I will try to think of it more before posting bugzilla. Thanks a have some sleep, ;-) M Michiel de Hoon wrote: > Sorry it is a bit hard to follow which changes exactly you are > proposing to make to Bio.Blast.NCBIStandalone. Maybe it's just me or > maybe because it's late at night. Could you open a bug report on > BugZilla and upload a patch there? > > --Michiel. From mjldehoon at yahoo.com Mon May 12 13:39:03 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 12 May 2008 06:39:03 -0700 (PDT) Subject: [BioPython] blastall does not flush buffers due to biopython buffering? In-Reply-To: <4828177F.9010009@ribosome.natur.cuni.cz> Message-ID: <247270.16639.qm@web62406.mail.re1.yahoo.com> Sorry it is a bit hard to follow which changes exactly you are proposing to make to Bio.Blast.NCBIStandalone. Maybe it's just me or maybe because it's late at night. Could you open a bug report on BugZilla and upload a patch there? --Michiel. Martin MOKREJ? wrote: Michiel de Hoon wrote: > Can you show an example script that causes the UndoHandle to block? Just to understand better what is going on. Maybe it is related to it, but sometimes blast process is probably close to exit: mmokrejs 3343 3329 0 10:37 pts/2 00:00:00 [blastall] But sometimes I see it is still running with all its arguments and I can attach to it by strace(1). So there are two issues. I have hacked NCBIStandalone.blastall() like this: + print "Executing %s" % " ".join([blastcmd] + params) w, r, e = os.popen3(" ".join([blastcmd] + params)) w.close() return File.UndoHandle(r), File.UndoHandle(e) $ python test38a.py Testing Bio.Blast.NCBIStandalone.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_8zjEnKfa -m 0 -M NUC4.2 -S 1 -e 1 -W 4 -E 1 -G 2 -b 10 -v 10 BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_TR0YHefa -m 0 -M NUC.4.2 -S 1 -e 1000 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results Traceback (most recent call last): File "test38b.py", line 15, in print ''.join(_error_info.readlines()) File "/usr/lib/python2.5/site-packages/Bio/File.py", line 37, in readlines lines = self._saved + self._handle.readlines(*args,**keywds) KeyboardInterrupt $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_meaeD7fa -m 0 -M NUC.4.2 -S 1 -e 1 -W 4 -E 1 -G 2 -b 10 -v 10 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_MRrSAWfa -m 0 -M NUC.4.2 -S 1 -e 1 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... $ python test38b.py Testing blasttest.blastall() functionality Executing /usr/bin/blastall -p blastn -d /tmp/sequence_nucleic_acids_all.fa -i /tmp/blast_VtMzeDfa -m 0 -M NUC.4.2 -S 1 -e 1000 -W 4 -E 1 -G 1 -b 9999999 -v 999 Fetching blast results BLASTN 2.2.18 [Mar-02-2008] ... So, it is not much reproducible. However, it is clear it has something to do with the length of the output. As I have already said, I saw by strace(1) many time that blastall did not write all its output. I propose to ensure that the open3() or equivalent does not use buffered outputs ("man perlopentut") and the UndoHandle avoided altogether. $ cat test38a.py #! /usr/bin/env python from sys import path, argv path.append('..') from Bio.Blast import NCBIStandalone print "Testing Bio.Blast.NCBIStandalone.blastall() functionality" _we_got_some_results = 0 _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', '/tmp/sequence_nucleic_acids_all.fa', '/tmp/blast_8zjEnKfa', matrix='NUC4.2', wordsize=4, gap_open=2, gap_extend=1, strands=1, alignments=9999999, descriptions=999, expectation=1, align_view=0) while 1: _line = _blast_out.readline() if _line: print _line, _we_got_some_results = 1 else: break if _we_got_some_results: while 1: _line = _error_info.readline() if _line: print _line, else: break $ cat test38b.py #! /usr/bin/env python import os from sys import path, argv path.append('..') print "Testing blasttest.blastall() functionality" import blasttest _blast_out, _error_info, _blast_file = blasttest.blastall('/tmp/sequence_nucleic_acids_all.fa', 'CCGCCGTCGCGGGCAGTGTCTAGCCAGGCCTTGACAAGCTA', '/tmp', 'sequence') # we have to read first from stdout of blast otherwise reading from stderr blocks print "Fetching blast results" print ''.join(_blast_out.readlines()) print "Fetching blast error messages" print ''.join(_error_info.readlines()) os.remove(_blast_file) $ cat blasttest.py #! /usr/bin/env python import os import file_io import tempfile from Bio.Blast import NCBIStandalone def blastall(blast_db, blast_query_string, tmpdir, mode, align_view=0, matrix='NUC.4.2'): _fd, _blast_file = tempfile.mkstemp(suffix='fa', prefix='blast_', dir=tmpdir, text=False) if blast_query_string[0] != '>': os.write(_fd, '>your_query\n' + blast_query_string.replace('\n','') + '\n') else: os.write(_fd, blast_query_string + '\n') os.close(_fd) del(_fd) if not file_io.file_size(_blast_file): raise ValueError, "No input sequence provided by user" return None, None if mode == 'sequence': _wordsize = 4 _gap_open = 1 _gap_extend = 1 _strands = 1 _alignments = 9999999 # -b _descriptions = 999 # -v _expectation = 1000 # -e 1e-20 _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix=matrix, wordsize=_wordsize, gap_open=_gap_open, gap_extend=_gap_extend, strands=_strands, alignments=_alignments, descriptions=_descriptions, expectation=_expectation, align_view=align_view) return _blast_out, _error_info, _blast_file Martin --------------------------------- Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. From david.moreira at u-psud.fr Wed May 14 09:26:13 2008 From: david.moreira at u-psud.fr (David Moreira) Date: Wed, 14 May 2008 10:26:13 +0100 Subject: [BioPython] GenBank.NCBIDictionary 'Operation timed out' Message-ID: <482AB035.80508@u-psud.fr> Dear all, I have a problem when using the GenBank.NCBIDictionary to retrieve sequences from GenBank. Quite often my script finishes with an error due to a too long connexion time: urllib2.URLError: > The code I am using is very simple, I have tried to avoid the problem by waiting 10 seconds before retrying, but it does not work: def retrieve_taxonomy_seq(acc_number): # retrieve the sequence from GenBank try: ncbi_dict = GenBank.NCBIDictionary(database="nucleotide", format="genbank") # exception in case of network problems except IOError: print "Network down, waiting 10 seconds before continuing" time.sleep(10) retrieve_taxonomy(acc_number) I have to use this function about ~100 times. I know that the best would be to have a local data base to avoid Internet connexion problems, but by some practical reasons I prefer to go directly to GenBank for my search. Do you have any suggestion to avoid this problem? Best, David -- David MOREIRA Unit? d'Ecologie, Syst?matique et Evolution - UMR CNRS 8079 Universit? Paris-Sud. B?timent 360. 91405 Orsay CEDEX. FRANCE Tel: 33 1 69 15 76 08 Fax: 33 1 69 15 46 97 http://www.ese.u-psud.fr/microbiologie/ From biopython at maubp.freeserve.co.uk Wed May 14 09:07:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 May 2008 10:07:12 +0100 Subject: [BioPython] GenBank.NCBIDictionary 'Operation timed out' In-Reply-To: <482AB035.80508@u-psud.fr> References: <482AB035.80508@u-psud.fr> Message-ID: <320fb6e00805140207u1e257aafmdecc9741513a48c5@mail.gmail.com> On Wed, May 14, 2008 at 10:26 AM, David Moreira wrote: > Dear all, > > I have a problem when using the GenBank.NCBIDictionary to retrieve > sequences from GenBank. Quite often my script finishes with an error due to > a too long connexion time: > > urllib2.URLError: > That does sound like a network problem. Perhaps you can do some investigation like running ping on the NCBI host when you have issues, or consult your IT helpdesk to see if there are any local issues. Which version of Biopython are you using? In Biopython 1.45 we switched the NCBIDictionary class from using EUtils to using Entrez internally. > I have to use this function about ~100 times. I know that the best would be > to have a local data base to avoid Internet connexion problems, but by some > practical reasons I prefer to go directly to GenBank for my search. > > Do you have any suggestion to avoid this problem? Have you added a local cache to record the sequences on your hard disk once they are downloaded, to avoid re-downloading the same things many times? Peter From biopython at maubp.freeserve.co.uk Fri May 16 15:10:03 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 May 2008 16:10:03 +0100 Subject: [BioPython] Parsing the pairwise alignments from the FASTA tool Message-ID: <320fb6e00805160810s75f27329yc4fa8d2a1676a1dd@mail.gmail.com> Hello everyone, I'm currently interested in parsing the output from Bill Pearson's FASTA tool in python, and would considering adding support for this to Biopython. See http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml Does anyone else on the list care about this file format? If so, do you have any python code you would like to share? Otherwise, would support for the "-m 10" output suffice? This output is specifically intended for parsing rather than being human readable. There are several different options for the output format, some of which looks a bit like the BLAST plain text output. Note that I do not mean the "typical" fasta format where each sequence starts with a ">" line, which was originally introduced as an input file format to the FASTA tools. Biopython already deals with this quite happily. Peter P.S. For anyone interested, BioPerl have had support for the human readable FASTA output for a while, and judging from this thread, they added support for the FASTA m10 variant last year: http://bioperl.org/pipermail/bioperl-l/2007-April/025465.html From cjfields at uiuc.edu Fri May 16 15:32:44 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 16 May 2008 10:32:44 -0500 Subject: [BioPython] Parsing the pairwise alignments from the FASTA tool In-Reply-To: <320fb6e00805160810s75f27329yc4fa8d2a1676a1dd@mail.gmail.com> References: <320fb6e00805160810s75f27329yc4fa8d2a1676a1dd@mail.gmail.com> Message-ID: <3FE79510-1A20-4863-B2F1-C9AFA13572B7@uiuc.edu> Peter, An enhancement request is in place in bugzilla for this, but BioPerl hasn't implemented parsing -m10 yet. As you indicated this shouldn't be too hard to implement; just needs someone with the time to code it up. chris On May 16, 2008, at 10:10 AM, Peter wrote: > ... > P.S. For anyone interested, BioPerl have had support for the human > readable FASTA output for a while, and judging from this thread, they > added support for the FASTA m10 variant last year: > http://bioperl.org/pipermail/bioperl-l/2007-April/025465.html > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Mon May 19 16:21:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 May 2008 17:21:20 +0100 Subject: [BioPython] Parsing the pairwise alignments from the FASTA tool In-Reply-To: <3FE79510-1A20-4863-B2F1-C9AFA13572B7@uiuc.edu> References: <320fb6e00805160810s75f27329yc4fa8d2a1676a1dd@mail.gmail.com> <3FE79510-1A20-4863-B2F1-C9AFA13572B7@uiuc.edu> Message-ID: <320fb6e00805190921o4dd3ebb3q39ea967108f77839@mail.gmail.com> On Fri, May 16, 2008 at 4:32 PM, Chris Fields wrote: > Peter, > > An enhancement request is in place in bugzilla for this, but BioPerl hasn't > implemented parsing -m10 yet. As you indicated this shouldn't be too hard > to implement; just needs someone with the time to code it up. Thanks for pointing that out Chris, I had assumed this had already happened without checking bugzilla. For anyone else interested, this is BioPerl Bug 2278 - FASTA m10 output support, http://bugzilla.open-bio.org/show_bug.cgi?id=2278 I'd still like to hear from anyone else using or interested in using the FASTA output in python... Peter From colochera at gmail.com Tue May 20 16:08:22 2008 From: colochera at gmail.com (Raul Guerra) Date: Tue, 20 May 2008 12:08:22 -0400 Subject: [BioPython] Remote PSI-Blast Message-ID: Hello Everybody, I just began using Biopython.I need to do a remote PSI-Blast, and I cannot find how to do it online. Does anybody know how to do it? Is it implemented in BioPython or do I have to implement it? Also, I am trying to do a blastp for a fasta file named fastaStr, but I want to restrict the sarch to the organim Chlamydomonas. Everytime I run the following code result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Chlamydomonas" [ORGN]') I get urllib2.URLError: from the Biopresult_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Chlamydomonas" [ORGN]') from the Biopython code. Any ideas? Thank you in advance, David Guerra Colgate University '09 From peter at maubp.freeserve.co.uk Tue May 20 16:13:40 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 20 May 2008 17:13:40 +0100 Subject: [BioPython] Remote PSI-Blast In-Reply-To: References: Message-ID: <320fb6e00805200913p458b0610p72d9b7c22e139894@mail.gmail.com> On Tue, May 20, 2008 at 5:08 PM, Raul Guerra wrote: > Hello Everybody, > > I just began using Biopython.I need to do a remote PSI-Blast, and I cannot > find how to do it online. > > Does anybody know how to do it? Is it implemented in BioPython or do I have > to implement it? I'm not sure if the NCBI have an official way to access the PSI-Blast search online. If you have a look at Bug 2496, we've made some headway in trying to do it anyway, but I would really like some official documentation from the NCBI on if/how they expect this to work. http://bugzilla.open-bio.org/show_bug.cgi?id=2496 Peter From biopython at maubp.freeserve.co.uk Tue May 20 16:21:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 May 2008 17:21:12 +0100 Subject: [BioPython] NCBIWWW.qblast time outs Message-ID: <320fb6e00805200921h557602eet99cd49134d6d068a@mail.gmail.com> On Tue, May 20, 2008 at 5:08 PM, Raul Guerra wrote: > Also, > I am trying to do a blastp for a fasta file named fastaStr, but I want to > restrict the sarch to the organim Chlamydomonas. Everytime I run the > following code > > result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='"Chlamydomonas" [ORGN]') > > I get > > urllib2.URLError: > > from the Biopresult_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='"Chlamydomonas" [ORGN]') > > from the Biopython code. Any ideas? You could try running the same query via your web browser, so get an idea of if its just a local networking problem or high server load at the NCBI. It does seem to be working for me right now, but I am using the latest CVS version of Biopython where I updated the CGI URL from http://www.ncbi.nlm.nih.gov/blast/Blast.cgi to http://blast.ncbi.nlm.nih.gov/Blast.cgi - see revision 1.49 for the change: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIWWW.py?cvsroot=biopython This might be worth trying... of course, for big tasks you would be better off installing a local copy of standalone blast, and creating your own blast database from the genome (i.e. the protein fasta file(s) for Chlamydomonas in your case). Peter From betainverse at gmail.com Tue May 20 16:24:08 2008 From: betainverse at gmail.com (Katie Edmonds) Date: Tue, 20 May 2008 12:24:08 -0400 Subject: [BioPython] Remote PSI-Blast In-Reply-To: <320fb6e00805200913p458b0610p72d9b7c22e139894@mail.gmail.com> References: <320fb6e00805200913p458b0610p72d9b7c22e139894@mail.gmail.com> Message-ID: <8e76d5310805200924q5a37e0d0j573d2bad3372b798@mail.gmail.com> I asked NCBI about this, and they (eventually) replied that it's "not officially supported." I have been unable to figure out how to get it to return iterations after the first one. Katie On Tue, May 20, 2008 at 12:13 PM, Peter wrote: > On Tue, May 20, 2008 at 5:08 PM, Raul Guerra wrote: > > Hello Everybody, > > > > I just began using Biopython.I need to do a remote PSI-Blast, and I > cannot > > find how to do it online. > > > > Does anybody know how to do it? Is it implemented in BioPython or do I > have > > to implement it? > > I'm not sure if the NCBI have an official way to access the PSI-Blast > search online. If you have a look at Bug 2496, we've made some headway > in trying to do it anyway, but I would really like some official > documentation from the NCBI on if/how they expect this to work. > http://bugzilla.open-bio.org/show_bug.cgi?id=2496 > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From colochera at gmail.com Tue May 20 23:28:27 2008 From: colochera at gmail.com (Raul Guerra) Date: Tue, 20 May 2008 19:28:27 -0400 Subject: [BioPython] Parstin a remote Blast output Message-ID: Thank you to everyone who replied my last post. I am sorry to bother you again with a question. Thank you in advance for your time. I am trying to parse the output from: result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Arabidopsis thaliana" [ORGN]') where fastaStr is a string in the fasta format. When I run the following code: b_parser = NCBIWWW.BlastParser() blast_records = b_parser.parse(result_handle) I get the following error: ValueError: Unexpected end of stream. I think that result_handle is a cStringIO.StringI data structure. I thought the code for some reason was trying to invoke the readline() method on the cStringIO.StringI data structure and maybe that was what is causing the error. However, I already saved result_handle.read() in a file and opened it with the open() function, so that I would get a file object with a readline() function. But the code still did not work. I tried to follow the logic of the program and I found that NCBIWWW.qblast() is outputing a XML file, and for some reason NCBIWWW.BlastParser() is expecting a HTML file. That is my guess of what is going wrong. So what I did was to use the parser in NCBIXML. So I ran the following result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Arabidopsis thaliana" [ORGN]') blast_records = NCBIXML.parse(result_handle) and it works fine (at least I do not get errors), but I have no idea on what type of object blast_records is. I tried the following next = blast_records.next() and got the following error: Traceback (most recent call last): File "/home/rguerra/workspace/Summer2008/src/testingBioPython.py", line 28, in next = blast_records.next() File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 574, in parse expat_parser.Parse(text, False) File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 98, in endElement eval("self.%s()" % method) File "", line 1, in File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 214, in _end_BlastOutput_version self._header.date = self._value.split()[2][1:-1] IndexError: list index out of range I have not been able to understand what is going on here. I just want to parse the results I get from: result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Arabidopsis thaliana" [ORGN]') Any ideas? Raul Guerra From ibdeno at gmail.com Wed May 21 07:43:42 2008 From: ibdeno at gmail.com (=?ISO-8859-1?Q?Miguel_Ortiz-Lombard=EDa?=) Date: Wed, 21 May 2008 09:43:42 +0200 Subject: [BioPython] PSIBlastParser and blastpgp 2.2.18 Message-ID: Hi, The PSIBlastParser (biopython 1.45) seems not to work with the latest (2.2.18) version of NCBI blastpgp. Using the same script/inputs I can successfully run it with blastpgp 2.2.15 but when using 2.2.18 I get this error: Traceback (most recent call last): File "./lpbl.py", line 23, in b_record = b_parser.parse(blast_out) File "/home/mortiz/Progs//lib/python/Bio/Blast/NCBIStandalone.py", line 760, in parse self._scanner.feed(handle, self._consumer) File "/home/mortiz/Progs//lib/python/Bio/Blast/NCBIStandalone.py", line 98, in feed self._scan_header(uhandle, consumer) File "/home/mortiz/Progs//lib/python/Bio/Blast/NCBIStandalone.py", line 208, in _scan_header raise ValueError("Invalid header?") ValueError: Invalid header? In both cases I had setup align_view='0' so the psi-blast output is plain text and not XML (This is because I don't think there is a PSI-Blast parser working with XML, if there is one, please let me know how to invoke it) Best regards, Miguel -- http://www.pangea.org/mol/spip.php?rubrique2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Je suis de la mauvaise herbe, Braves gens, braves gens, Je pousse en libert? Dans les jardins mal fr?quent?s! Georges Brassens From biopython at maubp.freeserve.co.uk Wed May 21 08:26:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 May 2008 09:26:37 +0100 Subject: [BioPython] PSIBlastParser and blastpgp 2.2.18 In-Reply-To: References: Message-ID: <320fb6e00805210126p469c0fd5id26a0e3be51697d0@mail.gmail.com> On Wed, May 21, 2008 at 8:43 AM, Miguel Ortiz-Lombard?a wrote: > Hi, > > The PSIBlastParser (biopython 1.45) seems not to work with the latest > (2.2.18) version of NCBI blastpgp. Using the same script/inputs I can > successfully run it with blastpgp 2.2.15 but when using 2.2.18 I get this > error: > > Traceback (most recent call last): > ... > ValueError: Invalid header? > > In both cases I had setup align_view='0' so the psi-blast output is plain > text and not XML (This is because I don't think there is a PSI-Blast parser > working with XML, if there is one, please let me know how to invoke it) Plain text Blast parsing in general is a pain as the NCBI often make minor changes to the file format. If you could file a bug on Bugzilla, with a couple of example input files, I'll try and have a look and see if its an easy fix this time. Matched pairs using blastpgp 2.2.15 and 2.2.18 with the same command line arguments would be especially useful. In the meantime, you could try the Bio.NCBIXML parser on the PSI-Blast XML output. I know that works for blastall and rpsblast, so it may be OK with blastpgp too. Please let us know if this works for you (and we can update our documentation). Thank you, Peter From biopython at maubp.freeserve.co.uk Wed May 21 08:46:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 May 2008 09:46:13 +0100 Subject: [BioPython] Parstin a remote Blast output In-Reply-To: References: Message-ID: <320fb6e00805210146l34fd2cd3q892dd52bbc439eae@mail.gmail.com> On Wed, May 21, 2008 at 12:28 AM, Raul Guerra wrote: > Thank you to everyone who replied my last post. I am sorry to bother you > again with a question. Thank you in advance for your time. > > I am trying to parse the output from: > > result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='"Arabidopsis thaliana" [ORGN]') > > where fastaStr is a string in the fasta format. As you have discovered, this will return XML by default (since Biopython 1.41). You will get back a handle object (of some sort). > I tried to follow the logic of the program and I found that NCBIWWW.qblast() > is outputing a XML file, and for some reason NCBIWWW.BlastParser() is > expecting a HTML file. That is my guess of what is going wrong. So what I > did was to use the parser in NCBIXML. Well done on working this out. Can I ask you why you tried the version using the plain text parser? I thought we'd updated all our documentation on this but perhaps we missed something. See BLAST Chapter of the tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > So I ran the following > > result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='"Arabidopsis thaliana" [ORGN]') > > blast_records = NCBIXML.parse(result_handle) > > and it works fine (at least I do not get errors), but I have no idea on what > type of object blast_records is. I tried the following Using NCBIXML.parse(result_handle) will return an iterator, but it doesn't actually start parsing the file until you call the next() method, which is usally done in a for loop. > next = blast_records.next() > > and got the following error: > > Traceback (most recent call last): > ... > File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 214, in > _end_BlastOutput_version > self._header.date = self._value.split()[2][1:-1] > IndexError: list index out of range > > I have not been able to understand what is going on here. Sadly the NCBI changed their output format slightly, and Biopython couldn't cope. We've fixed this now (Bug 2499), but you'll have to update your installation. See here for details: http://bugzilla.open-bio.org/show_bug.cgi?id=2499 > I just want to parse the results I get from: > > result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='"Arabidopsis thaliana" [ORGN]') > > Any ideas? You're very close. I suggest updating the NCBIXML file to cope with the current version of BLAST that the NCBI is using online (2.2.18+), and then using the XML parser: from Bio import NCBIWWW, NCBIXML result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Arabidopsis thaliana" [ORGN]') for record in NCBIXML.parse(result_handle) : #Do something with the blast result Peter From biopython at maubp.freeserve.co.uk Wed May 21 11:36:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 May 2008 12:36:27 +0100 Subject: [BioPython] PSIBlastParser and blastpgp 2.2.18 In-Reply-To: References: <320fb6e00805210126p469c0fd5id26a0e3be51697d0@mail.gmail.com> Message-ID: <320fb6e00805210436u5f7a0e0bi3ebe7a127d0b477d@mail.gmail.com> On Wed, May 21, 2008 at 10:26 AM, Miguel Ortiz-Lombard?a wrote: > Hi Peter, > > I will try NCBIXML, thank you. Good luck :) > In the mean time, how can I attach a file to bug I'm filing so you have my > script and an input file? I thought this possibility existed, but I can't > find it right now. You have to file the bug, and then go back and add an attachment as a second step. It is a little annoying/confusing. Does anyone knows if bugzilla can be configured to allow attachments when filing a bug? Peter From sdavis2 at mail.nih.gov Wed May 21 12:08:57 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 21 May 2008 08:08:57 -0400 Subject: [BioPython] Adding new database types to EUtils In-Reply-To: <200712041449.39100.luca.beltrame@unimi.it> References: <200712041119.46997.luca.beltrame@unimi.it> <200712041221.16990.luca.beltrame@unimi.it> <264855a00712040535l2911e009t48e4e6f483461ac5@mail.gmail.com> <200712041449.39100.luca.beltrame@unimi.it> Message-ID: <264855a00805210508v6c9945e2j2c7d638a27a0e61b@mail.gmail.com> On Tue, Dec 4, 2007 at 9:49 AM, Luca Beltrame wrote: > Il Tuesday 04 December 2007 14:35:13 hai scritto: > >> release. I have thought about making a python-based version, but I find R >> a much more compelling framework for statistical computing and array-based > > I think it is mostly a matter of personal preference. I turned to Python (but > I have been using GEOquery in the past) because I like the language more than > R. > >> Metadata (and not values), then URLs can be constructed against their web > > I guess I did not make the statement clear enough in my original mail. Yes, I > meant to fetch only the metadata because I wanted to gather the experiment > descriptions from all the accessions I had (a rather large number) in order > to look through them without having to query for each one. > I will try looking at the queries via web and see if I can write something > useful (although I still think that, as basic as it is, it would be nice to > have EUtils GEO support in Bio.EUtils, at least for the metadata). > >> I'm not sure that exactly the same functionality is available via Eutils, >> but I think not. > > I have played a bit with EUtils, but I haven't yet been able to use esearch to > work with a GEO accession. Since I have just looked at them briefly, I can't > guarantee it was just a mistake on my part, though. This is a pretty old post, I know, but I thought I would toss in some new information here. We have parsed all of the GEO metadata into a MySQL database which can be queried directly here: http://meltzerlab.nci.nih.gov/apps/geo In addition, for those who want programmatic access, on the same page is a SQLite database file with the same information. It can be used from python (or any other language with SQLite bindings) for SQL-based queries of GEO metadata. Sean From cjfields at uiuc.edu Wed May 21 12:21:48 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 21 May 2008 07:21:48 -0500 Subject: [BioPython] PSIBlastParser and blastpgp 2.2.18 In-Reply-To: <320fb6e00805210436u5f7a0e0bi3ebe7a127d0b477d@mail.gmail.com> References: <320fb6e00805210126p469c0fd5id26a0e3be51697d0@mail.gmail.com> <320fb6e00805210436u5f7a0e0bi3ebe7a127d0b477d@mail.gmail.com> Message-ID: <95A355CD-8C57-415A-BF28-0E7C7ABDB5CF@uiuc.edu> You have to file the bug first descriptively; after the ticket is generated the 'Create a New Attachment' link should show up. chris On May 21, 2008, at 6:36 AM, Peter wrote: > On Wed, May 21, 2008 at 10:26 AM, Miguel Ortiz-Lombard?a > wrote: >> Hi Peter, >> >> I will try NCBIXML, thank you. > > Good luck :) > >> In the mean time, how can I attach a file to bug I'm filing so you >> have my >> script and an input file? I thought this possibility existed, but I >> can't >> find it right now. > > You have to file the bug, and then go back and add an attachment as a > second step. It is a little annoying/confusing. > > Does anyone knows if bugzilla can be configured to allow attachments > when filing a bug? > > Peter > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Wed May 21 14:25:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 May 2008 15:25:31 +0100 Subject: [BioPython] [Bug 2502] PSIBlastParser fails with blastpgp 2.2.18 though works with blastpgp 2.2.15 In-Reply-To: References: <200805211305.m4LD5Eo3020573@portal.open-bio.org> Message-ID: <320fb6e00805210725n6a73f2lbe381acc1801d5b0@mail.gmail.com> On Wed, May 21, 2008 at 2:53 PM, Miguel Ortiz-Lombard?a wrote: > Peter, I'm sending this to you because bugzilla rejected my e-mail: I don't think you can submit comments to the bugs on bugzilla via email. You have to go to our bugzilla webpage (using the link included in the email) and fill in the comments box. i.e. http://bugzilla.open-bio.org/show_bug.cgi?id=2502 > Hi, > > Just to clarify: the output is not XML but plain text ( because > align_view='0' ) Is the plain text that you want? I had forgotten to check what the default output was - sorry. To cover all cases, I would like to see both the plain text and the XML for both the old and new blast versions. i.e. Four different files using the same input query and database. I would generate these on the command line. It may we that its only a simple change to the plain text (but I suspect not), and that updating the plain text parser would be simple. I fear that the changes to the plain text are bigger (based on what happened with the other blast tools recently) and we would be better off getting Biopython to parse the XML. Thanks, Peter From biopython at maubp.freeserve.co.uk Wed May 21 22:42:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 May 2008 23:42:10 +0100 Subject: [BioPython] Parstin a remote Blast output In-Reply-To: References: <320fb6e00805210146l34fd2cd3q892dd52bbc439eae@mail.gmail.com> Message-ID: <320fb6e00805211542t42554d06t88d5ac88f8ba4f64@mail.gmail.com> On Wed, May 21, 2008 at 6:04 PM, Raul Guerra wrote: > Hi Peter, > > Thank you for all your help, it is greatly appreciated. I am sorry that I > keep asking about how to parse the blast. About why I tried to use the plain > text parser, I googled around and found a BioPython CookBook (I know now > that it is an older version of yours) and tried the code in it. That would explain why you started out doing things "the old way". We could try writing to the website hosting the old tutorial/cookbook and see if they could update it... > About NCBIXML, I overwrote the file > "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py" with the one in CVS > following the link > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIXML.py?cvsroot=biopython OK, that should be able to parse the latest XML files from the NCBI except for the slight catch I had forgotten about (see below) > However, a new problem comes up. When I run the code, > > result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, > entrez_query='%s [ORGN]'%organism) > for record in NCBIXML.parse(result_handle) : > print record > > where organism is "Chlamydomonas" > and fastaStr is >>NP_013762_Nup116 > MFGVSRGAFPSAT.... Here is a short version using the GI number of this sequence, from Bio.Blast import NCBIWWW, NCBIXML handle = NCBIWWW.qblast("blastp", "nr", "6323691", "Chlamydomonas[ORGN]") for record in NCBIXML.parse(handle) : print record This seems to work for me. With hindsight, it would have been safer just to have got the whole of Biopython from CVS, but I think you also need to update this file /usr/lib/python2.5/site-packages/Bio/Blast/Record.py with the latest from CVS: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/Record.py?cvsroot=biopython I'd forgotten about this related change - sorry. Peter From biopython at maubp.freeserve.co.uk Thu May 22 08:24:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 May 2008 09:24:39 +0100 Subject: [BioPython] THANK YOU In-Reply-To: References: Message-ID: <320fb6e00805220124o4b00c578je98946b0951d9624@mail.gmail.com> On Thu, May 22, 2008 at 12:49 AM, Raul Guerra wrote: > > Peter, > > Thank you so much. it WORKED!! I did not know about BioPython until a few > weeks back. Before that I had been programming my own parsers and scripts to > access NCBI. I spent 15 weeks trying to figure out a parser for GenBank, and > I still could not get it to work for all cases. I thought that I was going > to spend a lot of time with a parser for Blast, but BioPython works > beautifully. > > Thanks for all the help, > > Raul Thanks for letting me know it work Raul - we got there in the end with the Blast XML parsing ;) I am optimistic that Biopython will have another release by summer 2008,which will be good news for everyone else wanting to use the NCBI's online blast (without having to mess about with the CVS code). If you find any issues with our GenBank parser, please do get in touch again on the mailing list (or via Bugzilla if you find a bug). Peter From florian.koelling at tu-bs.de Thu May 22 14:38:59 2008 From: florian.koelling at tu-bs.de (Florian Koelling) Date: Thu, 22 May 2008 16:38:59 +0200 Subject: [BioPython] problems to fetch pdb files from server using pdb_list Message-ID: <48358583.3050401@tu-bs.de> Hi Folks! I tried to download pdb files using pdb_list: (the second variant didn't work as well:-@) from Bio.PDB import* pdbl=PDBList() pdbl.retrieve_pdb_file('1FAT') #pdbl.retrieve_pdb_file('1FAT', obsolete= compression='.Z', uncompress='gunzip', pdir='/home/flo/Desktop') I receive: flo at AKB-12:~/Desktop/astex$ python test.py Traceback (most recent call last): File "test.py", line 10, in pdbl.retrieve_pdb_file('1FAT') File "/var/lib/python-support/python2.5/Bio/PDB/PDBList.py", line 190, in retrieve_pdb_file os.mkdir(path) OSError: [Errno 2] No such file or directory: '/pdb/fa' What am I doing wrong? Thanx for your Help! From peter at maubp.freeserve.co.uk Thu May 22 15:14:52 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Thu, 22 May 2008 16:14:52 +0100 Subject: [BioPython] problems to fetch pdb files from server using pdb_list In-Reply-To: <48358583.3050401@tu-bs.de> References: <48358583.3050401@tu-bs.de> Message-ID: <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> On Thu, May 22, 2008 at 3:38 PM, Florian Koelling wrote: > > Hi Folks! > > I tried to download pdb files using pdb_list: > (the second variant didn't work as well:-@) > > from Bio.PDB import* > > pdbl=PDBList() > pdbl.retrieve_pdb_file('1FAT') This worked for me, it create a subdirectory "fa" inside the current working directory, and saved the PDB there as "pdb1fat.ent". Are you sure the error you quoted in your email matches this code? > #pdbl.retrieve_pdb_file('1FAT', obsolete= compression='.Z', > uncompress='gunzip', pdir='/home/flo/Desktop') The above isn't valid python syntax (there is something wrong between the words obsolete and compression). You probably want something like this: pdbl.retrieve_pdb_file('1FAT', pdir='/home/flo/Desktop') Peter From florian.koelling at tu-bs.de Fri May 23 08:28:40 2008 From: florian.koelling at tu-bs.de (Florian Koelling) Date: Fri, 23 May 2008 10:28:40 +0200 Subject: [BioPython] problems to fetch pdb files from server using pdb_list In-Reply-To: <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> References: <48358583.3050401@tu-bs.de> <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> Message-ID: <48368038.7010504@tu-bs.de> Humm - I recognised that I was still using Biopython 1.42 (installed via synaptics) - it works fine on 1.45! :-)))) Peter wrote: > On Thu, May 22, 2008 at 3:38 PM, Florian Koelling > wrote: > >> Hi Folks! >> >> I tried to download pdb files using pdb_list: >> (the second variant didn't work as well:-@) >> >> from Bio.PDB import* >> >> pdbl=PDBList() >> pdbl.retrieve_pdb_file('1FAT') >> > > This worked for me, it create a subdirectory "fa" inside the current > working directory, and saved the PDB there as "pdb1fat.ent". > > Are you sure the error you quoted in your email matches this code? > > >> #pdbl.retrieve_pdb_file('1FAT', obsolete= compression='.Z', >> uncompress='gunzip', pdir='/home/flo/Desktop') >> > > The above isn't valid python syntax (there is something wrong between > the words obsolete and compression). You probably want something like > this: > > pdbl.retrieve_pdb_file('1FAT', pdir='/home/flo/Desktop') > > Peter > From peter at maubp.freeserve.co.uk Fri May 23 10:08:53 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Fri, 23 May 2008 11:08:53 +0100 Subject: [BioPython] problems to fetch pdb files from server using pdb_list In-Reply-To: <48368038.7010504@tu-bs.de> References: <48358583.3050401@tu-bs.de> <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> <48368038.7010504@tu-bs.de> Message-ID: <320fb6e00805230308yecdf4c3r6e604b8edae647c3@mail.gmail.com> On Fri, May 23, 2008 at 9:28 AM, Florian Koelling wrote: > Humm - I recognised that I was still using Biopython 1.42 (installed via > synaptics) - it works fine on 1.45! :-)))) One mystery solved - I did wonder if I should have asked which version of Biopython you had ;) Good luck with your PDB analysis. Peter From florian.koelling at tu-bs.de Fri May 23 10:22:11 2008 From: florian.koelling at tu-bs.de (Florian Koelling) Date: Fri, 23 May 2008 12:22:11 +0200 Subject: [BioPython] problems to fetch pdb files from server using pdb_list In-Reply-To: <320fb6e00805230308yecdf4c3r6e604b8edae647c3@mail.gmail.com> References: <48358583.3050401@tu-bs.de> <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> <48368038.7010504@tu-bs.de> <320fb6e00805230308yecdf4c3r6e604b8edae647c3@mail.gmail.com> Message-ID: <48369AD3.1010005@tu-bs.de> Thanx alot! Is there any possibility to avoid the construction of the obselete folder? I could not find it in the documentation - and the flags 0, 1 and None didn' t work It perturbs my loops while slicing over several Folders. Greetz, Florian Peter wrote: > On Fri, May 23, 2008 at 9:28 AM, Florian Koelling > wrote: > >> Humm - I recognised that I was still using Biopython 1.42 (installed via >> synaptics) - it works fine on 1.45! :-)))) >> > > One mystery solved - I did wonder if I should have asked which version > of Biopython you had ;) > > Good luck with your PDB analysis. > > Peter > From biopython at maubp.freeserve.co.uk Fri May 23 10:35:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 May 2008 11:35:59 +0100 Subject: [BioPython] problems to fetch pdb files from server using pdb_list In-Reply-To: <48369AD3.1010005@tu-bs.de> References: <48358583.3050401@tu-bs.de> <320fb6e00805220814y41bfe236ua2dd6189a55356b1@mail.gmail.com> <48368038.7010504@tu-bs.de> <320fb6e00805230308yecdf4c3r6e604b8edae647c3@mail.gmail.com> <48369AD3.1010005@tu-bs.de> Message-ID: <320fb6e00805230335m7444d608l4efb3c743f89f461@mail.gmail.com> Florian Koelling wrote: > Is there any possibility to avoid the construction of the obselete > folder? I could not find it in the documentation - and the flags 0, 1 > and None didn' t work You just use the pdir argument to specify the folder you want to use (otherwise it build the subdirectory structre). The function will understand the shorthand of "." for the current directory. Based on your previous example try: from Bio.PDB import * pdbl=PDBList() pdbl.retrieve_pdb_file('1FAT', pdir=".") Peter From gbastian at pasteur.fr Wed May 28 10:18:19 2008 From: gbastian at pasteur.fr (gbastian at pasteur.fr) Date: Wed, 28 May 2008 12:18:19 +0200 (CEST) Subject: [BioPython] biopython sequence retreive from locus_tag Message-ID: <34099.157.99.64.103.1211969899.squirrel@php.pasteur.fr> Hello all, I am trying to find a way to retreive the sequences from NCBI starting from the locus_tag information. I have no accession number. thanks for your suggestions. Giacomo From colochera at gmail.com Wed May 28 15:41:26 2008 From: colochera at gmail.com (Raul Guerra) Date: Wed, 28 May 2008 11:41:26 -0400 Subject: [BioPython] pblast in NCBI's website differs from biopython's pblast Message-ID: Hi everyone, I was wondering if someone has had the same problem. I am running the following code in BioPython. result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Chlamydomonas" [ORGN]',ncbi_gi= True,matrix_name='BLOSUM62', hitlist_size=50) where fastaStr is the fasta string for NP_012855. (When I mention Biopython's pBlast I refer to the code above) The results that I got back are different from the results I get from the pblast option at http://www.ncbi.nlm.nih.gov/blast/Blast.cgi (I refer to it when I mention pblast in NCBI's website) The results that I got from NCBI's website are 2 sequences, which were what I was looking for. On the other hand, Biopython gives back as many hits as I specify in the limit. Also in Biopython's pBlast, I only get one of the hits that I get in NCBI's pBlast. I know that the qBlast option in NCNBIWWW has many parameters. def qblast(program, database, sequence, auto_format=None,composition_based_statistics=None, db_genetic_code=None,endpoints=None,entrez_query='(none)', expect=10.0,filter=None,gapcosts=None,genetic_code=None, hitlist_size=50,i_thresh=None,layout=None,lcase_mask=None, matrix_name=None,nucl_penalty=None,nucl_reward=None, other_advanced=None,perc_ident=None,phi_pattern=None, query_file=None,query_believe_defline=None,query_from=None, query_to=None,searchsp_eff=None,service=None,threshold=None, ungapped_alignment=None,word_size=None, alignments=500,alignment_view=None,descriptions=500, entrez_links_new_window=None,expect_low=None,expect_high=None, format_entrez_query=None,format_object=None,format_type='XML', ncbi_gi=None,results_file=None,show_overview=None ): I also know that the pBlast in NCBI's website utilizes a Gap Cost of "Existence: 11 Extension:1". I am not sure how to translate that into the qblast function in Biopython. I am not sure if this is the problem, but it could be that Biopython's pblast and NCBI's pblast have different parameters. Thank you for your time, David From fahy at chapman.edu Thu May 29 02:16:53 2008 From: fahy at chapman.edu (Michael Fahy) Date: Wed, 28 May 2008 19:16:53 -0700 Subject: [BioPython] EST Alignment Message-ID: I expect to be getting two sets of ESTs from two stages of an organism whose genome has been sequenced. I would like to align the ESTs to the genome sequence to identify the genes that are being expressed and, more importantly, to find the differences in genes between the two sets. There is an EST2GENOME class in BioPerl that looks like it would help. Is there anything similar in BioPython? Or public tools, accessible thorugh BioPython, for identifying genes by aligning ESTs to genome sequences? From aloraine at gmail.com Thu May 29 10:18:25 2008 From: aloraine at gmail.com (Ann Loraine) Date: Thu, 29 May 2008 06:18:25 -0400 Subject: [BioPython] EST Alignment In-Reply-To: References: Message-ID: <83722dde0805290318m1c1e153er8e9e3e1e5f32db9d@mail.gmail.com> Have you looked at blat? On Wed, May 28, 2008 at 10:16 PM, Michael Fahy wrote: > I expect to be getting two sets of ESTs from two stages of an organism whose genome has been sequenced. I would like to align the ESTs to the genome sequence to identify the genes that are being expressed and, more importantly, to find the differences in genes between the two sets. There is an EST2GENOME class in BioPerl that looks like it would help. Is there anything similar in BioPython? Or public tools, accessible thorugh BioPython, for identifying genes by aligning ESTs to genome sequences? > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From sdavis2 at mail.nih.gov Thu May 29 11:33:42 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 29 May 2008 07:33:42 -0400 Subject: [BioPython] EST Alignment In-Reply-To: <83722dde0805290318m1c1e153er8e9e3e1e5f32db9d@mail.gmail.com> References: <83722dde0805290318m1c1e153er8e9e3e1e5f32db9d@mail.gmail.com> Message-ID: <264855a00805290433x68ee84fn22adbfafd27aef96@mail.gmail.com> On Thu, May 29, 2008 at 6:18 AM, Ann Loraine wrote: > Have you looked at blat? Or GMAP (or many others)? > On Wed, May 28, 2008 at 10:16 PM, Michael Fahy wrote: >> I expect to be getting two sets of ESTs from two stages of an organism whose genome has been sequenced. I would like to align the ESTs to the genome sequence to identify the genes that are being expressed and, more importantly, to find the differences in genes between the two sets. There is an EST2GENOME class in BioPerl that looks like it would help. Is there anything similar in BioPython? Or public tools, accessible thorugh BioPython, for identifying genes by aligning ESTs to genome sequences? >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From colochera at gmail.com Thu May 29 12:47:55 2008 From: colochera at gmail.com (Raul Guerra) Date: Thu, 29 May 2008 08:47:55 -0400 Subject: [BioPython] pblast in NCBI's website differs from biopython's pblast In-Reply-To: References: Message-ID: Hi everyone, I was wondering if someone has had the same problem. I am running the following code in BioPython. result_handle = NCBIWWW.qblast("blastp", "nr", fastaStr, entrez_query='"Chlamydomonas" [ORGN]',ncbi_gi= True,matrix_name='BLOSUM62', hitlist_size=50) where fastaStr is the fasta string for NP_012855. (When I mention Biopython's pBlast I refer to the code above) The results that I got back are different from the results I get from the pblast option at http://www.ncbi.nlm.nih.gov/blast/Blast.cgi , if you follow the link click on pblast and do a blast just specifiying the organism and the sequence accession number. The results that I got from NCBI's website are 2 sequences, which were what I was looking for. On the other hand, Biopython gives back as many hits as I specify in the limit. Also in Biopython's pBlast, I only get one of the hits that I get in NCBI's pBlast. I know that the qBlast option in NCNBIWWW has many parameters. def qblast(program, database, sequence, auto_format=None,composition_based_statistics=None, db_genetic_code=None,endpoints=None,entrez_query='(none)', expect=10.0,filter=None,gapcosts=None,genetic_code=None, hitlist_size=50,i_thresh=None,layout=None,lcase_mask=None, matrix_name=None,nucl_penalty=None,nucl_reward=None, other_advanced=None,perc_ident=None,phi_pattern=None, query_file=None,query_believe_defline=None,query_from=None, query_to=None,searchsp_eff=None,service=None,threshold=None, ungapped_alignment=None,word_size=None, alignments=500,alignment_view=None,descriptions=500, entrez_links_new_window=None,expect_low=None,expect_high=None, format_entrez_query=None,format_object=None,format_type='XML', ncbi_gi=None,results_file=None,show_overview=None ): I also know that the pBlast in NCBI's website utilizes a Gap Cost of "Existence: 11 Extension:1". I am not sure how to translate that into the qblast function in Biopython. I am not sure if this is the problem, but it could be that Biopython's pblast and NCBI's pblast have different parameters. Thank you for your time, David From biopython at maubp.freeserve.co.uk Thu May 29 21:53:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 May 2008 22:53:58 +0100 Subject: [BioPython] biopython sequence retreive from locus_tag In-Reply-To: <34099.157.99.64.103.1211969899.squirrel@php.pasteur.fr> References: <34099.157.99.64.103.1211969899.squirrel@php.pasteur.fr> Message-ID: <320fb6e00805291453x7e2597f8w8554d3a30b716b9@mail.gmail.com> > Hello all, > > I am trying to find a way to retreive the sequences from NCBI > starting from the locus_tag information. > I have no accession number. Hello Giacomo, There is probably more than one solution. My first idea is to try and use the Entrez utilities. Have you tried searching Entrez for your locus tag via this web page? http://www.ncbi.nlm.nih.gov/Entrez/ If this work, then I would try using the Bio.Entrez module in Biopython, and request the desired records in FASTA or GenBank format according to your needs. Peter From biopython at maubp.freeserve.co.uk Thu May 29 22:09:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 May 2008 23:09:59 +0100 Subject: [BioPython] pblast in NCBI's website differs from biopython's pblast In-Reply-To: References: Message-ID: <320fb6e00805291509u14da40b2lb14afa0458210dc8@mail.gmail.com> On Wed, May 28, 2008 at 4:41 PM, Raul Guerra wrote: > Hi everyone, > > I was wondering if someone has had the same problem. I am running the > following code in BioPython. ... > > The results that I got back are different from the results I get from the > pblast option at http://www.ncbi.nlm.nih.gov/blast/Blast.cgi (I refer to it > when I mention pblast in NCBI's website) ... > > I also know that the pBlast in NCBI's website utilizes a Gap Cost of > "Existence: 11 Extension:1". I am not sure how to translate that into the > qblast function in Biopython. I am not sure if this is the problem, but it > could be that Biopython's pblast and NCBI's pblast have different > parameters. The NCBI have often changed their default options in the BLAST tools, both the online versions and the web interface. As you have seen from the Biopython source code, we set many of the options to specific default values which may no longer match what the NCBI webpages do by default. Its a little frustrating :( For a similar examples from last year, see http://portal.open-bio.org/pipermail/biopython/2007-August/003679.html and http://portal.open-bio.org/pipermail/biopython/2007-August/003693.html You should compare the parameters for the two sets of results, work out where they are different, and decide which settings are best for your problem. For the gap costs, see http://www.ncbi.nlm.nih.gov/BLAST/Doc/node28.html - you can specify this parameter in the Biopython qblast function with the optional gapcosts argument. Peter