From abwork at utu.fi Wed Apr 1 00:01:08 2009 From: abwork at utu.fi (Abdi Worku Muleta) Date: Wed, 01 Apr 2009 06:01:08 +0200 Subject: [BioPython] HELP! Message-ID: Hi everyone, I am trying to write a bio-python script that uses SwissProt accession numbers to download a sequence objects and then run remote blast with the sequences. Then download good hit sequences listed in Blast results and print their sequences. I am using a Windows based system with bio-python 2.5, if someone could help me out I would really appreciate it with some sample code or something. I just started learning python and have tried to follow the documentation and cookbook without much success, my programming experience is virtually non-existent. Thanks. From peter at maubp.freeserve.co.uk Wed Apr 1 05:37:11 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Apr 2009 10:37:11 +0100 Subject: [BioPython] HELP! In-Reply-To: <513066.92437.qm@web111011.mail.gq1.yahoo.com> References: <513066.92437.qm@web111011.mail.gq1.yahoo.com> Message-ID: <320fb6e00904010237j65556cf5mb5dd2914a17cc8a0@mail.gmail.com> On Wed, Apr 1, 2009 at 4:56 AM, Hermella Woldemdihin wrote: > Hi everyone, > I am trying to write a bio-python script that uses SwissProt accession > numbers to download a sequence objects and then run remote blast > with the sequences. Then download good hit sequences listed in Blast > results and print their sequences.I am using a Windows based system > with bio-python 2.5, if someone could help me out I would really > appreciate it with some sample code or something. I just started > learning python and have tried to follow the documentation and > cookbook without much success, my programming experience is > virtually non-existent. Thanks. > Hermi Hello Hermi and Abdi Worku Muleta, You've both emailed almost identical questions at almost the same time - are you doing the same project for a university assignment? First of all, the Biopython Tutorial and Cookbook doesn't try to teach you python - it assumes you at least know the basics. Have a look at www.python.org for some beginners guides, or check you library as there are plenty of books for learning Python. To download SwissProt functions, look at the Bio.ExPASy.get_sprot_raw function from Bio.ExPASy (there is an example in the Tutorial, search for get_sprot_raw). You can also use Bio.Entrez.eftech, but I have found the NCBI only seem to keep track of the latest SwissProt identifiers, so using ExPASy should be more reliable. If you want to run BLAST on these records, then can do this for one query sequence at a time using Bio.Blast.NCBIWWW.qblast (again there is an example in the Tutorial, search for qblast). You could also install standalone BLAST from the NCBI on your own machine and do all the query sequences together in a FASTA file, but I think this might be a bit complicated for a novice. Peter From Yvan.Strahm at bccs.uib.no Wed Apr 1 06:34:56 2009 From: Yvan.Strahm at bccs.uib.no (Yvan.Strahm at bccs.uib.no) Date: Wed, 01 Apr 2009 12:34:56 +0200 Subject: [BioPython] Is query_length really the length of query? Message-ID: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> Hello List I try to get the length of the query from the blast result itself like that: result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file) from Bio.Blast import NCBIXML blast_records = NCBIXML.parse(result_handle) for blast_record in blast_records but blast_record.query_length return None and blast_record.query_letters return the actual size Should I test the length of the query before the blast result? O did I miss-interpreted the meaning of query_length and query_letters? Thanks for your time Is query_length really the length of query? From biopython at maubp.freeserve.co.uk Wed Apr 1 06:59:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Apr 2009 11:59:24 +0100 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> Message-ID: <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> On Wed, Apr 1, 2009 at 11:34 AM, wrote: > > Hello List > > I try to get the length of the query from the blast result itself > > like that: > result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, > "blastn", > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?my_blast_db, > my_blast_file) > > from Bio.Blast import NCBIXML > blast_records = NCBIXML.parse(result_handle) > for blast_record in blast_records > > but > blast_record.query_length return None > and > blast_record.query_letters return the actual size > > Should I test the length of the query before the blast result? O did I > miss-interpreted the meaning of query_length and query_letters? > > Thanks for your time > > Is query_length really the length of query? You can use query_letters (although it wouldn't hurt to double check this if you have the query sequence available). With the current BLAST XML parser query_length is always None (but I think we should fix so they are both populated). Its an unfortunate historical accident dating back to the plain text BLAST parser. The plain text output printed the query length in two places, with different captions, which was reflected in the names given in the BLAST record (the values should be the same, assuming the BLAST output is sane). The XML output doesn't have this redundancy, but our XML parser tries to use the same object to hold the results. See: http://bugzilla.open-bio.org/show_bug.cgi?id=2176#c12 Have a look at the discussion on Bug 2176 for more about this (including the far more complicated situation for the database length which has multiple meanings). This seems like a timely reminder that we could perhaps tidy up a little of this ready for Biopython 1.50 ... Peter From Yvan.Strahm at bccs.uib.no Wed Apr 1 07:15:13 2009 From: Yvan.Strahm at bccs.uib.no (Yvan.Strahm at bccs.uib.no) Date: Wed, 01 Apr 2009 13:15:13 +0200 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> Message-ID: <20090401131513.9al87wazacgcw0os@webmail.uib.no> Quoting Peter : > On Wed, Apr 1, 2009 at 11:34 AM, wrote: >> >> Hello List >> >> I try to get the length of the query from the blast result itself >> >> like that: >> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, >> "blastn", >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?my_blast_db, >> my_blast_file) >> >> from Bio.Blast import NCBIXML >> blast_records = NCBIXML.parse(result_handle) >> for blast_record in blast_records >> >> but >> blast_record.query_length return None >> and >> blast_record.query_letters return the actual size >> >> Should I test the length of the query before the blast result? O did I >> miss-interpreted the meaning of query_length and query_letters? >> >> Thanks for your time >> >> Is query_length really the length of query? > > You can use query_letters (although it wouldn't hurt to double check > this if you have the query sequence available). With the current BLAST > XML parser query_length is always None (but I think we should fix so > they are both populated). > > Its an unfortunate historical accident dating back to the plain text > BLAST parser. The plain text output printed the query length in two > places, with different captions, which was reflected in the names > given in the BLAST record (the values should be the same, assuming the > BLAST output is sane). The XML output doesn't have this redundancy, > but our XML parser tries to use the same object to hold the results. > See: http://bugzilla.open-bio.org/show_bug.cgi?id=2176#c12 > > Have a look at the discussion on Bug 2176 for more about this > (including the far more complicated situation for the database length > which has multiple meanings). > > This seems like a timely reminder that we could perhaps tidy up a > little of this ready for Biopython 1.50 ... > > Peter > Thanks for these precisions. have a nice day yvan From dejmail at gmail.com Sun Apr 5 08:59:15 2009 From: dejmail at gmail.com (Liam Thompson) Date: Sun, 5 Apr 2009 14:59:15 +0200 Subject: [BioPython] extraction from genbank/embl files Message-ID: Hi everyone I have a list of accession numbers, which I've used to download the entire genomic sequences of several hundred hepatitis B virus isolates. What I am trying to do is extract 3 gene sequences from each genomic sequence, and place each sequence in one of 3 files depending on the gene for further analysis. The question is whether there is a shorter way to extract from Genbank files using the Genbank parser, specific gene sequences, or whether I would need to identify the gene of each genomic isolate individually (as they are called a variety of names, despite being the same gene which makes it trickier), copy the coordinates of the gene sequence, and then proceed further down the file and actually perform the copying of the gene. I not experienced in python (or other languages for that matter), but I am trying. Any suggestions would be greatly appreciated Thanks Liam -- ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown South Africa 2193 Tel: 2711 717 2465/7 Fax: 2711 717 2395 Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From sean.maceach at gmail.com Sun Apr 5 09:13:16 2009 From: sean.maceach at gmail.com (Sean MacEachern) Date: Sun, 05 Apr 2009 09:13:16 -0400 Subject: [BioPython] extraction from genbank/embl files In-Reply-To: Message-ID: Hi Liam, Although not a biopython solution, you should be able to use seqret in EMBOSS to do something like you have described. You can call seqret in your python script using popen and write the results to one of your three files. HTH, Sean On 4/5/09 8:59 AM, "Liam Thompson" wrote: > Hi everyone > > I have a list of accession numbers, which I've used to download the entire > genomic sequences of several hundred hepatitis B virus isolates. What I am > trying to do is extract 3 gene sequences from each genomic sequence, and > place each sequence in one of 3 files depending on the gene for further > analysis. > > The question is whether there is a shorter way to extract from Genbank files > using the Genbank parser, specific gene sequences, or whether I would need > to identify the gene of each genomic isolate individually (as they are > called a variety of names, despite being the same gene which makes it > trickier), copy the coordinates of the gene sequence, and then proceed > further down the file and actually perform the copying of the gene. > > I not experienced in python (or other languages for that matter), but I am > trying. > > Any suggestions would be greatly appreciated > > Thanks > Liam > > > > > > From biopython at maubp.freeserve.co.uk Sun Apr 5 15:29:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Apr 2009 20:29:40 +0100 Subject: [BioPython] extraction from genbank/embl files In-Reply-To: References: Message-ID: <320fb6e00904051229v3698c00dr7dee7b58445b4bec@mail.gmail.com> On 4/5/09, Liam Thompson wrote: > Hi everyone > > I have a list of accession numbers, which I've used to download the entire > genomic sequences of several hundred hepatitis B virus isolates. What I am > trying to do is extract 3 gene sequences from each genomic sequence, and > place each sequence in one of 3 files depending on the gene for further > analysis. Are you looking for the CDS sequence of these three genes (i.e. a nucleotide sequence)? > The question is whether there is a shorter way to extract from Genbank files > using the Genbank parser, specific gene sequences, or whether I would need > to identify the gene of each genomic isolate individually (as they are > called a variety of names, despite being the same gene which makes it > trickier), copy the coordinates of the gene sequence, and then proceed > further down the file and actually perform the copying of the gene. I see two main options for you (regardless of what programming language you want to use): (1) Compile a list of all the gene names by hand. (2) Compile a few examples by hand, and then use pairwise alignments (e.g. BLAST, or FASTA, or needle from EMBOSS) to find the matching gene in each virus. You could do this with the protein or the nucleotide sequence. Using Biopython's Bio.SeqIO EMBL/GenBank parser each gene/CDS in the EMBL/GenBank file will be represented as a SeqFeature object, which includes the location information. If you can identify which features you want from their annotation, then that tells you where to cut the parent sequence. See this page for some related discussion: http://www.warwick.ac.uk/go/peter_cock/python/genbank/ As an alternative approach, rather than starting with the EMBL/GenBank files, can you just download the CDS sequences as a FASTA file? e.g. files called *.ffn from the NCBI ftp site. You might also want to download the genes protein sequence, the NCBI uses *.faa for these (FASTA amino acids). Having FASTA files would make the sequence comparison approach easiest - most of these tools will expect FASTA input files. Peter From dejmail at gmail.com Mon Apr 6 00:21:05 2009 From: dejmail at gmail.com (Liam Thompson) Date: Mon, 6 Apr 2009 06:21:05 +0200 Subject: [BioPython] extraction from genbank/embl files In-Reply-To: <320fb6e00904051229v3698c00dr7dee7b58445b4bec@mail.gmail.com> References: <320fb6e00904051229v3698c00dr7dee7b58445b4bec@mail.gmail.com> Message-ID: Hi Peter & Sean I am looking for a nucleotide sequence for these three genes and I have downloaded the entire genomic sequences so that I can compare the same 3 genes from all the same isolates. I downloaded the full GenBank and FASTA version of the same set of accession numbers, for as you said FASTA will be easier to work with once I can identify the location information from the info of the GB file. I'll give SeqFeature a bash, and possibly the seqret feature of EMBOSS as well. Thanks Liam On Sun, Apr 5, 2009 at 9:29 PM, Peter wrote: > On 4/5/09, Liam Thompson wrote: > > Hi everyone > > > > I have a list of accession numbers, which I've useSed to download the > entire > > genomic sequences of several hundred hepatitis B virus isolates. What I > am > > trying to do is extract 3 gene sequences from each genomic sequence, and > > place each sequence in one of 3 files depending on the gene for further > > analysis. > > Are you looking for the CDS sequence of these three genes (i.e. a > nucleotide sequence)? > > > The question is whether there is a shorter way to extract from Genbank > files > > using the Genbank parser, specific gene sequences, or whether I would > need > > to identify the gene of each genomic isolate individually (as they are > > called a variety of names, despite being the same gene which makes it > > trickier), copy the coordinates of the gene sequence, and then proceed > > further down the file and actually perform the copying of the gene. > > I see two main options for you (regardless of what programming > language you want to use): > > (1) Compile a list of all the gene names by hand. > (2) Compile a few examples by hand, and then use pairwise alignments > (e.g. BLAST, or FASTA, or needle from EMBOSS) to find the matching > gene in each virus. You could do this with the protein or the > nucleotide sequence. > > Using Biopython's Bio.SeqIO EMBL/GenBank parser each gene/CDS in the > EMBL/GenBank file will be represented as a SeqFeature object, which > includes the location information. If you can identify which features > you want from their annotation, then that tells you where to cut the > parent sequence. See this page for some related discussion: > http://www.warwick.ac.uk/go/peter_cock/python/genbank/ > > As an alternative approach, rather than starting with the EMBL/GenBank > files, can you just download the CDS sequences as a FASTA file? e.g. > files called *.ffn from the NCBI ftp site. You might also want to > download the genes protein sequence, the NCBI uses *.faa for these > (FASTA amino acids). > > Having FASTA files would make the sequence comparison approach easiest > - most of these tools will expect FASTA input files. > > Peter > -- ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown 2193 Tel: 2711 717 2465/7 Fax: 2711 717 2395 Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From biopython at maubp.freeserve.co.uk Mon Apr 6 06:50:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Apr 2009 11:50:15 +0100 Subject: [BioPython] extraction from genbank/embl files In-Reply-To: References: <320fb6e00904051229v3698c00dr7dee7b58445b4bec@mail.gmail.com> Message-ID: <320fb6e00904060350n4e1caad6l4ea9ae46927e26fb@mail.gmail.com> On 4/6/09, Liam Thompson wrote: > Hi Peter & Sean > > I am looking for a nucleotide sequence for these three genes and I have > downloaded the entire genomic sequences so that I can compare the same 3 > genes from all the same isolates. I downloaded the full GenBank and FASTA > version of the same set of accession numbers, for as you said FASTA will be > easier to work with once I can identify the location information from the > info of the GB file. The NCBI at least provide three flavours of FASTA file for a genome: *.fna - FASTA Nucleic Acids - entire DNA nucleotide sequence as one record *.faa - FASTA Amino Acids - amino acid sequences for each gene *.ffn - FASTA Feature Nucleotides - nucleotide sequences for each gene This is easiest to see on the FTP site. In your case, using the ffn files might be simplest - assuming you can recognise the genes from their sequences (e.g. using pairwise alignments to known references). Peter From florian.koelling at tu-bs.de Mon Apr 6 11:55:18 2009 From: florian.koelling at tu-bs.de (Florian Koelling) Date: Mon, 06 Apr 2009 17:55:18 +0200 Subject: [BioPython] JP (Jarvis Patrick) Clustering Message-ID: <49DA25E6.2060606@tu-bs.de> Hi Folks! Does anybody know an (open source) clustering package containing the Jarvis Patrick clustering algorithm? I only know Hcluster and Pycluster but I'm afraid that those candidates don't have a JP implementation -- maybe one of you knows an alternative. thanx and best regards, florian From chapmanb at 50mail.com Mon Apr 6 17:57:25 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 6 Apr 2009 17:57:25 -0400 Subject: [BioPython] JP (Jarvis Patrick) Clustering In-Reply-To: <49DA25E6.2060606@tu-bs.de> References: <49DA25E6.2060606@tu-bs.de> Message-ID: <20090406215725.GG43636@sobchak.mgh.harvard.edu> Hi Florian; > Does anybody know an (open source) clustering package containing the > Jarvis Patrick clustering algorithm? Here is a version in R and C: http://rguha.net/code/R/#jp If you need to access it from python, the RPy bindings are great: http://rpy.sourceforge.net/ Hope this helps, Brad From chapmanb at 50mail.com Mon Apr 6 19:05:42 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 6 Apr 2009 19:05:42 -0400 Subject: [BioPython] Invitation for Biopython news coordinators Message-ID: <20090406230542.GK43636@sobchak.mgh.harvard.edu> Biopythonistas; Communication is a key component of successful open source projects. The challenges of distributed programming by volunteers can be overcome by ensuring that the whole community is aware of interesting discussions, new contributions, and development goals. Traditionally, this communication has happened through our mailing lists, wiki pages, and bug tracking system. While these will continue to to be useful resources, new methods of disseminating information are changing how we interact through the web. I'd like to issue an invitation for anyone interested in helping revolutionize how Biopython news is disseminated. We are looking for contributors from the community to brainstorm new ways to make the discussions that happen at biopython.org accessible. You would actively follow development here and on the development lists and distill this information into useful quick bullet points for those interested in Biopython but too busy to follow detailed discussions. We are proposing two ways to do this: - Monthly highlights on our news server: http://news.open-bio.org/news/category/obf-projects/biopython/ The RSS feed from these posts are currently widely distributed around the internet. - More frequent pointers to interesting discussions or other items of interest happening in Biopython through our Twitter account: http://twitter.com/biopython This is an opportunity for those of you who are looking to become more involved, and would like to learn more about Biopython by following all of the coding activity more closely. The position is very flexible and we are happy to have one or more people take it on; we would also encourage you to be as creative as you want in doing so. I see this as an chance to both provide information and to highlight the great work people do at Biopython. If you are interested in taking on this role please respond with your ideas. Thanks for your interest, Brad From bradley.h at aggiemail.usu.edu Wed Apr 8 16:08:50 2009 From: bradley.h at aggiemail.usu.edu (Bradley Hintze) Date: Wed, 8 Apr 2009 14:08:50 -0600 Subject: [BioPython] Clustalw Problems Message-ID: <3933e78c0904081308p78e049b1i4562857c1ad06df4@mail.gmail.com> Hi, I am having a hard time running an alignment. I am running in windows and here is my code and the error message that I get after running do_alignment. >>> import os >>> from Bio.Clustalw import MultipleAlignCL >>> from Bio.Clustalw import do_alignment >>> cline=MultipleAlignCL(r"C:\Documents and Settings\student\Desktop\Foo\mtr4.fasta", r"C:\Program Files\clustalw1.83.XP\clustalw.exe") >>> cline.set_output(r"C:\Documents and Settings\students\Desktop\Foo\test.aln") >>> al=do_alignment(cline) Traceback (most recent call last): File "", line 1, in File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 124, in do_alignment % command_line.sequence_file) IOError: Cannot open sequence file C:\Documents and Settings\student\Desktop\Foo\mtr4.fasta when I open the file using o=open('C:\Documents and Settings\student\Desktop\Foo\mtr4.fasta') it woks fine. any ideas? -- Bradley -- Bradley J. Hintze Biochemistry Undergraduate Utah State University 801-712-8799 From biopython at maubp.freeserve.co.uk Wed Apr 8 17:50:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Apr 2009 22:50:24 +0100 Subject: [BioPython] Clustalw Problems In-Reply-To: <3933e78c0904081308p78e049b1i4562857c1ad06df4@mail.gmail.com> References: <3933e78c0904081308p78e049b1i4562857c1ad06df4@mail.gmail.com> Message-ID: <320fb6e00904081450y6a2bcdc2jce7935c543af9b8b@mail.gmail.com> On 4/8/09, Bradley Hintze wrote: > Hi, > > I am having a hard time running an alignment. I am running in windows and > here is my code and the error message that I get after running do_alignment. > > >>> import os > >>> from Bio.Clustalw import MultipleAlignCL > >>> from Bio.Clustalw import do_alignment > >>> cline=MultipleAlignCL(r"C:\Documents and > Settings\student\Desktop\Foo\mtr4.fasta", r"C:\Program > Files\clustalw1.83.XP\clustalw.exe") > >>> cline.set_output(r"C:\Documents and > Settings\students\Desktop\Foo\test.aln") > >>> al=do_alignment(cline) > Traceback (most recent call last): > File "", line 1, in > File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 124, > in do_alignment > % command_line.sequence_file) > IOError: Cannot open sequence file C:\Documents and > Settings\student\Desktop\Foo\mtr4.fasta > > when I open the file using o=open('C:\Documents and > Settings\student\Desktop\Foo\mtr4.fasta') it woks fine. > > any ideas? As a general tip, try this to see what the command Biopython is trying to run is: >>> print cline Then try running the same command by hand at the command prompt (DOS prompt), and make sure it works. I can tell from the error message you have Python 2.5, but what version of Biopython do you have? I'm not at a Windows machine to check, but it is generally a good idea to avoid file names and paths with spaces where you can. In this case, I'm sure relative names would be fine: >>> import os >>> from Bio.Clustalw import MultipleAlignCL >>> from Bio.Clustalw import do_alignment >>> cline=MultipleAlignCL("mtr4.fasta", r"C:\Program Files\clustalw1.83.XP\clustalw.exe") >>> cline.set_output("test.aln") Peter From jchen at alumni.caltech.edu Wed Apr 8 20:20:58 2009 From: jchen at alumni.caltech.edu (jchen at alumni.caltech.edu) Date: Wed, 8 Apr 2009 17:20:58 -0700 (PDT) Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? Message-ID: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> Hello, How do I convert a file full of BLAST runs into a FASTA file of sequences for each hit? I have tried parsing a file full of BLAST runs per the instructions from the Biopython tutorial and cookbook (http://biopython.org/DIST/docs/tutorial/Tutorial.html), but I continue to get a ValueError. I have tried the hints on throwing certain exceptions, without much help. The only thing I have gotten working is parsing a BLAST output consisting of a single hit from a single query. I used BLAST v.2.2.18 to generate my BLAST output. Any help would be appreciated. Thanks! -Jerry From biopython at maubp.freeserve.co.uk Thu Apr 9 04:49:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Apr 2009 09:49:32 +0100 Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? In-Reply-To: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> References: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> Message-ID: <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> On Thu, Apr 9, 2009 at 1:20 AM, wrote: > Hello, > > How do I convert a file full of BLAST runs into a FASTA file of sequences > for each hit? Do you just want the FASTA file to contain the matched region of the sequences in the database? That information should be in the BLAST output - you'll need to remove any gap characters. If you want the full sequence of each matched target, that isn't in the database. You'd have to take the reference number and look it up. If you made the database yourself from a FASTA file, that should be easy. If it was from NR/NT or another large database then maybe fetching the sequences from the NCBI would be easiest (try Bio.Entrez). > I have tried parsing a file full of BLAST runs per the instructions from > the Biopython tutorial and cookbook > (http://biopython.org/DIST/docs/tutorial/Tutorial.html), but I continue to > get a ValueError. I have tried the hints on throwing certain exceptions, > without much help. The only thing I have gotten working is parsing a BLAST > output consisting of a single hit from a single query. > > I used BLAST v.2.2.18 to generate my BLAST output. Are you sure you are using the XML output? With the plain text output and BLAST v.2.2.18, Biopython can only cope with single query output. The NCBI regularly change their plain text output, and we have more-or-less given up with the our plain text parser. The NCBI themselves do not recommend parsing it - that is what the XML format was introduced for. I can't offer any more advice without the error message, your OS (e.g. Windows XP), version of Python, version of Biopython and ideally a snippet of your code which is failing. Peter From jchen at alumni.caltech.edu Thu Apr 9 18:14:02 2009 From: jchen at alumni.caltech.edu (jchen at alumni.caltech.edu) Date: Thu, 9 Apr 2009 15:14:02 -0700 (PDT) Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? In-Reply-To: <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> References: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> Message-ID: <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> Hi Peter, > Do you just want the FASTA file to contain the matched region of the > sequences in the database? That information should be in the BLAST > output - you'll need to remove any gap characters. > > If you want the full sequence of each matched target, that isn't in > the database. You'd have to take the reference number and look it up. > If you made the database yourself from a FASTA file, that should be > easy. If it was from NR/NT or another large database then maybe > fetching the sequences from the NCBI would be easiest (try > Bio.Entrez). Yeah, I actually do want the full length FASTA sequences. I didn't think about the fact that the BLAST output only contains (partial) match regions. I have a FASTA file of the entire proteome for the organism we are studying. > Are you sure you are using the XML output? > > With the plain text output and BLAST v.2.2.18, Biopython can only cope > with single query output. The NCBI regularly change their plain text > output, and we have more-or-less given up with the our plain text > parser. The NCBI themselves do not recommend parsing it - that is > what the XML format was introduced for. > That's unfortunate there's no standard BLAST format. Yeah, I am trying to parse the plain text BLAST output. I'm not familiar with the XML output - I don't know how to have BLAST output in XML format. My file contains a few hundred queries. I ended up writing a little script that extracted the name of each query and each of its significant hits. I will probably end up writing my own scripts for getting the FASTA sequences for each of these hits from a FASTA proteome file. > I can't offer any more advice without the error message, your OS (e.g. > Windows XP), version of Python, version of Biopython and ideally a > snippet of your code which is failing. That's alright. It will be easier for me to write my own little scripts to parse my BLAST output file. I was just hoping there was an easy, fast way to do it with Biopython. Thanks for your help! -Jerry From agarbino at gmail.com Thu Apr 9 18:27:54 2009 From: agarbino at gmail.com (Alex Garbino) Date: Thu, 9 Apr 2009 17:27:54 -0500 Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? In-Reply-To: <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> References: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> Message-ID: <4cf37ad00904091527w519b3757wcd3b5854dd029d0b@mail.gmail.com> I wrote a simple script to do that, pasted below & attached. You supply your FASTA protein sequence up top, and it blasts it, and returns the top 200 hits in a CSV format with the full FASTA sequence for each hit. However, although it worked before (see output csv file), I'm trying it with a new protein (I've attached the fasta file .txt) and it gives me a StopIteration error; I'd appreciate help in debugging that!! The script also needs help in that: 1) sometimes skips a hit for the same organism with a higher HSP value 2) the csv file is not perfectly delimited, sometimes the label gets broken up (see output in excel from a previous run where it did work) 3) I'd like to get e-values instead of HSP scores, but I can figure out the structure of the record/how to get each piece. Despite all that, it will do what you are wanting to do... in a very newbie way! :) -Alex Garbino code: from Bio import SeqIO from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import Entrez #Open file to blast file = "ryr2fasta.txt" #Blast, save copy record = SeqIO.read(open(file), format="fasta") result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring(), hitlist_size=200) blast_results = result_handle.read() save_file = open(file[:-4]+"123.xml", "w") save_file.write(blast_results) save_file.close() result_handle = open(file[:-4]+"123.xml") #Load the blast record blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() output = {} for x in blast_record.alignments: for hsp in x.hsps: output[x.accession] = [x.title] output[x.accession].extend([x.length]) output[x.accession].extend([hsp.score]) for x in output: handle = Entrez.efetch(db="protein", id=x, rettype="genbank") record = SeqIO.parse(handle, "genbank") recurd = record.next() output[x].insert(0, recurd.id) output[x].insert(1, recurd.annotations["source"]) output[x].extend([recurd.seq.tostring()]) #print output save_file = open(file[:-4]+"123.csv", "w") #Generate CSV for item in output: # save_file.write('%s,%s,%s\n' % (output[item][0],output[item][1],output[item][2])) save_file.write('%s,%s,%s,%s,%s,%s\n' % (output[item][0],output[item][1],output[item][2],output[item][3],output[item][4],output[item][5])) save_file.close() On Thu, Apr 9, 2009 at 5:14 PM, wrote: > Hi Peter, > > > Do you just want the FASTA file to contain the matched region of the > > sequences in the database? That information should be in the BLAST > > output - you'll need to remove any gap characters. > > > > If you want the full sequence of each matched target, that isn't in > > the database. You'd have to take the reference number and look it up. > > If you made the database yourself from a FASTA file, that should be > > easy. If it was from NR/NT or another large database then maybe > > fetching the sequences from the NCBI would be easiest (try > > Bio.Entrez). > > Yeah, I actually do want the full length FASTA sequences. I didn't think > about the fact that the BLAST output only contains (partial) match > regions. I have a FASTA file of the entire proteome for the organism we > are studying. > > > Are you sure you are using the XML output? > > > > With the plain text output and BLAST v.2.2.18, Biopython can only cope > > with single query output. The NCBI regularly change their plain text > > output, and we have more-or-less given up with the our plain text > > parser. The NCBI themselves do not recommend parsing it - that is > > what the XML format was introduced for. > > > > That's unfortunate there's no standard BLAST format. Yeah, I am trying to > parse the plain text BLAST output. I'm not familiar with the XML output - > I don't know how to have BLAST output in XML format. > > My file contains a few hundred queries. I ended up writing a little script > that extracted the name of each query and each of its significant hits. I > will probably end up writing my own scripts for getting the FASTA > sequences for each of these hits from a FASTA proteome file. > > > I can't offer any more advice without the error message, your OS (e.g. > > Windows XP), version of Python, version of Biopython and ideally a > > snippet of your code which is failing. > > That's alright. It will be easier for me to write my own little scripts to > parse my BLAST output file. I was just hoping there was an easy, fast way > to do it with Biopython. > > Thanks for your help! > -Jerry > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -------------- next part -------------- >gi|112799847|ref|NP_001026.2| cardiac muscle ryanodine receptor [Homo sapiens] MADGGEGEDEIQFLRTDDEVVLQCTATIHKEQQKLCLAAEGFGNRLCFLESTSNSKNVPPDLSICTFVLE QSLSVRALQEMLANTVEKSEGQVDVEKWKFMMKTAQGGGHRTLLYGHAILLRHSYSGMYLCCLSTSRSST DKLAFDVGLQEDTTGEACWWTIHPASKQRSEGEKVRVGDDLILVSVSSERYLHLSYGNGSLHVDAAFQQT LWSVAPISSGSEAAQGYLIGGDVLRLLHGHMDECLTVPSGEHGEEQRRTVHYEGGAVSVHARSLWRLETL RVAWSGSHIRWGQPFRLRHVTTGKYLSLMEDKNLLLMDKEKADVKSTAFTFRSSKEKLDVGVRKEVDGMG TSEIKYGDSVCYIQHVDTGLWLTYQSVDVKSVRMGSIQRKAIMHHEGHMDDGISLSRSQHEESRTARVIR STVFLFNRFIRGLDALSKKAKASTVDLPIESVSLSLQDLIGYFHPPDEHLEHEDKQNRLRALKNRQNLFQ EEGMINLVLECIDRLHVYSSAAHFADVAGREAGESWKSILNSLYELLAALIRGNRKNCAQFSGSLDWLIS RLERLEASSGILEVLHCVLVESPEALNIIKEGHIKSIISLLDKHGRNHKVLDVLCSLCVCHGVAVRSNQH LICDNLLPGRDLLLQTRLVNHVSSMRPNIFLGVSEGSAQYKKWYYELMVDHTEPFVTAEATHLRVGWAST EGYSPYPGGGEEWGGNGVGDDLFSYGFDGLHLWSGCIARTVSSPNQHLLRTDDVISCCLDLSAPSISFRI NGQPVQGMFENFNIDGLFFPVVSFSAGIKVRFLLGGRHGEFKFLPPPGYAPCYEAVLPKEKLKVEHSREY KQERTYTRDLLGPTVSLTQAAFTPIPVDTSQIVLPPHLERIREKLAENIHELWVMNKIELGWQYGPVRDD NKRQHPCLVEFSKLPEQERNYNLQMSLETLKTLLALGCHVGISDEHAEDKVKKMKLPKNYQLTSGYKPAP MDLSFIKLTPSQEAMVDKLAENAHNVWARDRIRQGWTYGIQQDVKNRRNPRLVPYTLLDDRTKKSNKDSL REAVRTLLGYGYNLEAPDQDHAARAEVCSGTGERFRIFRAEKTYAVKAGRWYFEFETVTAGDMRVGWSRP GCQPDQELGSDERAFAFDGFKAQRWHQGNEHYGRSWQAGDVVGCMVDMNEHTMMFTLNGEILLDDSGSEL AFKDFDVGDGFIPVCSLGVAQVGRMNFGKDVSTLKYFTICGLQEGYEPFAVNTNRDITMWLSKRLPQFLQ VPSNHEHIEVTRIDGTIDSSPCLKVTQKSFGSQNSNTDIMFYRLSMPIECAEVFSKTVAGGLPGAGLFGP KNDLEDYDADSDFEVLMKTAHGHLVPDRVDKDKEATKPEFNNHKDYAQEKPSRLKQRFLLRRTKPDYSTS HSARLTEDVLADDRDDYDFLMQTSTYYYSVRIFPGQEPANVWVGWITSDFHQYDTGFDLDRVRTVTVTLG DEKGKVHESIKRSNCYMVCAGESMSPGQGRNNNGLEIGCVVDAASGLLTFIANGKELSTYYQVEPSTKLF PAVFAQATSPNVFQFELGRIKNVMPLSAGLFKSEHKNPVPQCPPRLHVQFLSHVLWSRMPNQFLKVDVSR ISERQGWLVQCLDPLQFMSLHIPEENRSVDILELTEQEELLKFHYHTLRLYSAVCALGNHRVAHALCSHV DEPQLLYAIENKYMPGLLRAGYYDLLIDIHLSSYATARLMMNNEYIVPMTEETKSITLFPDENKKHGLPG IGLSTSLRPRMQFSSPSFVSISNECYQYSPEFPLDILKSKTIQMLTEAVKEGSLHARDPVGGTTEFLFVP LIKLFYTLLIMGIFHNEDLKHILQLIEPSVFKEAATPEEESDTLEKELSVDDAKLQGAGEEEAKGGKRPK EGLLQMKLPEPVKLQMCLLLQYLCDCQVRHRIEAIVAFSDDFVAKLQDNQRFRYNEVMQALNMSAALTAR KTKEFRSPPQEQINMLLNFKDDKSECPCPEEIRDQLLDFHEDLMTHCGIELDEDGSLDGNSDLTIRGRLL SLVEKVTYLKKKQAEKPVESDSKKSSTLQQLISETMVRWAQESVIEDPELVRAMFVLLHRQYDGIGGLVR ALPKTYTINGVSVEDTINLLASLGQIRSLLSVRMGKEEEKLMIRGLGDIMNNKVFYQHPNLMRALGMHET VMEVMVNVLGGGESKEITFPKMVANCCRFLCYFCRISRQNQKAMFDHLSYLLENSSVGLASPAMRGSTPL DVAAASVMDNNELALALREPDLEKVVRYLAGCGLQSCQMLVSKGYPDIGWNPVEGERYLDFLRFAVFCNG ESVEENANVVVRLLIRRPECFGPALRGEGGNGLLAAMEEAIKIAEDPSRDGPSPNSGSSKTLDTEEEEDD TIHMGNAIMTFYSALIDLLGRCAPEMHLIHAGKGEAIRIRSILRSLIPLGDLVGVISIAFQMPTIAKDGN VVEPDMSAGFCPDHKAAMVLFLDRVYGIEVQDFLLHLLEVGFLPDLRAAASLDTAALSATDMALALNRYL CTAVLPLLTRCAPLFAGTEHHASLIDSLLHTVYRLSKGCSLTKAQRDSIEVCLLSICGQLRPSMMQHLLR RLVFDVPLLNEHAKMPLKLLTNHYERCWKYYCLPGGWGNFGAASEEELHLSRKLFWGIFDALSQKKYEQE LFKLALPCLSAVAGALPPDYMESNYVSMMEKQSSMDSEGNFNPQPVDTSNITIPEKLEYFINKYAEHSHD KWSMDKLANGWIYGEIYSDSSKVQPLMKPYKLLSEKEKEIYRWPIKESLKTMLAWGWRIERTREGDSMAL YNRTRRISQTSQVSVDAAHGYSPRAIDMSNVTLSRDLHAMAEMMAENYHNIWAKKKKMELESKGGGNHPL LVPYDTLTAKEKAKDREKAQDILKFLQINGYAVSRGFKDLELDTPSIEKRFAYSFLQQLIRYVDEAHQYI LEFDGGSRGKGEHFPYEQEIKFFAKVVLPLIDQYFKNHRLYFLSAASRPLCSGGHASNKEKEMVTSLFCK LGVLVRHRISLFGNDATSIVNCLHILGQTLDARTVMKTGLESVKSALRAFLDNAAEDLEKTMENLKQGQF THTRNQPKGVTQIINYTTVALLPMLSSLFEHIGQHQFGEDLILEDVQVSCYRILTSLYALGTSKSIYVER QRSALGECLAAFAGAFPVAFLETHLDKHNIYSIYNTKSSRERAALSLPTNVEDVCPNIPSLEKLMEEIVE LAESGIRYTQMPHVMEVILPMLCSYMSRWWEHGPENNPERAEMCCTALNSEHMNTLLGNILKIIYNNLGI DEGAWMKRLAVFSQPIINKVKPQLLKTHFLPLMEKLKKKAATVVSEEDHLKAEARGDMSEAELLILDEFT TLARDLYAFYPLLIRFVDYNRAKWLKEPNPEAEELFRMVAEVFIYWSKSHNFKREEQNFVVQNEINNMSF LITDTKSKMSKAAVSDQERKKMKRKGDRYSMQTSLIVAALKRLLPIGLNICAPGDQELIALAKNRFSLKD TEDEVRDIIRSNIHLQGKLEDPAIRWQMALYKDLPNRTDDTSDPEKTVERVLDIANVLFHLEQKSKRVGR RHYCLVEHPQRSKKAVWHKLLSKQRKRAVVACFRMAPLYNLPRHRAVNLFLQGYEKSWIETEEHYFEDKL IEDLAKPGAEPPEEDEGTKRVDPLHQLILLFSRTALTEKCKLEEDFLYMAYADIMAKSCHDEEDDDGEEE VKSFEEKEMEKQKLLYQQARLHDRGAAEMVLQTISASKGETGPMVAATLKLGIAILNGGNSTVQQKMLDY LKEKKDVGFFQSLAGLMQSCSVLDLNAFERQNKAEGLGMVTEEGSGEKVLQDDEFTCDLFRFLQLLCEGH NSDFQNYLRTQTGNNTTVNIIISTVDYLLRVQESISDFYWYYSGKDVIDEQGQRNFSKAIQVAKQVFNTL TEYIQGPCTGNQQSLAHSRLWDAVVGFLHVFAHMQMKLSQDSSQIELLKELMDLQKDMVVMLLSMLEGNV VNGTIGKQMVDMLVESSNNVEMILKFFDMFLKLKDLTSSDTFKEYDPDGKGVISKRDFHKAMESHKHYTQ SETEFLLSCAETDENETLDYEEFVKRFHEPAKDIGFNVAVLLTNLSEHMPNDTRLQTFLELAESVLNYFQ PFLGRIEIMGSAKRIERVYFEISESSRTQWEKPQVKESKRQFIFDVVNEGGEKEKMELFVNFCEDTIFEM QLAAQISESDLNERSANKEESEKERPEEQGPRMAFFSILTVRSALFALRYNILTLMRMLSLKSLKKQMKK VKKMTVKDMVTAFFSSYWSIFMTLLHFVASVFRGFFRIICSLLLGGSLVEGAKKIKVAELLANMPDPTQD EVRGDGEEGERKPLEAALPSEDLTDLKELTEESDLLSDIFGLDLKREGGQYKLIPHNPNAGLSDLMSNPV PMPEVQEKFQEQKAKEEEKEEKEETKSEPEKAEGEDGEKEEKAKEDKGKQKLRQLHTHRYGEPEVPESAF WKKIIAYQQKLLNYFARNFYNMRMLALFVAFAINFILLFYKVSTSSVVEGKELPTRSSSENAKVTSLDSS SHRIIAVHYVLEESSGYMEPTLRILAILHTVISFFCIIGYYCLKVPLVIFKREKEVARKLEFDGLYITEQ PSEDDIKGQWDRLVINTQSFPNNYWDKFVKRKVMDKYGEFYGRDRISELLGMDKAALDFSDAREKKKPKK DSSLSAVLNSIDVKYQMWKLGVVFTDNSFLYLAWYMTMSVLGHYNNFFFAAHLLDIAMGFKTLRTILSSV THNGKQLVLTVGLLAVVVYLYTVVAFNFFRKFYNKSEDGDTPDMKCDDMLTCYMFHMYVGVRAGGGIGDE IEDPAGDEYEIYRIIFDITFFFFVIVILLAIIQGLIIDAFGELRDQQEQVKEDMETKCFICGIGNDYFDT VPHGFETHTLQEHNLANYLFFLMYLINKDETEHTGQESYVWKMYQERCWEFFPAGDCFRKQYEDQLN -------------- next part -------------- A non-text attachment was scrubbed... Name: blast.py Type: text/x-python Size: 1420 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ryr2fasta123.csv Type: application/csv Size: 590961 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Thu Apr 9 18:46:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Apr 2009 23:46:37 +0100 Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? In-Reply-To: <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> References: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> Message-ID: <320fb6e00904091546r45f9ef26n8fd441131cce3c42@mail.gmail.com> On 4/9/09, jchen at alumni.caltech.edu wrote: > > Do you just want the FASTA file to contain the matched region of the > > sequences in the database? That information should be in the BLAST > > output - you'll need to remove any gap characters. > > > > If you want the full sequence of each matched target, that isn't in > > the database. You'd have to take the reference number and look it up. > > If you made the database yourself from a FASTA file, that should be > > easy. If it was from NR/NT or another large database then maybe > > fetching the sequences from the NCBI would be easiest (try > > Bio.Entrez). > > Yeah, I actually do want the full length FASTA sequences. I didn't think > about the fact that the BLAST output only contains (partial) match > regions. I have a FASTA file of the entire proteome for the organism we > are studying. You should be able to get the match IDs from the BLAST output and match them up to your FASTA file easily enough. > > Are you sure you are using the XML output? > > > > With the plain text output and BLAST v.2.2.18, Biopython can only cope > > with single query output. The NCBI regularly change their plain text > > output, and we have more-or-less given up with the our plain text > > parser. The NCBI themselves do not recommend parsing it - that is > > what the XML format was introduced for. > > That's unfortunate there's no standard BLAST format. Yeah, I am trying to > parse the plain text BLAST output. I'm not familiar with the XML output - > I don't know how to have BLAST output in XML format. If you are using the blastall tool at the command line directly, use the argument -m 7 (from memory - check the blastall help). If you are using the wrapper in Bio.Blast.NCBIStandalone, this defaults to requesting XML. Have you looked at our documentation or the tutorial? > My file contains a few hundred queries. I ended up writing a little script > that extracted the name of each query and each of its significant hits. I > will probably end up writing my own scripts for getting the FASTA > sequences for each of these hits from a FASTA proteome file. If you have already run the BLAST search and it would be slow to rerun it with XML output, then doing your own parser might be expedient. Anyway, once I had the sequence identifiers, I would use Bio.SeqIO to read the FASTA file. If the file is small, loading into memory as a python dictionary would be the simplest solution - see the Bio.SeqIO.to_dict function as one way to do this. Finally, sending attachments to the mailing list isn't a good idea - especially not half a megabyte of BLAST results! I think the mailing list has rejected that email anyway... Peter From biopython at maubp.freeserve.co.uk Thu Apr 9 18:49:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Apr 2009 23:49:51 +0100 Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? In-Reply-To: <320fb6e00904091546r45f9ef26n8fd441131cce3c42@mail.gmail.com> References: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> <320fb6e00904091546r45f9ef26n8fd441131cce3c42@mail.gmail.com> Message-ID: <320fb6e00904091549g2ec0c779p81e419964bc2de41@mail.gmail.com> On 4/9/09, Peter wrote: > Finally, sending attachments to the mailing list isn't a good idea - > especially not half a megabyte of BLAST results! I think the mailing > list has rejected that email anyway... Actually Alex's large email may have arrived in everyone's inbox after all. Well I hope Jerry found it useful. Peter From jmm217 at pitt.edu Sat Apr 11 19:49:50 2009 From: jmm217 at pitt.edu (John MacCallum) Date: Sat, 11 Apr 2009 19:49:50 -0400 Subject: [BioPython] Invitation for Biopython news coordinators In-Reply-To: <20090406230542.GK43636@sobchak.mgh.harvard.edu> Message-ID: <1239493790.27790.184.camel@localhost.localdomain> Hi, I'm a biology undergrad at the University of Pittsburgh and am interested in taking on the proposed news coordinator role. I'm likely not the most technically appropriate person for the position from a computational standpoint, but in the absence of other volunteers stepping forward I'd probably be adequate. The only caveat would be that I'd rather wait until after finals (about another two weeks) before beginning any new projects. Thanks, John MacCallum From chapmanb at 50mail.com Sun Apr 12 11:38:00 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 12 Apr 2009 11:38:00 -0400 Subject: [BioPython] BOSC 2009 -- abstracts due and informal hackathon Message-ID: <20090412153800.GA77169@kunkel> Biopython folks; A friendly reminder that abstracts for BOSC 2009 talks are due tomorrow: http://open-bio.org/wiki/BOSC_2009 It would be great to see a strong Python representation there, so I encourage anyone with open source work to think about putting an abstract and talk together. I would also like to work on organizing a day or two of Biopython coding before or after the conference. The idea is to get a small group of interested programmers, decide on some topics of interest, and sit down together in real life and implement them. Since people are likely starting to get their flights together, the first order of business is to find out who is interested and what days before or after BOSC would work. Drop an e-mail to the list if you're attending BOSC or ISMB this year and would be interested in an informal hackathon. Brad From tiagoantao at gmail.com Sun Apr 12 14:08:33 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 12 Apr 2009 19:08:33 +0100 Subject: [BioPython] BOSC 2009 -- abstracts due and informal hackathon In-Reply-To: <20090412153800.GA77169@kunkel> References: <20090412153800.GA77169@kunkel> Message-ID: <6d941f120904121108p1056719dkc10fe218206feccf@mail.gmail.com> Hi, I was not planning to attend. But if there is an hackthon I will talk with my boss and try to go... On Sun, Apr 12, 2009 at 4:38 PM, Brad Chapman wrote: > Biopython folks; > A friendly reminder that abstracts for BOSC 2009 talks are > due tomorrow: > > http://open-bio.org/wiki/BOSC_2009 > > It would be great to see a strong Python representation there, so I > encourage anyone with open source work to think about putting an > abstract and talk together. > > I would also like to work on organizing a day or two of > Biopython coding before or after the conference. The idea is to get > a small group of interested programmers, decide on some topics > of interest, and sit down together in real life and implement them. > > Since people are likely starting to get their flights together, the > first order of business is to find out who is interested and what > days before or after BOSC would work. Drop an e-mail to the list > if you're attending BOSC or ISMB this year and would be interested > in an informal hackathon. > > Brad > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From biopython at maubp.freeserve.co.uk Mon Apr 13 05:50:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 10:50:19 +0100 Subject: [BioPython] BOSC 2009 -- abstracts due and informal hackathon In-Reply-To: <20090412153800.GA77169@kunkel> References: <20090412153800.GA77169@kunkel> Message-ID: <320fb6e00904130250x1cdfed70j25079f12119ce01e@mail.gmail.com> On Sun, Apr 12, 2009 at 4:38 PM, Brad Chapman wrote: > Biopython folks; > A friendly reminder that abstracts for BOSC 2009 talks are > due tomorrow: > > http://open-bio.org/wiki/BOSC_2009 Yikes - that has crept up on me, thanks for the heads up. I'd already contacted them informally though... > It would be great to see a strong Python representation there, so I > encourage anyone with open source work to think about putting an > abstract and talk together. > > I would also like to work on organizing a day or two of > Biopython coding before or after the conference. The idea is to get > a small group of interested programmers, decide on some topics > of interest, and sit down together in real life and implement them. We might be able to do that as part of the BOSC coding sessions? > Since people are likely starting to get their flights together, the > first order of business is to find out who is interested and what > days before or after BOSC would work. Drop an e-mail to the list > if you're attending BOSC or ISMB this year and would be interested > in an informal hackathon. I'm hoping to attend BOSC and ISMB (finances permitting) and an informal hackathon sounds good, although I'm not yet sure when exactly would be the best time. Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 06:44:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 11:44:29 +0100 Subject: [BioPython] BOSC 2009 Message-ID: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> Hello Biopythoneers, Those of you following the dev-mailing list or the OBF news feed will know that talk abstracts for BOSC 2009 are due in today, see http://www.open-bio.org/wiki/BOSC_2009 I should to be able to attend and present the Biopython Project Update, and a few other Biopython developers may also be around too, so some sort of hackathon is in the air. It is a bit unfortunate the deadline was scheduled on the Easter break, as I'm sure quite a few of you will be on holiday, but here is an outline abstract. If anyone has comments, please let me know (on the list or directly) in the next couple of hours... Biopython Project Update (draft abstract for BOSC 2009) In this talk we present the current status of the Biopython project, focusing on features developed in the last year, and future plans for the project. The Oxford University Press journal Bioinformatics has recently published an application note describing Biopython: Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, and de Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Mar 20. doi:10.1093/bioinformatics/btp163 Since BOSC 2008, Biopython 1.49 has been released. This was an important milestone in bringing support for Python 2.6, and in terms of our dependence on Numerical Python as we made the transition from the obsolete Numeric library to NumPy. Biopython 1.49 also added more biological methods to our core sequence object. April 2009 will see the release of Biopython 1.50 (at the time of writing, a beta has already been released). Some of the new features include: 1. GenomeDiagram by Leighton Pritchard has been integrated into Biopython as the Bio.Graphics.GenomeDiagram module. 2. A new module Bio.Motif has been added, which is intended to replace the existing Bio.AlignAce and Bio.MEME modules. 3. Bio.SeqIO can now read and write FASTQ and QUAL files used in second generation sequencing work. Biopython will celebrate its 10th Birthday later this year, we will present a brief history of the project and current work. This includes the evaluation of git (and github) as a possible distributed version control system (DVCS) to replace our existing very stable CVS server hosted by the Open Bioinformatics Foundation, which we hope will encourage more participation in the project. -- Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 09:33:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 14:33:03 +0100 Subject: [BioPython] BOSC 2009 In-Reply-To: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> References: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> Message-ID: <320fb6e00904130633k68fe32bdj3c0419afc5ada71a@mail.gmail.com> On Mon, Apr 13, 2009 at 11:44 AM, Peter wrote: > Hello Biopythoneers, > > Those of you following the dev-mailing list or the OBF news feed will > know that talk abstracts for BOSC 2009 are due in today, see > http://www.open-bio.org/wiki/BOSC_2009 > I should to be able to attend and present the Biopython Project > Update, and a few other Biopython developers may also be > around too, so some sort of hackathon is in the air. > > It is a bit unfortunate the deadline was scheduled on the Easter > break, as I'm sure quite a few of you will be on holiday, but here > is an outline abstract. ?If anyone has comments, please let me > know (on the list or directly) in the next couple of hours... That's been submitted now, although I can still make revisions at the moment if anyone spots something worth adding/fixing. I did remember to add the website and license information as BOSC request on their instructions. Peter From peter at maubp.freeserve.co.uk Mon Apr 13 09:47:04 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 14:47:04 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object Message-ID: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> Hi all, I've filed enhancement bug 2809 with a patch to add startswith and endswith methods to the Seq object, http://bugzilla.open-bio.org/show_bug.cgi?id=2809 I'm confident there are many possible use cases for this. The example which prompted me to work on this was taking SeqRecord objects from sequencing reads (a FASTQ file read in with Bio.SeqIO, possible with Biopython 1.50 beta or later) where some include a PCR primer associated prefix/suffix which I want to strip off (by slicing the SeqRecord). ?To do this I need to know if a given SeqRecord's sequence starts with (or ends with) a given primer sequence (or a tuple of primer sequences). e.g. I want to be able to do this: primer = "TGACCTGAAAAGAC" crop = len(primer) #record is a SeqRecord object if record.seq.startswith(primer) : record = record[crop:] Currently you'd have to turn the Seq into a string to use its startswith method, which is not as nice: primer = "TGACCTGAAAAGAC" crop = len(primer) #record is a SeqRecord object if str(record.seq).startswith(primer) : record = record[crop:] or maybe use the find method instead: primer = "TGACCTGAAAAGAC" crop = len(primer) #record is a SeqRecord object if 0 == record.seq.find(primer) : record = record[crop:] Does this seem like a sensible addition to the Seq object? It is consistent with making the Seq object more like a python string. Peter From lpritc at scri.ac.uk Mon Apr 13 10:10:30 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 13 Apr 2009 15:10:30 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> Message-ID: Howdo, On 13/04/2009 14:47, "Peter" wrote: > I'm confident there are many possible use cases for this. > > The example which prompted me to work on this was taking SeqRecord > objects from sequencing reads (a FASTQ file read in with Bio.SeqIO, > possible with Biopython 1.50 beta or later) where some include a PCR > primer associated prefix/suffix which I want to strip off (by slicing > the SeqRecord). ?To do this I need to know if a given SeqRecord's > sequence starts with (or ends with) a given primer sequence (or a > tuple of primer sequences). > > e.g. I want to be able to do this: > > primer = "TGACCTGAAAAGAC" > crop = len(primer) > #record is a SeqRecord object > if record.seq.startswith(primer) : > record = record[crop:] [...] > Does this seem like a sensible addition to the Seq object? It is > consistent with making the Seq object more like a python string. Yes it does seem sensible. I'd quite like to (eventually) have the capability either to provide ambiguity symbols, or to query with a regular expression along the lines of re.match() (or maybe the nonexistent re.endmatch()). Since this isn't implemented yet, maybe there's still time to consider this potential usage in the implementation? L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From peter at maubp.freeserve.co.uk Mon Apr 13 10:46:31 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 15:46:31 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: References: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> Message-ID: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> On Mon, Apr 13, 2009 at 3:10 PM, Leighton Pritchard wrote: > > Howdo, > >> Does this seem like a sensible addition to the Seq object? ?It is >> consistent with making the Seq object more like a python string. > > Yes it does seem sensible. Good :) >?I'd quite like to (eventually) have the capability either to provide ambiguity > symbols, or to query with a regular expression along the lines of > re.match() (or maybe the nonexistent re.endmatch()). > > Since this isn't implemented yet, maybe there's still time to consider this > potential usage in the implementation? I'm not at all happy about the idea of supporting ambiguity characters in these string-like methods of the Seq object. Right now I was proposing nothing special with ambiguity symbols, so: >>> from Bio.Seq import Seq >>> Seq("TAN").startswith("TAN") True >>> Seq("TAA").startswith("TAN") False >>> Seq("TAA").startswith("TAX") False I agree this doesn't cover all possible use cases, but it is very simple, and easy to explain. Trying to support ambiguity symbols will be alphabet dependent (consider the above example could be a protein or DNA), and frankly extremely complicated. It also breaks the "act like a string" idea. Essentially you'd be asking for the following: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna, generic_protein >>> Seq("TAN", generic_dna).startswith("TAN") True >>> Seq("TAA", generic_dna).startswith("TAN") #treat N specially True >>> Seq("TAN", generic_protein).startswith("TAN") True >>> Seq("TAA", generic_protein).startswith("TAN") #protein, so N is a normal amino acid False >>> Seq("TAX", generic_protein).startswith("TAX") True >>> Seq("TAA", generic_protein).startswith("TAX") #treat X specially True So far that is at least understandable - but what would you expect the following to do, where we don't know if it is DNA or protein: >>> Seq("TAA").startswith("TAN") >>> Seq("TAA").startswith("TAX") We don't know, therefore we shouldn't guess, so I think these would have to raise an error. This also applies to the other ambiguous nucleotide letters, like S for G or C in nucleotide sequences. Then there are more alphabet corner cases - consider reduced alphabets (e.g. simplified protein sequences mapping all acidic residues to a single character etc). Several "Zen of Python" points spring to mind, including "If the implementation is hard to explain, it's a bad idea.", but in summary I against supporting ambiguous characters in the string-like methods of the Seq object (so: find, rfind, split, startswith, endswith, etc). We should handle this another way. Bartek: would Bio.Motif give us a nice way to do these kinds of searches? For example, given a simple nucleotide motif of "TAN" (which should match TAA, TAC, TAG or TAA) or "TAS" (which should match "TAC" or "TAG"), can we check if this matches at the start of a target nucleotide sequence? And similarly for protein motifs (e.g. signal peptides). Peter From mmokrejs at ribosome.natur.cuni.cz Mon Apr 13 10:44:14 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 13 Apr 2009 16:44:14 +0200 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> References: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> Message-ID: <49E34FBE.3070308@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > Hi all, > > I've filed enhancement bug 2809 with a patch to add startswith and > endswith methods to the Seq object, > http://bugzilla.open-bio.org/show_bug.cgi?id=2809 > > e.g. I want to be able to do this: > > primer = "TGACCTGAAAAGAC" > crop = len(primer) > #record is a SeqRecord object > if record.seq.startswith(primer) : > record = record[crop:] > > > Does this seem like a sensible addition to the Seq object? It is > consistent with making the Seq object more like a python string. Yes, I like this new approach. Martin From lpritc at scri.ac.uk Mon Apr 13 11:46:06 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 13 Apr 2009 16:46:06 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> Message-ID: On 13/04/2009 15:46, "Peter" wrote: > On Mon, Apr 13, 2009 at 3:10 PM, Leighton Pritchard wrote: > I'm not at all happy about the idea of supporting ambiguity characters > in these string-like methods of the Seq object. Right now I was > proposing nothing special with ambiguity symbols, so: > >>>> from Bio.Seq import Seq >>>> Seq("TAN").startswith("TAN") > True >>>> Seq("TAA").startswith("TAN") > False >>>> Seq("TAA").startswith("TAX") > False > > I agree this doesn't cover all possible use cases, but it is very > simple, and easy to explain. That's in its favour, but I don't think that: "Seq.startswith() behaves as expected for standard ambiguity symbols and regular expression syntax if the Seq object is declared with either a protein or nucleotide alphabet, but behaves like String.startswith() otherwise" is either complicated, or hard to explain. I think that the choice is one that is best made on whether the functionality is useful like this, or better implemented in some other way. On a design point, I'm not convinced that direct emulation of String methods in Seq objects is *always* a Good Thing?. There are String methods that it makes sense (to me) to emulate wholesale in Seq, such as .join(), .swapcase(), .upper(), slicing behaviours and so on. However, .title() and .capitalize() seem a bit out of place. Likewise, there are plenty of Seq methods that don't have sensible String counterparts. This is because, conceptually, they represent different abstract concepts, and I don't think that we should lose sight of that when making Seq objects behave like String objects. I think that abstract representation of sequences and provision of useful functionality are the important points. > Trying to support ambiguity symbols will be alphabet dependent > (consider the above example could be a protein or DNA), and frankly > extremely complicated. I don't think it *has* to be complicated at all, though it could be if we wanted it to be. For example, avoiding ambiguity codes for now: >>> from Bio.Seq import Seq >>> Seq("TAG").startswith("TA[CG]") Could be handled internally with re.match(), in the same way that >>> Seq("TAG").startswith("TAG") could be. Seq.endswith() might be implementable by checking that an re.search() call returns at least one group that stops at the end of the target sequence, for example. These methods would cover pretty much every use case I can think of right now that doesn't involve an ambiguity symbol. They wouldn't break String.startswith() behaviour for biological sequences, because the special symbols have no place in the biological sequence alphabets (except, perhaps, for gap characters). Such an implementation could gain extra *useful* functionality in .startswith() without breaking expected behaviour. It would also leave us in the same position originally proposed, that ambiguity symbols have no meaning. > It also breaks the "act like a string" idea. I do not agree, because there's some elision in a lawyer's definition of 'act like', as opposed to 'act as' a string that comes into play ;) If the Seq object acts *like* a string, then *when we expect it to* that doesn't prevent us from having functionality more appropriate for a Seq object, in addition to or instead of String behaviour. We're already doing this with the Seq.transcribe() and Seq.translate() (and no Seq.title()) methods, for example. I don't see how this differs conceptually for extending startswith() functionality, so long as it behaves like String.startswith() *when we expect it to*. The issue here is then: "when is it reasonable to expect this string of symbols to behave like a raw string, and when is it reasonable to expect it to behave like a biological/regex sequence of symbols?". > what would you expect the > following to do, where we don't know if it is DNA or protein: >>>> Seq("TAA").startswith("TAN") >>>> Seq("TAA").startswith("TAX") > We don't know, therefore we shouldn't guess, so I think these would > have to raise an error. That's one option, and likely the sanest - the error would probably provoke the user into specifying an alphabet, at least. However, there's no harm in discussing other options, even if none of us like them... If the sequence has an alphabet that specifies it as either Protein or Nucleotide, then in those cases we can infer clearly what the ambiguity symbol means, and there is no problem. If the sequence does not have such an alphabet, then we can potentially consider Seq.startswith() to behave like String.startswith(), with an appropriate warning. Alternatively, Seq.startswith() could behave like String.startswith() all the time, unless passed with an optional argument (e.g. "ambiguity=True"). Or maybe another optional argument could be passed to force the search to treat a sequence without an alphabet as either "type='protein'" or "type='RNA'", thereby suppressing the warning/error described above. > This also applies to the other ambiguous > nucleotide letters, like S for G or C in nucleotide sequences. Then > there are more alphabet corner cases - consider reduced alphabets > (e.g. simplified protein sequences mapping all acidic residues to a > single character etc). ...and what if the user makes up their own alphabet?, and so on... ;) Those would be neither Protein nor DNA/RNA alphabets, and so could do whatever the default is for Seq.startswith() behaviour in those circumstances. Another alternative could be to have an optional argument defining the ambiguity symbols, and what they represent (e.g. "ambiguity_table={'N':'[ACGT]', 'P':'[QRST]'}). > Several "Zen of Python" points spring to mind, including "If the > implementation is hard to explain, it's a bad idea.", but in summary I > against supporting ambiguous characters in the string-like methods of > the Seq object (so: find, rfind, split, startswith, endswith, etc). > We should handle this another way. If the natural home for this functionality is Bio.Motif, then the natural home for it is Bio.Motif, and I don't have a problem with that. I'm happy to go with the consensus. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From peter at maubp.freeserve.co.uk Mon Apr 13 11:56:35 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 16:56:35 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> References: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> Message-ID: <320fb6e00904130856vbda6446l2fa44c71f8308a0a@mail.gmail.com> Leighton Pritchard >>?I'd quite like to (eventually) have the capability either to provide ambiguity >> symbols, or to query with a regular expression along the lines of >> re.match() (or maybe the nonexistent re.endmatch()). >> >> Since this isn't implemented yet, maybe there's still time to consider this >> potential usage in the implementation? Peter wrote: > [Stuff about issues with the alphabet altering the behaviour] ..., but in > summary I am against supporting ambiguous characters in the string-like > methods of the Seq object (so: find, rfind, split, startswith, endswith, etc). > We should handle this another way. > > Bartek: would Bio.Motif give us a nice way to do these kinds of > searches? ?For example, given a simple nucleotide motif of "TAN" > (which should match TAA, TAC, TAG or TAA) or "TAS" (which should match > "TAC" or "TAG"), can we check if this matches at the start of a target > nucleotide sequence? ?And similarly for protein motifs (e.g. signal > peptides). This feels like a rehash of some of the debate on Bug 2601 doesn't it? http://bugzilla.open-bio.org/show_bug.cgi?id=2601 On Bug 2601 comment 5, Leighton wrote: >> I think the abstract distinction between search types here is: >> >> 1) Find match at start of sequence (re.match() and string.startswith()) >> 2) Find first match in sequence (re.search() and string.find()) >> 3) Find all non-overlapping matches in sequence (re.finditer() only) >> 4) Find all overlapping matches in sequence (neither re nor string) >> 1a) 2a) 3a) 4a) The same, but in the reverse complement. >> >> Moving down the list, the problem becomes more general. The type >> of search I need most often in biological sequences is number (4a), >> or (4) for proteins. Each of search types (1) to (3) (a or not) has a >> theoretically faster implementation than doing (4) then filtering the >> results. I don't mind having more than one search method with >> different names, or having to specify arguments to get a particular >> kind of search. I do mind not having (4a) as an option... Bartek, can Bio.Motif address these four (or eight) questions from Leighton, or am I expecting the wrong things from it? Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 13:04:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 18:04:41 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: References: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> Message-ID: <320fb6e00904131004i379c9bffwe9b3193568568cc@mail.gmail.com> On Mon, Apr 13, 2009 at 4:46 PM, Leighton Pritchard wrote: > However, there's no harm in discussing other options, even if none of > us like them... > > If the sequence has an alphabet that specifies it as either Protein or > Nucleotide, then in those cases we can infer clearly what the ambiguity > symbol means, and there is no problem. Strictly speaking, only if the sequence has an (ambiguous) IUPAC alphabet can we know what the (ambiguity) symbols mean with certainty. If the sequence has only a generic DNA/RNA/Nucleotide/Protein alphabet then we can only make a pretty good guess. > Alternatively, Seq.startswith() could behave like String.startswith() all > the time, unless passed with an optional argument (e.g. "ambiguity=True"). That idea could work. The default behaviour would be "act like a string", but an optional argument to startswith/endswith/find/rfind/count/... could enable ambiguity matching (provided the sequence has a suitable alphabet). This would be backwards compatible, and allow us to forge ahead with adding simple string-like startswith/endswith methods now (which are useful as is, and so far everyone seems supportive of), and implement ambiguity support later. > Or maybe another optional argument could be passed to force the search to > treat a sequence without an alphabet as either "type='protein'" or > "type='RNA'", thereby suppressing the warning/error described above. > ... > Another alternative could be to have an optional argument defining the > ambiguity symbols, and what they represent (e.g. > "ambiguity_table={'N':'[ACGT]', 'P':'[QRST]'}). If we go down the optional argument route (e.g. ambiguity=True), then a way of specifying the sequence type or ambiguity characters might be possible, although I'd prefer to encourage more rigorous use of alphabets in Seq objects in the first place (see also enhancement Bug 2597, http://bugzilla.open-bio.org/show_bug.cgi?id=2597 on this topic). If we consider the situation where someone creates their own custom alphabet, and wants to define their own ambiguity characters, I think any ambiguous search functionality would have to interrogate the alphabet object at run time. Possible, but a bit tricky. >> Several "Zen of Python" points spring to mind, including "If the >> implementation is hard to explain, it's a bad idea.", but in summary I >> against supporting ambiguous characters in the string-like methods of >> the Seq object (so: find, rfind, split, startswith, endswith, etc). >> We should handle this another way. > > If the natural home for this functionality is Bio.Motif, then the natural > home for it is Bio.Motif, and I don't have a problem with that. ?I'm happy > to go with the consensus. Well, let's hear what Bartek has to say (Bio.Motif author). Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 14:15:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 19:15:00 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? Message-ID: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> Dear Biopythoneers, There is a saying "no news is good news", but as per the title - can we have some feedback from the Biopython 1.50 beta release please? For example, this was the first time we've included a Windows installer for Python 2.6. We were waiting for NumPy 1.3, which was the first NumPy release to support Python 2.6 on Windows. If you've tried this and it works for you, please write us an email. Support for reading/writing FASTQ and QUAL files is also new - if you've tried it out on your own second generation sequencing files, again, it would be nice to know how it worked. If everything is fine, great, but if for example it can't parse the files from your local sequencing center, please let us know. Thanks, Peter From chapmanb at 50mail.com Mon Apr 13 21:53:01 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Apr 2009 21:53:01 -0400 Subject: [BioPython] Invitation for Biopython news coordinators In-Reply-To: <1239493790.27790.184.camel@localhost.localdomain> References: <20090406230542.GK43636@sobchak.mgh.harvard.edu> <1239493790.27790.184.camel@localhost.localdomain> Message-ID: <20090414015301.GA80360@kunkel> Hi John; Great. We'd be very happy to have you working on this. David Winter had also indicated an interest; see this post over on the development list: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005711.html Having several people involved will work out well. When you are finished up with finals, check back in with the list and David and we can get you set up with whatever you need. Good luck with exams and thanks again for the message, Brad > I'm a biology undergrad at the University of Pittsburgh and am > interested in taking on the proposed news coordinator role. I'm likely > not the most technically appropriate person for the position from a > computational standpoint, but in the absence of other volunteers > stepping forward I'd probably be adequate. > > The only caveat would be that I'd rather wait until after finals (about > another two weeks) before beginning any new projects. > > Thanks, > > John MacCallum > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From yvan.strahm at bccs.uib.no Tue Apr 14 10:00:17 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Tue, 14 Apr 2009 16:00:17 +0200 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> Message-ID: <49E496F1.9060608@bccs.uib.no> Peter wrote: > On Wed, Apr 1, 2009 at 11:34 AM, wrote: >> Hello List >> >> I try to get the length of the query from the blast result itself >> >> like that: >> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, >> "blastn", >> my_blast_db, >> my_blast_file) >> >> from Bio.Blast import NCBIXML >> blast_records = NCBIXML.parse(result_handle) >> for blast_record in blast_records >> >> but >> blast_record.query_length return None >> and >> blast_record.query_letters return the actual size >> >> Should I test the length of the query before the blast result? O did I >> miss-interpreted the meaning of query_length and query_letters? >> >> Thanks for your time >> >> Is query_length really the length of query? > > You can use query_letters (although it wouldn't hurt to double check > this if you have the query sequence available). With the current BLAST > XML parser query_length is always None (but I think we should fix so > they are both populated). > > Its an unfortunate historical accident dating back to the plain text > BLAST parser. The plain text output printed the query length in two > places, with different captions, which was reflected in the names > given in the BLAST record (the values should be the same, assuming the > BLAST output is sane). The XML output doesn't have this redundancy, > but our XML parser tries to use the same object to hold the results. > See: http://bugzilla.open-bio.org/show_bug.cgi?id=2176#c12 > > Have a look at the discussion on Bug 2176 for more about this > (including the far more complicated situation for the database length > which has multiple meanings). > > This seems like a timely reminder that we could perhaps tidy up a > little of this ready for Biopython 1.50 ... > > Peter Hello, I tried to check the length before sending it to blast. My problem is that all the query sequences are in a file so I used SeqIO to read/parse them for record in SeqIO.parse(fh, "fasta"): l_query = len(record.seq) result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, record.seq) doesn't work as NCBIStandalone.blastall takes a file as infile. Should I write a temporary file with the record.id and record.seq and pass it to NCBIStandalone.blastall ? or is there an easier way? for now I am just use the blast_record.query_letters variable. I am using Bioperl 1.49 and Python 2.6.1 cheers, yvan From biopython at maubp.freeserve.co.uk Tue Apr 14 10:23:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 15:23:03 +0100 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <49E496F1.9060608@bccs.uib.no> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> <49E496F1.9060608@bccs.uib.no> Message-ID: <320fb6e00904140723l41514cc9j1c8656a2b0f35a96@mail.gmail.com> On Tue, Apr 14, 2009 at 3:00 PM, Yvan Strahm wrote: > Hello, > > I tried to check the length before sending it to blast. > My problem is that all the query sequences are in a file so I used SeqIO to > read/parse them > > for record in SeqIO.parse(fh, "fasta"): > ? ? ? ?l_query = len(record.seq) > ? ? ? ?result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, > "blastn", > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?my_blast_db, > record.seq) > > doesn't work as NCBIStandalone.blastall takes a file as infile. > > Should I write a temporary file with the record.id and record.seq and pass > it to NCBIStandalone.blastall ? > > or is there an easier way? It sounds like you already have a FASTA file containing the query sequences, so just use that as the input to standalone BLAST. i.e. I would do something like this to double check the reported query length matches up with the actual query length: from Bio import SeqIO from Bio.Blast import NCBIXML from Bio.Blast import NCBIStandalone query_filename = "example.fasta" #Load all the queries into memory as a dictionary of SeqRecord objects query_handle = open("example.fasta") query_dict = SeqIO.to_dict(SeqIO.parse(query_handle,"fasta")) query_handle.close() #Run BLAST and loop over the XML blast results one by one (memory efficient), result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, \ "blastn", my_blast_db, query_filename) for blast_record in NCBIXML.parse(result_handle) : query_record = query_dict[blast_record.query_id] #check this assert len(query_record) == blast_record.query_letters assert len(query_record) == blast_record.query_length #Biopython 1.50b or later Note I haven't actually tested this example, but I think the idea is clear. This approach gives you easy access to the full query sequence, and its full description. If all you care about is the length, then rather than storing a dictionary of the queries as SeqRecords, just use a dictionary of their lengths as integers. Peter From biopython at maubp.freeserve.co.uk Tue Apr 14 10:25:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 15:25:40 +0100 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> Message-ID: <320fb6e00904140725w341b5882xeeecd9b0664950a2@mail.gmail.com> On Wed, Apr 1, 2009 at 11:59 AM, Peter wrote: >> >> Should I test the length of the query before the blast result? O did I >> miss-interpreted the meaning of query_length and query_letters? >> >> Thanks for your time >> >> Is query_length really the length of query? > > You can use query_letters (although it wouldn't hurt to double check > this if you have the query sequence available). With the current BLAST > XML parser query_length is always None (but I think we should fix so > they are both populated). > > Its an unfortunate historical accident dating back to the plain text > BLAST parser. [...] > > This seems like a timely reminder that we could perhaps tidy up a > little of this ready for Biopython 1.50 ... This was fixed in Biopython 1.50 beta, so you can now use either the query_length or the query_letters property when parsing BLAST XML output. For older versions of Biopython as noted above, query_length was left as None when parsing XML. Peter From biopython at maubp.freeserve.co.uk Tue Apr 14 14:07:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 19:07:13 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> Message-ID: <320fb6e00904141107m388985a1h56c8b534041f51b4@mail.gmail.com> On Mon, Apr 13, 2009 at 7:15 PM, Peter wrote: > Support for reading/writing FASTQ and QUAL files is also new - if > you've tried it out on your own second generation sequencing files, > again, it would be nice to know how it worked. ?If everything is fine, > great, but if for example it can't parse the files from your local > sequencing center, please let us know. David Schruth emailed to let me know he's successfully been using the new QUAL functionality in Bio.SeqIO on 454 data. Thanks David! There isn't much in the main Biopython tutorial on this yet, but in the meantime have a look at the built in documentation for our FASTQ and QUAL support: >>> from Bio import SeqIO >>> help(SeqIO.QualityIO) ... For those that didn't know, the Roche 454 off instrument applications (available on Linux only I believe) include a command line tool called "sffinfo" which can convert a binary SFF file into FASTA (using the command line option -s or -seq) or QUAL format using PHRED qualities (command line option -q or -qual). I've been using this myself to get some Roche 454 SFF read data into Bio.SeqIO in order to manually trim off primer sequences. Peter From biopython at maubp.freeserve.co.uk Tue Apr 14 16:36:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 21:36:09 +0100 Subject: [BioPython] Reading Roche 454 binary SFF files in Python Message-ID: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> Jose Blanca wrote this interesting reply (which I assume he meant to send to the whole mailing list, not just me): On 4/14/09, Blanca Postigo Jose Miguel wrote: > > > For those that didn't know, the Roche 454 off instrument applications > > (available on Linux only I believe) include a command line tool called > > "sffinfo" which can convert a binary SFF file into FASTA (using the > > command line option -s or -seq) or QUAL format using PHRED qualities > > (command line option -q or -qual). I've been using this myself to get > > some Roche 454 SFF read data into Bio.SeqIO in order to manually trim > > off primer sequences. > > For the ones that do not have the 454 software there's a free software > alternative. Some time ago Bastien Chevreux and I created a little utility to > convert sff files to fasta and xml (for the ancilliary info). It's called > sff_extract, is written in python and released under the GPL. > You can get the python script here: > http://bioinf.comav.upv.es/sff_extract/index.html > Maybe I should have announce it here, but I didn't, my fault. > If you think this code could be of some interest for you I could talk with > Bastien about the possibility of submitting it to biopython. Although in that > case it could use some cleaning, it works, but it could be nicer. > > Best regards, > > Jose Blanca That does sound interesting - if you want I, email me a proper release announcement and I can forward it to the Biopython announcement mailing list. I was aware that some information was available about the SFF file format, and it should be possible to reverse engineer the format in order to read and write it directly from Biopython. Right now with your code under the GPL, we can't incorporate it into Biopython, but if you and Bastien are prepared to offer it to Biopython under our MIT/BSD licence that could be very useful. Even without that, any documentation on the file format or example files you might be able to share could be valuable. I felt that adding FASTQ and QUAL support to Biopython should come first, but since the Bio.SeqIO framework is extendible perhaps we could add native support for SFF files to Biopython later on. Given people can use the Roche 454 tools (if they have them) or your open source sff_extract to get the data out of an SFF file, this isn't urgent, but is worth thinking about :) Peter P.S. Have you tested your sff_extract software on SFF files from the new Roche v2 software, released about the same time as the "titanium" 454 upgrade? From jblanca at btc.upv.es Wed Apr 15 03:07:27 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 15 Apr 2009 09:07:27 +0200 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> Message-ID: <200904150907.27360.jblanca@btc.upv.es> > I was aware that some information was available about the SFF file > format, and it should be possible to reverse engineer the format in > order to read and write it directly from Biopython. The sff format is fully documented in the NCBI's SRA web site. http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats#sff > Right now with your code under the GPL, we can't incorporate it into > Biopython, but if you and Bastien are prepared to offer it to > Biopython under our MIT/BSD licence that could be very useful. Even > without that, any documentation on the file format or example files > you might be able to share could be valuable. I guess that it wouldn't be a problem to offer you the code under your licence. But I don't think that's the best approach. The code as it is right now is not well suited to be integrated in a library. It would be easier to rewrite the sff reading part from scratch. I could do that for you in no time. The main problem would be to have sff files small enough to be used for the test. If you could provide that I could write the code to extract the information from the sff file for you. It would be easy to build a generator able to deliver the sequences one by one. sff_extract also is able to split the paired-ends reads. That's the part that Bastien wrote. Integrating that would be nice, but I think that in Biopython that should be treated as an independent problem. > P.S. Have you tested your sff_extract software on SFF files from the > new Roche v2 software, released about the same time as the "titanium" > 454 upgrade? Not me, but I think that Bastien has and he has found no problem at all with that. The sff format is well thought and consistent, the 454 people did a much better job than the ABI people did with the abi format. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From yvan.strahm at bccs.uib.no Wed Apr 15 04:32:49 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Wed, 15 Apr 2009 10:32:49 +0200 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <320fb6e00904140723l41514cc9j1c8656a2b0f35a96@mail.gmail.com> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> <49E496F1.9060608@bccs.uib.no> <320fb6e00904140723l41514cc9j1c8656a2b0f35a96@mail.gmail.com> Message-ID: <49E59BB1.7050208@bccs.uib.no> Peter wrote: > On Tue, Apr 14, 2009 at 3:00 PM, Yvan Strahm wrote: >> Hello, >> >> I tried to check the length before sending it to blast. >> My problem is that all the query sequences are in a file so I used SeqIO to >> read/parse them >> >> for record in SeqIO.parse(fh, "fasta"): >> l_query = len(record.seq) >> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, >> "blastn", >> my_blast_db, >> record.seq) >> >> doesn't work as NCBIStandalone.blastall takes a file as infile. >> >> Should I write a temporary file with the record.id and record.seq and pass >> it to NCBIStandalone.blastall ? >> >> or is there an easier way? > > It sounds like you already have a FASTA file containing the query > sequences, so just use that as the input to standalone BLAST. > > i.e. I would do something like this to double check the reported query > length matches up with the actual query length: > > from Bio import SeqIO > from Bio.Blast import NCBIXML > from Bio.Blast import NCBIStandalone > query_filename = "example.fasta" > #Load all the queries into memory as a dictionary of SeqRecord objects > query_handle = open("example.fasta") > query_dict = SeqIO.to_dict(SeqIO.parse(query_handle,"fasta")) > query_handle.close() > #Run BLAST and loop over the XML blast results one by one (memory efficient), > result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, \ > "blastn", my_blast_db, > query_filename) > for blast_record in NCBIXML.parse(result_handle) : > query_record = query_dict[blast_record.query_id] #check this > assert len(query_record) == blast_record.query_letters > assert len(query_record) == blast_record.query_length #Biopython > 1.50b or later > > Note I haven't actually tested this example, but I think the idea is clear. > > This approach gives you easy access to the full query sequence, and > its full description. If all you care about is the length, then > rather than storing a dictionary of the queries as SeqRecords, just > use a dictionary of their lengths as integers. > > Peter Thanks a lot! Just have to change the query_record = query_dict[blast_record.query_id] to query_record = query_dict[blast_record.query] because query_id return something like lcl|XXX and not the actual fasta header. and yes I am interested in the whole SeqRecords ;-). Does the query_dict size is limited by the memory of the machine ? yvan From biopython at maubp.freeserve.co.uk Wed Apr 15 04:42:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 09:42:12 +0100 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <200904150907.27360.jblanca@btc.upv.es> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> Message-ID: <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> On Wed, Apr 15, 2009 at 8:07 AM, Jose Blanca wrote: > >> I was aware that some information was available about the SFF file >> format, and it should be possible to reverse engineer the format in >> order to read and write it directly from Biopython. > > The sff format is fully documented in the NCBI's SRA web site. > http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats#sff Nice link - thanks. Given the specification is public (and as you say later, well thought out), we shouldn't have to worry so much about Roche making changes to it in future releases. >> Right now with your code under the GPL, we can't incorporate it into >> Biopython, but if you and Bastien are prepared to offer it to >> Biopython under our MIT/BSD license that could be very useful. ?Even >> without that, any documentation on the file format or example files >> you might be able to share could be valuable. > > I guess that it wouldn't be a problem to offer you the code under your > license. But I don't think that's the best approach. The code as it is right > now is not well suited to be integrated in a library. It would be easier to > rewrite the sff reading part from scratch. I could do that for you in no > time. I was expecting your sff_extract code would serve only as a basis - perhaps just lifting some core routines. If you are happy to extract/rewrite the core bits and give them to Biopython under the Biopython License that would be great. See http://biopython.org/DIST/LICENSE (basically MIT/BSD style). > The main problem would be to have sff files small enough to be used > for the test. The Roche command line tools allow you to take a large SFF file and produce a filtered version (use sfffile with the -i option and a simple text file of read identifiers). So making a small SFF file for unit tests should be simple. > If you could provide that I could write the code to extract the > information from the sff file for you. It would be easy to build a > generator able to deliver the sequences one by one. That would be very welcome :) > sff_extract also is able to split the paired-ends reads. That's the part > that Bastien wrote. Integrating that would be nice, but I think that in > Biopython that ?should be treated as an independent problem. Quite possibly - I haven't yet had to work with paired end reads, and at this point I'm not sure how best to represent them with the Biopython SeqRecord object. In some senses they are two short sequences (so using two Biopython SeqRecord objects would work, but with some kind of cross referencing). Alternatively you might treat them as a long sequence with known end regions, but an unknown region of unknown length in the middle (something we don't currently have a sequence object to represent). >> P.S. Have you tested your sff_extract software on SFF files from the >> new Roche v2 software, released about the same time as the "titanium" >> 454 upgrade? > > Not me, but I think that Bastien has and he has found no problem at all with > that. Great. > The sff format is well thought and consistent, the 454 people did a > much better job than the ABI people did with the abi format. That makes a pleasant change - the FASTQ format strikes me as less than ideal in several ways (and the fact Solexa made their own incompatible variant just made things worse). Peter From jblanca at btc.upv.es Wed Apr 15 05:01:06 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 15 Apr 2009 11:01:06 +0200 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> Message-ID: <200904151101.07113.jblanca@btc.upv.es> > > The main problem would be to have sff files small enough to be used > > for the test. > > The Roche command line tools allow you to take a large SFF file and > produce a filtered version (use sfffile with the -i option and a simple > text file of read identifiers). So making a small SFF file for unit tests > should be simple. Could you send a couple of example sff files to me? I haven't the Roche tools, that's why I implemented sff_extract in the first place :) -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Wed Apr 15 05:16:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 10:16:37 +0100 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <49E59BB1.7050208@bccs.uib.no> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> <49E496F1.9060608@bccs.uib.no> <320fb6e00904140723l41514cc9j1c8656a2b0f35a96@mail.gmail.com> <49E59BB1.7050208@bccs.uib.no> Message-ID: <320fb6e00904150216g71856e8dwfe5515000bbda2de@mail.gmail.com> On Wed, Apr 15, 2009 at 9:32 AM, Yvan Strahm wrote: > > Peter wrote: >> It sounds like you already have a FASTA file containing the query >> sequences, so just use that as the input to standalone BLAST. >> >> i.e. I would do something like this to double check the reported query >> length matches up with the actual query length: >> >> from Bio import SeqIO >> from Bio.Blast import NCBIXML >> from Bio.Blast import NCBIStandalone >> query_filename = "example.fasta" >> #Load all the queries into memory as a dictionary of SeqRecord objects >> query_handle = open("example.fasta") >> query_dict = SeqIO.to_dict(SeqIO.parse(query_handle,"fasta")) >> query_handle.close() >> #Run BLAST and loop over the XML blast results one by one (memory >> efficient), >> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, \ >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "blastn", my_blast_db, >> query_filename) >> for blast_record in NCBIXML.parse(result_handle) : >> ? query_record = query_dict[blast_record.query_id] #check this >> ? assert len(query_record) == blast_record.query_letters >> ? assert len(query_record) == blast_record.query_length #Biopython >> 1.50b or later >> >> Note I haven't actually tested this example, but I think the idea is >> clear. >> >> This approach gives you easy access to the full query sequence, and >> its full description. ?If all you care about is the length, then >> rather than storing a dictionary of the queries as SeqRecords, just >> use a dictionary of their lengths as integers. >> >> Peter > > Thanks a lot! > Just have to change the query_record = query_dict[blast_record.query_id] > to query_record = query_dict[blast_record.query] > because query_id return something like lcl|XXX and not the actual fasta > header. There may be a blastall command line argument to alter this behaviour, but if that works, great :) > and yes I am interested in the whole SeqRecords ;-). OK, good. That example should be a good starting point then. > Does the query_dict size is limited by the memory of the machine ? In the example I gave, yes. This is using a standard python dictionary, the keys are the record identifiers (strings) and the values are SeqRecord objects. A larger query file means more SeqRecord in memory in this dictionary. Unless you are using thousands of query sequences (in which case your BLAST search will be slow), I don't expect this to be a problem. However each SeqRecord in this example will be wasting some memory e.g. an empty list of features, an empty list of database cross references, and an empty dictionary of annotations. If you find you are running out of memory, then perhaps use a dictionary where the keys are the record identifiers (strings) and the values are just the sequence (as strings). If after that you are still running out of memory, you could index the FASTA file somehow, or use a full database (e.g. BioSQL). Peter From biopython at maubp.freeserve.co.uk Wed Apr 15 05:24:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 10:24:19 +0100 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <200904151101.07113.jblanca@btc.upv.es> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> <200904151101.07113.jblanca@btc.upv.es> Message-ID: <320fb6e00904150224j1e4fe9ddx3bc490282e5b700e@mail.gmail.com> On Wed, Apr 15, 2009 at 10:01 AM, Jose Blanca wrote: >> > The main problem would be to have sff files small enough to be used >> > for the test. >> >> The Roche command line tools allow you to take a large SFF file and >> produce a filtered version (use sfffile with the -i option and a simple >> text file of read identifiers). ?So making a small SFF file for unit tests >> should be simple. > > Could you send a couple of example sff files to me? > I'm not sure what SFF data I would be allowed to distribute, especially if you want it for a publicly available unit test example. Once the analysis is published, I'm sure this would be easier, but there would still be some administrative channels at work to go though to do this officially. It might be simpler if you gave me the URL of a public SFF file (or one of your own files) you would like split up, and I make a reduced version of that. Email me off list and we can talk about this. > I haven't the Roche tools, that's why I implemented sff_extract in the > first place :) Try have a word with the sequencing center where you get your Roche 454 sequencing done - they may be able to organize access to the software for you with Roche's approval. That's what we did. Peter From mmokrejs at ribosome.natur.cuni.cz Wed Apr 15 06:04:02 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 15 Apr 2009 12:04:02 +0200 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <320fb6e00904150224j1e4fe9ddx3bc490282e5b700e@mail.gmail.com> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> <200904151101.07113.jblanca@btc.upv.es> <320fb6e00904150224j1e4fe9ddx3bc490282e5b700e@mail.gmail.com> Message-ID: <49E5B112.7030402@ribosome.natur.cuni.cz> Hi, Peter wrote: > On Wed, Apr 15, 2009 at 10:01 AM, Jose Blanca wrote: >>>> The main problem would be to have sff files small enough to be used >>>> for the test. >>> The Roche command line tools allow you to take a large SFF file and >>> produce a filtered version (use sfffile with the -i option and a simple >>> text file of read identifiers). So making a small SFF file for unit tests >>> should be simple. >> Could you send a couple of example sff files to me? >> > > I'm not sure what SFF data I would be allowed to distribute, especially if > you want it for a publicly available unit test example. Once the analysis Just some random links to NCBI Trace Archive: ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=study&m=data&s=study http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP000001 (has 454 data from 2007) http://www.ncbi.nlm.nih.gov/sites/entrez?db=sra&report=full&term=SRX003639 ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/sff2scf/sff2scf.tar.gz Hope this helps, Martin From sbassi at gmail.com Wed Apr 15 09:35:55 2009 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 15 Apr 2009 10:35:55 -0300 Subject: [BioPython] Help for a presentation. Message-ID: I am working in a laptop session for a local workshop where I plan to show off some biopython features. The file would be available (CC-BY 3.0) after presentation in Crunchy compatible HTML (if you don't know Crunchy, take a look, it is impressive!). In the following drill, the task is to read a DNA sequence from a genbank file, translate it to an aminoacid sequence and save it as a Fasta file. The code here does this, but I think it looks a little complex when I want to show that biopython way to do it is easy. So I wonder if someone knows how to modify this code to get the same result with less steps. from Bio import SeqIO from Bio.Seq import translate from Bio.SeqRecord import SeqRecord handle = open('ampRdna.gb') seq_record = SeqIO.read(handle, "genbank") print "DNA Sequence:",seq_record.seq # make translation (numbers here is where the CDS starts) # I don't want to grab the translated sequence from genbank file # , I want to show how to translate it. protseq = translate(seq_record.seq[89:694]) # show translation print "Protein Sequence:",protseq # Make a SeqRecord seqid = seq_record.id seqdesc = seq_record.description protrec = SeqRecord(protseq,id=seqid,description=seqdesc) # save it to a fasta file. outfile_h = open('ampRprot.fasta','w') SeqIO.write([protrec],outfile_h,'fasta') outfile_h.close() From biopython at maubp.freeserve.co.uk Wed Apr 15 09:59:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 14:59:04 +0100 Subject: [BioPython] Help for a presentation. In-Reply-To: References: Message-ID: <320fb6e00904150659o77c4db3fw2c1e9b27e6476d55@mail.gmail.com> On Wed, Apr 15, 2009 at 2:35 PM, Sebastian Bassi wrote: > I am working in a laptop session for a local workshop where I plan to > show off some biopython features. The file would be available (CC-BY > 3.0) after presentation in Crunchy compatible HTML (if you don't know > Crunchy, take a look, it is impressive!). Sharing the presentation sounds good, we can link to it from here when its done if you like: http://biopython.org/wiki/Documentation#Presentations > In the following drill, the task is to read a DNA sequence from a > genbank file, translate it to an aminoacid sequence and save it as a > Fasta file. Funnily enough, I just recently added a couple of cookbook examples doing something a bit similar to this to the Tutorial in CVS. Also, have you looked at this page with some related stuff? http://www.warwick.ac.uk/go/peter_cock/python/genbank2fasta/ > The code here does this, but I think it looks a little complex when I > want to show that biopython way to do it is easy. So I wonder if > someone knows how to modify this code to get the same result with less > steps. > > from Bio import SeqIO > from Bio.Seq import translate > from Bio.SeqRecord import SeqRecord > > handle = open('ampRdna.gb') > seq_record = SeqIO.read(handle, "genbank") You never closed the input handle in the first place, so it should be just as safe to just do this: seq_record = SeqIO.read(open('ampRdna.gb'), "genbank") I do this often myself - its output handles you must be careful about closing. > print "DNA Sequence:",seq_record.seq > # make translation (numbers here is where the CDS starts) > # I don't want to grab the translated sequence from genbank file > # , I want to show how to translate it. > protseq = translate(seq_record.seq[89:694]) You can do that as a method call instead, which saves you an import line: protseq = seq_record.seq[89:694].translate() > # show translation > print "Protein Sequence:",protseq > # Make a SeqRecord > seqid = seq_record.id > seqdesc = seq_record.description > protrec = SeqRecord(protseq,id=seqid,description=seqdesc) I would merge those three lines as just: protrec = SeqRecord(protseq, id=seq_record.id, description=seq_record.description) But is it meaningful to use the whole nucleotide's ID and description for the protein? > # save it to a fasta file. > outfile_h = open('ampRprot.fasta','w') > SeqIO.write([protrec],outfile_h,'fasta') > outfile_h.close() I'm not sure if you'll find it clearer or not, but you could change the last three lines to this: outfile_h = open('ampRprot.fasta','w') outfile_h.write(protrec.format('fasta')) outfile_h.close() Peter From mmokrejs at ribosome.natur.cuni.cz Wed Apr 15 10:08:34 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 15 Apr 2009 16:08:34 +0200 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> Message-ID: <49E5EA62.2090902@ribosome.natur.cuni.cz> Peter wrote: > Dear Biopythoneers, > > There is a saying "no news is good news", but as per the title - can > we have some feedback from the Biopython 1.50 beta release please? The tests gave me: test_DocSQL ... /usr/lib/python2.6/site-packages/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet ok test_NCBIStandalone ... ERROR ====================================================================== ERROR: test_NCBIStandalone ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 247, in runTest suite = unittest.TestLoader().loadTestsFromName(name) File "/usr/lib/python2.6/unittest.py", line 576, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "test_NCBIStandalone.py", line 9, in from Bio.Blast import NCBIStandalone File "/home/mmokrejs/proj/biopython/build/lib.linux-i686-2.6/Bio/Blast/NCBIStandalone.py", line 1673 <<<<<<< NCBIStandalone.py ^ SyntaxError: invalid syntax ---------------------------------------------------------------------- Ran 111 tests in 373.969 seconds FAILED (failures = 1) $ From biopython at maubp.freeserve.co.uk Wed Apr 15 10:23:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 15:23:30 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <49E5EA62.2090902@ribosome.natur.cuni.cz> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> <49E5EA62.2090902@ribosome.natur.cuni.cz> Message-ID: <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> On Wed, Apr 15, 2009 at 3:08 PM, Martin MOKREJ? wrote: > Peter wrote: >> Dear Biopythoneers, >> >> There is a saying "no news is good news", but as per the title - can >> we have some feedback from the Biopython 1.50 beta release please? > > The tests gave me: > > test_DocSQL ... /usr/lib/python2.6/site-packages/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated > ?from sets import ImmutableSet That is a harmless bug in MySQLdb (it wasn't quite ready for Python 2.6), which I think has been fixed on their trunk, although I'm not sure if it is in their latest release or not. > test_NCBIStandalone ... ERROR > > ====================================================================== > ERROR: test_NCBIStandalone > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "run_tests.py", line 247, in runTest > ? ?suite = unittest.TestLoader().loadTestsFromName(name) > ?File "/usr/lib/python2.6/unittest.py", line 576, in loadTestsFromName > ? ?module = __import__('.'.join(parts_copy)) > ?File "test_NCBIStandalone.py", line 9, in > ? ?from Bio.Blast import NCBIStandalone > ?File "/home/mmokrejs/proj/biopython/build/lib.linux-i686-2.6/Bio/Blast/NCBIStandalone.py", line 1673 > ? ?<<<<<<< NCBIStandalone.py > ? ? ^ > SyntaxError: invalid syntax > ---------------------------------------------------------------------- That "<<<<<<<" text looks like a CVS merge failed, inserting a diff marker into the file. How did you install the Biopython 1.50 beta? I've just download and checked the Bio/Blast/NCBIStandalone.py file looks OK in both the archives: http://biopython.org/DIST/biopython-1.50b.tar.gz http://biopython.org/DIST/biopython-1.50b.zip Peter From sbassi at gmail.com Wed Apr 15 10:48:38 2009 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 15 Apr 2009 11:48:38 -0300 Subject: [BioPython] Help for a presentation. In-Reply-To: <320fb6e00904150659o77c4db3fw2c1e9b27e6476d55@mail.gmail.com> References: <320fb6e00904150659o77c4db3fw2c1e9b27e6476d55@mail.gmail.com> Message-ID: On Wed, Apr 15, 2009 at 10:59 AM, Peter wrote: > Sharing the presentation sounds good, we can link to it from here when > its done if you like: > http://biopython.org/wiki/Documentation#Presentations OK, I will the link there. Thank for your suggestions, applied most of them (sometimes shorter code it is not the easiest to read for a novice, they require more verbose examples). I will post the link here and in the wiki. Best, SB. From mmokrejs at ribosome.natur.cuni.cz Wed Apr 15 11:18:02 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 15 Apr 2009 17:18:02 +0200 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> <49E5EA62.2090902@ribosome.natur.cuni.cz> <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> Message-ID: <49E5FAAA.9080505@ribosome.natur.cuni.cz> Peter wrote: > On Wed, Apr 15, 2009 at 3:08 PM, Martin MOKREJ? > wrote: >> Peter wrote: >>> Dear Biopythoneers, >>> >>> There is a saying "no news is good news", but as per the title - can >>> we have some feedback from the Biopython 1.50 beta release please? >> The tests gave me: > >> test_NCBIStandalone ... ERROR >> >> ====================================================================== >> ERROR: test_NCBIStandalone >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "run_tests.py", line 247, in runTest >> suite = unittest.TestLoader().loadTestsFromName(name) >> File "/usr/lib/python2.6/unittest.py", line 576, in loadTestsFromName >> module = __import__('.'.join(parts_copy)) >> File "test_NCBIStandalone.py", line 9, in >> from Bio.Blast import NCBIStandalone >> File "/home/mmokrejs/proj/biopython/build/lib.linux-i686-2.6/Bio/Blast/NCBIStandalone.py", line 1673 >> <<<<<<< NCBIStandalone.py >> ^ >> SyntaxError: invalid syntax >> ---------------------------------------------------------------------- > > That "<<<<<<<" text looks like a CVS merge failed, inserting a diff > marker into the file. How did you install the Biopython 1.50 beta? > I've just download and checked the Bio/Blast/NCBIStandalone.py file > looks OK in both the archives: > http://biopython.org/DIST/biopython-1.50b.tar.gz > http://biopython.org/DIST/biopython-1.50b.zip Yes, sorry, I forgot to delete my old changes to it. Dropped the file and read-in current version from cvs now. ;) Anyway, I think "python setup.py clean" should zap .pyc files find Bio -name \*.pyc | xargs rm -f find BioSQL -name \*.pyc | xargs rm -f rm -f Tests/Quality/temp.fastq rm -f Tests/Quality/temp.qual The built-in tests ran fine for me. M. From biopython at maubp.freeserve.co.uk Wed Apr 15 11:38:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 16:38:16 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <49E5FAAA.9080505@ribosome.natur.cuni.cz> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> <49E5EA62.2090902@ribosome.natur.cuni.cz> <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> <49E5FAAA.9080505@ribosome.natur.cuni.cz> Message-ID: <320fb6e00904150838t29b7bafk97badda08098d8c4@mail.gmail.com> On Wed, Apr 15, 2009 at 4:18 PM, Martin MOKREJ? wrote: > > rm -f Tests/Quality/temp.fastq > rm -f Tests/Quality/temp.qual > Those two files were produced by the Bio.SeqIO.QualityIO doctest. I'm not aware of any nice way to do doctest clean up which doesn't show up in the docstrings themselves, so I've improvised in CVS revision 1.10 and included the deletions explicitly. I could instead have used a fancy temp file, or a StringIO handle - but this would detract from the documentation side of things more I feel. Maybe we should have a general clean up of any temp files at the end of run_tests.py ... it might be worth thinking about. > > The built-in tests ran fine for me. > M. Great. Peter From bartek at rezolwenta.eu.org Wed Apr 15 19:36:02 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 16 Apr 2009 01:36:02 +0200 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904131004i379c9bffwe9b3193568568cc@mail.gmail.com> References: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> <320fb6e00904131004i379c9bffwe9b3193568568cc@mail.gmail.com> Message-ID: <8b34ec180904151636r46d3216lb33c4323398faa17@mail.gmail.com> Hi all, Sorry, but I've missed this thread completely (despite being called by name a few times). It's too late for me to address the multiple points raised here, so I'll try to summarize what I understood: Peter wants to have the Seq. startswith function and do stuff like: >if record.seq.startswith(primer) : > record = record[crop:] Leighton would like to have an even more powerful method which would do things like: >>> Seq("TAG").startswith("TA[CG]") Which is quite cool, but Peter raises objections to the semantics of startswith called with arbitrary strings. I think that the issue would be resolved if the startswith method would not accept strings, but Seqs or Motifs. Assuming that we would have a nice way of generating appropriate motifs, it would lead to simple code: m=Motif.from_IUPAC("TAN") or alternatively m=Motif.from_re("TA[C|G]") s.startswith(m) Currently there are no methods from_IUPAC or from_re, but it should be fairly straightforward to implement them (if there is interest). writing the startswith method using a motif instance is very straightforward. There is one caveat: implementing complex regexps with Bio.Motif might be not as efficient as using regexps directly, but again I could work on improving the Motif class. hope this helps cheers Bartek On Mon, Apr 13, 2009 at 7:04 PM, Peter wrote: > On Mon, Apr 13, 2009 at 4:46 PM, Leighton Pritchard wrote: >> However, there's no harm in discussing other options, even if none of >> us like them... >> >> If the sequence has an alphabet that specifies it as either Protein or >> Nucleotide, then in those cases we can infer clearly what the ambiguity >> symbol means, and there is no problem. > > Strictly speaking, only if the sequence has an (ambiguous) IUPAC > alphabet can we know what the (ambiguity) symbols mean with certainty. > ?If the sequence has only a generic DNA/RNA/Nucleotide/Protein > alphabet then we can only make a pretty good guess. > >> Alternatively, Seq.startswith() could behave like String.startswith() all >> the time, unless passed with an optional argument (e.g. "ambiguity=True"). > > That idea could work. ?The default behaviour would be "act like a > string", but an optional argument to > startswith/endswith/find/rfind/count/... could enable ambiguity > matching (provided the sequence has a suitable alphabet). ?This would > be backwards compatible, and allow us to forge ahead with adding > simple string-like startswith/endswith methods now (which are useful > as is, and so far everyone seems supportive of), and implement > ambiguity support later. > >> Or maybe another optional argument could be passed to force the search to >> treat a sequence without an alphabet as either "type='protein'" or >> "type='RNA'", thereby suppressing the warning/error described above. >> ... >> Another alternative could be to have an optional argument defining the >> ambiguity symbols, and what they represent (e.g. >> "ambiguity_table={'N':'[ACGT]', 'P':'[QRST]'}). > > If we go down the optional argument route (e.g. ambiguity=True), then > a way of specifying the sequence type or ambiguity characters might be > possible, although I'd prefer to encourage more rigorous use of > alphabets in Seq objects in the first place (see also enhancement Bug > 2597, http://bugzilla.open-bio.org/show_bug.cgi?id=2597 on this > topic). > > If we consider the situation where someone creates their own custom > alphabet, and wants to define their own ambiguity characters, I think > any ambiguous search functionality would have to interrogate the > alphabet object at run time. ?Possible, but a bit tricky. > >>> Several "Zen of Python" points spring to mind, including "If the >>> implementation is hard to explain, it's a bad idea.", but in summary I >>> against supporting ambiguous characters in the string-like methods of >>> the Seq object (so: find, rfind, split, startswith, endswith, etc). >>> We should handle this another way. >> >> If the natural home for this functionality is Bio.Motif, then the natural >> home for it is Bio.Motif, and I don't have a problem with that. ?I'm happy >> to go with the consensus. > > Well, let's hear what Bartek has to say (Bio.Motif author). > > Peter > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From dalloliogm at gmail.com Thu Apr 16 05:00:42 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 16 Apr 2009 11:00:42 +0200 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904150838t29b7bafk97badda08098d8c4@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> <49E5EA62.2090902@ribosome.natur.cuni.cz> <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> <49E5FAAA.9080505@ribosome.natur.cuni.cz> <320fb6e00904150838t29b7bafk97badda08098d8c4@mail.gmail.com> Message-ID: <5aa3b3570904160200m2fd868cey822a4e0a9134138a@mail.gmail.com> On Wed, Apr 15, 2009 at 5:38 PM, Peter wrote: > > Maybe we should have a general clean up of any temp files at the end > of run_tests.py ... it might be worth thinking about. We need global fixtures for doctests :-) A tearDownAll which deletes all the temporary files created by the doctests would be sufficent. The way to implement it depends on how you want to run the tests (run_tests.py or nose). > >> >> The built-in tests ran fine for me. >> M. > > Great. > > Peter > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Apr 16 05:10:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 10:10:53 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <8b34ec180904151636r46d3216lb33c4323398faa17@mail.gmail.com> References: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> <320fb6e00904131004i379c9bffwe9b3193568568cc@mail.gmail.com> <8b34ec180904151636r46d3216lb33c4323398faa17@mail.gmail.com> Message-ID: <320fb6e00904160210x1cd61a4bo707576b5c3f16861@mail.gmail.com> > Peter wants to have the Seq. startswith function and do stuff like: >>if record.seq.startswith(primer) : >> ?record = record[crop:] > > Leighton would like to have an even more powerful method which would > do things like: >>>> Seq("TAG").startswith("TA[CG]") > > Which is quite cool, but Peter raises objections to the semantics of > startswith called with arbitrary strings. > > I think that the issue would be resolved if the startswith method > would not accept strings, but Seqs or Motifs. Note that the existing search related Seq methods like find, rfind, split, rsplit already take a string or another Seq object - so I was intending (with the patch on Bug 2809) that startswith and endswith did the same. However, while they take Seq objects like Seq("TAN",generic_dna), these methods would all still do a blind search for "TAN" literally, just like a python string would. Having these Seq object methods all cope with a Motif object is an interesting idea - I hadn't thought of that. We can have string or Seq arguments act as dumb python strings (no ambiguity magic), but giving a Motif object allows the ambiguity matches to be handled explicitly. I would like to clarify that I was thinking more the other way round: the Motif object has a search method where you give it a Seq (or string?) to be searched. Much like Python's regular expression objects take the target string as an argument. One advantage of doing it this way round is the Seq object is kept quite simple (which I think is a good thing), and all the ambiguity complexity lives in Bio.Motif instead. > Assuming that we would have a nice way of generating appropriate > motifs, it would lead to simple code: > > m=Motif.from_IUPAC("TAN") > > or alternatively > > m=Motif.from_re("TA[C|G]") > > s.startswith(m) > > Currently there are no methods from_IUPAC or from_re, but it should be > fairly straightforward to implement them (if there is interest). I think there is interest - although you might want to have from_IUPAC_protein, from_IUPAC_DNA, from_IUPAC_RNA. Just using m=Motif.from_IUPAC("TAN") it isn't clear if that is protein or DNA. If Motif.from_IUPAC only took a Seq object with a relevant alphabet that would solve this ambiguity, but would not be so easy to use. > writing the startswith method using a motif instance is very straightforward. If you say so :) > There is one caveat: implementing complex regexps with Bio.Motif might > be not as efficient as using regexps directly, but again I could work > on improving the Motif class. > > hope this helps Let's have a look at this (after Biopython 1.50 is out). Peter From biopython at maubp.freeserve.co.uk Thu Apr 16 05:14:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 10:14:56 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <5aa3b3570904160200m2fd868cey822a4e0a9134138a@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> <49E5EA62.2090902@ribosome.natur.cuni.cz> <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> <49E5FAAA.9080505@ribosome.natur.cuni.cz> <320fb6e00904150838t29b7bafk97badda08098d8c4@mail.gmail.com> <5aa3b3570904160200m2fd868cey822a4e0a9134138a@mail.gmail.com> Message-ID: <320fb6e00904160214i4bc6721he1b54b911a5003d1@mail.gmail.com> On Thu, Apr 16, 2009 at 10:00 AM, Giovanni Marco Dall'Olio wrote: > On Wed, Apr 15, 2009 at 5:38 PM, Peter wrote: >> >> Maybe we should have a general clean up of any temp files at the end >> of run_tests.py ... it might be worth thinking about. > > We need global fixtures for doctests :-) > I'm not quite sure what you mean (other than your general enthusiasm for global fixtures in unittests). The doctest framework doesn't have anything like this, it doesn't even have any way to issue instructions other than by embedding text in docstrings, does it? > A tearDownAll which deletes all the temporary files created by the > doctests would be sufficent. The way to implement it depends on how > you want to run the tests (run_tests.py or nose). That is what I just suggested - as long as we are consistent about naming temp files, run_tests.py can just delete say */temp_*.* under the Tests directory. However this won't clean up after running a doctest directly (bypassing run_tests.py). Peter From bartek at rezolwenta.eu.org Thu Apr 16 05:56:59 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 16 Apr 2009 11:56:59 +0200 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904160210x1cd61a4bo707576b5c3f16861@mail.gmail.com> References: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> <320fb6e00904131004i379c9bffwe9b3193568568cc@mail.gmail.com> <8b34ec180904151636r46d3216lb33c4323398faa17@mail.gmail.com> <320fb6e00904160210x1cd61a4bo707576b5c3f16861@mail.gmail.com> Message-ID: <8b34ec180904160256u4600c572jb92068cc81dece1f@mail.gmail.com> hi, On Thu, Apr 16, 2009 at 11:10 AM, Peter wrote: > Having these Seq object methods all cope with a Motif object is an > interesting idea - I hadn't thought of that. ?We can have string or > Seq arguments act as dumb python strings (no ambiguity magic), but > giving a Motif object allows the ambiguity matches to be handled > explicitly. That's exactly what I meant. > > I would like to clarify that I was thinking more the other way round: > the Motif object has a search method where you give it a Seq (or > string?) to be searched. ?Much like Python's regular expression > objects take the target string as an argument. ?One advantage of doing > it this way round is the Seq object is kept quite simple (which I > think is a good thing), and all the ambiguity complexity lives in > Bio.Motif instead. > Yes, so the idea would be for the startswith, endswith etc methods to only check whether the argument is a motif and if so, call the proper methods of the argument. I need to look into it more closely (especially for the other methods like find) but there are search methods as well as methods for finding instances for the whole sequence as well as for a given position. >> Currently there are no methods from_IUPAC or from_re, but it should be >> fairly straightforward to implement them (if there is interest). > > I think there is interest - although you might want to have > from_IUPAC_protein, from_IUPAC_DNA, from_IUPAC_RNA. ?Just using > m=Motif.from_IUPAC("TAN") it isn't clear if that is protein or DNA. > If Motif.from_IUPAC only took a Seq object with a relevant alphabet > that would solve this ambiguity, but would not be so easy to use. Good. I'll try to implement this. > >> writing the startswith method using a motif instance is very straightforward. > > If you say so :) > Once you have the motif instance, it's really easy. The problem is with making Motif creation easy enough. > Let's have a look at this (after Biopython 1.50 is out). I agree cheers Bartek From gatoygata at hotmail.com Fri Apr 17 07:25:48 2009 From: gatoygata at hotmail.com (Joaquin Abian Monux) Date: Fri, 17 Apr 2009 11:25:48 +0000 Subject: [BioPython] =?utf-8?q?blastall_produces_a_black_console_screen_wi?= =?utf-8?q?th_biopython_1=2E49_and_up=E2=80=8F?= Message-ID: Dear all, I coded a GUI utility to perform local blast searches on lists of peptides using NCBIStandalone.blastall(). I work on windows XP with python 2.5 After I updated from biopython1.48 to 1.49, when I make a search, NCBIStandalone.blastall() produces a black screen (a windows system console produced by the execution of ..\bin\blastall.exe) that pops up and rapidly disappears as blastall.exe is executed. Nothing more has been changed in the application and the screen does not appears if I downgrade to 1.48. This black window is very annoying (it appears in front of all other open windows) and in fact it is preventing me from upgrading my biopython installation. This problem occurs both with biopyton 1.49 an 1.50. With biopython 1.48 and below NCBIStandalone.blastall works silently. I have seen looking at the code that 1.49 uses preferently subprocess.popen() (in function _invoke_blast) to execute blast while in 1.48 it was os.popen3() (in function blastall). I have been playing with this but I got nothing clear Is this something already known? I could not found any hint by googling. Is there some way to get rid of this screen?. Thanks Joaquin _________________________________________________________________ M?s r?pido, sencillo y seguro. Desc?rgate ya el nuevo Internet Explorer 8 ?Es gratis! http://www.vivelive.com/ie8 From biopython at maubp.freeserve.co.uk Fri Apr 17 08:06:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 13:06:42 +0100 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <49E5B112.7030402@ribosome.natur.cuni.cz> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> <200904151101.07113.jblanca@btc.upv.es> <320fb6e00904150224j1e4fe9ddx3bc490282e5b700e@mail.gmail.com> <49E5B112.7030402@ribosome.natur.cuni.cz> Message-ID: <320fb6e00904170506u343261acif58e1feaacc8d387@mail.gmail.com> On Wed, Apr 15, 2009 at 11:04 AM, Martin MOKREJ? wrote: > Hi, > ... > Just some random links to NCBI Trace Archive: > > ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead > http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=study&m=data&s=study > http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP000001 (has 454 data from 2007) > http://www.ncbi.nlm.nih.gov/sites/entrez?db=sra&report=full&term=SRX003639 > ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/sff2scf/sff2scf.tar.gz > > Hope this helps, > Martin Thanks for those links Martin, I've use a FASTQ file from that list for a couple of examples I've just added to the tutorial. Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 08:14:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 13:14:22 +0100 Subject: [BioPython] =?utf-8?q?blastall_produces_a_black_console_screen_wi?= =?utf-8?q?th_biopython_1=2E49_and_up=E2=80=8F?= In-Reply-To: References: Message-ID: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com> On Fri, Apr 17, 2009 at 12:25 PM, Joaquin Abian Monux wrote: > > Dear all, > > ?I coded a GUI utility to perform local blast searches > on lists of peptides using NCBIStandalone.blastall(). > > I work on windows XP with python 2.5 > > After I updated from biopython1.48 to 1.49, ?when I > make a search, NCBIStandalone.blastall() ?produces > a black screen (a windows system console produced > by the execution of ..\bin\blastall.exe) that pops up and > rapidly disappears as blastall.exe is executed. ?Nothing > more has been changed in the application and the > screen does not appears if I downgrade to 1.48. > > This black window is very annoying (it appears in > front of all other open windows) and in fact it is > preventing me from upgrading my biopython installation. > > This problem occurs both with biopyton 1.49 an 1.50. > With biopython 1.48 and below NCBIStandalone.blastall > works silently. > > I have seen looking at the code that 1.49 uses preferently > subprocess.popen() (in function _invoke_blast) to execute > blast while in 1.48 it was os.popen3() (in function blastall). > I have been playing with this but I got nothing clear > > Is this something already known? I could not found any hint > by googling. Is there some way to get rid of this screen?. Stefanie L?ck had some issues with BLAST and subprocess on her Windows GUI program, which we traced to a bug in Python itself, http://bugs.python.org/issue1124861 See: http://lists.open-bio.org/pipermail/biopython/2009-January/004896.html http://lists.open-bio.org/pipermail/biopython/2009-February/004898.html We were able to resolve this for Stefanie, and the fix was included in Biopython 1.50 beta. Have you tried this yet? If that doesn't work could you show as a short GUI example that fails? Details of how you are running your program could also help. This may also make a difference - e.g. is it started from the command line with "python my_gui.py", or run from IDLE? Thanks Peter From cjfields at illinois.edu Fri Apr 17 08:18:53 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 17 Apr 2009 07:18:53 -0500 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <320fb6e00904170506u343261acif58e1feaacc8d387@mail.gmail.com> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> <200904151101.07113.jblanca@btc.upv.es> <320fb6e00904150224j1e4fe9ddx3bc490282e5b700e@mail.gmail.com> <49E5B112.7030402@ribosome.natur.cuni.cz> <320fb6e00904170506u343261acif58e1feaacc8d387@mail.gmail.com> Message-ID: <70EB1D0C-BFB4-4603-A4B3-19B9769BDD68@illinois.edu> Just to add, Pjotr Prins' BioLib initiative (http://biolib.open-bio.org/wiki/Main_Page ) is building SWIG-based interfaces to several C/C++-based libraries, including Staden io_lib, which supports the following formats (c&p from the latest io_lib README): SCF trace files ABI trace files ALF trace files CTF trace files ZTR trace files SFF trace archives SRF trace archives Experiment files Plain text files We're working on the Perl/Ruby bindings; it shouldn't be hard at all to get Python (and by extension, Biopython) working. chris On Apr 17, 2009, at 7:06 AM, Peter wrote: > On Wed, Apr 15, 2009 at 11:04 AM, Martin MOKREJ? > wrote: >> Hi, >> ... >> Just some random links to NCBI Trace Archive: >> >> ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead >> http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=study&m=data&s=study >> http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP000001 (has >> 454 data from 2007) >> http://www.ncbi.nlm.nih.gov/sites/entrez?db=sra&report=full&term=SRX003639 >> ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/sff2scf/sff2scf.tar.gz >> >> Hope this helps, >> Martin > > Thanks for those links Martin, > > I've use a FASTQ file from that list for a couple of examples I've > just added to the tutorial. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From sbassi at gmail.com Fri Apr 17 10:16:48 2009 From: sbassi at gmail.com (Sebastian Bassi) Date: Fri, 17 Apr 2009 11:16:48 -0300 Subject: [BioPython] Help for a presentation. In-Reply-To: References: <320fb6e00904150659o77c4db3fw2c1e9b27e6476d55@mail.gmail.com> Message-ID: I did the presentation yesterday. Most assistants were bioinformatics students and they liked what they saw. But they were previously exposed to Bioperl and they made comparative questions. I set up a little FAQ with my answers about this. Here is the laptop session: http://www.bioinformatica.info/biopython/ Anyone is invited to improve it. From gatoygata at hotmail.com Fri Apr 17 10:16:56 2009 From: gatoygata at hotmail.com (Joaquin Abian Monux) Date: Fri, 17 Apr 2009 14:16:56 +0000 Subject: [BioPython] =?utf-8?q?blastall_produces_a_black_console_screen_wi?= =?utf-8?q?th_biopython_1=2E49_and_up=E2=80=8F?= In-Reply-To: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com> References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com> Message-ID: Dear Peter, I already had seen that issue. I though it was not related with my problem because my program didn't hang (it had been also compiled with py2exe). The program works perfect. It runs the searches and shows the output, but each time I make a search I get the /Blast/bin/blastall.exe console popping. I have tried right now with 1.50b (I upgraded to 1.50b). Now, If I execute the main script from the system console (>python blast_main_205.pyw) the black screen does not appear when I send a Blast query. Neither If I execute from within the Stani's Python Editor IDE: it does not appear. It seems that this has been solved: before, with biopython 1.49 I was getting a console flash. But if I execute by double clicking on the main script, then it appears (!?) I compiled with py2exe (after fixing some problems in spark.py, see below) and the single file executable worked perfectly but I still get the nasty console when I make a blast query. So, still looking for help Joaquin Note: I got and error log when trying to execute the single file executable produced by py2exe: Traceback (most recent call last): File "blast_main_205.pyw", line 23, in File "zipextimporter.pyo", line 82, in load_module ......etc File "Bio\Parsers\spark.pyo", line 129, in collectRules File "Bio\Parsers\spark.pyo", line 101, in addRule AttributeError: 'NoneType' object has no attribute 'split' I fixed it by modifying spark.py in biopython 1.50b: original: def addRule(self, doc, func): rules = doc.split() fixed: def addRule(self, doc, func): rules = doc.split() if doc else [] > Date: Fri, 17 Apr 2009 13:14:22 +0100 > Subject: Re: [BioPython] blastall produces a black console screen with biopython 1.49 and up? > From: biopython at maubp.freeserve.co.uk > To: gatoygata at hotmail.com > CC: biopython at lists.open-bio.org > > On Fri, Apr 17, 2009 at 12:25 PM, Joaquin Abian Monux > wrote: > > > > Dear all, > > > > I coded a GUI utility to perform local blast searches > > on lists of peptides using NCBIStandalone.blastall(). > > > > I work on windows XP with python 2.5 > > > > After I updated from biopython1.48 to 1.49, when I > > make a search, NCBIStandalone.blastall() produces > > a black screen (a windows system console produced > > by the execution of ..\bin\blastall.exe) that pops up and > > rapidly disappears as blastall.exe is executed. Nothing > > more has been changed in the application and the > > screen does not appears if I downgrade to 1.48. > > > > This black window is very annoying (it appears in > > front of all other open windows) and in fact it is > > preventing me from upgrading my biopython installation. > > > > This problem occurs both with biopyton 1.49 an 1.50. > > With biopython 1.48 and below NCBIStandalone.blastall > > works silently. > > > > I have seen looking at the code that 1.49 uses preferently > > subprocess.popen() (in function _invoke_blast) to execute > > blast while in 1.48 it was os.popen3() (in function blastall). > > I have been playing with this but I got nothing clear > > > > Is this something already known? I could not found any hint > > by googling. Is there some way to get rid of this screen?. > > Stefanie L?ck had some issues with BLAST and subprocess on > her Windows GUI program, which we traced to a bug in Python itself, > http://bugs.python.org/issue1124861 > > See: > http://lists.open-bio.org/pipermail/biopython/2009-January/004896.html > http://lists.open-bio.org/pipermail/biopython/2009-February/004898.html > > We were able to resolve this for Stefanie, and the fix was included in > Biopython 1.50 beta. Have you tried this yet? > > If that doesn't work could you show as a short GUI example that fails? > Details of how you are running your program could also help. This may > also make a difference - e.g. is it started from the command line with > "python my_gui.py", or run from IDLE? > > Thanks > > Peter _________________________________________________________________ ?Quieres crear tus propios emoticonos gratis? Descubre c?mo hacerlo en el Club Oficial de Messenger http://vivelive.com/ilovemessenger/ From biopython at maubp.freeserve.co.uk Fri Apr 17 10:36:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 15:36:22 +0100 Subject: [BioPython] =?windows-1256?q?blastall_produces_a_black_console_sc?= =?windows-1256?q?reen_with_biopython_1=2E49_and_up=FE?= In-Reply-To: References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com> Message-ID: <320fb6e00904170736j44e0a9e1n2171b7a7310303eb@mail.gmail.com> On Fri, Apr 17, 2009 at 3:16 PM, Joaquin Abian Monux wrote: > Dear Peter, > > I already had seen that issue.? I though it was not? related with my problem > because my program didn't hang (it had been also compiled with py2exe). The > program works perfect. It runs the searches and shows the output, but each > time I? make a search I get the /Blast/bin/blastall.exe console popping. I > have tried right now with 1.50b (I upgraded to 1.50b). > > Now, If I execute the main script from the system console (>python > blast_main_205.pyw) the black screen does not appear when I send a Blast > query. Neither If I execute from within the Stani's Python Editor IDE: it > does not appear. It seems that this has been solved: before, with biopython > 1.49 I was getting a console flash. > > But if I execute by double clicking on the main script, then it appears (!?) > > I compiled with py2exe (after fixing some problems in spark.py, see below) > and the single file executable worked perfectly but I still get the nasty > console when I make a blast query. > > So, still looking for help > > Joaquin Right now I suggest you install Biopython 1.50 beta and then edit your copy of Bio/Blast/NCBIStandalone.py to use os.popen3 instead of subprocess. Then running py2exe and test it, and let us know if that works. If that works, we could revert to using os.popen3 on Python 2.5 or older, and only use subprocess on Python 2.6+ (where os.popen3 is deprecated), but that still leaves a possible problem on Python 2.6. I think basically you have a platform specific corner use case, and we may not be able to get Bio/Blast/NCBIStandalone.py to cope without a lot of effort (fixes welcome). You could also try using subprocess but modify the shell argument, but I think that will break other situations. We've been discussing our command line application wrappers on the dev mailing list this month, and after Biopython 1.50 I plan to update Bio.Blast.Applications (and make Bio.Blast.NCBIStandalone use this internally). The (slightly out of date) wrappers in Bio.Blast.Applications just take care of building the command line string - you would then be able to invoke it as you see fit (e.g. using os.system, os.popen3, subprocess - or even submit the task to your local computing cluster). The idea is that Bio.Blast.NCBIStandalone would continue to be a general purpose solution suitable for most situations (but perhaps not Windows GUI programs using py2exe), while Bio.Blast.Applications would give you a lower level option with more control. Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 11:38:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 16:38:20 +0100 Subject: [BioPython] =?windows-1256?q?blastall_produces_a_black_console_sc?= =?windows-1256?q?reen_with_biopython_1=2E49_and_up=FE?= In-Reply-To: References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com> <320fb6e00904170736j44e0a9e1n2171b7a7310303eb@mail.gmail.com> Message-ID: <320fb6e00904170838y2a5cfd8fud5a7cd31f1384412@mail.gmail.com> On Fri, Apr 17, 2009 at 4:24 PM, Joaquin Abian Monux wrote: > Dear Peter, > > Thanks, I fixed it. I'm not sure how. > > In: > > blast_process = subprocess.Popen(cmd_string, > ???????????????????????????????? stdout=subprocess.PIPE, > ???????????????????????????????? stderr=subprocess.PIPE, > ???????????????????????????????? stdin=subprocess.PIPE, > ???????????????????????????????? shell= (sys.platform!="win32")) > > (sys.platform!="win32") is False for my winXP computer. But if I set the > parameter shell=True, then the problem disappears, either in the py2exe > executables and in the scripts. Everything works perfect as usual. > > if shell=True in windows the shell used is the one set in COMSPEC. In my > case it is the normal windows shell (cmd.exe). If shell=False I am not sure > what happens in windows. Obviously no cmd shell should be expected...still, > blast.exe opens his. > ... > Done! (Although it would be nice someone could explain why the 'shell' > parameter must be 'True' for the program to behave properly)... >From memory, with subprocess using the shell argument in this way was deliberate on to get things to work cross platform. I looks like when running from the command line you need the *opposite* shell setting to running from py2exe. Could you put together a trivial python GUI application that calls BLAST that we could use for testing? Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 13:43:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 18:43:44 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> Message-ID: <320fb6e00904171043x3bc3ecb2s5be4ba52bec15076@mail.gmail.com> On Mon, Apr 13, 2009 at 7:15 PM, Peter wrote: > Dear Biopythoneers, > > There is a saying "no news is good news", but as per the title - can > we have some feedback from the Biopython 1.50 beta release please? Thanks everyone for your feedback so far. The plan is to do the final release this weekend, so if anyone has some last minute comments now is the time - even little things like reporting typos in the Tutorial are worthwhile. Thanks Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 13:44:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 18:44:18 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> Message-ID: <320fb6e00904171044q3dcbd703x10d2d75e2052373c@mail.gmail.com> On Mon, Apr 13, 2009 at 7:15 PM, Peter wrote: > Dear Biopythoneers, > > There is a saying "no news is good news", but as per the title - can > we have some feedback from the Biopython 1.50 beta release please? Thanks everyone for your feedback so far. The plan is to do the final release this weekend, so if anyone has some last minute comments now is the time - even little things like reporting typos in the Tutorial are worthwhile. Thanks Peter From peter at maubp.freeserve.co.uk Fri Apr 17 13:48:56 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 18:48:56 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> References: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> Message-ID: <320fb6e00904171048n68544510n285380f7efaef447@mail.gmail.com> On Mon, Apr 13, 2009 at 2:47 PM, Peter wrote: > Hi all, > > I've filed enhancement bug 2809 with a patch to add startswith and > endswith methods to the Seq object, > http://bugzilla.open-bio.org/show_bug.cgi?id=2809 > > I'm confident there are many possible use cases for this. > ... > Does this seem like a sensible addition to the Seq object? ?It is > consistent with making the Seq object more like a python string. For anyone not following the Bug or the dev mailing list, this has been checked in and will be included with Biopython 1.50, and there is an example using it in the new Tutorial. Peter From lueck at ipk-gatersleben.de Mon Apr 20 07:07:20 2009 From: lueck at ipk-gatersleben.de (=?utf-8?Q?Stefanie_L=C3=BCck?=) Date: Mon, 20 Apr 2009 13:07:20 +0200 Subject: [BioPython] =?utf-8?q?blastall_produces_a_black_console_screen_wi?= =?utf-8?q?th_biopython_1=2E49_and_up=E2=80=8F?= References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com><320fb6e00904170736j44e0a9e1n2171b7a7310303eb@mail.gmail.com> <320fb6e00904170838y2a5cfd8fud5a7cd31f1384412@mail.gmail.com> Message-ID: <008701c9c1a8$2cc0f740$1022a8c0@ipkgatersleben.de> Hi! How does your py2exe setup.py file looks? from distutils.core import setup import py2exe #python setup.py py2exe setup( windows = [ { "script": "nb_psst.py", ### Main Python script "icon_resources": [(0, "Icon.ico")] ### Icon to embed into the PE file. } ], ) If you use console instead of windows in setup() will the black screen dissapear? Kind regards Stefanie ----- Original Message ----- From: "Peter" To: "Joaquin Abian Monux" ; "BioPython Mailing List" Sent: Friday, April 17, 2009 5:38 PM Subject: Re: [BioPython]blastall produces a black console screen with biopython 1.49 and up? On Fri, Apr 17, 2009 at 4:24 PM, Joaquin Abian Monux wrote: > Dear Peter, > > Thanks, I fixed it. I'm not sure how. > > In: > > blast_process = subprocess.Popen(cmd_string, > stdout=subprocess.PIPE, > stderr=subprocess.PIPE, > stdin=subprocess.PIPE, > shell= (sys.platform!="win32")) > > (sys.platform!="win32") is False for my winXP computer. But if I set the > parameter shell=True, then the problem disappears, either in the py2exe > executables and in the scripts. Everything works perfect as usual. > > if shell=True in windows the shell used is the one set in COMSPEC. In my > case it is the normal windows shell (cmd.exe). If shell=False I am not > sure > what happens in windows. Obviously no cmd shell should be > expected...still, > blast.exe opens his. > ... > Done! (Although it would be nice someone could explain why the 'shell' > parameter must be 'True' for the program to behave properly)... >From memory, with subprocess using the shell argument in this way was deliberate on to get things to work cross platform. I looks like when running from the command line you need the *opposite* shell setting to running from py2exe. Could you put together a trivial python GUI application that calls BLAST that we could use for testing? Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Apr 20 11:22:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 16:22:02 +0100 Subject: [Biopython] [BioPython] Invitation for Biopython news coordinators In-Reply-To: <20090414015301.GA80360@kunkel> References: <20090406230542.GK43636@sobchak.mgh.harvard.edu> <1239493790.27790.184.camel@localhost.localdomain> <20090414015301.GA80360@kunkel> Message-ID: <320fb6e00904200822l375c1a3crdab46b8043fdad49@mail.gmail.com> On Tue, Apr 14, 2009 at 2:53 AM, Brad Chapman wrote: > Having several people involved will work out well. When you are > finished up with finals, check back in with the list and David and > we can get you set up with whatever you need. Good luck with exams > and thanks again for the message, Our news server uses WordPress, and the default roles are defined here: http://codex.wordpress.org/Roles_and_Capabilities I propose to make any "News Coordinator" volunteers "Contributors", meaning they can write and manage their own posts but not publish posts. Initially that will need an OK from above ;) Once we're happy with your work, we can upgrade you to an "Author" meaning you can publish and manage your own posts independently. Further down the line there is "Editor" (which will let you edit other peoples posts) or even "Admin" status... So David and John, you should be able to register yourselves here http://news.open-bio.org/news/wp-register.php and then drop me an email and I'll upgrade you from the default "Subscriber" to "Contributor". If that doesn't work, just email me directly with your contact details and a suggested username (I'd go with "johnm" or "davidw", but any sensible suggestion is fine - Brad just picked "brad"). Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 15:02:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 20:02:18 +0100 Subject: [Biopython] Biopython 1.50 released Message-ID: <320fb6e00904201202j4bb9666es18c89136ce973a48@mail.gmail.com> Dear all, We are pleased to announce Biopython release 1.50, featuring some significant additions since Biopython 1.49 was released late last year. GenomeDiagram by Leighton Pritchard has been integrated into Biopython as the Bio.Graphics.GenomeDiagram module. A new module Bio.Motif has been added, which is intended to replace the existing Bio.AlignAce and Bio.MEME modules. Also have a look at Bio.SwissProt and Bio.ExPASy and their revised parsers. As noted in a previous news posting, Bio.SeqIO can now read and write FASTQ and QUAL files used in second generation sequencing work. In connection with this, our SeqRecord object has a new dictionary attribute, letter_annotations, for per-letter-annotation information like sequence quality scores or secondary structure predictions. Also, the SeqRecord object can now be sliced to give a new SeqRecord covering just part of the sequence. Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is expected to be the final version to support Python 2.3 (see this previous announcement). Also, Biopython 1.50 should be the last release to include our old deprecated parsing infrastructure (Martel and Bio.Mindy). We?ve also updated the Biopython Tutorial and Cookbook (also available in PDF), and not just by adding our logo to the cover ;) http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Thank you to everyone who tested the Biopython 1.50 beta release, and to all our contributors. Source distributions and Windows installers are available from the downloads page on the Biopython website: http://biopython.org/wiki/Download -Peter, on behalf of the Biopython developers P.S. This news post is online at http://news.open-bio.org/news/2009/04/biopython-release-150/ You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News From chapmanb at 50mail.com Mon Apr 20 17:51:07 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 20 Apr 2009 17:51:07 -0400 Subject: [Biopython] Google Summer of Code at Biopython Message-ID: <20090420215107.GA30529@sobchak.mgh.harvard.edu> Biopython folks; I am very happy to announce that Biopython has had two students accepted for Google's Summer of Code (http://code.google.com/soc/): - Nick Matzke will be working on modules for Biogeographical Phylogenetics - Eric Talevich will be adding support for parsing and writing PhyloXML (http://www.phyloxml.org/) You have likely seen Nick and Eric around the mailing lists; both prepared excellent applications and project plans, navigating a stringent selection process. We should expect to see much more of them during the summer as they will be working full time on the projects with generous support from Google. I'd like to thank everyone who submitted Biopython proposals. We received many great queries and proposals, and it is a shame more could not have been included. Many thanks are also due to Hilmar and all the folks at NESCent for inviting Biopython to participate. We are looking forward to a great summer, and beyond, for the projects. Below are links to the NESCent main page, abstracts and full proposals; I believe you need to sign in with a Google account to see the full proposals: http://socghop.appspot.com/org/home/google/gsoc2009/nescent http://socghop.appspot.com/student_project/show/google/gsoc2009/nescent/t124022798969 http://socghop.appspot.com/student_proposal/show/google/gsoc2009/etal/t123854016039 http://socghop.appspot.com/student_project/show/google/gsoc2009/nescent/t124022798250 http://socghop.appspot.com/student_proposal/show/google/gsoc2009/nickmatzke/t123854590776 Congratulations again to Eric and Nick, Brad From biopython at maubp.freeserve.co.uk Mon Apr 20 18:15:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 23:15:51 +0100 Subject: [Biopython] Google Summer of Code at Biopython In-Reply-To: <20090420215107.GA30529@sobchak.mgh.harvard.edu> References: <20090420215107.GA30529@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904201515j1d2e855ue29665037a9edaa2@mail.gmail.com> On Mon, Apr 20, 2009 at 10:51 PM, Brad Chapman wrote: > Biopython folks; > I am very happy to announce that Biopython has had two students accepted > for Google's Summer of Code (http://code.google.com/soc/): Cool :) Congratulations Eric, Nick and Brad for stepping up to mentor on the Biopython side. I think that warrants a news post... John or David are you up for this? Peter From lpritc at scri.ac.uk Tue Apr 21 03:40:37 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 21 Apr 2009 08:40:37 +0100 Subject: [Biopython] Google Summer of Code at Biopython In-Reply-To: <20090420215107.GA30529@sobchak.mgh.harvard.edu> Message-ID: Congratulations to Nick and Eric! L. On 20/04/2009 22:51, "Brad Chapman" wrote: > Biopython folks; > I am very happy to announce that Biopython has had two students accepted > for Google's Summer of Code (http://code.google.com/soc/): > > - Nick Matzke will be working on modules for Biogeographical Phylogenetics > > - Eric Talevich will be adding support for parsing and writing PhyloXML > (http://www.phyloxml.org/) -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From gatoygata at hotmail.com Tue Apr 21 13:15:21 2009 From: gatoygata at hotmail.com (Joaquin Abian Monux) Date: Tue, 21 Apr 2009 17:15:21 +0000 Subject: [Biopython] =?utf-8?q?=5BBioPython=5Dblastall_produces_a_black_co?= =?utf-8?q?nsole_screen_with_biopython_1=2E49_and_up=E2=80=8F?= In-Reply-To: <008701c9c1a8$2cc0f740$1022a8c0@ipkgatersleben.de> References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com><320fb6e00904170736j44e0a9e1n2171b7a7310303eb@mail.gmail.com> <320fb6e00904170838y2a5cfd8fud5a7cd31f1384412@mail.gmail.com> <008701c9c1a8$2cc0f740$1022a8c0@ipkgatersleben.de> Message-ID: Hi Stefanie, It is a normal, minimal setup.py for windows and single file executables: # exWx/setup.py from distutils.core import setup import py2exe setup( windows=[ {'script': "blast_main_205.pyw", 'icon_resources':[(0,'blast.ico')] } ], options={ 'py2exe': { #'packages' : [], #'includes': [], 'excludes': [ 'Tkconstants','Tkinter', 'tcl' ], 'ignores': ['wxmsw26uh_vc.dll'], 'dll_excludes': ['libgdk_pixbuf-2.0-0.dll', 'libgdk-win32-2.0-0.dll', 'libgobject-2.0-0.dll' ], 'compressed': 1, 'optimize':2, 'bundle_files': 1 } }, zipfile = None, data_files= [ ] ) I think I understand the logic of your question. I will try with 'console' and let you know what happens. Here at work I have 1.48 installed and 1.50 at home. Currently, with biopython 1.50, I can produce py2exe single file executables of GUI programs based on wxpython that work perfectly after: a) I set console=True in the subprocess call in _invoke_blast (NCBIStandalone.py). Otherwise I have the Blast.exe console popping when I make a search. b) I modify a line in the addRule function in spark.py: rules = doc.split() ---to--> rules = doc.split() if doc else []. Otherwise the executable is produced but gives an exception when ran saying that can not split 'None'. Best regards Joaquin > From: lueck at ipk-gatersleben.de > To: biopython at maubp.freeserve.co.uk; gatoygata at hotmail.com; biopython at lists.open-bio.org > Subject: Re: [BioPython]blastall produces a black console screen with biopython 1.49 and up? > Date: Mon, 20 Apr 2009 13:07:20 +0200 > > Hi! > > How does your py2exe setup.py file looks? > > from distutils.core import setup > import py2exe > > #python setup.py py2exe > > setup( > windows = [ > { > "script": "nb_psst.py", ### Main Python > script > "icon_resources": [(0, "Icon.ico")] ### Icon to embed into > the PE file. > } > ], > ) > > If you use console instead of windows in setup() will the black screen > dissapear? > > Kind regards > Stefanie > > > ----- Original Message ----- > From: "Peter" > To: "Joaquin Abian Monux" ; "BioPython Mailing List" > > Sent: Friday, April 17, 2009 5:38 PM > Subject: Re: [BioPython]blastall produces a black console screen with > biopython 1.49 and up? > > > On Fri, Apr 17, 2009 at 4:24 PM, Joaquin Abian Monux > wrote: > > Dear Peter, > > > > Thanks, I fixed it. I'm not sure how. > > > > In: > > > > blast_process = subprocess.Popen(cmd_string, > > stdout=subprocess.PIPE, > > stderr=subprocess.PIPE, > > stdin=subprocess.PIPE, > > shell= (sys.platform!="win32")) > > > > (sys.platform!="win32") is False for my winXP computer. But if I set the > > parameter shell=True, then the problem disappears, either in the py2exe > > executables and in the scripts. Everything works perfect as usual. > > > > if shell=True in windows the shell used is the one set in COMSPEC. In my > > case it is the normal windows shell (cmd.exe). If shell=False I am not > > sure > > what happens in windows. Obviously no cmd shell should be > > expected...still, > > blast.exe opens his. > > ... > > Done! (Although it would be nice someone could explain why the 'shell' > > parameter must be 'True' for the program to behave properly)... > > >From memory, with subprocess using the shell argument in this way was > deliberate on to get things to work cross platform. I looks like when > running from the command line you need the *opposite* shell setting to > running from py2exe. > > Could you put together a trivial python GUI application that calls > BLAST that we could use for testing? > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > _________________________________________________________________ M?s r?pido, sencillo y seguro. Desc?rgate ya el nuevo Internet Explorer 8 ?Es gratis! http://www.vivelive.com/ie8 From lueck at ipk-gatersleben.de Wed Apr 22 04:15:14 2009 From: lueck at ipk-gatersleben.de (=?utf-8?Q?Stefanie_L=C3=BCck?=) Date: Wed, 22 Apr 2009 10:15:14 +0200 Subject: [Biopython] =?utf-8?q?=5BBioPython=5Dblastall_produces_a_black_co?= =?utf-8?q?nsole_screen_with_biopython_1=2E49_and_up=E2=80=8F?= References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com><320fb6e00904170736j44e0a9e1n2171b7a7310303eb@mail.gmail.com> <320fb6e00904170838y2a5cfd8fud5a7cd31f1384412@mail.gmail.com> <008701c9c1a8$2cc0f740$1022a8c0@ipkgatersleben.de> Message-ID: <03b601c9c322$76b94ed0$1022a8c0@ipkgatersleben.de> Hi! Sorry I mixed up things. It's should be windows (as in your setup.py) and not console to get no black screen! Sorry again, I think I need holidays ;-) Stefanie ----- Original Message ----- From: Joaquin Abian Monux To: lueck at ipk-gatersleben.de ; biopython at maubp.freeserve.co.uk ; biopython at lists.open-bio.org Sent: Tuesday, April 21, 2009 7:15 PM Subject: RE: [BioPython]blastall produces a black console screen with biopython 1.49 and up? Hi Stefanie, It is a normal, minimal setup.py for windows and single file executables: # exWx/setup.py from distutils.core import setup import py2exe setup( windows=[ {'script': "blast_main_205.pyw", 'icon_resources':[(0,'blast.ico')] } ], options={ 'py2exe': { #'packages' : [], #'includes': [], 'excludes': [ 'Tkconstants','Tkinter', 'tcl' ], 'ignores': ['wxmsw26uh_vc.dll'], 'dll_excludes': ['libgdk_pixbuf-2.0-0.dll', 'libgdk-win32-2.0-0.dll', 'libgobject-2.0-0.dll' ], 'compressed': 1, 'optimize':2, 'bundle_files': 1 } }, zipfile = None, data_files= [ ] ) I think I understand the logic of your question. I will try with 'console' and let you know what happens. Here at work I have 1.48 installed and 1.50 at home. Currently, with biopython 1.50, I can produce py2exe single file executables of GUI programs based on wxpython that work perfectly after: a) I set console=True in the subprocess call in _invoke_blast (NCBIStandalone.py). Otherwise I have the Blast.exe console popping when I make a search. b) I modify a line in the addRule function in spark.py: rules = doc.split() ---to--> rules = doc.split() if doc else []. Otherwise the executable is produced but gives an exception when ran saying that can not split 'None'. Best regards Joaquin > From: lueck at ipk-gatersleben.de > To: biopython at maubp.freeserve.co.uk; gatoygata at hotmail.com; biopython at lists.open-bio.org > Subject: Re: [BioPython]blastall produces a black console screen with biopython 1.49 and up? > Date: Mon, 20 Apr 2009 13:07:20 +0200 > > Hi! > > How does your py2exe setup.py file looks? > > from distutils.core import setup > import py2exe > > #python setup.py py2exe > > setup( > windows = [ > { > "script": "nb_psst.py", ### Main Python > script > "icon_resources": [(0, "Icon.ico")] ### Icon to embed into > the PE file. > } > ], > ) > > If you use console instead of windows in setup() will the black screen > dissapear? > > Kind regards > Stefanie > > > ----- Original Message ----- > From: "Peter" > To: "Joaquin Abian Monux" ; "BioPython Mailing List" > > Sent: Friday, April 17, 2009 5:38 PM > Subject: Re: [BioPython]blastall produces a black console screen with > biopython 1.49 and up? > > > On Fri, Apr 17, 2009 at 4:24 PM, Joaquin Abian Monux > wrote: > > Dear Peter, > > > > Thanks, I fixed it. I'm not sure how. > > > > In: > > > > blast_process = subprocess.Popen(cmd_string, > > stdout=subprocess.PIPE, > > stderr=subprocess.PIPE, > > stdin=subprocess.PIPE, > > shell= (sys.platform!="win32")) > > > > (sys.platform!="win32") is False for my winXP computer. But if I set the > > parameter shell=True, then the problem disappears, either in the py2exe > > executables and in the scripts. Everything works perfect as usual. > > > > if shell=True in windows the shell used is the one set in COMSPEC. In my > > case it is the normal windows shell (cmd.exe). If shell=False I am not > > sure > > what happens in windows. Obviously no cmd shell should be > > expected...still, > > blast.exe opens his. > > ... > > Done! (Although it would be nice someone could explain why the 'shell' > > parameter must be 'True' for the program to behave properly)... > > >From memory, with subprocess using the shell argument in this way was > deliberate on to get things to work cross platform. I looks like when > running from the command line you need the *opposite* shell setting to > running from py2exe. > > Could you put together a trivial python GUI application that calls > BLAST that we could use for testing? > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > ------------------------------------------------------------------------------ ?Quieres estar al d?a de la ?ltimas novedades? ?Ap?ntate gratis aqu?! From chapmanb at 50mail.com Wed Apr 22 09:09:53 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 22 Apr 2009 09:09:53 -0400 Subject: [Biopython] [BioPython] JP (Jarvis Patrick) Clustering In-Reply-To: <49EDD55E.5010801@tu-bs.de> References: <49DA25E6.2060606@tu-bs.de> <20090406215725.GG43636@sobchak.mgh.harvard.edu> <49EDD55E.5010801@tu-bs.de> Message-ID: <20090422130953.GB34546@sobchak.mgh.harvard.edu> Hi Florian; [Moving to Biopython list] > >> Does anybody know an (open source) clustering package containing the > >> Jarvis Patrick clustering algorithm? > >> > > Here is a version in R and C: > > http://rguha.net/code/R/#jp [...] > Do you know how to do the 'rpy calls ' > > >dyn.load('jpc.so') > >source('jpc.R') > >clus <- jpc(dat, j=3, k=1, diss=FALSE) > > for executing the script? Sure, here is how I would do it with python and rpy2. This builds a random array of 10 items to cluster, each with 4 points, and then runs the JPC clustering algorithm from Rajarshi's page on them. At the end, it shows how to extract the clustered indexes from the results. Hope this helps, Brad import numpy import rpy2.robjects as robjects import rpy2.robjects.numpy2ri robjects.r(''' dyn.load('jpc.so') source('jpc.R') ''') num_groups = 10 data = numpy.random.random((num_groups, 4)) cluster = robjects.r.jpc(data, j=3, k=1, diss=False) for gindex in range(num_groups): print 'Group ID:', gindex, 'Cluster ID:', cluster[0][1][gindex] print cluster From winda002 at student.otago.ac.nz Wed Apr 22 22:14:31 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 23 Apr 2009 14:14:31 +1200 Subject: [Biopython] main page on wiki Message-ID: <49EFCF07.2050502@student.otago.ac.nz> Hi all, As you probably know the main page of the wiki (http://biopython.org/wiki/Main_Page) is the first place someone washes up when they google 'biopython'. As part of this "news coordinator" idea I have made an alternative version of the main page (http://biopython.org/wiki/User:Davidw/homepage) which acts a bit more as a "portal" for the wiki/project. This is born from my own experience with the wiki as a newcomer; it took me a long time to cotton on to the fact there was a navigation box on each page so I didn't realise what the website had to offer (this may say more about me than the design of the front page). Which version would you like to see as the main page? Obviously this isn't an either-or thing, my 'mock-up' version can be edited by anyone with an account on the wiki (the main page is protected for obvious reasons) so any ideas that you have can be incorporated to that one (older versions of the page are all saved so you can edit as bravely as you like). Thanks, David From biopython at maubp.freeserve.co.uk Thu Apr 23 05:16:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 10:16:44 +0100 Subject: [Biopython] [Biopython-dev] main page on wiki In-Reply-To: <49EFE553.6070405@gmail.com> References: <49EFCF07.2050502@student.otago.ac.nz> <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> <49EFE553.6070405@gmail.com> Message-ID: <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> On Thu, Apr 23, 2009 at 4:49 AM, Iddo Friedberg wrote: > I second Sebastian on the icons, and third Sebastian and Alex on preferring > David's take on a main page. Are you all looking at the *current* home page which already has a few of David's suggestions (in particular the news feed on the right), or the old version from memory? Also, what size screens do you all have? It should ideally look OK on small screens or windows (e.g. 1024 by 768 is what my laptop uses, which isn't that old). From playing with my window size, it should be OK - the proposed layout seems quite flexible :) If there are no counter comments, I'll put David's changes up later today or tomorrow. Peter From biopython at maubp.freeserve.co.uk Thu Apr 23 10:22:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 15:22:17 +0100 Subject: [Biopython] Ace support in Bio.SeqIO and/or Bio.AlignIO? Message-ID: <320fb6e00904230722o398ff192u7792d73562b7e4f1@mail.gmail.com> How do you picture a contig in an ACE assembly file? As a single record with read data annotations (e.g. as a SeqRecord), or as a sequence alignment with a consensus (e.g. as an Alignment object)? I suspect the answer is "it depends", and that both are useful. Currently we use Bio.Sequencing.Ace in Bio.SeqIO to turn each contig into a SeqRecord. Now that we have per-letter-annotation support in the SeqRecord, this code could be updated to record the consensus base quality (BQ lines). We could also record the supporting reads (RD lines), maybe as SeqFeature objects. Recently David put together an example on the wiki using Bio.Sequencing.Ace to build an alignment, which we could use a basis for supporting Ace files in Bio.AlignIO as alignments: http://biopython.org/wiki/ACE_contig_to_alignment What do people think? I should be able to try out David's code on some real world ACE files from Newbler (i.e. 454Contigs.ace files)... Peter From biopython at maubp.freeserve.co.uk Thu Apr 23 11:47:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 16:47:28 +0100 Subject: [Biopython] Fwd: [BioPython] Clustalw Problems In-Reply-To: <3933e78c0904230806v77987c19h7221f4943236d82c@mail.gmail.com> References: <3933e78c0904081308p78e049b1i4562857c1ad06df4@mail.gmail.com> <320fb6e00904081450y6a2bcdc2jce7935c543af9b8b@mail.gmail.com> <3933e78c0904230806v77987c19h7221f4943236d82c@mail.gmail.com> Message-ID: <320fb6e00904230847s6e9b60ediea53d4a8624787b4@mail.gmail.com> Forwarding to the mailing list - I'll reply soon. Peter ---------- Forwarded message ---------- From: Bradley Hintze Date: Thu, Apr 23, 2009 at 4:06 PM Subject: Re: [BioPython] Clustalw Problems To: Peter Peter, Sorry that it has taken so long to reply..school. I am still having issues with the alignments. I have BioPython 1.50 I tried to do what you suggested and got the following: >>> from Bio.Clustalw import MultipleAlignCL >>> from Bio.Clustalw import do_alignment >>> cline=MultipleAlignCL(r'C:\Bradley_BioPython\mtr4.fasta',r'C:\Bradley_BioPyt hon\clustalw1.83.XP\clustalw.exe') >>> cline.set_output(r'C:\Bradley_BioPython\test.aln') >>> print cline C:\Bradley_BioPython\clustalw1.83.XP\clustalw.exe -INFILE=C:\Bradley_BioPython\m tr4.fasta -OUTFILE=C:\Bradley_BioPython\test.aln >>> al=do_alignment('C:\Bradley_BioPython\clustalw1.83.XP\clustalw.exe -INFILE=C :\Bradley_BioPython\mtr4.fasta -OUTFILE=C:\Bradley_BioPython\test.aln') Traceback (most recent call last): ? File "", line 1, in ? File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 120, in do _alignment ??? % str(command_line)) ValueError: Bad command line option in the command: C:\Bradley_BioPython\clustal w1.83.XP\clustalw.exe -INFILE=C:\Bradley_BioPython\mtr4.fasta -OUTFILE=C:\Bradle y_BioPython???? est.aln When i try running 'cline' i get this >>> al=do_alignment(cline) Traceback (most recent call last): ? File "", line 1, in ? File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 124, in do _alignment ??? % command_line.sequence_file) IOError: Cannot open sequence file C:\Bradley_BioPython\mtr4.fasta Any ideas? On Wed, Apr 8, 2009 at 3:50 PM, Peter wrote: > > On 4/8/09, Bradley Hintze wrote: > > Hi, > > > > ?I am having a hard time running an alignment. I am running in windows and > > ?here is my code and the error message that I get after running do_alignment. > > > > ?>>> import os > > ?>>> from Bio.Clustalw import MultipleAlignCL > > ?>>> from Bio.Clustalw import do_alignment > > ?>>> cline=MultipleAlignCL(r"C:\Documents and > > ?Settings\student\Desktop\Foo\mtr4.fasta", r"C:\Program > > ?Files\clustalw1.83.XP\clustalw.exe") > > ?>>> cline.set_output(r"C:\Documents and > > ?Settings\students\Desktop\Foo\test.aln") > > ?>>> al=do_alignment(cline) > > ?Traceback (most recent call last): > > ? File "", line 1, in > > ? File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 124, > > ?in do_alignment > > ? ?% command_line.sequence_file) > > ?IOError: Cannot open sequence file C:\Documents and > > ?Settings\student\Desktop\Foo\mtr4.fasta > > > > ?when I open the file using o=open('C:\Documents and > > ?Settings\student\Desktop\Foo\mtr4.fasta') it woks fine. > > > > ?any ideas? > > As a general tip, try this to see what the command Biopython is trying > to run is: > > >>> print cline > > Then try running the same command by hand at the command prompt (DOS > prompt), and make sure it works. > > I can tell from the error message you have Python 2.5, but what > version of Biopython do you have? > > I'm not at a Windows machine to check, but it is generally a good idea > to avoid file names and paths with spaces where you can. ?In this > case, I'm sure relative names would be fine: > > >>> import os > >>> from Bio.Clustalw import MultipleAlignCL > >>> from Bio.Clustalw import do_alignment > >>> cline=MultipleAlignCL("mtr4.fasta", r"C:\Program Files\clustalw1.83.XP\clustalw.exe") > >>> cline.set_output("test.aln") > > Peter -- Bradley J. Hintze Biochemistry Undergraduate Utah State University 801-712-8799 From biopython at maubp.freeserve.co.uk Thu Apr 23 11:59:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 16:59:23 +0100 Subject: [Biopython] [BioPython] Clustalw Problems In-Reply-To: <320fb6e00904230847s6e9b60ediea53d4a8624787b4@mail.gmail.com> References: <3933e78c0904081308p78e049b1i4562857c1ad06df4@mail.gmail.com> <320fb6e00904081450y6a2bcdc2jce7935c543af9b8b@mail.gmail.com> <3933e78c0904230806v77987c19h7221f4943236d82c@mail.gmail.com> <320fb6e00904230847s6e9b60ediea53d4a8624787b4@mail.gmail.com> Message-ID: <320fb6e00904230859v4d3c9860kc7e5f574afbcbe9a@mail.gmail.com> Bradley Hintze wrote: > > Peter, > > Sorry that it has taken so long to reply..school. > I am still having issues with the alignments. I have BioPython 1.50 OK, it is good that you are on the latest release already :) By the way it is "Biopython", not "BioPython" ;) > I tried to do what you suggested and got the following: > >>>> from Bio.Clustalw import MultipleAlignCL >>>> from Bio.Clustalw import do_alignment >>>> cline=MultipleAlignCL(r'C:\Bradley_BioPython\mtr4.fasta',r'C:\Bradley_BioPyt > hon\clustalw1.83.XP\clustalw.exe') >>>> cline.set_output(r'C:\Bradley_BioPython\test.aln') >>>> print cline > C:\Bradley_BioPython\clustalw1.83.XP\clustalw.exe -INFILE=C:\Bradley_BioPython\m > tr4.fasta -OUTFILE=C:\Bradley_BioPython\test.aln Did you try running this "by hand" at the windows command prompt? >From the Windows start menu, pick "run", then enter "cmd.exe". Then paste in this command (from memory I think you need to right click on the icon in the top left of this window to get the paste menu option). I am expecting it to say "Bad command line option in the command ..." >>>> al=do_alignment('C:\Bradley_BioPython\clustalw1.83.XP\clustalw.exe -INFILE=C > :\Bradley_BioPython\mtr4.fasta -OUTFILE=C:\Bradley_BioPython\test.aln') > Traceback (most recent call last): > ? File "", line 1, in > ? File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 120, in do > _alignment > ??? % str(command_line)) > ValueError: Bad command line option in the command: C:\Bradley_BioPython\clustal > w1.83.XP\clustalw.exe -INFILE=C:\Bradley_BioPython\mtr4.fasta -OUTFILE=C:\Bradle > y_BioPython???? est.aln You have used '\t' in this string which means it was treated as a tab. Instead use \\t, or raw mode as you did earlier for the filenames: al=do_alignment(r'C:\Bradley_BioPython\clustalw1.83.XP\clustalw.exe -INFILE=C:\Bradley_BioPython\mtr4.fasta -OUTFILE=C:\Bradley_BioPython\test.aln') The text "Bad command line option in the command" tells me ClustalW returned error code 1, but this makes sense due to the tab. > When i try running 'cline' i get this > >>>> al=do_alignment(cline) > Traceback (most recent call last): > ? File "", line 1, in > ? File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 124, in do > _alignment > ??? % command_line.sequence_file) > IOError: Cannot open sequence file C:\Bradley_BioPython\mtr4.fasta This means ClustalW returned error code 2 (which should mean it can't find your input file). Are you sure the path is correct? Try: import os print os.path.isfile(r'C:\Bradley_BioPython\mtr4.fasta') Peter From winda002 at student.otago.ac.nz Thu Apr 23 21:41:37 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 24 Apr 2009 13:41:37 +1200 Subject: [Biopython] Ace support in Bio.SeqIO and/or Bio.AlignIO? In-Reply-To: <320fb6e00904230722o398ff192u7792d73562b7e4f1@mail.gmail.com> References: <320fb6e00904230722o398ff192u7792d73562b7e4f1@mail.gmail.com> Message-ID: <49F118D1.7070701@student.otago.ac.nz> Peter wrote: > How do you picture a contig in an ACE assembly file? As a single > record with read data annotations (e.g. as a SeqRecord), or as a > sequence alignment with a consensus (e.g. as an Alignment object)? I > suspect the answer is "it depends", and that both are useful > Yup, I usually want to trust the assembler and treat them as sequences (and having the option of annotations would be great for these cases) but sometimes I want to pull them apart and look inside. > Recently David put together an example on the wiki using > Bio.Sequencing.Ace to build an alignment, which we could use a basis > for supporting Ace files in Bio.AlignIO as alignments: > http://biopython.org/wiki/ACE_contig_to_alignment > > What do people think? I should be able to try out David's code on > some real world ACE files from Newbler (i.e. 454Contigs.ace files).. The wiki example is based on a script that I use with newbler and mira (http://www.chevreux.org/projects_mira.html) assembled contigs (I thought I'd use one everyone with biopython has as the example) so it should be OK. I'm sure there are much prettier ways of doing what it does (eg, using the new SeqRecord annotations to hold the clipping masks ?). If people want it to be part of biopython I'm happy to provide what help I can with it. From marco at gallotta.co.za Fri Apr 24 16:57:12 2009 From: marco at gallotta.co.za (Marco Gallotta) Date: Fri, 24 Apr 2009 22:57:12 +0200 Subject: [Biopython] Clustalw Hangs on Python 2.6 In-Reply-To: <68cbba1d0904241354v5d965d6ep176a51b9fc356d4f@mail.gmail.com> References: <68cbba1d0904241354v5d965d6ep176a51b9fc356d4f@mail.gmail.com> Message-ID: <68cbba1d0904241357r4cd3bae4maeea1e03e2243829@mail.gmail.com> Hi I recently upgraded to Python 2.6 (from 2.5) and this seems to have revealed a potential bug in biopython. I'm using biopython to run clustalw and after the upgrade it just hangs. I discovered that it was hanging on a write to stdout. The best reason I could determine for this behaviour was that Python's subprocess module (which biopython uses to spawn clastlw) was piping stdout to bioypython, which wasn't reading it. The code I used to call clustalw: cline = MultipleAlignCL(blast_results_file) cline.set_output(alignment_file, output_type = format.upper(), output_order = "INPUT") Clustalw.do_alignment(cline) I was able to get it working by changing the arguments to the subprocess module to pipe to /dev/null as in the attached patch. Unfortunately this approach only works on Linux. If there is a better fix, or perhaps I'm calling clustalw incorrectly, please do let me know. Thanks Marco -- Marco Gallotta MSc Student | SACO Scientific Committee | ACM ICPC Coach Department of Computer Science, University of Cape Town people.cs.uct.ac.za/~mgallott | marco-za.blogspot.com marco AT gallotta DOT co DOT za | 073 170 4444 | 021 552 2731 -------------- next part -------------- A non-text attachment was scrubbed... Name: biopython_clustalw.patch Type: text/x-diff Size: 507 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Fri Apr 24 17:47:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Apr 2009 22:47:24 +0100 Subject: [Biopython] Clustalw Hangs on Python 2.6 In-Reply-To: <68cbba1d0904241357r4cd3bae4maeea1e03e2243829@mail.gmail.com> References: <68cbba1d0904241354v5d965d6ep176a51b9fc356d4f@mail.gmail.com> <68cbba1d0904241357r4cd3bae4maeea1e03e2243829@mail.gmail.com> Message-ID: <320fb6e00904241447u69343587lb2c83eecb78468ab@mail.gmail.com> On 4/24/09, Marco Gallotta wrote: > Hi > > I recently upgraded to Python 2.6 (from 2.5) and this seems to have > revealed a potential bug in biopython. I'm using biopython to run > clustalw and after the upgrade it just hangs. I discovered that it was > hanging on a write to stdout. The best reason I could determine for > this behaviour was that Python's subprocess module (which biopython > uses to spawn clastlw) was piping stdout to bioypython, which wasn't > reading it. Hi, You didn't say what version of Biopython you are using - Bug 2804 was fixed in Biopython 1.50 which sounds possibly related: http://bugzilla.open-bio.org/show_bug.cgi?id=2804 Peter From marco at gallotta.co.za Fri Apr 24 18:08:51 2009 From: marco at gallotta.co.za (Marco Gallotta) Date: Sat, 25 Apr 2009 00:08:51 +0200 Subject: [Biopython] Clustalw Hangs on Python 2.6 In-Reply-To: <320fb6e00904241447u69343587lb2c83eecb78468ab@mail.gmail.com> References: <68cbba1d0904241354v5d965d6ep176a51b9fc356d4f@mail.gmail.com> <68cbba1d0904241357r4cd3bae4maeea1e03e2243829@mail.gmail.com> <320fb6e00904241447u69343587lb2c83eecb78468ab@mail.gmail.com> Message-ID: <68cbba1d0904241508rf6119bek50c5dabd826ee374@mail.gmail.com> On Fri, Apr 24, 2009 at 11:47 PM, Peter wrote: > You didn't say what version of Biopython you are using - Bug 2804 was > fixed in Biopython 1.50 which sounds possibly related: > http://bugzilla.open-bio.org/show_bug.cgi?id=2804 I was using 1.49. Upgrading to 1.50 solved the problem. Thanks! Marco -- Marco Gallotta MSc Student | SACO Scientific Committee | ACM ICPC Coach Department of Computer Science, University of Cape Town people.cs.uct.ac.za/~mgallott | marco-za.blogspot.com marco AT gallotta DOT co DOT za | 073 170 4444 | 021 552 2731 From biopython at maubp.freeserve.co.uk Sat Apr 25 08:10:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Apr 2009 13:10:33 +0100 Subject: [Biopython] Clustalw Hangs on Python 2.6 In-Reply-To: <68cbba1d0904241508rf6119bek50c5dabd826ee374@mail.gmail.com> References: <68cbba1d0904241354v5d965d6ep176a51b9fc356d4f@mail.gmail.com> <68cbba1d0904241357r4cd3bae4maeea1e03e2243829@mail.gmail.com> <320fb6e00904241447u69343587lb2c83eecb78468ab@mail.gmail.com> <68cbba1d0904241508rf6119bek50c5dabd826ee374@mail.gmail.com> Message-ID: <320fb6e00904250510y747118c2le71a34fb667737e6@mail.gmail.com> On 4/24/09, Marco Gallotta wrote: > > On Fri, Apr 24, 2009 at 11:47 PM, Peter wrote: > > You didn't say what version of Biopython you are using - Bug 2804 > > was fixed in Biopython 1.50 which sounds possibly related: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2804 > > I was using 1.49. Upgrading to 1.50 solved the problem. Thanks! > > Marco Great :) Peter From biopython at maubp.freeserve.co.uk Mon Apr 27 05:58:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 10:58:52 +0100 Subject: [Biopython] [Biopython-dev] main page on wiki In-Reply-To: <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> References: <49EFCF07.2050502@student.otago.ac.nz> <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> <49EFE553.6070405@gmail.com> <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> Message-ID: <320fb6e00904270258s523c49a1j1bfc5d4a12ca86a9@mail.gmail.com> On Thu, Apr 23, 2009 at 10:16 AM, Peter wrote: > > If there are no counter comments, I'll put David's changes up later > today or tomorrow. > OK - make that a couple of days later ;) This isn't exactly as in David's draft - I shortened some of the link text and omitted a couple of links under "Contribute" which seemed unnecessary on the home page. I've also kept the final line giving the latest release and date (although the text is shorter now). Brad commented (off list?) that having this is a good indicator of the project's activity, and I agree. Alternatively, I'd like to try having dates on the news feed, but the media wiki plugin needs to be updated for that to work... Peter From lueck at ipk-gatersleben.de Mon Apr 27 06:34:28 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 27 Apr 2009 12:34:28 +0200 Subject: [Biopython] Parsing large blast files Message-ID: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> Hi! I want to blast many sequences against one DB and parse the outputs. At the moment, I do it in that way: def blast_a_record(fasta_rec): open("to_blast.fasta", "w").write(str(fasta_rec)) f = open('out.txt', 'a') my_blast_db = "\"G:\RNAiscan\\barleyv9\"" my_blast_file = "G:\\RNAiscan\\to_blast.fasta" my_blast_exe = "G:\\RNAiscan\\blastall.exe" result_handle, error_info = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file) blast_results = result_handle.read() save_file = open("my_blast.xml", "w") save_file.write(blast_results) save_file.close() result = open("my_blast.xml") blast_records = NCBIXML.parse(result) for blast_record in blast_records: for alignment in blast_record.alignments: for hsp in alignment.hsps: # extract xml data from blast percent = float(100) * float(hsp.score) / float(len(fasta_rec)) percent = round(percent, 0) if percent > 99.99: primer_name = str(alignment.hit_def) primer_length = str(alignment.length) f.write(str(percent) + str(alignment.hit_def) + '\n') f.close() def start_blast(): handle = open("G:\RNAiscan\est.fasta", 'r') data = handle.readlines() for seq_record in data: rec = seq_record first_blast_hit = blast_a_record(rec) handle.close() start_blast() This works but I think it's quite slow. I tried also the NCBIStandalone.Iterator() code from the tutrorial but I got the error message "Invalid header". Would NCBIStandalone.Iterator() be faster? Or, is there a way not to save a xml file or to save only the best hits (100 % match)? Kind regards Stefanie From p.j.a.cock at googlemail.com Mon Apr 27 06:54:09 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 27 Apr 2009 11:54:09 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> References: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> On Mon, Apr 27, 2009 at 11:34 AM, Stefanie L?ck wrote: > Hi! > > I want to blast many sequences against one DB and parse the outputs. > At the moment, I do it in that way: > > ... > > This works but I think it's quite slow. I tried also the NCBIStandalone.Iterator() > code from the tutrorial but I got the error message "Invalid header". > Would NCBIStandalone.Iterator() be faster? NCBIStandalone.Iterator() is the old semi-obsolete plain text parser - it won't parse the XML output, hence the "Invalid header" error. Maybe the tutorial (or the error message) could be clearer. > > Or, is there a way not to save a xml file or to save only the best hits > (100 % match)? > You could set the expectation threshold (I don't think there is an identity threshold which would be ideal for your example). If you only want the single BEST hit for a query, set the number of alignments and/or descriptions to show to just one (these do different things in the plain text output - maybe for XML output you only need to limit the number of alignments). This should give a much smaller file, which will be fast to parse. Finally, and perhaps most importantly - don't do an individual BLAST query for each record. Instead, prepare a FASTA file of ALL your queries, and use that as the input to BLAST. This way there is only one command line call, and the BLAST database is only loaded into memory once. Peter From lueck at ipk-gatersleben.de Tue Apr 28 04:23:02 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 28 Apr 2009 10:23:02 +0200 Subject: [Biopython] Parsing large blast files References: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> Message-ID: <002601c9c7da$8c111ee0$1022a8c0@ipkgatersleben.de> Thanks Peter! >You could set the expectation threshold (I don't think there is an >identity threshold which would be ideal for your example). I can't say what will be the expectation treshold. This won't work. >If you only want the single BEST hit for a query, set the number of >alignments and/or descriptions to show to just one (these do different >things in the plain text output - maybe for XML output you only need >to limit the number of alignments). This should give a much smaller >file, which will be fast to parse. This is to risky. There might be several 100 % hits which I need. >Finally, and perhaps most importantly - don't do an individual BLAST >query for each record. Instead, prepare a FASTA file of ALL your >queries, and use that as the input to BLAST. This way there is only >one command line call, and the BLAST database is only loaded into >memory once. Cool, I didn't know that this will work! Great, that's very nice! 50 % time speed up! Thanks Peter and have a nice day! Stefanie From p.j.a.cock at googlemail.com Tue Apr 28 04:33:52 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 Apr 2009 09:33:52 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <002601c9c7da$8c111ee0$1022a8c0@ipkgatersleben.de> References: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> <002601c9c7da$8c111ee0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00904280133v31c6d158u1353be561990b709@mail.gmail.com> On Tue, Apr 28, 2009 at 9:23 AM, Stefanie L?ck wrote: > Thanks Peter! >> >> You could set the expectation threshold (I don't think there is an >> identity threshold which would be ideal for your example). > > I can't say what will be the expectation treshold. This won't work. Still might be able to reduce it from the default of 10.0, maybe even just to 1.0, without loosing the very high identity matches you want. >> If you only want the single BEST hit for a query, set the number of >> alignments and/or descriptions to show to just one (these do different >> things in the plain text output - maybe for XML output you only need >> to limit the number of alignments). ?This should give a much smaller >> file, which will be fast to parse. > > This is to risky. There might be several 100 % hits which I need. If you expect and want several hits per query, then my suggestion is in appropriate. >> Finally, and perhaps most importantly - don't do an individual BLAST >> query for each record. ?Instead, prepare a FASTA file of ALL your >> queries, and use that as the input to BLAST. ?This way there is only >> one command line call, and the BLAST database is only loaded into >> memory once. > > Cool, I didn't know that this will work! Great, that's very nice! 50 % time > speed up! Only a 50% time speed up? i.e. It took half the time? Not bad, although I expected more. It will probably depend on the number of queries, their sizes, and the database - probably the speed up would be more for a larger database like NR. Peter From lueck at ipk-gatersleben.de Tue Apr 28 06:05:30 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 28 Apr 2009 12:05:30 +0200 Subject: [Biopython] Parsing large blast files References: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> <002601c9c7da$8c111ee0$1022a8c0@ipkgatersleben.de> <320fb6e00904280133v31c6d158u1353be561990b709@mail.gmail.com> Message-ID: <000801c9c7e8$dca6d170$1022a8c0@ipkgatersleben.de> Hi Peter! I'll play a little bit with the tresholds, also the short queries parameters (http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node74.html) which I actually need (nt = 21 bp). Of course, e = 1000 makes it even slower. >Only a 50% time speed up? i.e. It took half the time? Not bad, >although I expected more. It will probably depend on the number of >queries, their sizes, and the database - probably the speed up would >be more for a larger database like NR. I blast ~3000 queries against the tigr barley v9 DB (50500 subjects). It takes about 35 seconds with XP, E8400 (3GHZ), 4 GB RAM. Hope this is normal... Kind regards Stefanie ----- Original Message ----- From: "Peter Cock" To: "Stefanie L?ck" Cc: Sent: Tuesday, April 28, 2009 10:33 AM Subject: Re: [Biopython] Parsing large blast files On Tue, Apr 28, 2009 at 9:23 AM, Stefanie L?ck wrote: > Thanks Peter! >> >> You could set the expectation threshold (I don't think there is an >> identity threshold which would be ideal for your example). > > I can't say what will be the expectation treshold. This won't work. Still might be able to reduce it from the default of 10.0, maybe even just to 1.0, without loosing the very high identity matches you want. >> If you only want the single BEST hit for a query, set the number of >> alignments and/or descriptions to show to just one (these do different >> things in the plain text output - maybe for XML output you only need >> to limit the number of alignments). This should give a much smaller >> file, which will be fast to parse. > > This is to risky. There might be several 100 % hits which I need. If you expect and want several hits per query, then my suggestion is in appropriate. >> Finally, and perhaps most importantly - don't do an individual BLAST >> query for each record. Instead, prepare a FASTA file of ALL your >> queries, and use that as the input to BLAST. This way there is only >> one command line call, and the BLAST database is only loaded into >> memory once. > > Cool, I didn't know that this will work! Great, that's very nice! 50 % > time > speed up! Only a 50% time speed up? i.e. It took half the time? Not bad, although I expected more. It will probably depend on the number of queries, their sizes, and the database - probably the speed up would be more for a larger database like NR. Peter From lueck at ipk-gatersleben.de Tue Apr 28 06:13:54 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 28 Apr 2009 12:13:54 +0200 Subject: [Biopython] [BioPython] BLAST subprocess problem with a GUI References: <320fb6e00901280945p32eff05by64d8a42d576f76cc@mail.gmail.com> <002d01c98509$1dc1cd90$1022a8c0@ipkgatersleben.de> <320fb6e00902020216v231729dcm5d7e3ccdd3459ad4@mail.gmail.com> <007301c98aca$0a2084e0$1022a8c0@ipkgatersleben.de> <320fb6e00902090806k3ed4f286r5f2208801ca207ec@mail.gmail.com> <001e01c9971d$7a8d5be0$1022a8c0@ipkgatersleben.de> <320fb6e00902250059v1afd152ex51bc4439a34441c4@mail.gmail.com> Message-ID: <001701c9c7ea$0946cf40$1022a8c0@ipkgatersleben.de> A short info how I solved the problem: I just do this what usually the EmbossWin Installer does via Python: http://www.interactive-biosoftware.com:80/embosswin/install.html And after I copy the important files (eprimer3...) to the directory. This works quite well and there's no need to install EmbossWin (in case that is dissapear from the server, I don't liked this solution). Regards Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Wednesday, February 25, 2009 10:59 AM Subject: Re: [BioPython] BLAST subprocess problem with a GUI On Wed, Feb 25, 2009 at 7:48 AM, Stefanie L?ck wrote: > I tried this but after compilation/installation Primer3 gives no output, > only a return code of -1073741515 (but no errors or messages)... The error number -1073741515 can be regarded as the hex representation 0xc0000135, which could be "The application failed to initialize properly" (try searching both in Google). Without more information this would be hard to resolve as this seems to be a rather generic error code. Is this problem showing up on your own machine or someone elses? If it works on some computers and not others, this would suggest its could be a problem with the primer3 installation rather than your code. Does it work if you run the code directly in Python rather than via py2exe? I would suggest you add a message box or print statement to show the actual command line string to the user just before trying to run the program. Make a note of this, then also try running this command by hand. It could be something missing in your EMBOSS setup (e.g. it can't find certain data files). Peter From p.j.a.cock at googlemail.com Tue Apr 28 06:16:53 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 Apr 2009 11:16:53 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <000801c9c7e8$dca6d170$1022a8c0@ipkgatersleben.de> References: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> <002601c9c7da$8c111ee0$1022a8c0@ipkgatersleben.de> <320fb6e00904280133v31c6d158u1353be561990b709@mail.gmail.com> <000801c9c7e8$dca6d170$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00904280316j480a679cy4437555923a8ff8f@mail.gmail.com> On Tue, Apr 28, 2009 at 11:05 AM, Stefanie L?ck wrote: >> Only a 50% time speed up? i.e. It took half the time? ?Not bad, >> although I expected more. ?It will probably depend on the number of >> queries, their sizes, and the database - probably the speed up would >> be more for a larger database like NR. > > I blast ~3000 queries against the tigr barley v9 DB (50500 subjects). It > takes about 35 seconds with XP, E8400 (3GHZ), 4 GB RAM. Hope this is > normal... 35s sounds good :) I normally deal with much slower searches (e.g. protein against NR, or with RPS-BLAST against CDD), measured in minutes or when querying whole genomes, maybe hours. On this sort of problem I would expect doing individual searches for each query to be much much slower. You are dealing with a much smaller database, and with shorter queries, so it will in general be faster. Peter From mjldehoon at yahoo.com Tue Apr 28 09:00:07 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 28 Apr 2009 06:00:07 -0700 (PDT) Subject: [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> Message-ID: <627305.69090.qm@web62401.mail.re1.yahoo.com> --- On Mon, 4/27/09, Peter Cock wrote: > > Would NCBIStandalone.Iterator() be faster? > > NCBIStandalone.Iterator() is the old semi-obsolete plain > text parser - it won't parse the XML output, hence the > "Invalid header" error. Maybe the tutorial > (or the error message) could be clearer. I think part of the problem is the organization of the code in Bio.Blast, which seems to have grown historically. Bio.Blast.NCBIStandalone contains blastall, blastpgp, and rpsblast, which makes sense, but also BlastParser and PsiBlastParser, which are not necessarily connected to standalone Blast. Bio.Blast.ParseBlastTable contains the parser for blastpgp output. Bio.Blast.NCBIWWW contains qblast, but also the parser for Blast HTML output, though qblast does not necessarily generate output in HTML format. The usage of this module may be more understandable if all functions were accessible from Bio.Blast directly in a fashion more consistent with current Biopython. Bio.Blast would then have the following functions: read(handle, format='xml') parse(handle, format='xml') blastall blastpgp rpsblast qblast with most of the actual code hiding in Bio.Blast.NCBIStandalone etcetera. Any objections, comments? --Michiel. From p.j.a.cock at googlemail.com Tue Apr 28 09:36:37 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 Apr 2009 14:36:37 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <627305.69090.qm@web62401.mail.re1.yahoo.com> References: <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> <627305.69090.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> On Tue, Apr 28, 2009 at 2:00 PM, Michiel de Hoon wrote: >> NCBIStandalone.Iterator() is the old semi-obsolete plain >> text parser - it won't parse the XML output, hence the >> "Invalid header" error. ?Maybe the tutorial >> (or the error message) could be clearer. > > I think part of the problem is the organization of the code in Bio.Blast, > which seems to have grown historically. Bio.Blast.NCBIStandalone > contains blastall, blastpgp, and rpsblast, which makes sense, but also > ?BlastParser and PsiBlastParser, which are not necessarily connected > to standalone Blast. Bio.Blast.ParseBlastTable contains the parser for > blastpgp output. Bio.Blast.NCBIWWW contains qblast, but also the > parser for Blast HTML output, though qblast does not necessarily > generate output in HTML format. I presumed that initially the standalone tools only produced plain text, and the website (qblast) only produced HTML - hence the use of Bio.Blast.NCBIStandalone for both command line wrappers AND the plain text parser, and Bio.Blast.NCBIWWW for both the qblast function AND the HTML parser. > The usage of this module may be more understandable if all functions > were accessible from Bio.Blast directly in a fashion more consistent > with current Biopython. Bio.Blast would then have the following functions: > > read(handle, format='xml') > parse(handle, format='xml') > blastall > blastpgp > rpsblast > qblast > > with most of the actual code hiding in Bio.Blast.NCBIStandalone etcetera. > > Any objections, comments? I do like the idea of moving/importing the qblast function directly under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML later on. For read/parse functions, we should probably call the format "blastxml" to match BioPerl. Would you continue to support the plain text output here? Also something to keep in mind is there may be non-NCBI variants of BLAST with their own formats as well. Rather than continuing to encourage the use of blastall, blastpgp and rpsblast I would rather bring Bio.Blast.Applications up to date, and then declare them obsolete . These three "helper" functions are very limiting in how the command line is invoked - you can't choose the exact call used (e.g. subprocess options) or what you want back (e.g. you may not care about the handles). For example, getting BLAST to write its output to a file is confusingly difficult right now using these functions. Also, dealing with errors isn't nice. Peter From mjldehoon at yahoo.com Tue Apr 28 21:28:26 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 28 Apr 2009 18:28:26 -0700 (PDT) Subject: [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> Message-ID: <290052.25369.qm@web62407.mail.re1.yahoo.com> --- On Tue, 4/28/09, Peter Cock wrote: > I do like the idea of moving/importing the qblast function > directly under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML > later on. Well Bio.Blast.NCBIXML would still be there (containing the code for the XML parser), but users would access it through Bio.Blast.parse/read. > For read/parse functions, we should probably call the > format "blastxml" to match BioPerl. We could have both "xml" and "blastxml" for Blast XML output, "text" and "blasttext" for Blast text output, and "table" and "blasttable" for Blast table (-m 8 and 9) output. > Would you continue to support the plain text output here? Yes. I'm more thinking about code reorganization than removing/adding functionality. > Rather than continuing to encourage the use of blastall, > blastpgp and rpsblast I would rather bring Bio.Blast.Applications > up to date, and then declare them obsolete. How would users typically use Bio.Blast.Applications? --Michiel. From p.j.a.cock at googlemail.com Wed Apr 29 04:33:03 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Apr 2009 09:33:03 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <290052.25369.qm@web62407.mail.re1.yahoo.com> References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> <290052.25369.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> On Wed, Apr 29, 2009 at 2:28 AM, Michiel de Hoon wrote: > > How would users typically use Bio.Blast.Applications? > In the next release, I would aim to have Bio.Blast.Applications updated to cover blastall (fully), plus blastpgp and rpsblast (currently not covered) and for the three helper functions Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use Bio.Blast.Applications internally. I would suggest at some point (perhaps a release later) calling the three helper functions obsolete, and eventually deprecating them, but I appreciate these are well documented and well used, so this should be a gradual transistion. In the future I would see people contructing their application command line object and then using it to spawn the task as needed. The Bio.Applicaition.generic_run might suffice for low output tools, ranging up to using the builtin subprocess module for full control. The command line string can also be used in other ways, e.g. for submission to a computing cluster using qsub, or writing to a shell script etc. The point about this is decoupling constuction of the command line string, and actually executing it. Right now the Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions do both, and there is no way to (a) see what the command line used was, which makes debugging difficult, and (b) no way to control how it is invoked (e.g. recent Windows GUI questions). Another immediate benefit is an example usage that I do quite often: Running BLAST and saving the output to a file. The cleanest way to do this is to use the -o option to get BLAST itself to write to a file. If you do this, then there is no useful output written to the handles - but the Bio.Blast.NCBIStandalone make this fiddly (see Bug 2654). Right now the tutorial does something equally indirect - in python read BLAST output from stdout and save it to a file (and probably not in a memory efficient way either!). See also this thread on where to put new command line wrappers: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html If you where asking about the actual code for how to build the command line object, well I have some thoughts on making the current Bio.Application base class easier to use (properties and keyword arguments at init) which I have started to discuss on the dev list. Peter From bala.biophysics at gmail.com Wed Apr 29 05:15:50 2009 From: bala.biophysics at gmail.com (Bala subramanian) Date: Wed, 29 Apr 2009 11:15:50 +0200 Subject: [Biopython] chanding res id Message-ID: <288df32a0904290215h3eb63e4fp8e39bcd22e6e72e@mail.gmail.com> Friends, Following is a script that i wrote to change the resid. This works with single pdb file but if i use a NMR with multiple models i get an error message. I hope there shd be a loop or some fancy way to iterate over all the models in the NMR structure. Kindly help me in doing the same. #!/usr/bin/env python from Bio.PDB import PDBParser from Bio.PDB import PDBIO from sys import argv outfile=raw_input('enter outfile name: ') par=PDBParser() s=par.get_structure('x',argv[1]) for index, residue in enumerate(s.get_residues()): residue.id=(" ", index, " ") out=PDBIO() out.set_structure(s) out.save(outfile) Bala From biopython at maubp.freeserve.co.uk Wed Apr 29 05:26:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Apr 2009 10:26:26 +0100 Subject: [Biopython] chanding res id In-Reply-To: <288df32a0904290215h3eb63e4fp8e39bcd22e6e72e@mail.gmail.com> References: <288df32a0904290215h3eb63e4fp8e39bcd22e6e72e@mail.gmail.com> Message-ID: <320fb6e00904290226k40d0e7bcvf2973b20b7b39cd8@mail.gmail.com> On Wed, Apr 29, 2009 at 10:15 AM, Bala subramanian wrote: > Friends, > Following is a script that i wrote to change the resid. This works with > single pdb file but if i use a NMR with multiple models i get an error > message. I hope there shd be a loop or some fancy way to iterate over all > the models in the NMR structure. Kindly help me in doing the same. > #!/usr/bin/env python > from Bio.PDB import PDBParser > from Bio.PDB import PDBIO > from sys import argv > outfile=raw_input('enter outfile name: ') > par=PDBParser() > s=par.get_structure('x',argv[1]) > for index, residue in enumerate(s.get_residues()): I would add a loop here, because you want to reset the index for each model - something like this (untested): for model in s : for index, residue in enumerate(model.get_residues()): > ? ? residue.id=(" ", index, " ") Should you be using index+1 here? I don't recall if PDB files allow an index of zero or not. > out=PDBIO() > out.set_structure(s) > out.save(outfile) > > Bala Peter From p.j.a.cock at googlemail.com Wed Apr 29 06:31:26 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Apr 2009 11:31:26 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> <290052.25369.qm@web62407.mail.re1.yahoo.com> <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> Message-ID: <320fb6e00904290331n654964bficfc68ae92d477387@mail.gmail.com> On Apr 29, Peter wrote: > On Apr 29, Michiel de Hoon wrote: >> >> How would users typically use Bio.Blast.Applications? >> > > In the next release, I would aim to have Bio.Blast.Applications > updated to cover blastall (fully), plus blastpgp and rpsblast > (currently not covered) and for the three helper functions > Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use > Bio.Blast.Applications internally. ?... > > If you where asking about the actual code for how to build the command > line object, well I have some thoughts on making the current > Bio.Application base class easier to use (properties and keyword > arguments at init) which I have started to discuss on the dev list. See this dev list thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005916.html And Bug 2822 (with examples): http://bugzilla.open-bio.org/show_bug.cgi?id=2822 Peter From hermifi at yahoo.com Wed Apr 1 03:56:22 2009 From: hermifi at yahoo.com (Hermella Woldemdihin) Date: Tue, 31 Mar 2009 20:56:22 -0700 (PDT) Subject: [BioPython] HELP! Message-ID: <513066.92437.qm@web111011.mail.gq1.yahoo.com> Hi everyone, I am trying to write a bio-python script that uses SwissProt accession numbers to download a sequence objects and then run remote blast with the sequences. Then download good hit sequences listed in Blast results and print their sequences.I am using a Windows based system with bio-python 2.5, if someone could help me out I would really appreciate it with some sample code or something. I just started learning python and have tried to follow the documentation and cookbook without much success, my programming experience is virtually non-existent. Thanks. Hermi From abwork at utu.fi Wed Apr 1 04:01:08 2009 From: abwork at utu.fi (Abdi Worku Muleta) Date: Wed, 01 Apr 2009 06:01:08 +0200 Subject: [BioPython] HELP! Message-ID: Hi everyone, I am trying to write a bio-python script that uses SwissProt accession numbers to download a sequence objects and then run remote blast with the sequences. Then download good hit sequences listed in Blast results and print their sequences. I am using a Windows based system with bio-python 2.5, if someone could help me out I would really appreciate it with some sample code or something. I just started learning python and have tried to follow the documentation and cookbook without much success, my programming experience is virtually non-existent. Thanks. From peter at maubp.freeserve.co.uk Wed Apr 1 09:37:11 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Apr 2009 10:37:11 +0100 Subject: [BioPython] HELP! In-Reply-To: <513066.92437.qm@web111011.mail.gq1.yahoo.com> References: <513066.92437.qm@web111011.mail.gq1.yahoo.com> Message-ID: <320fb6e00904010237j65556cf5mb5dd2914a17cc8a0@mail.gmail.com> On Wed, Apr 1, 2009 at 4:56 AM, Hermella Woldemdihin wrote: > Hi everyone, > I am trying to write a bio-python script that uses SwissProt accession > numbers to download a sequence objects and then run remote blast > with the sequences. Then download good hit sequences listed in Blast > results and print their sequences.I am using a Windows based system > with bio-python 2.5, if someone could help me out I would really > appreciate it with some sample code or something. I just started > learning python and have tried to follow the documentation and > cookbook without much success, my programming experience is > virtually non-existent. Thanks. > Hermi Hello Hermi and Abdi Worku Muleta, You've both emailed almost identical questions at almost the same time - are you doing the same project for a university assignment? First of all, the Biopython Tutorial and Cookbook doesn't try to teach you python - it assumes you at least know the basics. Have a look at www.python.org for some beginners guides, or check you library as there are plenty of books for learning Python. To download SwissProt functions, look at the Bio.ExPASy.get_sprot_raw function from Bio.ExPASy (there is an example in the Tutorial, search for get_sprot_raw). You can also use Bio.Entrez.eftech, but I have found the NCBI only seem to keep track of the latest SwissProt identifiers, so using ExPASy should be more reliable. If you want to run BLAST on these records, then can do this for one query sequence at a time using Bio.Blast.NCBIWWW.qblast (again there is an example in the Tutorial, search for qblast). You could also install standalone BLAST from the NCBI on your own machine and do all the query sequences together in a FASTA file, but I think this might be a bit complicated for a novice. Peter From Yvan.Strahm at bccs.uib.no Wed Apr 1 10:34:56 2009 From: Yvan.Strahm at bccs.uib.no (Yvan.Strahm at bccs.uib.no) Date: Wed, 01 Apr 2009 12:34:56 +0200 Subject: [BioPython] Is query_length really the length of query? Message-ID: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> Hello List I try to get the length of the query from the blast result itself like that: result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file) from Bio.Blast import NCBIXML blast_records = NCBIXML.parse(result_handle) for blast_record in blast_records but blast_record.query_length return None and blast_record.query_letters return the actual size Should I test the length of the query before the blast result? O did I miss-interpreted the meaning of query_length and query_letters? Thanks for your time Is query_length really the length of query? From biopython at maubp.freeserve.co.uk Wed Apr 1 10:59:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Apr 2009 11:59:24 +0100 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> Message-ID: <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> On Wed, Apr 1, 2009 at 11:34 AM, wrote: > > Hello List > > I try to get the length of the query from the blast result itself > > like that: > result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, > "blastn", > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?my_blast_db, > my_blast_file) > > from Bio.Blast import NCBIXML > blast_records = NCBIXML.parse(result_handle) > for blast_record in blast_records > > but > blast_record.query_length return None > and > blast_record.query_letters return the actual size > > Should I test the length of the query before the blast result? O did I > miss-interpreted the meaning of query_length and query_letters? > > Thanks for your time > > Is query_length really the length of query? You can use query_letters (although it wouldn't hurt to double check this if you have the query sequence available). With the current BLAST XML parser query_length is always None (but I think we should fix so they are both populated). Its an unfortunate historical accident dating back to the plain text BLAST parser. The plain text output printed the query length in two places, with different captions, which was reflected in the names given in the BLAST record (the values should be the same, assuming the BLAST output is sane). The XML output doesn't have this redundancy, but our XML parser tries to use the same object to hold the results. See: http://bugzilla.open-bio.org/show_bug.cgi?id=2176#c12 Have a look at the discussion on Bug 2176 for more about this (including the far more complicated situation for the database length which has multiple meanings). This seems like a timely reminder that we could perhaps tidy up a little of this ready for Biopython 1.50 ... Peter From Yvan.Strahm at bccs.uib.no Wed Apr 1 11:15:13 2009 From: Yvan.Strahm at bccs.uib.no (Yvan.Strahm at bccs.uib.no) Date: Wed, 01 Apr 2009 13:15:13 +0200 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> Message-ID: <20090401131513.9al87wazacgcw0os@webmail.uib.no> Quoting Peter : > On Wed, Apr 1, 2009 at 11:34 AM, wrote: >> >> Hello List >> >> I try to get the length of the query from the blast result itself >> >> like that: >> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, >> "blastn", >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?my_blast_db, >> my_blast_file) >> >> from Bio.Blast import NCBIXML >> blast_records = NCBIXML.parse(result_handle) >> for blast_record in blast_records >> >> but >> blast_record.query_length return None >> and >> blast_record.query_letters return the actual size >> >> Should I test the length of the query before the blast result? O did I >> miss-interpreted the meaning of query_length and query_letters? >> >> Thanks for your time >> >> Is query_length really the length of query? > > You can use query_letters (although it wouldn't hurt to double check > this if you have the query sequence available). With the current BLAST > XML parser query_length is always None (but I think we should fix so > they are both populated). > > Its an unfortunate historical accident dating back to the plain text > BLAST parser. The plain text output printed the query length in two > places, with different captions, which was reflected in the names > given in the BLAST record (the values should be the same, assuming the > BLAST output is sane). The XML output doesn't have this redundancy, > but our XML parser tries to use the same object to hold the results. > See: http://bugzilla.open-bio.org/show_bug.cgi?id=2176#c12 > > Have a look at the discussion on Bug 2176 for more about this > (including the far more complicated situation for the database length > which has multiple meanings). > > This seems like a timely reminder that we could perhaps tidy up a > little of this ready for Biopython 1.50 ... > > Peter > Thanks for these precisions. have a nice day yvan From dejmail at gmail.com Sun Apr 5 12:59:15 2009 From: dejmail at gmail.com (Liam Thompson) Date: Sun, 5 Apr 2009 14:59:15 +0200 Subject: [BioPython] extraction from genbank/embl files Message-ID: Hi everyone I have a list of accession numbers, which I've used to download the entire genomic sequences of several hundred hepatitis B virus isolates. What I am trying to do is extract 3 gene sequences from each genomic sequence, and place each sequence in one of 3 files depending on the gene for further analysis. The question is whether there is a shorter way to extract from Genbank files using the Genbank parser, specific gene sequences, or whether I would need to identify the gene of each genomic isolate individually (as they are called a variety of names, despite being the same gene which makes it trickier), copy the coordinates of the gene sequence, and then proceed further down the file and actually perform the copying of the gene. I not experienced in python (or other languages for that matter), but I am trying. Any suggestions would be greatly appreciated Thanks Liam -- ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown South Africa 2193 Tel: 2711 717 2465/7 Fax: 2711 717 2395 Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From sean.maceach at gmail.com Sun Apr 5 13:13:16 2009 From: sean.maceach at gmail.com (Sean MacEachern) Date: Sun, 05 Apr 2009 09:13:16 -0400 Subject: [BioPython] extraction from genbank/embl files In-Reply-To: Message-ID: Hi Liam, Although not a biopython solution, you should be able to use seqret in EMBOSS to do something like you have described. You can call seqret in your python script using popen and write the results to one of your three files. HTH, Sean On 4/5/09 8:59 AM, "Liam Thompson" wrote: > Hi everyone > > I have a list of accession numbers, which I've used to download the entire > genomic sequences of several hundred hepatitis B virus isolates. What I am > trying to do is extract 3 gene sequences from each genomic sequence, and > place each sequence in one of 3 files depending on the gene for further > analysis. > > The question is whether there is a shorter way to extract from Genbank files > using the Genbank parser, specific gene sequences, or whether I would need > to identify the gene of each genomic isolate individually (as they are > called a variety of names, despite being the same gene which makes it > trickier), copy the coordinates of the gene sequence, and then proceed > further down the file and actually perform the copying of the gene. > > I not experienced in python (or other languages for that matter), but I am > trying. > > Any suggestions would be greatly appreciated > > Thanks > Liam > > > > > > From biopython at maubp.freeserve.co.uk Sun Apr 5 19:29:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Apr 2009 20:29:40 +0100 Subject: [BioPython] extraction from genbank/embl files In-Reply-To: References: Message-ID: <320fb6e00904051229v3698c00dr7dee7b58445b4bec@mail.gmail.com> On 4/5/09, Liam Thompson wrote: > Hi everyone > > I have a list of accession numbers, which I've used to download the entire > genomic sequences of several hundred hepatitis B virus isolates. What I am > trying to do is extract 3 gene sequences from each genomic sequence, and > place each sequence in one of 3 files depending on the gene for further > analysis. Are you looking for the CDS sequence of these three genes (i.e. a nucleotide sequence)? > The question is whether there is a shorter way to extract from Genbank files > using the Genbank parser, specific gene sequences, or whether I would need > to identify the gene of each genomic isolate individually (as they are > called a variety of names, despite being the same gene which makes it > trickier), copy the coordinates of the gene sequence, and then proceed > further down the file and actually perform the copying of the gene. I see two main options for you (regardless of what programming language you want to use): (1) Compile a list of all the gene names by hand. (2) Compile a few examples by hand, and then use pairwise alignments (e.g. BLAST, or FASTA, or needle from EMBOSS) to find the matching gene in each virus. You could do this with the protein or the nucleotide sequence. Using Biopython's Bio.SeqIO EMBL/GenBank parser each gene/CDS in the EMBL/GenBank file will be represented as a SeqFeature object, which includes the location information. If you can identify which features you want from their annotation, then that tells you where to cut the parent sequence. See this page for some related discussion: http://www.warwick.ac.uk/go/peter_cock/python/genbank/ As an alternative approach, rather than starting with the EMBL/GenBank files, can you just download the CDS sequences as a FASTA file? e.g. files called *.ffn from the NCBI ftp site. You might also want to download the genes protein sequence, the NCBI uses *.faa for these (FASTA amino acids). Having FASTA files would make the sequence comparison approach easiest - most of these tools will expect FASTA input files. Peter From dejmail at gmail.com Mon Apr 6 04:21:05 2009 From: dejmail at gmail.com (Liam Thompson) Date: Mon, 6 Apr 2009 06:21:05 +0200 Subject: [BioPython] extraction from genbank/embl files In-Reply-To: <320fb6e00904051229v3698c00dr7dee7b58445b4bec@mail.gmail.com> References: <320fb6e00904051229v3698c00dr7dee7b58445b4bec@mail.gmail.com> Message-ID: Hi Peter & Sean I am looking for a nucleotide sequence for these three genes and I have downloaded the entire genomic sequences so that I can compare the same 3 genes from all the same isolates. I downloaded the full GenBank and FASTA version of the same set of accession numbers, for as you said FASTA will be easier to work with once I can identify the location information from the info of the GB file. I'll give SeqFeature a bash, and possibly the seqret feature of EMBOSS as well. Thanks Liam On Sun, Apr 5, 2009 at 9:29 PM, Peter wrote: > On 4/5/09, Liam Thompson wrote: > > Hi everyone > > > > I have a list of accession numbers, which I've useSed to download the > entire > > genomic sequences of several hundred hepatitis B virus isolates. What I > am > > trying to do is extract 3 gene sequences from each genomic sequence, and > > place each sequence in one of 3 files depending on the gene for further > > analysis. > > Are you looking for the CDS sequence of these three genes (i.e. a > nucleotide sequence)? > > > The question is whether there is a shorter way to extract from Genbank > files > > using the Genbank parser, specific gene sequences, or whether I would > need > > to identify the gene of each genomic isolate individually (as they are > > called a variety of names, despite being the same gene which makes it > > trickier), copy the coordinates of the gene sequence, and then proceed > > further down the file and actually perform the copying of the gene. > > I see two main options for you (regardless of what programming > language you want to use): > > (1) Compile a list of all the gene names by hand. > (2) Compile a few examples by hand, and then use pairwise alignments > (e.g. BLAST, or FASTA, or needle from EMBOSS) to find the matching > gene in each virus. You could do this with the protein or the > nucleotide sequence. > > Using Biopython's Bio.SeqIO EMBL/GenBank parser each gene/CDS in the > EMBL/GenBank file will be represented as a SeqFeature object, which > includes the location information. If you can identify which features > you want from their annotation, then that tells you where to cut the > parent sequence. See this page for some related discussion: > http://www.warwick.ac.uk/go/peter_cock/python/genbank/ > > As an alternative approach, rather than starting with the EMBL/GenBank > files, can you just download the CDS sequences as a FASTA file? e.g. > files called *.ffn from the NCBI ftp site. You might also want to > download the genes protein sequence, the NCBI uses *.faa for these > (FASTA amino acids). > > Having FASTA files would make the sequence comparison approach easiest > - most of these tools will expect FASTA input files. > > Peter > -- ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown 2193 Tel: 2711 717 2465/7 Fax: 2711 717 2395 Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From biopython at maubp.freeserve.co.uk Mon Apr 6 10:50:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Apr 2009 11:50:15 +0100 Subject: [BioPython] extraction from genbank/embl files In-Reply-To: References: <320fb6e00904051229v3698c00dr7dee7b58445b4bec@mail.gmail.com> Message-ID: <320fb6e00904060350n4e1caad6l4ea9ae46927e26fb@mail.gmail.com> On 4/6/09, Liam Thompson wrote: > Hi Peter & Sean > > I am looking for a nucleotide sequence for these three genes and I have > downloaded the entire genomic sequences so that I can compare the same 3 > genes from all the same isolates. I downloaded the full GenBank and FASTA > version of the same set of accession numbers, for as you said FASTA will be > easier to work with once I can identify the location information from the > info of the GB file. The NCBI at least provide three flavours of FASTA file for a genome: *.fna - FASTA Nucleic Acids - entire DNA nucleotide sequence as one record *.faa - FASTA Amino Acids - amino acid sequences for each gene *.ffn - FASTA Feature Nucleotides - nucleotide sequences for each gene This is easiest to see on the FTP site. In your case, using the ffn files might be simplest - assuming you can recognise the genes from their sequences (e.g. using pairwise alignments to known references). Peter From florian.koelling at tu-bs.de Mon Apr 6 15:55:18 2009 From: florian.koelling at tu-bs.de (Florian Koelling) Date: Mon, 06 Apr 2009 17:55:18 +0200 Subject: [BioPython] JP (Jarvis Patrick) Clustering Message-ID: <49DA25E6.2060606@tu-bs.de> Hi Folks! Does anybody know an (open source) clustering package containing the Jarvis Patrick clustering algorithm? I only know Hcluster and Pycluster but I'm afraid that those candidates don't have a JP implementation -- maybe one of you knows an alternative. thanx and best regards, florian From chapmanb at 50mail.com Mon Apr 6 21:57:25 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 6 Apr 2009 17:57:25 -0400 Subject: [BioPython] JP (Jarvis Patrick) Clustering In-Reply-To: <49DA25E6.2060606@tu-bs.de> References: <49DA25E6.2060606@tu-bs.de> Message-ID: <20090406215725.GG43636@sobchak.mgh.harvard.edu> Hi Florian; > Does anybody know an (open source) clustering package containing the > Jarvis Patrick clustering algorithm? Here is a version in R and C: http://rguha.net/code/R/#jp If you need to access it from python, the RPy bindings are great: http://rpy.sourceforge.net/ Hope this helps, Brad From chapmanb at 50mail.com Mon Apr 6 23:05:42 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 6 Apr 2009 19:05:42 -0400 Subject: [BioPython] Invitation for Biopython news coordinators Message-ID: <20090406230542.GK43636@sobchak.mgh.harvard.edu> Biopythonistas; Communication is a key component of successful open source projects. The challenges of distributed programming by volunteers can be overcome by ensuring that the whole community is aware of interesting discussions, new contributions, and development goals. Traditionally, this communication has happened through our mailing lists, wiki pages, and bug tracking system. While these will continue to to be useful resources, new methods of disseminating information are changing how we interact through the web. I'd like to issue an invitation for anyone interested in helping revolutionize how Biopython news is disseminated. We are looking for contributors from the community to brainstorm new ways to make the discussions that happen at biopython.org accessible. You would actively follow development here and on the development lists and distill this information into useful quick bullet points for those interested in Biopython but too busy to follow detailed discussions. We are proposing two ways to do this: - Monthly highlights on our news server: http://news.open-bio.org/news/category/obf-projects/biopython/ The RSS feed from these posts are currently widely distributed around the internet. - More frequent pointers to interesting discussions or other items of interest happening in Biopython through our Twitter account: http://twitter.com/biopython This is an opportunity for those of you who are looking to become more involved, and would like to learn more about Biopython by following all of the coding activity more closely. The position is very flexible and we are happy to have one or more people take it on; we would also encourage you to be as creative as you want in doing so. I see this as an chance to both provide information and to highlight the great work people do at Biopython. If you are interested in taking on this role please respond with your ideas. Thanks for your interest, Brad From bradley.h at aggiemail.usu.edu Wed Apr 8 20:08:50 2009 From: bradley.h at aggiemail.usu.edu (Bradley Hintze) Date: Wed, 8 Apr 2009 14:08:50 -0600 Subject: [BioPython] Clustalw Problems Message-ID: <3933e78c0904081308p78e049b1i4562857c1ad06df4@mail.gmail.com> Hi, I am having a hard time running an alignment. I am running in windows and here is my code and the error message that I get after running do_alignment. >>> import os >>> from Bio.Clustalw import MultipleAlignCL >>> from Bio.Clustalw import do_alignment >>> cline=MultipleAlignCL(r"C:\Documents and Settings\student\Desktop\Foo\mtr4.fasta", r"C:\Program Files\clustalw1.83.XP\clustalw.exe") >>> cline.set_output(r"C:\Documents and Settings\students\Desktop\Foo\test.aln") >>> al=do_alignment(cline) Traceback (most recent call last): File "", line 1, in File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 124, in do_alignment % command_line.sequence_file) IOError: Cannot open sequence file C:\Documents and Settings\student\Desktop\Foo\mtr4.fasta when I open the file using o=open('C:\Documents and Settings\student\Desktop\Foo\mtr4.fasta') it woks fine. any ideas? -- Bradley -- Bradley J. Hintze Biochemistry Undergraduate Utah State University 801-712-8799 From biopython at maubp.freeserve.co.uk Wed Apr 8 21:50:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Apr 2009 22:50:24 +0100 Subject: [BioPython] Clustalw Problems In-Reply-To: <3933e78c0904081308p78e049b1i4562857c1ad06df4@mail.gmail.com> References: <3933e78c0904081308p78e049b1i4562857c1ad06df4@mail.gmail.com> Message-ID: <320fb6e00904081450y6a2bcdc2jce7935c543af9b8b@mail.gmail.com> On 4/8/09, Bradley Hintze wrote: > Hi, > > I am having a hard time running an alignment. I am running in windows and > here is my code and the error message that I get after running do_alignment. > > >>> import os > >>> from Bio.Clustalw import MultipleAlignCL > >>> from Bio.Clustalw import do_alignment > >>> cline=MultipleAlignCL(r"C:\Documents and > Settings\student\Desktop\Foo\mtr4.fasta", r"C:\Program > Files\clustalw1.83.XP\clustalw.exe") > >>> cline.set_output(r"C:\Documents and > Settings\students\Desktop\Foo\test.aln") > >>> al=do_alignment(cline) > Traceback (most recent call last): > File "", line 1, in > File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 124, > in do_alignment > % command_line.sequence_file) > IOError: Cannot open sequence file C:\Documents and > Settings\student\Desktop\Foo\mtr4.fasta > > when I open the file using o=open('C:\Documents and > Settings\student\Desktop\Foo\mtr4.fasta') it woks fine. > > any ideas? As a general tip, try this to see what the command Biopython is trying to run is: >>> print cline Then try running the same command by hand at the command prompt (DOS prompt), and make sure it works. I can tell from the error message you have Python 2.5, but what version of Biopython do you have? I'm not at a Windows machine to check, but it is generally a good idea to avoid file names and paths with spaces where you can. In this case, I'm sure relative names would be fine: >>> import os >>> from Bio.Clustalw import MultipleAlignCL >>> from Bio.Clustalw import do_alignment >>> cline=MultipleAlignCL("mtr4.fasta", r"C:\Program Files\clustalw1.83.XP\clustalw.exe") >>> cline.set_output("test.aln") Peter From jchen at alumni.caltech.edu Thu Apr 9 00:20:58 2009 From: jchen at alumni.caltech.edu (jchen at alumni.caltech.edu) Date: Wed, 8 Apr 2009 17:20:58 -0700 (PDT) Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? Message-ID: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> Hello, How do I convert a file full of BLAST runs into a FASTA file of sequences for each hit? I have tried parsing a file full of BLAST runs per the instructions from the Biopython tutorial and cookbook (http://biopython.org/DIST/docs/tutorial/Tutorial.html), but I continue to get a ValueError. I have tried the hints on throwing certain exceptions, without much help. The only thing I have gotten working is parsing a BLAST output consisting of a single hit from a single query. I used BLAST v.2.2.18 to generate my BLAST output. Any help would be appreciated. Thanks! -Jerry From biopython at maubp.freeserve.co.uk Thu Apr 9 08:49:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Apr 2009 09:49:32 +0100 Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? In-Reply-To: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> References: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> Message-ID: <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> On Thu, Apr 9, 2009 at 1:20 AM, wrote: > Hello, > > How do I convert a file full of BLAST runs into a FASTA file of sequences > for each hit? Do you just want the FASTA file to contain the matched region of the sequences in the database? That information should be in the BLAST output - you'll need to remove any gap characters. If you want the full sequence of each matched target, that isn't in the database. You'd have to take the reference number and look it up. If you made the database yourself from a FASTA file, that should be easy. If it was from NR/NT or another large database then maybe fetching the sequences from the NCBI would be easiest (try Bio.Entrez). > I have tried parsing a file full of BLAST runs per the instructions from > the Biopython tutorial and cookbook > (http://biopython.org/DIST/docs/tutorial/Tutorial.html), but I continue to > get a ValueError. I have tried the hints on throwing certain exceptions, > without much help. The only thing I have gotten working is parsing a BLAST > output consisting of a single hit from a single query. > > I used BLAST v.2.2.18 to generate my BLAST output. Are you sure you are using the XML output? With the plain text output and BLAST v.2.2.18, Biopython can only cope with single query output. The NCBI regularly change their plain text output, and we have more-or-less given up with the our plain text parser. The NCBI themselves do not recommend parsing it - that is what the XML format was introduced for. I can't offer any more advice without the error message, your OS (e.g. Windows XP), version of Python, version of Biopython and ideally a snippet of your code which is failing. Peter From jchen at alumni.caltech.edu Thu Apr 9 22:14:02 2009 From: jchen at alumni.caltech.edu (jchen at alumni.caltech.edu) Date: Thu, 9 Apr 2009 15:14:02 -0700 (PDT) Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? In-Reply-To: <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> References: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> Message-ID: <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> Hi Peter, > Do you just want the FASTA file to contain the matched region of the > sequences in the database? That information should be in the BLAST > output - you'll need to remove any gap characters. > > If you want the full sequence of each matched target, that isn't in > the database. You'd have to take the reference number and look it up. > If you made the database yourself from a FASTA file, that should be > easy. If it was from NR/NT or another large database then maybe > fetching the sequences from the NCBI would be easiest (try > Bio.Entrez). Yeah, I actually do want the full length FASTA sequences. I didn't think about the fact that the BLAST output only contains (partial) match regions. I have a FASTA file of the entire proteome for the organism we are studying. > Are you sure you are using the XML output? > > With the plain text output and BLAST v.2.2.18, Biopython can only cope > with single query output. The NCBI regularly change their plain text > output, and we have more-or-less given up with the our plain text > parser. The NCBI themselves do not recommend parsing it - that is > what the XML format was introduced for. > That's unfortunate there's no standard BLAST format. Yeah, I am trying to parse the plain text BLAST output. I'm not familiar with the XML output - I don't know how to have BLAST output in XML format. My file contains a few hundred queries. I ended up writing a little script that extracted the name of each query and each of its significant hits. I will probably end up writing my own scripts for getting the FASTA sequences for each of these hits from a FASTA proteome file. > I can't offer any more advice without the error message, your OS (e.g. > Windows XP), version of Python, version of Biopython and ideally a > snippet of your code which is failing. That's alright. It will be easier for me to write my own little scripts to parse my BLAST output file. I was just hoping there was an easy, fast way to do it with Biopython. Thanks for your help! -Jerry From agarbino at gmail.com Thu Apr 9 22:27:54 2009 From: agarbino at gmail.com (Alex Garbino) Date: Thu, 9 Apr 2009 17:27:54 -0500 Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? In-Reply-To: <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> References: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> Message-ID: <4cf37ad00904091527w519b3757wcd3b5854dd029d0b@mail.gmail.com> I wrote a simple script to do that, pasted below & attached. You supply your FASTA protein sequence up top, and it blasts it, and returns the top 200 hits in a CSV format with the full FASTA sequence for each hit. However, although it worked before (see output csv file), I'm trying it with a new protein (I've attached the fasta file .txt) and it gives me a StopIteration error; I'd appreciate help in debugging that!! The script also needs help in that: 1) sometimes skips a hit for the same organism with a higher HSP value 2) the csv file is not perfectly delimited, sometimes the label gets broken up (see output in excel from a previous run where it did work) 3) I'd like to get e-values instead of HSP scores, but I can figure out the structure of the record/how to get each piece. Despite all that, it will do what you are wanting to do... in a very newbie way! :) -Alex Garbino code: from Bio import SeqIO from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import Entrez #Open file to blast file = "ryr2fasta.txt" #Blast, save copy record = SeqIO.read(open(file), format="fasta") result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring(), hitlist_size=200) blast_results = result_handle.read() save_file = open(file[:-4]+"123.xml", "w") save_file.write(blast_results) save_file.close() result_handle = open(file[:-4]+"123.xml") #Load the blast record blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() output = {} for x in blast_record.alignments: for hsp in x.hsps: output[x.accession] = [x.title] output[x.accession].extend([x.length]) output[x.accession].extend([hsp.score]) for x in output: handle = Entrez.efetch(db="protein", id=x, rettype="genbank") record = SeqIO.parse(handle, "genbank") recurd = record.next() output[x].insert(0, recurd.id) output[x].insert(1, recurd.annotations["source"]) output[x].extend([recurd.seq.tostring()]) #print output save_file = open(file[:-4]+"123.csv", "w") #Generate CSV for item in output: # save_file.write('%s,%s,%s\n' % (output[item][0],output[item][1],output[item][2])) save_file.write('%s,%s,%s,%s,%s,%s\n' % (output[item][0],output[item][1],output[item][2],output[item][3],output[item][4],output[item][5])) save_file.close() On Thu, Apr 9, 2009 at 5:14 PM, wrote: > Hi Peter, > > > Do you just want the FASTA file to contain the matched region of the > > sequences in the database? That information should be in the BLAST > > output - you'll need to remove any gap characters. > > > > If you want the full sequence of each matched target, that isn't in > > the database. You'd have to take the reference number and look it up. > > If you made the database yourself from a FASTA file, that should be > > easy. If it was from NR/NT or another large database then maybe > > fetching the sequences from the NCBI would be easiest (try > > Bio.Entrez). > > Yeah, I actually do want the full length FASTA sequences. I didn't think > about the fact that the BLAST output only contains (partial) match > regions. I have a FASTA file of the entire proteome for the organism we > are studying. > > > Are you sure you are using the XML output? > > > > With the plain text output and BLAST v.2.2.18, Biopython can only cope > > with single query output. The NCBI regularly change their plain text > > output, and we have more-or-less given up with the our plain text > > parser. The NCBI themselves do not recommend parsing it - that is > > what the XML format was introduced for. > > > > That's unfortunate there's no standard BLAST format. Yeah, I am trying to > parse the plain text BLAST output. I'm not familiar with the XML output - > I don't know how to have BLAST output in XML format. > > My file contains a few hundred queries. I ended up writing a little script > that extracted the name of each query and each of its significant hits. I > will probably end up writing my own scripts for getting the FASTA > sequences for each of these hits from a FASTA proteome file. > > > I can't offer any more advice without the error message, your OS (e.g. > > Windows XP), version of Python, version of Biopython and ideally a > > snippet of your code which is failing. > > That's alright. It will be easier for me to write my own little scripts to > parse my BLAST output file. I was just hoping there was an easy, fast way > to do it with Biopython. > > Thanks for your help! > -Jerry > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -------------- next part -------------- >gi|112799847|ref|NP_001026.2| cardiac muscle ryanodine receptor [Homo sapiens] MADGGEGEDEIQFLRTDDEVVLQCTATIHKEQQKLCLAAEGFGNRLCFLESTSNSKNVPPDLSICTFVLE QSLSVRALQEMLANTVEKSEGQVDVEKWKFMMKTAQGGGHRTLLYGHAILLRHSYSGMYLCCLSTSRSST DKLAFDVGLQEDTTGEACWWTIHPASKQRSEGEKVRVGDDLILVSVSSERYLHLSYGNGSLHVDAAFQQT LWSVAPISSGSEAAQGYLIGGDVLRLLHGHMDECLTVPSGEHGEEQRRTVHYEGGAVSVHARSLWRLETL RVAWSGSHIRWGQPFRLRHVTTGKYLSLMEDKNLLLMDKEKADVKSTAFTFRSSKEKLDVGVRKEVDGMG TSEIKYGDSVCYIQHVDTGLWLTYQSVDVKSVRMGSIQRKAIMHHEGHMDDGISLSRSQHEESRTARVIR STVFLFNRFIRGLDALSKKAKASTVDLPIESVSLSLQDLIGYFHPPDEHLEHEDKQNRLRALKNRQNLFQ EEGMINLVLECIDRLHVYSSAAHFADVAGREAGESWKSILNSLYELLAALIRGNRKNCAQFSGSLDWLIS RLERLEASSGILEVLHCVLVESPEALNIIKEGHIKSIISLLDKHGRNHKVLDVLCSLCVCHGVAVRSNQH LICDNLLPGRDLLLQTRLVNHVSSMRPNIFLGVSEGSAQYKKWYYELMVDHTEPFVTAEATHLRVGWAST EGYSPYPGGGEEWGGNGVGDDLFSYGFDGLHLWSGCIARTVSSPNQHLLRTDDVISCCLDLSAPSISFRI NGQPVQGMFENFNIDGLFFPVVSFSAGIKVRFLLGGRHGEFKFLPPPGYAPCYEAVLPKEKLKVEHSREY KQERTYTRDLLGPTVSLTQAAFTPIPVDTSQIVLPPHLERIREKLAENIHELWVMNKIELGWQYGPVRDD NKRQHPCLVEFSKLPEQERNYNLQMSLETLKTLLALGCHVGISDEHAEDKVKKMKLPKNYQLTSGYKPAP MDLSFIKLTPSQEAMVDKLAENAHNVWARDRIRQGWTYGIQQDVKNRRNPRLVPYTLLDDRTKKSNKDSL REAVRTLLGYGYNLEAPDQDHAARAEVCSGTGERFRIFRAEKTYAVKAGRWYFEFETVTAGDMRVGWSRP GCQPDQELGSDERAFAFDGFKAQRWHQGNEHYGRSWQAGDVVGCMVDMNEHTMMFTLNGEILLDDSGSEL AFKDFDVGDGFIPVCSLGVAQVGRMNFGKDVSTLKYFTICGLQEGYEPFAVNTNRDITMWLSKRLPQFLQ VPSNHEHIEVTRIDGTIDSSPCLKVTQKSFGSQNSNTDIMFYRLSMPIECAEVFSKTVAGGLPGAGLFGP KNDLEDYDADSDFEVLMKTAHGHLVPDRVDKDKEATKPEFNNHKDYAQEKPSRLKQRFLLRRTKPDYSTS HSARLTEDVLADDRDDYDFLMQTSTYYYSVRIFPGQEPANVWVGWITSDFHQYDTGFDLDRVRTVTVTLG DEKGKVHESIKRSNCYMVCAGESMSPGQGRNNNGLEIGCVVDAASGLLTFIANGKELSTYYQVEPSTKLF PAVFAQATSPNVFQFELGRIKNVMPLSAGLFKSEHKNPVPQCPPRLHVQFLSHVLWSRMPNQFLKVDVSR ISERQGWLVQCLDPLQFMSLHIPEENRSVDILELTEQEELLKFHYHTLRLYSAVCALGNHRVAHALCSHV DEPQLLYAIENKYMPGLLRAGYYDLLIDIHLSSYATARLMMNNEYIVPMTEETKSITLFPDENKKHGLPG IGLSTSLRPRMQFSSPSFVSISNECYQYSPEFPLDILKSKTIQMLTEAVKEGSLHARDPVGGTTEFLFVP LIKLFYTLLIMGIFHNEDLKHILQLIEPSVFKEAATPEEESDTLEKELSVDDAKLQGAGEEEAKGGKRPK EGLLQMKLPEPVKLQMCLLLQYLCDCQVRHRIEAIVAFSDDFVAKLQDNQRFRYNEVMQALNMSAALTAR KTKEFRSPPQEQINMLLNFKDDKSECPCPEEIRDQLLDFHEDLMTHCGIELDEDGSLDGNSDLTIRGRLL SLVEKVTYLKKKQAEKPVESDSKKSSTLQQLISETMVRWAQESVIEDPELVRAMFVLLHRQYDGIGGLVR ALPKTYTINGVSVEDTINLLASLGQIRSLLSVRMGKEEEKLMIRGLGDIMNNKVFYQHPNLMRALGMHET VMEVMVNVLGGGESKEITFPKMVANCCRFLCYFCRISRQNQKAMFDHLSYLLENSSVGLASPAMRGSTPL DVAAASVMDNNELALALREPDLEKVVRYLAGCGLQSCQMLVSKGYPDIGWNPVEGERYLDFLRFAVFCNG ESVEENANVVVRLLIRRPECFGPALRGEGGNGLLAAMEEAIKIAEDPSRDGPSPNSGSSKTLDTEEEEDD TIHMGNAIMTFYSALIDLLGRCAPEMHLIHAGKGEAIRIRSILRSLIPLGDLVGVISIAFQMPTIAKDGN VVEPDMSAGFCPDHKAAMVLFLDRVYGIEVQDFLLHLLEVGFLPDLRAAASLDTAALSATDMALALNRYL CTAVLPLLTRCAPLFAGTEHHASLIDSLLHTVYRLSKGCSLTKAQRDSIEVCLLSICGQLRPSMMQHLLR RLVFDVPLLNEHAKMPLKLLTNHYERCWKYYCLPGGWGNFGAASEEELHLSRKLFWGIFDALSQKKYEQE LFKLALPCLSAVAGALPPDYMESNYVSMMEKQSSMDSEGNFNPQPVDTSNITIPEKLEYFINKYAEHSHD KWSMDKLANGWIYGEIYSDSSKVQPLMKPYKLLSEKEKEIYRWPIKESLKTMLAWGWRIERTREGDSMAL YNRTRRISQTSQVSVDAAHGYSPRAIDMSNVTLSRDLHAMAEMMAENYHNIWAKKKKMELESKGGGNHPL LVPYDTLTAKEKAKDREKAQDILKFLQINGYAVSRGFKDLELDTPSIEKRFAYSFLQQLIRYVDEAHQYI LEFDGGSRGKGEHFPYEQEIKFFAKVVLPLIDQYFKNHRLYFLSAASRPLCSGGHASNKEKEMVTSLFCK LGVLVRHRISLFGNDATSIVNCLHILGQTLDARTVMKTGLESVKSALRAFLDNAAEDLEKTMENLKQGQF THTRNQPKGVTQIINYTTVALLPMLSSLFEHIGQHQFGEDLILEDVQVSCYRILTSLYALGTSKSIYVER QRSALGECLAAFAGAFPVAFLETHLDKHNIYSIYNTKSSRERAALSLPTNVEDVCPNIPSLEKLMEEIVE LAESGIRYTQMPHVMEVILPMLCSYMSRWWEHGPENNPERAEMCCTALNSEHMNTLLGNILKIIYNNLGI DEGAWMKRLAVFSQPIINKVKPQLLKTHFLPLMEKLKKKAATVVSEEDHLKAEARGDMSEAELLILDEFT TLARDLYAFYPLLIRFVDYNRAKWLKEPNPEAEELFRMVAEVFIYWSKSHNFKREEQNFVVQNEINNMSF LITDTKSKMSKAAVSDQERKKMKRKGDRYSMQTSLIVAALKRLLPIGLNICAPGDQELIALAKNRFSLKD TEDEVRDIIRSNIHLQGKLEDPAIRWQMALYKDLPNRTDDTSDPEKTVERVLDIANVLFHLEQKSKRVGR RHYCLVEHPQRSKKAVWHKLLSKQRKRAVVACFRMAPLYNLPRHRAVNLFLQGYEKSWIETEEHYFEDKL IEDLAKPGAEPPEEDEGTKRVDPLHQLILLFSRTALTEKCKLEEDFLYMAYADIMAKSCHDEEDDDGEEE VKSFEEKEMEKQKLLYQQARLHDRGAAEMVLQTISASKGETGPMVAATLKLGIAILNGGNSTVQQKMLDY LKEKKDVGFFQSLAGLMQSCSVLDLNAFERQNKAEGLGMVTEEGSGEKVLQDDEFTCDLFRFLQLLCEGH NSDFQNYLRTQTGNNTTVNIIISTVDYLLRVQESISDFYWYYSGKDVIDEQGQRNFSKAIQVAKQVFNTL TEYIQGPCTGNQQSLAHSRLWDAVVGFLHVFAHMQMKLSQDSSQIELLKELMDLQKDMVVMLLSMLEGNV VNGTIGKQMVDMLVESSNNVEMILKFFDMFLKLKDLTSSDTFKEYDPDGKGVISKRDFHKAMESHKHYTQ SETEFLLSCAETDENETLDYEEFVKRFHEPAKDIGFNVAVLLTNLSEHMPNDTRLQTFLELAESVLNYFQ PFLGRIEIMGSAKRIERVYFEISESSRTQWEKPQVKESKRQFIFDVVNEGGEKEKMELFVNFCEDTIFEM QLAAQISESDLNERSANKEESEKERPEEQGPRMAFFSILTVRSALFALRYNILTLMRMLSLKSLKKQMKK VKKMTVKDMVTAFFSSYWSIFMTLLHFVASVFRGFFRIICSLLLGGSLVEGAKKIKVAELLANMPDPTQD EVRGDGEEGERKPLEAALPSEDLTDLKELTEESDLLSDIFGLDLKREGGQYKLIPHNPNAGLSDLMSNPV PMPEVQEKFQEQKAKEEEKEEKEETKSEPEKAEGEDGEKEEKAKEDKGKQKLRQLHTHRYGEPEVPESAF WKKIIAYQQKLLNYFARNFYNMRMLALFVAFAINFILLFYKVSTSSVVEGKELPTRSSSENAKVTSLDSS SHRIIAVHYVLEESSGYMEPTLRILAILHTVISFFCIIGYYCLKVPLVIFKREKEVARKLEFDGLYITEQ PSEDDIKGQWDRLVINTQSFPNNYWDKFVKRKVMDKYGEFYGRDRISELLGMDKAALDFSDAREKKKPKK DSSLSAVLNSIDVKYQMWKLGVVFTDNSFLYLAWYMTMSVLGHYNNFFFAAHLLDIAMGFKTLRTILSSV THNGKQLVLTVGLLAVVVYLYTVVAFNFFRKFYNKSEDGDTPDMKCDDMLTCYMFHMYVGVRAGGGIGDE IEDPAGDEYEIYRIIFDITFFFFVIVILLAIIQGLIIDAFGELRDQQEQVKEDMETKCFICGIGNDYFDT VPHGFETHTLQEHNLANYLFFLMYLINKDETEHTGQESYVWKMYQERCWEFFPAGDCFRKQYEDQLN -------------- next part -------------- A non-text attachment was scrubbed... Name: blast.py Type: text/x-python Size: 1420 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ryr2fasta123.csv Type: application/csv Size: 590961 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Thu Apr 9 22:46:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Apr 2009 23:46:37 +0100 Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? In-Reply-To: <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> References: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> Message-ID: <320fb6e00904091546r45f9ef26n8fd441131cce3c42@mail.gmail.com> On 4/9/09, jchen at alumni.caltech.edu wrote: > > Do you just want the FASTA file to contain the matched region of the > > sequences in the database? That information should be in the BLAST > > output - you'll need to remove any gap characters. > > > > If you want the full sequence of each matched target, that isn't in > > the database. You'd have to take the reference number and look it up. > > If you made the database yourself from a FASTA file, that should be > > easy. If it was from NR/NT or another large database then maybe > > fetching the sequences from the NCBI would be easiest (try > > Bio.Entrez). > > Yeah, I actually do want the full length FASTA sequences. I didn't think > about the fact that the BLAST output only contains (partial) match > regions. I have a FASTA file of the entire proteome for the organism we > are studying. You should be able to get the match IDs from the BLAST output and match them up to your FASTA file easily enough. > > Are you sure you are using the XML output? > > > > With the plain text output and BLAST v.2.2.18, Biopython can only cope > > with single query output. The NCBI regularly change their plain text > > output, and we have more-or-less given up with the our plain text > > parser. The NCBI themselves do not recommend parsing it - that is > > what the XML format was introduced for. > > That's unfortunate there's no standard BLAST format. Yeah, I am trying to > parse the plain text BLAST output. I'm not familiar with the XML output - > I don't know how to have BLAST output in XML format. If you are using the blastall tool at the command line directly, use the argument -m 7 (from memory - check the blastall help). If you are using the wrapper in Bio.Blast.NCBIStandalone, this defaults to requesting XML. Have you looked at our documentation or the tutorial? > My file contains a few hundred queries. I ended up writing a little script > that extracted the name of each query and each of its significant hits. I > will probably end up writing my own scripts for getting the FASTA > sequences for each of these hits from a FASTA proteome file. If you have already run the BLAST search and it would be slow to rerun it with XML output, then doing your own parser might be expedient. Anyway, once I had the sequence identifiers, I would use Bio.SeqIO to read the FASTA file. If the file is small, loading into memory as a python dictionary would be the simplest solution - see the Bio.SeqIO.to_dict function as one way to do this. Finally, sending attachments to the mailing list isn't a good idea - especially not half a megabyte of BLAST results! I think the mailing list has rejected that email anyway... Peter From biopython at maubp.freeserve.co.uk Thu Apr 9 22:49:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Apr 2009 23:49:51 +0100 Subject: [BioPython] how to convert file full of BLAST runs into a FASTA file of sequences? In-Reply-To: <320fb6e00904091546r45f9ef26n8fd441131cce3c42@mail.gmail.com> References: <49768.130.191.204.73.1239236458.squirrel@mail.alumni.caltech.edu> <320fb6e00904090149j54674388m9974abd23dd086ef@mail.gmail.com> <49556.130.191.204.206.1239315242.squirrel@mail.alumni.caltech.edu> <320fb6e00904091546r45f9ef26n8fd441131cce3c42@mail.gmail.com> Message-ID: <320fb6e00904091549g2ec0c779p81e419964bc2de41@mail.gmail.com> On 4/9/09, Peter wrote: > Finally, sending attachments to the mailing list isn't a good idea - > especially not half a megabyte of BLAST results! I think the mailing > list has rejected that email anyway... Actually Alex's large email may have arrived in everyone's inbox after all. Well I hope Jerry found it useful. Peter From jmm217 at pitt.edu Sat Apr 11 23:49:50 2009 From: jmm217 at pitt.edu (John MacCallum) Date: Sat, 11 Apr 2009 19:49:50 -0400 Subject: [BioPython] Invitation for Biopython news coordinators In-Reply-To: <20090406230542.GK43636@sobchak.mgh.harvard.edu> Message-ID: <1239493790.27790.184.camel@localhost.localdomain> Hi, I'm a biology undergrad at the University of Pittsburgh and am interested in taking on the proposed news coordinator role. I'm likely not the most technically appropriate person for the position from a computational standpoint, but in the absence of other volunteers stepping forward I'd probably be adequate. The only caveat would be that I'd rather wait until after finals (about another two weeks) before beginning any new projects. Thanks, John MacCallum From chapmanb at 50mail.com Sun Apr 12 15:38:00 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 12 Apr 2009 11:38:00 -0400 Subject: [BioPython] BOSC 2009 -- abstracts due and informal hackathon Message-ID: <20090412153800.GA77169@kunkel> Biopython folks; A friendly reminder that abstracts for BOSC 2009 talks are due tomorrow: http://open-bio.org/wiki/BOSC_2009 It would be great to see a strong Python representation there, so I encourage anyone with open source work to think about putting an abstract and talk together. I would also like to work on organizing a day or two of Biopython coding before or after the conference. The idea is to get a small group of interested programmers, decide on some topics of interest, and sit down together in real life and implement them. Since people are likely starting to get their flights together, the first order of business is to find out who is interested and what days before or after BOSC would work. Drop an e-mail to the list if you're attending BOSC or ISMB this year and would be interested in an informal hackathon. Brad From tiagoantao at gmail.com Sun Apr 12 18:08:33 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 12 Apr 2009 19:08:33 +0100 Subject: [BioPython] BOSC 2009 -- abstracts due and informal hackathon In-Reply-To: <20090412153800.GA77169@kunkel> References: <20090412153800.GA77169@kunkel> Message-ID: <6d941f120904121108p1056719dkc10fe218206feccf@mail.gmail.com> Hi, I was not planning to attend. But if there is an hackthon I will talk with my boss and try to go... On Sun, Apr 12, 2009 at 4:38 PM, Brad Chapman wrote: > Biopython folks; > A friendly reminder that abstracts for BOSC 2009 talks are > due tomorrow: > > http://open-bio.org/wiki/BOSC_2009 > > It would be great to see a strong Python representation there, so I > encourage anyone with open source work to think about putting an > abstract and talk together. > > I would also like to work on organizing a day or two of > Biopython coding before or after the conference. The idea is to get > a small group of interested programmers, decide on some topics > of interest, and sit down together in real life and implement them. > > Since people are likely starting to get their flights together, the > first order of business is to find out who is interested and what > days before or after BOSC would work. Drop an e-mail to the list > if you're attending BOSC or ISMB this year and would be interested > in an informal hackathon. > > Brad > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From biopython at maubp.freeserve.co.uk Mon Apr 13 09:50:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 10:50:19 +0100 Subject: [BioPython] BOSC 2009 -- abstracts due and informal hackathon In-Reply-To: <20090412153800.GA77169@kunkel> References: <20090412153800.GA77169@kunkel> Message-ID: <320fb6e00904130250x1cdfed70j25079f12119ce01e@mail.gmail.com> On Sun, Apr 12, 2009 at 4:38 PM, Brad Chapman wrote: > Biopython folks; > A friendly reminder that abstracts for BOSC 2009 talks are > due tomorrow: > > http://open-bio.org/wiki/BOSC_2009 Yikes - that has crept up on me, thanks for the heads up. I'd already contacted them informally though... > It would be great to see a strong Python representation there, so I > encourage anyone with open source work to think about putting an > abstract and talk together. > > I would also like to work on organizing a day or two of > Biopython coding before or after the conference. The idea is to get > a small group of interested programmers, decide on some topics > of interest, and sit down together in real life and implement them. We might be able to do that as part of the BOSC coding sessions? > Since people are likely starting to get their flights together, the > first order of business is to find out who is interested and what > days before or after BOSC would work. Drop an e-mail to the list > if you're attending BOSC or ISMB this year and would be interested > in an informal hackathon. I'm hoping to attend BOSC and ISMB (finances permitting) and an informal hackathon sounds good, although I'm not yet sure when exactly would be the best time. Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 10:44:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 11:44:29 +0100 Subject: [BioPython] BOSC 2009 Message-ID: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> Hello Biopythoneers, Those of you following the dev-mailing list or the OBF news feed will know that talk abstracts for BOSC 2009 are due in today, see http://www.open-bio.org/wiki/BOSC_2009 I should to be able to attend and present the Biopython Project Update, and a few other Biopython developers may also be around too, so some sort of hackathon is in the air. It is a bit unfortunate the deadline was scheduled on the Easter break, as I'm sure quite a few of you will be on holiday, but here is an outline abstract. If anyone has comments, please let me know (on the list or directly) in the next couple of hours... Biopython Project Update (draft abstract for BOSC 2009) In this talk we present the current status of the Biopython project, focusing on features developed in the last year, and future plans for the project. The Oxford University Press journal Bioinformatics has recently published an application note describing Biopython: Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, and de Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Mar 20. doi:10.1093/bioinformatics/btp163 Since BOSC 2008, Biopython 1.49 has been released. This was an important milestone in bringing support for Python 2.6, and in terms of our dependence on Numerical Python as we made the transition from the obsolete Numeric library to NumPy. Biopython 1.49 also added more biological methods to our core sequence object. April 2009 will see the release of Biopython 1.50 (at the time of writing, a beta has already been released). Some of the new features include: 1. GenomeDiagram by Leighton Pritchard has been integrated into Biopython as the Bio.Graphics.GenomeDiagram module. 2. A new module Bio.Motif has been added, which is intended to replace the existing Bio.AlignAce and Bio.MEME modules. 3. Bio.SeqIO can now read and write FASTQ and QUAL files used in second generation sequencing work. Biopython will celebrate its 10th Birthday later this year, we will present a brief history of the project and current work. This includes the evaluation of git (and github) as a possible distributed version control system (DVCS) to replace our existing very stable CVS server hosted by the Open Bioinformatics Foundation, which we hope will encourage more participation in the project. -- Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 13:33:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 14:33:03 +0100 Subject: [BioPython] BOSC 2009 In-Reply-To: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> References: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> Message-ID: <320fb6e00904130633k68fe32bdj3c0419afc5ada71a@mail.gmail.com> On Mon, Apr 13, 2009 at 11:44 AM, Peter wrote: > Hello Biopythoneers, > > Those of you following the dev-mailing list or the OBF news feed will > know that talk abstracts for BOSC 2009 are due in today, see > http://www.open-bio.org/wiki/BOSC_2009 > I should to be able to attend and present the Biopython Project > Update, and a few other Biopython developers may also be > around too, so some sort of hackathon is in the air. > > It is a bit unfortunate the deadline was scheduled on the Easter > break, as I'm sure quite a few of you will be on holiday, but here > is an outline abstract. ?If anyone has comments, please let me > know (on the list or directly) in the next couple of hours... That's been submitted now, although I can still make revisions at the moment if anyone spots something worth adding/fixing. I did remember to add the website and license information as BOSC request on their instructions. Peter From peter at maubp.freeserve.co.uk Mon Apr 13 13:47:04 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 14:47:04 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object Message-ID: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> Hi all, I've filed enhancement bug 2809 with a patch to add startswith and endswith methods to the Seq object, http://bugzilla.open-bio.org/show_bug.cgi?id=2809 I'm confident there are many possible use cases for this. The example which prompted me to work on this was taking SeqRecord objects from sequencing reads (a FASTQ file read in with Bio.SeqIO, possible with Biopython 1.50 beta or later) where some include a PCR primer associated prefix/suffix which I want to strip off (by slicing the SeqRecord). ?To do this I need to know if a given SeqRecord's sequence starts with (or ends with) a given primer sequence (or a tuple of primer sequences). e.g. I want to be able to do this: primer = "TGACCTGAAAAGAC" crop = len(primer) #record is a SeqRecord object if record.seq.startswith(primer) : record = record[crop:] Currently you'd have to turn the Seq into a string to use its startswith method, which is not as nice: primer = "TGACCTGAAAAGAC" crop = len(primer) #record is a SeqRecord object if str(record.seq).startswith(primer) : record = record[crop:] or maybe use the find method instead: primer = "TGACCTGAAAAGAC" crop = len(primer) #record is a SeqRecord object if 0 == record.seq.find(primer) : record = record[crop:] Does this seem like a sensible addition to the Seq object? It is consistent with making the Seq object more like a python string. Peter From lpritc at scri.ac.uk Mon Apr 13 14:10:30 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 13 Apr 2009 15:10:30 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> Message-ID: Howdo, On 13/04/2009 14:47, "Peter" wrote: > I'm confident there are many possible use cases for this. > > The example which prompted me to work on this was taking SeqRecord > objects from sequencing reads (a FASTQ file read in with Bio.SeqIO, > possible with Biopython 1.50 beta or later) where some include a PCR > primer associated prefix/suffix which I want to strip off (by slicing > the SeqRecord). ?To do this I need to know if a given SeqRecord's > sequence starts with (or ends with) a given primer sequence (or a > tuple of primer sequences). > > e.g. I want to be able to do this: > > primer = "TGACCTGAAAAGAC" > crop = len(primer) > #record is a SeqRecord object > if record.seq.startswith(primer) : > record = record[crop:] [...] > Does this seem like a sensible addition to the Seq object? It is > consistent with making the Seq object more like a python string. Yes it does seem sensible. I'd quite like to (eventually) have the capability either to provide ambiguity symbols, or to query with a regular expression along the lines of re.match() (or maybe the nonexistent re.endmatch()). Since this isn't implemented yet, maybe there's still time to consider this potential usage in the implementation? L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From peter at maubp.freeserve.co.uk Mon Apr 13 14:46:31 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 15:46:31 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: References: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> Message-ID: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> On Mon, Apr 13, 2009 at 3:10 PM, Leighton Pritchard wrote: > > Howdo, > >> Does this seem like a sensible addition to the Seq object? ?It is >> consistent with making the Seq object more like a python string. > > Yes it does seem sensible. Good :) >?I'd quite like to (eventually) have the capability either to provide ambiguity > symbols, or to query with a regular expression along the lines of > re.match() (or maybe the nonexistent re.endmatch()). > > Since this isn't implemented yet, maybe there's still time to consider this > potential usage in the implementation? I'm not at all happy about the idea of supporting ambiguity characters in these string-like methods of the Seq object. Right now I was proposing nothing special with ambiguity symbols, so: >>> from Bio.Seq import Seq >>> Seq("TAN").startswith("TAN") True >>> Seq("TAA").startswith("TAN") False >>> Seq("TAA").startswith("TAX") False I agree this doesn't cover all possible use cases, but it is very simple, and easy to explain. Trying to support ambiguity symbols will be alphabet dependent (consider the above example could be a protein or DNA), and frankly extremely complicated. It also breaks the "act like a string" idea. Essentially you'd be asking for the following: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna, generic_protein >>> Seq("TAN", generic_dna).startswith("TAN") True >>> Seq("TAA", generic_dna).startswith("TAN") #treat N specially True >>> Seq("TAN", generic_protein).startswith("TAN") True >>> Seq("TAA", generic_protein).startswith("TAN") #protein, so N is a normal amino acid False >>> Seq("TAX", generic_protein).startswith("TAX") True >>> Seq("TAA", generic_protein).startswith("TAX") #treat X specially True So far that is at least understandable - but what would you expect the following to do, where we don't know if it is DNA or protein: >>> Seq("TAA").startswith("TAN") >>> Seq("TAA").startswith("TAX") We don't know, therefore we shouldn't guess, so I think these would have to raise an error. This also applies to the other ambiguous nucleotide letters, like S for G or C in nucleotide sequences. Then there are more alphabet corner cases - consider reduced alphabets (e.g. simplified protein sequences mapping all acidic residues to a single character etc). Several "Zen of Python" points spring to mind, including "If the implementation is hard to explain, it's a bad idea.", but in summary I against supporting ambiguous characters in the string-like methods of the Seq object (so: find, rfind, split, startswith, endswith, etc). We should handle this another way. Bartek: would Bio.Motif give us a nice way to do these kinds of searches? For example, given a simple nucleotide motif of "TAN" (which should match TAA, TAC, TAG or TAA) or "TAS" (which should match "TAC" or "TAG"), can we check if this matches at the start of a target nucleotide sequence? And similarly for protein motifs (e.g. signal peptides). Peter From mmokrejs at ribosome.natur.cuni.cz Mon Apr 13 14:44:14 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 13 Apr 2009 16:44:14 +0200 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> References: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> Message-ID: <49E34FBE.3070308@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > Hi all, > > I've filed enhancement bug 2809 with a patch to add startswith and > endswith methods to the Seq object, > http://bugzilla.open-bio.org/show_bug.cgi?id=2809 > > e.g. I want to be able to do this: > > primer = "TGACCTGAAAAGAC" > crop = len(primer) > #record is a SeqRecord object > if record.seq.startswith(primer) : > record = record[crop:] > > > Does this seem like a sensible addition to the Seq object? It is > consistent with making the Seq object more like a python string. Yes, I like this new approach. Martin From lpritc at scri.ac.uk Mon Apr 13 15:46:06 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 13 Apr 2009 16:46:06 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> Message-ID: On 13/04/2009 15:46, "Peter" wrote: > On Mon, Apr 13, 2009 at 3:10 PM, Leighton Pritchard wrote: > I'm not at all happy about the idea of supporting ambiguity characters > in these string-like methods of the Seq object. Right now I was > proposing nothing special with ambiguity symbols, so: > >>>> from Bio.Seq import Seq >>>> Seq("TAN").startswith("TAN") > True >>>> Seq("TAA").startswith("TAN") > False >>>> Seq("TAA").startswith("TAX") > False > > I agree this doesn't cover all possible use cases, but it is very > simple, and easy to explain. That's in its favour, but I don't think that: "Seq.startswith() behaves as expected for standard ambiguity symbols and regular expression syntax if the Seq object is declared with either a protein or nucleotide alphabet, but behaves like String.startswith() otherwise" is either complicated, or hard to explain. I think that the choice is one that is best made on whether the functionality is useful like this, or better implemented in some other way. On a design point, I'm not convinced that direct emulation of String methods in Seq objects is *always* a Good Thing?. There are String methods that it makes sense (to me) to emulate wholesale in Seq, such as .join(), .swapcase(), .upper(), slicing behaviours and so on. However, .title() and .capitalize() seem a bit out of place. Likewise, there are plenty of Seq methods that don't have sensible String counterparts. This is because, conceptually, they represent different abstract concepts, and I don't think that we should lose sight of that when making Seq objects behave like String objects. I think that abstract representation of sequences and provision of useful functionality are the important points. > Trying to support ambiguity symbols will be alphabet dependent > (consider the above example could be a protein or DNA), and frankly > extremely complicated. I don't think it *has* to be complicated at all, though it could be if we wanted it to be. For example, avoiding ambiguity codes for now: >>> from Bio.Seq import Seq >>> Seq("TAG").startswith("TA[CG]") Could be handled internally with re.match(), in the same way that >>> Seq("TAG").startswith("TAG") could be. Seq.endswith() might be implementable by checking that an re.search() call returns at least one group that stops at the end of the target sequence, for example. These methods would cover pretty much every use case I can think of right now that doesn't involve an ambiguity symbol. They wouldn't break String.startswith() behaviour for biological sequences, because the special symbols have no place in the biological sequence alphabets (except, perhaps, for gap characters). Such an implementation could gain extra *useful* functionality in .startswith() without breaking expected behaviour. It would also leave us in the same position originally proposed, that ambiguity symbols have no meaning. > It also breaks the "act like a string" idea. I do not agree, because there's some elision in a lawyer's definition of 'act like', as opposed to 'act as' a string that comes into play ;) If the Seq object acts *like* a string, then *when we expect it to* that doesn't prevent us from having functionality more appropriate for a Seq object, in addition to or instead of String behaviour. We're already doing this with the Seq.transcribe() and Seq.translate() (and no Seq.title()) methods, for example. I don't see how this differs conceptually for extending startswith() functionality, so long as it behaves like String.startswith() *when we expect it to*. The issue here is then: "when is it reasonable to expect this string of symbols to behave like a raw string, and when is it reasonable to expect it to behave like a biological/regex sequence of symbols?". > what would you expect the > following to do, where we don't know if it is DNA or protein: >>>> Seq("TAA").startswith("TAN") >>>> Seq("TAA").startswith("TAX") > We don't know, therefore we shouldn't guess, so I think these would > have to raise an error. That's one option, and likely the sanest - the error would probably provoke the user into specifying an alphabet, at least. However, there's no harm in discussing other options, even if none of us like them... If the sequence has an alphabet that specifies it as either Protein or Nucleotide, then in those cases we can infer clearly what the ambiguity symbol means, and there is no problem. If the sequence does not have such an alphabet, then we can potentially consider Seq.startswith() to behave like String.startswith(), with an appropriate warning. Alternatively, Seq.startswith() could behave like String.startswith() all the time, unless passed with an optional argument (e.g. "ambiguity=True"). Or maybe another optional argument could be passed to force the search to treat a sequence without an alphabet as either "type='protein'" or "type='RNA'", thereby suppressing the warning/error described above. > This also applies to the other ambiguous > nucleotide letters, like S for G or C in nucleotide sequences. Then > there are more alphabet corner cases - consider reduced alphabets > (e.g. simplified protein sequences mapping all acidic residues to a > single character etc). ...and what if the user makes up their own alphabet?, and so on... ;) Those would be neither Protein nor DNA/RNA alphabets, and so could do whatever the default is for Seq.startswith() behaviour in those circumstances. Another alternative could be to have an optional argument defining the ambiguity symbols, and what they represent (e.g. "ambiguity_table={'N':'[ACGT]', 'P':'[QRST]'}). > Several "Zen of Python" points spring to mind, including "If the > implementation is hard to explain, it's a bad idea.", but in summary I > against supporting ambiguous characters in the string-like methods of > the Seq object (so: find, rfind, split, startswith, endswith, etc). > We should handle this another way. If the natural home for this functionality is Bio.Motif, then the natural home for it is Bio.Motif, and I don't have a problem with that. I'm happy to go with the consensus. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From peter at maubp.freeserve.co.uk Mon Apr 13 15:56:35 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 16:56:35 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> References: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> Message-ID: <320fb6e00904130856vbda6446l2fa44c71f8308a0a@mail.gmail.com> Leighton Pritchard >>?I'd quite like to (eventually) have the capability either to provide ambiguity >> symbols, or to query with a regular expression along the lines of >> re.match() (or maybe the nonexistent re.endmatch()). >> >> Since this isn't implemented yet, maybe there's still time to consider this >> potential usage in the implementation? Peter wrote: > [Stuff about issues with the alphabet altering the behaviour] ..., but in > summary I am against supporting ambiguous characters in the string-like > methods of the Seq object (so: find, rfind, split, startswith, endswith, etc). > We should handle this another way. > > Bartek: would Bio.Motif give us a nice way to do these kinds of > searches? ?For example, given a simple nucleotide motif of "TAN" > (which should match TAA, TAC, TAG or TAA) or "TAS" (which should match > "TAC" or "TAG"), can we check if this matches at the start of a target > nucleotide sequence? ?And similarly for protein motifs (e.g. signal > peptides). This feels like a rehash of some of the debate on Bug 2601 doesn't it? http://bugzilla.open-bio.org/show_bug.cgi?id=2601 On Bug 2601 comment 5, Leighton wrote: >> I think the abstract distinction between search types here is: >> >> 1) Find match at start of sequence (re.match() and string.startswith()) >> 2) Find first match in sequence (re.search() and string.find()) >> 3) Find all non-overlapping matches in sequence (re.finditer() only) >> 4) Find all overlapping matches in sequence (neither re nor string) >> 1a) 2a) 3a) 4a) The same, but in the reverse complement. >> >> Moving down the list, the problem becomes more general. The type >> of search I need most often in biological sequences is number (4a), >> or (4) for proteins. Each of search types (1) to (3) (a or not) has a >> theoretically faster implementation than doing (4) then filtering the >> results. I don't mind having more than one search method with >> different names, or having to specify arguments to get a particular >> kind of search. I do mind not having (4a) as an option... Bartek, can Bio.Motif address these four (or eight) questions from Leighton, or am I expecting the wrong things from it? Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 17:04:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 18:04:41 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: References: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> Message-ID: <320fb6e00904131004i379c9bffwe9b3193568568cc@mail.gmail.com> On Mon, Apr 13, 2009 at 4:46 PM, Leighton Pritchard wrote: > However, there's no harm in discussing other options, even if none of > us like them... > > If the sequence has an alphabet that specifies it as either Protein or > Nucleotide, then in those cases we can infer clearly what the ambiguity > symbol means, and there is no problem. Strictly speaking, only if the sequence has an (ambiguous) IUPAC alphabet can we know what the (ambiguity) symbols mean with certainty. If the sequence has only a generic DNA/RNA/Nucleotide/Protein alphabet then we can only make a pretty good guess. > Alternatively, Seq.startswith() could behave like String.startswith() all > the time, unless passed with an optional argument (e.g. "ambiguity=True"). That idea could work. The default behaviour would be "act like a string", but an optional argument to startswith/endswith/find/rfind/count/... could enable ambiguity matching (provided the sequence has a suitable alphabet). This would be backwards compatible, and allow us to forge ahead with adding simple string-like startswith/endswith methods now (which are useful as is, and so far everyone seems supportive of), and implement ambiguity support later. > Or maybe another optional argument could be passed to force the search to > treat a sequence without an alphabet as either "type='protein'" or > "type='RNA'", thereby suppressing the warning/error described above. > ... > Another alternative could be to have an optional argument defining the > ambiguity symbols, and what they represent (e.g. > "ambiguity_table={'N':'[ACGT]', 'P':'[QRST]'}). If we go down the optional argument route (e.g. ambiguity=True), then a way of specifying the sequence type or ambiguity characters might be possible, although I'd prefer to encourage more rigorous use of alphabets in Seq objects in the first place (see also enhancement Bug 2597, http://bugzilla.open-bio.org/show_bug.cgi?id=2597 on this topic). If we consider the situation where someone creates their own custom alphabet, and wants to define their own ambiguity characters, I think any ambiguous search functionality would have to interrogate the alphabet object at run time. Possible, but a bit tricky. >> Several "Zen of Python" points spring to mind, including "If the >> implementation is hard to explain, it's a bad idea.", but in summary I >> against supporting ambiguous characters in the string-like methods of >> the Seq object (so: find, rfind, split, startswith, endswith, etc). >> We should handle this another way. > > If the natural home for this functionality is Bio.Motif, then the natural > home for it is Bio.Motif, and I don't have a problem with that. ?I'm happy > to go with the consensus. Well, let's hear what Bartek has to say (Bio.Motif author). Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 18:15:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 19:15:00 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? Message-ID: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> Dear Biopythoneers, There is a saying "no news is good news", but as per the title - can we have some feedback from the Biopython 1.50 beta release please? For example, this was the first time we've included a Windows installer for Python 2.6. We were waiting for NumPy 1.3, which was the first NumPy release to support Python 2.6 on Windows. If you've tried this and it works for you, please write us an email. Support for reading/writing FASTQ and QUAL files is also new - if you've tried it out on your own second generation sequencing files, again, it would be nice to know how it worked. If everything is fine, great, but if for example it can't parse the files from your local sequencing center, please let us know. Thanks, Peter From chapmanb at 50mail.com Tue Apr 14 01:53:01 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Apr 2009 21:53:01 -0400 Subject: [BioPython] Invitation for Biopython news coordinators In-Reply-To: <1239493790.27790.184.camel@localhost.localdomain> References: <20090406230542.GK43636@sobchak.mgh.harvard.edu> <1239493790.27790.184.camel@localhost.localdomain> Message-ID: <20090414015301.GA80360@kunkel> Hi John; Great. We'd be very happy to have you working on this. David Winter had also indicated an interest; see this post over on the development list: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005711.html Having several people involved will work out well. When you are finished up with finals, check back in with the list and David and we can get you set up with whatever you need. Good luck with exams and thanks again for the message, Brad > I'm a biology undergrad at the University of Pittsburgh and am > interested in taking on the proposed news coordinator role. I'm likely > not the most technically appropriate person for the position from a > computational standpoint, but in the absence of other volunteers > stepping forward I'd probably be adequate. > > The only caveat would be that I'd rather wait until after finals (about > another two weeks) before beginning any new projects. > > Thanks, > > John MacCallum > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From yvan.strahm at bccs.uib.no Tue Apr 14 14:00:17 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Tue, 14 Apr 2009 16:00:17 +0200 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> Message-ID: <49E496F1.9060608@bccs.uib.no> Peter wrote: > On Wed, Apr 1, 2009 at 11:34 AM, wrote: >> Hello List >> >> I try to get the length of the query from the blast result itself >> >> like that: >> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, >> "blastn", >> my_blast_db, >> my_blast_file) >> >> from Bio.Blast import NCBIXML >> blast_records = NCBIXML.parse(result_handle) >> for blast_record in blast_records >> >> but >> blast_record.query_length return None >> and >> blast_record.query_letters return the actual size >> >> Should I test the length of the query before the blast result? O did I >> miss-interpreted the meaning of query_length and query_letters? >> >> Thanks for your time >> >> Is query_length really the length of query? > > You can use query_letters (although it wouldn't hurt to double check > this if you have the query sequence available). With the current BLAST > XML parser query_length is always None (but I think we should fix so > they are both populated). > > Its an unfortunate historical accident dating back to the plain text > BLAST parser. The plain text output printed the query length in two > places, with different captions, which was reflected in the names > given in the BLAST record (the values should be the same, assuming the > BLAST output is sane). The XML output doesn't have this redundancy, > but our XML parser tries to use the same object to hold the results. > See: http://bugzilla.open-bio.org/show_bug.cgi?id=2176#c12 > > Have a look at the discussion on Bug 2176 for more about this > (including the far more complicated situation for the database length > which has multiple meanings). > > This seems like a timely reminder that we could perhaps tidy up a > little of this ready for Biopython 1.50 ... > > Peter Hello, I tried to check the length before sending it to blast. My problem is that all the query sequences are in a file so I used SeqIO to read/parse them for record in SeqIO.parse(fh, "fasta"): l_query = len(record.seq) result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, record.seq) doesn't work as NCBIStandalone.blastall takes a file as infile. Should I write a temporary file with the record.id and record.seq and pass it to NCBIStandalone.blastall ? or is there an easier way? for now I am just use the blast_record.query_letters variable. I am using Bioperl 1.49 and Python 2.6.1 cheers, yvan From biopython at maubp.freeserve.co.uk Tue Apr 14 14:23:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 15:23:03 +0100 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <49E496F1.9060608@bccs.uib.no> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> <49E496F1.9060608@bccs.uib.no> Message-ID: <320fb6e00904140723l41514cc9j1c8656a2b0f35a96@mail.gmail.com> On Tue, Apr 14, 2009 at 3:00 PM, Yvan Strahm wrote: > Hello, > > I tried to check the length before sending it to blast. > My problem is that all the query sequences are in a file so I used SeqIO to > read/parse them > > for record in SeqIO.parse(fh, "fasta"): > ? ? ? ?l_query = len(record.seq) > ? ? ? ?result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, > "blastn", > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?my_blast_db, > record.seq) > > doesn't work as NCBIStandalone.blastall takes a file as infile. > > Should I write a temporary file with the record.id and record.seq and pass > it to NCBIStandalone.blastall ? > > or is there an easier way? It sounds like you already have a FASTA file containing the query sequences, so just use that as the input to standalone BLAST. i.e. I would do something like this to double check the reported query length matches up with the actual query length: from Bio import SeqIO from Bio.Blast import NCBIXML from Bio.Blast import NCBIStandalone query_filename = "example.fasta" #Load all the queries into memory as a dictionary of SeqRecord objects query_handle = open("example.fasta") query_dict = SeqIO.to_dict(SeqIO.parse(query_handle,"fasta")) query_handle.close() #Run BLAST and loop over the XML blast results one by one (memory efficient), result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, \ "blastn", my_blast_db, query_filename) for blast_record in NCBIXML.parse(result_handle) : query_record = query_dict[blast_record.query_id] #check this assert len(query_record) == blast_record.query_letters assert len(query_record) == blast_record.query_length #Biopython 1.50b or later Note I haven't actually tested this example, but I think the idea is clear. This approach gives you easy access to the full query sequence, and its full description. If all you care about is the length, then rather than storing a dictionary of the queries as SeqRecords, just use a dictionary of their lengths as integers. Peter From biopython at maubp.freeserve.co.uk Tue Apr 14 14:25:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 15:25:40 +0100 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> Message-ID: <320fb6e00904140725w341b5882xeeecd9b0664950a2@mail.gmail.com> On Wed, Apr 1, 2009 at 11:59 AM, Peter wrote: >> >> Should I test the length of the query before the blast result? O did I >> miss-interpreted the meaning of query_length and query_letters? >> >> Thanks for your time >> >> Is query_length really the length of query? > > You can use query_letters (although it wouldn't hurt to double check > this if you have the query sequence available). With the current BLAST > XML parser query_length is always None (but I think we should fix so > they are both populated). > > Its an unfortunate historical accident dating back to the plain text > BLAST parser. [...] > > This seems like a timely reminder that we could perhaps tidy up a > little of this ready for Biopython 1.50 ... This was fixed in Biopython 1.50 beta, so you can now use either the query_length or the query_letters property when parsing BLAST XML output. For older versions of Biopython as noted above, query_length was left as None when parsing XML. Peter From biopython at maubp.freeserve.co.uk Tue Apr 14 18:07:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 19:07:13 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> Message-ID: <320fb6e00904141107m388985a1h56c8b534041f51b4@mail.gmail.com> On Mon, Apr 13, 2009 at 7:15 PM, Peter wrote: > Support for reading/writing FASTQ and QUAL files is also new - if > you've tried it out on your own second generation sequencing files, > again, it would be nice to know how it worked. ?If everything is fine, > great, but if for example it can't parse the files from your local > sequencing center, please let us know. David Schruth emailed to let me know he's successfully been using the new QUAL functionality in Bio.SeqIO on 454 data. Thanks David! There isn't much in the main Biopython tutorial on this yet, but in the meantime have a look at the built in documentation for our FASTQ and QUAL support: >>> from Bio import SeqIO >>> help(SeqIO.QualityIO) ... For those that didn't know, the Roche 454 off instrument applications (available on Linux only I believe) include a command line tool called "sffinfo" which can convert a binary SFF file into FASTA (using the command line option -s or -seq) or QUAL format using PHRED qualities (command line option -q or -qual). I've been using this myself to get some Roche 454 SFF read data into Bio.SeqIO in order to manually trim off primer sequences. Peter From biopython at maubp.freeserve.co.uk Tue Apr 14 20:36:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 21:36:09 +0100 Subject: [BioPython] Reading Roche 454 binary SFF files in Python Message-ID: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> Jose Blanca wrote this interesting reply (which I assume he meant to send to the whole mailing list, not just me): On 4/14/09, Blanca Postigo Jose Miguel wrote: > > > For those that didn't know, the Roche 454 off instrument applications > > (available on Linux only I believe) include a command line tool called > > "sffinfo" which can convert a binary SFF file into FASTA (using the > > command line option -s or -seq) or QUAL format using PHRED qualities > > (command line option -q or -qual). I've been using this myself to get > > some Roche 454 SFF read data into Bio.SeqIO in order to manually trim > > off primer sequences. > > For the ones that do not have the 454 software there's a free software > alternative. Some time ago Bastien Chevreux and I created a little utility to > convert sff files to fasta and xml (for the ancilliary info). It's called > sff_extract, is written in python and released under the GPL. > You can get the python script here: > http://bioinf.comav.upv.es/sff_extract/index.html > Maybe I should have announce it here, but I didn't, my fault. > If you think this code could be of some interest for you I could talk with > Bastien about the possibility of submitting it to biopython. Although in that > case it could use some cleaning, it works, but it could be nicer. > > Best regards, > > Jose Blanca That does sound interesting - if you want I, email me a proper release announcement and I can forward it to the Biopython announcement mailing list. I was aware that some information was available about the SFF file format, and it should be possible to reverse engineer the format in order to read and write it directly from Biopython. Right now with your code under the GPL, we can't incorporate it into Biopython, but if you and Bastien are prepared to offer it to Biopython under our MIT/BSD licence that could be very useful. Even without that, any documentation on the file format or example files you might be able to share could be valuable. I felt that adding FASTQ and QUAL support to Biopython should come first, but since the Bio.SeqIO framework is extendible perhaps we could add native support for SFF files to Biopython later on. Given people can use the Roche 454 tools (if they have them) or your open source sff_extract to get the data out of an SFF file, this isn't urgent, but is worth thinking about :) Peter P.S. Have you tested your sff_extract software on SFF files from the new Roche v2 software, released about the same time as the "titanium" 454 upgrade? From jblanca at btc.upv.es Wed Apr 15 07:07:27 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 15 Apr 2009 09:07:27 +0200 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> Message-ID: <200904150907.27360.jblanca@btc.upv.es> > I was aware that some information was available about the SFF file > format, and it should be possible to reverse engineer the format in > order to read and write it directly from Biopython. The sff format is fully documented in the NCBI's SRA web site. http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats#sff > Right now with your code under the GPL, we can't incorporate it into > Biopython, but if you and Bastien are prepared to offer it to > Biopython under our MIT/BSD licence that could be very useful. Even > without that, any documentation on the file format or example files > you might be able to share could be valuable. I guess that it wouldn't be a problem to offer you the code under your licence. But I don't think that's the best approach. The code as it is right now is not well suited to be integrated in a library. It would be easier to rewrite the sff reading part from scratch. I could do that for you in no time. The main problem would be to have sff files small enough to be used for the test. If you could provide that I could write the code to extract the information from the sff file for you. It would be easy to build a generator able to deliver the sequences one by one. sff_extract also is able to split the paired-ends reads. That's the part that Bastien wrote. Integrating that would be nice, but I think that in Biopython that should be treated as an independent problem. > P.S. Have you tested your sff_extract software on SFF files from the > new Roche v2 software, released about the same time as the "titanium" > 454 upgrade? Not me, but I think that Bastien has and he has found no problem at all with that. The sff format is well thought and consistent, the 454 people did a much better job than the ABI people did with the abi format. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From yvan.strahm at bccs.uib.no Wed Apr 15 08:32:49 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Wed, 15 Apr 2009 10:32:49 +0200 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <320fb6e00904140723l41514cc9j1c8656a2b0f35a96@mail.gmail.com> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> <49E496F1.9060608@bccs.uib.no> <320fb6e00904140723l41514cc9j1c8656a2b0f35a96@mail.gmail.com> Message-ID: <49E59BB1.7050208@bccs.uib.no> Peter wrote: > On Tue, Apr 14, 2009 at 3:00 PM, Yvan Strahm wrote: >> Hello, >> >> I tried to check the length before sending it to blast. >> My problem is that all the query sequences are in a file so I used SeqIO to >> read/parse them >> >> for record in SeqIO.parse(fh, "fasta"): >> l_query = len(record.seq) >> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, >> "blastn", >> my_blast_db, >> record.seq) >> >> doesn't work as NCBIStandalone.blastall takes a file as infile. >> >> Should I write a temporary file with the record.id and record.seq and pass >> it to NCBIStandalone.blastall ? >> >> or is there an easier way? > > It sounds like you already have a FASTA file containing the query > sequences, so just use that as the input to standalone BLAST. > > i.e. I would do something like this to double check the reported query > length matches up with the actual query length: > > from Bio import SeqIO > from Bio.Blast import NCBIXML > from Bio.Blast import NCBIStandalone > query_filename = "example.fasta" > #Load all the queries into memory as a dictionary of SeqRecord objects > query_handle = open("example.fasta") > query_dict = SeqIO.to_dict(SeqIO.parse(query_handle,"fasta")) > query_handle.close() > #Run BLAST and loop over the XML blast results one by one (memory efficient), > result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, \ > "blastn", my_blast_db, > query_filename) > for blast_record in NCBIXML.parse(result_handle) : > query_record = query_dict[blast_record.query_id] #check this > assert len(query_record) == blast_record.query_letters > assert len(query_record) == blast_record.query_length #Biopython > 1.50b or later > > Note I haven't actually tested this example, but I think the idea is clear. > > This approach gives you easy access to the full query sequence, and > its full description. If all you care about is the length, then > rather than storing a dictionary of the queries as SeqRecords, just > use a dictionary of their lengths as integers. > > Peter Thanks a lot! Just have to change the query_record = query_dict[blast_record.query_id] to query_record = query_dict[blast_record.query] because query_id return something like lcl|XXX and not the actual fasta header. and yes I am interested in the whole SeqRecords ;-). Does the query_dict size is limited by the memory of the machine ? yvan From biopython at maubp.freeserve.co.uk Wed Apr 15 08:42:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 09:42:12 +0100 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <200904150907.27360.jblanca@btc.upv.es> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> Message-ID: <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> On Wed, Apr 15, 2009 at 8:07 AM, Jose Blanca wrote: > >> I was aware that some information was available about the SFF file >> format, and it should be possible to reverse engineer the format in >> order to read and write it directly from Biopython. > > The sff format is fully documented in the NCBI's SRA web site. > http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats#sff Nice link - thanks. Given the specification is public (and as you say later, well thought out), we shouldn't have to worry so much about Roche making changes to it in future releases. >> Right now with your code under the GPL, we can't incorporate it into >> Biopython, but if you and Bastien are prepared to offer it to >> Biopython under our MIT/BSD license that could be very useful. ?Even >> without that, any documentation on the file format or example files >> you might be able to share could be valuable. > > I guess that it wouldn't be a problem to offer you the code under your > license. But I don't think that's the best approach. The code as it is right > now is not well suited to be integrated in a library. It would be easier to > rewrite the sff reading part from scratch. I could do that for you in no > time. I was expecting your sff_extract code would serve only as a basis - perhaps just lifting some core routines. If you are happy to extract/rewrite the core bits and give them to Biopython under the Biopython License that would be great. See http://biopython.org/DIST/LICENSE (basically MIT/BSD style). > The main problem would be to have sff files small enough to be used > for the test. The Roche command line tools allow you to take a large SFF file and produce a filtered version (use sfffile with the -i option and a simple text file of read identifiers). So making a small SFF file for unit tests should be simple. > If you could provide that I could write the code to extract the > information from the sff file for you. It would be easy to build a > generator able to deliver the sequences one by one. That would be very welcome :) > sff_extract also is able to split the paired-ends reads. That's the part > that Bastien wrote. Integrating that would be nice, but I think that in > Biopython that ?should be treated as an independent problem. Quite possibly - I haven't yet had to work with paired end reads, and at this point I'm not sure how best to represent them with the Biopython SeqRecord object. In some senses they are two short sequences (so using two Biopython SeqRecord objects would work, but with some kind of cross referencing). Alternatively you might treat them as a long sequence with known end regions, but an unknown region of unknown length in the middle (something we don't currently have a sequence object to represent). >> P.S. Have you tested your sff_extract software on SFF files from the >> new Roche v2 software, released about the same time as the "titanium" >> 454 upgrade? > > Not me, but I think that Bastien has and he has found no problem at all with > that. Great. > The sff format is well thought and consistent, the 454 people did a > much better job than the ABI people did with the abi format. That makes a pleasant change - the FASTQ format strikes me as less than ideal in several ways (and the fact Solexa made their own incompatible variant just made things worse). Peter From jblanca at btc.upv.es Wed Apr 15 09:01:06 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 15 Apr 2009 11:01:06 +0200 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> Message-ID: <200904151101.07113.jblanca@btc.upv.es> > > The main problem would be to have sff files small enough to be used > > for the test. > > The Roche command line tools allow you to take a large SFF file and > produce a filtered version (use sfffile with the -i option and a simple > text file of read identifiers). So making a small SFF file for unit tests > should be simple. Could you send a couple of example sff files to me? I haven't the Roche tools, that's why I implemented sff_extract in the first place :) -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Wed Apr 15 09:16:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 10:16:37 +0100 Subject: [BioPython] Is query_length really the length of query? In-Reply-To: <49E59BB1.7050208@bccs.uib.no> References: <20090401123456.tkhitqrg8wk484ow@webmail.uib.no> <320fb6e00904010359q613c22e4rc3d7aacd4436f4d2@mail.gmail.com> <49E496F1.9060608@bccs.uib.no> <320fb6e00904140723l41514cc9j1c8656a2b0f35a96@mail.gmail.com> <49E59BB1.7050208@bccs.uib.no> Message-ID: <320fb6e00904150216g71856e8dwfe5515000bbda2de@mail.gmail.com> On Wed, Apr 15, 2009 at 9:32 AM, Yvan Strahm wrote: > > Peter wrote: >> It sounds like you already have a FASTA file containing the query >> sequences, so just use that as the input to standalone BLAST. >> >> i.e. I would do something like this to double check the reported query >> length matches up with the actual query length: >> >> from Bio import SeqIO >> from Bio.Blast import NCBIXML >> from Bio.Blast import NCBIStandalone >> query_filename = "example.fasta" >> #Load all the queries into memory as a dictionary of SeqRecord objects >> query_handle = open("example.fasta") >> query_dict = SeqIO.to_dict(SeqIO.parse(query_handle,"fasta")) >> query_handle.close() >> #Run BLAST and loop over the XML blast results one by one (memory >> efficient), >> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, \ >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "blastn", my_blast_db, >> query_filename) >> for blast_record in NCBIXML.parse(result_handle) : >> ? query_record = query_dict[blast_record.query_id] #check this >> ? assert len(query_record) == blast_record.query_letters >> ? assert len(query_record) == blast_record.query_length #Biopython >> 1.50b or later >> >> Note I haven't actually tested this example, but I think the idea is >> clear. >> >> This approach gives you easy access to the full query sequence, and >> its full description. ?If all you care about is the length, then >> rather than storing a dictionary of the queries as SeqRecords, just >> use a dictionary of their lengths as integers. >> >> Peter > > Thanks a lot! > Just have to change the query_record = query_dict[blast_record.query_id] > to query_record = query_dict[blast_record.query] > because query_id return something like lcl|XXX and not the actual fasta > header. There may be a blastall command line argument to alter this behaviour, but if that works, great :) > and yes I am interested in the whole SeqRecords ;-). OK, good. That example should be a good starting point then. > Does the query_dict size is limited by the memory of the machine ? In the example I gave, yes. This is using a standard python dictionary, the keys are the record identifiers (strings) and the values are SeqRecord objects. A larger query file means more SeqRecord in memory in this dictionary. Unless you are using thousands of query sequences (in which case your BLAST search will be slow), I don't expect this to be a problem. However each SeqRecord in this example will be wasting some memory e.g. an empty list of features, an empty list of database cross references, and an empty dictionary of annotations. If you find you are running out of memory, then perhaps use a dictionary where the keys are the record identifiers (strings) and the values are just the sequence (as strings). If after that you are still running out of memory, you could index the FASTA file somehow, or use a full database (e.g. BioSQL). Peter From biopython at maubp.freeserve.co.uk Wed Apr 15 09:24:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 10:24:19 +0100 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <200904151101.07113.jblanca@btc.upv.es> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> <200904151101.07113.jblanca@btc.upv.es> Message-ID: <320fb6e00904150224j1e4fe9ddx3bc490282e5b700e@mail.gmail.com> On Wed, Apr 15, 2009 at 10:01 AM, Jose Blanca wrote: >> > The main problem would be to have sff files small enough to be used >> > for the test. >> >> The Roche command line tools allow you to take a large SFF file and >> produce a filtered version (use sfffile with the -i option and a simple >> text file of read identifiers). ?So making a small SFF file for unit tests >> should be simple. > > Could you send a couple of example sff files to me? > I'm not sure what SFF data I would be allowed to distribute, especially if you want it for a publicly available unit test example. Once the analysis is published, I'm sure this would be easier, but there would still be some administrative channels at work to go though to do this officially. It might be simpler if you gave me the URL of a public SFF file (or one of your own files) you would like split up, and I make a reduced version of that. Email me off list and we can talk about this. > I haven't the Roche tools, that's why I implemented sff_extract in the > first place :) Try have a word with the sequencing center where you get your Roche 454 sequencing done - they may be able to organize access to the software for you with Roche's approval. That's what we did. Peter From mmokrejs at ribosome.natur.cuni.cz Wed Apr 15 10:04:02 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 15 Apr 2009 12:04:02 +0200 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <320fb6e00904150224j1e4fe9ddx3bc490282e5b700e@mail.gmail.com> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> <200904151101.07113.jblanca@btc.upv.es> <320fb6e00904150224j1e4fe9ddx3bc490282e5b700e@mail.gmail.com> Message-ID: <49E5B112.7030402@ribosome.natur.cuni.cz> Hi, Peter wrote: > On Wed, Apr 15, 2009 at 10:01 AM, Jose Blanca wrote: >>>> The main problem would be to have sff files small enough to be used >>>> for the test. >>> The Roche command line tools allow you to take a large SFF file and >>> produce a filtered version (use sfffile with the -i option and a simple >>> text file of read identifiers). So making a small SFF file for unit tests >>> should be simple. >> Could you send a couple of example sff files to me? >> > > I'm not sure what SFF data I would be allowed to distribute, especially if > you want it for a publicly available unit test example. Once the analysis Just some random links to NCBI Trace Archive: ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=study&m=data&s=study http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP000001 (has 454 data from 2007) http://www.ncbi.nlm.nih.gov/sites/entrez?db=sra&report=full&term=SRX003639 ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/sff2scf/sff2scf.tar.gz Hope this helps, Martin From sbassi at gmail.com Wed Apr 15 13:35:55 2009 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 15 Apr 2009 10:35:55 -0300 Subject: [BioPython] Help for a presentation. Message-ID: I am working in a laptop session for a local workshop where I plan to show off some biopython features. The file would be available (CC-BY 3.0) after presentation in Crunchy compatible HTML (if you don't know Crunchy, take a look, it is impressive!). In the following drill, the task is to read a DNA sequence from a genbank file, translate it to an aminoacid sequence and save it as a Fasta file. The code here does this, but I think it looks a little complex when I want to show that biopython way to do it is easy. So I wonder if someone knows how to modify this code to get the same result with less steps. from Bio import SeqIO from Bio.Seq import translate from Bio.SeqRecord import SeqRecord handle = open('ampRdna.gb') seq_record = SeqIO.read(handle, "genbank") print "DNA Sequence:",seq_record.seq # make translation (numbers here is where the CDS starts) # I don't want to grab the translated sequence from genbank file # , I want to show how to translate it. protseq = translate(seq_record.seq[89:694]) # show translation print "Protein Sequence:",protseq # Make a SeqRecord seqid = seq_record.id seqdesc = seq_record.description protrec = SeqRecord(protseq,id=seqid,description=seqdesc) # save it to a fasta file. outfile_h = open('ampRprot.fasta','w') SeqIO.write([protrec],outfile_h,'fasta') outfile_h.close() From biopython at maubp.freeserve.co.uk Wed Apr 15 13:59:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 14:59:04 +0100 Subject: [BioPython] Help for a presentation. In-Reply-To: References: Message-ID: <320fb6e00904150659o77c4db3fw2c1e9b27e6476d55@mail.gmail.com> On Wed, Apr 15, 2009 at 2:35 PM, Sebastian Bassi wrote: > I am working in a laptop session for a local workshop where I plan to > show off some biopython features. The file would be available (CC-BY > 3.0) after presentation in Crunchy compatible HTML (if you don't know > Crunchy, take a look, it is impressive!). Sharing the presentation sounds good, we can link to it from here when its done if you like: http://biopython.org/wiki/Documentation#Presentations > In the following drill, the task is to read a DNA sequence from a > genbank file, translate it to an aminoacid sequence and save it as a > Fasta file. Funnily enough, I just recently added a couple of cookbook examples doing something a bit similar to this to the Tutorial in CVS. Also, have you looked at this page with some related stuff? http://www.warwick.ac.uk/go/peter_cock/python/genbank2fasta/ > The code here does this, but I think it looks a little complex when I > want to show that biopython way to do it is easy. So I wonder if > someone knows how to modify this code to get the same result with less > steps. > > from Bio import SeqIO > from Bio.Seq import translate > from Bio.SeqRecord import SeqRecord > > handle = open('ampRdna.gb') > seq_record = SeqIO.read(handle, "genbank") You never closed the input handle in the first place, so it should be just as safe to just do this: seq_record = SeqIO.read(open('ampRdna.gb'), "genbank") I do this often myself - its output handles you must be careful about closing. > print "DNA Sequence:",seq_record.seq > # make translation (numbers here is where the CDS starts) > # I don't want to grab the translated sequence from genbank file > # , I want to show how to translate it. > protseq = translate(seq_record.seq[89:694]) You can do that as a method call instead, which saves you an import line: protseq = seq_record.seq[89:694].translate() > # show translation > print "Protein Sequence:",protseq > # Make a SeqRecord > seqid = seq_record.id > seqdesc = seq_record.description > protrec = SeqRecord(protseq,id=seqid,description=seqdesc) I would merge those three lines as just: protrec = SeqRecord(protseq, id=seq_record.id, description=seq_record.description) But is it meaningful to use the whole nucleotide's ID and description for the protein? > # save it to a fasta file. > outfile_h = open('ampRprot.fasta','w') > SeqIO.write([protrec],outfile_h,'fasta') > outfile_h.close() I'm not sure if you'll find it clearer or not, but you could change the last three lines to this: outfile_h = open('ampRprot.fasta','w') outfile_h.write(protrec.format('fasta')) outfile_h.close() Peter From mmokrejs at ribosome.natur.cuni.cz Wed Apr 15 14:08:34 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 15 Apr 2009 16:08:34 +0200 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> Message-ID: <49E5EA62.2090902@ribosome.natur.cuni.cz> Peter wrote: > Dear Biopythoneers, > > There is a saying "no news is good news", but as per the title - can > we have some feedback from the Biopython 1.50 beta release please? The tests gave me: test_DocSQL ... /usr/lib/python2.6/site-packages/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet ok test_NCBIStandalone ... ERROR ====================================================================== ERROR: test_NCBIStandalone ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 247, in runTest suite = unittest.TestLoader().loadTestsFromName(name) File "/usr/lib/python2.6/unittest.py", line 576, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "test_NCBIStandalone.py", line 9, in from Bio.Blast import NCBIStandalone File "/home/mmokrejs/proj/biopython/build/lib.linux-i686-2.6/Bio/Blast/NCBIStandalone.py", line 1673 <<<<<<< NCBIStandalone.py ^ SyntaxError: invalid syntax ---------------------------------------------------------------------- Ran 111 tests in 373.969 seconds FAILED (failures = 1) $ From biopython at maubp.freeserve.co.uk Wed Apr 15 14:23:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 15:23:30 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <49E5EA62.2090902@ribosome.natur.cuni.cz> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> <49E5EA62.2090902@ribosome.natur.cuni.cz> Message-ID: <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> On Wed, Apr 15, 2009 at 3:08 PM, Martin MOKREJ? wrote: > Peter wrote: >> Dear Biopythoneers, >> >> There is a saying "no news is good news", but as per the title - can >> we have some feedback from the Biopython 1.50 beta release please? > > The tests gave me: > > test_DocSQL ... /usr/lib/python2.6/site-packages/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated > ?from sets import ImmutableSet That is a harmless bug in MySQLdb (it wasn't quite ready for Python 2.6), which I think has been fixed on their trunk, although I'm not sure if it is in their latest release or not. > test_NCBIStandalone ... ERROR > > ====================================================================== > ERROR: test_NCBIStandalone > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "run_tests.py", line 247, in runTest > ? ?suite = unittest.TestLoader().loadTestsFromName(name) > ?File "/usr/lib/python2.6/unittest.py", line 576, in loadTestsFromName > ? ?module = __import__('.'.join(parts_copy)) > ?File "test_NCBIStandalone.py", line 9, in > ? ?from Bio.Blast import NCBIStandalone > ?File "/home/mmokrejs/proj/biopython/build/lib.linux-i686-2.6/Bio/Blast/NCBIStandalone.py", line 1673 > ? ?<<<<<<< NCBIStandalone.py > ? ? ^ > SyntaxError: invalid syntax > ---------------------------------------------------------------------- That "<<<<<<<" text looks like a CVS merge failed, inserting a diff marker into the file. How did you install the Biopython 1.50 beta? I've just download and checked the Bio/Blast/NCBIStandalone.py file looks OK in both the archives: http://biopython.org/DIST/biopython-1.50b.tar.gz http://biopython.org/DIST/biopython-1.50b.zip Peter From sbassi at gmail.com Wed Apr 15 14:48:38 2009 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 15 Apr 2009 11:48:38 -0300 Subject: [BioPython] Help for a presentation. In-Reply-To: <320fb6e00904150659o77c4db3fw2c1e9b27e6476d55@mail.gmail.com> References: <320fb6e00904150659o77c4db3fw2c1e9b27e6476d55@mail.gmail.com> Message-ID: On Wed, Apr 15, 2009 at 10:59 AM, Peter wrote: > Sharing the presentation sounds good, we can link to it from here when > its done if you like: > http://biopython.org/wiki/Documentation#Presentations OK, I will the link there. Thank for your suggestions, applied most of them (sometimes shorter code it is not the easiest to read for a novice, they require more verbose examples). I will post the link here and in the wiki. Best, SB. From mmokrejs at ribosome.natur.cuni.cz Wed Apr 15 15:18:02 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 15 Apr 2009 17:18:02 +0200 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> <49E5EA62.2090902@ribosome.natur.cuni.cz> <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> Message-ID: <49E5FAAA.9080505@ribosome.natur.cuni.cz> Peter wrote: > On Wed, Apr 15, 2009 at 3:08 PM, Martin MOKREJ? > wrote: >> Peter wrote: >>> Dear Biopythoneers, >>> >>> There is a saying "no news is good news", but as per the title - can >>> we have some feedback from the Biopython 1.50 beta release please? >> The tests gave me: > >> test_NCBIStandalone ... ERROR >> >> ====================================================================== >> ERROR: test_NCBIStandalone >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "run_tests.py", line 247, in runTest >> suite = unittest.TestLoader().loadTestsFromName(name) >> File "/usr/lib/python2.6/unittest.py", line 576, in loadTestsFromName >> module = __import__('.'.join(parts_copy)) >> File "test_NCBIStandalone.py", line 9, in >> from Bio.Blast import NCBIStandalone >> File "/home/mmokrejs/proj/biopython/build/lib.linux-i686-2.6/Bio/Blast/NCBIStandalone.py", line 1673 >> <<<<<<< NCBIStandalone.py >> ^ >> SyntaxError: invalid syntax >> ---------------------------------------------------------------------- > > That "<<<<<<<" text looks like a CVS merge failed, inserting a diff > marker into the file. How did you install the Biopython 1.50 beta? > I've just download and checked the Bio/Blast/NCBIStandalone.py file > looks OK in both the archives: > http://biopython.org/DIST/biopython-1.50b.tar.gz > http://biopython.org/DIST/biopython-1.50b.zip Yes, sorry, I forgot to delete my old changes to it. Dropped the file and read-in current version from cvs now. ;) Anyway, I think "python setup.py clean" should zap .pyc files find Bio -name \*.pyc | xargs rm -f find BioSQL -name \*.pyc | xargs rm -f rm -f Tests/Quality/temp.fastq rm -f Tests/Quality/temp.qual The built-in tests ran fine for me. M. From biopython at maubp.freeserve.co.uk Wed Apr 15 15:38:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 16:38:16 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <49E5FAAA.9080505@ribosome.natur.cuni.cz> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> <49E5EA62.2090902@ribosome.natur.cuni.cz> <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> <49E5FAAA.9080505@ribosome.natur.cuni.cz> Message-ID: <320fb6e00904150838t29b7bafk97badda08098d8c4@mail.gmail.com> On Wed, Apr 15, 2009 at 4:18 PM, Martin MOKREJ? wrote: > > rm -f Tests/Quality/temp.fastq > rm -f Tests/Quality/temp.qual > Those two files were produced by the Bio.SeqIO.QualityIO doctest. I'm not aware of any nice way to do doctest clean up which doesn't show up in the docstrings themselves, so I've improvised in CVS revision 1.10 and included the deletions explicitly. I could instead have used a fancy temp file, or a StringIO handle - but this would detract from the documentation side of things more I feel. Maybe we should have a general clean up of any temp files at the end of run_tests.py ... it might be worth thinking about. > > The built-in tests ran fine for me. > M. Great. Peter From bartek at rezolwenta.eu.org Wed Apr 15 23:36:02 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 16 Apr 2009 01:36:02 +0200 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904131004i379c9bffwe9b3193568568cc@mail.gmail.com> References: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> <320fb6e00904131004i379c9bffwe9b3193568568cc@mail.gmail.com> Message-ID: <8b34ec180904151636r46d3216lb33c4323398faa17@mail.gmail.com> Hi all, Sorry, but I've missed this thread completely (despite being called by name a few times). It's too late for me to address the multiple points raised here, so I'll try to summarize what I understood: Peter wants to have the Seq. startswith function and do stuff like: >if record.seq.startswith(primer) : > record = record[crop:] Leighton would like to have an even more powerful method which would do things like: >>> Seq("TAG").startswith("TA[CG]") Which is quite cool, but Peter raises objections to the semantics of startswith called with arbitrary strings. I think that the issue would be resolved if the startswith method would not accept strings, but Seqs or Motifs. Assuming that we would have a nice way of generating appropriate motifs, it would lead to simple code: m=Motif.from_IUPAC("TAN") or alternatively m=Motif.from_re("TA[C|G]") s.startswith(m) Currently there are no methods from_IUPAC or from_re, but it should be fairly straightforward to implement them (if there is interest). writing the startswith method using a motif instance is very straightforward. There is one caveat: implementing complex regexps with Bio.Motif might be not as efficient as using regexps directly, but again I could work on improving the Motif class. hope this helps cheers Bartek On Mon, Apr 13, 2009 at 7:04 PM, Peter wrote: > On Mon, Apr 13, 2009 at 4:46 PM, Leighton Pritchard wrote: >> However, there's no harm in discussing other options, even if none of >> us like them... >> >> If the sequence has an alphabet that specifies it as either Protein or >> Nucleotide, then in those cases we can infer clearly what the ambiguity >> symbol means, and there is no problem. > > Strictly speaking, only if the sequence has an (ambiguous) IUPAC > alphabet can we know what the (ambiguity) symbols mean with certainty. > ?If the sequence has only a generic DNA/RNA/Nucleotide/Protein > alphabet then we can only make a pretty good guess. > >> Alternatively, Seq.startswith() could behave like String.startswith() all >> the time, unless passed with an optional argument (e.g. "ambiguity=True"). > > That idea could work. ?The default behaviour would be "act like a > string", but an optional argument to > startswith/endswith/find/rfind/count/... could enable ambiguity > matching (provided the sequence has a suitable alphabet). ?This would > be backwards compatible, and allow us to forge ahead with adding > simple string-like startswith/endswith methods now (which are useful > as is, and so far everyone seems supportive of), and implement > ambiguity support later. > >> Or maybe another optional argument could be passed to force the search to >> treat a sequence without an alphabet as either "type='protein'" or >> "type='RNA'", thereby suppressing the warning/error described above. >> ... >> Another alternative could be to have an optional argument defining the >> ambiguity symbols, and what they represent (e.g. >> "ambiguity_table={'N':'[ACGT]', 'P':'[QRST]'}). > > If we go down the optional argument route (e.g. ambiguity=True), then > a way of specifying the sequence type or ambiguity characters might be > possible, although I'd prefer to encourage more rigorous use of > alphabets in Seq objects in the first place (see also enhancement Bug > 2597, http://bugzilla.open-bio.org/show_bug.cgi?id=2597 on this > topic). > > If we consider the situation where someone creates their own custom > alphabet, and wants to define their own ambiguity characters, I think > any ambiguous search functionality would have to interrogate the > alphabet object at run time. ?Possible, but a bit tricky. > >>> Several "Zen of Python" points spring to mind, including "If the >>> implementation is hard to explain, it's a bad idea.", but in summary I >>> against supporting ambiguous characters in the string-like methods of >>> the Seq object (so: find, rfind, split, startswith, endswith, etc). >>> We should handle this another way. >> >> If the natural home for this functionality is Bio.Motif, then the natural >> home for it is Bio.Motif, and I don't have a problem with that. ?I'm happy >> to go with the consensus. > > Well, let's hear what Bartek has to say (Bio.Motif author). > > Peter > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From dalloliogm at gmail.com Thu Apr 16 09:00:42 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 16 Apr 2009 11:00:42 +0200 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904150838t29b7bafk97badda08098d8c4@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> <49E5EA62.2090902@ribosome.natur.cuni.cz> <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> <49E5FAAA.9080505@ribosome.natur.cuni.cz> <320fb6e00904150838t29b7bafk97badda08098d8c4@mail.gmail.com> Message-ID: <5aa3b3570904160200m2fd868cey822a4e0a9134138a@mail.gmail.com> On Wed, Apr 15, 2009 at 5:38 PM, Peter wrote: > > Maybe we should have a general clean up of any temp files at the end > of run_tests.py ... it might be worth thinking about. We need global fixtures for doctests :-) A tearDownAll which deletes all the temporary files created by the doctests would be sufficent. The way to implement it depends on how you want to run the tests (run_tests.py or nose). > >> >> The built-in tests ran fine for me. >> M. > > Great. > > Peter > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Apr 16 09:10:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 10:10:53 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <8b34ec180904151636r46d3216lb33c4323398faa17@mail.gmail.com> References: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> <320fb6e00904131004i379c9bffwe9b3193568568cc@mail.gmail.com> <8b34ec180904151636r46d3216lb33c4323398faa17@mail.gmail.com> Message-ID: <320fb6e00904160210x1cd61a4bo707576b5c3f16861@mail.gmail.com> > Peter wants to have the Seq. startswith function and do stuff like: >>if record.seq.startswith(primer) : >> ?record = record[crop:] > > Leighton would like to have an even more powerful method which would > do things like: >>>> Seq("TAG").startswith("TA[CG]") > > Which is quite cool, but Peter raises objections to the semantics of > startswith called with arbitrary strings. > > I think that the issue would be resolved if the startswith method > would not accept strings, but Seqs or Motifs. Note that the existing search related Seq methods like find, rfind, split, rsplit already take a string or another Seq object - so I was intending (with the patch on Bug 2809) that startswith and endswith did the same. However, while they take Seq objects like Seq("TAN",generic_dna), these methods would all still do a blind search for "TAN" literally, just like a python string would. Having these Seq object methods all cope with a Motif object is an interesting idea - I hadn't thought of that. We can have string or Seq arguments act as dumb python strings (no ambiguity magic), but giving a Motif object allows the ambiguity matches to be handled explicitly. I would like to clarify that I was thinking more the other way round: the Motif object has a search method where you give it a Seq (or string?) to be searched. Much like Python's regular expression objects take the target string as an argument. One advantage of doing it this way round is the Seq object is kept quite simple (which I think is a good thing), and all the ambiguity complexity lives in Bio.Motif instead. > Assuming that we would have a nice way of generating appropriate > motifs, it would lead to simple code: > > m=Motif.from_IUPAC("TAN") > > or alternatively > > m=Motif.from_re("TA[C|G]") > > s.startswith(m) > > Currently there are no methods from_IUPAC or from_re, but it should be > fairly straightforward to implement them (if there is interest). I think there is interest - although you might want to have from_IUPAC_protein, from_IUPAC_DNA, from_IUPAC_RNA. Just using m=Motif.from_IUPAC("TAN") it isn't clear if that is protein or DNA. If Motif.from_IUPAC only took a Seq object with a relevant alphabet that would solve this ambiguity, but would not be so easy to use. > writing the startswith method using a motif instance is very straightforward. If you say so :) > There is one caveat: implementing complex regexps with Bio.Motif might > be not as efficient as using regexps directly, but again I could work > on improving the Motif class. > > hope this helps Let's have a look at this (after Biopython 1.50 is out). Peter From biopython at maubp.freeserve.co.uk Thu Apr 16 09:14:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 10:14:56 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <5aa3b3570904160200m2fd868cey822a4e0a9134138a@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> <49E5EA62.2090902@ribosome.natur.cuni.cz> <320fb6e00904150723s72c53946lf8129ed2d6dde27d@mail.gmail.com> <49E5FAAA.9080505@ribosome.natur.cuni.cz> <320fb6e00904150838t29b7bafk97badda08098d8c4@mail.gmail.com> <5aa3b3570904160200m2fd868cey822a4e0a9134138a@mail.gmail.com> Message-ID: <320fb6e00904160214i4bc6721he1b54b911a5003d1@mail.gmail.com> On Thu, Apr 16, 2009 at 10:00 AM, Giovanni Marco Dall'Olio wrote: > On Wed, Apr 15, 2009 at 5:38 PM, Peter wrote: >> >> Maybe we should have a general clean up of any temp files at the end >> of run_tests.py ... it might be worth thinking about. > > We need global fixtures for doctests :-) > I'm not quite sure what you mean (other than your general enthusiasm for global fixtures in unittests). The doctest framework doesn't have anything like this, it doesn't even have any way to issue instructions other than by embedding text in docstrings, does it? > A tearDownAll which deletes all the temporary files created by the > doctests would be sufficent. The way to implement it depends on how > you want to run the tests (run_tests.py or nose). That is what I just suggested - as long as we are consistent about naming temp files, run_tests.py can just delete say */temp_*.* under the Tests directory. However this won't clean up after running a doctest directly (bypassing run_tests.py). Peter From bartek at rezolwenta.eu.org Thu Apr 16 09:56:59 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 16 Apr 2009 11:56:59 +0200 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904160210x1cd61a4bo707576b5c3f16861@mail.gmail.com> References: <320fb6e00904130746w29212cbdped4f993d804b16a7@mail.gmail.com> <320fb6e00904131004i379c9bffwe9b3193568568cc@mail.gmail.com> <8b34ec180904151636r46d3216lb33c4323398faa17@mail.gmail.com> <320fb6e00904160210x1cd61a4bo707576b5c3f16861@mail.gmail.com> Message-ID: <8b34ec180904160256u4600c572jb92068cc81dece1f@mail.gmail.com> hi, On Thu, Apr 16, 2009 at 11:10 AM, Peter wrote: > Having these Seq object methods all cope with a Motif object is an > interesting idea - I hadn't thought of that. ?We can have string or > Seq arguments act as dumb python strings (no ambiguity magic), but > giving a Motif object allows the ambiguity matches to be handled > explicitly. That's exactly what I meant. > > I would like to clarify that I was thinking more the other way round: > the Motif object has a search method where you give it a Seq (or > string?) to be searched. ?Much like Python's regular expression > objects take the target string as an argument. ?One advantage of doing > it this way round is the Seq object is kept quite simple (which I > think is a good thing), and all the ambiguity complexity lives in > Bio.Motif instead. > Yes, so the idea would be for the startswith, endswith etc methods to only check whether the argument is a motif and if so, call the proper methods of the argument. I need to look into it more closely (especially for the other methods like find) but there are search methods as well as methods for finding instances for the whole sequence as well as for a given position. >> Currently there are no methods from_IUPAC or from_re, but it should be >> fairly straightforward to implement them (if there is interest). > > I think there is interest - although you might want to have > from_IUPAC_protein, from_IUPAC_DNA, from_IUPAC_RNA. ?Just using > m=Motif.from_IUPAC("TAN") it isn't clear if that is protein or DNA. > If Motif.from_IUPAC only took a Seq object with a relevant alphabet > that would solve this ambiguity, but would not be so easy to use. Good. I'll try to implement this. > >> writing the startswith method using a motif instance is very straightforward. > > If you say so :) > Once you have the motif instance, it's really easy. The problem is with making Motif creation easy enough. > Let's have a look at this (after Biopython 1.50 is out). I agree cheers Bartek From gatoygata at hotmail.com Fri Apr 17 11:25:48 2009 From: gatoygata at hotmail.com (Joaquin Abian Monux) Date: Fri, 17 Apr 2009 11:25:48 +0000 Subject: [BioPython] =?utf-8?q?blastall_produces_a_black_console_screen_wi?= =?utf-8?q?th_biopython_1=2E49_and_up=E2=80=8F?= Message-ID: Dear all, I coded a GUI utility to perform local blast searches on lists of peptides using NCBIStandalone.blastall(). I work on windows XP with python 2.5 After I updated from biopython1.48 to 1.49, when I make a search, NCBIStandalone.blastall() produces a black screen (a windows system console produced by the execution of ..\bin\blastall.exe) that pops up and rapidly disappears as blastall.exe is executed. Nothing more has been changed in the application and the screen does not appears if I downgrade to 1.48. This black window is very annoying (it appears in front of all other open windows) and in fact it is preventing me from upgrading my biopython installation. This problem occurs both with biopyton 1.49 an 1.50. With biopython 1.48 and below NCBIStandalone.blastall works silently. I have seen looking at the code that 1.49 uses preferently subprocess.popen() (in function _invoke_blast) to execute blast while in 1.48 it was os.popen3() (in function blastall). I have been playing with this but I got nothing clear Is this something already known? I could not found any hint by googling. Is there some way to get rid of this screen?. Thanks Joaquin _________________________________________________________________ M?s r?pido, sencillo y seguro. Desc?rgate ya el nuevo Internet Explorer 8 ?Es gratis! http://www.vivelive.com/ie8 From biopython at maubp.freeserve.co.uk Fri Apr 17 12:06:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 13:06:42 +0100 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <49E5B112.7030402@ribosome.natur.cuni.cz> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> <200904151101.07113.jblanca@btc.upv.es> <320fb6e00904150224j1e4fe9ddx3bc490282e5b700e@mail.gmail.com> <49E5B112.7030402@ribosome.natur.cuni.cz> Message-ID: <320fb6e00904170506u343261acif58e1feaacc8d387@mail.gmail.com> On Wed, Apr 15, 2009 at 11:04 AM, Martin MOKREJ? wrote: > Hi, > ... > Just some random links to NCBI Trace Archive: > > ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead > http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=study&m=data&s=study > http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP000001 (has 454 data from 2007) > http://www.ncbi.nlm.nih.gov/sites/entrez?db=sra&report=full&term=SRX003639 > ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/sff2scf/sff2scf.tar.gz > > Hope this helps, > Martin Thanks for those links Martin, I've use a FASTQ file from that list for a couple of examples I've just added to the tutorial. Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 12:14:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 13:14:22 +0100 Subject: [BioPython] =?utf-8?q?blastall_produces_a_black_console_screen_wi?= =?utf-8?q?th_biopython_1=2E49_and_up=E2=80=8F?= In-Reply-To: References: Message-ID: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com> On Fri, Apr 17, 2009 at 12:25 PM, Joaquin Abian Monux wrote: > > Dear all, > > ?I coded a GUI utility to perform local blast searches > on lists of peptides using NCBIStandalone.blastall(). > > I work on windows XP with python 2.5 > > After I updated from biopython1.48 to 1.49, ?when I > make a search, NCBIStandalone.blastall() ?produces > a black screen (a windows system console produced > by the execution of ..\bin\blastall.exe) that pops up and > rapidly disappears as blastall.exe is executed. ?Nothing > more has been changed in the application and the > screen does not appears if I downgrade to 1.48. > > This black window is very annoying (it appears in > front of all other open windows) and in fact it is > preventing me from upgrading my biopython installation. > > This problem occurs both with biopyton 1.49 an 1.50. > With biopython 1.48 and below NCBIStandalone.blastall > works silently. > > I have seen looking at the code that 1.49 uses preferently > subprocess.popen() (in function _invoke_blast) to execute > blast while in 1.48 it was os.popen3() (in function blastall). > I have been playing with this but I got nothing clear > > Is this something already known? I could not found any hint > by googling. Is there some way to get rid of this screen?. Stefanie L?ck had some issues with BLAST and subprocess on her Windows GUI program, which we traced to a bug in Python itself, http://bugs.python.org/issue1124861 See: http://lists.open-bio.org/pipermail/biopython/2009-January/004896.html http://lists.open-bio.org/pipermail/biopython/2009-February/004898.html We were able to resolve this for Stefanie, and the fix was included in Biopython 1.50 beta. Have you tried this yet? If that doesn't work could you show as a short GUI example that fails? Details of how you are running your program could also help. This may also make a difference - e.g. is it started from the command line with "python my_gui.py", or run from IDLE? Thanks Peter From cjfields at illinois.edu Fri Apr 17 12:18:53 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 17 Apr 2009 07:18:53 -0500 Subject: [BioPython] Reading Roche 454 binary SFF files in Python In-Reply-To: <320fb6e00904170506u343261acif58e1feaacc8d387@mail.gmail.com> References: <320fb6e00904141336u3eccae59p8218d491f554adcb@mail.gmail.com> <200904150907.27360.jblanca@btc.upv.es> <320fb6e00904150142m3b055792r3cd02f38dffa274e@mail.gmail.com> <200904151101.07113.jblanca@btc.upv.es> <320fb6e00904150224j1e4fe9ddx3bc490282e5b700e@mail.gmail.com> <49E5B112.7030402@ribosome.natur.cuni.cz> <320fb6e00904170506u343261acif58e1feaacc8d387@mail.gmail.com> Message-ID: <70EB1D0C-BFB4-4603-A4B3-19B9769BDD68@illinois.edu> Just to add, Pjotr Prins' BioLib initiative (http://biolib.open-bio.org/wiki/Main_Page ) is building SWIG-based interfaces to several C/C++-based libraries, including Staden io_lib, which supports the following formats (c&p from the latest io_lib README): SCF trace files ABI trace files ALF trace files CTF trace files ZTR trace files SFF trace archives SRF trace archives Experiment files Plain text files We're working on the Perl/Ruby bindings; it shouldn't be hard at all to get Python (and by extension, Biopython) working. chris On Apr 17, 2009, at 7:06 AM, Peter wrote: > On Wed, Apr 15, 2009 at 11:04 AM, Martin MOKREJ? > wrote: >> Hi, >> ... >> Just some random links to NCBI Trace Archive: >> >> ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead >> http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=study&m=data&s=study >> http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP000001 (has >> 454 data from 2007) >> http://www.ncbi.nlm.nih.gov/sites/entrez?db=sra&report=full&term=SRX003639 >> ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/sff2scf/sff2scf.tar.gz >> >> Hope this helps, >> Martin > > Thanks for those links Martin, > > I've use a FASTQ file from that list for a couple of examples I've > just added to the tutorial. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From sbassi at gmail.com Fri Apr 17 14:16:48 2009 From: sbassi at gmail.com (Sebastian Bassi) Date: Fri, 17 Apr 2009 11:16:48 -0300 Subject: [BioPython] Help for a presentation. In-Reply-To: References: <320fb6e00904150659o77c4db3fw2c1e9b27e6476d55@mail.gmail.com> Message-ID: I did the presentation yesterday. Most assistants were bioinformatics students and they liked what they saw. But they were previously exposed to Bioperl and they made comparative questions. I set up a little FAQ with my answers about this. Here is the laptop session: http://www.bioinformatica.info/biopython/ Anyone is invited to improve it. From gatoygata at hotmail.com Fri Apr 17 14:16:56 2009 From: gatoygata at hotmail.com (Joaquin Abian Monux) Date: Fri, 17 Apr 2009 14:16:56 +0000 Subject: [BioPython] =?utf-8?q?blastall_produces_a_black_console_screen_wi?= =?utf-8?q?th_biopython_1=2E49_and_up=E2=80=8F?= In-Reply-To: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com> References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com> Message-ID: Dear Peter, I already had seen that issue. I though it was not related with my problem because my program didn't hang (it had been also compiled with py2exe). The program works perfect. It runs the searches and shows the output, but each time I make a search I get the /Blast/bin/blastall.exe console popping. I have tried right now with 1.50b (I upgraded to 1.50b). Now, If I execute the main script from the system console (>python blast_main_205.pyw) the black screen does not appear when I send a Blast query. Neither If I execute from within the Stani's Python Editor IDE: it does not appear. It seems that this has been solved: before, with biopython 1.49 I was getting a console flash. But if I execute by double clicking on the main script, then it appears (!?) I compiled with py2exe (after fixing some problems in spark.py, see below) and the single file executable worked perfectly but I still get the nasty console when I make a blast query. So, still looking for help Joaquin Note: I got and error log when trying to execute the single file executable produced by py2exe: Traceback (most recent call last): File "blast_main_205.pyw", line 23, in File "zipextimporter.pyo", line 82, in load_module ......etc File "Bio\Parsers\spark.pyo", line 129, in collectRules File "Bio\Parsers\spark.pyo", line 101, in addRule AttributeError: 'NoneType' object has no attribute 'split' I fixed it by modifying spark.py in biopython 1.50b: original: def addRule(self, doc, func): rules = doc.split() fixed: def addRule(self, doc, func): rules = doc.split() if doc else [] > Date: Fri, 17 Apr 2009 13:14:22 +0100 > Subject: Re: [BioPython] blastall produces a black console screen with biopython 1.49 and up? > From: biopython at maubp.freeserve.co.uk > To: gatoygata at hotmail.com > CC: biopython at lists.open-bio.org > > On Fri, Apr 17, 2009 at 12:25 PM, Joaquin Abian Monux > wrote: > > > > Dear all, > > > > I coded a GUI utility to perform local blast searches > > on lists of peptides using NCBIStandalone.blastall(). > > > > I work on windows XP with python 2.5 > > > > After I updated from biopython1.48 to 1.49, when I > > make a search, NCBIStandalone.blastall() produces > > a black screen (a windows system console produced > > by the execution of ..\bin\blastall.exe) that pops up and > > rapidly disappears as blastall.exe is executed. Nothing > > more has been changed in the application and the > > screen does not appears if I downgrade to 1.48. > > > > This black window is very annoying (it appears in > > front of all other open windows) and in fact it is > > preventing me from upgrading my biopython installation. > > > > This problem occurs both with biopyton 1.49 an 1.50. > > With biopython 1.48 and below NCBIStandalone.blastall > > works silently. > > > > I have seen looking at the code that 1.49 uses preferently > > subprocess.popen() (in function _invoke_blast) to execute > > blast while in 1.48 it was os.popen3() (in function blastall). > > I have been playing with this but I got nothing clear > > > > Is this something already known? I could not found any hint > > by googling. Is there some way to get rid of this screen?. > > Stefanie L?ck had some issues with BLAST and subprocess on > her Windows GUI program, which we traced to a bug in Python itself, > http://bugs.python.org/issue1124861 > > See: > http://lists.open-bio.org/pipermail/biopython/2009-January/004896.html > http://lists.open-bio.org/pipermail/biopython/2009-February/004898.html > > We were able to resolve this for Stefanie, and the fix was included in > Biopython 1.50 beta. Have you tried this yet? > > If that doesn't work could you show as a short GUI example that fails? > Details of how you are running your program could also help. This may > also make a difference - e.g. is it started from the command line with > "python my_gui.py", or run from IDLE? > > Thanks > > Peter _________________________________________________________________ ?Quieres crear tus propios emoticonos gratis? Descubre c?mo hacerlo en el Club Oficial de Messenger http://vivelive.com/ilovemessenger/ From biopython at maubp.freeserve.co.uk Fri Apr 17 14:36:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 15:36:22 +0100 Subject: [BioPython] =?windows-1256?q?blastall_produces_a_black_console_sc?= =?windows-1256?q?reen_with_biopython_1=2E49_and_up=FE?= In-Reply-To: References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com> Message-ID: <320fb6e00904170736j44e0a9e1n2171b7a7310303eb@mail.gmail.com> On Fri, Apr 17, 2009 at 3:16 PM, Joaquin Abian Monux wrote: > Dear Peter, > > I already had seen that issue.? I though it was not? related with my problem > because my program didn't hang (it had been also compiled with py2exe). The > program works perfect. It runs the searches and shows the output, but each > time I? make a search I get the /Blast/bin/blastall.exe console popping. I > have tried right now with 1.50b (I upgraded to 1.50b). > > Now, If I execute the main script from the system console (>python > blast_main_205.pyw) the black screen does not appear when I send a Blast > query. Neither If I execute from within the Stani's Python Editor IDE: it > does not appear. It seems that this has been solved: before, with biopython > 1.49 I was getting a console flash. > > But if I execute by double clicking on the main script, then it appears (!?) > > I compiled with py2exe (after fixing some problems in spark.py, see below) > and the single file executable worked perfectly but I still get the nasty > console when I make a blast query. > > So, still looking for help > > Joaquin Right now I suggest you install Biopython 1.50 beta and then edit your copy of Bio/Blast/NCBIStandalone.py to use os.popen3 instead of subprocess. Then running py2exe and test it, and let us know if that works. If that works, we could revert to using os.popen3 on Python 2.5 or older, and only use subprocess on Python 2.6+ (where os.popen3 is deprecated), but that still leaves a possible problem on Python 2.6. I think basically you have a platform specific corner use case, and we may not be able to get Bio/Blast/NCBIStandalone.py to cope without a lot of effort (fixes welcome). You could also try using subprocess but modify the shell argument, but I think that will break other situations. We've been discussing our command line application wrappers on the dev mailing list this month, and after Biopython 1.50 I plan to update Bio.Blast.Applications (and make Bio.Blast.NCBIStandalone use this internally). The (slightly out of date) wrappers in Bio.Blast.Applications just take care of building the command line string - you would then be able to invoke it as you see fit (e.g. using os.system, os.popen3, subprocess - or even submit the task to your local computing cluster). The idea is that Bio.Blast.NCBIStandalone would continue to be a general purpose solution suitable for most situations (but perhaps not Windows GUI programs using py2exe), while Bio.Blast.Applications would give you a lower level option with more control. Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 15:38:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 16:38:20 +0100 Subject: [BioPython] =?windows-1256?q?blastall_produces_a_black_console_sc?= =?windows-1256?q?reen_with_biopython_1=2E49_and_up=FE?= In-Reply-To: References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com> <320fb6e00904170736j44e0a9e1n2171b7a7310303eb@mail.gmail.com> Message-ID: <320fb6e00904170838y2a5cfd8fud5a7cd31f1384412@mail.gmail.com> On Fri, Apr 17, 2009 at 4:24 PM, Joaquin Abian Monux wrote: > Dear Peter, > > Thanks, I fixed it. I'm not sure how. > > In: > > blast_process = subprocess.Popen(cmd_string, > ???????????????????????????????? stdout=subprocess.PIPE, > ???????????????????????????????? stderr=subprocess.PIPE, > ???????????????????????????????? stdin=subprocess.PIPE, > ???????????????????????????????? shell= (sys.platform!="win32")) > > (sys.platform!="win32") is False for my winXP computer. But if I set the > parameter shell=True, then the problem disappears, either in the py2exe > executables and in the scripts. Everything works perfect as usual. > > if shell=True in windows the shell used is the one set in COMSPEC. In my > case it is the normal windows shell (cmd.exe). If shell=False I am not sure > what happens in windows. Obviously no cmd shell should be expected...still, > blast.exe opens his. > ... > Done! (Although it would be nice someone could explain why the 'shell' > parameter must be 'True' for the program to behave properly)... >From memory, with subprocess using the shell argument in this way was deliberate on to get things to work cross platform. I looks like when running from the command line you need the *opposite* shell setting to running from py2exe. Could you put together a trivial python GUI application that calls BLAST that we could use for testing? Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 17:43:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 18:43:44 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> Message-ID: <320fb6e00904171043x3bc3ecb2s5be4ba52bec15076@mail.gmail.com> On Mon, Apr 13, 2009 at 7:15 PM, Peter wrote: > Dear Biopythoneers, > > There is a saying "no news is good news", but as per the title - can > we have some feedback from the Biopython 1.50 beta release please? Thanks everyone for your feedback so far. The plan is to do the final release this weekend, so if anyone has some last minute comments now is the time - even little things like reporting typos in the Tutorial are worthwhile. Thanks Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 17:44:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 18:44:18 +0100 Subject: [BioPython] Feedback from Biopython 1.50 beta? In-Reply-To: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> References: <320fb6e00904131115v6a65a305w567edccd6406cb05@mail.gmail.com> Message-ID: <320fb6e00904171044q3dcbd703x10d2d75e2052373c@mail.gmail.com> On Mon, Apr 13, 2009 at 7:15 PM, Peter wrote: > Dear Biopythoneers, > > There is a saying "no news is good news", but as per the title - can > we have some feedback from the Biopython 1.50 beta release please? Thanks everyone for your feedback so far. The plan is to do the final release this weekend, so if anyone has some last minute comments now is the time - even little things like reporting typos in the Tutorial are worthwhile. Thanks Peter From peter at maubp.freeserve.co.uk Fri Apr 17 17:48:56 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 18:48:56 +0100 Subject: [BioPython] Adding startswith and endswith methods to the Seq object In-Reply-To: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> References: <320fb6e00904130647t7241422ejc4e7757275856c14@mail.gmail.com> Message-ID: <320fb6e00904171048n68544510n285380f7efaef447@mail.gmail.com> On Mon, Apr 13, 2009 at 2:47 PM, Peter wrote: > Hi all, > > I've filed enhancement bug 2809 with a patch to add startswith and > endswith methods to the Seq object, > http://bugzilla.open-bio.org/show_bug.cgi?id=2809 > > I'm confident there are many possible use cases for this. > ... > Does this seem like a sensible addition to the Seq object? ?It is > consistent with making the Seq object more like a python string. For anyone not following the Bug or the dev mailing list, this has been checked in and will be included with Biopython 1.50, and there is an example using it in the new Tutorial. Peter From lueck at ipk-gatersleben.de Mon Apr 20 11:07:20 2009 From: lueck at ipk-gatersleben.de (=?utf-8?Q?Stefanie_L=C3=BCck?=) Date: Mon, 20 Apr 2009 13:07:20 +0200 Subject: [BioPython] =?utf-8?q?blastall_produces_a_black_console_screen_wi?= =?utf-8?q?th_biopython_1=2E49_and_up=E2=80=8F?= References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com><320fb6e00904170736j44e0a9e1n2171b7a7310303eb@mail.gmail.com> <320fb6e00904170838y2a5cfd8fud5a7cd31f1384412@mail.gmail.com> Message-ID: <008701c9c1a8$2cc0f740$1022a8c0@ipkgatersleben.de> Hi! How does your py2exe setup.py file looks? from distutils.core import setup import py2exe #python setup.py py2exe setup( windows = [ { "script": "nb_psst.py", ### Main Python script "icon_resources": [(0, "Icon.ico")] ### Icon to embed into the PE file. } ], ) If you use console instead of windows in setup() will the black screen dissapear? Kind regards Stefanie ----- Original Message ----- From: "Peter" To: "Joaquin Abian Monux" ; "BioPython Mailing List" Sent: Friday, April 17, 2009 5:38 PM Subject: Re: [BioPython]blastall produces a black console screen with biopython 1.49 and up? On Fri, Apr 17, 2009 at 4:24 PM, Joaquin Abian Monux wrote: > Dear Peter, > > Thanks, I fixed it. I'm not sure how. > > In: > > blast_process = subprocess.Popen(cmd_string, > stdout=subprocess.PIPE, > stderr=subprocess.PIPE, > stdin=subprocess.PIPE, > shell= (sys.platform!="win32")) > > (sys.platform!="win32") is False for my winXP computer. But if I set the > parameter shell=True, then the problem disappears, either in the py2exe > executables and in the scripts. Everything works perfect as usual. > > if shell=True in windows the shell used is the one set in COMSPEC. In my > case it is the normal windows shell (cmd.exe). If shell=False I am not > sure > what happens in windows. Obviously no cmd shell should be > expected...still, > blast.exe opens his. > ... > Done! (Although it would be nice someone could explain why the 'shell' > parameter must be 'True' for the program to behave properly)... >From memory, with subprocess using the shell argument in this way was deliberate on to get things to work cross platform. I looks like when running from the command line you need the *opposite* shell setting to running from py2exe. Could you put together a trivial python GUI application that calls BLAST that we could use for testing? Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Apr 20 15:22:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 16:22:02 +0100 Subject: [Biopython] [BioPython] Invitation for Biopython news coordinators In-Reply-To: <20090414015301.GA80360@kunkel> References: <20090406230542.GK43636@sobchak.mgh.harvard.edu> <1239493790.27790.184.camel@localhost.localdomain> <20090414015301.GA80360@kunkel> Message-ID: <320fb6e00904200822l375c1a3crdab46b8043fdad49@mail.gmail.com> On Tue, Apr 14, 2009 at 2:53 AM, Brad Chapman wrote: > Having several people involved will work out well. When you are > finished up with finals, check back in with the list and David and > we can get you set up with whatever you need. Good luck with exams > and thanks again for the message, Our news server uses WordPress, and the default roles are defined here: http://codex.wordpress.org/Roles_and_Capabilities I propose to make any "News Coordinator" volunteers "Contributors", meaning they can write and manage their own posts but not publish posts. Initially that will need an OK from above ;) Once we're happy with your work, we can upgrade you to an "Author" meaning you can publish and manage your own posts independently. Further down the line there is "Editor" (which will let you edit other peoples posts) or even "Admin" status... So David and John, you should be able to register yourselves here http://news.open-bio.org/news/wp-register.php and then drop me an email and I'll upgrade you from the default "Subscriber" to "Contributor". If that doesn't work, just email me directly with your contact details and a suggested username (I'd go with "johnm" or "davidw", but any sensible suggestion is fine - Brad just picked "brad"). Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 19:02:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 20:02:18 +0100 Subject: [Biopython] Biopython 1.50 released Message-ID: <320fb6e00904201202j4bb9666es18c89136ce973a48@mail.gmail.com> Dear all, We are pleased to announce Biopython release 1.50, featuring some significant additions since Biopython 1.49 was released late last year. GenomeDiagram by Leighton Pritchard has been integrated into Biopython as the Bio.Graphics.GenomeDiagram module. A new module Bio.Motif has been added, which is intended to replace the existing Bio.AlignAce and Bio.MEME modules. Also have a look at Bio.SwissProt and Bio.ExPASy and their revised parsers. As noted in a previous news posting, Bio.SeqIO can now read and write FASTQ and QUAL files used in second generation sequencing work. In connection with this, our SeqRecord object has a new dictionary attribute, letter_annotations, for per-letter-annotation information like sequence quality scores or secondary structure predictions. Also, the SeqRecord object can now be sliced to give a new SeqRecord covering just part of the sequence. Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is expected to be the final version to support Python 2.3 (see this previous announcement). Also, Biopython 1.50 should be the last release to include our old deprecated parsing infrastructure (Martel and Bio.Mindy). We?ve also updated the Biopython Tutorial and Cookbook (also available in PDF), and not just by adding our logo to the cover ;) http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Thank you to everyone who tested the Biopython 1.50 beta release, and to all our contributors. Source distributions and Windows installers are available from the downloads page on the Biopython website: http://biopython.org/wiki/Download -Peter, on behalf of the Biopython developers P.S. This news post is online at http://news.open-bio.org/news/2009/04/biopython-release-150/ You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News From chapmanb at 50mail.com Mon Apr 20 21:51:07 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 20 Apr 2009 17:51:07 -0400 Subject: [Biopython] Google Summer of Code at Biopython Message-ID: <20090420215107.GA30529@sobchak.mgh.harvard.edu> Biopython folks; I am very happy to announce that Biopython has had two students accepted for Google's Summer of Code (http://code.google.com/soc/): - Nick Matzke will be working on modules for Biogeographical Phylogenetics - Eric Talevich will be adding support for parsing and writing PhyloXML (http://www.phyloxml.org/) You have likely seen Nick and Eric around the mailing lists; both prepared excellent applications and project plans, navigating a stringent selection process. We should expect to see much more of them during the summer as they will be working full time on the projects with generous support from Google. I'd like to thank everyone who submitted Biopython proposals. We received many great queries and proposals, and it is a shame more could not have been included. Many thanks are also due to Hilmar and all the folks at NESCent for inviting Biopython to participate. We are looking forward to a great summer, and beyond, for the projects. Below are links to the NESCent main page, abstracts and full proposals; I believe you need to sign in with a Google account to see the full proposals: http://socghop.appspot.com/org/home/google/gsoc2009/nescent http://socghop.appspot.com/student_project/show/google/gsoc2009/nescent/t124022798969 http://socghop.appspot.com/student_proposal/show/google/gsoc2009/etal/t123854016039 http://socghop.appspot.com/student_project/show/google/gsoc2009/nescent/t124022798250 http://socghop.appspot.com/student_proposal/show/google/gsoc2009/nickmatzke/t123854590776 Congratulations again to Eric and Nick, Brad From biopython at maubp.freeserve.co.uk Mon Apr 20 22:15:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 23:15:51 +0100 Subject: [Biopython] Google Summer of Code at Biopython In-Reply-To: <20090420215107.GA30529@sobchak.mgh.harvard.edu> References: <20090420215107.GA30529@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904201515j1d2e855ue29665037a9edaa2@mail.gmail.com> On Mon, Apr 20, 2009 at 10:51 PM, Brad Chapman wrote: > Biopython folks; > I am very happy to announce that Biopython has had two students accepted > for Google's Summer of Code (http://code.google.com/soc/): Cool :) Congratulations Eric, Nick and Brad for stepping up to mentor on the Biopython side. I think that warrants a news post... John or David are you up for this? Peter From lpritc at scri.ac.uk Tue Apr 21 07:40:37 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 21 Apr 2009 08:40:37 +0100 Subject: [Biopython] Google Summer of Code at Biopython In-Reply-To: <20090420215107.GA30529@sobchak.mgh.harvard.edu> Message-ID: Congratulations to Nick and Eric! L. On 20/04/2009 22:51, "Brad Chapman" wrote: > Biopython folks; > I am very happy to announce that Biopython has had two students accepted > for Google's Summer of Code (http://code.google.com/soc/): > > - Nick Matzke will be working on modules for Biogeographical Phylogenetics > > - Eric Talevich will be adding support for parsing and writing PhyloXML > (http://www.phyloxml.org/) -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From gatoygata at hotmail.com Tue Apr 21 17:15:21 2009 From: gatoygata at hotmail.com (Joaquin Abian Monux) Date: Tue, 21 Apr 2009 17:15:21 +0000 Subject: [Biopython] =?utf-8?q?=5BBioPython=5Dblastall_produces_a_black_co?= =?utf-8?q?nsole_screen_with_biopython_1=2E49_and_up=E2=80=8F?= In-Reply-To: <008701c9c1a8$2cc0f740$1022a8c0@ipkgatersleben.de> References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com><320fb6e00904170736j44e0a9e1n2171b7a7310303eb@mail.gmail.com> <320fb6e00904170838y2a5cfd8fud5a7cd31f1384412@mail.gmail.com> <008701c9c1a8$2cc0f740$1022a8c0@ipkgatersleben.de> Message-ID: Hi Stefanie, It is a normal, minimal setup.py for windows and single file executables: # exWx/setup.py from distutils.core import setup import py2exe setup( windows=[ {'script': "blast_main_205.pyw", 'icon_resources':[(0,'blast.ico')] } ], options={ 'py2exe': { #'packages' : [], #'includes': [], 'excludes': [ 'Tkconstants','Tkinter', 'tcl' ], 'ignores': ['wxmsw26uh_vc.dll'], 'dll_excludes': ['libgdk_pixbuf-2.0-0.dll', 'libgdk-win32-2.0-0.dll', 'libgobject-2.0-0.dll' ], 'compressed': 1, 'optimize':2, 'bundle_files': 1 } }, zipfile = None, data_files= [ ] ) I think I understand the logic of your question. I will try with 'console' and let you know what happens. Here at work I have 1.48 installed and 1.50 at home. Currently, with biopython 1.50, I can produce py2exe single file executables of GUI programs based on wxpython that work perfectly after: a) I set console=True in the subprocess call in _invoke_blast (NCBIStandalone.py). Otherwise I have the Blast.exe console popping when I make a search. b) I modify a line in the addRule function in spark.py: rules = doc.split() ---to--> rules = doc.split() if doc else []. Otherwise the executable is produced but gives an exception when ran saying that can not split 'None'. Best regards Joaquin > From: lueck at ipk-gatersleben.de > To: biopython at maubp.freeserve.co.uk; gatoygata at hotmail.com; biopython at lists.open-bio.org > Subject: Re: [BioPython]blastall produces a black console screen with biopython 1.49 and up? > Date: Mon, 20 Apr 2009 13:07:20 +0200 > > Hi! > > How does your py2exe setup.py file looks? > > from distutils.core import setup > import py2exe > > #python setup.py py2exe > > setup( > windows = [ > { > "script": "nb_psst.py", ### Main Python > script > "icon_resources": [(0, "Icon.ico")] ### Icon to embed into > the PE file. > } > ], > ) > > If you use console instead of windows in setup() will the black screen > dissapear? > > Kind regards > Stefanie > > > ----- Original Message ----- > From: "Peter" > To: "Joaquin Abian Monux" ; "BioPython Mailing List" > > Sent: Friday, April 17, 2009 5:38 PM > Subject: Re: [BioPython]blastall produces a black console screen with > biopython 1.49 and up? > > > On Fri, Apr 17, 2009 at 4:24 PM, Joaquin Abian Monux > wrote: > > Dear Peter, > > > > Thanks, I fixed it. I'm not sure how. > > > > In: > > > > blast_process = subprocess.Popen(cmd_string, > > stdout=subprocess.PIPE, > > stderr=subprocess.PIPE, > > stdin=subprocess.PIPE, > > shell= (sys.platform!="win32")) > > > > (sys.platform!="win32") is False for my winXP computer. But if I set the > > parameter shell=True, then the problem disappears, either in the py2exe > > executables and in the scripts. Everything works perfect as usual. > > > > if shell=True in windows the shell used is the one set in COMSPEC. In my > > case it is the normal windows shell (cmd.exe). If shell=False I am not > > sure > > what happens in windows. Obviously no cmd shell should be > > expected...still, > > blast.exe opens his. > > ... > > Done! (Although it would be nice someone could explain why the 'shell' > > parameter must be 'True' for the program to behave properly)... > > >From memory, with subprocess using the shell argument in this way was > deliberate on to get things to work cross platform. I looks like when > running from the command line you need the *opposite* shell setting to > running from py2exe. > > Could you put together a trivial python GUI application that calls > BLAST that we could use for testing? > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > _________________________________________________________________ M?s r?pido, sencillo y seguro. Desc?rgate ya el nuevo Internet Explorer 8 ?Es gratis! http://www.vivelive.com/ie8 From lueck at ipk-gatersleben.de Wed Apr 22 08:15:14 2009 From: lueck at ipk-gatersleben.de (=?utf-8?Q?Stefanie_L=C3=BCck?=) Date: Wed, 22 Apr 2009 10:15:14 +0200 Subject: [Biopython] =?utf-8?q?=5BBioPython=5Dblastall_produces_a_black_co?= =?utf-8?q?nsole_screen_with_biopython_1=2E49_and_up=E2=80=8F?= References: <320fb6e00904170514m56ff6930x5b3158ad94a5afd9@mail.gmail.com><320fb6e00904170736j44e0a9e1n2171b7a7310303eb@mail.gmail.com> <320fb6e00904170838y2a5cfd8fud5a7cd31f1384412@mail.gmail.com> <008701c9c1a8$2cc0f740$1022a8c0@ipkgatersleben.de> Message-ID: <03b601c9c322$76b94ed0$1022a8c0@ipkgatersleben.de> Hi! Sorry I mixed up things. It's should be windows (as in your setup.py) and not console to get no black screen! Sorry again, I think I need holidays ;-) Stefanie ----- Original Message ----- From: Joaquin Abian Monux To: lueck at ipk-gatersleben.de ; biopython at maubp.freeserve.co.uk ; biopython at lists.open-bio.org Sent: Tuesday, April 21, 2009 7:15 PM Subject: RE: [BioPython]blastall produces a black console screen with biopython 1.49 and up? Hi Stefanie, It is a normal, minimal setup.py for windows and single file executables: # exWx/setup.py from distutils.core import setup import py2exe setup( windows=[ {'script': "blast_main_205.pyw", 'icon_resources':[(0,'blast.ico')] } ], options={ 'py2exe': { #'packages' : [], #'includes': [], 'excludes': [ 'Tkconstants','Tkinter', 'tcl' ], 'ignores': ['wxmsw26uh_vc.dll'], 'dll_excludes': ['libgdk_pixbuf-2.0-0.dll', 'libgdk-win32-2.0-0.dll', 'libgobject-2.0-0.dll' ], 'compressed': 1, 'optimize':2, 'bundle_files': 1 } }, zipfile = None, data_files= [ ] ) I think I understand the logic of your question. I will try with 'console' and let you know what happens. Here at work I have 1.48 installed and 1.50 at home. Currently, with biopython 1.50, I can produce py2exe single file executables of GUI programs based on wxpython that work perfectly after: a) I set console=True in the subprocess call in _invoke_blast (NCBIStandalone.py). Otherwise I have the Blast.exe console popping when I make a search. b) I modify a line in the addRule function in spark.py: rules = doc.split() ---to--> rules = doc.split() if doc else []. Otherwise the executable is produced but gives an exception when ran saying that can not split 'None'. Best regards Joaquin > From: lueck at ipk-gatersleben.de > To: biopython at maubp.freeserve.co.uk; gatoygata at hotmail.com; biopython at lists.open-bio.org > Subject: Re: [BioPython]blastall produces a black console screen with biopython 1.49 and up? > Date: Mon, 20 Apr 2009 13:07:20 +0200 > > Hi! > > How does your py2exe setup.py file looks? > > from distutils.core import setup > import py2exe > > #python setup.py py2exe > > setup( > windows = [ > { > "script": "nb_psst.py", ### Main Python > script > "icon_resources": [(0, "Icon.ico")] ### Icon to embed into > the PE file. > } > ], > ) > > If you use console instead of windows in setup() will the black screen > dissapear? > > Kind regards > Stefanie > > > ----- Original Message ----- > From: "Peter" > To: "Joaquin Abian Monux" ; "BioPython Mailing List" > > Sent: Friday, April 17, 2009 5:38 PM > Subject: Re: [BioPython]blastall produces a black console screen with > biopython 1.49 and up? > > > On Fri, Apr 17, 2009 at 4:24 PM, Joaquin Abian Monux > wrote: > > Dear Peter, > > > > Thanks, I fixed it. I'm not sure how. > > > > In: > > > > blast_process = subprocess.Popen(cmd_string, > > stdout=subprocess.PIPE, > > stderr=subprocess.PIPE, > > stdin=subprocess.PIPE, > > shell= (sys.platform!="win32")) > > > > (sys.platform!="win32") is False for my winXP computer. But if I set the > > parameter shell=True, then the problem disappears, either in the py2exe > > executables and in the scripts. Everything works perfect as usual. > > > > if shell=True in windows the shell used is the one set in COMSPEC. In my > > case it is the normal windows shell (cmd.exe). If shell=False I am not > > sure > > what happens in windows. Obviously no cmd shell should be > > expected...still, > > blast.exe opens his. > > ... > > Done! (Although it would be nice someone could explain why the 'shell' > > parameter must be 'True' for the program to behave properly)... > > >From memory, with subprocess using the shell argument in this way was > deliberate on to get things to work cross platform. I looks like when > running from the command line you need the *opposite* shell setting to > running from py2exe. > > Could you put together a trivial python GUI application that calls > BLAST that we could use for testing? > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > ------------------------------------------------------------------------------ ?Quieres estar al d?a de la ?ltimas novedades? ?Ap?ntate gratis aqu?! From chapmanb at 50mail.com Wed Apr 22 13:09:53 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 22 Apr 2009 09:09:53 -0400 Subject: [Biopython] [BioPython] JP (Jarvis Patrick) Clustering In-Reply-To: <49EDD55E.5010801@tu-bs.de> References: <49DA25E6.2060606@tu-bs.de> <20090406215725.GG43636@sobchak.mgh.harvard.edu> <49EDD55E.5010801@tu-bs.de> Message-ID: <20090422130953.GB34546@sobchak.mgh.harvard.edu> Hi Florian; [Moving to Biopython list] > >> Does anybody know an (open source) clustering package containing the > >> Jarvis Patrick clustering algorithm? > >> > > Here is a version in R and C: > > http://rguha.net/code/R/#jp [...] > Do you know how to do the 'rpy calls ' > > >dyn.load('jpc.so') > >source('jpc.R') > >clus <- jpc(dat, j=3, k=1, diss=FALSE) > > for executing the script? Sure, here is how I would do it with python and rpy2. This builds a random array of 10 items to cluster, each with 4 points, and then runs the JPC clustering algorithm from Rajarshi's page on them. At the end, it shows how to extract the clustered indexes from the results. Hope this helps, Brad import numpy import rpy2.robjects as robjects import rpy2.robjects.numpy2ri robjects.r(''' dyn.load('jpc.so') source('jpc.R') ''') num_groups = 10 data = numpy.random.random((num_groups, 4)) cluster = robjects.r.jpc(data, j=3, k=1, diss=False) for gindex in range(num_groups): print 'Group ID:', gindex, 'Cluster ID:', cluster[0][1][gindex] print cluster From winda002 at student.otago.ac.nz Thu Apr 23 02:14:31 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 23 Apr 2009 14:14:31 +1200 Subject: [Biopython] main page on wiki Message-ID: <49EFCF07.2050502@student.otago.ac.nz> Hi all, As you probably know the main page of the wiki (http://biopython.org/wiki/Main_Page) is the first place someone washes up when they google 'biopython'. As part of this "news coordinator" idea I have made an alternative version of the main page (http://biopython.org/wiki/User:Davidw/homepage) which acts a bit more as a "portal" for the wiki/project. This is born from my own experience with the wiki as a newcomer; it took me a long time to cotton on to the fact there was a navigation box on each page so I didn't realise what the website had to offer (this may say more about me than the design of the front page). Which version would you like to see as the main page? Obviously this isn't an either-or thing, my 'mock-up' version can be edited by anyone with an account on the wiki (the main page is protected for obvious reasons) so any ideas that you have can be incorporated to that one (older versions of the page are all saved so you can edit as bravely as you like). Thanks, David From biopython at maubp.freeserve.co.uk Thu Apr 23 09:16:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 10:16:44 +0100 Subject: [Biopython] [Biopython-dev] main page on wiki In-Reply-To: <49EFE553.6070405@gmail.com> References: <49EFCF07.2050502@student.otago.ac.nz> <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> <49EFE553.6070405@gmail.com> Message-ID: <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> On Thu, Apr 23, 2009 at 4:49 AM, Iddo Friedberg wrote: > I second Sebastian on the icons, and third Sebastian and Alex on preferring > David's take on a main page. Are you all looking at the *current* home page which already has a few of David's suggestions (in particular the news feed on the right), or the old version from memory? Also, what size screens do you all have? It should ideally look OK on small screens or windows (e.g. 1024 by 768 is what my laptop uses, which isn't that old). From playing with my window size, it should be OK - the proposed layout seems quite flexible :) If there are no counter comments, I'll put David's changes up later today or tomorrow. Peter From biopython at maubp.freeserve.co.uk Thu Apr 23 14:22:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 15:22:17 +0100 Subject: [Biopython] Ace support in Bio.SeqIO and/or Bio.AlignIO? Message-ID: <320fb6e00904230722o398ff192u7792d73562b7e4f1@mail.gmail.com> How do you picture a contig in an ACE assembly file? As a single record with read data annotations (e.g. as a SeqRecord), or as a sequence alignment with a consensus (e.g. as an Alignment object)? I suspect the answer is "it depends", and that both are useful. Currently we use Bio.Sequencing.Ace in Bio.SeqIO to turn each contig into a SeqRecord. Now that we have per-letter-annotation support in the SeqRecord, this code could be updated to record the consensus base quality (BQ lines). We could also record the supporting reads (RD lines), maybe as SeqFeature objects. Recently David put together an example on the wiki using Bio.Sequencing.Ace to build an alignment, which we could use a basis for supporting Ace files in Bio.AlignIO as alignments: http://biopython.org/wiki/ACE_contig_to_alignment What do people think? I should be able to try out David's code on some real world ACE files from Newbler (i.e. 454Contigs.ace files)... Peter From biopython at maubp.freeserve.co.uk Thu Apr 23 15:47:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 16:47:28 +0100 Subject: [Biopython] Fwd: [BioPython] Clustalw Problems In-Reply-To: <3933e78c0904230806v77987c19h7221f4943236d82c@mail.gmail.com> References: <3933e78c0904081308p78e049b1i4562857c1ad06df4@mail.gmail.com> <320fb6e00904081450y6a2bcdc2jce7935c543af9b8b@mail.gmail.com> <3933e78c0904230806v77987c19h7221f4943236d82c@mail.gmail.com> Message-ID: <320fb6e00904230847s6e9b60ediea53d4a8624787b4@mail.gmail.com> Forwarding to the mailing list - I'll reply soon. Peter ---------- Forwarded message ---------- From: Bradley Hintze Date: Thu, Apr 23, 2009 at 4:06 PM Subject: Re: [BioPython] Clustalw Problems To: Peter Peter, Sorry that it has taken so long to reply..school. I am still having issues with the alignments. I have BioPython 1.50 I tried to do what you suggested and got the following: >>> from Bio.Clustalw import MultipleAlignCL >>> from Bio.Clustalw import do_alignment >>> cline=MultipleAlignCL(r'C:\Bradley_BioPython\mtr4.fasta',r'C:\Bradley_BioPyt hon\clustalw1.83.XP\clustalw.exe') >>> cline.set_output(r'C:\Bradley_BioPython\test.aln') >>> print cline C:\Bradley_BioPython\clustalw1.83.XP\clustalw.exe -INFILE=C:\Bradley_BioPython\m tr4.fasta -OUTFILE=C:\Bradley_BioPython\test.aln >>> al=do_alignment('C:\Bradley_BioPython\clustalw1.83.XP\clustalw.exe -INFILE=C :\Bradley_BioPython\mtr4.fasta -OUTFILE=C:\Bradley_BioPython\test.aln') Traceback (most recent call last): ? File "", line 1, in ? File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 120, in do _alignment ??? % str(command_line)) ValueError: Bad command line option in the command: C:\Bradley_BioPython\clustal w1.83.XP\clustalw.exe -INFILE=C:\Bradley_BioPython\mtr4.fasta -OUTFILE=C:\Bradle y_BioPython???? est.aln When i try running 'cline' i get this >>> al=do_alignment(cline) Traceback (most recent call last): ? File "", line 1, in ? File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 124, in do _alignment ??? % command_line.sequence_file) IOError: Cannot open sequence file C:\Bradley_BioPython\mtr4.fasta Any ideas? On Wed, Apr 8, 2009 at 3:50 PM, Peter wrote: > > On 4/8/09, Bradley Hintze wrote: > > Hi, > > > > ?I am having a hard time running an alignment. I am running in windows and > > ?here is my code and the error message that I get after running do_alignment. > > > > ?>>> import os > > ?>>> from Bio.Clustalw import MultipleAlignCL > > ?>>> from Bio.Clustalw import do_alignment > > ?>>> cline=MultipleAlignCL(r"C:\Documents and > > ?Settings\student\Desktop\Foo\mtr4.fasta", r"C:\Program > > ?Files\clustalw1.83.XP\clustalw.exe") > > ?>>> cline.set_output(r"C:\Documents and > > ?Settings\students\Desktop\Foo\test.aln") > > ?>>> al=do_alignment(cline) > > ?Traceback (most recent call last): > > ? File "", line 1, in > > ? File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 124, > > ?in do_alignment > > ? ?% command_line.sequence_file) > > ?IOError: Cannot open sequence file C:\Documents and > > ?Settings\student\Desktop\Foo\mtr4.fasta > > > > ?when I open the file using o=open('C:\Documents and > > ?Settings\student\Desktop\Foo\mtr4.fasta') it woks fine. > > > > ?any ideas? > > As a general tip, try this to see what the command Biopython is trying > to run is: > > >>> print cline > > Then try running the same command by hand at the command prompt (DOS > prompt), and make sure it works. > > I can tell from the error message you have Python 2.5, but what > version of Biopython do you have? > > I'm not at a Windows machine to check, but it is generally a good idea > to avoid file names and paths with spaces where you can. ?In this > case, I'm sure relative names would be fine: > > >>> import os > >>> from Bio.Clustalw import MultipleAlignCL > >>> from Bio.Clustalw import do_alignment > >>> cline=MultipleAlignCL("mtr4.fasta", r"C:\Program Files\clustalw1.83.XP\clustalw.exe") > >>> cline.set_output("test.aln") > > Peter -- Bradley J. Hintze Biochemistry Undergraduate Utah State University 801-712-8799 From biopython at maubp.freeserve.co.uk Thu Apr 23 15:59:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 16:59:23 +0100 Subject: [Biopython] [BioPython] Clustalw Problems In-Reply-To: <320fb6e00904230847s6e9b60ediea53d4a8624787b4@mail.gmail.com> References: <3933e78c0904081308p78e049b1i4562857c1ad06df4@mail.gmail.com> <320fb6e00904081450y6a2bcdc2jce7935c543af9b8b@mail.gmail.com> <3933e78c0904230806v77987c19h7221f4943236d82c@mail.gmail.com> <320fb6e00904230847s6e9b60ediea53d4a8624787b4@mail.gmail.com> Message-ID: <320fb6e00904230859v4d3c9860kc7e5f574afbcbe9a@mail.gmail.com> Bradley Hintze wrote: > > Peter, > > Sorry that it has taken so long to reply..school. > I am still having issues with the alignments. I have BioPython 1.50 OK, it is good that you are on the latest release already :) By the way it is "Biopython", not "BioPython" ;) > I tried to do what you suggested and got the following: > >>>> from Bio.Clustalw import MultipleAlignCL >>>> from Bio.Clustalw import do_alignment >>>> cline=MultipleAlignCL(r'C:\Bradley_BioPython\mtr4.fasta',r'C:\Bradley_BioPyt > hon\clustalw1.83.XP\clustalw.exe') >>>> cline.set_output(r'C:\Bradley_BioPython\test.aln') >>>> print cline > C:\Bradley_BioPython\clustalw1.83.XP\clustalw.exe -INFILE=C:\Bradley_BioPython\m > tr4.fasta -OUTFILE=C:\Bradley_BioPython\test.aln Did you try running this "by hand" at the windows command prompt? >From the Windows start menu, pick "run", then enter "cmd.exe". Then paste in this command (from memory I think you need to right click on the icon in the top left of this window to get the paste menu option). I am expecting it to say "Bad command line option in the command ..." >>>> al=do_alignment('C:\Bradley_BioPython\clustalw1.83.XP\clustalw.exe -INFILE=C > :\Bradley_BioPython\mtr4.fasta -OUTFILE=C:\Bradley_BioPython\test.aln') > Traceback (most recent call last): > ? File "", line 1, in > ? File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 120, in do > _alignment > ??? % str(command_line)) > ValueError: Bad command line option in the command: C:\Bradley_BioPython\clustal > w1.83.XP\clustalw.exe -INFILE=C:\Bradley_BioPython\mtr4.fasta -OUTFILE=C:\Bradle > y_BioPython???? est.aln You have used '\t' in this string which means it was treated as a tab. Instead use \\t, or raw mode as you did earlier for the filenames: al=do_alignment(r'C:\Bradley_BioPython\clustalw1.83.XP\clustalw.exe -INFILE=C:\Bradley_BioPython\mtr4.fasta -OUTFILE=C:\Bradley_BioPython\test.aln') The text "Bad command line option in the command" tells me ClustalW returned error code 1, but this makes sense due to the tab. > When i try running 'cline' i get this > >>>> al=do_alignment(cline) > Traceback (most recent call last): > ? File "", line 1, in > ? File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 124, in do > _alignment > ??? % command_line.sequence_file) > IOError: Cannot open sequence file C:\Bradley_BioPython\mtr4.fasta This means ClustalW returned error code 2 (which should mean it can't find your input file). Are you sure the path is correct? Try: import os print os.path.isfile(r'C:\Bradley_BioPython\mtr4.fasta') Peter From winda002 at student.otago.ac.nz Fri Apr 24 01:41:37 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 24 Apr 2009 13:41:37 +1200 Subject: [Biopython] Ace support in Bio.SeqIO and/or Bio.AlignIO? In-Reply-To: <320fb6e00904230722o398ff192u7792d73562b7e4f1@mail.gmail.com> References: <320fb6e00904230722o398ff192u7792d73562b7e4f1@mail.gmail.com> Message-ID: <49F118D1.7070701@student.otago.ac.nz> Peter wrote: > How do you picture a contig in an ACE assembly file? As a single > record with read data annotations (e.g. as a SeqRecord), or as a > sequence alignment with a consensus (e.g. as an Alignment object)? I > suspect the answer is "it depends", and that both are useful > Yup, I usually want to trust the assembler and treat them as sequences (and having the option of annotations would be great for these cases) but sometimes I want to pull them apart and look inside. > Recently David put together an example on the wiki using > Bio.Sequencing.Ace to build an alignment, which we could use a basis > for supporting Ace files in Bio.AlignIO as alignments: > http://biopython.org/wiki/ACE_contig_to_alignment > > What do people think? I should be able to try out David's code on > some real world ACE files from Newbler (i.e. 454Contigs.ace files).. The wiki example is based on a script that I use with newbler and mira (http://www.chevreux.org/projects_mira.html) assembled contigs (I thought I'd use one everyone with biopython has as the example) so it should be OK. I'm sure there are much prettier ways of doing what it does (eg, using the new SeqRecord annotations to hold the clipping masks ?). If people want it to be part of biopython I'm happy to provide what help I can with it. From marco at gallotta.co.za Fri Apr 24 20:57:12 2009 From: marco at gallotta.co.za (Marco Gallotta) Date: Fri, 24 Apr 2009 22:57:12 +0200 Subject: [Biopython] Clustalw Hangs on Python 2.6 In-Reply-To: <68cbba1d0904241354v5d965d6ep176a51b9fc356d4f@mail.gmail.com> References: <68cbba1d0904241354v5d965d6ep176a51b9fc356d4f@mail.gmail.com> Message-ID: <68cbba1d0904241357r4cd3bae4maeea1e03e2243829@mail.gmail.com> Hi I recently upgraded to Python 2.6 (from 2.5) and this seems to have revealed a potential bug in biopython. I'm using biopython to run clustalw and after the upgrade it just hangs. I discovered that it was hanging on a write to stdout. The best reason I could determine for this behaviour was that Python's subprocess module (which biopython uses to spawn clastlw) was piping stdout to bioypython, which wasn't reading it. The code I used to call clustalw: cline = MultipleAlignCL(blast_results_file) cline.set_output(alignment_file, output_type = format.upper(), output_order = "INPUT") Clustalw.do_alignment(cline) I was able to get it working by changing the arguments to the subprocess module to pipe to /dev/null as in the attached patch. Unfortunately this approach only works on Linux. If there is a better fix, or perhaps I'm calling clustalw incorrectly, please do let me know. Thanks Marco -- Marco Gallotta MSc Student | SACO Scientific Committee | ACM ICPC Coach Department of Computer Science, University of Cape Town people.cs.uct.ac.za/~mgallott | marco-za.blogspot.com marco AT gallotta DOT co DOT za | 073 170 4444 | 021 552 2731 -------------- next part -------------- A non-text attachment was scrubbed... Name: biopython_clustalw.patch Type: text/x-diff Size: 507 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Fri Apr 24 21:47:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Apr 2009 22:47:24 +0100 Subject: [Biopython] Clustalw Hangs on Python 2.6 In-Reply-To: <68cbba1d0904241357r4cd3bae4maeea1e03e2243829@mail.gmail.com> References: <68cbba1d0904241354v5d965d6ep176a51b9fc356d4f@mail.gmail.com> <68cbba1d0904241357r4cd3bae4maeea1e03e2243829@mail.gmail.com> Message-ID: <320fb6e00904241447u69343587lb2c83eecb78468ab@mail.gmail.com> On 4/24/09, Marco Gallotta wrote: > Hi > > I recently upgraded to Python 2.6 (from 2.5) and this seems to have > revealed a potential bug in biopython. I'm using biopython to run > clustalw and after the upgrade it just hangs. I discovered that it was > hanging on a write to stdout. The best reason I could determine for > this behaviour was that Python's subprocess module (which biopython > uses to spawn clastlw) was piping stdout to bioypython, which wasn't > reading it. Hi, You didn't say what version of Biopython you are using - Bug 2804 was fixed in Biopython 1.50 which sounds possibly related: http://bugzilla.open-bio.org/show_bug.cgi?id=2804 Peter From marco at gallotta.co.za Fri Apr 24 22:08:51 2009 From: marco at gallotta.co.za (Marco Gallotta) Date: Sat, 25 Apr 2009 00:08:51 +0200 Subject: [Biopython] Clustalw Hangs on Python 2.6 In-Reply-To: <320fb6e00904241447u69343587lb2c83eecb78468ab@mail.gmail.com> References: <68cbba1d0904241354v5d965d6ep176a51b9fc356d4f@mail.gmail.com> <68cbba1d0904241357r4cd3bae4maeea1e03e2243829@mail.gmail.com> <320fb6e00904241447u69343587lb2c83eecb78468ab@mail.gmail.com> Message-ID: <68cbba1d0904241508rf6119bek50c5dabd826ee374@mail.gmail.com> On Fri, Apr 24, 2009 at 11:47 PM, Peter wrote: > You didn't say what version of Biopython you are using - Bug 2804 was > fixed in Biopython 1.50 which sounds possibly related: > http://bugzilla.open-bio.org/show_bug.cgi?id=2804 I was using 1.49. Upgrading to 1.50 solved the problem. Thanks! Marco -- Marco Gallotta MSc Student | SACO Scientific Committee | ACM ICPC Coach Department of Computer Science, University of Cape Town people.cs.uct.ac.za/~mgallott | marco-za.blogspot.com marco AT gallotta DOT co DOT za | 073 170 4444 | 021 552 2731 From biopython at maubp.freeserve.co.uk Sat Apr 25 12:10:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Apr 2009 13:10:33 +0100 Subject: [Biopython] Clustalw Hangs on Python 2.6 In-Reply-To: <68cbba1d0904241508rf6119bek50c5dabd826ee374@mail.gmail.com> References: <68cbba1d0904241354v5d965d6ep176a51b9fc356d4f@mail.gmail.com> <68cbba1d0904241357r4cd3bae4maeea1e03e2243829@mail.gmail.com> <320fb6e00904241447u69343587lb2c83eecb78468ab@mail.gmail.com> <68cbba1d0904241508rf6119bek50c5dabd826ee374@mail.gmail.com> Message-ID: <320fb6e00904250510y747118c2le71a34fb667737e6@mail.gmail.com> On 4/24/09, Marco Gallotta wrote: > > On Fri, Apr 24, 2009 at 11:47 PM, Peter wrote: > > You didn't say what version of Biopython you are using - Bug 2804 > > was fixed in Biopython 1.50 which sounds possibly related: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2804 > > I was using 1.49. Upgrading to 1.50 solved the problem. Thanks! > > Marco Great :) Peter From biopython at maubp.freeserve.co.uk Mon Apr 27 09:58:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 10:58:52 +0100 Subject: [Biopython] [Biopython-dev] main page on wiki In-Reply-To: <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> References: <49EFCF07.2050502@student.otago.ac.nz> <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> <49EFE553.6070405@gmail.com> <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> Message-ID: <320fb6e00904270258s523c49a1j1bfc5d4a12ca86a9@mail.gmail.com> On Thu, Apr 23, 2009 at 10:16 AM, Peter wrote: > > If there are no counter comments, I'll put David's changes up later > today or tomorrow. > OK - make that a couple of days later ;) This isn't exactly as in David's draft - I shortened some of the link text and omitted a couple of links under "Contribute" which seemed unnecessary on the home page. I've also kept the final line giving the latest release and date (although the text is shorter now). Brad commented (off list?) that having this is a good indicator of the project's activity, and I agree. Alternatively, I'd like to try having dates on the news feed, but the media wiki plugin needs to be updated for that to work... Peter From lueck at ipk-gatersleben.de Mon Apr 27 10:34:28 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 27 Apr 2009 12:34:28 +0200 Subject: [Biopython] Parsing large blast files Message-ID: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> Hi! I want to blast many sequences against one DB and parse the outputs. At the moment, I do it in that way: def blast_a_record(fasta_rec): open("to_blast.fasta", "w").write(str(fasta_rec)) f = open('out.txt', 'a') my_blast_db = "\"G:\RNAiscan\\barleyv9\"" my_blast_file = "G:\\RNAiscan\\to_blast.fasta" my_blast_exe = "G:\\RNAiscan\\blastall.exe" result_handle, error_info = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file) blast_results = result_handle.read() save_file = open("my_blast.xml", "w") save_file.write(blast_results) save_file.close() result = open("my_blast.xml") blast_records = NCBIXML.parse(result) for blast_record in blast_records: for alignment in blast_record.alignments: for hsp in alignment.hsps: # extract xml data from blast percent = float(100) * float(hsp.score) / float(len(fasta_rec)) percent = round(percent, 0) if percent > 99.99: primer_name = str(alignment.hit_def) primer_length = str(alignment.length) f.write(str(percent) + str(alignment.hit_def) + '\n') f.close() def start_blast(): handle = open("G:\RNAiscan\est.fasta", 'r') data = handle.readlines() for seq_record in data: rec = seq_record first_blast_hit = blast_a_record(rec) handle.close() start_blast() This works but I think it's quite slow. I tried also the NCBIStandalone.Iterator() code from the tutrorial but I got the error message "Invalid header". Would NCBIStandalone.Iterator() be faster? Or, is there a way not to save a xml file or to save only the best hits (100 % match)? Kind regards Stefanie From p.j.a.cock at googlemail.com Mon Apr 27 10:54:09 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 27 Apr 2009 11:54:09 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> References: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> On Mon, Apr 27, 2009 at 11:34 AM, Stefanie L?ck wrote: > Hi! > > I want to blast many sequences against one DB and parse the outputs. > At the moment, I do it in that way: > > ... > > This works but I think it's quite slow. I tried also the NCBIStandalone.Iterator() > code from the tutrorial but I got the error message "Invalid header". > Would NCBIStandalone.Iterator() be faster? NCBIStandalone.Iterator() is the old semi-obsolete plain text parser - it won't parse the XML output, hence the "Invalid header" error. Maybe the tutorial (or the error message) could be clearer. > > Or, is there a way not to save a xml file or to save only the best hits > (100 % match)? > You could set the expectation threshold (I don't think there is an identity threshold which would be ideal for your example). If you only want the single BEST hit for a query, set the number of alignments and/or descriptions to show to just one (these do different things in the plain text output - maybe for XML output you only need to limit the number of alignments). This should give a much smaller file, which will be fast to parse. Finally, and perhaps most importantly - don't do an individual BLAST query for each record. Instead, prepare a FASTA file of ALL your queries, and use that as the input to BLAST. This way there is only one command line call, and the BLAST database is only loaded into memory once. Peter From lueck at ipk-gatersleben.de Tue Apr 28 08:23:02 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 28 Apr 2009 10:23:02 +0200 Subject: [Biopython] Parsing large blast files References: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> Message-ID: <002601c9c7da$8c111ee0$1022a8c0@ipkgatersleben.de> Thanks Peter! >You could set the expectation threshold (I don't think there is an >identity threshold which would be ideal for your example). I can't say what will be the expectation treshold. This won't work. >If you only want the single BEST hit for a query, set the number of >alignments and/or descriptions to show to just one (these do different >things in the plain text output - maybe for XML output you only need >to limit the number of alignments). This should give a much smaller >file, which will be fast to parse. This is to risky. There might be several 100 % hits which I need. >Finally, and perhaps most importantly - don't do an individual BLAST >query for each record. Instead, prepare a FASTA file of ALL your >queries, and use that as the input to BLAST. This way there is only >one command line call, and the BLAST database is only loaded into >memory once. Cool, I didn't know that this will work! Great, that's very nice! 50 % time speed up! Thanks Peter and have a nice day! Stefanie From p.j.a.cock at googlemail.com Tue Apr 28 08:33:52 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 Apr 2009 09:33:52 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <002601c9c7da$8c111ee0$1022a8c0@ipkgatersleben.de> References: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> <002601c9c7da$8c111ee0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00904280133v31c6d158u1353be561990b709@mail.gmail.com> On Tue, Apr 28, 2009 at 9:23 AM, Stefanie L?ck wrote: > Thanks Peter! >> >> You could set the expectation threshold (I don't think there is an >> identity threshold which would be ideal for your example). > > I can't say what will be the expectation treshold. This won't work. Still might be able to reduce it from the default of 10.0, maybe even just to 1.0, without loosing the very high identity matches you want. >> If you only want the single BEST hit for a query, set the number of >> alignments and/or descriptions to show to just one (these do different >> things in the plain text output - maybe for XML output you only need >> to limit the number of alignments). ?This should give a much smaller >> file, which will be fast to parse. > > This is to risky. There might be several 100 % hits which I need. If you expect and want several hits per query, then my suggestion is in appropriate. >> Finally, and perhaps most importantly - don't do an individual BLAST >> query for each record. ?Instead, prepare a FASTA file of ALL your >> queries, and use that as the input to BLAST. ?This way there is only >> one command line call, and the BLAST database is only loaded into >> memory once. > > Cool, I didn't know that this will work! Great, that's very nice! 50 % time > speed up! Only a 50% time speed up? i.e. It took half the time? Not bad, although I expected more. It will probably depend on the number of queries, their sizes, and the database - probably the speed up would be more for a larger database like NR. Peter From lueck at ipk-gatersleben.de Tue Apr 28 10:05:30 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 28 Apr 2009 12:05:30 +0200 Subject: [Biopython] Parsing large blast files References: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> <002601c9c7da$8c111ee0$1022a8c0@ipkgatersleben.de> <320fb6e00904280133v31c6d158u1353be561990b709@mail.gmail.com> Message-ID: <000801c9c7e8$dca6d170$1022a8c0@ipkgatersleben.de> Hi Peter! I'll play a little bit with the tresholds, also the short queries parameters (http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node74.html) which I actually need (nt = 21 bp). Of course, e = 1000 makes it even slower. >Only a 50% time speed up? i.e. It took half the time? Not bad, >although I expected more. It will probably depend on the number of >queries, their sizes, and the database - probably the speed up would >be more for a larger database like NR. I blast ~3000 queries against the tigr barley v9 DB (50500 subjects). It takes about 35 seconds with XP, E8400 (3GHZ), 4 GB RAM. Hope this is normal... Kind regards Stefanie ----- Original Message ----- From: "Peter Cock" To: "Stefanie L?ck" Cc: Sent: Tuesday, April 28, 2009 10:33 AM Subject: Re: [Biopython] Parsing large blast files On Tue, Apr 28, 2009 at 9:23 AM, Stefanie L?ck wrote: > Thanks Peter! >> >> You could set the expectation threshold (I don't think there is an >> identity threshold which would be ideal for your example). > > I can't say what will be the expectation treshold. This won't work. Still might be able to reduce it from the default of 10.0, maybe even just to 1.0, without loosing the very high identity matches you want. >> If you only want the single BEST hit for a query, set the number of >> alignments and/or descriptions to show to just one (these do different >> things in the plain text output - maybe for XML output you only need >> to limit the number of alignments). This should give a much smaller >> file, which will be fast to parse. > > This is to risky. There might be several 100 % hits which I need. If you expect and want several hits per query, then my suggestion is in appropriate. >> Finally, and perhaps most importantly - don't do an individual BLAST >> query for each record. Instead, prepare a FASTA file of ALL your >> queries, and use that as the input to BLAST. This way there is only >> one command line call, and the BLAST database is only loaded into >> memory once. > > Cool, I didn't know that this will work! Great, that's very nice! 50 % > time > speed up! Only a 50% time speed up? i.e. It took half the time? Not bad, although I expected more. It will probably depend on the number of queries, their sizes, and the database - probably the speed up would be more for a larger database like NR. Peter From lueck at ipk-gatersleben.de Tue Apr 28 10:13:54 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 28 Apr 2009 12:13:54 +0200 Subject: [Biopython] [BioPython] BLAST subprocess problem with a GUI References: <320fb6e00901280945p32eff05by64d8a42d576f76cc@mail.gmail.com> <002d01c98509$1dc1cd90$1022a8c0@ipkgatersleben.de> <320fb6e00902020216v231729dcm5d7e3ccdd3459ad4@mail.gmail.com> <007301c98aca$0a2084e0$1022a8c0@ipkgatersleben.de> <320fb6e00902090806k3ed4f286r5f2208801ca207ec@mail.gmail.com> <001e01c9971d$7a8d5be0$1022a8c0@ipkgatersleben.de> <320fb6e00902250059v1afd152ex51bc4439a34441c4@mail.gmail.com> Message-ID: <001701c9c7ea$0946cf40$1022a8c0@ipkgatersleben.de> A short info how I solved the problem: I just do this what usually the EmbossWin Installer does via Python: http://www.interactive-biosoftware.com:80/embosswin/install.html And after I copy the important files (eprimer3...) to the directory. This works quite well and there's no need to install EmbossWin (in case that is dissapear from the server, I don't liked this solution). Regards Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Wednesday, February 25, 2009 10:59 AM Subject: Re: [BioPython] BLAST subprocess problem with a GUI On Wed, Feb 25, 2009 at 7:48 AM, Stefanie L?ck wrote: > I tried this but after compilation/installation Primer3 gives no output, > only a return code of -1073741515 (but no errors or messages)... The error number -1073741515 can be regarded as the hex representation 0xc0000135, which could be "The application failed to initialize properly" (try searching both in Google). Without more information this would be hard to resolve as this seems to be a rather generic error code. Is this problem showing up on your own machine or someone elses? If it works on some computers and not others, this would suggest its could be a problem with the primer3 installation rather than your code. Does it work if you run the code directly in Python rather than via py2exe? I would suggest you add a message box or print statement to show the actual command line string to the user just before trying to run the program. Make a note of this, then also try running this command by hand. It could be something missing in your EMBOSS setup (e.g. it can't find certain data files). Peter From p.j.a.cock at googlemail.com Tue Apr 28 10:16:53 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 Apr 2009 11:16:53 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <000801c9c7e8$dca6d170$1022a8c0@ipkgatersleben.de> References: <053601c9c723$be3cd3d0$1022a8c0@ipkgatersleben.de> <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> <002601c9c7da$8c111ee0$1022a8c0@ipkgatersleben.de> <320fb6e00904280133v31c6d158u1353be561990b709@mail.gmail.com> <000801c9c7e8$dca6d170$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00904280316j480a679cy4437555923a8ff8f@mail.gmail.com> On Tue, Apr 28, 2009 at 11:05 AM, Stefanie L?ck wrote: >> Only a 50% time speed up? i.e. It took half the time? ?Not bad, >> although I expected more. ?It will probably depend on the number of >> queries, their sizes, and the database - probably the speed up would >> be more for a larger database like NR. > > I blast ~3000 queries against the tigr barley v9 DB (50500 subjects). It > takes about 35 seconds with XP, E8400 (3GHZ), 4 GB RAM. Hope this is > normal... 35s sounds good :) I normally deal with much slower searches (e.g. protein against NR, or with RPS-BLAST against CDD), measured in minutes or when querying whole genomes, maybe hours. On this sort of problem I would expect doing individual searches for each query to be much much slower. You are dealing with a much smaller database, and with shorter queries, so it will in general be faster. Peter From mjldehoon at yahoo.com Tue Apr 28 13:00:07 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 28 Apr 2009 06:00:07 -0700 (PDT) Subject: [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> Message-ID: <627305.69090.qm@web62401.mail.re1.yahoo.com> --- On Mon, 4/27/09, Peter Cock wrote: > > Would NCBIStandalone.Iterator() be faster? > > NCBIStandalone.Iterator() is the old semi-obsolete plain > text parser - it won't parse the XML output, hence the > "Invalid header" error. Maybe the tutorial > (or the error message) could be clearer. I think part of the problem is the organization of the code in Bio.Blast, which seems to have grown historically. Bio.Blast.NCBIStandalone contains blastall, blastpgp, and rpsblast, which makes sense, but also BlastParser and PsiBlastParser, which are not necessarily connected to standalone Blast. Bio.Blast.ParseBlastTable contains the parser for blastpgp output. Bio.Blast.NCBIWWW contains qblast, but also the parser for Blast HTML output, though qblast does not necessarily generate output in HTML format. The usage of this module may be more understandable if all functions were accessible from Bio.Blast directly in a fashion more consistent with current Biopython. Bio.Blast would then have the following functions: read(handle, format='xml') parse(handle, format='xml') blastall blastpgp rpsblast qblast with most of the actual code hiding in Bio.Blast.NCBIStandalone etcetera. Any objections, comments? --Michiel. From p.j.a.cock at googlemail.com Tue Apr 28 13:36:37 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 Apr 2009 14:36:37 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <627305.69090.qm@web62401.mail.re1.yahoo.com> References: <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> <627305.69090.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> On Tue, Apr 28, 2009 at 2:00 PM, Michiel de Hoon wrote: >> NCBIStandalone.Iterator() is the old semi-obsolete plain >> text parser - it won't parse the XML output, hence the >> "Invalid header" error. ?Maybe the tutorial >> (or the error message) could be clearer. > > I think part of the problem is the organization of the code in Bio.Blast, > which seems to have grown historically. Bio.Blast.NCBIStandalone > contains blastall, blastpgp, and rpsblast, which makes sense, but also > ?BlastParser and PsiBlastParser, which are not necessarily connected > to standalone Blast. Bio.Blast.ParseBlastTable contains the parser for > blastpgp output. Bio.Blast.NCBIWWW contains qblast, but also the > parser for Blast HTML output, though qblast does not necessarily > generate output in HTML format. I presumed that initially the standalone tools only produced plain text, and the website (qblast) only produced HTML - hence the use of Bio.Blast.NCBIStandalone for both command line wrappers AND the plain text parser, and Bio.Blast.NCBIWWW for both the qblast function AND the HTML parser. > The usage of this module may be more understandable if all functions > were accessible from Bio.Blast directly in a fashion more consistent > with current Biopython. Bio.Blast would then have the following functions: > > read(handle, format='xml') > parse(handle, format='xml') > blastall > blastpgp > rpsblast > qblast > > with most of the actual code hiding in Bio.Blast.NCBIStandalone etcetera. > > Any objections, comments? I do like the idea of moving/importing the qblast function directly under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML later on. For read/parse functions, we should probably call the format "blastxml" to match BioPerl. Would you continue to support the plain text output here? Also something to keep in mind is there may be non-NCBI variants of BLAST with their own formats as well. Rather than continuing to encourage the use of blastall, blastpgp and rpsblast I would rather bring Bio.Blast.Applications up to date, and then declare them obsolete . These three "helper" functions are very limiting in how the command line is invoked - you can't choose the exact call used (e.g. subprocess options) or what you want back (e.g. you may not care about the handles). For example, getting BLAST to write its output to a file is confusingly difficult right now using these functions. Also, dealing with errors isn't nice. Peter From mjldehoon at yahoo.com Wed Apr 29 01:28:26 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 28 Apr 2009 18:28:26 -0700 (PDT) Subject: [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> Message-ID: <290052.25369.qm@web62407.mail.re1.yahoo.com> --- On Tue, 4/28/09, Peter Cock wrote: > I do like the idea of moving/importing the qblast function > directly under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML > later on. Well Bio.Blast.NCBIXML would still be there (containing the code for the XML parser), but users would access it through Bio.Blast.parse/read. > For read/parse functions, we should probably call the > format "blastxml" to match BioPerl. We could have both "xml" and "blastxml" for Blast XML output, "text" and "blasttext" for Blast text output, and "table" and "blasttable" for Blast table (-m 8 and 9) output. > Would you continue to support the plain text output here? Yes. I'm more thinking about code reorganization than removing/adding functionality. > Rather than continuing to encourage the use of blastall, > blastpgp and rpsblast I would rather bring Bio.Blast.Applications > up to date, and then declare them obsolete. How would users typically use Bio.Blast.Applications? --Michiel. From p.j.a.cock at googlemail.com Wed Apr 29 08:33:03 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Apr 2009 09:33:03 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <290052.25369.qm@web62407.mail.re1.yahoo.com> References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> <290052.25369.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> On Wed, Apr 29, 2009 at 2:28 AM, Michiel de Hoon wrote: > > How would users typically use Bio.Blast.Applications? > In the next release, I would aim to have Bio.Blast.Applications updated to cover blastall (fully), plus blastpgp and rpsblast (currently not covered) and for the three helper functions Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use Bio.Blast.Applications internally. I would suggest at some point (perhaps a release later) calling the three helper functions obsolete, and eventually deprecating them, but I appreciate these are well documented and well used, so this should be a gradual transistion. In the future I would see people contructing their application command line object and then using it to spawn the task as needed. The Bio.Applicaition.generic_run might suffice for low output tools, ranging up to using the builtin subprocess module for full control. The command line string can also be used in other ways, e.g. for submission to a computing cluster using qsub, or writing to a shell script etc. The point about this is decoupling constuction of the command line string, and actually executing it. Right now the Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions do both, and there is no way to (a) see what the command line used was, which makes debugging difficult, and (b) no way to control how it is invoked (e.g. recent Windows GUI questions). Another immediate benefit is an example usage that I do quite often: Running BLAST and saving the output to a file. The cleanest way to do this is to use the -o option to get BLAST itself to write to a file. If you do this, then there is no useful output written to the handles - but the Bio.Blast.NCBIStandalone make this fiddly (see Bug 2654). Right now the tutorial does something equally indirect - in python read BLAST output from stdout and save it to a file (and probably not in a memory efficient way either!). See also this thread on where to put new command line wrappers: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html If you where asking about the actual code for how to build the command line object, well I have some thoughts on making the current Bio.Application base class easier to use (properties and keyword arguments at init) which I have started to discuss on the dev list. Peter From bala.biophysics at gmail.com Wed Apr 29 09:15:50 2009 From: bala.biophysics at gmail.com (Bala subramanian) Date: Wed, 29 Apr 2009 11:15:50 +0200 Subject: [Biopython] chanding res id Message-ID: <288df32a0904290215h3eb63e4fp8e39bcd22e6e72e@mail.gmail.com> Friends, Following is a script that i wrote to change the resid. This works with single pdb file but if i use a NMR with multiple models i get an error message. I hope there shd be a loop or some fancy way to iterate over all the models in the NMR structure. Kindly help me in doing the same. #!/usr/bin/env python from Bio.PDB import PDBParser from Bio.PDB import PDBIO from sys import argv outfile=raw_input('enter outfile name: ') par=PDBParser() s=par.get_structure('x',argv[1]) for index, residue in enumerate(s.get_residues()): residue.id=(" ", index, " ") out=PDBIO() out.set_structure(s) out.save(outfile) Bala From biopython at maubp.freeserve.co.uk Wed Apr 29 09:26:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Apr 2009 10:26:26 +0100 Subject: [Biopython] chanding res id In-Reply-To: <288df32a0904290215h3eb63e4fp8e39bcd22e6e72e@mail.gmail.com> References: <288df32a0904290215h3eb63e4fp8e39bcd22e6e72e@mail.gmail.com> Message-ID: <320fb6e00904290226k40d0e7bcvf2973b20b7b39cd8@mail.gmail.com> On Wed, Apr 29, 2009 at 10:15 AM, Bala subramanian wrote: > Friends, > Following is a script that i wrote to change the resid. This works with > single pdb file but if i use a NMR with multiple models i get an error > message. I hope there shd be a loop or some fancy way to iterate over all > the models in the NMR structure. Kindly help me in doing the same. > #!/usr/bin/env python > from Bio.PDB import PDBParser > from Bio.PDB import PDBIO > from sys import argv > outfile=raw_input('enter outfile name: ') > par=PDBParser() > s=par.get_structure('x',argv[1]) > for index, residue in enumerate(s.get_residues()): I would add a loop here, because you want to reset the index for each model - something like this (untested): for model in s : for index, residue in enumerate(model.get_residues()): > ? ? residue.id=(" ", index, " ") Should you be using index+1 here? I don't recall if PDB files allow an index of zero or not. > out=PDBIO() > out.set_structure(s) > out.save(outfile) > > Bala Peter From p.j.a.cock at googlemail.com Wed Apr 29 10:31:26 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Apr 2009 11:31:26 +0100 Subject: [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> <290052.25369.qm@web62407.mail.re1.yahoo.com> <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> Message-ID: <320fb6e00904290331n654964bficfc68ae92d477387@mail.gmail.com> On Apr 29, Peter wrote: > On Apr 29, Michiel de Hoon wrote: >> >> How would users typically use Bio.Blast.Applications? >> > > In the next release, I would aim to have Bio.Blast.Applications > updated to cover blastall (fully), plus blastpgp and rpsblast > (currently not covered) and for the three helper functions > Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use > Bio.Blast.Applications internally. ?... > > If you where asking about the actual code for how to build the command > line object, well I have some thoughts on making the current > Bio.Application base class easier to use (properties and keyword > arguments at init) which I have started to discuss on the dev list. See this dev list thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005916.html And Bug 2822 (with examples): http://bugzilla.open-bio.org/show_bug.cgi?id=2822 Peter