From macmanes at gmail.com Tue Nov 1 15:19:24 2011 From: macmanes at gmail.com (Matthew MacManes) Date: Tue, 1 Nov 2011 12:19:24 -0700 Subject: [Biopython] parsing fasta based on header Message-ID: Hi All, I have a large fasta file that I am trying to sort into multiple smaller files based on their ID's. The File starts like this: >1MUSgi|116063569|ref|NM_010065.2| AGGGG-TGGTTGACCATCAACAACATCGGCATCATG-AAGGGAGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG- >2MUSgi|118130562|ref|NM_019880.3| CGGCCCGCGGCTCAGCCGTCGGCGCGCAGGATGGACGGCG-A >2MUSgi|118130562|ref|NM_019880.3| AGTTTAGCCAGGCCCTGGCCATCCGGAGCTACACCAAGTTTGTGATGGGGATTGCAGTGAGCATGCTGACCTACCCCTTCCTGCTCGTTGGAGATCTCATGGCAGTGAACAACCCTGGAGTAACCT >1HOMOgi|59853098|ref|NM_004408.2| GCATCCGCAAGGGCTGGCTGACTATCAATAATATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG- >1 GGTGATCCGCAGGGGCTGGCTGACCATCAACAACATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTCGTGCTCACTGCCGAGTCACTGTCCTGGTACAAGGACGAAGAGGAGAAAGAGAG >2 CGCGCCAGCACCGGCCCGCGGCGCAGCCCTCGGCCCGCAGGATGGACGGCGCGTCCGGGGGCCTGGGCTCTGGGGATAGTGCC I want all of the ID's beginning with 1's to go on one file, ID's starting with 2's in another. I have been trying to use SeqIO for record in SeqIO.parse(open("QHM-clean.fasta", "rU"), "fasta") : for i in range(1,3): if record.id %i: #this needs to be changed "if record.id *STARTS WITH* %i" print record.id output_handle = open("%i.fasta", "w") #naming in this manner does not seem to be allowed SeqIO.write(output_handle, "fasta") output_handle.close() But this seems to be not working in many obvious ways... Can anybody help me out with some advice on how to proceed? Thanks a lot, Matt From w.arindrarto at gmail.com Tue Nov 1 15:53:11 2011 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 1 Nov 2011 20:53:11 +0100 Subject: [Biopython] parsing fasta based on header In-Reply-To: References: Message-ID: Hi Matthew, You can use Python generators for this. Here's a rough example: # generators for the two different groups seq_1 = (r for r in SeqIO.parse(open('QHM-clean.fasta', 'rU'), 'fasta') if r.id.startswith('1')) seq_2 = (r for r in SeqIO.parse(open('QHM-clean.fasta', 'rU'), 'fasta') if r.id.startswith('2')) # seqs, filenames pair list pairs = [(seq_1, 'file_1'), (seq_2, 'file_2')] # the actual write for seq, filename in pairs: SeqIO.write(seq, open(filename, 'w'), 'fasta') cheers, Bowo On Tue, Nov 1, 2011 at 20:19, Matthew MacManes wrote: > Hi All, > > I have a large fasta file that I am trying to sort into multiple smaller > files based on their ID's. The File starts like this: > > >1MUSgi|116063569|ref|NM_010065.2| > > AGGGG-TGGTTGACCATCAACAACATCGGCATCATG-AAGGGAGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG- > >2MUSgi|118130562|ref|NM_019880.3| > CGGCCCGCGGCTCAGCCGTCGGCGCGCAGGATGGACGGCG-A > >2MUSgi|118130562|ref|NM_019880.3| > > AGTTTAGCCAGGCCCTGGCCATCCGGAGCTACACCAAGTTTGTGATGGGGATTGCAGTGAGCATGCTGACCTACCCCTTCCTGCTCGTTGGAGATCTCATGGCAGTGAACAACCCTGGAGTAACCT > >1HOMOgi|59853098|ref|NM_004408.2| > > GCATCCGCAAGGGCTGGCTGACTATCAATAATATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG- > >1 > > GGTGATCCGCAGGGGCTGGCTGACCATCAACAACATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTCGTGCTCACTGCCGAGTCACTGTCCTGGTACAAGGACGAAGAGGAGAAAGAGAG > >2 > > CGCGCCAGCACCGGCCCGCGGCGCAGCCCTCGGCCCGCAGGATGGACGGCGCGTCCGGGGGCCTGGGCTCTGGGGATAGTGCC > > I want all of the ID's beginning with 1's to go on one file, ID's starting > with 2's in another. > > I have been trying to use SeqIO > > for record in SeqIO.parse(open("QHM-clean.fasta", "rU"), "fasta") : > for i in range(1,3): > if record.id %i: #this needs to be changed "if record.id *STARTS WITH* > %i" > print record.id > output_handle = open("%i.fasta", "w") #naming in this manner does not seem > to be allowed > SeqIO.write(output_handle, "fasta") > output_handle.close() > > But this seems to be not working in many obvious ways... Can anybody help > me out with some advice on how to proceed? > > Thanks a lot, Matt > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Tue Nov 1 17:04:39 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 1 Nov 2011 21:04:39 +0000 Subject: [Biopython] parsing fasta based on header In-Reply-To: References: Message-ID: On Tue, Nov 1, 2011 at 7:53 PM, Wibowo Arindrarto wrote: > Hi Matthew, > > You can use Python generators for this. Here's a rough example: > > # generators for the two different groups > seq_1 = (r for r in SeqIO.parse(open('QHM-clean.fasta', 'rU'), 'fasta') if > r.id.startswith('1')) > seq_2 = (r for r in SeqIO.parse(open('QHM-clean.fasta', 'rU'), 'fasta') if > r.id.startswith('2')) > > # seqs, filenames pair list > pairs = [(seq_1, 'file_1'), (seq_2, 'file_2')] > > # the actual write > for seq, filename in pairs: > SeqIO.write(seq, open(filename, 'w'), 'fasta') > > cheers, > Bowo Email does tend to mess up the indentation in Python :( I'm pleased to see that's very similar to my answer earlier, http://biostar.stackexchange.com/questions/13791/parsing-fasta-based-on-header/13793 By the way Wibiwo, rather than this: SeqIO.write(seq, open(filename, 'w'), 'fasta') use this: SeqIO.write(seq, filename, 'fasta') It is shorter but also will ensure the handle is closed promptly on Jython/PyPy where garbage collection isn't as predictable as on normal C Python. Peter From mictadlo at gmail.com Wed Nov 2 00:39:18 2011 From: mictadlo at gmail.com (Mic) Date: Wed, 2 Nov 2011 14:39:18 +1000 Subject: [Biopython] subprocess.Popen problem Message-ID: Hello, I have tried to write a SOAPaligner wrapper in oder to get the SOAP alignment statistics: Total Pairs: 1000 PE Paired: 35 ( 3.50%) PE Singled: 170 ( 8.50%) SE Total Elapsed Time: 24.00 - Load Index Table: 23.22 - Alignment: 0.78 with the following code: import os, subprocess if __name__ == '__main__': try: cmd_soap = 'soap -p 1 -a test_A_clonesremoved-tiny.fastq -b test_B_clonesremoved-tiny.fastq -D all.m.fasta.index -r 0 -o test_clonesremoved-tiny_vs_all.m.paired.soap -2 test_clonesremoved-tiny_vs_all.m.single.soap -m 100 -x 550' proc = subprocess.Popen(cmd_soap, shell=True) returncode = proc.wait() print returncode except Exception, e: sys.stderr.write( "%s\n" % str(e)) sys.exit() However, when I started the script I just got 0 as an output: $ python soap_wrapper.py Begin Program SOAPaligner/soap2 Wed Nov 2 14:23:33 2011 Reference: all.m.fasta.index Query File a: test_A_clonesremoved-tiny.fastq Query File b: test_B_clonesremoved-tiny.fastq Output File: test_clonesremoved-tiny_vs_all.m.paired.soap test_clonesremoved-tiny_vs_all.m.single.soap Load Index Table ... Load Index Table OK Begin Alignment ... 2000 ok 0.76 sec Total Pairs: 1000 PE Paired: 35 ( 3.50%) PE Singled: 170 ( 8.50%) SE Total Elapsed Time: 24.00 - Load Index Table: 23.22 - Alignment: 0.78 SOAPaligner/soap2 End Wed Nov 2 14:23:57 2011 0 What did I wrong? Thank you in advance. From p.j.a.cock at googlemail.com Wed Nov 2 04:52:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 2 Nov 2011 08:52:19 +0000 Subject: [Biopython] subprocess.Popen problem In-Reply-To: References: Message-ID: On Wed, Nov 2, 2011 at 4:39 AM, Mic wrote: > Hello, > I have tried to write a SOAPaligner wrapper in oder to get the SOAP > alignment statistics: > > Total Pairs: 1000 PE > Paired: ? ? ?35 ( 3.50%) PE > Singled: ? ? 170 ( 8.50%) SE > Total Elapsed Time: ? ? ? ? ? 24.00 > ? ? ?- Load Index Table: ? ? 23.22 > ? ? ?- Alignment: ? ? ? ? ? ? 0.78 > > with the following code: > > import os, subprocess > if __name__ == '__main__': > ? ?try: > ? ? ? ?cmd_soap = 'soap -p 1 -a test_A_clonesremoved-tiny.fastq -b > test_B_clonesremoved-tiny.fastq -D all.m.fasta.index -r 0 -o > test_clonesremoved-tiny_vs_all.m.paired.soap -2 > test_clonesremoved-tiny_vs_all.m.single.soap -m 100 -x 550' > ? ? ? ?proc = subprocess.Popen(cmd_soap, shell=True) > ? ? ? ?returncode = proc.wait() > ? ? ? ?print returncode > ? ?except Exception, e: > ? ? ? ?sys.stderr.write( "%s\n" % str(e)) > ? ? ? ?sys.exit() > > > However, when I started the script I just got 0 as an output: > > $ python soap_wrapper.py > > Begin Program SOAPaligner/soap2 > Wed Nov ?2 14:23:33 2011 > Reference: all.m.fasta.index > Query File a: test_A_clonesremoved-tiny.fastq > Query File b: test_B_clonesremoved-tiny.fastq > Output File: test_clonesremoved-tiny_vs_all.m.paired.soap > ? ? ? ? ? ? test_clonesremoved-tiny_vs_all.m.single.soap > Load Index Table ... > Load Index Table OK > Begin Alignment ... > 2000 ok ? ?0.76 sec > Total Pairs: 1000 PE > Paired: ? ? ?35 ( 3.50%) PE > Singled: ? ? 170 ( 8.50%) SE > Total Elapsed Time: ? ? ? ? ? 24.00 > ? ? ?- Load Index Table: ? ? 23.22 > ? ? ?- Alignment: ? ? ? ? ? ? 0.78 > > SOAPaligner/soap2 End > Wed Nov ?2 14:23:57 2011 > > 0 > > What did I wrong? > > Thank you in advance. > Command line tools typically return 0 for success, anything else is usually an error code. You can double check the return code from the command line outside Python - e.g. a quick shell script to call the command and echo the return code to the terminal. Peter From p.j.a.cock at googlemail.com Wed Nov 2 05:42:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 2 Nov 2011 09:42:15 +0000 Subject: [Biopython] subprocess.Popen problem In-Reply-To: References:

Message-ID: On Wed, Nov 2, 2011 at 9:03 AM, Mic wrote: > Thank you, but is it possible to store the SOAP output in the memory/file in > order to retrieve the following?statistics?lines? > Total Pairs: 1000 PE > Paired: ? ? ?35 ( 3.50%) PE > Singled: ? ? 170 ( 8.50%) SE > Total Elapsed Time: ? ? ? ? ? 24.00 > ? ? ? - Load Index Table: ? ? 23.22 > ? ? ? - Alignment: ? ? ? ? ? ? 0.78 > Thank you in advance. Assuming that is written to stdout, just collect it as a pipe (handle) via subprocess, or since it is short, use the .communicate method and get this as a string. http://docs.python.org/library/subprocess.html Peter From patriciaseraos at gmail.com Wed Nov 2 06:39:48 2011 From: patriciaseraos at gmail.com (Patricia Soares) Date: Wed, 02 Nov 2011 10:39:48 +0000 Subject: [Biopython] Supertree - R Message-ID: <4EB11DF4.9070804@gmail.com> Hello, I need to build a supertree from four different methods. I found a code able to build supertrees with R. But I wanted to use python to do this. I was wondering if you have any way to do something similar to the R code and build a supertree. Appreciate your help and time. Best regards, Patricia -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: supertree.R URL: From cy at cymon.org Wed Nov 2 07:30:38 2011 From: cy at cymon.org (Cymon Cox) Date: Wed, 2 Nov 2011 11:30:38 +0000 Subject: [Biopython] Supertree - R In-Reply-To: <4EB11DF4.9070804@gmail.com> References: <4EB11DF4.9070804@gmail.com> Message-ID: Hi Patricia, On 2 November 2011 10:39, Patricia Soares wrote: > Hello, > > I need to build a supertree from four different methods. You want to build a supertree from a combination of optimal tree(s) of each of 4 different tree building methods, right? > I found a code > able to build supertrees with R. There is a long and complex literature about the best way to reconstruct supertrees. But from the code you attached, you appear to be wanting to reconstruct a supertree using Matrix Representation using Parsimony (MRP) - basically recode the source trees as nodes present/absent in a matrix then use parsimony to find the shortest tree. > But I wanted to use python to do this. > I was wondering if you have any way to do something similar to the R > code and build a supertree. > I'm not aware of anything in Biopython. There is however a module in p4 ( http://code.google.com/p/p4-phylogenetics) p4.MRP. mrp(trees, taxNames=None) that will recode you source trees and write the MRP matrix. You can then use your favourite parsimony implementation (PAUP*, TNT, etc) to build the tree. Cheers, Cymon From cy at cymon.org Wed Nov 2 09:28:56 2011 From: cy at cymon.org (Cymon Cox) Date: Wed, 2 Nov 2011 13:28:56 +0000 Subject: [Biopython] Supertree - R In-Reply-To: <4EB143B1.5010507@gmail.com> References: <4EB11DF4.9070804@gmail.com> <4EB143B1.5010507@gmail.com> Message-ID: Hi Patricia, On 2 November 2011 13:20, Patricia Soares wrote: > Thank you! > I will try with this package. > > By the way, do you know the difference between p4.SuperTreeSupport and > p4.MRP? > It's not a method to build a supertree as such, but a way of providing measures of support for the supertree given the source trees. >From the docstring: """Supertree support measures Super tree support can be used to calculate a number of support measures for a set of trees and a supertree. The measures can be at split level and placed on the supertree for image production or at tree level with a number of summary measures. The support of the input trees for a supertree is measured by counting the number of input trees that support(S), conflict(Q), permits(P) or are relevant(R) with the splits in the supertree. Supply a supertree and the input trees used to create it. Filenames or trees will do. A single supertree and a list of input trees. For example:: etc... Cheers, C. > > Cheers, > Patricia > > On 11/02/2011 11:30 AM, Cymon Cox wrote: > > Hi Patricia, > > > > On 2 November 2011 10:39, Patricia Soares > wrote: > > > >> Hello, > >> > >> I need to build a supertree from four different methods. > > > > > > You want to build a supertree from a combination of optimal tree(s) of > each > > of 4 different tree building methods, right? > > > > > > > >> I found a code > >> able to build supertrees with R. > > > > > > There is a long and complex literature about the best way to reconstruct > > supertrees. But from the code you attached, you appear to be wanting to > > reconstruct a supertree using Matrix Representation using Parsimony > (MRP) - > > basically recode the source trees as nodes present/absent in a matrix > then > > use parsimony to find the shortest tree. > > > > > > > >> But I wanted to use python to do this. > >> I was wondering if you have any way to do something similar to the R > >> code and build a supertree. > >> > > > > I'm not aware of anything in Biopython. There is however a module in p4 ( > > http://code.google.com/p/p4-phylogenetics) p4.MRP. mrp(trees, > > taxNames=None) that will recode you source trees and write the MRP > matrix. > > You can then use your favourite parsimony implementation (PAUP*, TNT, > etc) > > to build the tree. > > > > Cheers, Cymon > > > > From macmanes at gmail.com Wed Nov 2 12:21:44 2011 From: macmanes at gmail.com (Matthew MacManes) Date: Wed, 2 Nov 2011 09:21:44 -0700 Subject: [Biopython] send SeqIO.parse to NcbiblastnCommandline Message-ID: Hi All, I am trying to take a large fasta file, send sequences one by one to NcbiblastnCommandline, sending results to a unique file based on the query ID. So far I have MUSDATABASE='/media/hd/blastdb/mouse.rna' from Bio import SeqIO from Bio.Blast.Applications import NcbiblastnCommandline for seq_record in SeqIO.parse("test1.fa", "fasta"): cl = NcbiblastnCommandline(cmd="/home/matthew/ncbi-blast/bin/blastn", query=seq_record.seq, db=MUSDATABASE, evalue=0.0000000001, outfmt="'10 qseqid qseq sseqid sseq bitscore'", out=seq_record.id, max_target_seqs=1, num_threads=15) print cl stdout, stderr = cl() This seems like a promising approach, but the issue is that the query argument expects a file, not a sequence itself. In reading in the BLAST+ manual, blastn can accept a sequence from the standard input via query="-", but I cannot get this to work, does not catch the sequence. Any pointers greatly appreciated. Matt From p.j.a.cock at googlemail.com Wed Nov 2 13:04:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 2 Nov 2011 17:04:19 +0000 Subject: [Biopython] send SeqIO.parse to NcbiblastnCommandline In-Reply-To: References: Message-ID: On Wed, Nov 2, 2011 at 4:21 PM, Matthew MacManes wrote: > Hi All, > > I am trying to take a large fasta file, send sequences one by one > to NcbiblastnCommandline, sending results to a unique file based on the > query ID. So far I have > > MUSDATABASE='/media/hd/blastdb/mouse.rna' > > from Bio import SeqIO > from Bio.Blast.Applications import NcbiblastnCommandline > for seq_record in SeqIO.parse("test1.fa", "fasta"): > cl = NcbiblastnCommandline(cmd="/home/matthew/ncbi-blast/bin/blastn", > ?query=seq_record.seq, > db=MUSDATABASE, evalue=0.0000000001, > outfmt="'10 qseqid qseq sseqid sseq bitscore'", > ?out=seq_record.id, > max_target_seqs=1, > ?num_threads=15) > print cl > stdout, stderr = cl() > > > This seems like a promising approach, but the issue is that the query > argument expects a file, not a sequence itself. ?In reading in the BLAST+ > manual, blastn can accept a sequence from the standard input via query="-", > but I cannot get this to work, does not catch the sequence. > > > Any pointers greatly appreciated. > Matt You need to do two things, (1) tell BLAST to read the sequence from stdin, and (2) supply the FASTA formatted sequence to stdin. Try something along these lines: cline = NcbiblastnCommandline(..., query="-", ...) stdout, stderr = cline(stdin=record.format("fasta")) Peter From mictadlo at gmail.com Wed Nov 2 23:16:19 2011 From: mictadlo at gmail.com (Mic) Date: Thu, 3 Nov 2011 13:16:19 +1000 Subject: [Biopython] subprocess.Popen problem In-Reply-To: References:

Message-ID: Thank you, I wrote the following code and not sure whether it is what did write me. Is try/except necassary? import sys,os, subprocess if __name__ == '__main__': try: cmd_soap = 'soap -p 1 -a test_A_clonesremoved-tiny.fastq -b test_B_clonesremoved-tiny.fastq -D all.m.fasta.index -r 0 -o test_clonesremoved-tiny_vs_all.m.paired.soap -2 test_clonesremoved-tiny_vs_all.m.single.soap -m 100 -x 550' proc = subprocess.Popen(cmd_soap, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) returncode = proc.wait() print returncode stdout_value, stderr_value = proc.communicate() print '\tcombined output:', stdout_value, stderr_value if returncode == 1: sys.exit(1) except Exception, e: sys.stderr.write( "%s\n" % str(e)) sys.exit() Thank you in advance. On Wed, Nov 2, 2011 at 7:42 PM, Peter Cock wrote: > On Wed, Nov 2, 2011 at 9:03 AM, Mic wrote: > > Thank you, but is it possible to store the SOAP output in the > memory/file in > > order to retrieve the following statistics lines? > > Total Pairs: 1000 PE > > Paired: 35 ( 3.50%) PE > > Singled: 170 ( 8.50%) SE > > Total Elapsed Time: 24.00 > > - Load Index Table: 23.22 > > - Alignment: 0.78 > > Thank you in advance. > > Assuming that is written to stdout, just collect it as a pipe (handle) > via subprocess, or since it is short, use the .communicate method > and get this as a string. > > http://docs.python.org/library/subprocess.html > > Peter > From p.j.a.cock at googlemail.com Thu Nov 3 05:31:38 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Nov 2011 09:31:38 +0000 Subject: [Biopython] subprocess.Popen problem In-Reply-To: References:

Message-ID: On Thu, Nov 3, 2011 at 3:16 AM, Mic wrote: > Thank you, I wrote the following code and not sure whether it is what did > write me. Depending on the tool I would check for a non-zero return code rather than just treating 1 as an error. You are also not collecting stderr/stdout correctly. If you send them to a pipe, the strings from the .communicate will be empty. Rather reads from the process object's .stdout and .stderr handles. See: http://docs.python.org/library/subprocess.html Peter From fernandocharitha at googlemail.com Mon Nov 7 13:31:09 2011 From: fernandocharitha at googlemail.com (Charitha Fernando) Date: Mon, 7 Nov 2011 13:31:09 -0500 Subject: [Biopython] finding single domain proteins Message-ID: Hi guys, I know this is not the appropriate forum to ask this question , I am really sorry for doing so. I am just begining to find my way through protein science--I have a question I want a list of all Single domain proteins in the PDB, I am not sure if there is a list like that? I tried to play with both CATH/SCOP but I am not getting anywhere, is there a list someone has of all the single domain proteins, does not matter if it is all alpha or mixed, just need a list of them Thanks, Fernando From from.d.putto at gmail.com Mon Nov 7 14:10:22 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Mon, 7 Nov 2011 20:10:22 +0100 Subject: [Biopython] SeqIO.index AttributeError: 'SeqRecord' object has no attribute 'taxonomy' Message-ID: Hi All, Consider the following code (from Biopython Cookbook) from Bio import SeqIO uniprot = SeqIO.index("uniprot_sprot.dat", "swiss") handle = open("selected.dat", "w") for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: handle.write(uniprot.get_raw(acc)) handle.close() I want to print only selected part of (ID, description and Taxonomy ) not the full record. I modified the code as for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: print uniprot[acc].id, uniprot[acc].description, uniprot[acc].taxonomy but this gives error "AttributeError: 'SeqRecord' object has no attribute 'taxonomy' " Any suggestion !!!! -- Sheila d angel From p.j.a.cock at googlemail.com Mon Nov 7 14:38:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 Nov 2011 19:38:52 +0000 Subject: [Biopython] SeqIO.index AttributeError: 'SeqRecord' object has no attribute 'taxonomy' In-Reply-To: References: Message-ID: On Mon, Nov 7, 2011 at 7:10 PM, Sheila the angel wrote: > Hi All, > Consider the following code (from Biopython Cookbook) > > from Bio import SeqIO > uniprot = SeqIO.index("uniprot_sprot.dat", "swiss") > handle = open("selected.dat", "w") > for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: > ? ?handle.write(uniprot.get_raw(acc)) > handle.close() > > I want to print only selected part of ?(ID, description and Taxonomy ) not > the full record. I modified the code as > > for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: > ? ? print uniprot[acc].id, uniprot[acc].description, uniprot[acc].taxonomy > > but this gives error "AttributeError: 'SeqRecord' object has no attribute > 'taxonomy' " > Any suggestion !!!! What makes you think there would be a taxonomy attribute? Is there a mistake in the documentation somewhere? From memory you should try uniprot[acc].annoations["taxonomy"] Also your code will parse the record three times for each accession, use this instead: for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: record = uniprot[acc] print.id, record.description, ... Peter Peter From p.j.a.cock at googlemail.com Mon Nov 7 17:01:41 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 Nov 2011 22:01:41 +0000 Subject: [Biopython] SeqIO.index AttributeError: 'SeqRecord' object has no attribute 'taxonomy' In-Reply-To: References:

Message-ID: On Monday, November 7, 2011, Peter Cock wrote: > On Mon, Nov 7, 2011 at 7:10 PM, Sheila the angel wrote: >> Hi All, >> Consider the following code (from Biopython Cookbook) >> >> from Bio import SeqIO >> uniprot = SeqIO.index("uniprot_sprot.dat", "swiss") >> handle = open("selected.dat", "w") >> for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: >> handle.write(uniprot.get_raw(acc)) >> handle.close() >> >> I want to print only selected part of (ID, description and Taxonomy ) not >> the full record. I modified the code as >> >> for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: >> print uniprot[acc].id, uniprot[acc].description, uniprot[acc].taxonomy >> >> but this gives error "AttributeError: 'SeqRecord' object has no attribute >> 'taxonomy' " >> Any suggestion !!!! > > What makes you think there would be a taxonomy attribute? Is there > a mistake in the documentation somewhere? From memory you should > try uniprot[acc].annoations["taxonomy"] Typo: annotations Related to this, try dir(...) on Python objects to see what their attributes are. > Also your code will parse the record three times for each accession, > use this instead: > > for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: > record = uniprot[acc] > print.id, record.description, ... > > Peter > Peter From eric.talevich at gmail.com Mon Nov 7 18:26:12 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 7 Nov 2011 18:26:12 -0500 Subject: [Biopython] finding single domain proteins In-Reply-To: References: Message-ID: On Mon, Nov 7, 2011 at 1:31 PM, Charitha Fernando < fernandocharitha at googlemail.com> wrote: > Hi guys, > > I know this is not the appropriate forum to ask this question , I am really > sorry for doing so. > I am just begining to find my way through protein science--I have a > question > I want a list of all Single domain proteins in the PDB, I am not sure if > there is a list like that? > > I tried to play with both CATH/SCOP but I am not getting anywhere, is there > a list someone has of all the single domain proteins, does not matter if it > is all alpha or mixed, just need a list of them > Thanks, > Fernando > > Hi Fernando, I think you'll have a better chance of getting this question in front of the right audience if you ask on Biostar: http://biostar.stackexchange.com/ The answer depends on how you define what you're looking for, here. I'll hold off on the details until I see this on Biostar though. -E From bala.biophysics at gmail.com Thu Nov 10 10:46:25 2011 From: bala.biophysics at gmail.com (Bala subramanian) Date: Thu, 10 Nov 2011 16:46:25 +0100 Subject: [Biopython] structural alignment Message-ID: Friends, I have 50 conformations of a peptide. I want to do a pair wise structural alignment and obtain a score for each pair. I would like to know i) if there is any inbuilt function to do structural alignment in biopython ii)if there is a wrapper in biopython for any structural alignment software Thanks, Bala From p.j.a.cock at googlemail.com Thu Nov 10 11:11:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Nov 2011 16:11:15 +0000 Subject: [Biopython] structural alignment In-Reply-To: References: Message-ID: On Thu, Nov 10, 2011 at 3:46 PM, Bala subramanian wrote: > Friends, > I have 50 conformations of a peptide. I want to do a pair wise structural > alignment and obtain a score for each pair. I would like to know > i) if there is any inbuilt function to do structural alignment in biopython > ii)if there is a wrapper in biopython for any structural alignment software > > Thanks, > Bala SVD works nicely for some tasks, e.g. http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ Peter From natassa_g_2000 at yahoo.com Tue Nov 15 07:28:16 2011 From: natassa_g_2000 at yahoo.com (natassa) Date: Tue, 15 Nov 2011 04:28:16 -0800 (PST) Subject: [Biopython] index_db problem Message-ID: <1321360096.80662.YahooMailNeo@web39309.mail.mud.yahoo.com> Hi, I have a problem with a script using the Bio.SeqIO index_db function: I have previously it successfully, but as I? had to recently change computers I now tried the same? script with no success. I get the error: AttributeError: 'module' object has no attribute 'index_db' although I have installed biopython 1.58 (the index_db is apparently from 1.57 and on, and on the pevious computer I had installed the 1.57). I checked if i have sqlite 3 by importing the module, and this seems OK. So now i am kind of stuck why the module SeqIO from 1.58 gets imported, but not the function. My function using it is something like: def Sample_RandomHeader(numbRecs, reps, path): ? ??? species=os.path.basename(path).split('_')[0].split('.')[0] ??? newdir=os.path.dirname(path)+'/'+species+str(numbRecs)+'_'+str(reps)+'Randoms/' ??? speciesdb=SeqIO.index_db(species+'.idx', path, "fasta")?? .... and i have python 2.6 Any help would be appreciated! Thank you, Anastasia From p.j.a.cock at googlemail.com Tue Nov 15 07:38:51 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 15 Nov 2011 12:38:51 +0000 Subject: [Biopython] index_db problem In-Reply-To: <1321360096.80662.YahooMailNeo@web39309.mail.mud.yahoo.com> References: <1321360096.80662.YahooMailNeo@web39309.mail.mud.yahoo.com> Message-ID: On Tue, Nov 15, 2011 at 12:28 PM, natassa wrote: > Hi, > > I have a problem with a script using the Bio.SeqIO index_db function: > I have previously it successfully, but as I? had to recently change > computers I now tried the same? script with no success. I get the error: > > AttributeError: 'module' object has no attribute 'index_db' > > although I have installed biopython 1.58 (the index_db is apparently > from 1.57 and on, and on the pevious computer I had installed the > 1.57). I checked if i have sqlite 3 by importing the module, and this > seems OK. So now i am kind of stuck why the module SeqIO from > 1.58 gets imported, but not the function. > > My function using it is something like: > ... > > and i have python 2.6 > > Any help would be appreciated! > Thank you, > > Anastasia It sounds like you're getting Biopython 1.57 even if 1.58 is also installed somewhere else. Try this to find out which is being used and where it is installed: import Bio print Bio.__version__ print Bio.__file__ from Bio import SeqIO print SeqIO.__file__ Peter From p.j.a.cock at googlemail.com Tue Nov 15 09:10:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 15 Nov 2011 14:10:06 +0000 Subject: [Biopython] index_db problem In-Reply-To: <1321364191.60299.YahooMailNeo@web39315.mail.mud.yahoo.com> References: <1321360096.80662.YahooMailNeo@web39309.mail.mud.yahoo.com> <1321361890.13423.YahooMailNeo@web39311.mail.mud.yahoo.com> <1321364191.60299.YahooMailNeo@web39315.mail.mud.yahoo.com> Message-ID: Peter wrote: > Hi Anastasia, > > Are you familiar with the PYTHONPATH environment variable? > > Peter On Tue, Nov 15, 2011 at 1:36 PM, natassa wrote: > That rings a bell, thanks! > I added > PYTHONPATH=path-to-1.58libs > export PYTHONPATH > in my .bashrc > and the script now works! > Is it OK only with this libraries though? or are there other 1.58 -installed > files that need to be pointed at somewhere? > Thanks for your help, > Anastasia That should be everything, glad things seem to be working OK now. Peter From devaniranjan at gmail.com Wed Nov 16 15:39:28 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Wed, 16 Nov 2011 15:39:28 -0500 Subject: [Biopython] alignment scores -fitting to Poisson distribution Message-ID: Hello, I did an alignment of a protein with hundreds of other proteins--I got a distribution that is NOT normal (see attached picture). I was wondering if biopython had a way of fitting this to one of the distribution methods? I don't have mathlab/maple software, so if biopython cannot do it but if someone knows of an alternate software that I can use to fit this data--I would appreciate the help. What I want is to say......the target PDB was 'X' sigma away from the mean of the distribution. If it was a normal distribution, it would be easy to do, since the distribution is skewed I am wondering the best way to do it. Thank you, George -------------- next part -------------- A non-text attachment was scrubbed... Name: score.png Type: image/png Size: 8245 bytes Desc: not available URL: From idoerg at gmail.com Wed Nov 16 15:55:42 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 16 Nov 2011 15:55:42 -0500 Subject: [Biopython] alignment scores -fitting to Poisson distribution In-Reply-To: References: Message-ID: It's an extreme value distribution. http://en.wikipedia.org/wiki/Generalized_extreme_value_distribution which is what you generally get from selecting the maximum function from a set of random variables. (You get a normal distribution from a sum of random variables). Aligning proteins provides a maximum function, so multiple alignment scores are EVD distributed. http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html ./I On Wed, Nov 16, 2011 at 3:39 PM, George Devaniranjan wrote: > Hello, > > I did an alignment of a protein with hundreds of other proteins--I got > a distribution that is NOT normal (see attached picture). > I was wondering if biopython had a way of fitting this to one of the > distribution methods? > I don't have mathlab/maple software, so if biopython cannot do it but > if someone knows of an alternate software that I can use to fit this > data--I would appreciate the help. > > What I want is to say......the target PDB was 'X' sigma away from the > mean of the distribution. > If it was a normal distribution, it would be easy to do, since the > distribution is skewed I am wondering the best way to do it. > > Thank you, > George > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From caoyaqiang0410 at gmail.com Wed Nov 16 22:20:24 2011 From: caoyaqiang0410 at gmail.com (=?UTF-8?B?5pu55Lqa5by6?=) Date: Thu, 17 Nov 2011 11:20:24 +0800 Subject: [Biopython] Paired-End Read Splitting & Joining Message-ID: Dear mail-lists: Hi, my first time of asking questions in mailing, please excuse me if there is any possible problems. I'm new in Python and biopython, nearly without practically programming experience in Bioinformatics. Recently my work get involved in transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the software needs paired-end sequences in two fastq files. So I wonder can biopython finish the job in a conventient way? Because the paired-end file is too big and can't be done in a conventient way in *Galaxy* Please give me some guide. Thanks. Best wishes, Yaqiang Cao From p.j.a.cock at googlemail.com Thu Nov 17 04:21:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Nov 2011 09:21:21 +0000 Subject: [Biopython] Paired-End Read Splitting & Joining In-Reply-To: References: Message-ID: On Thu, Nov 17, 2011 at 3:20 AM, ??? wrote: > Dear mail-lists: > ? ? ? ?Hi, my first time of asking questions in mailing, please excuse me > if there is any possible problems. > ? ? ? ?I'm new in Python and biopython, nearly without practically > programming experience in Bioinformatics. Recently my work get involved in > transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the > software needs paired-end sequences in two fastq files. So I wonder can > biopython finish the job in a conventient way? Because the paired-end file > is too big and can't be done in a conventient way in *Galaxy* > ? ? ? ?Please give me some guide. Thanks. > > Best wishes, > Yaqiang Cao Probably, yes. So you have one large FASTQ file containing both parts of each pair (say part one and part two, or they might be labelled as the forward and reverse reads), and you want to split this into two FASTQ files? How are your reads named? The hard part is inferring this, one common scheme used /1 and /2 suffixes, but Illumina have changed this in their latest pipeline and the part is now in the description instead. Could you show us the first 6 reads (or so) from the big FASTQ file? Also are there any single reads in your file, either never paired or orphaned where one of a pair failed Qc? Peter From p.j.a.cock at googlemail.com Thu Nov 17 07:31:30 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Nov 2011 12:31:30 +0000 Subject: [Biopython] Paired-End Read Splitting & Joining In-Reply-To: <4EC4F5C5.8010402@gmail.com> References:

<4EC4F5C5.8010402@gmail.com> Message-ID: On Thu, Nov 17, 2011 at 11:53 AM, Yaqiang Cao wrote: > > Thanks for replying. > > Yes, I have a .fastq file convert from .sra, used one of NCBI > sratools,fastq-dump . And the file is over 1G. I want to split this into two > FASTQ files because the tophat requires two files of paired-end sequence. > The screenshot of the first 20 lines of the .fastq file is like the attached > picture file: Looking at the names, that file seems not to have both parts of each pair. I looked on the NCBI SRA page, and the library is described as paired: http://www.ncbi.nlm.nih.gov/sra?term=srr100235 There only seems to be one SRA file for this accession, ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX042/SRX042254/SRR100235/ i.e. This file: ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX042/SRX042254/SRR100235/SRR100235.sra I'd look more but the SRA website tells me "Our database is temporarily unavailable. Please come back later." Peter From caoyaqiang0410 at gmail.com Thu Nov 17 06:53:41 2011 From: caoyaqiang0410 at gmail.com (Yaqiang Cao) Date: Thu, 17 Nov 2011 19:53:41 +0800 Subject: [Biopython] Paired-End Read Splitting & Joining In-Reply-To: References:

Message-ID: <4EC4F5C5.8010402@gmail.com> ? 2011?11?17? 17:21, Peter Cock ??: > On Thu, Nov 17, 2011 at 3:20 AM, ??? wrote: >> Dear mail-lists: >> Hi, my first time of asking questions in mailing, please excuse me >> if there is any possible problems. >> I'm new in Python and biopython, nearly without practically >> programming experience in Bioinformatics. Recently my work get involved in >> transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the >> software needs paired-end sequences in two fastq files. So I wonder can >> biopython finish the job in a conventient way? Because the paired-end file >> is too big and can't be done in a conventient way in *Galaxy* >> Please give me some guide. Thanks. >> >> Best wishes, >> Yaqiang Cao > Probably, yes. > > So you have one large FASTQ file containing both parts of > each pair (say part one and part two, or they might be > labelled as the forward and reverse reads), and you want > to split this into two FASTQ files? > > How are your reads named? The hard part is inferring this, > one common scheme used /1 and /2 suffixes, but Illumina > have changed this in their latest pipeline and the part is > now in the description instead. > > Could you show us the first 6 reads (or so) from the big > FASTQ file? > > Also are there any single reads in your file, either never > paired or orphaned where one of a pair failed Qc? > > Peter Thanks for replying. Yes, I have a .fastq file convert from .sra, used one of NCBI sratools,fastq-dump . And the file is over 1G. I want to split this into two FASTQ files because the tophat requires two files of paired-end sequence. The screenshot of the first 20 lines of the .fastq file is like the attached picture file: And because I'm new, I can't quitely understand your words about " Also are there any single reads in your file, either never paired or orphaned where one of a pair failed Qc? " All I get is in the screenshot. And it's original NCBI SRA number is SRR100235. Thanks dear mail-listing and Peter. Best wishes, Yaqiang Cao -------------- next part -------------- A non-text attachment was scrubbed... Name: Screenshot-2011-11-17 19:49:04.png Type: image/png Size: 122983 bytes Desc: not available URL: From cjfields at illinois.edu Thu Nov 17 08:39:29 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 17 Nov 2011 13:39:29 +0000 Subject: [Biopython] Paired-End Read Splitting & Joining In-Reply-To: <4EC4F5C5.8010402@gmail.com> References:

<4EC4F5C5.8010402@gmail.com> Message-ID: <17C195A1-3EDB-4E78-AE7C-871124998A8C@illinois.edu> On Nov 17, 2011, at 5:53 AM, Yaqiang Cao wrote: > ? 2011?11?17? 17:21, Peter Cock ??: >> On Thu, Nov 17, 2011 at 3:20 AM, ??? wrote: >>> Dear mail-lists: >>> Hi, my first time of asking questions in mailing, please excuse me >>> if there is any possible problems. >>> I'm new in Python and biopython, nearly without practically >>> programming experience in Bioinformatics. Recently my work get involved in >>> transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the >>> software needs paired-end sequences in two fastq files. So I wonder can >>> biopython finish the job in a conventient way? Because the paired-end file >>> is too big and can't be done in a conventient way in *Galaxy* >>> Please give me some guide. Thanks. >>> >>> Best wishes, >>> Yaqiang Cao >> Probably, yes. >> >> So you have one large FASTQ file containing both parts of >> each pair (say part one and part two, or they might be >> labelled as the forward and reverse reads), and you want >> to split this into two FASTQ files? >> >> How are your reads named? The hard part is inferring this, >> one common scheme used /1 and /2 suffixes, but Illumina >> have changed this in their latest pipeline and the part is >> now in the description instead. >> >> Could you show us the first 6 reads (or so) from the big >> FASTQ file? >> >> Also are there any single reads in your file, either never >> paired or orphaned where one of a pair failed Qc? >> >> Peter > Thanks for replying. > > Yes, I have a .fastq file convert from .sra, used one of NCBI sratools,fastq-dump . And the file is over 1G. I want to split this into two FASTQ files because the tophat requires two files of paired-end sequence. The screenshot of the first 20 lines of the .fastq file is like the attached picture file: > And because I'm new, I can't quitely understand your words about " This is one of my gripes about the SRA tools, that they (by default) dump paired-end data as one concatenated string; it's a nasty gotcha. You need to specify the --split-files option to fastq-dump to dump these as paired end, and this will split them into two files. > Also are there any single reads in your file, either never > paired or orphaned where one of a pair failed Qc? > > " > > All I get is in the screenshot. And it's original NCBI SRA number is SRR100235. > Thanks dear mail-listing and Peter. These should all be matched pairs. > Best wishes, > Yaqiang Cao chris From p.j.a.cock at googlemail.com Thu Nov 17 09:28:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Nov 2011 14:28:03 +0000 Subject: [Biopython] Paired-End Read Splitting & Joining In-Reply-To: <17C195A1-3EDB-4E78-AE7C-871124998A8C@illinois.edu> References:

<4EC4F5C5.8010402@gmail.com> <17C195A1-3EDB-4E78-AE7C-871124998A8C@illinois.edu> Message-ID: On Thu, Nov 17, 2011 at 1:39 PM, Fields, Christopher J wrote: > > This is one of my gripes about the SRA tools, that they (by default) > dump paired-end data as one concatenated string; it's a nasty gotcha. >?You need to specify the --split-files option to fastq-dump to dump > these as paired end, and this will split them into two files. Thank Chris - that was one of the possibilities I wanted to check, but couldn't see the reads via the NCBI website. Sadly it is still saying "Our database is temporarily unavailable. Please come back later." Peter From petyuk at gmail.com Thu Nov 17 17:51:38 2011 From: petyuk at gmail.com (Vladislav Petyuk) Date: Thu, 17 Nov 2011 14:51:38 -0800 Subject: [Biopython] fetching chromosome IDs given the organism ID Message-ID: I am trying to fetch the chromosome IDs for a given genome. For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2 The piece of Biopython code that used to work for me is: #--------------------- url = Entrez.esearch(db="genome", term="txid43989") record = Entrez.read(url) chromosomeIDs = record["IdList"] #--------------------- Not anymore. Now it returns the organism id, which is 1608. Please point in the right direction how to get the chromosome ids given the organism id. For example: organism id: 1608 chromosomes ids: NC_010546.1 NC_010547.1 NC_010539.1 NC_010541.1 NC_010542.1 NC_010543.1 Thank you! From p.j.a.cock at googlemail.com Thu Nov 17 18:09:04 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Nov 2011 23:09:04 +0000 Subject: [Biopython] fetching chromosome IDs given the organism ID In-Reply-To: References: Message-ID: On Thu, Nov 17, 2011 at 10:51 PM, Vladislav Petyuk wrote: > I am trying to fetch the chromosome IDs for a given genome. > For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids > http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2 > The piece of Biopython code that used to work for me is: > #--------------------- > url = Entrez.esearch(db="genome", term="txid43989") > record = Entrez.read(url) > chromosomeIDs = record["IdList"] > #--------------------- > Not anymore. Now it returns the organism id, which is 1608. That's annoying of the NCBI to change things. > Please point in the right direction how to get the chromosome ids given the > organism id. Try searching the nucleotide database directly, with term txid43989[orgn] to restrict the species, and I think there is another field to restrict to complete genomes. Have a look at the field list with EInfo (see the Biopython tutorial for EInfo which explains how to do this). I would try it myself right now, but the Entrez website seems very slow from here tonight. Peter From cjfields at illinois.edu Thu Nov 17 21:19:39 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 18 Nov 2011 02:19:39 +0000 Subject: [Biopython] fetching chromosome IDs given the organism ID In-Reply-To: References:

Message-ID: <5314D1EA-D37A-4706-B691-90FBFAAC324B@illinois.edu> On Nov 17, 2011, at 5:09 PM, Peter Cock wrote: > On Thu, Nov 17, 2011 at 10:51 PM, Vladislav Petyuk wrote: >> I am trying to fetch the chromosome IDs for a given genome. >> For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids >> http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2 >> The piece of Biopython code that used to work for me is: >> #--------------------- >> url = Entrez.esearch(db="genome", term="txid43989") >> record = Entrez.read(url) >> chromosomeIDs = record["IdList"] >> #--------------------- >> Not anymore. Now it returns the organism id, which is 1608. > > That's annoying of the NCBI to change things. ...but not unusual (ref: BLAST output over the ages). >> Please point in the right direction how to get the chromosome ids given the >> organism id. > > Try searching the nucleotide database directly, with > term txid43989[orgn] to restrict the species, and I think > there is another field to restrict to complete genomes. > Have a look at the field list with EInfo (see the Biopython > tutorial for EInfo which explains how to do this). > > I would try it myself right now, but the Entrez website > seems very slow from here tonight. > > Peter I think they are doing a lot of work behind the scenes, particularly with efetch (something to be aware of for all us folks who have modules pulling data from genbank). chris From p.j.a.cock at googlemail.com Fri Nov 18 04:21:22 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 18 Nov 2011 09:21:22 +0000 Subject: [Biopython] fetching chromosome IDs given the organism ID In-Reply-To: <5314D1EA-D37A-4706-B691-90FBFAAC324B@illinois.edu> References:

<5314D1EA-D37A-4706-B691-90FBFAAC324B@illinois.edu> Message-ID: On Fri, Nov 18, 2011 at 2:19 AM, Fields, Christopher J wrote: > > I think they are doing a lot of work behind the scenes, particularly > with efetch (something to be aware of for all us folks who have > modules pulling data from genbank). > > chris Yes, this could be related: http://www.ncbi.nlm.nih.gov/mailman/pipermail/utilities-announce/2011-November/000081.html Peter From devaniranjan at gmail.com Fri Nov 18 10:56:44 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Fri, 18 Nov 2011 10:56:44 -0500 Subject: [Biopython] Difference between SVDSuperimposer and PDB.Superimposer Message-ID: Hi, I would like to calculate the RMSD using biopython, there seems to be 2 functions ( SVDSuperimposer and PDB.Superimposer) I could use, is there a difference? Another question I have is if I use the C-alpha carbon of proteins, does it actually calculate the "real" rmsd, what I mean is -for instance, PYMOL would "throw away" atoms over several cycles to improve the RMSD, I don't want to do that--looking at the code I think bipython does calculate the "real" RMSD, is this correct? Thank you very much, George From p.j.a.cock at googlemail.com Fri Nov 18 12:35:59 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 18 Nov 2011 17:35:59 +0000 Subject: [Biopython] Difference between SVDSuperimposer and PDB.Superimposer In-Reply-To: References: Message-ID: On Fri, Nov 18, 2011 at 3:56 PM, George Devaniranjan wrote: > Hi, > > I would like to calculate the RMSD using biopython, there seems to be > 2 functions ( SVDSuperimposer and PDB.Superimposer) I could use, is > there a difference? Note Bio.PDB.Superimposer calls Bio.SVDSuperimposer, so they should do the same thing. Peter From robert.campbell at queensu.ca Fri Nov 18 12:51:00 2011 From: robert.campbell at queensu.ca (Robert Campbell) Date: Fri, 18 Nov 2011 12:51:00 -0500 Subject: [Biopython] Difference between SVDSuperimposer and PDB.Superimposer In-Reply-To: References: Message-ID: <20111118125100.064e1f2d@adelie.biochem.queensu.ca> Hi George, On Fri, 2011-11-18 10:56 EST, George Devaniranjan wrote: > Hi, > > I would like to calculate the RMSD using biopython, there seems to be > 2 functions ( SVDSuperimposer and PDB.Superimposer) I could use, is > there a difference? > > Another question I have is if I use the C-alpha carbon of proteins, > does it actually calculate the "real" rmsd, what I mean is -for > instance, PYMOL would "throw away" atoms over several cycles to > improve the RMSD, I don't want to do that--looking at the code I think > bipython does calculate the "real" RMSD, is this correct? You can, of course, prevent PyMOL from rejecting atoms, by telling it to only do a single cycle (command line option: "cycles=1"). If you want to do batch aligning of many structures, you can also use a script that will do the alignment without starting up the graphics. Cheers, Rob -- Robert L. Campbell, Ph.D. Senior Research Associate/Adjunct Assistant Professor Dept. of Biomedical & Molecular Sciences, Botterell Hall Rm 644 Queen's University, Kingston, ON K7L 3N6 Canada Tel: 613-533-6821 http://pldserver1.biochem.queensu.ca/~rlc From devaniranjan at gmail.com Fri Nov 18 17:13:58 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Fri, 18 Nov 2011 17:13:58 -0500 Subject: [Biopython] make_log_odds_matrix Message-ID: The biopython tutorial talks about constructing a substitution matrix....I used the following lines of code test_dict={('Y', 'L'): 1552226.0, ('B', 'B'): 18251943.0, ('D', 'G'): 44863831.0, ('D', 'D'): 22086473.0,........... test_dict give the frequency which different amino acids replace each other from Bio import SubsMat my_arm=SubsMat.SeqMat(test_dict) my_log=SubsMat.make_log_odds_matrix(my_arm) my_log.print_mat() I have 2 questions: 1) Is this the correct way to do this? 2) This seems different to the way BLOSUM is generated, there is no "normalising" , so can we really compare this generated matrix to a BLOSUM matrix? Thank you, George From p.j.a.cock at googlemail.com Mon Nov 21 11:52:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 21 Nov 2011 16:52:18 +0000 Subject: [Biopython] fetching chromosome IDs given the organism ID In-Reply-To: References:

Message-ID: On Thu, Nov 17, 2011 at 11:09 PM, Peter Cock wrote: > On Thu, Nov 17, 2011 at 10:51 PM, Vladislav Petyuk wrote: >> I am trying to fetch the chromosome IDs for a given genome. >> For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids >> http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2 >> The piece of Biopython code that used to work for me is: >> #--------------------- >> url = Entrez.esearch(db="genome", term="txid43989") >> record = Entrez.read(url) >> chromosomeIDs = record["IdList"] >> #--------------------- >> Not anymore. Now it returns the organism id, which is 1608. > > That's annoying of the NCBI to change things. > The NCBI have just made a public announcement by email today (21 Nov 2011), and apologized for the lack of notice: http://www.ncbi.nlm.nih.gov/mailman/pipermail/utilities-announce/2011-November/000083.html Judging from the URL it was also on their news page the day you found the problem, but I hadn't seen that then: http://www.ncbi.nlm.nih.gov/About/news/17Nov2011.html It looks like a sensible long term change to the genome database. Regards, Peter From petyuk at gmail.com Mon Nov 21 17:13:00 2011 From: petyuk at gmail.com (Vladislav Petyuk) Date: Mon, 21 Nov 2011 14:13:00 -0800 Subject: [Biopython] fetching chromosome IDs given the organism ID In-Reply-To: References:

Message-ID: Thanks for pointing in the right direction. I also e-mailed Entrez help desk about this problem. It seems like their advise is to go through "nuccore". However, to restrict the results they suggest to use "srcdb_refseq[prop]" in the query line. http://www.ncbi.nlm.nih.gov/nuccore?term=%22Cyanothece%20sp.%20ATCC%2051142%22[orgn]%20AND%20srcdb_refseq[prop ] That works well for some not-extensively studied organisms such a Cyanothece and returns the right number of records, which is 6. But for human it returns 62051 records instead of 25 (chromosomes + mitochondrial DNA). http://www.ncbi.nlm.nih.gov/nuccore?term=%22homo%20sapiens%22[orgn]%20AND%20srcdb_refseq[prop ] After tuning the query a little bit this one seems like giving a reasonable results (("homo sapiens"[orgn] AND srcdb_refseq[prop]) AND 168[BioProject]) NOT patches NOT contig Tweaking nuccore queries seems like a hack rather then a solution. There better be a straight relationships between the databases: Genome -> Genome Project -> Nuccore This query returns the right thing, but in HTML format (even if &rettype=gb). http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&term=59013 or http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&term=168 I guess (hope) that GenBank format is something that will be added in the future, unless I am overlooking something. Cheers, Vlad On Mon, Nov 21, 2011 at 8:52 AM, Peter Cock wrote: > On Thu, Nov 17, 2011 at 11:09 PM, Peter Cock > wrote: > > On Thu, Nov 17, 2011 at 10:51 PM, Vladislav Petyuk > wrote: > >> I am trying to fetch the chromosome IDs for a given genome. > >> For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids > >> http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2 > >> The piece of Biopython code that used to work for me is: > >> #--------------------- > >> url = Entrez.esearch(db="genome", term="txid43989") > >> record = Entrez.read(url) > >> chromosomeIDs = record["IdList"] > >> #--------------------- > >> Not anymore. Now it returns the organism id, which is 1608. > > > > That's annoying of the NCBI to change things. > > > > The NCBI have just made a public announcement by email today > (21 Nov 2011), and apologized for the lack of notice: > > > http://www.ncbi.nlm.nih.gov/mailman/pipermail/utilities-announce/2011-November/000083.html > > Judging from the URL it was also on their news page the day you > found the problem, but I hadn't seen that then: > > http://www.ncbi.nlm.nih.gov/About/news/17Nov2011.html > > It looks like a sensible long term change to the genome database. > > Regards, > > Peter > From macrozhu at gmail.com Tue Nov 22 04:44:48 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Tue, 22 Nov 2011 10:44:48 +0100 Subject: [Biopython] Difference between SVDSuperimposer and PDB.Superimposer In-Reply-To: References: Message-ID: http://en.wikipedia.org/wiki/Structural_alignment#Structural_superposition On Fri, Nov 18, 2011 at 4:56 PM, George Devaniranjan wrote: > > Another question I have is if I use the C-alpha carbon of proteins, > does it actually calculate the "real" rmsd, what I mean is -for > instance, PYMOL would "throw away" atoms over several cycles to > improve the RMSD, I don't want to do that--looking at the code I think > bipython does calculate the "real" RMSD, is this correct? > > Hi, George, I believe when you said "real" RMSD, you actually meant the optimal solution to the least squares problem. In SVDSuperimposer, SVD is employed to solve the optimization problem given the correspondence between the residues in the two structures. It always generates the optimal solution, or the "real" RMSD with respect to the correspondence between residues. If you change the correspondence between structures (like what pymol does according to your description), it is like you have a different optimization problem to solve. SVDSuperimposer does not change the problem spontaneously. It solves the problem you input. It seems that the PyMOL command you described is "align". Note that "align" and "superimpose" do mean different operations, although they are often (wrongly) used interchangeably. http://en.wikipedia.org/wiki/Structural_alignment#Structural_superposition HTH hongbo Thank you very much, > George > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Hongbo From jordan.r.willis at Vanderbilt.Edu Wed Nov 23 19:57:19 2011 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Wed, 23 Nov 2011 18:57:19 -0600 Subject: [Biopython] Calculating Buried and Surface Residues Message-ID: <89F9C1F4-B3E4-404F-9816-7E7EB7F80427@vanderbilt.edu> Hello Biopython, I was wondering if anyone had a script, knew of a script, API, web-server, Pymol example, etc. to do something I have seen in so many papers but can't seem to find a way to do it. That is, given a PDB, calculate which residues are considered buried, and which ones are considered solvent exposed. If this is easy in Biopython, I would love to use it. Otherwise, if anyone knows anything out there for this task, please let me know. Thanks so much! Jordan From anaryin at gmail.com Thu Nov 24 02:26:20 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 24 Nov 2011 08:26:20 +0100 Subject: [Biopython] Calculating Buried and Surface Residues In-Reply-To: <89F9C1F4-B3E4-404F-9816-7E7EB7F80427@vanderbilt.edu> References: <89F9C1F4-B3E4-404F-9816-7E7EB7F80427@vanderbilt.edu> Message-ID: Hello Jordan, If you can get your hands on NACCESS that will do the trick. Biopython PDB module has a wrapper for this program which makes it pretty easy to use. Another alternative is to use the Half Sphere Exposure measure that is built-in Biopython and described here . Either way gives you per-residue information. Best, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/11/24 Willis, Jordan R > Hello Biopython, > > I was wondering if anyone had a script, knew of a script, API, web-server, > Pymol example, etc. to do something I have seen in so many papers but can't > seem to find a way to do it. That is, given a PDB, calculate which residues > are considered buried, and which ones are considered solvent exposed. If > this is easy in Biopython, I would love to use it. Otherwise, if anyone > knows anything out there for this task, please let me know. > > Thanks so much! > > Jordan > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mictadlo at gmail.com Mon Nov 28 21:06:58 2011 From: mictadlo at gmail.com (Mic) Date: Tue, 29 Nov 2011 12:06:58 +1000 Subject: [Biopython] Generator expression for SeqIO Message-ID: Hello, How is it possible to use generator expression for the following code? from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord a = {'a': 'AAA', 1: 'ATC', 3: 'CCT', 2: 'TCT'} order = [1, 2, 3, 'a'] records = (SeqRecord(Seq(seq), id=str(ref), description="") for ref_name in order: for ref, seq in a[ref_name].items()) output_handle = open('a.fasta', 'w') SeqIO.write(records, output_handle, "fasta") output_handle.close() Error message: $ python x.py File "x.py", line 7 records = (SeqRecord(Seq(seq), id=str(ref), description="") for ref_name in order: for ref, seq in a[ref_name].items()) ^ SyntaxError: invalid syntax Thank you in advance. From p.j.a.cock at googlemail.com Tue Nov 29 06:15:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Nov 2011 11:15:53 +0000 Subject: [Biopython] Generator expression for SeqIO In-Reply-To: References: Message-ID: On Tue, Nov 29, 2011 at 2:06 AM, Mic wrote: > Hello, > How is it possible to use generator expression for the following code? > > from Bio import SeqIO > from Bio.Seq import Seq > from Bio.SeqRecord import SeqRecord > > a = {'a': 'AAA', 1: 'ATC', 3: 'CCT', 2: 'TCT'} > order = [1, 2, 3, 'a'] > records = (SeqRecord(Seq(seq), id=str(ref), description="") for ref_name in > order: for ref, seq in a[ref_name].items()) > > output_handle = open('a.fasta', 'w') > SeqIO.write(records, output_handle, "fasta") > output_handle.close() > > Error message: > $ python x.py > ?File "x.py", line 7 > ? ?records = (SeqRecord(Seq(seq), id=str(ref), description="") for > ref_name in order: for ref, seq in a[ref_name].items()) > > ? ? ? ? ^ > SyntaxError: invalid syntax > > Thank you in advance. Well yes, that is invalid syntax - for a start there is a colon in the middle. I think you want: from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord a = {'a': 'AAA', 1: 'ATC', 3: 'CCT', 2: 'TCT'} order = [1, 2, 3, 'a'] records = (SeqRecord(Seq(a[ref]), id=str(ref), description="") for ref in order) output_handle = open('a.fasta', 'w') SeqIO.write(records, output_handle, "fasta") output_handle.close() Peter From macmanes at gmail.com Tue Nov 1 19:19:24 2011 From: macmanes at gmail.com (Matthew MacManes) Date: Tue, 1 Nov 2011 12:19:24 -0700 Subject: [Biopython] parsing fasta based on header Message-ID: Hi All, I have a large fasta file that I am trying to sort into multiple smaller files based on their ID's. The File starts like this: >1MUSgi|116063569|ref|NM_010065.2| AGGGG-TGGTTGACCATCAACAACATCGGCATCATG-AAGGGAGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG- >2MUSgi|118130562|ref|NM_019880.3| CGGCCCGCGGCTCAGCCGTCGGCGCGCAGGATGGACGGCG-A >2MUSgi|118130562|ref|NM_019880.3| AGTTTAGCCAGGCCCTGGCCATCCGGAGCTACACCAAGTTTGTGATGGGGATTGCAGTGAGCATGCTGACCTACCCCTTCCTGCTCGTTGGAGATCTCATGGCAGTGAACAACCCTGGAGTAACCT >1HOMOgi|59853098|ref|NM_004408.2| GCATCCGCAAGGGCTGGCTGACTATCAATAATATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG- >1 GGTGATCCGCAGGGGCTGGCTGACCATCAACAACATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTCGTGCTCACTGCCGAGTCACTGTCCTGGTACAAGGACGAAGAGGAGAAAGAGAG >2 CGCGCCAGCACCGGCCCGCGGCGCAGCCCTCGGCCCGCAGGATGGACGGCGCGTCCGGGGGCCTGGGCTCTGGGGATAGTGCC I want all of the ID's beginning with 1's to go on one file, ID's starting with 2's in another. I have been trying to use SeqIO for record in SeqIO.parse(open("QHM-clean.fasta", "rU"), "fasta") : for i in range(1,3): if record.id %i: #this needs to be changed "if record.id *STARTS WITH* %i" print record.id output_handle = open("%i.fasta", "w") #naming in this manner does not seem to be allowed SeqIO.write(output_handle, "fasta") output_handle.close() But this seems to be not working in many obvious ways... Can anybody help me out with some advice on how to proceed? Thanks a lot, Matt From w.arindrarto at gmail.com Tue Nov 1 19:53:11 2011 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 1 Nov 2011 20:53:11 +0100 Subject: [Biopython] parsing fasta based on header In-Reply-To: References: Message-ID: Hi Matthew, You can use Python generators for this. Here's a rough example: # generators for the two different groups seq_1 = (r for r in SeqIO.parse(open('QHM-clean.fasta', 'rU'), 'fasta') if r.id.startswith('1')) seq_2 = (r for r in SeqIO.parse(open('QHM-clean.fasta', 'rU'), 'fasta') if r.id.startswith('2')) # seqs, filenames pair list pairs = [(seq_1, 'file_1'), (seq_2, 'file_2')] # the actual write for seq, filename in pairs: SeqIO.write(seq, open(filename, 'w'), 'fasta') cheers, Bowo On Tue, Nov 1, 2011 at 20:19, Matthew MacManes wrote: > Hi All, > > I have a large fasta file that I am trying to sort into multiple smaller > files based on their ID's. The File starts like this: > > >1MUSgi|116063569|ref|NM_010065.2| > > AGGGG-TGGTTGACCATCAACAACATCGGCATCATG-AAGGGAGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG- > >2MUSgi|118130562|ref|NM_019880.3| > CGGCCCGCGGCTCAGCCGTCGGCGCGCAGGATGGACGGCG-A > >2MUSgi|118130562|ref|NM_019880.3| > > AGTTTAGCCAGGCCCTGGCCATCCGGAGCTACACCAAGTTTGTGATGGGGATTGCAGTGAGCATGCTGACCTACCCCTTCCTGCTCGTTGGAGATCTCATGGCAGTGAACAACCCTGGAGTAACCT > >1HOMOgi|59853098|ref|NM_004408.2| > > GCATCCGCAAGGGCTGGCTGACTATCAATAATATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG- > >1 > > GGTGATCCGCAGGGGCTGGCTGACCATCAACAACATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTCGTGCTCACTGCCGAGTCACTGTCCTGGTACAAGGACGAAGAGGAGAAAGAGAG > >2 > > CGCGCCAGCACCGGCCCGCGGCGCAGCCCTCGGCCCGCAGGATGGACGGCGCGTCCGGGGGCCTGGGCTCTGGGGATAGTGCC > > I want all of the ID's beginning with 1's to go on one file, ID's starting > with 2's in another. > > I have been trying to use SeqIO > > for record in SeqIO.parse(open("QHM-clean.fasta", "rU"), "fasta") : > for i in range(1,3): > if record.id %i: #this needs to be changed "if record.id *STARTS WITH* > %i" > print record.id > output_handle = open("%i.fasta", "w") #naming in this manner does not seem > to be allowed > SeqIO.write(output_handle, "fasta") > output_handle.close() > > But this seems to be not working in many obvious ways... Can anybody help > me out with some advice on how to proceed? > > Thanks a lot, Matt > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Tue Nov 1 21:04:39 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 1 Nov 2011 21:04:39 +0000 Subject: [Biopython] parsing fasta based on header In-Reply-To: References: Message-ID: On Tue, Nov 1, 2011 at 7:53 PM, Wibowo Arindrarto wrote: > Hi Matthew, > > You can use Python generators for this. Here's a rough example: > > # generators for the two different groups > seq_1 = (r for r in SeqIO.parse(open('QHM-clean.fasta', 'rU'), 'fasta') if > r.id.startswith('1')) > seq_2 = (r for r in SeqIO.parse(open('QHM-clean.fasta', 'rU'), 'fasta') if > r.id.startswith('2')) > > # seqs, filenames pair list > pairs = [(seq_1, 'file_1'), (seq_2, 'file_2')] > > # the actual write > for seq, filename in pairs: > SeqIO.write(seq, open(filename, 'w'), 'fasta') > > cheers, > Bowo Email does tend to mess up the indentation in Python :( I'm pleased to see that's very similar to my answer earlier, http://biostar.stackexchange.com/questions/13791/parsing-fasta-based-on-header/13793 By the way Wibiwo, rather than this: SeqIO.write(seq, open(filename, 'w'), 'fasta') use this: SeqIO.write(seq, filename, 'fasta') It is shorter but also will ensure the handle is closed promptly on Jython/PyPy where garbage collection isn't as predictable as on normal C Python. Peter From mictadlo at gmail.com Wed Nov 2 04:39:18 2011 From: mictadlo at gmail.com (Mic) Date: Wed, 2 Nov 2011 14:39:18 +1000 Subject: [Biopython] subprocess.Popen problem Message-ID: Hello, I have tried to write a SOAPaligner wrapper in oder to get the SOAP alignment statistics: Total Pairs: 1000 PE Paired: 35 ( 3.50%) PE Singled: 170 ( 8.50%) SE Total Elapsed Time: 24.00 - Load Index Table: 23.22 - Alignment: 0.78 with the following code: import os, subprocess if __name__ == '__main__': try: cmd_soap = 'soap -p 1 -a test_A_clonesremoved-tiny.fastq -b test_B_clonesremoved-tiny.fastq -D all.m.fasta.index -r 0 -o test_clonesremoved-tiny_vs_all.m.paired.soap -2 test_clonesremoved-tiny_vs_all.m.single.soap -m 100 -x 550' proc = subprocess.Popen(cmd_soap, shell=True) returncode = proc.wait() print returncode except Exception, e: sys.stderr.write( "%s\n" % str(e)) sys.exit() However, when I started the script I just got 0 as an output: $ python soap_wrapper.py Begin Program SOAPaligner/soap2 Wed Nov 2 14:23:33 2011 Reference: all.m.fasta.index Query File a: test_A_clonesremoved-tiny.fastq Query File b: test_B_clonesremoved-tiny.fastq Output File: test_clonesremoved-tiny_vs_all.m.paired.soap test_clonesremoved-tiny_vs_all.m.single.soap Load Index Table ... Load Index Table OK Begin Alignment ... 2000 ok 0.76 sec Total Pairs: 1000 PE Paired: 35 ( 3.50%) PE Singled: 170 ( 8.50%) SE Total Elapsed Time: 24.00 - Load Index Table: 23.22 - Alignment: 0.78 SOAPaligner/soap2 End Wed Nov 2 14:23:57 2011 0 What did I wrong? Thank you in advance. From p.j.a.cock at googlemail.com Wed Nov 2 08:52:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 2 Nov 2011 08:52:19 +0000 Subject: [Biopython] subprocess.Popen problem In-Reply-To: References: Message-ID: On Wed, Nov 2, 2011 at 4:39 AM, Mic wrote: > Hello, > I have tried to write a SOAPaligner wrapper in oder to get the SOAP > alignment statistics: > > Total Pairs: 1000 PE > Paired: ? ? ?35 ( 3.50%) PE > Singled: ? ? 170 ( 8.50%) SE > Total Elapsed Time: ? ? ? ? ? 24.00 > ? ? ?- Load Index Table: ? ? 23.22 > ? ? ?- Alignment: ? ? ? ? ? ? 0.78 > > with the following code: > > import os, subprocess > if __name__ == '__main__': > ? ?try: > ? ? ? ?cmd_soap = 'soap -p 1 -a test_A_clonesremoved-tiny.fastq -b > test_B_clonesremoved-tiny.fastq -D all.m.fasta.index -r 0 -o > test_clonesremoved-tiny_vs_all.m.paired.soap -2 > test_clonesremoved-tiny_vs_all.m.single.soap -m 100 -x 550' > ? ? ? ?proc = subprocess.Popen(cmd_soap, shell=True) > ? ? ? ?returncode = proc.wait() > ? ? ? ?print returncode > ? ?except Exception, e: > ? ? ? ?sys.stderr.write( "%s\n" % str(e)) > ? ? ? ?sys.exit() > > > However, when I started the script I just got 0 as an output: > > $ python soap_wrapper.py > > Begin Program SOAPaligner/soap2 > Wed Nov ?2 14:23:33 2011 > Reference: all.m.fasta.index > Query File a: test_A_clonesremoved-tiny.fastq > Query File b: test_B_clonesremoved-tiny.fastq > Output File: test_clonesremoved-tiny_vs_all.m.paired.soap > ? ? ? ? ? ? test_clonesremoved-tiny_vs_all.m.single.soap > Load Index Table ... > Load Index Table OK > Begin Alignment ... > 2000 ok ? ?0.76 sec > Total Pairs: 1000 PE > Paired: ? ? ?35 ( 3.50%) PE > Singled: ? ? 170 ( 8.50%) SE > Total Elapsed Time: ? ? ? ? ? 24.00 > ? ? ?- Load Index Table: ? ? 23.22 > ? ? ?- Alignment: ? ? ? ? ? ? 0.78 > > SOAPaligner/soap2 End > Wed Nov ?2 14:23:57 2011 > > 0 > > What did I wrong? > > Thank you in advance. > Command line tools typically return 0 for success, anything else is usually an error code. You can double check the return code from the command line outside Python - e.g. a quick shell script to call the command and echo the return code to the terminal. Peter From p.j.a.cock at googlemail.com Wed Nov 2 09:42:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 2 Nov 2011 09:42:15 +0000 Subject: [Biopython] subprocess.Popen problem In-Reply-To: References:

Message-ID: On Wed, Nov 2, 2011 at 9:03 AM, Mic wrote: > Thank you, but is it possible to store the SOAP output in the memory/file in > order to retrieve the following?statistics?lines? > Total Pairs: 1000 PE > Paired: ? ? ?35 ( 3.50%) PE > Singled: ? ? 170 ( 8.50%) SE > Total Elapsed Time: ? ? ? ? ? 24.00 > ? ? ? - Load Index Table: ? ? 23.22 > ? ? ? - Alignment: ? ? ? ? ? ? 0.78 > Thank you in advance. Assuming that is written to stdout, just collect it as a pipe (handle) via subprocess, or since it is short, use the .communicate method and get this as a string. http://docs.python.org/library/subprocess.html Peter From patriciaseraos at gmail.com Wed Nov 2 10:39:48 2011 From: patriciaseraos at gmail.com (Patricia Soares) Date: Wed, 02 Nov 2011 10:39:48 +0000 Subject: [Biopython] Supertree - R Message-ID: <4EB11DF4.9070804@gmail.com> Hello, I need to build a supertree from four different methods. I found a code able to build supertrees with R. But I wanted to use python to do this. I was wondering if you have any way to do something similar to the R code and build a supertree. Appreciate your help and time. Best regards, Patricia -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: supertree.R URL: From cy at cymon.org Wed Nov 2 11:30:38 2011 From: cy at cymon.org (Cymon Cox) Date: Wed, 2 Nov 2011 11:30:38 +0000 Subject: [Biopython] Supertree - R In-Reply-To: <4EB11DF4.9070804@gmail.com> References: <4EB11DF4.9070804@gmail.com> Message-ID: Hi Patricia, On 2 November 2011 10:39, Patricia Soares wrote: > Hello, > > I need to build a supertree from four different methods. You want to build a supertree from a combination of optimal tree(s) of each of 4 different tree building methods, right? > I found a code > able to build supertrees with R. There is a long and complex literature about the best way to reconstruct supertrees. But from the code you attached, you appear to be wanting to reconstruct a supertree using Matrix Representation using Parsimony (MRP) - basically recode the source trees as nodes present/absent in a matrix then use parsimony to find the shortest tree. > But I wanted to use python to do this. > I was wondering if you have any way to do something similar to the R > code and build a supertree. > I'm not aware of anything in Biopython. There is however a module in p4 ( http://code.google.com/p/p4-phylogenetics) p4.MRP. mrp(trees, taxNames=None) that will recode you source trees and write the MRP matrix. You can then use your favourite parsimony implementation (PAUP*, TNT, etc) to build the tree. Cheers, Cymon From cy at cymon.org Wed Nov 2 13:28:56 2011 From: cy at cymon.org (Cymon Cox) Date: Wed, 2 Nov 2011 13:28:56 +0000 Subject: [Biopython] Supertree - R In-Reply-To: <4EB143B1.5010507@gmail.com> References: <4EB11DF4.9070804@gmail.com> <4EB143B1.5010507@gmail.com> Message-ID: Hi Patricia, On 2 November 2011 13:20, Patricia Soares wrote: > Thank you! > I will try with this package. > > By the way, do you know the difference between p4.SuperTreeSupport and > p4.MRP? > It's not a method to build a supertree as such, but a way of providing measures of support for the supertree given the source trees. >From the docstring: """Supertree support measures Super tree support can be used to calculate a number of support measures for a set of trees and a supertree. The measures can be at split level and placed on the supertree for image production or at tree level with a number of summary measures. The support of the input trees for a supertree is measured by counting the number of input trees that support(S), conflict(Q), permits(P) or are relevant(R) with the splits in the supertree. Supply a supertree and the input trees used to create it. Filenames or trees will do. A single supertree and a list of input trees. For example:: etc... Cheers, C. > > Cheers, > Patricia > > On 11/02/2011 11:30 AM, Cymon Cox wrote: > > Hi Patricia, > > > > On 2 November 2011 10:39, Patricia Soares > wrote: > > > >> Hello, > >> > >> I need to build a supertree from four different methods. > > > > > > You want to build a supertree from a combination of optimal tree(s) of > each > > of 4 different tree building methods, right? > > > > > > > >> I found a code > >> able to build supertrees with R. > > > > > > There is a long and complex literature about the best way to reconstruct > > supertrees. But from the code you attached, you appear to be wanting to > > reconstruct a supertree using Matrix Representation using Parsimony > (MRP) - > > basically recode the source trees as nodes present/absent in a matrix > then > > use parsimony to find the shortest tree. > > > > > > > >> But I wanted to use python to do this. > >> I was wondering if you have any way to do something similar to the R > >> code and build a supertree. > >> > > > > I'm not aware of anything in Biopython. There is however a module in p4 ( > > http://code.google.com/p/p4-phylogenetics) p4.MRP. mrp(trees, > > taxNames=None) that will recode you source trees and write the MRP > matrix. > > You can then use your favourite parsimony implementation (PAUP*, TNT, > etc) > > to build the tree. > > > > Cheers, Cymon > > > > From macmanes at gmail.com Wed Nov 2 16:21:44 2011 From: macmanes at gmail.com (Matthew MacManes) Date: Wed, 2 Nov 2011 09:21:44 -0700 Subject: [Biopython] send SeqIO.parse to NcbiblastnCommandline Message-ID: Hi All, I am trying to take a large fasta file, send sequences one by one to NcbiblastnCommandline, sending results to a unique file based on the query ID. So far I have MUSDATABASE='/media/hd/blastdb/mouse.rna' from Bio import SeqIO from Bio.Blast.Applications import NcbiblastnCommandline for seq_record in SeqIO.parse("test1.fa", "fasta"): cl = NcbiblastnCommandline(cmd="/home/matthew/ncbi-blast/bin/blastn", query=seq_record.seq, db=MUSDATABASE, evalue=0.0000000001, outfmt="'10 qseqid qseq sseqid sseq bitscore'", out=seq_record.id, max_target_seqs=1, num_threads=15) print cl stdout, stderr = cl() This seems like a promising approach, but the issue is that the query argument expects a file, not a sequence itself. In reading in the BLAST+ manual, blastn can accept a sequence from the standard input via query="-", but I cannot get this to work, does not catch the sequence. Any pointers greatly appreciated. Matt From p.j.a.cock at googlemail.com Wed Nov 2 17:04:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 2 Nov 2011 17:04:19 +0000 Subject: [Biopython] send SeqIO.parse to NcbiblastnCommandline In-Reply-To: References: Message-ID: On Wed, Nov 2, 2011 at 4:21 PM, Matthew MacManes wrote: > Hi All, > > I am trying to take a large fasta file, send sequences one by one > to NcbiblastnCommandline, sending results to a unique file based on the > query ID. So far I have > > MUSDATABASE='/media/hd/blastdb/mouse.rna' > > from Bio import SeqIO > from Bio.Blast.Applications import NcbiblastnCommandline > for seq_record in SeqIO.parse("test1.fa", "fasta"): > cl = NcbiblastnCommandline(cmd="/home/matthew/ncbi-blast/bin/blastn", > ?query=seq_record.seq, > db=MUSDATABASE, evalue=0.0000000001, > outfmt="'10 qseqid qseq sseqid sseq bitscore'", > ?out=seq_record.id, > max_target_seqs=1, > ?num_threads=15) > print cl > stdout, stderr = cl() > > > This seems like a promising approach, but the issue is that the query > argument expects a file, not a sequence itself. ?In reading in the BLAST+ > manual, blastn can accept a sequence from the standard input via query="-", > but I cannot get this to work, does not catch the sequence. > > > Any pointers greatly appreciated. > Matt You need to do two things, (1) tell BLAST to read the sequence from stdin, and (2) supply the FASTA formatted sequence to stdin. Try something along these lines: cline = NcbiblastnCommandline(..., query="-", ...) stdout, stderr = cline(stdin=record.format("fasta")) Peter From mictadlo at gmail.com Thu Nov 3 03:16:19 2011 From: mictadlo at gmail.com (Mic) Date: Thu, 3 Nov 2011 13:16:19 +1000 Subject: [Biopython] subprocess.Popen problem In-Reply-To: References:

Message-ID: Thank you, I wrote the following code and not sure whether it is what did write me. Is try/except necassary? import sys,os, subprocess if __name__ == '__main__': try: cmd_soap = 'soap -p 1 -a test_A_clonesremoved-tiny.fastq -b test_B_clonesremoved-tiny.fastq -D all.m.fasta.index -r 0 -o test_clonesremoved-tiny_vs_all.m.paired.soap -2 test_clonesremoved-tiny_vs_all.m.single.soap -m 100 -x 550' proc = subprocess.Popen(cmd_soap, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) returncode = proc.wait() print returncode stdout_value, stderr_value = proc.communicate() print '\tcombined output:', stdout_value, stderr_value if returncode == 1: sys.exit(1) except Exception, e: sys.stderr.write( "%s\n" % str(e)) sys.exit() Thank you in advance. On Wed, Nov 2, 2011 at 7:42 PM, Peter Cock wrote: > On Wed, Nov 2, 2011 at 9:03 AM, Mic wrote: > > Thank you, but is it possible to store the SOAP output in the > memory/file in > > order to retrieve the following statistics lines? > > Total Pairs: 1000 PE > > Paired: 35 ( 3.50%) PE > > Singled: 170 ( 8.50%) SE > > Total Elapsed Time: 24.00 > > - Load Index Table: 23.22 > > - Alignment: 0.78 > > Thank you in advance. > > Assuming that is written to stdout, just collect it as a pipe (handle) > via subprocess, or since it is short, use the .communicate method > and get this as a string. > > http://docs.python.org/library/subprocess.html > > Peter > From p.j.a.cock at googlemail.com Thu Nov 3 09:31:38 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Nov 2011 09:31:38 +0000 Subject: [Biopython] subprocess.Popen problem In-Reply-To: References:

Message-ID: On Thu, Nov 3, 2011 at 3:16 AM, Mic wrote: > Thank you, I wrote the following code and not sure whether it is what did > write me. Depending on the tool I would check for a non-zero return code rather than just treating 1 as an error. You are also not collecting stderr/stdout correctly. If you send them to a pipe, the strings from the .communicate will be empty. Rather reads from the process object's .stdout and .stderr handles. See: http://docs.python.org/library/subprocess.html Peter From fernandocharitha at googlemail.com Mon Nov 7 18:31:09 2011 From: fernandocharitha at googlemail.com (Charitha Fernando) Date: Mon, 7 Nov 2011 13:31:09 -0500 Subject: [Biopython] finding single domain proteins Message-ID: Hi guys, I know this is not the appropriate forum to ask this question , I am really sorry for doing so. I am just begining to find my way through protein science--I have a question I want a list of all Single domain proteins in the PDB, I am not sure if there is a list like that? I tried to play with both CATH/SCOP but I am not getting anywhere, is there a list someone has of all the single domain proteins, does not matter if it is all alpha or mixed, just need a list of them Thanks, Fernando From from.d.putto at gmail.com Mon Nov 7 19:10:22 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Mon, 7 Nov 2011 20:10:22 +0100 Subject: [Biopython] SeqIO.index AttributeError: 'SeqRecord' object has no attribute 'taxonomy' Message-ID: Hi All, Consider the following code (from Biopython Cookbook) from Bio import SeqIO uniprot = SeqIO.index("uniprot_sprot.dat", "swiss") handle = open("selected.dat", "w") for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: handle.write(uniprot.get_raw(acc)) handle.close() I want to print only selected part of (ID, description and Taxonomy ) not the full record. I modified the code as for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: print uniprot[acc].id, uniprot[acc].description, uniprot[acc].taxonomy but this gives error "AttributeError: 'SeqRecord' object has no attribute 'taxonomy' " Any suggestion !!!! -- Sheila d angel From p.j.a.cock at googlemail.com Mon Nov 7 19:38:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 Nov 2011 19:38:52 +0000 Subject: [Biopython] SeqIO.index AttributeError: 'SeqRecord' object has no attribute 'taxonomy' In-Reply-To: References: Message-ID: On Mon, Nov 7, 2011 at 7:10 PM, Sheila the angel wrote: > Hi All, > Consider the following code (from Biopython Cookbook) > > from Bio import SeqIO > uniprot = SeqIO.index("uniprot_sprot.dat", "swiss") > handle = open("selected.dat", "w") > for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: > ? ?handle.write(uniprot.get_raw(acc)) > handle.close() > > I want to print only selected part of ?(ID, description and Taxonomy ) not > the full record. I modified the code as > > for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: > ? ? print uniprot[acc].id, uniprot[acc].description, uniprot[acc].taxonomy > > but this gives error "AttributeError: 'SeqRecord' object has no attribute > 'taxonomy' " > Any suggestion !!!! What makes you think there would be a taxonomy attribute? Is there a mistake in the documentation somewhere? From memory you should try uniprot[acc].annoations["taxonomy"] Also your code will parse the record three times for each accession, use this instead: for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: record = uniprot[acc] print.id, record.description, ... Peter Peter From p.j.a.cock at googlemail.com Mon Nov 7 22:01:41 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 Nov 2011 22:01:41 +0000 Subject: [Biopython] SeqIO.index AttributeError: 'SeqRecord' object has no attribute 'taxonomy' In-Reply-To: References:

Message-ID: On Monday, November 7, 2011, Peter Cock wrote: > On Mon, Nov 7, 2011 at 7:10 PM, Sheila the angel wrote: >> Hi All, >> Consider the following code (from Biopython Cookbook) >> >> from Bio import SeqIO >> uniprot = SeqIO.index("uniprot_sprot.dat", "swiss") >> handle = open("selected.dat", "w") >> for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: >> handle.write(uniprot.get_raw(acc)) >> handle.close() >> >> I want to print only selected part of (ID, description and Taxonomy ) not >> the full record. I modified the code as >> >> for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: >> print uniprot[acc].id, uniprot[acc].description, uniprot[acc].taxonomy >> >> but this gives error "AttributeError: 'SeqRecord' object has no attribute >> 'taxonomy' " >> Any suggestion !!!! > > What makes you think there would be a taxonomy attribute? Is there > a mistake in the documentation somewhere? From memory you should > try uniprot[acc].annoations["taxonomy"] Typo: annotations Related to this, try dir(...) on Python objects to see what their attributes are. > Also your code will parse the record three times for each accession, > use this instead: > > for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: > record = uniprot[acc] > print.id, record.description, ... > > Peter > Peter From eric.talevich at gmail.com Mon Nov 7 23:26:12 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 7 Nov 2011 18:26:12 -0500 Subject: [Biopython] finding single domain proteins In-Reply-To: References: Message-ID: On Mon, Nov 7, 2011 at 1:31 PM, Charitha Fernando < fernandocharitha at googlemail.com> wrote: > Hi guys, > > I know this is not the appropriate forum to ask this question , I am really > sorry for doing so. > I am just begining to find my way through protein science--I have a > question > I want a list of all Single domain proteins in the PDB, I am not sure if > there is a list like that? > > I tried to play with both CATH/SCOP but I am not getting anywhere, is there > a list someone has of all the single domain proteins, does not matter if it > is all alpha or mixed, just need a list of them > Thanks, > Fernando > > Hi Fernando, I think you'll have a better chance of getting this question in front of the right audience if you ask on Biostar: http://biostar.stackexchange.com/ The answer depends on how you define what you're looking for, here. I'll hold off on the details until I see this on Biostar though. -E From bala.biophysics at gmail.com Thu Nov 10 15:46:25 2011 From: bala.biophysics at gmail.com (Bala subramanian) Date: Thu, 10 Nov 2011 16:46:25 +0100 Subject: [Biopython] structural alignment Message-ID: Friends, I have 50 conformations of a peptide. I want to do a pair wise structural alignment and obtain a score for each pair. I would like to know i) if there is any inbuilt function to do structural alignment in biopython ii)if there is a wrapper in biopython for any structural alignment software Thanks, Bala From p.j.a.cock at googlemail.com Thu Nov 10 16:11:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Nov 2011 16:11:15 +0000 Subject: [Biopython] structural alignment In-Reply-To: References: Message-ID: On Thu, Nov 10, 2011 at 3:46 PM, Bala subramanian wrote: > Friends, > I have 50 conformations of a peptide. I want to do a pair wise structural > alignment and obtain a score for each pair. I would like to know > i) if there is any inbuilt function to do structural alignment in biopython > ii)if there is a wrapper in biopython for any structural alignment software > > Thanks, > Bala SVD works nicely for some tasks, e.g. http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ Peter From natassa_g_2000 at yahoo.com Tue Nov 15 12:28:16 2011 From: natassa_g_2000 at yahoo.com (natassa) Date: Tue, 15 Nov 2011 04:28:16 -0800 (PST) Subject: [Biopython] index_db problem Message-ID: <1321360096.80662.YahooMailNeo@web39309.mail.mud.yahoo.com> Hi, I have a problem with a script using the Bio.SeqIO index_db function: I have previously it successfully, but as I? had to recently change computers I now tried the same? script with no success. I get the error: AttributeError: 'module' object has no attribute 'index_db' although I have installed biopython 1.58 (the index_db is apparently from 1.57 and on, and on the pevious computer I had installed the 1.57). I checked if i have sqlite 3 by importing the module, and this seems OK. So now i am kind of stuck why the module SeqIO from 1.58 gets imported, but not the function. My function using it is something like: def Sample_RandomHeader(numbRecs, reps, path): ? ??? species=os.path.basename(path).split('_')[0].split('.')[0] ??? newdir=os.path.dirname(path)+'/'+species+str(numbRecs)+'_'+str(reps)+'Randoms/' ??? speciesdb=SeqIO.index_db(species+'.idx', path, "fasta")?? .... and i have python 2.6 Any help would be appreciated! Thank you, Anastasia From p.j.a.cock at googlemail.com Tue Nov 15 12:38:51 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 15 Nov 2011 12:38:51 +0000 Subject: [Biopython] index_db problem In-Reply-To: <1321360096.80662.YahooMailNeo@web39309.mail.mud.yahoo.com> References: <1321360096.80662.YahooMailNeo@web39309.mail.mud.yahoo.com> Message-ID: On Tue, Nov 15, 2011 at 12:28 PM, natassa wrote: > Hi, > > I have a problem with a script using the Bio.SeqIO index_db function: > I have previously it successfully, but as I? had to recently change > computers I now tried the same? script with no success. I get the error: > > AttributeError: 'module' object has no attribute 'index_db' > > although I have installed biopython 1.58 (the index_db is apparently > from 1.57 and on, and on the pevious computer I had installed the > 1.57). I checked if i have sqlite 3 by importing the module, and this > seems OK. So now i am kind of stuck why the module SeqIO from > 1.58 gets imported, but not the function. > > My function using it is something like: > ... > > and i have python 2.6 > > Any help would be appreciated! > Thank you, > > Anastasia It sounds like you're getting Biopython 1.57 even if 1.58 is also installed somewhere else. Try this to find out which is being used and where it is installed: import Bio print Bio.__version__ print Bio.__file__ from Bio import SeqIO print SeqIO.__file__ Peter From p.j.a.cock at googlemail.com Tue Nov 15 14:10:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 15 Nov 2011 14:10:06 +0000 Subject: [Biopython] index_db problem In-Reply-To: <1321364191.60299.YahooMailNeo@web39315.mail.mud.yahoo.com> References: <1321360096.80662.YahooMailNeo@web39309.mail.mud.yahoo.com> <1321361890.13423.YahooMailNeo@web39311.mail.mud.yahoo.com> <1321364191.60299.YahooMailNeo@web39315.mail.mud.yahoo.com> Message-ID: Peter wrote: > Hi Anastasia, > > Are you familiar with the PYTHONPATH environment variable? > > Peter On Tue, Nov 15, 2011 at 1:36 PM, natassa wrote: > That rings a bell, thanks! > I added > PYTHONPATH=path-to-1.58libs > export PYTHONPATH > in my .bashrc > and the script now works! > Is it OK only with this libraries though? or are there other 1.58 -installed > files that need to be pointed at somewhere? > Thanks for your help, > Anastasia That should be everything, glad things seem to be working OK now. Peter From devaniranjan at gmail.com Wed Nov 16 20:39:28 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Wed, 16 Nov 2011 15:39:28 -0500 Subject: [Biopython] alignment scores -fitting to Poisson distribution Message-ID: Hello, I did an alignment of a protein with hundreds of other proteins--I got a distribution that is NOT normal (see attached picture). I was wondering if biopython had a way of fitting this to one of the distribution methods? I don't have mathlab/maple software, so if biopython cannot do it but if someone knows of an alternate software that I can use to fit this data--I would appreciate the help. What I want is to say......the target PDB was 'X' sigma away from the mean of the distribution. If it was a normal distribution, it would be easy to do, since the distribution is skewed I am wondering the best way to do it. Thank you, George -------------- next part -------------- A non-text attachment was scrubbed... Name: score.png Type: image/png Size: 8245 bytes Desc: not available URL: From idoerg at gmail.com Wed Nov 16 20:55:42 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 16 Nov 2011 15:55:42 -0500 Subject: [Biopython] alignment scores -fitting to Poisson distribution In-Reply-To: References: Message-ID: It's an extreme value distribution. http://en.wikipedia.org/wiki/Generalized_extreme_value_distribution which is what you generally get from selecting the maximum function from a set of random variables. (You get a normal distribution from a sum of random variables). Aligning proteins provides a maximum function, so multiple alignment scores are EVD distributed. http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html ./I On Wed, Nov 16, 2011 at 3:39 PM, George Devaniranjan wrote: > Hello, > > I did an alignment of a protein with hundreds of other proteins--I got > a distribution that is NOT normal (see attached picture). > I was wondering if biopython had a way of fitting this to one of the > distribution methods? > I don't have mathlab/maple software, so if biopython cannot do it but > if someone knows of an alternate software that I can use to fit this > data--I would appreciate the help. > > What I want is to say......the target PDB was 'X' sigma away from the > mean of the distribution. > If it was a normal distribution, it would be easy to do, since the > distribution is skewed I am wondering the best way to do it. > > Thank you, > George > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From caoyaqiang0410 at gmail.com Thu Nov 17 03:20:24 2011 From: caoyaqiang0410 at gmail.com (=?UTF-8?B?5pu55Lqa5by6?=) Date: Thu, 17 Nov 2011 11:20:24 +0800 Subject: [Biopython] Paired-End Read Splitting & Joining Message-ID: Dear mail-lists: Hi, my first time of asking questions in mailing, please excuse me if there is any possible problems. I'm new in Python and biopython, nearly without practically programming experience in Bioinformatics. Recently my work get involved in transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the software needs paired-end sequences in two fastq files. So I wonder can biopython finish the job in a conventient way? Because the paired-end file is too big and can't be done in a conventient way in *Galaxy* Please give me some guide. Thanks. Best wishes, Yaqiang Cao From p.j.a.cock at googlemail.com Thu Nov 17 09:21:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Nov 2011 09:21:21 +0000 Subject: [Biopython] Paired-End Read Splitting & Joining In-Reply-To: References: Message-ID: On Thu, Nov 17, 2011 at 3:20 AM, ??? wrote: > Dear mail-lists: > ? ? ? ?Hi, my first time of asking questions in mailing, please excuse me > if there is any possible problems. > ? ? ? ?I'm new in Python and biopython, nearly without practically > programming experience in Bioinformatics. Recently my work get involved in > transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the > software needs paired-end sequences in two fastq files. So I wonder can > biopython finish the job in a conventient way? Because the paired-end file > is too big and can't be done in a conventient way in *Galaxy* > ? ? ? ?Please give me some guide. Thanks. > > Best wishes, > Yaqiang Cao Probably, yes. So you have one large FASTQ file containing both parts of each pair (say part one and part two, or they might be labelled as the forward and reverse reads), and you want to split this into two FASTQ files? How are your reads named? The hard part is inferring this, one common scheme used /1 and /2 suffixes, but Illumina have changed this in their latest pipeline and the part is now in the description instead. Could you show us the first 6 reads (or so) from the big FASTQ file? Also are there any single reads in your file, either never paired or orphaned where one of a pair failed Qc? Peter From p.j.a.cock at googlemail.com Thu Nov 17 12:31:30 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Nov 2011 12:31:30 +0000 Subject: [Biopython] Paired-End Read Splitting & Joining In-Reply-To: <4EC4F5C5.8010402@gmail.com> References:

<4EC4F5C5.8010402@gmail.com> Message-ID: On Thu, Nov 17, 2011 at 11:53 AM, Yaqiang Cao wrote: > > Thanks for replying. > > Yes, I have a .fastq file convert from .sra, used one of NCBI > sratools,fastq-dump . And the file is over 1G. I want to split this into two > FASTQ files because the tophat requires two files of paired-end sequence. > The screenshot of the first 20 lines of the .fastq file is like the attached > picture file: Looking at the names, that file seems not to have both parts of each pair. I looked on the NCBI SRA page, and the library is described as paired: http://www.ncbi.nlm.nih.gov/sra?term=srr100235 There only seems to be one SRA file for this accession, ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX042/SRX042254/SRR100235/ i.e. This file: ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX042/SRX042254/SRR100235/SRR100235.sra I'd look more but the SRA website tells me "Our database is temporarily unavailable. Please come back later." Peter From caoyaqiang0410 at gmail.com Thu Nov 17 11:53:41 2011 From: caoyaqiang0410 at gmail.com (Yaqiang Cao) Date: Thu, 17 Nov 2011 19:53:41 +0800 Subject: [Biopython] Paired-End Read Splitting & Joining In-Reply-To: References:

Message-ID: <4EC4F5C5.8010402@gmail.com> ? 2011?11?17? 17:21, Peter Cock ??: > On Thu, Nov 17, 2011 at 3:20 AM, ??? wrote: >> Dear mail-lists: >> Hi, my first time of asking questions in mailing, please excuse me >> if there is any possible problems. >> I'm new in Python and biopython, nearly without practically >> programming experience in Bioinformatics. Recently my work get involved in >> transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the >> software needs paired-end sequences in two fastq files. So I wonder can >> biopython finish the job in a conventient way? Because the paired-end file >> is too big and can't be done in a conventient way in *Galaxy* >> Please give me some guide. Thanks. >> >> Best wishes, >> Yaqiang Cao > Probably, yes. > > So you have one large FASTQ file containing both parts of > each pair (say part one and part two, or they might be > labelled as the forward and reverse reads), and you want > to split this into two FASTQ files? > > How are your reads named? The hard part is inferring this, > one common scheme used /1 and /2 suffixes, but Illumina > have changed this in their latest pipeline and the part is > now in the description instead. > > Could you show us the first 6 reads (or so) from the big > FASTQ file? > > Also are there any single reads in your file, either never > paired or orphaned where one of a pair failed Qc? > > Peter Thanks for replying. Yes, I have a .fastq file convert from .sra, used one of NCBI sratools,fastq-dump . And the file is over 1G. I want to split this into two FASTQ files because the tophat requires two files of paired-end sequence. The screenshot of the first 20 lines of the .fastq file is like the attached picture file: And because I'm new, I can't quitely understand your words about " Also are there any single reads in your file, either never paired or orphaned where one of a pair failed Qc? " All I get is in the screenshot. And it's original NCBI SRA number is SRR100235. Thanks dear mail-listing and Peter. Best wishes, Yaqiang Cao -------------- next part -------------- A non-text attachment was scrubbed... Name: Screenshot-2011-11-17 19:49:04.png Type: image/png Size: 122983 bytes Desc: not available URL: From cjfields at illinois.edu Thu Nov 17 13:39:29 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 17 Nov 2011 13:39:29 +0000 Subject: [Biopython] Paired-End Read Splitting & Joining In-Reply-To: <4EC4F5C5.8010402@gmail.com> References:

<4EC4F5C5.8010402@gmail.com> Message-ID: <17C195A1-3EDB-4E78-AE7C-871124998A8C@illinois.edu> On Nov 17, 2011, at 5:53 AM, Yaqiang Cao wrote: > ? 2011?11?17? 17:21, Peter Cock ??: >> On Thu, Nov 17, 2011 at 3:20 AM, ??? wrote: >>> Dear mail-lists: >>> Hi, my first time of asking questions in mailing, please excuse me >>> if there is any possible problems. >>> I'm new in Python and biopython, nearly without practically >>> programming experience in Bioinformatics. Recently my work get involved in >>> transcriptome and TopHat(http://tophat.cbcb.umd.edu/manual.html) , the >>> software needs paired-end sequences in two fastq files. So I wonder can >>> biopython finish the job in a conventient way? Because the paired-end file >>> is too big and can't be done in a conventient way in *Galaxy* >>> Please give me some guide. Thanks. >>> >>> Best wishes, >>> Yaqiang Cao >> Probably, yes. >> >> So you have one large FASTQ file containing both parts of >> each pair (say part one and part two, or they might be >> labelled as the forward and reverse reads), and you want >> to split this into two FASTQ files? >> >> How are your reads named? The hard part is inferring this, >> one common scheme used /1 and /2 suffixes, but Illumina >> have changed this in their latest pipeline and the part is >> now in the description instead. >> >> Could you show us the first 6 reads (or so) from the big >> FASTQ file? >> >> Also are there any single reads in your file, either never >> paired or orphaned where one of a pair failed Qc? >> >> Peter > Thanks for replying. > > Yes, I have a .fastq file convert from .sra, used one of NCBI sratools,fastq-dump . And the file is over 1G. I want to split this into two FASTQ files because the tophat requires two files of paired-end sequence. The screenshot of the first 20 lines of the .fastq file is like the attached picture file: > And because I'm new, I can't quitely understand your words about " This is one of my gripes about the SRA tools, that they (by default) dump paired-end data as one concatenated string; it's a nasty gotcha. You need to specify the --split-files option to fastq-dump to dump these as paired end, and this will split them into two files. > Also are there any single reads in your file, either never > paired or orphaned where one of a pair failed Qc? > > " > > All I get is in the screenshot. And it's original NCBI SRA number is SRR100235. > Thanks dear mail-listing and Peter. These should all be matched pairs. > Best wishes, > Yaqiang Cao chris From p.j.a.cock at googlemail.com Thu Nov 17 14:28:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Nov 2011 14:28:03 +0000 Subject: [Biopython] Paired-End Read Splitting & Joining In-Reply-To: <17C195A1-3EDB-4E78-AE7C-871124998A8C@illinois.edu> References:

<4EC4F5C5.8010402@gmail.com> <17C195A1-3EDB-4E78-AE7C-871124998A8C@illinois.edu> Message-ID: On Thu, Nov 17, 2011 at 1:39 PM, Fields, Christopher J wrote: > > This is one of my gripes about the SRA tools, that they (by default) > dump paired-end data as one concatenated string; it's a nasty gotcha. >?You need to specify the --split-files option to fastq-dump to dump > these as paired end, and this will split them into two files. Thank Chris - that was one of the possibilities I wanted to check, but couldn't see the reads via the NCBI website. Sadly it is still saying "Our database is temporarily unavailable. Please come back later." Peter From petyuk at gmail.com Thu Nov 17 22:51:38 2011 From: petyuk at gmail.com (Vladislav Petyuk) Date: Thu, 17 Nov 2011 14:51:38 -0800 Subject: [Biopython] fetching chromosome IDs given the organism ID Message-ID: I am trying to fetch the chromosome IDs for a given genome. For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2 The piece of Biopython code that used to work for me is: #--------------------- url = Entrez.esearch(db="genome", term="txid43989") record = Entrez.read(url) chromosomeIDs = record["IdList"] #--------------------- Not anymore. Now it returns the organism id, which is 1608. Please point in the right direction how to get the chromosome ids given the organism id. For example: organism id: 1608 chromosomes ids: NC_010546.1 NC_010547.1 NC_010539.1 NC_010541.1 NC_010542.1 NC_010543.1 Thank you! From p.j.a.cock at googlemail.com Thu Nov 17 23:09:04 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Nov 2011 23:09:04 +0000 Subject: [Biopython] fetching chromosome IDs given the organism ID In-Reply-To: References: Message-ID: On Thu, Nov 17, 2011 at 10:51 PM, Vladislav Petyuk wrote: > I am trying to fetch the chromosome IDs for a given genome. > For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids > http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2 > The piece of Biopython code that used to work for me is: > #--------------------- > url = Entrez.esearch(db="genome", term="txid43989") > record = Entrez.read(url) > chromosomeIDs = record["IdList"] > #--------------------- > Not anymore. Now it returns the organism id, which is 1608. That's annoying of the NCBI to change things. > Please point in the right direction how to get the chromosome ids given the > organism id. Try searching the nucleotide database directly, with term txid43989[orgn] to restrict the species, and I think there is another field to restrict to complete genomes. Have a look at the field list with EInfo (see the Biopython tutorial for EInfo which explains how to do this). I would try it myself right now, but the Entrez website seems very slow from here tonight. Peter From cjfields at illinois.edu Fri Nov 18 02:19:39 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 18 Nov 2011 02:19:39 +0000 Subject: [Biopython] fetching chromosome IDs given the organism ID In-Reply-To: References:

Message-ID: <5314D1EA-D37A-4706-B691-90FBFAAC324B@illinois.edu> On Nov 17, 2011, at 5:09 PM, Peter Cock wrote: > On Thu, Nov 17, 2011 at 10:51 PM, Vladislav Petyuk wrote: >> I am trying to fetch the chromosome IDs for a given genome. >> For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids >> http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2 >> The piece of Biopython code that used to work for me is: >> #--------------------- >> url = Entrez.esearch(db="genome", term="txid43989") >> record = Entrez.read(url) >> chromosomeIDs = record["IdList"] >> #--------------------- >> Not anymore. Now it returns the organism id, which is 1608. > > That's annoying of the NCBI to change things. ...but not unusual (ref: BLAST output over the ages). >> Please point in the right direction how to get the chromosome ids given the >> organism id. > > Try searching the nucleotide database directly, with > term txid43989[orgn] to restrict the species, and I think > there is another field to restrict to complete genomes. > Have a look at the field list with EInfo (see the Biopython > tutorial for EInfo which explains how to do this). > > I would try it myself right now, but the Entrez website > seems very slow from here tonight. > > Peter I think they are doing a lot of work behind the scenes, particularly with efetch (something to be aware of for all us folks who have modules pulling data from genbank). chris From p.j.a.cock at googlemail.com Fri Nov 18 09:21:22 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 18 Nov 2011 09:21:22 +0000 Subject: [Biopython] fetching chromosome IDs given the organism ID In-Reply-To: <5314D1EA-D37A-4706-B691-90FBFAAC324B@illinois.edu> References:

<5314D1EA-D37A-4706-B691-90FBFAAC324B@illinois.edu> Message-ID: On Fri, Nov 18, 2011 at 2:19 AM, Fields, Christopher J wrote: > > I think they are doing a lot of work behind the scenes, particularly > with efetch (something to be aware of for all us folks who have > modules pulling data from genbank). > > chris Yes, this could be related: http://www.ncbi.nlm.nih.gov/mailman/pipermail/utilities-announce/2011-November/000081.html Peter From devaniranjan at gmail.com Fri Nov 18 15:56:44 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Fri, 18 Nov 2011 10:56:44 -0500 Subject: [Biopython] Difference between SVDSuperimposer and PDB.Superimposer Message-ID: Hi, I would like to calculate the RMSD using biopython, there seems to be 2 functions ( SVDSuperimposer and PDB.Superimposer) I could use, is there a difference? Another question I have is if I use the C-alpha carbon of proteins, does it actually calculate the "real" rmsd, what I mean is -for instance, PYMOL would "throw away" atoms over several cycles to improve the RMSD, I don't want to do that--looking at the code I think bipython does calculate the "real" RMSD, is this correct? Thank you very much, George From p.j.a.cock at googlemail.com Fri Nov 18 17:35:59 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 18 Nov 2011 17:35:59 +0000 Subject: [Biopython] Difference between SVDSuperimposer and PDB.Superimposer In-Reply-To: References: Message-ID: On Fri, Nov 18, 2011 at 3:56 PM, George Devaniranjan wrote: > Hi, > > I would like to calculate the RMSD using biopython, there seems to be > 2 functions ( SVDSuperimposer and PDB.Superimposer) I could use, is > there a difference? Note Bio.PDB.Superimposer calls Bio.SVDSuperimposer, so they should do the same thing. Peter From robert.campbell at queensu.ca Fri Nov 18 17:51:00 2011 From: robert.campbell at queensu.ca (Robert Campbell) Date: Fri, 18 Nov 2011 12:51:00 -0500 Subject: [Biopython] Difference between SVDSuperimposer and PDB.Superimposer In-Reply-To: References: Message-ID: <20111118125100.064e1f2d@adelie.biochem.queensu.ca> Hi George, On Fri, 2011-11-18 10:56 EST, George Devaniranjan wrote: > Hi, > > I would like to calculate the RMSD using biopython, there seems to be > 2 functions ( SVDSuperimposer and PDB.Superimposer) I could use, is > there a difference? > > Another question I have is if I use the C-alpha carbon of proteins, > does it actually calculate the "real" rmsd, what I mean is -for > instance, PYMOL would "throw away" atoms over several cycles to > improve the RMSD, I don't want to do that--looking at the code I think > bipython does calculate the "real" RMSD, is this correct? You can, of course, prevent PyMOL from rejecting atoms, by telling it to only do a single cycle (command line option: "cycles=1"). If you want to do batch aligning of many structures, you can also use a script that will do the alignment without starting up the graphics. Cheers, Rob -- Robert L. Campbell, Ph.D. Senior Research Associate/Adjunct Assistant Professor Dept. of Biomedical & Molecular Sciences, Botterell Hall Rm 644 Queen's University, Kingston, ON K7L 3N6 Canada Tel: 613-533-6821 http://pldserver1.biochem.queensu.ca/~rlc From devaniranjan at gmail.com Fri Nov 18 22:13:58 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Fri, 18 Nov 2011 17:13:58 -0500 Subject: [Biopython] make_log_odds_matrix Message-ID: The biopython tutorial talks about constructing a substitution matrix....I used the following lines of code test_dict={('Y', 'L'): 1552226.0, ('B', 'B'): 18251943.0, ('D', 'G'): 44863831.0, ('D', 'D'): 22086473.0,........... test_dict give the frequency which different amino acids replace each other from Bio import SubsMat my_arm=SubsMat.SeqMat(test_dict) my_log=SubsMat.make_log_odds_matrix(my_arm) my_log.print_mat() I have 2 questions: 1) Is this the correct way to do this? 2) This seems different to the way BLOSUM is generated, there is no "normalising" , so can we really compare this generated matrix to a BLOSUM matrix? Thank you, George From p.j.a.cock at googlemail.com Mon Nov 21 16:52:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 21 Nov 2011 16:52:18 +0000 Subject: [Biopython] fetching chromosome IDs given the organism ID In-Reply-To: References:

Message-ID: On Thu, Nov 17, 2011 at 11:09 PM, Peter Cock wrote: > On Thu, Nov 17, 2011 at 10:51 PM, Vladislav Petyuk wrote: >> I am trying to fetch the chromosome IDs for a given genome. >> For example Cyanothece sp 51142 has 2 chromosomes and 4 plasmids >> http://www.ncbi.nlm.nih.gov/genome?term=1608%5Buid%5D#tabs-1608-2 >> The piece of Biopython code that used to work for me is: >> #--------------------- >> url = Entrez.esearch(db="genome", term="txid43989") >> record = Entrez.read(url) >> chromosomeIDs = record["IdList"] >> #--------------------- >> Not anymore. Now it returns the organism id, which is 1608. > > That's annoying of the NCBI to change things. > The NCBI have just made a public announcement by email today (21 Nov 2011), and apologized for the lack of notice: http://www.ncbi.nlm.nih.gov/mailman/pipermail/utilities-announce/2011-November/000083.html Judging from the URL it was also on their news page the day you found the problem, but I hadn't seen that then: http://www.ncbi.nlm.nih.gov/About/news/17Nov2011.html It looks like a sensible long term change to the genome database. Regards, Peter From petyuk at gmail.com Mon Nov 21 22:13:00 2011 From: petyuk at gmail.com (Vladislav Petyuk) Date: Mon, 21 Nov 2011 14:13:00 -0800 Subject: [Biopython] fetching chromosome IDs given the organism ID In-Reply-To: References: