From fjruizruano at gmail.com Mon Jan 3 16:47:41 2011 From: fjruizruano at gmail.com (=?ISO-8859-1?Q?Francisco_J=2E_Ruiz=2DRuano_Campa=F1a?=) Date: Mon, 3 Jan 2011 22:47:41 +0100 Subject: [Biopython] search and delete singleton sites Message-ID: Hello, list. > I need to search singleton sites in an fasta alignment and then to put in the position that differs the same nucleotide in rest of sequences in this position. Any idea? I'm starting with Python and Biopython. Thanks. Francisco. From dalke at dalkescientific.com Tue Jan 4 21:59:09 2011 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 5 Jan 2011 03:59:09 +0100 Subject: [Biopython] Bio.trie In-Reply-To: <524070.34823.qm@web62404.mail.re1.yahoo.com> References: <524070.34823.qm@web62404.mail.re1.yahoo.com> Message-ID: <88E0B7C6-7843-46E8-859B-9ED372ECF8A8@dalkescientific.com> On Dec 29, 2010, at 4:24 AM, Michiel de Hoon wrote: > We would like to know though how many users Bio.trie has, so we can decide whether it is worthwhile to update this module. If you are using Bio.trie, please let us know (preferably via the mailing list). If there are no current users, I suggest that we deprecate and later remove this module from Biopython. I am not a user but the other day I was looking through the Python bug list and came across: http://bugs.python.org/issue9520 The best existing implementation I've been able to find so far is one in the BioPython. Compared to defaultdict(int) on the task of counting words. Dataset 123,981,712 words (6,504,484 unique), 1..21 characters long: * bio.tree - 459 Mb/0.13 Hours, good O(1) behavior * defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior Andrew dalke at dalkescientific.com From ruchira.datta at gmail.com Tue Jan 4 22:10:53 2011 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Tue, 4 Jan 2011 19:10:53 -0800 Subject: [Biopython] Bio.trie In-Reply-To: <88E0B7C6-7843-46E8-859B-9ED372ECF8A8@dalkescientific.com> References: <524070.34823.qm@web62404.mail.re1.yahoo.com> <88E0B7C6-7843-46E8-859B-9ED372ECF8A8@dalkescientific.com> Message-ID: I had also seen that. Sorry for my delay in replying. A trie is an important data structure with many uses. The use I had in mind is: tries are the lowest-latency way of implementing an autosuggest/autocomplete (e.g., if you want to allow users to pick a member of the NCBI taxonomy by scientific name). --Ruchira On Tue, Jan 4, 2011 at 6:59 PM, Andrew Dalke wrote: > On Dec 29, 2010, at 4:24 AM, Michiel de Hoon wrote: > > We would like to know though how many users Bio.trie has, so we can > decide whether it is worthwhile to update this module. If you are using > Bio.trie, please let us know (preferably via the mailing list). If there are > no current users, I suggest that we deprecate and later remove this module > from Biopython. > > I am not a user but the other day I was looking through the Python bug list > and came across: > > http://bugs.python.org/issue9520 > > The best existing implementation I've been able to find so far > is one in the BioPython. Compared to defaultdict(int) on the > task of counting words. Dataset 123,981,712 words (6,504,484 > unique), 1..21 characters long: > * bio.tree - 459 Mb/0.13 Hours, good O(1) behavior > * defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior > > > > Andrew > dalke at dalkescientific.com > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From rmb32 at cornell.edu Wed Jan 5 11:17:18 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 05 Jan 2011 08:17:18 -0800 Subject: [Biopython] GSOC 2011 In-Reply-To: References: Message-ID: <4D24998E.2000200@cornell.edu> Hi Akshay, Thanks for your interest! You can subscribe to the OBF-GSOC mailing list by filling in the subscription form at http://lists.open-bio.org/mailman/listinfo/gsoc. Rob Akshay Goel wrote: > Dear Sir, > > I found your organization while looking through the list of GSOC 2010 > mentoring organizations. I am proficient in Python, C/C++, PHP and JSP, > and would like to get involved with the project. I would, therefore, > like to join the mailing list and know more about current projects. > > Thanking You > Yours Sincerely > Akshay Goel > > > From beiko at cs.dal.ca Wed Jan 12 09:06:43 2011 From: beiko at cs.dal.ca (Robert Beiko) Date: Wed, 12 Jan 2011 10:06:43 -0400 Subject: [Biopython] Phylo: rerooting a tree with a terminal node Message-ID: <4D2DB573.6020505@cs.dal.ca> Hi, I have been experimenting with the excellent Phylo package in BioPython, and am having a bit of trouble with the 'root_with_outgroup' method. Specifically, it seems to work fine when I apply it to internal nodes, but when I try to root on a terminal, I get an error: --------------- Traceback (most recent call last): File "C:/Projects/10000/Phylogenomics/TestTrees/test.py", line 16, in tree.root_with_outgroup(leaf) File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 777, in root_with_outgroup parent.clades.pop(parent.clades.index(new_parent)) ValueError: list.index(x): x not in list --------------- Here is an example script: import io import sys from Bio import Phylo infile = 'Example1.tre' trees = Phylo.parse(infile,'newick') for tree in trees: leafList = tree.get_terminals() for leaf in leafList: tree.root_with_outgroup(leaf) --------------- And here is the file 'Example1.tre' ((AAA,BBB),(CCC,DDD)); [I have tried many permutations of the tree, and in no case have I been able to root using a terminal]. Line 777 in 'BaseTree.py' is: parent.clades.pop(parent.clades.index(new_parent)) so it appears that new_parent is not in the list 'parent.clades'. My Python is rather rudimentary so I haven't been able to figure out why this might arise. If I change 'get_terminals()' to 'get_nonterminals()' in the script above, everything works fine. I imagine I could get things to work by introducing a dummy sister node for each terminal I would like to root on, and then rooting on the LCA of the terminal and its dummy sister. But is there something I am doing wrong in the script above, that could easily be remedied without a hack? Python version is 2.6, BioPython 1.56. Best wishes and thanks, Rob Beiko From eric.talevich at gmail.com Wed Jan 12 13:31:12 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 12 Jan 2011 13:31:12 -0500 Subject: [Biopython] Phylo: rerooting a tree with a terminal node In-Reply-To: <4D2DB573.6020505@cs.dal.ca> References: <4D2DB573.6020505@cs.dal.ca> Message-ID: On Wed, Jan 12, 2011 at 9:06 AM, Robert Beiko wrote: > Hi, > > I have been experimenting with the excellent Phylo package in BioPython, > and am having a bit of trouble with the 'root_with_outgroup' method. > > Specifically, it seems to work fine when I apply it to internal nodes, but > when I try to root on a terminal, I get an error: > > --------------- > > Traceback (most recent call last): > File "C:/Projects/10000/Phylogenomics/TestTrees/test.py", line 16, in > > tree.root_with_outgroup(leaf) > File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 777, in > root_with_outgroup > parent.clades.pop(parent.clades.index(new_parent)) > ValueError: list.index(x): x not in list > > --------------- > > Here is an example script: > > import io > import sys > from Bio import Phylo > > infile = 'Example1.tre' > trees = Phylo.parse(infile,'newick') > > for tree in trees: > leafList = tree.get_terminals() > for leaf in leafList: > tree.root_with_outgroup(leaf) > > --------------- > > And here is the file 'Example1.tre' > > ((AAA,BBB),(CCC,DDD)); > > [I have tried many permutations of the tree, and in no case have I been > able to root using a terminal]. > > Line 777 in 'BaseTree.py' is: > parent.clades.pop(parent.clades.index(new_parent)) > > so it appears that new_parent is not in the list 'parent.clades'. My Python > is rather rudimentary so I haven't been able to figure out why this might > arise. > > If I change 'get_terminals()' to 'get_nonterminals()' in the script above, > everything works fine. I imagine I could get things to work by introducing a > dummy sister node for each terminal I would like to root on, and then > rooting on the LCA of the terminal and its dummy sister. But is there > something I am doing wrong in the script above, that could easily be > remedied without a hack? > > Python version is 2.6, BioPython 1.56. > > Best wishes and thanks, > Rob Beiko > Hi Robert, Thanks for reporting this. It's certainly possible that there's a bug here; I'll take a closer look at the code. The odd thing I notice about your code is that you're rerooting inside a loop. The root_with_outgroup method operates in-place, so the "tree" object is changing with each iteration. I assume your original code was doing something after rerooting each time, like writing out a new tree. The difference in the code between rooting with terminal versus non-terminal nodes is that rooting with a terminal requires creating a new internal node just above the terminal, then using the new node as the new root. (Rerooting with an existing internal node just reroots at that node without creating any new objects.) So if this is done repeatedly, that could be the source of some trouble. Do you know if the loop is crashing in the first iteration or the second? Best regards, Eric From beiko at cs.dal.ca Wed Jan 12 13:46:24 2011 From: beiko at cs.dal.ca (Robert Beiko) Date: Wed, 12 Jan 2011 14:46:24 -0400 Subject: [Biopython] Phylo: rerooting a tree with a terminal node In-Reply-To: References: <4D2DB573.6020505@cs.dal.ca> Message-ID: <4D2DF700.3030905@cs.dal.ca> Hi Eric, Thank you very much for your quick reply. Indeed the full script is doing something much more interesting (rolling up in-paralogs with attempts at alternative rootings), but this is my attempt to cut out all of the other things I might have done wrong :^> The loop is crashing the first time I try it. Indeed, the following variation fails as well: import io import sys from Bio import Phylo infile = 'Example1.tre' trees = Phylo.parse(infile,'newick') for tree in trees: leafList = tree.get_terminals() tree.root_with_outgroup(leafList[0]) ---- again, the same code with internals rather than terminals works fine. Best wishes, Rob On 12/01/2011 2:31 PM, Eric Talevich wrote: > On Wed, Jan 12, 2011 at 9:06 AM, Robert Beiko > wrote: > > Hi, > > I have been experimenting with the excellent Phylo package in > BioPython, and am having a bit of trouble with the > 'root_with_outgroup' method. > > Specifically, it seems to work fine when I apply it to internal > nodes, but when I try to root on a terminal, I get an error: > > --------------- > > Traceback (most recent call last): > File "C:/Projects/10000/Phylogenomics/TestTrees/test.py", line > 16, in > tree.root_with_outgroup(leaf) > File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line > 777, in root_with_outgroup > parent.clades.pop(parent.clades.index(new_parent)) > ValueError: list.index(x): x not in list > > --------------- > > Here is an example script: > > import io > import sys > from Bio import Phylo > > infile = 'Example1.tre' > trees = Phylo.parse(infile,'newick') > > for tree in trees: > leafList = tree.get_terminals() > for leaf in leafList: > tree.root_with_outgroup(leaf) > > --------------- > > And here is the file 'Example1.tre' > > ((AAA,BBB),(CCC,DDD)); > > [I have tried many permutations of the tree, and in no case have I > been able to root using a terminal]. > > Line 777 in 'BaseTree.py' is: > parent.clades.pop(parent.clades.index(new_parent)) > > so it appears that new_parent is not in the list 'parent.clades'. > My Python is rather rudimentary so I haven't been able to figure > out why this might arise. > > If I change 'get_terminals()' to 'get_nonterminals()' in the > script above, everything works fine. I imagine I could get things > to work by introducing a dummy sister node for each terminal I > would like to root on, and then rooting on the LCA of the terminal > and its dummy sister. But is there something I am doing wrong in > the script above, that could easily be remedied without a hack? > > Python version is 2.6, BioPython 1.56. > > Best wishes and thanks, > Rob Beiko > > > > Hi Robert, > > Thanks for reporting this. It's certainly possible that there's a bug > here; I'll take a closer look at the code. > > The odd thing I notice about your code is that you're rerooting inside > a loop. The root_with_outgroup method operates in-place, so the "tree" > object is changing with each iteration. I assume your original code > was doing something after rerooting each time, like writing out a new > tree. > > The difference in the code between rooting with terminal versus > non-terminal nodes is that rooting with a terminal requires creating a > new internal node just above the terminal, then using the new node as > the new root. (Rerooting with an existing internal node just reroots > at that node without creating any new objects.) So if this is done > repeatedly, that could be the source of some trouble. > > Do you know if the loop is crashing in the first iteration or the second? > > Best regards, > Eric From eric.talevich at gmail.com Wed Jan 12 20:55:23 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 12 Jan 2011 20:55:23 -0500 Subject: [Biopython] Phylo: rerooting a tree with a terminal node In-Reply-To: <4D2DF700.3030905@cs.dal.ca> References: <4D2DB573.6020505@cs.dal.ca> <4D2DF700.3030905@cs.dal.ca> Message-ID: Hi Rob, This was an outright bug in Bio.Phylo, so thanks again for reporting it. I've pushed a fix to GitHub: https://github.com/biopython/biopython/commit/1a8a39b6d24a9a4b9088255327b0f2fd12c19a09 For your own work, you can get this fix by either: (a) checking out a development copy of Biopython from GitHub (the master branch is fairly safe) and reinstalling, or (b) applying just this fix to your copy of Bio.Phylo in-place -- i.e. editing your existing Biopython installation. You can replace the file Bio/Phylo/BaseTree.py with the one from GitHub without any ill effects. Cheers, Eric On Wed, Jan 12, 2011 at 1:46 PM, Robert Beiko wrote: > Hi Eric, > > Thank you very much for your quick reply. > > Indeed the full script is doing something much more interesting (rolling up > in-paralogs with attempts at alternative rootings), but this is my attempt > to cut out all of the other things I might have done wrong :^> > > The loop is crashing the first time I try it. Indeed, the following > variation fails as well: > > > import io > import sys > from Bio import Phylo > > infile = 'Example1.tre' > trees = Phylo.parse(infile,'newick') > > for tree in trees: > leafList = tree.get_terminals() > tree.root_with_outgroup(leafList[0]) > > ---- > > again, the same code with internals rather than terminals works fine. > > Best wishes, > Rob > > From fabrice.ciup at gmail.com Thu Jan 13 05:51:01 2011 From: fabrice.ciup at gmail.com (Fabrice Tourre) Date: Thu, 13 Jan 2011 11:51:01 +0100 Subject: [Biopython] I have little question. How can I get the all CpG sites genome postion for mm9? Message-ID: Hi List, I have little question. How can I get the all CpG sites genome postion for mm9? It is the positions for all CpG sites for mm9. Thanks. From beiko at cs.dal.ca Thu Jan 13 11:48:32 2011 From: beiko at cs.dal.ca (Robert Beiko) Date: Thu, 13 Jan 2011 12:48:32 -0400 Subject: [Biopython] Phylo: rerooting a tree with a terminal node In-Reply-To: References: <4D2DB573.6020505@cs.dal.ca> <4D2DF700.3030905@cs.dal.ca> Message-ID: <4D2F2CE0.9000400@cs.dal.ca> Hi Eric, I applied the fix and many of the terminal rootings now work. Thanks! I am still getting errors on a smaller subset of trees, though. The simplest example is this one (in a file called Example1.tre): (A,B,(C,D)); I have modified test.py to do things slightly differently: -------------------------- import io import sys from Bio import Phylo REROOT_INTERNAL = 0 infile = 'Example1.tre' trees = Phylo.parse(infile,'newick') for tree in trees: print "\nInitial tree:" Phylo.write(tree,sys.stdout,'newick') if REROOT_INTERNAL == 1: for iNode in tree.get_nonterminals(): if len(tree.get_path(iNode)) > 1: tree.root_with_outgroup(iNode) break print "\nRerooted with nice internal node:" Phylo.write(tree,sys.stdout,'newick') leafList = tree.get_terminals() print "\nAttempting to root on terminal " + leafList[0].name tree.root_with_outgroup(leafList[0]) print "Rerooted on terminal:" Phylo.write(tree,sys.stdout,'newick') ------------------------------- [Apologies for all the print statements and C-like constants] If REROOT_INTERNAL is set to 0, then we go right to the 'leafList = tree.get_terminals()' line and get the following error: Initial tree: (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; Attempting to root on terminal A Traceback (most recent call last): File "C:\Projects\10000\Phylogenomics\TestTrees\test.py", line 25, in tree.root_with_outgroup(leafList[0]) File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 768, in root_with_outgroup parent = outgroup_path.pop(-2) IndexError: pop index out of range which I assume occurs either because the terminal node (A) has the root as its immediate parent, and/or because it's one leaf from an initial trifurcation. Note that ((A,B),(C,D)); works fine in this case. Setting REROOT_INTERNAL to 1 is my attempt to get around this problem by first rooting on a 'safe' internal node, and then rooting on the terminal. On many of the larger trees I am working with, this solves the problem. But in the case of the tree above, it seems that the original trifurcation remains in place. A few of the larger trees I have tested also retain this trifurcation, even if branches are moved around. Setting REROOT_INTERNAL to 1 gives me the following output: Initial tree: (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; Rerooted with nice internal node: (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; Attempting to root on terminal A Traceback (most recent call last): File "C:\Projects\10000\Phylogenomics\TestTrees\test.py", line 25, in tree.root_with_outgroup(leafList[0]) File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 768, in root_with_outgroup parent = outgroup_path.pop(-2) IndexError: pop index out of range Now, if I change the line for iNode in tree.get_nonterminals(): to for iNode in tree.get_terminals(): Then I get the desired behaviour. This succeeds as a workaround as long as I check to make sure that I'm not rooting again on the terminal that I just rooted on. Best wishes, Rob On 12/01/2011 9:55 PM, Eric Talevich wrote: > Hi Rob, > > This was an outright bug in Bio.Phylo, so thanks again for reporting > it. I've pushed a fix to GitHub: > https://github.com/biopython/biopython/commit/1a8a39b6d24a9a4b9088255327b0f2fd12c19a09 > > For your own work, you can get this fix by either: > (a) checking out a development copy of Biopython from GitHub (the > master branch is fairly safe) and reinstalling, or > (b) applying just this fix to your copy of Bio.Phylo in-place -- i.e. > editing your existing Biopython installation. You can replace the file > Bio/Phylo/BaseTree.py with the one from GitHub without any ill effects. > > Cheers, > Eric > > On Wed, Jan 12, 2011 at 1:46 PM, Robert Beiko > wrote: > > Hi Eric, > > Thank you very much for your quick reply. > > Indeed the full script is doing something much more interesting > (rolling up in-paralogs with attempts at alternative rootings), > but this is my attempt to cut out all of the other things I might > have done wrong :^> > > The loop is crashing the first time I try it. Indeed, the > following variation fails as well: > > > import io > import sys > from Bio import Phylo > > infile = 'Example1.tre' > trees = Phylo.parse(infile,'newick') > > for tree in trees: > leafList = tree.get_terminals() > tree.root_with_outgroup(leafList[0]) > > ---- > > again, the same code with internals rather than terminals works fine. > > Best wishes, > Rob > From eric.talevich at gmail.com Fri Jan 14 23:50:58 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 14 Jan 2011 23:50:58 -0500 Subject: [Biopython] Phylo: rerooting a tree with a terminal node In-Reply-To: <4D2F2CE0.9000400@cs.dal.ca> References: <4D2DB573.6020505@cs.dal.ca> <4D2DF700.3030905@cs.dal.ca> <4D2F2CE0.9000400@cs.dal.ca> Message-ID: Hi Rob, This should work now: https://github.com/biopython/biopython/commit/e9cfcc3680a5b5692f91e560ea08e51515c9c757 I also added another unit test based on your example -- looping through all the nodes in a few contrived trees, rerooting at each node and testing that the total tree length doesn't change. It passes now. So, thanks for the test! Cheers, Eric On Thu, Jan 13, 2011 at 11:48 AM, Robert Beiko wrote: > Hi Eric, > > I applied the fix and many of the terminal rootings now work. Thanks! > > I am still getting errors on a smaller subset of trees, though. The > simplest example is this one (in a file called Example1.tre): > > (A,B,(C,D)); > > I have modified test.py to do things slightly differently: > > -------------------------- > > > import io > import sys > from Bio import Phylo > > REROOT_INTERNAL = 0 > > > infile = 'Example1.tre' > trees = Phylo.parse(infile,'newick') > > for tree in trees: > print "\nInitial tree:" > Phylo.write(tree,sys.stdout,'newick') > > if REROOT_INTERNAL == 1: > for iNode in tree.get_nonterminals(): > if len(tree.get_path(iNode)) > 1: > tree.root_with_outgroup(iNode) > break > > print "\nRerooted with nice internal node:" > Phylo.write(tree,sys.stdout,'newick') > > leafList = tree.get_terminals() > print "\nAttempting to root on terminal " + leafList[0].name > > tree.root_with_outgroup(leafList[0]) > > print "Rerooted on terminal:" > Phylo.write(tree,sys.stdout,'newick') > > ------------------------------- > > [Apologies for all the print statements and C-like constants] > > If REROOT_INTERNAL is set to 0, then we go right to the 'leafList = > tree.get_terminals()' line and get the following error: > > Initial tree: > (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; > > Attempting to root on terminal A > > Traceback (most recent call last): > File "C:\Projects\10000\Phylogenomics\TestTrees\test.py", line 25, in > > > tree.root_with_outgroup(leafList[0]) > File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 768, in > root_with_outgroup > parent = outgroup_path.pop(-2) > IndexError: pop index out of range > > which I assume occurs either because the terminal node (A) has the root as > its immediate parent, and/or because it's one leaf from an initial > trifurcation. Note that ((A,B),(C,D)); works fine in this case. > > Setting REROOT_INTERNAL to 1 is my attempt to get around this problem by > first rooting on a 'safe' internal node, and then rooting on the terminal. > On many of the larger trees I am working with, this solves the problem. But > in the case of the tree above, it seems that the original trifurcation > remains in place. A few of the larger trees I have tested also retain this > trifurcation, even if branches are moved around. Setting REROOT_INTERNAL to > 1 gives me the following output: > > Initial tree: > (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; > > Rerooted with nice internal node: > (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; > > Attempting to root on terminal A > > Traceback (most recent call last): > File "C:\Projects\10000\Phylogenomics\TestTrees\test.py", line 25, in > > > tree.root_with_outgroup(leafList[0]) > File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 768, in > root_with_outgroup > parent = outgroup_path.pop(-2) > IndexError: pop index out of range > > Now, if I change the line > > for iNode in tree.get_nonterminals(): > > to > > for iNode in tree.get_terminals(): > > Then I get the desired behaviour. This succeeds as a workaround as long as > I check to make sure that I'm not rooting again on the terminal that I just > rooted on. > > Best wishes, > Rob > > > On 12/01/2011 9:55 PM, Eric Talevich wrote: > > Hi Rob, > > This was an outright bug in Bio.Phylo, so thanks again for reporting it. > I've pushed a fix to GitHub: > > https://github.com/biopython/biopython/commit/1a8a39b6d24a9a4b9088255327b0f2fd12c19a09 > > For your own work, you can get this fix by either: > (a) checking out a development copy of Biopython from GitHub (the master > branch is fairly safe) and reinstalling, or > (b) applying just this fix to your copy of Bio.Phylo in-place -- i.e. > editing your existing Biopython installation. You can replace the file > Bio/Phylo/BaseTree.py with the one from GitHub without any ill effects. > > Cheers, > Eric > > On Wed, Jan 12, 2011 at 1:46 PM, Robert Beiko wrote: > >> Hi Eric, >> >> Thank you very much for your quick reply. >> >> Indeed the full script is doing something much more interesting (rolling >> up in-paralogs with attempts at alternative rootings), but this is my >> attempt to cut out all of the other things I might have done wrong :^> >> >> The loop is crashing the first time I try it. Indeed, the following >> variation fails as well: >> >> >> import io >> import sys >> from Bio import Phylo >> >> infile = 'Example1.tre' >> trees = Phylo.parse(infile,'newick') >> >> for tree in trees: >> leafList = tree.get_terminals() >> tree.root_with_outgroup(leafList[0]) >> >> ---- >> >> again, the same code with internals rather than terminals works fine. >> >> Best wishes, >> Rob >> >> > From idoerg at gmail.com Sun Jan 16 13:48:09 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 16 Jan 2011 13:48:09 -0500 Subject: [Biopython] FASTQ to qual+fasta Message-ID: question regarding the use of SeqIO.convert: how do I convert a FASTQ file to qual and fasta files? Currently it seems that I have to run SeqIO.convert twice e.g.: SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.qual","w"),"qual") SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.fasta","w"),"fasta") Or am I missing something? Thanks, ./I -- Iddo Friedberg http://iddo-friedberg.net/contact.html From p.j.a.cock at googlemail.com Sun Jan 16 14:25:43 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Jan 2011 19:25:43 +0000 Subject: [Biopython] FASTQ to qual+fasta In-Reply-To: References: Message-ID: On Sun, Jan 16, 2011 at 6:48 PM, Iddo Friedberg wrote: > question regarding the use of SeqIO.convert: how do I convert a FASTQ file > to qual and fasta files? Currently it seems that I have to run SeqIO.convert > twice e.g.: > > ?SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.qual","w"),"qual") > ?SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.fasta","w"),"fasta") > > Or am I missing something? > > Thanks, > > ./I Hi Iddo, That is almost the simplest solution, yes. You can use filename directly: SeqIO.convert("infile.fastq", "fastq", "outfile.qual", "qual") SeqIO.convert("infile.fastq", "fastq", "outfile.fasta", "fasta") Is it a bit slow for you? Using SeqIO.convert(...) in this case does use optimised code for FASTQ to FASTA, but currently we don't have a similar fast FASTQ to QUAL function. See Bio/SeqIO/_convert.py if you want to know how this is implemented. I can see several tricks for FASTQ to QUAL which should work... do you fancy trying this yourself? Alternatively, you could try combining a single call to SeqIO.parse(...) to iterate over the records as SeqRecord objects with itertools.tee to split this iterator in two to give it to two copies of SeqIO.write(...) to write FASTA and QUAL. I don't know how well that would work with memory consumption, but it would make only a single pass though the FASTQ file. If speed really matters here, first we should add FASTQ to QUAL to Bio/SeqIO/_convert.py and if that isn't enough, do a special case for FASTQ to FASTA and QUAL (to live in Bio.SeqIO.QualityIO I guess). Peter From idoerg at gmail.com Sun Jan 16 17:35:35 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 16 Jan 2011 17:35:35 -0500 Subject: [Biopython] FASTQ to qual+fasta In-Reply-To: References: Message-ID: <4D3372B7.1080801@gmail.com> On 01/16/2011 02:25 PM, Peter Cock wrote: > On Sun, Jan 16, 2011 at 6:48 PM, Iddo Friedberg wrote: >> question regarding the use of SeqIO.convert: how do I convert a FASTQ file >> to qual and fasta files? Currently it seems that I have to run SeqIO.convert >> twice e.g.: >> >> SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.qual","w"),"qual") >> SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.fasta","w"),"fasta") >> >> Or am I missing something? >> >> Thanks, >> >> ./I > Hi Iddo, > > That is almost the simplest solution, yes. You can use filename directly: > > SeqIO.convert("infile.fastq", "fastq", "outfile.qual", "qual") > SeqIO.convert("infile.fastq", "fastq", "outfile.fasta", "fasta") > > Is it a bit slow for you? > Well, although elegant, in this case I am running two loops, where one should suffice. > Using SeqIO.convert(...) in this case does use optimised code for FASTQ > to FASTA, but currently we don't have a similar fast FASTQ to QUAL > function. See Bio/SeqIO/_convert.py if you want to know how this is > implemented. I can see several tricks for FASTQ to QUAL which should > work... do you fancy trying this yourself? I wish I had the time.... :( > Alternatively, you could try combining a single call to SeqIO.parse(...) to > iterate over the records as SeqRecord objects with itertools.tee to split > this iterator in two to give it to two copies of SeqIO.write(...) to write > FASTA and QUAL. I don't know how well that would work with memory > consumption, but it would make only a single pass though the FASTQ file. That's actually what I ended up doing. > If speed really matters here, first we should add FASTQ to QUAL > to Bio/SeqIO/_convert.py and if that isn't enough, do a special case for > FASTQ to FASTA and QUAL (to live in Bio.SeqIO.QualityIO I guess). > > Peter I think a fastq to fasta & qual would be best. I'll look into the QualityIO module and see if my code can be massaged in there. Thanks, Iddo -- Iddo Friedberg, Ph.D. http://iddo-friedberg.org/contact.html From p.j.a.cock at googlemail.com Sun Jan 16 17:58:30 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Jan 2011 22:58:30 +0000 Subject: [Biopython] FASTQ to qual+fasta In-Reply-To: <4D3372B7.1080801@gmail.com> References: <4D3372B7.1080801@gmail.com> Message-ID: On Sun, Jan 16, 2011 at 10:35 PM, Iddo Friedberg wrote: > On 01/16/2011 02:25 PM, Peter Cock wrote: >> >> On Sun, Jan 16, 2011 at 6:48 PM, Iddo Friedberg ?wrote: >>> >>> question regarding the use of SeqIO.convert: how do I convert a >>> FASTQ fileto qual and fasta files? Currently it seems that I have >>> to run SeqIO.convert twice e.g.: >>> >>> >>> ?SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.qual","w"),"qual") >>> ?SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.fasta","w"),"fasta") >>> >>> Or am I missing something? >>> >>> Thanks, >>> >>> ./I >> >> Hi Iddo, >> >> That is almost the simplest solution, yes. You can use filename directly: >> >> SeqIO.convert("infile.fastq", "fastq", "outfile.qual", "qual") >> SeqIO.convert("infile.fastq", "fastq", "outfile.fasta", "fasta") >> >> Is it a bit slow for you? >> > > Well, although elegant, in this case I am running two loops, where one > should suffice. KISS? >> Using SeqIO.convert(...) in this case does use optimised code for FASTQ >> to FASTA, but currently we don't have a similar fast FASTQ to QUAL >> function. See Bio/SeqIO/_convert.py if you want to know how this is >> implemented. I can see several tricks for FASTQ to QUAL which should >> work... do you fancy trying this yourself? > > I wish I had the time.... :( I can picture how I'd solve this - it shouldn't take me too long. >> Alternatively, you could try combining a single call to SeqIO.parse(...) >> to iterate over the records as SeqRecord objects with itertools.tee to >> split this iterator in two to give it to two copies of SeqIO.write(...) to write >> FASTA and QUAL. I don't know how well that would work with memory >> consumption, but it would make only a single pass though the FASTQ file. > > That's actually what I ended up doing. Do you have the timings of this versus the two calls to SeqIO.convert()? I'd also be curious to see your code for this. >> If speed really matters here, first we should add FASTQ to QUAL >> to Bio/SeqIO/_convert.py and if that isn't enough, do a special case for >> FASTQ to FASTA and QUAL (to live in Bio.SeqIO.QualityIO I guess). > > I think a fastq to fasta & qual would be best. I'll look into the QualityIO > module and see if my code can be massaged in there. Maybe - assuming it would be faster than two calls to SeqIO.convert (once FASTQ to QUAL is optimised). Peter From p.j.a.cock at googlemail.com Mon Jan 17 07:30:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Jan 2011 12:30:44 +0000 Subject: [Biopython] FASTQ to qual+fasta In-Reply-To: References: <4D3372B7.1080801@gmail.com> Message-ID: On Sun, Jan 16, 2011 at 10:58 PM, Peter wrote: >>> Using SeqIO.convert(...) in this case does use optimised code for FASTQ >>> to FASTA, but currently we don't have a similar fast FASTQ to QUAL >>> function. See Bio/SeqIO/_convert.py if you want to know how this is >>> implemented. I can see several tricks for FASTQ to QUAL which should >>> work... do you fancy trying this yourself? >> >> I wish I had the time.... :( > > I can picture how I'd solve this - it shouldn't take me too long. Done: https://github.com/biopython/biopython/commit/e26030a290b6acdf5a5fc431056593cccc5d892a This makes FASTQ to QUAL about three times faster. There is probably scope for speeding up how we do the line wrapping in QUAL output - both in the the normal SeqRecord based code called by Bio.SeqIO.write() and in the new optimised code for Bio.SeqIO.convert(). I'd still like to see your itertools.tee solution and your timings of it against calling Bio.SeqIO.convert twice. Peter From almeida at cim.sld.cu Tue Jan 18 14:14:56 2011 From: almeida at cim.sld.cu (Yasser Almeida =?ISO-8859-1?Q?Hern=E1ndez?=) Date: Tue, 18 Jan 2011 14:14:56 -0500 Subject: [Biopython] Save coordinates of selected residues... Message-ID: <1295378096.1796.7.camel@almeida-desktop> Hi all... I wonder how can i save the coordinates of a list of residues from a structure. For example, from the structure 154L i want to save in a pdb file the next amino acids: PHE 123 HIS 101 GLY 150 PHE 123 TYR 147 HIS 101 PHE 123 ASP 97 GLU 73 THR 165 ASN 148 How would be the code...??? Best regards and thanks in advance... ;) -- Yasser Almeida Hern?ndez, BSc. Center of Molecular Immunology (CIM) Tumor Biology Direction Nanobiology Department 216 St. & 15th Ave, Siboney, Playa P.O.Box 16040. Havana, Cuba Phone: (+537) 214-3178 almeida at cim.sld.cu -------------- next part -------------- A non-text attachment was scrubbed... Name: face-wink.png Type: image/png Size: 1602 bytes Desc: face-wink.png URL: From anaryin at gmail.com Wed Jan 19 09:49:41 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Jan 2011 15:49:41 +0100 Subject: [Biopython] Save coordinates of selected residues... In-Reply-To: <1295378096.1796.7.camel@almeida-desktop> References: <1295378096.1796.7.camel@almeida-desktop> Message-ID: Hello Yasser, You can either create a new Structure object with copies of the residues of interest, and then use PDBIO to save it, or use the Dice module. If you choose the latter, make sure that the accept_residue function returns True only for those residues (you can make it based on residue id). Best! Jo?o [...] Rodrigues http://doeidoei.wordpress.com 2011/1/18 Yasser Almeida Hern?ndez > Hi all... > I wonder how can i save the coordinates of a list of residues from a > structure. > For example, from the structure 154L i want to save in a pdb file the > next amino acids: > PHE 123 > HIS 101 > GLY 150 > PHE 123 > TYR 147 > HIS 101 > PHE 123 > ASP 97 > GLU 73 > THR 165 > ASN 148 > > How would be the code...??? > > Best regards and thanks in advance... ;) > -- > > Yasser Almeida Hern?ndez, BSc. > Center of Molecular Immunology (CIM) > Tumor Biology Direction > Nanobiology Department > 216 St. & 15th Ave, Siboney, Playa > P.O.Box 16040. Havana, Cuba > Phone: (+537) 214-3178 > almeida at cim.sld.cu > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Thu Jan 20 09:52:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 20 Jan 2011 14:52:52 +0000 Subject: [Biopython] FASTQ to qual+fasta In-Reply-To: References: <4D3372B7.1080801@gmail.com> Message-ID: On Mon, Jan 17, 2011 at 12:30 PM, Peter Cock wrote: > > Done: > > https://github.com/biopython/biopython/commit/e26030a290b6acdf5a5fc431056593cccc5d892a > > This makes FASTQ to QUAL about three times faster. There is > probably scope for speeding up how we do the line wrapping in > QUAL output - both in the the normal SeqRecord based code > called by Bio.SeqIO.write() and in the new optimised code for > Bio.SeqIO.convert(). > Hi Iddo, I found time yesterday to optimise the line wrapping for QUAL output, for a further significant speed-up. Please update to the current code from git and give it a try :) Peter From rojan at riken.jp Thu Jan 20 23:09:24 2011 From: rojan at riken.jp (Rojan Shrestha) Date: Fri, 21 Jan 2011 13:09:24 +0900 Subject: [Biopython] DSSP python version Message-ID: <000501cbb920$fd1e6230$f75b2690$@jp> Hello: I am very much interested on python libraries available for bioinformatics. Currently, I need python version of DSSP or any other programs that can evaluate secondary structure of protein from coordinate data. I am pretty sure there should be such module. If you are well familiar with it, please let me know how it can be used in my own python program. Regards, Rojan From anaryin at gmail.com Fri Jan 21 02:21:01 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 21 Jan 2011 08:21:01 +0100 Subject: [Biopython] DSSP python version In-Reply-To: <000501cbb920$fd1e6230$f75b2690$@jp> References: <000501cbb920$fd1e6230$f75b2690$@jp> Message-ID: Hey Rojan, Biopython provides an interface to the command line DSSP. To my knowledge, no "python version" of DSSP exists. You'll have to get the program from G.Vriend's webpage and then use either Biopython or your own code to pass arguments and parse output. Jo?o From rojan at riken.jp Fri Jan 21 02:27:50 2011 From: rojan at riken.jp (Rojan Shrestha) Date: Fri, 21 Jan 2011 16:27:50 +0900 Subject: [Biopython] DSSP python version In-Reply-To: References: <000501cbb920$fd1e6230$f75b2690$@jp> Message-ID: <001a01cbb93c$b5b27960$21176c20$@jp> Hello Joao: Thank you very much. I found in Biopython but I got another problem to use it. When we do not have last column of PDB file, it gives an error. It is a problem for pdbparser library. Regards, Rojan From: Jo?o Rodrigues [mailto:anaryin at gmail.com] Sent: Friday, January 21, 2011 4:21 PM To: rojan at riken.jp Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] DSSP python version Hey Rojan, Biopython provides an interface to the command line DSSP. To my knowledge, no "python version" of DSSP exists. You'll have to get the program from G.Vriend's webpage and then use either Biopython or your own code to pass arguments and parse output. Jo?o From anaryin at gmail.com Fri Jan 21 03:11:35 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 21 Jan 2011 09:11:35 +0100 Subject: [Biopython] DSSP python version In-Reply-To: <001a01cbb93c$b5b27960$21176c20$@jp> References: <000501cbb920$fd1e6230$f75b2690$@jp> <001a01cbb93c$b5b27960$21176c20$@jp> Message-ID: Which column are you missing? Which error does the parser give? From dalke at dalkescientific.com Sat Jan 22 07:05:30 2011 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sat, 22 Jan 2011 13:05:30 +0100 Subject: [Biopython] DSSP python version In-Reply-To: <000501cbb920$fd1e6230$f75b2690$@jp> References: <000501cbb920$fd1e6230$f75b2690$@jp> Message-ID: <822F6196-A750-4C3C-B677-5A8CC498D50A@dalkescientific.com> On Jan 21, 2011, at 5:09 AM, Rojan Shrestha wrote: > I am very much interested on python libraries available for bioinformatics. > Currently, I need python version of DSSP or any other programs that can > evaluate secondary structure of protein from coordinate data. Another option is STRIDE, which is what VMD uses for its secondary structure calculation code. That's all available though the scripting interface, and VMD supports Python. (Some history now; or should I say reminiscing?) I chose STRIDE instead of DSSP for VMD because 15 years ago there were only three secondary structure programs I could find. DSSP didn't allow redistribution and its output format was nasty to parse. The DSSP implementation in RasMol was completely undocumented. While STRIDE allowed redistribution and was easy to parse. I have no idea on the current state of the art. Andrew dalke at dalkescientific.com From schaefer at rostlab.org Mon Jan 24 08:27:03 2011 From: schaefer at rostlab.org (Christian Schaefer) Date: Mon, 24 Jan 2011 14:27:03 +0100 Subject: [Biopython] Retrieve RefSeq sequence Message-ID: <4D3D7E27.8040005@rostlab.org> Hi there, I'm just wondering if there's a possibility to retrieve a protein sequence from NCBI's RefSeq just by giving its identifier, e.g. like NP_031402.3 (.3 being the version). Could this be done by the Bio.Entrez library? If so, how would I do this? Thanks in advance, Chris -- Dipl.-Bioinf. Christian Schaefer Technical University Munich Department for Bioinformatics Faculty of Computer Science/I12 Boltzmannstr. 3 D-85748 Garching b. Muenchen Germany http://www.rostlab.org/~schaefer http://gsish.tum.edu/ From moritz.beber at googlemail.com Tue Jan 25 01:03:23 2011 From: moritz.beber at googlemail.com (Moritz Beber) Date: Mon, 24 Jan 2011 23:03:23 -0700 Subject: [Biopython] BRENDA parser Message-ID: <4D3E67AB.5030300@googlemail.com> Dear list users, I've searched online and in the mailing archives but couldn't find anything on the topic. Did anyone come across a python parser for BRENDA before or has written one themselves? If that's not the case, what would be the guidelines for writing one and submitting it to Biopython? TIA, Moritz From ruchira.datta at gmail.com Tue Jan 25 01:20:52 2011 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Mon, 24 Jan 2011 22:20:52 -0800 Subject: [Biopython] BRENDA parser In-Reply-To: <4D3E67AB.5030300@googlemail.com> References: <4D3E67AB.5030300@googlemail.com> Message-ID: I parsed very little of BRENDA in Python -- all I wanted were which UniProt accessions went with which ECs. --Ruchira On Mon, Jan 24, 2011 at 10:03 PM, Moritz Beber wrote: > Dear list users, > > I've searched online and in the mailing archives but couldn't find > anything on the topic. Did anyone come across a python parser for BRENDA > before or has written one themselves? > > If that's not the case, what would be the guidelines for writing one and > submitting it to Biopython? > > TIA, > Moritz > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Tue Jan 25 07:27:48 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 25 Jan 2011 07:27:48 -0500 Subject: [Biopython] Retrieve RefSeq sequence In-Reply-To: <4D3D7E27.8040005@rostlab.org> References: <4D3D7E27.8040005@rostlab.org> Message-ID: <20110125122747.GJ27283@sobchak.mgh.harvard.edu> Chris; > I'm just wondering if there's a possibility to retrieve a protein > sequence from NCBI's RefSeq just by giving its identifier, e.g. like > NP_031402.3 (.3 being the version). Could this be done by the > Bio.Entrez library? If so, how would I do this? Definitely. The tutorial has details about doing this and much more: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html Chapter 8 describes the Entrez interface to NCBI in detail. For your specific question, you can do this in two steps by searching for the GI number with the identifier using esearch, then retrieving the sequence with efetch: In [1]: from Bio import Entrez In [2]: Entrez.email = "you at email.com" In [3]: rec = Entrez.read(Entrez.esearch(db="protein", term="NP_031402.3")) In [4]: print rec["IdList"] ['110347469'] In [5]: p_handle = Entrez.efetch(db="protein", id=rec["IdList"][0], rettype="fasta") In [6]: print p_handle.read() >gi|110347468|ref|NP_031402.3| alpha-2-macroglobulin precursor [Mus musculus] MRRNQLPTPAFLLLFLLLPRDATTATAKPQYVVLVPSEVYSGVPEKACVSLNHVNETVMLSLTLEYAMQQ TKLLTDQAVDKDSFYCSPFTISGSPLPYTFITVEIKGPTQRFIKKKSIQIIKAESPVFVQTDKPIYKPGQ IVKFRVVSVDISFRPLNETFPVVYIETPKRNRIFQWQNIHLAGGLHQLSFPLSVEPALGIYKVVVQKDSG KKIEHSFEVKEYVLPKFEVIIKMQKTMAFLEEELPITACGVYTYGKPVPGLVTLRVCRKYSRYRSTCHNQ NSMSICEEFSQQADDKGCFRQVVKTKVFQLRQKGHDMKIEVEAKIKEEGTGIELTGIGSCEIANALSKLK FTKVNTNYRPGLPFSGQVLLVDEKGKPIPNKNITSVVSPLGYLSIFTTDEHGLANISIDTSNFTAPFLRV VVTYKQNHVCYDNWWLDEFHTQADHSATLVFSPSQSYIQLELVFGTLACGQTQEIRIHYLLNEDIMKNEK DLTFYYLIKARGSIFNLGSHVLSLEQGNMKGVFSLPIQVEPGMAPEAQLLIYAILPNEELVADAQNFEIE KCFANKVNLSFPSAQSLPASDTHLKVKAAPLSLCALTAVDQSVLLLKPEAKLSPQSIYNLLPGKTVQGAF FGVPVYKDHENCISGEDITHNGIVYTPKHSLGDNDAHSIFQSVGINIFTNSKIHKPRFCQEFQHYPAMGG VAPQALAVAASGPGSSFRAMGVPMMGLDYSDEINQVVEVRETVRKYFPETWIWDLVPLDVSGDGELAVKV PDTITEWKASAFCLSGTTGLGLSSTISLQAFQPFFLELTLPYSVVRGEAFTLKATVLNYMSHCIQIRVDL EISPDFLAVPVGGHENSHCICGNERKTVSWAVTPKSLGEVNFTATAEALQSPELCGNKLTEVPALVHKDT VVKSVIVEPEGIEKEQTYNTLLCPQDTELQDNWSLELPPNVVEGSARATHSVLGDILGSAMQNLQNLLQM PYGCGEQNMVLFVPNIYVLNYLNETQQLTEAIKSKAINYLISGYQRQLNYQHSDGSYSTFGNHGGGNTPG NTWLTAFVLKAFAQAQSHIFIEKTHITNAFNWLSMKQKENGCFQQSGYLLNNAMKGGVDDEVTLSAYITI ALLEMPLPVTHSAVRNALFCLETAWASISQSQESHVYTKALLAYAFALAGNKAKRSELLESLNKDAVKEE DSLHWQRPGDVQKVKALSFYQPRAPSAEVEMTAYVLLAYLTSESSRPTRDLSSSDLSTASKIVKWISKQQ NSHGGFSSTQDTVVALQALSKYGAATFTRSQKEVLVTIESSGTFSKTFHVNSGNRLLLQEVRLPDLPGNY VTKGSGSGCVYLQTSLKYNILPVADGKAPFALQVNTLPLNFDKAGDHRTFQIRINVSYTGERPSSNMVIV DVKMVSGFIPMKPSVKKLQDQPNIQRTEVNTNHVLIYIEKLTNQTLGFSFAVEQDIPVKNLKPAPIKVYD YYETDEFTVEEYSAPFSDGSEQGNA Hope this helps, Brad From fahy at chapman.edu Tue Jan 25 20:34:50 2011 From: fahy at chapman.edu (Michael Fahy) Date: Tue, 25 Jan 2011 17:34:50 -0800 Subject: [Biopython] Entrez.efetch from gene In-Reply-To: <4D24998E.2000200@cornell.edu> References: <4D24998E.2000200@cornell.edu> Message-ID: <000001cbbcf9$395d5e50$ac181af0$@edu> Trying to use Entrez.efetch() to query the gene database. The efetch help at http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html says there are no retrieval types supported by the gene database. If I do an efetch query without specifying a value for rettype, it returns html. Is there a way in Biopython to parse this html? Or is there another way to query the gene database so it will return data that can be parsed? Sample code: from Bio import Entrez Entrez.email = 'email at chapman.edu' search_database = 'gene' search_term = 'YIL065C' handle = Entrez.esearch(db='gene',term=search_term) record = Entrez.read(handle) reclist = record['IdList'] handle = Entrez.efetch(db=search_database, id =reclist[0]) myrecord = handle.read() print myrecord ---------------------------------------------------------- Michael A. Fahy fahy at chapman.edu From mjldehoon at yahoo.com Wed Jan 26 05:29:25 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 26 Jan 2011 02:29:25 -0800 (PST) Subject: [Biopython] Entrez.efetch from gene In-Reply-To: <000001cbbcf9$395d5e50$ac181af0$@edu> Message-ID: <503379.20505.qm@web161208.mail.bf1.yahoo.com> Did you try retmode instead of rettype? Probably retmode='xml' will give you output in the XML format, which can then be parsed by Bio.Entrez. --Michiel --- On Tue, 1/25/11, Michael Fahy wrote: > From: Michael Fahy > Subject: [Biopython] Entrez.efetch from gene > To: Biopython at lists.open-bio.org > Date: Tuesday, January 25, 2011, 8:34 PM > Trying to use Entrez.efetch() to > query the gene database. > > The efetch help at > http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html > says > there are no retrieval types supported by the gene > database.? If I do an > efetch query without specifying a value for rettype, it > returns html.? Is > there a way in Biopython to parse this html?? Or is > there another way to > query the gene database so it will return data that can be > parsed? > > Sample code: > > from Bio import Entrez > Entrez.email = 'email at chapman.edu' > > search_database = 'gene' > search_term = 'YIL065C' > > handle = Entrez.esearch(db='gene',term=search_term) > record = Entrez.read(handle) > reclist = record['IdList']? > > handle = Entrez.efetch(db=search_database,? id > =reclist[0]) > > myrecord = handle.read() > print myrecord > > ---------------------------------------------------------- > Michael A. Fahy > fahy at chapman.edu > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ediths at botinst.uzh.ch Wed Jan 26 06:05:35 2011 From: ediths at botinst.uzh.ch (Edith Schlagenhauf) Date: Wed, 26 Jan 2011 12:05:35 +0100 (MET) Subject: [Biopython] Entrez.efetch from gene In-Reply-To: <000001cbbcf9$395d5e50$ac181af0$@edu> References: <4D24998E.2000200@cornell.edu> <000001cbbcf9$395d5e50$ac181af0$@edu> Message-ID: Using retmode='xml' returns the data in XML format, ie. handle = Entrez.efetch(db=search_database, id =reclist[0], retmode='xml') # the Bio.Entrez.read()The Bio.Entrez.read() function can parse most (if # not all) XML output returned by Entrez. record = Entrez.read(handle) handle.close() print record[0].keys() # prints list of available keys print record[0]["Entrezgene_rna"] # example key HTH, Edith ****************************************** Dr Edith Schlagenhauf University of Zurich SWITZERLAND e-mail: ediths AT botinst DOT uzh DOT ch ****************************************** On Tue, 25 Jan 2011, Michael Fahy wrote: > Trying to use Entrez.efetch() to query the gene database. > > The efetch help at > http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html says > there are no retrieval types supported by the gene database. If I do an > efetch query without specifying a value for rettype, it returns html. Is > there a way in Biopython to parse this html? Or is there another way to > query the gene database so it will return data that can be parsed? > > Sample code: > > from Bio import Entrez > Entrez.email = 'email at chapman.edu' > > search_database = 'gene' > search_term = 'YIL065C' > > handle = Entrez.esearch(db='gene',term=search_term) > record = Entrez.read(handle) > reclist = record['IdList'] > > handle = Entrez.efetch(db=search_database, id =reclist[0]) > > myrecord = handle.read() > print myrecord > > ---------------------------------------------------------- > Michael A. Fahy > fahy at chapman.edu > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bergland at stanford.edu Mon Jan 31 13:50:36 2011 From: bergland at stanford.edu (Alan Bergland) Date: Mon, 31 Jan 2011 10:50:36 -0800 Subject: [Biopython] internal function to convert illumina quality scores to phred Message-ID: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> Hi all, I am trying to convert some code I've written to use FastqGeneralIterator rather than SeqIO.parse. For the most part, it works great and there is a big speed improvement. However, I need to be able to convert the quality scores of 6 characters from the Illumina format to phred. I can't seem to find the function to do this. I'm sure it must exist, and I apologize if documentation for it is sitting right there in the tutorial - I can't seem to find it. Can someone point me in the right direction? Cheers, Alan From biopython at maubp.freeserve.co.uk Mon Jan 31 14:36:02 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Jan 2011 19:36:02 +0000 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> Message-ID: On Mon, Jan 31, 2011 at 6:50 PM, Alan Bergland wrote: > Hi all, > > ? ? ? ?I am trying to convert some code I've written to use > FastqGeneralIterator rather than SeqIO.parse. ?For the most part, it works > great and there is a big speed improvement. ?However, I need to be able to > convert the quality scores of 6 characters from the Illumina format to > phred. ?I can't seem to find the function to do this. ?I'm sure it must > exist, and I apologize if documentation for it is sitting right there in the > tutorial - I can't seem to find it. ?Can someone point me in the right > direction? > > Cheers, > Alan Hi Alan, Probably something in Bio.SeqIO.QualityIO will do what you want, consult the module's built in documentation via help(...) in Python or the online version which is here: http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html I could be more precise if you could clarify what exactly it is you want to do with a couple of examples (input, desired output). If you just want fast Solexa/Illumina FASTQ to Sanger FASTQ or a PHRED style QUAL file from within Python use Bio.SeqIO.convert for this. Peter From bergland at stanford.edu Mon Jan 31 14:54:23 2011 From: bergland at stanford.edu (Alan Bergland) Date: Mon, 31 Jan 2011 11:54:23 -0800 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> Message-ID: <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> Hi Peter, Thanks for the quick reply. So, I am trying to iterate through two large fastq files (each file is one paired-end read) and split the reads by one of 9 barcodes found on both 5' ends of each read. I would like to use the quality information for those barcode reads to assess which barcode-group they belong to. I think it would be nice to use FastqGeneralIterator because I don't need to translate the quality scores for the full read (100bp) back and forth while I iterate through the file. I gather that when I use SeqIO.parse and SeqIO.write, the quality scores are converted back and forth. There is no need to do this for the whole read. I've written a little snippet of code that simply prints the quality scores from the barcodes: from Bio import SeqIO from Bio.SeqIO.QualityIO import * from Bio.SeqIO import * pe1 = open("head2_pe1.fastq", "r") pe2 = open("head2_pe2.fastq", "r") pe1_record_it = FastqGeneralIterator(pe1) for pe1_seq_record in pe1_record_it: bc = SeqRecord(Seq(pe1_seq_record[1][:6]), id="a") bc.letter_annotations['fastq-illumina'] = pe1_seq_record[2][:6] print bc.letter_annotations["fastq-illumina"] this just prints out the illumina encoded quality scores. How would I print out the phred scores instead? Thanks, Alan On Jan 31, 2011, at 11:36 AM, Peter wrote: > On Mon, Jan 31, 2011 at 6:50 PM, Alan Bergland > wrote: >> Hi all, >> >> I am trying to convert some code I've written to use >> FastqGeneralIterator rather than SeqIO.parse. For the most part, >> it works >> great and there is a big speed improvement. However, I need to be >> able to >> convert the quality scores of 6 characters from the Illumina format >> to >> phred. I can't seem to find the function to do this. I'm sure it >> must >> exist, and I apologize if documentation for it is sitting right >> there in the >> tutorial - I can't seem to find it. Can someone point me in the >> right >> direction? >> >> Cheers, >> Alan > > Hi Alan, > > Probably something in Bio.SeqIO.QualityIO will do what you want, > consult the module's built in documentation via help(...) in Python > or the online version which is here: > http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html > > I could be more precise if you could clarify what exactly it is you > want to do with a couple of examples (input, desired output). > > If you just want fast Solexa/Illumina FASTQ to Sanger FASTQ or a > PHRED style QUAL file from within Python use Bio.SeqIO.convert > for this. > > Peter From biopython at maubp.freeserve.co.uk Mon Jan 31 15:50:30 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Jan 2011 20:50:30 +0000 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> Message-ID: On Mon, Jan 31, 2011 at 7:54 PM, Alan Bergland wrote: > Hi Peter, > > ? ? ? ?Thanks for the quick reply. ?So, I am trying to iterate through two > large fastq files (each file is one paired-end read) and split the reads by > one of 9 barcodes found on both 5' ends of each read. ?I would like to use > the quality information for those barcode reads to assess which > barcode-group they belong to. I suspect that it is simpler to just ignore reads which don't match the barcode within 1 or 2 mismatches, without worrying about their qualities. It will cost you computational time and effort for a relative small improvement in the number of filtered reads. YMMV. If you look at the archives there was a discussion earlier in Jan about doing this sort of thing with SFF files. I'm currently (between other tasks) trying to wrap up some sort of PCR-primer/barcode/adaptor filtering (Bio)python script as a tool for Galaxy, see http://usegalaxy.org > ? ? ? ?I think it would be nice to use FastqGeneralIterator because I don't > need to translate the quality scores for the full read (100bp) back and > forth while I iterate through the file. ?I gather that when I use > SeqIO.parse and SeqIO.write, the quality scores are converted back and > forth. ?There is no need to do this for the whole read. > > ? ? ? ?I've written a little snippet of code that simply prints the quality > scores from the barcodes: > > from Bio import SeqIO > from Bio.SeqIO.QualityIO import * > from Bio.SeqIO import * > > pe1 = open("head2_pe1.fastq", "r") > pe2 = open("head2_pe2.fastq", "r") > > pe1_record_it = FastqGeneralIterator(pe1) > > for pe1_seq_record in pe1_record_it: > ? ?bc = SeqRecord(Seq(pe1_seq_record[1][:6]), id="a") > ? ?bc.letter_annotations['fastq-illumina'] = pe1_seq_record[2][:6] > > ? ?print bc.letter_annotations["fastq-illumina"] > > this just prints out the illumina encoded quality scores. ?How would I print > out the phred scores instead? > > Thanks, > Alan If you have Illumina 1.3+ or later, then they use PHRED scores (not Solexa scores which have a different log encoding). If as it appears above you want to use SeqRecord objects, I think you might as well use Bio.SeqIO.parse and write. The point about using FastqGeneralIterator for speed is to avoid using a SeqRecord due to the overheads. See e.g.: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ Peter P.S. Watch out for the fact that Illumina are planning to switch to the standard Sanger FASTQ encoding in their next release: http://seqanswers.com/forums/showthread.php?t=8895 From biopython at maubp.freeserve.co.uk Sat Jan 1 00:05:52 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 1 Jan 2011 00:05:52 +0000 Subject: [Biopython] eprimer3 and primer3 incompatibility In-Reply-To: <4D1E684D.9070007@gmail.com> References: <4D1E684D.9070007@gmail.com> Message-ID: On Fri, Dec 31, 2010 at 11:33 PM, David Koppstein wrote: > Hi, > > I noticed today, while trying to work with > Bio.Emboss.Applications.Primer3CommandLine, that there is an incompatibility > in the TAG format between the current version of Emboss's eprimer3 (6.3.1) > and the current version of primer3_core (2.2.3). > > The EMBOSS developers apparently know about this: > > http://web.archiveorange.com/archive/v/yWYDQkVd25Rxx2EAVunh > > but haven't yet fixed it, unless you want to download a c program and > recompile manually. Oh yeah, it looks like it was me that reported this to them back in April 2010 (linked to in the thread you found): http://www.mail-archive.com/emboss at lists.open-bio.org/msg01405.html > If and when they do fix it, at some point the tag lists will probably have > to be updated for the Biopython module. Possibly - if EMBOSS change their command line arguments. > In the meantime, I am using the old > primer3_core (1.1.4) which should interface with eprimer3 just fine. Would > it make sense, however, to have a Biopython module that interfaces directly > with primer3_core, rather than going through eprimer3? Maybe. I've not looked at the native command line API of primer3_core to form an opinion. > Happy New Year! > David You too, Peter From fjruizruano at gmail.com Mon Jan 3 21:47:41 2011 From: fjruizruano at gmail.com (=?ISO-8859-1?Q?Francisco_J=2E_Ruiz=2DRuano_Campa=F1a?=) Date: Mon, 3 Jan 2011 22:47:41 +0100 Subject: [Biopython] search and delete singleton sites Message-ID: Hello, list. > I need to search singleton sites in an fasta alignment and then to put in the position that differs the same nucleotide in rest of sequences in this position. Any idea? I'm starting with Python and Biopython. Thanks. Francisco. From dalke at dalkescientific.com Wed Jan 5 02:59:09 2011 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 5 Jan 2011 03:59:09 +0100 Subject: [Biopython] Bio.trie In-Reply-To: <524070.34823.qm@web62404.mail.re1.yahoo.com> References: <524070.34823.qm@web62404.mail.re1.yahoo.com> Message-ID: <88E0B7C6-7843-46E8-859B-9ED372ECF8A8@dalkescientific.com> On Dec 29, 2010, at 4:24 AM, Michiel de Hoon wrote: > We would like to know though how many users Bio.trie has, so we can decide whether it is worthwhile to update this module. If you are using Bio.trie, please let us know (preferably via the mailing list). If there are no current users, I suggest that we deprecate and later remove this module from Biopython. I am not a user but the other day I was looking through the Python bug list and came across: http://bugs.python.org/issue9520 The best existing implementation I've been able to find so far is one in the BioPython. Compared to defaultdict(int) on the task of counting words. Dataset 123,981,712 words (6,504,484 unique), 1..21 characters long: * bio.tree - 459 Mb/0.13 Hours, good O(1) behavior * defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior Andrew dalke at dalkescientific.com From ruchira.datta at gmail.com Wed Jan 5 03:10:53 2011 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Tue, 4 Jan 2011 19:10:53 -0800 Subject: [Biopython] Bio.trie In-Reply-To: <88E0B7C6-7843-46E8-859B-9ED372ECF8A8@dalkescientific.com> References: <524070.34823.qm@web62404.mail.re1.yahoo.com> <88E0B7C6-7843-46E8-859B-9ED372ECF8A8@dalkescientific.com> Message-ID: I had also seen that. Sorry for my delay in replying. A trie is an important data structure with many uses. The use I had in mind is: tries are the lowest-latency way of implementing an autosuggest/autocomplete (e.g., if you want to allow users to pick a member of the NCBI taxonomy by scientific name). --Ruchira On Tue, Jan 4, 2011 at 6:59 PM, Andrew Dalke wrote: > On Dec 29, 2010, at 4:24 AM, Michiel de Hoon wrote: > > We would like to know though how many users Bio.trie has, so we can > decide whether it is worthwhile to update this module. If you are using > Bio.trie, please let us know (preferably via the mailing list). If there are > no current users, I suggest that we deprecate and later remove this module > from Biopython. > > I am not a user but the other day I was looking through the Python bug list > and came across: > > http://bugs.python.org/issue9520 > > The best existing implementation I've been able to find so far > is one in the BioPython. Compared to defaultdict(int) on the > task of counting words. Dataset 123,981,712 words (6,504,484 > unique), 1..21 characters long: > * bio.tree - 459 Mb/0.13 Hours, good O(1) behavior > * defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior > > > > Andrew > dalke at dalkescientific.com > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From rmb32 at cornell.edu Wed Jan 5 16:17:18 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 05 Jan 2011 08:17:18 -0800 Subject: [Biopython] GSOC 2011 In-Reply-To: References: Message-ID: <4D24998E.2000200@cornell.edu> Hi Akshay, Thanks for your interest! You can subscribe to the OBF-GSOC mailing list by filling in the subscription form at http://lists.open-bio.org/mailman/listinfo/gsoc. Rob Akshay Goel wrote: > Dear Sir, > > I found your organization while looking through the list of GSOC 2010 > mentoring organizations. I am proficient in Python, C/C++, PHP and JSP, > and would like to get involved with the project. I would, therefore, > like to join the mailing list and know more about current projects. > > Thanking You > Yours Sincerely > Akshay Goel > > > From beiko at cs.dal.ca Wed Jan 12 14:06:43 2011 From: beiko at cs.dal.ca (Robert Beiko) Date: Wed, 12 Jan 2011 10:06:43 -0400 Subject: [Biopython] Phylo: rerooting a tree with a terminal node Message-ID: <4D2DB573.6020505@cs.dal.ca> Hi, I have been experimenting with the excellent Phylo package in BioPython, and am having a bit of trouble with the 'root_with_outgroup' method. Specifically, it seems to work fine when I apply it to internal nodes, but when I try to root on a terminal, I get an error: --------------- Traceback (most recent call last): File "C:/Projects/10000/Phylogenomics/TestTrees/test.py", line 16, in tree.root_with_outgroup(leaf) File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 777, in root_with_outgroup parent.clades.pop(parent.clades.index(new_parent)) ValueError: list.index(x): x not in list --------------- Here is an example script: import io import sys from Bio import Phylo infile = 'Example1.tre' trees = Phylo.parse(infile,'newick') for tree in trees: leafList = tree.get_terminals() for leaf in leafList: tree.root_with_outgroup(leaf) --------------- And here is the file 'Example1.tre' ((AAA,BBB),(CCC,DDD)); [I have tried many permutations of the tree, and in no case have I been able to root using a terminal]. Line 777 in 'BaseTree.py' is: parent.clades.pop(parent.clades.index(new_parent)) so it appears that new_parent is not in the list 'parent.clades'. My Python is rather rudimentary so I haven't been able to figure out why this might arise. If I change 'get_terminals()' to 'get_nonterminals()' in the script above, everything works fine. I imagine I could get things to work by introducing a dummy sister node for each terminal I would like to root on, and then rooting on the LCA of the terminal and its dummy sister. But is there something I am doing wrong in the script above, that could easily be remedied without a hack? Python version is 2.6, BioPython 1.56. Best wishes and thanks, Rob Beiko From eric.talevich at gmail.com Wed Jan 12 18:31:12 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 12 Jan 2011 13:31:12 -0500 Subject: [Biopython] Phylo: rerooting a tree with a terminal node In-Reply-To: <4D2DB573.6020505@cs.dal.ca> References: <4D2DB573.6020505@cs.dal.ca> Message-ID: On Wed, Jan 12, 2011 at 9:06 AM, Robert Beiko wrote: > Hi, > > I have been experimenting with the excellent Phylo package in BioPython, > and am having a bit of trouble with the 'root_with_outgroup' method. > > Specifically, it seems to work fine when I apply it to internal nodes, but > when I try to root on a terminal, I get an error: > > --------------- > > Traceback (most recent call last): > File "C:/Projects/10000/Phylogenomics/TestTrees/test.py", line 16, in > > tree.root_with_outgroup(leaf) > File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 777, in > root_with_outgroup > parent.clades.pop(parent.clades.index(new_parent)) > ValueError: list.index(x): x not in list > > --------------- > > Here is an example script: > > import io > import sys > from Bio import Phylo > > infile = 'Example1.tre' > trees = Phylo.parse(infile,'newick') > > for tree in trees: > leafList = tree.get_terminals() > for leaf in leafList: > tree.root_with_outgroup(leaf) > > --------------- > > And here is the file 'Example1.tre' > > ((AAA,BBB),(CCC,DDD)); > > [I have tried many permutations of the tree, and in no case have I been > able to root using a terminal]. > > Line 777 in 'BaseTree.py' is: > parent.clades.pop(parent.clades.index(new_parent)) > > so it appears that new_parent is not in the list 'parent.clades'. My Python > is rather rudimentary so I haven't been able to figure out why this might > arise. > > If I change 'get_terminals()' to 'get_nonterminals()' in the script above, > everything works fine. I imagine I could get things to work by introducing a > dummy sister node for each terminal I would like to root on, and then > rooting on the LCA of the terminal and its dummy sister. But is there > something I am doing wrong in the script above, that could easily be > remedied without a hack? > > Python version is 2.6, BioPython 1.56. > > Best wishes and thanks, > Rob Beiko > Hi Robert, Thanks for reporting this. It's certainly possible that there's a bug here; I'll take a closer look at the code. The odd thing I notice about your code is that you're rerooting inside a loop. The root_with_outgroup method operates in-place, so the "tree" object is changing with each iteration. I assume your original code was doing something after rerooting each time, like writing out a new tree. The difference in the code between rooting with terminal versus non-terminal nodes is that rooting with a terminal requires creating a new internal node just above the terminal, then using the new node as the new root. (Rerooting with an existing internal node just reroots at that node without creating any new objects.) So if this is done repeatedly, that could be the source of some trouble. Do you know if the loop is crashing in the first iteration or the second? Best regards, Eric From beiko at cs.dal.ca Wed Jan 12 18:46:24 2011 From: beiko at cs.dal.ca (Robert Beiko) Date: Wed, 12 Jan 2011 14:46:24 -0400 Subject: [Biopython] Phylo: rerooting a tree with a terminal node In-Reply-To: References: <4D2DB573.6020505@cs.dal.ca> Message-ID: <4D2DF700.3030905@cs.dal.ca> Hi Eric, Thank you very much for your quick reply. Indeed the full script is doing something much more interesting (rolling up in-paralogs with attempts at alternative rootings), but this is my attempt to cut out all of the other things I might have done wrong :^> The loop is crashing the first time I try it. Indeed, the following variation fails as well: import io import sys from Bio import Phylo infile = 'Example1.tre' trees = Phylo.parse(infile,'newick') for tree in trees: leafList = tree.get_terminals() tree.root_with_outgroup(leafList[0]) ---- again, the same code with internals rather than terminals works fine. Best wishes, Rob On 12/01/2011 2:31 PM, Eric Talevich wrote: > On Wed, Jan 12, 2011 at 9:06 AM, Robert Beiko > wrote: > > Hi, > > I have been experimenting with the excellent Phylo package in > BioPython, and am having a bit of trouble with the > 'root_with_outgroup' method. > > Specifically, it seems to work fine when I apply it to internal > nodes, but when I try to root on a terminal, I get an error: > > --------------- > > Traceback (most recent call last): > File "C:/Projects/10000/Phylogenomics/TestTrees/test.py", line > 16, in > tree.root_with_outgroup(leaf) > File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line > 777, in root_with_outgroup > parent.clades.pop(parent.clades.index(new_parent)) > ValueError: list.index(x): x not in list > > --------------- > > Here is an example script: > > import io > import sys > from Bio import Phylo > > infile = 'Example1.tre' > trees = Phylo.parse(infile,'newick') > > for tree in trees: > leafList = tree.get_terminals() > for leaf in leafList: > tree.root_with_outgroup(leaf) > > --------------- > > And here is the file 'Example1.tre' > > ((AAA,BBB),(CCC,DDD)); > > [I have tried many permutations of the tree, and in no case have I > been able to root using a terminal]. > > Line 777 in 'BaseTree.py' is: > parent.clades.pop(parent.clades.index(new_parent)) > > so it appears that new_parent is not in the list 'parent.clades'. > My Python is rather rudimentary so I haven't been able to figure > out why this might arise. > > If I change 'get_terminals()' to 'get_nonterminals()' in the > script above, everything works fine. I imagine I could get things > to work by introducing a dummy sister node for each terminal I > would like to root on, and then rooting on the LCA of the terminal > and its dummy sister. But is there something I am doing wrong in > the script above, that could easily be remedied without a hack? > > Python version is 2.6, BioPython 1.56. > > Best wishes and thanks, > Rob Beiko > > > > Hi Robert, > > Thanks for reporting this. It's certainly possible that there's a bug > here; I'll take a closer look at the code. > > The odd thing I notice about your code is that you're rerooting inside > a loop. The root_with_outgroup method operates in-place, so the "tree" > object is changing with each iteration. I assume your original code > was doing something after rerooting each time, like writing out a new > tree. > > The difference in the code between rooting with terminal versus > non-terminal nodes is that rooting with a terminal requires creating a > new internal node just above the terminal, then using the new node as > the new root. (Rerooting with an existing internal node just reroots > at that node without creating any new objects.) So if this is done > repeatedly, that could be the source of some trouble. > > Do you know if the loop is crashing in the first iteration or the second? > > Best regards, > Eric From eric.talevich at gmail.com Thu Jan 13 01:55:23 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 12 Jan 2011 20:55:23 -0500 Subject: [Biopython] Phylo: rerooting a tree with a terminal node In-Reply-To: <4D2DF700.3030905@cs.dal.ca> References: <4D2DB573.6020505@cs.dal.ca> <4D2DF700.3030905@cs.dal.ca> Message-ID: Hi Rob, This was an outright bug in Bio.Phylo, so thanks again for reporting it. I've pushed a fix to GitHub: https://github.com/biopython/biopython/commit/1a8a39b6d24a9a4b9088255327b0f2fd12c19a09 For your own work, you can get this fix by either: (a) checking out a development copy of Biopython from GitHub (the master branch is fairly safe) and reinstalling, or (b) applying just this fix to your copy of Bio.Phylo in-place -- i.e. editing your existing Biopython installation. You can replace the file Bio/Phylo/BaseTree.py with the one from GitHub without any ill effects. Cheers, Eric On Wed, Jan 12, 2011 at 1:46 PM, Robert Beiko wrote: > Hi Eric, > > Thank you very much for your quick reply. > > Indeed the full script is doing something much more interesting (rolling up > in-paralogs with attempts at alternative rootings), but this is my attempt > to cut out all of the other things I might have done wrong :^> > > The loop is crashing the first time I try it. Indeed, the following > variation fails as well: > > > import io > import sys > from Bio import Phylo > > infile = 'Example1.tre' > trees = Phylo.parse(infile,'newick') > > for tree in trees: > leafList = tree.get_terminals() > tree.root_with_outgroup(leafList[0]) > > ---- > > again, the same code with internals rather than terminals works fine. > > Best wishes, > Rob > > From fabrice.ciup at gmail.com Thu Jan 13 10:51:01 2011 From: fabrice.ciup at gmail.com (Fabrice Tourre) Date: Thu, 13 Jan 2011 11:51:01 +0100 Subject: [Biopython] I have little question. How can I get the all CpG sites genome postion for mm9? Message-ID: Hi List, I have little question. How can I get the all CpG sites genome postion for mm9? It is the positions for all CpG sites for mm9. Thanks. From beiko at cs.dal.ca Thu Jan 13 16:48:32 2011 From: beiko at cs.dal.ca (Robert Beiko) Date: Thu, 13 Jan 2011 12:48:32 -0400 Subject: [Biopython] Phylo: rerooting a tree with a terminal node In-Reply-To: References: <4D2DB573.6020505@cs.dal.ca> <4D2DF700.3030905@cs.dal.ca> Message-ID: <4D2F2CE0.9000400@cs.dal.ca> Hi Eric, I applied the fix and many of the terminal rootings now work. Thanks! I am still getting errors on a smaller subset of trees, though. The simplest example is this one (in a file called Example1.tre): (A,B,(C,D)); I have modified test.py to do things slightly differently: -------------------------- import io import sys from Bio import Phylo REROOT_INTERNAL = 0 infile = 'Example1.tre' trees = Phylo.parse(infile,'newick') for tree in trees: print "\nInitial tree:" Phylo.write(tree,sys.stdout,'newick') if REROOT_INTERNAL == 1: for iNode in tree.get_nonterminals(): if len(tree.get_path(iNode)) > 1: tree.root_with_outgroup(iNode) break print "\nRerooted with nice internal node:" Phylo.write(tree,sys.stdout,'newick') leafList = tree.get_terminals() print "\nAttempting to root on terminal " + leafList[0].name tree.root_with_outgroup(leafList[0]) print "Rerooted on terminal:" Phylo.write(tree,sys.stdout,'newick') ------------------------------- [Apologies for all the print statements and C-like constants] If REROOT_INTERNAL is set to 0, then we go right to the 'leafList = tree.get_terminals()' line and get the following error: Initial tree: (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; Attempting to root on terminal A Traceback (most recent call last): File "C:\Projects\10000\Phylogenomics\TestTrees\test.py", line 25, in tree.root_with_outgroup(leafList[0]) File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 768, in root_with_outgroup parent = outgroup_path.pop(-2) IndexError: pop index out of range which I assume occurs either because the terminal node (A) has the root as its immediate parent, and/or because it's one leaf from an initial trifurcation. Note that ((A,B),(C,D)); works fine in this case. Setting REROOT_INTERNAL to 1 is my attempt to get around this problem by first rooting on a 'safe' internal node, and then rooting on the terminal. On many of the larger trees I am working with, this solves the problem. But in the case of the tree above, it seems that the original trifurcation remains in place. A few of the larger trees I have tested also retain this trifurcation, even if branches are moved around. Setting REROOT_INTERNAL to 1 gives me the following output: Initial tree: (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; Rerooted with nice internal node: (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; Attempting to root on terminal A Traceback (most recent call last): File "C:\Projects\10000\Phylogenomics\TestTrees\test.py", line 25, in tree.root_with_outgroup(leafList[0]) File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 768, in root_with_outgroup parent = outgroup_path.pop(-2) IndexError: pop index out of range Now, if I change the line for iNode in tree.get_nonterminals(): to for iNode in tree.get_terminals(): Then I get the desired behaviour. This succeeds as a workaround as long as I check to make sure that I'm not rooting again on the terminal that I just rooted on. Best wishes, Rob On 12/01/2011 9:55 PM, Eric Talevich wrote: > Hi Rob, > > This was an outright bug in Bio.Phylo, so thanks again for reporting > it. I've pushed a fix to GitHub: > https://github.com/biopython/biopython/commit/1a8a39b6d24a9a4b9088255327b0f2fd12c19a09 > > For your own work, you can get this fix by either: > (a) checking out a development copy of Biopython from GitHub (the > master branch is fairly safe) and reinstalling, or > (b) applying just this fix to your copy of Bio.Phylo in-place -- i.e. > editing your existing Biopython installation. You can replace the file > Bio/Phylo/BaseTree.py with the one from GitHub without any ill effects. > > Cheers, > Eric > > On Wed, Jan 12, 2011 at 1:46 PM, Robert Beiko > wrote: > > Hi Eric, > > Thank you very much for your quick reply. > > Indeed the full script is doing something much more interesting > (rolling up in-paralogs with attempts at alternative rootings), > but this is my attempt to cut out all of the other things I might > have done wrong :^> > > The loop is crashing the first time I try it. Indeed, the > following variation fails as well: > > > import io > import sys > from Bio import Phylo > > infile = 'Example1.tre' > trees = Phylo.parse(infile,'newick') > > for tree in trees: > leafList = tree.get_terminals() > tree.root_with_outgroup(leafList[0]) > > ---- > > again, the same code with internals rather than terminals works fine. > > Best wishes, > Rob > From eric.talevich at gmail.com Sat Jan 15 04:50:58 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 14 Jan 2011 23:50:58 -0500 Subject: [Biopython] Phylo: rerooting a tree with a terminal node In-Reply-To: <4D2F2CE0.9000400@cs.dal.ca> References: <4D2DB573.6020505@cs.dal.ca> <4D2DF700.3030905@cs.dal.ca> <4D2F2CE0.9000400@cs.dal.ca> Message-ID: Hi Rob, This should work now: https://github.com/biopython/biopython/commit/e9cfcc3680a5b5692f91e560ea08e51515c9c757 I also added another unit test based on your example -- looping through all the nodes in a few contrived trees, rerooting at each node and testing that the total tree length doesn't change. It passes now. So, thanks for the test! Cheers, Eric On Thu, Jan 13, 2011 at 11:48 AM, Robert Beiko wrote: > Hi Eric, > > I applied the fix and many of the terminal rootings now work. Thanks! > > I am still getting errors on a smaller subset of trees, though. The > simplest example is this one (in a file called Example1.tre): > > (A,B,(C,D)); > > I have modified test.py to do things slightly differently: > > -------------------------- > > > import io > import sys > from Bio import Phylo > > REROOT_INTERNAL = 0 > > > infile = 'Example1.tre' > trees = Phylo.parse(infile,'newick') > > for tree in trees: > print "\nInitial tree:" > Phylo.write(tree,sys.stdout,'newick') > > if REROOT_INTERNAL == 1: > for iNode in tree.get_nonterminals(): > if len(tree.get_path(iNode)) > 1: > tree.root_with_outgroup(iNode) > break > > print "\nRerooted with nice internal node:" > Phylo.write(tree,sys.stdout,'newick') > > leafList = tree.get_terminals() > print "\nAttempting to root on terminal " + leafList[0].name > > tree.root_with_outgroup(leafList[0]) > > print "Rerooted on terminal:" > Phylo.write(tree,sys.stdout,'newick') > > ------------------------------- > > [Apologies for all the print statements and C-like constants] > > If REROOT_INTERNAL is set to 0, then we go right to the 'leafList = > tree.get_terminals()' line and get the following error: > > Initial tree: > (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; > > Attempting to root on terminal A > > Traceback (most recent call last): > File "C:\Projects\10000\Phylogenomics\TestTrees\test.py", line 25, in > > > tree.root_with_outgroup(leafList[0]) > File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 768, in > root_with_outgroup > parent = outgroup_path.pop(-2) > IndexError: pop index out of range > > which I assume occurs either because the terminal node (A) has the root as > its immediate parent, and/or because it's one leaf from an initial > trifurcation. Note that ((A,B),(C,D)); works fine in this case. > > Setting REROOT_INTERNAL to 1 is my attempt to get around this problem by > first rooting on a 'safe' internal node, and then rooting on the terminal. > On many of the larger trees I am working with, this solves the problem. But > in the case of the tree above, it seems that the original trifurcation > remains in place. A few of the larger trees I have tested also retain this > trifurcation, even if branches are moved around. Setting REROOT_INTERNAL to > 1 gives me the following output: > > Initial tree: > (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; > > Rerooted with nice internal node: > (A:1.00000,B:1.00000,(C:1.00000,D:1.00000)0.00000:1.00000)0.00000:1.00000; > > Attempting to root on terminal A > > Traceback (most recent call last): > File "C:\Projects\10000\Phylogenomics\TestTrees\test.py", line 25, in > > > tree.root_with_outgroup(leafList[0]) > File "C:\Python26\lib\site-packages\Bio\Phylo\BaseTree.py", line 768, in > root_with_outgroup > parent = outgroup_path.pop(-2) > IndexError: pop index out of range > > Now, if I change the line > > for iNode in tree.get_nonterminals(): > > to > > for iNode in tree.get_terminals(): > > Then I get the desired behaviour. This succeeds as a workaround as long as > I check to make sure that I'm not rooting again on the terminal that I just > rooted on. > > Best wishes, > Rob > > > On 12/01/2011 9:55 PM, Eric Talevich wrote: > > Hi Rob, > > This was an outright bug in Bio.Phylo, so thanks again for reporting it. > I've pushed a fix to GitHub: > > https://github.com/biopython/biopython/commit/1a8a39b6d24a9a4b9088255327b0f2fd12c19a09 > > For your own work, you can get this fix by either: > (a) checking out a development copy of Biopython from GitHub (the master > branch is fairly safe) and reinstalling, or > (b) applying just this fix to your copy of Bio.Phylo in-place -- i.e. > editing your existing Biopython installation. You can replace the file > Bio/Phylo/BaseTree.py with the one from GitHub without any ill effects. > > Cheers, > Eric > > On Wed, Jan 12, 2011 at 1:46 PM, Robert Beiko wrote: > >> Hi Eric, >> >> Thank you very much for your quick reply. >> >> Indeed the full script is doing something much more interesting (rolling >> up in-paralogs with attempts at alternative rootings), but this is my >> attempt to cut out all of the other things I might have done wrong :^> >> >> The loop is crashing the first time I try it. Indeed, the following >> variation fails as well: >> >> >> import io >> import sys >> from Bio import Phylo >> >> infile = 'Example1.tre' >> trees = Phylo.parse(infile,'newick') >> >> for tree in trees: >> leafList = tree.get_terminals() >> tree.root_with_outgroup(leafList[0]) >> >> ---- >> >> again, the same code with internals rather than terminals works fine. >> >> Best wishes, >> Rob >> >> > From idoerg at gmail.com Sun Jan 16 18:48:09 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 16 Jan 2011 13:48:09 -0500 Subject: [Biopython] FASTQ to qual+fasta Message-ID: question regarding the use of SeqIO.convert: how do I convert a FASTQ file to qual and fasta files? Currently it seems that I have to run SeqIO.convert twice e.g.: SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.qual","w"),"qual") SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.fasta","w"),"fasta") Or am I missing something? Thanks, ./I -- Iddo Friedberg http://iddo-friedberg.net/contact.html From p.j.a.cock at googlemail.com Sun Jan 16 19:25:43 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Jan 2011 19:25:43 +0000 Subject: [Biopython] FASTQ to qual+fasta In-Reply-To: References: Message-ID: On Sun, Jan 16, 2011 at 6:48 PM, Iddo Friedberg wrote: > question regarding the use of SeqIO.convert: how do I convert a FASTQ file > to qual and fasta files? Currently it seems that I have to run SeqIO.convert > twice e.g.: > > ?SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.qual","w"),"qual") > ?SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.fasta","w"),"fasta") > > Or am I missing something? > > Thanks, > > ./I Hi Iddo, That is almost the simplest solution, yes. You can use filename directly: SeqIO.convert("infile.fastq", "fastq", "outfile.qual", "qual") SeqIO.convert("infile.fastq", "fastq", "outfile.fasta", "fasta") Is it a bit slow for you? Using SeqIO.convert(...) in this case does use optimised code for FASTQ to FASTA, but currently we don't have a similar fast FASTQ to QUAL function. See Bio/SeqIO/_convert.py if you want to know how this is implemented. I can see several tricks for FASTQ to QUAL which should work... do you fancy trying this yourself? Alternatively, you could try combining a single call to SeqIO.parse(...) to iterate over the records as SeqRecord objects with itertools.tee to split this iterator in two to give it to two copies of SeqIO.write(...) to write FASTA and QUAL. I don't know how well that would work with memory consumption, but it would make only a single pass though the FASTQ file. If speed really matters here, first we should add FASTQ to QUAL to Bio/SeqIO/_convert.py and if that isn't enough, do a special case for FASTQ to FASTA and QUAL (to live in Bio.SeqIO.QualityIO I guess). Peter From idoerg at gmail.com Sun Jan 16 22:35:35 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 16 Jan 2011 17:35:35 -0500 Subject: [Biopython] FASTQ to qual+fasta In-Reply-To: References: Message-ID: <4D3372B7.1080801@gmail.com> On 01/16/2011 02:25 PM, Peter Cock wrote: > On Sun, Jan 16, 2011 at 6:48 PM, Iddo Friedberg wrote: >> question regarding the use of SeqIO.convert: how do I convert a FASTQ file >> to qual and fasta files? Currently it seems that I have to run SeqIO.convert >> twice e.g.: >> >> SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.qual","w"),"qual") >> SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.fasta","w"),"fasta") >> >> Or am I missing something? >> >> Thanks, >> >> ./I > Hi Iddo, > > That is almost the simplest solution, yes. You can use filename directly: > > SeqIO.convert("infile.fastq", "fastq", "outfile.qual", "qual") > SeqIO.convert("infile.fastq", "fastq", "outfile.fasta", "fasta") > > Is it a bit slow for you? > Well, although elegant, in this case I am running two loops, where one should suffice. > Using SeqIO.convert(...) in this case does use optimised code for FASTQ > to FASTA, but currently we don't have a similar fast FASTQ to QUAL > function. See Bio/SeqIO/_convert.py if you want to know how this is > implemented. I can see several tricks for FASTQ to QUAL which should > work... do you fancy trying this yourself? I wish I had the time.... :( > Alternatively, you could try combining a single call to SeqIO.parse(...) to > iterate over the records as SeqRecord objects with itertools.tee to split > this iterator in two to give it to two copies of SeqIO.write(...) to write > FASTA and QUAL. I don't know how well that would work with memory > consumption, but it would make only a single pass though the FASTQ file. That's actually what I ended up doing. > If speed really matters here, first we should add FASTQ to QUAL > to Bio/SeqIO/_convert.py and if that isn't enough, do a special case for > FASTQ to FASTA and QUAL (to live in Bio.SeqIO.QualityIO I guess). > > Peter I think a fastq to fasta & qual would be best. I'll look into the QualityIO module and see if my code can be massaged in there. Thanks, Iddo -- Iddo Friedberg, Ph.D. http://iddo-friedberg.org/contact.html From p.j.a.cock at googlemail.com Sun Jan 16 22:58:30 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Jan 2011 22:58:30 +0000 Subject: [Biopython] FASTQ to qual+fasta In-Reply-To: <4D3372B7.1080801@gmail.com> References: <4D3372B7.1080801@gmail.com> Message-ID: On Sun, Jan 16, 2011 at 10:35 PM, Iddo Friedberg wrote: > On 01/16/2011 02:25 PM, Peter Cock wrote: >> >> On Sun, Jan 16, 2011 at 6:48 PM, Iddo Friedberg ?wrote: >>> >>> question regarding the use of SeqIO.convert: how do I convert a >>> FASTQ fileto qual and fasta files? Currently it seems that I have >>> to run SeqIO.convert twice e.g.: >>> >>> >>> ?SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.qual","w"),"qual") >>> ?SeqIO.convert(open("infile.fastq"),"fastq",open("outfile.fasta","w"),"fasta") >>> >>> Or am I missing something? >>> >>> Thanks, >>> >>> ./I >> >> Hi Iddo, >> >> That is almost the simplest solution, yes. You can use filename directly: >> >> SeqIO.convert("infile.fastq", "fastq", "outfile.qual", "qual") >> SeqIO.convert("infile.fastq", "fastq", "outfile.fasta", "fasta") >> >> Is it a bit slow for you? >> > > Well, although elegant, in this case I am running two loops, where one > should suffice. KISS? >> Using SeqIO.convert(...) in this case does use optimised code for FASTQ >> to FASTA, but currently we don't have a similar fast FASTQ to QUAL >> function. See Bio/SeqIO/_convert.py if you want to know how this is >> implemented. I can see several tricks for FASTQ to QUAL which should >> work... do you fancy trying this yourself? > > I wish I had the time.... :( I can picture how I'd solve this - it shouldn't take me too long. >> Alternatively, you could try combining a single call to SeqIO.parse(...) >> to iterate over the records as SeqRecord objects with itertools.tee to >> split this iterator in two to give it to two copies of SeqIO.write(...) to write >> FASTA and QUAL. I don't know how well that would work with memory >> consumption, but it would make only a single pass though the FASTQ file. > > That's actually what I ended up doing. Do you have the timings of this versus the two calls to SeqIO.convert()? I'd also be curious to see your code for this. >> If speed really matters here, first we should add FASTQ to QUAL >> to Bio/SeqIO/_convert.py and if that isn't enough, do a special case for >> FASTQ to FASTA and QUAL (to live in Bio.SeqIO.QualityIO I guess). > > I think a fastq to fasta & qual would be best. I'll look into the QualityIO > module and see if my code can be massaged in there. Maybe - assuming it would be faster than two calls to SeqIO.convert (once FASTQ to QUAL is optimised). Peter From p.j.a.cock at googlemail.com Mon Jan 17 12:30:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 17 Jan 2011 12:30:44 +0000 Subject: [Biopython] FASTQ to qual+fasta In-Reply-To: References: <4D3372B7.1080801@gmail.com> Message-ID: On Sun, Jan 16, 2011 at 10:58 PM, Peter wrote: >>> Using SeqIO.convert(...) in this case does use optimised code for FASTQ >>> to FASTA, but currently we don't have a similar fast FASTQ to QUAL >>> function. See Bio/SeqIO/_convert.py if you want to know how this is >>> implemented. I can see several tricks for FASTQ to QUAL which should >>> work... do you fancy trying this yourself? >> >> I wish I had the time.... :( > > I can picture how I'd solve this - it shouldn't take me too long. Done: https://github.com/biopython/biopython/commit/e26030a290b6acdf5a5fc431056593cccc5d892a This makes FASTQ to QUAL about three times faster. There is probably scope for speeding up how we do the line wrapping in QUAL output - both in the the normal SeqRecord based code called by Bio.SeqIO.write() and in the new optimised code for Bio.SeqIO.convert(). I'd still like to see your itertools.tee solution and your timings of it against calling Bio.SeqIO.convert twice. Peter From almeida at cim.sld.cu Tue Jan 18 19:14:56 2011 From: almeida at cim.sld.cu (Yasser Almeida =?ISO-8859-1?Q?Hern=E1ndez?=) Date: Tue, 18 Jan 2011 14:14:56 -0500 Subject: [Biopython] Save coordinates of selected residues... Message-ID: <1295378096.1796.7.camel@almeida-desktop> Hi all... I wonder how can i save the coordinates of a list of residues from a structure. For example, from the structure 154L i want to save in a pdb file the next amino acids: PHE 123 HIS 101 GLY 150 PHE 123 TYR 147 HIS 101 PHE 123 ASP 97 GLU 73 THR 165 ASN 148 How would be the code...??? Best regards and thanks in advance... ;) -- Yasser Almeida Hern?ndez, BSc. Center of Molecular Immunology (CIM) Tumor Biology Direction Nanobiology Department 216 St. & 15th Ave, Siboney, Playa P.O.Box 16040. Havana, Cuba Phone: (+537) 214-3178 almeida at cim.sld.cu -------------- next part -------------- A non-text attachment was scrubbed... Name: face-wink.png Type: image/png Size: 1602 bytes Desc: face-wink.png URL: From anaryin at gmail.com Wed Jan 19 14:49:41 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Jan 2011 15:49:41 +0100 Subject: [Biopython] Save coordinates of selected residues... In-Reply-To: <1295378096.1796.7.camel@almeida-desktop> References: <1295378096.1796.7.camel@almeida-desktop> Message-ID: Hello Yasser, You can either create a new Structure object with copies of the residues of interest, and then use PDBIO to save it, or use the Dice module. If you choose the latter, make sure that the accept_residue function returns True only for those residues (you can make it based on residue id). Best! Jo?o [...] Rodrigues http://doeidoei.wordpress.com 2011/1/18 Yasser Almeida Hern?ndez > Hi all... > I wonder how can i save the coordinates of a list of residues from a > structure. > For example, from the structure 154L i want to save in a pdb file the > next amino acids: > PHE 123 > HIS 101 > GLY 150 > PHE 123 > TYR 147 > HIS 101 > PHE 123 > ASP 97 > GLU 73 > THR 165 > ASN 148 > > How would be the code...??? > > Best regards and thanks in advance... ;) > -- > > Yasser Almeida Hern?ndez, BSc. > Center of Molecular Immunology (CIM) > Tumor Biology Direction > Nanobiology Department > 216 St. & 15th Ave, Siboney, Playa > P.O.Box 16040. Havana, Cuba > Phone: (+537) 214-3178 > almeida at cim.sld.cu > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Thu Jan 20 14:52:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 20 Jan 2011 14:52:52 +0000 Subject: [Biopython] FASTQ to qual+fasta In-Reply-To: References: <4D3372B7.1080801@gmail.com> Message-ID: On Mon, Jan 17, 2011 at 12:30 PM, Peter Cock wrote: > > Done: > > https://github.com/biopython/biopython/commit/e26030a290b6acdf5a5fc431056593cccc5d892a > > This makes FASTQ to QUAL about three times faster. There is > probably scope for speeding up how we do the line wrapping in > QUAL output - both in the the normal SeqRecord based code > called by Bio.SeqIO.write() and in the new optimised code for > Bio.SeqIO.convert(). > Hi Iddo, I found time yesterday to optimise the line wrapping for QUAL output, for a further significant speed-up. Please update to the current code from git and give it a try :) Peter From rojan at riken.jp Fri Jan 21 04:09:24 2011 From: rojan at riken.jp (Rojan Shrestha) Date: Fri, 21 Jan 2011 13:09:24 +0900 Subject: [Biopython] DSSP python version Message-ID: <000501cbb920$fd1e6230$f75b2690$@jp> Hello: I am very much interested on python libraries available for bioinformatics. Currently, I need python version of DSSP or any other programs that can evaluate secondary structure of protein from coordinate data. I am pretty sure there should be such module. If you are well familiar with it, please let me know how it can be used in my own python program. Regards, Rojan From anaryin at gmail.com Fri Jan 21 07:21:01 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 21 Jan 2011 08:21:01 +0100 Subject: [Biopython] DSSP python version In-Reply-To: <000501cbb920$fd1e6230$f75b2690$@jp> References: <000501cbb920$fd1e6230$f75b2690$@jp> Message-ID: Hey Rojan, Biopython provides an interface to the command line DSSP. To my knowledge, no "python version" of DSSP exists. You'll have to get the program from G.Vriend's webpage and then use either Biopython or your own code to pass arguments and parse output. Jo?o From rojan at riken.jp Fri Jan 21 07:27:50 2011 From: rojan at riken.jp (Rojan Shrestha) Date: Fri, 21 Jan 2011 16:27:50 +0900 Subject: [Biopython] DSSP python version In-Reply-To: References: <000501cbb920$fd1e6230$f75b2690$@jp> Message-ID: <001a01cbb93c$b5b27960$21176c20$@jp> Hello Joao: Thank you very much. I found in Biopython but I got another problem to use it. When we do not have last column of PDB file, it gives an error. It is a problem for pdbparser library. Regards, Rojan From: Jo?o Rodrigues [mailto:anaryin at gmail.com] Sent: Friday, January 21, 2011 4:21 PM To: rojan at riken.jp Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] DSSP python version Hey Rojan, Biopython provides an interface to the command line DSSP. To my knowledge, no "python version" of DSSP exists. You'll have to get the program from G.Vriend's webpage and then use either Biopython or your own code to pass arguments and parse output. Jo?o From anaryin at gmail.com Fri Jan 21 08:11:35 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 21 Jan 2011 09:11:35 +0100 Subject: [Biopython] DSSP python version In-Reply-To: <001a01cbb93c$b5b27960$21176c20$@jp> References: <000501cbb920$fd1e6230$f75b2690$@jp> <001a01cbb93c$b5b27960$21176c20$@jp> Message-ID: Which column are you missing? Which error does the parser give? From dalke at dalkescientific.com Sat Jan 22 12:05:30 2011 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sat, 22 Jan 2011 13:05:30 +0100 Subject: [Biopython] DSSP python version In-Reply-To: <000501cbb920$fd1e6230$f75b2690$@jp> References: <000501cbb920$fd1e6230$f75b2690$@jp> Message-ID: <822F6196-A750-4C3C-B677-5A8CC498D50A@dalkescientific.com> On Jan 21, 2011, at 5:09 AM, Rojan Shrestha wrote: > I am very much interested on python libraries available for bioinformatics. > Currently, I need python version of DSSP or any other programs that can > evaluate secondary structure of protein from coordinate data. Another option is STRIDE, which is what VMD uses for its secondary structure calculation code. That's all available though the scripting interface, and VMD supports Python. (Some history now; or should I say reminiscing?) I chose STRIDE instead of DSSP for VMD because 15 years ago there were only three secondary structure programs I could find. DSSP didn't allow redistribution and its output format was nasty to parse. The DSSP implementation in RasMol was completely undocumented. While STRIDE allowed redistribution and was easy to parse. I have no idea on the current state of the art. Andrew dalke at dalkescientific.com From schaefer at rostlab.org Mon Jan 24 13:27:03 2011 From: schaefer at rostlab.org (Christian Schaefer) Date: Mon, 24 Jan 2011 14:27:03 +0100 Subject: [Biopython] Retrieve RefSeq sequence Message-ID: <4D3D7E27.8040005@rostlab.org> Hi there, I'm just wondering if there's a possibility to retrieve a protein sequence from NCBI's RefSeq just by giving its identifier, e.g. like NP_031402.3 (.3 being the version). Could this be done by the Bio.Entrez library? If so, how would I do this? Thanks in advance, Chris -- Dipl.-Bioinf. Christian Schaefer Technical University Munich Department for Bioinformatics Faculty of Computer Science/I12 Boltzmannstr. 3 D-85748 Garching b. Muenchen Germany http://www.rostlab.org/~schaefer http://gsish.tum.edu/ From moritz.beber at googlemail.com Tue Jan 25 06:03:23 2011 From: moritz.beber at googlemail.com (Moritz Beber) Date: Mon, 24 Jan 2011 23:03:23 -0700 Subject: [Biopython] BRENDA parser Message-ID: <4D3E67AB.5030300@googlemail.com> Dear list users, I've searched online and in the mailing archives but couldn't find anything on the topic. Did anyone come across a python parser for BRENDA before or has written one themselves? If that's not the case, what would be the guidelines for writing one and submitting it to Biopython? TIA, Moritz From ruchira.datta at gmail.com Tue Jan 25 06:20:52 2011 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Mon, 24 Jan 2011 22:20:52 -0800 Subject: [Biopython] BRENDA parser In-Reply-To: <4D3E67AB.5030300@googlemail.com> References: <4D3E67AB.5030300@googlemail.com> Message-ID: I parsed very little of BRENDA in Python -- all I wanted were which UniProt accessions went with which ECs. --Ruchira On Mon, Jan 24, 2011 at 10:03 PM, Moritz Beber wrote: > Dear list users, > > I've searched online and in the mailing archives but couldn't find > anything on the topic. Did anyone come across a python parser for BRENDA > before or has written one themselves? > > If that's not the case, what would be the guidelines for writing one and > submitting it to Biopython? > > TIA, > Moritz > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Tue Jan 25 12:27:48 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 25 Jan 2011 07:27:48 -0500 Subject: [Biopython] Retrieve RefSeq sequence In-Reply-To: <4D3D7E27.8040005@rostlab.org> References: <4D3D7E27.8040005@rostlab.org> Message-ID: <20110125122747.GJ27283@sobchak.mgh.harvard.edu> Chris; > I'm just wondering if there's a possibility to retrieve a protein > sequence from NCBI's RefSeq just by giving its identifier, e.g. like > NP_031402.3 (.3 being the version). Could this be done by the > Bio.Entrez library? If so, how would I do this? Definitely. The tutorial has details about doing this and much more: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html Chapter 8 describes the Entrez interface to NCBI in detail. For your specific question, you can do this in two steps by searching for the GI number with the identifier using esearch, then retrieving the sequence with efetch: In [1]: from Bio import Entrez In [2]: Entrez.email = "you at email.com" In [3]: rec = Entrez.read(Entrez.esearch(db="protein", term="NP_031402.3")) In [4]: print rec["IdList"] ['110347469'] In [5]: p_handle = Entrez.efetch(db="protein", id=rec["IdList"][0], rettype="fasta") In [6]: print p_handle.read() >gi|110347468|ref|NP_031402.3| alpha-2-macroglobulin precursor [Mus musculus] MRRNQLPTPAFLLLFLLLPRDATTATAKPQYVVLVPSEVYSGVPEKACVSLNHVNETVMLSLTLEYAMQQ TKLLTDQAVDKDSFYCSPFTISGSPLPYTFITVEIKGPTQRFIKKKSIQIIKAESPVFVQTDKPIYKPGQ IVKFRVVSVDISFRPLNETFPVVYIETPKRNRIFQWQNIHLAGGLHQLSFPLSVEPALGIYKVVVQKDSG KKIEHSFEVKEYVLPKFEVIIKMQKTMAFLEEELPITACGVYTYGKPVPGLVTLRVCRKYSRYRSTCHNQ NSMSICEEFSQQADDKGCFRQVVKTKVFQLRQKGHDMKIEVEAKIKEEGTGIELTGIGSCEIANALSKLK FTKVNTNYRPGLPFSGQVLLVDEKGKPIPNKNITSVVSPLGYLSIFTTDEHGLANISIDTSNFTAPFLRV VVTYKQNHVCYDNWWLDEFHTQADHSATLVFSPSQSYIQLELVFGTLACGQTQEIRIHYLLNEDIMKNEK DLTFYYLIKARGSIFNLGSHVLSLEQGNMKGVFSLPIQVEPGMAPEAQLLIYAILPNEELVADAQNFEIE KCFANKVNLSFPSAQSLPASDTHLKVKAAPLSLCALTAVDQSVLLLKPEAKLSPQSIYNLLPGKTVQGAF FGVPVYKDHENCISGEDITHNGIVYTPKHSLGDNDAHSIFQSVGINIFTNSKIHKPRFCQEFQHYPAMGG VAPQALAVAASGPGSSFRAMGVPMMGLDYSDEINQVVEVRETVRKYFPETWIWDLVPLDVSGDGELAVKV PDTITEWKASAFCLSGTTGLGLSSTISLQAFQPFFLELTLPYSVVRGEAFTLKATVLNYMSHCIQIRVDL EISPDFLAVPVGGHENSHCICGNERKTVSWAVTPKSLGEVNFTATAEALQSPELCGNKLTEVPALVHKDT VVKSVIVEPEGIEKEQTYNTLLCPQDTELQDNWSLELPPNVVEGSARATHSVLGDILGSAMQNLQNLLQM PYGCGEQNMVLFVPNIYVLNYLNETQQLTEAIKSKAINYLISGYQRQLNYQHSDGSYSTFGNHGGGNTPG NTWLTAFVLKAFAQAQSHIFIEKTHITNAFNWLSMKQKENGCFQQSGYLLNNAMKGGVDDEVTLSAYITI ALLEMPLPVTHSAVRNALFCLETAWASISQSQESHVYTKALLAYAFALAGNKAKRSELLESLNKDAVKEE DSLHWQRPGDVQKVKALSFYQPRAPSAEVEMTAYVLLAYLTSESSRPTRDLSSSDLSTASKIVKWISKQQ NSHGGFSSTQDTVVALQALSKYGAATFTRSQKEVLVTIESSGTFSKTFHVNSGNRLLLQEVRLPDLPGNY VTKGSGSGCVYLQTSLKYNILPVADGKAPFALQVNTLPLNFDKAGDHRTFQIRINVSYTGERPSSNMVIV DVKMVSGFIPMKPSVKKLQDQPNIQRTEVNTNHVLIYIEKLTNQTLGFSFAVEQDIPVKNLKPAPIKVYD YYETDEFTVEEYSAPFSDGSEQGNA Hope this helps, Brad From fahy at chapman.edu Wed Jan 26 01:34:50 2011 From: fahy at chapman.edu (Michael Fahy) Date: Tue, 25 Jan 2011 17:34:50 -0800 Subject: [Biopython] Entrez.efetch from gene In-Reply-To: <4D24998E.2000200@cornell.edu> References: <4D24998E.2000200@cornell.edu> Message-ID: <000001cbbcf9$395d5e50$ac181af0$@edu> Trying to use Entrez.efetch() to query the gene database. The efetch help at http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html says there are no retrieval types supported by the gene database. If I do an efetch query without specifying a value for rettype, it returns html. Is there a way in Biopython to parse this html? Or is there another way to query the gene database so it will return data that can be parsed? Sample code: from Bio import Entrez Entrez.email = 'email at chapman.edu' search_database = 'gene' search_term = 'YIL065C' handle = Entrez.esearch(db='gene',term=search_term) record = Entrez.read(handle) reclist = record['IdList'] handle = Entrez.efetch(db=search_database, id =reclist[0]) myrecord = handle.read() print myrecord ---------------------------------------------------------- Michael A. Fahy fahy at chapman.edu From mjldehoon at yahoo.com Wed Jan 26 10:29:25 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 26 Jan 2011 02:29:25 -0800 (PST) Subject: [Biopython] Entrez.efetch from gene In-Reply-To: <000001cbbcf9$395d5e50$ac181af0$@edu> Message-ID: <503379.20505.qm@web161208.mail.bf1.yahoo.com> Did you try retmode instead of rettype? Probably retmode='xml' will give you output in the XML format, which can then be parsed by Bio.Entrez. --Michiel --- On Tue, 1/25/11, Michael Fahy wrote: > From: Michael Fahy > Subject: [Biopython] Entrez.efetch from gene > To: Biopython at lists.open-bio.org > Date: Tuesday, January 25, 2011, 8:34 PM > Trying to use Entrez.efetch() to > query the gene database. > > The efetch help at > http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html > says > there are no retrieval types supported by the gene > database.? If I do an > efetch query without specifying a value for rettype, it > returns html.? Is > there a way in Biopython to parse this html?? Or is > there another way to > query the gene database so it will return data that can be > parsed? > > Sample code: > > from Bio import Entrez > Entrez.email = 'email at chapman.edu' > > search_database = 'gene' > search_term = 'YIL065C' > > handle = Entrez.esearch(db='gene',term=search_term) > record = Entrez.read(handle) > reclist = record['IdList']? > > handle = Entrez.efetch(db=search_database,? id > =reclist[0]) > > myrecord = handle.read() > print myrecord > > ---------------------------------------------------------- > Michael A. Fahy > fahy at chapman.edu > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ediths at botinst.uzh.ch Wed Jan 26 11:05:35 2011 From: ediths at botinst.uzh.ch (Edith Schlagenhauf) Date: Wed, 26 Jan 2011 12:05:35 +0100 (MET) Subject: [Biopython] Entrez.efetch from gene In-Reply-To: <000001cbbcf9$395d5e50$ac181af0$@edu> References: <4D24998E.2000200@cornell.edu> <000001cbbcf9$395d5e50$ac181af0$@edu> Message-ID: Using retmode='xml' returns the data in XML format, ie. handle = Entrez.efetch(db=search_database, id =reclist[0], retmode='xml') # the Bio.Entrez.read()The Bio.Entrez.read() function can parse most (if # not all) XML output returned by Entrez. record = Entrez.read(handle) handle.close() print record[0].keys() # prints list of available keys print record[0]["Entrezgene_rna"] # example key HTH, Edith ****************************************** Dr Edith Schlagenhauf University of Zurich SWITZERLAND e-mail: ediths AT botinst DOT uzh DOT ch ****************************************** On Tue, 25 Jan 2011, Michael Fahy wrote: > Trying to use Entrez.efetch() to query the gene database. > > The efetch help at > http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html says > there are no retrieval types supported by the gene database. If I do an > efetch query without specifying a value for rettype, it returns html. Is > there a way in Biopython to parse this html? Or is there another way to > query the gene database so it will return data that can be parsed? > > Sample code: > > from Bio import Entrez > Entrez.email = 'email at chapman.edu' > > search_database = 'gene' > search_term = 'YIL065C' > > handle = Entrez.esearch(db='gene',term=search_term) > record = Entrez.read(handle) > reclist = record['IdList'] > > handle = Entrez.efetch(db=search_database, id =reclist[0]) > > myrecord = handle.read() > print myrecord > > ---------------------------------------------------------- > Michael A. Fahy > fahy at chapman.edu > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bergland at stanford.edu Mon Jan 31 18:50:36 2011 From: bergland at stanford.edu (Alan Bergland) Date: Mon, 31 Jan 2011 10:50:36 -0800 Subject: [Biopython] internal function to convert illumina quality scores to phred Message-ID: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> Hi all, I am trying to convert some code I've written to use FastqGeneralIterator rather than SeqIO.parse. For the most part, it works great and there is a big speed improvement. However, I need to be able to convert the quality scores of 6 characters from the Illumina format to phred. I can't seem to find the function to do this. I'm sure it must exist, and I apologize if documentation for it is sitting right there in the tutorial - I can't seem to find it. Can someone point me in the right direction? Cheers, Alan From biopython at maubp.freeserve.co.uk Mon Jan 31 19:36:02 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Jan 2011 19:36:02 +0000 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> Message-ID: On Mon, Jan 31, 2011 at 6:50 PM, Alan Bergland wrote: > Hi all, > > ? ? ? ?I am trying to convert some code I've written to use > FastqGeneralIterator rather than SeqIO.parse. ?For the most part, it works > great and there is a big speed improvement. ?However, I need to be able to > convert the quality scores of 6 characters from the Illumina format to > phred. ?I can't seem to find the function to do this. ?I'm sure it must > exist, and I apologize if documentation for it is sitting right there in the > tutorial - I can't seem to find it. ?Can someone point me in the right > direction? > > Cheers, > Alan Hi Alan, Probably something in Bio.SeqIO.QualityIO will do what you want, consult the module's built in documentation via help(...) in Python or the online version which is here: http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html I could be more precise if you could clarify what exactly it is you want to do with a couple of examples (input, desired output). If you just want fast Solexa/Illumina FASTQ to Sanger FASTQ or a PHRED style QUAL file from within Python use Bio.SeqIO.convert for this. Peter From bergland at stanford.edu Mon Jan 31 19:54:23 2011 From: bergland at stanford.edu (Alan Bergland) Date: Mon, 31 Jan 2011 11:54:23 -0800 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> Message-ID: <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> Hi Peter, Thanks for the quick reply. So, I am trying to iterate through two large fastq files (each file is one paired-end read) and split the reads by one of 9 barcodes found on both 5' ends of each read. I would like to use the quality information for those barcode reads to assess which barcode-group they belong to. I think it would be nice to use FastqGeneralIterator because I don't need to translate the quality scores for the full read (100bp) back and forth while I iterate through the file. I gather that when I use SeqIO.parse and SeqIO.write, the quality scores are converted back and forth. There is no need to do this for the whole read. I've written a little snippet of code that simply prints the quality scores from the barcodes: from Bio import SeqIO from Bio.SeqIO.QualityIO import * from Bio.SeqIO import * pe1 = open("head2_pe1.fastq", "r") pe2 = open("head2_pe2.fastq", "r") pe1_record_it = FastqGeneralIterator(pe1) for pe1_seq_record in pe1_record_it: bc = SeqRecord(Seq(pe1_seq_record[1][:6]), id="a") bc.letter_annotations['fastq-illumina'] = pe1_seq_record[2][:6] print bc.letter_annotations["fastq-illumina"] this just prints out the illumina encoded quality scores. How would I print out the phred scores instead? Thanks, Alan On Jan 31, 2011, at 11:36 AM, Peter wrote: > On Mon, Jan 31, 2011 at 6:50 PM, Alan Bergland > wrote: >> Hi all, >> >> I am trying to convert some code I've written to use >> FastqGeneralIterator rather than SeqIO.parse. For the most part, >> it works >> great and there is a big speed improvement. However, I need to be >> able to >> convert the quality scores of 6 characters from the Illumina format >> to >> phred. I can't seem to find the function to do this. I'm sure it >> must >> exist, and I apologize if documentation for it is sitting right >> there in the >> tutorial - I can't seem to find it. Can someone point me in the >> right >> direction? >> >> Cheers, >> Alan > > Hi Alan, > > Probably something in Bio.SeqIO.QualityIO will do what you want, > consult the module's built in documentation via help(...) in Python > or the online version which is here: > http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html > > I could be more precise if you could clarify what exactly it is you > want to do with a couple of examples (input, desired output). > > If you just want fast Solexa/Illumina FASTQ to Sanger FASTQ or a > PHRED style QUAL file from within Python use Bio.SeqIO.convert > for this. > > Peter From biopython at maubp.freeserve.co.uk Mon Jan 31 20:50:30 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Jan 2011 20:50:30 +0000 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> Message-ID: On Mon, Jan 31, 2011 at 7:54 PM, Alan Bergland wrote: > Hi Peter, > > ? ? ? ?Thanks for the quick reply. ?So, I am trying to iterate through two > large fastq files (each file is one paired-end read) and split the reads by > one of 9 barcodes found on both 5' ends of each read. ?I would like to use > the quality information for those barcode reads to assess which > barcode-group they belong to. I suspect that it is simpler to just ignore reads which don't match the barcode within 1 or 2 mismatches, without worrying about their qualities. It will cost you computational time and effort for a relative small improvement in the number of filtered reads. YMMV. If you look at the archives there was a discussion earlier in Jan about doing this sort of thing with SFF files. I'm currently (between other tasks) trying to wrap up some sort of PCR-primer/barcode/adaptor filtering (Bio)python script as a tool for Galaxy, see http://usegalaxy.org > ? ? ? ?I think it would be nice to use FastqGeneralIterator because I don't > need to translate the quality scores for the full read (100bp) back and > forth while I iterate through the file. ?I gather that when I use > SeqIO.parse and SeqIO.write, the quality scores are converted back and > forth. ?There is no need to do this for the whole read. > > ? ? ? ?I've written a little snippet of code that simply prints the quality > scores from the barcodes: > > from Bio import SeqIO > from Bio.SeqIO.QualityIO import * > from Bio.SeqIO import * > > pe1 = open("head2_pe1.fastq", "r") > pe2 = open("head2_pe2.fastq", "r") > > pe1_record_it = FastqGeneralIterator(pe1) > > for pe1_seq_record in pe1_record_it: > ? ?bc = SeqRecord(Seq(pe1_seq_record[1][:6]), id="a") > ? ?bc.letter_annotations['fastq-illumina'] = pe1_seq_record[2][:6] > > ? ?print bc.letter_annotations["fastq-illumina"] > > this just prints out the illumina encoded quality scores. ?How would I print > out the phred scores instead? > > Thanks, > Alan If you have Illumina 1.3+ or later, then they use PHRED scores (not Solexa scores which have a different log encoding). If as it appears above you want to use SeqRecord objects, I think you might as well use Bio.SeqIO.parse and write. The point about using FastqGeneralIterator for speed is to avoid using a SeqRecord due to the overheads. See e.g.: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ Peter P.S. Watch out for the fact that Illumina are planning to switch to the standard Sanger FASTQ encoding in their next release: http://seqanswers.com/forums/showthread.php?t=8895