From italo.maia at gmail.com Tue Sep 1 01:22:14 2009 From: italo.maia at gmail.com (Italo Maia) Date: Tue, 1 Sep 2009 02:22:14 -0300 Subject: [Biopython] Phylogenetic trees with biopython? Message-ID: <800166920908312222j185d305ahfe374efc0a7edb1e@mail.gmail.com> Is it possible to create phylogenetic trees with biopython alone or i'll have to "phylip things up" a little? Phylip doesn't seem to allow execution with options, as blast does, per example, and that really botters me. : / -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From stran104 at chapman.edu Tue Sep 1 02:09:23 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Mon, 31 Aug 2009 23:09:23 -0700 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <2a63cc350908312308m7b5e8644kc83b1ea3765e47e7@mail.gmail.com> References: <800166920908312222j185d305ahfe374efc0a7edb1e@mail.gmail.com> <2a63cc350908312308m7b5e8644kc83b1ea3765e47e7@mail.gmail.com> Message-ID: <2a63cc350908312309g59b6cc17i5fb67625acb97fe@mail.gmail.com> As far as I know (which doesn't say much) Biopython does not wrap the Phylip programs. However, you can achieve this through some fairly simple scripting. Phylip allows for options to be specified in command files. Informally, these command files consists of the same keystrokes you would enter when running a Phylip program. A command file to run the program with the default options would look like: Y\n This corresponds to pressing Y to accept the default options and pressing enter for a line break. You can specify any other options here as well. You can also specify the name of your "infile" (sequence file) on the first line of the command file. Then to run phylip with a command file you might execute something like: phylip protpars < command_file_name Then to wrap this in Python just programatically generate your command files and use the os.system command to execute phylip. (e.g. os.system('phylip kitsch < protpars_commands') I hope this helps and good luck. On Mon, Aug 31, 2009 at 10:22 PM, Italo Maia wrote: > Is it possible to create phylogenetic trees with biopython alone or i'll > have to "phylip things up" a little? Phylip doesn't seem to allow execution > with options, as blast does, per example, and that really botters me. : / > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Matthew Strand stran104 at chapman.edu From stran104 at chapman.edu Tue Sep 1 02:12:26 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Mon, 31 Aug 2009 23:12:26 -0700 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <2a63cc350908312308m7b5e8644kc83b1ea3765e47e7@mail.gmail.com> References: <800166920908312222j185d305ahfe374efc0a7edb1e@mail.gmail.com> <2a63cc350908312308m7b5e8644kc83b1ea3765e47e7@mail.gmail.com> Message-ID: <2a63cc350908312312w2f51b475w7e4ec7101d80572@mail.gmail.com> Shoot, I've fudged things up a bit and created a duplicate thread. Here is my original message, I don't know if someone may be able to delete the other thread, but hopefully so. On Mon, Aug 31, 2009 at 11:08 PM, Matthew Strand wrote: > As far as I know (which doesn't say much) Biopython does not wrap the > Phylip programs. However, you can achieve this through some fairly simple > scripting. Phylip allows for options to be specified in command files. > > Informally, these command files consists of the same keystrokes you would > enter when running a Phylip program. > > A command file to run the program with the default options would look like: > Y\n > This corresponds to pressing Y to accept the default options and pressing > enter for a line break. You can specify any other options here as well. You > can also specify the name of your "infile" (sequence file) on the first line > of the command file. > > Then to run phylip with a command file you might execute something like: > phylip protpars < command_file_name > > Then to wrap this in Python just programatically generate your command > files and use the os.system command to execute phylip. > (e.g. os.system('phylip kitsch < protpars_commands') > > I hope this helps and good luck. > > On Mon, Aug 31, 2009 at 10:22 PM, Italo Maia wrote: > >> Is it possible to create phylogenetic trees with biopython alone or i'll >> have to "phylip things up" a little? Phylip doesn't seem to allow >> execution >> with options, as blast does, per example, and that really botters me. : / >> >> -- >> "A arrog?ncia ? a arma dos fracos." >> >> =========================== >> Italo Moreira Campelo Maia >> Ci?ncia da Computa??o - UECE >> Desenvolvedor WEB e Desktop >> Programador Java, Python >> Ubuntu User For Life! >> ----------------------------------------------------- >> http://www.italomaia.com/ >> http://twitter.com/italomaia/ >> http://eusouolobomal.blogspot.com/ >> =========================== >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Matthew Strand > stran104 at chapman.edu > -- Matthew Strand stran104 at chapman.edu phone: (626) 524-4449 skype: matstrand From winda002 at student.otago.ac.nz Tue Sep 1 02:17:04 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 01 Sep 2009 18:17:04 +1200 Subject: [Biopython] Phylogenetic trees with biopython? Message-ID: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> > Is it possible to create phylogenetic trees with biopython alone or i'll > have to "phylip things up" a little? Phylip doesn't seem to allow execution > with options, as blast does, per example, and that really botters me. : Hi Italo, It depends on what exactly you want to do. If you want to run phylip programs as part of a biopython script then there are classes in Emboss.Applications for building up command lines to call the EMBOSS versions of enough of the phylip packages to make bootstrapped distance or parsimony trees. That would mean installing EMBOSS if you don't already have it but it makes automating phylip much, much easier. Those should let you define all the relevant arguments (if they don't it's easy to add them, so shout out) but they are for the 'old' versions of phylip, I'm sure you can still get the EMBOSS versions of these from their site but there are also 'new' versions which are meant to be a little faster but take different arguments (so the existing classes won't help you). I put up a branch on github which has classes for the new versions as well as for PhyML in Bio.Phylo here: http://github.com/dwinter/biopython/tree/phylo Hope that helps you out, David From biopython at maubp.freeserve.co.uk Tue Sep 1 05:14:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 10:14:31 +0100 Subject: [Biopython] IDLE problem In-Reply-To: <4A9C60B3.4040605@rockefeller.edu> References: <4A9C60B3.4040605@rockefeller.edu> Message-ID: <320fb6e00909010214o6230851ckb120f3099a9c24c4@mail.gmail.com> On Tue, Sep 1, 2009 at 12:45 AM, xiaoa wrote: > Hi, > > I am new to python and biopython. I ran into a problem when using > Entrez.esearch and efetch. ?My script worked fine when I used python 2.6.2 > command line (console), but it returned an empty line when I ran it in IDLE. > ?IDLE seems to be working, because I tested with 1. another python script > (no Entrez modules) and 2. even Entrez.einfo--worked fine. ?I am using > Windows Vista, 64-bit and Biopython 1.51 and Python 2.6.2. > Thanks in advance, > > Andrew Hi Andrew, Due to occasional network issues, Entrez scripts are not always 100% reproducible. Perhaps the NCBI was under very high load at the time? It is difficult to say any more without knowing what your script does. Also, I see you say you using Windows Vista, 64-bit and Biopython 1.51 and Python 2.6.2 - how did you install Biopython? I've never used Vista, and we don't provide 64-bit installers. Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 05:21:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 10:21:54 +0100 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> Message-ID: <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> On Tue, Sep 1, 2009 at 7:17 AM, David Winter wrote: > >> Is it possible to create phylogenetic trees with biopython alone or i'll >> have to "phylip things up" a little? Phylip doesn't seem to allow >> execution with options, as blast does, per example, and that really >> botters me. : > > Hi Italo, > > It depends on what exactly you want to do. If you want to run phylip > programs as part of a biopython script then there are classes in > Emboss.Applications for building up command lines to call the EMBOSS > versions of enough of the phylip packages to make bootstrapped distance or > parsimony trees. That would mean installing EMBOSS if you don't already have > it but it makes automating phylip much, much easier. Yes - we definitely recommend the EMBOSS versions of the PHYLIP tools because they support command line arguments, and the originals don't. > Those should let you define all the relevant arguments (if they don't it's > easy to add them, so shout out) but they are for the 'old' versions of > phylip, I'm sure you can still get the EMBOSS versions of ?these from their > site ?but ?there are also 'new' versions which are meant to be a little > faster but take different arguments (so the existing classes won't help > you). I put up a branch on github which has classes for the new versions as > well as for PhyML in Bio.Phylo here: > http://github.com/dwinter/biopython/tree/phylo Yes, as David points out, Bio.Emboss.Applications has wrappers for the "old" versions from PHYLIP 3.572 (whose EMBOSS names start with e): http://emboss.sourceforge.net/apps/release/6.1/embassy/phylip/ We should add the "new" versions from PHYLIP 3.6 (whose EMBOSS names start with f): http://emboss.sourceforge.net/apps/release/6.1/embassy/phylipnew/ David - I would prefer we also put your new wrappers in Bio.Emboss.Applications, and would be happy to look at adding those to CVS now that Biopython 1.51 is out (I had forgotten about them actually - so thanks for the reminder). Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 10:00:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 15:00:32 +0100 Subject: [Biopython] IDLE problem In-Reply-To: <4A9D269E.6080601@rockefeller.edu> References: <4A9C60B3.4040605@rockefeller.edu> <320fb6e00909010214o6230851ckb120f3099a9c24c4@mail.gmail.com> <4A9D269E.6080601@rockefeller.edu> Message-ID: <320fb6e00909010700racbfe0fxe5284dae8bdd4446@mail.gmail.com> Hi Andrew, Please keep the mailing list CC'd on replies. On Tue, Sep 1, 2009 at 2:50 PM, xiaoa wrote: > >> Hi Andrew, >> >> Due to occasional network issues, Entrez scripts are not always 100% >> reproducible. Perhaps the NCBI was under very high load at the time? >> It is difficult to say any more without knowing what your script does. >> >> Also, I see you say you using Windows Vista, 64-bit and Biopython 1.51 >> and Python 2.6.2 - how did you install Biopython? I've never used Vista, >> and we don't provide 64-bit installers. >> >> Peter > > Hi Peter, > > I forget to mention that although my OS is 64 bit, I installed 32-bit Python > ?2.6.2 because the IDLE for 64-bit Python 2.6.2 doesn't work in Vista. ?So > everything is in 32. ?It seems odd that everything works fine in commandline > but in IDLE. > > Andrew I see - if you are using the 32 bit version of Python, then our Windows installer might work. You may be the first person to report trying this on Windows Vista... Right now I am not sure what could be going wrong with Entrez. If you can show us your script (ideally a cut down example to show the problem) we may be able to help. Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 13:01:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 18:01:20 +0100 Subject: [Biopython] Removing deprecated module Bio.EUtils Message-ID: <320fb6e00909011001q5d99b62egc06c1303d6a8bc53@mail.gmail.com> Hi all, The Bio.Entrez module has long been our prefered interface to the NCBI Entrez Utilities. It replaced the old Bio.EUtils module which was officially deprecated in Biopython 1.48, released a year ago (Sept 2008). In line with our deprecation policy, I plan to remove Bio.EUtils in the next release. Are there any objections? If anyone is still using the Bio.EUtils module in old code, please feel free to ask for tips on porting this to Bio.Entrez. Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 13:05:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 18:05:27 +0100 Subject: [Biopython] Removing deprecated BLAST HTML parser Message-ID: <320fb6e00909011005k76cc8be9j2d3a58201ae54a8f@mail.gmail.com> Hi all, The old HTML BLAST parser in Bio.Blast.NCBIWWW was deprecated a year ago in Biopython 1.48, and in line with our deprecation policy I would like to remove this for the next release. Are there any objections? The preferred BLAST output for parsing (as recommended by the NCBI themselves) is XML. We also have a parser for the plain text output, but this is not updated very frequently and the NCBI have a history of making minor changes to the layout and breaking parsers. Peter From winda002 at student.otago.ac.nz Tue Sep 1 18:38:04 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 02 Sep 2009 10:38:04 +1200 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> Message-ID: <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> > David - I would prefer we also put your new wrappers in > Bio.Emboss.Applications, and would be happy to look at adding > those to CVS now that Biopython 1.51 is out (I had forgotten > about them actually - so thanks for the reminder). > > Peter Hi Peter, I'd almost forgotten about them myself! I only put them in their own module because I had the PhyML wrapper as well and that's not an EMBOSS application. I suspect a wrapper for PhyML is probably not going to be widely useful (a normal run lasts at least several hours and most people will want to look over their alignments by eye before they set it off). So I'll move the phylip ones into Emboss.Applications and gather a few thoughts about other phylogenetic software including PhyML and see what the dev list thinks about them. david From winda002 at student.otago.ac.nz Tue Sep 1 18:25:49 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 02 Sep 2009 10:25:49 +1200 Subject: [Biopython] IDLE problem In-Reply-To: <320fb6e00909010700racbfe0fxe5284dae8bdd4446@mail.gmail.com> References: <4A9C60B3.4040605@rockefeller.edu> <320fb6e00909010214o6230851ckb120f3099a9c24c4@mail.gmail.com> <4A9D269E.6080601@rockefeller.edu> <320fb6e00909010700racbfe0fxe5284dae8bdd4446@mail.gmail.com> Message-ID: <20090902102549.964215i7y6cuyru5@www.studentmail.otago.ac.nz> >> I forget to mention that although my OS is 64 bit, I installed 32-bit Python >> ?2.6.2 because the IDLE for 64-bit Python 2.6.2 doesn't work in Vista. ?So >> everything is in 32. ?It seems odd that everything works fine in commandline >> but in IDLE. >> >> Andrew Hi Andrew, you don't connect to the internet via a proxy do you? I've just been playing with esearch() in IDLE vs the commandline in vista and found that both worked fine with a direct connection but when the system-wide internet options are set to a proxy IDLE hangs at the point at which the commandline asks for a username/password. As I say, otherwise everything works fine for me so if it's not that then I'm no help. Cheers, David From italo.maia at gmail.com Tue Sep 1 22:37:05 2009 From: italo.maia at gmail.com (Italo Maia) Date: Tue, 1 Sep 2009 23:37:05 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> Message-ID: <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> Thank you everyone for your answers. I was about to give up and run my own wrappers over phylip(emboss for mey ubuntu is too old : /) when i just found out that clustal can create phylogenetic trees too. On commandline, with the options 4,1 and 4, i just made a tree here, from a .aln file generated with clustalw. Does anyone dislike this approach? Seems like easy/fast/efficient enough for me. ps: i just made a simple frontend for blast, formatdb, clustalw and muscle. Right now i'm going to add phylogenetic trees, then, i'm finished. It's my graduation thesis, by the way. 2009/9/1 David Winter > > David - I would prefer we also put your new wrappers in >> Bio.Emboss.Applications, and would be happy to look at adding >> those to CVS now that Biopython 1.51 is out (I had forgotten >> about them actually - so thanks for the reminder). >> >> Peter >> > > Hi Peter, > > I'd almost forgotten about them myself! I only put them in their own module > because I had the PhyML wrapper as well and that's not an EMBOSS > application. > > I suspect a wrapper for PhyML is probably not going to be widely useful (a > normal run lasts at least several hours and most people will want to look > over their alignments by eye before they set it off). So I'll move the > phylip ones into Emboss.Applications and gather a few thoughts about other > phylogenetic software including PhyML and see what the dev list thinks about > them. > > david > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From nuin at genedrift.org Tue Sep 1 22:46:48 2009 From: nuin at genedrift.org (Paulo Nuin) Date: Tue, 1 Sep 2009 22:46:48 -0400 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> Message-ID: <365E2A2E-C0B3-4249-B708-4409C097B581@genedrift.org> ClustalW trees are extremely "simple" and only can be generated with Neighbour Joining. Also they are not based on the final sequence alignment created by the program but as a guide for the alignment itself. They have a huge probability of being "wrong" or not representing the actual relationships. It will heavily depend on the the type, distance and differences among the sequences you are using. Do you need trees for what? Paulo On 1-Sep-09, at 10:37 PM, Italo Maia wrote: > Thank you everyone for your answers. I was about to give up and run > my own > wrappers over phylip(emboss for mey ubuntu is too old : /) when i > just found > out that clustal can create phylogenetic trees too. On commandline, > with the > options 4,1 and 4, i just made a tree here, from a .aln file > generated with > clustalw. Does anyone dislike this approach? Seems like easy/fast/ > efficient > enough for me. > > ps: i just made a simple frontend for blast, formatdb, clustalw and > muscle. > Right now i'm going to add phylogenetic trees, then, i'm finished. > It's my > graduation thesis, by the way. > > 2009/9/1 David Winter > >> >> David - I would prefer we also put your new wrappers in >>> Bio.Emboss.Applications, and would be happy to look at adding >>> those to CVS now that Biopython 1.51 is out (I had forgotten >>> about them actually - so thanks for the reminder). >>> >>> Peter >>> >> >> Hi Peter, >> >> I'd almost forgotten about them myself! I only put them in their >> own module >> because I had the PhyML wrapper as well and that's not an EMBOSS >> application. >> >> I suspect a wrapper for PhyML is probably not going to be widely >> useful (a >> normal run lasts at least several hours and most people will want >> to look >> over their alignments by eye before they set it off). So I'll move >> the >> phylip ones into Emboss.Applications and gather a few thoughts >> about other >> phylogenetic software including PhyML and see what the dev list >> thinks about >> them. >> >> david >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From italo.maia at gmail.com Tue Sep 1 23:17:09 2009 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 2 Sep 2009 00:17:09 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <365E2A2E-C0B3-4249-B708-4409C097B581@genedrift.org> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <365E2A2E-C0B3-4249-B708-4409C097B581@genedrift.org> Message-ID: <800166920909012017o68216e1cxe72ec52ba2babac3@mail.gmail.com> I want to use them to study evolution. Maybe guide some protein modelling. Would you suggest a better use? 2009/9/1 Paulo Nuin > ClustalW trees are extremely "simple" and only can be generated with > Neighbour Joining. Also they are not based on the final sequence alignment > created by the program but as a guide for the alignment itself. They have a > huge probability of being "wrong" or not representing the actual > relationships. It will heavily depend on the the type, distance and > differences among the sequences you are using. > > Do you need trees for what? > > Paulo > > > > > On 1-Sep-09, at 10:37 PM, Italo Maia wrote: > > Thank you everyone for your answers. I was about to give up and run my own >> wrappers over phylip(emboss for mey ubuntu is too old : /) when i just >> found >> out that clustal can create phylogenetic trees too. On commandline, with >> the >> options 4,1 and 4, i just made a tree here, from a .aln file generated >> with >> clustalw. Does anyone dislike this approach? Seems like >> easy/fast/efficient >> enough for me. >> >> ps: i just made a simple frontend for blast, formatdb, clustalw and >> muscle. >> Right now i'm going to add phylogenetic trees, then, i'm finished. It's my >> graduation thesis, by the way. >> >> 2009/9/1 David Winter >> >> >>> David - I would prefer we also put your new wrappers in >>> >>>> Bio.Emboss.Applications, and would be happy to look at adding >>>> those to CVS now that Biopython 1.51 is out (I had forgotten >>>> about them actually - so thanks for the reminder). >>>> >>>> Peter >>>> >>>> >>> Hi Peter, >>> >>> I'd almost forgotten about them myself! I only put them in their own >>> module >>> because I had the PhyML wrapper as well and that's not an EMBOSS >>> application. >>> >>> I suspect a wrapper for PhyML is probably not going to be widely useful >>> (a >>> normal run lasts at least several hours and most people will want to look >>> over their alignments by eye before they set it off). So I'll move the >>> phylip ones into Emboss.Applications and gather a few thoughts about >>> other >>> phylogenetic software including PhyML and see what the dev list thinks >>> about >>> them. >>> >>> david >>> >>> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> >> >> >> -- >> "A arrog?ncia ? a arma dos fracos." >> >> =========================== >> Italo Moreira Campelo Maia >> Ci?ncia da Computa??o - UECE >> Desenvolvedor WEB e Desktop >> Programador Java, Python >> Ubuntu User For Life! >> ----------------------------------------------------- >> http://www.italomaia.com/ >> http://twitter.com/italomaia/ >> http://eusouolobomal.blogspot.com/ >> =========================== >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From nuin at genedrift.org Tue Sep 1 23:38:51 2009 From: nuin at genedrift.org (Paulo Nuin) Date: Tue, 1 Sep 2009 23:38:51 -0400 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909012017o68216e1cxe72ec52ba2babac3@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <365E2A2E-C0B3-4249-B708-4409C097B581@genedrift.org> <800166920909012017o68216e1cxe72ec52ba2babac3@mail.gmail.com> Message-ID: <9C71C598-20E1-424A-8476-30CB661F450F@genedrift.org> So, ClustalW trees won't even scratch the surface. You should go the Phylip/EMBOSS route, I don't see another alternative while using BioPython. Or you will have to create your own wrappers for some command line applications, like MrBayes, TreePuzzle, etc. Paulo On 1-Sep-09, at 11:17 PM, Italo Maia wrote: > I want to use them to study evolution. Maybe guide some protein > modelling. Would you suggest a better use? > > 2009/9/1 Paulo Nuin > ClustalW trees are extremely "simple" and only can be generated with > Neighbour Joining. Also they are not based on the final sequence > alignment created by the program but as a guide for the alignment > itself. They have a huge probability of being "wrong" or not > representing the actual relationships. It will heavily depend on the > the type, distance and differences among the sequences you are using. > > Do you need trees for what? > > Paulo > > > > > On 1-Sep-09, at 10:37 PM, Italo Maia wrote: > > Thank you everyone for your answers. I was about to give up and run > my own > wrappers over phylip(emboss for mey ubuntu is too old : /) when i > just found > out that clustal can create phylogenetic trees too. On commandline, > with the > options 4,1 and 4, i just made a tree here, from a .aln file > generated with > clustalw. Does anyone dislike this approach? Seems like easy/fast/ > efficient > enough for me. > > ps: i just made a simple frontend for blast, formatdb, clustalw and > muscle. > Right now i'm going to add phylogenetic trees, then, i'm finished. > It's my > graduation thesis, by the way. > > 2009/9/1 David Winter > > > David - I would prefer we also put your new wrappers in > Bio.Emboss.Applications, and would be happy to look at adding > those to CVS now that Biopython 1.51 is out (I had forgotten > about them actually - so thanks for the reminder). > > Peter > > > Hi Peter, > > I'd almost forgotten about them myself! I only put them in their own > module > because I had the PhyML wrapper as well and that's not an EMBOSS > application. > > I suspect a wrapper for PhyML is probably not going to be widely > useful (a > normal run lasts at least several hours and most people will want to > look > over their alignments by eye before they set it off). So I'll move the > phylip ones into Emboss.Applications and gather a few thoughts about > other > phylogenetic software including PhyML and see what the dev list > thinks about > them. > > david > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== From winda002 at student.otago.ac.nz Wed Sep 2 00:35:21 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 02 Sep 2009 16:35:21 +1200 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> Message-ID: <4A9DF609.5050906@student.otago.ac.nz> > Thank you everyone for your answers. I was about to give up and run my > own wrappers over phylip(emboss for mey ubuntu is too old : /) It shouldn't matter how old your emboss is - I think there are phylip versions for every release of EMBOSS here: ftp://emboss.open-bio.org/pub/EMBOSS/old/ EMBOSS doesn't come with the phylip programs by default, you need to download them independently. There are no binaries for ubuntu but they're very easy to compile (if I can do it...) - you do need the EMBOSS sources though. > when i just found out that clustal can create phylogenetic trees too. > On commandline, with the options 4,1 and 4, i just made a tree here, > from a .aln file generated with clustalw. Does anyone dislike this > approach? Seems like easy/fast/efficient enough for me. > As Paulo says this might be easy/fast/efficient but there is no promise it will be accurate/powerful/useful ;). If you want to do it with existing biopython tools then I think phylip is probably going to be the way to go. You might also want to look at PyCogent which has controllers for some other phylogeny programs: http://pycogent.sourceforge.net/examples/phylogeny_app_controllers.html (I have no experience using those, so can't tell much about them) Cheers, David From italo.maia at gmail.com Wed Sep 2 18:13:45 2009 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 2 Sep 2009 19:13:45 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <4A9DF609.5050906@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> Message-ID: <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> I'll try pycongent too but, for now, i'll leave it with clustalw for being the fastest way for me to get it done. Anyway, the "clustering" option for ClustalCommandline seems to be buggy. In the _Option parameter, equate should be set to True, in order for it to work. 2009/9/2 David Winter > > Thank you everyone for your answers. I was about to give up and run my own >> wrappers over phylip(emboss for mey ubuntu is too old : /) >> > It shouldn't matter how old your emboss is - I think there are phylip > versions for every release of EMBOSS here: > ftp://emboss.open-bio.org/pub/EMBOSS/old/ > > EMBOSS doesn't come with the phylip programs by default, you need to > download them independently. There are no binaries for ubuntu but they're > very easy to compile (if I can do it...) - you do need the EMBOSS sources > though. > >> when i just found out that clustal can create phylogenetic trees too. On >> commandline, with the options 4,1 and 4, i just made a tree here, from a >> .aln file generated with clustalw. Does anyone dislike this approach? Seems >> like easy/fast/efficient enough for me. >> >> As Paulo says this might be easy/fast/efficient but there is no promise > it will be accurate/powerful/useful ;). If you want to do it with existing > biopython tools then I think phylip is probably going to be the way to go. > > You might also want to look at PyCogent which has controllers for some > other phylogeny programs: > http://pycogent.sourceforge.net/examples/phylogeny_app_controllers.html > > (I have no experience using those, so can't tell much about them) > > Cheers, > David > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From italo.maia at gmail.com Wed Sep 2 18:15:04 2009 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 2 Sep 2009 19:15:04 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> Message-ID: <800166920909021515r106e08a1lf158c18f3d541079@mail.gmail.com> ps: is there a work around, for this "wannabe" bug? 2009/9/2 Italo Maia > I'll try pycongent too but, for now, i'll leave it with clustalw for being > the fastest way for me to get it done. Anyway, the "clustering" option for > ClustalCommandline seems to be buggy. In the _Option parameter, equate > should be set to True, in order for it to work. > > 2009/9/2 David Winter > > >> Thank you everyone for your answers. I was about to give up and run my >>> own wrappers over phylip(emboss for mey ubuntu is too old : /) >>> >> It shouldn't matter how old your emboss is - I think there are phylip >> versions for every release of EMBOSS here: >> ftp://emboss.open-bio.org/pub/EMBOSS/old/ >> >> EMBOSS doesn't come with the phylip programs by default, you need to >> download them independently. There are no binaries for ubuntu but they're >> very easy to compile (if I can do it...) - you do need the EMBOSS sources >> though. >> >>> when i just found out that clustal can create phylogenetic trees too. On >>> commandline, with the options 4,1 and 4, i just made a tree here, from a >>> .aln file generated with clustalw. Does anyone dislike this approach? Seems >>> like easy/fast/efficient enough for me. >>> >>> As Paulo says this might be easy/fast/efficient but there is no promise >> it will be accurate/powerful/useful ;). If you want to do it with existing >> biopython tools then I think phylip is probably going to be the way to go. >> >> You might also want to look at PyCogent which has controllers for some >> other phylogeny programs: >> http://pycogent.sourceforge.net/examples/phylogeny_app_controllers.html >> >> (I have no experience using those, so can't tell much about them) >> >> Cheers, >> David >> > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From italo.maia at gmail.com Wed Sep 2 18:51:49 2009 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 2 Sep 2009 19:51:49 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909021515r106e08a1lf158c18f3d541079@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> <800166920909021515r106e08a1lf158c18f3d541079@mail.gmail.com> Message-ID: <800166920909021551r4aefa301ib6498997f4f45b7a@mail.gmail.com> Solved with a terrible ugly forloop. *for* par *in* self.cline.parameters: names = getattr(par, 'names', None) *if* names *is* not None: *if* "-clustering" *in* names: par.equate=True And clustalw can generate trees in phylip format, which is good news for me! Thank you guys! When i grab my hands in pycongent, i'll post something. And, by the way, emboss didn't seem to work fine in my ubuntu, even tough the package is in the repository. If i write "emboss" in the console, it won't work. @.o 2009/9/2 Italo Maia > ps: is there a work around, for this "wannabe" bug? > > 2009/9/2 Italo Maia > > I'll try pycongent too but, for now, i'll leave it with clustalw for being >> the fastest way for me to get it done. Anyway, the "clustering" option for >> ClustalCommandline seems to be buggy. In the _Option parameter, equate >> should be set to True, in order for it to work. >> >> 2009/9/2 David Winter >> >> >>> Thank you everyone for your answers. I was about to give up and run my >>>> own wrappers over phylip(emboss for mey ubuntu is too old : /) >>>> >>> It shouldn't matter how old your emboss is - I think there are phylip >>> versions for every release of EMBOSS here: >>> ftp://emboss.open-bio.org/pub/EMBOSS/old/ >>> >>> EMBOSS doesn't come with the phylip programs by default, you need to >>> download them independently. There are no binaries for ubuntu but they're >>> very easy to compile (if I can do it...) - you do need the EMBOSS sources >>> though. >>> >>>> when i just found out that clustal can create phylogenetic trees too. On >>>> commandline, with the options 4,1 and 4, i just made a tree here, from a >>>> .aln file generated with clustalw. Does anyone dislike this approach? Seems >>>> like easy/fast/efficient enough for me. >>>> >>>> As Paulo says this might be easy/fast/efficient but there is no promise >>> it will be accurate/powerful/useful ;). If you want to do it with existing >>> biopython tools then I think phylip is probably going to be the way to go. >>> >>> You might also want to look at PyCogent which has controllers for some >>> other phylogeny programs: >>> http://pycogent.sourceforge.net/examples/phylogeny_app_controllers.html >>> >>> (I have no experience using those, so can't tell much about them) >>> >>> Cheers, >>> David >>> >> >> >> >> -- >> "A arrog?ncia ? a arma dos fracos." >> >> =========================== >> Italo Moreira Campelo Maia >> Ci?ncia da Computa??o - UECE >> Desenvolvedor WEB e Desktop >> Programador Java, Python >> Ubuntu User For Life! >> ----------------------------------------------------- >> http://www.italomaia.com/ >> http://twitter.com/italomaia/ >> http://eusouolobomal.blogspot.com/ >> =========================== >> > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From italo.maia at gmail.com Wed Sep 2 18:54:14 2009 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 2 Sep 2009 19:54:14 -0300 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! Message-ID: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> Is there any recepie to plot a phylip phylogenetic tree? -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From biopython at maubp.freeserve.co.uk Thu Sep 3 05:32:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Sep 2009 10:32:16 +0100 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> Message-ID: <320fb6e00909030232s49f5a2e8vb3dda5b1261fadd5@mail.gmail.com> On Wed, Sep 2, 2009 at 11:13 PM, Italo Maia wrote: > I'll try pycongent too but, for now, i'll leave it with clustalw for being > the fastest way for me to get it done. Anyway, the "clustering" option for > ClustalCommandline seems to be buggy. In the _Option parameter, equate > should be set to True, in order for it to work. Could you show us a command line string that works, and a command line string that doesn't? Peter From biopython at maubp.freeserve.co.uk Thu Sep 3 05:34:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Sep 2009 10:34:16 +0100 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! In-Reply-To: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> References: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> Message-ID: <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> On Wed, Sep 2, 2009 at 11:54 PM, Italo Maia wrote: > Is there any recepie to plot a phylip phylogenetic tree? If you mean as a simple text representation, then sort of. Try the Bio.Nexus tree objects print methods. Do you meant draw a nice diagram (e.g. as a PDF or PNG file)? If so, no, not at the moment. Some of the Google Summer of Code project work including linking to NetworkX for graphics... this has not yet been merged into Biopython. Peter From italo.maia at gmail.com Thu Sep 3 08:20:43 2009 From: italo.maia at gmail.com (Italo Maia) Date: Thu, 3 Sep 2009 09:20:43 -0300 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! In-Reply-To: <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> References: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> Message-ID: <800166920909030520p1e02508ava6576fbfae0ad50a@mail.gmail.com> Yeap, i meant the pretty png way. I was thinking of something with python-imaging, maybe. Anyway, thanks Peter. If it does not exist, i won't waste time looking for it. 2009/9/3 Peter > On Wed, Sep 2, 2009 at 11:54 PM, Italo Maia wrote: > > Is there any recepie to plot a phylip phylogenetic tree? > > If you mean as a simple text representation, then sort of. Try the > Bio.Nexus tree objects print methods. > > Do you meant draw a nice diagram (e.g. as a PDF or PNG file)? If > so, no, not at the moment. Some of the Google Summer of Code > project work including linking to NetworkX for graphics... this has > not yet been merged into Biopython. > > Peter > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From italo.maia at gmail.com Thu Sep 3 08:22:17 2009 From: italo.maia at gmail.com (Italo Maia) Date: Thu, 3 Sep 2009 09:22:17 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <320fb6e00909030232s49f5a2e8vb3dda5b1261fadd5@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> <320fb6e00909030232s49f5a2e8vb3dda5b1261fadd5@mail.gmail.com> Message-ID: <800166920909030522m1d9a49adn90b93f380eed6cc@mail.gmail.com> *Not working* /usr/bin/clustalw -infile=/home/myuser/Temp/coi.dnd -tree -outputtree=PHYLIP -tossgaps -clustering NJ *Working* /usr/bin/clustalw -infile=/home/myuser/Temp/coi.dnd -tree -outputtree=PHYLIP -tossgaps -clustering=NJ 2009/9/3 Peter > On Wed, Sep 2, 2009 at 11:13 PM, Italo Maia wrote: > > I'll try pycongent too but, for now, i'll leave it with clustalw for > being > > the fastest way for me to get it done. Anyway, the "clustering" option > for > > ClustalCommandline seems to be buggy. In the _Option parameter, equate > > should be set to True, in order for it to work. > > Could you show us a command line string that works, and a command line > string that doesn't? > > Peter > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From biopython at maubp.freeserve.co.uk Thu Sep 3 08:38:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Sep 2009 13:38:47 +0100 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909030522m1d9a49adn90b93f380eed6cc@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> <320fb6e00909030232s49f5a2e8vb3dda5b1261fadd5@mail.gmail.com> <800166920909030522m1d9a49adn90b93f380eed6cc@mail.gmail.com> Message-ID: <320fb6e00909030538m1daa8279x5254d0dac6f00832@mail.gmail.com> On Thu, Sep 3, 2009 at 1:22 PM, Italo Maia wrote: > Not working > /usr/bin/clustalw -infile=/home/myuser/Temp/coi.dnd -tree -outputtree=PHYLIP > -tossgaps -clustering NJ > > Working > /usr/bin/clustalw -infile=/home/myuser/Temp/coi.dnd -tree -outputtree=PHYLIP > -tossgaps -clustering=NJ OK, yes. Fixed in CVS. I also made the boot labels argument use an equals (by eye all the rest looked fine). Could you confirm that works? Thanks Peter From jjkk73 at gmail.com Thu Sep 3 12:06:19 2009 From: jjkk73 at gmail.com (jorma kala) Date: Thu, 3 Sep 2009 17:06:19 +0100 Subject: [Biopython] Question about efetch output format Message-ID: Hi, I'm trying to retrieve a record from protein database (I found the record id by running Entrez.esearch) handle = Entrez.efetch(db="protein", id='483329',mode='xml') print handle.read() Although I specify xml mode, the result comes in a quite confusing format using braces (I've pasted a snippet at the end of the email) Do you know what I should do to get it in xml? Many thanks +++++++ Output Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Aspergillus flavus aflatoxin (aflR) gene, and translated products" , source { org { taxname "Aspergillus flavus" , db { { db "taxon" , tag id 5059 } } , orgname { name binomial { From biopython at maubp.freeserve.co.uk Thu Sep 3 12:38:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Sep 2009 17:38:57 +0100 Subject: [Biopython] Question about efetch output format In-Reply-To: References: Message-ID: <320fb6e00909030938t139e246eg1a4cb524653eb4a2@mail.gmail.com> On Thu, Sep 3, 2009 at 5:06 PM, jorma kala wrote: > Hi, > I'm trying to retrieve a record from protein database (I found the record id > by running Entrez.esearch) > > ? ?handle = Entrez.efetch(db="protein", id='483329',mode='xml') > ? ?print handle.read() > Although I specify xml mode, the result comes in a quite confusing format > using braces ?(I've pasted a snippet at the end of the email) > > Do you know what I should do to get it in xml? > Many thanks You've got the default ASN.1 output. You need to use "retmode" not "mode", from Bio import Entrez handle = Entrez.efetch(db="protein", id='483329',retmode='xml') print handle.read() I thought both the Biopython documentation and the NCBI documentation was clear on this - maybe you found a typo? Please let us know if there is an error in any of the documentation or examples. Thanks Peter From bartomas at gmail.com Fri Sep 4 04:21:04 2009 From: bartomas at gmail.com (bar tomas) Date: Fri, 4 Sep 2009 09:21:04 +0100 Subject: [Biopython] Question about efetch output format In-Reply-To: <320fb6e00909030938t139e246eg1a4cb524653eb4a2@mail.gmail.com> References: <320fb6e00909030938t139e246eg1a4cb524653eb4a2@mail.gmail.com> Message-ID: Many thanks. My mistake, I must've copied it badly from the doc. On Thu, Sep 3, 2009 at 5:38 PM, Peter wrote: > On Thu, Sep 3, 2009 at 5:06 PM, jorma kala wrote: > > Hi, > > I'm trying to retrieve a record from protein database (I found the record > id > > by running Entrez.esearch) > > > > handle = Entrez.efetch(db="protein", id='483329',mode='xml') > > print handle.read() > > Although I specify xml mode, the result comes in a quite confusing format > > using braces (I've pasted a snippet at the end of the email) > > > > Do you know what I should do to get it in xml? > > Many thanks > > You've got the default ASN.1 output. You need to use "retmode" not "mode", > > from Bio import Entrez > handle = Entrez.efetch(db="protein", id='483329',retmode='xml') > print handle.read() > > I thought both the Biopython documentation and the NCBI documentation > was clear on this - maybe you found a typo? Please let us know if there is > an error in any of the documentation or examples. > > Thanks > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bav853 at bham.ac.uk Fri Sep 4 08:38:18 2009 From: bav853 at bham.ac.uk (Bhima Auro van der Molen) Date: Fri, 04 Sep 2009 13:38:18 +0100 Subject: [Biopython] Residue Depth module. Message-ID: <4AA10A3A.7010303@bham.ac.uk> Hi everyone I have been trying to calculate residue depths in PDB files, using the hsexpo.py script, which calls and makes use of the "msms" "pdb_to_xyz," "pdb_to_xyzn" programs and the ResidueDepth.py module which on my system is located at: /var/lib/python-support/python2.6/Bio/PDB/ResidueDepth.py I opened up the ResidueDepth.py file and when I was debugging it I found that the: *from AbstractPropertyMap import AbstractPropertyMap * returns an error, however when I altered it to: *from Bio.PDB.AbstractPropertyMap import AbstractPropertyMap *it seemed to fix that specific problem. However the consistent problem I am having each time I try and run the hsexpo.py script with the option for RD or RDa, is the following error message: /Traceback (most recent call last): File "hsexpo.py", line 101, in d=ResidueDepth(m, pdbfile) File "/var/lib/python-support/python2.6/Bio/PDB/ResidueDepth.py", line 134, in __init__ surface=get_surface(pdb_file) File "/var/lib/python-support/python2.6/Bio/PDB/ResidueDepth.py", line 85, in get_surface surface=_read_vertex_array(surface_file) File "/var/lib/python-support/python2.6/Bio/PDB/ResidueDepth.py", line 53, in _read_vertex_array fp=open(filename, "r") IOError: [Errno 2] No such file or directory: '/tmp/tmpy8yHlH.vert / When I look in the /tmp folder I can see a number of tmp* files but not the one(s) that it is looking for specifically. To ensure that all the required system files were in the right place, i.e. a correct installation of msms with pdb_to_xyz etc, I moved the binaries to /usr/bin and tested the binaries from my home directory.. these worked alright.. I should clarify that even though I am running Python 2.6, I encountered the same problem in 2.5.2 and 2.5.4 as well. If anyone can help me figure out why this is not working I'd be grateful.. Thanks Bhima From italo.maia at gmail.com Fri Sep 4 10:26:10 2009 From: italo.maia at gmail.com (Italo Maia) Date: Fri, 4 Sep 2009 11:26:10 -0300 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! In-Reply-To: <800166920909030520p1e02508ava6576fbfae0ad50a@mail.gmail.com> References: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> <800166920909030520p1e02508ava6576fbfae0ad50a@mail.gmail.com> Message-ID: <800166920909040726v47897f7et8267500eb5166a52@mail.gmail.com> Does anyone has a link or doc explaining the phylip tree format? I think i'll try making some simple ploting for it. 2009/9/3 Italo Maia > Yeap, i meant the pretty png way. I was thinking of something with > python-imaging, maybe. Anyway, thanks Peter. If it does not exist, i won't > waste time looking for it. > > 2009/9/3 Peter > > On Wed, Sep 2, 2009 at 11:54 PM, Italo Maia wrote: >> > Is there any recepie to plot a phylip phylogenetic tree? >> >> If you mean as a simple text representation, then sort of. Try the >> Bio.Nexus tree objects print methods. >> >> Do you meant draw a nice diagram (e.g. as a PDF or PNG file)? If >> so, no, not at the moment. Some of the Google Summer of Code >> project work including linking to NetworkX for graphics... this has >> not yet been merged into Biopython. >> >> Peter >> > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From biopython at maubp.freeserve.co.uk Fri Sep 4 10:29:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Sep 2009 15:29:57 +0100 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! In-Reply-To: <800166920909040726v47897f7et8267500eb5166a52@mail.gmail.com> References: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> <800166920909030520p1e02508ava6576fbfae0ad50a@mail.gmail.com> <800166920909040726v47897f7et8267500eb5166a52@mail.gmail.com> Message-ID: <320fb6e00909040729q440ce158oa74734b29e8eba2@mail.gmail.com> On Fri, Sep 4, 2009 at 3:26 PM, Italo Maia wrote: > Does anyone has a link or doc explaining the phylip tree format? I think > i'll try making some simple ploting for it. Do you mean the Newick tree format? http://evolution.genetics.washington.edu/phylip/newicktree.html Peter P.S. There is a short example using Bio.Nexus.Trees in the current tutorial. From italo.maia at gmail.com Sat Sep 5 00:41:05 2009 From: italo.maia at gmail.com (Italo Maia) Date: Sat, 5 Sep 2009 01:41:05 -0300 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! In-Reply-To: <320fb6e00909040729q440ce158oa74734b29e8eba2@mail.gmail.com> References: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> <800166920909030520p1e02508ava6576fbfae0ad50a@mail.gmail.com> <800166920909040726v47897f7et8267500eb5166a52@mail.gmail.com> <320fb6e00909040729q440ce158oa74734b29e8eba2@mail.gmail.com> Message-ID: <800166920909042141y7a9790f1ka8f5c6e846296d0@mail.gmail.com> Yeap, that's the one! Just found a library for parsing these newick trees. Found them too late, actually. Just made my own image generator for newick trees. The output kind of looks like *treeview* trees. Big thanks Peter = ] A sample output can be viewed here: http://img215.imageshack.us/img215/954/outk.png ps: drawing trees is a pain in the....gee! 2009/9/4 Peter > On Fri, Sep 4, 2009 at 3:26 PM, Italo Maia wrote: > > Does anyone has a link or doc explaining the phylip tree format? I think > > i'll try making some simple ploting for it. > > Do you mean the Newick tree format? > http://evolution.genetics.washington.edu/phylip/newicktree.html > > Peter > > P.S. There is a short example using Bio.Nexus.Trees in the current > tutorial. > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From aduran at fhcrc.org Sat Sep 5 14:13:21 2009 From: aduran at fhcrc.org (Duran, Alysha M) Date: Sat, 5 Sep 2009 11:13:21 -0700 Subject: [Biopython] Fred Hutchinson Cancer Research Center - Systems Analyst/Programmer III/IV (AD-22564) References: <455E7DBAAEF0814A9D51D640B8DEDB90010DA4BF1B@ISIS.fhcrc.org> <040346FA7309BD439C327F97D4C4D69B05F54166@ISIS.fhcrc.org> Message-ID: <040346FA7309BD439C327F97D4C4D69B05F5416D@ISIS.fhcrc.org> Systems Analyst /Programmer III/IV (AD-22564) About Us Fred Hutchinson Cancer Research Center, home of three Nobel laureates, is an independent, nonprofit research institution dedicated to the development and advancement of biomedical research to eliminate cancer and other potentially fatal diseases. Recognized internationally for its pioneering work in bone-marrow transplantation, the Center's four scientific divisions collaborate to form a unique environment for conducting basic and applied science. The Hutchinson Center, in collaboration with its clinical and research partners, the University of Washington and Children's Hospital and Regional Medical Center, is the only National Cancer Institute-designated comprehensive cancer center in the Pacific Northwest. Join us and make a difference. Responsibilities We are seeking an experienced Programmer/Systems Analyst. The person will join the Cancer Prevention Program providing support for the data analysis of a large-scale genome-wide association study. This genome-wide scan is an NCI-funded multiple-year project with the goal to identify susceptibility genes associated with cancer risk and to investigate interaction between genes and environmental factors. The Programmer/Systems Analyst will work in a multidisciplinary research team. He/she will provide programming support for management of high dimensional data sets, implement quality control procedures, apply various software applications, and assist with running statistical analysis on high performance compute clusters. Furthermore, the person will prepare written documentation for the data management and the results of data analysis. Major Duties In support of the research projects, the incumbent may perform one or more of the following tasks in addition to other duties as assigned: 1. Participate in design and development of the data management system for studies. 2. Design, test, document, and maintain databases. 3. Develop, document, and maintain data cleaning procedures. 4. Implement and maintain standard datasets and reports. 5. Perform study-specific reporting. 6. Implement various software tools, such as PLINK, BeadStudio, GenomeStudio, or software packages in R. 7. Develop and/or maintain user interface programs and software tools. 8. Assist with the statistical analysis on high performance compute cluster. Qualifications The ideal candidate will possess the following qualifications: Bachelor's degree in computer science or related field and two years' experience as a Systems Analyst/Programmer III or equivalent; or one year as a Systems Analyst/Programmer III or equivalent and a Master's Degree in a job-related area. Experience with LINUX/Unix and R are required. Knowledge of C, FORTRAN, PERL, PYTHON, Rmpi or mpi is desirable. Experience with management of complex, high-dimensional genotype data is desirable. Demonstrated ability to communicate effectively as part of a team. Recommended Qualifications Proficient use of database management and statistical software. Knowledge of and experience in programming support of statistical analysis and methods development, or other scientific research are desired. Experience in writing program, system, and/or database documentation. To Apply For more information about the position and to apply, please visit the Fred Hutchinson Cancer Research Center website at www.fhcrc.org and search for Job# AD-22564. Alysha M. Duran Human Resources Specialist/Recruiter Fred Hutchinson Cancer Research Center Seattle Cancer Care Alliance Phone: (206) 667-2720 Fax: (206) 667-4051 Email: aduran at fhcrc.org Click here to search for open positions: www.fhcrc.org Follow new job openings on Twitter: http://twitter.com/FHCRC_Jobs From mitlox at op.pl Sat Sep 5 20:40:48 2009 From: mitlox at op.pl (xyz) Date: Sun, 06 Sep 2009 10:40:48 +1000 Subject: [Biopython] DSSP and secondary structure Message-ID: <4AA30510.4020105@op.pl> Hello, I have a solved structure (1E8W) with a ligand and I would like to know which secondary structure are within 16A (cut off) of the ligand. I am no interested in coils. From looking at the PDB file, ligand is last residue in chain A, named QUE. I wrote a little script (see bellow please) in order to test DSSP and it works. from Bio.PDB import * pdb_code = "1E8W" pdb_filename = "1E8W.pdb" structure = PDBParser().get_structure(pdb_code, pdb_filename) model=structure[0] dssp=DSSP(model, pdb_filename, "./dsspcmbi") for r in dssp: print r print len(dssp) Unfortunately, I do not know how can I find the secondary structures around 16A of the ligand. Thank you in advance. Best regards From srikrishnamohan at gmail.com Sun Sep 6 01:46:33 2009 From: srikrishnamohan at gmail.com (km) Date: Sun, 6 Sep 2009 14:46:33 +0900 Subject: [Biopython] DSSP and secondary structure In-Reply-To: <4AA30510.4020105@op.pl> References: <4AA30510.4020105@op.pl> Message-ID: use pymol KM On Sun, Sep 6, 2009 at 9:40 AM, xyz wrote: > Hello, > I have a solved structure (1E8W) with a ligand and I would like to know > which secondary structure are within 16A (cut off) of the ligand. I am no > interested in coils. > > From looking at the PDB file, ligand is last residue in chain A, named QUE. > > I wrote a little script (see bellow please) in order to test DSSP and it > works. > > from Bio.PDB import * > > pdb_code = "1E8W" > pdb_filename = "1E8W.pdb" > > structure = PDBParser().get_structure(pdb_code, pdb_filename) > model=structure[0] > dssp=DSSP(model, pdb_filename, "./dsspcmbi") > > for r in dssp: > print r > print len(dssp) > > Unfortunately, I do not know how can I find the secondary structures around > 16A of the ligand. > > Thank you in advance. > > Best regards > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Sun Sep 6 08:05:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 6 Sep 2009 13:05:19 +0100 Subject: [Biopython] DSSP and secondary structure In-Reply-To: <4AA30510.4020105@op.pl> References: <4AA30510.4020105@op.pl> Message-ID: <320fb6e00909060505r4befa189g4c9ce617dfd35511@mail.gmail.com> On Sun, Sep 6, 2009 at 1:40 AM, xyz wrote: > Hello, > I have a solved structure (1E8W) with a ligand and I would like to know > which secondary structure are within 16A (cut off) of the ligand. I am no > interested in coils. > > From looking at the PDB file, ligand is last residue in chain A, named QUE. > ... > Unfortunately, I do not know how can I find the secondary structures around > 16A of the ligand. There was a related thread back in March which may be helpful, http://lists.open-bio.org/pipermail/biopython/2009-March/005021.html If you are doing it just for this one protein, it may be easier to use a PDB viewer - which would also help get a feel for the structure itself. Peter From yvan.strahm at bccs.uib.no Tue Sep 8 08:01:08 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Tue, 08 Sep 2009 14:01:08 +0200 Subject: [Biopython] IPI fetching Message-ID: <4AA64784.7060506@bccs.uib.no> Hello All, I have a project with a bunch of IPI access number and need to get their fasta sequence. Now I am using the SRS web site to get these sequence. I found a old thread on Biopython-dev about IPI parseing (http://portal.open-bio.org/pipermail/biopython-dev/2001-December/000771.html), so I tried to use ExPASy.get_sprot_raw to get the sequence with no luck. Does anyone know how I can use the IPI accession number directly in Biopython? Thanks for your help, cheers, yvan From biopython at maubp.freeserve.co.uk Tue Sep 8 08:33:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Sep 2009 13:33:46 +0100 Subject: [Biopython] IPI fetching In-Reply-To: <4AA64784.7060506@bccs.uib.no> References: <4AA64784.7060506@bccs.uib.no> Message-ID: <320fb6e00909080533g7d5e48d6k8b6f64417c81914a@mail.gmail.com> On Tue, Sep 8, 2009 at 1:01 PM, Yvan Strahm wrote: > Hello All, > > I have a project with a bunch of IPI access number and need to get their > fasta sequence. Now I am using the SRS web site to get these sequence. > I found a old thread on Biopython-dev about IPI parseing > (http://portal.open-bio.org/pipermail/biopython-dev/2001-December/000771.html), > so I tried to use ExPASy.get_sprot_raw to get the sequence with no luck. > > Does anyone know how I can use the IPI accession number directly in > Biopython? Can you give us a specific example of an IPI number and the FASTA record you want back? Peter From yvan.strahm at bccs.uib.no Tue Sep 8 08:39:40 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Tue, 08 Sep 2009 14:39:40 +0200 Subject: [Biopython] IPI fetching In-Reply-To: <320fb6e00909080533g7d5e48d6k8b6f64417c81914a@mail.gmail.com> References: <4AA64784.7060506@bccs.uib.no> <320fb6e00909080533g7d5e48d6k8b6f64417c81914a@mail.gmail.com> Message-ID: <4AA6508C.3030705@bccs.uib.no> Peter wrote: > On Tue, Sep 8, 2009 at 1:01 PM, Yvan Strahm wrote: >> Hello All, >> >> I have a project with a bunch of IPI access number and need to get their >> fasta sequence. Now I am using the SRS web site to get these sequence. >> I found a old thread on Biopython-dev about IPI parseing >> (http://portal.open-bio.org/pipermail/biopython-dev/2001-December/000771.html), >> so I tried to use ExPASy.get_sprot_raw to get the sequence with no luck. >> >> Does anyone know how I can use the IPI accession number directly in >> Biopython? > > Can you give us a specific example of an IPI number and the FASTA > record you want back? > > Peter IPI00109764 > ipi|IPI00109764|IPI00109764.2 DNA TOPOISOMERASE 1. MSGDHLHNDSQIEADFRLNDSHKHKDKHKDREHRHKEHKKDKDKDREKSKHSNSEHKDSEKKHKEKEKTKHKDGSSEKHKDKHKDRDKERRKEEKIRAAG DAKIKKEKENGFSSPPRIKDEPEDDGYFAPPKEDIKPLKRLRDEDDADYKPKKIKTEDIKKEKKRKSEEEEDGKLKKPKNKDKDKKVAEPDNKKKKPKKE EEQKWKWWEEERYPEGIKWKFLEHKGPVFAPPYEPLPESVKFYYDGKVMKLSPKAEEVATFFAKMLDHEYTTKEIFRKNFFKDWRKEMTNDEKNTITNLS KCDFTQMSQYFKAQSEARKQMSKEEKLKIKEENEKLLKEYGFCVMDNHRERIANFKIEPPGLFRGRGNHPKMGMLKRRIMPEDIIINCSKDAKVPSPPPG HKWKEVRHDNKVTWLVSWTENIQGSIKYIMLNPSSRIKGEKDWQKYETARRLKKCVDKIRNQYREDWKSKEMKVRQRAVALYFIDKLALRAGNEKEEGET ADTVGCCSLRVEHINLHPELDGQEYVVEFDFPGKDSIRYYNKVPVEKRVFKNLQLFMENKQPEDDLFDRLNTGILNKHLQDLMEGLTAKVFRTYNASITL QQQLKELTAPDENVPAKILSYNRANRAVAILCNHQRAPPKTFEKSMMNLQSKIDAKKDQLADARRDLKSAKADAKVMKDAKTKKVVESKKKAVQRLEEQL MKLEVQATDREENKQIALGTSKLNYLDPRITVAWCKKWGVPIEKIYNKTQREKFAWAIDMTDEDYEF This particular entry has this Uniprot accession number:Q04750 From biopython at maubp.freeserve.co.uk Tue Sep 8 09:41:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Sep 2009 14:41:40 +0100 Subject: [Biopython] IPI fetching In-Reply-To: <4AA6508C.3030705@bccs.uib.no> References: <4AA64784.7060506@bccs.uib.no> <320fb6e00909080533g7d5e48d6k8b6f64417c81914a@mail.gmail.com> <4AA6508C.3030705@bccs.uib.no> Message-ID: <320fb6e00909080641m4fbd9b8duf6e8a13557f2d9e7@mail.gmail.com> On Tue, Sep 8, 2009 at 1:39 PM, Yvan Strahm wrote: >> >> Can you give us a specific example of an IPI number and the FASTA >> record you want back? > > IPI00109764 > >> ipi|IPI00109764|IPI00109764.2 DNA TOPOISOMERASE 1. > MSGDHLHNDSQIEADFRLNDSHKHKDKHKD...YEF > > This particular entry has this Uniprot accession number:Q04750 So if you can work out the uniprot accession number, then you can use the Bio.ExPASy.get_sprot_raw() function to download the file in the SwissProt/UniProt plain text format, e.g. >>> from Bio import ExPASy >>> from Bio import SeqIO >>> record = SeqIO.read(ExPASy.get_sprot_raw("Q04750"), "swiss") >>> print record.format("fasta") >Q04750 RecName: Full=DNA topoisomerase 1; EC=5.99.1.2; AltName: Full=DNA topoisomerase I; MSGDHLHNDSQIEADFRLNDSHKHKDKHKD...YEF It looks like you should be able to get the sequence directly from the EBI via the International Protein Index (IPI) identifier, IPI00109764 http://www.ebi.ac.uk/IPI/IPIhelp.html As per that old thread you referenced, Biopython should be able to parse the "swiss" output from IPI. How about a quick and dirty URL hack to access the EBI's SRS? >>> import urllib >>> from Bio import SeqIO >>> ipi = "IPI00109764" >>> url = "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[IPI-acc:%s]+-ascii" % ipi >>> record = SeqIO.read(urllib.urlopen(url), "swiss") >>> print record.format("fasta") >IPI00109764 DNA TOPOISOMERASE 1. MSGDHLHNDSQIEADFRLNDSHKHKDKHKDRE...YEF Done? With a little tweaking to the URL you can download this directly as FASTA if you like (saves some bandwidth). Peter From schafer at rostlab.org Tue Sep 8 13:45:53 2009 From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=) Date: Tue, 08 Sep 2009 13:45:53 -0400 Subject: [Biopython] Problem with pdb-file parsing Message-ID: <4AA69851.2000605@rostlab.org> Hi, I don't know whether this is either a bug or I did something wrong. I am parsing the pdb structure 1a2d with the following code to get the one-letter polypeptide sequence for chain A: ------------------CODE---------------- from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import * parser = PDBParser() ppb = PPBuilder() structure = parser.get_structure('tmp', '1a2d.pdb') polypeptide = ppb.build_peptides(structure[0]['A']) sequence = str(polypeptide[0].get_sequence()) print sequence ------------------CODE---------------- This however gives me a sequence that is one aminoacid shorter than expected. The structure contains one HETATM block within the ATOM block of chain A (pos 117), which gets translated into a 'X' in the sequence. The following aminoacid at position 118 (VAL) seems to be missing. So the resulting sequence around the X is: ...VEXMK... To my understanding this should be: ...VEXVMK... Is this behaviour intended? Is it a bug? The biopython version is 1.49 (Ubuntu jaunty). Chris From kelly.oakeson at utah.edu Tue Sep 8 23:22:51 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Tue, 8 Sep 2009 21:22:51 -0600 Subject: [Biopython] Biopython and Snow Leopard Message-ID: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> Hello list, I am wondering of Biopython is compatible with Mac OS 10.6? Kelly Oakeson kelly.oakeson at utah.edu From biopython at maubp.freeserve.co.uk Wed Sep 9 05:13:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 10:13:17 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> Message-ID: <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> On Wed, Sep 9, 2009 at 4:22 AM, Kelly F Oakeson wrote: > Hello list, > I am wondering of Biopython is compatible with Mac OS 10.6? In theory yes, but I don't know if anyone has tested it yet. Apple have updated the compiler (gcc), and probably the system Python since Mac OS 10.5 Leopard. On Leopard: $ python --version Python 2.5.2 $ gcc -v Using built-in specs. Target: i686-apple-darwin9 Configured with: /var/tmp/gcc/gcc-5465~16/src/configure --disable-checking -enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ --with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib --build=i686-apple-darwin9 --with-arch=apple --with-tune=generic --host=i686-apple-darwin9 --target=i686-apple-darwin9 Thread model: posix gcc version 4.0.1 (Apple Inc. build 5465) So, try it and see? If you run into problems compiling, or running the unit tests please let us know. Also in case it matters, was this an update or a clean install of Snow Leopard? Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 05:25:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 10:25:11 +0100 Subject: [Biopython] Problem with pdb-file parsing In-Reply-To: <4AA69851.2000605@rostlab.org> References: <4AA69851.2000605@rostlab.org> Message-ID: <320fb6e00909090225g686c88fdlf1b3bbaf9c10701d@mail.gmail.com> On Tue, Sep 8, 2009 at 6:45 PM, Christian Sch?fer wrote: > Hi, > > I don't know whether this is either a bug or I did something wrong. > I am parsing the pdb structure 1a2d with the following code to get > the one-letter polypeptide sequence for chain A: > > ------------------CODE---------------- > from Bio.PDB.PDBParser import PDBParser > from Bio.PDB.Polypeptide import * > > parser = PDBParser() > ppb = PPBuilder() > structure = parser.get_structure('tmp', '1a2d.pdb') > polypeptide = ppb.build_peptides(structure[0]['A']) > sequence = str(polypeptide[0].get_sequence()) > > print sequence > ------------------CODE---------------- > > This however gives me a sequence that is one aminoacid shorter than > expected. The structure contains one HETATM block within the ATOM > block of chain A (pos 117), which gets translated into a 'X' in the > sequence. The following aminoacid at position 118 (VAL) seems to be > missing. > > So the resulting sequence around the X is: > ...VEXMK... > To my understanding this should be: > ...VEXVMK... > > Is this behaviour intended? Is it a bug? The biopython version is 1.49 > (Ubuntu jaunty). I agree that does not seem to be sensible. I get the same behaviour with the latest code in the repository (so updating to Biopython 1.51 won't help here). It looks like a bug in the builder code, since the parser seems fine, and you can get the sequence in other ways, e.g. from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import * parser = PDBParser() ppb = PPBuilder() structure = parser.get_structure('tmp', '1a2d.pdb') for model in structure : for chain in model : #Try adjusting depending on if you expect just the 20 #standard amino acids etc. #aminos = [to_one_letter_code.get(res.resname,"X") \ # for res in chain if res.resname != "HOH"] aminos = [to_one_letter_code.get(res.resname,"X") \ for res in chain if "CA" in res.child_dict] sequence = "".join(aminos) print sequence Could you file this as a bug on Bugzilla please? http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython Thanks, Peter From lpritc at scri.ac.uk Wed Sep 9 05:35:55 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 09 Sep 2009 10:35:55 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> Message-ID: Hi all, I upgraded to 10.6 from 10.5.8 on my laptop, with a Python/Biopython installation still in-place, and I haven't had any problems yet. This, of course, doesn't mean that there aren't issues in the modules I haven't used, or with compilation under 10.6. Cheers, L. On 09/09/2009 04:22, "Kelly F Oakeson" wrote: > Hello list, > I am wondering of Biopython is compatible with Mac OS 10.6? > > > Kelly Oakeson > kelly.oakeson at utah.edu > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From kelly.oakeson at utah.edu Wed Sep 9 09:05:34 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 07:05:34 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> Message-ID: <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> Thanks Peter, I gave it a shot and it won't install for me. Here are my results: $ python --version Python 2.5.4 $ gcc -v Using built-in specs. Target: i686-apple-darwin10 Configured with: /var/tmp/gcc/gcc-5646~6/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10 $python setup.py install running build running build_py creating build/lib.macosx-10.3-x86_64-2.5 creating build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/__init__.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/distance.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/DocSQL.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/EZRetrieve.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/File.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/FilteredReader.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/HotRand.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/Index.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/kNN.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/listfns.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/LogisticRegression.py -> build/lib.macosx-10.3-x86_64-2.5/ Bio copying Bio/MarkovModel.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/mathfns.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/MaxEntropy.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/NaiveBayes.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/NetCatch.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/pairwise2.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/ParserSupport.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/PropertyManager.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/PubMed.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/Search.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/Seq.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/SeqFeature.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/SeqRecord.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/stringfns.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/Transcribe.py -> build/lib.macosx-10.3-x86_64-2.5/Bio . . . . . copying Bio/PopGen/SimCoal/data/ssm_2d.par -> build/lib.macosx-10.3- x86_64-2.5/Bio/PopGen/SimCoal/data running build_ext building 'Bio.clistfns' extension creating build/temp.macosx-10.3-x86_64-2.5 creating build/temp.macosx-10.3-x86_64-2.5/Bio Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/ MacOSX10.4u.sdk Please check your Xcode installation gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ Python.framework/Versions/2.5/include/python2.5 -c Bio/ clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ clistfnsmodule.o cc1: error: unrecognized command line option "-Wno-long-double" cc1: error: unrecognized command line option "-Wno-long-double" lipo: can't figure out the architecture type of: /var/tmp//ccQJ2KcH.out error: command 'gcc' failed with exit status 1 On Sep 9, 2009, at 3:13 AM, Peter wrote: > On Wed, Sep 9, 2009 at 4:22 AM, Kelly F > Oakeson wrote: >> Hello list, >> I am wondering of Biopython is compatible with Mac OS 10.6? > > In theory yes, but I don't know if anyone has tested it yet. > Apple have updated the compiler (gcc), and probably the > system Python since Mac OS 10.5 Leopard. > > On Leopard: > > $ python --version > Python 2.5.2 > > $ gcc -v > Using built-in specs. > Target: i686-apple-darwin9 > Configured with: /var/tmp/gcc/gcc-5465~16/src/configure > --disable-checking -enable-werror --prefix=/usr --mandir=/share/man > --enable-languages=c,objc,c++,obj-c++ > --program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ > --with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib > --build=i686-apple-darwin9 --with-arch=apple --with-tune=generic > --host=i686-apple-darwin9 --target=i686-apple-darwin9 > Thread model: posix > gcc version 4.0.1 (Apple Inc. build 5465) > > So, try it and see? If you run into problems compiling, or running > the unit tests please let us know. Also in case it matters, was > this an update or a clean install of Snow Leopard? > > Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 09:20:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 14:20:44 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> Message-ID: <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> On Wed, Sep 9, 2009 at 2:05 PM, Kelly F Oakeson wrote: > Thanks Peter, > I gave it a shot and it won't install for me. Here are my results: > > $ python --version > Python 2.5.4 > > $ gcc -v > Using built-in specs. > Target: i686-apple-darwin10 > Configured with: /var/tmp/gcc/gcc-5646~6/src/configure > --disable-checking --enable-werror --prefix=/usr --mandir=/share/man > --enable-languages=c,objc,c++,obj-c++ > --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ > --with-slibdir=/usr/lib --build=i686-apple-darwin10 > --with-gxx-include-dir=/include/c++/4.2.1 > --program-prefix=i686-apple-darwin10- > --host=x86_64-apple-darwin10 --target=i686-apple-darwin10 Did the last few lines get lost in the cut and paste? > $python setup.py install > running build > running build_py > creating build/lib.macosx-10.3-x86_64-2.5 > creating build/lib.macosx-10.3-x86_64-2.5/Bio > copying Bio/__init__.py -> build/lib.macosx-10.3-x86_64-2.5/Bio ... > . > copying Bio/PopGen/SimCoal/data/ssm_2d.par -> build/lib.macosx-10.3- > x86_64-2.5/Bio/PopGen/SimCoal/data > running build_ext > building 'Bio.clistfns' extension > creating build/temp.macosx-10.3-x86_64-2.5 > creating build/temp.macosx-10.3-x86_64-2.5/Bio > Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/ > MacOSX10.4u.sdk > Please check your Xcode installation > gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - > fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - > fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ > Python.framework/Versions/2.5/include/python2.5 -c Bio/ > clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ > clistfnsmodule.o > cc1: error: unrecognized command line option "-Wno-long-double" > cc1: error: unrecognized command line option "-Wno-long-double" > lipo: can't figure out the architecture type of: /var/tmp//ccQJ2KcH.out > error: command 'gcc' failed with exit status 1 OK, as I feared, the C code isn't compiling. Have you got XCode installed? Which version? The message "Please check your Xcode installation" is troubling. Could you also double check the gcc version (see above). Was this a clean Snow Leopard install, or an update? Thanks, Peter From kelly.oakeson at utah.edu Wed Sep 9 09:46:45 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 07:46:45 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> Message-ID: <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> On Sep 9, 2009, at 7:20 AM, Peter wrote: > On Wed, Sep 9, 2009 at 2:05 PM, Kelly F > Oakeson wrote: >> Thanks Peter, >> I gave it a shot and it won't install for me. Here are my results: >> >> $ python --version >> Python 2.5.4 >> >> $ gcc -v >> Using built-in specs. >> Target: i686-apple-darwin10 >> Configured with: /var/tmp/gcc/gcc-5646~6/src/configure >> --disable-checking --enable-werror --prefix=/usr --mandir=/share/man >> --enable-languages=c,objc,c++,obj-c++ >> --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ >> --with-slibdir=/usr/lib --build=i686-apple-darwin10 >> --with-gxx-include-dir=/include/c++/4.2.1 >> --program-prefix=i686-apple-darwin10- >> --host=x86_64-apple-darwin10 --target=i686-apple-darwin10 > > Did the last few lines get lost in the cut and paste? Here is how it looks exactly on my screen: $ gcc -v Using built-in specs. Target: i686-apple-darwin10 Configured with: /var/tmp/gcc/gcc-5646~6/src/configure --disable- checking --enable-werror --prefix=/usr --mandir=/share/man --enable- languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/ $/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx- include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- -- host=x86_64-apple-darwin10 --target=i686-apple-darwin10 Thread model: posix gcc version 4.2.1 (Apple Inc. build 5646) > >> $python setup.py install >> running build >> running build_py >> creating build/lib.macosx-10.3-x86_64-2.5 >> creating build/lib.macosx-10.3-x86_64-2.5/Bio >> copying Bio/__init__.py -> build/lib.macosx-10.3-x86_64-2.5/Bio > ... >> . >> copying Bio/PopGen/SimCoal/data/ssm_2d.par -> build/lib.macosx-10.3- >> x86_64-2.5/Bio/PopGen/SimCoal/data >> running build_ext >> building 'Bio.clistfns' extension >> creating build/temp.macosx-10.3-x86_64-2.5 >> creating build/temp.macosx-10.3-x86_64-2.5/Bio >> Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/ >> MacOSX10.4u.sdk >> Please check your Xcode installation >> gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - >> fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused- >> madd - >> fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ >> Python.framework/Versions/2.5/include/python2.5 -c Bio/ >> clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ >> clistfnsmodule.o >> cc1: error: unrecognized command line option "-Wno-long-double" >> cc1: error: unrecognized command line option "-Wno-long-double" >> lipo: can't figure out the architecture type of: /var/tmp// >> ccQJ2KcH.out >> error: command 'gcc' failed with exit status 1 > > OK, as I feared, the C code isn't compiling. Have you got XCode > installed? Which version? The message "Please check your Xcode > installation" is troubling. > > Could you also double check the gcc version (see above). > > Was this a clean Snow Leopard install, or an update? > > Thanks, > > Peter I installed XCode from the snow leopard install DVD, It is Version 3.2 (1610). It was an update on a MacPro that hadn't had Biopython installed before. I installed Python 2.5.4 and then tried to install Biopython. I also updated to Python 2.6 on a MacbookPro, also running 10.6 and that seemed to have broken my previous Biopython install. Thanks for the help, From biopython at maubp.freeserve.co.uk Wed Sep 9 09:57:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 14:57:11 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> Message-ID: <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> Hi Kelly, >From some Google searching this is a general Python issue. This blog post looked helpful, http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-others-on-snow-leopard/ It suggests the simplest solution is you should install the optional 10.4 SDK on the system, which Snow Leopard does not install by default ? it?s an optional install in the developer tools installer. This fits with the error message you had, > Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/ > MacOSX10.4u.sdk > Please check your Xcode installation Can you give that a go please? Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 09:59:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 14:59:17 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> Message-ID: <320fb6e00909090659y7895b8f2h1551460f8760244d@mail.gmail.com> On Wed, Sep 9, 2009 at 2:46 PM, Kelly F Oakeson wrote: > > I also updated to Python 2.6 on a MacbookPro, also running 10.6 and > that seemed to have broken my previous Biopython install. > Installing a new version of Python requires re-installing all the 3rd party python libraries for that new version of Python. So if you have Python 2.5 with Biopython working, and then installed Python 2.6, you would have to install Biopython again for Python 2.6 to use. In the meantime, you would still have the old Python 2.5+Biopython available. Peter From kelly.oakeson at utah.edu Wed Sep 9 11:46:02 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 09:46:02 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> Message-ID: Peter, I installed the 10.4 SDK and tried the install again. It still failed, here is the output: $ sudo python setup.py install Password: running install Numerical Python (NumPy) is not installed. This package is required for many Biopython features. Please install it before you install Biopython. You can install Biopython anyway, but anything dependent on NumPy will not work. If you do this, and later install NumPy, you should then re-install Biopython. You can find NumPy at http://numpy.scipy.org Do you want to continue this installation? (y/N) Y running build running build_py running build_ext building 'Bio.clistfns' extension gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ Python.framework/Versions/2.5/include/python2.5 -c Bio/ clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ clistfnsmodule.o cc1: error: unrecognized command line option "-Wno-long-double" cc1: error: unrecognized command line option "-Wno-long-double" lipo: can't figure out the architecture type of: /var/tmp//cc6kttGl.out error: command 'gcc' failed with exit status 1 $ arch i386 I do have the 64 bit kernel enabled, could that be causing the issue? Kelly Oakeson kelly.oakeson at utah.edu On Sep 9, 2009, at 7:57 AM, Peter wrote: > Hi Kelly, > > From some Google searching this is a general Python issue. This blog > post looked helpful, > > http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-others-on-snow-leopard/ > > It suggests the simplest solution is you should install the optional > 10.4 SDK on the system, which Snow Leopard does not install by default > ? it?s an optional install in the developer tools installer. > > This fits with the error message you had, > >> Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/ >> MacOSX10.4u.sdk >> Please check your Xcode installation > > Can you give that a go please? > > Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 12:05:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 17:05:13 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> Message-ID: <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> On Wed, Sep 9, 2009 at 4:46 PM, Kelly F Oakeson wrote: > Peter, > I installed the 10.4 SDK and tried the install again. It still failed, > here is the output: > > $ sudo python setup.py install > Password: > running install > > Numerical Python (NumPy) is not installed. > > This package is required for many Biopython features. ?Please install > it before you install Biopython. You can install Biopython anyway, but > anything dependent on NumPy will not work. If you do this, and later > install NumPy, you should then re-install Biopython. > > You can find NumPy at http://numpy.scipy.org > > Do you want to continue this installation? (y/N) ?Y As an aside, from following the NumPy mailing list, it can be installed on Snow Leopard but there were some similar glitches and I don't know how easy it is. > running build > running build_py > running build_ext > building 'Bio.clistfns' extension > gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - > fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - > fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ > Python.framework/Versions/2.5/include/python2.5 -c Bio/ > clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ > clistfnsmodule.o > cc1: error: unrecognized command line option "-Wno-long-double" > cc1: error: unrecognized command line option "-Wno-long-double" > lipo: can't figure out the architecture type of: /var/tmp//cc6kttGl.out > error: command 'gcc' failed with exit status 1 > > $ arch > i386 > > I do have the 64 bit kernel enabled, could that be causing the issue? Maybe - it looks like gcc has been called with "-arch ppc -arch i386", which should probably be "-arch x86_64" (or left as the default?) as you are running in full 64 bit mode (and you must have an Intel CPU, not a PowerPC). Try: gcc -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ Python.framework/Versions/2.5/include/python2.5 -c Bio/ clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ clistfnsmodule.o [all as one line] and see what gcc says then. Peter From kelly.oakeson at utah.edu Wed Sep 9 12:16:49 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 10:16:49 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> Message-ID: Peter, It looks like I may have solved it! Taking the advice from the post on http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-others-on-snow-leopard/ I changed the makefile in /Library/Frameworks/Python.framework/ Versions/2.6/lib/python2.6/config/Makefile replacing all of the occurrences of 10.4u with 10.6. I then ran $python setup.py build $python setup.py test $sudo python setup.py install Everything worked just fine without any gcc errors. This worked on both my MacBook pro and MacPro both running the 64 bit kernel. Thanks for all of the help. On Sep 9, 2009, at 10:05 AM, Peter wrote: > On Wed, Sep 9, 2009 at 4:46 PM, Kelly F > Oakeson wrote: >> Peter, >> I installed the 10.4 SDK and tried the install again. It still >> failed, >> here is the output: >> >> $ sudo python setup.py install >> Password: >> running install >> >> Numerical Python (NumPy) is not installed. >> >> This package is required for many Biopython features. Please install >> it before you install Biopython. You can install Biopython anyway, >> but >> anything dependent on NumPy will not work. If you do this, and later >> install NumPy, you should then re-install Biopython. >> >> You can find NumPy at http://numpy.scipy.org >> >> Do you want to continue this installation? (y/N) Y > > As an aside, from following the NumPy mailing list, it can be > installed > on Snow Leopard but there were some similar glitches and I don't > know how easy it is. > >> running build >> running build_py >> running build_ext >> building 'Bio.clistfns' extension >> gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - >> fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused- >> madd - >> fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ >> Python.framework/Versions/2.5/include/python2.5 -c Bio/ >> clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ >> clistfnsmodule.o >> cc1: error: unrecognized command line option "-Wno-long-double" >> cc1: error: unrecognized command line option "-Wno-long-double" >> lipo: can't figure out the architecture type of: /var/tmp// >> cc6kttGl.out >> error: command 'gcc' failed with exit status 1 >> >> $ arch >> i386 >> >> I do have the 64 bit kernel enabled, could that be causing the issue? > > Maybe - it looks like gcc has been called with "-arch ppc -arch i386", > which should probably be "-arch x86_64" (or left as the default?) as > you are running in full 64 bit mode (and you must have an Intel CPU, > not a PowerPC). Try: > > gcc -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - > fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - > fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ > Python.framework/Versions/2.5/include/python2.5 -c Bio/ > clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ > clistfnsmodule.o > > [all as one line] and see what gcc says then. > > Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 12:21:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 17:21:28 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> Message-ID: <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> On Wed, Sep 9, 2009 at 5:16 PM, Kelly F Oakeson wrote: > Peter, > It looks like I may have solved it! Taking the advice from the post on > http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-others-on-snow-leopard/ > I changed the makefile in /Library/Frameworks/Python.framework/ > Versions/2.6/lib/python2.6/config/Makefile replacing all of the > occurrences of 10.4u with 10.6. > I then ran > $python setup.py build > $python setup.py test > $sudo python setup.py install > > Everything worked just fine without any gcc errors. This worked on > both my MacBook pro and MacPro both running the 64 bit kernel. That's great, but doesn't seem like a "proper" solution for us in the long run. Did you run the Biopython unit tests? It would be good to know they are all fine. Peter From kelly.oakeson at utah.edu Wed Sep 9 12:26:04 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 10:26:04 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> Message-ID: <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> Peter, I only ran the setup.py test, but here are the results: $ python setup.py test running test test_Ace ... ok test_AlignIO ... ok test_BioSQL ... skipping. Enter your settings in Tests/setup_BioSQL.py (not important if you do not plan to use BioSQL). test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ setup_BioSQL.py (not important if you do not plan to use BioSQL). test_CAPS ... ok test_Clustalw ... ok test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you want to use Bio.Clustalw. test_Cluster ... skipping. If you want to use Bio.Cluster, install NumPy first and then reinstall Biopython test_CodonTable ... ok test_CodonUsage ... ok test_Compass ... ok test_Crystal ... ok test_Dialign_tool ... skipping. Install DIALIGN2-2 if you want to use the Bio.Align.Applications wrapper. test_DocSQL ... skipping. Install MySQLdb if you want to use Bio.DocSQL. test_Emboss ... skipping. Install EMBOSS if you want to use Bio.EMBOSS. test_EmbossPrimer ... ok test_Entrez ... ok test_Enzyme ... ok test_FSSP ... ok test_Fasta ... ok test_File ... ok test_GACrossover ... ok test_GAMutation ... ok test_GAOrganism ... ok test_GAQueens ... ok test_GARepair ... ok test_GASelection ... ok test_GFF ... skipping. Environment is not configured for this test (not important if you do not plan to use Bio.GFF). test_GFF2 ... skipping. Install MySQLdb if you want to use Bio.GFF. test_GenBank ... ok test_GenomeDiagram ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsChromosome ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsDistribution ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsGeneral ... skipping. Install reportlab if you want to use Bio.Graphics. test_HMMCasino ... ok test_HMMGeneral ... ok test_HotRand ... ok test_IsoelectricPoint ... ok test_KDTree ... skipping. Install NumPy if you want to use Bio.KDTree. test_KEGG ... ok test_KeyWList ... ok test_Location ... ok test_LocationParser ... ok test_LogisticRegression ... skipping. Install NumPy if you want to use Bio.LogisticRegression. test_MEME ... ok test_Mafft_tool ... skipping. Install MAFFT if you want to use the Bio.Align.Applications wrapper. test_MarkovModel ... skipping. Install NumPy if you want to use Bio.MarkovModel. test_Medline ... ok test_Motif ... ok test_Muscle_tool ... skipping. Install MUSCLE if you want to use the Bio.Align.Applications wrapper. test_NCBIStandalone ... ok test_NCBITextParser ... ok test_NCBIXML ... ok test_NCBI_qblast ... ok test_NNExclusiveOr ... ok test_NNGene ... ok test_NNGeneral ... ok test_Nexus ... ok test_PDB ... skipping. Install NumPy if you want to use Bio.PDB. test_PDB_unit ... skipping. Install NumPy if you want to use Bio.PDB. test_ParserSupport ... ok test_Pathway ... ok test_Phd ... ok test_PopGen_FDist ... skipping. Install FDist if you want to use Bio.PopGen.FDist. test_PopGen_FDist_nodepend ... ok test_PopGen_GenePop ... ok test_PopGen_SimCoal ... skipping. Install SIMCOAL2 if you want to use Bio.PopGen.SimCoal. test_PopGen_SimCoal_nodepend ... ok test_Prank_tool ... skipping. Install PRANK if you want to use the Bio.Align.Applications wrapper. test_Probcons_tool ... skipping. Install PROBCONS if you want to use the Bio.Align.Applications wrapper. test_ProtParam ... ok test_Restriction ... ok test_SCOP_Astral ... ok test_SCOP_Cla ... ok test_SCOP_Des ... ok test_SCOP_Dom ... ok test_SCOP_Hie ... ok test_SCOP_Raf ... ok test_SCOP_Residues ... ok test_SCOP_Scop ... ok test_SVDSuperimposer ... skipping. Install NumPy if you want to use Bio.SVDSuperimposer. test_SeqIO ... ok test_SeqIO_FastaIO ... ok test_SeqIO_QualityIO ... ok test_SeqIO_features ... ok test_SeqIO_online ... ok test_SeqUtils ... ok test_Seq_objs ... ok test_SubsMat ... ok test_SwissProt ... ok test_TCoffee_tool ... skipping. Install TCOFFEE if you want to use the Bio.Align.Applications wrapper. test_UniGene ... ok test_UniGene_obsolete ... ok test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_align ... ok test_geo ... ok test_interpro ... ok test_kNN ... skipping. Install NumPy if you want to use Bio.kNN. test_lowess ... skipping. Install NumPy if you want to use Bio.Statistics.lowess. test_pairwise2 ... ok test_prodoc ... ok test_property_manager ... ok test_prosite ... ok test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_seq ... ok test_translate ... ok test_trie ... ok test_triefind ... ok Bio.Seq docstring test ... ok Bio.SeqRecord docstring test ... ok Bio.SeqIO docstring test ... ok Bio.SeqIO.PhdIO docstring test ... ok Bio.SeqIO.QualityIO docstring test ... ok Bio.SeqIO.AceIO docstring test ... ok Bio.SeqUtils docstring test ... ok Bio.Align.Generic docstring test ... ok Bio.AlignIO docstring test ... ok Bio.AlignIO.StockholmIO docstring test ... ok Bio.Application docstring test ... ok Bio.KEGG.Compound docstring test ... ok Bio.KEGG.Enzyme docstring test ... ok Bio.Wise docstring test ... ok Bio.Wise.psw docstring test ... ok Bio.Motif docstring test ... ok ---------------------------------------------------------------------- Ran 124 tests in 62.858 seconds On Sep 9, 2009, at 10:21 AM, Peter wrote: > On Wed, Sep 9, 2009 at 5:16 PM, Kelly F > Oakeson wrote: >> Peter, >> It looks like I may have solved it! Taking the advice from the post >> on >> http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-others-on-snow-leopard/ >> I changed the makefile in /Library/Frameworks/Python.framework/ >> Versions/2.6/lib/python2.6/config/Makefile replacing all of the >> occurrences of 10.4u with 10.6. >> I then ran >> $python setup.py build >> $python setup.py test >> $sudo python setup.py install >> >> Everything worked just fine without any gcc errors. This worked on >> both my MacBook pro and MacPro both running the 64 bit kernel. > > That's great, but doesn't seem like a "proper" solution for us in > the long run. Did you run the Biopython unit tests? It would be > good to know they are all fine. > > Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 12:35:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 17:35:10 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> Message-ID: <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> On Wed, Sep 9, 2009 at 5:26 PM, Kelly F Oakeson wrote: > Peter, > I only ran the setup.py test, but here are the results: > $ python setup.py test > running test > test_Ace ... ok > test_AlignIO ... ok > test_BioSQL ... skipping. Enter your settings in Tests/setup_BioSQL.py > (not important if you do not plan to use BioSQL). > test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ > setup_BioSQL.py (not important if you do not plan to use BioSQL). > test_CAPS ... ok > test_Clustalw ... ok > test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you > want to use Bio.Clustalw. > test_Cluster ... skipping. If you want to use Bio.Cluster, install > NumPy first and then reinstall Biopython > ... > Bio.Wise.psw docstring test ... ok > Bio.Motif docstring test ... ok > ---------------------------------------------------------------------- > Ran 124 tests in 62.858 seconds Excellent - a clean bill of health. Do you want to try installing NumPy now, and then reinstalling Biopython to see if everything works? ;) Peter From lpritc at scri.ac.uk Wed Sep 9 12:41:20 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 09 Sep 2009 17:41:20 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: Message-ID: On 09/09/2009 17:40, "Leighton Pritchard" wrote: > Your solution seems to be the simplest one, Kelly. The alternative appears to > be to modify the build files. By which, of course, I mean for *the developers* to modify the build files. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From kelly.oakeson at utah.edu Wed Sep 9 12:42:10 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 10:42:10 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> Message-ID: <828374D5-C0F9-46AC-93DD-16529B5CD2BE@utah.edu> Peter, I will give it a shot once I get back from class, ah the joys of being a grad student. Kelly O. On Sep 9, 2009, at 10:35 AM, "Peter" wrote: > On Wed, Sep 9, 2009 at 5:26 PM, Kelly F > Oakeson wrote: >> Peter, >> I only ran the setup.py test, but here are the results: >> $ python setup.py test >> running test >> test_Ace ... ok >> test_AlignIO ... ok >> test_BioSQL ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py >> (not important if you do not plan to use BioSQL). >> test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py (not important if you do not plan to use BioSQL). >> test_CAPS ... ok >> test_Clustalw ... ok >> test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you >> want to use Bio.Clustalw. >> test_Cluster ... skipping. If you want to use Bio.Cluster, install >> NumPy first and then reinstall Biopython >> ... >> Bio.Wise.psw docstring test ... ok >> Bio.Motif docstring test ... ok >> --- >> ------------------------------------------------------------------- >> Ran 124 tests in 62.858 seconds > > Excellent - a clean bill of health. > > Do you want to try installing NumPy now, and then reinstalling > Biopython to see if everything works? ;) > > Peter From nuin at genedrift.org Wed Sep 9 12:39:58 2009 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 9 Sep 2009 12:39:58 -0400 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> Message-ID: <97C66059-B065-4C91-BC6A-6412519AB665@genedrift.org> Just another perspective from Snow Leopard install, I didn't have any errors compiling BioPython, after I installed the latest version of Xcode (including 10.4 support). All tests were fine too. Paulo On 2009-09-09, at 12:35 PM, Peter wrote: > On Wed, Sep 9, 2009 at 5:26 PM, Kelly F > Oakeson wrote: >> Peter, >> I only ran the setup.py test, but here are the results: >> $ python setup.py test >> running test >> test_Ace ... ok >> test_AlignIO ... ok >> test_BioSQL ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py >> (not important if you do not plan to use BioSQL). >> test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py (not important if you do not plan to use BioSQL). >> test_CAPS ... ok >> test_Clustalw ... ok >> test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you >> want to use Bio.Clustalw. >> test_Cluster ... skipping. If you want to use Bio.Cluster, install >> NumPy first and then reinstall Biopython >> ... >> Bio.Wise.psw docstring test ... ok >> Bio.Motif docstring test ... ok >> ---------------------------------------------------------------------- >> Ran 124 tests in 62.858 seconds > > Excellent - a clean bill of health. > > Do you want to try installing NumPy now, and then reinstalling > Biopython to see if everything works? ;) > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From lpritc at scri.ac.uk Wed Sep 9 12:40:33 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 09 Sep 2009 17:40:33 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> Message-ID: Hi all, Thanks to Google, it appears that you're not the only one to run into this problem: http://www.reddit.com/r/Python/comments/9gpuc/snow_leopard_and_python_compat ibility_issues/ http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-othe rs-on-snow-leopard/ http://www.allegro.cc/forums/print-thread/601429 And more details about the actual problem here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35961 Your solution seems to be the simplest one, Kelly. The alternative appears to be to modify the build files. Best, L. On 09/09/2009 17:21, "Peter" wrote: > On Wed, Sep 9, 2009 at 5:16 PM, Kelly F Oakeson wrote: >> Peter, >> It looks like I may have solved it! Taking the advice from the post on >> http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-other >> s-on-snow-leopard/ >> I changed the makefile in /Library/Frameworks/Python.framework/ >> Versions/2.6/lib/python2.6/config/Makefile replacing all of the >> occurrences of 10.4u with 10.6. >> I then ran >> $python setup.py build >> $python setup.py test >> $sudo python setup.py install >> >> Everything worked just fine without any gcc errors. This worked on >> both my MacBook pro and MacPro both running the 64 bit kernel. > > That's great, but doesn't seem like a "proper" solution for us in > the long run. Did you run the Biopython unit tests? It would be > good to know they are all fine. > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Wed Sep 9 12:49:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 17:49:26 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <97C66059-B065-4C91-BC6A-6412519AB665@genedrift.org> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> <97C66059-B065-4C91-BC6A-6412519AB665@genedrift.org> Message-ID: <320fb6e00909090949i78a2292cj9f50144873389e@mail.gmail.com> On Wed, Sep 9, 2009 at 5:39 PM, Paulo Nuin wrote: > > Just another perspective from Snow Leopard install, I didn't have any errors > compiling BioPython, after I installed the latest version of Xcode > (including 10.4 support). All tests were fine too. > > Paulo That's good. I wonder what was different on your machine compared to Kelly's? Peter From nuin at genedrift.org Wed Sep 9 12:53:34 2009 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 9 Sep 2009 12:53:34 -0400 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090949i78a2292cj9f50144873389e@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> <97C66059-B065-4C91-BC6A-6412519AB665@genedrift.org> <320fb6e00909090949i78a2292cj9f50144873389e@mail.gmail.com> Message-ID: <6053E0BA-407D-4765-B8E4-EB6CCFE56D0C@genedrift.org> Here's is my gcc -v Using built-in specs. Target: i686-apple-darwin10 Configured with: /var/tmp/gcc/gcc-5646~6/src/configure --disable- checking --enable-werror --prefix=/usr --mandir=/share/man --enable- languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/ $/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx- include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- -- host=x86_64-apple-darwin10 --target=i686-apple-darwin10 Thread model: posix gcc version 4.2.1 (Apple Inc. build 5646) I updated from Leopard to SL, and I still have my Python 2.5 with BioPython installed on it. On SL the "default" Python version is 2.6, and the new installation was performed on it. Paulo On 2009-09-09, at 12:49 PM, Peter wrote: > On Wed, Sep 9, 2009 at 5:39 PM, Paulo Nuin wrote: >> >> Just another perspective from Snow Leopard install, I didn't have >> any errors >> compiling BioPython, after I installed the latest version of Xcode >> (including 10.4 support). All tests were fine too. >> >> Paulo > > That's good. I wonder what was different on your machine compared to > Kelly's? > > Peter From kelly.oakeson at utah.edu Wed Sep 9 13:55:01 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 11:55:01 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> Message-ID: <7BF0A735-C28C-4AA3-831E-1FA41F2F7826@utah.edu> Peter, After installing NumPy everything tests out ok. Here are the results. $ python setup.py test running test test_Ace ... ok test_AlignIO ... ok test_BioSQL ... skipping. Enter your settings in Tests/setup_BioSQL.py (not important if you do not plan to use BioSQL). test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ setup_BioSQL.py (not important if you do not plan to use BioSQL). test_CAPS ... ok test_Clustalw ... ok test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you want to use Bio.Clustalw. test_Cluster ... ok test_CodonTable ... ok test_CodonUsage ... ok test_Compass ... ok test_Crystal ... ok test_Dialign_tool ... skipping. Install DIALIGN2-2 if you want to use the Bio.Align.Applications wrapper. test_DocSQL ... skipping. Install MySQLdb if you want to use Bio.DocSQL. test_Emboss ... skipping. Install EMBOSS if you want to use Bio.EMBOSS. test_EmbossPrimer ... ok test_Entrez ... ok test_Enzyme ... ok test_FSSP ... ok test_Fasta ... ok test_File ... ok test_GACrossover ... ok test_GAMutation ... ok test_GAOrganism ... ok test_GAQueens ... ok test_GARepair ... ok test_GASelection ... ok test_GFF ... skipping. Environment is not configured for this test (not important if you do not plan to use Bio.GFF). test_GFF2 ... skipping. Install MySQLdb if you want to use Bio.GFF. test_GenBank ... ok test_GenomeDiagram ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsChromosome ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsDistribution ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsGeneral ... skipping. Install reportlab if you want to use Bio.Graphics. test_HMMCasino ... ok test_HMMGeneral ... ok test_HotRand ... ok test_IsoelectricPoint ... ok test_KDTree ... ok test_KEGG ... ok test_KeyWList ... ok test_Location ... ok test_LocationParser ... ok test_LogisticRegression ... ok test_MEME ... ok test_Mafft_tool ... skipping. Install MAFFT if you want to use the Bio.Align.Applications wrapper. test_MarkovModel ... ok test_Medline ... ok test_Motif ... ok test_Muscle_tool ... skipping. Install MUSCLE if you want to use the Bio.Align.Applications wrapper. test_NCBIStandalone ... ok test_NCBITextParser ... ok test_NCBIXML ... ok test_NCBI_qblast ... ok test_NNExclusiveOr ... ok test_NNGene ... ok test_NNGeneral ... ok test_Nexus ... ok test_PDB ... ok test_PDB_unit ... ok test_ParserSupport ... ok test_Pathway ... ok test_Phd ... ok test_PopGen_FDist ... skipping. Install FDist if you want to use Bio.PopGen.FDist. test_PopGen_FDist_nodepend ... ok test_PopGen_GenePop ... ok test_PopGen_SimCoal ... skipping. Install SIMCOAL2 if you want to use Bio.PopGen.SimCoal. test_PopGen_SimCoal_nodepend ... ok test_Prank_tool ... skipping. Install PRANK if you want to use the Bio.Align.Applications wrapper. test_Probcons_tool ... skipping. Install PROBCONS if you want to use the Bio.Align.Applications wrapper. test_ProtParam ... ok test_Restriction ... ok test_SCOP_Astral ... ok test_SCOP_Cla ... ok test_SCOP_Des ... ok test_SCOP_Dom ... ok test_SCOP_Hie ... ok test_SCOP_Raf ... ok test_SCOP_Residues ... ok test_SCOP_Scop ... ok test_SVDSuperimposer ... ok test_SeqIO ... ok test_SeqIO_FastaIO ... ok test_SeqIO_QualityIO ... ok test_SeqIO_features ... ok test_SeqIO_online ... ok test_SeqUtils ... ok test_Seq_objs ... ok test_SubsMat ... ok test_SwissProt ... ok test_TCoffee_tool ... skipping. Install TCOFFEE if you want to use the Bio.Align.Applications wrapper. test_UniGene ... ok test_UniGene_obsolete ... ok test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_align ... ok test_geo ... ok test_interpro ... ok test_kNN ... ok test_lowess ... ok test_pairwise2 ... ok test_prodoc ... ok test_property_manager ... ok test_prosite ... ok test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_seq ... ok test_translate ... ok test_trie ... ok test_triefind ... ok Bio.Seq docstring test ... ok Bio.SeqRecord docstring test ... ok Bio.SeqIO docstring test ... ok Bio.SeqIO.PhdIO docstring test ... ok Bio.SeqIO.QualityIO docstring test ... ok Bio.SeqIO.AceIO docstring test ... ok Bio.SeqUtils docstring test ... ok Bio.Align.Generic docstring test ... ok Bio.AlignIO docstring test ... ok Bio.AlignIO.StockholmIO docstring test ... ok Bio.Application docstring test ... ok Bio.KEGG.Compound docstring test ... ok Bio.KEGG.Enzyme docstring test ... ok Bio.Wise docstring test ... ok Bio.Wise.psw docstring test ... ok Bio.Motif docstring test ... ok Bio.Statistics.lowess docstring test ... ok ---------------------------------------------------------------------- Ran 125 tests in 63.582 seconds On Sep 9, 2009, at 10:35 AM, Peter wrote: > On Wed, Sep 9, 2009 at 5:26 PM, Kelly F > Oakeson wrote: >> Peter, >> I only ran the setup.py test, but here are the results: >> $ python setup.py test >> running test >> test_Ace ... ok >> test_AlignIO ... ok >> test_BioSQL ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py >> (not important if you do not plan to use BioSQL). >> test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py (not important if you do not plan to use BioSQL). >> test_CAPS ... ok >> test_Clustalw ... ok >> test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you >> want to use Bio.Clustalw. >> test_Cluster ... skipping. If you want to use Bio.Cluster, install >> NumPy first and then reinstall Biopython >> ... >> Bio.Wise.psw docstring test ... ok >> Bio.Motif docstring test ... ok >> ---------------------------------------------------------------------- >> Ran 124 tests in 62.858 seconds > > Excellent - a clean bill of health. > > Do you want to try installing NumPy now, and then reinstalling > Biopython to see if everything works? ;) > > Peter From biopython at maubp.freeserve.co.uk Thu Sep 10 05:50:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Sep 2009 10:50:50 +0100 Subject: [Biopython] Removing deprecated module Bio.EUtils In-Reply-To: <320fb6e00909011001q5d99b62egc06c1303d6a8bc53@mail.gmail.com> References: <320fb6e00909011001q5d99b62egc06c1303d6a8bc53@mail.gmail.com> Message-ID: <320fb6e00909100250u140382efvbe42a41aedfc3fee@mail.gmail.com> On Tue, Sep 1, 2009 at 6:01 PM, Peter wrote: > Hi all, > > The Bio.Entrez module has long been our prefered interface to the NCBI > Entrez Utilities. It replaced the old Bio.EUtils module which was officially > deprecated in Biopython 1.48, released a year ago (Sept 2008). > > In line with our deprecation policy, I plan to remove Bio.EUtils in the next > release. > > Are there any objections? If anyone is still using the Bio.EUtils module > in old code, please feel free to ask for tips on porting this to Bio.Entrez. Bio.EUtils has been removed now, and will not be included in future releases of Biopython. Peter From biopython at maubp.freeserve.co.uk Thu Sep 10 05:51:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Sep 2009 10:51:37 +0100 Subject: [Biopython] Removing deprecated BLAST HTML parser In-Reply-To: <320fb6e00909011005k76cc8be9j2d3a58201ae54a8f@mail.gmail.com> References: <320fb6e00909011005k76cc8be9j2d3a58201ae54a8f@mail.gmail.com> Message-ID: <320fb6e00909100251m5fae2c6ax44d8d3a989f58a42@mail.gmail.com> On Tue, Sep 1, 2009 at 6:05 PM, Peter wrote: > Hi all, > > The old HTML BLAST parser in Bio.Blast.NCBIWWW was deprecated > a year ago in Biopython 1.48, and in line with our deprecation policy I > would like to remove this for the next release. > > Are there any objections? > > The preferred BLAST output for parsing (as recommended by the NCBI > themselves) is XML. We also have a parser for the plain text output, but > this is not updated very frequently and the NCBI have a history of > making minor changes to the layout and breaking parsers. The old HTML BLAST parser in Bio.Blast.NCBIWWW has been removed now, and will not be included in future releases of Biopython. Peter From biopython at maubp.freeserve.co.uk Thu Sep 10 06:17:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Sep 2009 11:17:25 +0100 Subject: [Biopython] Deprecating Bio.EZRetrieve, NetCatch, FilteredReader and SGMLHandle In-Reply-To: <320fb6e00908200251g6858939ci8de9192b9af7ad98@mail.gmail.com> References: <320fb6e00908200251g6858939ci8de9192b9af7ad98@mail.gmail.com> Message-ID: <320fb6e00909100317s39b3b831rdeadc98f5a6995f5@mail.gmail.com> On Thu, Aug 20, 2009 at 10:51 AM, Peter wrote: > Hi all, > > The minor modules Bio.EZRetrieve, Bio.NetCatch, Bio.File.SGMLHandle, > Bio.FilteredReader were declared obsolete in Release 1.50. Are there > any objections to us deprecating them in the next release? These will be deprecated in the next release - if anyone is still using them, please speak up now. Thanks, Peter From pzs at dcs.gla.ac.uk Thu Sep 10 11:24:04 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Thu, 10 Sep 2009 16:24:04 +0100 Subject: [Biopython] Creating GenBank files Message-ID: <4AA91A14.10602@dcs.gla.ac.uk> I'm trying to create a GenBank file from a sequence and some annotation information. Can BioPython do this? I can't seem to find anything obvious in the documentation. If BioPython does not support this, can anybody recommend another API for doing this? I want to be able to generate genbank files from a script. Peter From biopython at maubp.freeserve.co.uk Thu Sep 10 11:45:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Sep 2009 16:45:19 +0100 Subject: [Biopython] Creating GenBank files In-Reply-To: <4AA91A14.10602@dcs.gla.ac.uk> References: <4AA91A14.10602@dcs.gla.ac.uk> Message-ID: <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> On Thu, Sep 10, 2009 at 4:24 PM, Peter Saffrey wrote: > I'm trying to create a GenBank file from a sequence and some annotation > information. Can BioPython do this? I can't seem to find anything obvious in > the documentation. Yes, you must create a SeqRecord object with suitable SeqFeature objects, and then write it out with SeqIO in GenBank format. If all your features have trivial locations, this is pretty easy. For example, I've done this to make simple gene predictions based on ORF finding and selecting the most upstream start codon, then generating the corresponding SeqFeatures, and saving this as a GenBank file. Peter From biopython at maubp.freeserve.co.uk Thu Sep 10 11:46:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Sep 2009 16:46:22 +0100 Subject: [Biopython] Creating GenBank files In-Reply-To: <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> References: <4AA91A14.10602@dcs.gla.ac.uk> <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> Message-ID: <320fb6e00909100846v7a874fa7tacb515fc7c8d3866@mail.gmail.com> On Thu, Sep 10, 2009 at 4:45 PM, Peter wrote: > On Thu, Sep 10, 2009 at 4:24 PM, Peter Saffrey wrote: >> I'm trying to create a GenBank file from a sequence and some annotation >> information. Can BioPython do this? I can't seem to find anything obvious in >> the documentation. > > Yes, you must create a SeqRecord object with suitable SeqFeature objects, > and then write it out with SeqIO in GenBank format. If all your features have > trivial locations, this is pretty easy. > > For example, I've done this to make simple gene predictions based on ORF > finding and selecting the most upstream start codon, then generating the > corresponding SeqFeatures, and saving this as a GenBank file. P.S. You need Biopython 1.51 or later to be able to write out GenBank files with features. Peter From pedro.al at fenhi.uh.cu Thu Sep 10 17:07:20 2009 From: pedro.al at fenhi.uh.cu (Yasser Almeida Hernandez) Date: Thu, 10 Sep 2009 17:07:20 -0400 Subject: [Biopython] Write chains... Message-ID: <20090910170720.4hbr9c3e8ooggko0@correo.fenhi.uh.cu> Hi all!! How can i write custom chains of a PDB. For example a protein have A, B, C and D chains and i want to write only the B and C chains.... Thanks -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana 11600, Cuba Phone: (53-7) 271 7933, ext. 219 ---------------------------------------------------------------- Correo FENHI From michael.koeris at gmail.com Thu Sep 10 17:47:03 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Thu, 10 Sep 2009 17:47:03 -0400 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly Message-ID: Hi, I am trying to parse out the nucleic acid accession numbers from an Entrez.efetch query made to the Gene database. For some reason the parser does not create a dictionary when I call handle.read on it, even when I force the query to return xml. It just generates a string. Any ideas if another parser needs to be utilized? Many thanks Mike From kellrott at gmail.com Thu Sep 10 18:06:39 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Thu, 10 Sep 2009 15:06:39 -0700 Subject: [Biopython] MetaGene and GreenGene Message-ID: I've added two modules, MetaGene and GreenGene, to my BioPython fork. (found at http://github.com/kellrott/biopython/ ) Both of these modules deal with tools/databases related to metagenomic research. The GreenGene module parses and stores the 16S RNA sample library found in the file at http://greengenes.lbl.gov/Download/Sequence_Data/Greengenes_format/greengenes16SrRNAgenes.txt.gz and provides a query mechanism to lookup sequence and studies. MetaGene parses output from MetaGeneAnnotator ( http://metagene.cb.k.u-tokyo.ac.jp/metagene/ ), a gene prediction program designed for prokaryote and phage. It takes the predictions and produces SeqRecord objects of the predicted Amino Acid sequences. I would appreciate comments on additional functionality people would find usefully for these packages. I would also request these modules be considered for the mainline BioPython distribution. Kyle From biopython at maubp.freeserve.co.uk Fri Sep 11 05:33:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 11 Sep 2009 10:33:09 +0100 Subject: [Biopython] Write chains... In-Reply-To: <20090910170720.4hbr9c3e8ooggko0@correo.fenhi.uh.cu> References: <20090910170720.4hbr9c3e8ooggko0@correo.fenhi.uh.cu> Message-ID: <320fb6e00909110233s5f020db1m5bd5438842405701@mail.gmail.com> On Thu, Sep 10, 2009 at 10:07 PM, Yasser Almeida Hernandez wrote: > Hi all!! > > How can i write custom chains of a PDB. > For example a protein have A, B, C and D chains and i want > to write only the B and C chains.... > > Thanks You can do this with Bio.PDB - see pages 5 and 6 of the Bio.PDB documentation: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf i.e. Write a "Select class" which picks out only the chains you want (implement this in the accept_chain method). Here is a related example picking atoms: http://lists.open-bio.org/pipermail/biopython/2009-May/005174.html Alternatively, for a more low-level solution, you could do it manually: out = open("filtered.pdb", "w") for line in open("input.pdb") : if (line.startswith("ATOM") or line.startswith("HETATM")) \ and line[21] not in "BC" : continue out.write(line) out.close() [I'd recommend you try and learn how to use Bio.PDB as it is much more powerful in the long run.] Peter From biopython at maubp.freeserve.co.uk Fri Sep 11 05:34:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 11 Sep 2009 10:34:53 +0100 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly In-Reply-To: References: Message-ID: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> On Thu, Sep 10, 2009 at 10:47 PM, Michael S. Koeris wrote: > Hi, > > I am trying to parse out the nucleic acid accession numbers from an > Entrez.efetch query made to the Gene database. For some reason the parser > does not create a dictionary when I call handle.read on it, even when I > force the query to return xml. It just generates a string. Can you give us a tiny example script? Just fill in the missing bit here ;) from Bio import Entrez handle = Entrez.efetch(...) record = Entrez.read(handle) print record Peter From michael.koeris at gmail.com Fri Sep 11 07:39:31 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Fri, 11 Sep 2009 07:39:31 -0400 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly In-Reply-To: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> References: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> Message-ID: <62983B26-3CBA-4375-9EED-148317130E34@gmail.com> Hi Peter, I am using handle = Entrez.efetch(db='gene', id=UID['Activin B'][0][0], rettype='xml') testDic = handle.read() UID['Activin B'][0][0] is the unique gene number I grabbed earlier through an Entrez.esearch and in this case is : 90 Many thanks Mike On Sep 11, 2009, at 5:34 AM, Peter wrote: > On Thu, Sep 10, 2009 at 10:47 PM, Michael S. Koeris > wrote: >> Hi, >> >> I am trying to parse out the nucleic acid accession numbers from an >> Entrez.efetch query made to the Gene database. For some reason the >> parser >> does not create a dictionary when I call handle.read on it, even >> when I >> force the query to return xml. It just generates a string. > > Can you give us a tiny example script? Just fill in the missing bit > here ;) > > from Bio import Entrez > handle = Entrez.efetch(...) > record = Entrez.read(handle) > print record > > Peter From biopython at maubp.freeserve.co.uk Fri Sep 11 07:48:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 11 Sep 2009 12:48:13 +0100 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly In-Reply-To: <62983B26-3CBA-4375-9EED-148317130E34@gmail.com> References: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> <62983B26-3CBA-4375-9EED-148317130E34@gmail.com> Message-ID: <320fb6e00909110448k718f5541s344993ad477e0359@mail.gmail.com> On Fri, Sep 11, 2009 at 12:39 PM, Michael S. Koeris wrote: > Hi Peter, > > I am using > > handle = Entrez.efetch(db='gene', id=UID['Activin B'][0][0], rettype='xml') > testDic = handle.read() > > UID['Activin B'][0][0] is the unique gene number I grabbed earlier through > an Entrez.esearch and in this case is : 90 > > Many thanks > Mike You should be using retmode="xml", not retype="xml". See: http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html Is there a mistake in our documentation somewhere, or was this a typo? You are getting back an HTML error page, which of course our XML parser doesn't like. Try: from Bio import Entrez record = Entrez.read(Entrez.efetch(db='gene', id='90', retmode='xml')) Peter From biopython at maubp.freeserve.co.uk Fri Sep 11 09:37:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 11 Sep 2009 14:37:15 +0100 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly In-Reply-To: <9D4DF5C0-D588-4621-B94C-3611BCAE79D0@gmail.com> References: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> <62983B26-3CBA-4375-9EED-148317130E34@gmail.com> <320fb6e00909110448k718f5541s344993ad477e0359@mail.gmail.com> <9D4DF5C0-D588-4621-B94C-3611BCAE79D0@gmail.com> Message-ID: <320fb6e00909110637o17983630xb5fed23a53ac76e7@mail.gmail.com> Hi Michael, I've CC'd this to the list. On Fri, Sep 11, 2009 at 1:51 PM, Michael S. Koeris wrote: > > Yes indeed that does help - go dyslexia.... Easily done. Actually, on looking a little closer the NCBI returned "XML presented with HTML" (full of < and > entities) - still quite unsuitable for parsing, but not actually an error page as I assumed. > what seems to happen though is that it's not a dictionary but a list > made up of multiple dictionaries is that right? Probably - the Bio.Entrez parser will turn the XML nested structure into lists and dictionaries as appropriate. Going back to your original email, you just wanted "to parse out the nucleic acid accession numbers from an Entrez.efetch query made to the Gene database.", so I would actually suggest you should be using elink instead of efetch. See for example, http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html http://lists.open-bio.org/pipermail/biopython/2009-August/005472.html In your case something like this: >>> from Bio import Entrez >>> data = Entrez.read(Entrez.elink(db="nuccore", dbfrom="gene",id="90", retmode="xml")) >>> for db in data : ... print "Links for", db["IdList"], "from database", db["DbFrom"] ... for link in db["LinkSetDb"][0]["Link"] : print link["Id"] ... Links for ['90'] from database gene 224589811 224514625 194387497 190194409 187169269 187169268 164694819 157724517 157696421 89161198 88958353 74230050 50504351 22450871 21707501 18097079 15668129 2295237 402184 338218 Peter From michael.koeris at gmail.com Fri Sep 11 09:52:40 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Fri, 11 Sep 2009 09:52:40 -0400 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly In-Reply-To: <320fb6e00909110637o17983630xb5fed23a53ac76e7@mail.gmail.com> References: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> <62983B26-3CBA-4375-9EED-148317130E34@gmail.com> <320fb6e00909110448k718f5541s344993ad477e0359@mail.gmail.com> <9D4DF5C0-D588-4621-B94C-3611BCAE79D0@gmail.com> <320fb6e00909110637o17983630xb5fed23a53ac76e7@mail.gmail.com> Message-ID: Hi Peter, that helps a lot. Indeed that's what I am really looking for. So the NCACs I get back appear to be in the order in which they appear in the GeneDB listing (from Chromosome to the various mRNA variants). Searching those then I can easily narrow it down further to the NM_* type listings I really need (since I am looking for the full length mRNA of all variants usually. Thanks! Mike On Sep 11, 2009, at 9:37 AM, Peter wrote: > Hi Michael, > > I've CC'd this to the list. > > On Fri, Sep 11, 2009 at 1:51 PM, Michael S. Koeris > wrote: >> >> Yes indeed that does help - go dyslexia.... > > Easily done. Actually, on looking a little closer the NCBI returned > "XML presented with HTML" (full of < and > entities) - still > quite > unsuitable for parsing, but not actually an error page as I assumed. > >> what seems to happen though is that it's not a dictionary but a list >> made up of multiple dictionaries is that right? > > Probably - the Bio.Entrez parser will turn the XML nested structure > into > lists and dictionaries as appropriate. > > Going back to your original email, you just wanted "to parse out the > nucleic acid accession numbers from an Entrez.efetch query made > to the Gene database.", so I would actually suggest you should be > using elink instead of efetch. See for example, > > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html > http://lists.open-bio.org/pipermail/biopython/2009-August/005472.html > > In your case something like this: > >>>> from Bio import Entrez >>>> data = Entrez.read(Entrez.elink(db="nuccore", >>>> dbfrom="gene",id="90", retmode="xml")) >>>> for db in data : > ... print "Links for", db["IdList"], "from database", db["DbFrom"] > ... for link in db["LinkSetDb"][0]["Link"] : print link["Id"] > ... > Links for ['90'] from database gene > 224589811 > 224514625 > 194387497 > 190194409 > 187169269 > 187169268 > 164694819 > 157724517 > 157696421 > 89161198 > 88958353 > 74230050 > 50504351 > 22450871 > 21707501 > 18097079 > 15668129 > 2295237 > 402184 > 338218 > > Peter From alexanderdcastro at yahoo.com Mon Sep 14 00:22:21 2009 From: alexanderdcastro at yahoo.com (Alexander Castro) Date: Sun, 13 Sep 2009 21:22:21 -0700 (PDT) Subject: [Biopython] Need help in making a back table for the CodonTable object Message-ID: <655165.18001.qm@web38408.mail.mud.yahoo.com> Hello all, I apologize if this sound elementary but how can create a back table that will return more than one codon? For example, for table="Bacterial", codons CTT, CTG and CTA code for L (Lysine). How can I back out, given "L" get CTT, CTG and CTG. The back_table method only returns one value: output = bacterial_table.back_table["L"]. Any help would be greatly appreciated. Regards, Alexander Castro From biopython at maubp.freeserve.co.uk Mon Sep 14 05:05:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Sep 2009 10:05:45 +0100 Subject: [Biopython] Need help in making a back table for the CodonTable object In-Reply-To: <655165.18001.qm@web38408.mail.mud.yahoo.com> References: <655165.18001.qm@web38408.mail.mud.yahoo.com> Message-ID: <320fb6e00909140205q4013c007wbc876b6d35d93fb8@mail.gmail.com> On Mon, Sep 14, 2009 at 5:22 AM, Alexander Castro wrote: > Hello all, I apologize if this sound elementary but how can create a > back table that will return more than one codon? For example, for > table="Bacterial", codons CTT, CTG and CTA code for L (Lysine). > How can I back out, given "L" get CTT, CTG and CTG. The > back_table method only returns one value: > output = bacterial_table.back_table["L"]. > Any help would be greatly appreciated. > Regards, > Alexander Castro I'm curious about what you are trying to do - we talked about the general problem of back-translation, and there was no consensus and therefore we didn't add a back translation method directly to the Seq object. As you are do doubt well aware, back translation is a one to many mapping, and even with ambiguous codons it is impossible to capture this fully with a simple string style representation of the nucleotide sequence. Anyway, the specific task of back translation of one amino acid to a list of codons is well defined. How about: from Bio.Data.CodonTable import unambiguous_dna_by_name table = unambiguous_dna_by_name["Bacterial"].forward_table back_table = {} for codon, amino in table.iteritems() : try : back_table[amino].append(codon) except KeyError : back_table[amino] = [codon] print back_table["L"] Peter From dalke at dalkescientific.com Mon Sep 14 08:23:27 2009 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 14 Sep 2009 14:23:27 +0200 Subject: [Biopython] Need help in making a back table for the CodonTable object In-Reply-To: <320fb6e00909140205q4013c007wbc876b6d35d93fb8@mail.gmail.com> References: <655165.18001.qm@web38408.mail.mud.yahoo.com> <320fb6e00909140205q4013c007wbc876b6d35d93fb8@mail.gmail.com> Message-ID: On Sep 14, 2009, at 11:05 AM, Peter wrote: > Anyway, the specific task of back translation of one amino acid > to a list of codons is well defined. How about: > > from Bio.Data.CodonTable import unambiguous_dna_by_name > table = unambiguous_dna_by_name["Bacterial"].forward_table > back_table = {} > for codon, amino in table.iteritems() : > try : > back_table[amino].append(codon) > except KeyError : > back_table[amino] = [codon] > print back_table["L"] Quick pointer to 'defaultdict' added in Python 2.5: import defaultdict back_table = collections.defaultdict(list) for codon, amino in table.iteritems() : back_table[amino].append(codon) print back_table["L"] If you really want a dictionary at the end instead of a defaultdict, back_table = dict(back_table) Andrew dalke at dalkescientific.com From alexanderdcastro at yahoo.com Mon Sep 14 10:12:41 2009 From: alexanderdcastro at yahoo.com (Alexander Castro) Date: Mon, 14 Sep 2009 07:12:41 -0700 (PDT) Subject: [Biopython] Need help in making a back table for the CodonTable object In-Reply-To: <320fb6e00909140205q4013c007wbc876b6d35d93fb8@mail.gmail.com> Message-ID: <780019.49409.qm@web38402.mail.mud.yahoo.com> Hello Peter, my goal was replace one codon in a sequence with another to perform a "silent" mutation. I did not want to do this mutation manually and offload the work to BioPython. Thanks for the code suggestion! I'll try it out as soon as I'm able. Regards, Alexander Castro --- On Mon, 9/14/09, Peter wrote: From: Peter Subject: Re: [Biopython] Need help in making a back table for the CodonTable object To: "Alexander Castro" Cc: biopython at lists.open-bio.org Date: Monday, September 14, 2009, 2:05 AM On Mon, Sep 14, 2009 at 5:22 AM, Alexander Castro wrote: > Hello all, I apologize if this sound elementary but how can create a > back table that will return more than one codon? For example, for > table="Bacterial", codons CTT, CTG and CTA code for L (Lysine). > How can I back out, given "L" get CTT, CTG and CTG. The > back_table method only returns one value: > output = bacterial_table.back_table["L"]. > Any help would be greatly appreciated. > Regards, > Alexander Castro I'm curious about what you are trying to do - we talked about the general problem of back-translation, and there was no consensus and therefore we didn't add a back translation method directly to the Seq object. As you are do doubt well aware, back translation is a one to many mapping, and even with ambiguous codons it is impossible to capture this fully with a simple string style representation of the nucleotide sequence. Anyway, the specific task of back translation of one amino acid to a list of codons is well defined. How about: from Bio.Data.CodonTable import unambiguous_dna_by_name table = unambiguous_dna_by_name["Bacterial"].forward_table back_table = {} for codon, amino in table.iteritems() : ? ? try : ? ? ? ? back_table[amino].append(codon) ? ? except KeyError : ? ? ? ? back_table[amino] = [codon] print back_table["L"] Peter From biopython at maubp.freeserve.co.uk Mon Sep 14 10:56:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Sep 2009 15:56:07 +0100 Subject: [Biopython] Need help in making a back table for the CodonTable object In-Reply-To: <780019.49409.qm@web38402.mail.mud.yahoo.com> References: <320fb6e00909140205q4013c007wbc876b6d35d93fb8@mail.gmail.com> <780019.49409.qm@web38402.mail.mud.yahoo.com> Message-ID: <320fb6e00909140756y12364311q6f30eb24542cc3a2@mail.gmail.com> On Mon, Sep 14, 2009 at 3:12 PM, Alexander Castro wrote: > Hello Peter, my goal was replace one codon in a sequence with another > to perform a "silent" mutation. I did not want to do this mutation manually > and offload the work to BioPython. I see - leave most of the coding sequence as it was, but replace just one codon in order to have a synonymous mutation. This isn't just a simple back translation, but the reverse lookup table code I suggested should be useful here. Peter From natassa_g_2000 at yahoo.com Tue Sep 15 04:37:50 2009 From: natassa_g_2000 at yahoo.com (natassa) Date: Tue, 15 Sep 2009 01:37:50 -0700 (PDT) Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <4A9DF609.5050906@student.otago.ac.nz> Message-ID: <618545.48683.qm@web52012.mail.re2.yahoo.com> Hallo, I have been using SeqIO to convert Illumina (v1.3+) *sequence.txt files (containing both quality and sequence info) to simple Fastas. This worked well until I tried it on new reads of 75 bp. I need to have them in a single line, so fiddling around with the code I guess I need to change the wrap=60 argument in the FastaIO/FastaWriter class to wrap=0 to make it work. Am I right? are there any other bits of code that may be affected that I may have missed? I am sure this way of handling things is not a good one;-) , so I was wondering if other people have had the same problem and how this class could be modified to adress it in the future. Thanks, Anastasia Anastasia Gioti Post-Doc, Evolutionary Biology Department Upssala University Norbyv?gen 18D SE-752 36? UPPSALA anastasia.gioti at ebc.uu.se Tel: +46-18-471 2837 Fax: +46-18-471 6310 From biopython at maubp.freeserve.co.uk Tue Sep 15 06:11:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 11:11:05 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <618545.48683.qm@web52012.mail.re2.yahoo.com> References: <4A9DF609.5050906@student.otago.ac.nz> <618545.48683.qm@web52012.mail.re2.yahoo.com> Message-ID: <320fb6e00909150311k5e7b7f6p7abd9c16436a6ac8@mail.gmail.com> On Tue, Sep 15, 2009 at 9:37 AM, natassa wrote: > > Hallo, > I have been using SeqIO to convert Illumina (v1.3+) *sequence.txt files > (containing both quality and sequence info) to simple Fastas. Are you talking about Illumina FASTQ files? i.e. fastq-illumina in SeqIO? > This worked well until I tried it on new reads of 75 bp. I need to have > them in a single line, so fiddling around with the code I guess I need to > change the wrap=60 argument in the FastaIO/FastaWriter class to > wrap=0 to make it work. Am I right? are there any other bits of code > that may be affected that I may have missed? Bio.SeqIO defaults to writing FASTA files with 60bp line wrapping. You want to output 75bp FASTA files without line wrapping? As an aside, why? Line wrapping is common and normal in FASTA files and in fact is more widely used than non-wrapping. If another software tool can't read line wrapped FASTA it has a bug in my opinion. > I am sure this way of handling things is not a good one;-) , so I was > wondering if other people have had the same problem and how this > class could be modified to adress it in the future. If you don't like the Bio.SeqIO.write(...) defaults, you can use the underlying writer which may offer some options. In the case of FASTA output, Bio.SeqIO.FastaIO allows you to set the wrapping. e.g. from Bio import SeqIO from Bio.SeqIO.FastaIO import FastaWriter records = SeqIO.parse(open("illumina.fastq"), "fastq-illumina") handle = open("example.fasta", "w") count = FastaWriter(handle, wrap=80).write_file(records) handle.close() print "Converted %i records" % count Peter From biopython at maubp.freeserve.co.uk Tue Sep 15 08:19:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 13:19:25 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <922952.20788.qm@web52012.mail.re2.yahoo.com> References: <320fb6e00909150311k5e7b7f6p7abd9c16436a6ac8@mail.gmail.com> <922952.20788.qm@web52012.mail.re2.yahoo.com> Message-ID: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> On Tue, Sep 15, 2009 at 12:47 PM, natassa wrote: > > Hallo Peter, Hi again, Note I CC'd the mailing list again. >> >> Are you talking about Illumina FASTQ files? i.e. fastq-illumina in SeqIO? >> > > I suppose so, I am quite confused with the names: > The format is (ex from a file): > @HWI-EAS293:8:1:0:1311#0/1 > CGCCACTGTTTTTGAGGGACCGCGGGCAGCCGCGGATCCCCAACGCAAGCAGAGCTNNNNGGTTGAAATGACGCTC > +HWI-EAS293:8:1:0:1311#0/1 > `W_a^a`T``aaa]YIRW^`_X_]XT_``a]U]VWP\^a````_Y_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > So this is fastq-illumina in SeqIO, right? That does look like a FASTQ file, and you probably know that it came from a Solexa/Illumina machine. However, it could be an early Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO), or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED scores ("fastq-illumina" in SeqIO). From the read length (76bp) I would guess this probably is an "fastq-illumina" file, but you should double check this, as it does matter for poor quality reads. >> Bio.SeqIO defaults to writing FASTA files with 60bp line wrapping. >> You want to output 75bp FASTA files without line wrapping? >> As an aside, why? > > Yes, that is what i want. I have paired-end reads in 2 separate files >? that need to combine in one single file for subsequent assembly by > velvet program. There is a 3rd party perl script in velvet to do this, > and if I input to this program files converted to Fasta using the > default argument wrap=60, it does not behave correct. [...] > >> Line wrapping is common and normal in >> FASTA files and in fact is more widely used than non-wrapping. >> If another software tool can't read line wrapped FASTA it has >> a bug in my opinion. > > You are most probably right, I can report the bug to the person who > wrote the script, but until now I thought a one-line format would me > the more common and convenient way. Please do report this bug in the perl-script, as it will help other people in future. If you prefer to work in Python, it should be easy to recreate a Biopython version of the same script. Which script are we talking about? Is it publicly available? >> If you don't like the Bio.SeqIO.write(...) defaults, you can use the >> underlying writer which may offer some options. In the case of >> FASTA output, Bio.SeqIO.FastaIO allows you to set the wrapping. >> >> e.g. >> >> from Bio import SeqIO >> from Bio.SeqIO.FastaIO import FastaWriter >> records = SeqIO.parse(open("illumina.fastq"), "fastq-illumina") >> handle = open("example.fasta", "w") >> count = FastaWriter(handle, wrap=80).write_file(records) >> handle.close() >> print "Converted %i records" % count >> > > Thanks, I will try this out, it should give the same result as directly > changing the FastaWriter arguments, but surely is a cleaner option! Yes, it should :) Regards, Peter From ivan at biodec.com Tue Sep 15 07:30:33 2009 From: ivan at biodec.com (Ivan Rossi) Date: Tue, 15 Sep 2009 13:30:33 +0200 (CEST) Subject: [Biopython] Announcing Plone4bio 1.0 Message-ID: ---------------------- Plone4bio 1.0 released ---------------------- BioDec is pleased to announce the new stable release of plone4bio. Plone4bio is a project to provide an integrated environment where it is possible to manage and analyze biological sequences within the Plone (http://plone.org) content management system. What's new ---------- In this release we have added a major feature, namely a product that let Plone act as a web interface to a BioSQL database. BioSQL (http://www.biosql.org/) is a generic relational model covering sequences, features, sequence and feature annotation, a reference taxonomy, and ontologies (or controlled vocabularies): a BioSQL database (as you know) can be used seamlessly from BioPerl, BioPython, BioJava, and BioRuby. A BioSQL databases is connected with a standard connection string to a plone4bio adapter, and then the content of the BioSQL database can be searched, using the Plone internal search engine and the plone collections, it can be browsed, including the usual graphical mapping of the features on the sequence, and in general handled by the standard Plone CMS machinery. Download and Project page ------------------------- The software is available at http://www.plone4bio.org The documentation is available at http://www.plone4bio.org/trac/wiki/Install The SVN repository is available at http://www.plone4bio.org/svn/ plone4bio mailing list: http://ackbar.biodec.com/cgi-bin/mailman/listinfo/p4b plone4bio demo site (read-only): http://p4bdemo.biodec.com Installation notes ------------------ The package to install to have a full plone4bio site running is plone4bio.buildout The plone4bio.base is just the package that defines a skeleton predictor: deriving from that it is possible to integrate any other application and visualize all the results together. biocomp.pscoils is an example predictor, encapsulating the pscoils algorithm by Fariselli et al. available at http://www.biocomp.unibo.it/ It is intended both as an example on how to integrate one's own predictor in the plone4bio framework and as a ready-to-use predictor for coiled-coils. Requirements: - python2.4 - python setup tools (Debian users: the python-setuptools package) - BioPython - PIL - BioPerl for some graphics Further information -------------------- Either the web site or the mailing list p4b at biodec.com For installation and documentation issues refer to README.txt and INSTALL.txt files from the archive or the script published on the plone4bio wiki site. plone4bio is published under the GPL. To contact BioDec S.r.l.: - Email: info at biodec.com - WWW: http://www.biodec.com - Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 -- Ivan Rossi, PhD - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot it BioDec Srl, Via Calzavecchio 20/2, I-40033 Casalecchio di Reno (BO), Italy Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com From biopython at maubp.freeserve.co.uk Tue Sep 15 11:08:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 16:08:13 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <77266.93142.qm@web52005.mail.re2.yahoo.com> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> Message-ID: <320fb6e00909150808p3ebad107rcf872fdb9737baa3@mail.gmail.com> On Tue, Sep 15, 2009 at 4:02 PM, natassa wrote: > >> That does look like a FASTQ file, and you probably know that it >> came from a Solexa/Illumina machine. However, it could be an early >> Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO), >> or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED >> scores ("fastq-illumina" in SeqIO). From the read length (76bp) I >> would guess this probably is an "fastq-illumina" file, but you >> should double check this, as it does matter for poor quality reads. > > Because you created some doubts in my already confused mind: > The machine is indeed Solexa/Illumina. I have 55bp and 76 bp > reads from pipeline v1.3 and v1.4, respectively. In the pipeline > manuals they say that the scoring scheme is Phred.? I know > there is a lot of confusion about the terms, this is why I > preferred to use the seqIO -I hope I did not mix the formats.... That's fine then - the Solexa/Illumina 1.3 and 1.4 pipelines use PHRED scores (with a FASTQ ASCII offset of 64), and in Biopython we call this the "fastq-illumina" format. Peter From natassa_g_2000 at yahoo.com Tue Sep 15 11:02:06 2009 From: natassa_g_2000 at yahoo.com (natassa) Date: Tue, 15 Sep 2009 08:02:06 -0700 (PDT) Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> Message-ID: <77266.93142.qm@web52005.mail.re2.yahoo.com> That does look like a FASTQ file, and you probably know that it came from a Solexa/Illumina machine. However, it could be an early Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO), or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED scores ("fastq-illumina" in SeqIO). From the read length (76bp) I would guess this probably is an "fastq-illumina" file, but you should double check this, as it does matter for poor quality reads. Because you created some doubts in my already confused mind: The machine is indeed Solexa/Illumina. I have 55bp and 76 bp reads from pipeline v1.3 and v1.4, respectively. In the pipeline manuals they say that the scoring scheme is Phred.? I know there is a lot of confusion about the terms, this is why I preferred to use the seqIO -I hope I did not mix the formats.... If you prefer to work in Python, it should be easy to recreate a Biopython version of the same script. Which script are we talking about? Is it publicly available? It is called shuffleSequences_fasta.pl and goes along with the (free) distribution of velvet (Zerbino, EBI). The script is really simple. Thanks again, Anastasia From biopython at maubp.freeserve.co.uk Tue Sep 15 11:14:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 16:14:52 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <77266.93142.qm@web52005.mail.re2.yahoo.com> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> Message-ID: <320fb6e00909150814k6620f759u8c512bfab43bf75@mail.gmail.com> On Tue, Sep 15, 2009 at 4:02 PM, natassa wrote: > >> If you prefer to work in Python, it should be easy to recreate >> a Biopython version of the same script. Which script are we >> talking about? Is it publicly available? > > It is called shuffleSequences_fasta.pl and goes along with the > (free) distribution of velvet (Zerbino, EBI). The script is really > simple. Oh right - you can see the scripts on Daniel's github repository, http://github.com/dzerbino/velvet Both scripts are very very simple minded, which means fixing the bug will actually be a big change: shuffleSequences_fasta.pl appears to assume every FASTA entry is exactly two lines (a safe assumption for short reads like 36bp from early Solexa/Illumina), but not a safe choice in general as wrapping in FASTA is normal. shuffleSequences_fastq.pl appears to assume every FASTQ entry is exactly four lines (a reasonable assumption, especially for short reads like 36bp reads from early Solexa/Illumina), but not a safe choice in general as FASTQ files can also be wrapped (even if it is discouraged). We should be able to mimic these in Biopython using SeqIO... Peter From biopython at maubp.freeserve.co.uk Tue Sep 15 12:43:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 17:43:04 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <320fb6e00909150814k6620f759u8c512bfab43bf75@mail.gmail.com> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> <320fb6e00909150814k6620f759u8c512bfab43bf75@mail.gmail.com> Message-ID: <320fb6e00909150943v23aa743cpc481c31a91da7e3a@mail.gmail.com> On Tue, Sep 15, 2009 at 4:14 PM, Peter wrote: > On Tue, Sep 15, 2009 at 4:02 PM, natassa wrote: >> >>> If you prefer to work in Python, it should be easy to recreate >>> a Biopython version of the same script. Which script are we >>> talking about? Is it publicly available? >> >> It is called shuffleSequences_fasta.pl and goes along with the >> (free) distribution of velvet (Zerbino, EBI). The script is really >> simple. > > Oh right - you can see the scripts on Daniel's github repository, > http://github.com/dzerbino/velvet > > Both scripts are very very simple minded, which means fixing > the bug will actually be a big change: > > shuffleSequences_fasta.pl appears to assume every FASTA > entry is exactly two lines (a safe assumption for short reads > like 36bp from early Solexa/Illumina), but not a safe choice > in general as wrapping in FASTA is normal. > > shuffleSequences_fastq.pl appears to assume every FASTQ > entry is exactly four lines (a reasonable assumption, especially > for short reads like 36bp reads from early Solexa/Illumina), > but not a safe choice in general as FASTQ files can also be > wrapped (even if it is discouraged). > > We should be able to mimic these in Biopython using SeqIO... How about this? I'm using itertools.izip_longest from Python 2.6+ which should make sure the two input files have the same number of reads. Using itertools.izip would also work, but will silently ignore any extra records in one file. import itertools from Bio import SeqIO #Setup variables (could parse command line args here) fileA = "SRR001666_1.fastq" fileB = "SRR001666_2.fastq fileOut = "SRR001666_interleaved.fastq" format = "fastq" #Prepare the input (using iterators for memory efficiency) recordsA = SeqIO.parse(open(fileA,"rU"), format) recordsB = SeqIO.parse(open(fileB,"rU"), format) records = itertools.izip_longest(recordsA, recordsB) #Finally do the interleaved output: handle = open(fileOut, "w") count = SeqIO.write(records, handle, format) handle.close() print "%i records written to %s" % (count, fileOut) Rather than using itertools, you could also write a simple generator function to do the pairing explicitly. Assuming you are dealing with paired end reads, it would make sense to explicitly check the IDs match up as expected. Peter P.S. This uses the default output settings, for if you did this for FASTA it would use line wrapping. From cjfields at illinois.edu Tue Sep 15 13:43:26 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 15 Sep 2009 12:43:26 -0500 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <320fb6e00909150808p3ebad107rcf872fdb9737baa3@mail.gmail.com> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> <320fb6e00909150808p3ebad107rcf872fdb9737baa3@mail.gmail.com> Message-ID: On Sep 15, 2009, at 10:08 AM, Peter wrote: > On Tue, Sep 15, 2009 at 4:02 PM, natassa > wrote: >> >>> That does look like a FASTQ file, and you probably know that it >>> came from a Solexa/Illumina machine. However, it could be an early >>> Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO), >>> or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED >>> scores ("fastq-illumina" in SeqIO). From the read length (76bp) I >>> would guess this probably is an "fastq-illumina" file, but you >>> should double check this, as it does matter for poor quality reads. >> >> Because you created some doubts in my already confused mind: >> The machine is indeed Solexa/Illumina. I have 55bp and 76 bp >> reads from pipeline v1.3 and v1.4, respectively. In the pipeline >> manuals they say that the scoring scheme is Phred. I know >> there is a lot of confusion about the terms, this is why I >> preferred to use the seqIO -I hope I did not mix the formats.... > > That's fine then - the Solexa/Illumina 1.3 and 1.4 pipelines use > PHRED scores (with a FASTQ ASCII offset of 64), and in > Biopython we call this the "fastq-illumina" format. > > Peter I should add a very important caveat here. As I had mentioned to Peter I met with our local nextgen sequencing lead and was able to check the Illumina 1.4 pipeline manual. It indicates the ASCII offset for FASTQ is correct (64), but the quality score is calculated as (pg 122 of Genome Pipeline manual for 1.4): Q = 10*log10(p/(1-p)) Look familiar? Hint: it's not PHRED. I'm wondering if anyone else can confirm this, as it appears Illumina has switched back to using Solexa scores again. chris From cjfields at illinois.edu Tue Sep 15 15:27:57 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 15 Sep 2009 14:27:57 -0500 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> <320fb6e00909150808p3ebad107rcf872fdb9737baa3@mail.gmail.com> Message-ID: <16B3156A-FC16-4003-985F-582AAE5E134C@illinois.edu> On Sep 15, 2009, at 12:43 PM, Chris Fields wrote: > On Sep 15, 2009, at 10:08 AM, Peter wrote: > >> On Tue, Sep 15, 2009 at 4:02 PM, natassa >> wrote: >>> >>>> That does look like a FASTQ file, and you probably know that it >>>> came from a Solexa/Illumina machine. However, it could be an early >>>> Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO), >>>> or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED >>>> scores ("fastq-illumina" in SeqIO). From the read length (76bp) I >>>> would guess this probably is an "fastq-illumina" file, but you >>>> should double check this, as it does matter for poor quality reads. >>> >>> Because you created some doubts in my already confused mind: >>> The machine is indeed Solexa/Illumina. I have 55bp and 76 bp >>> reads from pipeline v1.3 and v1.4, respectively. In the pipeline >>> manuals they say that the scoring scheme is Phred. I know >>> there is a lot of confusion about the terms, this is why I >>> preferred to use the seqIO -I hope I did not mix the formats.... >> >> That's fine then - the Solexa/Illumina 1.3 and 1.4 pipelines use >> PHRED scores (with a FASTQ ASCII offset of 64), and in >> Biopython we call this the "fastq-illumina" format. >> >> Peter > > I should add a very important caveat here. As I had mentioned to > Peter I met with our local nextgen sequencing lead and was able to > check the Illumina 1.4 pipeline manual. It indicates the ASCII > offset for FASTQ is correct (64), but the quality score is > calculated as (pg 122 of Genome Pipeline manual for 1.4): > > Q = 10*log10(p/(1-p)) > > Look familiar? Hint: it's not PHRED. I'm wondering if anyone else > can confirm this, as it appears Illumina has switched back to using > Solexa scores again. > > chris Just got off the phone with Illumina customer support to double-check this, and I think it may be a false alarm, though I'm getting conflicting accounts (our local guys say it's solexa, not PHRED qual scores). According to Illumina tech support, qual scores coming off the 1.4 pipeline should be converted over to PHRED scores prior to output (what natassa mentions). The manual refers to the older (Solexa/ Illumina 1.0) scoring b/c that particular qual scoring option can be specified instead of PHRED. If anyone out there using the 1.4 pipeline can confirm this that would be most helpful, as all the Bio* toolkits and EMBOSS are updating FASTQ parsing. chris From biopython at maubp.freeserve.co.uk Wed Sep 16 06:01:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 11:01:18 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <16B3156A-FC16-4003-985F-582AAE5E134C@illinois.edu> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> <320fb6e00909150808p3ebad107rcf872fdb9737baa3@mail.gmail.com> <16B3156A-FC16-4003-985F-582AAE5E134C@illinois.edu> Message-ID: <320fb6e00909160301q54d3fed7yf941f8a96581ba49@mail.gmail.com> On Tue, Sep 15, 2009 at 8:27 PM, Chris Fields wrote: > Just got off the phone with Illumina customer support to double-check this, > and I think it may be a false alarm, though I'm getting conflicting accounts > (our local guys say it's solexa, not PHRED qual scores). > > According to Illumina tech support, qual scores coming off the 1.4 pipeline > should be converted over to PHRED scores prior to output (what natassa > mentions). ?The manual refers to the older (Solexa/Illumina 1.0) scoring b/c > that particular qual scoring option can be specified instead of PHRED. I gather via seqanswers.com that there are still some (out of date) references to the old Solexa scoring in the current Illumina pipeline documentation, so I expect this is just a false alarm... Peter From biopython at maubp.freeserve.co.uk Wed Sep 16 06:44:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 11:44:09 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <320fb6e00909150943v23aa743cpc481c31a91da7e3a@mail.gmail.com> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> <320fb6e00909150814k6620f759u8c512bfab43bf75@mail.gmail.com> <320fb6e00909150943v23aa743cpc481c31a91da7e3a@mail.gmail.com> Message-ID: <320fb6e00909160344u703b62d3t3d05b381539ebb8c@mail.gmail.com> On Tue, Sep 15, 2009 at 5:43 PM, Peter wrote: > Rather than using itertools, you could also write a simple generator > function to do the pairing explicitly. Assuming you are dealing with > paired end reads, it would make sense to explicitly check the IDs > match up as expected. I confess I didn't actually test that example (I don't have Python 2.6 on this machine), and I had miss read the itertools.izip_longest documentation - that won't actually work as is. Sorry :( Instead, here is a simple interleaving using a generator function, which I *have* tested on Python 2.5, from Bio import SeqIO #Setup variables (could parse command line args here) fileA = "SRR001666_1.fastq" fileB = "SRR001666_2.fastq fileOut = "SRR001666_interleaved.fastq" format = "fastq" #Setup the input def interleave(iter1, iter2) : while True : yield iter1.next() yield iter2.next() recordsA = SeqIO.parse(open(fileA,"rU"), format) recordsB = SeqIO.parse(open(fileB,"rU"), format) records = interleave(recordsA, recordsB) #Now the output handle = open(fileOut, "w") count = SeqIO.write(records, handle, format) handle.close() print "%i records written to %s" % (count, fileOut) Note that this does not check the number of records in the two files matches, nor does it do any explicit test on the record ids. Peter From peter at maubp.freeserve.co.uk Wed Sep 16 07:08:42 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 12:08:42 +0100 Subject: [Biopython] [Velvet-users] Shuffler for wrapped fastas In-Reply-To: <320fb6e00909160254u6cc988a8rbd07c4892aeb3515@mail.gmail.com> References: <4AAE3302.4010304@ebi.ac.uk> <16DCCE03-6439-4A84-8E32-7482BCB2D192@ebc.uu.se> <4AB0AE11.1000706@stats.ox.ac.uk> <320fb6e00909160254u6cc988a8rbd07c4892aeb3515@mail.gmail.com> Message-ID: <320fb6e00909160408g41c88c00w4048b9d6b58bc87b@mail.gmail.com> Hi Velvet users, I've also CC'd the biopython mailing list (and added links to the velvet mailing list archive posts), as this conversation might be better off continued there. Peter wrote: http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000567.html >> Nice to see people using Biopython - although that doesn't look like the >> most efficient way to write the code (I say without timing it!). See also: >> http://lists.open-bio.org/pipermail/biopython/2009-September/005579.html >> >> And of course these Bio.SeqIO scripts would also work for FASTQ files >> if you just switch the format name. If you wanted to do more error checking, >> you could compare the record IDs (to check they are paired reads), and >> you should make sure both files have the same number of records. Tanya wrote: http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000568.html > Absolutely, that's why I said it was slow -- but easy and efficient > enough for moderate-size fasta files. If you look at the attached > script, I wrote the function shuffle_reads in a way that lets you import > it and change the format type (or you could change the command line args > to pass it in, of course, but I couldn't be bothered). Also easy enough > to check the headers match. However, fastq does require the latest > Biopython release, and not everyone has it. > > Personally I don't use Biopython for shuffling fastq files because with > millions of short reads, it's hugely inefficient (tenfold runtime > difference) to go through creating a Biopython object for each read. > So for short reads, I'd go with straightforward shuffling, plain text. You are right that if performance really is a bottleneck, the Biopython SeqIO system can be a limit due to the creation of SeqRecord objects. However, it isn't quite as bad as you think - your code is making it worse by using the format method on each record (which internally calls SeqIO), instead of a single call to Bio.SeqIO.write() as recommended in our documentation. This version is about twice as fast as your original: import sys from Bio import SeqIO def interleave(iter1, iter2) : while True : yield iter1.next() yield iter2.next() f1, f2 = open(sys.argv[1]), open(sys.argv[2]) outfile = open(sys.argv[3], 'w') format = 'fasta' #or "fastq" or ... records = interleave(SeqIO.parse(f1, format), SeqIO.parse(f2, format)) count = SeqIO.write(records, outfile, format) outfile.close() (It still assumes the input files are properly paired) I agree that a lower level FASTA parser using python strings could be another five times faster - but you are aware, this gives up the generality of the SeqIO system whereby this could be used on any supported file format. Peter From pzs at dcs.gla.ac.uk Wed Sep 16 12:14:20 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Wed, 16 Sep 2009 17:14:20 +0100 Subject: [Biopython] Creating GenBank files In-Reply-To: <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> References: <4AA91A14.10602@dcs.gla.ac.uk> <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> Message-ID: <4AB10EDC.3060908@dcs.gla.ac.uk> Peter wrote: > Yes, you must create a SeqRecord object with suitable SeqFeature objects, > and then write it out with SeqIO in GenBank format. If all your features have > trivial locations, this is pretty easy. > Thanks for this. I've managed to get this to work, but encountered a few minor issues. I already have GenBank files created by CLC Genomics Workbench 3 but I want to make these in a script. The CLC generated GenBank files look like this: LOCUS Setd2-tagged 11750 bp DNA linear UNA FEATURES Location/Qualifiers misc_feature 1..50 /label="Subcloning HA Upstream" ...(snip other features) ORIGIN 1 TTGGTGTGAG CTCTTTGTGT CTTGCCTAAG TATGTGCATC TGTCTTGTCT ...(snip sequence) To do this in biopython, I need to create my feature thus: sf = SeqFeature.SeqFeature(SeqFeature.FeatureLocation(0,50), type="misc_feature", qualifiers = { "label" : [ "Subcloning HA Upstream" ]}) The issues I had were: - In the docstring for SeqFeature, it says the attribute is "qualifier" but it should be "qualifiers". - My first stab at the qualifiers argument was to do qualifiers = { "label" : "mylabel" } but if I do that, it iterates over "mylabel" giving me one "label" for each character! Maybe the qualifier printer should check it's being given a list and not a string? - I'd like to remove some of the extraneous header from the GenBank file: DEFINITION . ACCESSION VERSION KEYWORDS . SOURCE . ORGANISM . . Is this possible? Sorry for the long message, Peter From biopython at maubp.freeserve.co.uk Wed Sep 16 12:31:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 17:31:27 +0100 Subject: [Biopython] Creating GenBank files In-Reply-To: <4AB10EDC.3060908@dcs.gla.ac.uk> References: <4AA91A14.10602@dcs.gla.ac.uk> <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> <4AB10EDC.3060908@dcs.gla.ac.uk> Message-ID: <320fb6e00909160931y3669a524maea5be3e8cc3e1a7@mail.gmail.com> On Wed, Sep 16, 2009 at 5:14 PM, Peter Saffrey wrote: > Peter wrote: >> >> Yes, you must create a SeqRecord object with suitable SeqFeature objects, >> and then write it out with SeqIO in GenBank format. If all your features >> have trivial locations, this is pretty easy. > > Thanks for this. I've managed to get this to work, but encountered a few > minor issues. > > I already have GenBank files created by CLC Genomics Workbench 3 but I want > to make these in a script. The CLC generated GenBank files look like this: > > LOCUS ? ? ? Setd2-tagged ? ? ? ? ? 11750 bp ? ?DNA ? ? linear ? UNA > FEATURES ? ? ? ? ? ? Location/Qualifiers > ? ? misc_feature ? ?1..50 > ? ? ? ? ? ? ? ? ? ? /label="Subcloning HA Upstream" > ...(snip other features) > > ORIGIN > ? ? ? ?1 TTGGTGTGAG CTCTTTGTGT CTTGCCTAAG TATGTGCATC TGTCTTGTCT > > ...(snip sequence) > > > To do this in biopython, I need to create my feature thus: > > sf = SeqFeature.SeqFeature(SeqFeature.FeatureLocation(0,50), > type="misc_feature", qualifiers = { "label" : [ "Subcloning HA Upstream" ]}) > > The issues I had were: > > - In the docstring for SeqFeature, it says the attribute is "qualifier" but > it should be "qualifiers". I've fixed that in CVS - thanks for reporting it. > - My first stab at the qualifiers argument was to do > > qualifiers = { "label" : "mylabel" } > > but if I do that, it iterates over "mylabel" giving me one "label" for each > character! Maybe the qualifier printer should check it's being given a list > and not a string? As you have realised, based on what the GenBank (and other) parsers do, the GenBank output code was expecting the qualifier values to be a list (of strings). There are similar issues in the BioSQL code, and yes, I agree we should cope with either here too. > - I'd like to remove some of the extraneous header from the GenBank file: > > DEFINITION ?. > ACCESSION ? > VERSION ? ? > KEYWORDS ? ?. > SOURCE ? ? ?. > ?ORGANISM ?. > ? ? ? ? ? ?. > > Is this possible? > Why would you want to? They are there deliberately as according the the NCBI GenBank release notes (which pretty much is the official file format definition) those are all mandatory keywords, so should be present (even if with just a dot/period indicating no data). I would regard the CLC Genomics Workbench 3 output as technically out of spec. > > Sorry for the long message, > Not at all. Peter C. From biopython at maubp.freeserve.co.uk Wed Sep 16 12:44:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 17:44:15 +0100 Subject: [Biopython] Creating GenBank files In-Reply-To: <320fb6e00909160931y3669a524maea5be3e8cc3e1a7@mail.gmail.com> References: <4AA91A14.10602@dcs.gla.ac.uk> <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> <4AB10EDC.3060908@dcs.gla.ac.uk> <320fb6e00909160931y3669a524maea5be3e8cc3e1a7@mail.gmail.com> Message-ID: <320fb6e00909160944r75aa2128y32005bea1b18b48@mail.gmail.com> On Wed, Sep 16, 2009 at 5:31 PM, Peter wrote: > On Wed, Sep 16, 2009 at 5:14 PM, Peter Saffrey wrote: >> >> - My first stab at the qualifiers argument was to do >> >> qualifiers = { "label" : "mylabel" } >> >> but if I do that, it iterates over "mylabel" giving me one "label" for each >> character! Maybe the qualifier printer should check it's being given a list >> and not a string? > > As you have realised, based on what the GenBank (and other) parsers > do, the GenBank output code was expecting the qualifier values to be > a list (of strings). There are similar issues in the BioSQL code, and yes, > I agree we should cope with either here too. Fixed in CVS. Peter From fchiu at newton.berkeley.edu Wed Sep 16 19:18:20 2009 From: fchiu at newton.berkeley.edu (Finsen Chiu) Date: Wed, 16 Sep 2009 16:18:20 -0700 Subject: [Biopython] Permissions for Bio, BioSQL, Martel Message-ID: <4AB1723C.2040004@newton.berkeley.edu> Hi all, What should the correct permission settings be for the Bio, BioSQL, Martel directories and its sub folders/files? When I install biopython as root, the directories and files created have permissions of either 600 or 400, which is not very helpful if I am installing as a sys admin since other users can't import anything. I also tried doing a umask 022 before installing and encountered write permission for non-root users, as followed: ====================================================================== ERROR: test_Clustalw_tool ---------------------------------------------------------------------- [Errno 13] Permission denied: 'Clustalw/temp horses.fasta' ====================================================================== ERROR: Ensure that we can write proper FASTA output files. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Motif.py", line 65, in test_FAoutput output_handle = open(self.FAout, "w") IOError: [Errno 13] Permission denied: 'Motif/fa.out' ====================================================================== ERROR: Ensure that we can write proper TransFac output files. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Motif.py", line 72, in test_TFoutput output_handle = open(self.TFout, "w") IOError: [Errno 13] Permission denied: 'Motif/tf.out' ====================================================================== ERROR: Ensure that we can write proper pfm output files. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Motif.py", line 79, in test_pfm_output output_handle = open(self.PFMout, "w") IOError: [Errno 13] Permission denied: 'Motif/fa.out' ====================================================================== ERROR: Reading and writing motifs to a file ---------------------------------------------------------------------- Traceback (most recent call last): File "test_NNGene.py", line 48, in test_motif output_handle = open(self.test_file, "w") IOError: [Errno 13] Permission denied: 'NeuralNetwork/patternio.txt' ====================================================================== ERROR: Reading and writing schemas to a file. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_NNGene.py", line 81, in test_schema output_handle = open(self.test_file, "w") IOError: [Errno 13] Permission denied: 'NeuralNetwork/patternio.txt' ====================================================================== ERROR: Reading and writing signatures to a file. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_NNGene.py", line 114, in test_signature output_handle = open(self.test_file, "w") IOError: [Errno 13] Permission denied: 'NeuralNetwork/patternio.txt' ====================================================================== ERROR: Full template creation test ---------------------------------------------------------------------- Traceback (most recent call last): File "test_PopGen_SimCoal_nodepend.py", line 23, in test_template_full 'PopGen') File "~/biopython-1.50/build/lib.linux-i686-2.3/Bio/PopGen/SimCoal/Template.py", line 210, in generate_simcoal_from_template stream = open(out_dir + sep + 'tmp.par', 'w') IOError: [Errno 13] Permission denied: 'PopGen/tmp.par' ====================================================================== ERROR: test_SeqUtils ---------------------------------------------------------------------- [Errno 13] Permission denied: 'fasta.tmp' ====================================================================== How can I bypass these errors without giving non-root users write permission? Thanks, Finsen From biopython at maubp.freeserve.co.uk Wed Sep 16 19:35:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 00:35:34 +0100 Subject: [Biopython] Permissions for Bio, BioSQL, Martel In-Reply-To: <4AB1723C.2040004@newton.berkeley.edu> References: <4AB1723C.2040004@newton.berkeley.edu> Message-ID: <320fb6e00909161635yebd94dlf63f3efb8df299a1@mail.gmail.com> Hi, What OS are you using? And did you try just the normal installation: python setup.py build python setup.py test sudo python setup.py install That would normally work on Linux/Unix. On Thu, Sep 17, 2009 at 12:18 AM, Finsen Chiu wrote: > Hi all, > > I also tried doing a umask 022 before installing and encountered write > permission for non-root users, as followed: > > ====================================================================== > ERROR: test_Clustalw_tool > ---------------------------------------------------------------------- > [Errno 13] Permission denied: 'Clustalw/temp horses.fasta' etc The above could happens if you did the build and test as sudo, and the temp test files were left behind (e.g. if interrupted). If you later rerun the tests as a normal user you can't delete them (because they belong to root). Peter From fchiu at newton.berkeley.edu Wed Sep 16 20:00:18 2009 From: fchiu at newton.berkeley.edu (Finsen Chiu) Date: Wed, 16 Sep 2009 17:00:18 -0700 Subject: [Biopython] Permissions for Bio, BioSQL, Martel In-Reply-To: <320fb6e00909161635yebd94dlf63f3efb8df299a1@mail.gmail.com> References: <4AB1723C.2040004@newton.berkeley.edu> <320fb6e00909161635yebd94dlf63f3efb8df299a1@mail.gmail.com> Message-ID: <4AB17C12.60806@newton.berkeley.edu> I am using RHEL4 I built and installed as root and both went smoothly without interruption. Running the test as root is fine, but the permission denied errors came when running the test as a user. So, I wonder if I need to give users write permission to those files (which I am not wanting to do) or if those errors are negligible. Thanks, Finsen Peter wrote: > Hi, > > What OS are you using? And did you try just the normal installation: > > python setup.py build > python setup.py test > sudo python setup.py install > > That would normally work on Linux/Unix. > > On Thu, Sep 17, 2009 at 12:18 AM, Finsen Chiu wrote: > >> Hi all, >> >> I also tried doing a umask 022 before installing and encountered write >> permission for non-root users, as followed: >> >> ====================================================================== >> ERROR: test_Clustalw_tool >> ---------------------------------------------------------------------- >> [Errno 13] Permission denied: 'Clustalw/temp horses.fasta' >> > etc > > The above could happens if you did the build and test as sudo, and > the temp test files were left behind (e.g. if interrupted). If you later > rerun the tests as a normal user you can't delete them (because they > belong to root). > > Peter > > From biopython at maubp.freeserve.co.uk Thu Sep 17 05:15:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 10:15:06 +0100 Subject: [Biopython] Permissions for Bio, BioSQL, Martel In-Reply-To: <4AB17C12.60806@newton.berkeley.edu> References: <4AB1723C.2040004@newton.berkeley.edu> <320fb6e00909161635yebd94dlf63f3efb8df299a1@mail.gmail.com> <4AB17C12.60806@newton.berkeley.edu> Message-ID: <320fb6e00909170215j791ed9efsd0a593e56d5bc26f@mail.gmail.com> On Thu, Sep 17, 2009 at 1:00 AM, Finsen Chiu wrote: > I am using RHEL4 > > I built and installed as root and both went smoothly without interruption. OK, good that it worked. But I don't normally do the build and test as root, just use sudo for the install at the end. > Running the test as root is fine, but the permission denied errors came when > running the test as a user. > > So, I wonder if I need to give users write permission to those files (which > I am not wanting to do) or if those errors are negligible. Switching the permissions would work, but the best solution to not to run the tests as root in the first place. If you do the build and test as a normal user, all the temp test files will have normal user permissions. [Normally there shouldn't be any left over temp files from the tests, so I guess a test failed or was interupted before it could clean up] Peter From jhcepas at gmail.com Thu Sep 17 13:58:38 2009 From: jhcepas at gmail.com (Jaime Huerta Cepas) Date: Thu, 17 Sep 2009 19:58:38 +0200 Subject: [Biopython] ETE: a python Environment for Tree Exploration Message-ID: Hi all, I have recently finished a software (ETE) that might be of your interest. ETE (Environment for tree Exploration) is a python toolkit to analyze, manipulate and visualize hierarchical trees. It allows to deal with any kind of tree, but it includes specific data types for loading phylogenetic and clustering trees. Besides many tree handling options it provides some analytic methods such as orthology/paralogy prediction or cluster validation. The toolkit is GPL and aims to be very flexible and configurable, so it can be used together with other toolkits such as BioPython. The development of this software responses to the needs of our own group during the last years, and I hope it can be now useful for the bioinformatics community. Bugs and comments are very welcome, as well as any idea for a better integration with Biopython :) program, documentation and examples can be found at http://ete.cgenomics.org hope it's useful for you! cheers, Jaime ** Summary of main ETE's features: ** General trees ========== Advanced node annotation, tree topology manipulation, automatic tree prunning, cut \& paste partitions, trees concatenation, random trees generation, iterate over leaves and descendants, pre and pos-order tree traversion, root and unroot options, advanced nodes search, get distances among nodes, detect midpoint outgroup, find farthest descendant node, find farthest node in the whole tree, detect first common ancestor among nodes, text mode visualization, newick rendering (several formats), extended newick format integration, support for built-in python operations: print, len, iter, in. Phylogenetic trees ================ Link to multiple sequence alignments, automatic species name detection, check node monophily, evolutionary events dating, detect orthology and paralogy relationships: species overlap and tree reconciliation methods, complete access API to the phylomeDB database, integrated visualization (show molecular sequences and evolutionary events). Clustering trees ============== link to numerical matrices, calculate inter and intra-cluster distances among clusters, calculate Silhouette and Dunn Indexes, integrated visualization (display numerical profiles in several formats). Treeview extension =============== Interactive Graphical User Interface (GUI), programmable drawing engine, independent node aspect editing, support drawing node extra features (text or external images), vector graphics rendering using PDF format. -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From carlos.borroto at gmail.com Fri Sep 18 12:59:58 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Fri, 18 Sep 2009 12:59:58 -0400 Subject: [Biopython] Searching for and downloading sequences using the history Message-ID: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> Hi all, I'm trying to download all of the EST from a specie, I'm following the example on the tutorial which seems to be exactly what I need. But I running into this problem: >>> from Bio import Entrez >>> Entrez.email = "carlos.borroto at gmail.com" >>> dbname = "nucest" >>> query_term = "Genus specie" >>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y") >>> search_results = Entrez.read(search_handle) >>> search_handle.close() >>> len(search_results["IdList"]) 20 >>> print search_results["Count"] 193951 So the assert statement if failing: >>> gi_list = search_results["IdList"] >>> count = int(search_results["Count"]) >>> assert count == len(gi_list) Traceback (most recent call last): File "", line 1, in AssertionError And most important I'm not getting all of the ids. Did someone knows what I'm doing wrong? thanks in advance -- Carlos Javier Borroto Baltimore, MD Phone: (410) 929 4020 From carlos.borroto at gmail.com Fri Sep 18 13:15:41 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Fri, 18 Sep 2009 13:15:41 -0400 Subject: [Biopython] Searching for and downloading sequences using the history In-Reply-To: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> References: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> Message-ID: <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> On Fri, Sep 18, 2009 at 12:59 PM, Carlos Javier Borroto wrote: > Hi all, > > I'm trying to download all of the EST from a specie, I'm following the > example on the tutorial which seems to be exactly what I need. But I > running into this problem: > >>>> from Bio import Entrez >>>> Entrez.email = "carlos.borroto at gmail.com" >>>> dbname = "nucest" >>>> query_term = "Genus specie" >>>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y") >>>> search_results = Entrez.read(search_handle) >>>> search_handle.close() >>>> len(search_results["IdList"]) > 20 >>>> print search_results["Count"] > 193951 > > So the assert statement if failing: >>>> gi_list = search_results["IdList"] >>>> count = int(search_results["Count"]) >>>> assert count == len(gi_list) > Traceback (most recent call last): > ?File "", line 1, in > AssertionError > I just found this: http://portal.open-bio.org/pipermail/biopython/2008-August/004451.html So I tested this: >>> search_handle = Entrez.esearch(db=dbname,term=query_term,retmax=193951) >>> search_results = Entrez.read(search_handle) >>> search_handle.close() >>> print search_results["Count"] 193951 >>> len(search_results["IdList"]) 100000 Still not the complete list, maybe there is a maximum of result you can get and I see there is a retstart, so I'm guessing the only way to get all of the ids is dividing my search and using retstart. I'm right? I'm going to implement this I share it here. regards, -- Carlos Javier Borroto Baltimore, MD Phone: (410) 929 4020 From fredgca at hotmail.com Fri Sep 18 13:51:20 2009 From: fredgca at hotmail.com (Frederico Arnoldi) Date: Fri, 18 Sep 2009 17:51:20 +0000 Subject: [Biopython] ETE: a python Environment for Tree Exploration In-Reply-To: References: Message-ID: Jaime, Thanks for sharing. Very useful! Frederico Arnoldi > From: biopython-request at lists.open-bio.org > Subject: Biopython Digest, Vol 81, Issue 21 > To: biopython at lists.open-bio.org > Date: Fri, 18 Sep 2009 12:00:02 -0400 > Date: Thu, 17 Sep 2009 19:58:38 +0200 > From: Jaime Huerta Cepas > Subject: [Biopython] ETE: a python Environment for Tree Exploration > To: biopython at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Hi all, > > I have recently finished a software (ETE) that might be of your interest. > > ETE (Environment for tree Exploration) is a python toolkit to analyze, > manipulate and visualize hierarchical trees. It allows to deal with any kind > of tree, but it includes specific data types for loading phylogenetic and > clustering trees. Besides many tree handling options it provides some > analytic methods such as orthology/paralogy prediction or cluster > validation. The toolkit is GPL and aims to be very flexible and > configurable, so it can be used together with other toolkits such as > BioPython. > > The development of this software responses to the needs of our own group > during the last years, and I hope it can be now useful for the > bioinformatics community. > Bugs and comments are very welcome, as well as any idea for a better > integration with Biopython :) > > program, documentation and examples can be found at http://ete.cgenomics.org > > hope it's useful for you! > > cheers, > Jaime > > ** Summary of main ETE's features: ** > > General trees > ========== > Advanced node annotation, tree topology manipulation, automatic tree > prunning, cut \& paste partitions, trees concatenation, random trees > generation, iterate over leaves and descendants, pre and pos-order tree > traversion, root and unroot options, advanced nodes search, get distances > among nodes, detect midpoint outgroup, find farthest descendant node, find > farthest node in the whole tree, detect first common ancestor among nodes, > text mode visualization, newick rendering (several formats), extended newick > format integration, support for built-in python operations: print, len, > iter, in. > > Phylogenetic trees > ================ > Link to multiple sequence alignments, automatic species name detection, > check node monophily, evolutionary events dating, detect orthology and > paralogy relationships: species overlap and tree reconciliation methods, > complete access API to the phylomeDB database, integrated visualization > (show molecular sequences and evolutionary events). > > Clustering trees > ============== > link to numerical matrices, calculate inter and intra-cluster distances > among clusters, calculate Silhouette and Dunn Indexes, integrated > visualization (display numerical profiles in several formats). > > Treeview extension > =============== > Interactive Graphical User Interface (GUI), programmable drawing engine, > independent node aspect editing, support drawing node extra features (text > or external images), vector graphics rendering using PDF format. > > > -- > ========================= > Jaime Huerta-Cepas, Ph.D. > CRG-Centre for Genomic Regulation > Doctor Aiguader, 88 > PRBB Building > 08003 Barcelona, Spain > http://www.crg.es/comparative_genomics > ========================= > _________________________________________________________________ Voc? sabia que o Hotmail mudou? Clique e descubra as novidades. http://www.microsoft.com/brasil/windows/windowslive/products/hotmail.aspx From carlos.borroto at gmail.com Fri Sep 18 14:08:14 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Fri, 18 Sep 2009 14:08:14 -0400 Subject: [Biopython] Searching for and downloading sequences using the history In-Reply-To: <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> References: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> Message-ID: <65d4b7fc0909181108x3a82e17eu9d078bc4402248ab@mail.gmail.com> On Fri, Sep 18, 2009 at 1:15 PM, Carlos Javier Borroto wrote: > On Fri, Sep 18, 2009 at 12:59 PM, Carlos Javier Borroto > wrote: >> Hi all, >> >> I'm trying to download all of the EST from a specie, I'm following the >> example on the tutorial which seems to be exactly what I need. But I >> running into this problem: >> > > I'm right? I'm going to implement this I share it here. > Well here is my implementation, I'm very new to biopython or even python, my programing skills aren't great either, but because what I did was mostly copy/paste from the tutorial, I'm feeling confident on sharing this code, any advise to make it better is highly welcome. It seems to be working just fine, but I haven't been able to run it to the end, cause I keep getting this sporadic ncbi servers errors: Going to download records 5601 to 5700 Traceback (most recent call last): File "ncbi-downloader.py", line 58, in webenv=webenv, query_key=query_key) File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py", line 126, in efetch return _open(cgi, variables) File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py", line 373, in _open raise IOError(data.strip()) IOError: Error: 156514070 is not available at this time. Error: 156514069 is not available at this time. Error: 156514068 is not available at this time. Error: 156514067 is not available at this time. But I guess is only matter of been a better citizen and doing this in the weekend or outside USA peak time. Here is the code: from Bio import Entrez Entrez.email = "A.N.Other at example.com" dbname = "code_name_of_the_db" query_term = "query_term" handle = Entrez.egquery(term=query_term) record = Entrez.read(handle) handle.close() for row in record["eGQueryResult"]: if row["DbName"]==dbname: egquery_count = int(row["Count"]) esearch_batch_size = 1000 out_handle = open("outfile.fasta", "w") for esearch_start in range(0,egquery_count,esearch_batch_size) : esearch_end = min(egquery_count, esearch_start+esearch_batch_size) print "Going to get IDs of records %i to %i" % (esearch_start+1, esearch_end) search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y", retstart=esearch_start,retmax=esearch_batch_size) search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] assert esearch_batch_size == len(gi_list) count = int(search_results["Count"]) assert egquery_count == count webenv = search_results["WebEnv"] query_key = search_results["QueryKey"] batch_size = 100 for start in range(esearch_start,esearch_start+esearch_batch_size,batch_size) : end = min(count, start+batch_size) print "Going to download records %i to %i" % (start+1, end) fetch_handle = Entrez.efetch(db=dbname, rettype="fasta", retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key) data = fetch_handle.read() fetch_handle.close() out_handle.write(data) out_handle.close() regards, -- Carlos Javier Borroto Baltimore, MD Phone: (410) 929 4020 From mmokrejs at ribosome.natur.cuni.cz Fri Sep 18 14:23:26 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Fri, 18 Sep 2009 20:23:26 +0200 Subject: [Biopython] Searching for and downloading sequences using the history In-Reply-To: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> References: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> Message-ID: <4AB3D01E.5000807@ribosome.natur.cuni.cz> Hi Carlos, I had a look what the Entrez.esearch object has in its properties and I see RetMax attribute. >>> search_results {u'Count': '9279', u'RetMax': '20', u'IdList': ['189229275', '189229274', '189229273', '189229272', '189229271', '189229 270', '189229269', '189229268', '189229267', '189229266', '189229265', '189229264', '189229263', '189229262', '189229261 ', '189229260', '189229199', '189229198', '189229197', '189229196'], u'TranslationStack': [{u'Count': '9279', u'Field': 'All Fields', u'Term': 'Genus[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'QueryTranslation': 'Genus[All Fields]', u'Erro rList': {u'FieldNotFound': [], u'PhraseNotFound': ['specie']}, u'TranslationSet': [], u'RetStart': '0', u'QueryKey': '1' , u'WebEnv': 'NCID_1_3207467_130.14.22.148_9001_1253297878'} >>> So, here we go: >>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y",RetMax=99999) >>> search_results = Entrez.read(search_handle) >>> search_handle.close() >>> len(search_results["IdList"]) 9279 >>> BTW: >>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y",RetMax=9999999999) >>> search_results = Entrez.read(search_handle) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.6/site-packages/Bio/Entrez/__init__.py", line 297, in read record = handler.run(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 90, in run self.parser.ParseFile(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 141, in endElement raise RuntimeError(value) RuntimeError: Search Backend failed: NCBI C++ Exception: Error: CORELIB(CStringException::eConvert) "/pubmed_gen/rbuild/version/20090819/entrez/c++/src/corelib/ncbist r.cpp", line 411: --- Cannot convert string '9999999999' to int, overflow (m_Pos = 0) >>> Hope this helps, M. Carlos Javier Borroto wrote: > Hi all, > > I'm trying to download all of the EST from a specie, I'm following the > example on the tutorial which seems to be exactly what I need. But I > running into this problem: > >>>> from Bio import Entrez >>>> Entrez.email = "carlos.borroto at gmail.com" >>>> dbname = "nucest" >>>> query_term = "Genus specie" >>>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y") >>>> search_results = Entrez.read(search_handle) >>>> search_handle.close() >>>> len(search_results["IdList"]) > 20 >>>> print search_results["Count"] > 193951 > > So the assert statement if failing: >>>> gi_list = search_results["IdList"] >>>> count = int(search_results["Count"]) >>>> assert count == len(gi_list) > Traceback (most recent call last): > File "", line 1, in > AssertionError > > And most important I'm not getting all of the ids. > > Did someone knows what I'm doing wrong? > > thanks in advance From biopython at maubp.freeserve.co.uk Fri Sep 18 14:51:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Sep 2009 19:51:32 +0100 Subject: [Biopython] Searching for and downloading sequences using the history In-Reply-To: <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> References: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> Message-ID: <320fb6e00909181151y2e22c06fvc795b9c435f6c01b@mail.gmail.com> On Fri, Sep 18, 2009 at 6:15 PM, Carlos Javier Borroto wrote: > On Fri, Sep 18, 2009 at 12:59 PM, Carlos Javier Borroto wrote: >> Hi all, >> >> I'm trying to download all of the EST from a specie, I'm following the >> example on the tutorial which seems to be exactly what I need. But I >> running into this problem: >> ... > > I just found this: > http://portal.open-bio.org/pipermail/biopython/2008-August/004451.html > > So I tested this: >>>> search_handle = Entrez.esearch(db=dbname,term=query_term,retmax=193951) >>>> search_results = Entrez.read(search_handle) >>>> search_handle.close() >>>> print search_results["Count"] > 193951 >>>> len(search_results["IdList"]) > 100000 > > Still not the complete list, maybe there is a maximum of result you > can get and I see there is a retstart, so I'm guessing the only way to > get all of the ids is dividing my search and using retstart. OK, good - you found the retmax parameter. It looks like the NCBI still limit their return data to 100000 here - I don't know if EFetch (via the history) would also be limited to 100000 or not, but this is still a pretty large amount of EST data to try an download this way. I would first suggest you refine your Entrez search to use "species name[orgn]" rather than just "species name" (i.e. explicitly search on the organism rather than all fields). That may reduce things further. Even better, search using an NCBI taxonomy ID to be absolutely explicit. This may reduce the dataset a bit. Secondly, this seems like an awfully large amount of data to try and download via Entrez. Email the NCBI to ask if if this is OK (and if so what batch size you should use for EFetch calls), or if they have an alternative suggestion (e.g. some FTP site). Peter P.S. You could try wrapping each EFetch call in a try/except in order to retry any individual retrieval which fails. From fchiu at newton.berkeley.edu Fri Sep 18 16:32:26 2009 From: fchiu at newton.berkeley.edu (Finsen Chiu) Date: Fri, 18 Sep 2009 13:32:26 -0700 Subject: [Biopython] Permissions for Bio, BioSQL, Martel In-Reply-To: <320fb6e00909170215j791ed9efsd0a593e56d5bc26f@mail.gmail.com> References: <4AB1723C.2040004@newton.berkeley.edu> <320fb6e00909161635yebd94dlf63f3efb8df299a1@mail.gmail.com> <4AB17C12.60806@newton.berkeley.edu> <320fb6e00909170215j791ed9efsd0a593e56d5bc26f@mail.gmail.com> Message-ID: <4AB3EE5A.7080606@newton.berkeley.edu> This has been resolved. For the record, here is the resolution: I built and tested as an user and that went fine. I saw that the proper permission for the Bio, BioSQL and Martel should be umask 002. So I proceeded with installing as root with umask 002. Thanks, Finsen Peter wrote: > On Thu, Sep 17, 2009 at 1:00 AM, Finsen Chiu wrote: > >> I am using RHEL4 >> >> I built and installed as root and both went smoothly without interruption. >> > > OK, good that it worked. But I don't normally do the build and test as root, > just use sudo for the install at the end. > > >> Running the test as root is fine, but the permission denied errors came when >> running the test as a user. >> >> So, I wonder if I need to give users write permission to those files (which >> I am not wanting to do) or if those errors are negligible. >> > > Switching the permissions would work, but the best solution to not to > run the tests as root in the first place. If you do the build and test as > a normal user, all the temp test files will have normal user permissions. > > [Normally there shouldn't be any left over temp files from the tests, > so I guess a test failed or was interupted before it could clean up] > > Peter > > From biopython at maubp.freeserve.co.uk Fri Sep 18 16:56:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Sep 2009 21:56:34 +0100 Subject: [Biopython] Searching for and downloading sequences using the history In-Reply-To: <65d4b7fc0909181309j282e2cfbi193be5528cbafa5@mail.gmail.com> References: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> <320fb6e00909181151y2e22c06fvc795b9c435f6c01b@mail.gmail.com> <65d4b7fc0909181309j282e2cfbi193be5528cbafa5@mail.gmail.com> Message-ID: <320fb6e00909181356u5f3eb543xe2cdac77b8677c0e@mail.gmail.com> On Fri, Sep 18, 2009 at 9:09 PM, Carlos Javier Borroto wrote: > > On Fri, Sep 18, 2009 at 2:51 PM, Peter wrote: >> I would first suggest you refine your Entrez search to use "species >> name[orgn]" rather than just "species name" (i.e. explicitly search >> on the organism rather than all fields). That may reduce things >> further. Even better, search using an NCBI taxonomy ID to be >> absolutely explicit. This may reduce the dataset a bit. > > Nice advise, I was thinking about using it, now I'm using something > like txid6945[Organism:noexp], but still I have 100000+ sequence to > download. That is what I meant - but still, you have a lot of sequences! >> Secondly, this seems like an awfully large amount of data to >> try and download via Entrez. Email the NCBI to ask if if this is >> OK (and if so what batch size you should use for EFetch calls), >> or if they have an alternative suggestion (e.g. some FTP site). > > But how could I know what is and what isn't "an awfully large amount > of data" to download via Entrez?, I'm gonna try writing to them and > see what they think. FTP site was my first option but is unresponsive > right now, but I don't think they have this specific subset of > sequences there. Well over 100000 record sounds like a lot to me, but I agree, the NCBI could provide more explicit guidance. It is a shame if the NCBI don't provide what you want by FTP, as that would probably be easier. >> P.S. You could try wrapping each EFetch call in a >> try/except in order to retry any individual retrieval which fails. > > Great I just did it and it seems to be working fine!, here is what I did: > > ? ? ? ? ? ? ? ?while True : > ? ? ? ? ? ? ? ? ? ? ? ?try : > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?fetch_handle = ... > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?data = fetch_handle.read() > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?fetch_handle.close() > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?out_handle.write(data) > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?break > ? ? ? ? ? ? ? ? ? ? ? ?except IOError : > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?print "Server error, going to try > again from record %i" % (start+1) That will loop for ever if there a problem - which is fine if you are going to sit watching the script, but a very bad idea for automation. I would limit it to say 3 attempts before giving up, and maybe add a sleep of a few seconds too. Peter From peter at maubp.freeserve.co.uk Sat Sep 19 07:17:50 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Sat, 19 Sep 2009 12:17:50 +0100 Subject: [Biopython] Shuffler for wrapped fastas In-Reply-To: <4AB0D386.2060406@stats.ox.ac.uk> References: <4AAE3302.4010304@ebi.ac.uk> <16DCCE03-6439-4A84-8E32-7482BCB2D192@ebc.uu.se> <4AB0AE11.1000706@stats.ox.ac.uk> <320fb6e00909160254u6cc988a8rbd07c4892aeb3515@mail.gmail.com> <320fb6e00909160408g41c88c00w4048b9d6b58bc87b@mail.gmail.com> <4AB0D386.2060406@stats.ox.ac.uk> Message-ID: <320fb6e00909190417v30322978wc9b4132dc78dc909@mail.gmail.com> Hi Tanya, Sorry for the slight delay - your email didn't appear in my inbox for a couple of days. Odd. On Wed, Sep 16, 2009 at 1:01 PM, Tanya Golubchik wrote: > > A single write is definitely better, though it's still so much slower than > plain text shuffling that it's not ideal for millions of short reads unless > we want to do something useful like convert the scores to Phred in the > process. In that case we'd be using 'format' anyway, I assume, unless there > is a neat trick to reformat a whole lot of reads at once. If guess you mean the SeqRecord's format method? It isn't intended for output of multiple records to a file, but rather is a convenience method for a single record. The (slow) approach using a loop and many calls to SeqRecord.format(...) is also less general than using a single call to Bio.SeqIO.write(...). Consider interleaved file formats or those with a header - the for loop won't work here. Using Bio.SeqIO and combining the parse and write functions already allows simple conversion of a range of sequence file formats, including the three FASTQ variants. This is covered in the tutorial and the wiki, http://biopython.org/wiki/SeqIO The soon to be released Biopython 1.52 will make this even easier (and in some cases like FASTQ conversion also faster) with the addition of a Bio.SeqIO.convert(...) function. > In general I find myself using Biopython for longer sequences (fasta or > fastq), because of the neatness and flexibility of Biopython, but sticking > to plain text for short reads because of the overheads. In some cases that is the best thing to do. If you haven't already done so, have a look at the FastqGeneralIterator function in Bio.SeqIO.QualityIO which returns a tuple of three strings (so no overhead from Seq and SeqRecord objects). > BTW, itertools.izip does exactly what your interleave method > does, so I'm not sure there's any need to rewrite it. No it doesn't. The builtin function zip, and itertools.izip both return tuples (pairs of entries). Consider: >>> a = range(0,10,2) >>> b = range(1, 10, 2) >>> zip(a,b) [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)] Or, using itertools, >>> import itertools >>> list(itertools.izip(a,b)) [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)] Using the (not very general) interlace function I wrote earlier: http://lists.open-bio.org/pipermail/biopython/2009-September/005583.html >>> def interlace(iter1, iter2) : ... while True : ... yield iter1.next() ... yield iter2.next() ... >>> list(interlace(iter(a),iter(b))) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] I hope that illustrates the difference. Here you get back ten items, but with zip or izip you get five pairs of iterms. Via Google you can easily find much more general interlace functions in Python. Peter From rodrigo_faccioli at uol.com.br Sun Sep 20 15:24:54 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Sun, 20 Sep 2009 16:24:54 -0300 Subject: [Biopython] Protein side-chain angles from a PDB file Message-ID: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> Hi, I have a doubt about how can I calculate the protein side-chain angles from a PDB file? I've read the Bio.PDB.Vector evaluates the angles from three atoms. However, my question is how can I chose these atoms from a PDB file? I have based on peter cook web-site and I have calculated the phi-psi angles but I haven't seen about side-chain angles. Sorry if my question is very basic. However, I'm a computer scientist novice in chemistry issue. Thanks in advance. -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From p.j.a.cock at googlemail.com Mon Sep 21 05:18:42 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 21 Sep 2009 10:18:42 +0100 Subject: [Biopython] Protein side-chain angles from a PDB file In-Reply-To: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> References: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> Message-ID: <320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> On Sun, Sep 20, 2009 at 8:24 PM, Rodrigo faccioli wrote: > Hi, > > I have a doubt about how can I calculate the protein side-chain angles from > a PDB file? > > I've read the Bio.PDB.Vector evaluates the angles from three atoms. However, > my question is how can I ?chose these atoms from a PDB file? > > I have based on peter cook web-site and I have calculated the phi-psi angles > but I haven't seen about side-chain angles. http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/calculate/ Its "Cock", not "Cook", but its a common mistake ;) > Sorry if my question is very basic. However, I'm a computer scientist novice > in chemistry issue. How are you defining the side chain angle? From memory (without checking the details), you have the protein back bone which includes the "alpha carbon" (CA in PDB files) to which the side chains are attached. I guess you want to measure the angle of the side chain to the C-alpha to (either of the) neighbouring backbone atoms. The point here is off the top of my head there are at least two possible angles you might be asking about. But in terms of the code, you'll just need to get the coordinates of the three atoms defining the angle (which probably will be the C-alpha and two others), which defines two vectors, then take their dot product and thus get the cosine of the angle. Looking at the Psi-Phi code should help here. Peter From golubchi at stats.ox.ac.uk Mon Sep 21 05:20:44 2009 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Mon, 21 Sep 2009 10:20:44 +0100 Subject: [Biopython] Shuffler for wrapped fastas In-Reply-To: <320fb6e00909190417v30322978wc9b4132dc78dc909@mail.gmail.com> References: <4AAE3302.4010304@ebi.ac.uk> <16DCCE03-6439-4A84-8E32-7482BCB2D192@ebc.uu.se> <4AB0AE11.1000706@stats.ox.ac.uk> <320fb6e00909160254u6cc988a8rbd07c4892aeb3515@mail.gmail.com> <320fb6e00909160408g41c88c00w4048b9d6b58bc87b@mail.gmail.com> <4AB0D386.2060406@stats.ox.ac.uk> <320fb6e00909190417v30322978wc9b4132dc78dc909@mail.gmail.com> Message-ID: <4AB7456C.4020207@stats.ox.ac.uk> Hi Peter, Thanks for the pointer to FastqGeneralIterator, I'll definitely take a look. (Good point about interlacing as well.) Cheers, Tanya Peter wrote: > Hi Tanya, > > Sorry for the slight delay - your email didn't appear in my inbox > for a couple of days. Odd. > > On Wed, Sep 16, 2009 at 1:01 PM, Tanya Golubchik wrote: >> A single write is definitely better, though it's still so much slower than >> plain text shuffling that it's not ideal for millions of short reads unless >> we want to do something useful like convert the scores to Phred in the >> process. In that case we'd be using 'format' anyway, I assume, unless there >> is a neat trick to reformat a whole lot of reads at once. > > If guess you mean the SeqRecord's format method? It isn't intended > for output of multiple records to a file, but rather is a convenience > method for a single record. The (slow) approach using a loop and > many calls to SeqRecord.format(...) is also less general than using > a single call to Bio.SeqIO.write(...). Consider interleaved file formats > or those with a header - the for loop won't work here. > > Using Bio.SeqIO and combining the parse and write functions already > allows simple conversion of a range of sequence file formats, including > the three FASTQ variants. This is covered in the tutorial and the wiki, > http://biopython.org/wiki/SeqIO > > The soon to be released Biopython 1.52 will make this even easier > (and in some cases like FASTQ conversion also faster) with the > addition of a Bio.SeqIO.convert(...) function. > >> In general I find myself using Biopython for longer sequences (fasta or >> fastq), because of the neatness and flexibility of Biopython, but sticking >> to plain text for short reads because of the overheads. > > In some cases that is the best thing to do. If you haven't already > done so, have a look at the FastqGeneralIterator function in > Bio.SeqIO.QualityIO which returns a tuple of three strings (so > no overhead from Seq and SeqRecord objects). > >> BTW, itertools.izip does exactly what your interleave method >> does, so I'm not sure there's any need to rewrite it. > > No it doesn't. The builtin function zip, and itertools.izip both > return tuples (pairs of entries). Consider: > >>>> a = range(0,10,2) >>>> b = range(1, 10, 2) >>>> zip(a,b) > [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)] > > Or, using itertools, > >>>> import itertools >>>> list(itertools.izip(a,b)) > [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)] > > Using the (not very general) interlace function I wrote earlier: > http://lists.open-bio.org/pipermail/biopython/2009-September/005583.html > >>>> def interlace(iter1, iter2) : > ... while True : > ... yield iter1.next() > ... yield iter2.next() > ... >>>> list(interlace(iter(a),iter(b))) > [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] > > I hope that illustrates the difference. Here you get back ten items, > but with zip or izip you get five pairs of iterms. Via Google you > can easily find much more general interlace functions in Python. > > Peter From darnells at dnastar.com Mon Sep 21 11:03:01 2009 From: darnells at dnastar.com (Steve Darnell) Date: Mon, 21 Sep 2009 10:03:01 -0500 Subject: [Biopython] Protein side-chain angles from a PDB file In-Reply-To: <320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> References: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> <320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> Message-ID: Rodrigo, This is just to expand on Peter's comment. Your original question implicitly mentioned two types of angles: bond and dihedral angles. A bond angle can be calculated with three atoms, two vectors, and a dot product (the first type mentioned). When you use the term phi and psi angles, you are mentioning dihedral (or torsion) angles (the angle betweeen two planes where the intersection is along the bond of interest). It's more complicated to calculate, but relatively straight forward: http://en.wikipedia.org/wiki/Dihedral_angle Were you originally asking about how to calculate the torsion angles in the side chain? These are known as the chi angles and are used for defining rotational conformations (rotomers). I'll stop here since I'm guessing I misunderstood your original question. Regards, Steve -- Steve Darnell DNASTAR, Inc. Madison, WI USA -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock Sent: Monday, September 21, 2009 4:19 AM To: Rodrigo faccioli Cc: biopython Subject: Re: [Biopython] Protein side-chain angles from a PDB file On Sun, Sep 20, 2009 at 8:24 PM, Rodrigo faccioli wrote: > Hi, > > I have a doubt about how can I calculate the protein side-chain angles > from a PDB file? > > I've read the Bio.PDB.Vector evaluates the angles from three atoms. > However, my question is how can I ?chose these atoms from a PDB file? > > I have based on peter cook web-site and I have calculated the phi-psi > angles but I haven't seen about side-chain angles. http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/calculate/ Its "Cock", not "Cook", but its a common mistake ;) > Sorry if my question is very basic. However, I'm a computer scientist > novice in chemistry issue. How are you defining the side chain angle? From memory (without checking the details), you have the protein back bone which includes the "alpha carbon" (CA in PDB files) to which the side chains are attached. I guess you want to measure the angle of the side chain to the C-alpha to (either of the) neighbouring backbone atoms. The point here is off the top of my head there are at least two possible angles you might be asking about. But in terms of the code, you'll just need to get the coordinates of the three atoms defining the angle (which probably will be the C-alpha and two others), which defines two vectors, then take their dot product and thus get the cosine of the angle. Looking at the Psi-Phi code should help here. Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From rodrigo_faccioli at uol.com.br Mon Sep 21 15:23:37 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Mon, 21 Sep 2009 16:23:37 -0300 Subject: [Biopython] Protein side-chain angles from a PDB file In-Reply-To: References: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> <320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> Message-ID: <3715adb70909211223y119b750o152d4754004a971f@mail.gmail.com> Steve Darnell and Peter Cock, I apologize your answers. I have complicated my question because I have not understood very well about these angles. However, I have understood more. Thus, I'll retype my original question: How can I calculate the rotational conformations (rotamers) when I have a PDB file? Therefore, if I understood better, I need to know which amino acid I'm working because the number of chi angles vary according to each amino acids. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Mon, Sep 21, 2009 at 12:03 PM, Steve Darnell wrote: > Rodrigo, > > This is just to expand on Peter's comment. Your original question > implicitly mentioned two types of angles: bond and dihedral angles. A bond > angle can be calculated with three atoms, two vectors, and a dot product > (the first type mentioned). When you use the term phi and psi angles, you > are mentioning dihedral (or torsion) angles (the angle betweeen two planes > where the intersection is along the bond of interest). It's more > complicated to calculate, but relatively straight forward: > > http://en.wikipedia.org/wiki/Dihedral_angle > > Were you originally asking about how to calculate the torsion angles in the > side chain? These are known as the chi angles and are used for defining > rotational conformations (rotomers). I'll stop here since I'm guessing I > misunderstood your original question. > > Regards, > Steve > > -- > Steve Darnell > DNASTAR, Inc. > Madison, WI USA > > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org [mailto: > biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock > Sent: Monday, September 21, 2009 4:19 AM > To: Rodrigo faccioli > Cc: biopython > Subject: Re: [Biopython] Protein side-chain angles from a PDB file > > On Sun, Sep 20, 2009 at 8:24 PM, Rodrigo faccioli < > rodrigo_faccioli at uol.com.br> wrote: > > Hi, > > > > I have a doubt about how can I calculate the protein side-chain angles > > from a PDB file? > > > > I've read the Bio.PDB.Vector evaluates the angles from three atoms. > > However, my question is how can I chose these atoms from a PDB file? > > > > I have based on peter cook web-site and I have calculated the phi-psi > > angles but I haven't seen about side-chain angles. > > http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/calculate/ > > Its "Cock", not "Cook", but its a common mistake ;) > > > Sorry if my question is very basic. However, I'm a computer scientist > > novice in chemistry issue. > > How are you defining the side chain angle? From memory (without checking > the details), you have the protein back bone which includes the "alpha > carbon" (CA in PDB files) to which the side chains are attached. I guess you > want to measure the angle of the side chain to the C-alpha to (either of > the) neighbouring backbone atoms. > The point here is off the top of my head there are at least two possible > angles you might be asking about. > > But in terms of the code, you'll just need to get the coordinates of the > three atoms defining the angle (which probably will be the C-alpha and two > others), which defines two vectors, then take their dot product and thus get > the cosine of the angle. Looking at the Psi-Phi code should help here. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From darnells at dnastar.com Mon Sep 21 17:18:54 2009 From: darnells at dnastar.com (Steve Darnell) Date: Mon, 21 Sep 2009 16:18:54 -0500 Subject: [Biopython] Protein side-chain angles from a PDB file In-Reply-To: <3715adb70909211223y119b750o152d4754004a971f@mail.gmail.com> References: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com><320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> <3715adb70909211223y119b750o152d4754004a971f@mail.gmail.com> Message-ID: Rodrigo, Perhaps this will get you in the right direction for your implementation. It describes the math for calculating dihedrals and defines the chi angle atom mappings for all of the amino acids. At first glance it appears to be correct. Unfortunately, I don't have any code to help you out. http://www.math.fsu.edu/~quine/IntroMathBio_05/torsion_pdb/torsion_pdb.p df If you're only interested in calculating the chi angles only once or twice, you could try MolProbity. http://molprobity.biochem.duke.edu/ Maybe some else knows of another already implemented solution? ~Steve -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Rodrigo faccioli Sent: Monday, September 21, 2009 2:24 PM Cc: biopython Subject: Re: [Biopython] Protein side-chain angles from a PDB file Steve Darnell and Peter Cock, I apologize your answers. I have complicated my question because I have not understood very well about these angles. However, I have understood more. Thus, I'll retype my original question: How can I calculate the rotational conformations (rotamers) when I have a PDB file? Therefore, if I understood better, I need to know which amino acid I'm working because the number of chi angles vary according to each amino acids. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Mon, Sep 21, 2009 at 12:03 PM, Steve Darnell wrote: > Rodrigo, > > This is just to expand on Peter's comment. Your original question > implicitly mentioned two types of angles: bond and dihedral angles. A > bond angle can be calculated with three atoms, two vectors, and a dot > product (the first type mentioned). When you use the term phi and psi > angles, you are mentioning dihedral (or torsion) angles (the angle > betweeen two planes where the intersection is along the bond of > interest). It's more complicated to calculate, but relatively straight forward: > > http://en.wikipedia.org/wiki/Dihedral_angle > > Were you originally asking about how to calculate the torsion angles > in the side chain? These are known as the chi angles and are used for > defining rotational conformations (rotomers). I'll stop here since > I'm guessing I misunderstood your original question. > > Regards, > Steve > > -- > Steve Darnell > DNASTAR, Inc. > Madison, WI USA > > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org [mailto: > biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock > Sent: Monday, September 21, 2009 4:19 AM > To: Rodrigo faccioli > Cc: biopython > Subject: Re: [Biopython] Protein side-chain angles from a PDB file > > On Sun, Sep 20, 2009 at 8:24 PM, Rodrigo faccioli < > rodrigo_faccioli at uol.com.br> wrote: > > Hi, > > > > I have a doubt about how can I calculate the protein side-chain > > angles from a PDB file? > > > > I've read the Bio.PDB.Vector evaluates the angles from three atoms. > > However, my question is how can I chose these atoms from a PDB file? > > > > I have based on peter cook web-site and I have calculated the > > phi-psi angles but I haven't seen about side-chain angles. > > http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/calculate/ > > Its "Cock", not "Cook", but its a common mistake ;) > > > Sorry if my question is very basic. However, I'm a computer > > scientist novice in chemistry issue. > > How are you defining the side chain angle? From memory (without > checking the details), you have the protein back bone which includes > the "alpha carbon" (CA in PDB files) to which the side chains are > attached. I guess you want to measure the angle of the side chain to > the C-alpha to (either of > the) neighbouring backbone atoms. > The point here is off the top of my head there are at least two > possible angles you might be asking about. > > But in terms of the code, you'll just need to get the coordinates of > the three atoms defining the angle (which probably will be the C-alpha > and two others), which defines two vectors, then take their dot > product and thus get the cosine of the angle. Looking at the Psi-Phi code should help here. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Sep 22 12:38:21 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Sep 2009 17:38:21 +0100 Subject: [Biopython] Biopython 1.52 released Message-ID: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> Dear all, Those of you who signed up to our newsfeed will know this already, but we are pleased to announce the release of Biopython 1.52: http://news.open-bio.org/news/2009/09/biopython-release-152/ Thank you to all our developers, including David Winter for drafting the release announcement, and everyone else who as contributed with feedback, bug reports etc. Could I also take this opportunity to remind you all we have an application note out in the OUP journal Bioinformatics: http://news.open-bio.org/news/2009/03/biopython-paper-published/ http://dx.doi.org/10.1093/bioinformatics/btp163 In any scientific publication using Biopython, we kindly request you cite this, or another appropriate publication from this list: http://biopython.org/wiki/Documentation#Papers Thank you, Peter From rodrigo_faccioli at uol.com.br Wed Sep 23 09:34:29 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Wed, 23 Sep 2009 10:34:29 -0300 Subject: [Biopython] Protein side-chain angles from a PDB file In-Reply-To: References: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> <320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> <3715adb70909211223y119b750o152d4754004a971f@mail.gmail.com> Message-ID: <3715adb70909230634s31f11744hf9b25439bfc6711b@mail.gmail.com> Thank you. Your help was very important. I'll read about MolProbity because after I calculate of the chi angle, I'll store them in PostgreSQL. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Mon, Sep 21, 2009 at 6:18 PM, Steve Darnell wrote: > Rodrigo, > > Perhaps this will get you in the right direction for your > implementation. It describes the math for calculating dihedrals and > defines the chi angle atom mappings for all of the amino acids. At > first glance it appears to be correct. Unfortunately, I don't have any > code to help you out. > > http://www.math.fsu.edu/~quine/IntroMathBio_05/torsion_pdb/torsion_pdb.p > df > > If you're only interested in calculating the chi angles only once or > twice, you could try MolProbity. > > http://molprobity.biochem.duke.edu/ > > Maybe some else knows of another already implemented solution? > > ~Steve > > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org > [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Rodrigo > faccioli > Sent: Monday, September 21, 2009 2:24 PM > Cc: biopython > Subject: Re: [Biopython] Protein side-chain angles from a PDB file > > Steve Darnell and Peter Cock, > > I apologize your answers. I have complicated my question because I have > not understood very well about these angles. However, I have understood > more. > > Thus, I'll retype my original question: How can I calculate the > rotational conformations (rotamers) when I have a PDB file? > > Therefore, if I understood better, I need to know which amino acid I'm > working because the number of chi angles vary according to each amino > acids. > > > Thanks in advance, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL Intelligent System in > Structure Bioinformatics http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > > > On Mon, Sep 21, 2009 at 12:03 PM, Steve Darnell > wrote: > > > Rodrigo, > > > > This is just to expand on Peter's comment. Your original question > > implicitly mentioned two types of angles: bond and dihedral angles. A > > > bond angle can be calculated with three atoms, two vectors, and a dot > > product (the first type mentioned). When you use the term phi and psi > > > angles, you are mentioning dihedral (or torsion) angles (the angle > > betweeen two planes where the intersection is along the bond of > > interest). It's more complicated to calculate, but relatively > straight forward: > > > > http://en.wikipedia.org/wiki/Dihedral_angle > > > > Were you originally asking about how to calculate the torsion angles > > in the side chain? These are known as the chi angles and are used for > > > defining rotational conformations (rotomers). I'll stop here since > > I'm guessing I misunderstood your original question. > > > > Regards, > > Steve > > > > -- > > Steve Darnell > > DNASTAR, Inc. > > Madison, WI USA > > > > > > -----Original Message----- > > From: biopython-bounces at lists.open-bio.org [mailto: > > biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock > > Sent: Monday, September 21, 2009 4:19 AM > > To: Rodrigo faccioli > > Cc: biopython > > Subject: Re: [Biopython] Protein side-chain angles from a PDB file > > > > On Sun, Sep 20, 2009 at 8:24 PM, Rodrigo faccioli < > > rodrigo_faccioli at uol.com.br> wrote: > > > Hi, > > > > > > I have a doubt about how can I calculate the protein side-chain > > > angles from a PDB file? > > > > > > I've read the Bio.PDB.Vector evaluates the angles from three atoms. > > > However, my question is how can I chose these atoms from a PDB > file? > > > > > > I have based on peter cook web-site and I have calculated the > > > phi-psi angles but I haven't seen about side-chain angles. > > > > http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/calculate/ > > > > Its "Cock", not "Cook", but its a common mistake ;) > > > > > Sorry if my question is very basic. However, I'm a computer > > > scientist novice in chemistry issue. > > > > How are you defining the side chain angle? From memory (without > > checking the details), you have the protein back bone which includes > > the "alpha carbon" (CA in PDB files) to which the side chains are > > attached. I guess you want to measure the angle of the side chain to > > the C-alpha to (either of > > the) neighbouring backbone atoms. > > The point here is off the top of my head there are at least two > > possible angles you might be asking about. > > > > But in terms of the code, you'll just need to get the coordinates of > > the three atoms defining the angle (which probably will be the C-alpha > > > and two others), which defines two vectors, then take their dot > > product and thus get the cosine of the angle. Looking at the Psi-Phi > code should help here. > > > > Peter > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Sep 24 05:59:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 10:59:47 +0100 Subject: [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. Message-ID: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> Hi all, I'm forwarding an interesting post from Dave to the BioPerl mailing list, which should also be of interest here... Peter ---------- Forwarded message ---------- From: Dave Messina Date: Thu, Sep 24, 2009 at 10:38 AM Subject: Re: [Bioperl-l] a Main Page proposal To: Chris Fields Cc: bioperl-l at lists.open-bio.org, Dave Clements , Peter , "Mark A. Jensen" > > Not to add yet more to the list, but I also think a concise list of > projects using (or 'powered by') bioperl should be front-and-center; not a > lot of users know when/where bioperl is used. ?This applies to the other > bio* as well, particularly biopython (seeing it popping up more and more). > Along these lines, it'd be great to publicize not only BioPerl-*powered*projects, but ones which interface with it, too. Just this week, for example, there is this, which could go both on a static page and in the newsfeed: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp554v1 MOODS: fast search for position weight matrix matches in DNA sequences. Korhonen J, Martinm?ki P, Pizzi C, Rastas P, Ukkonen E. Department of Computer Science and Helsinki Institute for Information Technology, University of Helsinki, Helsinki, Finland. SUMMARY: MOODS (MOtif Occurrence Detection Suite) is a software package for matching position weight matrices against DNA sequences. MOODS implements state-of-the-art on-line matching algorithms, achieving considerably faster scanning speed than with a simple brute-force search. MOODS is written in C++, with bindings for the popular BioPerl and Biopython toolkits. It can easily be adapted for different purposes and integrated into existing workflows. It can also be used as a C++ library. AVAILABILITY: The package with documentation and examples of usage is available at http://www.cs.helsinki.fi/group/pssmfind. The source code is also available under the terms of a GNU General Public License (GPL). CONTACT: janne.h.korhonen at helsinki.fi. PMID: 19773334 [PubMed - as supplied by publisher] _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bartek at rezolwenta.eu.org Thu Sep 24 07:46:42 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 24 Sep 2009 13:46:42 +0200 Subject: [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. In-Reply-To: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> References: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> Message-ID: <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> On Thu, Sep 24, 2009 at 11:59 AM, Peter wrote: > Hi all, > > I'm forwarding an interesting post from Dave to the BioPerl mailing list, which > should also be of interest here... > > Just this week, for example, there is this, which could go both on a static > page and in the newsfeed: > http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp554v1 > > MOODS: fast search for position weight matrix matches in DNA sequences. > > Korhonen J, Martinm?ki P, Pizzi C, Rastas P, Ukkonen E. > Department of Computer Science and Helsinki Institute for Information > Technology, > University of Helsinki, Helsinki, Finland. > > SUMMARY: MOODS (MOtif Occurrence Detection Suite) is a software package for > matching position weight matrices against DNA sequences. MOODS implements > state-of-the-art on-line matching algorithms, achieving considerably faster > scanning speed than with a simple brute-force search. MOODS is written in C++, > with bindings for the popular BioPerl and Biopython toolkits. It can easily be > adapted for different purposes and integrated into existing workflows. It can > also be used as a C++ library. AVAILABILITY: The package with documentation and > examples of usage is available at http://www.cs.helsinki.fi/group/pssmfind. The > source code is also available under the terms of a GNU General Public License > (GPL). CONTACT: janne.h.korhonen at helsinki.fi. Hi all, I've seen this paper. It is directly related to the Bio.Motif code. They did a pretty good job of implementing an extremely efficient tool for finding motif instances in DNA sequences. it's c++ and it beats my pure python, brute-force code with both hands down... Of course this come at a price of only being applicable to DNA (only unambiguous alphabet etc.). Since they did the comparison, we have actually incorporated the _pwm.c module written by Michiel, which is also much faster and can be used for finding motifs in DNA. I have compared their performance with our code on a single Drosophila chromosme (20Mb) the results are similarly devastating to my old code: their code takes ~1.1 sec (advanced look-ahead algorithms in C++) while mine (pure python) takes 350 secs. The code contributed recently by Michiel (simple algorithm, but in C) takes 2.3secs to finish. since they provide python interface (there is nothing biopython related, despite their abstract), I was even thinking about incorporating their code into Biopython, but it's GPL, Instead, I can make the function using Michiel's code aware of the MOODS package: i.e. use it if it is installed. If we want to put it into the news, It would be worth mentioning that (thanks to Michiel) we have made quite some progress on that front. As a side note, I feel a little bit guilty of making biopython look slow compared to other tools. In the paper, they show a comparison between different tools (MOODS, bioperl, biopython) in terms of speed, which shows biopython as by far the slowest. This is just because I was not writing this code with speed in mind (I work on short regulatory sequences...). Nonetheless, it can make an impression that biopython is slow in general, which is not true. I will try to extend Michiel's code to accept different alphabets and then maybe phase out the slow code of mine. Bartek From biopython at maubp.freeserve.co.uk Thu Sep 24 08:09:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 13:09:16 +0100 Subject: [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. In-Reply-To: <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> References: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> Message-ID: <320fb6e00909240509h1efb0a46y222067c77b77aa68@mail.gmail.com> On Thu, Sep 24, 2009 at 12:46 PM, Bartek Wilczynski wrote: > On Thu, Sep 24, 2009 at 11:59 AM, Peter wrote: >> Hi all, >> >> I'm forwarding an interesting post from Dave to the BioPerl mailing list, which >> should also be of interest here... >> > > Hi all, > > I've seen this paper. It is directly related to the Bio.Motif code. > They did a pretty good job of implementing an extremely efficient tool > for finding motif instances in DNA sequences. it's c++ and it beats my > pure python, brute-force code with both hands down... Of course this > come at a price of only being applicable to DNA (only unambiguous > alphabet etc.). Since they did the comparison, we have actually > incorporated the _pwm.c module written by Michiel, which is also much > faster and can be used for finding motifs in DNA. I hadn't looked at the table until you pointed this out. I think they have been negligent by not including the version numbers of the different packages tested (and this is a general point, not just about Biopython). > I have compared their performance with our code on a single Drosophila > chromosme (20Mb) the results are similarly devastating to my old code: > their code takes ~1.1 sec (advanced look-ahead algorithms in C++) > while mine (pure python) takes 350 secs. The code contributed recently > by Michiel (simple algorithm, but in C) takes 2.3secs to finish. Our C code looks pretty good then :) > since they provide python interface (there is nothing biopython > related, despite their abstract), I was even thinking about > incorporating their code into Biopython, but it's GPL, Instead, I can > make the function using Michiel's code aware of the MOODS package: > i.e. use it if it is installed. I'm not sure about that from an architectural point of view, especially if the two algorithms give different results or take different parameters. > If we want to put it into the news, It would be worth mentioning that > (thanks to Michiel) we have made quite some progress on that front. Good idea - why don't you check in an extra paragraph to the NEWS file section for Biopython 1.51 (or was it 1.52?). We can also update the news post too. In fact, if you wanted to you could write up a whole blog post to put up on our news server with timing etc. > As a side note, I feel a little bit guilty of making biopython look > slow compared to other tools. In the paper, they show a comparison > between different tools (MOODS, bioperl, biopython) in terms of speed, > which shows biopython as by far the slowest. This is just because I > was not writing this ?code with speed in mind (I work on short > regulatory sequences...). Nonetheless, it can make an impression that > biopython is slow in general, which is not true. I will try to extend > Michiel's code to accept different alphabets and then maybe phase out > the slow code of mine. Extending the C code to cover more cases sounds like a good idea. However, I would keep the pure python fallback for situations like Jython where C extensions are not available. Peter From chapmanb at 50mail.com Thu Sep 24 08:27:30 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 24 Sep 2009 08:27:30 -0400 Subject: [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. In-Reply-To: <320fb6e00909240509h1efb0a46y222067c77b77aa68@mail.gmail.com> References: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> <320fb6e00909240509h1efb0a46y222067c77b77aa68@mail.gmail.com> Message-ID: <20090924122730.GL13500@sobchak.mgh.harvard.edu> Peter and Bartek; [MOODS paper compared with Biopython] > > I was even thinking about > > incorporating their code into Biopython, but it's GPL, Instead, I can > > make the function using Michiel's code aware of the MOODS package: > > i.e. use it if it is installed. It may be worth contacting the authors with your interest in incorporating it. If it improves substantially upon the current C code from Michiel and could fit with your interface this makes sense. Many times people are not tied to GPL, and they may be willing to re-license for inclusion in Biopython. > > If we want to put it into the news, It would be worth mentioning that > > (thanks to Michiel) we have made quite some progress on that front. > > Good idea - why don't you check in an extra paragraph to the NEWS > file section for Biopython 1.51 (or was it 1.52?). We can also update > the news post too. In fact, if you wanted to you could write up a whole > blog post to put up on our news server with timing etc. A separate news post mentioning the C option speed and showing usage examples from both is a great idea. Responsiveness to new methods is the fun part of science. > > As a side note, I feel a little bit guilty of making biopython look > > slow compared to other tools. In the paper, they show a comparison > > between different tools (MOODS, bioperl, biopython) in terms of speed, > > which shows biopython as by far the slowest. This is just because I > > was not writing this ?code with speed in mind (I work on short > > regulatory sequences...). Nonetheless, it can make an impression that > > biopython is slow in general, which is not true. This is more a consequence of how scientific publication works. You have to get published and to do that you have to prove you are somehow that much better than other options, which results in trying to find flaws in those options. This would all work smoother if the authors came on the Biopython list mentioning the speed issues, you all had this discussion then and worked on incorporating their code as it was being developed. Then we'd have an integrated implementation today. Doing it after the fact is a bit more roundabout, but what can you do. Brad From bartek at rezolwenta.eu.org Thu Sep 24 08:51:04 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 24 Sep 2009 14:51:04 +0200 Subject: [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. In-Reply-To: <20090924122730.GL13500@sobchak.mgh.harvard.edu> References: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> <320fb6e00909240509h1efb0a46y222067c77b77aa68@mail.gmail.com> <20090924122730.GL13500@sobchak.mgh.harvard.edu> Message-ID: <8b34ec180909240551v649769f3keeae64f1ef31a633@mail.gmail.com> Hi, On Thu, Sep 24, 2009 at 2:27 PM, Brad Chapman wrote: > Peter and Bartek; > > [MOODS paper compared with Biopython] >> > I was even thinking about >> > incorporating their code into Biopython, but it's GPL, Instead, I can >> > make the function using Michiel's code aware of the MOODS package: >> > i.e. use it if it is installed. > > It may be worth contacting the authors with your interest in > incorporating it. If it improves substantially upon the current C > code from Michiel and could fit with your interface this makes > sense. Many times people are not tied to GPL, and they may be > willing to re-license for inclusion in Biopython. > Yes, I'll try to talk to them. > > A separate news post mentioning the C option speed and showing usage > examples from both is a great idea. Responsiveness to new methods is > the fun part of science. > I'll try to write that up and send it to the list. > This is more a consequence of how scientific publication works. You have > to get published and to do that you have to prove you are somehow that > much better than other options, which results in trying to find flaws in > those options. This would all work smoother if the authors came on the > Biopython list mentioning the speed issues, you all had this discussion > then and worked on incorporating their code as it was being developed. > Then we'd have an integrated implementation today. Doing it after the > fact is a bit more roundabout, but what can you do. Exactly. cheers Bartek From michael.koeris at gmail.com Thu Sep 24 11:01:40 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Thu, 24 Sep 2009 11:01:40 -0400 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation Message-ID: Hi, I was wondering if a very useful feature of the web API is implemented in the browser -> the ability to specifity the organism on top of the database. Many thanks Mike From biopython at maubp.freeserve.co.uk Thu Sep 24 11:09:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 16:09:48 +0100 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation In-Reply-To: References: Message-ID: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> On Thu, Sep 24, 2009 at 4:01 PM, Michael S. Koeris wrote: > Hi, > > I was wondering if a very useful feature of the web API is implemented in > the browser -> the ability to specifity the organism on top of the database. The NCBI BLAST website lets you specify an organism or use an Entrez query - you can do this via the QBlast API as well. See the mailing list archive, e.g. http://lists.open-bio.org/pipermail/biopython/2009-August/005474.html Peter From michael.koeris at gmail.com Thu Sep 24 11:34:35 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Thu, 24 Sep 2009 11:34:35 -0400 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation In-Reply-To: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> References: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> Message-ID: <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> On Sep 24, 2009, at 11:09 AM, Peter wrote: > On Thu, Sep 24, 2009 at 4:01 PM, Michael S. Koeris > wrote: >> Hi, >> >> I was wondering if a very useful feature of the web API is >> implemented in >> the browser -> the ability to specifity the organism on top of the >> database. > > The NCBI BLAST website lets you specify an organism or use an Entrez > query - you can do this via the QBlast API as well. See the mailing > list > archive, e.g. > > http://lists.open-bio.org/pipermail/biopython/2009-August/005474.html I saw the option to put in an [ORGANISM] but I was hoping I could use the TaxonID because say I want to BLAST all bacteria or all archea From biopython at maubp.freeserve.co.uk Thu Sep 24 11:46:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 16:46:53 +0100 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation In-Reply-To: <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> References: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> Message-ID: <320fb6e00909240846w828e2edhc7b65fbd98876919@mail.gmail.com> >> The NCBI BLAST website lets you specify an organism or use an Entrez >> query - you can do this via the QBlast API as well. See the mailing list >> archive, e.g. >> >> http://lists.open-bio.org/pipermail/biopython/2009-August/005474.html > > I saw the option to put in an [ORGANISM] but I was hoping I could use the > TaxonID because say I want to BLAST all bacteria or all archea Yes, you can do that - you needed to read all of that thread: http://lists.open-bio.org/pipermail/biopython/2009-August/005476.html http://lists.open-bio.org/pipermail/biopython/2009-August/005477.html Peter From carlos.borroto at gmail.com Thu Sep 24 11:47:41 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Thu, 24 Sep 2009 11:47:41 -0400 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation In-Reply-To: <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> References: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> Message-ID: <65d4b7fc0909240847u2b5c1f5do322264080188c246@mail.gmail.com> On Thu, Sep 24, 2009 at 11:34 AM, Michael S. Koeris wrote: > I saw the option to put in an [ORGANISM] but I was hoping I could use the > TaxonID because say I want to BLAST all bacteria or all archea > I'm just doing exactly that, by putting on my entrez_query something like this: "txid6945[Organism:noexp]" I got that string by searching on the taxonomic database and then clicking to see all of the sequences of that taxon. I haven't tried to use only "txid6945" don't know what is the meaning of "[Organism:noexp]", but I can tell you this works. As a side note on blasting, I think there is a bug on the XML generator from NCBI, I getting stuff like this: >>> print blast_record.descriptions[0].title gi|241564310|ref|XP_002401874.1| E1-E2 ATPase, putative [Ixodes scapularis] >gi|215501920|gb|EEC11414.1| E1-E2 ATPase, putative [Ixodes scapularis] I have to do odd thing like: >>> print blast_record.descriptions[0].title.rsplit('>')[0] gi|241564310|ref|XP_002401874.1| E1-E2 ATPase, putative [Ixodes scapularis] And is not a bug on the biopython parser, cause I see the title is wrong on the xml output. Hope its help -- Carlos Javier Borroto Baltimore, MD Phone: (410) 929 4020 From biopython at maubp.freeserve.co.uk Thu Sep 24 12:04:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 17:04:52 +0100 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation In-Reply-To: <65d4b7fc0909240847u2b5c1f5do322264080188c246@mail.gmail.com> References: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> <65d4b7fc0909240847u2b5c1f5do322264080188c246@mail.gmail.com> Message-ID: <320fb6e00909240904o1b07d934w60e36681ee20722b@mail.gmail.com> On Thu, Sep 24, 2009 at 4:47 PM, Carlos Javier Borroto wrote: > On Thu, Sep 24, 2009 at 11:34 AM, Michael S. Koeris > wrote: >> I saw the option to put in an [ORGANISM] but I was hoping I could use the >> TaxonID because say I want to BLAST all bacteria or all archea > > I'm just doing exactly that, by putting on my entrez_query something like this: > "txid6945[Organism:noexp]" > > I got that string by searching on the taxonomic database and then > clicking to see all of the sequences of that taxon. I haven't tried to > use only "txid6945" don't know what is the meaning of > "[Organism:noexp]", but I can tell you this works. Where did [Organism:noexp] come from? I guess it tells Entrez not to expand the organism name or the heirachy? I would just use "txid6945[Organism]" or "txid6945[orgn]" which is shorter and I think clearer. See also this blog post and the EInfo entry in the tutorial: http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ >>> from Bio import Entrez >>> record = Entrez.read(Entrez.einfo(db="nuccore")) >>> for field in record["DbInfo"]["FieldList"] : ... print "%(Name)s, %(FullName)s, %(Description)s" % field ... ALL, All Fields, All terms from all searchable fields UID, UID, Unique number assigned to each sequence FILT, Filter, Limits the records WORD, Text Word, Free text associated with record TITL, Title, Words in definition line KYWD, Keyword, Nonstandardized terms provided by submitter AUTH, Author, Author(s) of publication JOUR, Journal, Journal abbreviation of publication VOL, Volume, Volume number of publication ISS, Issue, Issue number of publication PAGE, Page Number, Page number(s) of publication ORGN, Organism, Scientific and common names of organism, and all higher levels of taxonomy ACCN, Accession, Accession number of sequence PACC, Primary Accession, Does not include retired secondary accessions GENE, Gene Name, Name of gene associated with sequence PROT, Protein Name, Name of protein associated with sequence ECNO, EC/RN Number, EC number for enzyme or CAS registry number PDAT, Publication Date, Date sequence added to GenBank MDAT, Modification Date, Date of last update SUBS, Substance Name, CAS chemical name or MEDLINE Substance Name PROP, Properties, Classification by source qualifiers and molecule type SQID, SeqID String, String identifier for sequence GPRJ, Genome Project, Genome Project SLEN, Sequence Length, Length of sequence FKEY, Feature key, Feature annotated on sequence PORG, Primary Organism, Scientific and common names of primary organism, and all higher levels of taxonomy > As a side note on blasting, I think there is a bug on the XML > generator from NCBI, I getting stuff like this: >>>> print blast_record.descriptions[0].title > gi|241564310|ref|XP_002401874.1| E1-E2 ATPase, putative [Ixodes > scapularis] >gi|215501920|gb|EEC11414.1| E1-E2 ATPase, putative > [Ixodes scapularis] The NCBI BLAST tools have a strange method of merging redundant entries into a single entry, which results in these odd identifiers. Peter From cmckay at u.washington.edu Thu Sep 24 17:51:31 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 24 Sep 2009 14:51:31 -0700 Subject: [Biopython] get back raw records with SeqIO? Message-ID: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> Hello all. Congratulations on the release of 1.52. I'm very pleased to see the large file index feature included. And even more thrilled to have more full featured support for writing genbank files with SeqIO. Thanks! Are there plans to preserve more information in the in_genbank --> SeqIO --> out_genbank pipeline? For instance, at the moment, AUTHORS, COMMENT, etc are lost. I have a use question about SeqIO. If I want to get back the raw records from a file, can I do that with SeqIO? For example, to parse a genbank file with many records, I do: genbank_records = GenBank.Iterator(in_file_handle) Can I use SeqIO similarly somehow? Can I tell it not to parse records? My way works fine, but I presume that Bio.GenBank is going to be fazed out sometime. Thanks! Cedar From biopython at maubp.freeserve.co.uk Fri Sep 25 05:50:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 10:50:51 +0100 Subject: [Biopython] get back raw records with SeqIO? In-Reply-To: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> References: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> Message-ID: <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> On Thu, Sep 24, 2009 at 10:51 PM, Cedar McKay wrote: > Hello all. Congratulations on the release of 1.52. I'm very pleased to see > the large file index feature included. I hoped you would be - our mailing list discussion earlier in the year basically triggered including this in Biopython: http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html Were you able to update your script using the precursor index code to use the new Bio.SeqIO.index function? It should have been a drop in replacement ;) > And even more thrilled to have more full featured support for writing > genbank files with SeqIO. Thanks! I guess you missed that earlier - the GenBank output included features as of Biopython 1.51, but there have been a few tweaks since then. > Are there plans to preserve more information in the in_genbank > --> SeqIO --> out_genbank pipeline? For instance, at the moment, > AUTHORS, COMMENT, etc are lost. Like BioPerl, we are not expecting to offer a 100% round trip, but yes there are some bits (like the references) which still need doing. I haven't have the time or the need to follow up on those fields yet - but I would certainly review a patch if you wanted to work on that. > I have a use question about SeqIO. If I want to get back the raw records > from a file, can I do that with SeqIO? For example, to parse a genbank file > with many records, I do: > > genbank_records = GenBank.Iterator(in_file_handle) > > Can I use SeqIO similarly somehow? Can I tell it not to parse records? No, the SeqIO system does not break up files into chunks of raw text. One good reason for this is that it isn't possible for every file format (e.g. interlaced alignments). For some of the specific file formats it could be done. The mechanics of this is rather similar to what the new indexing code is doing internally (for those file formats where it is possible). Why do you want to do this? I'd like to understand the desired usage. > My way works fine, but I presume that Bio.GenBank is going to be > fazed out sometime. In the long term, perhaps we will phase out Bio.GenBank, but there is nothing planed. It currently does both SeqRecord parsing (called by Bio.SeqIO) and also a lower level more GenBank faithful record object. This still has its uses (especially while there is still room for improvement in GenBank output via SeqIO). Peter From biopython at maubp.freeserve.co.uk Fri Sep 25 08:29:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 13:29:22 +0100 Subject: [Biopython] Trimming FASTQ reads, was: [Velvet-users] Read length as a parameter? Message-ID: <320fb6e00909250529g15914649mde9e90683c85b975@mail.gmail.com> Hi all, I meant to forward this earlier, but it looks like I didn't. I've also just posted a related blog post on the topic: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ Peter ---------- Forwarded message ---------- From: Peter Date: Fri, Sep 25, 2009 at 11:46 AM Subject: Re: [Velvet-users] Read length as a parameter? To: Daniel Zerbino Cc: Dan Bolser , velvet-users at ebi.ac.uk Hi Velvet uses & Biopython fans, I've CC'd this to the Biopython list as the examples may be of interest there too. We are talking about scripts to pre-filter sequencing reads before analysis with another tool (in this case, the assembler velvet). The original thread is here: http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000578.html On Fri, Sep 25, 2009 at 10:45 AM, Daniel Zerbino wrote: > > Hello Yunchen and Dan, > > I'm afraid Velvet does not offer either length or k-mer frequency > pre-filtering, although the cov_cutoff is a k-mer frequency post-filtering. > > Given practical considerations, I don't think I can implement this in the > next months. > > However, what can be done is to have a simple script which does the > filtering then pipes them to velvet: > > my_filtering_script.xx my_reads.fa | velveth directory 21 -fasta - ... > > In the ?case of k-mer frequency filtering, you could imagine preparing a > pseudo-fastq file which assigns a score to each nucleotide based on the > frequency of the k-mer ending at that position, then scripting a score > filter which pipes into Velvet. > > As usual, if anyone is willing to put forward such scripts for the other > users, I will be happy to put it in the package. > > Best regards, > > Daniel Was that a challenge? ;) ?It probably won't be the fastest solution, but it is very easy to do this with Biopython's SeqIO library. #!/usr/bin/python # This is a simple python script using Biopython 1.50+ # to read in FASTA records from stdin, trim to 21 letters, # and write them to stdout. import sys from Bio import SeqIO records = (rec[:21] for rec in SeqIO.parse(sys.stdin, "fasta")) SeqIO.write(records, sys.stdout, "fasta") This isn't (yet) a full script with command line arguments etc. You could also do this with filenames, but to keep the examples short I'm using stdin and stdout (not a problem for those happy at the command line). Because FASTA files are so simple, it would be fairly trivial to to write a plain Python script (without using Biopython) which runs faster than this. However (and this is a sales pitch), just by changing the format name the above script would also work on other file formats. For example, with Biopython 1.51+ this would work fine for FASTQ files too. However, if speed is an issue (often the case with large next gen sequencing files), then a lower level python script is also possible, e.g.: #!/usr/bin/python # This is a fairly simple python script using Biopython 1.51+ # to read in FASTQ records from stdin, trim to 21 letters, # and write them to stdout. It does not check the quality # strings at all, and should therefore work on Sanger, # Solexa or Illumina 1.3+ FASTQ files equally well. import sys from Bio.SeqIO.QualityIO import FastqGeneralIterator for title, seq, qual in FastqGeneralIterator(sys.stdin) : ? ?print "@%s\n%s\n+\n%s" % (title, seq[:21], qual[:21]) ? ?#The print statement will include a trailing newline Both these examples are just four lines of code (two of which are imports), pretty neat if I do say so myself ;) Peter From biopython at maubp.freeserve.co.uk Fri Sep 25 10:04:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 15:04:00 +0100 Subject: [Biopython] Correcting short read errors based on k-mer coverage Message-ID: <320fb6e00909250704o369a90dci8156f2f5169747e9@mail.gmail.com> Hi all, This email was an offshoot of this thread on the Velvet user's list, and Dan suggested I could CC the Biopython mailing list. See also: http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000581.html The method Dan describes looks like an interesting computational challenge, but should be possible in (Bio)python... Peter ---------- Forwarded message ---------- From: Dan Bolser Date: Fri, Sep 25, 2009 at 2:39 PM Subject: Re: [Velvet-users] Read length as a parameter? To: peter at maubp.freeserve.co.uk Cc: Daniel Zerbino 2009/9/25 Peter : > On Fri, Sep 25, 2009 at 11:56 AM, Dan Bolser wrote: >> >> Hi Peter, >> >> Thanks for the examples. >> >> Since your clearly keen to show off the power of bioperl, here *is* a >> challenge ;-) >> >> 1) Construct a k-mer frequency distribution from the set of quality >> trimmed reads. >> 2) Correct full length reads by making point mutations that change >> 'rare' k-mers into 'common' k-mers. >> 3) Re-trim reads according to kmer frequency after correction. >> >> 4) (For extra credit), implement step 2 and 3 but include homo-polymer >> length variability (indels) in the set of allowed correction >> operations. >> >> I really think 'code jamboree' could be a lot of fun (given the rate >> of technology change). >> >> I'd be seriously impressed at any reasonable 'module' to crack the above! > > Hi Dan, > > Did you mean to send this off list? I figured it wasn't really relevant to velvet mailing list, but please feel free to cc biopython. > BioPerl/Biopython jokes aside, right now I don't understand exactly > what you are asking for - although with a little more background reading > I could probably work it out. Presumably all of this is ignoring the > FASTQ quality scores? e.g. it would be fine just to work with FASTA > files? In step (2) you want to edit the reads (giving a new FASTA file)? > What you want in step (3) is unclear. Sorry, I took quite some short cuts in my description. Please see: http://www.ncbi.nlm.nih.gov/pubmed/19056694 Step 1 uses quality to select high quality regions of reads. these reads are broken down into k-mers (say of length 21), and then you construct a k-mer frequency table. i.e. k-mer TATATATATATATATATATAT occurs 5000 times in my read set. Here you need to consider memory usage. In step 2 you take the full reads (ignoring qualities) and look at the k-mer frequency (average?) at each base. Some bases will have a very low k-mer frequency, indicating sequencing errors. Such bases can sometimes be unambiguously 'fixed' (changed to have mean k-mer frequency) making single base substitutions. Finally 'unfixable' reads (by the above definition) can be trimmed. HTH, Dan. From michael.koeris at gmail.com Fri Sep 25 11:25:00 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Fri, 25 Sep 2009 11:25:00 -0400 Subject: [Biopython] SeqIO parser error Message-ID: <1BD71883-D1BF-4766-BFC0-2CE0B237E9C6@gmail.com> Hi, I'm just getting acquainted with Sequence Objects and records and so forth. I tried some very basic code from the tutorial and I get an error when I run this: from Bio import Entrez, SeqIO gi_list = ['224589821', '224514694', '164698032', '157812089', '157734174'] gi_str = ",".join(gi_list) handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="gb") records = SeqIO.parse(handle, "gb") for record in records: print "%s, length %i, with %i features" \ %(record.name, len(record), len(record.features)) Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/Scanner.py", line 420, in parse_records record = self.parse(handle, do_features) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/Scanner.py", line 403, in parse if self.feed(handle, consumer, do_features) : File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/Scanner.py", line 381, in feed self._feed_misc_lines(consumer, misc_lines) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/Scanner.py", line 1142, in _feed_misc_lines consumer.contig_location(contig_location) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 987, in contig_location self.location(content) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 689, in location self._set_location_info(parse_info, self._cur_feature) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 797, in _set_location_info self._set_function(parse_info, cur_feature) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 720, in _set_function self._set_ordering_info(function, cur_feature) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 764, in _set_ordering_info feature_start = cur_feature.sub_features[0].location.start AttributeError: 'PositionGap' object has no attribute 'start' Any help is most appreciated. Mike -- Michael S. Koeris michael.koeris at gmail.com From biopython at maubp.freeserve.co.uk Fri Sep 25 11:42:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 16:42:39 +0100 Subject: [Biopython] Correcting short read errors based on k-mer coverage In-Reply-To: <320fb6e00909250704o369a90dci8156f2f5169747e9@mail.gmail.com> References: <320fb6e00909250704o369a90dci8156f2f5169747e9@mail.gmail.com> Message-ID: <320fb6e00909250842j2e3a2cbdi8d602be801696fe4@mail.gmail.com> Dan Bolser wrote: > Step 1 uses quality to select high quality regions of reads. these > reads are broken down into k-mers (say of length 21), and then you > construct a k-mer frequency table. i.e. k-mer TATATATATATATATATATAT > occurs 5000 times in my read set. Here you need to consider memory > usage. I just tried with a short read file from the NCBI SRA with ~7 million reads of 36bp and k=21. Each 36bp read gives 16 k-mers, thus I had in total ~100 million kmers in total, and found about ~18 million different kmers. About half occurred only once. My naive code to count the kmers used a Python dictionary (k-mer strings as the keys, integer counts as values). It took about 5 minutes to run and about 1.5 GB of RAM. What sized files are you hoping to run this on? Without knowing that, it is hard to say if this simple dictionary approach will scale well. Dan Bolser wrote: > In step 2 you take the full reads (ignoring qualities) and look at the > k-mer frequency (average?) at each base. Some bases will have a very > low k-mer frequency, indicating sequencing errors. Are you suggesting following the method of Chaisson et al 2009, described in section "Detecting and error correcting accurate read prefixes" of that paper - or something a little different? That section itself cites several related approaches to read correction. Peter From biopython at maubp.freeserve.co.uk Fri Sep 25 11:49:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 16:49:01 +0100 Subject: [Biopython] SeqIO parser error In-Reply-To: <1BD71883-D1BF-4766-BFC0-2CE0B237E9C6@gmail.com> References: <1BD71883-D1BF-4766-BFC0-2CE0B237E9C6@gmail.com> Message-ID: <320fb6e00909250849l1b985a3aj5658aec7630a7264@mail.gmail.com> On Fri, Sep 25, 2009 at 4:25 PM, Michael S. Koeris wrote: > Hi, > > I'm just getting acquainted with Sequence Objects and records and so forth. > I tried some very basic code from the tutorial and I get an error when I run > this: > > from Bio import Entrez, SeqIO > > gi_list = ['224589821', '224514694', '164698032', '157812089', '157734174'] > gi_str = ",".join(gi_list) > handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="gb") > > records = SeqIO.parse(handle, "gb") > > for record in records: > ? ?print "%s, length %i, with %i features" \ > ? ? ? ? ?%(record.name, len(record), len(record.features)) > > Traceback (most recent call last): > ... > ? ?feature_start = cur_feature.sub_features[0].location.start > AttributeError: 'PositionGap' object has no attribute 'start' > > Any help is most appreciated. Hi Mike, You have found Bug 2745. Do you fancy testing the proposed fix? http://bugzilla.open-bio.org/show_bug.cgi?id=2745 As a workaround, you can ask the NCBI for full GenBank records, not CONTIG records (use rettype="gbwithparts"). However, since these are such large files (whole chromosomes) it might be better to download the whole human genome via FTP instead... Peter From biopython at maubp.freeserve.co.uk Fri Sep 25 12:34:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 17:34:34 +0100 Subject: [Biopython] Correcting short read errors based on k-mer coverage In-Reply-To: <2c8757af0909250916g5263b02eudb52aa5a03019e6e@mail.gmail.com> References: <320fb6e00909250704o369a90dci8156f2f5169747e9@mail.gmail.com> <320fb6e00909250842j2e3a2cbdi8d602be801696fe4@mail.gmail.com> <2c8757af0909250916g5263b02eudb52aa5a03019e6e@mail.gmail.com> Message-ID: <320fb6e00909250934m69d95c40nd002e9652031c11d@mail.gmail.com> On Fri, Sep 25, 2009 at 5:16 PM, Dan Bolser wrote: > > 2009/9/25 Peter : >> >> I just tried with a short read file from the NCBI SRA with ~7 million reads >> of 36bp and k=21. Each 36bp read gives 16 k-mers, thus I had in total >> ~100 million kmers in total, and found about ~18 million different kmers. >> About half occurred only once. >> >> My naive code to count the kmers used a Python dictionary (k-mer >> strings as the keys, integer counts as values). It took about 5 minutes >> to run and about 1.5 GB of RAM. >> >> What sized files are you hoping to run this on? Without knowing that, >> it is hard to say if this simple dictionary approach will scale well. > > To warm up I'd want to try 125 million reads of ~50 bp. That might still be possible in RAM... just. Are you aware of any public datasets of that size? An NCBI SRA one for example? > Later I'd want about 100 times more. Right - that will certainly mean holding everything in memory isn't going to be an option! A simple SQLite database might work nicely though. >> Dan Bolser wrote: >>> In step 2 you take the full reads (ignoring qualities) and look at the >>> k-mer frequency (average?) at each base. Some bases will have a very >>> low k-mer frequency, indicating sequencing errors. >> >> Are you suggesting following the method of Chaisson et al 2009, >> described in section "Detecting and error correcting accurate read >> prefixes" of that paper - or something a little different? That section >> itself cites several related approaches to read correction. > > Yeah, I was thinking of the Chasson 2009 method. Since then I had a > couple of other methods brought to my attention on the Velvet mailing > list: > > Efficient frequency-based de novo short-read clustering for error > trimming in next-generation sequencing. > Qu W, Hashimoto S, Morishita S. > Genome Res. 2009 Jul;19(7):1309-15. Epub 2009 May 13. > PMID: 19439514 > http://www.ncbi.nlm.nih.gov/pubmed/19439514 > > SHREC: a short-read error correction method. > Schr?der J, Schr?der H, Puglisi SJ, Sinha R, Schmidt B. > Bioinformatics. 2009 Sep 1;25(17):2157-63. Epub 2009 Jun 19. > PMID: 19542152 > http://www.ncbi.nlm.nih.gov/pubmed/19542152 > > > So the result is looking more and more redundant... However, a python > one liner would be awesome! I doubt a few line python script for the whole task will be forthcoming, although parts of it may be more realistic (e.g. an SQLite based k-mer counter). This sort of thing (k-mer frequency based read correction and trimming) might be of interest to the EMBOSS project, who have expressed an interest in developing new command line tools for next generation sequencing data (e.g. simple quality score read filtering and trimming). Peter From thomas.e.keller at gmail.com Fri Sep 25 23:17:27 2009 From: thomas.e.keller at gmail.com (Thomas Keller) Date: Fri, 25 Sep 2009 22:17:27 -0500 Subject: [Biopython] Nexus.Tree fails to import nexus tree file Message-ID: <41d5d14b0909252017v22f01a90y51b71b3a3d66531a@mail.gmail.com> I apologize if this has been addressed, I looked online and it does not seem to be a general issue. I have several programs that generate nexus files consisting entirely of trees; there is no sequence information. Can the Nexus parser not read this type of nexus file? When I try to open the a file with: from Bio.Nexus import Trees tree_string=open('Analysis_tree_1a.tre').read() tree=Trees.Tree(tree_string) I get the following error: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /home/thomas/Missing data in tree of Life projects/tree_base_dat/S2000/ in () /var/lib/python-support/python2.6/Bio/Nexus/Trees.pyc in __init__(self, tree, weight, rooted, name, data, values_are_support, max_support) 70 # there's discrepancy whether newick allows semicolons et the end 71 tree=tree.rstrip(';') ---> 72 self._add_subtree(parent_id=root.id,tree=self._parse(tree)[0]) 73 74 def _parse(self,tree): /var/lib/python-support/python2.6/Bio/Nexus/Trees.pyc in _parse(self, tree) 96 else: 97 closing=tree.rfind(')') ---> 98 val=self._get_values(tree[closing+1:]) 99 if not val: 100 val=[None] /var/lib/python-support/python2.6/Bio/Nexus/Trees.pyc in _get_values(self, text) 172 values.append(nodecomment) 173 else: --> 174 values=[float(t) for t in text.split(':') if t.strip()] 175 return values 176 ValueError: invalid literal for float(): ;END The associated nexus file I am trying to read is the following: #NEXUS BEGIN TREES; TRANSLATE 1 'Pleurochrysis_pseudoroscoffensis_HAP48', 2 'Pleurochrysis_placolithoides_HAP59bis', 3 'Pleurochrysis_carterae_Von_Stosch', 4 'Pleurochrysis_roscoffensis_HAP32', 5 'Pleurochrysis_scherffelii_HAP11', 6 'Pleurochrysis_sp_Langue_du_chat', 7 'Pleurochrysis_elongata_CCMP874', 8 'Pleurochrysis_gayraliae_HAP10', 9 'Hymenomonas_coronata_HAP58bis', 10 'Pleurochrysis_elongata_HAP79', 11 'Ochrosphaera_verrucosa_HAP82', 12 'Pleurochrysis_carterae_HAP1', 13 'Pleurochrysis_dentata_HAP6', 14 'Pleurochrysis_sp_MBIC10443', 15 'Pleurochrysis_sp_MBIC10549', 16 'Jomonlithus_littoralis_JE5', 17 'Pleurochrysis_sp_CCMP875', 18 'Pleurochrysis_sp_CCMP300', 19 'Gloeothamnion_sp_HAPG' ; TREE 'Fig._2' = [&R] (11,((((((4,1,8,3,18),(10,7)),(12,(5,19))),6),2),(13,(14,17),15)),(9,16)); END; Please let me know if/what I am doing wrong. Is the tree nexus file malformed in some way? Cheers, Thomas Keller Reply Forward [It's All Text!] From biopython at maubp.freeserve.co.uk Sat Sep 26 06:23:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 26 Sep 2009 11:23:39 +0100 Subject: [Biopython] Nexus.Tree fails to import nexus tree file In-Reply-To: <41d5d14b0909252017v22f01a90y51b71b3a3d66531a@mail.gmail.com> References: <41d5d14b0909252017v22f01a90y51b71b3a3d66531a@mail.gmail.com> Message-ID: <320fb6e00909260323oc34fa9axd5dac734e63bee3f@mail.gmail.com> On Sat, Sep 26, 2009 at 4:17 AM, Thomas Keller wrote: > I apologize if this has been addressed, I looked online and it does > not seem to be a general issue. I have several programs that generate > nexus files consisting entirely of trees; there is no sequence > information. ?Can the Nexus parser not read this type of nexus file? > When I try to open the a file with: > > from Bio.Nexus import Trees > tree_string=open('Analysis_tree_1a.tre').read() > tree=Trees.Tree(tree_string) Use the above code if tree_string is JUST a Newick tree. In your case, from the example you have a full NEXUS file, so use the Bio.Nexus.Nexus parser. Peter From biopython at maubp.freeserve.co.uk Sat Sep 26 07:41:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 26 Sep 2009 12:41:55 +0100 Subject: [Biopython] SeqIO parser error In-Reply-To: <320fb6e00909250849l1b985a3aj5658aec7630a7264@mail.gmail.com> References: <1BD71883-D1BF-4766-BFC0-2CE0B237E9C6@gmail.com> <320fb6e00909250849l1b985a3aj5658aec7630a7264@mail.gmail.com> Message-ID: <320fb6e00909260441s2d463116md2ae093982f955cb@mail.gmail.com> On Fri, Sep 25, 2009 at 4:49 PM, Peter wrote: > > Hi Mike, > > You have found Bug 2745. Do you fancy testing the proposed fix? > http://bugzilla.open-bio.org/show_bug.cgi?id=2745 > That proposed fix has been checking into git now, so if anyone wants to test it you can grab the latest source code (e.g. via the github download link) and reinstall. See: http://biopython.org/wiki/SourceCode http://github.com/biopython/biopython Or, since this only affects a couple of files it would be possible to update them individually - although this is a bit more fiddly. I would normally only suggest this for Windows users who don't have a suitable C compiler installed. Peter From biopython at maubp.freeserve.co.uk Mon Sep 28 07:10:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Sep 2009 12:10:41 +0100 Subject: [Biopython] Correcting short read errors based on k-mer coverage In-Reply-To: <320fb6e00909250934m69d95c40nd002e9652031c11d@mail.gmail.com> References: <320fb6e00909250704o369a90dci8156f2f5169747e9@mail.gmail.com> <320fb6e00909250842j2e3a2cbdi8d602be801696fe4@mail.gmail.com> <2c8757af0909250916g5263b02eudb52aa5a03019e6e@mail.gmail.com> <320fb6e00909250934m69d95c40nd002e9652031c11d@mail.gmail.com> Message-ID: <320fb6e00909280410w46499133tb26e63d1529939b@mail.gmail.com> On Fri, Sep 25, 2009 at 5:34 PM, Peter wrote: > On Fri, Sep 25, 2009 at 5:16 PM, Dan Bolser wrote: >> >> 2009/9/25 Peter : >>> >>> I just tried with a short read file from the NCBI SRA with ~7 million reads >>> of 36bp and k=21. Each 36bp read gives 16 k-mers, thus I had in total >>> ~100 million kmers in total, and found about ~18 million different kmers. >>> About half occurred only once. >>> >>> My naive code to count the kmers used a Python dictionary (k-mer >>> strings as the keys, integer counts as values). It took about 5 minutes >>> to run and about 1.5 GB of RAM. An alternative approach reduced the memory needed for this example from 1.25GB resident to about 0.8GB resident, while still taking about 5 mins. Instead of storing the kmers as strings, I encoded them as large integers (basically using 2 bits per letter instead of 8 bits). This means for kmers up to and including 32-mers, you need only a 64bit unsigned long. You can do this in Python, but my initial code was a bit slow - so I redid it as a Python C extension. The only problem here is what to do with ambiguous sequences - for example any N characters? This still used a Python dictionary to hold the (integer) encoded kmer sequences as keys, and their (integer) counts as values. As noted before, there are disk based options here like an SQLite database. Peter From biopython at maubp.freeserve.co.uk Mon Sep 28 09:02:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Sep 2009 14:02:33 +0100 Subject: [Biopython] Bit encoded sequences, was: Correcting short read errors based on k-mer coverage Message-ID: <320fb6e00909280602l17eb165du9194ca1b92e4620@mail.gmail.com> On Mon, Sep 28, 2009 at 1:18 PM, Dan Bolser wrote: > > 2009/9/28 Peter : >> >> An alternative approach reduced the memory needed for this example >> from 1.25GB resident to about 0.8GB resident, while still taking about >> 5 mins. Instead of storing the kmers as strings, I encoded them as large >> integers (basically using 2 bits per letter instead of 8 bits). This means >> for kmers up to and including 32-mers, you need only a 64bit unsigned >> long. You can do this in Python, but my initial code was a bit slow - so >> I redid it as a Python C extension. The only problem here is what to do >> with ambiguous sequences - for example any N characters? > > Sounds like a good solution... does BioPython have any modules for > hiding this kind of compressed sequence representation? i.e. using > some object to represent the string instead of a string, where the > object has this compression 'under the hood'? Not at the moment, no. However, a bit encoded Seq subclass for unambiguous DNA or RNA is something I had in mind while looking at encoding kmers. I think BioJava does something like this already. Another reason to look at this is with an eye on the future for Python 3, which makes unicode the default (although byte strings remain, we'll probably want to use them in the Seq object). > I think most people reserve this kind of compression for > 'non-ambiguous' strings only, and cludge the ambiguity codes[1] if > necessary. > > [1] http://droog.gs.washington.edu/parc/images/iupac.html For ambiguous DNA or RNA, you can use four bits for each bp (i.e. can it be an A, C, G, or T - thus you might encode this as 1000 = A, 0100 = C, ..., 1100 = K and 1111 = N). This requires 50% of the memory of the naive one byte for each bp scheme. For unambiguous DNA or RNA you just need two bits, say A = 00, C = 01, G = 10 and T=11. This mapping should make taking the complement very fast via bit flipping. Dealing with the reverse complement requires a little more thought (e.g. byte alignment issues if the sequence is not a multiple of four bp in length). For proteins things are less easy, you'd need at least five bits (2^5 = 32 combinations) which isn't ideal. You can have compression, but the byte boundaries may slow things down. Then there are things like gap characters, stop symbols, mixed case and any other ad-hoc additions. For ambiguous single case DNA/RNA we could do two bp per byte, which in itself may not be worth the hassle. There would be scope for (reverse) complement optimisations, however, if it allowed faster sequence matching things become more interesting... But certainly, for unambiguous single case DNA/RNA we could get four bp per byte, which seems a worthwhile improvement. >> This still used a Python dictionary to hold the (integer) encoded kmer >> sequences as keys, and their (integer) counts as values. As noted >> before, there are disk based options here like an SQLite database. > > Yeah, I was wondering about a Berkeley DB or similar. Berkeley DB is certainly a sensible option to look at. Python 2.x includes DBD wrappers, but sadly this has been dropped from the standard libraries in Python 3.x which is why I had a slighly leaning to trying SQLite first of all. > I wonder if there is any way to do this approximately and still get > good error correction statistics? (I'm thinking about the way BowTie > works using approximate hash matching to pre-filter alignments). I don't know exactly what BowTie does. > Any hints from the two papers? Those that I have looked at are vague about the implementation details, but I may just have not read them carefully enough. Peter From biopython at maubp.freeserve.co.uk Tue Sep 29 07:06:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Sep 2009 12:06:19 +0100 Subject: [Biopython] Deprecating Bio.Prosite and Bio.Enzyme In-Reply-To: <320fb6e00908200248j26d20cefq6e3cf9373881d990@mail.gmail.com> References: <320fb6e00908200248j26d20cefq6e3cf9373881d990@mail.gmail.com> Message-ID: <320fb6e00909290406w2743caceqbf91e99a242c16d8@mail.gmail.com> On Thu, Aug 20, 2009 at 10:48 AM, Peter wrote: > Hi all, > > Bio.Prosite and Bio.Enzyme were declared obsolete in Release 1.50, > being replaced by Bio.ExPASy.Prosite and Bio.ExPASy.Enzyme, > respectively. > > Are there any objections to deprecating Bio.Prosite and Bio.Enzyme > for the next release? Bio.Prosite and Bio.Enzyme were left as obsolete in Release 1.52, but have now been deprecated for the next release. Peter From lueck at ipk-gatersleben.de Tue Sep 29 08:50:04 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 29 Sep 2009 14:50:04 +0200 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions Message-ID: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> Hi everybody! Does someone knows an algorithm to search for sequence similarity by allowing missmatches at certain positions? E.g. Looking in a sequence database for ATGCTCGCGCTCGCTCGCGCA by allowing an missmatch at position [3] and [18]. I can do it via regular expressions but I guess it would be quite slow. Thanks for any hints! Stefanie From chapmanb at 50mail.com Tue Sep 29 09:22:39 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 29 Sep 2009 09:22:39 -0400 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions In-Reply-To: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> References: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> Message-ID: <20090929132239.GK29829@sobchak.mgh.harvard.edu> Hi Stefanie; > Does someone knows an algorithm to search for sequence similarity by allowing missmatches at certain positions? > > E.g. > Looking in a sequence database for > > ATGCTCGCGCTCGCTCGCGCA > > by allowing an missmatch at position [3] and [18]. dreg and fuzznuc in EMBOSS both do this: http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/dreg.html http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/fuzznuc.html The SHRiMP aligner also allows you to specify a seed with defined match and mismatch positions: http://compbio.cs.toronto.edu/shrimp/ See the '-s' option in the README. Hope this helps, Brad From mailinglist.honeypot at gmail.com Tue Sep 29 09:23:29 2009 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Tue, 29 Sep 2009 09:23:29 -0400 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions In-Reply-To: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> References: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> Message-ID: <3C37F38A-C7D6-4870-B08E-893656E0EB7C@gmail.com> Hi, On Sep 29, 2009, at 8:50 AM, Stefanie L?ck wrote: > Hi everybody! > > Does someone knows an algorithm to search for sequence similarity by > allowing missmatches at certain positions? > > E.g. > Looking in a sequence database for > > ATGCTCGCGCTCGCTCGCGCA > > by allowing an missmatch at position [3] and [18]. > > I can do it via regular expressions but I guess it would be quite > slow. You can use bowtie: http://bowtie-bio.sourceforge.net/index.shtml You can't tell it where to allow the mismatch, but you can tell it how many mismatches to allow. The output file is easy to parse, and it also informs you the position of the mismatch, and what nucleotide was changed to what in order to make the match. Pros: Insanely fast aligner. Cons: * You'll have to do a bit of work at the command line. * You need an index file for your "database" of sequences you are searching against (not querying with). There are several provided on the site, otherwise it's also quite easy to make your own (though requires a lot of memory. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From biopython at maubp.freeserve.co.uk Tue Sep 29 09:30:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Sep 2009 14:30:00 +0100 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions In-Reply-To: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> References: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00909290630xc1b3b76s71be90e78c05a643@mail.gmail.com> On Tue, Sep 29, 2009 at 1:50 PM, Stefanie L?ck wrote: > Hi everybody! > > Does someone knows an algorithm to search for sequence similarity by allowing missmatches at certain positions? > > E.g. > Looking in a sequence database for > > ATGCTCGCGCTCGCTCGCGCA > > by allowing an missmatch at position [3] and [18]. > > I can do it via regular expressions but I guess it would be quite slow. When you say "sequence database" do you mean a set of local files (e.g. a big FASTA files), a real database (e.g. BioSQL), or something else like an online database (e.g. GenBank)? I would have suggested you tried regular expressions, because they let you deal with the specific positions where you allow a missmatch. i.e. ATG.TCGCGCTCGCTCGC.CA as a regular expression? You want to look for ATGNTCGCGCTCGCTCGCNCA using IUPAC codes, which I think would work with something like fuzznuc from EMBOSS: http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/fuzznuc.html Peter From biopython at maubp.freeserve.co.uk Tue Sep 29 09:32:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Sep 2009 14:32:34 +0100 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions In-Reply-To: <20090929132239.GK29829@sobchak.mgh.harvard.edu> References: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> <20090929132239.GK29829@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909290632w22e2589dj96062e47c816af61@mail.gmail.com> On Tue, Sep 29, 2009 at 2:22 PM, Brad Chapman wrote: > Hi Stefanie; > >> Does someone knows an algorithm to search for sequence similarity >> by allowing missmatches at certain positions? >> >> E.g. >> Looking in a sequence database for >> >> ATGCTCGCGCTCGCTCGCGCA >> >> by allowing an missmatch at position [3] and [18]. > > dreg and fuzznuc in EMBOSS both do this: > > http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/dreg.html > http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/fuzznuc.html I didn't know about EMBOSS dreg - that looks rather neat. We should add wrappers for these to Bio.Emboss.Applications ... > The SHRiMP aligner also allows you to specify a seed with defined > match and mismatch positions: > > http://compbio.cs.toronto.edu/shrimp/ > > See the '-s' option in the README. That's a neat trick. Peter From lueck at ipk-gatersleben.de Tue Sep 29 10:23:57 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 29 Sep 2009 16:23:57 +0200 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions References: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> <320fb6e00909290630xc1b3b76s71be90e78c05a643@mail.gmail.com> Message-ID: <00ef01ca4110$7b57e340$1022a8c0@ipkgatersleben.de> I mean big FASTA files. Thanks for all suggestions, I'll have a look on them and decide what to use! Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Tuesday, September 29, 2009 3:30 PM Subject: Re: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions On Tue, Sep 29, 2009 at 1:50 PM, Stefanie L?ck wrote: > Hi everybody! > > Does someone knows an algorithm to search for sequence similarity by > allowing missmatches at certain positions? > > E.g. > Looking in a sequence database for > > ATGCTCGCGCTCGCTCGCGCA > > by allowing an missmatch at position [3] and [18]. > > I can do it via regular expressions but I guess it would be quite slow. When you say "sequence database" do you mean a set of local files (e.g. a big FASTA files), a real database (e.g. BioSQL), or something else like an online database (e.g. GenBank)? I would have suggested you tried regular expressions, because they let you deal with the specific positions where you allow a missmatch. i.e. ATG.TCGCGCTCGCTCGC.CA as a regular expression? You want to look for ATGNTCGCGCTCGCTCGCNCA using IUPAC codes, which I think would work with something like fuzznuc from EMBOSS: http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/fuzznuc.html Peter From biopython at maubp.freeserve.co.uk Tue Sep 29 14:44:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Sep 2009 19:44:07 +0100 Subject: [Biopython] Nexus.Tree fails to import nexus tree file In-Reply-To: <320fb6e00909260323oc34fa9axd5dac734e63bee3f@mail.gmail.com> References: <41d5d14b0909252017v22f01a90y51b71b3a3d66531a@mail.gmail.com> <320fb6e00909260323oc34fa9axd5dac734e63bee3f@mail.gmail.com> Message-ID: <320fb6e00909291144r4c50d7ddl3892b1e513995a1c@mail.gmail.com> On Sat, Sep 26, 2009 at 11:23 AM, Peter wrote: > On Sat, Sep 26, 2009 at 4:17 AM, Thomas Keller > wrote: >> I apologize if this has been addressed, I looked online and it does >> not seem to be a general issue. I have several programs that generate >> nexus files consisting entirely of trees; there is no sequence >> information. ?Can the Nexus parser not read this type of nexus file? >> When I try to open the a file with: >> >> from Bio.Nexus import Trees >> tree_string=open('Analysis_tree_1a.tre').read() >> tree=Trees.Tree(tree_string) > > Use the above code if tree_string is JUST a Newick tree. > In your case, from the example you have a full NEXUS > file, so use the Bio.Nexus.Nexus parser. Did you get Bio.Nexus to parse the tree for you? Also, would you mind telling us where you got the tree from (what software package) and if we could use it for a test case within Biopthon? Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Sep 29 15:09:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Sep 2009 20:09:38 +0100 Subject: [Biopython] get back raw records with SeqIO? In-Reply-To: <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> References: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> Message-ID: <320fb6e00909291209r65e6c0f6nd682120591ef9a5f@mail.gmail.com> On Fri, Sep 25, 2009 at 10:50 AM, Peter wrote: > On Thu, Sep 24, 2009 at 10:51 PM, Cedar McKay wrote: >> Are there plans to preserve more information in the in_genbank >> --> SeqIO --> out_genbank pipeline? For instance, at the moment, >> AUTHORS, COMMENT, etc are lost. > > Like BioPerl, we are not expecting to offer a 100% round trip, but yes > there are some bits (like the references) which still need doing. I haven't > have the time or the need to follow up on those fields yet - but I would > certainly review a patch if you wanted to work on that. Hi Cedar, I've just added support for writing the COMMENT lines in SeqIO's GenBank output to the repository (which is now using git hosted on github). I'm hoping you'll give this code a quick test... Assuming you are runing Biopython 1.52, you only really need to update the Bio/SeqIO/InsdcIO.py file (e.g. download it via the source code browser) but it might be simpler to just grab the latest source and reinstall. See this wiki page for details: http://biopython.org/wiki/SourceCode Thanks, Peter From cmckay at u.washington.edu Wed Sep 30 19:14:15 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Wed, 30 Sep 2009 16:14:15 -0700 Subject: [Biopython] get back raw records with SeqIO? In-Reply-To: <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> References: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> Message-ID: <4995FD92-374F-4B4F-947B-CC9175B06BCD@u.washington.edu> > I hoped you would be - our mailing list discussion earlier in the year > basically triggered including this in Biopython: > http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html > > Were you able to update your script using the precursor index code > to use the new Bio.SeqIO.index function? It should have been a drop > in replacement ;) My head isn't at that code at the moment, but I'll try to give it a whirl next week. > Why do you want to do this? I'd like to understand the desired usage. I didn't have a specific technical reason. It just seemed like everything was going towards using SeqIO and things like Bio.Fasta were being deprecated, so I wanted to get ahead of the curve there. But if Bio.Genbank is going to be around for a long time, I don't have any problem with doing it that way. Thanks again. C From italo.maia at gmail.com Tue Sep 1 05:22:14 2009 From: italo.maia at gmail.com (Italo Maia) Date: Tue, 1 Sep 2009 02:22:14 -0300 Subject: [Biopython] Phylogenetic trees with biopython? Message-ID: <800166920908312222j185d305ahfe374efc0a7edb1e@mail.gmail.com> Is it possible to create phylogenetic trees with biopython alone or i'll have to "phylip things up" a little? Phylip doesn't seem to allow execution with options, as blast does, per example, and that really botters me. : / -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From stran104 at chapman.edu Tue Sep 1 06:09:23 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Mon, 31 Aug 2009 23:09:23 -0700 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <2a63cc350908312308m7b5e8644kc83b1ea3765e47e7@mail.gmail.com> References: <800166920908312222j185d305ahfe374efc0a7edb1e@mail.gmail.com> <2a63cc350908312308m7b5e8644kc83b1ea3765e47e7@mail.gmail.com> Message-ID: <2a63cc350908312309g59b6cc17i5fb67625acb97fe@mail.gmail.com> As far as I know (which doesn't say much) Biopython does not wrap the Phylip programs. However, you can achieve this through some fairly simple scripting. Phylip allows for options to be specified in command files. Informally, these command files consists of the same keystrokes you would enter when running a Phylip program. A command file to run the program with the default options would look like: Y\n This corresponds to pressing Y to accept the default options and pressing enter for a line break. You can specify any other options here as well. You can also specify the name of your "infile" (sequence file) on the first line of the command file. Then to run phylip with a command file you might execute something like: phylip protpars < command_file_name Then to wrap this in Python just programatically generate your command files and use the os.system command to execute phylip. (e.g. os.system('phylip kitsch < protpars_commands') I hope this helps and good luck. On Mon, Aug 31, 2009 at 10:22 PM, Italo Maia wrote: > Is it possible to create phylogenetic trees with biopython alone or i'll > have to "phylip things up" a little? Phylip doesn't seem to allow execution > with options, as blast does, per example, and that really botters me. : / > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Matthew Strand stran104 at chapman.edu From stran104 at chapman.edu Tue Sep 1 06:12:26 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Mon, 31 Aug 2009 23:12:26 -0700 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <2a63cc350908312308m7b5e8644kc83b1ea3765e47e7@mail.gmail.com> References: <800166920908312222j185d305ahfe374efc0a7edb1e@mail.gmail.com> <2a63cc350908312308m7b5e8644kc83b1ea3765e47e7@mail.gmail.com> Message-ID: <2a63cc350908312312w2f51b475w7e4ec7101d80572@mail.gmail.com> Shoot, I've fudged things up a bit and created a duplicate thread. Here is my original message, I don't know if someone may be able to delete the other thread, but hopefully so. On Mon, Aug 31, 2009 at 11:08 PM, Matthew Strand wrote: > As far as I know (which doesn't say much) Biopython does not wrap the > Phylip programs. However, you can achieve this through some fairly simple > scripting. Phylip allows for options to be specified in command files. > > Informally, these command files consists of the same keystrokes you would > enter when running a Phylip program. > > A command file to run the program with the default options would look like: > Y\n > This corresponds to pressing Y to accept the default options and pressing > enter for a line break. You can specify any other options here as well. You > can also specify the name of your "infile" (sequence file) on the first line > of the command file. > > Then to run phylip with a command file you might execute something like: > phylip protpars < command_file_name > > Then to wrap this in Python just programatically generate your command > files and use the os.system command to execute phylip. > (e.g. os.system('phylip kitsch < protpars_commands') > > I hope this helps and good luck. > > On Mon, Aug 31, 2009 at 10:22 PM, Italo Maia wrote: > >> Is it possible to create phylogenetic trees with biopython alone or i'll >> have to "phylip things up" a little? Phylip doesn't seem to allow >> execution >> with options, as blast does, per example, and that really botters me. : / >> >> -- >> "A arrog?ncia ? a arma dos fracos." >> >> =========================== >> Italo Moreira Campelo Maia >> Ci?ncia da Computa??o - UECE >> Desenvolvedor WEB e Desktop >> Programador Java, Python >> Ubuntu User For Life! >> ----------------------------------------------------- >> http://www.italomaia.com/ >> http://twitter.com/italomaia/ >> http://eusouolobomal.blogspot.com/ >> =========================== >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Matthew Strand > stran104 at chapman.edu > -- Matthew Strand stran104 at chapman.edu phone: (626) 524-4449 skype: matstrand From winda002 at student.otago.ac.nz Tue Sep 1 06:17:04 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 01 Sep 2009 18:17:04 +1200 Subject: [Biopython] Phylogenetic trees with biopython? Message-ID: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> > Is it possible to create phylogenetic trees with biopython alone or i'll > have to "phylip things up" a little? Phylip doesn't seem to allow execution > with options, as blast does, per example, and that really botters me. : Hi Italo, It depends on what exactly you want to do. If you want to run phylip programs as part of a biopython script then there are classes in Emboss.Applications for building up command lines to call the EMBOSS versions of enough of the phylip packages to make bootstrapped distance or parsimony trees. That would mean installing EMBOSS if you don't already have it but it makes automating phylip much, much easier. Those should let you define all the relevant arguments (if they don't it's easy to add them, so shout out) but they are for the 'old' versions of phylip, I'm sure you can still get the EMBOSS versions of these from their site but there are also 'new' versions which are meant to be a little faster but take different arguments (so the existing classes won't help you). I put up a branch on github which has classes for the new versions as well as for PhyML in Bio.Phylo here: http://github.com/dwinter/biopython/tree/phylo Hope that helps you out, David From biopython at maubp.freeserve.co.uk Tue Sep 1 09:14:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 10:14:31 +0100 Subject: [Biopython] IDLE problem In-Reply-To: <4A9C60B3.4040605@rockefeller.edu> References: <4A9C60B3.4040605@rockefeller.edu> Message-ID: <320fb6e00909010214o6230851ckb120f3099a9c24c4@mail.gmail.com> On Tue, Sep 1, 2009 at 12:45 AM, xiaoa wrote: > Hi, > > I am new to python and biopython. I ran into a problem when using > Entrez.esearch and efetch. ?My script worked fine when I used python 2.6.2 > command line (console), but it returned an empty line when I ran it in IDLE. > ?IDLE seems to be working, because I tested with 1. another python script > (no Entrez modules) and 2. even Entrez.einfo--worked fine. ?I am using > Windows Vista, 64-bit and Biopython 1.51 and Python 2.6.2. > Thanks in advance, > > Andrew Hi Andrew, Due to occasional network issues, Entrez scripts are not always 100% reproducible. Perhaps the NCBI was under very high load at the time? It is difficult to say any more without knowing what your script does. Also, I see you say you using Windows Vista, 64-bit and Biopython 1.51 and Python 2.6.2 - how did you install Biopython? I've never used Vista, and we don't provide 64-bit installers. Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 09:21:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 10:21:54 +0100 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> Message-ID: <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> On Tue, Sep 1, 2009 at 7:17 AM, David Winter wrote: > >> Is it possible to create phylogenetic trees with biopython alone or i'll >> have to "phylip things up" a little? Phylip doesn't seem to allow >> execution with options, as blast does, per example, and that really >> botters me. : > > Hi Italo, > > It depends on what exactly you want to do. If you want to run phylip > programs as part of a biopython script then there are classes in > Emboss.Applications for building up command lines to call the EMBOSS > versions of enough of the phylip packages to make bootstrapped distance or > parsimony trees. That would mean installing EMBOSS if you don't already have > it but it makes automating phylip much, much easier. Yes - we definitely recommend the EMBOSS versions of the PHYLIP tools because they support command line arguments, and the originals don't. > Those should let you define all the relevant arguments (if they don't it's > easy to add them, so shout out) but they are for the 'old' versions of > phylip, I'm sure you can still get the EMBOSS versions of ?these from their > site ?but ?there are also 'new' versions which are meant to be a little > faster but take different arguments (so the existing classes won't help > you). I put up a branch on github which has classes for the new versions as > well as for PhyML in Bio.Phylo here: > http://github.com/dwinter/biopython/tree/phylo Yes, as David points out, Bio.Emboss.Applications has wrappers for the "old" versions from PHYLIP 3.572 (whose EMBOSS names start with e): http://emboss.sourceforge.net/apps/release/6.1/embassy/phylip/ We should add the "new" versions from PHYLIP 3.6 (whose EMBOSS names start with f): http://emboss.sourceforge.net/apps/release/6.1/embassy/phylipnew/ David - I would prefer we also put your new wrappers in Bio.Emboss.Applications, and would be happy to look at adding those to CVS now that Biopython 1.51 is out (I had forgotten about them actually - so thanks for the reminder). Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 14:00:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 15:00:32 +0100 Subject: [Biopython] IDLE problem In-Reply-To: <4A9D269E.6080601@rockefeller.edu> References: <4A9C60B3.4040605@rockefeller.edu> <320fb6e00909010214o6230851ckb120f3099a9c24c4@mail.gmail.com> <4A9D269E.6080601@rockefeller.edu> Message-ID: <320fb6e00909010700racbfe0fxe5284dae8bdd4446@mail.gmail.com> Hi Andrew, Please keep the mailing list CC'd on replies. On Tue, Sep 1, 2009 at 2:50 PM, xiaoa wrote: > >> Hi Andrew, >> >> Due to occasional network issues, Entrez scripts are not always 100% >> reproducible. Perhaps the NCBI was under very high load at the time? >> It is difficult to say any more without knowing what your script does. >> >> Also, I see you say you using Windows Vista, 64-bit and Biopython 1.51 >> and Python 2.6.2 - how did you install Biopython? I've never used Vista, >> and we don't provide 64-bit installers. >> >> Peter > > Hi Peter, > > I forget to mention that although my OS is 64 bit, I installed 32-bit Python > ?2.6.2 because the IDLE for 64-bit Python 2.6.2 doesn't work in Vista. ?So > everything is in 32. ?It seems odd that everything works fine in commandline > but in IDLE. > > Andrew I see - if you are using the 32 bit version of Python, then our Windows installer might work. You may be the first person to report trying this on Windows Vista... Right now I am not sure what could be going wrong with Entrez. If you can show us your script (ideally a cut down example to show the problem) we may be able to help. Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 17:01:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 18:01:20 +0100 Subject: [Biopython] Removing deprecated module Bio.EUtils Message-ID: <320fb6e00909011001q5d99b62egc06c1303d6a8bc53@mail.gmail.com> Hi all, The Bio.Entrez module has long been our prefered interface to the NCBI Entrez Utilities. It replaced the old Bio.EUtils module which was officially deprecated in Biopython 1.48, released a year ago (Sept 2008). In line with our deprecation policy, I plan to remove Bio.EUtils in the next release. Are there any objections? If anyone is still using the Bio.EUtils module in old code, please feel free to ask for tips on porting this to Bio.Entrez. Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 17:05:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 18:05:27 +0100 Subject: [Biopython] Removing deprecated BLAST HTML parser Message-ID: <320fb6e00909011005k76cc8be9j2d3a58201ae54a8f@mail.gmail.com> Hi all, The old HTML BLAST parser in Bio.Blast.NCBIWWW was deprecated a year ago in Biopython 1.48, and in line with our deprecation policy I would like to remove this for the next release. Are there any objections? The preferred BLAST output for parsing (as recommended by the NCBI themselves) is XML. We also have a parser for the plain text output, but this is not updated very frequently and the NCBI have a history of making minor changes to the layout and breaking parsers. Peter From winda002 at student.otago.ac.nz Tue Sep 1 22:38:04 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 02 Sep 2009 10:38:04 +1200 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> Message-ID: <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> > David - I would prefer we also put your new wrappers in > Bio.Emboss.Applications, and would be happy to look at adding > those to CVS now that Biopython 1.51 is out (I had forgotten > about them actually - so thanks for the reminder). > > Peter Hi Peter, I'd almost forgotten about them myself! I only put them in their own module because I had the PhyML wrapper as well and that's not an EMBOSS application. I suspect a wrapper for PhyML is probably not going to be widely useful (a normal run lasts at least several hours and most people will want to look over their alignments by eye before they set it off). So I'll move the phylip ones into Emboss.Applications and gather a few thoughts about other phylogenetic software including PhyML and see what the dev list thinks about them. david From winda002 at student.otago.ac.nz Tue Sep 1 22:25:49 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 02 Sep 2009 10:25:49 +1200 Subject: [Biopython] IDLE problem In-Reply-To: <320fb6e00909010700racbfe0fxe5284dae8bdd4446@mail.gmail.com> References: <4A9C60B3.4040605@rockefeller.edu> <320fb6e00909010214o6230851ckb120f3099a9c24c4@mail.gmail.com> <4A9D269E.6080601@rockefeller.edu> <320fb6e00909010700racbfe0fxe5284dae8bdd4446@mail.gmail.com> Message-ID: <20090902102549.964215i7y6cuyru5@www.studentmail.otago.ac.nz> >> I forget to mention that although my OS is 64 bit, I installed 32-bit Python >> ?2.6.2 because the IDLE for 64-bit Python 2.6.2 doesn't work in Vista. ?So >> everything is in 32. ?It seems odd that everything works fine in commandline >> but in IDLE. >> >> Andrew Hi Andrew, you don't connect to the internet via a proxy do you? I've just been playing with esearch() in IDLE vs the commandline in vista and found that both worked fine with a direct connection but when the system-wide internet options are set to a proxy IDLE hangs at the point at which the commandline asks for a username/password. As I say, otherwise everything works fine for me so if it's not that then I'm no help. Cheers, David From italo.maia at gmail.com Wed Sep 2 02:37:05 2009 From: italo.maia at gmail.com (Italo Maia) Date: Tue, 1 Sep 2009 23:37:05 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> Message-ID: <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> Thank you everyone for your answers. I was about to give up and run my own wrappers over phylip(emboss for mey ubuntu is too old : /) when i just found out that clustal can create phylogenetic trees too. On commandline, with the options 4,1 and 4, i just made a tree here, from a .aln file generated with clustalw. Does anyone dislike this approach? Seems like easy/fast/efficient enough for me. ps: i just made a simple frontend for blast, formatdb, clustalw and muscle. Right now i'm going to add phylogenetic trees, then, i'm finished. It's my graduation thesis, by the way. 2009/9/1 David Winter > > David - I would prefer we also put your new wrappers in >> Bio.Emboss.Applications, and would be happy to look at adding >> those to CVS now that Biopython 1.51 is out (I had forgotten >> about them actually - so thanks for the reminder). >> >> Peter >> > > Hi Peter, > > I'd almost forgotten about them myself! I only put them in their own module > because I had the PhyML wrapper as well and that's not an EMBOSS > application. > > I suspect a wrapper for PhyML is probably not going to be widely useful (a > normal run lasts at least several hours and most people will want to look > over their alignments by eye before they set it off). So I'll move the > phylip ones into Emboss.Applications and gather a few thoughts about other > phylogenetic software including PhyML and see what the dev list thinks about > them. > > david > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From nuin at genedrift.org Wed Sep 2 02:46:48 2009 From: nuin at genedrift.org (Paulo Nuin) Date: Tue, 1 Sep 2009 22:46:48 -0400 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> Message-ID: <365E2A2E-C0B3-4249-B708-4409C097B581@genedrift.org> ClustalW trees are extremely "simple" and only can be generated with Neighbour Joining. Also they are not based on the final sequence alignment created by the program but as a guide for the alignment itself. They have a huge probability of being "wrong" or not representing the actual relationships. It will heavily depend on the the type, distance and differences among the sequences you are using. Do you need trees for what? Paulo On 1-Sep-09, at 10:37 PM, Italo Maia wrote: > Thank you everyone for your answers. I was about to give up and run > my own > wrappers over phylip(emboss for mey ubuntu is too old : /) when i > just found > out that clustal can create phylogenetic trees too. On commandline, > with the > options 4,1 and 4, i just made a tree here, from a .aln file > generated with > clustalw. Does anyone dislike this approach? Seems like easy/fast/ > efficient > enough for me. > > ps: i just made a simple frontend for blast, formatdb, clustalw and > muscle. > Right now i'm going to add phylogenetic trees, then, i'm finished. > It's my > graduation thesis, by the way. > > 2009/9/1 David Winter > >> >> David - I would prefer we also put your new wrappers in >>> Bio.Emboss.Applications, and would be happy to look at adding >>> those to CVS now that Biopython 1.51 is out (I had forgotten >>> about them actually - so thanks for the reminder). >>> >>> Peter >>> >> >> Hi Peter, >> >> I'd almost forgotten about them myself! I only put them in their >> own module >> because I had the PhyML wrapper as well and that's not an EMBOSS >> application. >> >> I suspect a wrapper for PhyML is probably not going to be widely >> useful (a >> normal run lasts at least several hours and most people will want >> to look >> over their alignments by eye before they set it off). So I'll move >> the >> phylip ones into Emboss.Applications and gather a few thoughts >> about other >> phylogenetic software including PhyML and see what the dev list >> thinks about >> them. >> >> david >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From italo.maia at gmail.com Wed Sep 2 03:17:09 2009 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 2 Sep 2009 00:17:09 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <365E2A2E-C0B3-4249-B708-4409C097B581@genedrift.org> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <365E2A2E-C0B3-4249-B708-4409C097B581@genedrift.org> Message-ID: <800166920909012017o68216e1cxe72ec52ba2babac3@mail.gmail.com> I want to use them to study evolution. Maybe guide some protein modelling. Would you suggest a better use? 2009/9/1 Paulo Nuin > ClustalW trees are extremely "simple" and only can be generated with > Neighbour Joining. Also they are not based on the final sequence alignment > created by the program but as a guide for the alignment itself. They have a > huge probability of being "wrong" or not representing the actual > relationships. It will heavily depend on the the type, distance and > differences among the sequences you are using. > > Do you need trees for what? > > Paulo > > > > > On 1-Sep-09, at 10:37 PM, Italo Maia wrote: > > Thank you everyone for your answers. I was about to give up and run my own >> wrappers over phylip(emboss for mey ubuntu is too old : /) when i just >> found >> out that clustal can create phylogenetic trees too. On commandline, with >> the >> options 4,1 and 4, i just made a tree here, from a .aln file generated >> with >> clustalw. Does anyone dislike this approach? Seems like >> easy/fast/efficient >> enough for me. >> >> ps: i just made a simple frontend for blast, formatdb, clustalw and >> muscle. >> Right now i'm going to add phylogenetic trees, then, i'm finished. It's my >> graduation thesis, by the way. >> >> 2009/9/1 David Winter >> >> >>> David - I would prefer we also put your new wrappers in >>> >>>> Bio.Emboss.Applications, and would be happy to look at adding >>>> those to CVS now that Biopython 1.51 is out (I had forgotten >>>> about them actually - so thanks for the reminder). >>>> >>>> Peter >>>> >>>> >>> Hi Peter, >>> >>> I'd almost forgotten about them myself! I only put them in their own >>> module >>> because I had the PhyML wrapper as well and that's not an EMBOSS >>> application. >>> >>> I suspect a wrapper for PhyML is probably not going to be widely useful >>> (a >>> normal run lasts at least several hours and most people will want to look >>> over their alignments by eye before they set it off). So I'll move the >>> phylip ones into Emboss.Applications and gather a few thoughts about >>> other >>> phylogenetic software including PhyML and see what the dev list thinks >>> about >>> them. >>> >>> david >>> >>> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> >> >> >> -- >> "A arrog?ncia ? a arma dos fracos." >> >> =========================== >> Italo Moreira Campelo Maia >> Ci?ncia da Computa??o - UECE >> Desenvolvedor WEB e Desktop >> Programador Java, Python >> Ubuntu User For Life! >> ----------------------------------------------------- >> http://www.italomaia.com/ >> http://twitter.com/italomaia/ >> http://eusouolobomal.blogspot.com/ >> =========================== >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From nuin at genedrift.org Wed Sep 2 03:38:51 2009 From: nuin at genedrift.org (Paulo Nuin) Date: Tue, 1 Sep 2009 23:38:51 -0400 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909012017o68216e1cxe72ec52ba2babac3@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <365E2A2E-C0B3-4249-B708-4409C097B581@genedrift.org> <800166920909012017o68216e1cxe72ec52ba2babac3@mail.gmail.com> Message-ID: <9C71C598-20E1-424A-8476-30CB661F450F@genedrift.org> So, ClustalW trees won't even scratch the surface. You should go the Phylip/EMBOSS route, I don't see another alternative while using BioPython. Or you will have to create your own wrappers for some command line applications, like MrBayes, TreePuzzle, etc. Paulo On 1-Sep-09, at 11:17 PM, Italo Maia wrote: > I want to use them to study evolution. Maybe guide some protein > modelling. Would you suggest a better use? > > 2009/9/1 Paulo Nuin > ClustalW trees are extremely "simple" and only can be generated with > Neighbour Joining. Also they are not based on the final sequence > alignment created by the program but as a guide for the alignment > itself. They have a huge probability of being "wrong" or not > representing the actual relationships. It will heavily depend on the > the type, distance and differences among the sequences you are using. > > Do you need trees for what? > > Paulo > > > > > On 1-Sep-09, at 10:37 PM, Italo Maia wrote: > > Thank you everyone for your answers. I was about to give up and run > my own > wrappers over phylip(emboss for mey ubuntu is too old : /) when i > just found > out that clustal can create phylogenetic trees too. On commandline, > with the > options 4,1 and 4, i just made a tree here, from a .aln file > generated with > clustalw. Does anyone dislike this approach? Seems like easy/fast/ > efficient > enough for me. > > ps: i just made a simple frontend for blast, formatdb, clustalw and > muscle. > Right now i'm going to add phylogenetic trees, then, i'm finished. > It's my > graduation thesis, by the way. > > 2009/9/1 David Winter > > > David - I would prefer we also put your new wrappers in > Bio.Emboss.Applications, and would be happy to look at adding > those to CVS now that Biopython 1.51 is out (I had forgotten > about them actually - so thanks for the reminder). > > Peter > > > Hi Peter, > > I'd almost forgotten about them myself! I only put them in their own > module > because I had the PhyML wrapper as well and that's not an EMBOSS > application. > > I suspect a wrapper for PhyML is probably not going to be widely > useful (a > normal run lasts at least several hours and most people will want to > look > over their alignments by eye before they set it off). So I'll move the > phylip ones into Emboss.Applications and gather a few thoughts about > other > phylogenetic software including PhyML and see what the dev list > thinks about > them. > > david > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== From winda002 at student.otago.ac.nz Wed Sep 2 04:35:21 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 02 Sep 2009 16:35:21 +1200 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> Message-ID: <4A9DF609.5050906@student.otago.ac.nz> > Thank you everyone for your answers. I was about to give up and run my > own wrappers over phylip(emboss for mey ubuntu is too old : /) It shouldn't matter how old your emboss is - I think there are phylip versions for every release of EMBOSS here: ftp://emboss.open-bio.org/pub/EMBOSS/old/ EMBOSS doesn't come with the phylip programs by default, you need to download them independently. There are no binaries for ubuntu but they're very easy to compile (if I can do it...) - you do need the EMBOSS sources though. > when i just found out that clustal can create phylogenetic trees too. > On commandline, with the options 4,1 and 4, i just made a tree here, > from a .aln file generated with clustalw. Does anyone dislike this > approach? Seems like easy/fast/efficient enough for me. > As Paulo says this might be easy/fast/efficient but there is no promise it will be accurate/powerful/useful ;). If you want to do it with existing biopython tools then I think phylip is probably going to be the way to go. You might also want to look at PyCogent which has controllers for some other phylogeny programs: http://pycogent.sourceforge.net/examples/phylogeny_app_controllers.html (I have no experience using those, so can't tell much about them) Cheers, David From italo.maia at gmail.com Wed Sep 2 22:13:45 2009 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 2 Sep 2009 19:13:45 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <4A9DF609.5050906@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> Message-ID: <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> I'll try pycongent too but, for now, i'll leave it with clustalw for being the fastest way for me to get it done. Anyway, the "clustering" option for ClustalCommandline seems to be buggy. In the _Option parameter, equate should be set to True, in order for it to work. 2009/9/2 David Winter > > Thank you everyone for your answers. I was about to give up and run my own >> wrappers over phylip(emboss for mey ubuntu is too old : /) >> > It shouldn't matter how old your emboss is - I think there are phylip > versions for every release of EMBOSS here: > ftp://emboss.open-bio.org/pub/EMBOSS/old/ > > EMBOSS doesn't come with the phylip programs by default, you need to > download them independently. There are no binaries for ubuntu but they're > very easy to compile (if I can do it...) - you do need the EMBOSS sources > though. > >> when i just found out that clustal can create phylogenetic trees too. On >> commandline, with the options 4,1 and 4, i just made a tree here, from a >> .aln file generated with clustalw. Does anyone dislike this approach? Seems >> like easy/fast/efficient enough for me. >> >> As Paulo says this might be easy/fast/efficient but there is no promise > it will be accurate/powerful/useful ;). If you want to do it with existing > biopython tools then I think phylip is probably going to be the way to go. > > You might also want to look at PyCogent which has controllers for some > other phylogeny programs: > http://pycogent.sourceforge.net/examples/phylogeny_app_controllers.html > > (I have no experience using those, so can't tell much about them) > > Cheers, > David > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From italo.maia at gmail.com Wed Sep 2 22:15:04 2009 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 2 Sep 2009 19:15:04 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> Message-ID: <800166920909021515r106e08a1lf158c18f3d541079@mail.gmail.com> ps: is there a work around, for this "wannabe" bug? 2009/9/2 Italo Maia > I'll try pycongent too but, for now, i'll leave it with clustalw for being > the fastest way for me to get it done. Anyway, the "clustering" option for > ClustalCommandline seems to be buggy. In the _Option parameter, equate > should be set to True, in order for it to work. > > 2009/9/2 David Winter > > >> Thank you everyone for your answers. I was about to give up and run my >>> own wrappers over phylip(emboss for mey ubuntu is too old : /) >>> >> It shouldn't matter how old your emboss is - I think there are phylip >> versions for every release of EMBOSS here: >> ftp://emboss.open-bio.org/pub/EMBOSS/old/ >> >> EMBOSS doesn't come with the phylip programs by default, you need to >> download them independently. There are no binaries for ubuntu but they're >> very easy to compile (if I can do it...) - you do need the EMBOSS sources >> though. >> >>> when i just found out that clustal can create phylogenetic trees too. On >>> commandline, with the options 4,1 and 4, i just made a tree here, from a >>> .aln file generated with clustalw. Does anyone dislike this approach? Seems >>> like easy/fast/efficient enough for me. >>> >>> As Paulo says this might be easy/fast/efficient but there is no promise >> it will be accurate/powerful/useful ;). If you want to do it with existing >> biopython tools then I think phylip is probably going to be the way to go. >> >> You might also want to look at PyCogent which has controllers for some >> other phylogeny programs: >> http://pycogent.sourceforge.net/examples/phylogeny_app_controllers.html >> >> (I have no experience using those, so can't tell much about them) >> >> Cheers, >> David >> > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From italo.maia at gmail.com Wed Sep 2 22:51:49 2009 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 2 Sep 2009 19:51:49 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909021515r106e08a1lf158c18f3d541079@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> <800166920909021515r106e08a1lf158c18f3d541079@mail.gmail.com> Message-ID: <800166920909021551r4aefa301ib6498997f4f45b7a@mail.gmail.com> Solved with a terrible ugly forloop. *for* par *in* self.cline.parameters: names = getattr(par, 'names', None) *if* names *is* not None: *if* "-clustering" *in* names: par.equate=True And clustalw can generate trees in phylip format, which is good news for me! Thank you guys! When i grab my hands in pycongent, i'll post something. And, by the way, emboss didn't seem to work fine in my ubuntu, even tough the package is in the repository. If i write "emboss" in the console, it won't work. @.o 2009/9/2 Italo Maia > ps: is there a work around, for this "wannabe" bug? > > 2009/9/2 Italo Maia > > I'll try pycongent too but, for now, i'll leave it with clustalw for being >> the fastest way for me to get it done. Anyway, the "clustering" option for >> ClustalCommandline seems to be buggy. In the _Option parameter, equate >> should be set to True, in order for it to work. >> >> 2009/9/2 David Winter >> >> >>> Thank you everyone for your answers. I was about to give up and run my >>>> own wrappers over phylip(emboss for mey ubuntu is too old : /) >>>> >>> It shouldn't matter how old your emboss is - I think there are phylip >>> versions for every release of EMBOSS here: >>> ftp://emboss.open-bio.org/pub/EMBOSS/old/ >>> >>> EMBOSS doesn't come with the phylip programs by default, you need to >>> download them independently. There are no binaries for ubuntu but they're >>> very easy to compile (if I can do it...) - you do need the EMBOSS sources >>> though. >>> >>>> when i just found out that clustal can create phylogenetic trees too. On >>>> commandline, with the options 4,1 and 4, i just made a tree here, from a >>>> .aln file generated with clustalw. Does anyone dislike this approach? Seems >>>> like easy/fast/efficient enough for me. >>>> >>>> As Paulo says this might be easy/fast/efficient but there is no promise >>> it will be accurate/powerful/useful ;). If you want to do it with existing >>> biopython tools then I think phylip is probably going to be the way to go. >>> >>> You might also want to look at PyCogent which has controllers for some >>> other phylogeny programs: >>> http://pycogent.sourceforge.net/examples/phylogeny_app_controllers.html >>> >>> (I have no experience using those, so can't tell much about them) >>> >>> Cheers, >>> David >>> >> >> >> >> -- >> "A arrog?ncia ? a arma dos fracos." >> >> =========================== >> Italo Moreira Campelo Maia >> Ci?ncia da Computa??o - UECE >> Desenvolvedor WEB e Desktop >> Programador Java, Python >> Ubuntu User For Life! >> ----------------------------------------------------- >> http://www.italomaia.com/ >> http://twitter.com/italomaia/ >> http://eusouolobomal.blogspot.com/ >> =========================== >> > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From italo.maia at gmail.com Wed Sep 2 22:54:14 2009 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 2 Sep 2009 19:54:14 -0300 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! Message-ID: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> Is there any recepie to plot a phylip phylogenetic tree? -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From biopython at maubp.freeserve.co.uk Thu Sep 3 09:32:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Sep 2009 10:32:16 +0100 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> Message-ID: <320fb6e00909030232s49f5a2e8vb3dda5b1261fadd5@mail.gmail.com> On Wed, Sep 2, 2009 at 11:13 PM, Italo Maia wrote: > I'll try pycongent too but, for now, i'll leave it with clustalw for being > the fastest way for me to get it done. Anyway, the "clustering" option for > ClustalCommandline seems to be buggy. In the _Option parameter, equate > should be set to True, in order for it to work. Could you show us a command line string that works, and a command line string that doesn't? Peter From biopython at maubp.freeserve.co.uk Thu Sep 3 09:34:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Sep 2009 10:34:16 +0100 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! In-Reply-To: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> References: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> Message-ID: <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> On Wed, Sep 2, 2009 at 11:54 PM, Italo Maia wrote: > Is there any recepie to plot a phylip phylogenetic tree? If you mean as a simple text representation, then sort of. Try the Bio.Nexus tree objects print methods. Do you meant draw a nice diagram (e.g. as a PDF or PNG file)? If so, no, not at the moment. Some of the Google Summer of Code project work including linking to NetworkX for graphics... this has not yet been merged into Biopython. Peter From italo.maia at gmail.com Thu Sep 3 12:20:43 2009 From: italo.maia at gmail.com (Italo Maia) Date: Thu, 3 Sep 2009 09:20:43 -0300 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! In-Reply-To: <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> References: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> Message-ID: <800166920909030520p1e02508ava6576fbfae0ad50a@mail.gmail.com> Yeap, i meant the pretty png way. I was thinking of something with python-imaging, maybe. Anyway, thanks Peter. If it does not exist, i won't waste time looking for it. 2009/9/3 Peter > On Wed, Sep 2, 2009 at 11:54 PM, Italo Maia wrote: > > Is there any recepie to plot a phylip phylogenetic tree? > > If you mean as a simple text representation, then sort of. Try the > Bio.Nexus tree objects print methods. > > Do you meant draw a nice diagram (e.g. as a PDF or PNG file)? If > so, no, not at the moment. Some of the Google Summer of Code > project work including linking to NetworkX for graphics... this has > not yet been merged into Biopython. > > Peter > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From italo.maia at gmail.com Thu Sep 3 12:22:17 2009 From: italo.maia at gmail.com (Italo Maia) Date: Thu, 3 Sep 2009 09:22:17 -0300 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <320fb6e00909030232s49f5a2e8vb3dda5b1261fadd5@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> <320fb6e00909030232s49f5a2e8vb3dda5b1261fadd5@mail.gmail.com> Message-ID: <800166920909030522m1d9a49adn90b93f380eed6cc@mail.gmail.com> *Not working* /usr/bin/clustalw -infile=/home/myuser/Temp/coi.dnd -tree -outputtree=PHYLIP -tossgaps -clustering NJ *Working* /usr/bin/clustalw -infile=/home/myuser/Temp/coi.dnd -tree -outputtree=PHYLIP -tossgaps -clustering=NJ 2009/9/3 Peter > On Wed, Sep 2, 2009 at 11:13 PM, Italo Maia wrote: > > I'll try pycongent too but, for now, i'll leave it with clustalw for > being > > the fastest way for me to get it done. Anyway, the "clustering" option > for > > ClustalCommandline seems to be buggy. In the _Option parameter, equate > > should be set to True, in order for it to work. > > Could you show us a command line string that works, and a command line > string that doesn't? > > Peter > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From biopython at maubp.freeserve.co.uk Thu Sep 3 12:38:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Sep 2009 13:38:47 +0100 Subject: [Biopython] Phylogenetic trees with biopython? In-Reply-To: <800166920909030522m1d9a49adn90b93f380eed6cc@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <800166920909011937v6ca5f878y551f45cc4c135f50@mail.gmail.com> <4A9DF609.5050906@student.otago.ac.nz> <800166920909021513l33b0f4edrebd918dece164065@mail.gmail.com> <320fb6e00909030232s49f5a2e8vb3dda5b1261fadd5@mail.gmail.com> <800166920909030522m1d9a49adn90b93f380eed6cc@mail.gmail.com> Message-ID: <320fb6e00909030538m1daa8279x5254d0dac6f00832@mail.gmail.com> On Thu, Sep 3, 2009 at 1:22 PM, Italo Maia wrote: > Not working > /usr/bin/clustalw -infile=/home/myuser/Temp/coi.dnd -tree -outputtree=PHYLIP > -tossgaps -clustering NJ > > Working > /usr/bin/clustalw -infile=/home/myuser/Temp/coi.dnd -tree -outputtree=PHYLIP > -tossgaps -clustering=NJ OK, yes. Fixed in CVS. I also made the boot labels argument use an equals (by eye all the rest looked fine). Could you confirm that works? Thanks Peter From jjkk73 at gmail.com Thu Sep 3 16:06:19 2009 From: jjkk73 at gmail.com (jorma kala) Date: Thu, 3 Sep 2009 17:06:19 +0100 Subject: [Biopython] Question about efetch output format Message-ID: Hi, I'm trying to retrieve a record from protein database (I found the record id by running Entrez.esearch) handle = Entrez.efetch(db="protein", id='483329',mode='xml') print handle.read() Although I specify xml mode, the result comes in a quite confusing format using braces (I've pasted a snippet at the end of the email) Do you know what I should do to get it in xml? Many thanks +++++++ Output Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Aspergillus flavus aflatoxin (aflR) gene, and translated products" , source { org { taxname "Aspergillus flavus" , db { { db "taxon" , tag id 5059 } } , orgname { name binomial { From biopython at maubp.freeserve.co.uk Thu Sep 3 16:38:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Sep 2009 17:38:57 +0100 Subject: [Biopython] Question about efetch output format In-Reply-To: References: Message-ID: <320fb6e00909030938t139e246eg1a4cb524653eb4a2@mail.gmail.com> On Thu, Sep 3, 2009 at 5:06 PM, jorma kala wrote: > Hi, > I'm trying to retrieve a record from protein database (I found the record id > by running Entrez.esearch) > > ? ?handle = Entrez.efetch(db="protein", id='483329',mode='xml') > ? ?print handle.read() > Although I specify xml mode, the result comes in a quite confusing format > using braces ?(I've pasted a snippet at the end of the email) > > Do you know what I should do to get it in xml? > Many thanks You've got the default ASN.1 output. You need to use "retmode" not "mode", from Bio import Entrez handle = Entrez.efetch(db="protein", id='483329',retmode='xml') print handle.read() I thought both the Biopython documentation and the NCBI documentation was clear on this - maybe you found a typo? Please let us know if there is an error in any of the documentation or examples. Thanks Peter From bartomas at gmail.com Fri Sep 4 08:21:04 2009 From: bartomas at gmail.com (bar tomas) Date: Fri, 4 Sep 2009 09:21:04 +0100 Subject: [Biopython] Question about efetch output format In-Reply-To: <320fb6e00909030938t139e246eg1a4cb524653eb4a2@mail.gmail.com> References: <320fb6e00909030938t139e246eg1a4cb524653eb4a2@mail.gmail.com> Message-ID: Many thanks. My mistake, I must've copied it badly from the doc. On Thu, Sep 3, 2009 at 5:38 PM, Peter wrote: > On Thu, Sep 3, 2009 at 5:06 PM, jorma kala wrote: > > Hi, > > I'm trying to retrieve a record from protein database (I found the record > id > > by running Entrez.esearch) > > > > handle = Entrez.efetch(db="protein", id='483329',mode='xml') > > print handle.read() > > Although I specify xml mode, the result comes in a quite confusing format > > using braces (I've pasted a snippet at the end of the email) > > > > Do you know what I should do to get it in xml? > > Many thanks > > You've got the default ASN.1 output. You need to use "retmode" not "mode", > > from Bio import Entrez > handle = Entrez.efetch(db="protein", id='483329',retmode='xml') > print handle.read() > > I thought both the Biopython documentation and the NCBI documentation > was clear on this - maybe you found a typo? Please let us know if there is > an error in any of the documentation or examples. > > Thanks > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bav853 at bham.ac.uk Fri Sep 4 12:38:18 2009 From: bav853 at bham.ac.uk (Bhima Auro van der Molen) Date: Fri, 04 Sep 2009 13:38:18 +0100 Subject: [Biopython] Residue Depth module. Message-ID: <4AA10A3A.7010303@bham.ac.uk> Hi everyone I have been trying to calculate residue depths in PDB files, using the hsexpo.py script, which calls and makes use of the "msms" "pdb_to_xyz," "pdb_to_xyzn" programs and the ResidueDepth.py module which on my system is located at: /var/lib/python-support/python2.6/Bio/PDB/ResidueDepth.py I opened up the ResidueDepth.py file and when I was debugging it I found that the: *from AbstractPropertyMap import AbstractPropertyMap * returns an error, however when I altered it to: *from Bio.PDB.AbstractPropertyMap import AbstractPropertyMap *it seemed to fix that specific problem. However the consistent problem I am having each time I try and run the hsexpo.py script with the option for RD or RDa, is the following error message: /Traceback (most recent call last): File "hsexpo.py", line 101, in d=ResidueDepth(m, pdbfile) File "/var/lib/python-support/python2.6/Bio/PDB/ResidueDepth.py", line 134, in __init__ surface=get_surface(pdb_file) File "/var/lib/python-support/python2.6/Bio/PDB/ResidueDepth.py", line 85, in get_surface surface=_read_vertex_array(surface_file) File "/var/lib/python-support/python2.6/Bio/PDB/ResidueDepth.py", line 53, in _read_vertex_array fp=open(filename, "r") IOError: [Errno 2] No such file or directory: '/tmp/tmpy8yHlH.vert / When I look in the /tmp folder I can see a number of tmp* files but not the one(s) that it is looking for specifically. To ensure that all the required system files were in the right place, i.e. a correct installation of msms with pdb_to_xyz etc, I moved the binaries to /usr/bin and tested the binaries from my home directory.. these worked alright.. I should clarify that even though I am running Python 2.6, I encountered the same problem in 2.5.2 and 2.5.4 as well. If anyone can help me figure out why this is not working I'd be grateful.. Thanks Bhima From italo.maia at gmail.com Fri Sep 4 14:26:10 2009 From: italo.maia at gmail.com (Italo Maia) Date: Fri, 4 Sep 2009 11:26:10 -0300 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! In-Reply-To: <800166920909030520p1e02508ava6576fbfae0ad50a@mail.gmail.com> References: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> <800166920909030520p1e02508ava6576fbfae0ad50a@mail.gmail.com> Message-ID: <800166920909040726v47897f7et8267500eb5166a52@mail.gmail.com> Does anyone has a link or doc explaining the phylip tree format? I think i'll try making some simple ploting for it. 2009/9/3 Italo Maia > Yeap, i meant the pretty png way. I was thinking of something with > python-imaging, maybe. Anyway, thanks Peter. If it does not exist, i won't > waste time looking for it. > > 2009/9/3 Peter > > On Wed, Sep 2, 2009 at 11:54 PM, Italo Maia wrote: >> > Is there any recepie to plot a phylip phylogenetic tree? >> >> If you mean as a simple text representation, then sort of. Try the >> Bio.Nexus tree objects print methods. >> >> Do you meant draw a nice diagram (e.g. as a PDF or PNG file)? If >> so, no, not at the moment. Some of the Google Summer of Code >> project work including linking to NetworkX for graphics... this has >> not yet been merged into Biopython. >> >> Peter >> > > > > -- > "A arrog?ncia ? a arma dos fracos." > > =========================== > Italo Moreira Campelo Maia > Ci?ncia da Computa??o - UECE > Desenvolvedor WEB e Desktop > Programador Java, Python > Ubuntu User For Life! > ----------------------------------------------------- > http://www.italomaia.com/ > http://twitter.com/italomaia/ > http://eusouolobomal.blogspot.com/ > =========================== > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From biopython at maubp.freeserve.co.uk Fri Sep 4 14:29:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Sep 2009 15:29:57 +0100 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! In-Reply-To: <800166920909040726v47897f7et8267500eb5166a52@mail.gmail.com> References: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> <800166920909030520p1e02508ava6576fbfae0ad50a@mail.gmail.com> <800166920909040726v47897f7et8267500eb5166a52@mail.gmail.com> Message-ID: <320fb6e00909040729q440ce158oa74734b29e8eba2@mail.gmail.com> On Fri, Sep 4, 2009 at 3:26 PM, Italo Maia wrote: > Does anyone has a link or doc explaining the phylip tree format? I think > i'll try making some simple ploting for it. Do you mean the Newick tree format? http://evolution.genetics.washington.edu/phylip/newicktree.html Peter P.S. There is a short example using Bio.Nexus.Trees in the current tutorial. From italo.maia at gmail.com Sat Sep 5 04:41:05 2009 From: italo.maia at gmail.com (Italo Maia) Date: Sat, 5 Sep 2009 01:41:05 -0300 Subject: [Biopython] How to plot PHYLIP phylogenetic tree?! In-Reply-To: <320fb6e00909040729q440ce158oa74734b29e8eba2@mail.gmail.com> References: <800166920909021554g76b89fd8y36bfeb13b9142f72@mail.gmail.com> <320fb6e00909030234m2a629e08gfa283e979bf4851e@mail.gmail.com> <800166920909030520p1e02508ava6576fbfae0ad50a@mail.gmail.com> <800166920909040726v47897f7et8267500eb5166a52@mail.gmail.com> <320fb6e00909040729q440ce158oa74734b29e8eba2@mail.gmail.com> Message-ID: <800166920909042141y7a9790f1ka8f5c6e846296d0@mail.gmail.com> Yeap, that's the one! Just found a library for parsing these newick trees. Found them too late, actually. Just made my own image generator for newick trees. The output kind of looks like *treeview* trees. Big thanks Peter = ] A sample output can be viewed here: http://img215.imageshack.us/img215/954/outk.png ps: drawing trees is a pain in the....gee! 2009/9/4 Peter > On Fri, Sep 4, 2009 at 3:26 PM, Italo Maia wrote: > > Does anyone has a link or doc explaining the phylip tree format? I think > > i'll try making some simple ploting for it. > > Do you mean the Newick tree format? > http://evolution.genetics.washington.edu/phylip/newicktree.html > > Peter > > P.S. There is a short example using Bio.Nexus.Trees in the current > tutorial. > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From aduran at fhcrc.org Sat Sep 5 18:13:21 2009 From: aduran at fhcrc.org (Duran, Alysha M) Date: Sat, 5 Sep 2009 11:13:21 -0700 Subject: [Biopython] Fred Hutchinson Cancer Research Center - Systems Analyst/Programmer III/IV (AD-22564) References: <455E7DBAAEF0814A9D51D640B8DEDB90010DA4BF1B@ISIS.fhcrc.org> <040346FA7309BD439C327F97D4C4D69B05F54166@ISIS.fhcrc.org> Message-ID: <040346FA7309BD439C327F97D4C4D69B05F5416D@ISIS.fhcrc.org> Systems Analyst /Programmer III/IV (AD-22564) About Us Fred Hutchinson Cancer Research Center, home of three Nobel laureates, is an independent, nonprofit research institution dedicated to the development and advancement of biomedical research to eliminate cancer and other potentially fatal diseases. Recognized internationally for its pioneering work in bone-marrow transplantation, the Center's four scientific divisions collaborate to form a unique environment for conducting basic and applied science. The Hutchinson Center, in collaboration with its clinical and research partners, the University of Washington and Children's Hospital and Regional Medical Center, is the only National Cancer Institute-designated comprehensive cancer center in the Pacific Northwest. Join us and make a difference. Responsibilities We are seeking an experienced Programmer/Systems Analyst. The person will join the Cancer Prevention Program providing support for the data analysis of a large-scale genome-wide association study. This genome-wide scan is an NCI-funded multiple-year project with the goal to identify susceptibility genes associated with cancer risk and to investigate interaction between genes and environmental factors. The Programmer/Systems Analyst will work in a multidisciplinary research team. He/she will provide programming support for management of high dimensional data sets, implement quality control procedures, apply various software applications, and assist with running statistical analysis on high performance compute clusters. Furthermore, the person will prepare written documentation for the data management and the results of data analysis. Major Duties In support of the research projects, the incumbent may perform one or more of the following tasks in addition to other duties as assigned: 1. Participate in design and development of the data management system for studies. 2. Design, test, document, and maintain databases. 3. Develop, document, and maintain data cleaning procedures. 4. Implement and maintain standard datasets and reports. 5. Perform study-specific reporting. 6. Implement various software tools, such as PLINK, BeadStudio, GenomeStudio, or software packages in R. 7. Develop and/or maintain user interface programs and software tools. 8. Assist with the statistical analysis on high performance compute cluster. Qualifications The ideal candidate will possess the following qualifications: Bachelor's degree in computer science or related field and two years' experience as a Systems Analyst/Programmer III or equivalent; or one year as a Systems Analyst/Programmer III or equivalent and a Master's Degree in a job-related area. Experience with LINUX/Unix and R are required. Knowledge of C, FORTRAN, PERL, PYTHON, Rmpi or mpi is desirable. Experience with management of complex, high-dimensional genotype data is desirable. Demonstrated ability to communicate effectively as part of a team. Recommended Qualifications Proficient use of database management and statistical software. Knowledge of and experience in programming support of statistical analysis and methods development, or other scientific research are desired. Experience in writing program, system, and/or database documentation. To Apply For more information about the position and to apply, please visit the Fred Hutchinson Cancer Research Center website at www.fhcrc.org and search for Job# AD-22564. Alysha M. Duran Human Resources Specialist/Recruiter Fred Hutchinson Cancer Research Center Seattle Cancer Care Alliance Phone: (206) 667-2720 Fax: (206) 667-4051 Email: aduran at fhcrc.org Click here to search for open positions: www.fhcrc.org Follow new job openings on Twitter: http://twitter.com/FHCRC_Jobs From mitlox at op.pl Sun Sep 6 00:40:48 2009 From: mitlox at op.pl (xyz) Date: Sun, 06 Sep 2009 10:40:48 +1000 Subject: [Biopython] DSSP and secondary structure Message-ID: <4AA30510.4020105@op.pl> Hello, I have a solved structure (1E8W) with a ligand and I would like to know which secondary structure are within 16A (cut off) of the ligand. I am no interested in coils. From looking at the PDB file, ligand is last residue in chain A, named QUE. I wrote a little script (see bellow please) in order to test DSSP and it works. from Bio.PDB import * pdb_code = "1E8W" pdb_filename = "1E8W.pdb" structure = PDBParser().get_structure(pdb_code, pdb_filename) model=structure[0] dssp=DSSP(model, pdb_filename, "./dsspcmbi") for r in dssp: print r print len(dssp) Unfortunately, I do not know how can I find the secondary structures around 16A of the ligand. Thank you in advance. Best regards From srikrishnamohan at gmail.com Sun Sep 6 05:46:33 2009 From: srikrishnamohan at gmail.com (km) Date: Sun, 6 Sep 2009 14:46:33 +0900 Subject: [Biopython] DSSP and secondary structure In-Reply-To: <4AA30510.4020105@op.pl> References: <4AA30510.4020105@op.pl> Message-ID: use pymol KM On Sun, Sep 6, 2009 at 9:40 AM, xyz wrote: > Hello, > I have a solved structure (1E8W) with a ligand and I would like to know > which secondary structure are within 16A (cut off) of the ligand. I am no > interested in coils. > > From looking at the PDB file, ligand is last residue in chain A, named QUE. > > I wrote a little script (see bellow please) in order to test DSSP and it > works. > > from Bio.PDB import * > > pdb_code = "1E8W" > pdb_filename = "1E8W.pdb" > > structure = PDBParser().get_structure(pdb_code, pdb_filename) > model=structure[0] > dssp=DSSP(model, pdb_filename, "./dsspcmbi") > > for r in dssp: > print r > print len(dssp) > > Unfortunately, I do not know how can I find the secondary structures around > 16A of the ligand. > > Thank you in advance. > > Best regards > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Sun Sep 6 12:05:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 6 Sep 2009 13:05:19 +0100 Subject: [Biopython] DSSP and secondary structure In-Reply-To: <4AA30510.4020105@op.pl> References: <4AA30510.4020105@op.pl> Message-ID: <320fb6e00909060505r4befa189g4c9ce617dfd35511@mail.gmail.com> On Sun, Sep 6, 2009 at 1:40 AM, xyz wrote: > Hello, > I have a solved structure (1E8W) with a ligand and I would like to know > which secondary structure are within 16A (cut off) of the ligand. I am no > interested in coils. > > From looking at the PDB file, ligand is last residue in chain A, named QUE. > ... > Unfortunately, I do not know how can I find the secondary structures around > 16A of the ligand. There was a related thread back in March which may be helpful, http://lists.open-bio.org/pipermail/biopython/2009-March/005021.html If you are doing it just for this one protein, it may be easier to use a PDB viewer - which would also help get a feel for the structure itself. Peter From yvan.strahm at bccs.uib.no Tue Sep 8 12:01:08 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Tue, 08 Sep 2009 14:01:08 +0200 Subject: [Biopython] IPI fetching Message-ID: <4AA64784.7060506@bccs.uib.no> Hello All, I have a project with a bunch of IPI access number and need to get their fasta sequence. Now I am using the SRS web site to get these sequence. I found a old thread on Biopython-dev about IPI parseing (http://portal.open-bio.org/pipermail/biopython-dev/2001-December/000771.html), so I tried to use ExPASy.get_sprot_raw to get the sequence with no luck. Does anyone know how I can use the IPI accession number directly in Biopython? Thanks for your help, cheers, yvan From biopython at maubp.freeserve.co.uk Tue Sep 8 12:33:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Sep 2009 13:33:46 +0100 Subject: [Biopython] IPI fetching In-Reply-To: <4AA64784.7060506@bccs.uib.no> References: <4AA64784.7060506@bccs.uib.no> Message-ID: <320fb6e00909080533g7d5e48d6k8b6f64417c81914a@mail.gmail.com> On Tue, Sep 8, 2009 at 1:01 PM, Yvan Strahm wrote: > Hello All, > > I have a project with a bunch of IPI access number and need to get their > fasta sequence. Now I am using the SRS web site to get these sequence. > I found a old thread on Biopython-dev about IPI parseing > (http://portal.open-bio.org/pipermail/biopython-dev/2001-December/000771.html), > so I tried to use ExPASy.get_sprot_raw to get the sequence with no luck. > > Does anyone know how I can use the IPI accession number directly in > Biopython? Can you give us a specific example of an IPI number and the FASTA record you want back? Peter From yvan.strahm at bccs.uib.no Tue Sep 8 12:39:40 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Tue, 08 Sep 2009 14:39:40 +0200 Subject: [Biopython] IPI fetching In-Reply-To: <320fb6e00909080533g7d5e48d6k8b6f64417c81914a@mail.gmail.com> References: <4AA64784.7060506@bccs.uib.no> <320fb6e00909080533g7d5e48d6k8b6f64417c81914a@mail.gmail.com> Message-ID: <4AA6508C.3030705@bccs.uib.no> Peter wrote: > On Tue, Sep 8, 2009 at 1:01 PM, Yvan Strahm wrote: >> Hello All, >> >> I have a project with a bunch of IPI access number and need to get their >> fasta sequence. Now I am using the SRS web site to get these sequence. >> I found a old thread on Biopython-dev about IPI parseing >> (http://portal.open-bio.org/pipermail/biopython-dev/2001-December/000771.html), >> so I tried to use ExPASy.get_sprot_raw to get the sequence with no luck. >> >> Does anyone know how I can use the IPI accession number directly in >> Biopython? > > Can you give us a specific example of an IPI number and the FASTA > record you want back? > > Peter IPI00109764 > ipi|IPI00109764|IPI00109764.2 DNA TOPOISOMERASE 1. MSGDHLHNDSQIEADFRLNDSHKHKDKHKDREHRHKEHKKDKDKDREKSKHSNSEHKDSEKKHKEKEKTKHKDGSSEKHKDKHKDRDKERRKEEKIRAAG DAKIKKEKENGFSSPPRIKDEPEDDGYFAPPKEDIKPLKRLRDEDDADYKPKKIKTEDIKKEKKRKSEEEEDGKLKKPKNKDKDKKVAEPDNKKKKPKKE EEQKWKWWEEERYPEGIKWKFLEHKGPVFAPPYEPLPESVKFYYDGKVMKLSPKAEEVATFFAKMLDHEYTTKEIFRKNFFKDWRKEMTNDEKNTITNLS KCDFTQMSQYFKAQSEARKQMSKEEKLKIKEENEKLLKEYGFCVMDNHRERIANFKIEPPGLFRGRGNHPKMGMLKRRIMPEDIIINCSKDAKVPSPPPG HKWKEVRHDNKVTWLVSWTENIQGSIKYIMLNPSSRIKGEKDWQKYETARRLKKCVDKIRNQYREDWKSKEMKVRQRAVALYFIDKLALRAGNEKEEGET ADTVGCCSLRVEHINLHPELDGQEYVVEFDFPGKDSIRYYNKVPVEKRVFKNLQLFMENKQPEDDLFDRLNTGILNKHLQDLMEGLTAKVFRTYNASITL QQQLKELTAPDENVPAKILSYNRANRAVAILCNHQRAPPKTFEKSMMNLQSKIDAKKDQLADARRDLKSAKADAKVMKDAKTKKVVESKKKAVQRLEEQL MKLEVQATDREENKQIALGTSKLNYLDPRITVAWCKKWGVPIEKIYNKTQREKFAWAIDMTDEDYEF This particular entry has this Uniprot accession number:Q04750 From biopython at maubp.freeserve.co.uk Tue Sep 8 13:41:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Sep 2009 14:41:40 +0100 Subject: [Biopython] IPI fetching In-Reply-To: <4AA6508C.3030705@bccs.uib.no> References: <4AA64784.7060506@bccs.uib.no> <320fb6e00909080533g7d5e48d6k8b6f64417c81914a@mail.gmail.com> <4AA6508C.3030705@bccs.uib.no> Message-ID: <320fb6e00909080641m4fbd9b8duf6e8a13557f2d9e7@mail.gmail.com> On Tue, Sep 8, 2009 at 1:39 PM, Yvan Strahm wrote: >> >> Can you give us a specific example of an IPI number and the FASTA >> record you want back? > > IPI00109764 > >> ipi|IPI00109764|IPI00109764.2 DNA TOPOISOMERASE 1. > MSGDHLHNDSQIEADFRLNDSHKHKDKHKD...YEF > > This particular entry has this Uniprot accession number:Q04750 So if you can work out the uniprot accession number, then you can use the Bio.ExPASy.get_sprot_raw() function to download the file in the SwissProt/UniProt plain text format, e.g. >>> from Bio import ExPASy >>> from Bio import SeqIO >>> record = SeqIO.read(ExPASy.get_sprot_raw("Q04750"), "swiss") >>> print record.format("fasta") >Q04750 RecName: Full=DNA topoisomerase 1; EC=5.99.1.2; AltName: Full=DNA topoisomerase I; MSGDHLHNDSQIEADFRLNDSHKHKDKHKD...YEF It looks like you should be able to get the sequence directly from the EBI via the International Protein Index (IPI) identifier, IPI00109764 http://www.ebi.ac.uk/IPI/IPIhelp.html As per that old thread you referenced, Biopython should be able to parse the "swiss" output from IPI. How about a quick and dirty URL hack to access the EBI's SRS? >>> import urllib >>> from Bio import SeqIO >>> ipi = "IPI00109764" >>> url = "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[IPI-acc:%s]+-ascii" % ipi >>> record = SeqIO.read(urllib.urlopen(url), "swiss") >>> print record.format("fasta") >IPI00109764 DNA TOPOISOMERASE 1. MSGDHLHNDSQIEADFRLNDSHKHKDKHKDRE...YEF Done? With a little tweaking to the URL you can download this directly as FASTA if you like (saves some bandwidth). Peter From schafer at rostlab.org Tue Sep 8 17:45:53 2009 From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=) Date: Tue, 08 Sep 2009 13:45:53 -0400 Subject: [Biopython] Problem with pdb-file parsing Message-ID: <4AA69851.2000605@rostlab.org> Hi, I don't know whether this is either a bug or I did something wrong. I am parsing the pdb structure 1a2d with the following code to get the one-letter polypeptide sequence for chain A: ------------------CODE---------------- from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import * parser = PDBParser() ppb = PPBuilder() structure = parser.get_structure('tmp', '1a2d.pdb') polypeptide = ppb.build_peptides(structure[0]['A']) sequence = str(polypeptide[0].get_sequence()) print sequence ------------------CODE---------------- This however gives me a sequence that is one aminoacid shorter than expected. The structure contains one HETATM block within the ATOM block of chain A (pos 117), which gets translated into a 'X' in the sequence. The following aminoacid at position 118 (VAL) seems to be missing. So the resulting sequence around the X is: ...VEXMK... To my understanding this should be: ...VEXVMK... Is this behaviour intended? Is it a bug? The biopython version is 1.49 (Ubuntu jaunty). Chris From kelly.oakeson at utah.edu Wed Sep 9 03:22:51 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Tue, 8 Sep 2009 21:22:51 -0600 Subject: [Biopython] Biopython and Snow Leopard Message-ID: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> Hello list, I am wondering of Biopython is compatible with Mac OS 10.6? Kelly Oakeson kelly.oakeson at utah.edu From biopython at maubp.freeserve.co.uk Wed Sep 9 09:13:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 10:13:17 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> Message-ID: <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> On Wed, Sep 9, 2009 at 4:22 AM, Kelly F Oakeson wrote: > Hello list, > I am wondering of Biopython is compatible with Mac OS 10.6? In theory yes, but I don't know if anyone has tested it yet. Apple have updated the compiler (gcc), and probably the system Python since Mac OS 10.5 Leopard. On Leopard: $ python --version Python 2.5.2 $ gcc -v Using built-in specs. Target: i686-apple-darwin9 Configured with: /var/tmp/gcc/gcc-5465~16/src/configure --disable-checking -enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ --with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib --build=i686-apple-darwin9 --with-arch=apple --with-tune=generic --host=i686-apple-darwin9 --target=i686-apple-darwin9 Thread model: posix gcc version 4.0.1 (Apple Inc. build 5465) So, try it and see? If you run into problems compiling, or running the unit tests please let us know. Also in case it matters, was this an update or a clean install of Snow Leopard? Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 09:25:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 10:25:11 +0100 Subject: [Biopython] Problem with pdb-file parsing In-Reply-To: <4AA69851.2000605@rostlab.org> References: <4AA69851.2000605@rostlab.org> Message-ID: <320fb6e00909090225g686c88fdlf1b3bbaf9c10701d@mail.gmail.com> On Tue, Sep 8, 2009 at 6:45 PM, Christian Sch?fer wrote: > Hi, > > I don't know whether this is either a bug or I did something wrong. > I am parsing the pdb structure 1a2d with the following code to get > the one-letter polypeptide sequence for chain A: > > ------------------CODE---------------- > from Bio.PDB.PDBParser import PDBParser > from Bio.PDB.Polypeptide import * > > parser = PDBParser() > ppb = PPBuilder() > structure = parser.get_structure('tmp', '1a2d.pdb') > polypeptide = ppb.build_peptides(structure[0]['A']) > sequence = str(polypeptide[0].get_sequence()) > > print sequence > ------------------CODE---------------- > > This however gives me a sequence that is one aminoacid shorter than > expected. The structure contains one HETATM block within the ATOM > block of chain A (pos 117), which gets translated into a 'X' in the > sequence. The following aminoacid at position 118 (VAL) seems to be > missing. > > So the resulting sequence around the X is: > ...VEXMK... > To my understanding this should be: > ...VEXVMK... > > Is this behaviour intended? Is it a bug? The biopython version is 1.49 > (Ubuntu jaunty). I agree that does not seem to be sensible. I get the same behaviour with the latest code in the repository (so updating to Biopython 1.51 won't help here). It looks like a bug in the builder code, since the parser seems fine, and you can get the sequence in other ways, e.g. from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import * parser = PDBParser() ppb = PPBuilder() structure = parser.get_structure('tmp', '1a2d.pdb') for model in structure : for chain in model : #Try adjusting depending on if you expect just the 20 #standard amino acids etc. #aminos = [to_one_letter_code.get(res.resname,"X") \ # for res in chain if res.resname != "HOH"] aminos = [to_one_letter_code.get(res.resname,"X") \ for res in chain if "CA" in res.child_dict] sequence = "".join(aminos) print sequence Could you file this as a bug on Bugzilla please? http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython Thanks, Peter From lpritc at scri.ac.uk Wed Sep 9 09:35:55 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 09 Sep 2009 10:35:55 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> Message-ID: Hi all, I upgraded to 10.6 from 10.5.8 on my laptop, with a Python/Biopython installation still in-place, and I haven't had any problems yet. This, of course, doesn't mean that there aren't issues in the modules I haven't used, or with compilation under 10.6. Cheers, L. On 09/09/2009 04:22, "Kelly F Oakeson" wrote: > Hello list, > I am wondering of Biopython is compatible with Mac OS 10.6? > > > Kelly Oakeson > kelly.oakeson at utah.edu > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From kelly.oakeson at utah.edu Wed Sep 9 13:05:34 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 07:05:34 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> Message-ID: <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> Thanks Peter, I gave it a shot and it won't install for me. Here are my results: $ python --version Python 2.5.4 $ gcc -v Using built-in specs. Target: i686-apple-darwin10 Configured with: /var/tmp/gcc/gcc-5646~6/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10 $python setup.py install running build running build_py creating build/lib.macosx-10.3-x86_64-2.5 creating build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/__init__.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/distance.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/DocSQL.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/EZRetrieve.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/File.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/FilteredReader.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/HotRand.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/Index.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/kNN.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/listfns.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/LogisticRegression.py -> build/lib.macosx-10.3-x86_64-2.5/ Bio copying Bio/MarkovModel.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/mathfns.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/MaxEntropy.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/NaiveBayes.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/NetCatch.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/pairwise2.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/ParserSupport.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/PropertyManager.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/PubMed.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/Search.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/Seq.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/SeqFeature.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/SeqRecord.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/stringfns.py -> build/lib.macosx-10.3-x86_64-2.5/Bio copying Bio/Transcribe.py -> build/lib.macosx-10.3-x86_64-2.5/Bio . . . . . copying Bio/PopGen/SimCoal/data/ssm_2d.par -> build/lib.macosx-10.3- x86_64-2.5/Bio/PopGen/SimCoal/data running build_ext building 'Bio.clistfns' extension creating build/temp.macosx-10.3-x86_64-2.5 creating build/temp.macosx-10.3-x86_64-2.5/Bio Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/ MacOSX10.4u.sdk Please check your Xcode installation gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ Python.framework/Versions/2.5/include/python2.5 -c Bio/ clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ clistfnsmodule.o cc1: error: unrecognized command line option "-Wno-long-double" cc1: error: unrecognized command line option "-Wno-long-double" lipo: can't figure out the architecture type of: /var/tmp//ccQJ2KcH.out error: command 'gcc' failed with exit status 1 On Sep 9, 2009, at 3:13 AM, Peter wrote: > On Wed, Sep 9, 2009 at 4:22 AM, Kelly F > Oakeson wrote: >> Hello list, >> I am wondering of Biopython is compatible with Mac OS 10.6? > > In theory yes, but I don't know if anyone has tested it yet. > Apple have updated the compiler (gcc), and probably the > system Python since Mac OS 10.5 Leopard. > > On Leopard: > > $ python --version > Python 2.5.2 > > $ gcc -v > Using built-in specs. > Target: i686-apple-darwin9 > Configured with: /var/tmp/gcc/gcc-5465~16/src/configure > --disable-checking -enable-werror --prefix=/usr --mandir=/share/man > --enable-languages=c,objc,c++,obj-c++ > --program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ > --with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib > --build=i686-apple-darwin9 --with-arch=apple --with-tune=generic > --host=i686-apple-darwin9 --target=i686-apple-darwin9 > Thread model: posix > gcc version 4.0.1 (Apple Inc. build 5465) > > So, try it and see? If you run into problems compiling, or running > the unit tests please let us know. Also in case it matters, was > this an update or a clean install of Snow Leopard? > > Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 13:20:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 14:20:44 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> Message-ID: <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> On Wed, Sep 9, 2009 at 2:05 PM, Kelly F Oakeson wrote: > Thanks Peter, > I gave it a shot and it won't install for me. Here are my results: > > $ python --version > Python 2.5.4 > > $ gcc -v > Using built-in specs. > Target: i686-apple-darwin10 > Configured with: /var/tmp/gcc/gcc-5646~6/src/configure > --disable-checking --enable-werror --prefix=/usr --mandir=/share/man > --enable-languages=c,objc,c++,obj-c++ > --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ > --with-slibdir=/usr/lib --build=i686-apple-darwin10 > --with-gxx-include-dir=/include/c++/4.2.1 > --program-prefix=i686-apple-darwin10- > --host=x86_64-apple-darwin10 --target=i686-apple-darwin10 Did the last few lines get lost in the cut and paste? > $python setup.py install > running build > running build_py > creating build/lib.macosx-10.3-x86_64-2.5 > creating build/lib.macosx-10.3-x86_64-2.5/Bio > copying Bio/__init__.py -> build/lib.macosx-10.3-x86_64-2.5/Bio ... > . > copying Bio/PopGen/SimCoal/data/ssm_2d.par -> build/lib.macosx-10.3- > x86_64-2.5/Bio/PopGen/SimCoal/data > running build_ext > building 'Bio.clistfns' extension > creating build/temp.macosx-10.3-x86_64-2.5 > creating build/temp.macosx-10.3-x86_64-2.5/Bio > Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/ > MacOSX10.4u.sdk > Please check your Xcode installation > gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - > fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - > fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ > Python.framework/Versions/2.5/include/python2.5 -c Bio/ > clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ > clistfnsmodule.o > cc1: error: unrecognized command line option "-Wno-long-double" > cc1: error: unrecognized command line option "-Wno-long-double" > lipo: can't figure out the architecture type of: /var/tmp//ccQJ2KcH.out > error: command 'gcc' failed with exit status 1 OK, as I feared, the C code isn't compiling. Have you got XCode installed? Which version? The message "Please check your Xcode installation" is troubling. Could you also double check the gcc version (see above). Was this a clean Snow Leopard install, or an update? Thanks, Peter From kelly.oakeson at utah.edu Wed Sep 9 13:46:45 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 07:46:45 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> Message-ID: <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> On Sep 9, 2009, at 7:20 AM, Peter wrote: > On Wed, Sep 9, 2009 at 2:05 PM, Kelly F > Oakeson wrote: >> Thanks Peter, >> I gave it a shot and it won't install for me. Here are my results: >> >> $ python --version >> Python 2.5.4 >> >> $ gcc -v >> Using built-in specs. >> Target: i686-apple-darwin10 >> Configured with: /var/tmp/gcc/gcc-5646~6/src/configure >> --disable-checking --enable-werror --prefix=/usr --mandir=/share/man >> --enable-languages=c,objc,c++,obj-c++ >> --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ >> --with-slibdir=/usr/lib --build=i686-apple-darwin10 >> --with-gxx-include-dir=/include/c++/4.2.1 >> --program-prefix=i686-apple-darwin10- >> --host=x86_64-apple-darwin10 --target=i686-apple-darwin10 > > Did the last few lines get lost in the cut and paste? Here is how it looks exactly on my screen: $ gcc -v Using built-in specs. Target: i686-apple-darwin10 Configured with: /var/tmp/gcc/gcc-5646~6/src/configure --disable- checking --enable-werror --prefix=/usr --mandir=/share/man --enable- languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/ $/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx- include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- -- host=x86_64-apple-darwin10 --target=i686-apple-darwin10 Thread model: posix gcc version 4.2.1 (Apple Inc. build 5646) > >> $python setup.py install >> running build >> running build_py >> creating build/lib.macosx-10.3-x86_64-2.5 >> creating build/lib.macosx-10.3-x86_64-2.5/Bio >> copying Bio/__init__.py -> build/lib.macosx-10.3-x86_64-2.5/Bio > ... >> . >> copying Bio/PopGen/SimCoal/data/ssm_2d.par -> build/lib.macosx-10.3- >> x86_64-2.5/Bio/PopGen/SimCoal/data >> running build_ext >> building 'Bio.clistfns' extension >> creating build/temp.macosx-10.3-x86_64-2.5 >> creating build/temp.macosx-10.3-x86_64-2.5/Bio >> Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/ >> MacOSX10.4u.sdk >> Please check your Xcode installation >> gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - >> fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused- >> madd - >> fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ >> Python.framework/Versions/2.5/include/python2.5 -c Bio/ >> clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ >> clistfnsmodule.o >> cc1: error: unrecognized command line option "-Wno-long-double" >> cc1: error: unrecognized command line option "-Wno-long-double" >> lipo: can't figure out the architecture type of: /var/tmp// >> ccQJ2KcH.out >> error: command 'gcc' failed with exit status 1 > > OK, as I feared, the C code isn't compiling. Have you got XCode > installed? Which version? The message "Please check your Xcode > installation" is troubling. > > Could you also double check the gcc version (see above). > > Was this a clean Snow Leopard install, or an update? > > Thanks, > > Peter I installed XCode from the snow leopard install DVD, It is Version 3.2 (1610). It was an update on a MacPro that hadn't had Biopython installed before. I installed Python 2.5.4 and then tried to install Biopython. I also updated to Python 2.6 on a MacbookPro, also running 10.6 and that seemed to have broken my previous Biopython install. Thanks for the help, From biopython at maubp.freeserve.co.uk Wed Sep 9 13:57:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 14:57:11 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> Message-ID: <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> Hi Kelly, >From some Google searching this is a general Python issue. This blog post looked helpful, http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-others-on-snow-leopard/ It suggests the simplest solution is you should install the optional 10.4 SDK on the system, which Snow Leopard does not install by default ? it?s an optional install in the developer tools installer. This fits with the error message you had, > Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/ > MacOSX10.4u.sdk > Please check your Xcode installation Can you give that a go please? Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 13:59:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 14:59:17 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> Message-ID: <320fb6e00909090659y7895b8f2h1551460f8760244d@mail.gmail.com> On Wed, Sep 9, 2009 at 2:46 PM, Kelly F Oakeson wrote: > > I also updated to Python 2.6 on a MacbookPro, also running 10.6 and > that seemed to have broken my previous Biopython install. > Installing a new version of Python requires re-installing all the 3rd party python libraries for that new version of Python. So if you have Python 2.5 with Biopython working, and then installed Python 2.6, you would have to install Biopython again for Python 2.6 to use. In the meantime, you would still have the old Python 2.5+Biopython available. Peter From kelly.oakeson at utah.edu Wed Sep 9 15:46:02 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 09:46:02 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> Message-ID: Peter, I installed the 10.4 SDK and tried the install again. It still failed, here is the output: $ sudo python setup.py install Password: running install Numerical Python (NumPy) is not installed. This package is required for many Biopython features. Please install it before you install Biopython. You can install Biopython anyway, but anything dependent on NumPy will not work. If you do this, and later install NumPy, you should then re-install Biopython. You can find NumPy at http://numpy.scipy.org Do you want to continue this installation? (y/N) Y running build running build_py running build_ext building 'Bio.clistfns' extension gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ Python.framework/Versions/2.5/include/python2.5 -c Bio/ clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ clistfnsmodule.o cc1: error: unrecognized command line option "-Wno-long-double" cc1: error: unrecognized command line option "-Wno-long-double" lipo: can't figure out the architecture type of: /var/tmp//cc6kttGl.out error: command 'gcc' failed with exit status 1 $ arch i386 I do have the 64 bit kernel enabled, could that be causing the issue? Kelly Oakeson kelly.oakeson at utah.edu On Sep 9, 2009, at 7:57 AM, Peter wrote: > Hi Kelly, > > From some Google searching this is a general Python issue. This blog > post looked helpful, > > http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-others-on-snow-leopard/ > > It suggests the simplest solution is you should install the optional > 10.4 SDK on the system, which Snow Leopard does not install by default > ? it?s an optional install in the developer tools installer. > > This fits with the error message you had, > >> Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/ >> MacOSX10.4u.sdk >> Please check your Xcode installation > > Can you give that a go please? > > Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 16:05:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 17:05:13 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> Message-ID: <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> On Wed, Sep 9, 2009 at 4:46 PM, Kelly F Oakeson wrote: > Peter, > I installed the 10.4 SDK and tried the install again. It still failed, > here is the output: > > $ sudo python setup.py install > Password: > running install > > Numerical Python (NumPy) is not installed. > > This package is required for many Biopython features. ?Please install > it before you install Biopython. You can install Biopython anyway, but > anything dependent on NumPy will not work. If you do this, and later > install NumPy, you should then re-install Biopython. > > You can find NumPy at http://numpy.scipy.org > > Do you want to continue this installation? (y/N) ?Y As an aside, from following the NumPy mailing list, it can be installed on Snow Leopard but there were some similar glitches and I don't know how easy it is. > running build > running build_py > running build_ext > building 'Bio.clistfns' extension > gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - > fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - > fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ > Python.framework/Versions/2.5/include/python2.5 -c Bio/ > clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ > clistfnsmodule.o > cc1: error: unrecognized command line option "-Wno-long-double" > cc1: error: unrecognized command line option "-Wno-long-double" > lipo: can't figure out the architecture type of: /var/tmp//cc6kttGl.out > error: command 'gcc' failed with exit status 1 > > $ arch > i386 > > I do have the 64 bit kernel enabled, could that be causing the issue? Maybe - it looks like gcc has been called with "-arch ppc -arch i386", which should probably be "-arch x86_64" (or left as the default?) as you are running in full 64 bit mode (and you must have an Intel CPU, not a PowerPC). Try: gcc -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ Python.framework/Versions/2.5/include/python2.5 -c Bio/ clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ clistfnsmodule.o [all as one line] and see what gcc says then. Peter From kelly.oakeson at utah.edu Wed Sep 9 16:16:49 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 10:16:49 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> Message-ID: Peter, It looks like I may have solved it! Taking the advice from the post on http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-others-on-snow-leopard/ I changed the makefile in /Library/Frameworks/Python.framework/ Versions/2.6/lib/python2.6/config/Makefile replacing all of the occurrences of 10.4u with 10.6. I then ran $python setup.py build $python setup.py test $sudo python setup.py install Everything worked just fine without any gcc errors. This worked on both my MacBook pro and MacPro both running the 64 bit kernel. Thanks for all of the help. On Sep 9, 2009, at 10:05 AM, Peter wrote: > On Wed, Sep 9, 2009 at 4:46 PM, Kelly F > Oakeson wrote: >> Peter, >> I installed the 10.4 SDK and tried the install again. It still >> failed, >> here is the output: >> >> $ sudo python setup.py install >> Password: >> running install >> >> Numerical Python (NumPy) is not installed. >> >> This package is required for many Biopython features. Please install >> it before you install Biopython. You can install Biopython anyway, >> but >> anything dependent on NumPy will not work. If you do this, and later >> install NumPy, you should then re-install Biopython. >> >> You can find NumPy at http://numpy.scipy.org >> >> Do you want to continue this installation? (y/N) Y > > As an aside, from following the NumPy mailing list, it can be > installed > on Snow Leopard but there were some similar glitches and I don't > know how easy it is. > >> running build >> running build_py >> running build_ext >> building 'Bio.clistfns' extension >> gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - >> fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused- >> madd - >> fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ >> Python.framework/Versions/2.5/include/python2.5 -c Bio/ >> clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ >> clistfnsmodule.o >> cc1: error: unrecognized command line option "-Wno-long-double" >> cc1: error: unrecognized command line option "-Wno-long-double" >> lipo: can't figure out the architecture type of: /var/tmp// >> cc6kttGl.out >> error: command 'gcc' failed with exit status 1 >> >> $ arch >> i386 >> >> I do have the 64 bit kernel enabled, could that be causing the issue? > > Maybe - it looks like gcc has been called with "-arch ppc -arch i386", > which should probably be "-arch x86_64" (or left as the default?) as > you are running in full 64 bit mode (and you must have an Intel CPU, > not a PowerPC). Try: > > gcc -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.4u.sdk - > fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd - > fno-common -dynamic -DNDEBUG -g -O3 -I/Library/Frameworks/ > Python.framework/Versions/2.5/include/python2.5 -c Bio/ > clistfnsmodule.c -o build/temp.macosx-10.3-x86_64-2.5/Bio/ > clistfnsmodule.o > > [all as one line] and see what gcc says then. > > Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 16:21:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 17:21:28 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> Message-ID: <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> On Wed, Sep 9, 2009 at 5:16 PM, Kelly F Oakeson wrote: > Peter, > It looks like I may have solved it! Taking the advice from the post on > http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-others-on-snow-leopard/ > I changed the makefile in /Library/Frameworks/Python.framework/ > Versions/2.6/lib/python2.6/config/Makefile replacing all of the > occurrences of 10.4u with 10.6. > I then ran > $python setup.py build > $python setup.py test > $sudo python setup.py install > > Everything worked just fine without any gcc errors. This worked on > both my MacBook pro and MacPro both running the 64 bit kernel. That's great, but doesn't seem like a "proper" solution for us in the long run. Did you run the Biopython unit tests? It would be good to know they are all fine. Peter From kelly.oakeson at utah.edu Wed Sep 9 16:26:04 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 10:26:04 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <320fb6e00909090213u698914aar6d0ee833d17254b8@mail.gmail.com> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> Message-ID: <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> Peter, I only ran the setup.py test, but here are the results: $ python setup.py test running test test_Ace ... ok test_AlignIO ... ok test_BioSQL ... skipping. Enter your settings in Tests/setup_BioSQL.py (not important if you do not plan to use BioSQL). test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ setup_BioSQL.py (not important if you do not plan to use BioSQL). test_CAPS ... ok test_Clustalw ... ok test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you want to use Bio.Clustalw. test_Cluster ... skipping. If you want to use Bio.Cluster, install NumPy first and then reinstall Biopython test_CodonTable ... ok test_CodonUsage ... ok test_Compass ... ok test_Crystal ... ok test_Dialign_tool ... skipping. Install DIALIGN2-2 if you want to use the Bio.Align.Applications wrapper. test_DocSQL ... skipping. Install MySQLdb if you want to use Bio.DocSQL. test_Emboss ... skipping. Install EMBOSS if you want to use Bio.EMBOSS. test_EmbossPrimer ... ok test_Entrez ... ok test_Enzyme ... ok test_FSSP ... ok test_Fasta ... ok test_File ... ok test_GACrossover ... ok test_GAMutation ... ok test_GAOrganism ... ok test_GAQueens ... ok test_GARepair ... ok test_GASelection ... ok test_GFF ... skipping. Environment is not configured for this test (not important if you do not plan to use Bio.GFF). test_GFF2 ... skipping. Install MySQLdb if you want to use Bio.GFF. test_GenBank ... ok test_GenomeDiagram ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsChromosome ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsDistribution ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsGeneral ... skipping. Install reportlab if you want to use Bio.Graphics. test_HMMCasino ... ok test_HMMGeneral ... ok test_HotRand ... ok test_IsoelectricPoint ... ok test_KDTree ... skipping. Install NumPy if you want to use Bio.KDTree. test_KEGG ... ok test_KeyWList ... ok test_Location ... ok test_LocationParser ... ok test_LogisticRegression ... skipping. Install NumPy if you want to use Bio.LogisticRegression. test_MEME ... ok test_Mafft_tool ... skipping. Install MAFFT if you want to use the Bio.Align.Applications wrapper. test_MarkovModel ... skipping. Install NumPy if you want to use Bio.MarkovModel. test_Medline ... ok test_Motif ... ok test_Muscle_tool ... skipping. Install MUSCLE if you want to use the Bio.Align.Applications wrapper. test_NCBIStandalone ... ok test_NCBITextParser ... ok test_NCBIXML ... ok test_NCBI_qblast ... ok test_NNExclusiveOr ... ok test_NNGene ... ok test_NNGeneral ... ok test_Nexus ... ok test_PDB ... skipping. Install NumPy if you want to use Bio.PDB. test_PDB_unit ... skipping. Install NumPy if you want to use Bio.PDB. test_ParserSupport ... ok test_Pathway ... ok test_Phd ... ok test_PopGen_FDist ... skipping. Install FDist if you want to use Bio.PopGen.FDist. test_PopGen_FDist_nodepend ... ok test_PopGen_GenePop ... ok test_PopGen_SimCoal ... skipping. Install SIMCOAL2 if you want to use Bio.PopGen.SimCoal. test_PopGen_SimCoal_nodepend ... ok test_Prank_tool ... skipping. Install PRANK if you want to use the Bio.Align.Applications wrapper. test_Probcons_tool ... skipping. Install PROBCONS if you want to use the Bio.Align.Applications wrapper. test_ProtParam ... ok test_Restriction ... ok test_SCOP_Astral ... ok test_SCOP_Cla ... ok test_SCOP_Des ... ok test_SCOP_Dom ... ok test_SCOP_Hie ... ok test_SCOP_Raf ... ok test_SCOP_Residues ... ok test_SCOP_Scop ... ok test_SVDSuperimposer ... skipping. Install NumPy if you want to use Bio.SVDSuperimposer. test_SeqIO ... ok test_SeqIO_FastaIO ... ok test_SeqIO_QualityIO ... ok test_SeqIO_features ... ok test_SeqIO_online ... ok test_SeqUtils ... ok test_Seq_objs ... ok test_SubsMat ... ok test_SwissProt ... ok test_TCoffee_tool ... skipping. Install TCOFFEE if you want to use the Bio.Align.Applications wrapper. test_UniGene ... ok test_UniGene_obsolete ... ok test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_align ... ok test_geo ... ok test_interpro ... ok test_kNN ... skipping. Install NumPy if you want to use Bio.kNN. test_lowess ... skipping. Install NumPy if you want to use Bio.Statistics.lowess. test_pairwise2 ... ok test_prodoc ... ok test_property_manager ... ok test_prosite ... ok test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_seq ... ok test_translate ... ok test_trie ... ok test_triefind ... ok Bio.Seq docstring test ... ok Bio.SeqRecord docstring test ... ok Bio.SeqIO docstring test ... ok Bio.SeqIO.PhdIO docstring test ... ok Bio.SeqIO.QualityIO docstring test ... ok Bio.SeqIO.AceIO docstring test ... ok Bio.SeqUtils docstring test ... ok Bio.Align.Generic docstring test ... ok Bio.AlignIO docstring test ... ok Bio.AlignIO.StockholmIO docstring test ... ok Bio.Application docstring test ... ok Bio.KEGG.Compound docstring test ... ok Bio.KEGG.Enzyme docstring test ... ok Bio.Wise docstring test ... ok Bio.Wise.psw docstring test ... ok Bio.Motif docstring test ... ok ---------------------------------------------------------------------- Ran 124 tests in 62.858 seconds On Sep 9, 2009, at 10:21 AM, Peter wrote: > On Wed, Sep 9, 2009 at 5:16 PM, Kelly F > Oakeson wrote: >> Peter, >> It looks like I may have solved it! Taking the advice from the post >> on >> http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-others-on-snow-leopard/ >> I changed the makefile in /Library/Frameworks/Python.framework/ >> Versions/2.6/lib/python2.6/config/Makefile replacing all of the >> occurrences of 10.4u with 10.6. >> I then ran >> $python setup.py build >> $python setup.py test >> $sudo python setup.py install >> >> Everything worked just fine without any gcc errors. This worked on >> both my MacBook pro and MacPro both running the 64 bit kernel. > > That's great, but doesn't seem like a "proper" solution for us in > the long run. Did you run the Biopython unit tests? It would be > good to know they are all fine. > > Peter From biopython at maubp.freeserve.co.uk Wed Sep 9 16:35:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 17:35:10 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> Message-ID: <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> On Wed, Sep 9, 2009 at 5:26 PM, Kelly F Oakeson wrote: > Peter, > I only ran the setup.py test, but here are the results: > $ python setup.py test > running test > test_Ace ... ok > test_AlignIO ... ok > test_BioSQL ... skipping. Enter your settings in Tests/setup_BioSQL.py > (not important if you do not plan to use BioSQL). > test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ > setup_BioSQL.py (not important if you do not plan to use BioSQL). > test_CAPS ... ok > test_Clustalw ... ok > test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you > want to use Bio.Clustalw. > test_Cluster ... skipping. If you want to use Bio.Cluster, install > NumPy first and then reinstall Biopython > ... > Bio.Wise.psw docstring test ... ok > Bio.Motif docstring test ... ok > ---------------------------------------------------------------------- > Ran 124 tests in 62.858 seconds Excellent - a clean bill of health. Do you want to try installing NumPy now, and then reinstalling Biopython to see if everything works? ;) Peter From lpritc at scri.ac.uk Wed Sep 9 16:41:20 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 09 Sep 2009 17:41:20 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: Message-ID: On 09/09/2009 17:40, "Leighton Pritchard" wrote: > Your solution seems to be the simplest one, Kelly. The alternative appears to > be to modify the build files. By which, of course, I mean for *the developers* to modify the build files. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From kelly.oakeson at utah.edu Wed Sep 9 16:42:10 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 10:42:10 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> Message-ID: <828374D5-C0F9-46AC-93DD-16529B5CD2BE@utah.edu> Peter, I will give it a shot once I get back from class, ah the joys of being a grad student. Kelly O. On Sep 9, 2009, at 10:35 AM, "Peter" wrote: > On Wed, Sep 9, 2009 at 5:26 PM, Kelly F > Oakeson wrote: >> Peter, >> I only ran the setup.py test, but here are the results: >> $ python setup.py test >> running test >> test_Ace ... ok >> test_AlignIO ... ok >> test_BioSQL ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py >> (not important if you do not plan to use BioSQL). >> test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py (not important if you do not plan to use BioSQL). >> test_CAPS ... ok >> test_Clustalw ... ok >> test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you >> want to use Bio.Clustalw. >> test_Cluster ... skipping. If you want to use Bio.Cluster, install >> NumPy first and then reinstall Biopython >> ... >> Bio.Wise.psw docstring test ... ok >> Bio.Motif docstring test ... ok >> --- >> ------------------------------------------------------------------- >> Ran 124 tests in 62.858 seconds > > Excellent - a clean bill of health. > > Do you want to try installing NumPy now, and then reinstalling > Biopython to see if everything works? ;) > > Peter From nuin at genedrift.org Wed Sep 9 16:39:58 2009 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 9 Sep 2009 12:39:58 -0400 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> Message-ID: <97C66059-B065-4C91-BC6A-6412519AB665@genedrift.org> Just another perspective from Snow Leopard install, I didn't have any errors compiling BioPython, after I installed the latest version of Xcode (including 10.4 support). All tests were fine too. Paulo On 2009-09-09, at 12:35 PM, Peter wrote: > On Wed, Sep 9, 2009 at 5:26 PM, Kelly F > Oakeson wrote: >> Peter, >> I only ran the setup.py test, but here are the results: >> $ python setup.py test >> running test >> test_Ace ... ok >> test_AlignIO ... ok >> test_BioSQL ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py >> (not important if you do not plan to use BioSQL). >> test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py (not important if you do not plan to use BioSQL). >> test_CAPS ... ok >> test_Clustalw ... ok >> test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you >> want to use Bio.Clustalw. >> test_Cluster ... skipping. If you want to use Bio.Cluster, install >> NumPy first and then reinstall Biopython >> ... >> Bio.Wise.psw docstring test ... ok >> Bio.Motif docstring test ... ok >> ---------------------------------------------------------------------- >> Ran 124 tests in 62.858 seconds > > Excellent - a clean bill of health. > > Do you want to try installing NumPy now, and then reinstalling > Biopython to see if everything works? ;) > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From lpritc at scri.ac.uk Wed Sep 9 16:40:33 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 09 Sep 2009 17:40:33 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> Message-ID: Hi all, Thanks to Google, it appears that you're not the only one to run into this problem: http://www.reddit.com/r/Python/comments/9gpuc/snow_leopard_and_python_compat ibility_issues/ http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-othe rs-on-snow-leopard/ http://www.allegro.cc/forums/print-thread/601429 And more details about the actual problem here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35961 Your solution seems to be the simplest one, Kelly. The alternative appears to be to modify the build files. Best, L. On 09/09/2009 17:21, "Peter" wrote: > On Wed, Sep 9, 2009 at 5:16 PM, Kelly F Oakeson wrote: >> Peter, >> It looks like I may have solved it! Taking the advice from the post on >> http://mtrichardson.com/2009/09/fixing-jinja2-and-pycrypto-and-probably-other >> s-on-snow-leopard/ >> I changed the makefile in /Library/Frameworks/Python.framework/ >> Versions/2.6/lib/python2.6/config/Makefile replacing all of the >> occurrences of 10.4u with 10.6. >> I then ran >> $python setup.py build >> $python setup.py test >> $sudo python setup.py install >> >> Everything worked just fine without any gcc errors. This worked on >> both my MacBook pro and MacPro both running the 64 bit kernel. > > That's great, but doesn't seem like a "proper" solution for us in > the long run. Did you run the Biopython unit tests? It would be > good to know they are all fine. > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Wed Sep 9 16:49:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Sep 2009 17:49:26 +0100 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <97C66059-B065-4C91-BC6A-6412519AB665@genedrift.org> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> <97C66059-B065-4C91-BC6A-6412519AB665@genedrift.org> Message-ID: <320fb6e00909090949i78a2292cj9f50144873389e@mail.gmail.com> On Wed, Sep 9, 2009 at 5:39 PM, Paulo Nuin wrote: > > Just another perspective from Snow Leopard install, I didn't have any errors > compiling BioPython, after I installed the latest version of Xcode > (including 10.4 support). All tests were fine too. > > Paulo That's good. I wonder what was different on your machine compared to Kelly's? Peter From nuin at genedrift.org Wed Sep 9 16:53:34 2009 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 9 Sep 2009 12:53:34 -0400 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090949i78a2292cj9f50144873389e@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> <97C66059-B065-4C91-BC6A-6412519AB665@genedrift.org> <320fb6e00909090949i78a2292cj9f50144873389e@mail.gmail.com> Message-ID: <6053E0BA-407D-4765-B8E4-EB6CCFE56D0C@genedrift.org> Here's is my gcc -v Using built-in specs. Target: i686-apple-darwin10 Configured with: /var/tmp/gcc/gcc-5646~6/src/configure --disable- checking --enable-werror --prefix=/usr --mandir=/share/man --enable- languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/ $/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx- include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- -- host=x86_64-apple-darwin10 --target=i686-apple-darwin10 Thread model: posix gcc version 4.2.1 (Apple Inc. build 5646) I updated from Leopard to SL, and I still have my Python 2.5 with BioPython installed on it. On SL the "default" Python version is 2.6, and the new installation was performed on it. Paulo On 2009-09-09, at 12:49 PM, Peter wrote: > On Wed, Sep 9, 2009 at 5:39 PM, Paulo Nuin wrote: >> >> Just another perspective from Snow Leopard install, I didn't have >> any errors >> compiling BioPython, after I installed the latest version of Xcode >> (including 10.4 support). All tests were fine too. >> >> Paulo > > That's good. I wonder what was different on your machine compared to > Kelly's? > > Peter From kelly.oakeson at utah.edu Wed Sep 9 17:55:01 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Wed, 9 Sep 2009 11:55:01 -0600 Subject: [Biopython] Biopython and Snow Leopard In-Reply-To: <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> References: <896E5EF8-AE10-4679-AF95-FB4993D7CA18@utah.edu> <5119CA6F-3D0E-41CA-B0E8-61275F4A3FC9@utah.edu> <320fb6e00909090620n6ba586d1x5b89e7fba7c5af5e@mail.gmail.com> <09DC18DB-C91A-46E1-B649-7E1E08C2FD8B@utah.edu> <320fb6e00909090657v26931c00yef25aec495d5656c@mail.gmail.com> <320fb6e00909090905s69913edek813e4b4aaa836afa@mail.gmail.com> <320fb6e00909090921o24addeebnae93185166169a2d@mail.gmail.com> <046A9D5E-726B-413A-8C82-86637520B066@utah.edu> <320fb6e00909090935n56efc4b6v467ebac2d09f420c@mail.gmail.com> Message-ID: <7BF0A735-C28C-4AA3-831E-1FA41F2F7826@utah.edu> Peter, After installing NumPy everything tests out ok. Here are the results. $ python setup.py test running test test_Ace ... ok test_AlignIO ... ok test_BioSQL ... skipping. Enter your settings in Tests/setup_BioSQL.py (not important if you do not plan to use BioSQL). test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ setup_BioSQL.py (not important if you do not plan to use BioSQL). test_CAPS ... ok test_Clustalw ... ok test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you want to use Bio.Clustalw. test_Cluster ... ok test_CodonTable ... ok test_CodonUsage ... ok test_Compass ... ok test_Crystal ... ok test_Dialign_tool ... skipping. Install DIALIGN2-2 if you want to use the Bio.Align.Applications wrapper. test_DocSQL ... skipping. Install MySQLdb if you want to use Bio.DocSQL. test_Emboss ... skipping. Install EMBOSS if you want to use Bio.EMBOSS. test_EmbossPrimer ... ok test_Entrez ... ok test_Enzyme ... ok test_FSSP ... ok test_Fasta ... ok test_File ... ok test_GACrossover ... ok test_GAMutation ... ok test_GAOrganism ... ok test_GAQueens ... ok test_GARepair ... ok test_GASelection ... ok test_GFF ... skipping. Environment is not configured for this test (not important if you do not plan to use Bio.GFF). test_GFF2 ... skipping. Install MySQLdb if you want to use Bio.GFF. test_GenBank ... ok test_GenomeDiagram ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsChromosome ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsDistribution ... skipping. Install reportlab if you want to use Bio.Graphics. test_GraphicsGeneral ... skipping. Install reportlab if you want to use Bio.Graphics. test_HMMCasino ... ok test_HMMGeneral ... ok test_HotRand ... ok test_IsoelectricPoint ... ok test_KDTree ... ok test_KEGG ... ok test_KeyWList ... ok test_Location ... ok test_LocationParser ... ok test_LogisticRegression ... ok test_MEME ... ok test_Mafft_tool ... skipping. Install MAFFT if you want to use the Bio.Align.Applications wrapper. test_MarkovModel ... ok test_Medline ... ok test_Motif ... ok test_Muscle_tool ... skipping. Install MUSCLE if you want to use the Bio.Align.Applications wrapper. test_NCBIStandalone ... ok test_NCBITextParser ... ok test_NCBIXML ... ok test_NCBI_qblast ... ok test_NNExclusiveOr ... ok test_NNGene ... ok test_NNGeneral ... ok test_Nexus ... ok test_PDB ... ok test_PDB_unit ... ok test_ParserSupport ... ok test_Pathway ... ok test_Phd ... ok test_PopGen_FDist ... skipping. Install FDist if you want to use Bio.PopGen.FDist. test_PopGen_FDist_nodepend ... ok test_PopGen_GenePop ... ok test_PopGen_SimCoal ... skipping. Install SIMCOAL2 if you want to use Bio.PopGen.SimCoal. test_PopGen_SimCoal_nodepend ... ok test_Prank_tool ... skipping. Install PRANK if you want to use the Bio.Align.Applications wrapper. test_Probcons_tool ... skipping. Install PROBCONS if you want to use the Bio.Align.Applications wrapper. test_ProtParam ... ok test_Restriction ... ok test_SCOP_Astral ... ok test_SCOP_Cla ... ok test_SCOP_Des ... ok test_SCOP_Dom ... ok test_SCOP_Hie ... ok test_SCOP_Raf ... ok test_SCOP_Residues ... ok test_SCOP_Scop ... ok test_SVDSuperimposer ... ok test_SeqIO ... ok test_SeqIO_FastaIO ... ok test_SeqIO_QualityIO ... ok test_SeqIO_features ... ok test_SeqIO_online ... ok test_SeqUtils ... ok test_Seq_objs ... ok test_SubsMat ... ok test_SwissProt ... ok test_TCoffee_tool ... skipping. Install TCOFFEE if you want to use the Bio.Align.Applications wrapper. test_UniGene ... ok test_UniGene_obsolete ... ok test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_align ... ok test_geo ... ok test_interpro ... ok test_kNN ... ok test_lowess ... ok test_pairwise2 ... ok test_prodoc ... ok test_property_manager ... ok test_prosite ... ok test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_seq ... ok test_translate ... ok test_trie ... ok test_triefind ... ok Bio.Seq docstring test ... ok Bio.SeqRecord docstring test ... ok Bio.SeqIO docstring test ... ok Bio.SeqIO.PhdIO docstring test ... ok Bio.SeqIO.QualityIO docstring test ... ok Bio.SeqIO.AceIO docstring test ... ok Bio.SeqUtils docstring test ... ok Bio.Align.Generic docstring test ... ok Bio.AlignIO docstring test ... ok Bio.AlignIO.StockholmIO docstring test ... ok Bio.Application docstring test ... ok Bio.KEGG.Compound docstring test ... ok Bio.KEGG.Enzyme docstring test ... ok Bio.Wise docstring test ... ok Bio.Wise.psw docstring test ... ok Bio.Motif docstring test ... ok Bio.Statistics.lowess docstring test ... ok ---------------------------------------------------------------------- Ran 125 tests in 63.582 seconds On Sep 9, 2009, at 10:35 AM, Peter wrote: > On Wed, Sep 9, 2009 at 5:26 PM, Kelly F > Oakeson wrote: >> Peter, >> I only ran the setup.py test, but here are the results: >> $ python setup.py test >> running test >> test_Ace ... ok >> test_AlignIO ... ok >> test_BioSQL ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py >> (not important if you do not plan to use BioSQL). >> test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/ >> setup_BioSQL.py (not important if you do not plan to use BioSQL). >> test_CAPS ... ok >> test_Clustalw ... ok >> test_Clustalw_tool ... skipping. Install clustalw or clustalw2 if you >> want to use Bio.Clustalw. >> test_Cluster ... skipping. If you want to use Bio.Cluster, install >> NumPy first and then reinstall Biopython >> ... >> Bio.Wise.psw docstring test ... ok >> Bio.Motif docstring test ... ok >> ---------------------------------------------------------------------- >> Ran 124 tests in 62.858 seconds > > Excellent - a clean bill of health. > > Do you want to try installing NumPy now, and then reinstalling > Biopython to see if everything works? ;) > > Peter From biopython at maubp.freeserve.co.uk Thu Sep 10 09:50:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Sep 2009 10:50:50 +0100 Subject: [Biopython] Removing deprecated module Bio.EUtils In-Reply-To: <320fb6e00909011001q5d99b62egc06c1303d6a8bc53@mail.gmail.com> References: <320fb6e00909011001q5d99b62egc06c1303d6a8bc53@mail.gmail.com> Message-ID: <320fb6e00909100250u140382efvbe42a41aedfc3fee@mail.gmail.com> On Tue, Sep 1, 2009 at 6:01 PM, Peter wrote: > Hi all, > > The Bio.Entrez module has long been our prefered interface to the NCBI > Entrez Utilities. It replaced the old Bio.EUtils module which was officially > deprecated in Biopython 1.48, released a year ago (Sept 2008). > > In line with our deprecation policy, I plan to remove Bio.EUtils in the next > release. > > Are there any objections? If anyone is still using the Bio.EUtils module > in old code, please feel free to ask for tips on porting this to Bio.Entrez. Bio.EUtils has been removed now, and will not be included in future releases of Biopython. Peter From biopython at maubp.freeserve.co.uk Thu Sep 10 09:51:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Sep 2009 10:51:37 +0100 Subject: [Biopython] Removing deprecated BLAST HTML parser In-Reply-To: <320fb6e00909011005k76cc8be9j2d3a58201ae54a8f@mail.gmail.com> References: <320fb6e00909011005k76cc8be9j2d3a58201ae54a8f@mail.gmail.com> Message-ID: <320fb6e00909100251m5fae2c6ax44d8d3a989f58a42@mail.gmail.com> On Tue, Sep 1, 2009 at 6:05 PM, Peter wrote: > Hi all, > > The old HTML BLAST parser in Bio.Blast.NCBIWWW was deprecated > a year ago in Biopython 1.48, and in line with our deprecation policy I > would like to remove this for the next release. > > Are there any objections? > > The preferred BLAST output for parsing (as recommended by the NCBI > themselves) is XML. We also have a parser for the plain text output, but > this is not updated very frequently and the NCBI have a history of > making minor changes to the layout and breaking parsers. The old HTML BLAST parser in Bio.Blast.NCBIWWW has been removed now, and will not be included in future releases of Biopython. Peter From biopython at maubp.freeserve.co.uk Thu Sep 10 10:17:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Sep 2009 11:17:25 +0100 Subject: [Biopython] Deprecating Bio.EZRetrieve, NetCatch, FilteredReader and SGMLHandle In-Reply-To: <320fb6e00908200251g6858939ci8de9192b9af7ad98@mail.gmail.com> References: <320fb6e00908200251g6858939ci8de9192b9af7ad98@mail.gmail.com> Message-ID: <320fb6e00909100317s39b3b831rdeadc98f5a6995f5@mail.gmail.com> On Thu, Aug 20, 2009 at 10:51 AM, Peter wrote: > Hi all, > > The minor modules Bio.EZRetrieve, Bio.NetCatch, Bio.File.SGMLHandle, > Bio.FilteredReader were declared obsolete in Release 1.50. Are there > any objections to us deprecating them in the next release? These will be deprecated in the next release - if anyone is still using them, please speak up now. Thanks, Peter From pzs at dcs.gla.ac.uk Thu Sep 10 15:24:04 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Thu, 10 Sep 2009 16:24:04 +0100 Subject: [Biopython] Creating GenBank files Message-ID: <4AA91A14.10602@dcs.gla.ac.uk> I'm trying to create a GenBank file from a sequence and some annotation information. Can BioPython do this? I can't seem to find anything obvious in the documentation. If BioPython does not support this, can anybody recommend another API for doing this? I want to be able to generate genbank files from a script. Peter From biopython at maubp.freeserve.co.uk Thu Sep 10 15:45:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Sep 2009 16:45:19 +0100 Subject: [Biopython] Creating GenBank files In-Reply-To: <4AA91A14.10602@dcs.gla.ac.uk> References: <4AA91A14.10602@dcs.gla.ac.uk> Message-ID: <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> On Thu, Sep 10, 2009 at 4:24 PM, Peter Saffrey wrote: > I'm trying to create a GenBank file from a sequence and some annotation > information. Can BioPython do this? I can't seem to find anything obvious in > the documentation. Yes, you must create a SeqRecord object with suitable SeqFeature objects, and then write it out with SeqIO in GenBank format. If all your features have trivial locations, this is pretty easy. For example, I've done this to make simple gene predictions based on ORF finding and selecting the most upstream start codon, then generating the corresponding SeqFeatures, and saving this as a GenBank file. Peter From biopython at maubp.freeserve.co.uk Thu Sep 10 15:46:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 10 Sep 2009 16:46:22 +0100 Subject: [Biopython] Creating GenBank files In-Reply-To: <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> References: <4AA91A14.10602@dcs.gla.ac.uk> <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> Message-ID: <320fb6e00909100846v7a874fa7tacb515fc7c8d3866@mail.gmail.com> On Thu, Sep 10, 2009 at 4:45 PM, Peter wrote: > On Thu, Sep 10, 2009 at 4:24 PM, Peter Saffrey wrote: >> I'm trying to create a GenBank file from a sequence and some annotation >> information. Can BioPython do this? I can't seem to find anything obvious in >> the documentation. > > Yes, you must create a SeqRecord object with suitable SeqFeature objects, > and then write it out with SeqIO in GenBank format. If all your features have > trivial locations, this is pretty easy. > > For example, I've done this to make simple gene predictions based on ORF > finding and selecting the most upstream start codon, then generating the > corresponding SeqFeatures, and saving this as a GenBank file. P.S. You need Biopython 1.51 or later to be able to write out GenBank files with features. Peter From pedro.al at fenhi.uh.cu Thu Sep 10 21:07:20 2009 From: pedro.al at fenhi.uh.cu (Yasser Almeida Hernandez) Date: Thu, 10 Sep 2009 17:07:20 -0400 Subject: [Biopython] Write chains... Message-ID: <20090910170720.4hbr9c3e8ooggko0@correo.fenhi.uh.cu> Hi all!! How can i write custom chains of a PDB. For example a protein have A, B, C and D chains and i want to write only the B and C chains.... Thanks -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana 11600, Cuba Phone: (53-7) 271 7933, ext. 219 ---------------------------------------------------------------- Correo FENHI From michael.koeris at gmail.com Thu Sep 10 21:47:03 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Thu, 10 Sep 2009 17:47:03 -0400 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly Message-ID: Hi, I am trying to parse out the nucleic acid accession numbers from an Entrez.efetch query made to the Gene database. For some reason the parser does not create a dictionary when I call handle.read on it, even when I force the query to return xml. It just generates a string. Any ideas if another parser needs to be utilized? Many thanks Mike From kellrott at gmail.com Thu Sep 10 22:06:39 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Thu, 10 Sep 2009 15:06:39 -0700 Subject: [Biopython] MetaGene and GreenGene Message-ID: I've added two modules, MetaGene and GreenGene, to my BioPython fork. (found at http://github.com/kellrott/biopython/ ) Both of these modules deal with tools/databases related to metagenomic research. The GreenGene module parses and stores the 16S RNA sample library found in the file at http://greengenes.lbl.gov/Download/Sequence_Data/Greengenes_format/greengenes16SrRNAgenes.txt.gz and provides a query mechanism to lookup sequence and studies. MetaGene parses output from MetaGeneAnnotator ( http://metagene.cb.k.u-tokyo.ac.jp/metagene/ ), a gene prediction program designed for prokaryote and phage. It takes the predictions and produces SeqRecord objects of the predicted Amino Acid sequences. I would appreciate comments on additional functionality people would find usefully for these packages. I would also request these modules be considered for the mainline BioPython distribution. Kyle From biopython at maubp.freeserve.co.uk Fri Sep 11 09:33:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 11 Sep 2009 10:33:09 +0100 Subject: [Biopython] Write chains... In-Reply-To: <20090910170720.4hbr9c3e8ooggko0@correo.fenhi.uh.cu> References: <20090910170720.4hbr9c3e8ooggko0@correo.fenhi.uh.cu> Message-ID: <320fb6e00909110233s5f020db1m5bd5438842405701@mail.gmail.com> On Thu, Sep 10, 2009 at 10:07 PM, Yasser Almeida Hernandez wrote: > Hi all!! > > How can i write custom chains of a PDB. > For example a protein have A, B, C and D chains and i want > to write only the B and C chains.... > > Thanks You can do this with Bio.PDB - see pages 5 and 6 of the Bio.PDB documentation: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf i.e. Write a "Select class" which picks out only the chains you want (implement this in the accept_chain method). Here is a related example picking atoms: http://lists.open-bio.org/pipermail/biopython/2009-May/005174.html Alternatively, for a more low-level solution, you could do it manually: out = open("filtered.pdb", "w") for line in open("input.pdb") : if (line.startswith("ATOM") or line.startswith("HETATM")) \ and line[21] not in "BC" : continue out.write(line) out.close() [I'd recommend you try and learn how to use Bio.PDB as it is much more powerful in the long run.] Peter From biopython at maubp.freeserve.co.uk Fri Sep 11 09:34:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 11 Sep 2009 10:34:53 +0100 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly In-Reply-To: References: Message-ID: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> On Thu, Sep 10, 2009 at 10:47 PM, Michael S. Koeris wrote: > Hi, > > I am trying to parse out the nucleic acid accession numbers from an > Entrez.efetch query made to the Gene database. For some reason the parser > does not create a dictionary when I call handle.read on it, even when I > force the query to return xml. It just generates a string. Can you give us a tiny example script? Just fill in the missing bit here ;) from Bio import Entrez handle = Entrez.efetch(...) record = Entrez.read(handle) print record Peter From michael.koeris at gmail.com Fri Sep 11 11:39:31 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Fri, 11 Sep 2009 07:39:31 -0400 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly In-Reply-To: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> References: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> Message-ID: <62983B26-3CBA-4375-9EED-148317130E34@gmail.com> Hi Peter, I am using handle = Entrez.efetch(db='gene', id=UID['Activin B'][0][0], rettype='xml') testDic = handle.read() UID['Activin B'][0][0] is the unique gene number I grabbed earlier through an Entrez.esearch and in this case is : 90 Many thanks Mike On Sep 11, 2009, at 5:34 AM, Peter wrote: > On Thu, Sep 10, 2009 at 10:47 PM, Michael S. Koeris > wrote: >> Hi, >> >> I am trying to parse out the nucleic acid accession numbers from an >> Entrez.efetch query made to the Gene database. For some reason the >> parser >> does not create a dictionary when I call handle.read on it, even >> when I >> force the query to return xml. It just generates a string. > > Can you give us a tiny example script? Just fill in the missing bit > here ;) > > from Bio import Entrez > handle = Entrez.efetch(...) > record = Entrez.read(handle) > print record > > Peter From biopython at maubp.freeserve.co.uk Fri Sep 11 11:48:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 11 Sep 2009 12:48:13 +0100 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly In-Reply-To: <62983B26-3CBA-4375-9EED-148317130E34@gmail.com> References: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> <62983B26-3CBA-4375-9EED-148317130E34@gmail.com> Message-ID: <320fb6e00909110448k718f5541s344993ad477e0359@mail.gmail.com> On Fri, Sep 11, 2009 at 12:39 PM, Michael S. Koeris wrote: > Hi Peter, > > I am using > > handle = Entrez.efetch(db='gene', id=UID['Activin B'][0][0], rettype='xml') > testDic = handle.read() > > UID['Activin B'][0][0] is the unique gene number I grabbed earlier through > an Entrez.esearch and in this case is : 90 > > Many thanks > Mike You should be using retmode="xml", not retype="xml". See: http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html Is there a mistake in our documentation somewhere, or was this a typo? You are getting back an HTML error page, which of course our XML parser doesn't like. Try: from Bio import Entrez record = Entrez.read(Entrez.efetch(db='gene', id='90', retmode='xml')) Peter From biopython at maubp.freeserve.co.uk Fri Sep 11 13:37:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 11 Sep 2009 14:37:15 +0100 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly In-Reply-To: <9D4DF5C0-D588-4621-B94C-3611BCAE79D0@gmail.com> References: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> <62983B26-3CBA-4375-9EED-148317130E34@gmail.com> <320fb6e00909110448k718f5541s344993ad477e0359@mail.gmail.com> <9D4DF5C0-D588-4621-B94C-3611BCAE79D0@gmail.com> Message-ID: <320fb6e00909110637o17983630xb5fed23a53ac76e7@mail.gmail.com> Hi Michael, I've CC'd this to the list. On Fri, Sep 11, 2009 at 1:51 PM, Michael S. Koeris wrote: > > Yes indeed that does help - go dyslexia.... Easily done. Actually, on looking a little closer the NCBI returned "XML presented with HTML" (full of < and > entities) - still quite unsuitable for parsing, but not actually an error page as I assumed. > what seems to happen though is that it's not a dictionary but a list > made up of multiple dictionaries is that right? Probably - the Bio.Entrez parser will turn the XML nested structure into lists and dictionaries as appropriate. Going back to your original email, you just wanted "to parse out the nucleic acid accession numbers from an Entrez.efetch query made to the Gene database.", so I would actually suggest you should be using elink instead of efetch. See for example, http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html http://lists.open-bio.org/pipermail/biopython/2009-August/005472.html In your case something like this: >>> from Bio import Entrez >>> data = Entrez.read(Entrez.elink(db="nuccore", dbfrom="gene",id="90", retmode="xml")) >>> for db in data : ... print "Links for", db["IdList"], "from database", db["DbFrom"] ... for link in db["LinkSetDb"][0]["Link"] : print link["Id"] ... Links for ['90'] from database gene 224589811 224514625 194387497 190194409 187169269 187169268 164694819 157724517 157696421 89161198 88958353 74230050 50504351 22450871 21707501 18097079 15668129 2295237 402184 338218 Peter From michael.koeris at gmail.com Fri Sep 11 13:52:40 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Fri, 11 Sep 2009 09:52:40 -0400 Subject: [Biopython] Formatting of the xml return from Entrez DB Gene is not parsed correctly In-Reply-To: <320fb6e00909110637o17983630xb5fed23a53ac76e7@mail.gmail.com> References: <320fb6e00909110234v360db16anfdc60562b658037d@mail.gmail.com> <62983B26-3CBA-4375-9EED-148317130E34@gmail.com> <320fb6e00909110448k718f5541s344993ad477e0359@mail.gmail.com> <9D4DF5C0-D588-4621-B94C-3611BCAE79D0@gmail.com> <320fb6e00909110637o17983630xb5fed23a53ac76e7@mail.gmail.com> Message-ID: Hi Peter, that helps a lot. Indeed that's what I am really looking for. So the NCACs I get back appear to be in the order in which they appear in the GeneDB listing (from Chromosome to the various mRNA variants). Searching those then I can easily narrow it down further to the NM_* type listings I really need (since I am looking for the full length mRNA of all variants usually. Thanks! Mike On Sep 11, 2009, at 9:37 AM, Peter wrote: > Hi Michael, > > I've CC'd this to the list. > > On Fri, Sep 11, 2009 at 1:51 PM, Michael S. Koeris > wrote: >> >> Yes indeed that does help - go dyslexia.... > > Easily done. Actually, on looking a little closer the NCBI returned > "XML presented with HTML" (full of < and > entities) - still > quite > unsuitable for parsing, but not actually an error page as I assumed. > >> what seems to happen though is that it's not a dictionary but a list >> made up of multiple dictionaries is that right? > > Probably - the Bio.Entrez parser will turn the XML nested structure > into > lists and dictionaries as appropriate. > > Going back to your original email, you just wanted "to parse out the > nucleic acid accession numbers from an Entrez.efetch query made > to the Gene database.", so I would actually suggest you should be > using elink instead of efetch. See for example, > > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html > http://lists.open-bio.org/pipermail/biopython/2009-August/005472.html > > In your case something like this: > >>>> from Bio import Entrez >>>> data = Entrez.read(Entrez.elink(db="nuccore", >>>> dbfrom="gene",id="90", retmode="xml")) >>>> for db in data : > ... print "Links for", db["IdList"], "from database", db["DbFrom"] > ... for link in db["LinkSetDb"][0]["Link"] : print link["Id"] > ... > Links for ['90'] from database gene > 224589811 > 224514625 > 194387497 > 190194409 > 187169269 > 187169268 > 164694819 > 157724517 > 157696421 > 89161198 > 88958353 > 74230050 > 50504351 > 22450871 > 21707501 > 18097079 > 15668129 > 2295237 > 402184 > 338218 > > Peter From alexanderdcastro at yahoo.com Mon Sep 14 04:22:21 2009 From: alexanderdcastro at yahoo.com (Alexander Castro) Date: Sun, 13 Sep 2009 21:22:21 -0700 (PDT) Subject: [Biopython] Need help in making a back table for the CodonTable object Message-ID: <655165.18001.qm@web38408.mail.mud.yahoo.com> Hello all, I apologize if this sound elementary but how can create a back table that will return more than one codon? For example, for table="Bacterial", codons CTT, CTG and CTA code for L (Lysine). How can I back out, given "L" get CTT, CTG and CTG. The back_table method only returns one value: output = bacterial_table.back_table["L"]. Any help would be greatly appreciated. Regards, Alexander Castro From biopython at maubp.freeserve.co.uk Mon Sep 14 09:05:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Sep 2009 10:05:45 +0100 Subject: [Biopython] Need help in making a back table for the CodonTable object In-Reply-To: <655165.18001.qm@web38408.mail.mud.yahoo.com> References: <655165.18001.qm@web38408.mail.mud.yahoo.com> Message-ID: <320fb6e00909140205q4013c007wbc876b6d35d93fb8@mail.gmail.com> On Mon, Sep 14, 2009 at 5:22 AM, Alexander Castro wrote: > Hello all, I apologize if this sound elementary but how can create a > back table that will return more than one codon? For example, for > table="Bacterial", codons CTT, CTG and CTA code for L (Lysine). > How can I back out, given "L" get CTT, CTG and CTG. The > back_table method only returns one value: > output = bacterial_table.back_table["L"]. > Any help would be greatly appreciated. > Regards, > Alexander Castro I'm curious about what you are trying to do - we talked about the general problem of back-translation, and there was no consensus and therefore we didn't add a back translation method directly to the Seq object. As you are do doubt well aware, back translation is a one to many mapping, and even with ambiguous codons it is impossible to capture this fully with a simple string style representation of the nucleotide sequence. Anyway, the specific task of back translation of one amino acid to a list of codons is well defined. How about: from Bio.Data.CodonTable import unambiguous_dna_by_name table = unambiguous_dna_by_name["Bacterial"].forward_table back_table = {} for codon, amino in table.iteritems() : try : back_table[amino].append(codon) except KeyError : back_table[amino] = [codon] print back_table["L"] Peter From dalke at dalkescientific.com Mon Sep 14 12:23:27 2009 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 14 Sep 2009 14:23:27 +0200 Subject: [Biopython] Need help in making a back table for the CodonTable object In-Reply-To: <320fb6e00909140205q4013c007wbc876b6d35d93fb8@mail.gmail.com> References: <655165.18001.qm@web38408.mail.mud.yahoo.com> <320fb6e00909140205q4013c007wbc876b6d35d93fb8@mail.gmail.com> Message-ID: On Sep 14, 2009, at 11:05 AM, Peter wrote: > Anyway, the specific task of back translation of one amino acid > to a list of codons is well defined. How about: > > from Bio.Data.CodonTable import unambiguous_dna_by_name > table = unambiguous_dna_by_name["Bacterial"].forward_table > back_table = {} > for codon, amino in table.iteritems() : > try : > back_table[amino].append(codon) > except KeyError : > back_table[amino] = [codon] > print back_table["L"] Quick pointer to 'defaultdict' added in Python 2.5: import defaultdict back_table = collections.defaultdict(list) for codon, amino in table.iteritems() : back_table[amino].append(codon) print back_table["L"] If you really want a dictionary at the end instead of a defaultdict, back_table = dict(back_table) Andrew dalke at dalkescientific.com From alexanderdcastro at yahoo.com Mon Sep 14 14:12:41 2009 From: alexanderdcastro at yahoo.com (Alexander Castro) Date: Mon, 14 Sep 2009 07:12:41 -0700 (PDT) Subject: [Biopython] Need help in making a back table for the CodonTable object In-Reply-To: <320fb6e00909140205q4013c007wbc876b6d35d93fb8@mail.gmail.com> Message-ID: <780019.49409.qm@web38402.mail.mud.yahoo.com> Hello Peter, my goal was replace one codon in a sequence with another to perform a "silent" mutation. I did not want to do this mutation manually and offload the work to BioPython. Thanks for the code suggestion! I'll try it out as soon as I'm able. Regards, Alexander Castro --- On Mon, 9/14/09, Peter wrote: From: Peter Subject: Re: [Biopython] Need help in making a back table for the CodonTable object To: "Alexander Castro" Cc: biopython at lists.open-bio.org Date: Monday, September 14, 2009, 2:05 AM On Mon, Sep 14, 2009 at 5:22 AM, Alexander Castro wrote: > Hello all, I apologize if this sound elementary but how can create a > back table that will return more than one codon? For example, for > table="Bacterial", codons CTT, CTG and CTA code for L (Lysine). > How can I back out, given "L" get CTT, CTG and CTG. The > back_table method only returns one value: > output = bacterial_table.back_table["L"]. > Any help would be greatly appreciated. > Regards, > Alexander Castro I'm curious about what you are trying to do - we talked about the general problem of back-translation, and there was no consensus and therefore we didn't add a back translation method directly to the Seq object. As you are do doubt well aware, back translation is a one to many mapping, and even with ambiguous codons it is impossible to capture this fully with a simple string style representation of the nucleotide sequence. Anyway, the specific task of back translation of one amino acid to a list of codons is well defined. How about: from Bio.Data.CodonTable import unambiguous_dna_by_name table = unambiguous_dna_by_name["Bacterial"].forward_table back_table = {} for codon, amino in table.iteritems() : ? ? try : ? ? ? ? back_table[amino].append(codon) ? ? except KeyError : ? ? ? ? back_table[amino] = [codon] print back_table["L"] Peter From biopython at maubp.freeserve.co.uk Mon Sep 14 14:56:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Sep 2009 15:56:07 +0100 Subject: [Biopython] Need help in making a back table for the CodonTable object In-Reply-To: <780019.49409.qm@web38402.mail.mud.yahoo.com> References: <320fb6e00909140205q4013c007wbc876b6d35d93fb8@mail.gmail.com> <780019.49409.qm@web38402.mail.mud.yahoo.com> Message-ID: <320fb6e00909140756y12364311q6f30eb24542cc3a2@mail.gmail.com> On Mon, Sep 14, 2009 at 3:12 PM, Alexander Castro wrote: > Hello Peter, my goal was replace one codon in a sequence with another > to perform a "silent" mutation. I did not want to do this mutation manually > and offload the work to BioPython. I see - leave most of the coding sequence as it was, but replace just one codon in order to have a synonymous mutation. This isn't just a simple back translation, but the reverse lookup table code I suggested should be useful here. Peter From natassa_g_2000 at yahoo.com Tue Sep 15 08:37:50 2009 From: natassa_g_2000 at yahoo.com (natassa) Date: Tue, 15 Sep 2009 01:37:50 -0700 (PDT) Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <4A9DF609.5050906@student.otago.ac.nz> Message-ID: <618545.48683.qm@web52012.mail.re2.yahoo.com> Hallo, I have been using SeqIO to convert Illumina (v1.3+) *sequence.txt files (containing both quality and sequence info) to simple Fastas. This worked well until I tried it on new reads of 75 bp. I need to have them in a single line, so fiddling around with the code I guess I need to change the wrap=60 argument in the FastaIO/FastaWriter class to wrap=0 to make it work. Am I right? are there any other bits of code that may be affected that I may have missed? I am sure this way of handling things is not a good one;-) , so I was wondering if other people have had the same problem and how this class could be modified to adress it in the future. Thanks, Anastasia Anastasia Gioti Post-Doc, Evolutionary Biology Department Upssala University Norbyv?gen 18D SE-752 36? UPPSALA anastasia.gioti at ebc.uu.se Tel: +46-18-471 2837 Fax: +46-18-471 6310 From biopython at maubp.freeserve.co.uk Tue Sep 15 10:11:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 11:11:05 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <618545.48683.qm@web52012.mail.re2.yahoo.com> References: <4A9DF609.5050906@student.otago.ac.nz> <618545.48683.qm@web52012.mail.re2.yahoo.com> Message-ID: <320fb6e00909150311k5e7b7f6p7abd9c16436a6ac8@mail.gmail.com> On Tue, Sep 15, 2009 at 9:37 AM, natassa wrote: > > Hallo, > I have been using SeqIO to convert Illumina (v1.3+) *sequence.txt files > (containing both quality and sequence info) to simple Fastas. Are you talking about Illumina FASTQ files? i.e. fastq-illumina in SeqIO? > This worked well until I tried it on new reads of 75 bp. I need to have > them in a single line, so fiddling around with the code I guess I need to > change the wrap=60 argument in the FastaIO/FastaWriter class to > wrap=0 to make it work. Am I right? are there any other bits of code > that may be affected that I may have missed? Bio.SeqIO defaults to writing FASTA files with 60bp line wrapping. You want to output 75bp FASTA files without line wrapping? As an aside, why? Line wrapping is common and normal in FASTA files and in fact is more widely used than non-wrapping. If another software tool can't read line wrapped FASTA it has a bug in my opinion. > I am sure this way of handling things is not a good one;-) , so I was > wondering if other people have had the same problem and how this > class could be modified to adress it in the future. If you don't like the Bio.SeqIO.write(...) defaults, you can use the underlying writer which may offer some options. In the case of FASTA output, Bio.SeqIO.FastaIO allows you to set the wrapping. e.g. from Bio import SeqIO from Bio.SeqIO.FastaIO import FastaWriter records = SeqIO.parse(open("illumina.fastq"), "fastq-illumina") handle = open("example.fasta", "w") count = FastaWriter(handle, wrap=80).write_file(records) handle.close() print "Converted %i records" % count Peter From biopython at maubp.freeserve.co.uk Tue Sep 15 12:19:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 13:19:25 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <922952.20788.qm@web52012.mail.re2.yahoo.com> References: <320fb6e00909150311k5e7b7f6p7abd9c16436a6ac8@mail.gmail.com> <922952.20788.qm@web52012.mail.re2.yahoo.com> Message-ID: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> On Tue, Sep 15, 2009 at 12:47 PM, natassa wrote: > > Hallo Peter, Hi again, Note I CC'd the mailing list again. >> >> Are you talking about Illumina FASTQ files? i.e. fastq-illumina in SeqIO? >> > > I suppose so, I am quite confused with the names: > The format is (ex from a file): > @HWI-EAS293:8:1:0:1311#0/1 > CGCCACTGTTTTTGAGGGACCGCGGGCAGCCGCGGATCCCCAACGCAAGCAGAGCTNNNNGGTTGAAATGACGCTC > +HWI-EAS293:8:1:0:1311#0/1 > `W_a^a`T``aaa]YIRW^`_X_]XT_``a]U]VWP\^a````_Y_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > So this is fastq-illumina in SeqIO, right? That does look like a FASTQ file, and you probably know that it came from a Solexa/Illumina machine. However, it could be an early Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO), or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED scores ("fastq-illumina" in SeqIO). From the read length (76bp) I would guess this probably is an "fastq-illumina" file, but you should double check this, as it does matter for poor quality reads. >> Bio.SeqIO defaults to writing FASTA files with 60bp line wrapping. >> You want to output 75bp FASTA files without line wrapping? >> As an aside, why? > > Yes, that is what i want. I have paired-end reads in 2 separate files >? that need to combine in one single file for subsequent assembly by > velvet program. There is a 3rd party perl script in velvet to do this, > and if I input to this program files converted to Fasta using the > default argument wrap=60, it does not behave correct. [...] > >> Line wrapping is common and normal in >> FASTA files and in fact is more widely used than non-wrapping. >> If another software tool can't read line wrapped FASTA it has >> a bug in my opinion. > > You are most probably right, I can report the bug to the person who > wrote the script, but until now I thought a one-line format would me > the more common and convenient way. Please do report this bug in the perl-script, as it will help other people in future. If you prefer to work in Python, it should be easy to recreate a Biopython version of the same script. Which script are we talking about? Is it publicly available? >> If you don't like the Bio.SeqIO.write(...) defaults, you can use the >> underlying writer which may offer some options. In the case of >> FASTA output, Bio.SeqIO.FastaIO allows you to set the wrapping. >> >> e.g. >> >> from Bio import SeqIO >> from Bio.SeqIO.FastaIO import FastaWriter >> records = SeqIO.parse(open("illumina.fastq"), "fastq-illumina") >> handle = open("example.fasta", "w") >> count = FastaWriter(handle, wrap=80).write_file(records) >> handle.close() >> print "Converted %i records" % count >> > > Thanks, I will try this out, it should give the same result as directly > changing the FastaWriter arguments, but surely is a cleaner option! Yes, it should :) Regards, Peter From ivan at biodec.com Tue Sep 15 11:30:33 2009 From: ivan at biodec.com (Ivan Rossi) Date: Tue, 15 Sep 2009 13:30:33 +0200 (CEST) Subject: [Biopython] Announcing Plone4bio 1.0 Message-ID: ---------------------- Plone4bio 1.0 released ---------------------- BioDec is pleased to announce the new stable release of plone4bio. Plone4bio is a project to provide an integrated environment where it is possible to manage and analyze biological sequences within the Plone (http://plone.org) content management system. What's new ---------- In this release we have added a major feature, namely a product that let Plone act as a web interface to a BioSQL database. BioSQL (http://www.biosql.org/) is a generic relational model covering sequences, features, sequence and feature annotation, a reference taxonomy, and ontologies (or controlled vocabularies): a BioSQL database (as you know) can be used seamlessly from BioPerl, BioPython, BioJava, and BioRuby. A BioSQL databases is connected with a standard connection string to a plone4bio adapter, and then the content of the BioSQL database can be searched, using the Plone internal search engine and the plone collections, it can be browsed, including the usual graphical mapping of the features on the sequence, and in general handled by the standard Plone CMS machinery. Download and Project page ------------------------- The software is available at http://www.plone4bio.org The documentation is available at http://www.plone4bio.org/trac/wiki/Install The SVN repository is available at http://www.plone4bio.org/svn/ plone4bio mailing list: http://ackbar.biodec.com/cgi-bin/mailman/listinfo/p4b plone4bio demo site (read-only): http://p4bdemo.biodec.com Installation notes ------------------ The package to install to have a full plone4bio site running is plone4bio.buildout The plone4bio.base is just the package that defines a skeleton predictor: deriving from that it is possible to integrate any other application and visualize all the results together. biocomp.pscoils is an example predictor, encapsulating the pscoils algorithm by Fariselli et al. available at http://www.biocomp.unibo.it/ It is intended both as an example on how to integrate one's own predictor in the plone4bio framework and as a ready-to-use predictor for coiled-coils. Requirements: - python2.4 - python setup tools (Debian users: the python-setuptools package) - BioPython - PIL - BioPerl for some graphics Further information -------------------- Either the web site or the mailing list p4b at biodec.com For installation and documentation issues refer to README.txt and INSTALL.txt files from the archive or the script published on the plone4bio wiki site. plone4bio is published under the GPL. To contact BioDec S.r.l.: - Email: info at biodec.com - WWW: http://www.biodec.com - Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 -- Ivan Rossi, PhD - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot it BioDec Srl, Via Calzavecchio 20/2, I-40033 Casalecchio di Reno (BO), Italy Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com From biopython at maubp.freeserve.co.uk Tue Sep 15 15:08:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 16:08:13 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <77266.93142.qm@web52005.mail.re2.yahoo.com> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> Message-ID: <320fb6e00909150808p3ebad107rcf872fdb9737baa3@mail.gmail.com> On Tue, Sep 15, 2009 at 4:02 PM, natassa wrote: > >> That does look like a FASTQ file, and you probably know that it >> came from a Solexa/Illumina machine. However, it could be an early >> Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO), >> or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED >> scores ("fastq-illumina" in SeqIO). From the read length (76bp) I >> would guess this probably is an "fastq-illumina" file, but you >> should double check this, as it does matter for poor quality reads. > > Because you created some doubts in my already confused mind: > The machine is indeed Solexa/Illumina. I have 55bp and 76 bp > reads from pipeline v1.3 and v1.4, respectively. In the pipeline > manuals they say that the scoring scheme is Phred.? I know > there is a lot of confusion about the terms, this is why I > preferred to use the seqIO -I hope I did not mix the formats.... That's fine then - the Solexa/Illumina 1.3 and 1.4 pipelines use PHRED scores (with a FASTQ ASCII offset of 64), and in Biopython we call this the "fastq-illumina" format. Peter From natassa_g_2000 at yahoo.com Tue Sep 15 15:02:06 2009 From: natassa_g_2000 at yahoo.com (natassa) Date: Tue, 15 Sep 2009 08:02:06 -0700 (PDT) Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> Message-ID: <77266.93142.qm@web52005.mail.re2.yahoo.com> That does look like a FASTQ file, and you probably know that it came from a Solexa/Illumina machine. However, it could be an early Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO), or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED scores ("fastq-illumina" in SeqIO). From the read length (76bp) I would guess this probably is an "fastq-illumina" file, but you should double check this, as it does matter for poor quality reads. Because you created some doubts in my already confused mind: The machine is indeed Solexa/Illumina. I have 55bp and 76 bp reads from pipeline v1.3 and v1.4, respectively. In the pipeline manuals they say that the scoring scheme is Phred.? I know there is a lot of confusion about the terms, this is why I preferred to use the seqIO -I hope I did not mix the formats.... If you prefer to work in Python, it should be easy to recreate a Biopython version of the same script. Which script are we talking about? Is it publicly available? It is called shuffleSequences_fasta.pl and goes along with the (free) distribution of velvet (Zerbino, EBI). The script is really simple. Thanks again, Anastasia From biopython at maubp.freeserve.co.uk Tue Sep 15 15:14:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 16:14:52 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <77266.93142.qm@web52005.mail.re2.yahoo.com> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> Message-ID: <320fb6e00909150814k6620f759u8c512bfab43bf75@mail.gmail.com> On Tue, Sep 15, 2009 at 4:02 PM, natassa wrote: > >> If you prefer to work in Python, it should be easy to recreate >> a Biopython version of the same script. Which script are we >> talking about? Is it publicly available? > > It is called shuffleSequences_fasta.pl and goes along with the > (free) distribution of velvet (Zerbino, EBI). The script is really > simple. Oh right - you can see the scripts on Daniel's github repository, http://github.com/dzerbino/velvet Both scripts are very very simple minded, which means fixing the bug will actually be a big change: shuffleSequences_fasta.pl appears to assume every FASTA entry is exactly two lines (a safe assumption for short reads like 36bp from early Solexa/Illumina), but not a safe choice in general as wrapping in FASTA is normal. shuffleSequences_fastq.pl appears to assume every FASTQ entry is exactly four lines (a reasonable assumption, especially for short reads like 36bp reads from early Solexa/Illumina), but not a safe choice in general as FASTQ files can also be wrapped (even if it is discouraged). We should be able to mimic these in Biopython using SeqIO... Peter From biopython at maubp.freeserve.co.uk Tue Sep 15 16:43:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 17:43:04 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <320fb6e00909150814k6620f759u8c512bfab43bf75@mail.gmail.com> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> <320fb6e00909150814k6620f759u8c512bfab43bf75@mail.gmail.com> Message-ID: <320fb6e00909150943v23aa743cpc481c31a91da7e3a@mail.gmail.com> On Tue, Sep 15, 2009 at 4:14 PM, Peter wrote: > On Tue, Sep 15, 2009 at 4:02 PM, natassa wrote: >> >>> If you prefer to work in Python, it should be easy to recreate >>> a Biopython version of the same script. Which script are we >>> talking about? Is it publicly available? >> >> It is called shuffleSequences_fasta.pl and goes along with the >> (free) distribution of velvet (Zerbino, EBI). The script is really >> simple. > > Oh right - you can see the scripts on Daniel's github repository, > http://github.com/dzerbino/velvet > > Both scripts are very very simple minded, which means fixing > the bug will actually be a big change: > > shuffleSequences_fasta.pl appears to assume every FASTA > entry is exactly two lines (a safe assumption for short reads > like 36bp from early Solexa/Illumina), but not a safe choice > in general as wrapping in FASTA is normal. > > shuffleSequences_fastq.pl appears to assume every FASTQ > entry is exactly four lines (a reasonable assumption, especially > for short reads like 36bp reads from early Solexa/Illumina), > but not a safe choice in general as FASTQ files can also be > wrapped (even if it is discouraged). > > We should be able to mimic these in Biopython using SeqIO... How about this? I'm using itertools.izip_longest from Python 2.6+ which should make sure the two input files have the same number of reads. Using itertools.izip would also work, but will silently ignore any extra records in one file. import itertools from Bio import SeqIO #Setup variables (could parse command line args here) fileA = "SRR001666_1.fastq" fileB = "SRR001666_2.fastq fileOut = "SRR001666_interleaved.fastq" format = "fastq" #Prepare the input (using iterators for memory efficiency) recordsA = SeqIO.parse(open(fileA,"rU"), format) recordsB = SeqIO.parse(open(fileB,"rU"), format) records = itertools.izip_longest(recordsA, recordsB) #Finally do the interleaved output: handle = open(fileOut, "w") count = SeqIO.write(records, handle, format) handle.close() print "%i records written to %s" % (count, fileOut) Rather than using itertools, you could also write a simple generator function to do the pairing explicitly. Assuming you are dealing with paired end reads, it would make sense to explicitly check the IDs match up as expected. Peter P.S. This uses the default output settings, for if you did this for FASTA it would use line wrapping. From cjfields at illinois.edu Tue Sep 15 17:43:26 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 15 Sep 2009 12:43:26 -0500 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <320fb6e00909150808p3ebad107rcf872fdb9737baa3@mail.gmail.com> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> <320fb6e00909150808p3ebad107rcf872fdb9737baa3@mail.gmail.com> Message-ID: On Sep 15, 2009, at 10:08 AM, Peter wrote: > On Tue, Sep 15, 2009 at 4:02 PM, natassa > wrote: >> >>> That does look like a FASTQ file, and you probably know that it >>> came from a Solexa/Illumina machine. However, it could be an early >>> Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO), >>> or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED >>> scores ("fastq-illumina" in SeqIO). From the read length (76bp) I >>> would guess this probably is an "fastq-illumina" file, but you >>> should double check this, as it does matter for poor quality reads. >> >> Because you created some doubts in my already confused mind: >> The machine is indeed Solexa/Illumina. I have 55bp and 76 bp >> reads from pipeline v1.3 and v1.4, respectively. In the pipeline >> manuals they say that the scoring scheme is Phred. I know >> there is a lot of confusion about the terms, this is why I >> preferred to use the seqIO -I hope I did not mix the formats.... > > That's fine then - the Solexa/Illumina 1.3 and 1.4 pipelines use > PHRED scores (with a FASTQ ASCII offset of 64), and in > Biopython we call this the "fastq-illumina" format. > > Peter I should add a very important caveat here. As I had mentioned to Peter I met with our local nextgen sequencing lead and was able to check the Illumina 1.4 pipeline manual. It indicates the ASCII offset for FASTQ is correct (64), but the quality score is calculated as (pg 122 of Genome Pipeline manual for 1.4): Q = 10*log10(p/(1-p)) Look familiar? Hint: it's not PHRED. I'm wondering if anyone else can confirm this, as it appears Illumina has switched back to using Solexa scores again. chris From cjfields at illinois.edu Tue Sep 15 19:27:57 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 15 Sep 2009 14:27:57 -0500 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> <320fb6e00909150808p3ebad107rcf872fdb9737baa3@mail.gmail.com> Message-ID: <16B3156A-FC16-4003-985F-582AAE5E134C@illinois.edu> On Sep 15, 2009, at 12:43 PM, Chris Fields wrote: > On Sep 15, 2009, at 10:08 AM, Peter wrote: > >> On Tue, Sep 15, 2009 at 4:02 PM, natassa >> wrote: >>> >>>> That does look like a FASTQ file, and you probably know that it >>>> came from a Solexa/Illumina machine. However, it could be an early >>>> Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO), >>>> or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED >>>> scores ("fastq-illumina" in SeqIO). From the read length (76bp) I >>>> would guess this probably is an "fastq-illumina" file, but you >>>> should double check this, as it does matter for poor quality reads. >>> >>> Because you created some doubts in my already confused mind: >>> The machine is indeed Solexa/Illumina. I have 55bp and 76 bp >>> reads from pipeline v1.3 and v1.4, respectively. In the pipeline >>> manuals they say that the scoring scheme is Phred. I know >>> there is a lot of confusion about the terms, this is why I >>> preferred to use the seqIO -I hope I did not mix the formats.... >> >> That's fine then - the Solexa/Illumina 1.3 and 1.4 pipelines use >> PHRED scores (with a FASTQ ASCII offset of 64), and in >> Biopython we call this the "fastq-illumina" format. >> >> Peter > > I should add a very important caveat here. As I had mentioned to > Peter I met with our local nextgen sequencing lead and was able to > check the Illumina 1.4 pipeline manual. It indicates the ASCII > offset for FASTQ is correct (64), but the quality score is > calculated as (pg 122 of Genome Pipeline manual for 1.4): > > Q = 10*log10(p/(1-p)) > > Look familiar? Hint: it's not PHRED. I'm wondering if anyone else > can confirm this, as it appears Illumina has switched back to using > Solexa scores again. > > chris Just got off the phone with Illumina customer support to double-check this, and I think it may be a false alarm, though I'm getting conflicting accounts (our local guys say it's solexa, not PHRED qual scores). According to Illumina tech support, qual scores coming off the 1.4 pipeline should be converted over to PHRED scores prior to output (what natassa mentions). The manual refers to the older (Solexa/ Illumina 1.0) scoring b/c that particular qual scoring option can be specified instead of PHRED. If anyone out there using the 1.4 pipeline can confirm this that would be most helpful, as all the Bio* toolkits and EMBOSS are updating FASTQ parsing. chris From biopython at maubp.freeserve.co.uk Wed Sep 16 10:01:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 11:01:18 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <16B3156A-FC16-4003-985F-582AAE5E134C@illinois.edu> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> <320fb6e00909150808p3ebad107rcf872fdb9737baa3@mail.gmail.com> <16B3156A-FC16-4003-985F-582AAE5E134C@illinois.edu> Message-ID: <320fb6e00909160301q54d3fed7yf941f8a96581ba49@mail.gmail.com> On Tue, Sep 15, 2009 at 8:27 PM, Chris Fields wrote: > Just got off the phone with Illumina customer support to double-check this, > and I think it may be a false alarm, though I'm getting conflicting accounts > (our local guys say it's solexa, not PHRED qual scores). > > According to Illumina tech support, qual scores coming off the 1.4 pipeline > should be converted over to PHRED scores prior to output (what natassa > mentions). ?The manual refers to the older (Solexa/Illumina 1.0) scoring b/c > that particular qual scoring option can be specified instead of PHRED. I gather via seqanswers.com that there are still some (out of date) references to the old Solexa scoring in the current Illumina pipeline documentation, so I expect this is just a false alarm... Peter From biopython at maubp.freeserve.co.uk Wed Sep 16 10:44:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 11:44:09 +0100 Subject: [Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp In-Reply-To: <320fb6e00909150943v23aa743cpc481c31a91da7e3a@mail.gmail.com> References: <320fb6e00909150519j6d8efdf1redc6a4e8fc1fe217@mail.gmail.com> <77266.93142.qm@web52005.mail.re2.yahoo.com> <320fb6e00909150814k6620f759u8c512bfab43bf75@mail.gmail.com> <320fb6e00909150943v23aa743cpc481c31a91da7e3a@mail.gmail.com> Message-ID: <320fb6e00909160344u703b62d3t3d05b381539ebb8c@mail.gmail.com> On Tue, Sep 15, 2009 at 5:43 PM, Peter wrote: > Rather than using itertools, you could also write a simple generator > function to do the pairing explicitly. Assuming you are dealing with > paired end reads, it would make sense to explicitly check the IDs > match up as expected. I confess I didn't actually test that example (I don't have Python 2.6 on this machine), and I had miss read the itertools.izip_longest documentation - that won't actually work as is. Sorry :( Instead, here is a simple interleaving using a generator function, which I *have* tested on Python 2.5, from Bio import SeqIO #Setup variables (could parse command line args here) fileA = "SRR001666_1.fastq" fileB = "SRR001666_2.fastq fileOut = "SRR001666_interleaved.fastq" format = "fastq" #Setup the input def interleave(iter1, iter2) : while True : yield iter1.next() yield iter2.next() recordsA = SeqIO.parse(open(fileA,"rU"), format) recordsB = SeqIO.parse(open(fileB,"rU"), format) records = interleave(recordsA, recordsB) #Now the output handle = open(fileOut, "w") count = SeqIO.write(records, handle, format) handle.close() print "%i records written to %s" % (count, fileOut) Note that this does not check the number of records in the two files matches, nor does it do any explicit test on the record ids. Peter From peter at maubp.freeserve.co.uk Wed Sep 16 11:08:42 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 12:08:42 +0100 Subject: [Biopython] [Velvet-users] Shuffler for wrapped fastas In-Reply-To: <320fb6e00909160254u6cc988a8rbd07c4892aeb3515@mail.gmail.com> References: <4AAE3302.4010304@ebi.ac.uk> <16DCCE03-6439-4A84-8E32-7482BCB2D192@ebc.uu.se> <4AB0AE11.1000706@stats.ox.ac.uk> <320fb6e00909160254u6cc988a8rbd07c4892aeb3515@mail.gmail.com> Message-ID: <320fb6e00909160408g41c88c00w4048b9d6b58bc87b@mail.gmail.com> Hi Velvet users, I've also CC'd the biopython mailing list (and added links to the velvet mailing list archive posts), as this conversation might be better off continued there. Peter wrote: http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000567.html >> Nice to see people using Biopython - although that doesn't look like the >> most efficient way to write the code (I say without timing it!). See also: >> http://lists.open-bio.org/pipermail/biopython/2009-September/005579.html >> >> And of course these Bio.SeqIO scripts would also work for FASTQ files >> if you just switch the format name. If you wanted to do more error checking, >> you could compare the record IDs (to check they are paired reads), and >> you should make sure both files have the same number of records. Tanya wrote: http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000568.html > Absolutely, that's why I said it was slow -- but easy and efficient > enough for moderate-size fasta files. If you look at the attached > script, I wrote the function shuffle_reads in a way that lets you import > it and change the format type (or you could change the command line args > to pass it in, of course, but I couldn't be bothered). Also easy enough > to check the headers match. However, fastq does require the latest > Biopython release, and not everyone has it. > > Personally I don't use Biopython for shuffling fastq files because with > millions of short reads, it's hugely inefficient (tenfold runtime > difference) to go through creating a Biopython object for each read. > So for short reads, I'd go with straightforward shuffling, plain text. You are right that if performance really is a bottleneck, the Biopython SeqIO system can be a limit due to the creation of SeqRecord objects. However, it isn't quite as bad as you think - your code is making it worse by using the format method on each record (which internally calls SeqIO), instead of a single call to Bio.SeqIO.write() as recommended in our documentation. This version is about twice as fast as your original: import sys from Bio import SeqIO def interleave(iter1, iter2) : while True : yield iter1.next() yield iter2.next() f1, f2 = open(sys.argv[1]), open(sys.argv[2]) outfile = open(sys.argv[3], 'w') format = 'fasta' #or "fastq" or ... records = interleave(SeqIO.parse(f1, format), SeqIO.parse(f2, format)) count = SeqIO.write(records, outfile, format) outfile.close() (It still assumes the input files are properly paired) I agree that a lower level FASTA parser using python strings could be another five times faster - but you are aware, this gives up the generality of the SeqIO system whereby this could be used on any supported file format. Peter From pzs at dcs.gla.ac.uk Wed Sep 16 16:14:20 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Wed, 16 Sep 2009 17:14:20 +0100 Subject: [Biopython] Creating GenBank files In-Reply-To: <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> References: <4AA91A14.10602@dcs.gla.ac.uk> <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> Message-ID: <4AB10EDC.3060908@dcs.gla.ac.uk> Peter wrote: > Yes, you must create a SeqRecord object with suitable SeqFeature objects, > and then write it out with SeqIO in GenBank format. If all your features have > trivial locations, this is pretty easy. > Thanks for this. I've managed to get this to work, but encountered a few minor issues. I already have GenBank files created by CLC Genomics Workbench 3 but I want to make these in a script. The CLC generated GenBank files look like this: LOCUS Setd2-tagged 11750 bp DNA linear UNA FEATURES Location/Qualifiers misc_feature 1..50 /label="Subcloning HA Upstream" ...(snip other features) ORIGIN 1 TTGGTGTGAG CTCTTTGTGT CTTGCCTAAG TATGTGCATC TGTCTTGTCT ...(snip sequence) To do this in biopython, I need to create my feature thus: sf = SeqFeature.SeqFeature(SeqFeature.FeatureLocation(0,50), type="misc_feature", qualifiers = { "label" : [ "Subcloning HA Upstream" ]}) The issues I had were: - In the docstring for SeqFeature, it says the attribute is "qualifier" but it should be "qualifiers". - My first stab at the qualifiers argument was to do qualifiers = { "label" : "mylabel" } but if I do that, it iterates over "mylabel" giving me one "label" for each character! Maybe the qualifier printer should check it's being given a list and not a string? - I'd like to remove some of the extraneous header from the GenBank file: DEFINITION . ACCESSION VERSION KEYWORDS . SOURCE . ORGANISM . . Is this possible? Sorry for the long message, Peter From biopython at maubp.freeserve.co.uk Wed Sep 16 16:31:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 17:31:27 +0100 Subject: [Biopython] Creating GenBank files In-Reply-To: <4AB10EDC.3060908@dcs.gla.ac.uk> References: <4AA91A14.10602@dcs.gla.ac.uk> <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> <4AB10EDC.3060908@dcs.gla.ac.uk> Message-ID: <320fb6e00909160931y3669a524maea5be3e8cc3e1a7@mail.gmail.com> On Wed, Sep 16, 2009 at 5:14 PM, Peter Saffrey wrote: > Peter wrote: >> >> Yes, you must create a SeqRecord object with suitable SeqFeature objects, >> and then write it out with SeqIO in GenBank format. If all your features >> have trivial locations, this is pretty easy. > > Thanks for this. I've managed to get this to work, but encountered a few > minor issues. > > I already have GenBank files created by CLC Genomics Workbench 3 but I want > to make these in a script. The CLC generated GenBank files look like this: > > LOCUS ? ? ? Setd2-tagged ? ? ? ? ? 11750 bp ? ?DNA ? ? linear ? UNA > FEATURES ? ? ? ? ? ? Location/Qualifiers > ? ? misc_feature ? ?1..50 > ? ? ? ? ? ? ? ? ? ? /label="Subcloning HA Upstream" > ...(snip other features) > > ORIGIN > ? ? ? ?1 TTGGTGTGAG CTCTTTGTGT CTTGCCTAAG TATGTGCATC TGTCTTGTCT > > ...(snip sequence) > > > To do this in biopython, I need to create my feature thus: > > sf = SeqFeature.SeqFeature(SeqFeature.FeatureLocation(0,50), > type="misc_feature", qualifiers = { "label" : [ "Subcloning HA Upstream" ]}) > > The issues I had were: > > - In the docstring for SeqFeature, it says the attribute is "qualifier" but > it should be "qualifiers". I've fixed that in CVS - thanks for reporting it. > - My first stab at the qualifiers argument was to do > > qualifiers = { "label" : "mylabel" } > > but if I do that, it iterates over "mylabel" giving me one "label" for each > character! Maybe the qualifier printer should check it's being given a list > and not a string? As you have realised, based on what the GenBank (and other) parsers do, the GenBank output code was expecting the qualifier values to be a list (of strings). There are similar issues in the BioSQL code, and yes, I agree we should cope with either here too. > - I'd like to remove some of the extraneous header from the GenBank file: > > DEFINITION ?. > ACCESSION ? > VERSION ? ? > KEYWORDS ? ?. > SOURCE ? ? ?. > ?ORGANISM ?. > ? ? ? ? ? ?. > > Is this possible? > Why would you want to? They are there deliberately as according the the NCBI GenBank release notes (which pretty much is the official file format definition) those are all mandatory keywords, so should be present (even if with just a dot/period indicating no data). I would regard the CLC Genomics Workbench 3 output as technically out of spec. > > Sorry for the long message, > Not at all. Peter C. From biopython at maubp.freeserve.co.uk Wed Sep 16 16:44:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 17:44:15 +0100 Subject: [Biopython] Creating GenBank files In-Reply-To: <320fb6e00909160931y3669a524maea5be3e8cc3e1a7@mail.gmail.com> References: <4AA91A14.10602@dcs.gla.ac.uk> <320fb6e00909100845wb072bd6rc09133d0c0c745bf@mail.gmail.com> <4AB10EDC.3060908@dcs.gla.ac.uk> <320fb6e00909160931y3669a524maea5be3e8cc3e1a7@mail.gmail.com> Message-ID: <320fb6e00909160944r75aa2128y32005bea1b18b48@mail.gmail.com> On Wed, Sep 16, 2009 at 5:31 PM, Peter wrote: > On Wed, Sep 16, 2009 at 5:14 PM, Peter Saffrey wrote: >> >> - My first stab at the qualifiers argument was to do >> >> qualifiers = { "label" : "mylabel" } >> >> but if I do that, it iterates over "mylabel" giving me one "label" for each >> character! Maybe the qualifier printer should check it's being given a list >> and not a string? > > As you have realised, based on what the GenBank (and other) parsers > do, the GenBank output code was expecting the qualifier values to be > a list (of strings). There are similar issues in the BioSQL code, and yes, > I agree we should cope with either here too. Fixed in CVS. Peter From fchiu at newton.berkeley.edu Wed Sep 16 23:18:20 2009 From: fchiu at newton.berkeley.edu (Finsen Chiu) Date: Wed, 16 Sep 2009 16:18:20 -0700 Subject: [Biopython] Permissions for Bio, BioSQL, Martel Message-ID: <4AB1723C.2040004@newton.berkeley.edu> Hi all, What should the correct permission settings be for the Bio, BioSQL, Martel directories and its sub folders/files? When I install biopython as root, the directories and files created have permissions of either 600 or 400, which is not very helpful if I am installing as a sys admin since other users can't import anything. I also tried doing a umask 022 before installing and encountered write permission for non-root users, as followed: ====================================================================== ERROR: test_Clustalw_tool ---------------------------------------------------------------------- [Errno 13] Permission denied: 'Clustalw/temp horses.fasta' ====================================================================== ERROR: Ensure that we can write proper FASTA output files. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Motif.py", line 65, in test_FAoutput output_handle = open(self.FAout, "w") IOError: [Errno 13] Permission denied: 'Motif/fa.out' ====================================================================== ERROR: Ensure that we can write proper TransFac output files. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Motif.py", line 72, in test_TFoutput output_handle = open(self.TFout, "w") IOError: [Errno 13] Permission denied: 'Motif/tf.out' ====================================================================== ERROR: Ensure that we can write proper pfm output files. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Motif.py", line 79, in test_pfm_output output_handle = open(self.PFMout, "w") IOError: [Errno 13] Permission denied: 'Motif/fa.out' ====================================================================== ERROR: Reading and writing motifs to a file ---------------------------------------------------------------------- Traceback (most recent call last): File "test_NNGene.py", line 48, in test_motif output_handle = open(self.test_file, "w") IOError: [Errno 13] Permission denied: 'NeuralNetwork/patternio.txt' ====================================================================== ERROR: Reading and writing schemas to a file. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_NNGene.py", line 81, in test_schema output_handle = open(self.test_file, "w") IOError: [Errno 13] Permission denied: 'NeuralNetwork/patternio.txt' ====================================================================== ERROR: Reading and writing signatures to a file. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_NNGene.py", line 114, in test_signature output_handle = open(self.test_file, "w") IOError: [Errno 13] Permission denied: 'NeuralNetwork/patternio.txt' ====================================================================== ERROR: Full template creation test ---------------------------------------------------------------------- Traceback (most recent call last): File "test_PopGen_SimCoal_nodepend.py", line 23, in test_template_full 'PopGen') File "~/biopython-1.50/build/lib.linux-i686-2.3/Bio/PopGen/SimCoal/Template.py", line 210, in generate_simcoal_from_template stream = open(out_dir + sep + 'tmp.par', 'w') IOError: [Errno 13] Permission denied: 'PopGen/tmp.par' ====================================================================== ERROR: test_SeqUtils ---------------------------------------------------------------------- [Errno 13] Permission denied: 'fasta.tmp' ====================================================================== How can I bypass these errors without giving non-root users write permission? Thanks, Finsen From biopython at maubp.freeserve.co.uk Wed Sep 16 23:35:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 00:35:34 +0100 Subject: [Biopython] Permissions for Bio, BioSQL, Martel In-Reply-To: <4AB1723C.2040004@newton.berkeley.edu> References: <4AB1723C.2040004@newton.berkeley.edu> Message-ID: <320fb6e00909161635yebd94dlf63f3efb8df299a1@mail.gmail.com> Hi, What OS are you using? And did you try just the normal installation: python setup.py build python setup.py test sudo python setup.py install That would normally work on Linux/Unix. On Thu, Sep 17, 2009 at 12:18 AM, Finsen Chiu wrote: > Hi all, > > I also tried doing a umask 022 before installing and encountered write > permission for non-root users, as followed: > > ====================================================================== > ERROR: test_Clustalw_tool > ---------------------------------------------------------------------- > [Errno 13] Permission denied: 'Clustalw/temp horses.fasta' etc The above could happens if you did the build and test as sudo, and the temp test files were left behind (e.g. if interrupted). If you later rerun the tests as a normal user you can't delete them (because they belong to root). Peter From fchiu at newton.berkeley.edu Thu Sep 17 00:00:18 2009 From: fchiu at newton.berkeley.edu (Finsen Chiu) Date: Wed, 16 Sep 2009 17:00:18 -0700 Subject: [Biopython] Permissions for Bio, BioSQL, Martel In-Reply-To: <320fb6e00909161635yebd94dlf63f3efb8df299a1@mail.gmail.com> References: <4AB1723C.2040004@newton.berkeley.edu> <320fb6e00909161635yebd94dlf63f3efb8df299a1@mail.gmail.com> Message-ID: <4AB17C12.60806@newton.berkeley.edu> I am using RHEL4 I built and installed as root and both went smoothly without interruption. Running the test as root is fine, but the permission denied errors came when running the test as a user. So, I wonder if I need to give users write permission to those files (which I am not wanting to do) or if those errors are negligible. Thanks, Finsen Peter wrote: > Hi, > > What OS are you using? And did you try just the normal installation: > > python setup.py build > python setup.py test > sudo python setup.py install > > That would normally work on Linux/Unix. > > On Thu, Sep 17, 2009 at 12:18 AM, Finsen Chiu wrote: > >> Hi all, >> >> I also tried doing a umask 022 before installing and encountered write >> permission for non-root users, as followed: >> >> ====================================================================== >> ERROR: test_Clustalw_tool >> ---------------------------------------------------------------------- >> [Errno 13] Permission denied: 'Clustalw/temp horses.fasta' >> > etc > > The above could happens if you did the build and test as sudo, and > the temp test files were left behind (e.g. if interrupted). If you later > rerun the tests as a normal user you can't delete them (because they > belong to root). > > Peter > > From biopython at maubp.freeserve.co.uk Thu Sep 17 09:15:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 10:15:06 +0100 Subject: [Biopython] Permissions for Bio, BioSQL, Martel In-Reply-To: <4AB17C12.60806@newton.berkeley.edu> References: <4AB1723C.2040004@newton.berkeley.edu> <320fb6e00909161635yebd94dlf63f3efb8df299a1@mail.gmail.com> <4AB17C12.60806@newton.berkeley.edu> Message-ID: <320fb6e00909170215j791ed9efsd0a593e56d5bc26f@mail.gmail.com> On Thu, Sep 17, 2009 at 1:00 AM, Finsen Chiu wrote: > I am using RHEL4 > > I built and installed as root and both went smoothly without interruption. OK, good that it worked. But I don't normally do the build and test as root, just use sudo for the install at the end. > Running the test as root is fine, but the permission denied errors came when > running the test as a user. > > So, I wonder if I need to give users write permission to those files (which > I am not wanting to do) or if those errors are negligible. Switching the permissions would work, but the best solution to not to run the tests as root in the first place. If you do the build and test as a normal user, all the temp test files will have normal user permissions. [Normally there shouldn't be any left over temp files from the tests, so I guess a test failed or was interupted before it could clean up] Peter From jhcepas at gmail.com Thu Sep 17 17:58:38 2009 From: jhcepas at gmail.com (Jaime Huerta Cepas) Date: Thu, 17 Sep 2009 19:58:38 +0200 Subject: [Biopython] ETE: a python Environment for Tree Exploration Message-ID: Hi all, I have recently finished a software (ETE) that might be of your interest. ETE (Environment for tree Exploration) is a python toolkit to analyze, manipulate and visualize hierarchical trees. It allows to deal with any kind of tree, but it includes specific data types for loading phylogenetic and clustering trees. Besides many tree handling options it provides some analytic methods such as orthology/paralogy prediction or cluster validation. The toolkit is GPL and aims to be very flexible and configurable, so it can be used together with other toolkits such as BioPython. The development of this software responses to the needs of our own group during the last years, and I hope it can be now useful for the bioinformatics community. Bugs and comments are very welcome, as well as any idea for a better integration with Biopython :) program, documentation and examples can be found at http://ete.cgenomics.org hope it's useful for you! cheers, Jaime ** Summary of main ETE's features: ** General trees ========== Advanced node annotation, tree topology manipulation, automatic tree prunning, cut \& paste partitions, trees concatenation, random trees generation, iterate over leaves and descendants, pre and pos-order tree traversion, root and unroot options, advanced nodes search, get distances among nodes, detect midpoint outgroup, find farthest descendant node, find farthest node in the whole tree, detect first common ancestor among nodes, text mode visualization, newick rendering (several formats), extended newick format integration, support for built-in python operations: print, len, iter, in. Phylogenetic trees ================ Link to multiple sequence alignments, automatic species name detection, check node monophily, evolutionary events dating, detect orthology and paralogy relationships: species overlap and tree reconciliation methods, complete access API to the phylomeDB database, integrated visualization (show molecular sequences and evolutionary events). Clustering trees ============== link to numerical matrices, calculate inter and intra-cluster distances among clusters, calculate Silhouette and Dunn Indexes, integrated visualization (display numerical profiles in several formats). Treeview extension =============== Interactive Graphical User Interface (GUI), programmable drawing engine, independent node aspect editing, support drawing node extra features (text or external images), vector graphics rendering using PDF format. -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From carlos.borroto at gmail.com Fri Sep 18 16:59:58 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Fri, 18 Sep 2009 12:59:58 -0400 Subject: [Biopython] Searching for and downloading sequences using the history Message-ID: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> Hi all, I'm trying to download all of the EST from a specie, I'm following the example on the tutorial which seems to be exactly what I need. But I running into this problem: >>> from Bio import Entrez >>> Entrez.email = "carlos.borroto at gmail.com" >>> dbname = "nucest" >>> query_term = "Genus specie" >>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y") >>> search_results = Entrez.read(search_handle) >>> search_handle.close() >>> len(search_results["IdList"]) 20 >>> print search_results["Count"] 193951 So the assert statement if failing: >>> gi_list = search_results["IdList"] >>> count = int(search_results["Count"]) >>> assert count == len(gi_list) Traceback (most recent call last): File "", line 1, in AssertionError And most important I'm not getting all of the ids. Did someone knows what I'm doing wrong? thanks in advance -- Carlos Javier Borroto Baltimore, MD Phone: (410) 929 4020 From carlos.borroto at gmail.com Fri Sep 18 17:15:41 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Fri, 18 Sep 2009 13:15:41 -0400 Subject: [Biopython] Searching for and downloading sequences using the history In-Reply-To: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> References: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> Message-ID: <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> On Fri, Sep 18, 2009 at 12:59 PM, Carlos Javier Borroto wrote: > Hi all, > > I'm trying to download all of the EST from a specie, I'm following the > example on the tutorial which seems to be exactly what I need. But I > running into this problem: > >>>> from Bio import Entrez >>>> Entrez.email = "carlos.borroto at gmail.com" >>>> dbname = "nucest" >>>> query_term = "Genus specie" >>>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y") >>>> search_results = Entrez.read(search_handle) >>>> search_handle.close() >>>> len(search_results["IdList"]) > 20 >>>> print search_results["Count"] > 193951 > > So the assert statement if failing: >>>> gi_list = search_results["IdList"] >>>> count = int(search_results["Count"]) >>>> assert count == len(gi_list) > Traceback (most recent call last): > ?File "", line 1, in > AssertionError > I just found this: http://portal.open-bio.org/pipermail/biopython/2008-August/004451.html So I tested this: >>> search_handle = Entrez.esearch(db=dbname,term=query_term,retmax=193951) >>> search_results = Entrez.read(search_handle) >>> search_handle.close() >>> print search_results["Count"] 193951 >>> len(search_results["IdList"]) 100000 Still not the complete list, maybe there is a maximum of result you can get and I see there is a retstart, so I'm guessing the only way to get all of the ids is dividing my search and using retstart. I'm right? I'm going to implement this I share it here. regards, -- Carlos Javier Borroto Baltimore, MD Phone: (410) 929 4020 From fredgca at hotmail.com Fri Sep 18 17:51:20 2009 From: fredgca at hotmail.com (Frederico Arnoldi) Date: Fri, 18 Sep 2009 17:51:20 +0000 Subject: [Biopython] ETE: a python Environment for Tree Exploration In-Reply-To: References: Message-ID: Jaime, Thanks for sharing. Very useful! Frederico Arnoldi > From: biopython-request at lists.open-bio.org > Subject: Biopython Digest, Vol 81, Issue 21 > To: biopython at lists.open-bio.org > Date: Fri, 18 Sep 2009 12:00:02 -0400 > Date: Thu, 17 Sep 2009 19:58:38 +0200 > From: Jaime Huerta Cepas > Subject: [Biopython] ETE: a python Environment for Tree Exploration > To: biopython at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Hi all, > > I have recently finished a software (ETE) that might be of your interest. > > ETE (Environment for tree Exploration) is a python toolkit to analyze, > manipulate and visualize hierarchical trees. It allows to deal with any kind > of tree, but it includes specific data types for loading phylogenetic and > clustering trees. Besides many tree handling options it provides some > analytic methods such as orthology/paralogy prediction or cluster > validation. The toolkit is GPL and aims to be very flexible and > configurable, so it can be used together with other toolkits such as > BioPython. > > The development of this software responses to the needs of our own group > during the last years, and I hope it can be now useful for the > bioinformatics community. > Bugs and comments are very welcome, as well as any idea for a better > integration with Biopython :) > > program, documentation and examples can be found at http://ete.cgenomics.org > > hope it's useful for you! > > cheers, > Jaime > > ** Summary of main ETE's features: ** > > General trees > ========== > Advanced node annotation, tree topology manipulation, automatic tree > prunning, cut \& paste partitions, trees concatenation, random trees > generation, iterate over leaves and descendants, pre and pos-order tree > traversion, root and unroot options, advanced nodes search, get distances > among nodes, detect midpoint outgroup, find farthest descendant node, find > farthest node in the whole tree, detect first common ancestor among nodes, > text mode visualization, newick rendering (several formats), extended newick > format integration, support for built-in python operations: print, len, > iter, in. > > Phylogenetic trees > ================ > Link to multiple sequence alignments, automatic species name detection, > check node monophily, evolutionary events dating, detect orthology and > paralogy relationships: species overlap and tree reconciliation methods, > complete access API to the phylomeDB database, integrated visualization > (show molecular sequences and evolutionary events). > > Clustering trees > ============== > link to numerical matrices, calculate inter and intra-cluster distances > among clusters, calculate Silhouette and Dunn Indexes, integrated > visualization (display numerical profiles in several formats). > > Treeview extension > =============== > Interactive Graphical User Interface (GUI), programmable drawing engine, > independent node aspect editing, support drawing node extra features (text > or external images), vector graphics rendering using PDF format. > > > -- > ========================= > Jaime Huerta-Cepas, Ph.D. > CRG-Centre for Genomic Regulation > Doctor Aiguader, 88 > PRBB Building > 08003 Barcelona, Spain > http://www.crg.es/comparative_genomics > ========================= > _________________________________________________________________ Voc? sabia que o Hotmail mudou? Clique e descubra as novidades. http://www.microsoft.com/brasil/windows/windowslive/products/hotmail.aspx From carlos.borroto at gmail.com Fri Sep 18 18:08:14 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Fri, 18 Sep 2009 14:08:14 -0400 Subject: [Biopython] Searching for and downloading sequences using the history In-Reply-To: <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> References: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> Message-ID: <65d4b7fc0909181108x3a82e17eu9d078bc4402248ab@mail.gmail.com> On Fri, Sep 18, 2009 at 1:15 PM, Carlos Javier Borroto wrote: > On Fri, Sep 18, 2009 at 12:59 PM, Carlos Javier Borroto > wrote: >> Hi all, >> >> I'm trying to download all of the EST from a specie, I'm following the >> example on the tutorial which seems to be exactly what I need. But I >> running into this problem: >> > > I'm right? I'm going to implement this I share it here. > Well here is my implementation, I'm very new to biopython or even python, my programing skills aren't great either, but because what I did was mostly copy/paste from the tutorial, I'm feeling confident on sharing this code, any advise to make it better is highly welcome. It seems to be working just fine, but I haven't been able to run it to the end, cause I keep getting this sporadic ncbi servers errors: Going to download records 5601 to 5700 Traceback (most recent call last): File "ncbi-downloader.py", line 58, in webenv=webenv, query_key=query_key) File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py", line 126, in efetch return _open(cgi, variables) File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py", line 373, in _open raise IOError(data.strip()) IOError: Error: 156514070 is not available at this time. Error: 156514069 is not available at this time. Error: 156514068 is not available at this time. Error: 156514067 is not available at this time. But I guess is only matter of been a better citizen and doing this in the weekend or outside USA peak time. Here is the code: from Bio import Entrez Entrez.email = "A.N.Other at example.com" dbname = "code_name_of_the_db" query_term = "query_term" handle = Entrez.egquery(term=query_term) record = Entrez.read(handle) handle.close() for row in record["eGQueryResult"]: if row["DbName"]==dbname: egquery_count = int(row["Count"]) esearch_batch_size = 1000 out_handle = open("outfile.fasta", "w") for esearch_start in range(0,egquery_count,esearch_batch_size) : esearch_end = min(egquery_count, esearch_start+esearch_batch_size) print "Going to get IDs of records %i to %i" % (esearch_start+1, esearch_end) search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y", retstart=esearch_start,retmax=esearch_batch_size) search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] assert esearch_batch_size == len(gi_list) count = int(search_results["Count"]) assert egquery_count == count webenv = search_results["WebEnv"] query_key = search_results["QueryKey"] batch_size = 100 for start in range(esearch_start,esearch_start+esearch_batch_size,batch_size) : end = min(count, start+batch_size) print "Going to download records %i to %i" % (start+1, end) fetch_handle = Entrez.efetch(db=dbname, rettype="fasta", retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key) data = fetch_handle.read() fetch_handle.close() out_handle.write(data) out_handle.close() regards, -- Carlos Javier Borroto Baltimore, MD Phone: (410) 929 4020 From mmokrejs at ribosome.natur.cuni.cz Fri Sep 18 18:23:26 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Fri, 18 Sep 2009 20:23:26 +0200 Subject: [Biopython] Searching for and downloading sequences using the history In-Reply-To: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> References: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> Message-ID: <4AB3D01E.5000807@ribosome.natur.cuni.cz> Hi Carlos, I had a look what the Entrez.esearch object has in its properties and I see RetMax attribute. >>> search_results {u'Count': '9279', u'RetMax': '20', u'IdList': ['189229275', '189229274', '189229273', '189229272', '189229271', '189229 270', '189229269', '189229268', '189229267', '189229266', '189229265', '189229264', '189229263', '189229262', '189229261 ', '189229260', '189229199', '189229198', '189229197', '189229196'], u'TranslationStack': [{u'Count': '9279', u'Field': 'All Fields', u'Term': 'Genus[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'QueryTranslation': 'Genus[All Fields]', u'Erro rList': {u'FieldNotFound': [], u'PhraseNotFound': ['specie']}, u'TranslationSet': [], u'RetStart': '0', u'QueryKey': '1' , u'WebEnv': 'NCID_1_3207467_130.14.22.148_9001_1253297878'} >>> So, here we go: >>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y",RetMax=99999) >>> search_results = Entrez.read(search_handle) >>> search_handle.close() >>> len(search_results["IdList"]) 9279 >>> BTW: >>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y",RetMax=9999999999) >>> search_results = Entrez.read(search_handle) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.6/site-packages/Bio/Entrez/__init__.py", line 297, in read record = handler.run(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 90, in run self.parser.ParseFile(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 141, in endElement raise RuntimeError(value) RuntimeError: Search Backend failed: NCBI C++ Exception: Error: CORELIB(CStringException::eConvert) "/pubmed_gen/rbuild/version/20090819/entrez/c++/src/corelib/ncbist r.cpp", line 411: --- Cannot convert string '9999999999' to int, overflow (m_Pos = 0) >>> Hope this helps, M. Carlos Javier Borroto wrote: > Hi all, > > I'm trying to download all of the EST from a specie, I'm following the > example on the tutorial which seems to be exactly what I need. But I > running into this problem: > >>>> from Bio import Entrez >>>> Entrez.email = "carlos.borroto at gmail.com" >>>> dbname = "nucest" >>>> query_term = "Genus specie" >>>> search_handle = Entrez.esearch(db=dbname,term=query_term,usehistory="y") >>>> search_results = Entrez.read(search_handle) >>>> search_handle.close() >>>> len(search_results["IdList"]) > 20 >>>> print search_results["Count"] > 193951 > > So the assert statement if failing: >>>> gi_list = search_results["IdList"] >>>> count = int(search_results["Count"]) >>>> assert count == len(gi_list) > Traceback (most recent call last): > File "", line 1, in > AssertionError > > And most important I'm not getting all of the ids. > > Did someone knows what I'm doing wrong? > > thanks in advance From biopython at maubp.freeserve.co.uk Fri Sep 18 18:51:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Sep 2009 19:51:32 +0100 Subject: [Biopython] Searching for and downloading sequences using the history In-Reply-To: <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> References: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> Message-ID: <320fb6e00909181151y2e22c06fvc795b9c435f6c01b@mail.gmail.com> On Fri, Sep 18, 2009 at 6:15 PM, Carlos Javier Borroto wrote: > On Fri, Sep 18, 2009 at 12:59 PM, Carlos Javier Borroto wrote: >> Hi all, >> >> I'm trying to download all of the EST from a specie, I'm following the >> example on the tutorial which seems to be exactly what I need. But I >> running into this problem: >> ... > > I just found this: > http://portal.open-bio.org/pipermail/biopython/2008-August/004451.html > > So I tested this: >>>> search_handle = Entrez.esearch(db=dbname,term=query_term,retmax=193951) >>>> search_results = Entrez.read(search_handle) >>>> search_handle.close() >>>> print search_results["Count"] > 193951 >>>> len(search_results["IdList"]) > 100000 > > Still not the complete list, maybe there is a maximum of result you > can get and I see there is a retstart, so I'm guessing the only way to > get all of the ids is dividing my search and using retstart. OK, good - you found the retmax parameter. It looks like the NCBI still limit their return data to 100000 here - I don't know if EFetch (via the history) would also be limited to 100000 or not, but this is still a pretty large amount of EST data to try an download this way. I would first suggest you refine your Entrez search to use "species name[orgn]" rather than just "species name" (i.e. explicitly search on the organism rather than all fields). That may reduce things further. Even better, search using an NCBI taxonomy ID to be absolutely explicit. This may reduce the dataset a bit. Secondly, this seems like an awfully large amount of data to try and download via Entrez. Email the NCBI to ask if if this is OK (and if so what batch size you should use for EFetch calls), or if they have an alternative suggestion (e.g. some FTP site). Peter P.S. You could try wrapping each EFetch call in a try/except in order to retry any individual retrieval which fails. From fchiu at newton.berkeley.edu Fri Sep 18 20:32:26 2009 From: fchiu at newton.berkeley.edu (Finsen Chiu) Date: Fri, 18 Sep 2009 13:32:26 -0700 Subject: [Biopython] Permissions for Bio, BioSQL, Martel In-Reply-To: <320fb6e00909170215j791ed9efsd0a593e56d5bc26f@mail.gmail.com> References: <4AB1723C.2040004@newton.berkeley.edu> <320fb6e00909161635yebd94dlf63f3efb8df299a1@mail.gmail.com> <4AB17C12.60806@newton.berkeley.edu> <320fb6e00909170215j791ed9efsd0a593e56d5bc26f@mail.gmail.com> Message-ID: <4AB3EE5A.7080606@newton.berkeley.edu> This has been resolved. For the record, here is the resolution: I built and tested as an user and that went fine. I saw that the proper permission for the Bio, BioSQL and Martel should be umask 002. So I proceeded with installing as root with umask 002. Thanks, Finsen Peter wrote: > On Thu, Sep 17, 2009 at 1:00 AM, Finsen Chiu wrote: > >> I am using RHEL4 >> >> I built and installed as root and both went smoothly without interruption. >> > > OK, good that it worked. But I don't normally do the build and test as root, > just use sudo for the install at the end. > > >> Running the test as root is fine, but the permission denied errors came when >> running the test as a user. >> >> So, I wonder if I need to give users write permission to those files (which >> I am not wanting to do) or if those errors are negligible. >> > > Switching the permissions would work, but the best solution to not to > run the tests as root in the first place. If you do the build and test as > a normal user, all the temp test files will have normal user permissions. > > [Normally there shouldn't be any left over temp files from the tests, > so I guess a test failed or was interupted before it could clean up] > > Peter > > From biopython at maubp.freeserve.co.uk Fri Sep 18 20:56:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Sep 2009 21:56:34 +0100 Subject: [Biopython] Searching for and downloading sequences using the history In-Reply-To: <65d4b7fc0909181309j282e2cfbi193be5528cbafa5@mail.gmail.com> References: <65d4b7fc0909180959m5bae40e3s37c88e9c46c385d0@mail.gmail.com> <65d4b7fc0909181015i606819e8n6e7c191636290c4f@mail.gmail.com> <320fb6e00909181151y2e22c06fvc795b9c435f6c01b@mail.gmail.com> <65d4b7fc0909181309j282e2cfbi193be5528cbafa5@mail.gmail.com> Message-ID: <320fb6e00909181356u5f3eb543xe2cdac77b8677c0e@mail.gmail.com> On Fri, Sep 18, 2009 at 9:09 PM, Carlos Javier Borroto wrote: > > On Fri, Sep 18, 2009 at 2:51 PM, Peter wrote: >> I would first suggest you refine your Entrez search to use "species >> name[orgn]" rather than just "species name" (i.e. explicitly search >> on the organism rather than all fields). That may reduce things >> further. Even better, search using an NCBI taxonomy ID to be >> absolutely explicit. This may reduce the dataset a bit. > > Nice advise, I was thinking about using it, now I'm using something > like txid6945[Organism:noexp], but still I have 100000+ sequence to > download. That is what I meant - but still, you have a lot of sequences! >> Secondly, this seems like an awfully large amount of data to >> try and download via Entrez. Email the NCBI to ask if if this is >> OK (and if so what batch size you should use for EFetch calls), >> or if they have an alternative suggestion (e.g. some FTP site). > > But how could I know what is and what isn't "an awfully large amount > of data" to download via Entrez?, I'm gonna try writing to them and > see what they think. FTP site was my first option but is unresponsive > right now, but I don't think they have this specific subset of > sequences there. Well over 100000 record sounds like a lot to me, but I agree, the NCBI could provide more explicit guidance. It is a shame if the NCBI don't provide what you want by FTP, as that would probably be easier. >> P.S. You could try wrapping each EFetch call in a >> try/except in order to retry any individual retrieval which fails. > > Great I just did it and it seems to be working fine!, here is what I did: > > ? ? ? ? ? ? ? ?while True : > ? ? ? ? ? ? ? ? ? ? ? ?try : > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?fetch_handle = ... > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?data = fetch_handle.read() > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?fetch_handle.close() > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?out_handle.write(data) > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?break > ? ? ? ? ? ? ? ? ? ? ? ?except IOError : > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?print "Server error, going to try > again from record %i" % (start+1) That will loop for ever if there a problem - which is fine if you are going to sit watching the script, but a very bad idea for automation. I would limit it to say 3 attempts before giving up, and maybe add a sleep of a few seconds too. Peter From peter at maubp.freeserve.co.uk Sat Sep 19 11:17:50 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Sat, 19 Sep 2009 12:17:50 +0100 Subject: [Biopython] Shuffler for wrapped fastas In-Reply-To: <4AB0D386.2060406@stats.ox.ac.uk> References: <4AAE3302.4010304@ebi.ac.uk> <16DCCE03-6439-4A84-8E32-7482BCB2D192@ebc.uu.se> <4AB0AE11.1000706@stats.ox.ac.uk> <320fb6e00909160254u6cc988a8rbd07c4892aeb3515@mail.gmail.com> <320fb6e00909160408g41c88c00w4048b9d6b58bc87b@mail.gmail.com> <4AB0D386.2060406@stats.ox.ac.uk> Message-ID: <320fb6e00909190417v30322978wc9b4132dc78dc909@mail.gmail.com> Hi Tanya, Sorry for the slight delay - your email didn't appear in my inbox for a couple of days. Odd. On Wed, Sep 16, 2009 at 1:01 PM, Tanya Golubchik wrote: > > A single write is definitely better, though it's still so much slower than > plain text shuffling that it's not ideal for millions of short reads unless > we want to do something useful like convert the scores to Phred in the > process. In that case we'd be using 'format' anyway, I assume, unless there > is a neat trick to reformat a whole lot of reads at once. If guess you mean the SeqRecord's format method? It isn't intended for output of multiple records to a file, but rather is a convenience method for a single record. The (slow) approach using a loop and many calls to SeqRecord.format(...) is also less general than using a single call to Bio.SeqIO.write(...). Consider interleaved file formats or those with a header - the for loop won't work here. Using Bio.SeqIO and combining the parse and write functions already allows simple conversion of a range of sequence file formats, including the three FASTQ variants. This is covered in the tutorial and the wiki, http://biopython.org/wiki/SeqIO The soon to be released Biopython 1.52 will make this even easier (and in some cases like FASTQ conversion also faster) with the addition of a Bio.SeqIO.convert(...) function. > In general I find myself using Biopython for longer sequences (fasta or > fastq), because of the neatness and flexibility of Biopython, but sticking > to plain text for short reads because of the overheads. In some cases that is the best thing to do. If you haven't already done so, have a look at the FastqGeneralIterator function in Bio.SeqIO.QualityIO which returns a tuple of three strings (so no overhead from Seq and SeqRecord objects). > BTW, itertools.izip does exactly what your interleave method > does, so I'm not sure there's any need to rewrite it. No it doesn't. The builtin function zip, and itertools.izip both return tuples (pairs of entries). Consider: >>> a = range(0,10,2) >>> b = range(1, 10, 2) >>> zip(a,b) [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)] Or, using itertools, >>> import itertools >>> list(itertools.izip(a,b)) [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)] Using the (not very general) interlace function I wrote earlier: http://lists.open-bio.org/pipermail/biopython/2009-September/005583.html >>> def interlace(iter1, iter2) : ... while True : ... yield iter1.next() ... yield iter2.next() ... >>> list(interlace(iter(a),iter(b))) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] I hope that illustrates the difference. Here you get back ten items, but with zip or izip you get five pairs of iterms. Via Google you can easily find much more general interlace functions in Python. Peter From rodrigo_faccioli at uol.com.br Sun Sep 20 19:24:54 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Sun, 20 Sep 2009 16:24:54 -0300 Subject: [Biopython] Protein side-chain angles from a PDB file Message-ID: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> Hi, I have a doubt about how can I calculate the protein side-chain angles from a PDB file? I've read the Bio.PDB.Vector evaluates the angles from three atoms. However, my question is how can I chose these atoms from a PDB file? I have based on peter cook web-site and I have calculated the phi-psi angles but I haven't seen about side-chain angles. Sorry if my question is very basic. However, I'm a computer scientist novice in chemistry issue. Thanks in advance. -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From p.j.a.cock at googlemail.com Mon Sep 21 09:18:42 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 21 Sep 2009 10:18:42 +0100 Subject: [Biopython] Protein side-chain angles from a PDB file In-Reply-To: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> References: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> Message-ID: <320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> On Sun, Sep 20, 2009 at 8:24 PM, Rodrigo faccioli wrote: > Hi, > > I have a doubt about how can I calculate the protein side-chain angles from > a PDB file? > > I've read the Bio.PDB.Vector evaluates the angles from three atoms. However, > my question is how can I ?chose these atoms from a PDB file? > > I have based on peter cook web-site and I have calculated the phi-psi angles > but I haven't seen about side-chain angles. http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/calculate/ Its "Cock", not "Cook", but its a common mistake ;) > Sorry if my question is very basic. However, I'm a computer scientist novice > in chemistry issue. How are you defining the side chain angle? From memory (without checking the details), you have the protein back bone which includes the "alpha carbon" (CA in PDB files) to which the side chains are attached. I guess you want to measure the angle of the side chain to the C-alpha to (either of the) neighbouring backbone atoms. The point here is off the top of my head there are at least two possible angles you might be asking about. But in terms of the code, you'll just need to get the coordinates of the three atoms defining the angle (which probably will be the C-alpha and two others), which defines two vectors, then take their dot product and thus get the cosine of the angle. Looking at the Psi-Phi code should help here. Peter From golubchi at stats.ox.ac.uk Mon Sep 21 09:20:44 2009 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Mon, 21 Sep 2009 10:20:44 +0100 Subject: [Biopython] Shuffler for wrapped fastas In-Reply-To: <320fb6e00909190417v30322978wc9b4132dc78dc909@mail.gmail.com> References: <4AAE3302.4010304@ebi.ac.uk> <16DCCE03-6439-4A84-8E32-7482BCB2D192@ebc.uu.se> <4AB0AE11.1000706@stats.ox.ac.uk> <320fb6e00909160254u6cc988a8rbd07c4892aeb3515@mail.gmail.com> <320fb6e00909160408g41c88c00w4048b9d6b58bc87b@mail.gmail.com> <4AB0D386.2060406@stats.ox.ac.uk> <320fb6e00909190417v30322978wc9b4132dc78dc909@mail.gmail.com> Message-ID: <4AB7456C.4020207@stats.ox.ac.uk> Hi Peter, Thanks for the pointer to FastqGeneralIterator, I'll definitely take a look. (Good point about interlacing as well.) Cheers, Tanya Peter wrote: > Hi Tanya, > > Sorry for the slight delay - your email didn't appear in my inbox > for a couple of days. Odd. > > On Wed, Sep 16, 2009 at 1:01 PM, Tanya Golubchik wrote: >> A single write is definitely better, though it's still so much slower than >> plain text shuffling that it's not ideal for millions of short reads unless >> we want to do something useful like convert the scores to Phred in the >> process. In that case we'd be using 'format' anyway, I assume, unless there >> is a neat trick to reformat a whole lot of reads at once. > > If guess you mean the SeqRecord's format method? It isn't intended > for output of multiple records to a file, but rather is a convenience > method for a single record. The (slow) approach using a loop and > many calls to SeqRecord.format(...) is also less general than using > a single call to Bio.SeqIO.write(...). Consider interleaved file formats > or those with a header - the for loop won't work here. > > Using Bio.SeqIO and combining the parse and write functions already > allows simple conversion of a range of sequence file formats, including > the three FASTQ variants. This is covered in the tutorial and the wiki, > http://biopython.org/wiki/SeqIO > > The soon to be released Biopython 1.52 will make this even easier > (and in some cases like FASTQ conversion also faster) with the > addition of a Bio.SeqIO.convert(...) function. > >> In general I find myself using Biopython for longer sequences (fasta or >> fastq), because of the neatness and flexibility of Biopython, but sticking >> to plain text for short reads because of the overheads. > > In some cases that is the best thing to do. If you haven't already > done so, have a look at the FastqGeneralIterator function in > Bio.SeqIO.QualityIO which returns a tuple of three strings (so > no overhead from Seq and SeqRecord objects). > >> BTW, itertools.izip does exactly what your interleave method >> does, so I'm not sure there's any need to rewrite it. > > No it doesn't. The builtin function zip, and itertools.izip both > return tuples (pairs of entries). Consider: > >>>> a = range(0,10,2) >>>> b = range(1, 10, 2) >>>> zip(a,b) > [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)] > > Or, using itertools, > >>>> import itertools >>>> list(itertools.izip(a,b)) > [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)] > > Using the (not very general) interlace function I wrote earlier: > http://lists.open-bio.org/pipermail/biopython/2009-September/005583.html > >>>> def interlace(iter1, iter2) : > ... while True : > ... yield iter1.next() > ... yield iter2.next() > ... >>>> list(interlace(iter(a),iter(b))) > [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] > > I hope that illustrates the difference. Here you get back ten items, > but with zip or izip you get five pairs of iterms. Via Google you > can easily find much more general interlace functions in Python. > > Peter From darnells at dnastar.com Mon Sep 21 15:03:01 2009 From: darnells at dnastar.com (Steve Darnell) Date: Mon, 21 Sep 2009 10:03:01 -0500 Subject: [Biopython] Protein side-chain angles from a PDB file In-Reply-To: <320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> References: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> <320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> Message-ID: Rodrigo, This is just to expand on Peter's comment. Your original question implicitly mentioned two types of angles: bond and dihedral angles. A bond angle can be calculated with three atoms, two vectors, and a dot product (the first type mentioned). When you use the term phi and psi angles, you are mentioning dihedral (or torsion) angles (the angle betweeen two planes where the intersection is along the bond of interest). It's more complicated to calculate, but relatively straight forward: http://en.wikipedia.org/wiki/Dihedral_angle Were you originally asking about how to calculate the torsion angles in the side chain? These are known as the chi angles and are used for defining rotational conformations (rotomers). I'll stop here since I'm guessing I misunderstood your original question. Regards, Steve -- Steve Darnell DNASTAR, Inc. Madison, WI USA -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock Sent: Monday, September 21, 2009 4:19 AM To: Rodrigo faccioli Cc: biopython Subject: Re: [Biopython] Protein side-chain angles from a PDB file On Sun, Sep 20, 2009 at 8:24 PM, Rodrigo faccioli wrote: > Hi, > > I have a doubt about how can I calculate the protein side-chain angles > from a PDB file? > > I've read the Bio.PDB.Vector evaluates the angles from three atoms. > However, my question is how can I ?chose these atoms from a PDB file? > > I have based on peter cook web-site and I have calculated the phi-psi > angles but I haven't seen about side-chain angles. http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/calculate/ Its "Cock", not "Cook", but its a common mistake ;) > Sorry if my question is very basic. However, I'm a computer scientist > novice in chemistry issue. How are you defining the side chain angle? From memory (without checking the details), you have the protein back bone which includes the "alpha carbon" (CA in PDB files) to which the side chains are attached. I guess you want to measure the angle of the side chain to the C-alpha to (either of the) neighbouring backbone atoms. The point here is off the top of my head there are at least two possible angles you might be asking about. But in terms of the code, you'll just need to get the coordinates of the three atoms defining the angle (which probably will be the C-alpha and two others), which defines two vectors, then take their dot product and thus get the cosine of the angle. Looking at the Psi-Phi code should help here. Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From rodrigo_faccioli at uol.com.br Mon Sep 21 19:23:37 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Mon, 21 Sep 2009 16:23:37 -0300 Subject: [Biopython] Protein side-chain angles from a PDB file In-Reply-To: References: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> <320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> Message-ID: <3715adb70909211223y119b750o152d4754004a971f@mail.gmail.com> Steve Darnell and Peter Cock, I apologize your answers. I have complicated my question because I have not understood very well about these angles. However, I have understood more. Thus, I'll retype my original question: How can I calculate the rotational conformations (rotamers) when I have a PDB file? Therefore, if I understood better, I need to know which amino acid I'm working because the number of chi angles vary according to each amino acids. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Mon, Sep 21, 2009 at 12:03 PM, Steve Darnell wrote: > Rodrigo, > > This is just to expand on Peter's comment. Your original question > implicitly mentioned two types of angles: bond and dihedral angles. A bond > angle can be calculated with three atoms, two vectors, and a dot product > (the first type mentioned). When you use the term phi and psi angles, you > are mentioning dihedral (or torsion) angles (the angle betweeen two planes > where the intersection is along the bond of interest). It's more > complicated to calculate, but relatively straight forward: > > http://en.wikipedia.org/wiki/Dihedral_angle > > Were you originally asking about how to calculate the torsion angles in the > side chain? These are known as the chi angles and are used for defining > rotational conformations (rotomers). I'll stop here since I'm guessing I > misunderstood your original question. > > Regards, > Steve > > -- > Steve Darnell > DNASTAR, Inc. > Madison, WI USA > > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org [mailto: > biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock > Sent: Monday, September 21, 2009 4:19 AM > To: Rodrigo faccioli > Cc: biopython > Subject: Re: [Biopython] Protein side-chain angles from a PDB file > > On Sun, Sep 20, 2009 at 8:24 PM, Rodrigo faccioli < > rodrigo_faccioli at uol.com.br> wrote: > > Hi, > > > > I have a doubt about how can I calculate the protein side-chain angles > > from a PDB file? > > > > I've read the Bio.PDB.Vector evaluates the angles from three atoms. > > However, my question is how can I chose these atoms from a PDB file? > > > > I have based on peter cook web-site and I have calculated the phi-psi > > angles but I haven't seen about side-chain angles. > > http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/calculate/ > > Its "Cock", not "Cook", but its a common mistake ;) > > > Sorry if my question is very basic. However, I'm a computer scientist > > novice in chemistry issue. > > How are you defining the side chain angle? From memory (without checking > the details), you have the protein back bone which includes the "alpha > carbon" (CA in PDB files) to which the side chains are attached. I guess you > want to measure the angle of the side chain to the C-alpha to (either of > the) neighbouring backbone atoms. > The point here is off the top of my head there are at least two possible > angles you might be asking about. > > But in terms of the code, you'll just need to get the coordinates of the > three atoms defining the angle (which probably will be the C-alpha and two > others), which defines two vectors, then take their dot product and thus get > the cosine of the angle. Looking at the Psi-Phi code should help here. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From darnells at dnastar.com Mon Sep 21 21:18:54 2009 From: darnells at dnastar.com (Steve Darnell) Date: Mon, 21 Sep 2009 16:18:54 -0500 Subject: [Biopython] Protein side-chain angles from a PDB file In-Reply-To: <3715adb70909211223y119b750o152d4754004a971f@mail.gmail.com> References: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com><320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> <3715adb70909211223y119b750o152d4754004a971f@mail.gmail.com> Message-ID: Rodrigo, Perhaps this will get you in the right direction for your implementation. It describes the math for calculating dihedrals and defines the chi angle atom mappings for all of the amino acids. At first glance it appears to be correct. Unfortunately, I don't have any code to help you out. http://www.math.fsu.edu/~quine/IntroMathBio_05/torsion_pdb/torsion_pdb.p df If you're only interested in calculating the chi angles only once or twice, you could try MolProbity. http://molprobity.biochem.duke.edu/ Maybe some else knows of another already implemented solution? ~Steve -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Rodrigo faccioli Sent: Monday, September 21, 2009 2:24 PM Cc: biopython Subject: Re: [Biopython] Protein side-chain angles from a PDB file Steve Darnell and Peter Cock, I apologize your answers. I have complicated my question because I have not understood very well about these angles. However, I have understood more. Thus, I'll retype my original question: How can I calculate the rotational conformations (rotamers) when I have a PDB file? Therefore, if I understood better, I need to know which amino acid I'm working because the number of chi angles vary according to each amino acids. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Mon, Sep 21, 2009 at 12:03 PM, Steve Darnell wrote: > Rodrigo, > > This is just to expand on Peter's comment. Your original question > implicitly mentioned two types of angles: bond and dihedral angles. A > bond angle can be calculated with three atoms, two vectors, and a dot > product (the first type mentioned). When you use the term phi and psi > angles, you are mentioning dihedral (or torsion) angles (the angle > betweeen two planes where the intersection is along the bond of > interest). It's more complicated to calculate, but relatively straight forward: > > http://en.wikipedia.org/wiki/Dihedral_angle > > Were you originally asking about how to calculate the torsion angles > in the side chain? These are known as the chi angles and are used for > defining rotational conformations (rotomers). I'll stop here since > I'm guessing I misunderstood your original question. > > Regards, > Steve > > -- > Steve Darnell > DNASTAR, Inc. > Madison, WI USA > > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org [mailto: > biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock > Sent: Monday, September 21, 2009 4:19 AM > To: Rodrigo faccioli > Cc: biopython > Subject: Re: [Biopython] Protein side-chain angles from a PDB file > > On Sun, Sep 20, 2009 at 8:24 PM, Rodrigo faccioli < > rodrigo_faccioli at uol.com.br> wrote: > > Hi, > > > > I have a doubt about how can I calculate the protein side-chain > > angles from a PDB file? > > > > I've read the Bio.PDB.Vector evaluates the angles from three atoms. > > However, my question is how can I chose these atoms from a PDB file? > > > > I have based on peter cook web-site and I have calculated the > > phi-psi angles but I haven't seen about side-chain angles. > > http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/calculate/ > > Its "Cock", not "Cook", but its a common mistake ;) > > > Sorry if my question is very basic. However, I'm a computer > > scientist novice in chemistry issue. > > How are you defining the side chain angle? From memory (without > checking the details), you have the protein back bone which includes > the "alpha carbon" (CA in PDB files) to which the side chains are > attached. I guess you want to measure the angle of the side chain to > the C-alpha to (either of > the) neighbouring backbone atoms. > The point here is off the top of my head there are at least two > possible angles you might be asking about. > > But in terms of the code, you'll just need to get the coordinates of > the three atoms defining the angle (which probably will be the C-alpha > and two others), which defines two vectors, then take their dot > product and thus get the cosine of the angle. Looking at the Psi-Phi code should help here. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Sep 22 16:38:21 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Sep 2009 17:38:21 +0100 Subject: [Biopython] Biopython 1.52 released Message-ID: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> Dear all, Those of you who signed up to our newsfeed will know this already, but we are pleased to announce the release of Biopython 1.52: http://news.open-bio.org/news/2009/09/biopython-release-152/ Thank you to all our developers, including David Winter for drafting the release announcement, and everyone else who as contributed with feedback, bug reports etc. Could I also take this opportunity to remind you all we have an application note out in the OUP journal Bioinformatics: http://news.open-bio.org/news/2009/03/biopython-paper-published/ http://dx.doi.org/10.1093/bioinformatics/btp163 In any scientific publication using Biopython, we kindly request you cite this, or another appropriate publication from this list: http://biopython.org/wiki/Documentation#Papers Thank you, Peter From rodrigo_faccioli at uol.com.br Wed Sep 23 13:34:29 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Wed, 23 Sep 2009 10:34:29 -0300 Subject: [Biopython] Protein side-chain angles from a PDB file In-Reply-To: References: <3715adb70909201224h11f2ddc4x3fafcfabb54bb33c@mail.gmail.com> <320fb6e00909210218r1591cac3pee1bf6abcf8ad82b@mail.gmail.com> <3715adb70909211223y119b750o152d4754004a971f@mail.gmail.com> Message-ID: <3715adb70909230634s31f11744hf9b25439bfc6711b@mail.gmail.com> Thank you. Your help was very important. I'll read about MolProbity because after I calculate of the chi angle, I'll store them in PostgreSQL. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Mon, Sep 21, 2009 at 6:18 PM, Steve Darnell wrote: > Rodrigo, > > Perhaps this will get you in the right direction for your > implementation. It describes the math for calculating dihedrals and > defines the chi angle atom mappings for all of the amino acids. At > first glance it appears to be correct. Unfortunately, I don't have any > code to help you out. > > http://www.math.fsu.edu/~quine/IntroMathBio_05/torsion_pdb/torsion_pdb.p > df > > If you're only interested in calculating the chi angles only once or > twice, you could try MolProbity. > > http://molprobity.biochem.duke.edu/ > > Maybe some else knows of another already implemented solution? > > ~Steve > > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org > [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Rodrigo > faccioli > Sent: Monday, September 21, 2009 2:24 PM > Cc: biopython > Subject: Re: [Biopython] Protein side-chain angles from a PDB file > > Steve Darnell and Peter Cock, > > I apologize your answers. I have complicated my question because I have > not understood very well about these angles. However, I have understood > more. > > Thus, I'll retype my original question: How can I calculate the > rotational conformations (rotamers) when I have a PDB file? > > Therefore, if I understood better, I need to know which amino acid I'm > working because the number of chi angles vary according to each amino > acids. > > > Thanks in advance, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL Intelligent System in > Structure Bioinformatics http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > > > On Mon, Sep 21, 2009 at 12:03 PM, Steve Darnell > wrote: > > > Rodrigo, > > > > This is just to expand on Peter's comment. Your original question > > implicitly mentioned two types of angles: bond and dihedral angles. A > > > bond angle can be calculated with three atoms, two vectors, and a dot > > product (the first type mentioned). When you use the term phi and psi > > > angles, you are mentioning dihedral (or torsion) angles (the angle > > betweeen two planes where the intersection is along the bond of > > interest). It's more complicated to calculate, but relatively > straight forward: > > > > http://en.wikipedia.org/wiki/Dihedral_angle > > > > Were you originally asking about how to calculate the torsion angles > > in the side chain? These are known as the chi angles and are used for > > > defining rotational conformations (rotomers). I'll stop here since > > I'm guessing I misunderstood your original question. > > > > Regards, > > Steve > > > > -- > > Steve Darnell > > DNASTAR, Inc. > > Madison, WI USA > > > > > > -----Original Message----- > > From: biopython-bounces at lists.open-bio.org [mailto: > > biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock > > Sent: Monday, September 21, 2009 4:19 AM > > To: Rodrigo faccioli > > Cc: biopython > > Subject: Re: [Biopython] Protein side-chain angles from a PDB file > > > > On Sun, Sep 20, 2009 at 8:24 PM, Rodrigo faccioli < > > rodrigo_faccioli at uol.com.br> wrote: > > > Hi, > > > > > > I have a doubt about how can I calculate the protein side-chain > > > angles from a PDB file? > > > > > > I've read the Bio.PDB.Vector evaluates the angles from three atoms. > > > However, my question is how can I chose these atoms from a PDB > file? > > > > > > I have based on peter cook web-site and I have calculated the > > > phi-psi angles but I haven't seen about side-chain angles. > > > > http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/calculate/ > > > > Its "Cock", not "Cook", but its a common mistake ;) > > > > > Sorry if my question is very basic. However, I'm a computer > > > scientist novice in chemistry issue. > > > > How are you defining the side chain angle? From memory (without > > checking the details), you have the protein back bone which includes > > the "alpha carbon" (CA in PDB files) to which the side chains are > > attached. I guess you want to measure the angle of the side chain to > > the C-alpha to (either of > > the) neighbouring backbone atoms. > > The point here is off the top of my head there are at least two > > possible angles you might be asking about. > > > > But in terms of the code, you'll just need to get the coordinates of > > the three atoms defining the angle (which probably will be the C-alpha > > > and two others), which defines two vectors, then take their dot > > product and thus get the cosine of the angle. Looking at the Psi-Phi > code should help here. > > > > Peter > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Sep 24 09:59:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 10:59:47 +0100 Subject: [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. Message-ID: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> Hi all, I'm forwarding an interesting post from Dave to the BioPerl mailing list, which should also be of interest here... Peter ---------- Forwarded message ---------- From: Dave Messina Date: Thu, Sep 24, 2009 at 10:38 AM Subject: Re: [Bioperl-l] a Main Page proposal To: Chris Fields Cc: bioperl-l at lists.open-bio.org, Dave Clements , Peter , "Mark A. Jensen" > > Not to add yet more to the list, but I also think a concise list of > projects using (or 'powered by') bioperl should be front-and-center; not a > lot of users know when/where bioperl is used. ?This applies to the other > bio* as well, particularly biopython (seeing it popping up more and more). > Along these lines, it'd be great to publicize not only BioPerl-*powered*projects, but ones which interface with it, too. Just this week, for example, there is this, which could go both on a static page and in the newsfeed: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp554v1 MOODS: fast search for position weight matrix matches in DNA sequences. Korhonen J, Martinm?ki P, Pizzi C, Rastas P, Ukkonen E. Department of Computer Science and Helsinki Institute for Information Technology, University of Helsinki, Helsinki, Finland. SUMMARY: MOODS (MOtif Occurrence Detection Suite) is a software package for matching position weight matrices against DNA sequences. MOODS implements state-of-the-art on-line matching algorithms, achieving considerably faster scanning speed than with a simple brute-force search. MOODS is written in C++, with bindings for the popular BioPerl and Biopython toolkits. It can easily be adapted for different purposes and integrated into existing workflows. It can also be used as a C++ library. AVAILABILITY: The package with documentation and examples of usage is available at http://www.cs.helsinki.fi/group/pssmfind. The source code is also available under the terms of a GNU General Public License (GPL). CONTACT: janne.h.korhonen at helsinki.fi. PMID: 19773334 [PubMed - as supplied by publisher] _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bartek at rezolwenta.eu.org Thu Sep 24 11:46:42 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 24 Sep 2009 13:46:42 +0200 Subject: [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. In-Reply-To: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> References: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> Message-ID: <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> On Thu, Sep 24, 2009 at 11:59 AM, Peter wrote: > Hi all, > > I'm forwarding an interesting post from Dave to the BioPerl mailing list, which > should also be of interest here... > > Just this week, for example, there is this, which could go both on a static > page and in the newsfeed: > http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp554v1 > > MOODS: fast search for position weight matrix matches in DNA sequences. > > Korhonen J, Martinm?ki P, Pizzi C, Rastas P, Ukkonen E. > Department of Computer Science and Helsinki Institute for Information > Technology, > University of Helsinki, Helsinki, Finland. > > SUMMARY: MOODS (MOtif Occurrence Detection Suite) is a software package for > matching position weight matrices against DNA sequences. MOODS implements > state-of-the-art on-line matching algorithms, achieving considerably faster > scanning speed than with a simple brute-force search. MOODS is written in C++, > with bindings for the popular BioPerl and Biopython toolkits. It can easily be > adapted for different purposes and integrated into existing workflows. It can > also be used as a C++ library. AVAILABILITY: The package with documentation and > examples of usage is available at http://www.cs.helsinki.fi/group/pssmfind. The > source code is also available under the terms of a GNU General Public License > (GPL). CONTACT: janne.h.korhonen at helsinki.fi. Hi all, I've seen this paper. It is directly related to the Bio.Motif code. They did a pretty good job of implementing an extremely efficient tool for finding motif instances in DNA sequences. it's c++ and it beats my pure python, brute-force code with both hands down... Of course this come at a price of only being applicable to DNA (only unambiguous alphabet etc.). Since they did the comparison, we have actually incorporated the _pwm.c module written by Michiel, which is also much faster and can be used for finding motifs in DNA. I have compared their performance with our code on a single Drosophila chromosme (20Mb) the results are similarly devastating to my old code: their code takes ~1.1 sec (advanced look-ahead algorithms in C++) while mine (pure python) takes 350 secs. The code contributed recently by Michiel (simple algorithm, but in C) takes 2.3secs to finish. since they provide python interface (there is nothing biopython related, despite their abstract), I was even thinking about incorporating their code into Biopython, but it's GPL, Instead, I can make the function using Michiel's code aware of the MOODS package: i.e. use it if it is installed. If we want to put it into the news, It would be worth mentioning that (thanks to Michiel) we have made quite some progress on that front. As a side note, I feel a little bit guilty of making biopython look slow compared to other tools. In the paper, they show a comparison between different tools (MOODS, bioperl, biopython) in terms of speed, which shows biopython as by far the slowest. This is just because I was not writing this code with speed in mind (I work on short regulatory sequences...). Nonetheless, it can make an impression that biopython is slow in general, which is not true. I will try to extend Michiel's code to accept different alphabets and then maybe phase out the slow code of mine. Bartek From biopython at maubp.freeserve.co.uk Thu Sep 24 12:09:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 13:09:16 +0100 Subject: [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. In-Reply-To: <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> References: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> Message-ID: <320fb6e00909240509h1efb0a46y222067c77b77aa68@mail.gmail.com> On Thu, Sep 24, 2009 at 12:46 PM, Bartek Wilczynski wrote: > On Thu, Sep 24, 2009 at 11:59 AM, Peter wrote: >> Hi all, >> >> I'm forwarding an interesting post from Dave to the BioPerl mailing list, which >> should also be of interest here... >> > > Hi all, > > I've seen this paper. It is directly related to the Bio.Motif code. > They did a pretty good job of implementing an extremely efficient tool > for finding motif instances in DNA sequences. it's c++ and it beats my > pure python, brute-force code with both hands down... Of course this > come at a price of only being applicable to DNA (only unambiguous > alphabet etc.). Since they did the comparison, we have actually > incorporated the _pwm.c module written by Michiel, which is also much > faster and can be used for finding motifs in DNA. I hadn't looked at the table until you pointed this out. I think they have been negligent by not including the version numbers of the different packages tested (and this is a general point, not just about Biopython). > I have compared their performance with our code on a single Drosophila > chromosme (20Mb) the results are similarly devastating to my old code: > their code takes ~1.1 sec (advanced look-ahead algorithms in C++) > while mine (pure python) takes 350 secs. The code contributed recently > by Michiel (simple algorithm, but in C) takes 2.3secs to finish. Our C code looks pretty good then :) > since they provide python interface (there is nothing biopython > related, despite their abstract), I was even thinking about > incorporating their code into Biopython, but it's GPL, Instead, I can > make the function using Michiel's code aware of the MOODS package: > i.e. use it if it is installed. I'm not sure about that from an architectural point of view, especially if the two algorithms give different results or take different parameters. > If we want to put it into the news, It would be worth mentioning that > (thanks to Michiel) we have made quite some progress on that front. Good idea - why don't you check in an extra paragraph to the NEWS file section for Biopython 1.51 (or was it 1.52?). We can also update the news post too. In fact, if you wanted to you could write up a whole blog post to put up on our news server with timing etc. > As a side note, I feel a little bit guilty of making biopython look > slow compared to other tools. In the paper, they show a comparison > between different tools (MOODS, bioperl, biopython) in terms of speed, > which shows biopython as by far the slowest. This is just because I > was not writing this ?code with speed in mind (I work on short > regulatory sequences...). Nonetheless, it can make an impression that > biopython is slow in general, which is not true. I will try to extend > Michiel's code to accept different alphabets and then maybe phase out > the slow code of mine. Extending the C code to cover more cases sounds like a good idea. However, I would keep the pure python fallback for situations like Jython where C extensions are not available. Peter From chapmanb at 50mail.com Thu Sep 24 12:27:30 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 24 Sep 2009 08:27:30 -0400 Subject: [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. In-Reply-To: <320fb6e00909240509h1efb0a46y222067c77b77aa68@mail.gmail.com> References: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> <320fb6e00909240509h1efb0a46y222067c77b77aa68@mail.gmail.com> Message-ID: <20090924122730.GL13500@sobchak.mgh.harvard.edu> Peter and Bartek; [MOODS paper compared with Biopython] > > I was even thinking about > > incorporating their code into Biopython, but it's GPL, Instead, I can > > make the function using Michiel's code aware of the MOODS package: > > i.e. use it if it is installed. It may be worth contacting the authors with your interest in incorporating it. If it improves substantially upon the current C code from Michiel and could fit with your interface this makes sense. Many times people are not tied to GPL, and they may be willing to re-license for inclusion in Biopython. > > If we want to put it into the news, It would be worth mentioning that > > (thanks to Michiel) we have made quite some progress on that front. > > Good idea - why don't you check in an extra paragraph to the NEWS > file section for Biopython 1.51 (or was it 1.52?). We can also update > the news post too. In fact, if you wanted to you could write up a whole > blog post to put up on our news server with timing etc. A separate news post mentioning the C option speed and showing usage examples from both is a great idea. Responsiveness to new methods is the fun part of science. > > As a side note, I feel a little bit guilty of making biopython look > > slow compared to other tools. In the paper, they show a comparison > > between different tools (MOODS, bioperl, biopython) in terms of speed, > > which shows biopython as by far the slowest. This is just because I > > was not writing this ?code with speed in mind (I work on short > > regulatory sequences...). Nonetheless, it can make an impression that > > biopython is slow in general, which is not true. This is more a consequence of how scientific publication works. You have to get published and to do that you have to prove you are somehow that much better than other options, which results in trying to find flaws in those options. This would all work smoother if the authors came on the Biopython list mentioning the speed issues, you all had this discussion then and worked on incorporating their code as it was being developed. Then we'd have an integrated implementation today. Doing it after the fact is a bit more roundabout, but what can you do. Brad From bartek at rezolwenta.eu.org Thu Sep 24 12:51:04 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 24 Sep 2009 14:51:04 +0200 Subject: [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences. In-Reply-To: <20090924122730.GL13500@sobchak.mgh.harvard.edu> References: <320fb6e00909240259y15374d42m1ce4bda0cf1c9d0a@mail.gmail.com> <8b34ec180909240446u212a6112tde9279eb96f5a70a@mail.gmail.com> <320fb6e00909240509h1efb0a46y222067c77b77aa68@mail.gmail.com> <20090924122730.GL13500@sobchak.mgh.harvard.edu> Message-ID: <8b34ec180909240551v649769f3keeae64f1ef31a633@mail.gmail.com> Hi, On Thu, Sep 24, 2009 at 2:27 PM, Brad Chapman wrote: > Peter and Bartek; > > [MOODS paper compared with Biopython] >> > I was even thinking about >> > incorporating their code into Biopython, but it's GPL, Instead, I can >> > make the function using Michiel's code aware of the MOODS package: >> > i.e. use it if it is installed. > > It may be worth contacting the authors with your interest in > incorporating it. If it improves substantially upon the current C > code from Michiel and could fit with your interface this makes > sense. Many times people are not tied to GPL, and they may be > willing to re-license for inclusion in Biopython. > Yes, I'll try to talk to them. > > A separate news post mentioning the C option speed and showing usage > examples from both is a great idea. Responsiveness to new methods is > the fun part of science. > I'll try to write that up and send it to the list. > This is more a consequence of how scientific publication works. You have > to get published and to do that you have to prove you are somehow that > much better than other options, which results in trying to find flaws in > those options. This would all work smoother if the authors came on the > Biopython list mentioning the speed issues, you all had this discussion > then and worked on incorporating their code as it was being developed. > Then we'd have an integrated implementation today. Doing it after the > fact is a bit more roundabout, but what can you do. Exactly. cheers Bartek From michael.koeris at gmail.com Thu Sep 24 15:01:40 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Thu, 24 Sep 2009 11:01:40 -0400 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation Message-ID: Hi, I was wondering if a very useful feature of the web API is implemented in the browser -> the ability to specifity the organism on top of the database. Many thanks Mike From biopython at maubp.freeserve.co.uk Thu Sep 24 15:09:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 16:09:48 +0100 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation In-Reply-To: References: Message-ID: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> On Thu, Sep 24, 2009 at 4:01 PM, Michael S. Koeris wrote: > Hi, > > I was wondering if a very useful feature of the web API is implemented in > the browser -> the ability to specifity the organism on top of the database. The NCBI BLAST website lets you specify an organism or use an Entrez query - you can do this via the QBlast API as well. See the mailing list archive, e.g. http://lists.open-bio.org/pipermail/biopython/2009-August/005474.html Peter From michael.koeris at gmail.com Thu Sep 24 15:34:35 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Thu, 24 Sep 2009 11:34:35 -0400 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation In-Reply-To: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> References: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> Message-ID: <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> On Sep 24, 2009, at 11:09 AM, Peter wrote: > On Thu, Sep 24, 2009 at 4:01 PM, Michael S. Koeris > wrote: >> Hi, >> >> I was wondering if a very useful feature of the web API is >> implemented in >> the browser -> the ability to specifity the organism on top of the >> database. > > The NCBI BLAST website lets you specify an organism or use an Entrez > query - you can do this via the QBlast API as well. See the mailing > list > archive, e.g. > > http://lists.open-bio.org/pipermail/biopython/2009-August/005474.html I saw the option to put in an [ORGANISM] but I was hoping I could use the TaxonID because say I want to BLAST all bacteria or all archea From biopython at maubp.freeserve.co.uk Thu Sep 24 15:46:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 16:46:53 +0100 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation In-Reply-To: <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> References: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> Message-ID: <320fb6e00909240846w828e2edhc7b65fbd98876919@mail.gmail.com> >> The NCBI BLAST website lets you specify an organism or use an Entrez >> query - you can do this via the QBlast API as well. See the mailing list >> archive, e.g. >> >> http://lists.open-bio.org/pipermail/biopython/2009-August/005474.html > > I saw the option to put in an [ORGANISM] but I was hoping I could use the > TaxonID because say I want to BLAST all bacteria or all archea Yes, you can do that - you needed to read all of that thread: http://lists.open-bio.org/pipermail/biopython/2009-August/005476.html http://lists.open-bio.org/pipermail/biopython/2009-August/005477.html Peter From carlos.borroto at gmail.com Thu Sep 24 15:47:41 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Thu, 24 Sep 2009 11:47:41 -0400 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation In-Reply-To: <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> References: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> Message-ID: <65d4b7fc0909240847u2b5c1f5do322264080188c246@mail.gmail.com> On Thu, Sep 24, 2009 at 11:34 AM, Michael S. Koeris wrote: > I saw the option to put in an [ORGANISM] but I was hoping I could use the > TaxonID because say I want to BLAST all bacteria or all archea > I'm just doing exactly that, by putting on my entrez_query something like this: "txid6945[Organism:noexp]" I got that string by searching on the taxonomic database and then clicking to see all of the sequences of that taxon. I haven't tried to use only "txid6945" don't know what is the meaning of "[Organism:noexp]", but I can tell you this works. As a side note on blasting, I think there is a bug on the XML generator from NCBI, I getting stuff like this: >>> print blast_record.descriptions[0].title gi|241564310|ref|XP_002401874.1| E1-E2 ATPase, putative [Ixodes scapularis] >gi|215501920|gb|EEC11414.1| E1-E2 ATPase, putative [Ixodes scapularis] I have to do odd thing like: >>> print blast_record.descriptions[0].title.rsplit('>')[0] gi|241564310|ref|XP_002401874.1| E1-E2 ATPase, putative [Ixodes scapularis] And is not a bug on the biopython parser, cause I see the title is wrong on the xml output. Hope its help -- Carlos Javier Borroto Baltimore, MD Phone: (410) 929 4020 From biopython at maubp.freeserve.co.uk Thu Sep 24 16:04:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 17:04:52 +0100 Subject: [Biopython] Blast.NCBIWWW qblast taxonID search space limitation In-Reply-To: <65d4b7fc0909240847u2b5c1f5do322264080188c246@mail.gmail.com> References: <320fb6e00909240809m6fb34396x2a8ca690e33c2c86@mail.gmail.com> <47406501-1BFD-49FD-88A6-E1235F302879@gmail.com> <65d4b7fc0909240847u2b5c1f5do322264080188c246@mail.gmail.com> Message-ID: <320fb6e00909240904o1b07d934w60e36681ee20722b@mail.gmail.com> On Thu, Sep 24, 2009 at 4:47 PM, Carlos Javier Borroto wrote: > On Thu, Sep 24, 2009 at 11:34 AM, Michael S. Koeris > wrote: >> I saw the option to put in an [ORGANISM] but I was hoping I could use the >> TaxonID because say I want to BLAST all bacteria or all archea > > I'm just doing exactly that, by putting on my entrez_query something like this: > "txid6945[Organism:noexp]" > > I got that string by searching on the taxonomic database and then > clicking to see all of the sequences of that taxon. I haven't tried to > use only "txid6945" don't know what is the meaning of > "[Organism:noexp]", but I can tell you this works. Where did [Organism:noexp] come from? I guess it tells Entrez not to expand the organism name or the heirachy? I would just use "txid6945[Organism]" or "txid6945[orgn]" which is shorter and I think clearer. See also this blog post and the EInfo entry in the tutorial: http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ >>> from Bio import Entrez >>> record = Entrez.read(Entrez.einfo(db="nuccore")) >>> for field in record["DbInfo"]["FieldList"] : ... print "%(Name)s, %(FullName)s, %(Description)s" % field ... ALL, All Fields, All terms from all searchable fields UID, UID, Unique number assigned to each sequence FILT, Filter, Limits the records WORD, Text Word, Free text associated with record TITL, Title, Words in definition line KYWD, Keyword, Nonstandardized terms provided by submitter AUTH, Author, Author(s) of publication JOUR, Journal, Journal abbreviation of publication VOL, Volume, Volume number of publication ISS, Issue, Issue number of publication PAGE, Page Number, Page number(s) of publication ORGN, Organism, Scientific and common names of organism, and all higher levels of taxonomy ACCN, Accession, Accession number of sequence PACC, Primary Accession, Does not include retired secondary accessions GENE, Gene Name, Name of gene associated with sequence PROT, Protein Name, Name of protein associated with sequence ECNO, EC/RN Number, EC number for enzyme or CAS registry number PDAT, Publication Date, Date sequence added to GenBank MDAT, Modification Date, Date of last update SUBS, Substance Name, CAS chemical name or MEDLINE Substance Name PROP, Properties, Classification by source qualifiers and molecule type SQID, SeqID String, String identifier for sequence GPRJ, Genome Project, Genome Project SLEN, Sequence Length, Length of sequence FKEY, Feature key, Feature annotated on sequence PORG, Primary Organism, Scientific and common names of primary organism, and all higher levels of taxonomy > As a side note on blasting, I think there is a bug on the XML > generator from NCBI, I getting stuff like this: >>>> print blast_record.descriptions[0].title > gi|241564310|ref|XP_002401874.1| E1-E2 ATPase, putative [Ixodes > scapularis] >gi|215501920|gb|EEC11414.1| E1-E2 ATPase, putative > [Ixodes scapularis] The NCBI BLAST tools have a strange method of merging redundant entries into a single entry, which results in these odd identifiers. Peter From cmckay at u.washington.edu Thu Sep 24 21:51:31 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 24 Sep 2009 14:51:31 -0700 Subject: [Biopython] get back raw records with SeqIO? Message-ID: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> Hello all. Congratulations on the release of 1.52. I'm very pleased to see the large file index feature included. And even more thrilled to have more full featured support for writing genbank files with SeqIO. Thanks! Are there plans to preserve more information in the in_genbank --> SeqIO --> out_genbank pipeline? For instance, at the moment, AUTHORS, COMMENT, etc are lost. I have a use question about SeqIO. If I want to get back the raw records from a file, can I do that with SeqIO? For example, to parse a genbank file with many records, I do: genbank_records = GenBank.Iterator(in_file_handle) Can I use SeqIO similarly somehow? Can I tell it not to parse records? My way works fine, but I presume that Bio.GenBank is going to be fazed out sometime. Thanks! Cedar From biopython at maubp.freeserve.co.uk Fri Sep 25 09:50:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 10:50:51 +0100 Subject: [Biopython] get back raw records with SeqIO? In-Reply-To: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> References: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> Message-ID: <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> On Thu, Sep 24, 2009 at 10:51 PM, Cedar McKay wrote: > Hello all. Congratulations on the release of 1.52. I'm very pleased to see > the large file index feature included. I hoped you would be - our mailing list discussion earlier in the year basically triggered including this in Biopython: http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html Were you able to update your script using the precursor index code to use the new Bio.SeqIO.index function? It should have been a drop in replacement ;) > And even more thrilled to have more full featured support for writing > genbank files with SeqIO. Thanks! I guess you missed that earlier - the GenBank output included features as of Biopython 1.51, but there have been a few tweaks since then. > Are there plans to preserve more information in the in_genbank > --> SeqIO --> out_genbank pipeline? For instance, at the moment, > AUTHORS, COMMENT, etc are lost. Like BioPerl, we are not expecting to offer a 100% round trip, but yes there are some bits (like the references) which still need doing. I haven't have the time or the need to follow up on those fields yet - but I would certainly review a patch if you wanted to work on that. > I have a use question about SeqIO. If I want to get back the raw records > from a file, can I do that with SeqIO? For example, to parse a genbank file > with many records, I do: > > genbank_records = GenBank.Iterator(in_file_handle) > > Can I use SeqIO similarly somehow? Can I tell it not to parse records? No, the SeqIO system does not break up files into chunks of raw text. One good reason for this is that it isn't possible for every file format (e.g. interlaced alignments). For some of the specific file formats it could be done. The mechanics of this is rather similar to what the new indexing code is doing internally (for those file formats where it is possible). Why do you want to do this? I'd like to understand the desired usage. > My way works fine, but I presume that Bio.GenBank is going to be > fazed out sometime. In the long term, perhaps we will phase out Bio.GenBank, but there is nothing planed. It currently does both SeqRecord parsing (called by Bio.SeqIO) and also a lower level more GenBank faithful record object. This still has its uses (especially while there is still room for improvement in GenBank output via SeqIO). Peter From biopython at maubp.freeserve.co.uk Fri Sep 25 12:29:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 13:29:22 +0100 Subject: [Biopython] Trimming FASTQ reads, was: [Velvet-users] Read length as a parameter? Message-ID: <320fb6e00909250529g15914649mde9e90683c85b975@mail.gmail.com> Hi all, I meant to forward this earlier, but it looks like I didn't. I've also just posted a related blog post on the topic: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ Peter ---------- Forwarded message ---------- From: Peter Date: Fri, Sep 25, 2009 at 11:46 AM Subject: Re: [Velvet-users] Read length as a parameter? To: Daniel Zerbino Cc: Dan Bolser , velvet-users at ebi.ac.uk Hi Velvet uses & Biopython fans, I've CC'd this to the Biopython list as the examples may be of interest there too. We are talking about scripts to pre-filter sequencing reads before analysis with another tool (in this case, the assembler velvet). The original thread is here: http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000578.html On Fri, Sep 25, 2009 at 10:45 AM, Daniel Zerbino wrote: > > Hello Yunchen and Dan, > > I'm afraid Velvet does not offer either length or k-mer frequency > pre-filtering, although the cov_cutoff is a k-mer frequency post-filtering. > > Given practical considerations, I don't think I can implement this in the > next months. > > However, what can be done is to have a simple script which does the > filtering then pipes them to velvet: > > my_filtering_script.xx my_reads.fa | velveth directory 21 -fasta - ... > > In the ?case of k-mer frequency filtering, you could imagine preparing a > pseudo-fastq file which assigns a score to each nucleotide based on the > frequency of the k-mer ending at that position, then scripting a score > filter which pipes into Velvet. > > As usual, if anyone is willing to put forward such scripts for the other > users, I will be happy to put it in the package. > > Best regards, > > Daniel Was that a challenge? ;) ?It probably won't be the fastest solution, but it is very easy to do this with Biopython's SeqIO library. #!/usr/bin/python # This is a simple python script using Biopython 1.50+ # to read in FASTA records from stdin, trim to 21 letters, # and write them to stdout. import sys from Bio import SeqIO records = (rec[:21] for rec in SeqIO.parse(sys.stdin, "fasta")) SeqIO.write(records, sys.stdout, "fasta") This isn't (yet) a full script with command line arguments etc. You could also do this with filenames, but to keep the examples short I'm using stdin and stdout (not a problem for those happy at the command line). Because FASTA files are so simple, it would be fairly trivial to to write a plain Python script (without using Biopython) which runs faster than this. However (and this is a sales pitch), just by changing the format name the above script would also work on other file formats. For example, with Biopython 1.51+ this would work fine for FASTQ files too. However, if speed is an issue (often the case with large next gen sequencing files), then a lower level python script is also possible, e.g.: #!/usr/bin/python # This is a fairly simple python script using Biopython 1.51+ # to read in FASTQ records from stdin, trim to 21 letters, # and write them to stdout. It does not check the quality # strings at all, and should therefore work on Sanger, # Solexa or Illumina 1.3+ FASTQ files equally well. import sys from Bio.SeqIO.QualityIO import FastqGeneralIterator for title, seq, qual in FastqGeneralIterator(sys.stdin) : ? ?print "@%s\n%s\n+\n%s" % (title, seq[:21], qual[:21]) ? ?#The print statement will include a trailing newline Both these examples are just four lines of code (two of which are imports), pretty neat if I do say so myself ;) Peter From biopython at maubp.freeserve.co.uk Fri Sep 25 14:04:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 15:04:00 +0100 Subject: [Biopython] Correcting short read errors based on k-mer coverage Message-ID: <320fb6e00909250704o369a90dci8156f2f5169747e9@mail.gmail.com> Hi all, This email was an offshoot of this thread on the Velvet user's list, and Dan suggested I could CC the Biopython mailing list. See also: http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000581.html The method Dan describes looks like an interesting computational challenge, but should be possible in (Bio)python... Peter ---------- Forwarded message ---------- From: Dan Bolser Date: Fri, Sep 25, 2009 at 2:39 PM Subject: Re: [Velvet-users] Read length as a parameter? To: peter at maubp.freeserve.co.uk Cc: Daniel Zerbino 2009/9/25 Peter : > On Fri, Sep 25, 2009 at 11:56 AM, Dan Bolser wrote: >> >> Hi Peter, >> >> Thanks for the examples. >> >> Since your clearly keen to show off the power of bioperl, here *is* a >> challenge ;-) >> >> 1) Construct a k-mer frequency distribution from the set of quality >> trimmed reads. >> 2) Correct full length reads by making point mutations that change >> 'rare' k-mers into 'common' k-mers. >> 3) Re-trim reads according to kmer frequency after correction. >> >> 4) (For extra credit), implement step 2 and 3 but include homo-polymer >> length variability (indels) in the set of allowed correction >> operations. >> >> I really think 'code jamboree' could be a lot of fun (given the rate >> of technology change). >> >> I'd be seriously impressed at any reasonable 'module' to crack the above! > > Hi Dan, > > Did you mean to send this off list? I figured it wasn't really relevant to velvet mailing list, but please feel free to cc biopython. > BioPerl/Biopython jokes aside, right now I don't understand exactly > what you are asking for - although with a little more background reading > I could probably work it out. Presumably all of this is ignoring the > FASTQ quality scores? e.g. it would be fine just to work with FASTA > files? In step (2) you want to edit the reads (giving a new FASTA file)? > What you want in step (3) is unclear. Sorry, I took quite some short cuts in my description. Please see: http://www.ncbi.nlm.nih.gov/pubmed/19056694 Step 1 uses quality to select high quality regions of reads. these reads are broken down into k-mers (say of length 21), and then you construct a k-mer frequency table. i.e. k-mer TATATATATATATATATATAT occurs 5000 times in my read set. Here you need to consider memory usage. In step 2 you take the full reads (ignoring qualities) and look at the k-mer frequency (average?) at each base. Some bases will have a very low k-mer frequency, indicating sequencing errors. Such bases can sometimes be unambiguously 'fixed' (changed to have mean k-mer frequency) making single base substitutions. Finally 'unfixable' reads (by the above definition) can be trimmed. HTH, Dan. From michael.koeris at gmail.com Fri Sep 25 15:25:00 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Fri, 25 Sep 2009 11:25:00 -0400 Subject: [Biopython] SeqIO parser error Message-ID: <1BD71883-D1BF-4766-BFC0-2CE0B237E9C6@gmail.com> Hi, I'm just getting acquainted with Sequence Objects and records and so forth. I tried some very basic code from the tutorial and I get an error when I run this: from Bio import Entrez, SeqIO gi_list = ['224589821', '224514694', '164698032', '157812089', '157734174'] gi_str = ",".join(gi_list) handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="gb") records = SeqIO.parse(handle, "gb") for record in records: print "%s, length %i, with %i features" \ %(record.name, len(record), len(record.features)) Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/Scanner.py", line 420, in parse_records record = self.parse(handle, do_features) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/Scanner.py", line 403, in parse if self.feed(handle, consumer, do_features) : File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/Scanner.py", line 381, in feed self._feed_misc_lines(consumer, misc_lines) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/Scanner.py", line 1142, in _feed_misc_lines consumer.contig_location(contig_location) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 987, in contig_location self.location(content) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 689, in location self._set_location_info(parse_info, self._cur_feature) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 797, in _set_location_info self._set_function(parse_info, cur_feature) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 720, in _set_function self._set_ordering_info(function, cur_feature) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 764, in _set_ordering_info feature_start = cur_feature.sub_features[0].location.start AttributeError: 'PositionGap' object has no attribute 'start' Any help is most appreciated. Mike -- Michael S. Koeris michael.koeris at gmail.com From biopython at maubp.freeserve.co.uk Fri Sep 25 15:42:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 16:42:39 +0100 Subject: [Biopython] Correcting short read errors based on k-mer coverage In-Reply-To: <320fb6e00909250704o369a90dci8156f2f5169747e9@mail.gmail.com> References: <320fb6e00909250704o369a90dci8156f2f5169747e9@mail.gmail.com> Message-ID: <320fb6e00909250842j2e3a2cbdi8d602be801696fe4@mail.gmail.com> Dan Bolser wrote: > Step 1 uses quality to select high quality regions of reads. these > reads are broken down into k-mers (say of length 21), and then you > construct a k-mer frequency table. i.e. k-mer TATATATATATATATATATAT > occurs 5000 times in my read set. Here you need to consider memory > usage. I just tried with a short read file from the NCBI SRA with ~7 million reads of 36bp and k=21. Each 36bp read gives 16 k-mers, thus I had in total ~100 million kmers in total, and found about ~18 million different kmers. About half occurred only once. My naive code to count the kmers used a Python dictionary (k-mer strings as the keys, integer counts as values). It took about 5 minutes to run and about 1.5 GB of RAM. What sized files are you hoping to run this on? Without knowing that, it is hard to say if this simple dictionary approach will scale well. Dan Bolser wrote: > In step 2 you take the full reads (ignoring qualities) and look at the > k-mer frequency (average?) at each base. Some bases will have a very > low k-mer frequency, indicating sequencing errors. Are you suggesting following the method of Chaisson et al 2009, described in section "Detecting and error correcting accurate read prefixes" of that paper - or something a little different? That section itself cites several related approaches to read correction. Peter From biopython at maubp.freeserve.co.uk Fri Sep 25 15:49:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 16:49:01 +0100 Subject: [Biopython] SeqIO parser error In-Reply-To: <1BD71883-D1BF-4766-BFC0-2CE0B237E9C6@gmail.com> References: <1BD71883-D1BF-4766-BFC0-2CE0B237E9C6@gmail.com> Message-ID: <320fb6e00909250849l1b985a3aj5658aec7630a7264@mail.gmail.com> On Fri, Sep 25, 2009 at 4:25 PM, Michael S. Koeris wrote: > Hi, > > I'm just getting acquainted with Sequence Objects and records and so forth. > I tried some very basic code from the tutorial and I get an error when I run > this: > > from Bio import Entrez, SeqIO > > gi_list = ['224589821', '224514694', '164698032', '157812089', '157734174'] > gi_str = ",".join(gi_list) > handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="gb") > > records = SeqIO.parse(handle, "gb") > > for record in records: > ? ?print "%s, length %i, with %i features" \ > ? ? ? ? ?%(record.name, len(record), len(record.features)) > > Traceback (most recent call last): > ... > ? ?feature_start = cur_feature.sub_features[0].location.start > AttributeError: 'PositionGap' object has no attribute 'start' > > Any help is most appreciated. Hi Mike, You have found Bug 2745. Do you fancy testing the proposed fix? http://bugzilla.open-bio.org/show_bug.cgi?id=2745 As a workaround, you can ask the NCBI for full GenBank records, not CONTIG records (use rettype="gbwithparts"). However, since these are such large files (whole chromosomes) it might be better to download the whole human genome via FTP instead... Peter From biopython at maubp.freeserve.co.uk Fri Sep 25 16:34:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 17:34:34 +0100 Subject: [Biopython] Correcting short read errors based on k-mer coverage In-Reply-To: <2c8757af0909250916g5263b02eudb52aa5a03019e6e@mail.gmail.com> References: <320fb6e00909250704o369a90dci8156f2f5169747e9@mail.gmail.com> <320fb6e00909250842j2e3a2cbdi8d602be801696fe4@mail.gmail.com> <2c8757af0909250916g5263b02eudb52aa5a03019e6e@mail.gmail.com> Message-ID: <320fb6e00909250934m69d95c40nd002e9652031c11d@mail.gmail.com> On Fri, Sep 25, 2009 at 5:16 PM, Dan Bolser wrote: > > 2009/9/25 Peter : >> >> I just tried with a short read file from the NCBI SRA with ~7 million reads >> of 36bp and k=21. Each 36bp read gives 16 k-mers, thus I had in total >> ~100 million kmers in total, and found about ~18 million different kmers. >> About half occurred only once. >> >> My naive code to count the kmers used a Python dictionary (k-mer >> strings as the keys, integer counts as values). It took about 5 minutes >> to run and about 1.5 GB of RAM. >> >> What sized files are you hoping to run this on? Without knowing that, >> it is hard to say if this simple dictionary approach will scale well. > > To warm up I'd want to try 125 million reads of ~50 bp. That might still be possible in RAM... just. Are you aware of any public datasets of that size? An NCBI SRA one for example? > Later I'd want about 100 times more. Right - that will certainly mean holding everything in memory isn't going to be an option! A simple SQLite database might work nicely though. >> Dan Bolser wrote: >>> In step 2 you take the full reads (ignoring qualities) and look at the >>> k-mer frequency (average?) at each base. Some bases will have a very >>> low k-mer frequency, indicating sequencing errors. >> >> Are you suggesting following the method of Chaisson et al 2009, >> described in section "Detecting and error correcting accurate read >> prefixes" of that paper - or something a little different? That section >> itself cites several related approaches to read correction. > > Yeah, I was thinking of the Chasson 2009 method. Since then I had a > couple of other methods brought to my attention on the Velvet mailing > list: > > Efficient frequency-based de novo short-read clustering for error > trimming in next-generation sequencing. > Qu W, Hashimoto S, Morishita S. > Genome Res. 2009 Jul;19(7):1309-15. Epub 2009 May 13. > PMID: 19439514 > http://www.ncbi.nlm.nih.gov/pubmed/19439514 > > SHREC: a short-read error correction method. > Schr?der J, Schr?der H, Puglisi SJ, Sinha R, Schmidt B. > Bioinformatics. 2009 Sep 1;25(17):2157-63. Epub 2009 Jun 19. > PMID: 19542152 > http://www.ncbi.nlm.nih.gov/pubmed/19542152 > > > So the result is looking more and more redundant... However, a python > one liner would be awesome! I doubt a few line python script for the whole task will be forthcoming, although parts of it may be more realistic (e.g. an SQLite based k-mer counter). This sort of thing (k-mer frequency based read correction and trimming) might be of interest to the EMBOSS project, who have expressed an interest in developing new command line tools for next generation sequencing data (e.g. simple quality score read filtering and trimming). Peter From thomas.e.keller at gmail.com Sat Sep 26 03:17:27 2009 From: thomas.e.keller at gmail.com (Thomas Keller) Date: Fri, 25 Sep 2009 22:17:27 -0500 Subject: [Biopython] Nexus.Tree fails to import nexus tree file Message-ID: <41d5d14b0909252017v22f01a90y51b71b3a3d66531a@mail.gmail.com> I apologize if this has been addressed, I looked online and it does not seem to be a general issue. I have several programs that generate nexus files consisting entirely of trees; there is no sequence information. Can the Nexus parser not read this type of nexus file? When I try to open the a file with: from Bio.Nexus import Trees tree_string=open('Analysis_tree_1a.tre').read() tree=Trees.Tree(tree_string) I get the following error: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /home/thomas/Missing data in tree of Life projects/tree_base_dat/S2000/ in () /var/lib/python-support/python2.6/Bio/Nexus/Trees.pyc in __init__(self, tree, weight, rooted, name, data, values_are_support, max_support) 70 # there's discrepancy whether newick allows semicolons et the end 71 tree=tree.rstrip(';') ---> 72 self._add_subtree(parent_id=root.id,tree=self._parse(tree)[0]) 73 74 def _parse(self,tree): /var/lib/python-support/python2.6/Bio/Nexus/Trees.pyc in _parse(self, tree) 96 else: 97 closing=tree.rfind(')') ---> 98 val=self._get_values(tree[closing+1:]) 99 if not val: 100 val=[None] /var/lib/python-support/python2.6/Bio/Nexus/Trees.pyc in _get_values(self, text) 172 values.append(nodecomment) 173 else: --> 174 values=[float(t) for t in text.split(':') if t.strip()] 175 return values 176 ValueError: invalid literal for float(): ;END The associated nexus file I am trying to read is the following: #NEXUS BEGIN TREES; TRANSLATE 1 'Pleurochrysis_pseudoroscoffensis_HAP48', 2 'Pleurochrysis_placolithoides_HAP59bis', 3 'Pleurochrysis_carterae_Von_Stosch', 4 'Pleurochrysis_roscoffensis_HAP32', 5 'Pleurochrysis_scherffelii_HAP11', 6 'Pleurochrysis_sp_Langue_du_chat', 7 'Pleurochrysis_elongata_CCMP874', 8 'Pleurochrysis_gayraliae_HAP10', 9 'Hymenomonas_coronata_HAP58bis', 10 'Pleurochrysis_elongata_HAP79', 11 'Ochrosphaera_verrucosa_HAP82', 12 'Pleurochrysis_carterae_HAP1', 13 'Pleurochrysis_dentata_HAP6', 14 'Pleurochrysis_sp_MBIC10443', 15 'Pleurochrysis_sp_MBIC10549', 16 'Jomonlithus_littoralis_JE5', 17 'Pleurochrysis_sp_CCMP875', 18 'Pleurochrysis_sp_CCMP300', 19 'Gloeothamnion_sp_HAPG' ; TREE 'Fig._2' = [&R] (11,((((((4,1,8,3,18),(10,7)),(12,(5,19))),6),2),(13,(14,17),15)),(9,16)); END; Please let me know if/what I am doing wrong. Is the tree nexus file malformed in some way? Cheers, Thomas Keller Reply Forward [It's All Text!] From biopython at maubp.freeserve.co.uk Sat Sep 26 10:23:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 26 Sep 2009 11:23:39 +0100 Subject: [Biopython] Nexus.Tree fails to import nexus tree file In-Reply-To: <41d5d14b0909252017v22f01a90y51b71b3a3d66531a@mail.gmail.com> References: <41d5d14b0909252017v22f01a90y51b71b3a3d66531a@mail.gmail.com> Message-ID: <320fb6e00909260323oc34fa9axd5dac734e63bee3f@mail.gmail.com> On Sat, Sep 26, 2009 at 4:17 AM, Thomas Keller wrote: > I apologize if this has been addressed, I looked online and it does > not seem to be a general issue. I have several programs that generate > nexus files consisting entirely of trees; there is no sequence > information. ?Can the Nexus parser not read this type of nexus file? > When I try to open the a file with: > > from Bio.Nexus import Trees > tree_string=open('Analysis_tree_1a.tre').read() > tree=Trees.Tree(tree_string) Use the above code if tree_string is JUST a Newick tree. In your case, from the example you have a full NEXUS file, so use the Bio.Nexus.Nexus parser. Peter From biopython at maubp.freeserve.co.uk Sat Sep 26 11:41:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 26 Sep 2009 12:41:55 +0100 Subject: [Biopython] SeqIO parser error In-Reply-To: <320fb6e00909250849l1b985a3aj5658aec7630a7264@mail.gmail.com> References: <1BD71883-D1BF-4766-BFC0-2CE0B237E9C6@gmail.com> <320fb6e00909250849l1b985a3aj5658aec7630a7264@mail.gmail.com> Message-ID: <320fb6e00909260441s2d463116md2ae093982f955cb@mail.gmail.com> On Fri, Sep 25, 2009 at 4:49 PM, Peter wrote: > > Hi Mike, > > You have found Bug 2745. Do you fancy testing the proposed fix? > http://bugzilla.open-bio.org/show_bug.cgi?id=2745 > That proposed fix has been checking into git now, so if anyone wants to test it you can grab the latest source code (e.g. via the github download link) and reinstall. See: http://biopython.org/wiki/SourceCode http://github.com/biopython/biopython Or, since this only affects a couple of files it would be possible to update them individually - although this is a bit more fiddly. I would normally only suggest this for Windows users who don't have a suitable C compiler installed. Peter From biopython at maubp.freeserve.co.uk Mon Sep 28 11:10:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Sep 2009 12:10:41 +0100 Subject: [Biopython] Correcting short read errors based on k-mer coverage In-Reply-To: <320fb6e00909250934m69d95c40nd002e9652031c11d@mail.gmail.com> References: <320fb6e00909250704o369a90dci8156f2f5169747e9@mail.gmail.com> <320fb6e00909250842j2e3a2cbdi8d602be801696fe4@mail.gmail.com> <2c8757af0909250916g5263b02eudb52aa5a03019e6e@mail.gmail.com> <320fb6e00909250934m69d95c40nd002e9652031c11d@mail.gmail.com> Message-ID: <320fb6e00909280410w46499133tb26e63d1529939b@mail.gmail.com> On Fri, Sep 25, 2009 at 5:34 PM, Peter wrote: > On Fri, Sep 25, 2009 at 5:16 PM, Dan Bolser wrote: >> >> 2009/9/25 Peter : >>> >>> I just tried with a short read file from the NCBI SRA with ~7 million reads >>> of 36bp and k=21. Each 36bp read gives 16 k-mers, thus I had in total >>> ~100 million kmers in total, and found about ~18 million different kmers. >>> About half occurred only once. >>> >>> My naive code to count the kmers used a Python dictionary (k-mer >>> strings as the keys, integer counts as values). It took about 5 minutes >>> to run and about 1.5 GB of RAM. An alternative approach reduced the memory needed for this example from 1.25GB resident to about 0.8GB resident, while still taking about 5 mins. Instead of storing the kmers as strings, I encoded them as large integers (basically using 2 bits per letter instead of 8 bits). This means for kmers up to and including 32-mers, you need only a 64bit unsigned long. You can do this in Python, but my initial code was a bit slow - so I redid it as a Python C extension. The only problem here is what to do with ambiguous sequences - for example any N characters? This still used a Python dictionary to hold the (integer) encoded kmer sequences as keys, and their (integer) counts as values. As noted before, there are disk based options here like an SQLite database. Peter From biopython at maubp.freeserve.co.uk Mon Sep 28 13:02:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Sep 2009 14:02:33 +0100 Subject: [Biopython] Bit encoded sequences, was: Correcting short read errors based on k-mer coverage Message-ID: <320fb6e00909280602l17eb165du9194ca1b92e4620@mail.gmail.com> On Mon, Sep 28, 2009 at 1:18 PM, Dan Bolser wrote: > > 2009/9/28 Peter : >> >> An alternative approach reduced the memory needed for this example >> from 1.25GB resident to about 0.8GB resident, while still taking about >> 5 mins. Instead of storing the kmers as strings, I encoded them as large >> integers (basically using 2 bits per letter instead of 8 bits). This means >> for kmers up to and including 32-mers, you need only a 64bit unsigned >> long. You can do this in Python, but my initial code was a bit slow - so >> I redid it as a Python C extension. The only problem here is what to do >> with ambiguous sequences - for example any N characters? > > Sounds like a good solution... does BioPython have any modules for > hiding this kind of compressed sequence representation? i.e. using > some object to represent the string instead of a string, where the > object has this compression 'under the hood'? Not at the moment, no. However, a bit encoded Seq subclass for unambiguous DNA or RNA is something I had in mind while looking at encoding kmers. I think BioJava does something like this already. Another reason to look at this is with an eye on the future for Python 3, which makes unicode the default (although byte strings remain, we'll probably want to use them in the Seq object). > I think most people reserve this kind of compression for > 'non-ambiguous' strings only, and cludge the ambiguity codes[1] if > necessary. > > [1] http://droog.gs.washington.edu/parc/images/iupac.html For ambiguous DNA or RNA, you can use four bits for each bp (i.e. can it be an A, C, G, or T - thus you might encode this as 1000 = A, 0100 = C, ..., 1100 = K and 1111 = N). This requires 50% of the memory of the naive one byte for each bp scheme. For unambiguous DNA or RNA you just need two bits, say A = 00, C = 01, G = 10 and T=11. This mapping should make taking the complement very fast via bit flipping. Dealing with the reverse complement requires a little more thought (e.g. byte alignment issues if the sequence is not a multiple of four bp in length). For proteins things are less easy, you'd need at least five bits (2^5 = 32 combinations) which isn't ideal. You can have compression, but the byte boundaries may slow things down. Then there are things like gap characters, stop symbols, mixed case and any other ad-hoc additions. For ambiguous single case DNA/RNA we could do two bp per byte, which in itself may not be worth the hassle. There would be scope for (reverse) complement optimisations, however, if it allowed faster sequence matching things become more interesting... But certainly, for unambiguous single case DNA/RNA we could get four bp per byte, which seems a worthwhile improvement. >> This still used a Python dictionary to hold the (integer) encoded kmer >> sequences as keys, and their (integer) counts as values. As noted >> before, there are disk based options here like an SQLite database. > > Yeah, I was wondering about a Berkeley DB or similar. Berkeley DB is certainly a sensible option to look at. Python 2.x includes DBD wrappers, but sadly this has been dropped from the standard libraries in Python 3.x which is why I had a slighly leaning to trying SQLite first of all. > I wonder if there is any way to do this approximately and still get > good error correction statistics? (I'm thinking about the way BowTie > works using approximate hash matching to pre-filter alignments). I don't know exactly what BowTie does. > Any hints from the two papers? Those that I have looked at are vague about the implementation details, but I may just have not read them carefully enough. Peter From biopython at maubp.freeserve.co.uk Tue Sep 29 11:06:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Sep 2009 12:06:19 +0100 Subject: [Biopython] Deprecating Bio.Prosite and Bio.Enzyme In-Reply-To: <320fb6e00908200248j26d20cefq6e3cf9373881d990@mail.gmail.com> References: <320fb6e00908200248j26d20cefq6e3cf9373881d990@mail.gmail.com> Message-ID: <320fb6e00909290406w2743caceqbf91e99a242c16d8@mail.gmail.com> On Thu, Aug 20, 2009 at 10:48 AM, Peter wrote: > Hi all, > > Bio.Prosite and Bio.Enzyme were declared obsolete in Release 1.50, > being replaced by Bio.ExPASy.Prosite and Bio.ExPASy.Enzyme, > respectively. > > Are there any objections to deprecating Bio.Prosite and Bio.Enzyme > for the next release? Bio.Prosite and Bio.Enzyme were left as obsolete in Release 1.52, but have now been deprecated for the next release. Peter From lueck at ipk-gatersleben.de Tue Sep 29 12:50:04 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 29 Sep 2009 14:50:04 +0200 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions Message-ID: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> Hi everybody! Does someone knows an algorithm to search for sequence similarity by allowing missmatches at certain positions? E.g. Looking in a sequence database for ATGCTCGCGCTCGCTCGCGCA by allowing an missmatch at position [3] and [18]. I can do it via regular expressions but I guess it would be quite slow. Thanks for any hints! Stefanie From chapmanb at 50mail.com Tue Sep 29 13:22:39 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 29 Sep 2009 09:22:39 -0400 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions In-Reply-To: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> References: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> Message-ID: <20090929132239.GK29829@sobchak.mgh.harvard.edu> Hi Stefanie; > Does someone knows an algorithm to search for sequence similarity by allowing missmatches at certain positions? > > E.g. > Looking in a sequence database for > > ATGCTCGCGCTCGCTCGCGCA > > by allowing an missmatch at position [3] and [18]. dreg and fuzznuc in EMBOSS both do this: http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/dreg.html http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/fuzznuc.html The SHRiMP aligner also allows you to specify a seed with defined match and mismatch positions: http://compbio.cs.toronto.edu/shrimp/ See the '-s' option in the README. Hope this helps, Brad From mailinglist.honeypot at gmail.com Tue Sep 29 13:23:29 2009 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Tue, 29 Sep 2009 09:23:29 -0400 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions In-Reply-To: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> References: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> Message-ID: <3C37F38A-C7D6-4870-B08E-893656E0EB7C@gmail.com> Hi, On Sep 29, 2009, at 8:50 AM, Stefanie L?ck wrote: > Hi everybody! > > Does someone knows an algorithm to search for sequence similarity by > allowing missmatches at certain positions? > > E.g. > Looking in a sequence database for > > ATGCTCGCGCTCGCTCGCGCA > > by allowing an missmatch at position [3] and [18]. > > I can do it via regular expressions but I guess it would be quite > slow. You can use bowtie: http://bowtie-bio.sourceforge.net/index.shtml You can't tell it where to allow the mismatch, but you can tell it how many mismatches to allow. The output file is easy to parse, and it also informs you the position of the mismatch, and what nucleotide was changed to what in order to make the match. Pros: Insanely fast aligner. Cons: * You'll have to do a bit of work at the command line. * You need an index file for your "database" of sequences you are searching against (not querying with). There are several provided on the site, otherwise it's also quite easy to make your own (though requires a lot of memory. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From biopython at maubp.freeserve.co.uk Tue Sep 29 13:30:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Sep 2009 14:30:00 +0100 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions In-Reply-To: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> References: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00909290630xc1b3b76s71be90e78c05a643@mail.gmail.com> On Tue, Sep 29, 2009 at 1:50 PM, Stefanie L?ck wrote: > Hi everybody! > > Does someone knows an algorithm to search for sequence similarity by allowing missmatches at certain positions? > > E.g. > Looking in a sequence database for > > ATGCTCGCGCTCGCTCGCGCA > > by allowing an missmatch at position [3] and [18]. > > I can do it via regular expressions but I guess it would be quite slow. When you say "sequence database" do you mean a set of local files (e.g. a big FASTA files), a real database (e.g. BioSQL), or something else like an online database (e.g. GenBank)? I would have suggested you tried regular expressions, because they let you deal with the specific positions where you allow a missmatch. i.e. ATG.TCGCGCTCGCTCGC.CA as a regular expression? You want to look for ATGNTCGCGCTCGCTCGCNCA using IUPAC codes, which I think would work with something like fuzznuc from EMBOSS: http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/fuzznuc.html Peter From biopython at maubp.freeserve.co.uk Tue Sep 29 13:32:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Sep 2009 14:32:34 +0100 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions In-Reply-To: <20090929132239.GK29829@sobchak.mgh.harvard.edu> References: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> <20090929132239.GK29829@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909290632w22e2589dj96062e47c816af61@mail.gmail.com> On Tue, Sep 29, 2009 at 2:22 PM, Brad Chapman wrote: > Hi Stefanie; > >> Does someone knows an algorithm to search for sequence similarity >> by allowing missmatches at certain positions? >> >> E.g. >> Looking in a sequence database for >> >> ATGCTCGCGCTCGCTCGCGCA >> >> by allowing an missmatch at position [3] and [18]. > > dreg and fuzznuc in EMBOSS both do this: > > http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/dreg.html > http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/fuzznuc.html I didn't know about EMBOSS dreg - that looks rather neat. We should add wrappers for these to Bio.Emboss.Applications ... > The SHRiMP aligner also allows you to specify a seed with defined > match and mismatch positions: > > http://compbio.cs.toronto.edu/shrimp/ > > See the '-s' option in the README. That's a neat trick. Peter From lueck at ipk-gatersleben.de Tue Sep 29 14:23:57 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 29 Sep 2009 16:23:57 +0200 Subject: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions References: <009501ca4103$5dcb6ac0$1022a8c0@ipkgatersleben.de> <320fb6e00909290630xc1b3b76s71be90e78c05a643@mail.gmail.com> Message-ID: <00ef01ca4110$7b57e340$1022a8c0@ipkgatersleben.de> I mean big FASTA files. Thanks for all suggestions, I'll have a look on them and decide what to use! Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Tuesday, September 29, 2009 3:30 PM Subject: Re: [Biopython] alignment/matching algorithm whichs allows missmatches at certain positions On Tue, Sep 29, 2009 at 1:50 PM, Stefanie L?ck wrote: > Hi everybody! > > Does someone knows an algorithm to search for sequence similarity by > allowing missmatches at certain positions? > > E.g. > Looking in a sequence database for > > ATGCTCGCGCTCGCTCGCGCA > > by allowing an missmatch at position [3] and [18]. > > I can do it via regular expressions but I guess it would be quite slow. When you say "sequence database" do you mean a set of local files (e.g. a big FASTA files), a real database (e.g. BioSQL), or something else like an online database (e.g. GenBank)? I would have suggested you tried regular expressions, because they let you deal with the specific positions where you allow a missmatch. i.e. ATG.TCGCGCTCGCTCGC.CA as a regular expression? You want to look for ATGNTCGCGCTCGCTCGCNCA using IUPAC codes, which I think would work with something like fuzznuc from EMBOSS: http://emboss.sourceforge.net/apps/release/6.1/emboss/apps/fuzznuc.html Peter From biopython at maubp.freeserve.co.uk Tue Sep 29 18:44:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Sep 2009 19:44:07 +0100 Subject: [Biopython] Nexus.Tree fails to import nexus tree file In-Reply-To: <320fb6e00909260323oc34fa9axd5dac734e63bee3f@mail.gmail.com> References: <41d5d14b0909252017v22f01a90y51b71b3a3d66531a@mail.gmail.com> <320fb6e00909260323oc34fa9axd5dac734e63bee3f@mail.gmail.com> Message-ID: <320fb6e00909291144r4c50d7ddl3892b1e513995a1c@mail.gmail.com> On Sat, Sep 26, 2009 at 11:23 AM, Peter wrote: > On Sat, Sep 26, 2009 at 4:17 AM, Thomas Keller > wrote: >> I apologize if this has been addressed, I looked online and it does >> not seem to be a general issue. I have several programs that generate >> nexus files consisting entirely of trees; there is no sequence >> information. ?Can the Nexus parser not read this type of nexus file? >> When I try to open the a file with: >> >> from Bio.Nexus import Trees >> tree_string=open('Analysis_tree_1a.tre').read() >> tree=Trees.Tree(tree_string) > > Use the above code if tree_string is JUST a Newick tree. > In your case, from the example you have a full NEXUS > file, so use the Bio.Nexus.Nexus parser. Did you get Bio.Nexus to parse the tree for you? Also, would you mind telling us where you got the tree from (what software package) and if we could use it for a test case within Biopthon? Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Sep 29 19:09:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 29 Sep 2009 20:09:38 +0100 Subject: [Biopython] get back raw records with SeqIO? In-Reply-To: <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> References: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> Message-ID: <320fb6e00909291209r65e6c0f6nd682120591ef9a5f@mail.gmail.com> On Fri, Sep 25, 2009 at 10:50 AM, Peter wrote: > On Thu, Sep 24, 2009 at 10:51 PM, Cedar McKay wrote: >> Are there plans to preserve more information in the in_genbank >> --> SeqIO --> out_genbank pipeline? For instance, at the moment, >> AUTHORS, COMMENT, etc are lost. > > Like BioPerl, we are not expecting to offer a 100% round trip, but yes > there are some bits (like the references) which still need doing. I haven't > have the time or the need to follow up on those fields yet - but I would > certainly review a patch if you wanted to work on that. Hi Cedar, I've just added support for writing the COMMENT lines in SeqIO's GenBank output to the repository (which is now using git hosted on github). I'm hoping you'll give this code a quick test... Assuming you are runing Biopython 1.52, you only really need to update the Bio/SeqIO/InsdcIO.py file (e.g. download it via the source code browser) but it might be simpler to just grab the latest source and reinstall. See this wiki page for details: http://biopython.org/wiki/SourceCode Thanks, Peter From cmckay at u.washington.edu Wed Sep 30 23:14:15 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Wed, 30 Sep 2009 16:14:15 -0700 Subject: [Biopython] get back raw records with SeqIO? In-Reply-To: <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> References: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> Message-ID: <4995FD92-374F-4B4F-947B-CC9175B06BCD@u.washington.edu> > I hoped you would be - our mailing list discussion earlier in the year > basically triggered including this in Biopython: > http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html > > Were you able to update your script using the precursor index code > to use the new Bio.SeqIO.index function? It should have been a drop > in replacement ;) My head isn't at that code at the moment, but I'll try to give it a whirl next week. > Why do you want to do this? I'd like to understand the desired usage. I didn't have a specific technical reason. It just seemed like everything was going towards using SeqIO and things like Bio.Fasta were being deprecated, so I wanted to get ahead of the curve there. But if Bio.Genbank is going to be around for a long time, I don't have any problem with doing it that way. Thanks again. C