From winda002 at student.otago.ac.nz Wed Jul 1 02:13:17 2009 From: winda002 at student.otago.ac.nz (David WInter) Date: Wed, 01 Jul 2009 18:13:17 +1200 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <320fb6e00906300247ve2f45eau4ef97d3f65bb50c5@mail.gmail.com> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291716.06607.jblanca@btc.upv.es> <320fb6e00906300101r3e3faa37l6a47295bd5e12538@mail.gmail.com> <200906301031.06273.jblanca@btc.upv.es> <320fb6e00906300247ve2f45eau4ef97d3f65bb50c5@mail.gmail.com> Message-ID: <4A4AFE7D.9020800@student.otago.ac.nz> Peter Cock wrote: > On Tue, Jun 30, 2009 at 9:31 AM, Jose Blanca wrote: > >>> What I was thinking of was a contig class as an alignment subclass, >>> holding a list of SeqRecord objects and offsets. >> I thought about that implementation and I created some code. The >> problem I found with that approach is that the contig class code got >> too messy. . >> > > A simple masked sequence class would also be useful for Roche SFF > files which hold sequencing reads (of about 500bp) with start and end > trim points. This is a use case separate from the location offset in an > alignment - so I'm not convinced it makes sense to do both in one > class. > > Perhaps having the contig class hold a list of (masked) SeqRecord > objects, their offset, and their direction would work? > > That sounds like the most intuitive way for the class to work from a user's perspective >>> One important thing I think we should do BEFORE adding any contig >>> class to Biopython, is get it working with at least one other contig file >>> >>> >> Well, In fact my contig class is modeled after the caf file format. >> The ace parsing was just an afterthought, my primary interest >> was the caf format. >> > > Well, as the CAF file format was an extension of the ACE format, > perhaps a third contig format would be worth looking at before > considering if a contig class would be sufficiently general. > I came across the page somewhere in my travels, a quick description of a few contig files: http://www.cbcb.umd.edu/research/contig_representation.shtml At a glance I think all of them could be treated with a similar approach to the one described above. David From bugzilla-daemon at portal.open-bio.org Wed Jul 1 10:12:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Jul 2009 10:12:38 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200907011412.n61ECcLO022490@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 ------- Comment #2 from cymon.cox at gmail.com 2009-07-01 10:12 EST ------- Following the email from David Gordon the Consed author via Gordon Roberston (thanks Gordon) on the dev list (http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006322.html) Ive made some changes to the PhdWriter and parser: The writer no longer uses default values for the header COMMENTS (here we differ from bioperl). Peak location letter annotations are now optional in both the parsing and writing. Additional unittest have been added for the examples of 454 and Solexa data that David Gordon included in his message. Note also: Currently we ignore comments in Phd files, ie those beginning with "#". Nothing special is done with the version number which is appended to the identifier on the BEGIN_SEQUENCE line in phd_ball files. Attached is a patch against biopython on github and Ive pushed changes to my assembly branch. Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 1 10:13:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Jul 2009 10:13:37 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200907011413.n61EDbDd022582@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 cymon.cox at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1333 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 1 10:14:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Jul 2009 10:14:10 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200907011414.n61EEAbv022636@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 ------- Comment #3 from cymon.cox at gmail.com 2009-07-01 10:14 EST ------- Created an attachment (id=1335) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1335&action=view) Another patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Wed Jul 1 10:27:47 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 1 Jul 2009 16:27:47 +0200 Subject: [Biopython-dev] [Bug 2867] New: Bio.PDB.PDBList.update_pdb calls invalid os.cmd Message-ID: I would introduce this new (and recommended) library instead of that command: http://docs.python.org/library/shutil.html But since this is the first bug I'm replying to... I'm asking you first. Cheers! Jo?o [ .. ] Rodrigues (Blog) http://doeidoei.wordpress.com (MSN) always_asleep_ at hotmail.com (Skype) rodrigues.jglm From bugzilla-daemon at portal.open-bio.org Wed Jul 1 10:39:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Jul 2009 10:39:05 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200907011439.n61Ed5Ks024881@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-01 10:39 EST ------- (In reply to comment #2) > Following the email from David Gordon the Consed author via Gordon Roberston > (thanks Gordon) on the dev list > (http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006322.html) Ive > made some changes to the PhdWriter and parser: Yep - coping with missing peak values sounds like it is required now. > The writer no longer uses default values for the header COMMENTS (here we > differ from bioperl). Do you just leave out the comments? That seems better to me. > Peak location letter annotations are now optional in both the parsing and > writing. Good. > Additional unittest have been added for the examples of 454 and Solexa data > that David Gordon included in his message. I'll have to look at those later... > Note also: Currently we ignore comments in Phd files, ie those beginning with > "#". Nothing special is done with the version number which is appended to the > identifier on the BEGIN_SEQUENCE line in phd_ball files. > > Attached is a patch against biopython on github and Ive pushed changes to my > assembly branch. I've done another partial merge, still leaving out the writer code. I'm not going to commit that until next week at the earliest (when I'll be back at work) as I want to give it a good test first. I'm not sure if this will make it into Biopython 1.51 final or not. I will however try and add the new example files and test cases before that. [Don't feel you have to redo the patch - I can continue to pull bits out of it] As part of my commit I added a doctest to Bio/SeqIO/PhdIO.py, which has made me wonder if for SeqIO we should convert the PHRED sequence to upper case (just because it would look nicer for PHRED to FASTQ conversions). Thanks again, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 1 10:43:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Jul 2009 10:43:39 -0400 Subject: [Biopython-dev] [Bug 2867] Bio.PDB.PDBList.update_pdb calls invalid os.cmd In-Reply-To: Message-ID: <200907011443.n61EhdPb025132@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2867 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-01 10:43 EST ------- As Jo??o Rodrigues noted on the mailing list, the python shutil library would be a sensible (and cross platform) way to move/rename a file. I'm a little surprised that os.cmd ever worked - maybe it was present in an old version of python... I'd have to check. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jul 2 05:57:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Jul 2009 05:57:19 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200907020957.n629vJk6014895@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-02 05:57 EST ------- (In reply to comment #4) > (In reply to comment #2) > > Following the email from David Gordon the Consed author via > > Gordon Roberston (thanks Gordon) on the dev list > > (http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006322.html) I've checked in those two examples and extended the parsing unit tests now. This showed a small issue with PHD "file names" with a space in them, which I have resolved following our convention for FASTA files. This means converting PHD to FASTA/FASTQ/QUAL works nicely. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Thu Jul 2 16:59:08 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 2 Jul 2009 16:59:08 -0400 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython Message-ID: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> Hi all, While everyone was away in Stockholm having a great time, I added some user-oriented documentation for my project to the Biopython wiki: http://www.biopython.org/wiki/PhyloXML What do you think? Any missing information, unclear wording, or outright lies? I also updated the project plan with some ideas for filling up the rest of July: http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML The code is there, too. The useful files to look at are Bio/PhyloXML/*.py and Tests/test_PhyloXML.py, if anyone would like to take a look. I would greatly appreciate any comments on any of this. Thanks! Eric From biopython at maubp.freeserve.co.uk Sat Jul 4 10:14:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 4 Jul 2009 15:14:03 +0100 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> Message-ID: <320fb6e00907040714w6d7e9a6bq8c622572c0962b8c@mail.gmail.com> On Thu, Jul 2, 2009 at 9:59 PM, Eric Talevich wrote: > Hi all, > > While everyone was away in Stockholm having a great time, I added some > user-oriented documentation for my project to the Biopython wiki: > http://www.biopython.org/wiki/PhyloXML > > What do you think? Any missing information, unclear wording, or outright > lies? The __repr__ thing isn't Biopython specific, its just what Python does. For simple objects, eval(repr(obj)) should recreate the object. Consider: >>> print phx.other [Other(tag=alignment, namespace=http://example.org/align)] That is odd to me. It looks like "other" is a list, containing an "Other" object, but with a funny __repr__ - I would have expected it to look more like this: >>> print phx.other [Other(tag="alignment", namespace="http://example.org/align")] i.e. using the repr of what I have assumed are string arguments. Peter From eric.talevich at gmail.com Sat Jul 4 12:28:45 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 4 Jul 2009 12:28:45 -0400 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <320fb6e00907040714w6d7e9a6bq8c622572c0962b8c@mail.gmail.com> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> <320fb6e00907040714w6d7e9a6bq8c622572c0962b8c@mail.gmail.com> Message-ID: <3f6baf360907040928l528665dcv1b2f0f80a5d7dfa8@mail.gmail.com> On Sat, Jul 4, 2009 at 10:14 AM, Peter wrote: > > The __repr__ thing isn't Biopython specific, its just what Python does. For > simple objects, eval(repr(obj)) should recreate the object. Consider: > > >>> print phx.other > [Other(tag=alignment, namespace=http://example.org/align)] > > That is odd to me. It looks like "other" is a list, containing an "Other" > object, but with a funny __repr__ - I would have expected it to look more > like this: > > >>> print phx.other > [Other(tag="alignment", namespace="http://example.org/align")] > > i.e. using the repr of what I have assumed are string arguments. > > Peter > Hi Peter, Thanks! Your interpretation of the example is correct. I'll change __repr__ to check if the attribute is a string and, if so, escape and quote it. In the docs, I wrote that the representation is Biopython-style because by default, Python does something a little different for complex objects: >>> class Foo(object): pass >>> Foo() <__main__.Foo object at 0xb7cff22c> But I noticed that Seq and other Biopython objects give a nicer representation that actually works as a constructor, so I tried to match that. Cheers, Eric (P.S. - Sorry if the original message seemed a little terse or weird. I watched the BOSC slides and I do appreciate the effort you all put into the conference.) From biopython at maubp.freeserve.co.uk Sat Jul 4 12:39:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 4 Jul 2009 17:39:13 +0100 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <3f6baf360907040928l528665dcv1b2f0f80a5d7dfa8@mail.gmail.com> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> <320fb6e00907040714w6d7e9a6bq8c622572c0962b8c@mail.gmail.com> <3f6baf360907040928l528665dcv1b2f0f80a5d7dfa8@mail.gmail.com> Message-ID: <320fb6e00907040939s4363370x6f56999a18591a01@mail.gmail.com> On Sat, Jul 4, 2009 at 5:28 PM, Eric Talevich wrote: > On Sat, Jul 4, 2009 at 10:14 AM, Peter wrote: > >> >> The __repr__ thing isn't Biopython specific, its just what Python does. For >> simple objects, eval(repr(obj)) should recreate the object. Consider: >> >> >>> print phx.other >> [Other(tag=alignment, namespace=http://example.org/align)] >> >> That is odd to me. It looks like "other" is a list, containing an "Other" >> object, but with a funny __repr__ - I would have expected it to look more >> like this: >> >> >>> print phx.other >> [Other(tag="alignment", namespace="http://example.org/align")] >> >> i.e. using the repr of what I have assumed are string arguments. >> >> Peter >> > > Hi Peter, > > Thanks! Your interpretation of the example is correct. I'll change __repr__ > to check if the attribute is a string and, if so, escape and quote it. > > In the docs, I wrote that the representation is Biopython-style because by > default, Python does something a little different for complex objects: > >>>> class Foo(object): pass >>>> Foo() > <__main__.Foo object at 0xb7cff22c> Yes, that is the Python default for a user defined object. > But I noticed that Seq and other Biopython objects give a nicer > representation that actually works as a constructor, so I tried > to match that. I'd have to think of some more examples, but other Python modules try to have eval(repr(obj)) work for their (simpler) objects. If you can do it without risking a really long string, this is a good idea. You'll notice the Seq object repr actually uses a truncated sequence for long sequences - you won't want to accidentally get the whole thing printed at the python prompt! Likewise doing repr() on a SeqRecord doesn't give you the full object. Peter From eric.talevich at gmail.com Sat Jul 4 13:24:12 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 4 Jul 2009 13:24:12 -0400 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <320fb6e00907040939s4363370x6f56999a18591a01@mail.gmail.com> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> <320fb6e00907040714w6d7e9a6bq8c622572c0962b8c@mail.gmail.com> <3f6baf360907040928l528665dcv1b2f0f80a5d7dfa8@mail.gmail.com> <320fb6e00907040939s4363370x6f56999a18591a01@mail.gmail.com> Message-ID: <3f6baf360907041024j1a3495a0k997733ad12ca7d39@mail.gmail.com> On Sat, Jul 4, 2009 at 12:39 PM, Peter wrote: > On Sat, Jul 4, 2009 at 5:28 PM, Eric Talevich > wrote: > > On Sat, Jul 4, 2009 at 10:14 AM, Peter >wrote: > > > >> > >> The __repr__ thing isn't Biopython specific, its just what Python does. > For > >> simple objects, eval(repr(obj)) should recreate the object. Consider: > >> > >> >>> print phx.other > >> [Other(tag=alignment, namespace=http://example.org/align)] > >> > >> That is odd to me. It looks like "other" is a list, containing an > "Other" > >> object, but with a funny __repr__ - I would have expected it to look > more > >> like this: > >> > >> >>> print phx.other > >> [Other(tag="alignment", namespace="http://example.org/align")] > >> > >> i.e. using the repr of what I have assumed are string arguments. > >> > >> Peter > >> > > > > Hi Peter, > > > > Thanks! Your interpretation of the example is correct. I'll change > __repr__ > > to check if the attribute is a string and, if so, escape and quote it. > Correction: since it's filtering for primitive types already, I'll just call repr() on each attribute. I changed the wiki page examples to show this, and I'll fix the code on Monday. > > If you can do it without risking a really long string, this is a good > idea. You'll notice the Seq object repr actually uses a truncated > sequence for long sequences - you won't want to accidentally > get the whole thing printed at the python prompt! Likewise > doing repr() on a SeqRecord doesn't give you the full object. > > Peter > OK, I'll add another check for long strings and truncate them like Seq does. This isn't in the wiki examples yet, though. -Eric From eric.talevich at gmail.com Sat Jul 4 15:32:32 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 4 Jul 2009 15:32:32 -0400 Subject: [Biopython-dev] Biopython link on python.org wiki Message-ID: <3f6baf360907041232o79881a0ehf22a3fd1225014f4@mail.gmail.com> Hi, Is anyone on this list active on the python.org wiki? I noticed that the "Scientific and Numeric" page, which gets a link on the front page of python.org, did not mention Biopython. In a fit of enthusiasm I add a link to biopython.org at the bottom, incorporating the existing pycluster item. Would someone else more familiar with landscape of scientific Python software like to review this and perhaps incorporate it more appropriately into the page? http://wiki.python.org/moin/NumericAndScientific Thanks, Eric From chapmanb at 50mail.com Sat Jul 4 15:38:43 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 4 Jul 2009 15:38:43 -0400 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> Message-ID: <20090704193843.GA1206@kunkel> Hi Eric; Great stuff as always. You are rocking on this; I was digging through your code at the end of the week and really happy with what you've put together. > user-oriented documentation for my project to the Biopython wiki: > http://www.biopython.org/wiki/PhyloXML > > What do you think? Any missing information, unclear wording, or outright > lies? What you have looks very good. A couple of thoughts on other things that would be useful: - In the usage section where you introduce clades, it might help to have a high-level diagram of a simple tree and the corresponding PhyloXML representation in terms of phylogeny and the clade parent/child relationship. Understanding this representation is important for newcomers and might ease them into using the classes. - The examples in 'Using PhyloXML objects' are very good and to the extent you have time to expand this, more of these would be very useful. These real life type examples are the best way to help users discover the features of PhyloXML. Based on Christian's highlighted features on the PhyloXML page, a little brainstroming on some things to tackle: - Providing annotation data on a node of the tree. - Adding orthology relationships to the tree; generally providing high level node data. These would expose more of the extensive markup elements built into PhyloXML and help users discover them. > I also updated the project plan with some ideas for filling up the rest of > July: > http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML I really like the idea of exploring interoperability with other Biopython tree representations and generalizing there. In addition to the Tree class in Bio.Nexus, the PyCogent tree representation looks generalized: http://pycogent.svn.sourceforge.net/viewvc/pycogent/trunk/cogent/core/tree.py?view=markup Combining this with the PhyloXML examples above, maybe it would worthwhile to think through and document a more complicated pipeline. Something like starting with a protein, identifying homologs, building a tree, adding annotation data, and outputting to PhyloXML. This would be a great starting place to how to interoperate, and also give users a jumping off point for providing more phylogenies in PhyloXML. Similarly, a PhyloXML to networkx (or other) display would also give a nice interoperable use case for others to build off of. Thanks for all your hard work on this, Brad From chapmanb at 50mail.com Sat Jul 4 16:11:00 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 4 Jul 2009 16:11:00 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <4A4D052D.7010708@berkeley.edu> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> Message-ID: <20090704201059.GB29677@kunkel> Hi Nick; Thanks much for the update. I'm cc'ing in the Biopython dev list to keep everyone there in the loop as well. > I have worked out a number of better functions for searching xml > database results, i.e. finding all elements with tags y that exist > somewhere inside elements with tags x. This is much more flexible in > the event that data of interest resides at different levels of a > hierarchy, which I have found in some cases. Awesome. Echoing what Hilmar mentioned, it would be good to step back and this point and talk about integration with Biopython. A couple of thoughts and suggestions along those lines: - You've included code from Lagrange which worries me for two reasons. First, this overlaps with existing Biopython functionality in Bio.Nexus; we want to eliminate that as it's confusing for users of the package to find different non-compatible implementations. If the existing code doesn't work for you in some way, could you flesh out those issues on the Biopython dev list so we can work to resolve them. Secondly, lagrange is licensed under the GPL so practically it is not compatible with Biopython, which is licensed much more freely. - You've settled on a flat system of coding with functions and no nesting inside of classes. This makes it difficult to flesh up the public API from internal functions. We could help make this more clear in a couple of ways: - Organizing related functionality into classes. - Prefixing internal functions with underscrores to indicate they are not meant to be called by users. - Starting to provide some user documentation, ideally centered around use cases. Often these help provide a way to think about the usability of the code and hint at ways to improve it. Hope this is helpful and I'm happy to offer more specific suggestions as you dig into it. Have a great 4th of July weekend, Brad > Stephen Smith wrote: > > These look really great. Glad the lagrange tree code is working out. I > > am very excited for the merging of the Biopython and the lagrange tree > > classes. More details to come. > > Stephen > > ================== > > Stephen A. Smith > > Postdoctoral Researcher > > NESCent: National Evolutionary Synthesis Center > > page: http://blackrim.org > > blog: http://blackrim.net/semaphoront > > sasmith at nescent.org > > > > > > > > On Jun 24, 2009, at 12:47 AM, Nick Matzke wrote: > > > >> OK, here's the latest... > >> > >> New functions: a bunch of stuff dealing with phylogenetic trees, making > >> use of the tree/node class in Stephen Smith's lagrange (GNU public > >> license), which was superior to the half-baked (and not GPL) tree/node > >> class I was using before GSoC started. > >> > >> ============= > >> read_ultrametric_Newick(newickstr): > >> Read a Newick file into a tree object (a series of node objects links to > >> parent and daughter nodes), also reading node ages and node labels if > >> any. > >> > >> list_leaves(phylo_obj): > >> Print out all of the leaves in above a node object > >> > >> treelength(node): > >> Gets the total branchlength above a given node by recursively adding > >> through tree. > >> > >> phylodistance(node1, node2): > >> Get the phylogenetic distance (branch length) between two nodes. > >> > >> get_distance_matrix(phylo_obj): > >> Get a matrix of all of the pairwise distances between the tips of a tree. > >> > >> get_mrca_array(phylo_obj): > >> Get a square list of lists (array) listing the mrca of each pair of > >> leaves (half-diagonal matrix) > >> > >> subset_tree(phylo_obj, list_to_keep): > >> Given a list of tips and a tree, remove all other tips and resulting > >> redundant nodes to produce a new smaller tree. > >> > >> prune_single_desc_nodes(node): > >> Follow a tree from the bottom up, pruning any nodes with only one > >> descendent > >> > >> find_new_root(node): > >> Search up tree from root and make new root at first divergence > >> > >> make_None_list_array(xdim, ydim): > >> Make a list of lists ("array") with the specified dimensions > >> > >> get_PD_to_mrca(node, mrca, PD): > >> Add up the phylogenetic distance from a node to the specified ancestor > >> (mrca). Find mrca with find_1st_match. > >> > >> find_1st_match(list1, list2): > >> Find the first match in two ordered lists. > >> > >> get_ancestors_list(node, anc_list): > >> Get the list of ancestors of a given node > >> > >> addup_PD(node, PD): > >> Adds the branchlength of the current node to the total PD measure. > >> > >> print_tree_outline_format(phylo_obj): > >> Prints the tree out in "outline" format (daughter clades are indented, > >> etc.) > >> > >> print_Node(node, rank): > >> Prints the node in question, and recursively all daughter nodes, > >> maintaining rank as it goes. > >> > >> lagrange_disclaimer(): > >> Just prints lagrange citation etc. in code using lagrange libraries. > >> ============= > >> > >> > >> > >> What's next: > >> > >> I'm going to spend the rest of this week following up on Brad's > >> suggestions to make the code more standard, with the priority of > >> figuring out how I can revise the current BioPython phylogeny class, to > >> resemble the better version in lagrange, so that there is a generic > >> flexible phylogeny/newick parser that can be used generally as well as > >> by my BioGeography package specifically. > >> > >> updated wiki/git: > >> http://biopython.org/wiki/BioGeography#June.2C_week_3:_Functions_to_read_user-specified_Newick_files_.28with_ages_and_internal_node_labels.29_and_generate_basic_summary_information. > >> > >> http://github.com/nmatzke/biopython/commits/Geography > >> > >> Cheers! > >> Nick > >> > >> > >> > >> > >> > >> Nick Matzke wrote: > >>> Sorry my update is slow, it is coming in a bit! Thanks, Nick > >>> > >>> Brad Chapman wrote: > >>>> Nick; > >>>> Thanks for the update -- hope y'all are having fun at the Evolution > >>>> meeting and have managed to meet up. > >>>> > >>>>> Basically this week I added functions to download & parse large > >>>>> numbers of records, get TaxonOccurrence gbifKeys, and search with > >>>>> those keys. Main functions: > >>>> > >>>> Good stuff. My main comment echoes a couple of things we discussed > >>>> earlier: > >>>> > >>>> - It is not clear to a user which functions are API functions to > >>>> call and which are used internally. Prefixing the internal > >>>> functions with underscores (_) and organizing these into classes > >>>> will help with this. > >>>> > >>>> - I still noticed some tempfile writing from what we discussed last > >>>> week. If you have problems using in memory file handles let us > >>>> know and we can discuss more. > >>>> > >>>> In general if your coding style is to get it out there and then > >>>> re-factor, that is cool. But please put some time into the > >>>> schedule for this so I know not to bug you before you've actually > >>>> had a chance to go through things a second time. Also, it's a good > >>>> idea to do this in segments as we go along. From experience, if you > >>>> build up too much code that needs rework it becomes more mentally > >>>> difficult to get into the rewriting. > >>>> > >>>>> An issue: > >>>>> > >>>>> Next week come functions to process phylogenetic trees. I have had > >>>>> issues with the current BioPython newick parser etc.; basically what > >>>>> exists appears to not accept node label information which is required > >>>>> to store e.g. branchlengths which are crucial for the sorts of things > >>>>> I have to do in the future. So unless there is a better suggestion I > >>>>> plan to upload modify & upload my own tree parsing/using functions. I > >>>>> am open to suggestions in this matter. > >>>> > >>>> We do not want to introduce duplicated code for Newick tree parsing in > >>>> Biopython. This is a good opportunity to engage the development list > >>>> to help figure out how to fix the current parser to do what you > >>>> need. If you are not sure how to get started, the best way is to get > >>>> together a small test file that demonstrates your problems, and post > >>>> it to the list. It would be more useful to everyone to have your > >>>> fixes in the main parser. > >>>> > >>>> Brad > >>>> > >>> > >> > >> -- > >> ==================================================== > >> Nicholas J. Matzke > >> Ph.D. Candidate, Graduate Student Researcher > >> Huelsenbeck Lab > >> Center for Theoretical Evolutionary Genomics > >> 4151 VLSB (Valley Life Sciences Building) > >> Department of Integrative Biology > >> University of California, Berkeley > >> > >> Lab websites: > >> http://ib.berkeley.edu/people/lab_detail.php?lab=54 > >> http://fisher.berkeley.edu/cteg/hlab.html > >> Dept. personal page: > >> http://ib.berkeley.edu/people/students/person_detail.php?person=370 > >> Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > >> Lab phone: 510-643-6299 > >> Dept. fax: 510-643-6264 > >> Cell phone: 510-301-0179 > >> Email: matzke at berkeley.edu > >> > >> Mailing address: > >> Department of Integrative Biology > >> 3060 VLSB #3140 > >> Berkeley, CA 94720-3140 > >> > >> ----------------------------------------------------- > >> "[W]hen people thought the earth was flat, they were wrong. When people > >> thought the earth was spherical, they were wrong. But if you think that > >> thinking the earth is spherical is just as wrong as thinking the earth > >> is flat, then your view is wronger than both of them put together." > >> > >> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, > >> 14(1), 35-44. Fall 1989. > >> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > >> ==================================================== > >> _______________________________________________ > >> Wg-phyloinformatics mailing list > >> Wg-phyloinformatics at nescent.org > >> https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > > > > > > -- > ==================================================== > Nicholas J. Matzke > Ph.D. Candidate, Graduate Student Researcher > Huelsenbeck Lab > Center for Theoretical Evolutionary Genomics > 4151 VLSB (Valley Life Sciences Building) > Department of Integrative Biology > University of California, Berkeley > > Lab websites: > http://ib.berkeley.edu/people/lab_detail.php?lab=54 > http://fisher.berkeley.edu/cteg/hlab.html > Dept. personal page: > http://ib.berkeley.edu/people/students/person_detail.php?person=370 > Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > Lab phone: 510-643-6299 > Dept. fax: 510-643-6264 > Cell phone: 510-301-0179 > Email: matzke at berkeley.edu > > Mailing address: > Department of Integrative Biology > 3060 VLSB #3140 > Berkeley, CA 94720-3140 > > ----------------------------------------------------- > "[W]hen people thought the earth was flat, they were wrong. When people > thought the earth was spherical, they were wrong. But if you think that > thinking the earth is spherical is just as wrong as thinking the earth > is flat, then your view is wronger than both of them put together." > > Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, > 14(1), 35-44. Fall 1989. > http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > ==================================================== > _______________________________________________ > Wg-phyloinformatics mailing list > Wg-phyloinformatics at nescent.org > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics From biopython at maubp.freeserve.co.uk Sun Jul 5 04:48:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Jul 2009 09:48:04 +0100 Subject: [Biopython-dev] Biopython link on python.org wiki In-Reply-To: <3f6baf360907041232o79881a0ehf22a3fd1225014f4@mail.gmail.com> References: <3f6baf360907041232o79881a0ehf22a3fd1225014f4@mail.gmail.com> Message-ID: <320fb6e00907050148h55e38152xd31d752515746e7e@mail.gmail.com> On Sat, Jul 4, 2009 at 8:32 PM, Eric Talevich wrote: > Hi, > > Is anyone on this list active on the python.org wiki? I noticed that the > "Scientific and Numeric" page, which gets a link on the front page of > python.org, did not mention Biopython. In a fit of enthusiasm I add a link > to biopython.org at the bottom, incorporating the existing pycluster item. > Would someone else more familiar with landscape of scientific Python > software like to review this and perhaps incorporate it more appropriately > into the page? > > http://wiki.python.org/moin/NumericAndScientific > > Thanks, > Eric Good idea - thanks. Peter From biopython at maubp.freeserve.co.uk Sun Jul 5 04:52:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Jul 2009 09:52:38 +0100 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <20090704193843.GA1206@kunkel> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> <20090704193843.GA1206@kunkel> Message-ID: <320fb6e00907050152o470ca5e3ja451c3ebc52f7c83@mail.gmail.com> On Sat, Jul 4, 2009 at 8:38 PM, Brad Chapman wrote: > I really like the idea of exploring interoperability with other > Biopython tree representations and generalizing there. In addition to > the Tree class in Bio.Nexus, the PyCogent tree representation looks > generalized: > > http://pycogent.svn.sourceforge.net/viewvc/pycogent/trunk/cogent/core/tree.py?view=markup There is also Thomas Mailund's Newick Tree module, which provides yet another perspective on trees, and various things you can do with them (his visitor stuff is cool once you figure it out). If you haven't looked it this, it might be worth a play as well for ideas. I've actually used this more than Bio.Nexus as it predates it ;) http://www.daimi.au.dk/~mailund/newick.html Peter From bugzilla-daemon at portal.open-bio.org Sun Jul 5 06:20:58 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 5 Jul 2009 06:20:58 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200907051020.n65AKwn4020321@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2870 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jul 5 07:10:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 5 Jul 2009 07:10:02 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200907051110.n65BA2qv021842@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement OS/Version|FreeBSD |All ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-05 07:10 EST ------- Three points... Which existing schema did you start from to generate this one, and how did you do it? This may be interesting for Hilmar if there are any subtle differences between the existing schemas. --------------------------------------------------------------- This new line in BioSeq.py isn't valid on Python 2.4, val = [(str(x) if isinstance(x, unicode) else x) for x in val] See http://www.python.org/dev/peps/pep-0308/ As a quick hack, I used: val = [_make_unicode_into_string(x) for x in val] where I had defined: def _make_unicode_into_string(text) : if isinstance(text, unicode): return str(text) else : return text Not very elegant, but with that the BioSQL tests pass on my old desktop using Python 2.4 and MySQL. This machine doesn't have the SQLite bindings installed. --------------------------------------------------------------- In the long term, Tests/setup_BioSQL.py could automatically try to use SQLite (if available) when the user hasn't overriden it with their own local settings. Peter P.S. I filed BioSQL enhancement Bug 2870 for adding an SQLite schema to BioSQL itself. And I marked this bug as an enhancement too. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jul 6 11:06:47 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 6 Jul 2009 11:06:47 -0400 Subject: [Biopython-dev] GSoC Weekly Update 7: PhyloXML for Biopython Message-ID: <3f6baf360907060806o5cbc3e4ew8bd614b0a5f811c2@mail.gmail.com> Hi all, Previously (June 29--July 3) I: - Wrote serialization methods for each class, matching Parser - Also profiled the writer - Caught up on documentation -- http://www.biopython.org/wiki/PhyloXML This week (July 6--10) I will: - Address comments from last week's code/doc review - Enable Pythonic syntax sugar (__getitem__, __contains__, override __str__) - Unit tests for new code - Identify more Biopython objects to reuse or export to (improve the SeqRecord conversion) - Look specifically at interoperating with Nexus, Newick trees - Fill out the midterm evaluation Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Mon Jul 6 15:02:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Jul 2009 20:02:56 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? Message-ID: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> Hi all, There were many things I discussed with Biopython folks at BOSC 2009, and one of these was a conversation with Brad about some of Bio.Application - specifically the idea behind the ApplicationResult object. We basically agreed this was superfluous and could be deprecated. The only thing I've found useful in this object is the return code (an integer) when using Bio.Application.generic_run (which in itself seems a bit superfluous). Now, declaring ApplicationResult obsolete for Biopython 1.51 (with a deprecation in the following release) is fine except for the fact that this object gets used in the function generic_run. So we'd have to obsolete that too. [If anyone can see any other side effects of deprecating Bio.Application.ApplicationResult please speak up] Right now, generic_run waits for the sub-process to finish, and returns a tuple of: * An ApplicationResult object holding the return code (and a few other things which can also be found from the command line string object, like the expected output filenames). * Standard output as a StringIO handle (could be memory hungry!) * Standard error as a StringIO handle (could be memory hungry!) Personally when running a sub-process I have either wanted the stdout (and stderr) handles, OR the return code (and I don't have about stdout and stderr). I can't think of a situation off hand where I needed both. So for me, the Bio.Application.generic_run function isn't very helpful. In Python, there are several ways to run a tool, starting with something very simple like os.system(...) which will run and block until the task finished, returning the return code (with some provisos on Windows). Next, there were a whole set of popen*() functions which generally returned handles. These are now all obsolete with Python 2.6, and subprocess should be used instead. If we want to deprecate Bio.Application.generic_run (in order to deprecate Bio.Application.ApplicationResult), then do we need a replacement? Or replacements? Possible helper functions that come to mind are: (a) Returns the return code (integer) only. This would basically be a cross-platfrom version of os.system using the subprocess module internally. (b) Returns the return code (integer) plus the stdout and stderr (which would have to be StringIO handles, with the data in memory). This would be a direct replacement for the current Bio.Application.generic_run function. (c) Returns the stdout (and stderr) handles. This basically is recreating a deprecated Python popen*() function, which seems silly. However, I'm tempted to say Biopython shouldn't be duplicating basic Python functionality, like wrapping the subprocess module in helper functions for typical situations. Instead we should just document using the current recommend Python best practice (which I believe to be use the subprocess module). The downside is that using subprocess is a bit tricky for novices. Any thoughts? Peter From bartek at rezolwenta.eu.org Mon Jul 6 15:35:53 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 6 Jul 2009 21:35:53 +0200 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> Message-ID: <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> On Mon, Jul 6, 2009 at 9:02 PM, Peter wrote: > Hi all, > Hi, > this object gets used in the function generic_run. So we'd have to > obsolete that too. [If anyone can see any other side effects of > deprecating Bio.Application.ApplicationResult please speak up] I'm fine with deprecating ApplicationReslut. > Personally when running a sub-process I have either wanted the stdout > (and stderr) handles, OR the return code (and I don't have about > stdout and stderr). I can't think of a situation off hand where I > needed both. So for me, the Bio.Application.generic_run function isn't > very helpful. > Well, I don't have too much experience with writing application wrappers, but I can easily think of the scenario when I first check whether the program returned the "right" error code and then if it's fine I would process the stdout. > If we want to deprecate Bio.Application.generic_run (in order to > deprecate Bio.Application.ApplicationResult), then do we need a > replacement? Or replacements? > > (b) Returns the return code (integer) plus the stdout and stderr > (which would have to be StringIO handles, with the data in memory). > This would be a direct replacement for the current > Bio.Application.generic_run function. That sounds like a good replacement. > However, I'm tempted to say Biopython shouldn't be duplicating basic > Python functionality, like wrapping the subprocess module in helper > functions for typical situations. Instead we should just document > using the current recommend Python best practice (which I believe to > be use the subprocess module). The downside is that using subprocess > is a bit tricky for novices. > I don't have strong feelings about that, but my personal experience is that it helps to have some infrastructure which (even if providing somewhat superfluous API layer over the bare python libs), especially for people who may have limited experience with different platforms. I, for one, would find it useful if biopython provided a simple classes which allowed people to write cross-platform wrappers for command line tools. cheers Bartek From biopython at maubp.freeserve.co.uk Mon Jul 6 17:06:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Jul 2009 22:06:54 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> Message-ID: <320fb6e00907061406o2c5907e8q3c30676897728167@mail.gmail.com> On Mon, Jul 6, 2009 at 8:35 PM, Bartek Wilczynski wrote: > > I'm fine with deprecating ApplicationReslut. > >> Personally when running a sub-process I have either wanted the stdout >> (and stderr) handles, OR the return code (and I don't have about >> stdout and stderr). I can't think of a situation off hand where I >> needed both. So for me, the Bio.Application.generic_run function isn't >> very helpful. > > Well, I don't have too much experience with writing application wrappers, > but I can easily think of the scenario when I first check whether the program > returned the "right" error code and then if it's fine I would process > the stdout. True - but in practice I usually find it more productive to switch to the command line prompt and explore the failure there (rather than trying to diagnose things from within Python). I would be content for the script to tell me a command line failed with an error return code (and give me the command line string and the return code). >> If we want to deprecate Bio.Application.generic_run (in order to >> deprecate Bio.Application.ApplicationResult), then do we need a >> replacement? Or replacements? >> >> (b) Returns the return code (integer) plus the stdout and stderr >> (which would have to be StringIO handles, with the data in memory). >> This would be a direct replacement for the current >> Bio.Application.generic_run function. > > That sounds like a good replacement. Of the three examples I put forward, (b) certainly seemed most useful. Any other ideas? >> However, I'm tempted to say Biopython shouldn't be duplicating basic >> Python functionality, like wrapping the subprocess module in helper >> functions for typical situations. Instead we should just document >> using the current recommend Python best practice (which I believe to >> be use the subprocess module). The downside is that using subprocess >> is a bit tricky for novices. >> > > I don't have strong feelings about that, but my personal experience is > that it helps to have some infrastructure which (even if providing > somewhat superfluous API layer over the bare python libs), especially > for people who may have limited experience with different platforms. > > I, for one, would find it useful if biopython provided a simple > classes which allowed people to write cross-platform wrappers > for command line tools. Do you feel option (b) above would fit that criteria? Peter From tiagoantao at gmail.com Mon Jul 6 17:34:01 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 6 Jul 2009 22:34:01 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> Message-ID: <6d941f120907061434w105480e9y4b754c46acce27ff@mail.gmail.com> On Mon, Jul 6, 2009 at 8:02 PM, Peter wrote: > Any thoughts? I am using generic_run (and the Bio.Application framework) for the new genepop code. But it would be trivial to change. The only thing that I need is the return code (not even stdout). The only thing that I need is to be informed of the new "best practice" that replaces generic_run and I will act accordingly If you are interested my use case is on: http://github.com/tiagoantao/biopython/blob/e1720bd4419ae5cf60ae5e1c7ec72828c6f6e6fe/Bio/PopGen/GenePop/Controller.py (_run_genepop and class _GenePopCommandline) Regards From biopython at maubp.freeserve.co.uk Mon Jul 6 17:51:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Jul 2009 22:51:37 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <6d941f120907061434w105480e9y4b754c46acce27ff@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <6d941f120907061434w105480e9y4b754c46acce27ff@mail.gmail.com> Message-ID: <320fb6e00907061451s1908d8cfw335f4f11319d14bb@mail.gmail.com> 2009/7/6 Tiago Ant?o : > > On Mon, Jul 6, 2009 at 8:02 PM, Peter wrote: >> Any thoughts? > > I am using generic_run (and the Bio.Application framework) for the new > genepop code. But it would be trivial to change. > > The only thing that I need is the return code (not even stdout). > > The only thing that I need is to be informed of the new "best > practice" that replaces generic_run and I will act accordingly You wouldn't have to rush anything - I was only thinking to declare it obsolete for 1.51 (with any replacement in place). The point of this discussion is to agree the "best practice". It sounds like this will be telling people to use subprocess for full control, but we may continue to provide one or two helper functions for very common usecases. Peter From chapmanb at 50mail.com Mon Jul 6 18:04:53 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 6 Jul 2009 18:04:53 -0400 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> Message-ID: <20090706220453.GI17086@sobchak.mgh.harvard.edu> Hi all; > > this object gets used in the function generic_run. So we'd have to > > obsolete that too. [If anyone can see any other side effects of > > deprecating Bio.Application.ApplicationResult please speak up] > > I'm fine with deprecating ApplicationReslut. Bartek, you just won the typo of the month contest hands down. > > If we want to deprecate Bio.Application.generic_run (in order to > > deprecate Bio.Application.ApplicationResult), then do we need a > > replacement? Or replacements? [...] > > However, I'm tempted to say Biopython shouldn't be duplicating basic > > Python functionality, like wrapping the subprocess module in helper > > functions for typical situations. Instead we should just document > > using the current recommend Python best practice (which I believe to > > be use the subprocess module). The downside is that using subprocess > > is a bit tricky for novices. My vote is to document using subprocess and avoid creating our own wrapper. No one has to learn a Biopython specific API for running programs, and subprocess provides plenty of flexibility to get stdout, stderr and return codes. For places where we feel like using subprocess is tricky, additional documentation within Biopython should help those encountering it for the first time. This gives us more time to work on biology problems, and leaves the running programs problems up to the greater Python community. Brad From bugzilla-daemon at portal.open-bio.org Mon Jul 6 18:55:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 6 Jul 2009 18:55:52 -0400 Subject: [Biopython-dev] [Bug 2872] New: Genbank parser breaks on VectorNTI generated genbank file Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2872 Summary: Genbank parser breaks on VectorNTI generated genbank file Product: Biopython Version: 1.51b Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: tsham at lbl.gov The GenBank parser dies while parsing VectorNTI generated genbank files. VectorNTI *sometimes* generates a file with no date string at position 65, which causes this. It is true that this is a non-standard genbank file, but since VectorNTI is a commonly used program, it would be nice for BioPython to handle this case. Sample session: >>> import Bio >>> Bio.__version__ '1.51b' >>> fh = open("pBbA1a-RFP.gb") >>> from Bio.GenBank import RecordParser >>> rp = RecordParser() >>> result = rp.parse(fh) Traceback (most recent call last): File "", line 1, in File "Bio/GenBank/__init__.py", line 172, in parse self._scanner.feed(handle, self._consumer) File "Bio/GenBank/Scanner.py", line 370, in feed self._feed_first_line(consumer, self.line) File "Bio/GenBank/Scanner.py", line 820, in _feed_first_line 'LOCUS line does not contain - at position 65 in date:\n' + line AssertionError: LOCUS line does not contain - at position 65 in date: LOCUS pBbA1a-RFP 4252 bp DNA circular >>> -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 6 18:56:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 6 Jul 2009 18:56:56 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907062256.n66MuuBH002457@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #1 from tsham at lbl.gov 2009-07-06 18:56 EST ------- Created an attachment (id=1338) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1338&action=view) test case file, vectorNTI generated genbank file Here is a sample file that breaks the parser. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jul 6 19:34:30 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 6 Jul 2009 19:34:30 -0400 Subject: [Biopython-dev] PhyloXML helper functions Message-ID: <3f6baf360907061634u43d89bdcxa9bad9fb0babb350@mail.gmail.com> Hey all, I've been mulling a couple of methods for PhyloXML objects that I thought could deserve some discussion. 1. Singular properties for some plural attributes This goes back to the "confidences" issue: When I'm drilling down through a phyloXML-derived tree, I keep expecting certain attributes to be singular values when they're actually plural. Auto-completion catches it, of course, but the resulting code would seem more obvious if I used the singular name when I know the attribute consists of a list of one element. The attributes I had in mind for this are taxonomies (Clade class) and confidences (Clade and Phylogeny classes). Should any other attributes get this treatment? Here's an example getter method -- Rubyists may ignore the first line: @property def confidence(self): if len(self.confidences) > 1: raise RuntimeError, "More than one confidence item is available! Use foo.confidences" elif len(self.confidences) == 0: raise RuntimeError, "No confidence item is available! You fail" else: return self.confidences[0] Then this works as expected, similar to the way certain IO read() functions work elsewhere in Biopython. 2. A find() method on Clade and maybe Phylogeny objects The function definition and docstring would look like this: def find(cls, **kwargs): """Find all sub-nodes matching the given attributes. The first argument specifies the class of the sub-node. (Use Tree.PhyloElement to match any standard phyloXML type.) The arbitrary keyword arguments indicate the attribute name of the sub-node and the value to match. The result is an iterable through all matching objects. Example: >>> tree = PhyloXML.read('phyloxml_examples.xml').phylogenies[5] >>> matches = tree.clade.find(Taxonomy, code='OCTVU') >>> matches.next() Taxonomy(code='OCTVU', scientific_name='Octopus vulgaris') """ Enhancements: - The keyword argument could be a regular expression. Would that be useful? To handle numbers, I'd have to convert every sub-node attribute value to a string, and that would be weird -- or else find() would have to skip numerical attributes. - Non-keyword arguments (*args) could specify just the not-None existence of an attribute. Allowing regexes would make this unnecessary (e.g. name='.*') - If no regular arguments are needed, cls could default to PhyloElement or even "object" to match everything. - To enable arbitrary hairiness, this function could accept a function as the value of the keyword argument and return anything truthy. But at that point, the user could probably just roll their own find_node() function. However, it could still be useful to filter for numerical values. What do you think? Thanks, Eric From tiagoantao at gmail.com Tue Jul 7 03:55:58 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 7 Jul 2009 08:55:58 +0100 Subject: [Biopython-dev] Arlequin sequence files in Bio.Popgen In-Reply-To: <4A52985C.3000603@student.otago.ac.nz> References: <4A52985C.3000603@student.otago.ac.nz> Message-ID: <6d941f120907070055m2d34fcb1qe8b29e40d8d67880@mail.gmail.com> Hi David, [I am Ccing the biopython-dev mailing list, so that other biopython dev people can chip in] 2009/7/7 David WInter : > Is there any plan to support arlequin in Bio.Popgen? The script that I have Bio.PopGen currently supports Simcoal, so it should already support Arlequin (as Simcoal outputs arlequin). Unfortunatelly I never got round to make an Arlequin parser (which makes full sense, for a lot of reasons). > to have a go at getting it to work in that framework. That would be more than welcome. I have personally an interest on getting it up and running. Arlequin format support is an important thing. If you have little time, I can offer to help. If you prefer to go ahead alone you are also more than welcome to do it. Just dont do the same mistake that I did with the genepop parser: where I load the whole file into memory. I have discovered that there are a lot of people that have thousands of markers and thousands of individuals (loading such a file into memory is in some cases impossible). Using an iterator might be a solution. One might try to go to the Arlequin developers and ask for a specification of the format (as far as I know there is no specification in public). Code on biopython has to have documentation and unit tests (a boring thing, but necessary). In this case, I would not mind doing that myself (in case you are uninterested) as I think Arlequin support is really a cool thing. I will sort out the git links, thanks for the info. BTW if you are doing any kind of frequency based statistics, we are adding support for genepop statistics (mainly a python wrapper to the application). You can now get things like Fst, Fis and the likes from inside python. Feel free to write back with any comments you might have. Tiago From bartek at rezolwenta.eu.org Tue Jul 7 04:20:49 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 7 Jul 2009 10:20:49 +0200 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <20090706220453.GI17086@sobchak.mgh.harvard.edu> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> Message-ID: <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> On Tue, Jul 7, 2009 at 12:04 AM, Brad Chapman wrote: >> I'm fine with deprecating ApplicationReslut. > > Bartek, you just won the typo of the month contest hands down. > Well, It's always motivating to see that people actually read your posts carefully ;) > My vote is to document using subprocess and avoid creating our own > wrapper. No one has to learn a Biopython specific API for running > programs, and subprocess provides plenty of flexibility to get stdout, > stderr and return codes. For places where we feel like using subprocess > is tricky, additional documentation within Biopython should help those > encountering it for the first time. This gives us more time to work > on biology problems, and leaves the running programs problems up to > the greater Python community. well, having such a documentation would be a great thing. I've just gone through the docs for subprocess module and it seems to be the layer unifying all those crazy different ways of spawning processes. It's a shame I somehow missed that it's there since python 2.4... So now, after doing my homework and checking what has been going on in python since 2004, I think that Brad's idea is better. We have dropped support for 2.3, so we can try to move from Application.generic_run to subprocess.Popen instead of trying to provide our own wrapper. We just need good docs. cheers Bartek From tiagoantao at gmail.com Tue Jul 7 04:33:49 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 7 Jul 2009 09:33:49 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907061451s1908d8cfw335f4f11319d14bb@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <6d941f120907061434w105480e9y4b754c46acce27ff@mail.gmail.com> <320fb6e00907061451s1908d8cfw335f4f11319d14bb@mail.gmail.com> Message-ID: <6d941f120907070133h6d4d9283v3282bc227dd12033@mail.gmail.com> 2009/7/6 Peter : > The point of this discussion is to agree the "best practice". It > sounds like this will be telling people to use subprocess for full > control, but we may continue to provide one or two helper > functions for very common usecases. I tried to use Bio.Application and there was one part (maybe I am using it wrongly) that was kind of awkward: parameters (Ive added my code below). The need to declare them explicitly plus the fact that in some cases parameters are always compulsory and really not parameters (granted a strange use case, but I have a fixed parameter for genepop, namely saying that the run is machine-controlled, batch mode). At the end of the day, I end up with a lot of biolerplate code (like below). _Argument(["command"], ["INTEGER(.INTEGER)*"], None, True, "GenePop option to be called"), _Argument(["mode"], ["Dont touch this"], None, True, "Should allways be batch"), _Argument(["input"], ["input"], None, True, "Input file"), _Argument(["Dememorization"], ["input"], None, False, "Dememorization step"), _Argument(["BatchNumber"], ["input"], None, False, "Number of MCMC batches"), _Argument(["BatchLength"], ["input"], None, False, "Length of MCMC chains"), _Argument(["HWtests"], ["input"], None, False, "Enumeration or MCMC"), -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From biopython at maubp.freeserve.co.uk Tue Jul 7 05:19:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 10:19:34 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <6d941f120907070133h6d4d9283v3282bc227dd12033@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <6d941f120907061434w105480e9y4b754c46acce27ff@mail.gmail.com> <320fb6e00907061451s1908d8cfw335f4f11319d14bb@mail.gmail.com> <6d941f120907070133h6d4d9283v3282bc227dd12033@mail.gmail.com> Message-ID: <320fb6e00907070219y15b019b8x30607e33137edf2e@mail.gmail.com> 2009/7/7 Tiago Ant?o : > 2009/7/6 Peter : >> The point of this discussion is to agree the "best practice". It >> sounds like this will be telling people to use subprocess for full >> control, but we may continue to provide one or two helper >> functions for very common usecases. > > I tried to use Bio.Application and there was one part (maybe I am > using it wrongly) that was kind of awkward: parameters (Ive > added my code below). > The need to declare them explicitly plus the fact that in some cases > parameters are always compulsory and really not parameters > (granted a strange use case, but I have a fixed parameter for > genepop, namely saying that the run is machine-controlled, batch > mode). If you have a fixed parameter, like "-mode batch" which must be present, it doesn't make sense to expose the mode setting to the used. Maybe you could do this by subclassing the __str__ method? > At the end of the day, I end ?up with a lot of biolerplate code (like below). The nature of the command line wrappers is there will be lots of boilerplate. On the bright side, once we get ride of ApplicationResult, we can probably get rid of the "input"/"output" thing too. Peter From biopython at maubp.freeserve.co.uk Tue Jul 7 05:41:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 10:41:10 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> Message-ID: <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> On Tue, Jul 7, 2009 at 9:20 AM, Bartek Wilczynski wrote: > On Tue, Jul 7, 2009 at 12:04 AM, Brad Chapman wrote: > >>> I'm fine with deprecating ApplicationReslut. >> >> Bartek, you just won the typo of the month contest hands down. >> > Well, It's always motivating to see that people actually read your > posts carefully ;) I read it, but just chuckled to myself ;) Brad wrote: >> My vote is to document using subprocess and avoid creating our own >> wrapper. No one has to learn a Biopython specific API for running >> programs, and subprocess provides plenty of flexibility to get stdout, >> stderr and return codes. For places where we feel like using subprocess >> is tricky, additional documentation within Biopython should help those >> encountering it for the first time. This gives us more time to work >> on biology problems, and leaves the running programs problems up to >> the greater Python community. Exactly. I'm sure there will still be questions on the mailing list from people about using subprocess, but if our documentation is done well enough this shouldn't be too much of a burden. Bartek wrote: > ?well, having such a documentation would be a great thing. I've just gone > through the docs for subprocess module and it seems to be the layer unifying > all those crazy different ways of spawning processes. It's a shame ?I somehow > missed that it's there since python 2.4... So now, after doing my homework and > checking what has been going on in python since 2004, I think that Brad's idea > is better. We have dropped support for 2.3, so we can try to move from > Application.generic_run to subprocess.Popen instead of trying to > provide our own wrapper. ?We just need good docs. That seems unanimous so far: Deprecate Bio.Application.generic_run, and document using subprocess instead. Good :) Are you all happy with just marking Bio.Application.generic_run and Bio.Application.ApplicationResult as obsolete for Biopython 1.51, with the deprecation warning added in Biopython 1.52? We'll need to update the Tutorial too - which reminds me, could someone go over the "Alignment Tools" bit (currently section 6.3) to see if I've pitched this at about the right level? On re-reading it just now I found an fixed several typos. Peter From bugzilla-daemon at portal.open-bio.org Tue Jul 7 05:43:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 05:43:33 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907070943.n679hXi0025601@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 05:43 EST ------- Yes, I would agree that we should be able to cope with a missing date (perhaps with a warning). Can we include this file in Biopython as a unit test? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 06:16:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 06:16:55 -0400 Subject: [Biopython-dev] [Bug 2873] New: import warnings.warn instead of warnings causes code to fail Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2873 Summary: import warnings.warn instead of warnings causes code to fail Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: trivial Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mlrodrigues at igc.gulbenkian.pt As of commit: http://github.com/biopython/biopython/commit/f2b2125dbbf57b1b1ac5a0259918acfc4e63abbe#diff-3 On github, the line 39 (from warnings import warn) was inserted but during the function, the module is always refered to as warnings.warn() and not warn() Changing line 39 to 'import warnings' solves the problem -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 06:44:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 06:44:10 -0400 Subject: [Biopython-dev] [Bug 2873] import warnings.warn instead of warnings causes code to fail In-Reply-To: Message-ID: <200907071044.n67AiA7X027715@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2873 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 06:44 EST ------- Thanks - import statement in Bio/PDB/PDBList.py now fixed in CVS, will be on github shortly. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Tue Jul 7 08:51:52 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 7 Jul 2009 08:51:52 -0400 Subject: [Biopython-dev] PhyloXML helper functions In-Reply-To: <3f6baf360907061634u43d89bdcxa9bad9fb0babb350@mail.gmail.com> References: <3f6baf360907061634u43d89bdcxa9bad9fb0babb350@mail.gmail.com> Message-ID: <20090707125152.GL17086@sobchak.mgh.harvard.edu> Hi Eric; > I've been mulling a couple of methods for PhyloXML objects that I thought > could deserve some discussion. > > 1. Singular properties for some plural attributes > > This goes back to the "confidences" issue: When I'm drilling down through a > phyloXML-derived tree, I keep expecting certain attributes to be singular > values when they're actually plural. Auto-completion catches it, of course, > but the resulting code would seem more obvious if I used the singular name > when I know the attribute consists of a list of one element. I like the idea and implementation for cases where you can have multiple items, but have one most of the time. Very nice. > 2. A find() method on Clade and maybe Phylogeny objects [...] > Enhancements: > - The keyword argument could be a regular expression. Would that be useful? This seems useful. Often people use crazy naming convention hacks, and might want to pull out something like all proteins from a particular organism based on a common prefix in the name. > To handle numbers, I'd have to convert every sub-node attribute value to a > string, and that would be weird -- or else find() would have to skip > numerical attributes. Is this if you support regular expressions or either way? For the find, I think it's sufficient to define what you support and leave it at that set: any subset of searching will help people get their work done. > - If no regular arguments are needed, cls could default to PhyloElement or > even "object" to match everything. I like the object default here. This fits with a simple use case of: find everything that matches this string of interest. > - To enable arbitrary hairiness, this function could accept a function as > the value of the keyword argument and return anything truthy. But at that > point, the user could probably just roll their own find_node() function. > However, it could still be useful to filter for numerical values. This is probably more than you need. For complicated cases I'd assume people are sophisticated enough to roll their own. Nice ideas, Brad From chapmanb at 50mail.com Tue Jul 7 09:02:48 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 7 Jul 2009 09:02:48 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> Message-ID: <20090707130248.GM17086@sobchak.mgh.harvard.edu> Hi Stephen; > In reference to the lagrange code, I see the concern with the > licensing. I think that this could be corrected however with a simple > rewrite when conforming to the BioPython standards. We can require lagrange to be installed and use imports to grab the needed code. The other option is that y'all can explicitly relicense a subset of the code under the Biopython license. > I can see however > where the Bio.Nexus functionality might not be sufficient for tree > manipulation. I am not a contributor to the BioPython dev group so I > cannot speak to those specifics, but as a user I can see separating > out the tree functions from the Nexus package (and tree I/O in > general) as logically a phylogenetic tree structure has little to do > with the nexus file format. It can be somewhat awkward to deal with in > the current form. A more general implementation might be a Bio.Tree > package with I/O readers in Nexus and Newick and XML, etc. Definitely. Eric has been discussing this with regards to the PhyloXML project and we had been looking at other Tree representations: in PyCogent and Thomas Mailund's Newick module. Considering the lagrange tree model makes a lot of sense as well. What I'd like to see is a stab at a generalized Tree object that supports the operations you need and that the Bio.Nexus parser can produce, exactly as you describe. Eric and Nick, what do you think about coordinating on this? > Just a thought and I am happy to work on the tree code in whatever > capacity it would be helpful to Nick. Awesome. We're very open to generalizing the Tree representation in Biopython. What I'm trying to avoid is having multiple Nexus/Newick parsers; this is confusing to users and too much duplicated effort. It sounds like we're on the same page in coming together on something that will work for everyone. Brad > Take care, > Stephen > ================== > Stephen A. Smith > Postdoctoral Researcher > NESCent: National Evolutionary Synthesis Center > page: http://blackrim.org > blog: http://blackrim.net/semaphoront > sasmith at nescent.org > > > > On Jul 4, 2009, at 4:11 PM, Brad Chapman wrote: > > > Hi Nick; > > Thanks much for the update. I'm cc'ing in the Biopython dev list to > > keep everyone there in the loop as well. > > > >> I have worked out a number of better functions for searching xml > >> database results, i.e. finding all elements with tags y that exist > >> somewhere inside elements with tags x. This is much more flexible in > >> the event that data of interest resides at different levels of a > >> hierarchy, which I have found in some cases. > > > > Awesome. Echoing what Hilmar mentioned, it would be good to step back > > and this point and talk about integration with Biopython. A couple > > of thoughts and suggestions along those lines: > > > > - You've included code from Lagrange which worries me for two > > reasons. First, this overlaps with existing Biopython functionality > > in Bio.Nexus; we want to eliminate that as it's confusing for > > users of the package to find different non-compatible > > implementations. If the existing code doesn't work for you in some > > way, could you flesh out those issues on the Biopython dev list so we > > can work to resolve them. Secondly, lagrange is licensed under the > > GPL so practically it is not compatible with Biopython, which is > > licensed much more freely. > > > > - You've settled on a flat system of coding with functions and no > > nesting inside of classes. This makes it difficult to flesh up the > > public API from internal functions. We could help make this more > > clear in a couple of ways: > > > > - Organizing related functionality into classes. > > - Prefixing internal functions with underscrores to indicate they > > are not meant to be called by users. > > - Starting to provide some user documentation, ideally centered > > around use cases. Often these help provide a way to think about > > the usability of the code and hint at ways to improve it. > > > > Hope this is helpful and I'm happy to offer more specific > > suggestions as you dig into it. Have a great 4th of July weekend, > > > > Brad > > > > > >> Stephen Smith wrote: > >>> These look really great. Glad the lagrange tree code is working > >>> out. I > >>> am very excited for the merging of the Biopython and the lagrange > >>> tree > >>> classes. More details to come. > >>> Stephen > >>> ================== > >>> Stephen A. Smith > >>> Postdoctoral Researcher > >>> NESCent: National Evolutionary Synthesis Center > >>> page: http://blackrim.org > >>> blog: http://blackrim.net/semaphoront > >>> sasmith at nescent.org > >>> > >>> > >>> > >>> On Jun 24, 2009, at 12:47 AM, Nick Matzke wrote: > >>> > >>>> OK, here's the latest... > >>>> > >>>> New functions: a bunch of stuff dealing with phylogenetic trees, > >>>> making > >>>> use of the tree/node class in Stephen Smith's lagrange (GNU public > >>>> license), which was superior to the half-baked (and not GPL) tree/ > >>>> node > >>>> class I was using before GSoC started. > >>>> > >>>> ============= > >>>> read_ultrametric_Newick(newickstr): > >>>> Read a Newick file into a tree object (a series of node objects > >>>> links to > >>>> parent and daughter nodes), also reading node ages and node > >>>> labels if > >>>> any. > >>>> > >>>> list_leaves(phylo_obj): > >>>> Print out all of the leaves in above a node object > >>>> > >>>> treelength(node): > >>>> Gets the total branchlength above a given node by recursively > >>>> adding > >>>> through tree. > >>>> > >>>> phylodistance(node1, node2): > >>>> Get the phylogenetic distance (branch length) between two nodes. > >>>> > >>>> get_distance_matrix(phylo_obj): > >>>> Get a matrix of all of the pairwise distances between the tips of > >>>> a tree. > >>>> > >>>> get_mrca_array(phylo_obj): > >>>> Get a square list of lists (array) listing the mrca of each pair of > >>>> leaves (half-diagonal matrix) > >>>> > >>>> subset_tree(phylo_obj, list_to_keep): > >>>> Given a list of tips and a tree, remove all other tips and > >>>> resulting > >>>> redundant nodes to produce a new smaller tree. > >>>> > >>>> prune_single_desc_nodes(node): > >>>> Follow a tree from the bottom up, pruning any nodes with only one > >>>> descendent > >>>> > >>>> find_new_root(node): > >>>> Search up tree from root and make new root at first divergence > >>>> > >>>> make_None_list_array(xdim, ydim): > >>>> Make a list of lists ("array") with the specified dimensions > >>>> > >>>> get_PD_to_mrca(node, mrca, PD): > >>>> Add up the phylogenetic distance from a node to the specified > >>>> ancestor > >>>> (mrca). Find mrca with find_1st_match. > >>>> > >>>> find_1st_match(list1, list2): > >>>> Find the first match in two ordered lists. > >>>> > >>>> get_ancestors_list(node, anc_list): > >>>> Get the list of ancestors of a given node > >>>> > >>>> addup_PD(node, PD): > >>>> Adds the branchlength of the current node to the total PD measure. > >>>> > >>>> print_tree_outline_format(phylo_obj): > >>>> Prints the tree out in "outline" format (daughter clades are > >>>> indented, > >>>> etc.) > >>>> > >>>> print_Node(node, rank): > >>>> Prints the node in question, and recursively all daughter nodes, > >>>> maintaining rank as it goes. > >>>> > >>>> lagrange_disclaimer(): > >>>> Just prints lagrange citation etc. in code using lagrange > >>>> libraries. > >>>> ============= > >>>> > >>>> > >>>> > >>>> What's next: > >>>> > >>>> I'm going to spend the rest of this week following up on Brad's > >>>> suggestions to make the code more standard, with the priority of > >>>> figuring out how I can revise the current BioPython phylogeny > >>>> class, to > >>>> resemble the better version in lagrange, so that there is a generic > >>>> flexible phylogeny/newick parser that can be used generally as > >>>> well as > >>>> by my BioGeography package specifically. > >>>> > >>>> updated wiki/git: > >>>> http://biopython.org/wiki/BioGeography#June. > >>>> 2C_week_3:_Functions_to_read_user-specified_Newick_files_. > >>>> 28with_ages_and_internal_node_labels. > >>>> 29_and_generate_basic_summary_information. > >>>> > >>>> http://github.com/nmatzke/biopython/commits/Geography > >>>> > >>>> Cheers! > >>>> Nick > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> Nick Matzke wrote: > >>>>> Sorry my update is slow, it is coming in a bit! Thanks, Nick > >>>>> > >>>>> Brad Chapman wrote: > >>>>>> Nick; > >>>>>> Thanks for the update -- hope y'all are having fun at the > >>>>>> Evolution > >>>>>> meeting and have managed to meet up. > >>>>>> > >>>>>>> Basically this week I added functions to download & parse large > >>>>>>> numbers of records, get TaxonOccurrence gbifKeys, and search > >>>>>>> with > >>>>>>> those keys. Main functions: > >>>>>> > >>>>>> Good stuff. My main comment echoes a couple of things we > >>>>>> discussed > >>>>>> earlier: > >>>>>> > >>>>>> - It is not clear to a user which functions are API functions to > >>>>>> call and which are used internally. Prefixing the internal > >>>>>> functions with underscores (_) and organizing these into classes > >>>>>> will help with this. > >>>>>> > >>>>>> - I still noticed some tempfile writing from what we discussed > >>>>>> last > >>>>>> week. If you have problems using in memory file handles let us > >>>>>> know and we can discuss more. > >>>>>> > >>>>>> In general if your coding style is to get it out there and then > >>>>>> re-factor, that is cool. But please put some time into the > >>>>>> schedule for this so I know not to bug you before you've actually > >>>>>> had a chance to go through things a second time. Also, it's a > >>>>>> good > >>>>>> idea to do this in segments as we go along. From experience, if > >>>>>> you > >>>>>> build up too much code that needs rework it becomes more mentally > >>>>>> difficult to get into the rewriting. > >>>>>> > >>>>>>> An issue: > >>>>>>> > >>>>>>> Next week come functions to process phylogenetic trees. I > >>>>>>> have had > >>>>>>> issues with the current BioPython newick parser etc.; > >>>>>>> basically what > >>>>>>> exists appears to not accept node label information which is > >>>>>>> required > >>>>>>> to store e.g. branchlengths which are crucial for the sorts of > >>>>>>> things > >>>>>>> I have to do in the future. So unless there is a better > >>>>>>> suggestion I > >>>>>>> plan to upload modify & upload my own tree parsing/using > >>>>>>> functions. I > >>>>>>> am open to suggestions in this matter. > >>>>>> > >>>>>> We do not want to introduce duplicated code for Newick tree > >>>>>> parsing in > >>>>>> Biopython. This is a good opportunity to engage the development > >>>>>> list > >>>>>> to help figure out how to fix the current parser to do what you > >>>>>> need. If you are not sure how to get started, the best way is > >>>>>> to get > >>>>>> together a small test file that demonstrates your problems, and > >>>>>> post > >>>>>> it to the list. It would be more useful to everyone to have your > >>>>>> fixes in the main parser. > >>>>>> > >>>>>> Brad > >>>>>> > >>>>> > >>>> > >>>> -- > >>>> ==================================================== > >>>> Nicholas J. Matzke > >>>> Ph.D. Candidate, Graduate Student Researcher > >>>> Huelsenbeck Lab > >>>> Center for Theoretical Evolutionary Genomics > >>>> 4151 VLSB (Valley Life Sciences Building) > >>>> Department of Integrative Biology > >>>> University of California, Berkeley > >>>> > >>>> Lab websites: > >>>> http://ib.berkeley.edu/people/lab_detail.php?lab=54 > >>>> http://fisher.berkeley.edu/cteg/hlab.html > >>>> Dept. personal page: > >>>> http://ib.berkeley.edu/people/students/person_detail.php?person=370 > >>>> Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > >>>> Lab phone: 510-643-6299 > >>>> Dept. fax: 510-643-6264 > >>>> Cell phone: 510-301-0179 > >>>> Email: matzke at berkeley.edu > >>>> > >>>> Mailing address: > >>>> Department of Integrative Biology > >>>> 3060 VLSB #3140 > >>>> Berkeley, CA 94720-3140 > >>>> > >>>> ----------------------------------------------------- > >>>> "[W]hen people thought the earth was flat, they were wrong. When > >>>> people > >>>> thought the earth was spherical, they were wrong. But if you > >>>> think that > >>>> thinking the earth is spherical is just as wrong as thinking the > >>>> earth > >>>> is flat, then your view is wronger than both of them put together." > >>>> > >>>> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical > >>>> Inquirer, > >>>> 14(1), 35-44. Fall 1989. > >>>> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > >>>> ==================================================== > >>>> _______________________________________________ > >>>> Wg-phyloinformatics mailing list > >>>> Wg-phyloinformatics at nescent.org > >>>> https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > >>> > >>> > >> > >> -- > >> ==================================================== > >> Nicholas J. Matzke > >> Ph.D. Candidate, Graduate Student Researcher > >> Huelsenbeck Lab > >> Center for Theoretical Evolutionary Genomics > >> 4151 VLSB (Valley Life Sciences Building) > >> Department of Integrative Biology > >> University of California, Berkeley > >> > >> Lab websites: > >> http://ib.berkeley.edu/people/lab_detail.php?lab=54 > >> http://fisher.berkeley.edu/cteg/hlab.html > >> Dept. personal page: > >> http://ib.berkeley.edu/people/students/person_detail.php?person=370 > >> Lab personal page: http://fisher.berkeley.edu/cteg/members/ > >> matzke.html > >> Lab phone: 510-643-6299 > >> Dept. fax: 510-643-6264 > >> Cell phone: 510-301-0179 > >> Email: matzke at berkeley.edu > >> > >> Mailing address: > >> Department of Integrative Biology > >> 3060 VLSB #3140 > >> Berkeley, CA 94720-3140 > >> > >> ----------------------------------------------------- > >> "[W]hen people thought the earth was flat, they were wrong. When > >> people > >> thought the earth was spherical, they were wrong. But if you think > >> that > >> thinking the earth is spherical is just as wrong as thinking the > >> earth > >> is flat, then your view is wronger than both of them put together." > >> > >> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical > >> Inquirer, > >> 14(1), 35-44. Fall 1989. > >> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > >> ==================================================== > >> _______________________________________________ > >> Wg-phyloinformatics mailing list > >> Wg-phyloinformatics at nescent.org > >> https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > > _______________________________________________ > > Wg-phyloinformatics mailing list > > Wg-phyloinformatics at nescent.org > > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > From bugzilla-daemon at portal.open-bio.org Tue Jul 7 09:10:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 09:10:10 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200907071310.n67DAATG001005@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 chapmanb at 50mail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #5 from chapmanb at 50mail.com 2009-07-07 09:10 EST ------- It's derived from the MySQL schema. I'll mention that on the BioSQL bug when I upload the schema there. Good catch with Python2.4. Grrr old versions, I like those conditional expressions too much. I think test_BioSQL should default to the in-memory version of SQLite, so completely agreed. This is most likely to work out of the box on a default system. Do you want me to check this in with the 2.4 fix? Or should we wait until after 1.51? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 09:59:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 09:59:17 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200907071359.n67DxHOr002748@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 09:59 EST ------- (In reply to comment #5) > It's derived from the MySQL schema. I'll mention that on the BioSQL > bug when I upload the schema there. OK > Good catch with Python2.4. Grrr old versions, I like those conditional > expressions too much. I haven't really used them, some of "my" machines are still on Python 2.4, but can see the appeal - especially within a list or generator comprehension. > I think test_BioSQL should default to the in-memory version of SQLite, so > completely agreed. This is most likely to work out of the box on a default > system. Good. > Do you want me to check this in with the 2.4 fix? Or should we wait > until after 1.51? At least wait until 1.51 is out, and we've had some feedback from Hilmar. I would prefer to wait until the SQLite schema is at least in the BioSQL repository, and ideally publicly released. I had the impression from Hilmar at BOSC that BioSQL 1.0.2 could be out later this year, so this may not take that long. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Tue Jul 7 10:25:40 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 7 Jul 2009 10:25:40 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <20090707130248.GM17086@sobchak.mgh.harvard.edu> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> On Tue, Jul 7, 2009 at 9:02 AM, Brad Chapman wrote: > Hi Stephen; > > We can require lagrange to be installed and use imports to > grab the needed code. The other option is that y'all can explicitly > relicense a subset of the code under the Biopython license. > Trivia: it looks like lagrange in turn depends on scipy, but quickly glancing through the code, I only see numpy functions being used. Since some other Biopython modules already depend on numpy, could the installation of lagrange for Bio.Geography be made simpler by just changing the import to numpy? > I can see however > > where the Bio.Nexus functionality might not be sufficient for tree > > manipulation. I am not a contributor to the BioPython dev group so I > > cannot speak to those specifics, but as a user I can see separating > > out the tree functions from the Nexus package (and tree I/O in > > general) as logically a phylogenetic tree structure has little to do > > with the nexus file format. It can be somewhat awkward to deal with in > > the current form. A more general implementation might be a Bio.Tree > > package with I/O readers in Nexus and Newick and XML, etc. > > Definitely. Eric has been discussing this with regards to the > PhyloXML project and we had been looking at other Tree > representations: in PyCogent and Thomas Mailund's Newick module. > Considering the lagrange tree model makes a lot of sense as well. > What I'd like to see is a stab at a generalized Tree object that > supports the operations you need and that the Bio.Nexus parser can > produce, exactly as you describe. Eric and Nick, what do you think > about coordinating on this? > Sounds great to me. My impression is that most tree representations are based on a recursive Node element with a few associated attributes and a number of useful methods; phyloXML has a Clade object roughly corresponding to that, but also a bunch of other element types for extensive annotation of the tree. So two options spring to mind: 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed by any phylogenetic tree representation, ever. (It's already pretty close.) Refactor Nexus and Newick to use these objects; merge the features of lagrange so the rest of the Biopython environment can benefit. Only export to external object structures that are something other than a straight phylogenetic tree -- e.g. networkx or graphviz for plotting, numpy/scipy for crunching. 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let that be the Biopython default representation. Add a function in Bio.PhyloXML to export its enhanced tree structure to this simpler Bio.Tree representation. I wrote Bio.PhyloXML.Tree to use the naming conventions of phyloXML, but otherwise be independent of that specific file format. It doesn't depend on any XML library directly, and both child nodes and XML node attributes appear as plain ol' object attributes in the tree. But the Nexus module looked like the parser was kind of tied to the tree representation, so I haven't reused any of that code yet. So #1 is my preference, but it put the burden of inter-module compatibility on whoever is maintaining Bio.Nexus, whereas #2 leaves my code on a quiet little island for the rest of the summer. All the best, Eric From biopython at maubp.freeserve.co.uk Tue Jul 7 10:56:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 15:56:01 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> Message-ID: <320fb6e00907070756hcb15f96pff5694ac4552ef32@mail.gmail.com> On Tue, Jul 7, 2009 at 3:25 PM, Eric Talevich wrote: > On Tue, Jul 7, 2009 at 9:02 AM, Brad Chapman wrote: >> Hi Stephen; >> >> We can require lagrange to be installed and use imports to >> grab the needed code. The other option is that y'all can explicitly >> relicense a subset of the code under the Biopython license. > > Trivia: it looks like lagrange in turn depends on scipy, but quickly > glancing through the code, I only see numpy functions being used. > Since some other Biopython modules already depend on numpy, > could the installation of lagrange for Bio.Geography be made > simpler by just changing the import to numpy? That sounds like a good idea to follow up with the lagrange team (making lagrange depend on numpy but not scipy). I think Brad is right to be asking questions about the lagrange code and their license. How much code do you actually use from lagrange, and can we either get those bits re-licensed (or reimplemented) to include directly into Biopython? This may not be realisitic, in which case a dependency on lagrange may be the best bet... Adding external python library dependencies in Biopython is generally is discouraged, *especially* anything required at build time as this makes installation much more complicated. As I recall, we've been able to cut these down to just numpy (needed for several modules, but we can install without it), plus optional dependencies like database drivers (e.g. for BioSQL) and ReportLab (only used in Bio.Graphics). Peter From biopython at maubp.freeserve.co.uk Tue Jul 7 11:12:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 16:12:02 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> Message-ID: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> Stephen [I think] wrote: >> > I can see however >> > where the Bio.Nexus functionality might not be sufficient for tree >> > manipulation. I am not a contributor to the BioPython dev group so I >> > cannot speak to those specifics, but as a user I can see separating >> > out the tree functions from the Nexus package (and tree I/O in >> > general) as logically a phylogenetic tree structure has little to do >> > with the nexus file format. It can be somewhat awkward to deal with in >> > the current form. A more general implementation might be a Bio.Tree >> > package with I/O readers in Nexus and Newick and XML, etc. Brad wrote: >> Definitely. Eric has been discussing this with regards to the >> PhyloXML project and we had been looking at other Tree >> representations: in PyCogent and Thomas Mailund's Newick module. >> Considering the lagrange tree model makes a lot of sense as well. >> What I'd like to see is a stab at a generalized Tree object that >> supports the operations you need and that the Bio.Nexus parser can >> produce, exactly as you describe. Eric and Nick, what do you think >> about coordinating on this? Eric worte: > Sounds great to me. I also agree. Bio.Nexus has some good stuff that is a bit hidden, and has wider application - some kind of Bio.Tree module sounds sensible (ideally with I/O for Nexus, XML, etc). We might even move the phyloXML specific stuff to live under Bio.Tree.PhyloXML. > My impression is that most tree representations are based on a recursive > Node element with a few associated attributes and a number of useful > methods; phyloXML has a Clade object roughly corresponding to that, > but also a bunch of other element types for extensive annotation of > the tree. So two options spring to mind: > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed by > any phylogenetic tree representation, ever. (It's already pretty close.) > Refactor Nexus and Newick to use these objects; merge the features of > lagrange so the rest of the Biopython environment can benefit. Only export > to external object structures that are something other than a straight > phylogenetic tree -- e.g. networkx or graphviz for plotting, numpy/scipy for > crunching. > > 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let > that be the Biopython default representation. Add a function in Bio.PhyloXML > to export its enhanced tree structure to this simpler Bio.Tree > representation. I am unclear why would you need to have to have an entirely separate tree object structure (which then requires code to map between the two). Perhaps some specific examples of the "enhancements" would help? How about this variation on (2): Suppose Bio.Tree provided a simple tree object (holding a nested structure), with methods/functions for general operations like DFT, finding common ancestors, calculating branch lengths, collapsing internal nodes, etc. [and I would expect a lot of this could be borrowed from Bio.Nexus, and/or Thomas Mailund's Newick module]. Couldn't Bio.PhyloXML build on this using subclassed tree nodes? Do we even need different objects? What if each node class had an optional python dictionary for annotations? You could maybe key this off the PhyloXML names? > I wrote Bio.PhyloXML.Tree to use the naming conventions of phyloXML, but > otherwise be independent of that specific file format. It doesn't depend on > any XML library directly, and both child nodes and XML node attributes > appear as plain ol' object attributes in the tree. But the Nexus module > looked like the parser was kind of tied to the tree representation, so I > haven't reused any of that code yet. So #1 is my preference, but it put the > burden of inter-module compatibility on whoever is maintaining Bio.Nexus, > whereas #2 leaves my code on a quiet little island for the rest of the > summer. We're going to need some input from the Bio.Nexus authors - Frank and Cymon (CC'd). Peter From bugzilla-daemon at portal.open-bio.org Tue Jul 7 12:14:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 12:14:28 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907071614.n67GESBG008148@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #3 from tsham at lbl.gov 2009-07-07 12:14 EST ------- Hi, The file is part of an unpublished work that is in preparation. I think it would be ok to include it in the unit test *after* it's been published, but not just yet. Or I could generate a test file that is similar to this file. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 12:22:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 12:22:35 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907071622.n67GMZoI008582@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 12:22 EST ------- (In reply to comment #3) > Hi, > > The file is part of an unpublished work that is in preparation. I think it > would be ok to include it in the unit test *after* it's been published, but not > just yet. Or I could generate a test file that is similar to this file. > A realistic but similar file would be fine - thanks. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 12:38:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 12:38:08 -0400 Subject: [Biopython-dev] [Bug 2874] New: invalid class on warning module Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2874 Summary: invalid class on warning module Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mlrodrigues at igc.gulbenkian.pt /usr/lib/pymodules/python2.5/Bio/PDB/PDBList.py:151: UserWarning: retrieving index file. Takes about 5 MB. warnings.warn("retrieving index file. Takes about 5 MB.") Traceback (most recent call last): File "get_pdb_structures.py", line 23, in get(pdblist, f, my_try) File "get_pdb_structures.py", line 16, in get x.download_entire_pdb(listfile=f) File "/usr/lib/pymodules/python2.5/Bio/PDB/PDBList.py", line 288, in download_entire_pdb for pdb_code in entries: self.retrieve_pdb_file(pdb_code) File "/usr/lib/pymodules/python2.5/Bio/PDB/PDBList.py", line 237, in retrieve_pdb_file RuntimeError) File "/usr/lib/python2.5/warnings.py", line 32, in warn assert issubclass(category, Warning) AssertionError -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 12:52:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 12:52:12 -0400 Subject: [Biopython-dev] [Bug 2874] invalid class on warning module In-Reply-To: Message-ID: <200907071652.n67GqCKX009821@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2874 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 12:52 EST ------- Fixed, thanks! Downloading the 5MB index file in the unit tests seems like a bad idea, but clearly we need more unit test coverage as this error of mine actually affected three files in Bio.PDB, Bio/PDB/Dice.py Bio/PDB/MMCIF2Dict.py Bio/PDB/PDBList.py If you have any suggestions for further unit tests, please let us know. Regards, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 13:08:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 13:08:50 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907071708.n67H8olB010408@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 13:08 EST ------- Hi Tim, This should be fixed in CVS (and will be on github soon), but I would still like to include an example in the unit tests. If you can also test this, that would be great. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From matzke at berkeley.edu Tue Jul 7 14:12:10 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 07 Jul 2009 11:12:10 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> Message-ID: <4A538FFA.4030708@berkeley.edu> Hi all, I am just now back in town and would love to co-coordinate on this. I agree having multiple newick parsers etc. is undesirable, I just found I was forced to that this spring when BioPython didn't have what I need even for pretty standard Newick files. I have also made use of Mailund's newick parser in the past. I am booked this afternoon but will go through the thread more this evening and comment further. Cheers! Nick Eric Talevich wrote: > On Tue, Jul 7, 2009 at 9:02 AM, Brad Chapman > wrote: > > Hi Stephen; > > We can require lagrange to be installed and use imports to > grab the needed code. The other option is that y'all can explicitly > relicense a subset of the code under the Biopython license. > > > Trivia: it looks like lagrange in turn depends on scipy, but quickly > glancing through the code, I only see numpy functions being used. Since > some other Biopython modules already depend on numpy, could the > installation of lagrange for Bio.Geography be made simpler by just > changing the import to numpy? > > > I can see however > > where the Bio.Nexus functionality might not be sufficient for tree > > manipulation. I am not a contributor to the BioPython dev group so I > > cannot speak to those specifics, but as a user I can see separating > > out the tree functions from the Nexus package (and tree I/O in > > general) as logically a phylogenetic tree structure has little to do > > with the nexus file format. It can be somewhat awkward to deal > with in > > the current form. A more general implementation might be a Bio.Tree > > package with I/O readers in Nexus and Newick and XML, etc. > > Definitely. Eric has been discussing this with regards to the > PhyloXML project and we had been looking at other Tree > representations: in PyCogent and Thomas Mailund's Newick module. > Considering the lagrange tree model makes a lot of sense as well. > What I'd like to see is a stab at a generalized Tree object that > supports the operations you need and that the Bio.Nexus parser can > produce, exactly as you describe. Eric and Nick, what do you think > about coordinating on this? > > > Sounds great to me. My impression is that most tree representations are > based on a recursive Node element with a few associated attributes and a > number of useful methods; phyloXML has a Clade object roughly > corresponding to that, but also a bunch of other element types for > extensive annotation of the tree. So two options spring to mind: > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed > by any phylogenetic tree representation, ever. (It's already pretty > close.) Refactor Nexus and Newick to use these objects; merge the > features of lagrange so the rest of the Biopython environment can > benefit. Only export to external object structures that are something > other than a straight phylogenetic tree -- e.g. networkx or graphviz for > plotting, numpy/scipy for crunching. > > 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let > that be the Biopython default representation. Add a function in > Bio.PhyloXML to export its enhanced tree structure to this simpler > Bio.Tree representation. > > I wrote Bio.PhyloXML.Tree to use the naming conventions of phyloXML, but > otherwise be independent of that specific file format. It doesn't depend > on any XML library directly, and both child nodes and XML node > attributes appear as plain ol' object attributes in the tree. But the > Nexus module looked like the parser was kind of tied to the tree > representation, so I haven't reused any of that code yet. So #1 is my > preference, but it put the burden of inter-module compatibility on > whoever is maintaining Bio.Nexus, whereas #2 leaves my code on a quiet > little island for the rest of the summer. > > All the best, > Eric -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From czmasek at burnham.org Tue Jul 7 14:46:32 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 7 Jul 2009 11:46:32 -0700 Subject: [Biopython-dev] PhyloXML helper functions In-Reply-To: <20090707125152.GL17086@sobchak.mgh.harvard.edu> References: <3f6baf360907061634u43d89bdcxa9bad9fb0babb350@mail.gmail.com> <20090707125152.GL17086@sobchak.mgh.harvard.edu> Message-ID: <4A539808.201@burnham.org> Hi: I cannot really comment on the first point (since I don't know enough Python), but I totally agree with Brad on the issue of the find() methods -- very useful! Christian Brad Chapman wrote: > Hi Eric; > > >> I've been mulling a couple of methods for PhyloXML objects that I thought >> could deserve some discussion. >> >> 1. Singular properties for some plural attributes >> >> This goes back to the "confidences" issue: When I'm drilling down through a >> phyloXML-derived tree, I keep expecting certain attributes to be singular >> values when they're actually plural. Auto-completion catches it, of course, >> but the resulting code would seem more obvious if I used the singular name >> when I know the attribute consists of a list of one element. >> > > I like the idea and implementation for cases where you can have > multiple items, but have one most of the time. Very nice. > > >> 2. A find() method on Clade and maybe Phylogeny objects >> > [...] > >> Enhancements: >> - The keyword argument could be a regular expression. Would that be useful? >> > > This seems useful. Often people use crazy naming convention hacks, > and might want to pull out something like all proteins from a > particular organism based on a common prefix in the name. > > >> To handle numbers, I'd have to convert every sub-node attribute value to a >> string, and that would be weird -- or else find() would have to skip >> numerical attributes. >> > > Is this if you support regular expressions or either way? For the > find, I think it's sufficient to define what you support and leave > it at that set: any subset of searching will help people get their > work done. > > >> - If no regular arguments are needed, cls could default to PhyloElement or >> even "object" to match everything. >> > > I like the object default here. This fits with a simple use case of: > find everything that matches this string of interest. > > >> - To enable arbitrary hairiness, this function could accept a function as >> the value of the keyword argument and return anything truthy. But at that >> point, the user could probably just roll their own find_node() function. >> However, it could still be useful to filter for numerical values. >> > > This is probably more than you need. For complicated cases I'd > assume people are sophisticated enough to roll their own. > > Nice ideas, > Brad > From bugzilla-daemon at portal.open-bio.org Tue Jul 7 17:49:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 17:49:16 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907072149.n67LnGaJ019542@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #6 from tsham at lbl.gov 2009-07-07 17:49 EST ------- Created an attachment (id=1339) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1339&action=view) Test case vector nti generated genbank file. This file is ok to include in the unit test. It has the same problem as the other file. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jul 7 18:49:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 23:49:27 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> Message-ID: <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> Brad wrote: >> My vote is to document using subprocess and avoid creating our own >> wrapper. No one has to learn a Biopython specific API for running >> programs, and subprocess provides plenty of flexibility to get stdout, >> stderr and return codes. For places where we feel like using subprocess >> is tricky, additional documentation within Biopython should help those >> encountering it for the first time. This gives us more time to work >> on biology problems, and leaves the running programs problems up to >> the greater Python community. Peter wrote: > Exactly. I'm sure there will still be questions on the mailing list from > people about using subprocess, but if our documentation is done > well enough this shouldn't be too much of a burden. > ... > That seems unanimous so far: Deprecate Bio.Application.generic_run, > and document using subprocess instead. Good :) I started trying to rewrite the tutorial sections using generic_run, and unfortunately it looks like a reasonably cross platform replacement for generic_run when all you want is the return code but you don't want the tool's output printed on screen becomes quite complex, e.g. import subprocess return_code = subprocess.call(str(cline), stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform!="win32")) We need to use pipes for stdout (and stderr) to stop the tool's output being printed to screen. Just using os.system(str(cline)) has the same problem. We needed to include the stdin as a pipe as a work around for a Windows specific bug in subprocess if called from a GUI using Biopython, see http://bugs.python.org/issue1124861 and earlier mailing list posts. This may not be worth worrying about for the documentation examples, as its a corner case and has been fixed in recent versions of Python. Finally, we need to use shell=True on Unix (but not Windows as I recall from looking at the Bio.Application code) as we are giving the command as a string (rather than a list of the tool and its arguments). Maybe we can make the command line wrapper object more list like to make subprocess happy without needing to create a string? I'll try and test this on Windows, Mac and Linux tomorrow - but maybe we will want to include a replacement for Bio.Application.generic_run after all? (Would "simple_run", "run", or "call" be good names?) Peter From eric.talevich at gmail.com Wed Jul 8 00:09:43 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 8 Jul 2009 00:09:43 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> Message-ID: <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> On Tue, Jul 7, 2009 at 11:12 AM, Peter wrote: > Eric wrote: > > My impression is that most tree representations are based on a recursive > > Node element with a few associated attributes and a number of useful > > methods; phyloXML has a Clade object roughly corresponding to that, > > but also a bunch of other element types for extensive annotation of > > the tree. So two options spring to mind: > > > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed > by > > any phylogenetic tree representation, ever. (It's already pretty close.) > > Refactor Nexus and Newick to use these objects; merge the features of > > lagrange so the rest of the Biopython environment can benefit. Only > export > > to external object structures that are something other than a straight > > phylogenetic tree -- e.g. networkx or graphviz for plotting, numpy/scipy > for > > crunching. > > > > 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let > > that be the Biopython default representation. Add a function in > Bio.PhyloXML > > to export its enhanced tree structure to this simpler Bio.Tree > > representation. > > I am unclear why would you need to have to have an entirely separate tree > object structure (which then requires code to map between the two). > Perhaps some specific examples of the "enhancements" would help? > The benefit of letting the tree object structures diverge is procrastination -- we could reconcile the two modules after GSoC is over, with stable features and test suites in place. But I could justifiably focus on integration for the remaining weeks if that's best for Biopython, since otherwise I'd probably be reimplementing a number of features already present in other modules. How about this variation on (2): > Suppose Bio.Tree provided a simple tree object (holding a nested > structure), > with methods/functions for general operations like DFT, finding common > ancestors, calculating branch lengths, collapsing internal nodes, etc. > [and I would expect a lot of this could be borrowed from Bio.Nexus, > and/or Thomas Mailund's Newick module]. Couldn't Bio.PhyloXML build > on this using subclassed tree nodes? > The Bioperl and Bioruby phyloXML projects were done this way, I think, but they already had access to Tree/Node objects within each project. Bio.PhyloXML.Tree objects could inherit from Bio.Tree objects if the Bio.Tree objects were designed in a compatible way... if we go this route I'll need to draft up a list of traps, like naming conventions ("annotations" is already an attribute of Bio.PhyloXML.Sequence) and class hierarchy (some functions rely on everything in the phyloXML spec being a subclass of PhyloElement). Do we even need different objects? What if each node class had an optional > python dictionary for annotations? You could maybe key this off the > PhyloXML > names? > > I bet this could be done without different objects. Bio.PhyloXML.Tree could be moved to Bio.Tree or Bio.Tree.Elements; the base class PhyloElement could be renamed to TreeElement; and the Nexus and Newick parsers could reuse PhyloXML's Phylogeny and Clade elements, where Clade merges with the existing Node class(es). Even Clade by itself might be enough. For organizational purposes, format-specific tree elements could move to their own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some multiple-inheritance tricks could be used to smooth things over. Here is the phyloXML definitions of Clade: http://www.phyloxml.org/documentation/version_100/phyloxml.xsd.html#h-1124608460 My implementation (trimmed): class Clade(PhyloElement): """Describes a branch of the current phylogenetic tree. Used recursively, describes the topology of a phylogenetic tree. The parent branch length of a clade can be described with the 'branch_length' attribute. Element 'confidence' is used to indicate the support for a clade/parent branch. Element 'events' is used to describe such events as gene-duplications at the root node/parent branch of a clade. Element 'width' is the branch width for this clade (including parent branch). Both 'color' and 'width' elements apply for the whole clade unless overwritten in-sub clades. """ def __init__(self, branch_length=None, id_source=None, name=None, width=None, color=None, node_id=None, events=None, binary_characters=None, date=None, # Collections confidences=None, taxonomies=None, sequences=None, distributions=None, references=None, properties=None, clades=None, other=None, ): # set all keyword arguments to instance attributes; collections default to [] ... The same for Phylogeny: http://www.phyloxml.org/documentation/version_100/phyloxml.xsd.html#h535307528 class Phylogeny(PhyloElement): """A phylogenetic tree.""" def __init__(self, rooted, rerootable=None, branch_length_unit=None, type=None, name=None, id=None, description=None, date=None, clade=None, # Collections confidences=None, clade_relations=None, sequence_relations=None, properties=None, other=None, ): assert isinstance(rooted, bool) # set keyword arguments to attributes; collections default to [] ... Sources: http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML/Tree.py http://www.phyloxml.org/documentation/version_100/phyloxml.xsd.html If we base the Bio.Tree objects off of these two classes, then I wouldn't even need an optional annotations dictionary on each object. Which makes sense, since I think the phyloXML format was designed to accommodate nearly all types of annotations that could reasonably be applied to phylogenetic trees. Assuming most of the Newick and Nexus annotations fit into this design, if a small number of annotations don't, they can be added to this constructor as more keyword arguments without much harm. (I know nothing about NeXML; should we keep an eye on that too? Glance at the homepage I don't see much about complex annotation types, which is probably good if we want to fit that format into this framework eventually.) Cheers, Eric From eric.talevich at gmail.com Wed Jul 8 00:45:16 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 8 Jul 2009 00:45:16 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> Message-ID: <3f6baf360907072145t235d43a6nb3633612c94a5244@mail.gmail.com> On Tue, Jul 7, 2009 at 11:12 AM, Peter wrote: > > Perhaps some specific examples of the "enhancements" would help? > > Sure. Here are the special phyloXML element types listed at phyloxml.org, with comments: Annotation -- attached to Sequence; has metadata BinaryCharacters -- "names and/or counts of binary characters present, gained, and lost at the root of a clade" BranchColor -- RGB, for graphics support CladeRelation -- typed relationship between two clades, e.g. multiple parents Date -- e.g. #mya, or name of period ("Silurian") Distribution, Point, Polygon -- geographic distribution of the items of a clade (species, sequences) DomainArchitecture, ProteinDomain -- like SeqFeature for a protein sequence Events -- e.g. one gene duplication on the current clade Property -- attach external references to a node, kind of meta Reference -- literature reference: doi or text description Sequence -- like SeqRecord; more specific annotation fields SequenceRelation -- typed relationship between two sequences, e.g. orthology Taxonomy -- with scientific name, common names, rank, id, code, URI Some of these could be adapted into generally useful Biopython objects, such as Taxonomy and Reference. A few are metadata related to the structure or interpretation of the tree, and a few are small classes that could be converted to dictionaries if necessary. The conversion between Sequence and SeqRecord could probably be made lossless, or close to it, and then it would be safe to just plug the Biopython object directly into the tree instead of using a PhyloXML-specific class. Cheers, Eric From bugzilla-daemon at portal.open-bio.org Wed Jul 8 06:16:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Jul 2009 06:16:09 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907081016.n68AG94P010653@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-08 06:16 EST ------- (In reply to comment #6) > Created an attachment (id=1339) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1339&action=view) [details] > Test case vector nti generated genbank file. > > This file is ok to include in the unit test. It has the same problem as the > other file. > Thanks - I've added that as a new unit test. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Wed Jul 8 08:36:00 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Jul 2009 08:36:00 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <4A538FFA.4030708@berkeley.edu> References: <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <4A538FFA.4030708@berkeley.edu> Message-ID: <20090708123600.GW17086@sobchak.mgh.harvard.edu> Hi Nick; > I am just now back in town and would love to co-coordinate on this. I > agree having multiple newick parsers etc. is undesirable, I just found I > was forced to that this spring when BioPython didn't have what I need > even for pretty standard Newick files. I have also made use of > Mailund's newick parser in the past. That sounds great. Eric is also on board from the PhyloXML side. For the parser, the right approach is to provide some example files that Bio.Nexus does not handle correctly, and work on improvements to that parser to bring it in line with what you need. Secondarily, we should work on parsing into a general tree structure that supports the questions you need to ask. This should allow us to avoid the lagrange code duplication and also have a more robust Nexus parser in Biopython. Thanks, Brad > > I am booked this afternoon but will go through the thread more this > evening and comment further. Cheers! > Nick > > Eric Talevich wrote: > > On Tue, Jul 7, 2009 at 9:02 AM, Brad Chapman > > wrote: > > > > Hi Stephen; > > > > We can require lagrange to be installed and use imports to > > grab the needed code. The other option is that y'all can explicitly > > relicense a subset of the code under the Biopython license. > > > > > > Trivia: it looks like lagrange in turn depends on scipy, but quickly > > glancing through the code, I only see numpy functions being used. Since > > some other Biopython modules already depend on numpy, could the > > installation of lagrange for Bio.Geography be made simpler by just > > changing the import to numpy? > > > > > I can see however > > > where the Bio.Nexus functionality might not be sufficient for tree > > > manipulation. I am not a contributor to the BioPython dev group so I > > > cannot speak to those specifics, but as a user I can see separating > > > out the tree functions from the Nexus package (and tree I/O in > > > general) as logically a phylogenetic tree structure has little to do > > > with the nexus file format. It can be somewhat awkward to deal > > with in > > > the current form. A more general implementation might be a Bio.Tree > > > package with I/O readers in Nexus and Newick and XML, etc. > > > > Definitely. Eric has been discussing this with regards to the > > PhyloXML project and we had been looking at other Tree > > representations: in PyCogent and Thomas Mailund's Newick module. > > Considering the lagrange tree model makes a lot of sense as well. > > What I'd like to see is a stab at a generalized Tree object that > > supports the operations you need and that the Bio.Nexus parser can > > produce, exactly as you describe. Eric and Nick, what do you think > > about coordinating on this? > > > > > > Sounds great to me. My impression is that most tree representations are > > based on a recursive Node element with a few associated attributes and a > > number of useful methods; phyloXML has a Clade object roughly > > corresponding to that, but also a bunch of other element types for > > extensive annotation of the tree. So two options spring to mind: > > > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed > > by any phylogenetic tree representation, ever. (It's already pretty > > close.) Refactor Nexus and Newick to use these objects; merge the > > features of lagrange so the rest of the Biopython environment can > > benefit. Only export to external object structures that are something > > other than a straight phylogenetic tree -- e.g. networkx or graphviz for > > plotting, numpy/scipy for crunching. > > > > 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let > > that be the Biopython default representation. Add a function in > > Bio.PhyloXML to export its enhanced tree structure to this simpler > > Bio.Tree representation. > > > > I wrote Bio.PhyloXML.Tree to use the naming conventions of phyloXML, but > > otherwise be independent of that specific file format. It doesn't depend > > on any XML library directly, and both child nodes and XML node > > attributes appear as plain ol' object attributes in the tree. But the > > Nexus module looked like the parser was kind of tied to the tree > > representation, so I haven't reused any of that code yet. So #1 is my > > preference, but it put the burden of inter-module compatibility on > > whoever is maintaining Bio.Nexus, whereas #2 leaves my code on a quiet > > little island for the rest of the summer. > > > > All the best, > > Eric > > -- > ==================================================== > Nicholas J. Matzke > Ph.D. Candidate, Graduate Student Researcher > Huelsenbeck Lab > Center for Theoretical Evolutionary Genomics > 4151 VLSB (Valley Life Sciences Building) > Department of Integrative Biology > University of California, Berkeley > > Lab websites: > http://ib.berkeley.edu/people/lab_detail.php?lab=54 > http://fisher.berkeley.edu/cteg/hlab.html > Dept. personal page: > http://ib.berkeley.edu/people/students/person_detail.php?person=370 > Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > Lab phone: 510-643-6299 > Dept. fax: 510-643-6264 > Cell phone: 510-301-0179 > Email: matzke at berkeley.edu > > Mailing address: > Department of Integrative Biology > 3060 VLSB #3140 > Berkeley, CA 94720-3140 > > ----------------------------------------------------- > "[W]hen people thought the earth was flat, they were wrong. When people > thought the earth was spherical, they were wrong. But if you think that > thinking the earth is spherical is just as wrong as thinking the earth > is flat, then your view is wronger than both of them put together." > > Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, > 14(1), 35-44. Fall 1989. > http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > ==================================================== From chapmanb at 50mail.com Wed Jul 8 08:48:41 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Jul 2009 08:48:41 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> References: <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> Message-ID: <20090708124841.GX17086@sobchak.mgh.harvard.edu> Hi all; > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed > > by any phylogenetic tree representation, ever. (It's already pretty close.) > > Refactor Nexus and Newick to use these objects; merge the features of > > lagrange so the rest of the Biopython environment can benefit. I am for this approach. It sounds like what people want is a tree that does everything, and re-implementations occur because representations are lacking in something. It would be nice to design this modularly -- with mixin classes for related add-on functionality -- as much as possible. This would allow lighter weight implementations in the future if that were desired. > The benefit of letting the tree object structures diverge is procrastination > -- we could reconcile the two modules after GSoC is over, with stable > features and test suites in place. But I could justifiably focus on > integration for the remaining weeks if that's best for Biopython, since > otherwise I'd probably be reimplementing a number of features already > present in other modules. My vote is for the integration work. Refactoring is hard work and best done early. It is easier to add functionality to a fully integrated PhyloXML parser in the future. > I bet this could be done without different objects. Bio.PhyloXML.Tree could > be moved to Bio.Tree or Bio.Tree.Elements; the base class PhyloElement could > be renamed to TreeElement; and the Nexus and Newick parsers could reuse > PhyloXML's Phylogeny and Clade elements, where Clade merges with the > existing Node class(es). Even Clade by itself might be enough. For > organizational purposes, format-specific tree elements could move to their > own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some > multiple-inheritance tricks could be used to smooth things over. Yes, this sounds exactly right. Great stuff. > (I know nothing > about NeXML; should we keep an eye on that too? Glance at the homepage I > don't see much about complex annotation types, which is probably good if we > want to fit that format into this framework eventually.) PhyloXML plus Nexus/Newick is probably enough to stay reasonably general and keep our sanity. NeXML support would be great but practically is an additional project. The refactoring you've described is a good chunk to run with. Brad From chapmanb at 50mail.com Wed Jul 8 09:06:49 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Jul 2009 09:06:49 -0400 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> Message-ID: <20090708130649.GY17086@sobchak.mgh.harvard.edu> Hi Peter; > I started trying to rewrite the tutorial sections using generic_run, and > unfortunately it looks like a reasonably cross platform replacement for > generic_run when all you want is the return code but you don't want > the tool's output printed on screen becomes quite complex, e.g. > > import subprocess > return_code = subprocess.call(str(cline), > stdin=subprocess.PIPE, > stdout=subprocess.PIPE, > stderr=subprocess.PIPE, > shell=(sys.platform!="win32")) > > We need to use pipes for stdout (and stderr) to stop the tool's output > being printed to screen. Just using os.system(str(cline)) has the same > problem. How about adding a function like "run_arguments" to the commandlines that returns the commandline as a list. It sounds like we can drop the stdin workaround and provide a documentation item for older Windows versions from a GUI. It might be better to use Popen and wait to make it straightforward to learn to get stdout and stderr. So then we get: import subprocess child = subprocess.Popen(cline.run_arguments(), stdout=subprocess.PIPE, stderr=subprocess.PIPE) return_code = child.wait() print child.stdout.read() This avoids the shell nastiness with the argument list, is as simple as it gets with subprocess, and gives users an easy path to getting stdout, stderr and the return codes. Also documenting how to avoid stdout and stderr entirely is useful: import os import subprocess child = subprocess.Popen(cline.run_arguments(), stdout=open(os.devnull, "w"), stderr=subprocess.STDOUT) Brad From bugzilla-daemon at portal.open-bio.org Wed Jul 8 14:22:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Jul 2009 14:22:00 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200907081822.n68IM0Lc028503@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1297 is|0 |1 obsolete| | ------- Comment #14 from eric.talevich at gmail.com 2009-07-08 14:21 EST ------- Created an attachment (id=1340) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1340&action=view) Adapted test_warnings to work with Py2.4-5 This patch is also available on my github branch for this bug: http://github.com/etal/biopython/tree/bug2820 I tested it with Python 2.4, 2.5 and 2.6 on Ubuntu, applied to the current biopython trunk. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Wed Jul 8 14:58:52 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 8 Jul 2009 14:58:52 -0400 Subject: [Biopython-dev] PhyloXML helper functions In-Reply-To: <20090707125152.GL17086@sobchak.mgh.harvard.edu> References: <3f6baf360907061634u43d89bdcxa9bad9fb0babb350@mail.gmail.com> <20090707125152.GL17086@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360907081158i649feb97t10cc52dc9a4454b6@mail.gmail.com> Hi Brad, On Tue, Jul 7, 2009 at 8:51 AM, Brad Chapman wrote: > > 2. A find() method on Clade and maybe Phylogeny objects > [...] > > Enhancements: > > - The keyword argument could be a regular expression. Would that be > useful? > > This seems useful. Often people use crazy naming convention hacks, > and might want to pull out something like all proteins from a > particular organism based on a common prefix in the name. > > > To handle numbers, I'd have to convert every sub-node attribute value to > a > > string, and that would be weird -- or else find() would have to skip > > numerical attributes. > > Is this if you support regular expressions or either way? For the > find, I think it's sufficient to define what you support and leave > it at that set: any subset of searching will help people get their > work done. > I implemented it. Here's the signature and docstring: def find(self, cls=None, **kwargs) """Find all sub-nodes matching the given attributes. The 'cls' argument specifies the class of the sub-node. Nodes that inherit from this type will also match. (The default, Tree.PhyloElement, matches any standard phyloXML type.) The arbitrary keyword arguments indicate the attribute name of the sub-node and the value to match: string, integer or boolean. Strings are evaluated as regular expression matches; integers are compared directly for equality, and booleans evaluate the attribute's truth value (True or False) before comparing. To handle nonzero floats, search with a boolean argument, then filter the result manually. If no keyword arguments are given, then just the class type is used for matching. The result is an iterable through all matching objects, by depth-first search. (Not necessarily the same order as the elements appear in the source file!) Example: >>> tree = PhyloXML.read('phyloxml_examples.xml').phylogenies[5] >>> matches = tree.clade.find(code='OCTVU') >>> matches.next() Taxonomy(code='OCTVU', scientific_name='Octopus vulgaris') """ Notes: - Phylogeny.find just directly calls self.clade.find and returns the result. - I still use PhyloElement instead of object for the default class. The recursive function uses __dict__ to walk the tree, so allowing any object to be searched leads to chaos (e.g. int.__dict__ has 55 keys). Restricting the search to Tree-related nodes still accommodates most use cases, I think. - Depth-first search - if a node that matches has subnodes that also match, the higher node will be yielded first, then the first matching subnode, and so on. But: since the object dictionary doesn't keep XML node order, the order the matches are returned in isn't always what you'd expect. I think I can mitigate this somewhat, but still -- documented weirdness. Thanks, Eric From biopython at maubp.freeserve.co.uk Thu Jul 9 05:18:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Jul 2009 10:18:49 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <20090708130649.GY17086@sobchak.mgh.harvard.edu> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> On Wed, Jul 8, 2009 at 2:06 PM, Brad Chapman wrote: > Hi Peter; > >> I started trying to rewrite the tutorial sections using generic_run, and >> unfortunately it looks like a reasonably cross platform replacement for >> generic_run when all you want is the return code but you don't want >> the tool's output printed on screen becomes quite complex, e.g. >> >> import subprocess >> return_code = subprocess.call(str(cline), >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? stdin=subprocess.PIPE, >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? stdout=subprocess.PIPE, >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? stderr=subprocess.PIPE, >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? shell=(sys.platform!="win32")) >> >> We need to use pipes for stdout (and stderr) to stop the tool's output >> being printed to screen. Just using os.system(str(cline)) has the same >> problem. > > How about adding a function like "run_arguments" to the > commandlines that returns the commandline as a list. That would be a simple alternative to my vague idea "Maybe we can make the command line wrapper object more list like to make subprocess happy without needing to create a string?", which may not be possible. Either way, this will require a bit of work on the Bio.Application parameter objects... > It sounds like we can drop the stdin workaround and provide a > documentation item for older Windows versions from a GUI. Yes, as I noted, this is a corner case. It is something any replacement for generic_run would still have to cater to, but it would just complicate an example. > It might be better to use Popen and wait to make it > straightforward to learn to get stdout and stderr. Yes, using subprocess.Popen explicitly rather than their helper function subprocess.call makes sense for our docs Peter P.S. Thanks Cymon for those minor corrections to the tutorial. The master file is a LaTeX document, Doc/Tutorial.tex, the command line tools pdflatex and hevea turn it into PDF and HTML which we include with the Biopython archives, and manually copy onto the website as well. From eric.talevich at gmail.com Thu Jul 9 15:46:53 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 9 Jul 2009 15:46:53 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <20090708124841.GX17086@sobchak.mgh.harvard.edu> References: <4A4141AD.6070605@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> On Wed, Jul 8, 2009 at 8:48 AM, Brad Chapman wrote: > Hi all; > > > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed > > > by any phylogenetic tree representation, ever. (It's already pretty > close.) > > > Refactor Nexus and Newick to use these objects; merge the features of > > > lagrange so the rest of the Biopython environment can benefit. > > I am for this approach. It sounds like what people want is a tree > that does everything, and re-implementations occur because > representations are lacking in something. > > It would be nice to design this modularly -- with mixin classes for > related add-on functionality -- as much as possible. This would > allow lighter weight implementations in the future if that were > desired. > OK. Here's the current file layout that needs merging, to illustrate: Bio/ PhyloXML/ __init__.py -- flat public API Tree.py Parser.py Writer.py Utils.py Exceptions.py Nexus/ Nexus.py Nodes.py Trees.py cnexus.c The proposal is to extract the Tree class hierarchy so that other modules can share it, and Biopython users can do I/O with trees as easily as they currently can with sequences ("from Bio import TreeIO; for tree in TreeIO.parse('example.xml', 'phyloxml'): ..."). Bio/ Tree/ Elements.py TreeIO.py -- read, write wrappers PhyloXML/ Parser.py Writer.py Utils.py Nexus/ Nexus.py cnexus.c In the above case, TreeIO.py is a new file containing wrappers for the read and parse functions in my PhyloXML module, and also Nexus and Newick, pending integration. The modules implementing each specific format remain where they are, under Bio/, but aren't expected to be imported directly by the end user. Alternatively, the individual modules that implement each format for I/O can be collected under a new TreeIO directory, with __init__ implementing the wrappers: Bio/ Tree/ Elements.py Utils.py? TreeIO/ __init__.py -- read, write wrappers PhyloXML.py -- Parser + Writer combined Nexus.py cnexus.c ... What do you think? Should I start writing a generalized Bio/Tree/Elements.py for PhyloXML to depend on? -Eric From biopython at maubp.freeserve.co.uk Thu Jul 9 17:53:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Jul 2009 22:53:42 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> References: <4A4141AD.6070605@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> Message-ID: <320fb6e00907091453j4e114cbegb389f8df3517ba32@mail.gmail.com> On Thu, Jul 9, 2009 at 8:46 PM, Eric Talevich wrote: > The proposal is to extract the Tree class hierarchy so that other modules > can share it, and Biopython users can do I/O with trees as easily as they > currently can with sequences ("from Bio import TreeIO; for tree in > TreeIO.parse('example.xml', 'phyloxml'): ..."). > ... Yes :) > In the above case, TreeIO.py is a new file containing wrappers for the read > and parse functions in my PhyloXML module, and also Nexus and Newick, > pending integration. ... > > Alternatively, the individual modules that implement each format for I/O can > be collected under a new TreeIO directory, with __init__ implementing the > wrappers: ... Either idea sounds reasonable. However, for future extensivility, and also consistency with Bio.SeqIO and Bio.AlignIO, I would suggest we have Bio/TreeIO/__init__.py (i.e. as a folder containing as many wrappers or parsers as needed) rather than just using Bio/TreeIO.py (a single file). Note that the Nexus parser is much more than just a tree parser. NEXUS files can contain trees, but much more besides (including a multiple sequence alignment, and instructions to phylogenetic tools). In the short term for TreeIO and Nexus, I would just have Bio/TreeIO/NexusIO.py as a thin wrapper that calls Bio.Nexus and converts its trees into the standard trees (i.e. we don't have to make any changes to Bio.Nexus immediately). In the longer term, it would make sense for Bio.Nexus to start using the new tree objects - but we also have backwards compatibility to think about. Ideally we can get Frank and/or Cymon to look at this (rather than Nick or Eric - as this is their code, and Nick and Eric have more than enough work to do for their projects). [There are parallels here to how I did Bio.SeqIO (and AlignIO), often wrapping existing parsers by turning their format specific data structures into the common SeqRecord (or Alignment) objects. For example, to read/write alignments in NEXUS format Bio.AlignIO just calls Bio.Nexus internally.] Peter From chapmanb at 50mail.com Fri Jul 10 08:07:34 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 10 Jul 2009 08:07:34 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <320fb6e00907091453j4e114cbegb389f8df3517ba32@mail.gmail.com> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> <320fb6e00907091453j4e114cbegb389f8df3517ba32@mail.gmail.com> Message-ID: <20090710120734.GD17086@sobchak.mgh.harvard.edu> Hi Eric; > > The proposal is to extract the Tree class hierarchy so that other modules > > can share it, and Biopython users can do I/O with trees as easily as they > > currently can with sequences ("from Bio import TreeIO; for tree in > > TreeIO.parse('example.xml', 'phyloxml'): ..."). > > ... Sounds great. For most of this I will defer to Peter's expert opinion. As he mentioned, basing this off of SeqIO/AlignIO makes a lot of sense. > > In the above case, TreeIO.py is a new file containing wrappers for the read > > and parse functions in my PhyloXML module, and also Nexus and Newick, > > pending integration. ... > > > > Alternatively, the individual modules that implement each format for I/O can > > be collected under a new TreeIO directory, with __init__ implementing the > > wrappers: ... > > Either idea sounds reasonable. However, for future extensivility, and > also consistency with Bio.SeqIO and Bio.AlignIO, I would suggest we > have Bio/TreeIO/__init__.py (i.e. as a folder containing as many > wrappers or parsers as needed) rather than just using Bio/TreeIO.py > (a single file). Agreed. The imports are the same but this gives added flexibility. > Note that the Nexus parser is much more than just a tree parser. > NEXUS files can contain trees, but much more besides (including a > multiple sequence alignment, and instructions to phylogenetic > tools). In the short term for TreeIO and Nexus, I would just have > Bio/TreeIO/NexusIO.py as a thin wrapper that calls Bio.Nexus and > converts its trees into the standard trees (i.e. we don't have to > make any changes to Bio.Nexus immediately). In the longer term, > it would make sense for Bio.Nexus to start using the new tree > objects - but we also have backwards compatibility to think about. Also agreed. We should get Bio.Nexus updated enough so that is can handle Nick's problem files, and from there apply a wrapper to push Nexus trees into a generic tree compatible with PhyloXML. This will force us to be general about the Tree implementation, but save some re-writing and maintain back-compatibility. Once the generic tree is hammered out and everyone is happy, then we can think about migrating Nexus to it. Seconding Peter's comments, this is probably another big job. So, in summary, the major deliverables are: - Generic tree representation plus a TreeIO structure - PhyloXML parser that uses this tree directly - Nexus parser that can handle problem files and parse into the generic tree. This will let us drop the lagrange duplication from Nick's code. Sounds like you have this well worked out, Brad From biopython at maubp.freeserve.co.uk Fri Jul 10 08:24:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 10 Jul 2009 13:24:03 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <20090710120734.GD17086@sobchak.mgh.harvard.edu> References: <4A4D052D.7010708@berkeley.edu> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> <320fb6e00907091453j4e114cbegb389f8df3517ba32@mail.gmail.com> <20090710120734.GD17086@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00907100524xb4e1f1cx14f495c5fb658106@mail.gmail.com> On Fri, Jul 10, 2009 at 1:07 PM, Brad Chapman wrote: > So, in summary, the major deliverables are: > > - Generic tree representation plus a TreeIO structure > - PhyloXML parser that uses this tree directly > - Nexus parser that can handle problem files and parse into the > ?generic tree. This will let us drop the lagrange duplication from > ?Nick's code. > > Sounds like you have this well worked out, > Brad Sounds good. Note PhyloXML (which I gather is annotation rich) may not have to use the generic trees, it could use a subclass. If this means the generic trees can be less memory hungry that might be worth while... something to keep in mind at least. e.g. Consider a large Newick file with only taxa names and branch lengths, no branch colours, no bootstraps, no internal node names, etc. What specifically is wrong with the Bio.Nexus Newick parser? i.e. what files won't it parse that the lagrange code will? The only thing I am aware of is "naked" internal node labels (Bug 2788): http://bugzilla.open-bio.org/show_bug.cgi?id=2788 Peter From biopython at maubp.freeserve.co.uk Fri Jul 10 08:38:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 10 Jul 2009 13:38:43 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> Message-ID: <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> On Mon, Jun 22, 2009 at 6:57 PM, Peter wrote: > > Once the beta release is out, we'll resume taking small changes > (especially for documentation additions or clarifications) with a > view to releasing Biopython 1.51 final in July (probably the second > week, after people get back from BOSC/ISMB). > OK, that didn't happen - too much to catch up on at work after being away at BOSC/ISMB for a week. Also I will be on holiday next week (graduation etc). I will have some limited internet access. I'm thinking of doing the final release of Biopython 1.51 the following week (i.e. the week starting 20th July). This will be after the annual EMBOSS release, and one little thing I want to sort out before we release Biopython 1.51 is mapping Solexa/PHRED scores in FASTQ files (specifically what to do with a PHRED score of zero which is usually a dummy value, but taken literally means "this read is wrong" or "worst than random"). After discussion with Peter Rice at BOSC/ISMB 2009, I plan to follow his plan for EMBOSS (map PHRED of zero to the lowest used Solexa score, -5). Once the EMBOSS release is out, I can use it for cross checking our FASTQ conversions. Also, we have the Bio.Application.generic_run code to retire, which basically means we label it as obsolete and update the tutorial to use subprocess (see other thread), but this requires cross platform testing. Peter From tiagoantao at gmail.com Fri Jul 10 18:52:41 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 10 Jul 2009 23:52:41 +0100 Subject: [Biopython-dev] Arlequin sequence files in Bio.Popgen In-Reply-To: <4A572392.3040902@student.otago.ac.nz> References: <4A52985C.3000603@student.otago.ac.nz> <6d941f120907070055m2d34fcb1qe8b29e40d8d67880@mail.gmail.com> <4A572392.3040902@student.otago.ac.nz> Message-ID: <6d941f120907101552y32cbd121ub9817f0b5e4292e@mail.gmail.com> Hi David, > Gee, I hope I haven't raised your hopes beyond my ability to deliver (both > in terms of time and skills). I've uploaded my Arlequin classes and > functions to a branch on github so you can see them (/Bio/PopGen/Arlequin/ > on http://github.com/dwinter/biopython/tree/arleq-branch) This is great, I took your code and created a new version (nothing more than also an initial sketch - Feel free to disagree/propose changes), you can find it here: http://github.com/tiagoantao/biopython/tree/arlequin Here are a few comments: 1. I've put indentation at 4 spaces, which I think is the biopython standard 2. I've split the code in Record (__init__.py) and your Seq code (on Utils.py) 3. Just one note, samples and haplotype tables, might not be lists, but iterators. The problem is with very large files (like thousands of sequences) which do not fit in memory. While the current implementation is fine, the expectation is that what is there is just an iterator, not specifically a (in memory) list. I think a list should be ok for arlequin genetic structures which I hope are always small... 4. I've put a copyright message with your name in both files ;) 5. I HAVE NOT TESTED THE CODE CHANGES. Just as a proposed startup draft concept OK, somebody has to do a parser to actually read the files in ;) . Which is the biggest piece of work to be done. I don't mind doing it (like in the next month or so - I have some free time now), but you can do it if you want. In case you decide to do it, I have just one major point to note: making a parser that is able to read big files (i.e., some files cannot be parsed into memory in one go). I made this mistake with the genepop parser and some people do complain about it. Somethings cannot be read as lists to memory but have to be read as iterators (issue 3 above). I think a parser that is able to handle lots of files is also good to help in building a sound model to represent an arlequin record. As usual we will need test code and documentation for all this ;) > By the way, is there a plan to have generic representations of populations, > alleles etc in PopGen? It would make a parser for Arlequin files a much more > useful tool. I found a few threads about it on the mailing lists around the > birth of the module but not since. I am actually afraid of a single generic representation. My main issue with this is that I don't believe that it is possible to get it right. Many kinds of markers, type of data (frequency, gametic-phase, non-phased), population info (e.g. georeferencing). But after we get the genepop code and an arlequin parser fully working I don't mind revisiting this. But I would like to delay this discussion after the genepop code and (if we get it done) the arlequin code in the production version. Any comments would be most welcome, Tiago From bugzilla-daemon at portal.open-bio.org Mon Jul 13 10:44:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 13 Jul 2009 10:44:19 -0400 Subject: [Biopython-dev] [Bug 2879] New: missing __delitem__ in Bio.PDB.Entity.Entity Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2879 Summary: missing __delitem__ in Bio.PDB.Entity.Entity Product: Biopython Version: 1.51b Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P3 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: katja.luck at unistra.fr I realised that using the __delitem__ method in class Chain causes the following error message: ... File "/Library/Python/2.5/site-packages/Bio/PDB/Chain.py", line 79, in __delitem__ return Entity.__delitem__(self, id) AttributeError: class Entity has no attribute '__delitem__' And indeed, the class Entity doesn't have the method __delitem__ even though it is used in Chain. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jul 13 11:21:20 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 13 Jul 2009 11:21:20 -0400 Subject: [Biopython-dev] GSoC Weekly Update 8: PhyloXML for Biopython Message-ID: <3f6baf360907130821g6bbbe7a9s5c551156a11aeac1@mail.gmail.com> Hi all, Previously (July 6-10) I: - Addressed some comments from last week's code/doc review - Enabled Pythonic syntax sugar (dictionary emulation, specialized __str__ methods, singular properties for some plural attributes), plus tests - Wrote Clade.find() for flexible searching - Checked Py2.4 compatibility (it's slower, but it works) - Started Bio.Tree, Bio.TreeIO modules (integration) This week (July 13-17) I will: Extend the core to the rest of the spec: - Adding unit tests and classes to support the remaining (non-core) phyloXML elements - Implement collapse_whitespace -- see the spec glossary - Make Writer use the correct namespace prefixes - "other" objects: assert the namespace is not phyloxml - Use the schema document to validate the input file Integrate with Biopython: - Extract a Bio.Tree.BaseTree module from PhyloXML's tree classes - Improve the SeqRecord conversion Improve/revise documentation: - Address remaining comments from code/doc review - Revisit docstrings for all classes, functions, methods; consider enabling epydoc formatting Questions: - My serializer uses XML entity codes instead of unicode characters in the output -- is that OK? It still round-trips successfully with the parser. - Is there anything to do for BioSQL compatibility, besides extracting sequences? Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From eric.talevich at gmail.com Mon Jul 13 12:12:06 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 13 Jul 2009 12:12:06 -0400 Subject: [Biopython-dev] Bio.Tree layout (Was: BioGeography update) Message-ID: <3f6baf360907130912x4e13e9b1j2fd9506520e69a3a@mail.gmail.com> Hi folks, On Fri, Jul 10, 2009 at 8:24 AM, Peter wrote: > On Fri, Jul 10, 2009 at 1:07 PM, Brad Chapman wrote: > > So, in summary, the major deliverables are: > > > > - Generic tree representation plus a TreeIO structure > > - PhyloXML parser that uses this tree directly > > - Nexus parser that can handle problem files and parse into the > > generic tree. This will let us drop the lagrange duplication from > > Nick's code. > > > > Sounds like you have this well worked out, > > Brad > > Sounds good. Note PhyloXML (which I gather is annotation rich) > may not have to use the generic trees, it could use a subclass. > If this means the generic trees can be less memory hungry that > might be worth while... something to keep in mind at least. e.g. > Consider a large Newick file with only taxa names and branch > lengths, no branch colours, no bootstraps, no internal node > names, etc. > > Peter > Hilmar Lapp just pointed me to the BioSQL PhyloDB extension: http://biosql.org/wiki/Extensions Should this schema be the basis of a Bio.Tree.BaseTree module? Here's the file layout I'm picturing: Bio/Tree/ BaseTree.py -- everything else derives from these classes PhyloXMLTree.py -- already on github NexusTree.py -- if necessary The class structure I'm working on right now looks like: # In BaseTree -- currently empty classes, pending Nexus integration class TreeElement(object) class TreeNode(TreeElement) # In PhyloXMLTree class PhyloElement(BaseTree.TreeElement) class Clade(PhyloElement, BaseTree.TreeNode) class ...(PhyloElement) -- all other phyloXML classes Rather than treat BaseTree as the intersection of all the other Tree representations that rely on it, we could use PhyloDB as the reference point. What do you think? Should we come back to this in a week or two? Eric From matzke at berkeley.edu Mon Jul 13 14:34:42 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 13 Jul 2009 11:34:42 -0700 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <20090708124841.GX17086@sobchak.mgh.harvard.edu> References: <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> Message-ID: <4A5B7E42.40106@berkeley.edu> Brad Chapman wrote: > Hi all; > >>> 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed >>> by any phylogenetic tree representation, ever. (It's already pretty close.) >>> Refactor Nexus and Newick to use these objects; merge the features of >>> lagrange so the rest of the Biopython environment can benefit. > > I am for this approach. It sounds like what people want is a tree > that does everything, and re-implementations occur because > representations are lacking in something. Hi all -- thanks for this discussion about tree classes. Sorry it took me awhile to absorb all of this (and I may still be working on absorbing all of it...there is a lot to keep in my head!). PS: This also serves as my Monday update, basically I need to revise my schedule based on the decisions made after discussion of this thread. Here is a summary of the situation as I understand it. It may be a little long, apologies! (I was kind of hoping an easy solution would just appear, since really everything after this point in my GSoC project requires tree processing, and thus I have to at least the decision made about which tree class to use.) I. Tree Class Options It sounds like we have 3 options being discussed: 1. making Bio.PhyloXML.Tree the super-duper tree class 2. improving Bio.Nexus.Trees 3. including the Lagrange tree class or suitably licensed/inspired version thereof. (Or there is #4, some combination) II. My Original Problem, Which is Probably Quite Small Really I think I kind of unintentionally kicked all of this off because I couldn't get Bio.Nexus.Trees to read what I considered pretty standard Newick files back when I originally exploring this in the spring. Initially for my own scripts I used another newick parser & tree class I found online (Mailund's IIRC), then discovered a superior one in Lagrange and started using that. Thus in GSoC it was simplest to begin by importing the Lagrange parser, but that lead to legitimate concerns about duplication/licensing etc. Reviewing my original issues from the spring, really the only problem I found with Bio.Nexus.Trees was with node labels, i.e. when an internal node is given e.g. a clade name, in addition to a branch length. This a standard output on a great many newick files in my experience, which seem to be correctly read by just about all the other programs I use (Mesquite, Dendroscope, etc.) so I impulsively abandoned Bio.Nexus.Trees at the time when I couldn't get it work. III. Bug Report I did file a bug report back in March. This is outstanding as far as I know. Bio.Nexus.Trees newick parser does not support internal node labels http://bugzilla.open-bio.org/show_bug.cgi?id=2788 IV. Problem Examples Below I have accumulated some cases that work/don't work: ================= from Bio.Nexus import Trees # This works ts0 = "(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10;" to0 = Trees.Tree(ts0) print to0 # Gymnosperms tree with node labels; doesn't work ts1a = '(((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,Gin kgo:275.000000)gymnosperm:75.000000;' to1a = Trees.Tree(ts1a) # Just Taxaceae; doesn't work ts1b = '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000;' to1b = Trees.Tree(ts1b) # Just Taxaceae; this works; node labels deleted ts1c = '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)25.000000)90.000000;' to1c = Trees.Tree(ts1c) # This doesn't work (from bug report) ts2 = "(((t9:0.385832, (t8:0.445135,t4:0.41401)C:0.024032)B:0.041436, t6:0.392496)A:0.0291131, t2:0.497673, ((t0:0.301171, t7:0.482152)E:0.0268148, ((t5:0.0984167,t3:0.488578)G:0.0349662, t1:0.130208)F:0.0318288)D:0.0273876);" to2 = Trees.Tree(ts2) ================= But if I import the Lagrange tree class/parser, all of these work and my life is happy: ================= import lagrange_newick # This is lagrange's newick.py file, renamed to lagrange_newick.py lt1 = lagrange_newick.parse(ts1) lt1a = lagrange_newick.parse(ts1a) lt1b = lagrange_newick.parse(ts1b) lt2 = lagrange_newick.parse(ts2) ================= V. The Functions I Need From a Tree Class Basically my method of late has been to use the Lagrange Tree class, and then write my own standalone functions to do various necessary basic processing of trees. E.g.: * subset tree based on list of taxa; update root and any now-redundant internal nodes left with 0 or 1 descendents * extract a subtree to a new tree (cloned nodes so they don't refer to the old nodes, important in doing passes through tree) * read/write to Newick * print tree to screen in a readable format * get distance (total branch length between 2 nodes) * calculate many measures that can be done from the distances (total all-to-all distance matrix, tree length, mean phylogenetic distance, mean nearest-neighbor phylogenetic distance) * several others I don't remember off the top of my head In my list-o-functions approach, I would just write functions for the tree class I was using, but I think it has been made clear that really these functions should be methods of a certain Tree class. Which requires a decision about what Tree class to use. VI. What the current classes do. I had never looked seriously at Bio.Nexus.Trees since I was just crashing it, but it actually looks like it does a bunch: Bio.Nexus.Trees =========== type(to1c) to1c dir(to1c) ['_Tree__values_are_support', '__doc__', '__init__', '__module__', '__str__', '_add_subtree', '_get_id', '_get_values', '_parse', '_walk', 'add', 'all_ids', 'branchlength2support', 'chain', 'collapse', 'collapse_genera', 'common_ancestor', 'convert_absolute_support', 'count_terminals', 'dataclass', 'display', 'distance', 'get_taxa', 'get_terminals', 'has_support', 'id', 'is_bifurcating', 'is_compatible', 'is_identical', 'is_internal', 'is_monophyletic', 'is_parent_of', 'is_preterminal', 'is_terminal', 'kill', 'link', 'max_support', 'merge_with_support', 'name', 'node', 'prune', 'randomize', 'root', 'root_with_outgroup', 'rooted', 'search_taxon', 'set_subtree', 'split', 'sum_branchlength', 'to_string', 'trace', 'unlink', 'unroot', 'weight'] # Node methods: nd = to1c.node(1) nd type(nd) dir(nd) ['__doc__', '__init__', '__module__', 'add_succ', 'data', 'get_data', 'get_id', 'get_prev', 'get_succ', 'id', 'prev', 'remove_succ', 'set_data', 'set_id', 'set_prev', 'set_succ', 'succ'] # Node data: ndd = nd.get_data() dir(ndd) ['__doc__', '__init__', '__module__', 'branchlength', 'comment', 'support', 'taxon'] =========== Lagrange Tree Class: (really class Node I guess, and the tree is reference by the root Node) ============= type(lt1b) lt1b dir(lt1b) ['__doc__', '__init__', '__module__', 'add_child', 'children', 'data', 'descendants', 'excluded_dists', 'find_descendant', 'graft', 'isroot', 'istip', 'iternodes', 'label', 'labelset_nodemap', 'leaf_distances', 'leaves', 'length', 'mrca', 'nchildren', 'order_subtrees_by_size', 'parent', 'prune', 'remove_child', 'rootpath', 'subtree_mapping', 'ultrametricize_dumbly'] ============= Bio.PhyloXML.Tree ============= [not sure...perhaps someone could contribute the list of methods/intended methods] ============= VII. I am Leaning Towards Bio.Nexus.Trees Based on current functionality and integration with BioPython, and what can be done in the short term, it looks to me like the best option is to mod the Bio.Nexus.Trees module, inspired by the Lagrange Node class as necessary. However if e.g. PhyloXML is working well enough that I can use that, that is an option. VIII. What I should do next Given what I now know, I probably should have just written a little function to strip node labels out of my Newick trees, and done everything based on the Bio.Nexus.Trees class. I could still do this and continue on my merry way without too much trouble. But given that my tree-based functions should probably be methods of some class...here are the questions I have: * Should I muck with Bio.Nexus.Trees and try to fix the node labels issue? My instinct was not to mess with other people's stuff, but that may be a poor instinct... * Should I implement my tree-based functions methods as methods of the Bio.Nexus.Trees class? * Should I delay on this whole issue while it is being discussed, and go back to issues more localized to my GSoC project, i.e. making my GBIF functions into methods of a GBIF records class? Thanks for reading! And sorry if this was more confusing than it had to be, I am definitely learning as I go here. Cheers, Nick > > It would be nice to design this modularly -- with mixin classes for > related add-on functionality -- as much as possible. This would > allow lighter weight implementations in the future if that were > desired. > >> The benefit of letting the tree object structures diverge is procrastination >> -- we could reconcile the two modules after GSoC is over, with stable >> features and test suites in place. But I could justifiably focus on >> integration for the remaining weeks if that's best for Biopython, since >> otherwise I'd probably be reimplementing a number of features already >> present in other modules. > > My vote is for the integration work. Refactoring is hard work and > best done early. It is easier to add functionality to a fully integrated > PhyloXML parser in the future. > >> I bet this could be done without different objects. Bio.PhyloXML.Tree could >> be moved to Bio.Tree or Bio.Tree.Elements; the base class PhyloElement could >> be renamed to TreeElement; and the Nexus and Newick parsers could reuse >> PhyloXML's Phylogeny and Clade elements, where Clade merges with the >> existing Node class(es). Even Clade by itself might be enough. For >> organizational purposes, format-specific tree elements could move to their >> own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some >> multiple-inheritance tricks could be used to smooth things over. > > Yes, this sounds exactly right. Great stuff. > >> (I know nothing >> about NeXML; should we keep an eye on that too? Glance at the homepage I >> don't see much about complex annotation types, which is probably good if we >> want to fit that format into this framework eventually.) > > PhyloXML plus Nexus/Newick is probably enough to stay reasonably > general and keep our sanity. NeXML support would be great but > practically is an additional project. The refactoring you've described > is a good chunk to run with. > > Brad > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From eric.talevich at gmail.com Mon Jul 13 16:01:07 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 13 Jul 2009 16:01:07 -0400 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A5B7E42.40106@berkeley.edu> References: <4A4141AD.6070605@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> Message-ID: <3f6baf360907131301v2096cef0o64c458ca1bfabc7c@mail.gmail.com> Hi Nick, On Mon, Jul 13, 2009 at 2:34 PM, Nick Matzke wrote: > > > Hi all -- thanks for this discussion about tree classes. Sorry it took me > awhile to absorb all of this (and I may still be working on absorbing all of > it...there is a lot to keep in my head!). > [...] > > I. Tree Class Options > > It sounds like we have 3 options being discussed: > > 1. making Bio.PhyloXML.Tree the super-duper tree class > 2. improving Bio.Nexus.Trees > 3. including the Lagrange tree class or suitably licensed/inspired version > thereof. > > (Or there is #4, some combination) > The last consensus we reached on Biopython-dev was to create two new modules, Bio.Tree and Bio.TreeIO, like so: 1. Extract a very basic Tree and Node class, looking at the intersection of the PhyloXML and Nexus class hierarchies, and put the result in Bio.Tree.BaseTree. I started on this today: http://github.com/etal/biopython/blob/phyloxml/Bio/Tree/BaseTree.py (It doesn't do anything yet besides set up a class heirarchy that we can use for generalizing existing code.) 2. Write wrappers for the existing PhyloXML and Nexus I/O functions. I'm putting that here: http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/__init__.py Again, it's only useful for PhyloXML parsing right now. Eventually we can connect Bio.Nexus to these two modules, but that's well outside the scope of my GSoC project. > Bio.PhyloXML.Tree > ============= > [not sure...perhaps someone could contribute the list of methods/intended > methods] > ============= > Not very many! My project is to implement the phyloXML spec, and the spec says nothing about methods, just about how to store data. As you've noted, Bio.Nexus has a lot of useful methods for phylogenetic trees, independent of the underlying file format. I'd like to separate the I/O code from the tree representations for Bio.Nexus and Bio.PhyloXML, leaving Bio.TreeIO with format-specific wrappers, and Bio.Tree, with common tree representations and methods for handling trees. Basically, I don't want to rewrite necessary methods from scratch, I want to use the ones Nexus already has. Since phyloXML is designed to store more kinds of annotations than Nexus, there are some additional Tree-based classes in Bio.Tree.PhyloXMLTree, with some methods for dealing with the additional annotations. But the methods you want will be on Bio.Tree.BaseTree objects, and you shouldn't have to worry about phyloXML objects unless you want to add some additional phyloXML-specific annotations to your trees. > VIII. What I should do next > > Given what I now know, I probably should have just written a little > function to strip node labels out of my Newick trees, and done everything > based on the Bio.Nexus.Trees class. I could still do this and continue on > my merry way without too much trouble. > > But given that my tree-based functions should probably be methods of some > class...here are the questions I have: > > * Should I muck with Bio.Nexus.Trees and try to fix the node labels issue? > My instinct was not to mess with other people's stuff, but that may be a > poor instinct... > > * Should I implement my tree-based functions methods as methods of the > Bio.Nexus.Trees class? > > * Should I delay on this whole issue while it is being discussed, and go > back to issues more localized to my GSoC project, i.e. making my GBIF > functions into methods of a GBIF records class? > > It sounds like relying on the current Bio.Nexus is the best approach. I'll defer to the experts, but my guess is that if it's only a small change you need, then make a patch to Bio.Nexus.Trees for your own use and also upload the patch to Bugzilla to make it easier to use upstream. Integrating the functions into Bio.Nexus right now probably isn't necessary, since many of those methods will probably end up in Bio.Tree eventually anyway. For functions that could become Nexus methods, try arranging the argument list so that the object the method would belong to comes first. Then functions can be moved into classes by renaming the first argument to 'self', and nothing breaks. It's also possible to directly monkeypatch a class/object with functions structured that way, but I think that would be frowned upon in general... Cheers, Eric From chapmanb at 50mail.com Mon Jul 13 17:39:05 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Jul 2009 17:39:05 -0400 Subject: [Biopython-dev] Bio.Tree layout (Was: BioGeography update) In-Reply-To: <3f6baf360907130912x4e13e9b1j2fd9506520e69a3a@mail.gmail.com> References: <3f6baf360907130912x4e13e9b1j2fd9506520e69a3a@mail.gmail.com> Message-ID: <20090713213905.GO17086@sobchak.mgh.harvard.edu> Hi Eric; > Hilmar Lapp just pointed me to the BioSQL PhyloDB extension: > http://biosql.org/wiki/Extensions > > Should this schema be the basis of a Bio.Tree.BaseTree module? Yes, that sounds perfect. PhyloDB has been kicked around quite a bit and will be a good base. Great idea. If you want someone to talk to in real life at UGa, Jamie Estill worked on PhyloDB during GSoC a couple of years back: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Command_Line_Topological_Query_Application_for_BioSQL http://jestill.myweb.uga.edu/ He's crazy smart and a nice guy, and was in my lab when I was down there. He's a great person to know. > Here's the file layout I'm picturing: > > Bio/Tree/ > BaseTree.py -- everything else derives from these classes > PhyloXMLTree.py -- already on github > NexusTree.py -- if necessary > > The class structure I'm working on right now looks like: > > # In BaseTree -- currently empty classes, pending Nexus integration > class TreeElement(object) > class TreeNode(TreeElement) > > # In PhyloXMLTree > class PhyloElement(BaseTree.TreeElement) > class Clade(PhyloElement, BaseTree.TreeNode) > class ...(PhyloElement) -- all other phyloXML classes > > Rather than treat BaseTree as the intersection of all the other Tree > representations that rely on it, we could use PhyloDB as the reference > point. What do you think? Should we come back to this in a week or two? I think PhyloDB is the right starting point, and then the implementations in Newick, lagrange and PyCogene and elsewhere are good references for the operations that people will want to do on the tree. I don't see any reason to wait on this; I'm excited about the generic tree representation and bringing these things together. Brad From chapmanb at 50mail.com Mon Jul 13 17:39:05 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Jul 2009 17:39:05 -0400 Subject: [Biopython-dev] Bio.Tree layout (Was: BioGeography update) In-Reply-To: <3f6baf360907130912x4e13e9b1j2fd9506520e69a3a@mail.gmail.com> References: <3f6baf360907130912x4e13e9b1j2fd9506520e69a3a@mail.gmail.com> Message-ID: <20090713213905.GO17086@sobchak.mgh.harvard.edu> Hi Eric; > Hilmar Lapp just pointed me to the BioSQL PhyloDB extension: > http://biosql.org/wiki/Extensions > > Should this schema be the basis of a Bio.Tree.BaseTree module? Yes, that sounds perfect. PhyloDB has been kicked around quite a bit and will be a good base. Great idea. If you want someone to talk to in real life at UGa, Jamie Estill worked on PhyloDB during GSoC a couple of years back: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Command_Line_Topological_Query_Application_for_BioSQL http://jestill.myweb.uga.edu/ He's crazy smart and a nice guy, and was in my lab when I was down there. He's a great person to know. > Here's the file layout I'm picturing: > > Bio/Tree/ > BaseTree.py -- everything else derives from these classes > PhyloXMLTree.py -- already on github > NexusTree.py -- if necessary > > The class structure I'm working on right now looks like: > > # In BaseTree -- currently empty classes, pending Nexus integration > class TreeElement(object) > class TreeNode(TreeElement) > > # In PhyloXMLTree > class PhyloElement(BaseTree.TreeElement) > class Clade(PhyloElement, BaseTree.TreeNode) > class ...(PhyloElement) -- all other phyloXML classes > > Rather than treat BaseTree as the intersection of all the other Tree > representations that rely on it, we could use PhyloDB as the reference > point. What do you think? Should we come back to this in a week or two? I think PhyloDB is the right starting point, and then the implementations in Newick, lagrange and PyCogene and elsewhere are good references for the operations that people will want to do on the tree. I don't see any reason to wait on this; I'm excited about the generic tree representation and bringing these things together. Brad From matzke at berkeley.edu Mon Jul 13 17:40:15 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 13 Jul 2009 14:40:15 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <20090710120734.GD17086@sobchak.mgh.harvard.edu> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> <320fb6e00907091453j4e114cbegb389f8df3517ba32@mail.gmail.com> <20090710120734.GD17086@sobchak.mgh.harvard.edu> Message-ID: <4A5BA9BF.2080605@berkeley.edu> Brad Chapman wrote: > Also agreed. We should get Bio.Nexus updated enough so that is can > handle Nick's problem files, and from there apply a wrapper to push > Nexus trees into a generic tree compatible with PhyloXML. This will > force us to be general about the Tree implementation, but save some > re-writing and maintain back-compatibility. Once the generic tree > is hammered out and everyone is happy, then we can think about > migrating Nexus to it. Seconding Peter's comments, this is probably > another big job. > > So, in summary, the major deliverables are: > > - Generic tree representation plus a TreeIO structure > - PhyloXML parser that uses this tree directly > - Nexus parser that can handle problem files and parse into the > generic tree. This will let us drop the lagrange duplication from > Nick's code. > > Sounds like you have this well worked out, > Brad Whoops I missed a few of these biopython-dev messages before, I have different filters shuttling things different places depending on the Subj. line. Eric filled me in. Here were some cases where the node labels blocked Bio.Nexus.Trees: ================= from Bio.Nexus import Trees # This works ts0 = "(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10;" to0 = Trees.Tree(ts0) print to0 # Gymnosperms tree with node labels; doesn't work ts1a = '(((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,Gin kgo:275.000000)gymnosperm:75.000000;' to1a = Trees.Tree(ts1a) # Just Taxaceae; doesn't work ts1b = '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000;' to1b = Trees.Tree(ts1b) # Just Taxaceae; this works; node labels deleted ts1c = '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)25.000000)90.000000;' to1c = Trees.Tree(ts1c) # This doesn't work (from bug report) ts2 = "(((t9:0.385832, (t8:0.445135,t4:0.41401)C:0.024032)B:0.041436, t6:0.392496)A:0.0291131, t2:0.497673, ((t0:0.301171, t7:0.482152)E:0.0268148, ((t5:0.0984167,t3:0.488578)G:0.0349662, t1:0.130208)F:0.0318288)D:0.0273876);" to2 = Trees.Tree(ts2) ================= -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From matzke at berkeley.edu Mon Jul 13 18:02:24 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 13 Jul 2009 15:02:24 -0700 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A5B7E42.40106@berkeley.edu> References: <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> Message-ID: <4A5BAEF0.9050504@berkeley.edu> Just updating one chunk of part I of the previous long message: Nick Matzke wrote: > > > I. Tree Class Options > > It sounds like we have 3 options being discussed: > > 1. making Bio.PhyloXML.Tree the super-duper tree class > 2. improving Bio.Nexus.Trees > 3. including the Lagrange tree class or suitably licensed/inspired > version thereof. > > (Or there is #4, some combination) > The last consensus we reached on Biopython-dev was to create two new > modules, Bio.Tree and Bio.TreeIO, like so: > > 1. Extract a very basic Tree and Node class, looking at the intersection > of the PhyloXML and Nexus class hierarchies, and put the result in > Bio.Tree.BaseTree. I started on this today: > http://github.com/etal/biopython/blob/phyloxml/Bio/Tree/BaseTree.py > > (It doesn't do anything yet besides set up a class heirarchy that we can > use for generalizing existing code.) > > 2. Write wrappers for the existing PhyloXML and Nexus I/O functions. I'm > putting that here: > http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/__init__.py > > Again, it's only useful for PhyloXML parsing right now. Eventually we > can connect Bio.Nexus to these two modules, but that's well outside the > scope of my GSoC project. It sounds like for my immediate purposes, Bio.Nexus.Trees is the solution for now, I will reorganize my code accordingly based on this. If/when Bio.Nexus.Trees accepts node labels I will remove a function stripping out node labels. Also I have not forgotten previous comments from Brad et al. about bringing the other code up to specs. So I will update the BioGeography schedule and overall organization I hope to have at the end (with classes/methods etc., instead of just a list-o-functions, which is how my original schedule was explicitly laid out), and post an update when done. Cheers! Nick > > > > > > II. My Original Problem, Which is Probably Quite Small Really > > I think I kind of unintentionally kicked all of this off because I > couldn't get Bio.Nexus.Trees to read what I considered pretty standard > Newick files back when I originally exploring this in the spring. > Initially for my own scripts I used another newick parser & tree class I > found online (Mailund's IIRC), then discovered a superior one in > Lagrange and started using that. Thus in GSoC it was simplest to begin > by importing the Lagrange parser, but that lead to legitimate concerns > about duplication/licensing etc. > > Reviewing my original issues from the spring, really the only problem I > found with Bio.Nexus.Trees was with node labels, i.e. when an internal > node is given e.g. a clade name, in addition to a branch length. This a > standard output on a great many newick files in my experience, which > seem to be correctly read by just about all the other programs I use > (Mesquite, Dendroscope, etc.) so I impulsively abandoned Bio.Nexus.Trees > at the time when I couldn't get it work. > > > > > > III. Bug Report > > I did file a bug report back in March. This is outstanding as far as I > know. > > Bio.Nexus.Trees newick parser does not support internal node labels > http://bugzilla.open-bio.org/show_bug.cgi?id=2788 > > > > > > > > IV. Problem Examples > > > Below I have accumulated some cases that work/don't work: > > > ================= > from Bio.Nexus import Trees > > # This works > > ts0 = > "(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, > Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10;" > > to0 = Trees.Tree(ts0) > print to0 > > > > # Gymnosperms tree with node labels; doesn't work > ts1a = > '(((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,G in > > kgo:275.000000)gymnosperm:75.000000;' > > to1a = Trees.Tree(ts1a) > > > > > # Just Taxaceae; doesn't work > ts1b = > '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000;' > > to1b = Trees.Tree(ts1b) > > # Just Taxaceae; this works; node labels deleted > ts1c = > '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)25.000000)90.000000;' > > to1c = Trees.Tree(ts1c) > > > > > # This doesn't work (from bug report) > ts2 = "(((t9:0.385832, (t8:0.445135,t4:0.41401)C:0.024032)B:0.041436, > t6:0.392496)A:0.0291131, t2:0.497673, ((t0:0.301171, > t7:0.482152)E:0.0268148, ((t5:0.0984167,t3:0.488578)G:0.0349662, > t1:0.130208)F:0.0318288)D:0.0273876);" > to2 = Trees.Tree(ts2) > ================= > > > > > But if I import the Lagrange tree class/parser, all of these work and my > life is happy: > > ================= > import lagrange_newick > # This is lagrange's newick.py file, renamed to lagrange_newick.py > > lt1 = lagrange_newick.parse(ts1) > lt1a = lagrange_newick.parse(ts1a) > lt1b = lagrange_newick.parse(ts1b) > lt2 = lagrange_newick.parse(ts2) > ================= > > > > > > > V. The Functions I Need From a Tree Class > > Basically my method of late has been to use the Lagrange Tree class, and > then write my own standalone functions to do various necessary basic > processing of trees. E.g.: > > * subset tree based on list of taxa; update root and any now-redundant > internal nodes left with 0 or 1 descendents > > * extract a subtree to a new tree (cloned nodes so they don't refer to > the old nodes, important in doing passes through tree) > > * read/write to Newick > > * print tree to screen in a readable format > > * get distance (total branch length between 2 nodes) > > * calculate many measures that can be done from the distances (total > all-to-all distance matrix, tree length, mean phylogenetic distance, > mean nearest-neighbor phylogenetic distance) > > * several others I don't remember off the top of my head > > > In my list-o-functions approach, I would just write functions for the > tree class I was using, but I think it has been made clear that really > these functions should be methods of a certain Tree class. Which > requires a decision about what Tree class to use. > > > > > > VI. What the current classes do. > > I had never looked seriously at Bio.Nexus.Trees since I was just > crashing it, but it actually looks like it does a bunch: > > Bio.Nexus.Trees > =========== > type(to1c) > > > to1c > > > dir(to1c) > > ['_Tree__values_are_support', > '__doc__', > '__init__', > '__module__', > '__str__', > '_add_subtree', > '_get_id', > '_get_values', > '_parse', > '_walk', > 'add', > 'all_ids', > 'branchlength2support', > 'chain', > 'collapse', > 'collapse_genera', > 'common_ancestor', > 'convert_absolute_support', > 'count_terminals', > 'dataclass', > 'display', > 'distance', > 'get_taxa', > 'get_terminals', > 'has_support', > 'id', > 'is_bifurcating', > 'is_compatible', > 'is_identical', > 'is_internal', > 'is_monophyletic', > 'is_parent_of', > 'is_preterminal', > 'is_terminal', > 'kill', > 'link', > 'max_support', > 'merge_with_support', > 'name', > 'node', > 'prune', > 'randomize', > 'root', > 'root_with_outgroup', > 'rooted', > 'search_taxon', > 'set_subtree', > 'split', > 'sum_branchlength', > 'to_string', > 'trace', > 'unlink', > 'unroot', > 'weight'] > > > # Node methods: > nd = to1c.node(1) > > nd > > > > type(nd) > > > dir(nd) > > ['__doc__', > '__init__', > '__module__', > 'add_succ', > 'data', > 'get_data', > 'get_id', > 'get_prev', > 'get_succ', > 'id', > 'prev', > 'remove_succ', > 'set_data', > 'set_id', > 'set_prev', > 'set_succ', > 'succ'] > > > # Node data: > ndd = nd.get_data() > > dir(ndd) > > ['__doc__', > '__init__', > '__module__', > 'branchlength', > 'comment', > 'support', > 'taxon'] > =========== > > > > > > > > Lagrange Tree Class: > (really class Node I guess, and the tree is reference by the root Node) > > ============= > type(lt1b) > > > lt1b > > > dir(lt1b) > > ['__doc__', > '__init__', > '__module__', > 'add_child', > 'children', > 'data', > 'descendants', > 'excluded_dists', > 'find_descendant', > 'graft', > 'isroot', > 'istip', > 'iternodes', > 'label', > 'labelset_nodemap', > 'leaf_distances', > 'leaves', > 'length', > 'mrca', > 'nchildren', > 'order_subtrees_by_size', > 'parent', > 'prune', > 'remove_child', > 'rootpath', > 'subtree_mapping', > 'ultrametricize_dumbly'] > ============= > > > > > Bio.PhyloXML.Tree > ============= > [not sure...perhaps someone could contribute the list of > methods/intended methods] > ============= > > > > > VII. I am Leaning Towards Bio.Nexus.Trees > > Based on current functionality and integration with BioPython, and what > can be done in the short term, it looks to me like the best option is to > mod the Bio.Nexus.Trees module, inspired by the Lagrange Node class as > necessary. However if e.g. PhyloXML is working well enough that I can > use that, that is an option. > > > > > > VIII. What I should do next > > Given what I now know, I probably should have just written a little > function to strip node labels out of my Newick trees, and done > everything based on the Bio.Nexus.Trees class. I could still do this > and continue on my merry way without too much trouble. > > But given that my tree-based functions should probably be methods of > some class...here are the questions I have: > > * Should I muck with Bio.Nexus.Trees and try to fix the node labels > issue? My instinct was not to mess with other people's stuff, but that > may be a poor instinct... > > * Should I implement my tree-based functions methods as methods of the > Bio.Nexus.Trees class? > > * Should I delay on this whole issue while it is being discussed, and go > back to issues more localized to my GSoC project, i.e. making my GBIF > functions into methods of a GBIF records class? > > > Thanks for reading! And sorry if this was more confusing than it had to > be, I am definitely learning as I go here. > > Cheers, > Nick > > > > > > > > >> >> It would be nice to design this modularly -- with mixin classes for >> related add-on functionality -- as much as possible. This would >> allow lighter weight implementations in the future if that were >> desired. >> >>> The benefit of letting the tree object structures diverge is >>> procrastination >>> -- we could reconcile the two modules after GSoC is over, with stable >>> features and test suites in place. But I could justifiably focus on >>> integration for the remaining weeks if that's best for Biopython, since >>> otherwise I'd probably be reimplementing a number of features already >>> present in other modules. >> >> My vote is for the integration work. Refactoring is hard work and >> best done early. It is easier to add functionality to a fully integrated >> PhyloXML parser in the future. >> >>> I bet this could be done without different objects. Bio.PhyloXML.Tree >>> could >>> be moved to Bio.Tree or Bio.Tree.Elements; the base class >>> PhyloElement could >>> be renamed to TreeElement; and the Nexus and Newick parsers could reuse >>> PhyloXML's Phylogeny and Clade elements, where Clade merges with the >>> existing Node class(es). Even Clade by itself might be enough. For >>> organizational purposes, format-specific tree elements could move to >>> their >>> own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some >>> multiple-inheritance tricks could be used to smooth things over. >> >> Yes, this sounds exactly right. Great stuff. >> >>> (I know nothing >>> about NeXML; should we keep an eye on that too? Glance at the homepage I >>> don't see much about complex annotation types, which is probably good >>> if we >>> want to fit that format into this framework eventually.) >> >> PhyloXML plus Nexus/Newick is probably enough to stay reasonably >> general and keep our sanity. NeXML support would be great but >> practically is an additional project. The refactoring you've described >> is a good chunk to run with. >> >> Brad >> > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From hlapp at gmx.net Tue Jul 14 03:41:23 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 14 Jul 2009 08:41:23 +0100 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A5B7E42.40106@berkeley.edu> References: <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> Message-ID: <8D7EC898-7AAF-4140-B41C-4BB1F424150D@gmx.net> On Jul 13, 2009, at 7:34 PM, Nick Matzke wrote: > * Should I muck with Bio.Nexus.Trees and try to fix the node labels > issue? My instinct was not to mess with other people's stuff, but > that may be a poor instinct... Just my $0.02 - messing with other people's stuff is an inherent, and not infrequent, activity in distributed open-source development. I would in fact be rather merciless in doing so. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bugzilla-daemon at portal.open-bio.org Tue Jul 14 08:24:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Jul 2009 08:24:35 -0400 Subject: [Biopython-dev] [Bug 2788] Bio.Nexus.Trees newick parser does not support internal node labels In-Reply-To: Message-ID: <200907141224.n6ECOZ9X014789@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2788 ------- Comment #3 from chapmanb at 50mail.com 2009-07-14 08:24 EST ------- Created an attachment (id=1342) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1342&action=view) Fix for internal node taxon labels Includes a fix and test cases for internal nodes labeled with taxon information. Please test this out on some files of interest and report any additional problem cases. I'd like to get a few more eyes on it before checking it in. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Tue Jul 14 08:35:34 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 14 Jul 2009 08:35:34 -0400 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A5BAEF0.9050504@berkeley.edu> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> Message-ID: <20090714123534.GQ17086@sobchak.mgh.harvard.edu> Hi Nick; Thanks for the comprehensive update. It sounds like your discussion with Eric resolved most of the questions about the tree representation. It's great to see y'all converging on this. > It sounds like for my immediate purposes, Bio.Nexus.Trees is the > solution for now, I will reorganize my code accordingly based on this. > If/when Bio.Nexus.Trees accepts node labels I will remove a function > stripping out node labels. Also I have not forgotten previous comments > from Brad et al. about bringing the other code up to specs. So I will > update the BioGeography schedule and overall organization I hope to have > at the end (with classes/methods etc., instead of just a > list-o-functions, which is how my original schedule was explicitly laid > out), and post an update when done. Agreed, and seconding Hilmar that the best thing about open source code is having others looking at your code. Conversely, feel free to dig in and fix current code where it is holding you up. To remove this blocking issue on Nexus and get us rolling again, I put together an initial fix. You can grab the patch from: http://bugzilla.open-bio.org/show_bug.cgi?id=2788 Let us know if this works for your files of interest. If this clears up the Nexus issue, it would be great to see the revised schedule incorporating the refactoring. Sounds like we are moving in the right direction. Good stuff. Thanks, Brad From matzke at berkeley.edu Tue Jul 14 15:08:56 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 14 Jul 2009 12:08:56 -0700 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <20090714123534.GQ17086@sobchak.mgh.harvard.edu> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> Message-ID: <4A5CD7C8.70009@berkeley.edu> Thanks for the fix!!! A big help. I am currently organizing my functions into several classes and making sure they work, basically the classes look like they will be something like: ========== GbifXml -- for processing GBIF XML results (all of the functions for searching/extracting stuff from xmltree structures) TreeSum -- for processing trees & getting summary statistics etc. Ranges -- Geographic range of a species (collection of points, results of classification of those points into regions), GIS-like functions for processing them Points -- geographic locations of individual collected specimens ========== Brad Chapman wrote: > Hi Nick; > Thanks for the comprehensive update. It sounds like your discussion > with Eric resolved most of the questions about the tree > representation. It's great to see y'all converging on this. > >> It sounds like for my immediate purposes, Bio.Nexus.Trees is the >> solution for now, I will reorganize my code accordingly based on this. >> If/when Bio.Nexus.Trees accepts node labels I will remove a function >> stripping out node labels. Also I have not forgotten previous comments >> from Brad et al. about bringing the other code up to specs. So I will >> update the BioGeography schedule and overall organization I hope to have >> at the end (with classes/methods etc., instead of just a >> list-o-functions, which is how my original schedule was explicitly laid >> out), and post an update when done. > > Agreed, and seconding Hilmar that the best thing about open source > code is having others looking at your code. Conversely, feel free to > dig in and fix current code where it is holding you up. To remove > this blocking issue on Nexus and get us rolling again, I > put together an initial fix. You can grab the patch from: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2788 > > Let us know if this works for your files of interest. > > If this clears up the Nexus issue, it would be great to see the > revised schedule incorporating the refactoring. Sounds like we > are moving in the right direction. Good stuff. > > Thanks, > Brad > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From mjldehoon at yahoo.com Thu Jul 16 04:50:35 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 16 Jul 2009 01:50:35 -0700 (PDT) Subject: [Biopython-dev] Calculating motif scores Message-ID: <865356.89579.qm@web62406.mail.re1.yahoo.com> Hi everybody, I was looking for a way to calculate the position-weight matrix score for a given sequence. Motif.score_hit(sequence,position,normalized=0,masked=0) in Bio/Motif/_Motif.py does what I need, but it calculates the score at only one position. For speed reasons, I am looking for a function that can calculate the scores at all positions in a sequence. Something like score(pwm, sequence) returning a Numerical Python array of length len(sequence) - len(pwm) + 1, with the "score" function implemented in a C extension. Perhaps the position-weight matrix should be its own class, with "score" as one of its methods. Is there perhaps some other function that I can use for this? If not, I can contribute a C extension implementing this functionality. If so, are there any preferences on how this should be integrated with Bio.Motif? --Michiel From bartek at rezolwenta.eu.org Thu Jul 16 07:32:34 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 16 Jul 2009 13:32:34 +0200 Subject: [Biopython-dev] Calculating motif scores In-Reply-To: <865356.89579.qm@web62406.mail.re1.yahoo.com> References: <865356.89579.qm@web62406.mail.re1.yahoo.com> Message-ID: <8b34ec180907160432j6647a8e3u20054b2f7781b978@mail.gmail.com> On Thu, Jul 16, 2009 at 10:50 AM, Michiel de Hoon wrote: > > Hi everybody, Hi > > I was looking for a way to calculate the position-weight matrix score for a given sequence. Motif.score_hit(sequence,position,normalized=0,masked=0) in Bio/Motif/_Motif.py does what I need, but it calculates the score at only one position. For speed reasons, I am looking for a function that can calculate the scores at all positions in a sequence. Something like > > score(pwm, sequence) > > returning a Numerical Python array of length len(sequence) - len(pwm) + 1, with the "score" function implemented in a C extension. Perhaps the position-weight matrix should be its own class, with "score" as one of its methods. > > Is there perhaps some other function that I can use for this? The function you are looking for is called search_pwm: search_pwm(self, sequence, normalized=0, masked=0, threshold=0.0, both=True) a generator function, returning found hits in a given sequence with the pwm score higher than the threshold > If not, I can contribute a C extension implementing this functionality. If so, are there any preferences on how this should be integrated with Bio.Motif? As you can see, the current function is a generator rather than returning a full array, because of the memory issues with searching large sequences for a few cases of a good motif. If you set the threshold to (-inf) you should get the results for all positions. Nonetheless, if you have a function in c doing just that, we could incorporate it into biopython, for fast exhaustive searches on shorter seqences. cheers Bartek From mjldehoon at yahoo.com Thu Jul 16 22:25:22 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 16 Jul 2009 19:25:22 -0700 (PDT) Subject: [Biopython-dev] Calculating motif scores Message-ID: <572083.29767.qm@web62405.mail.re1.yahoo.com> > The function you are looking for is called search_pwm: > > search_pwm(self, sequence, normalized=0, masked=0, > threshold=0.0, both=True) > a generator function, returning found hits in a given > sequence with the pwm score higher than the threshold OK, that comes close to what I had in mind. > Nonetheless, if you have a function in c doing just that, > we could incorporate it into biopython, for fast exhaustive > searches on shorter sequences. It doesn't have to be so short. I've been running these calculations for whole mammalian chromosomes. For the human chromosome 1, this would take 247249719 * 4 bytes = 943 MB to store the scores in a Numerical Python array. This can still be comfortably handled by today's computers. I'll upload a C version to CVS so you guys can have a look and try it out. How would you feel about having a separate PWM class in Bio.Motif? Some of the stuff currently in the class Motif is actually more about the PWM by itself; it may make sense to separate that out. --Michiel. --Michiel. From bugzilla-daemon at portal.open-bio.org Fri Jul 17 09:12:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 17 Jul 2009 09:12:09 -0400 Subject: [Biopython-dev] [Bug 2880] New: Two unit tests issues in 1.51b (t-coffee and mafft) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2880 Summary: Two unit tests issues in 1.51b (t-coffee and mafft) Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Unit Tests AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz biopython-1.51b # python setup.py test running test test_Ace ... ok test_AlignIO ... ok test_BioSQL ... skipping. Enter your settings in Tests/setup_BioSQL.py (not important if you do not plan to use BioSQL). test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/setup_BioSQL.py (not important if you do not plan to use BioSQL). test_CAPS ... ok test_Clustalw ... ok test_Clustalw_tool ... ok test_Cluster ... ok test_CodonTable ... ok test_CodonUsage ... ok test_Compass ... ok test_Crystal ... ok test_Dialign_tool ... skipping. Install DIALIGN2-2 if you want to use the Bio.Align.Applications wrapper. test_DocSQL ... /usr/lib/python2.6/site-packages/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet ok test_Emboss ... ok test_EmbossPrimer ... ok test_Entrez ... ok test_Enzyme ... ok test_FSSP ... ok test_Fasta ... ok test_Fasta2 ... ok test_File ... ok test_GACrossover ... ok test_GAMutation ... ok test_GAOrganism ... ok test_GAQueens ... ok test_GARepair ... ok test_GASelection ... ok test_GFF ... skipping. Environment is not configured for this test (not important if you do not plan to use Bio.GFF). test_GFF2 ... /var/tmp/portage/sci-biology/biopython-1.51b/work/biopython-1.51b/build/lib.linux-i686-2.6/Bio/Translate.py:23: DeprecationWarning: Bio.Translate and Bio.Transcribe are deprecated, and will be removed in a future release of Biopython. Please use the functions or object methods defined in Bio.Seq instead (described in the tutorial). If you want to continue to use this code, please get in contact with the Biopython developers via the mailing lists to avoid its permanent removal from Biopython. DeprecationWarning) ok test_GenBank ... ok test_GenomeDiagram ... ok test_GraphicsChromosome ... ok test_GraphicsDistribution ... ok test_GraphicsGeneral ... ok test_HMMCasino ... ok test_HMMGeneral ... ok test_HotRand ... ok test_IsoelectricPoint ... ok test_KDTree ... ok test_KEGG ... ok test_KeyWList ... ok test_Location ... ok test_LocationParser ... ok test_LogisticRegression ... ok test_MEME ... ok test_Mafft_tool ... FAIL test_MarkovModel ... ok test_Medline ... ok test_Motif ... ok test_Muscle_tool ... ok test_NCBIStandalone ... ok test_NCBITextParser ... ok test_NCBIXML ... ok test_NCBI_qblast ... ok test_NNExclusiveOr ... ok test_NNGene ... ok test_NNGeneral ... ok test_Nexus ... ok test_PDB ... ok test_PDB_unit ... ok test_ParserSupport ... ok test_Pathway ... ok test_Phd ... ok test_PopGen_FDist ... skipping. Install FDist if you want to use Bio.PopGen.FDist. test_PopGen_FDist_nodepend ... ok test_PopGen_GenePop ... ok test_PopGen_SimCoal ... skipping. Install SIMCOAL2 if you want to use Bio.PopGen.SimCoal. test_PopGen_SimCoal_nodepend ... ok test_Prank_tool ... skipping. Install PRANK if you want to use the Bio.Align.Applications wrapper. test_Probcons_tool ... ok test_ProtParam ... ok test_Restriction ... ok test_SCOP_Astral ... ok test_SCOP_Cla ... ok test_SCOP_Des ... ok test_SCOP_Dom ... ok test_SCOP_Hie ... ok test_SCOP_Raf ... ok test_SCOP_Residues ... ok test_SCOP_Scop ... ok test_SVDSuperimposer ... ok test_SeqIO ... ok test_SeqIO_features ... ok test_SeqIO_online ... ok test_SeqUtils ... ok test_Seq_objs ... ok test_SubsMat ... ok test_SwissProt ... ok test_TCoffee_tool ... Probably t-coffee is waiting for some data on its stdin. root 2987 6482 1 11:36 pts/8 00:03:17 python setup.py test root 20102 2987 0 11:45 pts/8 00:00:00 sh -c { t_coffee; } 2>&1 root 20121 20102 0 11:45 pts/8 00:00:00 t_coffee Further note that test_Mafft_tool failed as well. $ mafft checking nawk checking gawk prog=/usr/bin/gawk --------------------------------------------------------------------- MAFFT v6.240 (2007/04/04) Copyright (c) 2006 Kazutaka Katoh NAR 30:3059-3066, NAR 33:511-518 http://align.bmr.kyushu-u.ac.jp/mafft/software/ --------------------------------------------------------------------- Input file? (fasta format) @ Input file? (fasta format) @ quit quit: No such file. Input file? (fasta format) @ exit exit: No such file. Input file? (fasta format) @ Input file? (fasta format) @ x x: No such file. Input file? (fasta format) @ ^C $ -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From winda002 at student.otago.ac.nz Sat Jul 18 04:17:02 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Sat, 18 Jul 2009 20:17:02 +1200 Subject: [Biopython-dev] Arlequin sequence files in Bio.Popgen Message-ID: <20090718201702.408455fp9qeau1ha@www.studentmail.otago.ac.nz> Hi again Tiago, Sorry about falling of the grid before I could get back to you about this. Tiago Ant?o wrote: >> I've uploaded my Arlequin classes and >> functions to a branch on github so you can see them (/Bio/PopGen/Arlequin/ >> on http://github.com/dwinter/biopython/tree/arleq-branch) > > This is great, I took your code and created a new version (nothing > more than also an initial sketch - Feel free to disagree/propose > changes), you can find it here: > http://github.com/tiagoantao/biopython/tree/arlequin Yeah, all the changes you talk about seem sensible to me > OK, somebody has to do a parser to actually read the files in ;) . > Which is the biggest piece of work to be done. I don't mind doing it > (like in the next month or so - I have some free time now), but you > can do it if you want. In case you decide to do it, I have just one > major point to note: making a parser that is able to read big files > (i.e., some files cannot be parsed into memory in one go). I made this > mistake with the genepop parser and some people do complain about it. > Somethings cannot be read as lists to memory but have to be read as > iterators (issue 3 above). > I think a parser that is able to handle lots of files is also good to > help in building a sound model to represent an arlequin record. > > As usual we will need test code and documentation for all this ;) This is where I have to admit to not having the time or the skills to this justice, I'm happy to provide what help I can, (especially with the docs and tests which are probably closer to my skill-set) but just couldn't promise to do the bulk of the work. There might also be another option, a bit of searching in github found this: http://github.com/ryanraaum/oldowan.arlequin/tree/master Open (MIT license) code for dealing with Arlequin in python. I'll contact the author and ask if he is interested in contributing (it can't hurt to ask right?) Cheers, David From bugzilla-daemon at portal.open-bio.org Sat Jul 18 07:37:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Jul 2009 07:37:46 -0400 Subject: [Biopython-dev] [Bug 2880] Two unit tests issues in 1.51b (t-coffee and mafft) In-Reply-To: Message-ID: <200907181137.n6IBbkhD025712@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-18 07:37 EST ------- Could you tell us the version numbers of t-coffe and mafft you have installed? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 18 12:07:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Jul 2009 12:07:19 -0400 Subject: [Biopython-dev] [Bug 2880] Two unit tests issues in 1.51b (t-coffee and mafft) In-Reply-To: Message-ID: <200907181607.n6IG7JnE000703@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 ------- Comment #2 from mmokrejs at ribosome.natur.cuni.cz 2009-07-18 12:07 EST ------- $ t_coffee PROGRAM: T-COFFEE (Version_7.54) [cut] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 18 14:21:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Jul 2009 14:21:09 -0400 Subject: [Biopython-dev] [Bug 2880] Two unit tests issues in 1.51b (t-coffee and mafft) In-Reply-To: Message-ID: <200907181821.n6IIL9if004943@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-18 14:21 EST ------- It looks like the T-Coffee test just hung? Could you try this in the Tests directory to confirm this: python run_tests.py test_TCoffee_tool.py Could you also try just running T-Coffee directly: t_coffee In my machine this prints out some stuff, and finishes. This it what seems to be hanging on your machine... I'm thinking that instead of calling "t_coffee" we could instead use "t_coffee -version" which finishes much more quickly. So could you also try: t_coffee -version I was using T-Coffee 7.81 on Linux and things worked. Even this is out of date, so I tried the latest version too, 7.97, and again it all looks fine. ------------- Regarding MAFFT, what actually fails when you do this?: run_tests.py test_Mafft_tool.py I note you have MAFFT v6.240. I have MAFFT v6.626b and the test passes. Again, this is also out of date. Could you try updating your copy of MAFFT? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 18 14:41:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Jul 2009 14:41:38 -0400 Subject: [Biopython-dev] [Bug 2880] Two unit tests issues in 1.51b (t-coffee and mafft) In-Reply-To: Message-ID: <200907181841.n6IIfc03005430@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 ------- Comment #4 from mmokrejs at ribosome.natur.cuni.cz 2009-07-18 14:41 EST ------- # python run_tests.py test_TCoffee_tool.py test_TCoffee_tool ... ok ---------------------------------------------------------------------- Ran 1 test in 3.627 seconds # t_coffee PROGRAM: T-COFFEE (Version_7.54) -full_log S [0] -run_name S [0] -mem_mode S [0] mem -extend D [1] 1 -extend_mode S [0] very_fast_triplet -max_n_pair D [0] 10 -seq_name_for_quadruplet S [0] all -compact S [0] default [cut] # t_coffee -version PROGRAM: T-COFFEE (Version_7.54) # python run_tests.py test_Mafft_tool.py test_Mafft_tool ... FAIL ====================================================================== FAIL: Simple round-trip through app with clustal output ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/tmp/portage/sci-biology/biopython-1.51b/work/biopython-1.51b/Tests/test_Mafft_tool.py", line 78, in test_Mafft_with_Clustalw_output self.assert_(stdout.read().startswith("CLUSTAL format alignment by MAFFT")) AssertionError ---------------------------------------------------------------------- Ran 1 test in 1.598 seconds FAILED (failures = 1) # python setup.py test [cut] test_Seq_objs ... ok test_SubsMat ... ok test_SwissProt ... ok test_TCoffee_tool ... ok test_UniGene ... ok test_UniGene_obsolete ... ok test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_align ... ok test_geo ... ok test_interpro ... ok test_kNN ... ok test_lowess ... ok test_pairwise2 ... ok test_prodoc ... ok test_property_manager ... ok test_prosite ... ok test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_seq ... ok test_translate ... ok test_trie ... ok test_triefind ... ok Bio.Seq docstring test ... ok Bio.SeqRecord docstring test ... ok Bio.SeqIO docstring test ... ok Bio.SeqIO.QualityIO docstring test ... ok Bio.SeqIO.AceIO docstring test ... ok Bio.SeqUtils docstring test ... ok Bio.Align.Generic docstring test ... ok Bio.AlignIO docstring test ... ok Bio.AlignIO.StockholmIO docstring test ... ok Bio.Application docstring test ... ok Bio.KEGG.Compound docstring test ... ok Bio.KEGG.Enzyme docstring test ... ok Bio.Wise docstring test ... ok Bio.Wise.psw docstring test ... ok Bio.Motif docstring test ... ok Bio.Statistics.lowess docstring test ... ok ====================================================================== FAIL: Simple round-trip through app with clustal output ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Mafft_tool.py", line 78, in test_Mafft_with_Clustalw_output self.assert_(stdout.read().startswith("CLUSTAL format alignment by MAFFT")) AssertionError ---------------------------------------------------------------------- Ran 123 tests in 342.671 seconds FAILED (failures = 1) # So this time t_coffee test passed, sorry for the noise. I will try to find time next week to upgrade mafft. Thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 18 15:35:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Jul 2009 15:35:32 -0400 Subject: [Biopython-dev] [Bug 2880] test_Mafft_tool.py unit test failure In-Reply-To: Message-ID: <200907181935.n6IJZWMB007312@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Two unit tests issues in |test_Mafft_tool.py unit test |1.51b (t-coffee and mafft) |failure ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-18 15:35 EST ------- (In reply to comment #4) > # python run_tests.py test_TCoffee_tool.py > test_TCoffee_tool ... ok > ---------------------------------------------------------------------- > Ran 1 test in 3.627 seconds > > # t_coffee > > PROGRAM: T-COFFEE (Version_7.54) > -full_log S [0] > ... > [cut] > # t_coffee -version > PROGRAM: T-COFFEE (Version_7.54) OK - that all looks as I would hope. > # python run_tests.py test_Mafft_tool.py > test_Mafft_tool ... FAIL > ====================================================================== > FAIL: Simple round-trip through app with clustal output > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/var/tmp/portage/sci-biology/biopython-1.51b/work/biopython-1.51b/Tests/test_Mafft_tool.py", > line 78, in test_Mafft_with_Clustalw_output > self.assert_(stdout.read().startswith("CLUSTAL format alignment by MAFFT")) > AssertionError > > ---------------------------------------------------------------------- > Ran 1 test in 1.598 seconds > > FAILED (failures = 1) > # python setup.py test > [cut] > test_Seq_objs ... ok > test_SubsMat ... ok > test_SwissProt ... ok > test_TCoffee_tool ... ok > test_UniGene ... ok > test_UniGene_obsolete ... ok > test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. > test_align ... ok > test_geo ... ok > test_interpro ... ok > test_kNN ... ok > test_lowess ... ok > test_pairwise2 ... ok > test_prodoc ... ok > test_property_manager ... ok > test_prosite ... ok > test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. > test_seq ... ok > test_translate ... ok > test_trie ... ok > test_triefind ... ok > Bio.Seq docstring test ... ok > Bio.SeqRecord docstring test ... ok > Bio.SeqIO docstring test ... ok > Bio.SeqIO.QualityIO docstring test ... ok > Bio.SeqIO.AceIO docstring test ... ok > Bio.SeqUtils docstring test ... ok > Bio.Align.Generic docstring test ... ok > Bio.AlignIO docstring test ... ok > Bio.AlignIO.StockholmIO docstring test ... ok > Bio.Application docstring test ... ok > Bio.KEGG.Compound docstring test ... ok > Bio.KEGG.Enzyme docstring test ... ok > Bio.Wise docstring test ... ok > Bio.Wise.psw docstring test ... ok > Bio.Motif docstring test ... ok > Bio.Statistics.lowess docstring test ... ok > ====================================================================== > FAIL: Simple round-trip through app with clustal output > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Mafft_tool.py", line 78, in test_Mafft_with_Clustalw_output > self.assert_(stdout.read().startswith("CLUSTAL format alignment by MAFFT")) > AssertionError > > ---------------------------------------------------------------------- > Ran 123 tests in 342.671 seconds > > FAILED (failures = 1) > # > > So this time t_coffee test passed, sorry for the noise. I will try > to find time next week to upgrade mafft. Thanks. I've retitled the bug to focus on the MAFFT issue. This may well be a problem with your old version of MAFFT - I know for example the the FASTA output is broken on some versions of MAFFT. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jul 20 10:57:15 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 20 Jul 2009 10:57:15 -0400 Subject: [Biopython-dev] GSoC Weekly Update 9: PhyloXML for Biopython Message-ID: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> Hi all, Previously (July 13-17) I: - Implemented "Collapse Whitespace Policy" -- the spec mentions this in the glossary but doesn't appear to say where it should be use, so I applied it willy-nilly. (Mainly on 'name' and 'desc'/'description' node text.) - Made Writer use the normal namespace prefixes -- for human-readability, though it technically doesn't matter for parsing. - Tried XSD validation on the PhyloXML.Writer output using xmlstarlet -- it failed, probably due to element ordering. - Created Bio.Tree and Bio.TreeIO modules. The PhyloXML tree classes are all under Bio.Tree now, while TreeIO contains just a thin wrapper for Parser and Writer (still under Bio.PhyloXML). Three mostly empty base classes live in Bio.Tree.BaseTree and PhyloXML's tree classes now inherit from them. This made it possible to generalize the Utils.pretty_print function and move it to Bio.Tree.Utils. The other "utility", for dumping xml tag names, was added to PhyloXML's Parser near the other xml-related helpers. - Checked that 'other' objects won't belong to the phyloXML namespace. This week (July 20-24) I will: Extend the core to the rest of the spec: - Adding unit tests and classes to support the remaining (non-core) phyloXML elements - Use the schema document to validate the input file -- or at least, make Writer use the correct sub-node ordering - Take a stab at phyloXML 1.10 support Work on documentation: - Address remaining comments from code/doc review - Revisit docstrings for all classes, functions, methods; consider enabling epydoc formatting Also: - Improve the SeqRecord conversion - Warnings: show the offending line at the previous level in the stack Remarks: I haven't done anything specifically for Nexus integration, though I'm looking at the Bio.Nexus Tree and Node classes while writing Bio.Tree.BaseTree classes. I'm also looking at PhyloDB, the BioSQL extension. Plan: BaseTree classes will mirror PhyloDB tables, and any methods from PhyloXML trees that only rely on those attributes will be moved to the base classes. Attribute naming will be tricky -- the 'node' in Nexus and PhyloDB is called 'clade' in phyloXML, and most of the base-class methods will operate on that attribute. Options: 1. Create two properties on PhyloXML's Clade and Phylogeny classes, called 'clade' and 'clades', that simply access the object's 'node' attribute. 2. Break phyloXML's naming convention, and call a 'clade' a 'node'. The I/O functions currently treat tag_name<->attribute as the general case, with exceptions like pluralization scattered in, so making this change will be unpretty but not horrible. Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Mon Jul 20 13:57:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 18:57:59 +0100 Subject: [Biopython-dev] EMBOSS format name "fastq-sanger" in Biopython? Message-ID: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Hi all at Biopython (and EMBOSS-dev CC'd), Now that EMBOSS 6.1.0 is out I've started checking it against Biopython. As I mentioned on the Biopython mailing list a week ago, in particular I'd like to make sure we agree on the various FASTQ variants. I'm waiting for EMBOSS to update the documentation on their website, but as I recall from talking to Peter Rice at BOSC/ISMB 2009 and a quick test this afternoon, they are using: fastq - FASTQ where the qualities are ignored (useful for input?) fastq-sanger - Standard Sanger style FASTQ using PHRED offset 33 fastq-solexa - Early Solexa/Illumina FASTQ, Solexa scores offset 64 fastq-illumina - Illumina 1.3+ FASTQ using PHRED offset 64 I was expecting "fastq" to be an EMBOSS input only format given how I had understood this to be interpreted (ignore the qualities). This makes sense for tasks like FASTQ to FASTQ where the qualities can be ignored. I was however surprised that using "fastq" as an output format in EMBOSS seqret gives quality strings of double quote characters. This ASCII character (34) is outside the range used in the Solexa and Illumina 1.3+ FASTQ variants. If interpreted as a Sanger style FASTQ file this means a PHRED quality of one (meaning about random, a sensible default). Enough background. The reason for this email was that (subject to confirmation), Biopython's "fastq" matches EMBOSS's "fastq-sanger", so I'd like to consider adding this as an alias in Bio.SeqIO. I resisted adding aliases initially, but we now have "gb" for "genbank" to make working with Entrez a little easier, so there is a precedent. In this case, it will make some of the test_Emboss.py code cleaner if I can just use "fastq-sanger" everywhere and have both Biopython and EMBOSS understand this. Peter From matzke at berkeley.edu Mon Jul 20 15:13:59 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 20 Jul 2009 12:13:59 -0700 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A5CD7C8.70009@berkeley.edu> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> Message-ID: <4A64C1F7.5040503@berkeley.edu> Hi all, here is my weekly update... 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! 2. Code refactoring: this is basically the layout I've got going at the moment. (long outline & function descriptions below) 3. GbifXml is working, my next task is the TreeSum class which requires re-doing the functions which made use of the lagrange tree class. I've built these functions under several different tree classes since January and have gotten pretty good at tree logic so this shouldn't be too hard. 4. Philosophy question: If I build some functions that do something new with an e.g. ElementTree (XML tree) object, should I: (a) make these functions go in a subclass of the class for the original object (thus inheriting the methods of the original class, and basically adding new methods). E.g. basically extending the methods of ElementTree, with a subclass GbifElementTree; or: (b) make a class containing the object as an attribute, with e.g. GbifXml.xmltree containing an ElementTree attribute which then gets passed to the various functions. I currently have (b) but the more I think about it, the more (a) makes more sense from a simplicity/usability/maintainability sense. Cheers! Nick ========== Class for accessing GBIF, downloading records, processing them, and extracting information from the xmltree in that class. class GbifXmlError(Exception): pass class GbifXml(): gbifxml is a class for holding and processing xmltrees of GBIF records. def __init__(self, xmltree=None): This is an instantiation class for setting up new objects of this class. def print_xmltree(self): Prints all the elements & subelements of the xmltree to screen (may require fix_ASCII to input file to succeed) def print_subelements(self, element): Takes an element from an XML tree and prints the subelements tag & text, and the within-tag items (key/value or whatnot) def element_items_to_dictionary(self, element_items): If the XML tree element has items encoded in the tag, e.g. key/value or whatever, this function puts them in a python dictionary and returns them. def extract_latlongs(self, element): Create a temporary pseudofile, extract lat longs to it, return results as string. Inspired by: http://www.skymind.com/~ocrow/python_string/ (Method 5: Write to a pseudo file) def extract_latlong_datum(self, element, file_str): Searches an element in an XML tree for lat/long information, and the complete name. Searches recursively, if there are subelements. def extract_taxonconceptkeys_tofile(self, element, outfh): Searches an element in an XML tree for TaxonOccurrence gbifKeys, and the complete sname. Searches recursively, if there are subelements. Returns file at outfh. def extract_taxonconceptkeys_tolist(self, element, output_list): Searches an element in an XML tree for TaxonOccurrence gbifKeys, and the complete name. Searches recursively, if there are subelements. Returns list. def extract_occurrence_elements(self, element, output_list): Returns a list of the elements, picking elements by TaxonOccurrence; this should return a list of elements equal to the number of hits. def find_to_elements_w_ancs(self, el_tag, anc_el_tag): Burrow into XML to get an element with tag el_tag, return only those el_tags underneath a particular parent element parent_el_tag def create_sub_xmltree(self, element): Create a subset xmltree (to avoid going back to irrelevant parents) def xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, match_el_list): Recursively burrows down to find whatever elements with el_tag exist inside a parent_el_tag. def xml_burrow_up(self, element, anc_el_tag, found_anc): Burrow up xml to find anc_el_tag def xml_burrow_up_cousin(element, cousin_el_tag, found_cousin): Burrow up from element of interest, until a cousin is found with cousin_el_tag def return_parent_in_xmltree(self, child_to_search_for): Search through an xmltree to get the parent of child_to_search_for def return_parent_in_element(self, potential_parent, child_to_search_for, returned_parent): Search through an XML element to return parent of child_to_search_for def find_1st_matching_element(self, element, el_tag, return_element): Burrow down into the XML tree, retrieve the first element with the matching tag # Functions devoted to accessing/downloading GBIF records def access_gbif(url, params): # Helper function to access various GBIF services # # choose the URL ("url") from here: # http://data.gbif.org/ws/rest/occurrence # # params are a dictionary of key/value pairs # # "_open" is from Bio.Entrez._open, online here: # http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#_open # # Get the handle of results # (looks like e.g.: > ) # (open with results_handle.read() ) def get_hits(params): Get the actual hits that are be returned by a given search (this allows parsing & gradual downloading of searches larger than e.g. 1000 records) It will return the LAST non-none instance (in a standard search result there should be only one, anyway). def get_xml_hits(params): Returns hits like get_hits, but returns a parsed XML tree. def get_all_records_by_increment(params, inc, prefix_fn): Download all of the records in stages, store in list of elements. Increments of e.g. 100 to not overload server def get_record(key): Get a single record, return xmltree for it. def get_numhits(params): Get the number of hits that will be returned by a given search (this allows parsing & gradual downloading of searches larger than e.g. 1000 records) It will return the LAST non-none instance (in a standard search result there should be only one, anyway). def extract_numhits(element): # Search an element of a parsed XML string and find the # number of hits, if it exists. Recursively searches, # if there are subelements. # def xmlstring_to_xmltree(xmlstring): Take the text string returned by GBIF and parse to an XML tree using ElementTree. Requires the intermediate step of saving to a temporary file (required to make ElementTree.parse work, apparently) class TreeSum() Summary statistics on trees (some of these now redundant with Nexus.Tree & will be eliminated. def read_ultrametric_Newick(newickstr): Read a Newick file into a tree object (a series of node objects links to parent and daughter nodes), also reading node ages and node labels if any. def list_leaves(phylo_obj): Print out all of the leaves in above a node object def treelength(node): Gets the total branchlength above a given node by recursively adding through tree. def phylodistance(node1, node2): Get the phylogenetic distance (branch length) between two nodes. def get_distance_matrix(phylo_obj): Get a matrix of all of the pairwise distances between the tips of a tree. def get_mrca_array(phylo_obj): Get a square list of lists (array) listing the mrca of each pair of leaves (half-diagonal matrix) def subset_tree(phylo_obj, list_to_keep): Given a list of tips and a tree, remove all other tips and resulting redundant nodes to produce a new smaller tree. def prune_single_desc_nodes(node): Follow a tree from the bottom up, pruning any nodes with only one descendent def find_new_root(node): Search up tree from root and make new root at first divergence def make_None_list_array(xdim, ydim): Make a list of lists ("array") with the specified dimensions def get_PD_to_mrca(node, mrca, PD): Add up the phylogenetic distance from a node to the specified ancestor (mrca). Find mrca with find_1st_match. def get_ancestors_list(node, anc_list): Get the list of ancestors of a given node def addup_PD(node, PD): Adds the branchlength of the current node to the total PD measure. def print_tree_outline_format(phylo_obj): Prints the tree out in "outline" format (daughter clades are indented, etc.) def print_Node(node, rank): Prints the node in question, and recursively all daughter nodes, maintaining rank as it goes. class Ranges(): Geographic range of a species (collection of points, results of classification of those points into regions), GIS-like functions for processing them. class Points(): geographic locations of individual collected specimens def readshpfile(fn): def summarize_shapefile(fn, output_option, outfn): def point_inside_polygon(x,y,poly): def shapefile_points_in_poly(pt_records, poly): def tablefile_points_in_poly(fh, ycol, xcol, namecol, poly): ========== Here is a summary of the Nick Matzke wrote: > Thanks for the fix!!! A big help. I am currently organizing my > functions into several classes and making sure they work, basically the > classes look like they will be something like: > > ========== > GbifXml -- for processing GBIF XML results (all of the functions for > searching/extracting stuff from xmltree structures) > > TreeSum -- for processing trees & getting summary statistics etc. > > Ranges -- Geographic range of a species (collection of points, results > of classification of those points into regions), GIS-like functions for > processing them > Points -- geographic locations of individual collected specimens > ========== > > > Brad Chapman wrote: >> Hi Nick; >> Thanks for the comprehensive update. It sounds like your discussion >> with Eric resolved most of the questions about the tree >> representation. It's great to see y'all converging on this. >> >>> It sounds like for my immediate purposes, Bio.Nexus.Trees is the >>> solution for now, I will reorganize my code accordingly based on >>> this. If/when Bio.Nexus.Trees accepts node labels I will remove a >>> function stripping out node labels. Also I have not forgotten >>> previous comments from Brad et al. about bringing the other code up >>> to specs. So I will update the BioGeography schedule and overall >>> organization I hope to have at the end (with classes/methods etc., >>> instead of just a list-o-functions, which is how my original schedule >>> was explicitly laid out), and post an update when done. >> >> Agreed, and seconding Hilmar that the best thing about open source >> code is having others looking at your code. Conversely, feel free to >> dig in and fix current code where it is holding you up. To remove >> this blocking issue on Nexus and get us rolling again, I >> put together an initial fix. You can grab the patch from: >> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2788 >> >> Let us know if this works for your files of interest. >> >> If this clears up the Nexus issue, it would be great to see the >> revised schedule incorporating the refactoring. Sounds like we are >> moving in the right direction. Good stuff. >> >> Thanks, >> Brad >> > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Mon Jul 20 15:48:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 20:48:44 +0100 Subject: [Biopython-dev] Bio.Nexus and internal node labels (Bug 2788) Message-ID: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> On Mon, Jul 20, 2009 at 8:13 PM, Nick Matzke wrote: > > Hi all, here is my weekly update... > > 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! > Cool. I haven't tried it personally though ;) Frank and/or Cymon - any comments regarding Brad checking this in? See Bug 2788 for details. I gather you (Nick) are using this on ultrametric Newick trees - could you supply a sensibly sized example to use as a unit test? Initially this can be just to test Brad's patch to Bio.Nexus, but try and pick something you can build documentation examples around in future. Thanks, Peter From matzke at berkeley.edu Mon Jul 20 15:56:18 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 20 Jul 2009 12:56:18 -0700 Subject: [Biopython-dev] Bio.Nexus and internal node labels (Bug 2788) In-Reply-To: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> References: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> Message-ID: <4A64CBE2.9050605@berkeley.edu> Peter wrote: > On Mon, Jul 20, 2009 at 8:13 PM, Nick Matzke wrote: >> Hi all, here is my weekly update... >> >> 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! >> > > Cool. I haven't tried it personally though ;) Frank and/or Cymon - any > comments regarding Brad checking this in? See Bug 2788 for details. > > I gather you (Nick) are using this on ultrametric Newick trees yes - could > you supply a sensibly sized example to use as a unit test? Initially > this can be just to test Brad's patch to Bio.Nexus, but try and pick > something you can build documentation examples around in future. This is probable a reasonable size, a subset of my bigger tree. (Just gymnosperms; times are in millions of years.) (((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,Gink go:275.000000)gymnosperm:75.000000; I am using this as a test case in my own script, I have put some effort into reading up on Unittest but I don't quite get how it all fits together yet. Another important case I will try and come up with is the results of pruning a tree, in my experience it is very easy to mess up the tree and/or branchlengths when pruning. > > Thanks, > > Peter > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From fkauff at biologie.uni-kl.de Tue Jul 21 02:32:02 2009 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Tue, 21 Jul 2009 08:32:02 +0200 Subject: [Biopython-dev] Bio.Nexus and internal node labels (Bug 2788) In-Reply-To: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> References: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> Message-ID: <4A6560E2.4030502@biologie.uni-kl.de> Hi all, Peter wrote: > On Mon, Jul 20, 2009 at 8:13 PM, Nick Matzke wrote: > >> Hi all, here is my weekly update... >> >> 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! >> >> > > Cool. I haven't tried it personally though ;) Frank and/or Cymon - any > comments regarding Brad checking this in? See Bug 2788 for details. > > Not at all - you're most welcome. Thanks for dealing with it. Frank > I gather you (Nick) are using this on ultrametric Newick trees - could > you supply a sensibly sized example to use as a unit test? Initially > this can be just to test Brad's patch to Bio.Nexus, but try and pick > something you can build documentation examples around in future. > > Thanks, > > Peter > > From biopython at maubp.freeserve.co.uk Tue Jul 21 07:32:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 12:32:59 +0100 Subject: [Biopython-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Message-ID: <320fb6e00907210432h26da39b2ka24ceb1194a1be1a@mail.gmail.com> On Mon, Jul 20, 2009 at 6:57 PM, Peter wrote: > I was expecting "fastq" to be an EMBOSS input only format given > how I had understood this to be interpreted (ignore the qualities). This > makes sense for tasks like FASTQ to FASTQ where the qualities can > be ignored. I meant of course, for FASTQ to FASTA conversion the qualities (and how they are encoded, Sanger versus Solexa versus Illumina 1.3+) can be ignored. Peter From chapmanb at 50mail.com Tue Jul 21 08:22:13 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Jul 2009 08:22:13 -0400 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A64C1F7.5040503@berkeley.edu> References: <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> Message-ID: <20090721122213.GA96870@sobchak.mgh.harvard.edu> Hi Nick; > 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! Sweet. Glad to hear it. > 2. Code refactoring: this is basically the layout I've got going at the > moment. (long outline & function descriptions below) Is this checked in on GitHub? I pulled from the Geography branch but didn't get the new code. The organization below looks great and really helps with clarity. One additional suggestion I would make is to prefix classes which are not part of the public API with an underscore (_internal_function). Just from the descriptions, I image some of the functions like xml_burrow_up_cousin would not be called directly by users. > 3. GbifXml is working, my next task is the TreeSum class which requires > re-doing the functions which made use of the lagrange tree class. I've > built these functions under several different tree classes since January > and have gotten pretty good at tree logic so this shouldn't be too hard. Great. Have you had a look at Eric's generic Tree proposal, which he was working on this week: http://github.com/etal/biopython/tree/phyloxml/Bio/Tree It would be great to propose general functionality there so it can be rolled into PhyloXML and ultimately Nexus parsing as well. > 4. Philosophy question: If I build some functions that do something new > with an e.g. ElementTree (XML tree) object, should I: > > (a) make these functions go in a subclass of the class for the original > object (thus inheriting the methods of the original class, and basically > adding new methods). E.g. basically extending the methods of > ElementTree, with a subclass GbifElementTree; or: > > (b) make a class containing the object as an attribute, with e.g. > GbifXml.xmltree containing an ElementTree attribute which then gets > passed to the various functions. > > I currently have (b) but the more I think about it, the more (a) makes > more sense from a simplicity/usability/maintainability sense. My vote would be for your (b) option. ElementTree is a pretty tricky interface with overrides for attribute access, so inheriting from it could be a bit tricky and more trouble than it's worse. If you find yourself mirroring ElementTree functionality, you could always make the tree itself a public attribute and encourage users to call it directly. Brad > > Cheers! > Nick > > ========== > Class for accessing GBIF, downloading records, processing them, and > extracting information from the xmltree in that class. > > class GbifXmlError(Exception): pass > class GbifXml(): > gbifxml is a class for holding and processing xmltrees of GBIF records. > > def __init__(self, xmltree=None): > > This is an instantiation class for setting up new objects of this > class. > > def print_xmltree(self): > > Prints all the elements & subelements of the xmltree to screen (may > require > fix_ASCII to input file to succeed) > > def print_subelements(self, element): > > Takes an element from an XML tree and prints the subelements tag & > text, and > the within-tag items (key/value or whatnot) > > > def element_items_to_dictionary(self, element_items): > > If the XML tree element has items encoded in the tag, e.g. key/value or > whatever, this function puts them in a python dictionary and returns > them. > > > > def extract_latlongs(self, element): > > Create a temporary pseudofile, extract lat longs to it, > return results as string. > > Inspired by: http://www.skymind.com/~ocrow/python_string/ > (Method 5: Write to a pseudo file) > > > def extract_latlong_datum(self, element, file_str): > > Searches an element in an XML tree for lat/long information, and the > complete name. Searches recursively, if there are subelements. > > > > def extract_taxonconceptkeys_tofile(self, element, outfh): > > Searches an element in an XML tree for TaxonOccurrence gbifKeys, > and the complete sname. Searches recursively, if there are subelements. > Returns file at outfh. > > > > > def extract_taxonconceptkeys_tolist(self, element, output_list): > > Searches an element in an XML tree for TaxonOccurrence gbifKeys, > and the complete name. Searches recursively, if there are subelements. > Returns list. > > > > > > def extract_occurrence_elements(self, element, output_list): > > Returns a list of the elements, picking elements by > TaxonOccurrence; this should > return a list of elements equal to the number of hits. > > > > > def find_to_elements_w_ancs(self, el_tag, anc_el_tag): > > Burrow into XML to get an element with tag el_tag, return only > those el_tags underneath a particular parent element parent_el_tag > > > def create_sub_xmltree(self, element): > > Create a subset xmltree (to avoid going back to irrelevant parents) > > > > def xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, > match_el_list): > > Recursively burrows down to find whatever elements with el_tag > exist inside a parent_el_tag. > > > def xml_burrow_up(self, element, anc_el_tag, found_anc): > > Burrow up xml to find anc_el_tag > > > > def xml_burrow_up_cousin(element, cousin_el_tag, found_cousin): > > Burrow up from element of interest, until a cousin is found with > cousin_el_tag > > > > def return_parent_in_xmltree(self, child_to_search_for): > > Search through an xmltree to get the parent of child_to_search_for > > > > def return_parent_in_element(self, potential_parent, > child_to_search_for, returned_parent): > > Search through an XML element to return parent of child_to_search_for > > > > def find_1st_matching_element(self, element, el_tag, return_element): > > Burrow down into the XML tree, retrieve the first element with the > matching tag > > > > > # Functions devoted to accessing/downloading GBIF records > > def access_gbif(url, params): > > # Helper function to access various GBIF services > # > # choose the URL ("url") from here: > # http://data.gbif.org/ws/rest/occurrence > # > # params are a dictionary of key/value pairs > # > # "_open" is from Bio.Entrez._open, online here: > # http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#_open > # > # Get the handle of results > # (looks like e.g.: > ) > > # (open with results_handle.read() ) > > > def get_hits(params): > > Get the actual hits that are be returned by a given search > (this allows parsing & gradual downloading of searches larger > than e.g. 1000 records) > It will return the LAST non-none instance (in a standard search > result there > should be only one, anyway). > > > def get_xml_hits(params): > > Returns hits like get_hits, but returns a parsed XML tree. > > > def get_all_records_by_increment(params, inc, prefix_fn): > > Download all of the records in stages, store in list of elements. > Increments of e.g. 100 to not overload server > > def get_record(key): > > Get a single record, return xmltree for it. > > > def get_numhits(params): > > Get the number of hits that will be returned by a given search > (this allows parsing & gradual downloading of searches larger > than e.g. 1000 records) > It will return the LAST non-none instance (in a standard search > result there > should be only one, anyway). > > def extract_numhits(element): > > # Search an element of a parsed XML string and find the > # number of hits, if it exists. Recursively searches, > # if there are subelements. > # > > def xmlstring_to_xmltree(xmlstring): > > Take the text string returned by GBIF and parse to an XML tree using > ElementTree. > Requires the intermediate step of saving to a temporary file > (required to make > ElementTree.parse work, apparently) > > > > > class TreeSum() > > Summary statistics on trees (some of these now redundant with > Nexus.Tree & will be eliminated. > > def read_ultrametric_Newick(newickstr): > > Read a Newick file into a tree object (a series of node objects > links to parent and daughter nodes), also reading node ages and node > labels if any. > > > def list_leaves(phylo_obj): > > Print out all of the leaves in above a node object > > > > def treelength(node): > > Gets the total branchlength above a given node by recursively > adding through tree. > > > def phylodistance(node1, node2): > > Get the phylogenetic distance (branch length) between two nodes. > > > def get_distance_matrix(phylo_obj): > > Get a matrix of all of the pairwise distances between the tips of a > tree. > > > > def get_mrca_array(phylo_obj): > > Get a square list of lists (array) listing the mrca of each pair of > leaves > (half-diagonal matrix) > > > > def subset_tree(phylo_obj, list_to_keep): > > Given a list of tips and a tree, remove all other tips and > resulting redundant nodes to produce a new smaller tree. > > > def prune_single_desc_nodes(node): > > Follow a tree from the bottom up, pruning any nodes with only one > descendent > > > def find_new_root(node): > > Search up tree from root and make new root at first divergence > > > def make_None_list_array(xdim, ydim): > > Make a list of lists ("array") with the specified dimensions > > > def get_PD_to_mrca(node, mrca, PD): > > Add up the phylogenetic distance from a node to the specified > ancestor (mrca). Find mrca with find_1st_match. > > > > def get_ancestors_list(node, anc_list): > > Get the list of ancestors of a given node > > > > > def addup_PD(node, PD): > > Adds the branchlength of the current node to the total PD measure. > > > def print_tree_outline_format(phylo_obj): > > Prints the tree out in "outline" format (daughter clades are > indented, etc.) > > > def print_Node(node, rank): > > Prints the node in question, and recursively all daughter nodes, > maintaining rank as it goes. > > > > class Ranges(): > > Geographic range of a species (collection of points, results > of classification of those points into regions), GIS-like functions for > processing them. > > > class Points(): > > geographic locations of individual collected specimens > > > def readshpfile(fn): > > def summarize_shapefile(fn, output_option, outfn): > > def point_inside_polygon(x,y,poly): > > def shapefile_points_in_poly(pt_records, poly): > > def tablefile_points_in_poly(fh, ycol, xcol, namecol, poly): > > ========== > > > Here is a summary of the > > Nick Matzke wrote: > > Thanks for the fix!!! A big help. I am currently organizing my > > functions into several classes and making sure they work, basically the > > classes look like they will be something like: > > > > ========== > > GbifXml -- for processing GBIF XML results (all of the functions for > > searching/extracting stuff from xmltree structures) > > > > TreeSum -- for processing trees & getting summary statistics etc. > > > > Ranges -- Geographic range of a species (collection of points, results > > of classification of those points into regions), GIS-like functions for > > processing them > > Points -- geographic locations of individual collected specimens > > ========== > > > > > > Brad Chapman wrote: > >> Hi Nick; > >> Thanks for the comprehensive update. It sounds like your discussion > >> with Eric resolved most of the questions about the tree > >> representation. It's great to see y'all converging on this. > >> > >>> It sounds like for my immediate purposes, Bio.Nexus.Trees is the > >>> solution for now, I will reorganize my code accordingly based on > >>> this. If/when Bio.Nexus.Trees accepts node labels I will remove a > >>> function stripping out node labels. Also I have not forgotten > >>> previous comments from Brad et al. about bringing the other code up > >>> to specs. So I will update the BioGeography schedule and overall > >>> organization I hope to have at the end (with classes/methods etc., > >>> instead of just a list-o-functions, which is how my original schedule > >>> was explicitly laid out), and post an update when done. > >> > >> Agreed, and seconding Hilmar that the best thing about open source > >> code is having others looking at your code. Conversely, feel free to > >> dig in and fix current code where it is holding you up. To remove > >> this blocking issue on Nexus and get us rolling again, I > >> put together an initial fix. You can grab the patch from: > >> > >> http://bugzilla.open-bio.org/show_bug.cgi?id=2788 > >> > >> Let us know if this works for your files of interest. > >> > >> If this clears up the Nexus issue, it would be great to see the > >> revised schedule incorporating the refactoring. Sounds like we are > >> moving in the right direction. Good stuff. > >> > >> Thanks, > >> Brad > >> > > > > -- > ==================================================== > Nicholas J. Matzke > Ph.D. Candidate, Graduate Student Researcher > Huelsenbeck Lab > Center for Theoretical Evolutionary Genomics > 4151 VLSB (Valley Life Sciences Building) > Department of Integrative Biology > University of California, Berkeley > > Lab websites: > http://ib.berkeley.edu/people/lab_detail.php?lab=54 > http://fisher.berkeley.edu/cteg/hlab.html > Dept. personal page: > http://ib.berkeley.edu/people/students/person_detail.php?person=370 > Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > Lab phone: 510-643-6299 > Dept. fax: 510-643-6264 > Cell phone: 510-301-0179 > Email: matzke at berkeley.edu > > Mailing address: > Department of Integrative Biology > 3060 VLSB #3140 > Berkeley, CA 94720-3140 > > ----------------------------------------------------- > "[W]hen people thought the earth was flat, they were wrong. When people > thought the earth was spherical, they were wrong. But if you think that > thinking the earth is spherical is just as wrong as thinking the earth > is flat, then your view is wronger than both of them put together." > > Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, > 14(1), 35-44. Fall 1989. > http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > ==================================================== From chapmanb at 50mail.com Tue Jul 21 08:40:10 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Jul 2009 08:40:10 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> Message-ID: <20090721124010.GB96870@sobchak.mgh.harvard.edu> Hi Eric; Great stuff this week. I'm happy to see the generalized Tree interface coming together and appreciate you taking the time to look through PhyloDB for future compatibility with that. > - Tried XSD validation on the PhyloXML.Writer output using xmlstarlet -- it > failed, probably due to element ordering. It would be nice to be able to pull off validation. I'm not a big stickler for XSD validation myself but have worked in the past with those who were and know that it can be a point of contention. Being able to cleanly validate will improve perception of the PhyloXML, and specifically the Biopython implementation. Hopefully that'll lead to greater use and adoption. > - Created Bio.Tree and Bio.TreeIO modules. The PhyloXML tree classes are > all under Bio.Tree now, while TreeIO contains just a thin wrapper for > Parser and Writer (still under Bio.PhyloXML). Three mostly empty base > classes live in Bio.Tree.BaseTree and PhyloXML's tree classes now inherit > from them. This looks really nice -- thanks again. Do you think any of the functionality from the Nexus trees class would fit into here and be useful for examining PhyloXML trees? There is a whole ton of stuff there but a few that caught my eye beyond the total_branch_length function you had a skeleton for were: get_terminals, is_identical, common_ancestor, and distance. > I haven't done anything specifically for Nexus integration, though I'm > looking > at the Bio.Nexus Tree and Node classes while writing Bio.Tree.BaseTree > classes. > I'm also looking at PhyloDB, the BioSQL extension. Plan: BaseTree classes > will > mirror PhyloDB tables, and any methods from PhyloXML trees that only rely on > those attributes will be moved to the base classes. This sounds fine. If you want to dig into Nexus you are welcome, but certainly it's outside the scope of the proposal. > Attribute naming will be tricky -- the 'node' in Nexus and PhyloDB is called > 'clade' in phyloXML, and most of the base-class methods will operate on that > attribute. Options: > > 1. Create two properties on PhyloXML's Clade and Phylogeny classes, > called > 'clade' and 'clades', that simply access the object's 'node' attribute. > > 2. Break phyloXML's naming convention, and call a 'clade' a 'node'. The > I/O > functions currently treat tag_name<->attribute as the general case, with > exceptions like pluralization scattered in, so making this change will > be > unpretty but not horrible. I like option 1 -- make clade and clades references to the node/nodes attribute. I do prefer the node naming convention, but for the PhyloXML specific classes you should also be able to retrieve things with their clade nomenclature. Brad From cy at cymon.org Tue Jul 21 09:01:59 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 21 Jul 2009 14:01:59 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <20090721124010.GB96870@sobchak.mgh.harvard.edu> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <20090721124010.GB96870@sobchak.mgh.harvard.edu> Message-ID: <7265d4f0907210601s35084ce2u77659ad909ea80fa@mail.gmail.com> Hi Eric, 2009/7/21 Brad Chapman > Hi Eric; ... > Do you think any of the > functionality from the Nexus trees class would fit into here and be > useful for examining PhyloXML trees? There is a whole ton of stuff > there but a few that caught my eye beyond the total_branch_length > function you had a skeleton for were: get_terminals, is_identical, > common_ancestor, and distance. > > > I haven't done anything specifically for Nexus integration, though I'm > > looking > > at the Bio.Nexus Tree and Node classes while writing Bio.Tree.BaseTree > > classes. > You might also take a look at p4's tree representation and methods: http://code.google.com/p/p4-phylogenetics/source/browse/trunk/p4/Tree.py / Node.py / Tree_muck.py etc. Cheers, C. -- From biopython at maubp.freeserve.co.uk Tue Jul 21 09:05:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 14:05:35 +0100 Subject: [Biopython-dev] [emboss-dev] EMBOSS and its FASTA like alignment output In-Reply-To: <4A65AF53.5090105@ebi.ac.uk> References: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> <4A65AF53.5090105@ebi.ac.uk> Message-ID: <320fb6e00907210605v7415b1b6id043af520c1bb8de@mail.gmail.com> Hi all, I've CC'd the Biopython-dev mailing list as this EMBOSS thread is becoming cross project. On Tue, Jul 21, 2009 at 1:06 PM, Peter Rice wrote: > > Peter wrote: >> Hi, >> >> One of the many things I talked to Peter Rice about in Sweden >> was the Pearson FASTA like output from needle and water (e.g. >> what EMBOSS calls the markx10 output format), and why it >> includes the EMBOSS header and footer lines (which start with >> a # character), which are not present in real FASTA output. >> >> Biopython can parse the pairwise -m 10 output from Bill >> Pearson's FASTA tools, so in theory we (Biopython) should >> be able to parse the markx10 output from EMBOSS needle >> and water. We could probably cope with the extra header >> and footer, but I think it would be best if EMBOSS could >> produce something more closely matching the real FASTA >> output. Unfortunately, it appears to be more than just the >> headers which upset our parser - even ignoring them, >> EMBOSS markx10 output still looks rather different to >> (current) FASTA -m 10 output. Was the markx10 output >> mimicking a particular (old) version of the FASTA tools? > > The source code documentation refers to FASTA 3.4 which > may be the last time I took a detailed look at the FASTA > alignment outputs. That might explain it - I've been using FASTA 3.5. > Can you send us some example files so we can check for > the significant differences? Sure. There are half a dozen FASTA -m 10 output files here: http://biopython.open-bio.org/SRC/biopython/Tests/Fasta/ > We plan to install all the bio* projects so it would be helpful > to have a set of biopython parser scripts we can use to test > locally. We can add them to our routine QA tests and flag up > changes as soon as they appear. If you have (the latest) Biopython installed, and periodically run the unit tests (in particular, test_Emboss.py), that would be a good start. Right now I know that unit test works with EMBOSS 4.0.0 and 6.0.1 (which happens to be on two of the machines I use for testing), and mostly works with EMBOSS 6.1.0 (everything except the GenBank regression you were just looking into today). I'm considering extending test_Emboss.py in the future to take advantage of the new features in EMBOSS 6.1.0 onwards such as GFF and FASTQ support, or perhaps having a second test script (which will be conditional on the version of EMBOSS installed). >> Peter R. did say it would be simple to turn off this header and >> footer output, so I thought I would try this myself. It looks like >> this is handled in file ajax/ajalign.c by function alignWriteMark, >> but I don't see a switch to disable the headers and footers. > > You correctly found how to turn off the header. The footer is > reported for anything except pure sequence output. > > For the next release I will add attributes to the list of alignment > formats to say whether the header and footer are needed. That > will allow us better control and reporting. > > Meanwhile, we are very happy to standardise the markx* outputs > to make them easier to parse. Biopython is the first project to > report problems with this. There are alternatives - specifying > -aformat and using some other alignment format for all > applications - but we like to conform and will do our best to fir > what parsers expect. > > Also, of course, once we know we are being parsed we will do > our best not to let the output change. This isn't really a problem. Biopython can read EMBOSS's own alignment formats (pairs and simple), so there is little need for us to be able to parse EMBOSS's version of the FASTA output. [Although at the moment we ignore all the header information, if that formatting will be consistent, we could parse it too.] However, at least one person wanted to parse EMBOSS markx10 output strongly enough that he wrote a modified version of our FASTA -m 10 parser. I would rather however have EMBOSS revise its output to better match FASTA. See http://bugzilla.open-bio.org/show_bug.cgi?id=2704 Peter C. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 09:25:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 09:25:50 -0400 Subject: [Biopython-dev] [Bug 2704] Parser for the markx10 alignment format In-Reply-To: Message-ID: <200907211325.n6LDPouc006005@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2704 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 09:25 EST ------- I've started a conversation with Peter Rice at EMBOSS about making needle and water output more FASTA like when using the markx10 format (and related FASTA mimicking output modes). See: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000583.html Later cross posted to Biopython-dev as well: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006425.html Hopefully the EMBOSS markx10 output will in future be close enough to the FASTA -m 10 output that Biopython will only need a single parser to read both. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Tue Jul 21 10:56:33 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 21 Jul 2009 16:56:33 +0200 Subject: [Biopython-dev] Calculating motif scores In-Reply-To: <572083.29767.qm@web62405.mail.re1.yahoo.com> References: <572083.29767.qm@web62405.mail.re1.yahoo.com> Message-ID: <8b34ec180907210756p3340a097i1042856386242b55@mail.gmail.com> Hi, sorry for the delayed response. Busy time... On Fri, Jul 17, 2009 at 4:25 AM, Michiel de Hoon wrote: > > It doesn't have to be so short. I've been running these calculations for whole mammalian chromosomes. For the human chromosome 1, this would take > 247249719 * 4 bytes = 943 MB to store the scores in a Numerical Python array. This can still be comfortably handled by today's computers. Well, I'm not sure if this is an expected behavior for typical uses for a single function call to allocate that much memory. Especially that most people would be interested in the "hits" which exceed some significance threshold. Nonetheless, there will be cases where the user is interested in all scores for a sequence, even the negative ones. Then it is definitely better to provide him with an array rather than a generator. > > I'll upload a C version to CVS so you guys can have a look and try it out. > I took a brief look. It seems fine to me. I haven't done any testing yet though. I'll try to integrate it into a method of Bio.Motif. What do you think about: Motif.scanPWM(self, sequence) ? > How would you feel about having a separate PWM class in Bio.Motif? Some of the stuff currently in the class Motif is actually more > about the PWM by itself; it may make sense to separate that out. Hmm, I think that your question connects directly to a bigger design question which has popped up earlier in the discussion on Bio.Motif suggestions: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005811.html I'm not sure myself whether I like to have different classes for different motif types: consensus, alignment, regexp, pwm and hmm. I understand though, that this makes things simpler for people who only use one of those types so that don't have to deal with the complications of a motif possibly coming from different sources and behaving (slightly) differently. I still think that it's useful to have a Motif class that can be used in a similar way for different kinds of motifs. As for the PWM being a separate class and used by the motif: I don't know. I'm using Bio.substmat.FreqTable for implementing frequency table, so I understand that the new PWM class would be basically a "smarter" FreqTable. I'm not sure whether it solves any problems... cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue Jul 21 11:21:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 11:21:35 -0400 Subject: [Biopython-dev] [Bug 2882] New: Unnamed variable used to raise Exception in /Bio/SwissProt/__init__.py Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2882 Summary: Unnamed variable used to raise Exception in /Bio/SwissProt/__init__.py Product: Biopython Version: 1.51b Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: sjcockell at gmail.com When raising a ValueError on finding an unknown key in a SwissProt record, Bio.SwissProt.__init__._read() references the undefined 'keyword' instead of the expected 'key'. Instead of raising a ValueError, a NameError is raised: Traceback (most recent call last): File "goClass.py", line 31, in main('tubulin') File "goClass.py", line 23, in main record = SwissProt.read(handle) File "[...]/biopython-1.51b/build/lib.macosx-10.5-i386-2.5/Bio/SwissProt/__init__.py", line 120, in read record = _read(handle) File "[...]/biopython-1.51b/build/lib.macosx-10.5-i386-2.5/Bio/SwissProt/__init__.py", line 236, in _read raise ValueError("Unknown keyword %s found" % keyword) NameError: global name 'keyword' is not defined Fixed by the following patch file: " 240c240 < raise ValueError("Unknown keyword %s found" % keyword) --- > raise ValueError("Unknown keyword %s found" % key) " Regards Simon -- http://fuzzierlogic.com http://friendfeed.com/sjcockell http://twitter.com/sjcockell -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 11:22:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 11:22:24 -0400 Subject: [Biopython-dev] [Bug 2882] Unnamed variable used to raise Exception in /Bio/SwissProt/__init__.py In-Reply-To: Message-ID: <200907211522.n6LFMO0Z010044@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2882 ------- Comment #1 from sjcockell at gmail.com 2009-07-21 11:22 EST ------- Created an attachment (id=1345) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1345&action=view) Proposed Patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 11:35:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 11:35:54 -0400 Subject: [Biopython-dev] [Bug 2882] Unnamed variable used to raise Exception in /Bio/SwissProt/__init__.py In-Reply-To: Message-ID: <200907211535.n6LFZs52010429@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2882 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 11:35 EST ------- Thanks - fixed in CVS (will be on github within the hour). Did you have an example file which triggers this, or did you just spot the error from reading the code? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Tue Jul 21 12:03:44 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 21 Jul 2009 12:03:44 -0400 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A64C1F7.5040503@berkeley.edu> References: <4A4D052D.7010708@berkeley.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> Message-ID: <3f6baf360907210903l5167eefdl46f5cd969c2d164b@mail.gmail.com> Hi Nick, On Mon, Jul 20, 2009 at 3:13 PM, Nick Matzke wrote: > 4. Philosophy question: If I build some functions that do something new > with an e.g. ElementTree (XML tree) object, should I: > > (a) make these functions go in a subclass of the class for the original > object (thus inheriting the methods of the original class, and basically > adding new methods). E.g. basically extending the methods of ElementTree, > with a subclass GbifElementTree; or: > > (b) make a class containing the object as an attribute, with e.g. > GbifXml.xmltree containing an ElementTree attribute which then gets passed > to the various functions. > > I currently have (b) but the more I think about it, the more (a) makes more > sense from a simplicity/usability/maintainability sense. > > I have some ElementTree-related helper functions, too. Since we're still maintaining compatibility with Python 2.4 and xml.etree didn't enter the standard library until Py2.5, the ElementTree interface could potentially come from several different sources, with slightly different capabilities. It's a weird module in general... basically, I'm treating the library like a wild badger -- a function either relies on the ETree object structure, or it doesn't, and the ETree-specific functions live in their own area near the top of the file. The methods that do phyloXML-specific work call another function to extract what they need from a node, then carry on with ordinary, well-behaved Python objects. When Bio.Tree integration comes due, we could check how much our various ETree utilities overlap and maybe combine them into a separate module. For instance, I have a tree pretty-printer and a function for dumping a list of XML node tags, too. Summary: Integrating with Bio.Tree will involve some refactoring, and it would be easier if the ElementTree stuff was quarantined off a little bit. > def extract_latlongs(self, element): > > Create a temporary pseudofile, extract lat longs to it, > return results as string. > > Inspired by: http://www.skymind.com/~ocrow/python_string/ > (Method 5: Write to a pseudo file) > Neat article! I was intrigued by this result so I tried to replicate it -- and my results were different, since newer Pythons have some string optimizations that weren't in place when the article was written. Adding strings together in a loop doesn't lead to quadratic time complexity anymore. Blogged it: http://etalog.blogspot.com/2009/07/faster-string-concatenation-in-python.html > def xmlstring_to_xmltree(xmlstring): > > Take the text string returned by GBIF and parse to an XML tree using > ElementTree. > Requires the intermediate step of saving to a temporary file (required to > make > ElementTree.parse work, apparently) > > Did cStringIO work as a temp file handle? I wonder if this is a bug in Python. Overall, it's great to see Biopython is going to have such solid phylogenetics/geography support. Should be fun to work with in the future. Cheers, Eric From hlapp at gmx.net Tue Jul 21 13:12:00 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 21 Jul 2009 13:12:00 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> Message-ID: On Jul 20, 2009, at 10:57 AM, Eric Talevich wrote: > the 'node' in Nexus and PhyloDB is called 'clade' in phyloXML Really? A clade is *not* a node in the sense it is normally used in phylogenetics, and I would suggest that PhyloXML is using "clade" synonymously with "node" it needs to change b/c using established terminology in conflicting ways isn't a good idea. A clade is a subtree of a tree, i.e., a node and all its descendent nodes (and the branches that connect them). Or more generally for an unrooted tree, it is any group of nodes (and branches connecting them) that can be completely separated from the rest of the tree by severing a single branch. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From eric.talevich at gmail.com Tue Jul 21 13:29:54 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 21 Jul 2009 13:29:54 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> Message-ID: <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> On Tue, Jul 21, 2009 at 1:03 PM, Hilmar Lapp wrote: > > On Jul 20, 2009, at 10:57 AM, Eric Talevich wrote: > > the 'node' in Nexus and PhyloDB is called 'clade' in phyloXML >> > > > Really? A clade is *not* a node in the sense it is normally used in > phylogenetics, and I would suggest that PhyloXML is using "clade" > synonymously with "node" it needs to change b/c using established > terminology in conflicting ways isn't a good idea. > > A clade is a subtree of a tree, i.e., a node and all its descendent nodes > (and the branches that connect them). Or more generally for an unrooted > tree, it is any group of nodes (and branches connecting them) that can be > completely separated from the rest of the tree by severing a single branch. > > -hilmar > Interesting to know. Here's the documentation for the Clade type: Element Clade is used in a recursive manner to describe the topology of a phylogenetic tree. The parent branch length of a clade can be described either with the 'branch_length' element or the 'branch_length' attribute (it is not recommended to use both at the same time, though). Usage of the 'branch_length' attribute allows for a less verbose description. Element 'confidence' is used to indicate the support for a clade/parent branch. Element 'events' is used to describe such events as gene-duplications at the root node/parent branch of a clade. Element 'width' is the branch width for this clade (including parent branch). Both 'color' and 'width' elements apply for the whole clade unless overwritten in-sub clades. Attribute 'id_source' is used to link other elements to a clade (on the xml-level). It has a label (name), confidence value and branch length like most Node objects do, and even an attribute called node_id. I guess nodes and edges are implicit in the phyloXML representation, and everything *except* the clade class would be considered a sub-type of the traditional node. Then maybe Clade should inherit from Tree instead of Node, and offer an interface to implicit node and edge objects. For the purposes of reusing methods among Nexus, phyloXML, etc. trees, using Clade as a Node seems easiest in terms of having the right attributes available. The same mapping is being using in the BioRuby project, too: Phylogeny:Tree, Clade:Node. (Not sure about Bioperl.) I'll hold off working on the BaseTree integration until we have consensus on this. Best, Eric From hlapp at gmx.net Tue Jul 21 13:45:08 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 21 Jul 2009 13:45:08 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> Message-ID: On Jul 21, 2009, at 1:29 PM, Eric Talevich wrote: > Element Clade is used in a recursive manner to describe the topology > of a phylogenetic tree. That's OK I guess on the topological level - a subtree of a clade is also a clade. I.e., the clade formed a node A and all its descendants is contained within the clade formed by the parent of A and all of the parent's descendants. But referring to or identifying a clade must be referring to an entire group of nodes, not only one. So attaching something to the clade semantically has to attach it to all nodes in the clade. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From czmasek at burnham.org Tue Jul 21 13:51:03 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 21 Jul 2009 10:51:03 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> Message-ID: <4A660007.5090900@burnham.org> Hi, Hilmar: Hilmar Lapp wrote: > A clade is a subtree of a tree, i.e., a node and all its descendent > nodes (and the branches that connect them). Or more generally for an > unrooted tree, it is any group of nodes (and branches connecting them) > that can be completely separated from the rest of the tree by severing > a single branch. > > -hilmar Actually, that is how clade is being used. Like so: A B C The difference is, a clade can contain other clades, wheres as node cannot. Chris From czmasek at burnham.org Tue Jul 21 14:05:25 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 21 Jul 2009 11:05:25 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> Message-ID: <4A660365.5060405@burnham.org> > But referring to or identifying a clade must be referring to an entire > group of nodes, not only one. So attaching something to the clade > semantically has to attach it to all nodes in the clade. Good point! Predefined phyloXML elements are defined to either apply to the whole clade, as long as they are not "overwritten" by values in descendant clades (for example Taxonomy) or are defined to only apply to the clade ("node" in this case) they are in, "branch_length" for example. The property element (used for "custom" data), has a "applies_to" attribute to indicate where to data should be attached to (values are: "phylogeny", "clade", "node", "parent_branch", ...). Chris > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From hlapp at gmx.net Tue Jul 21 14:24:27 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 21 Jul 2009 14:24:27 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <4A660365.5060405@burnham.org> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> <4A660365.5060405@burnham.org> Message-ID: <7265CC6F-2C0B-478A-ADAE-AD8B96ABE1EC@gmx.net> On Jul 21, 2009, at 2:05 PM, Christian M Zmasek wrote: > or are defined to only apply to the clade ("node" in this case) they > are in, "branch_length" for example. You do see how you are contradicting the previous definition here, right? *All* nodes in a clade are in that clade, and *all* branches. My recommendation is to fix this in the phyloXML spec - there is a whole field of cladistics and I don't think it's a wise idea to re- apply their terminology in ways that are in contradiction. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bugzilla-daemon at portal.open-bio.org Tue Jul 21 15:56:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 15:56:12 -0400 Subject: [Biopython-dev] [Bug 2880] test_Mafft_tool.py unit test failure In-Reply-To: Message-ID: <200907211956.n6LJuCXT018866@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 15:56 EST ------- (In reply to comment #5) > > I've retitled the bug to focus on the MAFFT issue. This may well be > a problem with your old version of MAFFT - I know for example the > the FASTA output is broken on some versions of MAFFT. > I was able to install MAFFT v6.240 on another machine, and worked out a simple fix. Basically this version produced a different CLUSTAL style header line. Should be fixed in CVS now. Thanks for the report, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 16:56:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 16:56:51 -0400 Subject: [Biopython-dev] [Bug 2874] invalid class on warning module In-Reply-To: Message-ID: <200907212056.n6LKupOV020686@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2874 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 16:56 EST ------- I thought I had already marked this bug as fixed... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 16:59:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 16:59:29 -0400 Subject: [Biopython-dev] [Bug 2879] missing __delitem__ in Bio.PDB.Entity.Entity In-Reply-To: Message-ID: <200907212059.n6LKxT1o020771@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2879 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 16:59 EST ------- Could you give a short but complete example showing the problem? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 17:09:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 17:09:33 -0400 Subject: [Biopython-dev] [Bug 2867] Bio.PDB.PDBList.update_pdb calls invalid os.cmd In-Reply-To: Message-ID: <200907212109.n6LL9XP6021073@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2867 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 17:09 EST ------- I've checked a fix for this into CVS, but have not tested it. Could you update and retry? It might be simplest to reinstall all of Biopython from CVS or github, but you only need to update this one file, /usr/lib/pymodules/python2.5/Bio/PDB/PDBList.py The new version will be on github soon, or here soon: http://biopython.org/SRC/biopython/Bio/PDB/PDBList.py The differences are quite small: RCS file: /home/repository/biopython/biopython/Bio/PDB/PDBList.py,v retrieving revision 1.25 diff -r1.25 PDBList.py 37a38 > #TODO - Use os.path.join(...) instead of adding strings with os.sep 39a41 > import shutil 248d249 < 280c281 < os.cmd('mv %s %s'%(old_file,new_file)) --- > shutil.move(old_file, new_file) i.e. The new version uses shutil.move(old_file, new_file) instead. Thanks! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 04:17:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 04:17:06 -0400 Subject: [Biopython-dev] [Bug 2879] missing __delitem__ in Bio.PDB.Entity.Entity In-Reply-To: Message-ID: <200907220817.n6M8H6IQ008427@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2879 ------- Comment #2 from katja.luck at unistra.fr 2009-07-22 04:17 EST ------- (In reply to comment #1) > Could you give a short but complete example showing the problem? > example code: from Bio.PDB.PDBParser import PDBParser if '__main__' == __name__: parser = PDBParser(PERMISSIVE=1) PDBID = '1N7T' PDB_file = '/Network/Servers/sumba/Volumes/s/luck/pymol/1N7T.pdb' structure = parser.get_structure(PDBID,PDB_file) chain = structure[0]['A'] print chain[66].get_id() chain.__delitem__(66) command line output: [carlit:/Users/katja] luck% python Python_scripts/PDZ_project/bug_example.py (' ', 66, ' ') Traceback (most recent call last): File "Python_scripts/PDZ_project/bug_example.py", line 14, in chain.__delitem__(66) File "/Library/Python/2.5/site-packages/Bio/PDB/Chain.py", line 79, in __delitem__ return Entity.__delitem__(self, id) AttributeError: class Entity has no attribute '__delitem__' Okay, I now realised that I should rather use detach_child() than the private method __delitem__() for deleting residues from a chain but still thought it might be good to report this bug. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 05:14:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 05:14:04 -0400 Subject: [Biopython-dev] [Bug 2879] missing __delitem__ in Bio.PDB.Entity.Entity In-Reply-To: Message-ID: <200907220914.n6M9E4qq009918@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2879 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-22 05:14 EST ------- Thanks for the clarification. Note in python __delitem__ is a special method, and rather than this: chain.__delitem__(66) you would normally do: del chain[66] and this will internally call the special __delitem__ method. This is much like other special methods, e.g. str(object) will internally do object.__str__() for you. You wouldn't normally use these double underscore methods explicitly. In any case, I don't understand what Thomas intended the __delitem__ to do, and there may be a bug here. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jul 22 07:56:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Jul 2009 12:56:23 +0100 Subject: [Biopython-dev] Line wrapping in FASTQ output Message-ID: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> Hi Peter R. et al, Up until now I had mostly been trying EMBOSS 6.1.0 with short read data. I've just noticed for longer reads EMBOSS wraps the sequences and qualities lines in FASTQ output (at 60 characters). There is an example of this at the end of the email. My understanding is that while line breaks are allowed in the sequences and qualities lines of a FASTQ file, they are discouraged as it can break simple minded parsers. Unfortunately right now I can't find any references/websites to back up this assertion (other than things I wrote myself since), but I was sure I read this on the MAQ site somewhere. Several sites do simply talk about "the" sequence line and "the" quality line (indeed the early drafts of the wikipedia page had this assumption, which I fixed). This is natural if all you have ever worked with is short read data. Of course, 454 reads are hundreds of bases long, and even the latest Illumina reads now are in the range 70 to 100 bp (or so I hear), so this issue will become more common - so any existing parsers that can't cope with line breaks will soon get broken, and hopefully fixed. For Biopython we should be able cope with any strange line breaks in the sequences and qualities lines on input, but for output don't do any line wrapping. I felt this would result in more widely parseable output. I wondered what your thought process was, and if you think it is worth removing the line wrapping on EMBOSS's FASTQ output (or indeed, if you have a good argument to convince me to make Biopython output FASTQ with line wrapping by default). [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as ideal for an OBF cross project mailing list, something we talked about at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) were going to look into this?] Regards, Peter C. (at Biopython) e.g. $ embossversion Reports the current EMBOSS version number 6.1.0 $ more sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN + ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! It is likely that email software will mangle the line breaks, but in my example file sanger_93.fastq the sequence and the quality are single line strings (of length 94). Now let's let EMBOSS seqret read this in and write it out again: $ seqret -filter -seq sanger_93.fastq -sformat fastq-sanger -osformat fastq-sanger @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG ACTGACTGACTGACTGACTGACTGACTGACTGAN +Test ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDC BA@?>=<;:9876543210/.-,+*)('&%$#"! The new lines are real and not just from the email formatting - you can check this by piping the output though hexdump. It appears EMBOSS is using 60 character line wrapping. Peter C. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 11:14:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 11:14:46 -0400 Subject: [Biopython-dev] [Bug 2883] New: Errors after unpickling of 1.49 seqrecords Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2883 Summary: Errors after unpickling of 1.49 seqrecords Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: andrea at biodec.com I've the same error also with biopython 1.50b I've the same errors either with python2.4 and python2.5 PROBLEM: I've for testing purposes some cPickled seqrecords that i prepared with biopython-1.49. The unpickling doesn't produce any error at all, but if i try to: 1) print the unpickled seqrecord 2) use the unpickled seqrecord i get errors. 1) ========================================================================= >>> print seqr Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.5/site-packages/biopython-1.51b-py2.5-linux-x86_64.egg/Bio/SeqRecord.py", line 501, in __str__ if self.letter_annotations : File "/usr/lib/python2.5/site-packages/biopython-1.51b-py2.5-linux-x86_64.egg/Bio/SeqRecord.py", line 170, in fget=lambda self : self._per_letter_annotations, AttributeError: 'SeqRecord' object has no attribute '_per_letter_annotations' ### This problem maybe is related to the one of the bug #2838 =============================================================================== 2)============================================================================= >>> seqr.seq Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.5/site-packages/biopython-1.51b-py2.5-linux-x86_64.egg/Bio/SeqRecord.py", line 538, in __repr__ % tuple(map(repr, (self.seq, self.id, self.name, File "/usr/lib/python2.5/site-packages/biopython-1.51b-py2.5-linux-x86_64.egg/Bio/SeqRecord.py", line 233, in seq = property(fget=lambda self : self._seq, AttributeError: 'SeqRecord' object has no attribute '_seq' =============================================================================== According to me old seqrecords didn't have any "_per_letter_annotations" or any "_seq" in SeqRecord class/instances. Maybe i've to split the two errors in two different bugs but i prefer to keep together because are related to the same main problem of "unpickling an old seqrecord" (or maybe is not a problem and i haven't to try to unpickle old seqrecord instance with new biopython versions) I didn't try the CVS code because i didn't find any related error in bugzilla.open-bio.org. I've added a dump of a seqrecord generated with biopython-1.49 Thanks Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 11:15:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 11:15:32 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907221515.n6MFFWaK023587@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #1 from andrea at biodec.com 2009-07-22 11:15 EST ------- Created an attachment (id=1346) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1346&action=view) Dump of a seqrecord generated with biopython 1.49 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 11:44:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 11:44:47 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907221544.n6MFil8a025136@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-22 11:44 EST ------- It sounds like pickling and unpickling worked for you on Biopython 1.49, but I am not 100% sure that is what you meant. The good news is I can pickle/unpickle a new SeqRecord object: >>> import pickle >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> s = Seq("ACGT", generic_dna) >>> s2 = pickle.loads(pickle.dumps(s)) >>> s2 Seq('ACGT', DNAAlphabet()) >>> from Bio.SeqRecord import SeqRecord >>> r = SeqRecord(s, id="test", letter_annotations={"dummy":[4,3,2,1]}) >>> print r ID: test Name: Description: Number of features: 0 Per letter annotation for: dummy Seq('ACGT', DNAAlphabet()) >>> r2 = pickle.loads(pickle.dumps(r)) >>> print r2 ID: test Name: Description: Number of features: 0 Per letter annotation for: dummy Seq('ACGT', DNAAlphabet()) And this also works with cPickle: >>> import cPickle >>> s3 = cPickle.loads(cPickle.dumps(s)) >>> s3 Seq('ACGT', DNAAlphabet()) >>> r3 = cPickle.loads(cPickle.dumps(r)) >>> print r3 ID: test Name: Description: Number of features: 0 Per letter annotation for: dummy Seq('ACGT', DNAAlphabet()) I would expect you to be able to pickle/unpickle new objects on your system too. However, I can confirm trying to unpickle the example you attached to this bug also fails for me (using the latest Biopython from CVS). As you may be aware, per-letter-annotation support was added in Biopython 1.50 which is stored internally by a private property of the SeqRecord, _per_letter_annotations. The seq property is also now stored internally by a private property of the SeqRecord, _seq. This means if you unpickle a pre-Biopython 1.50 SeqRecord on Biopython 1.50 or later, the _per_letter_annotations and _seq properties never gets initialised. This causes the two errors you saw. I don't think there is much we can do about this... not without making the SeqRecord even more complicated, e.g. http://code.activestate.com/recipes/521901/ Peter P.S. Bug 2838 was a problem in the DBSeqRecord (used for BioSQL), and shouldn't be relevant to the underlying SeqRecord object, or this issue. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 12:52:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 12:52:14 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907221652.n6MGqEu7028407@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #3 from andrea at biodec.com 2009-07-22 12:52 EST ------- (In reply to comment #2) > It sounds like pickling and unpickling worked for you on Biopython 1.49, but I > am not 100% sure that is what you meant. Yes, that's true. it worked. > > The good news is I can pickle/unpickle a new SeqRecord object: > yes this i know, and it works also for me and also with cPickle. > I would expect you to be able to pickle/unpickle new objects on your system > too. sure > > However, I can confirm trying to unpickle the example you attached to this bug > also fails for me (using the latest Biopython from CVS). > > As you may be aware, per-letter-annotation support was added in Biopython 1.50 > which is stored internally by a private property of the SeqRecord, > _per_letter_annotations. The seq property is also now stored internally by a > private property of the SeqRecord, _seq. This means if you unpickle a > pre-Biopython 1.50 SeqRecord on Biopython 1.50 or later, the > _per_letter_annotations and _seq properties never gets initialised. This causes > the two errors you saw. This is the problem. i've many example of seqrecord dump (that i use as a test) that due to the seqrecord modifications i cannot use anymore. - I've to convert in the new type. - or i've to design fully new tests that permit me to manage changing in the SeqRecord structure. > > I don't think there is much we can do about this... not without making the > SeqRecord even more complicated, e.g. > http://code.activestate.com/recipes/521901/ I understand. I thought SeqRecod was structurally stable. But it isn't. In this sense i can only pickle strings, lists and dictionaries... so i will redraw my tests to manage only SeqRecord stored data (representing it as a dictionary of dictionaries it would be a good solution). > > > P.S. Bug 2838 was a problem in the DBSeqRecord (used for BioSQL), and shouldn't > be relevant to the underlying SeqRecord object, or this issue. > yes, but in the last part of the bug there was a similar error AttributeError: 'DBSeqRecord' object has no attribute '_per_letter_annotations' and i thought it was due to the fact that DBSeqRecord didn't have that attribute and it was out of sync with respect to the new 1.50 seqrecord... Thanks Andrea PS: i think you could close the bug. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 13:28:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 13:28:39 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907221728.n6MHSdjt029734@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WONTFIX ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-22 13:28 EST ------- (In reply to comment #3) > > As you may be aware, per-letter-annotation support was added in Biopython > > 1.50 which is stored internally by a private property of the SeqRecord, > > _per_letter_annotations. The seq property is also now stored internally > > by a private property of the SeqRecord, _seq. This means if you unpickle > > a pre-Biopython 1.50 SeqRecord on Biopython 1.50 or later, the > > _per_letter_annotations and _seq properties never gets initialised. This > > causes the two errors you saw. > > This is the problem. i've many example of seqrecord dump (that i use as a > test) that due to the seqrecord modifications i cannot use anymore. > - I've to convert in the new type. > - or i've to design fully new tests that permit me to > manage changing in the SeqRecord structure. You can probably hack the missing per letter annotation with something like record._per_letter_annotations = {}, but it looks like there is no obvious way to get at the sequence information in the unpicked record. Would you like to discuss your storage strategy on the mailing list? I'm curious what you are doing that made you choose to use pickle like this (instead of saving to a standard sequence file format, or BioSQL). > > I don't think there is much we can do about this... not without > > making the SeqRecord even more complicated, e.g. > > http://code.activestate.com/recipes/521901/ > > I understand. I thought SeqRecod was structurally stable. > But it isn't. In this sense i can only pickle strings, lists and > dictionaries... so i will redraw my tests to manage only SeqRecord > stored data (representing it as a dictionary of dictionaries it would > be a good solution). Pickling complex objects is usually fine, unless the class changes - like the SeqRecord did (and it may do in future, or more likely the SeqFeature object may). > > P.S. Bug 2838 was a problem in the DBSeqRecord (used for BioSQL), and > > shouldn't be relevant to the underlying SeqRecord object, or this issue. > > yes, but in the last part of the bug there was a similar error > AttributeError: 'DBSeqRecord' object has no attribute > _per_letter_annotations' and i thought it was due to the fact that > DBSeqRecord didn't have that attribute and it was out of sync with > respect to the new 1.50 seqrecord... Yes, part of Bug 2838 was that the DBSeqRecord got out of sync with the SeqRecord. > PS: i think you could close the bug. OK - marking as "won't fix". Sorry about this, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jul 22 15:25:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Jul 2009 20:25:25 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> Message-ID: <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> On Wed, Jul 22, 2009 at 7:16 PM, James Casbon wrote: > > A bit late to the party, but I put my sff parsing code into this fork > before reading this thread: > http://github.com/jamescasbon/biopython/tree/sff > > I have a test suite but not sure where all the other QualityIO tests > are so it can live with them > > It does work with the roche tools v2, but I have no paired end sff > files to test. Sounds interesting - github is being very slow for me right now, so I'll probably take a look tomorrow. I'll be interested to see how it compares to my rough code on Bug 2837 based on the code from Jose Blanca (this doesn't do paired end reads yet). http://bugzilla.open-bio.org/show_bug.cgi?id=2837 This is something I hope to work on for Biopython 1.52, once Biopython 1.51 final is out the door (later this month I hope). Peter From czmasek at burnham.org Thu Jul 23 00:43:08 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Wed, 22 Jul 2009 21:43:08 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <325F101D-1E7A-4BEA-BF2C-A3C18547063B@illinois.edu> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <4A660007.5090900@burnham.org> <325F101D-1E7A-4BEA-BF2C-A3C18547063B@illinois.edu> Message-ID: <4A67EA5C.90709@burnham.org> Hi, Chris: > From that contained Clades fall out quite easily, as they would just > be deeper subtrees within that Clade that also have a clade 'root node'. I don't understand this sentence. Chris From pmr at ebi.ac.uk Thu Jul 23 04:08:51 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 23 Jul 2009 09:08:51 +0100 Subject: [Biopython-dev] Line wrapping in FASTQ output In-Reply-To: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> References: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> Message-ID: <4A681A93.9030303@ebi.ac.uk> Peter C. wrote: > Hi Peter R. et al, > > For Biopython we should be able cope with any strange line breaks in > the sequences and qualities lines on input, but for output don't do > any line wrapping. I felt this would result in more widely parseable > output. I wondered what your thought process was, and if you think it > is worth removing the line wrapping on EMBOSS's FASTQ output (or > indeed, if you have a good argument to convince me to make Biopython > output FASTQ with line wrapping by default). There is also an issue with making the ines so long that brain-damaged parsers (those that read a line in C and fail to check it was a complete line) will fail. Leaving the line breaks in was deliberate in EMBOSS 6.1.0 to see whether any parsers would object. The obvious compromise is to increase the default line length in EMBOSS to say 500 so that anyone reading up to 512 characters will still be safe. Unfortunately some flk will then assume there will never be a line break. Alternatively, we could truly make everything fit on one line. Or we could double up the fastq outputs with and without line breaks (horrible problems with naming the ouptut formats) I suspect this one-line thing is a simple attempt to avoid the "quality line starting with '@' or '+'" issue. > [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as > ideal for an OBF cross project mailing list, something we talked about > at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) were going > to look into this?] Yes indeed I was. Waylaid by the demands of the 6.1.0 EMOSS release but I will get back on to it. regards, Peter From bugzilla-daemon at portal.open-bio.org Thu Jul 23 04:47:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Jul 2009 04:47:42 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907230847.n6N8lgYw029402@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #5 from andrea at biodec.com 2009-07-23 04:47 EST ------- > You can probably hack the missing per letter annotation with something > like record._per_letter_annotations = {}, Yes, i tried and it works... but there is no possibility to recover the seq... seqrecord.seq is not accessibile anymore.... > but it looks like there is no > obvious way to get at the sequence information in the unpicked record. > > Would you like to discuss your storage strategy on the mailing list? Sure, which one? Discussion, developement..... But are you sure it is necessary? > I'm curious what you are doing that made you choose to use pickle like > this (instead of saving to a standard sequence file format, or BioSQL). I'm using pickled object only for testing purposes. So implement a BioSQL system for that is too much... (also if it is available for sql lite) Maybe saving data in other format (for sure not fasta)... for example GenBank it could be another good solution but i will add a possible "layer of failure" related to parsing problems.... (And i think, unpickling of dictionary will not introduce this possible "layer of failure"). Were you thinking about GenBank format? Do you suggest something different? > > Sorry about this, Don't worry. I think you are developing the system in a way that it will bring it to a better state... so, it isn't a problem at all.... even better thanks a lot. Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jul 23 05:14:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 10:14:52 +0100 Subject: [Biopython-dev] Line wrapping in FASTQ output In-Reply-To: <4A681A93.9030303@ebi.ac.uk> References: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> <4A681A93.9030303@ebi.ac.uk> Message-ID: <320fb6e00907230214l6df7ff76j643e8ddc1f600054@mail.gmail.com> On Thu, Jul 23, 2009 at 9:08 AM, Peter Rice wrote: > Peter C. wrote: >> >> Hi Peter R. et al, >> >> For Biopython we should be able cope with any strange line breaks >> in the sequences and qualities lines on input, but for output don't do >> any line wrapping. I felt this would result in more widely parseable >> output. I wondered what your thought process was, and if you think >> it is worth removing the line wrapping on EMBOSS's FASTQ output >> (or indeed, if you have a good argument to convince me to make >> Biopython output FASTQ with line wrapping by default). > > There is also an issue with making the ines so long that brain-damaged > parsers (those that read a line in C and fail to check it was a complete > line) will fail. You mean a C parser with a finite string buffer (say 100 characters) which reads things line by line. Yes, that would be a bit brain dead too. I guess either way could break some parsers out there ;) > Leaving the line breaks in was deliberate in EMBOSS 6.1.0 to see > whether any parsers would object. I see - well I'm not objecting, and neither is the Biopython parser. > The obvious compromise is to increase the default line length in > EMBOSS to say 500 so that anyone reading up to 512 characters > will still be safe. Unfortunately some flk will then assume there will > never be a line break. That seems like a bad idea - especially as Roche 454 reads are in the region of 500+ bp, meaning some would wrap and some wouldn't. Even using a longer wrap like 1000 would probably just postpone the issue. If you are going to wrap, something short like 60 seems more sensible (often used in FASTA files too) given the historical 80 character width of a terminal window. People using early Solexa/Illumina machines will only see a single line, but as their read lengths are already in the range 70 to 100bp, I wonder what the latest Illumina pipelines output (wrt wrapping)? > Alternatively, we could truly make everything fit on one line. That's what Biopython currently does. But you are right - I hadn't considered brain dead parsers using fixed buffers. > Or we could double up the fastq outputs with and without line breaks > (horrible problems with naming the ouptut formats) I don't like that plan. For Biopython we could have a wrapping setting available for people who really need to specify this (as we do for FASTA already), with a sensible default value. > I suspect this one-line thing is a simple attempt to avoid the "quality line > starting with '@' or '+'" issue. Could be. I think the fact that @ and + are valid entries in the quality string is the second most annoying thing about the FASTQ format (after the lack of a clear format definition from Sanger, and the resulting variants from Solexa/Illumina etc). >> [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as >> ideal for an OBF cross project mailing list, something we talked >> about at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) >> were going to look into this?] > > Yes indeed I was. Waylaid by the demands of the 6.1.0 EMOSS release > but I will get back on to it. Thanks! > regards, > > Peter Cheers, Peter C. From bugzilla-daemon at portal.open-bio.org Thu Jul 23 05:20:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Jul 2009 05:20:20 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907230920.n6N9KKwC030688@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-23 05:20 EST ------- (In reply to comment #5) > > Would you like to discuss your storage strategy on the mailing list? > > Sure, which one? Discussion, developement..... But are you sure it is > necessary? I was thinking the main discussion list - but if this was just for your own testing, maybe we don't need to. > > I'm curious what you are doing that made you choose to use pickle like > > this (instead of saving to a standard sequence file format, or BioSQL). > > I'm using pickled object only for testing purposes. So implement a BioSQL > system for that is too much... (also if it is available for sql lite) > Maybe saving data in other format (for sure not fasta)... for example > GenBank it could be another good solution but i will add a possible > "layer of failure" related to parsing problems.... (And i think, unpickling > of dictionary will not introduce this possible "layer of failure"). > Were you thinking about GenBank format? Do you suggest something different? If your SeqRecord objects are all simply loaded from sequence files in the first place (and not modified), I would just keep the original file and re-parse it. If you have generated your own SeqRecords (or modified those from reading a file), then it makes sense to save them somehow. The choice of file format depends on the nature of annotation. The latest Biopython will now record the features in a GenBank file, making that a reasonable choice - but this does not cover per-letter-annotations. BioSQL has the same limitation. > > > Sorry about this, > > Don't worry. I think you are developing the system in a way that it > will bring it to a better state... so, it isn't a problem at all.... > even better thanks a lot. > > Andrea Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jul 23 05:34:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 10:34:26 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> Message-ID: <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> On Wed, Jul 22, 2009 at 9:51 PM, James Casbon wrote: > > 2009/7/22 Peter : >> On Wed, Jul 22, 2009 at 7:16 PM, James Casbon wrote: >>> >>> A bit late to the party, but I put my sff parsing code into this fork >>> before reading this thread: >>> http://github.com/jamescasbon/biopython/tree/sff >> >> Sounds interesting - github is being very slow for me right now, >> so I'll probably take a look tomorrow. I'll be interested to see how >> it compares to my rough code on Bug 2837 based on the code >> from Jose Blanca (this doesn't do paired end reads yet). >> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 > > I don't think there is much in it really. ?You have a factored > BinaryFile class, I have classes for the components of the SFF file. > Both are based around struct. Github is working fine now - maybe my wireless network was just too slow at home last night? Jose's code uses seek/tell which means it has to have a handle to an actual file. He also used binary read mode - I'm not sure if this was essential or not. James' code seems to make a single pass though the file handle, without using seek/tell to jump about. I think this is nicer, as it is consistent with the other SeqIO parsers, and should work on more types of handles (e.g. from gzip, StringIO, or even a network connection). It looks like you (James) construct Seq objects using the full untrimmed sequence as is. I was undecided on if trimmed or untrimmed should be the default, but the idea of some kind of masked or trimmed Seq object had come up on the mailing list which might be useful here (and in contig alignments). i.e. something which acts like a Seq object giving the trimmed sequence, but which also contains the full sequence and trim positions. I also want to look at paired end reads in SFF files... Peter From bugzilla-daemon at portal.open-bio.org Thu Jul 23 05:56:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Jul 2009 05:56:50 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907230956.n6N9uouv031896@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #7 from andrea at biodec.com 2009-07-23 05:56 EST ------- (In reply to comment #6) > (In reply to comment #5) > If your SeqRecord objects are all simply loaded from sequence files in the > first place (and not modified), I would just keep the original file and > re-parse it. > > If you have generated your own SeqRecords (or modified those from reading > a file), then it makes sense to save them somehow. The choice of file > format depends on the nature of annotation. The latest Biopython will now > record the features in a GenBank file, making that a reasonable choice - > but this does not cover per-letter-annotations. BioSQL has the same > limitation. yes, i'm testing some predictors. I do prediction and i compare the "newly predicted seqrecords" with the "previously correct predicted pickled seqrecords". I've them (the correct ones) only in pickled seqrecord format. The correctly predicted seqrecord, before prediction were in fasta format, but after i parsed them (into seqrecord), i did prediction, and then i pickled them (during prediction i add to seqrecord features and annotations). Actually i don't use per-letter-annotation despite the fact it seems interesting. But i didn't find any example in documentation (that show how the dictionary is populated...) so i really don't know how to use it.... even if i've, during prediction, a "per position annotation". Also if the "per letter annotation" is not managed in the GenBank format or in the BioSQL format (that i use a lot) i've to wait!! I was thinking also to store the pssm information somewhere in the seqrecord.... but this would be a very big change... (and also manage to store it in BioSQL.... )... but it's better to stop the discussion here or to move it... :-) Thanks Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jul 23 06:28:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Jul 2009 06:28:42 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907231028.n6NASgcX000743@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-23 06:28 EST ------- (In reply to comment #7) > ... but it's better to stop the discussion here or to move it... :-) Moving discussion to mailing list, see: http://lists.open-bio.org/pipermail/biopython/2009-July/005385.html Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jul 23 07:08:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 12:08:09 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> Message-ID: <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> On Fri, Jul 10, 2009 at 1:38 PM, Peter wrote: > On Mon, Jun 22, 2009 at 6:57 PM, Peter wrote: >> >> Once the beta release is out, we'll resume taking small changes >> (especially for documentation additions or clarifications) with a >> view to releasing Biopython 1.51 final in July (probably the second >> week, after people get back from BOSC/ISMB). > > OK, that didn't happen - too much to catch up on at work after > being away at BOSC/ISMB for a week. Also I will be on holiday > next week (graduation etc). I will have some limited internet > access. I'm thinking of doing the final release of Biopython 1.51 > the following week (i.e. the week starting 20th July). > > This will be after the annual EMBOSS release, and one little thing > I want to sort out before we release Biopython 1.51 is mapping > Solexa/PHRED scores in FASTQ files (specifically what to do with > a PHRED score of zero which is usually a dummy value, but taken > literally means "this read is wrong" or "worst than random"). After > discussion with Peter Rice at BOSC/ISMB 2009, I plan to follow > his plan for EMBOSS (map PHRED of zero to the lowest used > Solexa score, -5). Once the EMBOSS release is out, I can use it > for cross checking our FASTQ conversions. The FASTQ checking is on going. I have updated our FASTQ code to map Solexa scores as I understood Peter Rice's description of the intended EMBOSS behaviour (this is for the corner case of very poor quality reads). However, due to a couple of minor bugs I found in EMBOSS 6.1.0 we'll either have to cross check against their CVS code, or hope they release EMBOSS 6.1.1 soon. Cross checking against MAQ would also be worthwhile, but while there are some patches about to fix a couple of MAQ FASTQ bugs and include Illumina to Sanger standard conversion, this isn't in their official repository yet. I guess I could cross check against BioPerl's new FASTQ support ... > Also, we have the Bio.Application.generic_run code to retire, > which basically means we label it as obsolete and update the > tutorial to use subprocess (see other thread), but this requires > cross platform testing. I still haven't got near my Windows machine to do this. I think this is important to get done in Biopython 1.51 as we are also introducing the extended set of command line wrappers. Nevertheless, a July release is still looking possible. Are there any other issues that would block the release? Peter From eric.talevich at gmail.com Thu Jul 23 11:59:32 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 23 Jul 2009 11:59:32 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML forBiopython In-Reply-To: References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> <6CC117B53EF342238715843D2C185723@NewLife> Message-ID: <3f6baf360907230859l400f0fcfm450985cff710375b@mail.gmail.com> All, Thanks for the ongoing discussion and helpful links. I'm going to propose an object mapping here and see how it sits with everyone -- please correct any questionable statements. In raw XML, the clade designation looks reasonable. The attributes that blur the clade-node distinction are branch_length, confidence and node_id. In the first two, the attributes apply to an implicit root node, not the entire clade. (Stated this way, it makes much more sense in the XML representation to have branch_length as a child node, not an attribute.) The node_id clearly applies to the clade's root node, once it's understood that the node is implicit. http://www.phyloxml.org/documentation/version_100/phyloxml.xsd.html#h-1124608460 On Thu, Jul 23, 2009 at 2:08 AM, Chris Fields wrote: > On Jul 23, 2009, at 12:12 AM, Mark A. Jensen wrote: > >> FWIW, BioPerl has Trees and Nodes. That's it; maybe Branches later (if I >> get around to it, or convince Chase it would be a good project). > > Many of the existing generalized Tree object representations seem to be guided by the Nexus/Newick format, which is basically an s-expression. This format can represent a tree as a parenthetical expression, and a node as a token (comma-delimited, potentially combining a taxon label and branch length separated by a colon) within that expression. Edges or branches are implicit. So Trees, Nodes, branch lengths and labels are all we *really* need to find common ground on, but other, more expressive representations are certainly possible. I'm basing my BaseTree classes on the tables in BioSQL's PhyloDB extension ( http://biosql.org/wiki/Extensions) -- which were probably in turn based on BioPerl's Tree objects, but have at least been given some extra effort towards generalization. The PhyloDB schema includes include an Edge table definition, among other things. Question: The Node objects in PhyloDB have left_idx and right_idx attributes. It looks like nodes are being kept in a double-linked list, which seems like unusually low-level information to keep around since Perl, Python, Ruby and Java all have flexible array or list types that can keep track of element order efficiently. Is there a use for these indexes in general phylogenetics work that couldn't be handled by other language-specific constructs? In this scrap >> http://www.bioperl.org/wiki/Finding_all_clades_represented_in_a_tree >> I defined a clade as a "maximal set of leaf/tip taxa descended from a >> given single node", because that's really what the question poser wanted. >> You might expand that definition to include all branches and nodes between >> the "given node" and the tips. That would be synonyomous with "subtree". >> > > Yes, but some define clade slightly differently: > > http://en.wikipedia.org/wiki/Cladistics#Three_definitiOther representations > are possible.ons_of_clade > Helpful! It looks like phyloXML's interpretation is "branch-based". Note that in the spec, the Phylogeny element that the various Bio* projects have interpreted as the Tree type is defined to have exactly one Clade attribute -- presumably the root node of the tree. I'm not sure how to interpret a branch_length value for that clade; maybe it should be ignored or disallowed. I think I see the utility of a clade as an annotation entity: one wants to >> grant properties to subtrees ("Mammalia", e.g.). >> > The Clade node does have most of the important annotation types as its children -- Taxonomy, Sequence, Events, etc. Given how Nexus trees often label nodes with taxon names, the nearest phyloXML equivalent to a Node type might be Taxonomy. But in phyloXML, all of the Clade attributes and annotations apply to the root node, and potentially all sub-clades and sub-nodes that don't override this information. I don't think I'd map the basic Node type to anything but Clade for this reason. A "Node" (in BioPerl, or standard phylogenetics) can be *mapped* to a clade, >> or used to obtain a clade, *if* the tree is rooted (as Hilmar points out). >> It seems that for a rooted tree (i.e., where anc->desc relationships are >> defined), a "Clade" annotation that contained all the desired clade >> properties could be associated with the Node, because of the one-to-one >> mapping of nodes to clades in this case. In the case of an unrooted tree, a >> Clade could also be associated with a node, if the Clade also possessed a >> direction property. For example, in an unrooted tree, a Clade could be >> specified by Node + Branches of Node contained in Clade (which would be two >> of the three branches on an internal node). This would provide the direction >> of "descent". >> >> The 'rooted' and 'rerootable' attributes belong to Phylogeny, at the top of the tree. A Clade object should probably have easy access to this information for use in pruning or rerooting. This raises some questions about the role of the Phylogeny element -- is-it-really-a Tree? Or simply a wrapper with metadata about all the clades it contains, containing a single clade which is actually the top of the phylogenetic tree? In that case it could make sense for each clade to contain a direct or indirect reference to the phylogeny object, rather than the other way around. The mind reels. I was more comfortable calling it a Tree, as the other Bio* projects do, but then I haven't tried to integrate the Nexus tree classes yet. Conclusions: 1. A Clade is-a Tree, and also is-a Node for various operations. 2. For reusing base-class methods, a Clade should provide a 'node' attribute that behaves properly -- in most or all cases, the nodes will be be the same as the list of sub-clades. 3. A Clade also needs to access some attributes of its original Phylogeny. Best regards, Eric From biopython at maubp.freeserve.co.uk Thu Jul 23 12:21:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 17:21:05 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML forBiopython In-Reply-To: <3f6baf360907230859l400f0fcfm450985cff710375b@mail.gmail.com> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> <6CC117B53EF342238715843D2C185723@NewLife> <3f6baf360907230859l400f0fcfm450985cff710375b@mail.gmail.com> Message-ID: <320fb6e00907230921lf22fc67hafd3bc8998a4eb7e@mail.gmail.com> On Thu, Jul 23, 2009 at 4:59 PM, Eric Talevich wrote: > > Question: > The Node objects in PhyloDB have left_idx and right_idx attributes. It looks > like nodes are being kept in a double-linked list, which seems like > unusually low-level information to keep around since Perl, Python, Ruby and > Java all have flexible array or list types that can keep track of element > order efficiently. Is there a use for these indexes in general phylogenetics > work that couldn't be handled by other language-specific constructs? I would guess this is like the left/right indices used in BioSQL's taxon tree, see http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html If they are being used the same way, the are an expensive to calculate second indexing scheme, which is useful for many tree operations. Peter From mjldehoon at yahoo.com Fri Jul 24 05:34:33 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 24 Jul 2009 02:34:33 -0700 (PDT) Subject: [Biopython-dev] Calculating motif scores Message-ID: <220087.67461.qm@web62406.mail.re1.yahoo.com> > As for the PWM being a separate class and used by the motif: > I don't know. I'm using Bio.SubsMat.FreqTable for implementing > frequency table, so I understand that the new PWM class would > be basically a "smarter" FreqTable. I'm not sure whether it > solves any problems... Wow, I didn't even know the Bio.SubsMat module existed. As we have several different but related modules (Bio.Motif, Bio.SubstMat, Bio.Align), I think we should define the purpose and scope of each of these modules. Maybe a good way to start is the documentation. Bio.SubsMat is currently divided into two chapters (14.4 and 16.2). I'll have a look at this over the weekend to see if this can be cleaned up a bit. --Michiel. From jblanca at btc.upv.es Fri Jul 24 06:22:39 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Fri, 24 Jul 2009 12:22:39 +0200 Subject: [Biopython-dev] [Biopython] next-gen sequencing software In-Reply-To: <320fb6e00907240250h128654e4w2e2845255392d205@mail.gmail.com> References: <200907241053.15954.jblanca@btc.upv.es> <320fb6e00907240250h128654e4w2e2845255392d205@mail.gmail.com> Message-ID: <200907241222.39608.jblanca@btc.upv.es> On Friday 24 July 2009 11:50:08 Peter wrote: > Work on improving the Biopython alignment object and introducing a > contig object is something I would like to see for the next release (once > Biopython 1.51 is out). I think that's quite necessary. Consider my code an experiment in that regard. I will be very please to discuss the details of such a class. I think that my experience with my contig implementation could be of some value. > I'm sure there is other stuff in your code that would also be very useful. > > If you want to contribute code to Biopython is will have to be under our > MIT style license, but in the meantime maybe you should stick an > an explicit license on your code? > > Peter I'm aware of the biopython licence. I prefer the GPL, that's why when I release code on my own I use it. But if some of my code could be useful to the Biopython community I have no problem with releasing under the MIT. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Fri Jul 24 06:40:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 11:40:44 +0100 Subject: [Biopython-dev] [Biopython] next-gen sequencing software In-Reply-To: <200907241222.39608.jblanca@btc.upv.es> References: <200907241053.15954.jblanca@btc.upv.es> <320fb6e00907240250h128654e4w2e2845255392d205@mail.gmail.com> <200907241222.39608.jblanca@btc.upv.es> Message-ID: <320fb6e00907240340v4d4fdb9dge48458edfa085122@mail.gmail.com> On Fri, Jul 24, 2009 at 11:22 AM, Jose Blanca wrote: > On Friday 24 July 2009 11:50:08 Peter wrote: >> Work on improving the Biopython alignment object and introducing a >> contig object is something I would like to see for the next release (once >> Biopython 1.51 is out). > > I think that's quite necessary. Consider my code an experiment in that regard. > I will be very please to discuss the details of such a class. I think that my > experience with my contig implementation could be of some value. Absolutely :) >> I'm sure there is other stuff in your code that would also be very useful. >> >> If you want to contribute code to Biopython is will have to be under our >> MIT style license, but in the meantime maybe you should stick an >> an explicit license on your code? >> >> Peter > > I'm aware of the biopython licence. I prefer the GPL, that's why when I > release code on my own I use it. But if some of my code could be useful to > the Biopython community I have no problem with releasing under the MIT. Great :) For reference, http://biopython.org/DIST/LICENSE Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 06:48:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 11:48:04 +0100 Subject: [Biopython-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Message-ID: <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> On Mon, Jul 20, 2009 at 6:57 PM, Peter wrote: > Hi all at Biopython (and EMBOSS-dev CC'd), > > Now that EMBOSS 6.1.0 is out I've started checking it against Biopython. > As I mentioned on the Biopython mailing list a week ago, in particular I'd > like to make sure we agree on the various FASTQ variants. I'm waiting > for EMBOSS to update the documentation on their website, but as I > recall from talking to Peter Rice at BOSC/ISMB 2009 and a quick test > this afternoon, they are using: > > fastq - FASTQ where the qualities are ignored (useful for input?) > fastq-sanger - Standard Sanger style FASTQ using PHRED offset 33 > fastq-solexa - Early Solexa/Illumina FASTQ, Solexa scores offset 64 > fastq-illumina - Illumina 1.3+ FASTQ using PHRED offset 64 > > I was expecting "fastq" to be an EMBOSS input only format given > how I had understood this to be interpreted (ignore the qualities). > ... I was however surprised that using "fastq" as an output format > in EMBOSS seqret gives quality strings of double quote characters. To be more precise, it looks like "fastq" as an output format in EMBOSS is an alias for "fastq-sanger" (to be confirmed), see: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html In any case, it would still make sense to include "fastq-sanger" as an alias for the Sanger standard FASTQ files in Biopython's SeqIO, especially if BioPerl is also going to use that name (to be confirmed): http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030688.html Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 08:40:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 13:40:55 +0100 Subject: [Biopython-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> Message-ID: <320fb6e00907240540i17f7f3f0kdf144c79ccbfdae@mail.gmail.com> On Fri, Jul 24, 2009 at 11:48 AM, Peter wrote: > > To be more precise, it looks like "fastq" as an output format in > EMBOSS is an alias for "fastq-sanger" (to be confirmed), see: > http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html Confirmed, http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000602.html > In any case, it would still make sense to include "fastq-sanger" as > an alias for the Sanger standard FASTQ files in Biopython's SeqIO, > especially if BioPerl is also going to use that name (to be confirmed): > http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030688.html Confirmed, BioPerl will support "fastq" or "fastq-sanger" to mean the Sanger standard FASTQ files: http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030691.html I've updated Biopython's SeqIO in CVS to support "fastq-sanger" as an alias for "fastq". Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 09:32:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 14:32:49 +0100 Subject: [Biopython-dev] FASTQ support in Biopython, BioPerl, and EMBOSS Message-ID: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Hi all, Peter Rice kindly said he will look into an OBF cross project mailing list, but in the meantime this has been cross posted to the Biopython, BioPerl, and EMBOSS development lists. On Thu, Jul 23, 2009 at 11:58 PM, Chris Fields wrote: >> I'd like to get comparisons against BioPerl's new FASTQ support >> going too. To do this I'd need to know which (branch?) of BioPerl I >> should install, and I'd also like a trivial sample BioPerl script to do >> piped FASTQ conversion. i.e. read a FASTQ file from stdin (say >> as "fastq-solexa"), and output it to stdout (say as "fastq" meaning >> the Sanger Standard FASTQ). > > You would have to install svn (bioperl-live) if you want the refactored > fastq. ?That commit was within the last month. I've got SVN bioperl-live installed and apparently working :) >> i.e. Something like this four line Biopython script would be perfect: >> http://biopython.org/wiki/Reading_from_unix_pipes > > We use named parameters so it's a little more verbose. > > use Bio::SeqIO; > my $in ?= Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-sanger'); > my $out = Bio::SeqIO->new(-format => 'fastq-solexa'); > while (my $seq = $in->next_seq) { $out->write_seq($seq) } > > Don't be surprised if there are still bugs lurking about, just let me know > and I'll fix 'em. I've got a bug report coming up in a second email, but the basics work :) e.g. Using this Sanger style FASTQ file, and converting it to Solexa style http://biopython.org/SRC/biopython/Tests/Quality/example.fastq $ more example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ;;;;;;;;;;;9;7;;.7;393333 This is simple three record FASTQ file (in the Sanger format). Using EMBOSS 6.1.0: $ seqret -filter -sformat fastq-sanger -osformat fastq-solexa < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ZZZZZZZZZZZXZVZZMVZRXRRRR Using BioPerl: $ perl bioperl_sanger2solexa.pl < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ZZZZZZZZZZZXZVZZMVZRXRRRR Using Biopython: $ python biopython_sanger2solexa.py < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ZZZZZZZZZZZXZVZZMVZRXRRRR They all agree, except that Biopython has followed the MAQ convention of omitting the (optional) repeat of the captions on the plus lines. This is something I'd already asked Peter Rice about for EMBOSS (but I think we got sidetracked): http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000577.html Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 09:53:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 14:53:40 +0100 Subject: [Biopython-dev] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Message-ID: <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> On Fri, Jul 24, 2009 at 2:32 PM, Peter wrote: >> >> Don't be surprised if there are still bugs lurking about, just let me >> know and I'll fix 'em. > > I've got a bug report coming up in a second email, but the basics work :) I think I have found a bug in BioPerl's conversion from fastq-solexa to fastq-sanger concerning lower quality scores. Here is an artificial Solexa file using the Solexa scores from 40 down to -5 (which I believe to be the full range expected from an instrument). $ more solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; A Solexa quality of 40 maps to ASCII 40+64 = 104, "h" A Solexa quality of -5 maps to ASCII -5+64 = 59, ";" You should find this example has Solexa scores 40, 39, .., -4, -5. This file is in the Biopython repository under biopython/Tests/Quality Here is the conversion using MAQ (with the chomp fix from Tim Yu to remove an extra "!" character, see the maq-help mailing list for 10 July 2009): http://sourceforge.net/mailarchive/forum.php?thread_name=320fb6e00906170708lb2ce4f7qbc5dfa43543189a2%40mail.gmail.com&forum_name=maq-help $ perl fq_all2std.pl sol2std < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN + IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" Here is the Biopython conversion, which is identical: $ python biopython_solexa2sanger.py < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN + IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" EMBOSS 6.1.0 has a rounding issue with negative Solexa scores, and the last six qualities are up by one - Peter Rice is aware of this, and has a fix: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000596.html $ seqret -filter -sformat fastq-solexa -osformat fastq-sanger < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 IHGFEDCBA@?>=<;:9876543210/.-,+*)(''&%%$$##""" Now we come to BioPerl, $ perl bioperl_solexa2sanger.pl < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 IHGFEDCBA@?>=<;:9876543210/.-,+++*)(''&&&&%%%% You look fine for the higher qualities, but there is something really wrong for the lower scores (not just the negative ones). I'll leave you to double check the details, but here are the Sanger PHRED qualities decoded into integers (using Biopython to convert from "fastq-sanger" to "qual" output): $ perl bioperl_solexa2sanger.pl < solexa_faked.fastq | python biopython_sanger2qual.py >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 10 9 8 7 6 6 5 5 5 5 4 4 4 4 $ perl fq_all2std.pl sol2std < solexa_faked.fastq | python biopython_sanger2qual.py >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1 1 Peter C. P.S. This is the BioPerl script I am using here: $ more bioperl_solexa2sanger.pl use Bio::SeqIO; my $in = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-solexa'); my $out = Bio::SeqIO->new(-format => 'fastq-sanger'); while (my $seq = $in->next_seq) { $out->write_seq($seq) }; From biopython at maubp.freeserve.co.uk Fri Jul 24 11:12:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 16:12:57 +0100 Subject: [Biopython-dev] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> Message-ID: <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> On Fri, Jul 24, 2009 at 2:53 PM, Peter wrote: > On Fri, Jul 24, 2009 at 2:32 PM, Peter wrote: >>> >>> Don't be surprised if there are still bugs lurking about, just let me >>> know and I'll fix 'em. >> >> I've got a bug report coming up in a second email, but the basics work :) > > I think I have found a bug in BioPerl's conversion from fastq-solexa > to fastq-sanger concerning lower quality scores. Next up is an issue with BioPerl converting from Sanger to Illumina. In principle this is simple - the quality strings both use PHRED scores just with different offsets. With lower PHRED scores, everything is fine: $ more sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN + IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! Again, this is an example constructed by hand to cover a broad range of valid scores, and can be found in the Biopython repository under biopython/Tests/Quality $ perl bioperl_sanger2illumina.pl < sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN +Test PHRED qualities from 40 to 0 inclusive hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ $ python biopython_sanger2illumina.py < sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN + hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ So, BioPerl and Biopython (and EMBOSS) agree - apart from the repeating second title on the plus line. I understand that EMBOSS will in future omit the repeated title on the plus line: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000598.html Now, here comes the problem. I believe FASTQ files directly from an Illumina 1.3+ pipeline will have PHRED scores in the range 0 to 40 (as in this example). However, much higher PHRED scores are possible during assembly / contig'ing and read mapping. For example, the tool MAQ will output Sanger style FASTQ files with PHRED scores in the range 0 to 93 inclusive. Now, in the Sanger FASTQ format, PHRED scores of 0 to 93 map onto ASCII values of 33 to 126 (! to ~). There is a reason for stopping at 126, since ASCII 127 is "delete". However, in the Illumina 1.3+ FASTQ format, PHRED scores of 0 to 93 would map to ASCII values of 64 to 157, which includes a lot of non printing characters. Working with such files at the command line or in an editor is a big problem. Clearly, Illumina never intended to include such high scores in their FASTQ files! Nevertheless, it is possible to write a FASTQ format following the Illumina 1.3+ encoding with these values. Biopython and EMBOSS attempt to do this - although I would regard throwing an error as equally acceptable. So, here is another hand constructed example of a Sanger style FASTQ file using the full quality range: $ more sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN + ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! Again, this example is in the Biopython repository under biopython/Tests/Quality Just to check: $ python biopython_sanger2qual.py < sanger_93.fastq >Test PHRED qualities from 93 to 0 inclusive 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 So, here we go - apologies for the expected line mangling: $ seqret -filter -sformat fastq-sanger -osformat fastq-illumina < sanger_93.fastq | hexdump -C -v 00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test PHRED qual| 00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities from 93 to| 00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0 inclusive.ACT| 00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000060 47 41 43 54 47 41 43 54 47 0a 41 43 54 47 41 43 |GACTGACTG.ACTGAC| 00000070 54 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 |TGACTGACTGACTGAC| 00000080 54 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 54 65 |TGACTGACTGAN.+Te| 00000090 73 74 0a 9d 9c 9b 9a 99 98 97 96 95 94 93 92 91 |st..............| 000000a0 90 8f 8e 8d 8c 8b 8a 89 88 87 86 85 84 83 82 81 |................| 000000b0 80 7f 7e 7d 7c 7b 7a 79 78 77 76 75 74 73 72 71 |..~}|{zyxwvutsrq| 000000c0 70 6f 6e 6d 6c 6b 6a 69 68 67 66 65 64 63 62 0a |ponmlkjihgfedcb.| 000000d0 61 60 5f 5e 5d 5c 5b 5a 59 58 57 56 55 54 53 52 |a`_^]\[ZYXWVUTSR| 000000e0 51 50 4f 4e 4d 4c 4b 4a 49 48 47 46 45 44 43 42 |QPONMLKJIHGFEDCB| 000000f0 41 40 0a |A at .| 000000f3 $ python biopython_sanger2illumina.py < sanger_93.fastq | hexdump -C -v00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test PHRED qual| 00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities from 93 to| 00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0 inclusive.ACT| 00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000060 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000070 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000080 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 0a 9d 9c |GACTGACTGAN.+...| 00000090 9b 9a 99 98 97 96 95 94 93 92 91 90 8f 8e 8d 8c |................| 000000a0 8b 8a 89 88 87 86 85 84 83 82 81 80 7f 7e 7d 7c |.............~}|| 000000b0 7b 7a 79 78 77 76 75 74 73 72 71 70 6f 6e 6d 6c |{zyxwvutsrqponml| 000000c0 6b 6a 69 68 67 66 65 64 63 62 61 60 5f 5e 5d 5c |kjihgfedcba`_^]\| 000000d0 5b 5a 59 58 57 56 55 54 53 52 51 50 4f 4e 4d 4c |[ZYXWVUTSRQPONML| 000000e0 4b 4a 49 48 47 46 45 44 43 42 41 40 0a |KJIHGFEDCBA at .| 000000ed Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to 64 in decimal, which after subtracting the Illumina offset of 64, gives PHRED scores of 93 to 0 as desired. Now to BioPerl, $ perl bioperl_sanger2illumina.pl < sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN +Test PHRED qualities from 93 to 0 inclusive hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v ... BioPerl has output an invalid FASTQ file - it seems to omit the quality scores for the top scoring nucleotides at the start. The BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40 (in hex), giving 104 to 64 in decimal, giving PHRED values of 40 to 0. I think BioPerl should either throw an error, or output the non printing characters as done by Biopython and EMBOSS. Regards, Peter C. (@Biopython) From mjldehoon at yahoo.com Sat Jul 25 11:28:35 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 25 Jul 2009 08:28:35 -0700 (PDT) Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) Message-ID: <311853.75944.qm@web62407.mail.re1.yahoo.com> Hi everybody, Over the weekend I was looking at Bio.SubsMat and its documentation. There are a few points in Bio.SubstMat that would be handled differently in modern Python, but I'd thought I'd raise them here first before I make any changes: 1) The matrix types (NOTYPE = 0, ACCREP = 1, OBSFREQ = 2, SUBS = 3, EXPFREQ = 4, LO = 5) are now global variables (at the level of Bio.SubsMat). I think that these should be class variables of the Bio.SubsMat.SeqMat class. 2) The print_mat method. It would be more Pythonic to use __str__, __format__ for this, though the latter is only available for Python versions >= 2.6. 3) The __sum__ method. I guess that this was intended to be __add__? 4) The sum_letters attribute. To calculate the sum of all values for a given letter, currently the following two functions are involved: def all_letters_sum(self): for letter in self.alphabet.letters: self.sum_letters[letter] = self.letter_sum(letter) def letter_sum(self,letter): assert letter in self.alphabet.letters sum = 0. for i in self.keys(): if letter in i: if i[0] == i[1]: sum += self[i] else: sum += (self[i] / 2.) return sum As you can see, the result is not returned, but stored in an attribute called sum_letters. I suggest to replace this with the following: def sum(self): result = {} for letter in self.alphabet.letters: result[letter] = 0.0 for pair, value in self: i1, i2 = pair if i1==i2: result[i1] += value else: result[i1] += value / 2 result[i2] += value / 2 return result so without storing the result in an attribute. Any comments, objections? --Michiel --- On Fri, 7/24/09, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: Re: [Biopython-dev] Calculating motif scores > To: "Bartek Wilczynski" > Cc: biopython-dev at biopython.org > Date: Friday, July 24, 2009, 5:34 AM > > > As for the PWM being a separate class and used by the > motif: > > I don't know. I'm using Bio.SubsMat.FreqTable for > implementing > > frequency table, so I understand that the new PWM > class would > > be basically a "smarter" FreqTable. I'm not sure > whether it > > solves any problems... > > Wow, I didn't even know the Bio.SubsMat module existed. > As we have several different but related modules > (Bio.Motif, Bio.SubstMat, Bio.Align), I think we should > define the purpose and scope of each of these modules. > Maybe a good way to start is the documentation. Bio.SubsMat > is currently divided into two chapters (14.4 and 16.2). I'll > have a look at this over the weekend to see if this can be > cleaned up a bit. > > --Michiel. > > > ? ? ? > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From idoerg at gmail.com Sat Jul 25 16:57:59 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sat, 25 Jul 2009 13:57:59 -0700 Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: References: <311853.75944.qm@web62407.mail.re1.yahoo.com> Message-ID: I'm the author of subsmat IIRC. Everything sounds good, but I would not make 2.6 changes that will break on 2.5. Ubuntu still uses 2.5 and I imagine other linux distros do too. Thanks, Iddo Would code those in myself, but I'm moving. Iddo Friedberg http://iddo-friedberg.net/contact.html On Jul 25, 2009 8:35 AM, "Michiel de Hoon" wrote: Hi everybody, Over the weekend I was looking at Bio.SubsMat and its documentation. There are a few points in Bio.SubstMat that would be handled differently in modern Python, but I'd thought I'd raise them here first before I make any changes: 1) The matrix types (NOTYPE = 0, ACCREP = 1, OBSFREQ = 2, SUBS = 3, EXPFREQ = 4, LO = 5) are now global variables (at the level of Bio.SubsMat). I think that these should be class variables of the Bio.SubsMat.SeqMat class. 2) The print_mat method. It would be more Pythonic to use __str__, __format__ for this, though the latter is only available for Python versions >= 2.6. 3) The __sum__ method. I guess that this was intended to be __add__? 4) The sum_letters attribute. To calculate the sum of all values for a given letter, currently the following two functions are involved: def all_letters_sum(self): for letter in self.alphabet.letters: self.sum_letters[letter] = self.letter_sum(letter) def letter_sum(self,letter): assert letter in self.alphabet.letters sum = 0. for i in self.keys(): if letter in i: if i[0] == i[1]: sum += self[i] else: sum += (self[i] / 2.) return sum As you can see, the result is not returned, but stored in an attribute called sum_letters. I suggest to replace this with the following: def sum(self): result = {} for letter in self.alphabet.letters: result[letter] = 0.0 for pair, value in self: i1, i2 = pair if i1==i2: result[i1] += value else: result[i1] += value / 2 result[i2] += value / 2 return result so without storing the result in an attribute. Any comments, objections? --Michiel --- On Fri, 7/24/09, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: Re: [Biopython-dev] Calculating motif scores > To: "Bartek Wilczynski" > Cc: biopython-dev at biopython.org > Date: Friday, July 24, 2009, 5:34 AM > > > As for the PWM being a separate class and used by the > motif: > > I don't know. I'm using Bio.SubsMat.FreqTable for > implementing > > frequency table, so I understand that the new PWM > class would > > be basically a "smarter" FreqTable. I'm not sure > whether it > > solves any problems... > > Wow, I didn't even know the Bio.SubsMat module existed. > As we have several different but related modules > (Bio.Motif, Bio.SubstMat, Bio.Align), I think we should > define the purpose and scope of each of these modules. > Maybe a good way to start is the documentation. Bio.SubsMat > is currently divided into two chapters (14.4 and 16.2). I'll > have a look at this over the weekend to see if this can be > cleaned up a bit. > > --Michiel. > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Sat Jul 25 17:12:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Jul 2009 22:12:26 +0100 Subject: [Biopython-dev] [Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> Message-ID: <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields wrote: > >> Now, here comes the problem. I believe FASTQ files directly >> from an Illumina 1.3+ pipeline will have PHRED scores in the >> range 0 to 40 (as in this example). However, much higher >> PHRED scores are possible during assembly / contig'ing >> and read mapping. For example, the tool MAQ will output >> Sanger style FASTQ files with PHRED scores in the range >> 0 to 93 inclusive. > > Is this behavior documented anywhere, specifically by Illumina (that values > can exceed 40)? If Illumina 1.3 is specified as being PHRED 0-40, and > another (non-Illumina) software package pushes that limit above the > specified range of Illumina values, I would consider that unfortunately yet > another variant. > > We can support it as Illumina 1.3, but my point is this may getting into a > grey area and may be something that Illumina doesn't/wouldn't support. > Reminds me a little of the multiple GFF2 variations (one of the main > reasons for a GFF3). I agree this is an grey area (high scores in Solexa/Illumina FASTQ files). >> Now, in the Sanger FASTQ format, PHRED scores of 0 to >> 93 map onto ASCII values of 33 to 126 (! to ~). There is a >> reason for stopping at 126, since ASCII 127 is "delete". >> >> However, in the Illumina 1.3+ FASTQ format, PHRED >> scores of 0 to 93 would map to ASCII values of 64 to >> 157, which includes a lot of non printing characters. >> Working with such files at the command line or in an >> editor is a big problem. Clearly, Illumina never intended >> to include such high scores in their FASTQ files! > > Exactly. > >> Nevertheless, it is possible to write a FASTQ format >> following the Illumina 1.3+ encoding with these values. >> Biopython and EMBOSS attempt to do this - although I >> would regard throwing an error as equally acceptable. >> >> So, here is another hand constructed example of a >> Sanger style FASTQ file using the full quality range: >> >> ... >> >> Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree >> on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to >> 64 in decimal, which after subtracting the Illumina offset of 64, gives >> PHRED scores of 93 to 0 as desired. >> >> Now to BioPerl, >> >> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq >> ... >> >> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v >> ... >> >> BioPerl has output an invalid FASTQ file - it seems to omit the >> quality scores for the top scoring nucleotides at the start. The >> BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40 >> (in hex), giving 104 to 64 in decimal, giving PHRED values of >> 40 to 0. I think BioPerl should either throw an error, or output >> the non printing characters as done by Biopython and EMBOSS. > > If this is accepted as common practice between BioPython and EMBOSS > we will follow similarly. I do think it's worth at least a warning for the > reasons outlined above (e.g. it likely isn't Illumina's intent to support qual > values outside the specified range). Might be worth checking into. True. I think what EMBOSS and Biopython are doing is reasonable (although a warning in this situation makes sense). Equally, an error is a valid option. However, one question is when would you issue the warning/error? For a PHRED score above 40? (Assuming we have a definative reference for Illumina using just 0 to 40). How about if a problem character would result? Since ASCII 64+63=127, the first problem character would be for PHRED score 63. i.e. An Illumina FASTQ format file can hold PHRED scores in the range 0 to 62 without using problem characters. And likewise for a Solexa FASTQ file (Solexa scores up to 62). > From this it could be summarized that converting to sanger format is least > problematic, as possible issues may be encountered when converting to the > other variants. Yes. The Sanger FASTQ format will hold PHRED scores from 0 to 93 while using nice ASCII characters - this means it is suitable for both raw reads and processed data from assemblies or read mappings. In my personal experience, Solexa/Illumina FASTQ files tend to get converted into the Sanger FASTQ format for downstream analysis (e.g. the MAQ tool, or the NCBI short read archive). i.e. Writing high quality reads (i.e. above PHRED 40) to Solexa or Illumina FASTQ files is unlikely. > We'll need to fix the solexa quality calculations in the BioPerl > parser as noted in your previous post; I'll work on that. Great. Peter From biopython at maubp.freeserve.co.uk Sat Jul 25 17:18:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Jul 2009 22:18:41 +0100 Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: <311853.75944.qm@web62407.mail.re1.yahoo.com> References: <311853.75944.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00907251418s66499a4cy5491a27af5c1b458@mail.gmail.com> On Sat, Jul 25, 2009 at 4:28 PM, Michiel de Hoon wrote: > ... > > 2) The print_mat method. It would be more Pythonic to use __str__, > __format__ for this, though the latter is only available for Python > versions >= 2.6. You can define a __format__ method on older versions of Python, it just won't do anything. For the SeqRecord and Alignment we have already added these, and also included a format method as an alias (principly to make the funcationality available on pre-Python 2.6). Using the __format__ method requires some concept of format names... The "print_mat" function sounds like it has similarities to the "pretty print" code for trees that has come up on the Tree/TreeIO thread. The existing Bio.Nexus tree object already has something as the "display" method. I'd have so spend some time looking at the code in more details to comment on the other issues. Peter From biopython at maubp.freeserve.co.uk Sat Jul 25 17:21:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Jul 2009 22:21:16 +0100 Subject: [Biopython-dev] Bio.Nexus and internal node labels (Bug 2788) In-Reply-To: <4A6560E2.4030502@biologie.uni-kl.de> References: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> <4A6560E2.4030502@biologie.uni-kl.de> Message-ID: <320fb6e00907251421o40e7931cj69caa7d5b00794a8@mail.gmail.com> On Tue, Jul 21, 2009 at 7:32 AM, Frank Kauff wrote: > > Hi all, > > Peter wrote: >> On Mon, Jul 20, 2009 at 8:13 PM, Nick Matzke wrote: >> >>> Hi all, here is my weekly update... >>> >>> 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! >> >> Cool. I haven't tried it personally though ;) Frank and/or Cymon - any >> comments regarding Brad checking this in? See Bug 2788 for details. > > Not at all - you're most welcome. Thanks for dealing with it. > > Frank Sounds like you should proably check in that fix then Brad :) Peter From pmr at ebi.ac.uk Mon Jul 27 04:55:43 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 27 Jul 2009 09:55:43 +0100 Subject: [Biopython-dev] Open-bio cross-project issues In-Reply-To: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Message-ID: <4A6D6B8F.9060108@ebi.ac.uk> Peter C. wrote (to bioperl-l, biopython-l, emboss-dev): > Hi all, > > Peter Rice kindly said he will look into an OBF cross project mailing > list, but in the meantime this has been cross posted to the Biopython, > BioPerl, and EMBOSS development lists. There is a list already for this purpose - open-bio-l I think we will also need a cross-project wiki space on the OBF site. Is there something already used by other projects or should we set something up? I am cross-posting this to other OBF project lists to encourage developers interested in combining efforts to address common problems. This started with FASTQ short read formats, and open-bio-l (a low volume list) has also seen discussion of common test data sets. Please sign up to open-bio-l (if you are not there already) and post suggestions for cross-project issues there. The list subscription page is: http://lists.open-bio.org/mailman/listinfo/open-bio-l Please feel free to forward this to any other projects I may have missed (I picked the obvious addresses from the list.open-bio-org server) regards, Peter Rice From bugzilla-daemon at portal.open-bio.org Mon Jul 27 08:27:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Jul 2009 08:27:05 -0400 Subject: [Biopython-dev] [Bug 2788] Bio.Nexus.Trees newick parser does not support internal node labels In-Reply-To: Message-ID: <200907271227.n6RCR51v032090@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2788 chapmanb at 50mail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from chapmanb at 50mail.com 2009-07-27 08:27 EST ------- Patch verified and checked in with unit tests: Checking in Bio/Nexus/Trees.py; new revision: 1.19; previous revision: 1.18 Checking in Tests/test_Nexus.py; new revision: 1.9; previous revision: 1.8 Marking bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 27 10:48:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Jul 2009 10:48:55 -0400 Subject: [Biopython-dev] [Bug 2887] New: set iteration order dependency in Bio.Data.CodonTable Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2887 Summary: set iteration order dependency in Bio.Data.CodonTable Product: Biopython Version: 1.51b Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: fwereade at googlemail.com Running under IronPython 2.0.1 with ironclad r515 (http://code.google.com/p/ironclad ) symptoms: --------------------------------------------- from Bio.Data import CodonTable File "C:\dev\biopython-1.51b\Bio\Data\CodonTable.py", line 618, in C:\dev\biop ython-1.51b\Bio\Data\CodonTable.py assert list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TAR', 'TRA'] AssertionError --------------------------------------------- cause: set iteration order is different in IronPython (it may also be different in Jython and/or PyPy, and has the potential to change across CPython versions) fix: make Bio.Data.CodonTable.py:618 read as follows --------------------------------------------- assert set(list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values)) == set(['TGA', 'TAA', 'TAG', 'TAR', 'TRA']) --------------------------------------------- better fix: as above, but for all similar lines (the preceding lines currrently work under ipy) just a thought: it might also be worth moving all the tests into the Tests directory, rather than running them inline every time. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 27 11:06:30 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Jul 2009 11:06:30 -0400 Subject: [Biopython-dev] [Bug 2887] set iteration order dependency in Bio.Data.CodonTable In-Reply-To: Message-ID: <200907271506.n6RF6Uqj007530@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2887 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-27 11:06 EST ------- Fixed in Bio/Data/CodonTable.py CVS revision 1.15 Thanks! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 27 11:08:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Jul 2009 11:08:11 -0400 Subject: [Biopython-dev] [Bug 2887] set iteration order dependency in Bio.Data.CodonTable In-Reply-To: Message-ID: <200907271508.n6RF8Be4007635@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2887 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-27 11:08 EST ------- Fixed in Bio/Data/CodonTable.py CVS revision 1.15 so marking bug as fixed. Note I opted to preserve the existing API (i.e. return lists), so didn't use your suggested fix. Please let us know if there are any other issues with IronPython. Thanks. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jul 27 12:18:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 17:18:11 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> Message-ID: <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> On Thu, Jul 23, 2009 at 12:08 PM, Peter wrote: > > The FASTQ checking is on going. I have updated our FASTQ > code to map Solexa scores as I understood Peter Rice's > description of the intended EMBOSS behaviour (this is for the > corner case of very poor quality reads). However, due to a > couple of minor bugs I found in EMBOSS 6.1.0 we'll either > have to cross check against their CVS code, or hope they > release EMBOSS 6.1.1 soon. > > Cross checking against MAQ would also be worthwhile, but > while there are some patches about to fix a couple of MAQ > FASTQ bugs and include Illumina to Sanger standard > conversion, this isn't in their official repository yet. > > I guess I could cross check against BioPerl's new FASTQ > support ... The FASTQ cross-validation is on going, as you may have gathered from the cross-project thread (now on open-bio-l) I did start testing against BioPerl SVN which uncovered some BioPerl problems, and a grey area of the format worth debate. See also: http://lists.open-bio.org/pipermail/open-bio-l/2009-July/ This is taking longer than I had expected, but think it will be worth the effort. Peter P.S. Anyone care to guess on how EMBOSS, BioPerl, and Biopython's FASTQ parsing stacks up in terms of run time? From bugzilla-daemon at portal.open-bio.org Mon Jul 27 12:41:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Jul 2009 12:41:02 -0400 Subject: [Biopython-dev] [Bug 2887] set iteration order dependency in Bio.Data.CodonTable In-Reply-To: Message-ID: <200907271641.n6RGf2M8011652@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2887 ------- Comment #3 from fwereade at googlemail.com 2009-07-27 12:41 EST ------- Sweet! I think that's the fastest bugfix I've ever seen :-). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mhampton at d.umn.edu Mon Jul 27 13:05:05 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 27 Jul 2009 12:05:05 -0500 (CDT) Subject: [Biopython-dev] Phylip interface questions In-Reply-To: References: Message-ID: I am wondering if there is already an interface to the Phylip programs in biopython. I am pretty sure there is not, but I wanted to ask before doing a chunk of work on one. I know that AlignIO can read and write the phylip alignment files, but I think that is it. Assuming such a thing doesn't already exist, I will write some functions for calling various combinations of programs in phylip to make some common tasks easier. Mostly this will use the pexpect module. What is the most appropriate place to put such an interface within biopython? Thanks, Marshall Hampton Integrated Biosciences Program and the Department of Mathematics and Statistics University of Minnesota Duluth From biopython at maubp.freeserve.co.uk Mon Jul 27 13:24:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 18:24:57 +0100 Subject: [Biopython-dev] Phylip interface questions In-Reply-To: References: Message-ID: <320fb6e00907271024y6894bc7ch92d4411ee3784a48@mail.gmail.com> On Mon, Jul 27, 2009 at 6:05 PM, Marshall Hampton wrote: > > I am wondering if there is already an interface to the Phylip programs in > biopython. ?I am pretty sure there is not, but I wanted to ask before doing > a chunk of work on one. ?I know that AlignIO can read and write the phylip > alignment files, but I think that is it. > > Assuming such a thing doesn't already exist, I will write some functions for > calling various combinations of programs in phylip to make some common tasks > easier. ?Mostly this will use the pexpect module. ?What is the most > appropriate place to put such an interface within biopython? I really wouldn't go down the route of trying to wrap the original PHYLIP tools, it would involve piping simulated keypresses to stdin - very tricky (even if the python module pexpect is wonderful). I would instead wrap the EMBOSS packaged versions of the PHYLIP suite, which have proper command line interfaces with switches etc. In this case, Bio/Emboss/Applications.py would be the file to look at. However, something I have been discussing with Peter Rice at EMBOSS is using their ACD files (which define the EMBOSS tools command line interfaces) to automatically generate the Biopython wrappers. Peter From eric.talevich at gmail.com Mon Jul 27 13:56:40 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 27 Jul 2009 13:56:40 -0400 Subject: [Biopython-dev] GSoC Weekly Update 10: PhyloXML for Biopython Message-ID: <3f6baf360907271056t2c84429fwe17739c93046105d@mail.gmail.com> Hi folks, Previously (July 20-24) I: Finished implementing I/O methods, Tree classes and tests for all phyloXML elements. Changed Writer to preserve node order in the XML; output now validates under the phyloXML 1.00 schema (but 1.10 complains) Did some drastic code reorganization. - Bio.Tree: - Moved Clade.find() and PhyloElement.__repr__ methods to BaseTree classes - Made Clade inherit from BaseTree.Tree in addition to BaseTree.Node, and added the corresponding attributes - Moved Bio.PhyloXML.Tree to Bio.Tree.PhyloXML - Bio.TreeIO: - Merged PhyloXML's Parser and Writer into PhyloXMLIO under the new Bio.TreeIO module, and updated imports everywhere - Added wrappers for Nexus read/write; doesn't return Bio.Tree objects yet though Added/updated unit tests for all of this. Documented the code reorg on the Biopython wiki, adding Tree and TreeIO pages and fixing the examples on the PhyloXML page. Scrubbed docstrings and enabled epydoc processing. This week (July 27-31) I will: Finish implementing the phyloXML spec: - Scan "simple types" for restricted tokens; check strings in constructors - Take a stab at phyloXML 1.10 support (need a 'version' arg to Writer?) - Clean up and reorganize any code that needs it Enhancements (time permitting): - Improve the SeqRecord conversion - Work on Bio.Tree.BaseTree compatibility with BioSQL's PhyloDB extension - Port common methods to Bio.Tree.BaseTree -- see Bio.Nexus.Tree, Bioperl node objects, PyCogent, p4-phylogenetics - Tree method: build_index (set left_idx, right_idx on all nodes): - calculate left/right indexes for nested-set representation - see http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html - Export to networkx (http://networkx.lanl.gov/) -- also get graphviz export for free, via networkx.to_agraph() Remarks: - Bioperl's phyloXML driver was written for version 1.00 and might hurl if given a v1.10 file -- so that's a potential problem if Biopython defaults to writing v1.10 files. Should Writer take a option to specify the file format version number? Right now it only writes valid phyloXML v1.00. - PhyloXMLIO also always writes branch_length as an XML node, not an attribute. This validates and will be handled safely by any sane parser, and fits better with the idea of an implicit root node in each clade object, I think. (The parser still handles an attribute properly.) Any objections? - Above, I've listed more enhancements than I'll probably be able to finish this week. Which should have higher priority? I know merging Bio.Nexus and Bio.Tree would be the most useful, but since (1) Biopython development still happens on CVS, not Git, and (2) another Tree-based GSoC project is expected to land around the same time as mine, I think doing the integration right now would be kind of painful. So I can focus either on laying the groundwork in Bio.Tree.BaseTree, copying rather than moving the relevant Nexus code, or else work mainly on exporting to other useful object representations like networkx graphs, or any Biopython classes I've missed (e.g. alignments). Suggestions? Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Mon Jul 27 15:43:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 20:43:09 +0100 Subject: [Biopython-dev] Phylip interface questions In-Reply-To: References: <320fb6e00907271024y6894bc7ch92d4411ee3784a48@mail.gmail.com> Message-ID: <320fb6e00907271243q16d7a5efnca5873faaee3937f@mail.gmail.com> On Mon, Jul 27, 2009 at 7:42 PM, Marshall Hampton wrote: > > Thanks Peter, I was unaware of the EMBOSS versions of PHYLIP. I don't > think using pexpect to wrap the originals is really that hard - I have some > working fine already - but now I see its almost pointless. I don't like the > EMBOSS dependence, but it sounds like you are already working on > getting rid of that. I'm not quite sure what you are saying. Biopython doesn't depend on EMBOSS, we just have some optional code to interact with EMBOSS. If you want to run the PHYLIP tools from Python, you are going to have to install PHYLIP or EMBOSS anyway. The EMBOSS version is (I think) far more useful, and there is lots of other useful stuff in EMBOSS as well, so I really don't see a problem with recommending EMBOSS. Right now we have parsers and wrappers for some of the EMBOSS tools, and I would like to have more. Generating the wrappers (semi) automatically would be a step forward as we currently have wrappers for only about ten of the EMBOSS tools (hand picked based on people actually wanting to use them from within Biopython). Peter From mhampton at d.umn.edu Mon Jul 27 14:42:21 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 27 Jul 2009 13:42:21 -0500 (CDT) Subject: [Biopython-dev] Phylip interface questions In-Reply-To: <320fb6e00907271024y6894bc7ch92d4411ee3784a48@mail.gmail.com> References: <320fb6e00907271024y6894bc7ch92d4411ee3784a48@mail.gmail.com> Message-ID: Thanks Peter, I was unaware of the EMBOSS versions of PHYLIP. I don't think using pexpect to wrap the originals is really that hard - I have some working fine already - but now I see its almost pointless. I don't like the EMBOSS dependence, but it sounds like you are already working on getting rid of that. Cheers, Marshall On Mon, 27 Jul 2009, Peter wrote: > On Mon, Jul 27, 2009 at 6:05 PM, Marshall Hampton wrote: >> >> I am wondering if there is already an interface to the Phylip programs in >> biopython. ?I am pretty sure there is not, but I wanted to ask before doing >> a chunk of work on one. ?I know that AlignIO can read and write the phylip >> alignment files, but I think that is it. >> >> Assuming such a thing doesn't already exist, I will write some functions for >> calling various combinations of programs in phylip to make some common tasks >> easier. ?Mostly this will use the pexpect module. ?What is the most >> appropriate place to put such an interface within biopython? > > I really wouldn't go down the route of trying to wrap the original PHYLIP > tools, it would involve piping simulated keypresses to stdin - very tricky > (even if the python module pexpect is wonderful). > > I would instead wrap the EMBOSS packaged versions of the PHYLIP > suite, which have proper command line interfaces with switches etc. > In this case, Bio/Emboss/Applications.py would be the file to look at. > However, something I have been discussing with Peter Rice at EMBOSS > is using their ACD files (which define the EMBOSS tools command line > interfaces) to automatically generate the Biopython wrappers. > > Peter > From chapmanb at 50mail.com Mon Jul 27 18:12:02 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 27 Jul 2009 18:12:02 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 10: PhyloXML for Biopython In-Reply-To: <3f6baf360907271056t2c84429fwe17739c93046105d@mail.gmail.com> References: <3f6baf360907271056t2c84429fwe17739c93046105d@mail.gmail.com> Message-ID: <20090727221202.GC68751@sobchak.mgh.harvard.edu> Hi Eric; Thanks for taking on this reorganization into Tree/TreeIO. That is turning out really nice and provides a great general framework for plugging other phylogeny modules into. > - Bioperl's phyloXML driver was written for version 1.00 and might hurl > if given a v1.10 file -- so that's a potential problem if Biopython > defaults to writing v1.10 files. Should Writer take a option to specify the > file format version number? Right now it only writes valid phyloXML v1.00. I tend to agree with Mark and Hilmar's assessment; PhyloXML is in development right now so we want to push towards the latest version. Reading Christian's summary of changes: http://phyloxml.blogspot.com/2009/06/proposed-changes-and-additions-for.html it seems like much of this is fixes. It would be worth pinging BioPerl to be sure someone will handle updates to the latest version but otherwise I would go with what is easiest. You want to be careful not to get trapped in version purgatory. > - Above, I've listed more enhancements than I'll probably be able to finish > this week. Which should have higher priority? I know merging Bio.Nexus > and Bio.Tree would be the most useful, but since (1) Biopython > development still happens on CVS, not Git, and (2) another Tree-based > GSoC project is expected to land around the same time as mine, I think > doing the integration right now would be kind of painful. So I can focus > either on laying the groundwork in Bio.Tree.BaseTree, copying rather than > moving the relevant Nexus code, or else work mainly on exporting to other > useful object representations like networkx graphs, or any Biopython > classes I've missed (e.g. alignments). Suggestions? What are you most interested in? You've certainly earned the right to work on what you think may be most useful to you in the future. Any of the listed projects are a good step forward. If you really really want my votes, they are for adding common tree manipulation methods to the base Tree class and working towards PhyloDB storage compatibility. Brad From chapmanb at 50mail.com Mon Jul 27 18:44:06 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 27 Jul 2009 18:44:06 -0400 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> Message-ID: <20090727224406.GE68751@sobchak.mgh.harvard.edu> Hi Peter; > The FASTQ cross-validation is on going, as you may have > gathered from the cross-project thread (now on open-bio-l) > I did start testing against BioPerl SVN which uncovered > some BioPerl problems, and a grey area of the format > worth debate. See also: > http://lists.open-bio.org/pipermail/open-bio-l/2009-July/ > > This is taking longer than I had expected, but think it will be > worth the effort. Glad you are tackling this -- fleshing out the incompatibilities is tough work but will save a lot of headaches for people in the future. > P.S. Anyone care to guess on how EMBOSS, BioPerl, and > Biopython's FASTQ parsing stacks up in terms of run time? We better be the fastest. Everyone knows that C code is bloated and slow. In terms of 1.51 and beyond, I've got two things: - SQLite support: I'd love to push this in now for 1.51. If we have a working version that people can test on, it'll encourage adoption for the next BioSQL release. - GFF parsing: The code is revamped to be more SeqIO like based on the discussion you, Michiel and I had earlier, and the documentation is in progress. I'll plan to get this in post-1.51 so people can work with it in git and find bugs. Brad From chapmanb at 50mail.com Mon Jul 27 18:34:16 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 27 Jul 2009 18:34:16 -0400 Subject: [Biopython-dev] Phylip interface questions In-Reply-To: References: Message-ID: <20090727223416.GD68751@sobchak.mgh.harvard.edu> Hi Marshall; > I am wondering if there is already an interface to the Phylip programs in > biopython. I am pretty sure there is not, but I wanted to ask before > doing a chunk of work on one. I know that AlignIO can read and write the > phylip alignment files, but I think that is it. > > Assuming such a thing doesn't already exist, I will write some functions > for calling various combinations of programs in phylip to make some common > tasks easier. Mostly this will use the pexpect module. What is the most > appropriate place to put such an interface within biopython? I did a lot of work with Phylip a while back, and generally interfacing with it is hideous looking but not impossible. I would create input files with the data for all of the menu items, and then feed this into the program. Then you need to handle renaming the generically named output files. Here's a chunk of it to give you the idea: pars_outfile = os.path.join(work_dir, "outgroup_phy.pars") pars_tree_outfile = os.path.join(work_dir, "outgroup_phy.parstree") hack_phylip_file = os.path.join(work_dir, "protpars.hack") hack_output = "%s\nM\nD\n%s\n13\n10\nO\n1\nY\n" % (align_file, num_boot) hack_handle = open(hack_phylip_file, "w") hack_handle.write(hack_output) hack_handle.close() cl = PhylipHackCommandline("protpars", hack_phylip_file) Application.generic_run(cl) os.rename("outfile", pars_outfile) os.rename("outtree", pars_tree_outfile) I would second Peter in using the EMBOSS interfaces to Phylip. There are ones already in Biopython for protdist, neighbor, protpars, consense and seqboot: http://github.com/biopython/biopython/blob/master/Bio/Emboss/Applications.py Why do you prefer the pexpect module for running applications? From a quick glance, the subprocess module included in Python should let you do most of what you can do with pexpect and it doesn't require an extra install. Finally, I am not as plugged in on the latest in phylogeny building but is Phylip still in favor? I know there has been a lot of work on Maximum Likelihood and Bayesian methods, like: FastTree: http://www.microbesonline.org/fasttree/index.html RAxML: http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm MrBayes: http://mrbayes.csit.fsu.edu/ In terms of Python support for these, Frank Kauff has some things to deal with RAxML: http://www.lutzonilab.net/downloads/ The latest PyCogent had support for FastTree and I believe they also tackle RAxML: http://pycogent.svn.sourceforge.net/viewvc/pycogent/trunk/cogent/app/fasttree.py?revision=333&view=markup Hope this helps. Glad to have someone thinking about these questions, Brad From winda002 at student.otago.ac.nz Mon Jul 27 21:56:46 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 28 Jul 2009 13:56:46 +1200 Subject: [Biopython-dev] Phylip interface questions In-Reply-To: References: Message-ID: <20090728135646.11011zn7r4bfzqam@www.studentmail.otago.ac.nz> Hi Marshall, I am wondering if there is already an interface to the Phylip programs in biopython. I am pretty sure there is not, but I wanted to ask before doing a chunk of work on one. I know that AlignIO can read and write the phylip alignment files, but I think that is it. Assuming such a thing doesn't already exist, I will write some functions for calling various combinations of programs in phylip to make some common tasks easier. Mostly this will use the pexpect module. I wrote a few for my own use (I presumed no one else was doing stuff like this) which I've now uploaded as module (Bio.Phylo) here: http://github.com/dwinter/biopython/tree/phylo They are for the 'new phylip' version ('f' prefixed not 'e') in EMBOSS's 'embassy' packages (which take different arguments than the classes already in the EMBOSS module...). They also depend on the cool stuff that Brad and Peter have done for applications in biopython 1.51. Hopefully they will cover some of the same ground that you want to, or at least prevent you having to start from scratch. (There's also support for PhyML which is based on phylip's dnaml but it much faster.) Cheers, David From biopython at maubp.freeserve.co.uk Tue Jul 28 05:17:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 10:17:06 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <20090727224406.GE68751@sobchak.mgh.harvard.edu> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> <20090727224406.GE68751@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00907280217obb767b6wffdc4c029bbab651@mail.gmail.com> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: > In terms of 1.51 and beyond, I've got two things: > > - SQLite support: I'd love to push this in now for 1.51. If we have > ?a working version that people can test on, it'll encourage > ?adoption for the next BioSQL release. > I would be OK with including this once Hilmar adds the SQLite schema to the BioSQL repository. I'd prefer him to do a point release of BioSQL first, but as long as this is going to happen at some point that is fine. Let's bring this up again on the BioSQL mailing list... If Hilmar isn't keen to rush, we *could* ship it anyway with Biopython, but it should then be clearly labelled as a prototype schema which may be subject to change. > - GFF parsing: The code is revamped to be more SeqIO like based > on ?the discussion you, Michiel and I had earlier, and the > ?documentation is in progress. I'll plan to get this in post-1.51 > ?so people can work with it in git and find bugs. Definitely post-1.51, note that EMBOSS 6.1.0 now has some support for GFF and features in GenBank, so we can hopefully use that as a reference implementation. i.e. Once we add GFF parsing to SeqIO, this should let Biopython convert from GFF to SeqRecord objects to GenBank, and we can compare this to EMBOSS. Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 05:26:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 10:26:30 +0100 Subject: [Biopython-dev] Phylip interface questions In-Reply-To: <20090728135646.11011zn7r4bfzqam@www.studentmail.otago.ac.nz> References: <20090728135646.11011zn7r4bfzqam@www.studentmail.otago.ac.nz> Message-ID: <320fb6e00907280226n786ae91fy6df4ed1a73aa7bbe@mail.gmail.com> On Tue, Jul 28, 2009 at 2:56 AM, David Winter wrote: > Hi Marshall, > > Assuming such a thing doesn't already exist, I will write some functions for > calling various combinations of programs in phylip to make some common tasks > easier. ?Mostly this will use the pexpect module. > I wrote a few for my own use (I presumed no one else was doing stuff like > this) which I've now uploaded as module (Bio.Phylo) here: > > http://github.com/dwinter/biopython/tree/phylo > > They are for the 'new phylip' version ('f' prefixed not 'e') in EMBOSS's > 'embassy' packages (which take different arguments than the classes already > in the EMBOSS module...). They also depend on the cool stuff that Brad and > Peter have done for applications in biopython 1.51. Hopefully they will > cover some of the same ground that you want to, or at least prevent you > having to start from scratch. Cool. I would double check that _EmbossCommandLine is still appropriate - especially with regards the outfile parameter. I changed a few things recently for seqret (which doesn't have an outfile parameter). Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 07:19:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 12:19:15 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? Message-ID: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> Hi all, As a possible enhancement to Bio.SeqIO, I've been toying with the idea of introducing another function, essentially to provide the following functionality: def convert(in_handle, in_format, out_handle, out_format, alphabet=None) : """Converts between two file formats, returns number of records.""" records = parse(in_handle, in_format, alphabet) return write(records, out_handle, out_format) As implied by this reference implementation above, this would be a convenience or helper function which would allow simple conversion scripts to save a line, e.g. import sys from Bio import SeqIO records = SeqIO.parse(sys.stdin, "fastq-solexa") SeqIO.write(records, sys.stdout, "fastq") becomes: import sys from Bio import SeqIO SeqIO.convert(sys.stdin, "fastq-solexa", sys.stdout, "fastq") Now some people might find that in itself a small improvement, but it does make the API a little more complex (feature creep). However, that isn't the real aim here. Having a function like this would allow a number of file format specific optimisations - instead of using SeqIO.parse to create SeqRecord objects which get converted by SeqIO.write as shown above. For example, converting GenBank or EMBL to FASTA (or tab), we don't need most of the annotation, so creating all those SeqFeature objects is a waste of time and memory. The GenBank/EMBL parser already has (buried) an option to skip the features, and a Bio.SeqIO.convert function would be able to exploit this. Likewise, converting any of the FASTQ formats to FASTA (which I think will be a fairly common task) can be speed up greatly by ignoring the quality scores, and even more so by never creating Seq and SeqRecord objects. I've tested this particular example, and it is massively faster (about five times faster in fact, which means it actually beats the current version of EMBOSS seqret - which is cool). Likewise converting between FASTQ formats (in particular Solexa to Sanger, and Illumina to Sanger) are also going to be common tasks which are currently something of a bottle neck. Again, this can be made faster by avoiding using Seq and SeqRecord objects within a convert function. What I have in mind is a lookup table of special case optimised converters (e.g. FASTQ to FASTA). If there is no special case defined, the convert function would default to the SeqIO parse/write code shown above. We would need a good set of unit tests to ensure these optimised converters did produce exactly the same output as the parse/write solution. Of course, if we have bottlenecks in the SeqIO parsing and writing code, it would be worthwhile of course to fix them - rather than writing a special case converter. Maybe to avoid the gradual build up of too many specialised converters, we might ask as a rule of thumb that it be at least three times faster than using parse/write? Any thoughts? Would this all just make SeqIO too complicated? Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 08:07:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 13:07:23 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <20090727224406.GE68751@sobchak.mgh.harvard.edu> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> <20090727224406.GE68751@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00907280507s4575c6f4v60409efd39a9c4aa@mail.gmail.com> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: > Hi Peter; > >> The FASTQ cross-validation is on going, as you may have >> gathered from the cross-project thread (now on open-bio-l) >> I did start testing against BioPerl SVN which uncovered >> some BioPerl problems, and a grey area of the format >> worth debate. See also: >> http://lists.open-bio.org/pipermail/open-bio-l/2009-July/ >> >> This is taking longer than I had expected, but think it will be >> worth the effort. > > Glad you are tackling this -- fleshing out the incompatibilities > is tough work but will save a lot of headaches for people in > the future. Absolutely. >> P.S. Anyone care to guess on how EMBOSS, BioPerl, and >> Biopython's FASTQ parsing stacks up in terms of run time? > > We better be the fastest. Everyone knows that C code is bloated > and slow. I pretty sure that was tongue in check, but if you were being mean you probably could describe some of the EMBOSS infrastructure as bloat. In any case, I'm sure that EMBOSS can be made faster now that speed matters here with next generation sequencing, see: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html And I've got bad news for you then - currently EMBOSS seqret is about twice as fast as CVS Biopython SeqIO (measuring parsing versus writing is a bit tricky). However, I have a cunning plan: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html Peter From pmr at ebi.ac.uk Tue Jul 28 08:40:43 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 28 Jul 2009 13:40:43 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00907280507s4575c6f4v60409efd39a9c4aa@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> <20090727224406.GE68751@sobchak.mgh.harvard.edu> <320fb6e00907280507s4575c6f4v60409efd39a9c4aa@mail.gmail.com> Message-ID: <4A6EF1CB.7000800@ebi.ac.uk> Peter wrote: > On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: >>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and >>> Biopython's FASTQ parsing stacks up in terms of run time? >> >> We better be the fastest. Everyone knows that C code is bloated >> and slow. > > I pretty sure that was tongue in check, but if you were being mean > you probably could describe some of the EMBOSS infrastructure > as bloat. In any case, I'm sure that EMBOSS can be made faster > now that speed matters here with next generation sequencing, see: > http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html EMBOSS code is indeed bloated and slow in some places - for example on output it constructs a sequence output object from the input sequence. However, it's C ... if we know what we're doing we can tell the machine to go faster. Unless the compiler decides it can optimise us away... Certainly this is a place where using reference-counted strings shows gains. We tend to avoid them in EMBOSS because early experience in optimising had them being deleted at the 'wrong' times and leaving us with no significant improvement in performance. Sequence output looks like a good place for them. We can also simplify the sequence output objects to avoid some of the reset operations when reusing the objects. > And I've got bad news for you then - currently EMBOSS seqret > is about twice as fast as CVS Biopython SeqIO (measuring parsing > versus writing is a bit tricky). However, I have a cunning plan: > http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html Worse news, I can find some speedups in EMBOSS ... though the split is about 40% in output and 60% in input CPU time. I/O time is another issue where we could play with blocked reads ... though when I tried that some time ago it seemed the operating systems and file systems were doing a grand job and it was hard to get a consistent speed gain even for one specific system. regards, Peter Rice From biopython at maubp.freeserve.co.uk Tue Jul 28 08:51:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 13:51:08 +0100 Subject: [Biopython-dev] FASTQ parsing speed in EMBOSS Message-ID: <320fb6e00907280551n7a42563byb802016b2342de06@mail.gmail.com> I've retitled this and CC'ed it to the EMBOSS dev list - which is probably a better place for this now! On Tue, Jul 28, 2009 at 1:40 PM, Peter Rice wrote: > Peter wrote: >> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: > >>>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and >>>> Biopython's FASTQ parsing stacks up in terms of run time? >>> >>> We better be the fastest. Everyone knows that C code is bloated >>> and slow. >> >> I pretty sure that was tongue in check, but if you were being mean >> you probably could describe some of the EMBOSS infrastructure >> as bloat. In any case, I'm sure that EMBOSS can be made faster >> now that speed matters here with next generation sequencing, see: >> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html > > EMBOSS code is indeed bloated and slow in some places - for example on > output it constructs a sequence output object from the input sequence. > However, it's C ... if we know what we're doing we can tell the machine > to go faster. Unless the compiler decides it can optimise us away... > > Certainly this is a place where using reference-counted strings shows > gains. We tend to avoid them in EMBOSS because early experience in > optimising had them being deleted at the 'wrong' times and leaving us > with no significant improvement in performance. Sequence output looks > like a good place for them. > > We can also simplify the sequence output objects to avoid some of the > reset operations when reusing the objects. > >> And I've got bad news for you then - currently EMBOSS seqret >> is about twice as fast as CVS Biopython SeqIO (measuring parsing >> versus writing is a bit tricky). However, I have a cunning plan: >> http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html > > Worse news, I can find some speedups in EMBOSS ... though > the split is about 40% in output and 60% in input CPU time. Well, it is only bad news from the point of view of Biopython bragging rights ;) And with those speed ups, I guess my fast lower level Biopython FASTQ to FASTA script will now be about the same speed as seqret! See: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html Nice work! > I/O time is another issue where we could play with blocked > reads ... though when I tried that some time ago it seemed > the operating systems and file systems were doing a grand > job and it was hard to get a consistent speed gain even for > one specific system. Maybe best avoided, given EMBOSS is truly cross platform. Peter C. From biopython at maubp.freeserve.co.uk Tue Jul 28 09:14:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 14:14:52 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> Message-ID: <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> On Tue, Jul 28, 2009 at 12:19 PM, Peter wrote: > > Any thoughts? Would this all just make SeqIO too complicated? > The idea of the Bio.SeqIO.convert function was two fold: (1) Syntactic sugar (and for this alone I wouldn't add it) (2) Faster file format conversion (e.g. for scripts or pipelines) While we could clearly out perform EMBOSS 6.1.0 on FASTQ to FASTA, given the possible speed ups Peter Rice is reporting for EMBOSS seqret, it looks this will change shortly: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006496.html I don't see any real point in trying to compete with EMBOSS for simple file conversion if in general seqret will be faster (and on the next release of EMBOSS, it should be). The real benefit if using Bio.SeqIO for any file format conversion (rather than seqret), is this lets the user add their own conditional filters or modifications as needed. And for this, my proposed function Bio.SeqIO.convert() doesn't help in any way. So, unless anyone pipes up, I probably won't pursue this. Finally, if anyone is interested, this was idea for the high speed FASTQ to FASTA conversion - as a proof of principle script using standard input and standard output at the command line: #High performance FASTQ to FASTA conversion for short reads. #This uses the low level FASTQ parser in Biopython 1.50 or #later. This avoids Bio.SeqIO and the associated overheads #of object creation and decoding the FASTQ quality string. import sys from Bio.SeqIO.QualityIO import FastqGeneralIterator #This just returns tuples of three strings from FASTQ: write = sys.stdout.write #avoid repeated attribute lookups for title, sequence, quality in FastqGeneralIterator(sys.stdin) : write(">%s\n" % title) #Wrap at 60 characters (as done by Bio.SeqIO FASTA): for i in range(0, len(sequence), 60): write(sequence[i:i+60] + "\n") If you don't want line wrapping, the code is two lines shorter, and even faster: import sys from Bio.SeqIO.QualityIO import FastqGeneralIterator write = sys.stdout.write #avoid repeated attribute lookups for title, sequence, quality in FastqGeneralIterator(sys.stdin) : write(">%s\n%s\n" % (title, sequence)) Peter From mjldehoon at yahoo.com Tue Jul 28 08:55:33 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 28 Jul 2009 05:55:33 -0700 (PDT) Subject: [Biopython-dev] Bio.SeqIO.convert function? Message-ID: <988956.8355.qm@web62404.mail.re1.yahoo.com> > Of course, if we have bottlenecks in the SeqIO parsing > and writing code, it would be worthwhile of course to fix > them - rather than writing a special case converter. Maybe > to avoid the gradual build up of too many specialised > converters, we might ask as a rule of thumb that it be > at least three times faster than using parse/write? > I have no fundamental objection, but we should first try to speed up the current GenBank parser and see if the specialized converter is still more than three times faster. --Michiel From biopython at maubp.freeserve.co.uk Tue Jul 28 09:19:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 14:19:45 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <988956.8355.qm@web62404.mail.re1.yahoo.com> References: <988956.8355.qm@web62404.mail.re1.yahoo.com> Message-ID: <320fb6e00907280619y1493ec19vdf00543cb45fc8d5@mail.gmail.com> On Tue, Jul 28, 2009 at 1:55 PM, Michiel de Hoon wrote: > >> Of course, if we have bottlenecks in the SeqIO parsing >> and writing code, it would be worthwhile of course to fix >> them - rather than writing a special case converter. Maybe >> to avoid the gradual build up of too many specialised >> converters, we might ask as a rule of thumb that it be >> at least three times faster than using parse/write? > > I have no fundamental objection, but we should first try > to speed up the current GenBank parser and see if the > specialized converter is still more than three times faster. I can already in principle make the current GenBank parser up to four times faster - I was working on this before all the FASTQ stuff and would hope to see this in Biopython 1.52, http://bugzilla.open-bio.org/show_bug.cgi?id=2738 Even with a change like that to speed up feature location parsing, it would still be faster still to skip the features in a GenBank or EMBL file completely. Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 10:47:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 15:47:57 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> Message-ID: <320fb6e00907280747g29beec82lef221e297895a097@mail.gmail.com> On Tue, Jul 28, 2009 at 2:14 PM, Peter wrote: > Finally, if anyone is interested, this was idea for the high speed > FASTQ to FASTA conversion - as a proof of principle script > using standard input and standard output at the command line: > > #High performance FASTQ to FASTA conversion for short reads. > #This uses the low level FASTQ parser in Biopython 1.50 or > #later. This avoids Bio.SeqIO and the associated overheads > #of object creation and decoding the FASTQ quality string. > import sys > from Bio.SeqIO.QualityIO import FastqGeneralIterator > #This just returns tuples of three strings from FASTQ: > write = sys.stdout.write #avoid repeated attribute lookups > for title, sequence, quality in FastqGeneralIterator(sys.stdin) : > ? ?write(">%s\n" % title) > ? ?#Wrap at 60 characters (as done by Bio.SeqIO FASTA): > ? ?for i in range(0, len(sequence), 60): > ? ? ? ?write(sequence[i:i+60] + "\n") > > If you don't want line wrapping, the code is two lines shorter, > and even faster: > > import sys > from Bio.SeqIO.QualityIO import FastqGeneralIterator > write = sys.stdout.write #avoid repeated attribute lookups > for title, sequence, quality in FastqGeneralIterator(sys.stdin) : > ? ?write(">%s\n%s\n" % (title, sequence)) > > Peter And here is a similar high performance script for mapping Solexa FASTQ to Sanger FASTQ, import sys from Bio.SeqIO.QualityIO import FastqGeneralIterator, phred_quality_from_solexa from string import maketrans solexa = "".join(chr(64+q) for q in range(-5,62+1)) sanger = "".join(chr(int(round(33+phred_quality_from_solexa(q)))) \ for q in range(-5,62+1)) mapping = maketrans(solexa, sanger) write = sys.stdout.write #avoid repeated attribute lookups for title, sequence, quality in FastqGeneralIterator(sys.stdin) : write("@%s\n%s\n+\n%s\n" % (title, sequence, quality.translate(mapping))) The same basic idea works equally well for mapping between any of the three FASTQ variants, and the speed is very similar to the FASTQ to FASTA script, taking about 1/5 of the time using SeqIO parse/write for this. I'm still investigating how to make the SeqIO parsing/writing faster. When I get an updated version of EMBOSS installed, I intend to profile it against these scripts ;) Peter From eric.talevich at gmail.com Tue Jul 28 11:49:29 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 28 Jul 2009 11:49:29 -0400 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> Message-ID: <3f6baf360907280849o68b03873n652f883f775f8f75@mail.gmail.com> Hi Peter, On Tue, Jul 28, 2009 at 9:14 AM, Peter wrote: > On Tue, Jul 28, 2009 at 12:19 PM, Peter > wrote: > > > > Any thoughts? Would this all just make SeqIO too complicated? > > > > The idea of the Bio.SeqIO.convert function was two fold: > (1) Syntactic sugar (and for this alone I wouldn't add it) > (2) Faster file format conversion (e.g. for scripts or pipelines) > > This would be nice if it was implemented in AlignIO and TreeIO, too. The naming is pretty intuitive, and the concept is general, so I don't think it makes the API any more difficult to understand. (Personally, I like having a sugary API to use inside ipython.) But the main reason I piped up was that some time ago, we observed that some popular Python libraries have functions that can accept either an open file handle or a file name, and do the right thing. The xml.etree module in the standard lib does this by checking if the 'file' argument has a 'read' method, and if not, trying to open it. I didn't see any reason for Bio.TreeIO to be any fussier than the standard library, so... http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NexusIO.py Implementing this for SeqIO.convert() (or ideally, read/parse/write on all the *IO modules) would make it very nice for files other than stdin and stdout -- otherwise, the user needs to open and maybe close two file handles before calling convert(). What do you think? Cheers, Eric From biopython at maubp.freeserve.co.uk Tue Jul 28 12:04:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 17:04:48 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <3f6baf360907280849o68b03873n652f883f775f8f75@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> <3f6baf360907280849o68b03873n652f883f775f8f75@mail.gmail.com> Message-ID: <320fb6e00907280904k11e0f197qb931d622474eeb69@mail.gmail.com> On Tue, Jul 28, 2009 at 4:49 PM, Eric Talevich wrote: > Hi Peter, > > On Tue, Jul 28, 2009 at 9:14 AM, Peter wrote: > >> On Tue, Jul 28, 2009 at 12:19 PM, Peter >> wrote: >> > >> > Any thoughts? Would this all just make SeqIO too complicated? >> > >> >> The idea of the Bio.SeqIO.convert function was two fold: >> (1) Syntactic sugar (and for this alone I wouldn't add it) >> (2) Faster file format conversion (e.g. for scripts or pipelines) >> > This would be nice if it was implemented in AlignIO and TreeIO, too. The > naming is pretty intuitive, and the concept is general, so I don't think it > makes the API any more difficult to understand. (Personally, I like having a > sugary API to use inside ipython.) OK - fair point. And yes, if we added it to Bio.SeqIO, it would make sense to add a similar function to Bio.AlignIO and the nascent Bio.TreeIO module too. If combined with allowing filenames in place of handles, then yes, it makes one line file conversion very convenient too. On the more general issue of filenames versus handles, I think I'll reply on a new thread though... Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 12:34:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 17:34:48 +0100 Subject: [Biopython-dev] Handles and/or filenames in Bio.SeqIO etc? Message-ID: <320fb6e00907280934i54f326a6r38325c05a314cdbc@mail.gmail.com> Hi all, Eric just reopened an old debate - should Bio.SeqIO (and similar) support filenames as well has handles? In fact, this something we originally discussed way back when planning SeqIO way back in Nov 2006. Michiel and I were at the time generally in favour of allowing filename/handles, but Iddo Friedberg (who at that time was basically in charge) and Chris Lasher didn't like this. It would have broken with the existing Biopython parsers which were all handle only. After a little debate, we opted to support just handles, knowing we could if need be later allow filenames instead. [Other things which with hindsight I am very glad Michiel, Iddo, Chris etc talked me out of where "guessing" the file format based on the filename or its contents.] I had written up a draft email on this topic a couple of months ago, to raise this issue (which I can't find right now) which went over some of the downsides - other than complicating what is currently a nice clean API. I never sent it because after thinking about it, I was happy with handles only. I guess I'll have to retype my objections as they come back to me. On the thread about a possible Bio.SeqIO.convert function, Eric wrote: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006501.html > But the main reason I piped up was that some time ago, we observed that > some popular Python libraries have functions that can accept either an > open file handle or a file name, and do the right thing. The xml.etree > module in the standard lib does this by checking if the 'file' argument > has a 'read' method, and if not, trying to open it. I didn't see any reason > for Bio.TreeIO to be any fussier than the standard library, so... > > http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NexusIO.py First of all, I would argue Bio.TreeIO should be consistent with Bio.SeqIO and Bio.AlignIO with respect to handles vs filenames. If we do agree to support filenames or handles, then I would keep all the Bio.ModuleIO.SubModule code using handles only, and put the boiler plate (repeated) handle/filename code in the Bio.ModuleIO functions only. This is (a) less work, and (b) less code duplication. After all, the code in the modules under Bio.SeqIO (and similar) is rarely used directly. Other top level parsers, like Bio.Entrez.read() might then also deserve the filename/handle treatment. As a bonus, Bio.Nexus would cease to be an oddity as it does this already. > Implementing this for SeqIO.convert() (or ideally, read/parse/write on all > the *IO modules) would make it very nice for files other than stdin and > stdout -- otherwise, the user needs to open and maybe close two file handles > before calling convert(). > > What do you think? >From an end user point of view, especially when working directly at the python prompt interactively, being able to give filenames would be nicer. This will also make lots of the examples in the tutorial shorter and simpler, because we don't have to do things like closing output handles (because the SeqIO.write() function would do it for us). There is a minor downside that Python beginners won't necessarily get to gripes with handles so quickly. There is a cost, in that lots of parser code will need to check if it has a filename and if so open it. For output code this is a little more complex, as the writer function must also close the file afterwards. Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 12:48:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 17:48:50 +0100 Subject: [Biopython-dev] Newick support in Bio.TreeIO? Message-ID: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> Hi Eric, If you wanted a good multi-tree example file format for TreeIO, I would suggest plain Newick trees. I am familiar with plain text files which contain one Newick tree per line (with a terminating semi-colon), although in principle they could be wrapped over many lines. The neighbour joining (NJ) tree software QuickJoin from Thomas Mailund can certainly output this kind of file. I would expect to be able to read and write such multi-tree Newick files using Bio.TreeIO. http://www.daimi.au.dk/~mailund/quick-join.html The obvious application of this (which I have used personally), was to generate bootstrap trees on multiple machines in a cluster (or cores on a single machine), e.g. 100 instances each of 10 bootstrap trees, giving in total 1000 trees (which are then used either to build a consensus, or allocate bootstrap support to the randomised master tree). I wrote some code in python to do this bootstrapping step using the splits defined by each edge (i.e. the two sets of nodes you get if the edge was severed), which I represented using bit arrays, for use as keys in a dictionary mapping the splits to the master tree's edges. Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 13:04:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 18:04:52 +0100 Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: References: <311853.75944.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00907281004y1c597d68k5548840e3a792687@mail.gmail.com> On Sat, Jul 25, 2009 at 9:57 PM, Iddo Friedberg wrote: > I'm the author of subsmat IIRC. Everything sounds good, but I would not make > 2.6 changes that will break on 2.5. Ubuntu still uses 2.5 and I imagine > other linux distros do too. Plus we are still supporting Biopython on Python 2.4, having only recently dropped support for Python 2.3 ;) The current Ubuntu with long term support (LTS) is 8.04 (hardy), and that uses Python 2.5. However, the latest Ubuntu (jaunty) and the in development one (karmic) are already using Python 2.6. Biopython will often get used on clusters and servers (not just desktops), and these tend to get upgraded less often. Our cluster is still running Python 2.4 for example. Peter From chapmanb at 50mail.com Tue Jul 28 18:09:43 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 28 Jul 2009 18:09:43 -0400 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> Message-ID: <20090728220943.GJ68751@sobchak.mgh.harvard.edu> Hi Peter; > As a possible enhancement to Bio.SeqIO, I've been toying with > the idea of introducing another function, essentially to provide > the following functionality: > > def convert(in_handle, in_format, out_handle, out_format, alphabet=None) : > """Converts between two file formats, returns number of records.""" > records = parse(in_handle, in_format, alphabet) > return write(records, out_handle, out_format) [...] > However, that isn't the real aim here. Having a function like this > would allow a number of file format specific optimisations - > instead of using SeqIO.parse to create SeqRecord objects > which get converted by SeqIO.write as shown above. I like this idea. To the extent in which we can optimize popular conversions, this gives us a standard place to put it. There is going to be lots of fastq to fasta conversion and being as fast as possible is good (notice my avoidance of any more potentially misconstrued jokes). Conversion lately seems to be getting worse, not better, with all of the alignment and annotation formats springing up. Extending this to AlignIO and TreeIO as Eric suggested is also great. So +1 from me, Brad From chapmanb at 50mail.com Tue Jul 28 18:17:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 28 Jul 2009 18:17:26 -0400 Subject: [Biopython-dev] Handles and/or filenames in Bio.SeqIO etc? In-Reply-To: <320fb6e00907280934i54f326a6r38325c05a314cdbc@mail.gmail.com> References: <320fb6e00907280934i54f326a6r38325c05a314cdbc@mail.gmail.com> Message-ID: <20090728221726.GK68751@sobchak.mgh.harvard.edu> Hey all; > Eric just reopened an old debate - should Bio.SeqIO (and similar) > support filenames as well has handles? > > In fact, this something we originally discussed way back when planning > SeqIO way back in Nov 2006. Michiel and I were at the time generally > in favour of allowing filename/handles, but Iddo Friedberg (who at that > time was basically in charge) and Chris Lasher didn't like this. It would > have broken with the existing Biopython parsers which were all handle > only. After a little debate, we opted to support just handles, knowing we > could if need be later allow filenames instead. I am for file and handle support. Only dealing with handles is like so totally 2006. I did this in the GFF parser by necessity since Disco MapReduce needed files and the standard Biopython way is handles. Essentially, it checks for a read attribute and keeps track of needing to close the handle: if hasattr(gff_file, "read"): need_close = False in_handle = gff_file else: need_close = True in_handle = open(gff_file) > There is a minor downside > that Python beginners won't necessarily get to gripes with handles so quickly. Yes, that is the downside I see as well. The plus side of the same issue is that the learning curve is less steep. Brad From bugzilla-daemon at portal.open-bio.org Tue Jul 28 20:57:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Jul 2009 20:57:54 -0400 Subject: [Biopython-dev] [Bug 2889] New: setup.py reads stdin even when stdin is not a terminal Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2889 Summary: setup.py reads stdin even when stdin is not a terminal Product: Biopython Version: 1.51b Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: sridhar.ratna at gmail.com setup.py files are *not* meant be using raw_input and other funky things that interferes with build automation. Please remove the use of raw_input() .. or, at least, use raw_input() only when stdin is a real terminal ("if sys.stdout.isatty()"). This way you could allow your package to built via automated build tools. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From matzke at berkeley.edu Wed Jul 29 00:33:46 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 28 Jul 2009 21:33:46 -0700 Subject: [Biopython-dev] Bio.Nexus and internal node labels (Bug 2788) In-Reply-To: <320fb6e00907251421o40e7931cj69caa7d5b00794a8@mail.gmail.com> References: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> <4A6560E2.4030502@biologie.uni-kl.de> <320fb6e00907251421o40e7931cj69caa7d5b00794a8@mail.gmail.com> Message-ID: <4A6FD12A.4020707@berkeley.edu> Peter wrote: > On Tue, Jul 21, 2009 at 7:32 AM, Frank Kauff wrote: >> Hi all, >> >> Peter wrote: >>> On Mon, Jul 20, 2009 at 8:13 PM, Nick Matzke wrote: >>> >>>> Hi all, here is my weekly update... >>>> >>>> 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! >>> Cool. I haven't tried it personally though ;) Frank and/or Cymon - any >>> comments regarding Brad checking this in? See Bug 2788 for details. >> Not at all - you're most welcome. Thanks for dealing with it. >> >> Frank > > Sounds like you should proably check in that fix then Brad :) > > Peter Yeah, I used the revised module for a bunch more operations, including many of the tree methods. No crashes or huge issues once I "got" how everything worked. I did have to write my own methods for what should probably eventually be basic tree methods, like deep-copying the tree, subsetting the tree based on what occurs above a given node, etc. Thanks! Nick -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Wed Jul 29 03:43:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 08:43:58 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <20090728220943.GJ68751@sobchak.mgh.harvard.edu> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <20090728220943.GJ68751@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> On Tue, Jul 28, 2009 at 11:09 PM, Brad Chapman wrote: > Hi Peter; > >> As a possible enhancement to Bio.SeqIO, I've been toying with >> the idea of introducing another function, essentially to provide >> the following functionality: >> >> def convert(in_handle, in_format, out_handle, out_format, alphabet=None) : >> ? ? """Converts between two file formats, returns number of records.""" >> ? ? records = parse(in_handle, in_format, alphabet) >> ? ? return write(records, out_handle, out_format) > [...] >> However, that isn't the real aim here. Having a function like this >> would allow a number of file format specific optimisations - >> instead of using SeqIO.parse to create SeqRecord objects >> which get converted by SeqIO.write as shown above. > > I like this idea. To the extent in which we can optimize popular > conversions, this gives us a standard place to put it. There is > going to be lots of fastq to fasta conversion and being as fast as > possible is good (notice my avoidance of any more potentially > misconstrued jokes). OK, assuming we press ahead with this, the Bio.SeqIO.convert() function would be the only public API addition, the internals would all be private. What I had in mind was Bio.SeqIO.convert() using a dictionary of functions (all with the same arguments), keyed on a tuple of (in_format, out_format). I was thinking of using Bio/SeqIO/_convert.py for the individual functions (like GenBank/EMBL to FASTA/tab, or any FASTQ to FASTA/tab). Note I am expecting that in many cases it will be quite simple to handle several related conversions in one function, and this should avoid some code duplication. My marking these details as private, we can of course refine this scheme later. > Conversion lately seems to be getting worse, not better, with > all of the alignment and annotation formats springing up. > Extending this to AlignIO and TreeIO as Eric suggested is > also great. Whatever we do for Bio.SeqIO, we can follow the same pattern for Bio.AlignIO etc. > So +1 from me, > Brad And we basically had a +0 from Michiel, and a +1 from Eric. And I like the idea but am not convinced we need it. Maybe we should put the suggestion forward on the main discussion list for debate? Peter From bugzilla-daemon at portal.open-bio.org Wed Jul 29 03:46:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 03:46:25 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907290746.n6T7kPIe029876@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-29 03:46 EST ------- (In reply to comment #0) > setup.py files are *not* meant be using raw_input and other funky things > that interferes with build automation. Have you got a reference for that? I can see why it might have a problem, but there is probably official guidance for this kind of thing. > Please remove the use of raw_input() .. or, at least, use raw_input() only > when stdin is a real terminal ("if sys.stdout.isatty()"). That makes sense. But what would you do if this is not the case? > This way you could allow your package to built via automated build tools. What tool has a problem? All the Linux packagers manage fine. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 29 04:42:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 04:42:25 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907290842.n6T8gPQb032600@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 ------- Comment #2 from sridhar.ratna at gmail.com 2009-07-29 04:42 EST ------- > ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-29 03:46 EST ------- > (In reply to comment #0) >> setup.py files are *not* meant be using raw_input and other funky things >> that interferes with build automation. > > Have you got a reference for that? I can see why it might have a problem, > but there is probably official guidance for this kind of thing. Ok, I'll ease up on my assertions .. what I meant was it is a good practice to keep the script execution simple. See http://mail.python.org/pipermail/distutils-sig/2009-July/012832.html (last paragraph) >> Please remove the use of raw_input() .. or, at least, use raw_input() only >> when stdin is a real terminal ("if sys.stdout.isatty()"). > > That makes sense. But what would you do if this is not the case? Since your package already makes use of setuptools, I suggest you to make use of the 'extras' features in setuptools: http://peak.telecommunity.com/DevCenter/setuptools#declaring-extras-optional-features-with-their-own-dependencies If Foo depends on your package .. but also requires the numpy component, then Foo would depend upon "biopython[numpy]". Zope namespace packages makes use of this feature extensively (eg: zope.component[zcml]) >> This way you could allow your package to built via automated build tools. > > What tool has a problem? All the Linux packagers manage fine. PyPM (ActiveState's Python Package Manager .. analogous to PPM for Perl) is the tool that has the problem with such packages .. the resolution being to kill the build process that takes more than X number of minutes (raw_input() implies infinite execution time for no stdin). This has the unfortunate consequence of such packages becoming not part of the repository. Even if this bug is not fixed, we could patch the setup.py - but ideally I prefer this to be done in the project itself (to keep things unsophisticated). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 29 05:45:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 05:45:23 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907290945.n6T9jNHF002902@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-29 05:45 EST ------- (In reply to comment #2) > > (In reply to comment #0) > >> setup.py files are *not* meant be using raw_input and other funky > >> things that interferes with build automation. > > > > Have you got a reference for that? I can see why it might have a > > problem, but there is probably official guidance for this kind of > > thing. > > Ok, I'll ease up on my assertions .. what I meant was it is a good > practice to keep the script execution simple. See > http://mail.python.org/pipermail/distutils-sig/2009-July/012832.html > (last paragraph) > > >> Please remove the use of raw_input() .. or, at least, use raw_input() > >> only when stdin is a real terminal ("if sys.stdout.isatty()"). > > > > That makes sense. But what would you do if this is not the case? > > Since your package already makes use of setuptools, I suggest you to > make use of the 'extras' features in setuptools: The official way to install Biopython is "python setup.py install" (i.e. using distutils). We don't do anything special to support setuptools - but it seems to work. Unfortunately, using "extras_require" or "install_requires" to make setuptools happy causes ugly UserWarning messages from distutils. > >> This way you could allow your package to built via automated > >> build tools. > > > > What tool has a problem? All the Linux packagers manage fine. > > PyPM (ActiveState's Python Package Manager .. analogous to PPM for > Perl) is the tool that has the problem with such packages .. the > resolution being to kill the build process that takes more than X > number of minutes (raw_input() implies infinite execution time for no > stdin). This has the unfortunate consequence of such packages becoming > not part of the repository. > > Even if this bug is not fixed, we could patch the setup.py - but > ideally I prefer this to be done in the project itself (to keep > things unsophisticated). The yes/no prompt using raw_input is for solely for installing without NumPy (which is still useful, but only a subset of the full Biopython), and is only shown if NumPy is not installed. This is a compile time dependency for parts of Biopython. I've updated CVS and now setup.py will abort if NumPy is not installed and we don't appear to be running in a real terminal (based on your suggestion). Could you test this please? Thanks Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 29 05:47:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 05:47:35 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907290947.n6T9lZhF003020@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-29 05:47 EST ------- (In reply to comment #3) > > The yes/no prompt using raw_input is for solely for installing without > NumPy (which is still useful, but only a subset of the full Biopython), > and is only shown if NumPy is not installed. This is a compile time > dependency for parts of Biopython. > > I've updated CVS and now setup.py will abort if NumPy is not installed > and we don't appear to be running in a real terminal (based on your > suggestion). > > Could you test this please? You need setup.py CVS revision 1.170, which should also be available from github within the hour: http://github.com/biopython/biopython/tree/master I could attach the new setup.py to this bug if that would be easier for you. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Wed Jul 29 08:54:07 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 29 Jul 2009 14:54:07 +0200 Subject: [Biopython-dev] [Biopython] Restriction enzyme digestion gels In-Reply-To: <320fb6e00907290435i32a15382l48206bdbedbd7bf6@mail.gmail.com> References: <4A702ACB.2080204@dcs.gla.ac.uk> <320fb6e00907290435i32a15382l48206bdbedbd7bf6@mail.gmail.com> Message-ID: <200907291454.07300.jblanca@btc.upv.es> Hi: > There is nothing built into Biopython's graphics module for generating > fake gel images - so using matplot seems worth trying. However, I > would suggest you talk to Jose Blanca about his work first: > http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005472.html > http://bioinf.comav.upv.es/svn/gelify/gelifyfsa/ Once I needed a similar tool to represent aflp data as a gel and I wrote the code to solve that issue. I haven't used that much because the project was cancelled due to external reasons, but the code worked. You can take a look at: http://bioinf.comav.upv.es/svn/gelify/gelifyfsa/src/ If you have any problems with it, write me a line. I'm sure that it will be bugs and and the performance is not great, but it worked for me. At least I hope you can look at how the image is build using matplotlib. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From eric.talevich at gmail.com Wed Jul 29 11:49:22 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 29 Jul 2009 11:49:22 -0400 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> Message-ID: <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> Hi Peter, On Tue, Jul 28, 2009 at 12:48 PM, Peter wrote: > Hi Eric, > > If you wanted a good multi-tree example file format for TreeIO, I would > suggest plain Newick trees. I am familiar with plain text files which > contain > one Newick tree per line (with a terminating semi-colon), although in > principle they could be wrapped over many lines. The neighbour joining > (NJ) tree software QuickJoin from Thomas Mailund can certainly output > this kind of file. I would expect to be able to read and write such > multi-tree > Newick files using Bio.TreeIO. > I was wondering about this in regard to Bio.Nexus. It looks like the class Bio.Nexus.Nexus calls its _tree method when it encounters a tree block in a Nexus file, which corresponds to a tree in Newick format plus a short preamble. The _tree method churns the preamble, then passes a CharBuffer (the Newick string) and some defaults to the Bio.Nexus.Trees.Tree constructor, which does the Newick parsing and creates a Tree object. After a quick glance at the Nexus original article/spec, it looks like the format is a bindle of simpler formats for various applications; most of these formats are unique to Nexus, but Newick is dropped into Nexus completely intact. So! I'm proposing that the Newick parser, currently stashed inside Bio.Nexus.Trees, be moved to Bio.TreeIO.NewickIO, and the Nexus parser be changed to simply call the Newick parser from its new location. (A further refactoring of the Nexus parser would put the individual parsers for each block in separate classes or files, rather than mingled with the block-level parsing code. I can't guarantee I'll get around to that, though.) Does this bear any resemblance to your plan? The obvious application of this (which I have used personally), was to > generate bootstrap trees on multiple machines in a cluster (or cores on > a single machine), e.g. 100 instances each of 10 bootstrap trees, giving > in total 1000 trees (which are then used either to build a consensus, or > allocate bootstrap support to the randomised master tree). > Sounds like an incremental parse() function over these trees would be very useful for distributed bootstrap analysis etc. I don't see how Bio.Nexus currently supports this, though, beyond iterating over the 'trees' attribute, which is a list. How would a reasonable person go about this? Generate trees in Newick format rather than Nexus, run on the cluster, combine, distill, and only save the resulting master tree in Newick format (or even phyloXML)? If the Newick parser is separated from Nexus, then this wouldn't be too difficult to support. > I wrote some code in python to do this bootstrapping step using the > splits defined by each edge (i.e. the two sets of nodes you get if the > edge was severed), which I represented using bit arrays, for use as > keys in a dictionary mapping the splits to the master tree's edges. > > I would be interested to see this. Thanks, Eric P.S. On inspection, there's a possible bug in Bio.Nexus.get_start_end: the default argument for skiplist is a list with two characters in it. If skiplist is altered, this would persist across subsequent calls, wouldn't it? From biopython at maubp.freeserve.co.uk Wed Jul 29 12:16:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 17:16:57 +0100 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> Message-ID: <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> On Wed, Jul 29, 2009 at 4:49 PM, Eric Talevich wrote: > Hi Peter, > > On Tue, Jul 28, 2009 at 12:48 PM, Peter > wrote: >> >> Hi Eric, >> >> If you wanted a good multi-tree example file format for TreeIO, I would >> suggest plain Newick trees. I am familiar with plain text files which >> contain one Newick tree per line (with a terminating semi-colon), >> although in principle they could be wrapped over many lines. The >> neighbour joining (NJ) tree software QuickJoin from Thomas Mailund >> can certainly output this kind of file. I would expect to be able to read >> and write such multi-tree Newick files using Bio.TreeIO. > > I was wondering about this in regard to Bio.Nexus. It looks like the class > Bio.Nexus.Nexus calls its _tree method when it encounters a tree block in a > Nexus file, which corresponds to a tree in Newick format plus a short > preamble. The _tree method churns the preamble, then passes a CharBuffer > (the Newick string) and some defaults to the Bio.Nexus.Trees.Tree > constructor, which does the Newick parsing and creates a Tree object. > > After a quick glance at the Nexus original article/spec, it looks like the > format is a bindle of simpler formats for various applications; most of > these formats are unique to Nexus, but Newick is dropped into Nexus > completely intact. So! I'm proposing that the Newick parser, currently > stashed inside Bio.Nexus.Trees, be moved to Bio.TreeIO.NewickIO, and the > Nexus parser be changed to simply call the Newick parser from its new > location. > > (A further refactoring of the Nexus parser would put the individual parsers > for each block in separate classes or files, rather than mingled with the > block-level parsing code. I can't guarantee I'll get around to that, > though.) > > Does this bear any resemblance to your plan? No - but probably only because I didn't fancy restructuring Bio.Nexus ;) We can already call the Newick tree parser directly, so it doesn't have to be moved (although we could do). [In case you hadn't seen it, the current version of the Tutorial has a tiny example using this at the end of a ClustalW example in the Alignment chapter.] Bio.TreeIO.parse() should be an iterator, returning complete tree objects one by one. I was thinking of having Bio.TreeIO.NewickIO just take a plain text file, split it up at the ";\n" characters (or similar) to get each tree as a string, which is passed to Bio.Nexus.Trees.Tree to parse it. I'd never read the original Nexus publication which describes the file format (my University didn't subscribe to that journal). However, it appears to have been digitised and made freely available since then: http://sysbio.oxfordjournals.org/cgi/reprint/46/4/590 It looks like the NEXUS format allows explicit handling of multiple trees within the NEXUS block structure. Note that this is quite different to the simple concatenated plain text Newick files I was talking about. i.e. the "nexus" and "newick" formats in Bio.TreeIO do both deal with Newick trees, but they are held in different container formats (i.e. a NEXUS file, or plain text). >> The obvious application of this (which I have used personally), was to >> generate bootstrap trees on multiple machines in a cluster (or cores on >> a single machine), e.g. 100 instances each of 10 bootstrap trees, giving >> in total 1000 trees (which are then used either to build a consensus, or >> allocate bootstrap support to the randomised master tree). > > Sounds like an incremental parse() function over these trees would be > very useful for distributed bootstrap analysis etc. Exactly. And Bio.TreeIO.read() would be for the special case where the file format contains exactly one tree. > I don't see how Bio.Nexus currently supports this, though, beyond > iterating over the 'trees' attribute, which is a list. As far as I know, Bio.Nexus just parses a whole file in one go. This means either Bio.TreeIO.NexusIO would call this and then loop over the list (very memory inefficient), or it would need a minimal Nexus parser just to spot the TREES block, and handle them only. > How would a reasonable person go about this? > Generate trees in Newick format rather than Nexus, run on the cluster, > combine, distill, and only save the resulting master tree in Newick format > (or even phyloXML)? If the Newick parser is separated from Nexus, then > this wouldn't be too difficult to support. For the example workflow I gave, I did everything with simple Newick files. At the very end, it might make sense to save the bootstrapped tree as phyloXML, or even as a full NEXUS file bundled up with the alignment. >> I wrote some code in python to do this bootstrapping step using the >> splits defined by each edge (i.e. the two sets of nodes you get if the >> edge was severed), which I represented using bit arrays, for use as >> keys in a dictionary mapping the splits to the master tree's edges. > > I would be interested to see this. I'm not actually sure where I put it... it should be on my old desktop at home somewhere. However, I can elaborate in that in addition NJ using quicktree, I also did parsimony bootstrap values, and drew my own colourful trees using reportlab. See the three supplementary figures here: http://dx.doi.org/10.1099/mic.0.2007/013672-0 Peter > P.S. On inspection, there's a possible bug in Bio.Nexus.get_start_end: the > default argument for skiplist is a list with two characters in it. If > skiplist is altered, this would persist across subsequent calls, wouldn't > it? I don't understand what you are trying to say. If the get_start_end is called with an argument (say skiplist=["a","b"]) then this will not affect subsequence calls where there default will still be ['-','?']. From biopython at maubp.freeserve.co.uk Wed Jul 29 12:24:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 17:24:52 +0100 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> Message-ID: <320fb6e00907290924o58a63e01l37950046070c290e@mail.gmail.com> > The obvious application of this (which I have used personally), was to > generate bootstrap trees on multiple machines in a cluster (or cores on > a single machine), e.g. 100 instances each of 10 bootstrap trees, giving > in total 1000 trees (which are then used either to build a consensus, or > allocate bootstrap support to the randomised master tree). I hope it was clear anyway, but that last bit should have read: ... which are then used either to build a consensus [tree], or allocate bootstrap support to the original *non* randomised master tree [generated from the original alignment]. Peter From eric.talevich at gmail.com Wed Jul 29 13:59:27 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 29 Jul 2009 13:59:27 -0400 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> Message-ID: <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> On Wed, Jul 29, 2009 at 12:16 PM, Peter wrote: > On Wed, Jul 29, 2009 at 4:49 PM, Eric Talevich > wrote: > > > > Does this bear any resemblance to your plan? > > No - but probably only because I didn't fancy restructuring Bio.Nexus ;) > We can already call the Newick tree parser directly, so it doesn't > have to be moved (although we could do). [In case you hadn't seen > it, the current version of the Tutorial has a tiny example using this > at the end of a ClustalW example in the Alignment chapter.] > > Bio.TreeIO.parse() should be an iterator, returning complete tree > objects one by one. I was thinking of having Bio.TreeIO.NewickIO > just take a plain text file, split it up at the ";\n" characters (or > similar) > to get each tree as a string, which is passed to Bio.Nexus.Trees.Tree > to parse it. > OK, I did this. http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py > > Sounds like an incremental parse() function over these trees would be > > very useful for distributed bootstrap analysis etc. > > Exactly. And Bio.TreeIO.read() would be for the special case where > the file format contains exactly one tree. > PhyloXML has a top-level object that contains multiple phylogenies, plus arbitrary 'other' data; PhyloXML.read() returns one of those object regardless of how many phylogenies it contains. Newick doesn't have a top-level container, so returning one tree and raising a RuntimeError if there isn't exactly one tree makes sense. But Nexus has a top-level container with (potentially) a bunch of other info -- should NexusIO.read() return the complete Nexus object, or just pretend to be a Newick wrapper and behave that way? As far as I know, Bio.Nexus just parses a whole file in one go. This > means either Bio.TreeIO.NexusIO would call this and then loop over > the list (very memory inefficient), or it would need a minimal Nexus > parser just to spot the TREES block, and handle them only. > That's what I pictured for a Bio.Nexus refactoring -- I don't know the right way to do it in a memory-efficient way, though, given that there are multiple types of blocks and they may be needed at different times. Maybe make an initial pass to index the file at the block level, then call incremental line-level parsers on the selected blocks? Or, simpler, factor out the efficient line-level parsers so that they can be accessed separately if need be -- basically the way Nexus._tree() works now -- and let the block-level parsing code call those specific parsers. >> I wrote some code in python to do this bootstrapping step using the > >> splits defined by each edge (i.e. the two sets of nodes you get if the > >> edge was severed), which I represented using bit arrays, for use as > >> keys in a dictionary mapping the splits to the master tree's edges. > > > > I would be interested to see this. > > I'm not actually sure where I put it... it should be on my old desktop > at home somewhere. However, I can elaborate in that in addition NJ > using quicktree, I also did parsimony bootstrap values, and drew my > own colourful trees using reportlab. See the three supplementary > figures here: http://dx.doi.org/10.1099/mic.0.2007/013672-0 > > Hey, neat. I was about to start a project involving kinases and response regulators. How much trouble was it to draw trees in reportlab? Do you think it would be worth adding a tree-drawing module to Bio.Graphics? Eric From biopython at maubp.freeserve.co.uk Wed Jul 29 15:37:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 20:37:08 +0100 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> Message-ID: <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> On Wed, Jul 29, 2009 at 6:59 PM, Eric Talevich wrote: >> >> Bio.TreeIO.parse() should be an iterator, returning complete tree >> objects one by one. I was thinking of having Bio.TreeIO.NewickIO >> just take a plain text file, split it up at the ";\n" characters (or >> similar) to get each tree as a string, which is passed to >> Bio.Nexus.Trees.Tree to parse it. > > OK, I did this. > > http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py OK, I haven't run the code but have a couple of points. On a general point, are you intending to re-write parse and read functions for each tree format? For Bio.SeqIO all I do is write a iterator (i.e. a parse) function, and Bio.SeqIO.parse() and also Bio.SeqIO.read() call this. I didn't use RuntimeError for SeqIO and AlignIO, I used ValueError. I figured the data in the handle didn't match the expectations, which was like saying it had a bad value. It would therefore be more consistent to do the same. The parsing code looks weird to me - but that is probably a style thing. Certainly I had to stare at it to work out what it was doing. It also has a bug - consider a Newick file containing one tree but with no trailing semi colon. On a more serious note, your output code creates a monster string of all the trees in memory! Don't do this as it ruins the whole memory benefit of using iterators to keep just one tree in memory at a time: lines = (t.to_string(plain_newick=True, plain=plain, **kwargs) for t in trees) file.write(';\n'.join(lines) Instead handle the trees one by one: for t in trees : file.write(t.to_string(...) + ";\n") (I'm assuming like you did that the to_string method won't add the trailing semi colon and new line.) >> > Sounds like an incremental parse() function over these trees would be >> > very useful for distributed bootstrap analysis etc. >> >> Exactly. And Bio.TreeIO.read() would be for the special case where >> the file format contains exactly one tree. > > PhyloXML has a top-level object that contains multiple phylogenies, plus > arbitrary 'other' data; PhyloXML.read() returns one of those object > regardless of how many phylogenies it contains. Newick doesn't have a > top-level container, so returning one tree and raising a RuntimeError if > there isn't exactly one tree makes sense. But Nexus has a top-level > container with (potentially) a bunch of other info -- should NexusIO.read() > return the complete Nexus object, or just pretend to be a Newick wrapper and > behave that way? Ah. The top level information about all the trees may cause trouble for the TreeIO model I had in mind (which was *just* for trees). The advantage of this is a consistent API, the downside is certain file format specific things cannot be supported nicely. I think this balance has worked nicely for SeqIO and AlignIO to date. So: * Bio.TreeIO.read(...) would return one tree. * Bio.TreeIO.parse(...) would iterate over trees one by one. * Bio.TreeIO.write(...) would write trees out (ideally sequentially if the file format allows this). Note I am assuming it is possible to write a PhyloXML tree with minimal (empty) top level annotation? You would need to do this in order to convert from a Nexus or Newick tree to a (minimal) PhyloXML tree. So, based on how SeqIO and AlignIO work, I would expect Bio.TreeIO would only give you the trees - you'd not get the top level information. For parsing Nexus files, Bio.TreeIO would only give access to a subset of the data in a Nexus file - just the trees. In the same way, parsing a Nexus file with AlignIO only gives you the alignment. If you want any of the other data in a Nexus file, you have to use the Bio.Nexus module. If you (as a user) needed the top level annotation in a PhyloXML file, then I would say use Bio.PhyloXML (or what ever we are calling it) directly instead of Bio.TreeIO. >> As far as I know, Bio.Nexus just parses a whole file in one go. This >> means either Bio.TreeIO.NexusIO would call this and then loop over >> the list (very memory inefficient), or it would need a minimal Nexus >> parser just to spot the TREES block, and handle them only. > > That's what I pictured for a Bio.Nexus refactoring -- I don't know the right > way to do it in a memory-efficient way, though, given that there are > multiple types of blocks and they may be needed at different times. Maybe > make an initial pass to index the file at the block level, then call > incremental line-level parsers on the selected blocks? Or, simpler, factor > out the efficient line-level parsers so that they can be accessed separately > if need be -- basically the way Nexus._tree() works now -- and let the > block-level parsing code call those specific parsers. Maybe. Of course, in practice Nexus files may not be that big. I don't know if anyone uses them to store (for example) 1000 bootstrap trees. As Brad and I have noted before, spending time on refactoring Bio.Nexus is not the best use of your GSoC project time (plus we'd need to get Cymon and Frank much more involved, worry more about backwards compatibility etc). >> I'm not actually sure where I put it... it should be on my old desktop >> at home somewhere. However, I can elaborate in that in addition NJ >> using quicktree, I also did parsimony bootstrap values, and drew my >> own colourful trees using reportlab. See the three supplementary >> figures here: http://dx.doi.org/10.1099/mic.0.2007/013672-0 > > Hey, neat. I was about to start a project involving kinases and response > regulators. Cool - email me off list if you want to chat more about this aspect. > How much trouble was it to draw trees in reportlab? Do you think > it would be worth adding a tree-drawing module to Bio.Graphics? I agree that tree drawing would be a nice addition to Bio.Graphics. But that code of mine as written would not be good enough. In the end it was a bit of a hack - it got the job done but had lots of special cases (e.g. to get colouring by species to work, and in particular the double bootstrap values caused me pain as I had to have two otherwise identical trees loaded). Even ignoring this, the basic code didn't use an object orientated approach which makes it a poor match to the rest of Bio.Graphics. Basically I would want to rewrite it from scratch before I felt it was fit for public reuse, and have never found the time. Peter From bugzilla-daemon at portal.open-bio.org Wed Jul 29 16:57:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 16:57:38 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907292057.n6TKvcF9028919@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 ------- Comment #5 from sridhar.ratna at gmail.com 2009-07-29 16:57 EST ------- Yup, that works. When run as a script (eg: via subprocess module), setup.py terminates when numpy is not installed. That is good enough fix. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 29 17:02:30 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 17:02:30 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907292102.n6TL2UgT029129@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-29 17:02 EST ------- (In reply to comment #5) > Yup, that works. When run as a script (eg: via subprocess module), setup.py > terminates when numpy is not installed. > > That is good enough fix. > Great. Thank you for your report, and taking the time to test this for us. Marking as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From czmasek at burnham.org Wed Jul 29 17:12:52 2009 From: czmasek at burnham.org (Christian Zmasek) Date: Wed, 29 Jul 2009 14:12:52 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 10: PhyloXML for Biopython In-Reply-To: <3f6baf360907271056t2c84429fwe17739c93046105d@mail.gmail.com> References: <3f6baf360907271056t2c84429fwe17739c93046105d@mail.gmail.com> Message-ID: Hi, Eric: Looks good! Remarks: - Bioperl's phyloXML driver was written for version 1.00 and might hurl if given a v1.10 file -- so that's a potential problem if Biopython defaults to writing v1.10 files. Should Writer take a option to specify the file format version number? Right now it only writes valid phyloXML v1.00. This is a nice thought, but to be honest, I would not do it, especially since it is likely there will be more versions in the future (although, hopefully, just extending 1.10, as opposed to the removal and change of elements. - PhyloXMLIO also always writes branch_length as an XML node, not an attribute. This validates and will be handled safely by any sane parser, and fits better with the idea of an implicit root node in each clade object, I think. (The parser still handles an attribute properly.) Any objections? This is fine! - Above, I've listed more enhancements than I'll probably be able to finish this week. Which should have higher priority? I know merging Bio.Nexus and Bio.Tree would be the most useful, but since (1) Biopython development still happens on CVS, not Git, and (2) another Tree-based GSoC project is expected to land around the same time as mine, I think doing the integration right now would be kind of painful. So I can focus either on laying the groundwork in Bio.Tree.BaseTree, copying rather than moving the relevant Nexus code, or else work mainly on exporting to other useful object representations like networkx graphs, or any Biopython classes I've missed (e.g. alignments). Suggestions? Time permitting I would concentrate on exporting to other useful object representations and on Bio.Tree.BaseTree compatibility with BioSQL's PhyloDB extensions. Christian ________________________________________ From: wg-phyloinformatics-bounces at nescent.org [wg-phyloinformatics-bounces at nescent.org] On Behalf Of Eric Talevich [eric.talevich at gmail.com] Sent: Monday, July 27, 2009 10:56 AM To: Phyloinformatics Group; BioPython-Dev Mailing List Subject: [Wg-phyloinformatics] GSoC Weekly Update 10: PhyloXML for Biopython Hi folks, Previously (July 20-24) I: Finished implementing I/O methods, Tree classes and tests for all phyloXML elements. Changed Writer to preserve node order in the XML; output now validates under the phyloXML 1.00 schema (but 1.10 complains) Did some drastic code reorganization. - Bio.Tree: - Moved Clade.find() and PhyloElement.__repr__ methods to BaseTree classes - Made Clade inherit from BaseTree.Tree in addition to BaseTree.Node, and added the corresponding attributes - Moved Bio.PhyloXML.Tree to Bio.Tree.PhyloXML - Bio.TreeIO: - Merged PhyloXML's Parser and Writer into PhyloXMLIO under the new Bio.TreeIO module, and updated imports everywhere - Added wrappers for Nexus read/write; doesn't return Bio.Tree objects yet though Added/updated unit tests for all of this. Documented the code reorg on the Biopython wiki, adding Tree and TreeIO pages and fixing the examples on the PhyloXML page. Scrubbed docstrings and enabled epydoc processing. This week (July 27-31) I will: Finish implementing the phyloXML spec: - Scan "simple types" for restricted tokens; check strings in constructors - Take a stab at phyloXML 1.10 support (need a 'version' arg to Writer?) - Clean up and reorganize any code that needs it Enhancements (time permitting): - Improve the SeqRecord conversion - Work on Bio.Tree.BaseTree compatibility with BioSQL's PhyloDB extension - Port common methods to Bio.Tree.BaseTree -- see Bio.Nexus.Tree, Bioperl node objects, PyCogent, p4-phylogenetics - Tree method: build_index (set left_idx, right_idx on all nodes): - calculate left/right indexes for nested-set representation - see http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html - Export to networkx (http://networkx.lanl.gov/) -- also get graphviz export for free, via networkx.to_agraph() Remarks: - Bioperl's phyloXML driver was written for version 1.00 and might hurl if given a v1.10 file -- so that's a potential problem if Biopython defaults to writing v1.10 files. Should Writer take a option to specify the file format version number? Right now it only writes valid phyloXML v1.00. - PhyloXMLIO also always writes branch_length as an XML node, not an attribute. This validates and will be handled safely by any sane parser, and fits better with the idea of an implicit root node in each clade object, I think. (The parser still handles an attribute properly.) Any objections? - Above, I've listed more enhancements than I'll probably be able to finish this week. Which should have higher priority? I know merging Bio.Nexus and Bio.Tree would be the most useful, but since (1) Biopython development still happens on CVS, not Git, and (2) another Tree-based GSoC project is expected to land around the same time as mine, I think doing the integration right now would be kind of painful. So I can focus either on laying the groundwork in Bio.Tree.BaseTree, copying rather than moving the relevant Nexus code, or else work mainly on exporting to other useful object representations like networkx graphs, or any Biopython classes I've missed (e.g. alignments). Suggestions? Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From hlapp at gmx.net Wed Jul 29 21:55:47 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 29 Jul 2009 21:55:47 -0400 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> Message-ID: On Jul 29, 2009, at 3:37 PM, Peter wrote: > consider a Newick file containing one tree but with no trailing semi > colon That's actually not legal Newick format if you take it by the letter. Some programs out there are lenient and take it anyway, but some will actually balk and throw an error. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From eric.talevich at gmail.com Thu Jul 30 00:10:35 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 30 Jul 2009 00:10:35 -0400 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> Message-ID: <3f6baf360907292110y652c3cc9wf77ae9269080ca5b@mail.gmail.com> On Wed, Jul 29, 2009 at 3:37 PM, Peter wrote: > On Wed, Jul 29, 2009 at 6:59 PM, Eric Talevich > wrote: > >> > >> Bio.TreeIO.parse() should be an iterator, returning complete tree > >> objects one by one. I was thinking of having Bio.TreeIO.NewickIO > >> just take a plain text file, split it up at the ";\n" characters (or > >> similar) to get each tree as a string, which is passed to > >> Bio.Nexus.Trees.Tree to parse it. > > > > OK, I did this. > > > > http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py > > OK, I haven't run the code but have a couple of points. > > On a general point, are you intending to re-write parse and read > functions for each tree format? For Bio.SeqIO all I do is write a > iterator (i.e. a parse) function, and Bio.SeqIO.parse() and also > Bio.SeqIO.read() call this. > If the top-level TreeIO read function returns just the first parsed tree and raises a ValueError if 0 or >1 trees are available, then I can make the wrappers simpler and reduce some code duplication. The parsing code looks weird to me - but that is probably a style > thing. Certainly I had to stare at it to work out what it was doing. > It also has a bug - consider a Newick file containing one tree but > with no trailing semi colon. > It is weird; I'll fix these issues in parse() and write(). (I only tested with a small 2-tree file.) Style: The "foo and bar or baz" is a Py2.4-friendly idiom that we can one day replace everywhere with the real ternary expression syntax introduced in Py2.5: "bar if foo else baz". I've been using it throughout my GSoC code, though it's not really necessary in this function. Hilmar says there's supposed to be a terminal semicolon; I didn't check what Biopython's parser does but I suppose this should duplicate that. >> > Sounds like an incremental parse() function over these trees would be > >> > very useful for distributed bootstrap analysis etc. > >> > >> Exactly. And Bio.TreeIO.read() would be for the special case where > >> the file format contains exactly one tree. > > > > PhyloXML has a top-level object that contains multiple phylogenies, plus > > arbitrary 'other' data; PhyloXML.read() returns one of those object > > regardless of how many phylogenies it contains. Newick doesn't have a > > top-level container, so returning one tree and raising a RuntimeError if > > there isn't exactly one tree makes sense. But Nexus has a top-level > > container with (potentially) a bunch of other info -- should > NexusIO.read() > > return the complete Nexus object, or just pretend to be a Newick wrapper > and > > behave that way? > > > Ah. The top level information about all the trees may cause trouble > for the TreeIO model I had in mind (which was *just* for trees). The > advantage of this is a consistent API, the downside is certain file > format specific things cannot be supported nicely. I think this balance > has worked nicely for SeqIO and AlignIO to date. So: > * Bio.TreeIO.read(...) would return one tree. > * Bio.TreeIO.parse(...) would iterate over trees one by one. > * Bio.TreeIO.write(...) would write trees out (ideally sequentially > if the file format allows this). > > Note I am assuming it is possible to write a PhyloXML tree with > minimal (empty) top level annotation? You would need to do this > in order to convert from a Nexus or Newick tree to a (minimal) > PhyloXML tree. > > So, based on how SeqIO and AlignIO work, I would expect Bio.TreeIO > would only give you the trees - you'd not get the top level information. > For parsing Nexus files, Bio.TreeIO would only give access to a > subset of the data in a Nexus file - just the trees. In the same way, > parsing a Nexus file with AlignIO only gives you the alignment. If > you want any of the other data in a Nexus file, you have to use the > Bio.Nexus module. > > If you (as a user) needed the top level annotation in a PhyloXML file, > then I would say use Bio.PhyloXML (or what ever we are calling it) > directly instead of Bio.TreeIO. > Within the last couple of weeks, I moved all of the PhyloXML I/O code to Bio.TreeIO.PhyloXMLIO, and the tree class definitions to Bio.Tree.PhyloXML -- so there is no Bio.PhyloXML module now, as far as imports and setup.py are concerned. Unlike Nexus, a phyloXML file really doesn't contain anything other than phylogenetic trees and their annotations, so I didn't see the need to clutter the Bio namespace further. Plan: TreeIO has read(), parse(), write(), and possibly convert(), which behave exactly like the corresponding AlignIO and SeqIO functions, but with trees. Under Bio.TreeIO we have wrappers for other formats, and these wrappers may have public functions that go beyond the shared TreeIO ones. In some cases this can lead to a specific read-like function that returns a single object containing one or more trees, plus other tree-related metadata. This function can either be called read() also, as it currently is in PhyloXMLIO, or we could choose another name like load(). For basic tree access: from Bio import TreeIO tree = TreeIO.read('example.xml', 'phyloxml') TreeIO.write([tree], 'example.nex', 'nexus') For the connoisseur: from Bio.TreeIO import PhyloXMLIO phx = PhyloXMLIO.read('example.xml') if phx.other: # do something clever... Of course, in practice Nexus files may not be that big. I don't > know if anyone uses them to store (for example) 1000 bootstrap trees. > As Brad and I have noted before, spending time on refactoring Bio.Nexus > is not the best use of your GSoC project time (plus we'd need to get > Cymon and Frank much more involved, worry more about backwards > compatibility etc). > This refactoring quest actually started because I was trying to figure out an object model for BaseTree that could support PhyloDB, reuse the Nexus tree methods with some resemblance to the original form, and still provide useful base classes for phyloXML. That was holding up everything else -- but I think it's under control now. > I agree that tree drawing would be a nice addition to Bio.Graphics. > > But that code of mine as written would not be good enough. In the > end it was a bit of a hack - it got the job done but had lots of special > cases (e.g. to get colouring by species to work, and in particular the > double bootstrap values caused me pain as I had to have two otherwise > identical trees loaded). Even ignoring this, the basic code didn't use > an object orientated approach which makes it a poor match to the > rest of Bio.Graphics. Basically I would want to rewrite it from scratch > before I felt it was fit for public reuse, and have never found the time. > Maybe it will be worth another shot after the Tree module settles down. If networkx export comes easily this week, that may take also take care of visualization for some uses. Cheers, Eric From biopython at maubp.freeserve.co.uk Thu Jul 30 05:13:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Jul 2009 10:13:29 +0100 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <3f6baf360907292110y652c3cc9wf77ae9269080ca5b@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> <3f6baf360907292110y652c3cc9wf77ae9269080ca5b@mail.gmail.com> Message-ID: <320fb6e00907300213s582c9313ya38b48d84993101b@mail.gmail.com> On Thu, Jul 30, 2009 at 5:10 AM, Eric Talevich wrote: > On Wed, Jul 29, 2009 at 3:37 PM, Peter wrote: > >> The parsing code looks weird to me - but that is probably a style >> thing. Certainly I had to stare at it to work out what it was doing. >> It also has a bug - consider a Newick file containing one tree but >> with no trailing semi colon. > > Hilmar says there's supposed to be a terminal semicolon; I didn't check > what Biopython's parser does but I suppose this should duplicate that. Hilmar is right, see http://evolution.genetics.washington.edu/phylip/newicktree.html However, in this case I would opt to support this variant anyway for input (but you must include the ";" on output). > Plan: > TreeIO has read(), parse(), write(), and possibly convert(), which behave > exactly like the corresponding AlignIO and SeqIO functions, but with trees. > Under Bio.TreeIO we have wrappers for other formats, and these wrappers may > have public functions that go beyond the shared TreeIO ones. Sounds good. > In some cases this can lead to a specific read-like function that returns a > single object containing one or more trees, plus other tree-related > metadata. This function can either be called read() also, as it currently is > in PhyloXMLIO, or we could choose another name like load(). > > For basic tree access: > > from Bio import TreeIO > tree = TreeIO.read('example.xml', 'phyloxml') > TreeIO.write([tree], 'example.nex', 'nexus') > > For the connoisseur: > > from Bio.TreeIO import PhyloXMLIO > phx = PhyloXMLIO.read('example.xml') > if phx.other: # do something clever... Sounds OK to me at first glance. > ?Of course, in practice Nexus files may not be that big. I don't >> know if anyone uses them to store (for example) 1000 bootstrap trees. >> As Brad and I have noted before, spending time on refactoring Bio.Nexus >> is not the best use of your GSoC project time (plus we'd need to get >> Cymon and Frank much more involved, worry more about backwards >> compatibility etc). > > This refactoring quest actually started because I was trying to figure out > an object model for BaseTree that could support PhyloDB, reuse the Nexus > tree methods with some resemblance to the original form, and still provide > useful base classes for phyloXML. That was holding up everything else -- > but I think it's under control now. Cool. >> I agree that tree drawing would be a nice addition to Bio.Graphics. >> ... > > Maybe it will be worth another shot after the Tree module settles down. If > networkx export comes easily this week, that may take also take care of > visualization for some uses. Good point. In fact from memory, my tree PDF code was probably using Thomas Mailund's Newick parser (not Bio.Nexus which didn't exist when I first started work on trees). http://www.birc.au.dk/~mailund/newick.html Peter From bugzilla-daemon at portal.open-bio.org Fri Jul 31 13:20:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 13:20:28 -0400 Subject: [Biopython-dev] [Bug 2890] New: Getting setup.py to work in Jython Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2890 Summary: Getting setup.py to work in Jython Product: Biopython Version: 1.51b Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu Currently setup.py fails in Jython because it that implementation of Python does not support building C extensions. This can be avoided by adding the code: if os.name == 'java': EXTENSIONS = [] else: EXTENSIONS = [ ...continue with regular extension definition This will not introduce bugs into main BioPython target platforms (CPython), and will allow for development on new platforms (Jython). Tested with Jython 2.5.0. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 16:06:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:06:32 -0400 Subject: [Biopython-dev] [Bug 2891] New: Jython test_NCBITextParser fix+patch Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2891 Summary: Jython test_NCBITextParser fix+patch Product: Biopython Version: 1.51b Platform: Other OS/Version: Other Status: NEW Severity: enhancement Priority: P2 Component: Unit Tests AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu Jython is limited to JVM method sizes, overly large methods cause JVM exceptions (java.lang.ClassFormatError: Invalid method Code length ...). The test_NCBITextParser unit test contains a few methods that are larger then the JVM limit. This can be fixed by breaking some of the methods into multi segment tests. So test_bt007 becomes test_bt007a and test_bt007b. A sample fix patch, tested with Jython2.5.0: 713c713 < def test_bt007(self): --- > def test_bt007a(self): 1242a1243,1250 > > def test_bt007b(self): > "Test parsing PHI-BLAST, BLASTP 2.0.10 output, three rounds (bt007)" > > path = os.path.join('Blast', 'bt007') > handle = open(path) > record = self.pb_parser.parse(handle) > 1891c1899 < def test_bt009(self): --- > def test_bt009a(self): 2525a2534,2541 > > > def test_bt009b(self): > "Test parsing PHI-BLAST, BLASTP 2.0.10 output, two rounds (bt009)" > > path = os.path.join('Blast', 'bt009') > handle = open(path) > record = self.pb_parser.parse(handle) 5635c5651 < def test_bt047(self): --- > def test_bt047a(self): 6251a6268,6275 > > def test_bt047b(self): > "Test parsing PHI-BLAST, BLASTP 2.0.11 output, two rounds (bt047)" > > path = os.path.join('Blast', 'bt047') > handle = open(path) > record = self.pb_parser.parse(handle) > 9959c9983 < def test_bt060(self): --- > def test_bt060a(self): 10330a10355,10362 > > def test_bt060b(self): > "Test parsing BLASTP 2.0.12 output (bt060)" > > path = os.path.join('Blast', 'bt060') > handle = open(path) > record = self.pb_parser.parse(handle) > 11000a11033,11041 > > > def test_bt060c(self): > "Test parsing BLASTP 2.0.12 output (bt060)" > > path = os.path.join('Blast', 'bt060') > handle = open(path) > record = self.pb_parser.parse(handle) > 11504a11546,11552 > > def test_bt060d(self): > "Test parsing BLASTP 2.0.12 output (bt060)" > > path = os.path.join('Blast', 'bt060') > handle = open(path) > record = self.pb_parser.parse(handle) 11812a11861,11866 > > def test_bt060e(self): > "Test parsing BLASTP 2.0.12 output (bt060)" > > path = os.path.join('Blast', 'bt060') > handle = open(path) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 16:40:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:40:33 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200907312040.n6VKeX4u029072@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2890 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 16:40:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:40:35 -0400 Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython In-Reply-To: Message-ID: <200907312040.n6VKeZDl029078@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2890 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2891 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 16:47:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:47:17 -0400 Subject: [Biopython-dev] [Bug 2892] New: Jython MatrixInfo.py fix+patch Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2892 Summary: Jython MatrixInfo.py fix+patch Product: Biopython Version: 1.51b Platform: Other OS/Version: Other Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu BugsThisDependsOn: 2890,2891 Jython is limited to JVM method size limitations, overly large methods cause JVM exceptions (java.lang.ClassFormatError: Invalid method Code length ...). MatrixInfo creates several matrices in the base level of the module, and causes that exception in Jython. This can be fixed by putting each of the matrix definitions in separate methods and then calling those methods to define the variables. Attached is a patch the work in Jython2.5.0 and should have no affect on CPython. 9c9,10 < available_matrices = ['benner6', 'benner22', 'benner74', 'blosum100', --- > def gen_available_matrices(): > return [ 22c23,24 < benner6 = { --- > def gen_benner6(): > return { 78c80,81 < benner22 = { --- > def gen_benner22(): > return { 134c137,138 < benner74 = { --- > def gen_benner74(): > return { 190c194,195 < blosum100 = { --- > def gen_blosum100(): > return { 262c267,268 < blosum30 = { --- > def gen_blosum30(): > return { 334c340,341 < blosum35 = { --- > def gen_blosum35(): > return { 406c413,414 < blosum40 = { --- > def gen_blosum40(): > return { 478c486,487 < blosum45 = { --- > def gen_blosum45(): > return { 550c559,560 < blosum50 = { --- > def gen_blosum50(): > return { 622c632,633 < blosum55 = { --- > def gen_blosum55(): > return { 694c705,706 < blosum60 = { --- > def gen_blosum60(): > return { 766c778,779 < blosum62 = { --- > def gen_blosum62(): > return { 838c851,852 < blosum65 = { --- > def gen_blosum65(): > return { 910c924,925 < blosum70 = { --- > def gen_blosum70(): > return { 982c997,998 < blosum75 = { --- > def gen_blosum75(): > return { 1054c1070,1071 < blosum80 = { --- > def gen_blosum80(): > return { 1126c1143,1144 < blosum85 = { --- > def gen_blosum85(): > return { 1198c1216,1217 < blosum90 = { --- > def gen_blosum90(): > return { 1270c1289,1290 < blosum95 = { --- > def gen_blosum95(): > return { 1342c1362,1363 < feng = { --- > def gen_feng(): > return { 1398c1419,1420 < fitch = { --- > def gen_fitch(): > return { 1444c1466,1467 < genetic = { --- > def gen_genetic(): > return { 1500c1523,1524 < gonnet = { --- > def gen_gonnet(): > return { 1556c1580,1581 < grant = { --- > def gen_grant(): > return { 1612c1637,1638 < ident = { --- > def gen_ident(): > return { 1668c1694,1695 < johnson = { --- > def gen_johnson(): > return { 1724c1751,1752 < levin = { --- > def gen_levin(): > return { 1780c1808,1809 < mclach = { --- > def gen_mclach(): > return { 1836c1865,1866 < miyata = { --- > def gen_miyata(): > return { 1892c1922,1923 < nwsgappep = { --- > def gen_nwsgappep(): > return { 1959c1990,1991 < pam120 = { --- > def gen_pam120(): > return { 2031c2063,2064 < pam180 = { --- > def gen_pam180(): > return { 2103c2136,2137 < pam250 = { --- > def gen_pam250(): > return { 2175c2209,2210 < pam30 = { --- > def gen_pam30(): > return { 2247c2282,2283 < pam300 = { --- > def gen_pam300(): > return { 2319c2355,2356 < pam60 = { --- > def gen_pam60(): > return { 2391c2428,2429 < pam90 = { --- > def gen_pam90(): > return { 2458c2496,2497 < rao = { --- > def gen_rao(): > return { 2514c2553,2554 < risler = { --- > def gen_risler(): > return { 2570c2610,2611 < structure = { --- > def gen_structure(): > return { 2624a2666,2707 > available_matrices = gen_available_matrices() > benner6 = gen_benner6() > benner22 = gen_benner22() > benner74 = gen_benner74() > blosum100 = gen_blosum100() > blosum30 = gen_blosum30() > blosum35 = gen_blosum35() > blosum40 = gen_blosum40() > blosum45 = gen_blosum45() > blosum50 = gen_blosum50() > blosum55 = gen_blosum55() > blosum60 = gen_blosum60() > blosum62 = gen_blosum62() > blosum65 = gen_blosum65() > blosum70 = gen_blosum70() > blosum75 = gen_blosum75() > blosum80 = gen_blosum80() > blosum85 = gen_blosum85() > blosum90 = gen_blosum90() > blosum95 = gen_blosum95() > feng = gen_feng() > fitch = gen_fitch() > genetic = gen_genetic() > gonnet = gen_gonnet() > grant = gen_grant() > ident = gen_ident() > johnson = gen_johnson() > levin = gen_levin() > mclach = gen_mclach() > miyata = gen_miyata() > nwsgappep = gen_nwsgappep() > pam120 = gen_pam120() > pam180 = gen_pam180() > pam250 = gen_pam250() > pam30 = gen_pam30() > pam300 = gen_pam300() > pam60 = gen_pam60() > pam90 = gen_pam90() > rao = gen_rao() > risler = gen_risler() > structure = gen_structure() > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 16:47:30 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:47:30 -0400 Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython In-Reply-To: Message-ID: <200907312047.n6VKlU0e029268@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2890 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2892 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 16:47:31 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:47:31 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200907312047.n6VKlVuB029274@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2892 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 17:28:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:28:25 -0400 Subject: [Biopython-dev] [Bug 2893] New: Jython test_prosite fix+patch Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2893 Summary: Jython test_prosite fix+patch Product: Biopython Version: 1.51b Platform: Other OS/Version: Other Status: NEW Severity: enhancement Priority: P2 Component: Unit Tests AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu BugsThisDependsOn: 2890,2891,2892 Jython is limited to JVM method sizes, overly large methods cause JVM exceptions (java.lang.ClassFormatError: Invalid method Code length ...). The test_prosite unit test contains a few methods that are larger then the JVM limit. This can be fixed by breaking separate methods into smaller methods. This patch combined with other bug fixes brings Biopython to the point where "jython setup.py test" can complete without throwing exceptions: Ran 122 tests in 39.295 seconds FAILED (failures = 74) Patch tested with Jython2.5.0 3742c3742 < def test_read1(self): --- > def test_read1a(self): 4096a4097,4103 > > > def test_read1b(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 4499a4507,4515 > > > > > def test_read1c(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 4934a4951,4956 > > def test_read1d(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 5190a5213,5218 > > def test_read1e(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 5611a5640,5645 > > def test_read1f(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 5892a5927,5932 > > def test_read1g(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 6417c6457 < def test_read4(self): --- > def test_read4a(self): 6617a6658,6663 > > def test_read4b(self): > "Parsing Prosite record ps00432.txt" > filename = os.path.join('Prosite', 'ps00432.txt') > handle = open(filename) > record = Prosite.read(handle) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 17:28:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:28:38 -0400 Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython In-Reply-To: Message-ID: <200907312128.n6VLScnl030449@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2890 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2893 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 17:28:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:28:39 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200907312128.n6VLSdCw030455@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2893 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 17:28:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:28:45 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200907312128.n6VLSj7I030464@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2893 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 17:59:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:59:34 -0400 Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython In-Reply-To: Message-ID: <200907312159.n6VLxY4V031200@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2890 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-31 17:59 EST ------- Fixed in CVS, the other Jython fixes will take a little longer to review. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 17:59:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:59:38 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200907312159.n6VLxcBj031220@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 Bug 2891 depends on bug 2890, which changed state. Bug 2890 Summary: Getting setup.py to work in Jython http://bugzilla.open-bio.org/show_bug.cgi?id=2890 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 17:59:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:59:52 -0400 Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch In-Reply-To: Message-ID: <200907312159.n6VLxqg3031243@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2893 Bug 2893 depends on bug 2890, which changed state. Bug 2890 Summary: Getting setup.py to work in Jython http://bugzilla.open-bio.org/show_bug.cgi?id=2890 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 17:59:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:59:54 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200907312159.n6VLxs5H031255@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 Bug 2892 depends on bug 2890, which changed state. Bug 2890 Summary: Getting setup.py to work in Jython http://bugzilla.open-bio.org/show_bug.cgi?id=2890 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From agrobertson at telus.net Wed Jul 1 00:27:21 2009 From: agrobertson at telus.net (Gordon Robertson) Date: Tue, 30 Jun 2009 17:27:21 -0700 Subject: [Biopython-dev] Fwd: ACE files at Biopython-dev References: Message-ID: <9FFF9C34-5253-45F4-B009-6193A5FEEA6B@telus.net> I flagged the current ACE discussion to the Consed author, David Gordon, and am forwarding his response. G Begin forwarded message: > From: David Gordon > Date: June 30, 2009 9:38:10 AM PDT > To: Gordon Robertson > Cc: Yaron Butterfield , David Gordon > > Subject: Re: ACE files at Biopython-dev > > Hi, Gordon, > > Could you add a comment from me to this thread, please? Here is the > text to add: > > > I am the author of Consed and briefly read this thread. > > I have one suggestion on this tool: that it create phd ball files > instead of phd files, particularly if the number of reads is more than > a few hundred. phd files are a leftover from the days of sequencing > when there were only a few thousand reads at most. The linux > operating system and software cannot handle millions of phd files in > the same directory, so Consed now typically uses a small number of phd > balls. Here is an example of a phd ball file that contains 2 reads > (typically a phd ball file will contain up to a million--more than > that becomes difficult to copy). Notice that there is a comment at > the beginning starting with "#" at the beginning of the line. Also > notice that the BEGIN_SEQUENCE line is slightly different due to the > "1" at the end--this is the version, which corresponds to the > extension on the end of a phd file name such as > HWI-EAS94_4_1_1_537_446.phd.1 > > Notice also that peak positions (which normally form a 3rd column > after the quality) are now optional, which helps keep the file size > down. For reads that you want to see the traces, you will need to > have peak positions. A 454 example follows: > > > > # solexa file ../solexa_dir/solexa_reads.fastq (beginning) > > BEGIN_SEQUENCE HWI-EAS94_4_1_1_537_446 1 > BEGIN_COMMENT > TIME: Wed Dec 24 11:21:50 2008 > CHEM: solexa > END_COMMENT > BEGIN_DNA > g 30 > c 30 > c 30 > a 30 > a 30 > t 30 > c 30 > a 30 > g 30 > g 30 > t 30 > t 30 > t 30 > c 30 > t 30 > c 30 > t 30 > g 30 > c 30 > a 30 > a 28 > g 23 > c 30 > c 30 > c 30 > c 30 > t 30 > t 30 > t 28 > a 22 > g 8 > c 22 > a 7 > g 15 > c 15 > t 15 > g 10 > a 10 > g 11 > c 15 > END_DNA > END_SEQUENCE > > BEGIN_SEQUENCE HWI-EAS94_4_1_1_602_99 1 > BEGIN_COMMENT > TIME: Wed Dec 24 11:21:50 2008 > CHEM: solexa > END_COMMENT > BEGIN_DNA > g 30 > c 30 > c 30 > a 30 > t 30 > g 30 > g 30 > c 30 > a 30 > c 30 > a 30 > t 30 > a 30 > t 30 > a 30 > t 30 > g 30 > a 30 > a 30 > g 30 > g 30 > t 30 > c 30 > a 30 > g 30 > a 30 > g 16 > g 30 > a 28 > c 22 > a 22 > a 22 > c 14 > t 15 > t 15 > g 5 > c 10 > t 15 > g 10 > t 5 > END_DNA > END_SEQUENCE > > > phd ball files for 454 reads (in which traces are displayed) have more > information. Here is an example: > > BEGIN_SEQUENCE EBE03TV04IHLTF.77-243 1 > > BEGIN_COMMENT > > CHROMAT_FILE: sff:reads.sff:EBE03TV04IHLTF > QUALITY_LEVELS: 99 > TIME: Thu Jul 27 12:33:48 2000 > TRACE_ARRAY_MIN_INDEX: 0 > TRACE_ARRAY_MAX_INDEX: 4723 > CHEM: 454 > > END_COMMENT > > BEGIN_DNA > g 37 91 > g 37 110 > g 37 129 > g 37 148 > a 37 167 > t 37 186 > g 37 205 > a 37 224 > a 37 243 > a 37 262 > g 37 281 > g 37 300 > g 37 319 > . > . > . > a 26 4385 > t 26 4404 > c 26 4423 > t 30 4442 > c 33 4461 > g 33 4480 > g 33 4499 > t 33 4518 > g 33 4537 > g 36 4556 > t 36 4575 > a 33 4594 > g 33 4613 > g 33 4632 > t 36 4651 > g 26 4670 > a 22 4689 > END_DNA > > END_SEQUENCE > > (more BEGIN_SEQUENCE/END_SEQUENCE blocks to follow) > > > The line: > CHROMAT_FILE: sff:reads.sff:EBE03TV04IHLTF > indicates both the sff file that the read came from as well as the > read name. > > When creating ace files, BS lines are now optional. BS lines really > only make sense when the assembly is phrap > > > > David Gordon > > > > On Tue, 30 Jun 2009, Gordon Robertson wrote: > >> David >> >> I thought I should flag with you that code for ACE files are being >> discussed now in BioPython. >> >> G >> >> Begin forwarded message: >> >>> From: biopython-dev-request at lists.open-bio.org >>> Date: June 30, 2009 1:39:04 AM PDT >>> To: biopython-dev at lists.open-bio.org >>> Subject: Biopython-dev Digest, Vol 77, Issue 30 >>> Reply-To: biopython-dev at lists.open-bio.org >>> Send Biopython-dev mailing list submissions to >>> biopython-dev at lists.open-bio.org >>> To subscribe or unsubscribe via the World Wide Web, visit >>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>> or, via email, send a message with subject or body 'help' to >>> biopython-dev-request at lists.open-bio.org >>> You can reach the person managing the list at >>> biopython-dev-owner at lists.open-bio.org >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of Biopython-dev digest..." >>> Today's Topics: >>> >>> 1. GSoC Weekly Update 6: PhyloXML for Biopython (Eric Talevich) >>> 2. Re: GSoC Weekly Update 6: PhyloXML for Biopython (Peter) >>> 3. Re: [Biopython] Bio.Sequencing.Ace (Peter Cock) >>> 4. Re: Bio.Sequencing (Peter) >>> 5. Re: [Biopython] Bio.Sequencing.Ace (Jose Blanca) >>> 6. Re: GSoC Weekly Update 6: PhyloXML for Biopython >>> (Bartek Wilczynski) >>> ---------------------------------------------------------------------- >>> >>> ------------------------------ >>> Message: 3 >>> Date: Tue, 30 Jun 2009 09:01:28 +0100 >>> From: Peter Cock >>> Subject: Re: [Biopython-dev] [Biopython] Bio.Sequencing.Ace >>> To: Jose Blanca >>> Cc: biopython-dev at lists.open-bio.org >>> Message-ID: >>> <320fb6e00906300101r3e3faa37l6a47295bd5e12538 at mail.gmail.com> >>> Content-Type: text/plain; charset=ISO-8859-1 >>> On Mon, Jun 29, 2009 at 4:16 PM, Jose Blanca >>> wrote: >>>>> Are you using Bio.Sequencing.Ace in your code, or did you write >>>>> a whole >>>>> new parser instead? >>>> I wrote one, because I wanted to be able to get one particular >>>> contig or just >>>> the contig or the read names. But I don't think that is a >>>> problem. I gues >>>> that the biopyhon parser could be easily adapted to that. >>> I see. This touches on the indexing discussion - the same idea on >>> this thread would probably work on Ace files too: >>> http://lists.open-bio.org/pipermail/biopython/2009-June/005275.html >>>>> Now that I have been using Ace files in my own work, I've been >>>>> meaning >>>>> to look over your stuff. In some ways, a contig class can be >>>>> seen as a >>>>> generalisation of a multiple sequence alignment class. Certainly >>>>> this is >>>>> something we should improve in Biopython (as you might gather from >>>>> some of the enhancement bugs on bugzilla, I have lots of ideas >>>>> for the >>>>> current alignment class), and I'm sure you have some great ideas >>>>> too. >>>> I think that here is the main deviation from Biopython. The >>>> contig class is >>>> similar to an alignment class, in fact my contig classes shoud be >>>> compatible >>>> with your new alignment proporsal api. >>> That's good. I agree that a specialised contig class that works like >>> the traditional multiple sequence alignment class would be nice. >>> It would then make sense to have Bio.AlignIO handle contigs as >>> well as traditional multiple sequence alignments. >>>> alignment. >>>> seq1 +++++++++> >>>> seq2 +++++++++> >>>> seq3 +++++++++> >>>> contig >>>> seq1 ++++> >>>> seq2 ? ?+++++> >>>> seq3 ? ? ? ?++++++> >>>> Basically every read has a different coordinate system in the >>>> contig case. >>>> What I've done is to create a class named LocatableSequence that >>>> is a >>>> container for sequence objects. It works like: >>>>>>> seq1 = 'ATCG' >>>>>>> locseq1 = locate_sequence(seq1, location=10) >>>>>>> locseq1[10] == A >>>> In that way the contig is a list of LocatableSequences and the >>>> coordinate >>>> system transformations are done by the LocatableSequences, not by >>>> the contig. >>>> The LocatableSequences also allow for masks. >>>> The LocatableSequence works with any sequence like objects, strs, >>>> Seq, >>>> SeqRecord, lists, etc. >>>> There's also a Location class that represents a fragment of a >>>> sequence. My >>>> Location class is more limited than the one in the Biopython >>>> SeqFeature. In >>>> my case the start and end should be integers. I use this class to >>>> represent >>>> the region not masked in the sequence and the Location of the >>>> sequence inside >>>> the LocatableSequence. >>>> Take a look at Contig.py and at LocatableSequence.py, these are >>>> the most >>>> relevant classes for this. >>>> Best regards, >>> I'll have to make some time for looking at your code. >>> What I was thinking of was a contig class as an alignment subclass, >>> holding a list of SeqRecord objects and offsets. The consensus might >>> just be one element of this list - but could be handled specially. >>> This >>> sounds simpler than having to introduce a whole new object system, >>> related to but different to SeqFeature objects. However, I don't yet >>> have a sample implementation to demonstrate this. >>> One important thing I think we should do BEFORE adding any contig >>> class to Biopython, is get it working with at least one other >>> contig file >>> format in addition to Ace. I don't want to end up with a class which >>> is too specialised for how ace contigs work. >>> Peter >>> ------------------------------ >>> Message: 4 >>> Date: Tue, 30 Jun 2009 09:18:44 +0100 >>> From: Peter >>> Subject: Re: [Biopython-dev] Bio.Sequencing >>> To: Cymon Cox >>> Cc: BioPython-Dev Mailing List >>> Message-ID: >>> <320fb6e00906300118l78ca2a98kc25278e24ad433a1 at mail.gmail.com> >>> Content-Type: text/plain; charset=ISO-8859-1 >>> On Mon, Jun 29, 2009 at 4:47 PM, Cymon Cox wrote: >>>> Hi Peter, >>>> 2009/6/29 Peter >>>>> Hi Cymon, >>>>> I've checked in some of your patch on Bug 2865 already, >>>>> recording the per-letter-annotation which I was planning to >>>>> do but hadn't got round to yet - thank you: >>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 >>>>> This means with the latest code you can now use Biopython >>>>> to convert a PHD output file into a FASTQ file (or a QUAL >>>>> file) which could be handy for doing meta assemblies. >>>> Yeah, that's nice. Conversely, the reason I wrote the Phd writer >>>> is that I >>>> want to 'fake' some Phd files from FASTA and QUAL files - should/ >>>> might be >>>> possible by using the default headers and equally spaced peak >>>> locations. The >>>> use-case is to fool Consed into displaying the trace (which it >>>> 'fakes') from >>>> a 454 Mira assembly ACE file output, but which it will only do if >>>> the Phd >>>> files are available. So I'm hoping to write the Phd files from >>>> the original >>>> FASTA/QUAL input files. Not sure if this is going to work, or if >>>> its a >>>> sensible thing to be trying... >>> That sounds reasonable - as long as you know you are faking it ;) >>>>> I did relatively recently update SeqIO for the Ace format to >>>>> record the qualities - but there is an issue here. Only the >>>>> nucleotides get given quality scores, but not the insertions >>>>> (gaps, shown as "*" in the Ace file consensus sequence). >>>>> Currently the Bio.SeqIO parser gives the gapped sequence. >>>>> This means to record the quality scores, we need to give >>>>> some null value to the gap characters (and I used None). >>>>> What I am wondering about is making the Bio.SeqIO Ace >>>>> parser just return the ungapped sequence (and the >>>>> associated PHRED quality scores). This means we could >>>>> then convert Ace files into FASTQ or QUAL files, and also >>>>> a simple Ace to FASTA conversion would give something >>>>> useful for downstream analysis (the ungapped consensus). >>>>> The gaps *are* important if you want to see how the >>>>> consensus was built up - in which case it makes sense to >>>>> think about each Ace contig as a kind of multiple sequence >>>>> alignment. See this earlier discussion with David Winter: >>>>> http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html >>>>> http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html >>>>> Any thoughts? >>>> I think it's probably unwise to return an ungapped sequence/qual >>>> by default >>>> if the contig in the ACE assembly is gapped. It would be nice if >>>> the parser >>>> had a switch ungapped=True, but thats not going to work with the >>>> SeqIO >>>> interface. >>> We can certainly add a ungapped optional argument to the parser >>> in Bio.SeqIO.AceIO - that would be a small improvement, meaning >>> the functionality would be there if you needed it (all be it a bit >>> hidden). >>> Several of the Bio.SeqIO parsers already have optional arguments. >>> I have sometimes wondered about letting the SeqIO functions take >>> a **kwargs argument, and passing these arbitrary options to the >>> underlying parser. This would allow for example passing wrap options >>> to the FASTA writer, or skiping the features when parsing GenBank >>> and EBML. On the other hand, it gets very complicated, and detracts >>> from the current simplicity of Bio.SeqIO (which I like). >>>> Second best option would be to have an easy way of getting the >>>> ungapped SeqRecord from the gapped SeqRecord - a function >>>> somewhere in Bio.Sequencing? >>> I've already suggested some kind of "ungapped" method for Seq >>> objects, and yes, having this at the SeqRecord level too would >>> solve this particular use case. Removing the per-letter-annotations >>> associated with the gaps would be straight forward. I'm not sure >>> what we would want to do with any features in the SeqRecord >>> (perhaps a corner case), but most likely any SeqFeature covering >>> a region containing a gap would be lost. >>>> Anyway, I assume (havent checked) that currently if all the >>>> contigs are free of gaps then the SeqIO.AceIO will parse >>>> them into an Ungapped alphabet which can then be written >>>> to FASTA/QUAL etc. I think this is the right way to go, if >>>> the contigs have gaps the user needs to decide how to deal >>>> with them explicitly. >>> Yes, if the Ace contig has no gaps, it will have a nice integer >>> PHRED quality for each base, and could be saved as FASTQ >>> or QUAL (or FASTA). >>> The thing about "gaps" in contigs is that the consensus is >>> really the ungapped sequence. I'd have to check but I think >>> Newbler and CAP3 will output both FASTA and ACE files, >>> and in the FASTA files there are no insertions/gaps in the >>> contig sequences. >>> What I am thinking is Bio.SeqIO could return the ungapped >>> consensus sequences as SeqRecord objects (which can then >>> be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO >>> could return contig-alignment objects (with the gaps, like >>> David's cookbook but in the long run with a contig class). >>> This has some merit, but breaks my current convention that >>> parsing an alignment file with SeqIO works by giving each >>> gapped sequence in each alignment in turn. >>> Peter >>> ------------------------------ >>> Message: 5 >>> Date: Tue, 30 Jun 2009 10:31:06 +0200 >>> From: Jose Blanca >>> Subject: Re: [Biopython-dev] [Biopython] Bio.Sequencing.Ace >>> To: biopython-dev at lists.open-bio.org >>> Message-ID: <200906301031.06273.jblanca at btc.upv.es> >>> Content-Type: text/plain; charset="iso-8859-1" >>>> What I was thinking of was a contig class as an alignment subclass, >>>> holding a list of SeqRecord objects and offsets. The consensus >>>> might >>>> just be one element of this list - but could be handled >>>> specially. This >>>> sounds simpler than having to introduce a whole new object system, >>>> related to but different to SeqFeature objects. However, I don't >>>> yet >>>> have a sample implementation to demonstrate this. >>> I thought about that implementation and I created some code. The >>> problem I >>> found with that approach is that the contig class code got too >>> messy. Take >>> into account that besides the offset you also need the masks and >>> that some >>> sequences could be reversed. That's why I decided to split the >>> part that >>> calculates the offset and the mask into a separate class. >>>> One important thing I think we should do BEFORE adding any contig >>>> class to Biopython, is get it working with at least one other >>>> contig file >>>> format in addition to Ace. I don't want to end up with a class >>>> which >>>> is too specialised for how ace contigs work. >>>> Peter >>> Well, In fact my contig class is modeled after the caf file >>> format. The ace >>> parsing was just an afterthought, my primary interest was the caf >>> format. >>> -- >>> Jose M. Blanca Postigo >>> Instituto Universitario de Conservacion y >>> Mejora de la Agrodiversidad Valenciana (COMAV) >>> Universidad Politecnica de Valencia (UPV) >>> Edificio CPI (Ciudad Politecnica de la Innovacion), 8E >>> 46022 Valencia (SPAIN) >>> Tlf.:+34-96-3877000 (ext 88473) >>> ------------------------------ >>> >>> _______________________________________________ >>> Biopython-dev mailing list >>> Biopython-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>> End of Biopython-dev Digest, Vol 77, Issue 30 >>> ********************************************* >> >> -- >> Gordon Robertson >> Canada's Michael Smith Genome Sciences Centre >> Vancouver BC Canada >> >> >> > -- Gordon Robertson Canada's Michael Smith Genome Sciences Centre Vancouver BC Canada From winda002 at student.otago.ac.nz Wed Jul 1 06:13:17 2009 From: winda002 at student.otago.ac.nz (David WInter) Date: Wed, 01 Jul 2009 18:13:17 +1200 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <320fb6e00906300247ve2f45eau4ef97d3f65bb50c5@mail.gmail.com> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291716.06607.jblanca@btc.upv.es> <320fb6e00906300101r3e3faa37l6a47295bd5e12538@mail.gmail.com> <200906301031.06273.jblanca@btc.upv.es> <320fb6e00906300247ve2f45eau4ef97d3f65bb50c5@mail.gmail.com> Message-ID: <4A4AFE7D.9020800@student.otago.ac.nz> Peter Cock wrote: > On Tue, Jun 30, 2009 at 9:31 AM, Jose Blanca wrote: > >>> What I was thinking of was a contig class as an alignment subclass, >>> holding a list of SeqRecord objects and offsets. >> I thought about that implementation and I created some code. The >> problem I found with that approach is that the contig class code got >> too messy. . >> > > A simple masked sequence class would also be useful for Roche SFF > files which hold sequencing reads (of about 500bp) with start and end > trim points. This is a use case separate from the location offset in an > alignment - so I'm not convinced it makes sense to do both in one > class. > > Perhaps having the contig class hold a list of (masked) SeqRecord > objects, their offset, and their direction would work? > > That sounds like the most intuitive way for the class to work from a user's perspective >>> One important thing I think we should do BEFORE adding any contig >>> class to Biopython, is get it working with at least one other contig file >>> >>> >> Well, In fact my contig class is modeled after the caf file format. >> The ace parsing was just an afterthought, my primary interest >> was the caf format. >> > > Well, as the CAF file format was an extension of the ACE format, > perhaps a third contig format would be worth looking at before > considering if a contig class would be sufficiently general. > I came across the page somewhere in my travels, a quick description of a few contig files: http://www.cbcb.umd.edu/research/contig_representation.shtml At a glance I think all of them could be treated with a similar approach to the one described above. David From bugzilla-daemon at portal.open-bio.org Wed Jul 1 14:12:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Jul 2009 10:12:38 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200907011412.n61ECcLO022490@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 ------- Comment #2 from cymon.cox at gmail.com 2009-07-01 10:12 EST ------- Following the email from David Gordon the Consed author via Gordon Roberston (thanks Gordon) on the dev list (http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006322.html) Ive made some changes to the PhdWriter and parser: The writer no longer uses default values for the header COMMENTS (here we differ from bioperl). Peak location letter annotations are now optional in both the parsing and writing. Additional unittest have been added for the examples of 454 and Solexa data that David Gordon included in his message. Note also: Currently we ignore comments in Phd files, ie those beginning with "#". Nothing special is done with the version number which is appended to the identifier on the BEGIN_SEQUENCE line in phd_ball files. Attached is a patch against biopython on github and Ive pushed changes to my assembly branch. Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 1 14:13:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Jul 2009 10:13:37 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200907011413.n61EDbDd022582@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 cymon.cox at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1333 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 1 14:14:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Jul 2009 10:14:10 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200907011414.n61EEAbv022636@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 ------- Comment #3 from cymon.cox at gmail.com 2009-07-01 10:14 EST ------- Created an attachment (id=1335) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1335&action=view) Another patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Wed Jul 1 14:27:47 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 1 Jul 2009 16:27:47 +0200 Subject: [Biopython-dev] [Bug 2867] New: Bio.PDB.PDBList.update_pdb calls invalid os.cmd Message-ID: I would introduce this new (and recommended) library instead of that command: http://docs.python.org/library/shutil.html But since this is the first bug I'm replying to... I'm asking you first. Cheers! Jo?o [ .. ] Rodrigues (Blog) http://doeidoei.wordpress.com (MSN) always_asleep_ at hotmail.com (Skype) rodrigues.jglm From bugzilla-daemon at portal.open-bio.org Wed Jul 1 14:39:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Jul 2009 10:39:05 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200907011439.n61Ed5Ks024881@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-01 10:39 EST ------- (In reply to comment #2) > Following the email from David Gordon the Consed author via Gordon Roberston > (thanks Gordon) on the dev list > (http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006322.html) Ive > made some changes to the PhdWriter and parser: Yep - coping with missing peak values sounds like it is required now. > The writer no longer uses default values for the header COMMENTS (here we > differ from bioperl). Do you just leave out the comments? That seems better to me. > Peak location letter annotations are now optional in both the parsing and > writing. Good. > Additional unittest have been added for the examples of 454 and Solexa data > that David Gordon included in his message. I'll have to look at those later... > Note also: Currently we ignore comments in Phd files, ie those beginning with > "#". Nothing special is done with the version number which is appended to the > identifier on the BEGIN_SEQUENCE line in phd_ball files. > > Attached is a patch against biopython on github and Ive pushed changes to my > assembly branch. I've done another partial merge, still leaving out the writer code. I'm not going to commit that until next week at the earliest (when I'll be back at work) as I want to give it a good test first. I'm not sure if this will make it into Biopython 1.51 final or not. I will however try and add the new example files and test cases before that. [Don't feel you have to redo the patch - I can continue to pull bits out of it] As part of my commit I added a doctest to Bio/SeqIO/PhdIO.py, which has made me wonder if for SeqIO we should convert the PHRED sequence to upper case (just because it would look nicer for PHRED to FASTQ conversions). Thanks again, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 1 14:43:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Jul 2009 10:43:39 -0400 Subject: [Biopython-dev] [Bug 2867] Bio.PDB.PDBList.update_pdb calls invalid os.cmd In-Reply-To: Message-ID: <200907011443.n61EhdPb025132@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2867 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-01 10:43 EST ------- As Jo??o Rodrigues noted on the mailing list, the python shutil library would be a sensible (and cross platform) way to move/rename a file. I'm a little surprised that os.cmd ever worked - maybe it was present in an old version of python... I'd have to check. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jul 2 09:57:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Jul 2009 05:57:19 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200907020957.n629vJk6014895@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-02 05:57 EST ------- (In reply to comment #4) > (In reply to comment #2) > > Following the email from David Gordon the Consed author via > > Gordon Roberston (thanks Gordon) on the dev list > > (http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006322.html) I've checked in those two examples and extended the parsing unit tests now. This showed a small issue with PHD "file names" with a space in them, which I have resolved following our convention for FASTA files. This means converting PHD to FASTA/FASTQ/QUAL works nicely. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Thu Jul 2 20:59:08 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 2 Jul 2009 16:59:08 -0400 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython Message-ID: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> Hi all, While everyone was away in Stockholm having a great time, I added some user-oriented documentation for my project to the Biopython wiki: http://www.biopython.org/wiki/PhyloXML What do you think? Any missing information, unclear wording, or outright lies? I also updated the project plan with some ideas for filling up the rest of July: http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML The code is there, too. The useful files to look at are Bio/PhyloXML/*.py and Tests/test_PhyloXML.py, if anyone would like to take a look. I would greatly appreciate any comments on any of this. Thanks! Eric From biopython at maubp.freeserve.co.uk Sat Jul 4 14:14:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 4 Jul 2009 15:14:03 +0100 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> Message-ID: <320fb6e00907040714w6d7e9a6bq8c622572c0962b8c@mail.gmail.com> On Thu, Jul 2, 2009 at 9:59 PM, Eric Talevich wrote: > Hi all, > > While everyone was away in Stockholm having a great time, I added some > user-oriented documentation for my project to the Biopython wiki: > http://www.biopython.org/wiki/PhyloXML > > What do you think? Any missing information, unclear wording, or outright > lies? The __repr__ thing isn't Biopython specific, its just what Python does. For simple objects, eval(repr(obj)) should recreate the object. Consider: >>> print phx.other [Other(tag=alignment, namespace=http://example.org/align)] That is odd to me. It looks like "other" is a list, containing an "Other" object, but with a funny __repr__ - I would have expected it to look more like this: >>> print phx.other [Other(tag="alignment", namespace="http://example.org/align")] i.e. using the repr of what I have assumed are string arguments. Peter From eric.talevich at gmail.com Sat Jul 4 16:28:45 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 4 Jul 2009 12:28:45 -0400 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <320fb6e00907040714w6d7e9a6bq8c622572c0962b8c@mail.gmail.com> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> <320fb6e00907040714w6d7e9a6bq8c622572c0962b8c@mail.gmail.com> Message-ID: <3f6baf360907040928l528665dcv1b2f0f80a5d7dfa8@mail.gmail.com> On Sat, Jul 4, 2009 at 10:14 AM, Peter wrote: > > The __repr__ thing isn't Biopython specific, its just what Python does. For > simple objects, eval(repr(obj)) should recreate the object. Consider: > > >>> print phx.other > [Other(tag=alignment, namespace=http://example.org/align)] > > That is odd to me. It looks like "other" is a list, containing an "Other" > object, but with a funny __repr__ - I would have expected it to look more > like this: > > >>> print phx.other > [Other(tag="alignment", namespace="http://example.org/align")] > > i.e. using the repr of what I have assumed are string arguments. > > Peter > Hi Peter, Thanks! Your interpretation of the example is correct. I'll change __repr__ to check if the attribute is a string and, if so, escape and quote it. In the docs, I wrote that the representation is Biopython-style because by default, Python does something a little different for complex objects: >>> class Foo(object): pass >>> Foo() <__main__.Foo object at 0xb7cff22c> But I noticed that Seq and other Biopython objects give a nicer representation that actually works as a constructor, so I tried to match that. Cheers, Eric (P.S. - Sorry if the original message seemed a little terse or weird. I watched the BOSC slides and I do appreciate the effort you all put into the conference.) From biopython at maubp.freeserve.co.uk Sat Jul 4 16:39:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 4 Jul 2009 17:39:13 +0100 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <3f6baf360907040928l528665dcv1b2f0f80a5d7dfa8@mail.gmail.com> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> <320fb6e00907040714w6d7e9a6bq8c622572c0962b8c@mail.gmail.com> <3f6baf360907040928l528665dcv1b2f0f80a5d7dfa8@mail.gmail.com> Message-ID: <320fb6e00907040939s4363370x6f56999a18591a01@mail.gmail.com> On Sat, Jul 4, 2009 at 5:28 PM, Eric Talevich wrote: > On Sat, Jul 4, 2009 at 10:14 AM, Peter wrote: > >> >> The __repr__ thing isn't Biopython specific, its just what Python does. For >> simple objects, eval(repr(obj)) should recreate the object. Consider: >> >> >>> print phx.other >> [Other(tag=alignment, namespace=http://example.org/align)] >> >> That is odd to me. It looks like "other" is a list, containing an "Other" >> object, but with a funny __repr__ - I would have expected it to look more >> like this: >> >> >>> print phx.other >> [Other(tag="alignment", namespace="http://example.org/align")] >> >> i.e. using the repr of what I have assumed are string arguments. >> >> Peter >> > > Hi Peter, > > Thanks! Your interpretation of the example is correct. I'll change __repr__ > to check if the attribute is a string and, if so, escape and quote it. > > In the docs, I wrote that the representation is Biopython-style because by > default, Python does something a little different for complex objects: > >>>> class Foo(object): pass >>>> Foo() > <__main__.Foo object at 0xb7cff22c> Yes, that is the Python default for a user defined object. > But I noticed that Seq and other Biopython objects give a nicer > representation that actually works as a constructor, so I tried > to match that. I'd have to think of some more examples, but other Python modules try to have eval(repr(obj)) work for their (simpler) objects. If you can do it without risking a really long string, this is a good idea. You'll notice the Seq object repr actually uses a truncated sequence for long sequences - you won't want to accidentally get the whole thing printed at the python prompt! Likewise doing repr() on a SeqRecord doesn't give you the full object. Peter From eric.talevich at gmail.com Sat Jul 4 17:24:12 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 4 Jul 2009 13:24:12 -0400 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <320fb6e00907040939s4363370x6f56999a18591a01@mail.gmail.com> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> <320fb6e00907040714w6d7e9a6bq8c622572c0962b8c@mail.gmail.com> <3f6baf360907040928l528665dcv1b2f0f80a5d7dfa8@mail.gmail.com> <320fb6e00907040939s4363370x6f56999a18591a01@mail.gmail.com> Message-ID: <3f6baf360907041024j1a3495a0k997733ad12ca7d39@mail.gmail.com> On Sat, Jul 4, 2009 at 12:39 PM, Peter wrote: > On Sat, Jul 4, 2009 at 5:28 PM, Eric Talevich > wrote: > > On Sat, Jul 4, 2009 at 10:14 AM, Peter >wrote: > > > >> > >> The __repr__ thing isn't Biopython specific, its just what Python does. > For > >> simple objects, eval(repr(obj)) should recreate the object. Consider: > >> > >> >>> print phx.other > >> [Other(tag=alignment, namespace=http://example.org/align)] > >> > >> That is odd to me. It looks like "other" is a list, containing an > "Other" > >> object, but with a funny __repr__ - I would have expected it to look > more > >> like this: > >> > >> >>> print phx.other > >> [Other(tag="alignment", namespace="http://example.org/align")] > >> > >> i.e. using the repr of what I have assumed are string arguments. > >> > >> Peter > >> > > > > Hi Peter, > > > > Thanks! Your interpretation of the example is correct. I'll change > __repr__ > > to check if the attribute is a string and, if so, escape and quote it. > Correction: since it's filtering for primitive types already, I'll just call repr() on each attribute. I changed the wiki page examples to show this, and I'll fix the code on Monday. > > If you can do it without risking a really long string, this is a good > idea. You'll notice the Seq object repr actually uses a truncated > sequence for long sequences - you won't want to accidentally > get the whole thing printed at the python prompt! Likewise > doing repr() on a SeqRecord doesn't give you the full object. > > Peter > OK, I'll add another check for long strings and truncate them like Seq does. This isn't in the wiki examples yet, though. -Eric From eric.talevich at gmail.com Sat Jul 4 19:32:32 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 4 Jul 2009 15:32:32 -0400 Subject: [Biopython-dev] Biopython link on python.org wiki Message-ID: <3f6baf360907041232o79881a0ehf22a3fd1225014f4@mail.gmail.com> Hi, Is anyone on this list active on the python.org wiki? I noticed that the "Scientific and Numeric" page, which gets a link on the front page of python.org, did not mention Biopython. In a fit of enthusiasm I add a link to biopython.org at the bottom, incorporating the existing pycluster item. Would someone else more familiar with landscape of scientific Python software like to review this and perhaps incorporate it more appropriately into the page? http://wiki.python.org/moin/NumericAndScientific Thanks, Eric From chapmanb at 50mail.com Sat Jul 4 19:38:43 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 4 Jul 2009 15:38:43 -0400 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> Message-ID: <20090704193843.GA1206@kunkel> Hi Eric; Great stuff as always. You are rocking on this; I was digging through your code at the end of the week and really happy with what you've put together. > user-oriented documentation for my project to the Biopython wiki: > http://www.biopython.org/wiki/PhyloXML > > What do you think? Any missing information, unclear wording, or outright > lies? What you have looks very good. A couple of thoughts on other things that would be useful: - In the usage section where you introduce clades, it might help to have a high-level diagram of a simple tree and the corresponding PhyloXML representation in terms of phylogeny and the clade parent/child relationship. Understanding this representation is important for newcomers and might ease them into using the classes. - The examples in 'Using PhyloXML objects' are very good and to the extent you have time to expand this, more of these would be very useful. These real life type examples are the best way to help users discover the features of PhyloXML. Based on Christian's highlighted features on the PhyloXML page, a little brainstroming on some things to tackle: - Providing annotation data on a node of the tree. - Adding orthology relationships to the tree; generally providing high level node data. These would expose more of the extensive markup elements built into PhyloXML and help users discover them. > I also updated the project plan with some ideas for filling up the rest of > July: > http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML I really like the idea of exploring interoperability with other Biopython tree representations and generalizing there. In addition to the Tree class in Bio.Nexus, the PyCogent tree representation looks generalized: http://pycogent.svn.sourceforge.net/viewvc/pycogent/trunk/cogent/core/tree.py?view=markup Combining this with the PhyloXML examples above, maybe it would worthwhile to think through and document a more complicated pipeline. Something like starting with a protein, identifying homologs, building a tree, adding annotation data, and outputting to PhyloXML. This would be a great starting place to how to interoperate, and also give users a jumping off point for providing more phylogenies in PhyloXML. Similarly, a PhyloXML to networkx (or other) display would also give a nice interoperable use case for others to build off of. Thanks for all your hard work on this, Brad From chapmanb at 50mail.com Sat Jul 4 20:11:00 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 4 Jul 2009 16:11:00 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <4A4D052D.7010708@berkeley.edu> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> Message-ID: <20090704201059.GB29677@kunkel> Hi Nick; Thanks much for the update. I'm cc'ing in the Biopython dev list to keep everyone there in the loop as well. > I have worked out a number of better functions for searching xml > database results, i.e. finding all elements with tags y that exist > somewhere inside elements with tags x. This is much more flexible in > the event that data of interest resides at different levels of a > hierarchy, which I have found in some cases. Awesome. Echoing what Hilmar mentioned, it would be good to step back and this point and talk about integration with Biopython. A couple of thoughts and suggestions along those lines: - You've included code from Lagrange which worries me for two reasons. First, this overlaps with existing Biopython functionality in Bio.Nexus; we want to eliminate that as it's confusing for users of the package to find different non-compatible implementations. If the existing code doesn't work for you in some way, could you flesh out those issues on the Biopython dev list so we can work to resolve them. Secondly, lagrange is licensed under the GPL so practically it is not compatible with Biopython, which is licensed much more freely. - You've settled on a flat system of coding with functions and no nesting inside of classes. This makes it difficult to flesh up the public API from internal functions. We could help make this more clear in a couple of ways: - Organizing related functionality into classes. - Prefixing internal functions with underscrores to indicate they are not meant to be called by users. - Starting to provide some user documentation, ideally centered around use cases. Often these help provide a way to think about the usability of the code and hint at ways to improve it. Hope this is helpful and I'm happy to offer more specific suggestions as you dig into it. Have a great 4th of July weekend, Brad > Stephen Smith wrote: > > These look really great. Glad the lagrange tree code is working out. I > > am very excited for the merging of the Biopython and the lagrange tree > > classes. More details to come. > > Stephen > > ================== > > Stephen A. Smith > > Postdoctoral Researcher > > NESCent: National Evolutionary Synthesis Center > > page: http://blackrim.org > > blog: http://blackrim.net/semaphoront > > sasmith at nescent.org > > > > > > > > On Jun 24, 2009, at 12:47 AM, Nick Matzke wrote: > > > >> OK, here's the latest... > >> > >> New functions: a bunch of stuff dealing with phylogenetic trees, making > >> use of the tree/node class in Stephen Smith's lagrange (GNU public > >> license), which was superior to the half-baked (and not GPL) tree/node > >> class I was using before GSoC started. > >> > >> ============= > >> read_ultrametric_Newick(newickstr): > >> Read a Newick file into a tree object (a series of node objects links to > >> parent and daughter nodes), also reading node ages and node labels if > >> any. > >> > >> list_leaves(phylo_obj): > >> Print out all of the leaves in above a node object > >> > >> treelength(node): > >> Gets the total branchlength above a given node by recursively adding > >> through tree. > >> > >> phylodistance(node1, node2): > >> Get the phylogenetic distance (branch length) between two nodes. > >> > >> get_distance_matrix(phylo_obj): > >> Get a matrix of all of the pairwise distances between the tips of a tree. > >> > >> get_mrca_array(phylo_obj): > >> Get a square list of lists (array) listing the mrca of each pair of > >> leaves (half-diagonal matrix) > >> > >> subset_tree(phylo_obj, list_to_keep): > >> Given a list of tips and a tree, remove all other tips and resulting > >> redundant nodes to produce a new smaller tree. > >> > >> prune_single_desc_nodes(node): > >> Follow a tree from the bottom up, pruning any nodes with only one > >> descendent > >> > >> find_new_root(node): > >> Search up tree from root and make new root at first divergence > >> > >> make_None_list_array(xdim, ydim): > >> Make a list of lists ("array") with the specified dimensions > >> > >> get_PD_to_mrca(node, mrca, PD): > >> Add up the phylogenetic distance from a node to the specified ancestor > >> (mrca). Find mrca with find_1st_match. > >> > >> find_1st_match(list1, list2): > >> Find the first match in two ordered lists. > >> > >> get_ancestors_list(node, anc_list): > >> Get the list of ancestors of a given node > >> > >> addup_PD(node, PD): > >> Adds the branchlength of the current node to the total PD measure. > >> > >> print_tree_outline_format(phylo_obj): > >> Prints the tree out in "outline" format (daughter clades are indented, > >> etc.) > >> > >> print_Node(node, rank): > >> Prints the node in question, and recursively all daughter nodes, > >> maintaining rank as it goes. > >> > >> lagrange_disclaimer(): > >> Just prints lagrange citation etc. in code using lagrange libraries. > >> ============= > >> > >> > >> > >> What's next: > >> > >> I'm going to spend the rest of this week following up on Brad's > >> suggestions to make the code more standard, with the priority of > >> figuring out how I can revise the current BioPython phylogeny class, to > >> resemble the better version in lagrange, so that there is a generic > >> flexible phylogeny/newick parser that can be used generally as well as > >> by my BioGeography package specifically. > >> > >> updated wiki/git: > >> http://biopython.org/wiki/BioGeography#June.2C_week_3:_Functions_to_read_user-specified_Newick_files_.28with_ages_and_internal_node_labels.29_and_generate_basic_summary_information. > >> > >> http://github.com/nmatzke/biopython/commits/Geography > >> > >> Cheers! > >> Nick > >> > >> > >> > >> > >> > >> Nick Matzke wrote: > >>> Sorry my update is slow, it is coming in a bit! Thanks, Nick > >>> > >>> Brad Chapman wrote: > >>>> Nick; > >>>> Thanks for the update -- hope y'all are having fun at the Evolution > >>>> meeting and have managed to meet up. > >>>> > >>>>> Basically this week I added functions to download & parse large > >>>>> numbers of records, get TaxonOccurrence gbifKeys, and search with > >>>>> those keys. Main functions: > >>>> > >>>> Good stuff. My main comment echoes a couple of things we discussed > >>>> earlier: > >>>> > >>>> - It is not clear to a user which functions are API functions to > >>>> call and which are used internally. Prefixing the internal > >>>> functions with underscores (_) and organizing these into classes > >>>> will help with this. > >>>> > >>>> - I still noticed some tempfile writing from what we discussed last > >>>> week. If you have problems using in memory file handles let us > >>>> know and we can discuss more. > >>>> > >>>> In general if your coding style is to get it out there and then > >>>> re-factor, that is cool. But please put some time into the > >>>> schedule for this so I know not to bug you before you've actually > >>>> had a chance to go through things a second time. Also, it's a good > >>>> idea to do this in segments as we go along. From experience, if you > >>>> build up too much code that needs rework it becomes more mentally > >>>> difficult to get into the rewriting. > >>>> > >>>>> An issue: > >>>>> > >>>>> Next week come functions to process phylogenetic trees. I have had > >>>>> issues with the current BioPython newick parser etc.; basically what > >>>>> exists appears to not accept node label information which is required > >>>>> to store e.g. branchlengths which are crucial for the sorts of things > >>>>> I have to do in the future. So unless there is a better suggestion I > >>>>> plan to upload modify & upload my own tree parsing/using functions. I > >>>>> am open to suggestions in this matter. > >>>> > >>>> We do not want to introduce duplicated code for Newick tree parsing in > >>>> Biopython. This is a good opportunity to engage the development list > >>>> to help figure out how to fix the current parser to do what you > >>>> need. If you are not sure how to get started, the best way is to get > >>>> together a small test file that demonstrates your problems, and post > >>>> it to the list. It would be more useful to everyone to have your > >>>> fixes in the main parser. > >>>> > >>>> Brad > >>>> > >>> > >> > >> -- > >> ==================================================== > >> Nicholas J. Matzke > >> Ph.D. Candidate, Graduate Student Researcher > >> Huelsenbeck Lab > >> Center for Theoretical Evolutionary Genomics > >> 4151 VLSB (Valley Life Sciences Building) > >> Department of Integrative Biology > >> University of California, Berkeley > >> > >> Lab websites: > >> http://ib.berkeley.edu/people/lab_detail.php?lab=54 > >> http://fisher.berkeley.edu/cteg/hlab.html > >> Dept. personal page: > >> http://ib.berkeley.edu/people/students/person_detail.php?person=370 > >> Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > >> Lab phone: 510-643-6299 > >> Dept. fax: 510-643-6264 > >> Cell phone: 510-301-0179 > >> Email: matzke at berkeley.edu > >> > >> Mailing address: > >> Department of Integrative Biology > >> 3060 VLSB #3140 > >> Berkeley, CA 94720-3140 > >> > >> ----------------------------------------------------- > >> "[W]hen people thought the earth was flat, they were wrong. When people > >> thought the earth was spherical, they were wrong. But if you think that > >> thinking the earth is spherical is just as wrong as thinking the earth > >> is flat, then your view is wronger than both of them put together." > >> > >> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, > >> 14(1), 35-44. Fall 1989. > >> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > >> ==================================================== > >> _______________________________________________ > >> Wg-phyloinformatics mailing list > >> Wg-phyloinformatics at nescent.org > >> https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > > > > > > -- > ==================================================== > Nicholas J. Matzke > Ph.D. Candidate, Graduate Student Researcher > Huelsenbeck Lab > Center for Theoretical Evolutionary Genomics > 4151 VLSB (Valley Life Sciences Building) > Department of Integrative Biology > University of California, Berkeley > > Lab websites: > http://ib.berkeley.edu/people/lab_detail.php?lab=54 > http://fisher.berkeley.edu/cteg/hlab.html > Dept. personal page: > http://ib.berkeley.edu/people/students/person_detail.php?person=370 > Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > Lab phone: 510-643-6299 > Dept. fax: 510-643-6264 > Cell phone: 510-301-0179 > Email: matzke at berkeley.edu > > Mailing address: > Department of Integrative Biology > 3060 VLSB #3140 > Berkeley, CA 94720-3140 > > ----------------------------------------------------- > "[W]hen people thought the earth was flat, they were wrong. When people > thought the earth was spherical, they were wrong. But if you think that > thinking the earth is spherical is just as wrong as thinking the earth > is flat, then your view is wronger than both of them put together." > > Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, > 14(1), 35-44. Fall 1989. > http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > ==================================================== > _______________________________________________ > Wg-phyloinformatics mailing list > Wg-phyloinformatics at nescent.org > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics From biopython at maubp.freeserve.co.uk Sun Jul 5 08:48:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Jul 2009 09:48:04 +0100 Subject: [Biopython-dev] Biopython link on python.org wiki In-Reply-To: <3f6baf360907041232o79881a0ehf22a3fd1225014f4@mail.gmail.com> References: <3f6baf360907041232o79881a0ehf22a3fd1225014f4@mail.gmail.com> Message-ID: <320fb6e00907050148h55e38152xd31d752515746e7e@mail.gmail.com> On Sat, Jul 4, 2009 at 8:32 PM, Eric Talevich wrote: > Hi, > > Is anyone on this list active on the python.org wiki? I noticed that the > "Scientific and Numeric" page, which gets a link on the front page of > python.org, did not mention Biopython. In a fit of enthusiasm I add a link > to biopython.org at the bottom, incorporating the existing pycluster item. > Would someone else more familiar with landscape of scientific Python > software like to review this and perhaps incorporate it more appropriately > into the page? > > http://wiki.python.org/moin/NumericAndScientific > > Thanks, > Eric Good idea - thanks. Peter From biopython at maubp.freeserve.co.uk Sun Jul 5 08:52:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Jul 2009 09:52:38 +0100 Subject: [Biopython-dev] GSoC code+documentation review: PhyloXML for Biopython In-Reply-To: <20090704193843.GA1206@kunkel> References: <3f6baf360907021359u122d22ayf88bc62059f7f150@mail.gmail.com> <20090704193843.GA1206@kunkel> Message-ID: <320fb6e00907050152o470ca5e3ja451c3ebc52f7c83@mail.gmail.com> On Sat, Jul 4, 2009 at 8:38 PM, Brad Chapman wrote: > I really like the idea of exploring interoperability with other > Biopython tree representations and generalizing there. In addition to > the Tree class in Bio.Nexus, the PyCogent tree representation looks > generalized: > > http://pycogent.svn.sourceforge.net/viewvc/pycogent/trunk/cogent/core/tree.py?view=markup There is also Thomas Mailund's Newick Tree module, which provides yet another perspective on trees, and various things you can do with them (his visitor stuff is cool once you figure it out). If you haven't looked it this, it might be worth a play as well for ideas. I've actually used this more than Bio.Nexus as it predates it ;) http://www.daimi.au.dk/~mailund/newick.html Peter From bugzilla-daemon at portal.open-bio.org Sun Jul 5 10:20:58 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 5 Jul 2009 06:20:58 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200907051020.n65AKwn4020321@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2870 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jul 5 11:10:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 5 Jul 2009 07:10:02 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200907051110.n65BA2qv021842@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement OS/Version|FreeBSD |All ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-05 07:10 EST ------- Three points... Which existing schema did you start from to generate this one, and how did you do it? This may be interesting for Hilmar if there are any subtle differences between the existing schemas. --------------------------------------------------------------- This new line in BioSeq.py isn't valid on Python 2.4, val = [(str(x) if isinstance(x, unicode) else x) for x in val] See http://www.python.org/dev/peps/pep-0308/ As a quick hack, I used: val = [_make_unicode_into_string(x) for x in val] where I had defined: def _make_unicode_into_string(text) : if isinstance(text, unicode): return str(text) else : return text Not very elegant, but with that the BioSQL tests pass on my old desktop using Python 2.4 and MySQL. This machine doesn't have the SQLite bindings installed. --------------------------------------------------------------- In the long term, Tests/setup_BioSQL.py could automatically try to use SQLite (if available) when the user hasn't overriden it with their own local settings. Peter P.S. I filed BioSQL enhancement Bug 2870 for adding an SQLite schema to BioSQL itself. And I marked this bug as an enhancement too. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jul 6 15:06:47 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 6 Jul 2009 11:06:47 -0400 Subject: [Biopython-dev] GSoC Weekly Update 7: PhyloXML for Biopython Message-ID: <3f6baf360907060806o5cbc3e4ew8bd614b0a5f811c2@mail.gmail.com> Hi all, Previously (June 29--July 3) I: - Wrote serialization methods for each class, matching Parser - Also profiled the writer - Caught up on documentation -- http://www.biopython.org/wiki/PhyloXML This week (July 6--10) I will: - Address comments from last week's code/doc review - Enable Pythonic syntax sugar (__getitem__, __contains__, override __str__) - Unit tests for new code - Identify more Biopython objects to reuse or export to (improve the SeqRecord conversion) - Look specifically at interoperating with Nexus, Newick trees - Fill out the midterm evaluation Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Mon Jul 6 19:02:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Jul 2009 20:02:56 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? Message-ID: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> Hi all, There were many things I discussed with Biopython folks at BOSC 2009, and one of these was a conversation with Brad about some of Bio.Application - specifically the idea behind the ApplicationResult object. We basically agreed this was superfluous and could be deprecated. The only thing I've found useful in this object is the return code (an integer) when using Bio.Application.generic_run (which in itself seems a bit superfluous). Now, declaring ApplicationResult obsolete for Biopython 1.51 (with a deprecation in the following release) is fine except for the fact that this object gets used in the function generic_run. So we'd have to obsolete that too. [If anyone can see any other side effects of deprecating Bio.Application.ApplicationResult please speak up] Right now, generic_run waits for the sub-process to finish, and returns a tuple of: * An ApplicationResult object holding the return code (and a few other things which can also be found from the command line string object, like the expected output filenames). * Standard output as a StringIO handle (could be memory hungry!) * Standard error as a StringIO handle (could be memory hungry!) Personally when running a sub-process I have either wanted the stdout (and stderr) handles, OR the return code (and I don't have about stdout and stderr). I can't think of a situation off hand where I needed both. So for me, the Bio.Application.generic_run function isn't very helpful. In Python, there are several ways to run a tool, starting with something very simple like os.system(...) which will run and block until the task finished, returning the return code (with some provisos on Windows). Next, there were a whole set of popen*() functions which generally returned handles. These are now all obsolete with Python 2.6, and subprocess should be used instead. If we want to deprecate Bio.Application.generic_run (in order to deprecate Bio.Application.ApplicationResult), then do we need a replacement? Or replacements? Possible helper functions that come to mind are: (a) Returns the return code (integer) only. This would basically be a cross-platfrom version of os.system using the subprocess module internally. (b) Returns the return code (integer) plus the stdout and stderr (which would have to be StringIO handles, with the data in memory). This would be a direct replacement for the current Bio.Application.generic_run function. (c) Returns the stdout (and stderr) handles. This basically is recreating a deprecated Python popen*() function, which seems silly. However, I'm tempted to say Biopython shouldn't be duplicating basic Python functionality, like wrapping the subprocess module in helper functions for typical situations. Instead we should just document using the current recommend Python best practice (which I believe to be use the subprocess module). The downside is that using subprocess is a bit tricky for novices. Any thoughts? Peter From bartek at rezolwenta.eu.org Mon Jul 6 19:35:53 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 6 Jul 2009 21:35:53 +0200 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> Message-ID: <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> On Mon, Jul 6, 2009 at 9:02 PM, Peter wrote: > Hi all, > Hi, > this object gets used in the function generic_run. So we'd have to > obsolete that too. [If anyone can see any other side effects of > deprecating Bio.Application.ApplicationResult please speak up] I'm fine with deprecating ApplicationReslut. > Personally when running a sub-process I have either wanted the stdout > (and stderr) handles, OR the return code (and I don't have about > stdout and stderr). I can't think of a situation off hand where I > needed both. So for me, the Bio.Application.generic_run function isn't > very helpful. > Well, I don't have too much experience with writing application wrappers, but I can easily think of the scenario when I first check whether the program returned the "right" error code and then if it's fine I would process the stdout. > If we want to deprecate Bio.Application.generic_run (in order to > deprecate Bio.Application.ApplicationResult), then do we need a > replacement? Or replacements? > > (b) Returns the return code (integer) plus the stdout and stderr > (which would have to be StringIO handles, with the data in memory). > This would be a direct replacement for the current > Bio.Application.generic_run function. That sounds like a good replacement. > However, I'm tempted to say Biopython shouldn't be duplicating basic > Python functionality, like wrapping the subprocess module in helper > functions for typical situations. Instead we should just document > using the current recommend Python best practice (which I believe to > be use the subprocess module). The downside is that using subprocess > is a bit tricky for novices. > I don't have strong feelings about that, but my personal experience is that it helps to have some infrastructure which (even if providing somewhat superfluous API layer over the bare python libs), especially for people who may have limited experience with different platforms. I, for one, would find it useful if biopython provided a simple classes which allowed people to write cross-platform wrappers for command line tools. cheers Bartek From biopython at maubp.freeserve.co.uk Mon Jul 6 21:06:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Jul 2009 22:06:54 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> Message-ID: <320fb6e00907061406o2c5907e8q3c30676897728167@mail.gmail.com> On Mon, Jul 6, 2009 at 8:35 PM, Bartek Wilczynski wrote: > > I'm fine with deprecating ApplicationReslut. > >> Personally when running a sub-process I have either wanted the stdout >> (and stderr) handles, OR the return code (and I don't have about >> stdout and stderr). I can't think of a situation off hand where I >> needed both. So for me, the Bio.Application.generic_run function isn't >> very helpful. > > Well, I don't have too much experience with writing application wrappers, > but I can easily think of the scenario when I first check whether the program > returned the "right" error code and then if it's fine I would process > the stdout. True - but in practice I usually find it more productive to switch to the command line prompt and explore the failure there (rather than trying to diagnose things from within Python). I would be content for the script to tell me a command line failed with an error return code (and give me the command line string and the return code). >> If we want to deprecate Bio.Application.generic_run (in order to >> deprecate Bio.Application.ApplicationResult), then do we need a >> replacement? Or replacements? >> >> (b) Returns the return code (integer) plus the stdout and stderr >> (which would have to be StringIO handles, with the data in memory). >> This would be a direct replacement for the current >> Bio.Application.generic_run function. > > That sounds like a good replacement. Of the three examples I put forward, (b) certainly seemed most useful. Any other ideas? >> However, I'm tempted to say Biopython shouldn't be duplicating basic >> Python functionality, like wrapping the subprocess module in helper >> functions for typical situations. Instead we should just document >> using the current recommend Python best practice (which I believe to >> be use the subprocess module). The downside is that using subprocess >> is a bit tricky for novices. >> > > I don't have strong feelings about that, but my personal experience is > that it helps to have some infrastructure which (even if providing > somewhat superfluous API layer over the bare python libs), especially > for people who may have limited experience with different platforms. > > I, for one, would find it useful if biopython provided a simple > classes which allowed people to write cross-platform wrappers > for command line tools. Do you feel option (b) above would fit that criteria? Peter From tiagoantao at gmail.com Mon Jul 6 21:34:01 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 6 Jul 2009 22:34:01 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> Message-ID: <6d941f120907061434w105480e9y4b754c46acce27ff@mail.gmail.com> On Mon, Jul 6, 2009 at 8:02 PM, Peter wrote: > Any thoughts? I am using generic_run (and the Bio.Application framework) for the new genepop code. But it would be trivial to change. The only thing that I need is the return code (not even stdout). The only thing that I need is to be informed of the new "best practice" that replaces generic_run and I will act accordingly If you are interested my use case is on: http://github.com/tiagoantao/biopython/blob/e1720bd4419ae5cf60ae5e1c7ec72828c6f6e6fe/Bio/PopGen/GenePop/Controller.py (_run_genepop and class _GenePopCommandline) Regards From biopython at maubp.freeserve.co.uk Mon Jul 6 21:51:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Jul 2009 22:51:37 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <6d941f120907061434w105480e9y4b754c46acce27ff@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <6d941f120907061434w105480e9y4b754c46acce27ff@mail.gmail.com> Message-ID: <320fb6e00907061451s1908d8cfw335f4f11319d14bb@mail.gmail.com> 2009/7/6 Tiago Ant?o : > > On Mon, Jul 6, 2009 at 8:02 PM, Peter wrote: >> Any thoughts? > > I am using generic_run (and the Bio.Application framework) for the new > genepop code. But it would be trivial to change. > > The only thing that I need is the return code (not even stdout). > > The only thing that I need is to be informed of the new "best > practice" that replaces generic_run and I will act accordingly You wouldn't have to rush anything - I was only thinking to declare it obsolete for 1.51 (with any replacement in place). The point of this discussion is to agree the "best practice". It sounds like this will be telling people to use subprocess for full control, but we may continue to provide one or two helper functions for very common usecases. Peter From chapmanb at 50mail.com Mon Jul 6 22:04:53 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 6 Jul 2009 18:04:53 -0400 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> Message-ID: <20090706220453.GI17086@sobchak.mgh.harvard.edu> Hi all; > > this object gets used in the function generic_run. So we'd have to > > obsolete that too. [If anyone can see any other side effects of > > deprecating Bio.Application.ApplicationResult please speak up] > > I'm fine with deprecating ApplicationReslut. Bartek, you just won the typo of the month contest hands down. > > If we want to deprecate Bio.Application.generic_run (in order to > > deprecate Bio.Application.ApplicationResult), then do we need a > > replacement? Or replacements? [...] > > However, I'm tempted to say Biopython shouldn't be duplicating basic > > Python functionality, like wrapping the subprocess module in helper > > functions for typical situations. Instead we should just document > > using the current recommend Python best practice (which I believe to > > be use the subprocess module). The downside is that using subprocess > > is a bit tricky for novices. My vote is to document using subprocess and avoid creating our own wrapper. No one has to learn a Biopython specific API for running programs, and subprocess provides plenty of flexibility to get stdout, stderr and return codes. For places where we feel like using subprocess is tricky, additional documentation within Biopython should help those encountering it for the first time. This gives us more time to work on biology problems, and leaves the running programs problems up to the greater Python community. Brad From bugzilla-daemon at portal.open-bio.org Mon Jul 6 22:55:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 6 Jul 2009 18:55:52 -0400 Subject: [Biopython-dev] [Bug 2872] New: Genbank parser breaks on VectorNTI generated genbank file Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2872 Summary: Genbank parser breaks on VectorNTI generated genbank file Product: Biopython Version: 1.51b Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: tsham at lbl.gov The GenBank parser dies while parsing VectorNTI generated genbank files. VectorNTI *sometimes* generates a file with no date string at position 65, which causes this. It is true that this is a non-standard genbank file, but since VectorNTI is a commonly used program, it would be nice for BioPython to handle this case. Sample session: >>> import Bio >>> Bio.__version__ '1.51b' >>> fh = open("pBbA1a-RFP.gb") >>> from Bio.GenBank import RecordParser >>> rp = RecordParser() >>> result = rp.parse(fh) Traceback (most recent call last): File "", line 1, in File "Bio/GenBank/__init__.py", line 172, in parse self._scanner.feed(handle, self._consumer) File "Bio/GenBank/Scanner.py", line 370, in feed self._feed_first_line(consumer, self.line) File "Bio/GenBank/Scanner.py", line 820, in _feed_first_line 'LOCUS line does not contain - at position 65 in date:\n' + line AssertionError: LOCUS line does not contain - at position 65 in date: LOCUS pBbA1a-RFP 4252 bp DNA circular >>> -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 6 22:56:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 6 Jul 2009 18:56:56 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907062256.n66MuuBH002457@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #1 from tsham at lbl.gov 2009-07-06 18:56 EST ------- Created an attachment (id=1338) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1338&action=view) test case file, vectorNTI generated genbank file Here is a sample file that breaks the parser. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jul 6 23:34:30 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 6 Jul 2009 19:34:30 -0400 Subject: [Biopython-dev] PhyloXML helper functions Message-ID: <3f6baf360907061634u43d89bdcxa9bad9fb0babb350@mail.gmail.com> Hey all, I've been mulling a couple of methods for PhyloXML objects that I thought could deserve some discussion. 1. Singular properties for some plural attributes This goes back to the "confidences" issue: When I'm drilling down through a phyloXML-derived tree, I keep expecting certain attributes to be singular values when they're actually plural. Auto-completion catches it, of course, but the resulting code would seem more obvious if I used the singular name when I know the attribute consists of a list of one element. The attributes I had in mind for this are taxonomies (Clade class) and confidences (Clade and Phylogeny classes). Should any other attributes get this treatment? Here's an example getter method -- Rubyists may ignore the first line: @property def confidence(self): if len(self.confidences) > 1: raise RuntimeError, "More than one confidence item is available! Use foo.confidences" elif len(self.confidences) == 0: raise RuntimeError, "No confidence item is available! You fail" else: return self.confidences[0] Then this works as expected, similar to the way certain IO read() functions work elsewhere in Biopython. 2. A find() method on Clade and maybe Phylogeny objects The function definition and docstring would look like this: def find(cls, **kwargs): """Find all sub-nodes matching the given attributes. The first argument specifies the class of the sub-node. (Use Tree.PhyloElement to match any standard phyloXML type.) The arbitrary keyword arguments indicate the attribute name of the sub-node and the value to match. The result is an iterable through all matching objects. Example: >>> tree = PhyloXML.read('phyloxml_examples.xml').phylogenies[5] >>> matches = tree.clade.find(Taxonomy, code='OCTVU') >>> matches.next() Taxonomy(code='OCTVU', scientific_name='Octopus vulgaris') """ Enhancements: - The keyword argument could be a regular expression. Would that be useful? To handle numbers, I'd have to convert every sub-node attribute value to a string, and that would be weird -- or else find() would have to skip numerical attributes. - Non-keyword arguments (*args) could specify just the not-None existence of an attribute. Allowing regexes would make this unnecessary (e.g. name='.*') - If no regular arguments are needed, cls could default to PhyloElement or even "object" to match everything. - To enable arbitrary hairiness, this function could accept a function as the value of the keyword argument and return anything truthy. But at that point, the user could probably just roll their own find_node() function. However, it could still be useful to filter for numerical values. What do you think? Thanks, Eric From tiagoantao at gmail.com Tue Jul 7 07:55:58 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 7 Jul 2009 08:55:58 +0100 Subject: [Biopython-dev] Arlequin sequence files in Bio.Popgen In-Reply-To: <4A52985C.3000603@student.otago.ac.nz> References: <4A52985C.3000603@student.otago.ac.nz> Message-ID: <6d941f120907070055m2d34fcb1qe8b29e40d8d67880@mail.gmail.com> Hi David, [I am Ccing the biopython-dev mailing list, so that other biopython dev people can chip in] 2009/7/7 David WInter : > Is there any plan to support arlequin in Bio.Popgen? The script that I have Bio.PopGen currently supports Simcoal, so it should already support Arlequin (as Simcoal outputs arlequin). Unfortunatelly I never got round to make an Arlequin parser (which makes full sense, for a lot of reasons). > to have a go at getting it to work in that framework. That would be more than welcome. I have personally an interest on getting it up and running. Arlequin format support is an important thing. If you have little time, I can offer to help. If you prefer to go ahead alone you are also more than welcome to do it. Just dont do the same mistake that I did with the genepop parser: where I load the whole file into memory. I have discovered that there are a lot of people that have thousands of markers and thousands of individuals (loading such a file into memory is in some cases impossible). Using an iterator might be a solution. One might try to go to the Arlequin developers and ask for a specification of the format (as far as I know there is no specification in public). Code on biopython has to have documentation and unit tests (a boring thing, but necessary). In this case, I would not mind doing that myself (in case you are uninterested) as I think Arlequin support is really a cool thing. I will sort out the git links, thanks for the info. BTW if you are doing any kind of frequency based statistics, we are adding support for genepop statistics (mainly a python wrapper to the application). You can now get things like Fst, Fis and the likes from inside python. Feel free to write back with any comments you might have. Tiago From bartek at rezolwenta.eu.org Tue Jul 7 08:20:49 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 7 Jul 2009 10:20:49 +0200 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <20090706220453.GI17086@sobchak.mgh.harvard.edu> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> Message-ID: <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> On Tue, Jul 7, 2009 at 12:04 AM, Brad Chapman wrote: >> I'm fine with deprecating ApplicationReslut. > > Bartek, you just won the typo of the month contest hands down. > Well, It's always motivating to see that people actually read your posts carefully ;) > My vote is to document using subprocess and avoid creating our own > wrapper. No one has to learn a Biopython specific API for running > programs, and subprocess provides plenty of flexibility to get stdout, > stderr and return codes. For places where we feel like using subprocess > is tricky, additional documentation within Biopython should help those > encountering it for the first time. This gives us more time to work > on biology problems, and leaves the running programs problems up to > the greater Python community. well, having such a documentation would be a great thing. I've just gone through the docs for subprocess module and it seems to be the layer unifying all those crazy different ways of spawning processes. It's a shame I somehow missed that it's there since python 2.4... So now, after doing my homework and checking what has been going on in python since 2004, I think that Brad's idea is better. We have dropped support for 2.3, so we can try to move from Application.generic_run to subprocess.Popen instead of trying to provide our own wrapper. We just need good docs. cheers Bartek From tiagoantao at gmail.com Tue Jul 7 08:33:49 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 7 Jul 2009 09:33:49 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907061451s1908d8cfw335f4f11319d14bb@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <6d941f120907061434w105480e9y4b754c46acce27ff@mail.gmail.com> <320fb6e00907061451s1908d8cfw335f4f11319d14bb@mail.gmail.com> Message-ID: <6d941f120907070133h6d4d9283v3282bc227dd12033@mail.gmail.com> 2009/7/6 Peter : > The point of this discussion is to agree the "best practice". It > sounds like this will be telling people to use subprocess for full > control, but we may continue to provide one or two helper > functions for very common usecases. I tried to use Bio.Application and there was one part (maybe I am using it wrongly) that was kind of awkward: parameters (Ive added my code below). The need to declare them explicitly plus the fact that in some cases parameters are always compulsory and really not parameters (granted a strange use case, but I have a fixed parameter for genepop, namely saying that the run is machine-controlled, batch mode). At the end of the day, I end up with a lot of biolerplate code (like below). _Argument(["command"], ["INTEGER(.INTEGER)*"], None, True, "GenePop option to be called"), _Argument(["mode"], ["Dont touch this"], None, True, "Should allways be batch"), _Argument(["input"], ["input"], None, True, "Input file"), _Argument(["Dememorization"], ["input"], None, False, "Dememorization step"), _Argument(["BatchNumber"], ["input"], None, False, "Number of MCMC batches"), _Argument(["BatchLength"], ["input"], None, False, "Length of MCMC chains"), _Argument(["HWtests"], ["input"], None, False, "Enumeration or MCMC"), -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From biopython at maubp.freeserve.co.uk Tue Jul 7 09:19:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 10:19:34 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <6d941f120907070133h6d4d9283v3282bc227dd12033@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <6d941f120907061434w105480e9y4b754c46acce27ff@mail.gmail.com> <320fb6e00907061451s1908d8cfw335f4f11319d14bb@mail.gmail.com> <6d941f120907070133h6d4d9283v3282bc227dd12033@mail.gmail.com> Message-ID: <320fb6e00907070219y15b019b8x30607e33137edf2e@mail.gmail.com> 2009/7/7 Tiago Ant?o : > 2009/7/6 Peter : >> The point of this discussion is to agree the "best practice". It >> sounds like this will be telling people to use subprocess for full >> control, but we may continue to provide one or two helper >> functions for very common usecases. > > I tried to use Bio.Application and there was one part (maybe I am > using it wrongly) that was kind of awkward: parameters (Ive > added my code below). > The need to declare them explicitly plus the fact that in some cases > parameters are always compulsory and really not parameters > (granted a strange use case, but I have a fixed parameter for > genepop, namely saying that the run is machine-controlled, batch > mode). If you have a fixed parameter, like "-mode batch" which must be present, it doesn't make sense to expose the mode setting to the used. Maybe you could do this by subclassing the __str__ method? > At the end of the day, I end ?up with a lot of biolerplate code (like below). The nature of the command line wrappers is there will be lots of boilerplate. On the bright side, once we get ride of ApplicationResult, we can probably get rid of the "input"/"output" thing too. Peter From biopython at maubp.freeserve.co.uk Tue Jul 7 09:41:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 10:41:10 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> Message-ID: <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> On Tue, Jul 7, 2009 at 9:20 AM, Bartek Wilczynski wrote: > On Tue, Jul 7, 2009 at 12:04 AM, Brad Chapman wrote: > >>> I'm fine with deprecating ApplicationReslut. >> >> Bartek, you just won the typo of the month contest hands down. >> > Well, It's always motivating to see that people actually read your > posts carefully ;) I read it, but just chuckled to myself ;) Brad wrote: >> My vote is to document using subprocess and avoid creating our own >> wrapper. No one has to learn a Biopython specific API for running >> programs, and subprocess provides plenty of flexibility to get stdout, >> stderr and return codes. For places where we feel like using subprocess >> is tricky, additional documentation within Biopython should help those >> encountering it for the first time. This gives us more time to work >> on biology problems, and leaves the running programs problems up to >> the greater Python community. Exactly. I'm sure there will still be questions on the mailing list from people about using subprocess, but if our documentation is done well enough this shouldn't be too much of a burden. Bartek wrote: > ?well, having such a documentation would be a great thing. I've just gone > through the docs for subprocess module and it seems to be the layer unifying > all those crazy different ways of spawning processes. It's a shame ?I somehow > missed that it's there since python 2.4... So now, after doing my homework and > checking what has been going on in python since 2004, I think that Brad's idea > is better. We have dropped support for 2.3, so we can try to move from > Application.generic_run to subprocess.Popen instead of trying to > provide our own wrapper. ?We just need good docs. That seems unanimous so far: Deprecate Bio.Application.generic_run, and document using subprocess instead. Good :) Are you all happy with just marking Bio.Application.generic_run and Bio.Application.ApplicationResult as obsolete for Biopython 1.51, with the deprecation warning added in Biopython 1.52? We'll need to update the Tutorial too - which reminds me, could someone go over the "Alignment Tools" bit (currently section 6.3) to see if I've pitched this at about the right level? On re-reading it just now I found an fixed several typos. Peter From bugzilla-daemon at portal.open-bio.org Tue Jul 7 09:43:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 05:43:33 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907070943.n679hXi0025601@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 05:43 EST ------- Yes, I would agree that we should be able to cope with a missing date (perhaps with a warning). Can we include this file in Biopython as a unit test? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 10:16:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 06:16:55 -0400 Subject: [Biopython-dev] [Bug 2873] New: import warnings.warn instead of warnings causes code to fail Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2873 Summary: import warnings.warn instead of warnings causes code to fail Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: trivial Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mlrodrigues at igc.gulbenkian.pt As of commit: http://github.com/biopython/biopython/commit/f2b2125dbbf57b1b1ac5a0259918acfc4e63abbe#diff-3 On github, the line 39 (from warnings import warn) was inserted but during the function, the module is always refered to as warnings.warn() and not warn() Changing line 39 to 'import warnings' solves the problem -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 10:44:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 06:44:10 -0400 Subject: [Biopython-dev] [Bug 2873] import warnings.warn instead of warnings causes code to fail In-Reply-To: Message-ID: <200907071044.n67AiA7X027715@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2873 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 06:44 EST ------- Thanks - import statement in Bio/PDB/PDBList.py now fixed in CVS, will be on github shortly. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Tue Jul 7 12:51:52 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 7 Jul 2009 08:51:52 -0400 Subject: [Biopython-dev] PhyloXML helper functions In-Reply-To: <3f6baf360907061634u43d89bdcxa9bad9fb0babb350@mail.gmail.com> References: <3f6baf360907061634u43d89bdcxa9bad9fb0babb350@mail.gmail.com> Message-ID: <20090707125152.GL17086@sobchak.mgh.harvard.edu> Hi Eric; > I've been mulling a couple of methods for PhyloXML objects that I thought > could deserve some discussion. > > 1. Singular properties for some plural attributes > > This goes back to the "confidences" issue: When I'm drilling down through a > phyloXML-derived tree, I keep expecting certain attributes to be singular > values when they're actually plural. Auto-completion catches it, of course, > but the resulting code would seem more obvious if I used the singular name > when I know the attribute consists of a list of one element. I like the idea and implementation for cases where you can have multiple items, but have one most of the time. Very nice. > 2. A find() method on Clade and maybe Phylogeny objects [...] > Enhancements: > - The keyword argument could be a regular expression. Would that be useful? This seems useful. Often people use crazy naming convention hacks, and might want to pull out something like all proteins from a particular organism based on a common prefix in the name. > To handle numbers, I'd have to convert every sub-node attribute value to a > string, and that would be weird -- or else find() would have to skip > numerical attributes. Is this if you support regular expressions or either way? For the find, I think it's sufficient to define what you support and leave it at that set: any subset of searching will help people get their work done. > - If no regular arguments are needed, cls could default to PhyloElement or > even "object" to match everything. I like the object default here. This fits with a simple use case of: find everything that matches this string of interest. > - To enable arbitrary hairiness, this function could accept a function as > the value of the keyword argument and return anything truthy. But at that > point, the user could probably just roll their own find_node() function. > However, it could still be useful to filter for numerical values. This is probably more than you need. For complicated cases I'd assume people are sophisticated enough to roll their own. Nice ideas, Brad From chapmanb at 50mail.com Tue Jul 7 13:02:48 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 7 Jul 2009 09:02:48 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> Message-ID: <20090707130248.GM17086@sobchak.mgh.harvard.edu> Hi Stephen; > In reference to the lagrange code, I see the concern with the > licensing. I think that this could be corrected however with a simple > rewrite when conforming to the BioPython standards. We can require lagrange to be installed and use imports to grab the needed code. The other option is that y'all can explicitly relicense a subset of the code under the Biopython license. > I can see however > where the Bio.Nexus functionality might not be sufficient for tree > manipulation. I am not a contributor to the BioPython dev group so I > cannot speak to those specifics, but as a user I can see separating > out the tree functions from the Nexus package (and tree I/O in > general) as logically a phylogenetic tree structure has little to do > with the nexus file format. It can be somewhat awkward to deal with in > the current form. A more general implementation might be a Bio.Tree > package with I/O readers in Nexus and Newick and XML, etc. Definitely. Eric has been discussing this with regards to the PhyloXML project and we had been looking at other Tree representations: in PyCogent and Thomas Mailund's Newick module. Considering the lagrange tree model makes a lot of sense as well. What I'd like to see is a stab at a generalized Tree object that supports the operations you need and that the Bio.Nexus parser can produce, exactly as you describe. Eric and Nick, what do you think about coordinating on this? > Just a thought and I am happy to work on the tree code in whatever > capacity it would be helpful to Nick. Awesome. We're very open to generalizing the Tree representation in Biopython. What I'm trying to avoid is having multiple Nexus/Newick parsers; this is confusing to users and too much duplicated effort. It sounds like we're on the same page in coming together on something that will work for everyone. Brad > Take care, > Stephen > ================== > Stephen A. Smith > Postdoctoral Researcher > NESCent: National Evolutionary Synthesis Center > page: http://blackrim.org > blog: http://blackrim.net/semaphoront > sasmith at nescent.org > > > > On Jul 4, 2009, at 4:11 PM, Brad Chapman wrote: > > > Hi Nick; > > Thanks much for the update. I'm cc'ing in the Biopython dev list to > > keep everyone there in the loop as well. > > > >> I have worked out a number of better functions for searching xml > >> database results, i.e. finding all elements with tags y that exist > >> somewhere inside elements with tags x. This is much more flexible in > >> the event that data of interest resides at different levels of a > >> hierarchy, which I have found in some cases. > > > > Awesome. Echoing what Hilmar mentioned, it would be good to step back > > and this point and talk about integration with Biopython. A couple > > of thoughts and suggestions along those lines: > > > > - You've included code from Lagrange which worries me for two > > reasons. First, this overlaps with existing Biopython functionality > > in Bio.Nexus; we want to eliminate that as it's confusing for > > users of the package to find different non-compatible > > implementations. If the existing code doesn't work for you in some > > way, could you flesh out those issues on the Biopython dev list so we > > can work to resolve them. Secondly, lagrange is licensed under the > > GPL so practically it is not compatible with Biopython, which is > > licensed much more freely. > > > > - You've settled on a flat system of coding with functions and no > > nesting inside of classes. This makes it difficult to flesh up the > > public API from internal functions. We could help make this more > > clear in a couple of ways: > > > > - Organizing related functionality into classes. > > - Prefixing internal functions with underscrores to indicate they > > are not meant to be called by users. > > - Starting to provide some user documentation, ideally centered > > around use cases. Often these help provide a way to think about > > the usability of the code and hint at ways to improve it. > > > > Hope this is helpful and I'm happy to offer more specific > > suggestions as you dig into it. Have a great 4th of July weekend, > > > > Brad > > > > > >> Stephen Smith wrote: > >>> These look really great. Glad the lagrange tree code is working > >>> out. I > >>> am very excited for the merging of the Biopython and the lagrange > >>> tree > >>> classes. More details to come. > >>> Stephen > >>> ================== > >>> Stephen A. Smith > >>> Postdoctoral Researcher > >>> NESCent: National Evolutionary Synthesis Center > >>> page: http://blackrim.org > >>> blog: http://blackrim.net/semaphoront > >>> sasmith at nescent.org > >>> > >>> > >>> > >>> On Jun 24, 2009, at 12:47 AM, Nick Matzke wrote: > >>> > >>>> OK, here's the latest... > >>>> > >>>> New functions: a bunch of stuff dealing with phylogenetic trees, > >>>> making > >>>> use of the tree/node class in Stephen Smith's lagrange (GNU public > >>>> license), which was superior to the half-baked (and not GPL) tree/ > >>>> node > >>>> class I was using before GSoC started. > >>>> > >>>> ============= > >>>> read_ultrametric_Newick(newickstr): > >>>> Read a Newick file into a tree object (a series of node objects > >>>> links to > >>>> parent and daughter nodes), also reading node ages and node > >>>> labels if > >>>> any. > >>>> > >>>> list_leaves(phylo_obj): > >>>> Print out all of the leaves in above a node object > >>>> > >>>> treelength(node): > >>>> Gets the total branchlength above a given node by recursively > >>>> adding > >>>> through tree. > >>>> > >>>> phylodistance(node1, node2): > >>>> Get the phylogenetic distance (branch length) between two nodes. > >>>> > >>>> get_distance_matrix(phylo_obj): > >>>> Get a matrix of all of the pairwise distances between the tips of > >>>> a tree. > >>>> > >>>> get_mrca_array(phylo_obj): > >>>> Get a square list of lists (array) listing the mrca of each pair of > >>>> leaves (half-diagonal matrix) > >>>> > >>>> subset_tree(phylo_obj, list_to_keep): > >>>> Given a list of tips and a tree, remove all other tips and > >>>> resulting > >>>> redundant nodes to produce a new smaller tree. > >>>> > >>>> prune_single_desc_nodes(node): > >>>> Follow a tree from the bottom up, pruning any nodes with only one > >>>> descendent > >>>> > >>>> find_new_root(node): > >>>> Search up tree from root and make new root at first divergence > >>>> > >>>> make_None_list_array(xdim, ydim): > >>>> Make a list of lists ("array") with the specified dimensions > >>>> > >>>> get_PD_to_mrca(node, mrca, PD): > >>>> Add up the phylogenetic distance from a node to the specified > >>>> ancestor > >>>> (mrca). Find mrca with find_1st_match. > >>>> > >>>> find_1st_match(list1, list2): > >>>> Find the first match in two ordered lists. > >>>> > >>>> get_ancestors_list(node, anc_list): > >>>> Get the list of ancestors of a given node > >>>> > >>>> addup_PD(node, PD): > >>>> Adds the branchlength of the current node to the total PD measure. > >>>> > >>>> print_tree_outline_format(phylo_obj): > >>>> Prints the tree out in "outline" format (daughter clades are > >>>> indented, > >>>> etc.) > >>>> > >>>> print_Node(node, rank): > >>>> Prints the node in question, and recursively all daughter nodes, > >>>> maintaining rank as it goes. > >>>> > >>>> lagrange_disclaimer(): > >>>> Just prints lagrange citation etc. in code using lagrange > >>>> libraries. > >>>> ============= > >>>> > >>>> > >>>> > >>>> What's next: > >>>> > >>>> I'm going to spend the rest of this week following up on Brad's > >>>> suggestions to make the code more standard, with the priority of > >>>> figuring out how I can revise the current BioPython phylogeny > >>>> class, to > >>>> resemble the better version in lagrange, so that there is a generic > >>>> flexible phylogeny/newick parser that can be used generally as > >>>> well as > >>>> by my BioGeography package specifically. > >>>> > >>>> updated wiki/git: > >>>> http://biopython.org/wiki/BioGeography#June. > >>>> 2C_week_3:_Functions_to_read_user-specified_Newick_files_. > >>>> 28with_ages_and_internal_node_labels. > >>>> 29_and_generate_basic_summary_information. > >>>> > >>>> http://github.com/nmatzke/biopython/commits/Geography > >>>> > >>>> Cheers! > >>>> Nick > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> Nick Matzke wrote: > >>>>> Sorry my update is slow, it is coming in a bit! Thanks, Nick > >>>>> > >>>>> Brad Chapman wrote: > >>>>>> Nick; > >>>>>> Thanks for the update -- hope y'all are having fun at the > >>>>>> Evolution > >>>>>> meeting and have managed to meet up. > >>>>>> > >>>>>>> Basically this week I added functions to download & parse large > >>>>>>> numbers of records, get TaxonOccurrence gbifKeys, and search > >>>>>>> with > >>>>>>> those keys. Main functions: > >>>>>> > >>>>>> Good stuff. My main comment echoes a couple of things we > >>>>>> discussed > >>>>>> earlier: > >>>>>> > >>>>>> - It is not clear to a user which functions are API functions to > >>>>>> call and which are used internally. Prefixing the internal > >>>>>> functions with underscores (_) and organizing these into classes > >>>>>> will help with this. > >>>>>> > >>>>>> - I still noticed some tempfile writing from what we discussed > >>>>>> last > >>>>>> week. If you have problems using in memory file handles let us > >>>>>> know and we can discuss more. > >>>>>> > >>>>>> In general if your coding style is to get it out there and then > >>>>>> re-factor, that is cool. But please put some time into the > >>>>>> schedule for this so I know not to bug you before you've actually > >>>>>> had a chance to go through things a second time. Also, it's a > >>>>>> good > >>>>>> idea to do this in segments as we go along. From experience, if > >>>>>> you > >>>>>> build up too much code that needs rework it becomes more mentally > >>>>>> difficult to get into the rewriting. > >>>>>> > >>>>>>> An issue: > >>>>>>> > >>>>>>> Next week come functions to process phylogenetic trees. I > >>>>>>> have had > >>>>>>> issues with the current BioPython newick parser etc.; > >>>>>>> basically what > >>>>>>> exists appears to not accept node label information which is > >>>>>>> required > >>>>>>> to store e.g. branchlengths which are crucial for the sorts of > >>>>>>> things > >>>>>>> I have to do in the future. So unless there is a better > >>>>>>> suggestion I > >>>>>>> plan to upload modify & upload my own tree parsing/using > >>>>>>> functions. I > >>>>>>> am open to suggestions in this matter. > >>>>>> > >>>>>> We do not want to introduce duplicated code for Newick tree > >>>>>> parsing in > >>>>>> Biopython. This is a good opportunity to engage the development > >>>>>> list > >>>>>> to help figure out how to fix the current parser to do what you > >>>>>> need. If you are not sure how to get started, the best way is > >>>>>> to get > >>>>>> together a small test file that demonstrates your problems, and > >>>>>> post > >>>>>> it to the list. It would be more useful to everyone to have your > >>>>>> fixes in the main parser. > >>>>>> > >>>>>> Brad > >>>>>> > >>>>> > >>>> > >>>> -- > >>>> ==================================================== > >>>> Nicholas J. Matzke > >>>> Ph.D. Candidate, Graduate Student Researcher > >>>> Huelsenbeck Lab > >>>> Center for Theoretical Evolutionary Genomics > >>>> 4151 VLSB (Valley Life Sciences Building) > >>>> Department of Integrative Biology > >>>> University of California, Berkeley > >>>> > >>>> Lab websites: > >>>> http://ib.berkeley.edu/people/lab_detail.php?lab=54 > >>>> http://fisher.berkeley.edu/cteg/hlab.html > >>>> Dept. personal page: > >>>> http://ib.berkeley.edu/people/students/person_detail.php?person=370 > >>>> Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > >>>> Lab phone: 510-643-6299 > >>>> Dept. fax: 510-643-6264 > >>>> Cell phone: 510-301-0179 > >>>> Email: matzke at berkeley.edu > >>>> > >>>> Mailing address: > >>>> Department of Integrative Biology > >>>> 3060 VLSB #3140 > >>>> Berkeley, CA 94720-3140 > >>>> > >>>> ----------------------------------------------------- > >>>> "[W]hen people thought the earth was flat, they were wrong. When > >>>> people > >>>> thought the earth was spherical, they were wrong. But if you > >>>> think that > >>>> thinking the earth is spherical is just as wrong as thinking the > >>>> earth > >>>> is flat, then your view is wronger than both of them put together." > >>>> > >>>> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical > >>>> Inquirer, > >>>> 14(1), 35-44. Fall 1989. > >>>> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > >>>> ==================================================== > >>>> _______________________________________________ > >>>> Wg-phyloinformatics mailing list > >>>> Wg-phyloinformatics at nescent.org > >>>> https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > >>> > >>> > >> > >> -- > >> ==================================================== > >> Nicholas J. Matzke > >> Ph.D. Candidate, Graduate Student Researcher > >> Huelsenbeck Lab > >> Center for Theoretical Evolutionary Genomics > >> 4151 VLSB (Valley Life Sciences Building) > >> Department of Integrative Biology > >> University of California, Berkeley > >> > >> Lab websites: > >> http://ib.berkeley.edu/people/lab_detail.php?lab=54 > >> http://fisher.berkeley.edu/cteg/hlab.html > >> Dept. personal page: > >> http://ib.berkeley.edu/people/students/person_detail.php?person=370 > >> Lab personal page: http://fisher.berkeley.edu/cteg/members/ > >> matzke.html > >> Lab phone: 510-643-6299 > >> Dept. fax: 510-643-6264 > >> Cell phone: 510-301-0179 > >> Email: matzke at berkeley.edu > >> > >> Mailing address: > >> Department of Integrative Biology > >> 3060 VLSB #3140 > >> Berkeley, CA 94720-3140 > >> > >> ----------------------------------------------------- > >> "[W]hen people thought the earth was flat, they were wrong. When > >> people > >> thought the earth was spherical, they were wrong. But if you think > >> that > >> thinking the earth is spherical is just as wrong as thinking the > >> earth > >> is flat, then your view is wronger than both of them put together." > >> > >> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical > >> Inquirer, > >> 14(1), 35-44. Fall 1989. > >> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > >> ==================================================== > >> _______________________________________________ > >> Wg-phyloinformatics mailing list > >> Wg-phyloinformatics at nescent.org > >> https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > > _______________________________________________ > > Wg-phyloinformatics mailing list > > Wg-phyloinformatics at nescent.org > > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > From bugzilla-daemon at portal.open-bio.org Tue Jul 7 13:10:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 09:10:10 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200907071310.n67DAATG001005@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 chapmanb at 50mail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #5 from chapmanb at 50mail.com 2009-07-07 09:10 EST ------- It's derived from the MySQL schema. I'll mention that on the BioSQL bug when I upload the schema there. Good catch with Python2.4. Grrr old versions, I like those conditional expressions too much. I think test_BioSQL should default to the in-memory version of SQLite, so completely agreed. This is most likely to work out of the box on a default system. Do you want me to check this in with the 2.4 fix? Or should we wait until after 1.51? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 13:59:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 09:59:17 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200907071359.n67DxHOr002748@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 09:59 EST ------- (In reply to comment #5) > It's derived from the MySQL schema. I'll mention that on the BioSQL > bug when I upload the schema there. OK > Good catch with Python2.4. Grrr old versions, I like those conditional > expressions too much. I haven't really used them, some of "my" machines are still on Python 2.4, but can see the appeal - especially within a list or generator comprehension. > I think test_BioSQL should default to the in-memory version of SQLite, so > completely agreed. This is most likely to work out of the box on a default > system. Good. > Do you want me to check this in with the 2.4 fix? Or should we wait > until after 1.51? At least wait until 1.51 is out, and we've had some feedback from Hilmar. I would prefer to wait until the SQLite schema is at least in the BioSQL repository, and ideally publicly released. I had the impression from Hilmar at BOSC that BioSQL 1.0.2 could be out later this year, so this may not take that long. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Tue Jul 7 14:25:40 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 7 Jul 2009 10:25:40 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <20090707130248.GM17086@sobchak.mgh.harvard.edu> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> On Tue, Jul 7, 2009 at 9:02 AM, Brad Chapman wrote: > Hi Stephen; > > We can require lagrange to be installed and use imports to > grab the needed code. The other option is that y'all can explicitly > relicense a subset of the code under the Biopython license. > Trivia: it looks like lagrange in turn depends on scipy, but quickly glancing through the code, I only see numpy functions being used. Since some other Biopython modules already depend on numpy, could the installation of lagrange for Bio.Geography be made simpler by just changing the import to numpy? > I can see however > > where the Bio.Nexus functionality might not be sufficient for tree > > manipulation. I am not a contributor to the BioPython dev group so I > > cannot speak to those specifics, but as a user I can see separating > > out the tree functions from the Nexus package (and tree I/O in > > general) as logically a phylogenetic tree structure has little to do > > with the nexus file format. It can be somewhat awkward to deal with in > > the current form. A more general implementation might be a Bio.Tree > > package with I/O readers in Nexus and Newick and XML, etc. > > Definitely. Eric has been discussing this with regards to the > PhyloXML project and we had been looking at other Tree > representations: in PyCogent and Thomas Mailund's Newick module. > Considering the lagrange tree model makes a lot of sense as well. > What I'd like to see is a stab at a generalized Tree object that > supports the operations you need and that the Bio.Nexus parser can > produce, exactly as you describe. Eric and Nick, what do you think > about coordinating on this? > Sounds great to me. My impression is that most tree representations are based on a recursive Node element with a few associated attributes and a number of useful methods; phyloXML has a Clade object roughly corresponding to that, but also a bunch of other element types for extensive annotation of the tree. So two options spring to mind: 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed by any phylogenetic tree representation, ever. (It's already pretty close.) Refactor Nexus and Newick to use these objects; merge the features of lagrange so the rest of the Biopython environment can benefit. Only export to external object structures that are something other than a straight phylogenetic tree -- e.g. networkx or graphviz for plotting, numpy/scipy for crunching. 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let that be the Biopython default representation. Add a function in Bio.PhyloXML to export its enhanced tree structure to this simpler Bio.Tree representation. I wrote Bio.PhyloXML.Tree to use the naming conventions of phyloXML, but otherwise be independent of that specific file format. It doesn't depend on any XML library directly, and both child nodes and XML node attributes appear as plain ol' object attributes in the tree. But the Nexus module looked like the parser was kind of tied to the tree representation, so I haven't reused any of that code yet. So #1 is my preference, but it put the burden of inter-module compatibility on whoever is maintaining Bio.Nexus, whereas #2 leaves my code on a quiet little island for the rest of the summer. All the best, Eric From biopython at maubp.freeserve.co.uk Tue Jul 7 14:56:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 15:56:01 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> Message-ID: <320fb6e00907070756hcb15f96pff5694ac4552ef32@mail.gmail.com> On Tue, Jul 7, 2009 at 3:25 PM, Eric Talevich wrote: > On Tue, Jul 7, 2009 at 9:02 AM, Brad Chapman wrote: >> Hi Stephen; >> >> We can require lagrange to be installed and use imports to >> grab the needed code. The other option is that y'all can explicitly >> relicense a subset of the code under the Biopython license. > > Trivia: it looks like lagrange in turn depends on scipy, but quickly > glancing through the code, I only see numpy functions being used. > Since some other Biopython modules already depend on numpy, > could the installation of lagrange for Bio.Geography be made > simpler by just changing the import to numpy? That sounds like a good idea to follow up with the lagrange team (making lagrange depend on numpy but not scipy). I think Brad is right to be asking questions about the lagrange code and their license. How much code do you actually use from lagrange, and can we either get those bits re-licensed (or reimplemented) to include directly into Biopython? This may not be realisitic, in which case a dependency on lagrange may be the best bet... Adding external python library dependencies in Biopython is generally is discouraged, *especially* anything required at build time as this makes installation much more complicated. As I recall, we've been able to cut these down to just numpy (needed for several modules, but we can install without it), plus optional dependencies like database drivers (e.g. for BioSQL) and ReportLab (only used in Bio.Graphics). Peter From biopython at maubp.freeserve.co.uk Tue Jul 7 15:12:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 16:12:02 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> Message-ID: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> Stephen [I think] wrote: >> > I can see however >> > where the Bio.Nexus functionality might not be sufficient for tree >> > manipulation. I am not a contributor to the BioPython dev group so I >> > cannot speak to those specifics, but as a user I can see separating >> > out the tree functions from the Nexus package (and tree I/O in >> > general) as logically a phylogenetic tree structure has little to do >> > with the nexus file format. It can be somewhat awkward to deal with in >> > the current form. A more general implementation might be a Bio.Tree >> > package with I/O readers in Nexus and Newick and XML, etc. Brad wrote: >> Definitely. Eric has been discussing this with regards to the >> PhyloXML project and we had been looking at other Tree >> representations: in PyCogent and Thomas Mailund's Newick module. >> Considering the lagrange tree model makes a lot of sense as well. >> What I'd like to see is a stab at a generalized Tree object that >> supports the operations you need and that the Bio.Nexus parser can >> produce, exactly as you describe. Eric and Nick, what do you think >> about coordinating on this? Eric worte: > Sounds great to me. I also agree. Bio.Nexus has some good stuff that is a bit hidden, and has wider application - some kind of Bio.Tree module sounds sensible (ideally with I/O for Nexus, XML, etc). We might even move the phyloXML specific stuff to live under Bio.Tree.PhyloXML. > My impression is that most tree representations are based on a recursive > Node element with a few associated attributes and a number of useful > methods; phyloXML has a Clade object roughly corresponding to that, > but also a bunch of other element types for extensive annotation of > the tree. So two options spring to mind: > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed by > any phylogenetic tree representation, ever. (It's already pretty close.) > Refactor Nexus and Newick to use these objects; merge the features of > lagrange so the rest of the Biopython environment can benefit. Only export > to external object structures that are something other than a straight > phylogenetic tree -- e.g. networkx or graphviz for plotting, numpy/scipy for > crunching. > > 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let > that be the Biopython default representation. Add a function in Bio.PhyloXML > to export its enhanced tree structure to this simpler Bio.Tree > representation. I am unclear why would you need to have to have an entirely separate tree object structure (which then requires code to map between the two). Perhaps some specific examples of the "enhancements" would help? How about this variation on (2): Suppose Bio.Tree provided a simple tree object (holding a nested structure), with methods/functions for general operations like DFT, finding common ancestors, calculating branch lengths, collapsing internal nodes, etc. [and I would expect a lot of this could be borrowed from Bio.Nexus, and/or Thomas Mailund's Newick module]. Couldn't Bio.PhyloXML build on this using subclassed tree nodes? Do we even need different objects? What if each node class had an optional python dictionary for annotations? You could maybe key this off the PhyloXML names? > I wrote Bio.PhyloXML.Tree to use the naming conventions of phyloXML, but > otherwise be independent of that specific file format. It doesn't depend on > any XML library directly, and both child nodes and XML node attributes > appear as plain ol' object attributes in the tree. But the Nexus module > looked like the parser was kind of tied to the tree representation, so I > haven't reused any of that code yet. So #1 is my preference, but it put the > burden of inter-module compatibility on whoever is maintaining Bio.Nexus, > whereas #2 leaves my code on a quiet little island for the rest of the > summer. We're going to need some input from the Bio.Nexus authors - Frank and Cymon (CC'd). Peter From bugzilla-daemon at portal.open-bio.org Tue Jul 7 16:14:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 12:14:28 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907071614.n67GESBG008148@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #3 from tsham at lbl.gov 2009-07-07 12:14 EST ------- Hi, The file is part of an unpublished work that is in preparation. I think it would be ok to include it in the unit test *after* it's been published, but not just yet. Or I could generate a test file that is similar to this file. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 16:22:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 12:22:35 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907071622.n67GMZoI008582@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 12:22 EST ------- (In reply to comment #3) > Hi, > > The file is part of an unpublished work that is in preparation. I think it > would be ok to include it in the unit test *after* it's been published, but not > just yet. Or I could generate a test file that is similar to this file. > A realistic but similar file would be fine - thanks. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 16:38:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 12:38:08 -0400 Subject: [Biopython-dev] [Bug 2874] New: invalid class on warning module Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2874 Summary: invalid class on warning module Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mlrodrigues at igc.gulbenkian.pt /usr/lib/pymodules/python2.5/Bio/PDB/PDBList.py:151: UserWarning: retrieving index file. Takes about 5 MB. warnings.warn("retrieving index file. Takes about 5 MB.") Traceback (most recent call last): File "get_pdb_structures.py", line 23, in get(pdblist, f, my_try) File "get_pdb_structures.py", line 16, in get x.download_entire_pdb(listfile=f) File "/usr/lib/pymodules/python2.5/Bio/PDB/PDBList.py", line 288, in download_entire_pdb for pdb_code in entries: self.retrieve_pdb_file(pdb_code) File "/usr/lib/pymodules/python2.5/Bio/PDB/PDBList.py", line 237, in retrieve_pdb_file RuntimeError) File "/usr/lib/python2.5/warnings.py", line 32, in warn assert issubclass(category, Warning) AssertionError -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 16:52:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 12:52:12 -0400 Subject: [Biopython-dev] [Bug 2874] invalid class on warning module In-Reply-To: Message-ID: <200907071652.n67GqCKX009821@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2874 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 12:52 EST ------- Fixed, thanks! Downloading the 5MB index file in the unit tests seems like a bad idea, but clearly we need more unit test coverage as this error of mine actually affected three files in Bio.PDB, Bio/PDB/Dice.py Bio/PDB/MMCIF2Dict.py Bio/PDB/PDBList.py If you have any suggestions for further unit tests, please let us know. Regards, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 7 17:08:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 13:08:50 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907071708.n67H8olB010408@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-07 13:08 EST ------- Hi Tim, This should be fixed in CVS (and will be on github soon), but I would still like to include an example in the unit tests. If you can also test this, that would be great. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From matzke at berkeley.edu Tue Jul 7 18:12:10 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 07 Jul 2009 11:12:10 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> Message-ID: <4A538FFA.4030708@berkeley.edu> Hi all, I am just now back in town and would love to co-coordinate on this. I agree having multiple newick parsers etc. is undesirable, I just found I was forced to that this spring when BioPython didn't have what I need even for pretty standard Newick files. I have also made use of Mailund's newick parser in the past. I am booked this afternoon but will go through the thread more this evening and comment further. Cheers! Nick Eric Talevich wrote: > On Tue, Jul 7, 2009 at 9:02 AM, Brad Chapman > wrote: > > Hi Stephen; > > We can require lagrange to be installed and use imports to > grab the needed code. The other option is that y'all can explicitly > relicense a subset of the code under the Biopython license. > > > Trivia: it looks like lagrange in turn depends on scipy, but quickly > glancing through the code, I only see numpy functions being used. Since > some other Biopython modules already depend on numpy, could the > installation of lagrange for Bio.Geography be made simpler by just > changing the import to numpy? > > > I can see however > > where the Bio.Nexus functionality might not be sufficient for tree > > manipulation. I am not a contributor to the BioPython dev group so I > > cannot speak to those specifics, but as a user I can see separating > > out the tree functions from the Nexus package (and tree I/O in > > general) as logically a phylogenetic tree structure has little to do > > with the nexus file format. It can be somewhat awkward to deal > with in > > the current form. A more general implementation might be a Bio.Tree > > package with I/O readers in Nexus and Newick and XML, etc. > > Definitely. Eric has been discussing this with regards to the > PhyloXML project and we had been looking at other Tree > representations: in PyCogent and Thomas Mailund's Newick module. > Considering the lagrange tree model makes a lot of sense as well. > What I'd like to see is a stab at a generalized Tree object that > supports the operations you need and that the Bio.Nexus parser can > produce, exactly as you describe. Eric and Nick, what do you think > about coordinating on this? > > > Sounds great to me. My impression is that most tree representations are > based on a recursive Node element with a few associated attributes and a > number of useful methods; phyloXML has a Clade object roughly > corresponding to that, but also a bunch of other element types for > extensive annotation of the tree. So two options spring to mind: > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed > by any phylogenetic tree representation, ever. (It's already pretty > close.) Refactor Nexus and Newick to use these objects; merge the > features of lagrange so the rest of the Biopython environment can > benefit. Only export to external object structures that are something > other than a straight phylogenetic tree -- e.g. networkx or graphviz for > plotting, numpy/scipy for crunching. > > 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let > that be the Biopython default representation. Add a function in > Bio.PhyloXML to export its enhanced tree structure to this simpler > Bio.Tree representation. > > I wrote Bio.PhyloXML.Tree to use the naming conventions of phyloXML, but > otherwise be independent of that specific file format. It doesn't depend > on any XML library directly, and both child nodes and XML node > attributes appear as plain ol' object attributes in the tree. But the > Nexus module looked like the parser was kind of tied to the tree > representation, so I haven't reused any of that code yet. So #1 is my > preference, but it put the burden of inter-module compatibility on > whoever is maintaining Bio.Nexus, whereas #2 leaves my code on a quiet > little island for the rest of the summer. > > All the best, > Eric -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From czmasek at burnham.org Tue Jul 7 18:46:32 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 7 Jul 2009 11:46:32 -0700 Subject: [Biopython-dev] PhyloXML helper functions In-Reply-To: <20090707125152.GL17086@sobchak.mgh.harvard.edu> References: <3f6baf360907061634u43d89bdcxa9bad9fb0babb350@mail.gmail.com> <20090707125152.GL17086@sobchak.mgh.harvard.edu> Message-ID: <4A539808.201@burnham.org> Hi: I cannot really comment on the first point (since I don't know enough Python), but I totally agree with Brad on the issue of the find() methods -- very useful! Christian Brad Chapman wrote: > Hi Eric; > > >> I've been mulling a couple of methods for PhyloXML objects that I thought >> could deserve some discussion. >> >> 1. Singular properties for some plural attributes >> >> This goes back to the "confidences" issue: When I'm drilling down through a >> phyloXML-derived tree, I keep expecting certain attributes to be singular >> values when they're actually plural. Auto-completion catches it, of course, >> but the resulting code would seem more obvious if I used the singular name >> when I know the attribute consists of a list of one element. >> > > I like the idea and implementation for cases where you can have > multiple items, but have one most of the time. Very nice. > > >> 2. A find() method on Clade and maybe Phylogeny objects >> > [...] > >> Enhancements: >> - The keyword argument could be a regular expression. Would that be useful? >> > > This seems useful. Often people use crazy naming convention hacks, > and might want to pull out something like all proteins from a > particular organism based on a common prefix in the name. > > >> To handle numbers, I'd have to convert every sub-node attribute value to a >> string, and that would be weird -- or else find() would have to skip >> numerical attributes. >> > > Is this if you support regular expressions or either way? For the > find, I think it's sufficient to define what you support and leave > it at that set: any subset of searching will help people get their > work done. > > >> - If no regular arguments are needed, cls could default to PhyloElement or >> even "object" to match everything. >> > > I like the object default here. This fits with a simple use case of: > find everything that matches this string of interest. > > >> - To enable arbitrary hairiness, this function could accept a function as >> the value of the keyword argument and return anything truthy. But at that >> point, the user could probably just roll their own find_node() function. >> However, it could still be useful to filter for numerical values. >> > > This is probably more than you need. For complicated cases I'd > assume people are sophisticated enough to roll their own. > > Nice ideas, > Brad > From bugzilla-daemon at portal.open-bio.org Tue Jul 7 21:49:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 7 Jul 2009 17:49:16 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907072149.n67LnGaJ019542@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #6 from tsham at lbl.gov 2009-07-07 17:49 EST ------- Created an attachment (id=1339) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1339&action=view) Test case vector nti generated genbank file. This file is ok to include in the unit test. It has the same problem as the other file. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jul 7 22:49:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 23:49:27 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> Message-ID: <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> Brad wrote: >> My vote is to document using subprocess and avoid creating our own >> wrapper. No one has to learn a Biopython specific API for running >> programs, and subprocess provides plenty of flexibility to get stdout, >> stderr and return codes. For places where we feel like using subprocess >> is tricky, additional documentation within Biopython should help those >> encountering it for the first time. This gives us more time to work >> on biology problems, and leaves the running programs problems up to >> the greater Python community. Peter wrote: > Exactly. I'm sure there will still be questions on the mailing list from > people about using subprocess, but if our documentation is done > well enough this shouldn't be too much of a burden. > ... > That seems unanimous so far: Deprecate Bio.Application.generic_run, > and document using subprocess instead. Good :) I started trying to rewrite the tutorial sections using generic_run, and unfortunately it looks like a reasonably cross platform replacement for generic_run when all you want is the return code but you don't want the tool's output printed on screen becomes quite complex, e.g. import subprocess return_code = subprocess.call(str(cline), stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform!="win32")) We need to use pipes for stdout (and stderr) to stop the tool's output being printed to screen. Just using os.system(str(cline)) has the same problem. We needed to include the stdin as a pipe as a work around for a Windows specific bug in subprocess if called from a GUI using Biopython, see http://bugs.python.org/issue1124861 and earlier mailing list posts. This may not be worth worrying about for the documentation examples, as its a corner case and has been fixed in recent versions of Python. Finally, we need to use shell=True on Unix (but not Windows as I recall from looking at the Bio.Application code) as we are giving the command as a string (rather than a list of the tool and its arguments). Maybe we can make the command line wrapper object more list like to make subprocess happy without needing to create a string? I'll try and test this on Windows, Mac and Linux tomorrow - but maybe we will want to include a replacement for Bio.Application.generic_run after all? (Would "simple_run", "run", or "call" be good names?) Peter From eric.talevich at gmail.com Wed Jul 8 04:09:43 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 8 Jul 2009 00:09:43 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> Message-ID: <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> On Tue, Jul 7, 2009 at 11:12 AM, Peter wrote: > Eric wrote: > > My impression is that most tree representations are based on a recursive > > Node element with a few associated attributes and a number of useful > > methods; phyloXML has a Clade object roughly corresponding to that, > > but also a bunch of other element types for extensive annotation of > > the tree. So two options spring to mind: > > > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed > by > > any phylogenetic tree representation, ever. (It's already pretty close.) > > Refactor Nexus and Newick to use these objects; merge the features of > > lagrange so the rest of the Biopython environment can benefit. Only > export > > to external object structures that are something other than a straight > > phylogenetic tree -- e.g. networkx or graphviz for plotting, numpy/scipy > for > > crunching. > > > > 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let > > that be the Biopython default representation. Add a function in > Bio.PhyloXML > > to export its enhanced tree structure to this simpler Bio.Tree > > representation. > > I am unclear why would you need to have to have an entirely separate tree > object structure (which then requires code to map between the two). > Perhaps some specific examples of the "enhancements" would help? > The benefit of letting the tree object structures diverge is procrastination -- we could reconcile the two modules after GSoC is over, with stable features and test suites in place. But I could justifiably focus on integration for the remaining weeks if that's best for Biopython, since otherwise I'd probably be reimplementing a number of features already present in other modules. How about this variation on (2): > Suppose Bio.Tree provided a simple tree object (holding a nested > structure), > with methods/functions for general operations like DFT, finding common > ancestors, calculating branch lengths, collapsing internal nodes, etc. > [and I would expect a lot of this could be borrowed from Bio.Nexus, > and/or Thomas Mailund's Newick module]. Couldn't Bio.PhyloXML build > on this using subclassed tree nodes? > The Bioperl and Bioruby phyloXML projects were done this way, I think, but they already had access to Tree/Node objects within each project. Bio.PhyloXML.Tree objects could inherit from Bio.Tree objects if the Bio.Tree objects were designed in a compatible way... if we go this route I'll need to draft up a list of traps, like naming conventions ("annotations" is already an attribute of Bio.PhyloXML.Sequence) and class hierarchy (some functions rely on everything in the phyloXML spec being a subclass of PhyloElement). Do we even need different objects? What if each node class had an optional > python dictionary for annotations? You could maybe key this off the > PhyloXML > names? > > I bet this could be done without different objects. Bio.PhyloXML.Tree could be moved to Bio.Tree or Bio.Tree.Elements; the base class PhyloElement could be renamed to TreeElement; and the Nexus and Newick parsers could reuse PhyloXML's Phylogeny and Clade elements, where Clade merges with the existing Node class(es). Even Clade by itself might be enough. For organizational purposes, format-specific tree elements could move to their own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some multiple-inheritance tricks could be used to smooth things over. Here is the phyloXML definitions of Clade: http://www.phyloxml.org/documentation/version_100/phyloxml.xsd.html#h-1124608460 My implementation (trimmed): class Clade(PhyloElement): """Describes a branch of the current phylogenetic tree. Used recursively, describes the topology of a phylogenetic tree. The parent branch length of a clade can be described with the 'branch_length' attribute. Element 'confidence' is used to indicate the support for a clade/parent branch. Element 'events' is used to describe such events as gene-duplications at the root node/parent branch of a clade. Element 'width' is the branch width for this clade (including parent branch). Both 'color' and 'width' elements apply for the whole clade unless overwritten in-sub clades. """ def __init__(self, branch_length=None, id_source=None, name=None, width=None, color=None, node_id=None, events=None, binary_characters=None, date=None, # Collections confidences=None, taxonomies=None, sequences=None, distributions=None, references=None, properties=None, clades=None, other=None, ): # set all keyword arguments to instance attributes; collections default to [] ... The same for Phylogeny: http://www.phyloxml.org/documentation/version_100/phyloxml.xsd.html#h535307528 class Phylogeny(PhyloElement): """A phylogenetic tree.""" def __init__(self, rooted, rerootable=None, branch_length_unit=None, type=None, name=None, id=None, description=None, date=None, clade=None, # Collections confidences=None, clade_relations=None, sequence_relations=None, properties=None, other=None, ): assert isinstance(rooted, bool) # set keyword arguments to attributes; collections default to [] ... Sources: http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML/Tree.py http://www.phyloxml.org/documentation/version_100/phyloxml.xsd.html If we base the Bio.Tree objects off of these two classes, then I wouldn't even need an optional annotations dictionary on each object. Which makes sense, since I think the phyloXML format was designed to accommodate nearly all types of annotations that could reasonably be applied to phylogenetic trees. Assuming most of the Newick and Nexus annotations fit into this design, if a small number of annotations don't, they can be added to this constructor as more keyword arguments without much harm. (I know nothing about NeXML; should we keep an eye on that too? Glance at the homepage I don't see much about complex annotation types, which is probably good if we want to fit that format into this framework eventually.) Cheers, Eric From eric.talevich at gmail.com Wed Jul 8 04:45:16 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 8 Jul 2009 00:45:16 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> References: <25d4d3af0906141759x462c3c97gdc8697f21520f12@mail.gmail.com> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> Message-ID: <3f6baf360907072145t235d43a6nb3633612c94a5244@mail.gmail.com> On Tue, Jul 7, 2009 at 11:12 AM, Peter wrote: > > Perhaps some specific examples of the "enhancements" would help? > > Sure. Here are the special phyloXML element types listed at phyloxml.org, with comments: Annotation -- attached to Sequence; has metadata BinaryCharacters -- "names and/or counts of binary characters present, gained, and lost at the root of a clade" BranchColor -- RGB, for graphics support CladeRelation -- typed relationship between two clades, e.g. multiple parents Date -- e.g. #mya, or name of period ("Silurian") Distribution, Point, Polygon -- geographic distribution of the items of a clade (species, sequences) DomainArchitecture, ProteinDomain -- like SeqFeature for a protein sequence Events -- e.g. one gene duplication on the current clade Property -- attach external references to a node, kind of meta Reference -- literature reference: doi or text description Sequence -- like SeqRecord; more specific annotation fields SequenceRelation -- typed relationship between two sequences, e.g. orthology Taxonomy -- with scientific name, common names, rank, id, code, URI Some of these could be adapted into generally useful Biopython objects, such as Taxonomy and Reference. A few are metadata related to the structure or interpretation of the tree, and a few are small classes that could be converted to dictionaries if necessary. The conversion between Sequence and SeqRecord could probably be made lossless, or close to it, and then it would be safe to just plug the Biopython object directly into the tree instead of using a PhyloXML-specific class. Cheers, Eric From bugzilla-daemon at portal.open-bio.org Wed Jul 8 10:16:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Jul 2009 06:16:09 -0400 Subject: [Biopython-dev] [Bug 2872] Genbank parser breaks on VectorNTI generated genbank file In-Reply-To: Message-ID: <200907081016.n68AG94P010653@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2872 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-08 06:16 EST ------- (In reply to comment #6) > Created an attachment (id=1339) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1339&action=view) [details] > Test case vector nti generated genbank file. > > This file is ok to include in the unit test. It has the same problem as the > other file. > Thanks - I've added that as a new unit test. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Wed Jul 8 12:36:00 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Jul 2009 08:36:00 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <4A538FFA.4030708@berkeley.edu> References: <20090617120257.GG44321@sobchak.mgh.harvard.edu> <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <4A538FFA.4030708@berkeley.edu> Message-ID: <20090708123600.GW17086@sobchak.mgh.harvard.edu> Hi Nick; > I am just now back in town and would love to co-coordinate on this. I > agree having multiple newick parsers etc. is undesirable, I just found I > was forced to that this spring when BioPython didn't have what I need > even for pretty standard Newick files. I have also made use of > Mailund's newick parser in the past. That sounds great. Eric is also on board from the PhyloXML side. For the parser, the right approach is to provide some example files that Bio.Nexus does not handle correctly, and work on improvements to that parser to bring it in line with what you need. Secondarily, we should work on parsing into a general tree structure that supports the questions you need to ask. This should allow us to avoid the lagrange code duplication and also have a more robust Nexus parser in Biopython. Thanks, Brad > > I am booked this afternoon but will go through the thread more this > evening and comment further. Cheers! > Nick > > Eric Talevich wrote: > > On Tue, Jul 7, 2009 at 9:02 AM, Brad Chapman > > wrote: > > > > Hi Stephen; > > > > We can require lagrange to be installed and use imports to > > grab the needed code. The other option is that y'all can explicitly > > relicense a subset of the code under the Biopython license. > > > > > > Trivia: it looks like lagrange in turn depends on scipy, but quickly > > glancing through the code, I only see numpy functions being used. Since > > some other Biopython modules already depend on numpy, could the > > installation of lagrange for Bio.Geography be made simpler by just > > changing the import to numpy? > > > > > I can see however > > > where the Bio.Nexus functionality might not be sufficient for tree > > > manipulation. I am not a contributor to the BioPython dev group so I > > > cannot speak to those specifics, but as a user I can see separating > > > out the tree functions from the Nexus package (and tree I/O in > > > general) as logically a phylogenetic tree structure has little to do > > > with the nexus file format. It can be somewhat awkward to deal > > with in > > > the current form. A more general implementation might be a Bio.Tree > > > package with I/O readers in Nexus and Newick and XML, etc. > > > > Definitely. Eric has been discussing this with regards to the > > PhyloXML project and we had been looking at other Tree > > representations: in PyCogent and Thomas Mailund's Newick module. > > Considering the lagrange tree model makes a lot of sense as well. > > What I'd like to see is a stab at a generalized Tree object that > > supports the operations you need and that the Bio.Nexus parser can > > produce, exactly as you describe. Eric and Nick, what do you think > > about coordinating on this? > > > > > > Sounds great to me. My impression is that most tree representations are > > based on a recursive Node element with a few associated attributes and a > > number of useful methods; phyloXML has a Clade object roughly > > corresponding to that, but also a bunch of other element types for > > extensive annotation of the tree. So two options spring to mind: > > > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed > > by any phylogenetic tree representation, ever. (It's already pretty > > close.) Refactor Nexus and Newick to use these objects; merge the > > features of lagrange so the rest of the Biopython environment can > > benefit. Only export to external object structures that are something > > other than a straight phylogenetic tree -- e.g. networkx or graphviz for > > plotting, numpy/scipy for crunching. > > > > 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let > > that be the Biopython default representation. Add a function in > > Bio.PhyloXML to export its enhanced tree structure to this simpler > > Bio.Tree representation. > > > > I wrote Bio.PhyloXML.Tree to use the naming conventions of phyloXML, but > > otherwise be independent of that specific file format. It doesn't depend > > on any XML library directly, and both child nodes and XML node > > attributes appear as plain ol' object attributes in the tree. But the > > Nexus module looked like the parser was kind of tied to the tree > > representation, so I haven't reused any of that code yet. So #1 is my > > preference, but it put the burden of inter-module compatibility on > > whoever is maintaining Bio.Nexus, whereas #2 leaves my code on a quiet > > little island for the rest of the summer. > > > > All the best, > > Eric > > -- > ==================================================== > Nicholas J. Matzke > Ph.D. Candidate, Graduate Student Researcher > Huelsenbeck Lab > Center for Theoretical Evolutionary Genomics > 4151 VLSB (Valley Life Sciences Building) > Department of Integrative Biology > University of California, Berkeley > > Lab websites: > http://ib.berkeley.edu/people/lab_detail.php?lab=54 > http://fisher.berkeley.edu/cteg/hlab.html > Dept. personal page: > http://ib.berkeley.edu/people/students/person_detail.php?person=370 > Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > Lab phone: 510-643-6299 > Dept. fax: 510-643-6264 > Cell phone: 510-301-0179 > Email: matzke at berkeley.edu > > Mailing address: > Department of Integrative Biology > 3060 VLSB #3140 > Berkeley, CA 94720-3140 > > ----------------------------------------------------- > "[W]hen people thought the earth was flat, they were wrong. When people > thought the earth was spherical, they were wrong. But if you think that > thinking the earth is spherical is just as wrong as thinking the earth > is flat, then your view is wronger than both of them put together." > > Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, > 14(1), 35-44. Fall 1989. > http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > ==================================================== From chapmanb at 50mail.com Wed Jul 8 12:48:41 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Jul 2009 08:48:41 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> References: <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> Message-ID: <20090708124841.GX17086@sobchak.mgh.harvard.edu> Hi all; > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed > > by any phylogenetic tree representation, ever. (It's already pretty close.) > > Refactor Nexus and Newick to use these objects; merge the features of > > lagrange so the rest of the Biopython environment can benefit. I am for this approach. It sounds like what people want is a tree that does everything, and re-implementations occur because representations are lacking in something. It would be nice to design this modularly -- with mixin classes for related add-on functionality -- as much as possible. This would allow lighter weight implementations in the future if that were desired. > The benefit of letting the tree object structures diverge is procrastination > -- we could reconcile the two modules after GSoC is over, with stable > features and test suites in place. But I could justifiably focus on > integration for the remaining weeks if that's best for Biopython, since > otherwise I'd probably be reimplementing a number of features already > present in other modules. My vote is for the integration work. Refactoring is hard work and best done early. It is easier to add functionality to a fully integrated PhyloXML parser in the future. > I bet this could be done without different objects. Bio.PhyloXML.Tree could > be moved to Bio.Tree or Bio.Tree.Elements; the base class PhyloElement could > be renamed to TreeElement; and the Nexus and Newick parsers could reuse > PhyloXML's Phylogeny and Clade elements, where Clade merges with the > existing Node class(es). Even Clade by itself might be enough. For > organizational purposes, format-specific tree elements could move to their > own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some > multiple-inheritance tricks could be used to smooth things over. Yes, this sounds exactly right. Great stuff. > (I know nothing > about NeXML; should we keep an eye on that too? Glance at the homepage I > don't see much about complex annotation types, which is probably good if we > want to fit that format into this framework eventually.) PhyloXML plus Nexus/Newick is probably enough to stay reasonably general and keep our sanity. NeXML support would be great but practically is an additional project. The refactoring you've described is a good chunk to run with. Brad From chapmanb at 50mail.com Wed Jul 8 13:06:49 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Jul 2009 09:06:49 -0400 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> Message-ID: <20090708130649.GY17086@sobchak.mgh.harvard.edu> Hi Peter; > I started trying to rewrite the tutorial sections using generic_run, and > unfortunately it looks like a reasonably cross platform replacement for > generic_run when all you want is the return code but you don't want > the tool's output printed on screen becomes quite complex, e.g. > > import subprocess > return_code = subprocess.call(str(cline), > stdin=subprocess.PIPE, > stdout=subprocess.PIPE, > stderr=subprocess.PIPE, > shell=(sys.platform!="win32")) > > We need to use pipes for stdout (and stderr) to stop the tool's output > being printed to screen. Just using os.system(str(cline)) has the same > problem. How about adding a function like "run_arguments" to the commandlines that returns the commandline as a list. It sounds like we can drop the stdin workaround and provide a documentation item for older Windows versions from a GUI. It might be better to use Popen and wait to make it straightforward to learn to get stdout and stderr. So then we get: import subprocess child = subprocess.Popen(cline.run_arguments(), stdout=subprocess.PIPE, stderr=subprocess.PIPE) return_code = child.wait() print child.stdout.read() This avoids the shell nastiness with the argument list, is as simple as it gets with subprocess, and gives users an easy path to getting stdout, stderr and the return codes. Also documenting how to avoid stdout and stderr entirely is useful: import os import subprocess child = subprocess.Popen(cline.run_arguments(), stdout=open(os.devnull, "w"), stderr=subprocess.STDOUT) Brad From bugzilla-daemon at portal.open-bio.org Wed Jul 8 18:22:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Jul 2009 14:22:00 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200907081822.n68IM0Lc028503@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1297 is|0 |1 obsolete| | ------- Comment #14 from eric.talevich at gmail.com 2009-07-08 14:21 EST ------- Created an attachment (id=1340) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1340&action=view) Adapted test_warnings to work with Py2.4-5 This patch is also available on my github branch for this bug: http://github.com/etal/biopython/tree/bug2820 I tested it with Python 2.4, 2.5 and 2.6 on Ubuntu, applied to the current biopython trunk. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Wed Jul 8 18:58:52 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 8 Jul 2009 14:58:52 -0400 Subject: [Biopython-dev] PhyloXML helper functions In-Reply-To: <20090707125152.GL17086@sobchak.mgh.harvard.edu> References: <3f6baf360907061634u43d89bdcxa9bad9fb0babb350@mail.gmail.com> <20090707125152.GL17086@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360907081158i649feb97t10cc52dc9a4454b6@mail.gmail.com> Hi Brad, On Tue, Jul 7, 2009 at 8:51 AM, Brad Chapman wrote: > > 2. A find() method on Clade and maybe Phylogeny objects > [...] > > Enhancements: > > - The keyword argument could be a regular expression. Would that be > useful? > > This seems useful. Often people use crazy naming convention hacks, > and might want to pull out something like all proteins from a > particular organism based on a common prefix in the name. > > > To handle numbers, I'd have to convert every sub-node attribute value to > a > > string, and that would be weird -- or else find() would have to skip > > numerical attributes. > > Is this if you support regular expressions or either way? For the > find, I think it's sufficient to define what you support and leave > it at that set: any subset of searching will help people get their > work done. > I implemented it. Here's the signature and docstring: def find(self, cls=None, **kwargs) """Find all sub-nodes matching the given attributes. The 'cls' argument specifies the class of the sub-node. Nodes that inherit from this type will also match. (The default, Tree.PhyloElement, matches any standard phyloXML type.) The arbitrary keyword arguments indicate the attribute name of the sub-node and the value to match: string, integer or boolean. Strings are evaluated as regular expression matches; integers are compared directly for equality, and booleans evaluate the attribute's truth value (True or False) before comparing. To handle nonzero floats, search with a boolean argument, then filter the result manually. If no keyword arguments are given, then just the class type is used for matching. The result is an iterable through all matching objects, by depth-first search. (Not necessarily the same order as the elements appear in the source file!) Example: >>> tree = PhyloXML.read('phyloxml_examples.xml').phylogenies[5] >>> matches = tree.clade.find(code='OCTVU') >>> matches.next() Taxonomy(code='OCTVU', scientific_name='Octopus vulgaris') """ Notes: - Phylogeny.find just directly calls self.clade.find and returns the result. - I still use PhyloElement instead of object for the default class. The recursive function uses __dict__ to walk the tree, so allowing any object to be searched leads to chaos (e.g. int.__dict__ has 55 keys). Restricting the search to Tree-related nodes still accommodates most use cases, I think. - Depth-first search - if a node that matches has subnodes that also match, the higher node will be yielded first, then the first matching subnode, and so on. But: since the object dictionary doesn't keep XML node order, the order the matches are returned in isn't always what you'd expect. I think I can mitigate this somewhat, but still -- documented weirdness. Thanks, Eric From biopython at maubp.freeserve.co.uk Thu Jul 9 09:18:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Jul 2009 10:18:49 +0100 Subject: [Biopython-dev] ApplicationResult and generic_run obsolete? In-Reply-To: <20090708130649.GY17086@sobchak.mgh.harvard.edu> References: <320fb6e00907061202k3ae9141bkfcb46b3149592c68@mail.gmail.com> <8b34ec180907061235o2202e275k74632d5d9de3375c@mail.gmail.com> <20090706220453.GI17086@sobchak.mgh.harvard.edu> <8b34ec180907070120y7d99f43ejd05f2877380cbcc7@mail.gmail.com> <320fb6e00907070241s68d5f66bv7893e18668871ac5@mail.gmail.com> <320fb6e00907071549k608325c5taa38c2056c4b09d5@mail.gmail.com> <20090708130649.GY17086@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00907090218u345950agd3a94c9be2bad7dd@mail.gmail.com> On Wed, Jul 8, 2009 at 2:06 PM, Brad Chapman wrote: > Hi Peter; > >> I started trying to rewrite the tutorial sections using generic_run, and >> unfortunately it looks like a reasonably cross platform replacement for >> generic_run when all you want is the return code but you don't want >> the tool's output printed on screen becomes quite complex, e.g. >> >> import subprocess >> return_code = subprocess.call(str(cline), >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? stdin=subprocess.PIPE, >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? stdout=subprocess.PIPE, >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? stderr=subprocess.PIPE, >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? shell=(sys.platform!="win32")) >> >> We need to use pipes for stdout (and stderr) to stop the tool's output >> being printed to screen. Just using os.system(str(cline)) has the same >> problem. > > How about adding a function like "run_arguments" to the > commandlines that returns the commandline as a list. That would be a simple alternative to my vague idea "Maybe we can make the command line wrapper object more list like to make subprocess happy without needing to create a string?", which may not be possible. Either way, this will require a bit of work on the Bio.Application parameter objects... > It sounds like we can drop the stdin workaround and provide a > documentation item for older Windows versions from a GUI. Yes, as I noted, this is a corner case. It is something any replacement for generic_run would still have to cater to, but it would just complicate an example. > It might be better to use Popen and wait to make it > straightforward to learn to get stdout and stderr. Yes, using subprocess.Popen explicitly rather than their helper function subprocess.call makes sense for our docs Peter P.S. Thanks Cymon for those minor corrections to the tutorial. The master file is a LaTeX document, Doc/Tutorial.tex, the command line tools pdflatex and hevea turn it into PDF and HTML which we include with the Biopython archives, and manually copy onto the website as well. From eric.talevich at gmail.com Thu Jul 9 19:46:53 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 9 Jul 2009 15:46:53 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <20090708124841.GX17086@sobchak.mgh.harvard.edu> References: <4A4141AD.6070605@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> On Wed, Jul 8, 2009 at 8:48 AM, Brad Chapman wrote: > Hi all; > > > > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed > > > by any phylogenetic tree representation, ever. (It's already pretty > close.) > > > Refactor Nexus and Newick to use these objects; merge the features of > > > lagrange so the rest of the Biopython environment can benefit. > > I am for this approach. It sounds like what people want is a tree > that does everything, and re-implementations occur because > representations are lacking in something. > > It would be nice to design this modularly -- with mixin classes for > related add-on functionality -- as much as possible. This would > allow lighter weight implementations in the future if that were > desired. > OK. Here's the current file layout that needs merging, to illustrate: Bio/ PhyloXML/ __init__.py -- flat public API Tree.py Parser.py Writer.py Utils.py Exceptions.py Nexus/ Nexus.py Nodes.py Trees.py cnexus.c The proposal is to extract the Tree class hierarchy so that other modules can share it, and Biopython users can do I/O with trees as easily as they currently can with sequences ("from Bio import TreeIO; for tree in TreeIO.parse('example.xml', 'phyloxml'): ..."). Bio/ Tree/ Elements.py TreeIO.py -- read, write wrappers PhyloXML/ Parser.py Writer.py Utils.py Nexus/ Nexus.py cnexus.c In the above case, TreeIO.py is a new file containing wrappers for the read and parse functions in my PhyloXML module, and also Nexus and Newick, pending integration. The modules implementing each specific format remain where they are, under Bio/, but aren't expected to be imported directly by the end user. Alternatively, the individual modules that implement each format for I/O can be collected under a new TreeIO directory, with __init__ implementing the wrappers: Bio/ Tree/ Elements.py Utils.py? TreeIO/ __init__.py -- read, write wrappers PhyloXML.py -- Parser + Writer combined Nexus.py cnexus.c ... What do you think? Should I start writing a generalized Bio/Tree/Elements.py for PhyloXML to depend on? -Eric From biopython at maubp.freeserve.co.uk Thu Jul 9 21:53:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Jul 2009 22:53:42 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> References: <4A4141AD.6070605@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> Message-ID: <320fb6e00907091453j4e114cbegb389f8df3517ba32@mail.gmail.com> On Thu, Jul 9, 2009 at 8:46 PM, Eric Talevich wrote: > The proposal is to extract the Tree class hierarchy so that other modules > can share it, and Biopython users can do I/O with trees as easily as they > currently can with sequences ("from Bio import TreeIO; for tree in > TreeIO.parse('example.xml', 'phyloxml'): ..."). > ... Yes :) > In the above case, TreeIO.py is a new file containing wrappers for the read > and parse functions in my PhyloXML module, and also Nexus and Newick, > pending integration. ... > > Alternatively, the individual modules that implement each format for I/O can > be collected under a new TreeIO directory, with __init__ implementing the > wrappers: ... Either idea sounds reasonable. However, for future extensivility, and also consistency with Bio.SeqIO and Bio.AlignIO, I would suggest we have Bio/TreeIO/__init__.py (i.e. as a folder containing as many wrappers or parsers as needed) rather than just using Bio/TreeIO.py (a single file). Note that the Nexus parser is much more than just a tree parser. NEXUS files can contain trees, but much more besides (including a multiple sequence alignment, and instructions to phylogenetic tools). In the short term for TreeIO and Nexus, I would just have Bio/TreeIO/NexusIO.py as a thin wrapper that calls Bio.Nexus and converts its trees into the standard trees (i.e. we don't have to make any changes to Bio.Nexus immediately). In the longer term, it would make sense for Bio.Nexus to start using the new tree objects - but we also have backwards compatibility to think about. Ideally we can get Frank and/or Cymon to look at this (rather than Nick or Eric - as this is their code, and Nick and Eric have more than enough work to do for their projects). [There are parallels here to how I did Bio.SeqIO (and AlignIO), often wrapping existing parsers by turning their format specific data structures into the common SeqRecord (or Alignment) objects. For example, to read/write alignments in NEXUS format Bio.AlignIO just calls Bio.Nexus internally.] Peter From chapmanb at 50mail.com Fri Jul 10 12:07:34 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 10 Jul 2009 08:07:34 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <320fb6e00907091453j4e114cbegb389f8df3517ba32@mail.gmail.com> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> <320fb6e00907091453j4e114cbegb389f8df3517ba32@mail.gmail.com> Message-ID: <20090710120734.GD17086@sobchak.mgh.harvard.edu> Hi Eric; > > The proposal is to extract the Tree class hierarchy so that other modules > > can share it, and Biopython users can do I/O with trees as easily as they > > currently can with sequences ("from Bio import TreeIO; for tree in > > TreeIO.parse('example.xml', 'phyloxml'): ..."). > > ... Sounds great. For most of this I will defer to Peter's expert opinion. As he mentioned, basing this off of SeqIO/AlignIO makes a lot of sense. > > In the above case, TreeIO.py is a new file containing wrappers for the read > > and parse functions in my PhyloXML module, and also Nexus and Newick, > > pending integration. ... > > > > Alternatively, the individual modules that implement each format for I/O can > > be collected under a new TreeIO directory, with __init__ implementing the > > wrappers: ... > > Either idea sounds reasonable. However, for future extensivility, and > also consistency with Bio.SeqIO and Bio.AlignIO, I would suggest we > have Bio/TreeIO/__init__.py (i.e. as a folder containing as many > wrappers or parsers as needed) rather than just using Bio/TreeIO.py > (a single file). Agreed. The imports are the same but this gives added flexibility. > Note that the Nexus parser is much more than just a tree parser. > NEXUS files can contain trees, but much more besides (including a > multiple sequence alignment, and instructions to phylogenetic > tools). In the short term for TreeIO and Nexus, I would just have > Bio/TreeIO/NexusIO.py as a thin wrapper that calls Bio.Nexus and > converts its trees into the standard trees (i.e. we don't have to > make any changes to Bio.Nexus immediately). In the longer term, > it would make sense for Bio.Nexus to start using the new tree > objects - but we also have backwards compatibility to think about. Also agreed. We should get Bio.Nexus updated enough so that is can handle Nick's problem files, and from there apply a wrapper to push Nexus trees into a generic tree compatible with PhyloXML. This will force us to be general about the Tree implementation, but save some re-writing and maintain back-compatibility. Once the generic tree is hammered out and everyone is happy, then we can think about migrating Nexus to it. Seconding Peter's comments, this is probably another big job. So, in summary, the major deliverables are: - Generic tree representation plus a TreeIO structure - PhyloXML parser that uses this tree directly - Nexus parser that can handle problem files and parse into the generic tree. This will let us drop the lagrange duplication from Nick's code. Sounds like you have this well worked out, Brad From biopython at maubp.freeserve.co.uk Fri Jul 10 12:24:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 10 Jul 2009 13:24:03 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <20090710120734.GD17086@sobchak.mgh.harvard.edu> References: <4A4D052D.7010708@berkeley.edu> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> <320fb6e00907091453j4e114cbegb389f8df3517ba32@mail.gmail.com> <20090710120734.GD17086@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00907100524xb4e1f1cx14f495c5fb658106@mail.gmail.com> On Fri, Jul 10, 2009 at 1:07 PM, Brad Chapman wrote: > So, in summary, the major deliverables are: > > - Generic tree representation plus a TreeIO structure > - PhyloXML parser that uses this tree directly > - Nexus parser that can handle problem files and parse into the > ?generic tree. This will let us drop the lagrange duplication from > ?Nick's code. > > Sounds like you have this well worked out, > Brad Sounds good. Note PhyloXML (which I gather is annotation rich) may not have to use the generic trees, it could use a subclass. If this means the generic trees can be less memory hungry that might be worth while... something to keep in mind at least. e.g. Consider a large Newick file with only taxa names and branch lengths, no branch colours, no bootstraps, no internal node names, etc. What specifically is wrong with the Bio.Nexus Newick parser? i.e. what files won't it parse that the lagrange code will? The only thing I am aware of is "naked" internal node labels (Bug 2788): http://bugzilla.open-bio.org/show_bug.cgi?id=2788 Peter From biopython at maubp.freeserve.co.uk Fri Jul 10 12:38:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 10 Jul 2009 13:38:43 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> Message-ID: <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> On Mon, Jun 22, 2009 at 6:57 PM, Peter wrote: > > Once the beta release is out, we'll resume taking small changes > (especially for documentation additions or clarifications) with a > view to releasing Biopython 1.51 final in July (probably the second > week, after people get back from BOSC/ISMB). > OK, that didn't happen - too much to catch up on at work after being away at BOSC/ISMB for a week. Also I will be on holiday next week (graduation etc). I will have some limited internet access. I'm thinking of doing the final release of Biopython 1.51 the following week (i.e. the week starting 20th July). This will be after the annual EMBOSS release, and one little thing I want to sort out before we release Biopython 1.51 is mapping Solexa/PHRED scores in FASTQ files (specifically what to do with a PHRED score of zero which is usually a dummy value, but taken literally means "this read is wrong" or "worst than random"). After discussion with Peter Rice at BOSC/ISMB 2009, I plan to follow his plan for EMBOSS (map PHRED of zero to the lowest used Solexa score, -5). Once the EMBOSS release is out, I can use it for cross checking our FASTQ conversions. Also, we have the Bio.Application.generic_run code to retire, which basically means we label it as obsolete and update the tutorial to use subprocess (see other thread), but this requires cross platform testing. Peter From tiagoantao at gmail.com Fri Jul 10 22:52:41 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 10 Jul 2009 23:52:41 +0100 Subject: [Biopython-dev] Arlequin sequence files in Bio.Popgen In-Reply-To: <4A572392.3040902@student.otago.ac.nz> References: <4A52985C.3000603@student.otago.ac.nz> <6d941f120907070055m2d34fcb1qe8b29e40d8d67880@mail.gmail.com> <4A572392.3040902@student.otago.ac.nz> Message-ID: <6d941f120907101552y32cbd121ub9817f0b5e4292e@mail.gmail.com> Hi David, > Gee, I hope I haven't raised your hopes beyond my ability to deliver (both > in terms of time and skills). I've uploaded my Arlequin classes and > functions to a branch on github so you can see them (/Bio/PopGen/Arlequin/ > on http://github.com/dwinter/biopython/tree/arleq-branch) This is great, I took your code and created a new version (nothing more than also an initial sketch - Feel free to disagree/propose changes), you can find it here: http://github.com/tiagoantao/biopython/tree/arlequin Here are a few comments: 1. I've put indentation at 4 spaces, which I think is the biopython standard 2. I've split the code in Record (__init__.py) and your Seq code (on Utils.py) 3. Just one note, samples and haplotype tables, might not be lists, but iterators. The problem is with very large files (like thousands of sequences) which do not fit in memory. While the current implementation is fine, the expectation is that what is there is just an iterator, not specifically a (in memory) list. I think a list should be ok for arlequin genetic structures which I hope are always small... 4. I've put a copyright message with your name in both files ;) 5. I HAVE NOT TESTED THE CODE CHANGES. Just as a proposed startup draft concept OK, somebody has to do a parser to actually read the files in ;) . Which is the biggest piece of work to be done. I don't mind doing it (like in the next month or so - I have some free time now), but you can do it if you want. In case you decide to do it, I have just one major point to note: making a parser that is able to read big files (i.e., some files cannot be parsed into memory in one go). I made this mistake with the genepop parser and some people do complain about it. Somethings cannot be read as lists to memory but have to be read as iterators (issue 3 above). I think a parser that is able to handle lots of files is also good to help in building a sound model to represent an arlequin record. As usual we will need test code and documentation for all this ;) > By the way, is there a plan to have generic representations of populations, > alleles etc in PopGen? It would make a parser for Arlequin files a much more > useful tool. I found a few threads about it on the mailing lists around the > birth of the module but not since. I am actually afraid of a single generic representation. My main issue with this is that I don't believe that it is possible to get it right. Many kinds of markers, type of data (frequency, gametic-phase, non-phased), population info (e.g. georeferencing). But after we get the genepop code and an arlequin parser fully working I don't mind revisiting this. But I would like to delay this discussion after the genepop code and (if we get it done) the arlequin code in the production version. Any comments would be most welcome, Tiago From bugzilla-daemon at portal.open-bio.org Mon Jul 13 14:44:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 13 Jul 2009 10:44:19 -0400 Subject: [Biopython-dev] [Bug 2879] New: missing __delitem__ in Bio.PDB.Entity.Entity Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2879 Summary: missing __delitem__ in Bio.PDB.Entity.Entity Product: Biopython Version: 1.51b Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P3 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: katja.luck at unistra.fr I realised that using the __delitem__ method in class Chain causes the following error message: ... File "/Library/Python/2.5/site-packages/Bio/PDB/Chain.py", line 79, in __delitem__ return Entity.__delitem__(self, id) AttributeError: class Entity has no attribute '__delitem__' And indeed, the class Entity doesn't have the method __delitem__ even though it is used in Chain. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jul 13 15:21:20 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 13 Jul 2009 11:21:20 -0400 Subject: [Biopython-dev] GSoC Weekly Update 8: PhyloXML for Biopython Message-ID: <3f6baf360907130821g6bbbe7a9s5c551156a11aeac1@mail.gmail.com> Hi all, Previously (July 6-10) I: - Addressed some comments from last week's code/doc review - Enabled Pythonic syntax sugar (dictionary emulation, specialized __str__ methods, singular properties for some plural attributes), plus tests - Wrote Clade.find() for flexible searching - Checked Py2.4 compatibility (it's slower, but it works) - Started Bio.Tree, Bio.TreeIO modules (integration) This week (July 13-17) I will: Extend the core to the rest of the spec: - Adding unit tests and classes to support the remaining (non-core) phyloXML elements - Implement collapse_whitespace -- see the spec glossary - Make Writer use the correct namespace prefixes - "other" objects: assert the namespace is not phyloxml - Use the schema document to validate the input file Integrate with Biopython: - Extract a Bio.Tree.BaseTree module from PhyloXML's tree classes - Improve the SeqRecord conversion Improve/revise documentation: - Address remaining comments from code/doc review - Revisit docstrings for all classes, functions, methods; consider enabling epydoc formatting Questions: - My serializer uses XML entity codes instead of unicode characters in the output -- is that OK? It still round-trips successfully with the parser. - Is there anything to do for BioSQL compatibility, besides extracting sequences? Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From eric.talevich at gmail.com Mon Jul 13 16:12:06 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 13 Jul 2009 12:12:06 -0400 Subject: [Biopython-dev] Bio.Tree layout (Was: BioGeography update) Message-ID: <3f6baf360907130912x4e13e9b1j2fd9506520e69a3a@mail.gmail.com> Hi folks, On Fri, Jul 10, 2009 at 8:24 AM, Peter wrote: > On Fri, Jul 10, 2009 at 1:07 PM, Brad Chapman wrote: > > So, in summary, the major deliverables are: > > > > - Generic tree representation plus a TreeIO structure > > - PhyloXML parser that uses this tree directly > > - Nexus parser that can handle problem files and parse into the > > generic tree. This will let us drop the lagrange duplication from > > Nick's code. > > > > Sounds like you have this well worked out, > > Brad > > Sounds good. Note PhyloXML (which I gather is annotation rich) > may not have to use the generic trees, it could use a subclass. > If this means the generic trees can be less memory hungry that > might be worth while... something to keep in mind at least. e.g. > Consider a large Newick file with only taxa names and branch > lengths, no branch colours, no bootstraps, no internal node > names, etc. > > Peter > Hilmar Lapp just pointed me to the BioSQL PhyloDB extension: http://biosql.org/wiki/Extensions Should this schema be the basis of a Bio.Tree.BaseTree module? Here's the file layout I'm picturing: Bio/Tree/ BaseTree.py -- everything else derives from these classes PhyloXMLTree.py -- already on github NexusTree.py -- if necessary The class structure I'm working on right now looks like: # In BaseTree -- currently empty classes, pending Nexus integration class TreeElement(object) class TreeNode(TreeElement) # In PhyloXMLTree class PhyloElement(BaseTree.TreeElement) class Clade(PhyloElement, BaseTree.TreeNode) class ...(PhyloElement) -- all other phyloXML classes Rather than treat BaseTree as the intersection of all the other Tree representations that rely on it, we could use PhyloDB as the reference point. What do you think? Should we come back to this in a week or two? Eric From matzke at berkeley.edu Mon Jul 13 18:34:42 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 13 Jul 2009 11:34:42 -0700 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <20090708124841.GX17086@sobchak.mgh.harvard.edu> References: <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> Message-ID: <4A5B7E42.40106@berkeley.edu> Brad Chapman wrote: > Hi all; > >>> 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed >>> by any phylogenetic tree representation, ever. (It's already pretty close.) >>> Refactor Nexus and Newick to use these objects; merge the features of >>> lagrange so the rest of the Biopython environment can benefit. > > I am for this approach. It sounds like what people want is a tree > that does everything, and re-implementations occur because > representations are lacking in something. Hi all -- thanks for this discussion about tree classes. Sorry it took me awhile to absorb all of this (and I may still be working on absorbing all of it...there is a lot to keep in my head!). PS: This also serves as my Monday update, basically I need to revise my schedule based on the decisions made after discussion of this thread. Here is a summary of the situation as I understand it. It may be a little long, apologies! (I was kind of hoping an easy solution would just appear, since really everything after this point in my GSoC project requires tree processing, and thus I have to at least the decision made about which tree class to use.) I. Tree Class Options It sounds like we have 3 options being discussed: 1. making Bio.PhyloXML.Tree the super-duper tree class 2. improving Bio.Nexus.Trees 3. including the Lagrange tree class or suitably licensed/inspired version thereof. (Or there is #4, some combination) II. My Original Problem, Which is Probably Quite Small Really I think I kind of unintentionally kicked all of this off because I couldn't get Bio.Nexus.Trees to read what I considered pretty standard Newick files back when I originally exploring this in the spring. Initially for my own scripts I used another newick parser & tree class I found online (Mailund's IIRC), then discovered a superior one in Lagrange and started using that. Thus in GSoC it was simplest to begin by importing the Lagrange parser, but that lead to legitimate concerns about duplication/licensing etc. Reviewing my original issues from the spring, really the only problem I found with Bio.Nexus.Trees was with node labels, i.e. when an internal node is given e.g. a clade name, in addition to a branch length. This a standard output on a great many newick files in my experience, which seem to be correctly read by just about all the other programs I use (Mesquite, Dendroscope, etc.) so I impulsively abandoned Bio.Nexus.Trees at the time when I couldn't get it work. III. Bug Report I did file a bug report back in March. This is outstanding as far as I know. Bio.Nexus.Trees newick parser does not support internal node labels http://bugzilla.open-bio.org/show_bug.cgi?id=2788 IV. Problem Examples Below I have accumulated some cases that work/don't work: ================= from Bio.Nexus import Trees # This works ts0 = "(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10;" to0 = Trees.Tree(ts0) print to0 # Gymnosperms tree with node labels; doesn't work ts1a = '(((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,Gin kgo:275.000000)gymnosperm:75.000000;' to1a = Trees.Tree(ts1a) # Just Taxaceae; doesn't work ts1b = '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000;' to1b = Trees.Tree(ts1b) # Just Taxaceae; this works; node labels deleted ts1c = '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)25.000000)90.000000;' to1c = Trees.Tree(ts1c) # This doesn't work (from bug report) ts2 = "(((t9:0.385832, (t8:0.445135,t4:0.41401)C:0.024032)B:0.041436, t6:0.392496)A:0.0291131, t2:0.497673, ((t0:0.301171, t7:0.482152)E:0.0268148, ((t5:0.0984167,t3:0.488578)G:0.0349662, t1:0.130208)F:0.0318288)D:0.0273876);" to2 = Trees.Tree(ts2) ================= But if I import the Lagrange tree class/parser, all of these work and my life is happy: ================= import lagrange_newick # This is lagrange's newick.py file, renamed to lagrange_newick.py lt1 = lagrange_newick.parse(ts1) lt1a = lagrange_newick.parse(ts1a) lt1b = lagrange_newick.parse(ts1b) lt2 = lagrange_newick.parse(ts2) ================= V. The Functions I Need From a Tree Class Basically my method of late has been to use the Lagrange Tree class, and then write my own standalone functions to do various necessary basic processing of trees. E.g.: * subset tree based on list of taxa; update root and any now-redundant internal nodes left with 0 or 1 descendents * extract a subtree to a new tree (cloned nodes so they don't refer to the old nodes, important in doing passes through tree) * read/write to Newick * print tree to screen in a readable format * get distance (total branch length between 2 nodes) * calculate many measures that can be done from the distances (total all-to-all distance matrix, tree length, mean phylogenetic distance, mean nearest-neighbor phylogenetic distance) * several others I don't remember off the top of my head In my list-o-functions approach, I would just write functions for the tree class I was using, but I think it has been made clear that really these functions should be methods of a certain Tree class. Which requires a decision about what Tree class to use. VI. What the current classes do. I had never looked seriously at Bio.Nexus.Trees since I was just crashing it, but it actually looks like it does a bunch: Bio.Nexus.Trees =========== type(to1c) to1c dir(to1c) ['_Tree__values_are_support', '__doc__', '__init__', '__module__', '__str__', '_add_subtree', '_get_id', '_get_values', '_parse', '_walk', 'add', 'all_ids', 'branchlength2support', 'chain', 'collapse', 'collapse_genera', 'common_ancestor', 'convert_absolute_support', 'count_terminals', 'dataclass', 'display', 'distance', 'get_taxa', 'get_terminals', 'has_support', 'id', 'is_bifurcating', 'is_compatible', 'is_identical', 'is_internal', 'is_monophyletic', 'is_parent_of', 'is_preterminal', 'is_terminal', 'kill', 'link', 'max_support', 'merge_with_support', 'name', 'node', 'prune', 'randomize', 'root', 'root_with_outgroup', 'rooted', 'search_taxon', 'set_subtree', 'split', 'sum_branchlength', 'to_string', 'trace', 'unlink', 'unroot', 'weight'] # Node methods: nd = to1c.node(1) nd type(nd) dir(nd) ['__doc__', '__init__', '__module__', 'add_succ', 'data', 'get_data', 'get_id', 'get_prev', 'get_succ', 'id', 'prev', 'remove_succ', 'set_data', 'set_id', 'set_prev', 'set_succ', 'succ'] # Node data: ndd = nd.get_data() dir(ndd) ['__doc__', '__init__', '__module__', 'branchlength', 'comment', 'support', 'taxon'] =========== Lagrange Tree Class: (really class Node I guess, and the tree is reference by the root Node) ============= type(lt1b) lt1b dir(lt1b) ['__doc__', '__init__', '__module__', 'add_child', 'children', 'data', 'descendants', 'excluded_dists', 'find_descendant', 'graft', 'isroot', 'istip', 'iternodes', 'label', 'labelset_nodemap', 'leaf_distances', 'leaves', 'length', 'mrca', 'nchildren', 'order_subtrees_by_size', 'parent', 'prune', 'remove_child', 'rootpath', 'subtree_mapping', 'ultrametricize_dumbly'] ============= Bio.PhyloXML.Tree ============= [not sure...perhaps someone could contribute the list of methods/intended methods] ============= VII. I am Leaning Towards Bio.Nexus.Trees Based on current functionality and integration with BioPython, and what can be done in the short term, it looks to me like the best option is to mod the Bio.Nexus.Trees module, inspired by the Lagrange Node class as necessary. However if e.g. PhyloXML is working well enough that I can use that, that is an option. VIII. What I should do next Given what I now know, I probably should have just written a little function to strip node labels out of my Newick trees, and done everything based on the Bio.Nexus.Trees class. I could still do this and continue on my merry way without too much trouble. But given that my tree-based functions should probably be methods of some class...here are the questions I have: * Should I muck with Bio.Nexus.Trees and try to fix the node labels issue? My instinct was not to mess with other people's stuff, but that may be a poor instinct... * Should I implement my tree-based functions methods as methods of the Bio.Nexus.Trees class? * Should I delay on this whole issue while it is being discussed, and go back to issues more localized to my GSoC project, i.e. making my GBIF functions into methods of a GBIF records class? Thanks for reading! And sorry if this was more confusing than it had to be, I am definitely learning as I go here. Cheers, Nick > > It would be nice to design this modularly -- with mixin classes for > related add-on functionality -- as much as possible. This would > allow lighter weight implementations in the future if that were > desired. > >> The benefit of letting the tree object structures diverge is procrastination >> -- we could reconcile the two modules after GSoC is over, with stable >> features and test suites in place. But I could justifiably focus on >> integration for the remaining weeks if that's best for Biopython, since >> otherwise I'd probably be reimplementing a number of features already >> present in other modules. > > My vote is for the integration work. Refactoring is hard work and > best done early. It is easier to add functionality to a fully integrated > PhyloXML parser in the future. > >> I bet this could be done without different objects. Bio.PhyloXML.Tree could >> be moved to Bio.Tree or Bio.Tree.Elements; the base class PhyloElement could >> be renamed to TreeElement; and the Nexus and Newick parsers could reuse >> PhyloXML's Phylogeny and Clade elements, where Clade merges with the >> existing Node class(es). Even Clade by itself might be enough. For >> organizational purposes, format-specific tree elements could move to their >> own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some >> multiple-inheritance tricks could be used to smooth things over. > > Yes, this sounds exactly right. Great stuff. > >> (I know nothing >> about NeXML; should we keep an eye on that too? Glance at the homepage I >> don't see much about complex annotation types, which is probably good if we >> want to fit that format into this framework eventually.) > > PhyloXML plus Nexus/Newick is probably enough to stay reasonably > general and keep our sanity. NeXML support would be great but > practically is an additional project. The refactoring you've described > is a good chunk to run with. > > Brad > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From eric.talevich at gmail.com Mon Jul 13 20:01:07 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 13 Jul 2009 16:01:07 -0400 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A5B7E42.40106@berkeley.edu> References: <4A4141AD.6070605@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> Message-ID: <3f6baf360907131301v2096cef0o64c458ca1bfabc7c@mail.gmail.com> Hi Nick, On Mon, Jul 13, 2009 at 2:34 PM, Nick Matzke wrote: > > > Hi all -- thanks for this discussion about tree classes. Sorry it took me > awhile to absorb all of this (and I may still be working on absorbing all of > it...there is a lot to keep in my head!). > [...] > > I. Tree Class Options > > It sounds like we have 3 options being discussed: > > 1. making Bio.PhyloXML.Tree the super-duper tree class > 2. improving Bio.Nexus.Trees > 3. including the Lagrange tree class or suitably licensed/inspired version > thereof. > > (Or there is #4, some combination) > The last consensus we reached on Biopython-dev was to create two new modules, Bio.Tree and Bio.TreeIO, like so: 1. Extract a very basic Tree and Node class, looking at the intersection of the PhyloXML and Nexus class hierarchies, and put the result in Bio.Tree.BaseTree. I started on this today: http://github.com/etal/biopython/blob/phyloxml/Bio/Tree/BaseTree.py (It doesn't do anything yet besides set up a class heirarchy that we can use for generalizing existing code.) 2. Write wrappers for the existing PhyloXML and Nexus I/O functions. I'm putting that here: http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/__init__.py Again, it's only useful for PhyloXML parsing right now. Eventually we can connect Bio.Nexus to these two modules, but that's well outside the scope of my GSoC project. > Bio.PhyloXML.Tree > ============= > [not sure...perhaps someone could contribute the list of methods/intended > methods] > ============= > Not very many! My project is to implement the phyloXML spec, and the spec says nothing about methods, just about how to store data. As you've noted, Bio.Nexus has a lot of useful methods for phylogenetic trees, independent of the underlying file format. I'd like to separate the I/O code from the tree representations for Bio.Nexus and Bio.PhyloXML, leaving Bio.TreeIO with format-specific wrappers, and Bio.Tree, with common tree representations and methods for handling trees. Basically, I don't want to rewrite necessary methods from scratch, I want to use the ones Nexus already has. Since phyloXML is designed to store more kinds of annotations than Nexus, there are some additional Tree-based classes in Bio.Tree.PhyloXMLTree, with some methods for dealing with the additional annotations. But the methods you want will be on Bio.Tree.BaseTree objects, and you shouldn't have to worry about phyloXML objects unless you want to add some additional phyloXML-specific annotations to your trees. > VIII. What I should do next > > Given what I now know, I probably should have just written a little > function to strip node labels out of my Newick trees, and done everything > based on the Bio.Nexus.Trees class. I could still do this and continue on > my merry way without too much trouble. > > But given that my tree-based functions should probably be methods of some > class...here are the questions I have: > > * Should I muck with Bio.Nexus.Trees and try to fix the node labels issue? > My instinct was not to mess with other people's stuff, but that may be a > poor instinct... > > * Should I implement my tree-based functions methods as methods of the > Bio.Nexus.Trees class? > > * Should I delay on this whole issue while it is being discussed, and go > back to issues more localized to my GSoC project, i.e. making my GBIF > functions into methods of a GBIF records class? > > It sounds like relying on the current Bio.Nexus is the best approach. I'll defer to the experts, but my guess is that if it's only a small change you need, then make a patch to Bio.Nexus.Trees for your own use and also upload the patch to Bugzilla to make it easier to use upstream. Integrating the functions into Bio.Nexus right now probably isn't necessary, since many of those methods will probably end up in Bio.Tree eventually anyway. For functions that could become Nexus methods, try arranging the argument list so that the object the method would belong to comes first. Then functions can be moved into classes by renaming the first argument to 'self', and nothing breaks. It's also possible to directly monkeypatch a class/object with functions structured that way, but I think that would be frowned upon in general... Cheers, Eric From chapmanb at 50mail.com Mon Jul 13 21:39:05 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Jul 2009 17:39:05 -0400 Subject: [Biopython-dev] Bio.Tree layout (Was: BioGeography update) In-Reply-To: <3f6baf360907130912x4e13e9b1j2fd9506520e69a3a@mail.gmail.com> References: <3f6baf360907130912x4e13e9b1j2fd9506520e69a3a@mail.gmail.com> Message-ID: <20090713213905.GO17086@sobchak.mgh.harvard.edu> Hi Eric; > Hilmar Lapp just pointed me to the BioSQL PhyloDB extension: > http://biosql.org/wiki/Extensions > > Should this schema be the basis of a Bio.Tree.BaseTree module? Yes, that sounds perfect. PhyloDB has been kicked around quite a bit and will be a good base. Great idea. If you want someone to talk to in real life at UGa, Jamie Estill worked on PhyloDB during GSoC a couple of years back: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Command_Line_Topological_Query_Application_for_BioSQL http://jestill.myweb.uga.edu/ He's crazy smart and a nice guy, and was in my lab when I was down there. He's a great person to know. > Here's the file layout I'm picturing: > > Bio/Tree/ > BaseTree.py -- everything else derives from these classes > PhyloXMLTree.py -- already on github > NexusTree.py -- if necessary > > The class structure I'm working on right now looks like: > > # In BaseTree -- currently empty classes, pending Nexus integration > class TreeElement(object) > class TreeNode(TreeElement) > > # In PhyloXMLTree > class PhyloElement(BaseTree.TreeElement) > class Clade(PhyloElement, BaseTree.TreeNode) > class ...(PhyloElement) -- all other phyloXML classes > > Rather than treat BaseTree as the intersection of all the other Tree > representations that rely on it, we could use PhyloDB as the reference > point. What do you think? Should we come back to this in a week or two? I think PhyloDB is the right starting point, and then the implementations in Newick, lagrange and PyCogene and elsewhere are good references for the operations that people will want to do on the tree. I don't see any reason to wait on this; I'm excited about the generic tree representation and bringing these things together. Brad From chapmanb at 50mail.com Mon Jul 13 21:39:05 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Jul 2009 17:39:05 -0400 Subject: [Biopython-dev] Bio.Tree layout (Was: BioGeography update) In-Reply-To: <3f6baf360907130912x4e13e9b1j2fd9506520e69a3a@mail.gmail.com> References: <3f6baf360907130912x4e13e9b1j2fd9506520e69a3a@mail.gmail.com> Message-ID: <20090713213905.GO17086@sobchak.mgh.harvard.edu> Hi Eric; > Hilmar Lapp just pointed me to the BioSQL PhyloDB extension: > http://biosql.org/wiki/Extensions > > Should this schema be the basis of a Bio.Tree.BaseTree module? Yes, that sounds perfect. PhyloDB has been kicked around quite a bit and will be a good base. Great idea. If you want someone to talk to in real life at UGa, Jamie Estill worked on PhyloDB during GSoC a couple of years back: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Command_Line_Topological_Query_Application_for_BioSQL http://jestill.myweb.uga.edu/ He's crazy smart and a nice guy, and was in my lab when I was down there. He's a great person to know. > Here's the file layout I'm picturing: > > Bio/Tree/ > BaseTree.py -- everything else derives from these classes > PhyloXMLTree.py -- already on github > NexusTree.py -- if necessary > > The class structure I'm working on right now looks like: > > # In BaseTree -- currently empty classes, pending Nexus integration > class TreeElement(object) > class TreeNode(TreeElement) > > # In PhyloXMLTree > class PhyloElement(BaseTree.TreeElement) > class Clade(PhyloElement, BaseTree.TreeNode) > class ...(PhyloElement) -- all other phyloXML classes > > Rather than treat BaseTree as the intersection of all the other Tree > representations that rely on it, we could use PhyloDB as the reference > point. What do you think? Should we come back to this in a week or two? I think PhyloDB is the right starting point, and then the implementations in Newick, lagrange and PyCogene and elsewhere are good references for the operations that people will want to do on the tree. I don't see any reason to wait on this; I'm excited about the generic tree representation and bringing these things together. Brad From matzke at berkeley.edu Mon Jul 13 21:40:15 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 13 Jul 2009 14:40:15 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] BioGeography update In-Reply-To: <20090710120734.GD17086@sobchak.mgh.harvard.edu> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <3f6baf360907091246o61fae0abn34f8cfac864a4bb1@mail.gmail.com> <320fb6e00907091453j4e114cbegb389f8df3517ba32@mail.gmail.com> <20090710120734.GD17086@sobchak.mgh.harvard.edu> Message-ID: <4A5BA9BF.2080605@berkeley.edu> Brad Chapman wrote: > Also agreed. We should get Bio.Nexus updated enough so that is can > handle Nick's problem files, and from there apply a wrapper to push > Nexus trees into a generic tree compatible with PhyloXML. This will > force us to be general about the Tree implementation, but save some > re-writing and maintain back-compatibility. Once the generic tree > is hammered out and everyone is happy, then we can think about > migrating Nexus to it. Seconding Peter's comments, this is probably > another big job. > > So, in summary, the major deliverables are: > > - Generic tree representation plus a TreeIO structure > - PhyloXML parser that uses this tree directly > - Nexus parser that can handle problem files and parse into the > generic tree. This will let us drop the lagrange duplication from > Nick's code. > > Sounds like you have this well worked out, > Brad Whoops I missed a few of these biopython-dev messages before, I have different filters shuttling things different places depending on the Subj. line. Eric filled me in. Here were some cases where the node labels blocked Bio.Nexus.Trees: ================= from Bio.Nexus import Trees # This works ts0 = "(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10;" to0 = Trees.Tree(ts0) print to0 # Gymnosperms tree with node labels; doesn't work ts1a = '(((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,Gin kgo:275.000000)gymnosperm:75.000000;' to1a = Trees.Tree(ts1a) # Just Taxaceae; doesn't work ts1b = '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000;' to1b = Trees.Tree(ts1b) # Just Taxaceae; this works; node labels deleted ts1c = '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)25.000000)90.000000;' to1c = Trees.Tree(ts1c) # This doesn't work (from bug report) ts2 = "(((t9:0.385832, (t8:0.445135,t4:0.41401)C:0.024032)B:0.041436, t6:0.392496)A:0.0291131, t2:0.497673, ((t0:0.301171, t7:0.482152)E:0.0268148, ((t5:0.0984167,t3:0.488578)G:0.0349662, t1:0.130208)F:0.0318288)D:0.0273876);" to2 = Trees.Tree(ts2) ================= -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From matzke at berkeley.edu Mon Jul 13 22:02:24 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 13 Jul 2009 15:02:24 -0700 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A5B7E42.40106@berkeley.edu> References: <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> Message-ID: <4A5BAEF0.9050504@berkeley.edu> Just updating one chunk of part I of the previous long message: Nick Matzke wrote: > > > I. Tree Class Options > > It sounds like we have 3 options being discussed: > > 1. making Bio.PhyloXML.Tree the super-duper tree class > 2. improving Bio.Nexus.Trees > 3. including the Lagrange tree class or suitably licensed/inspired > version thereof. > > (Or there is #4, some combination) > The last consensus we reached on Biopython-dev was to create two new > modules, Bio.Tree and Bio.TreeIO, like so: > > 1. Extract a very basic Tree and Node class, looking at the intersection > of the PhyloXML and Nexus class hierarchies, and put the result in > Bio.Tree.BaseTree. I started on this today: > http://github.com/etal/biopython/blob/phyloxml/Bio/Tree/BaseTree.py > > (It doesn't do anything yet besides set up a class heirarchy that we can > use for generalizing existing code.) > > 2. Write wrappers for the existing PhyloXML and Nexus I/O functions. I'm > putting that here: > http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/__init__.py > > Again, it's only useful for PhyloXML parsing right now. Eventually we > can connect Bio.Nexus to these two modules, but that's well outside the > scope of my GSoC project. It sounds like for my immediate purposes, Bio.Nexus.Trees is the solution for now, I will reorganize my code accordingly based on this. If/when Bio.Nexus.Trees accepts node labels I will remove a function stripping out node labels. Also I have not forgotten previous comments from Brad et al. about bringing the other code up to specs. So I will update the BioGeography schedule and overall organization I hope to have at the end (with classes/methods etc., instead of just a list-o-functions, which is how my original schedule was explicitly laid out), and post an update when done. Cheers! Nick > > > > > > II. My Original Problem, Which is Probably Quite Small Really > > I think I kind of unintentionally kicked all of this off because I > couldn't get Bio.Nexus.Trees to read what I considered pretty standard > Newick files back when I originally exploring this in the spring. > Initially for my own scripts I used another newick parser & tree class I > found online (Mailund's IIRC), then discovered a superior one in > Lagrange and started using that. Thus in GSoC it was simplest to begin > by importing the Lagrange parser, but that lead to legitimate concerns > about duplication/licensing etc. > > Reviewing my original issues from the spring, really the only problem I > found with Bio.Nexus.Trees was with node labels, i.e. when an internal > node is given e.g. a clade name, in addition to a branch length. This a > standard output on a great many newick files in my experience, which > seem to be correctly read by just about all the other programs I use > (Mesquite, Dendroscope, etc.) so I impulsively abandoned Bio.Nexus.Trees > at the time when I couldn't get it work. > > > > > > III. Bug Report > > I did file a bug report back in March. This is outstanding as far as I > know. > > Bio.Nexus.Trees newick parser does not support internal node labels > http://bugzilla.open-bio.org/show_bug.cgi?id=2788 > > > > > > > > IV. Problem Examples > > > Below I have accumulated some cases that work/don't work: > > > ================= > from Bio.Nexus import Trees > > # This works > > ts0 = > "(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, > Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10;" > > to0 = Trees.Tree(ts0) > print to0 > > > > # Gymnosperms tree with node labels; doesn't work > ts1a = > '(((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,G in > > kgo:275.000000)gymnosperm:75.000000;' > > to1a = Trees.Tree(ts1a) > > > > > # Just Taxaceae; doesn't work > ts1b = > '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000;' > > to1b = Trees.Tree(ts1b) > > # Just Taxaceae; this works; node labels deleted > ts1c = > '(Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)25.000000)90.000000;' > > to1c = Trees.Tree(ts1c) > > > > > # This doesn't work (from bug report) > ts2 = "(((t9:0.385832, (t8:0.445135,t4:0.41401)C:0.024032)B:0.041436, > t6:0.392496)A:0.0291131, t2:0.497673, ((t0:0.301171, > t7:0.482152)E:0.0268148, ((t5:0.0984167,t3:0.488578)G:0.0349662, > t1:0.130208)F:0.0318288)D:0.0273876);" > to2 = Trees.Tree(ts2) > ================= > > > > > But if I import the Lagrange tree class/parser, all of these work and my > life is happy: > > ================= > import lagrange_newick > # This is lagrange's newick.py file, renamed to lagrange_newick.py > > lt1 = lagrange_newick.parse(ts1) > lt1a = lagrange_newick.parse(ts1a) > lt1b = lagrange_newick.parse(ts1b) > lt2 = lagrange_newick.parse(ts2) > ================= > > > > > > > V. The Functions I Need From a Tree Class > > Basically my method of late has been to use the Lagrange Tree class, and > then write my own standalone functions to do various necessary basic > processing of trees. E.g.: > > * subset tree based on list of taxa; update root and any now-redundant > internal nodes left with 0 or 1 descendents > > * extract a subtree to a new tree (cloned nodes so they don't refer to > the old nodes, important in doing passes through tree) > > * read/write to Newick > > * print tree to screen in a readable format > > * get distance (total branch length between 2 nodes) > > * calculate many measures that can be done from the distances (total > all-to-all distance matrix, tree length, mean phylogenetic distance, > mean nearest-neighbor phylogenetic distance) > > * several others I don't remember off the top of my head > > > In my list-o-functions approach, I would just write functions for the > tree class I was using, but I think it has been made clear that really > these functions should be methods of a certain Tree class. Which > requires a decision about what Tree class to use. > > > > > > VI. What the current classes do. > > I had never looked seriously at Bio.Nexus.Trees since I was just > crashing it, but it actually looks like it does a bunch: > > Bio.Nexus.Trees > =========== > type(to1c) > > > to1c > > > dir(to1c) > > ['_Tree__values_are_support', > '__doc__', > '__init__', > '__module__', > '__str__', > '_add_subtree', > '_get_id', > '_get_values', > '_parse', > '_walk', > 'add', > 'all_ids', > 'branchlength2support', > 'chain', > 'collapse', > 'collapse_genera', > 'common_ancestor', > 'convert_absolute_support', > 'count_terminals', > 'dataclass', > 'display', > 'distance', > 'get_taxa', > 'get_terminals', > 'has_support', > 'id', > 'is_bifurcating', > 'is_compatible', > 'is_identical', > 'is_internal', > 'is_monophyletic', > 'is_parent_of', > 'is_preterminal', > 'is_terminal', > 'kill', > 'link', > 'max_support', > 'merge_with_support', > 'name', > 'node', > 'prune', > 'randomize', > 'root', > 'root_with_outgroup', > 'rooted', > 'search_taxon', > 'set_subtree', > 'split', > 'sum_branchlength', > 'to_string', > 'trace', > 'unlink', > 'unroot', > 'weight'] > > > # Node methods: > nd = to1c.node(1) > > nd > > > > type(nd) > > > dir(nd) > > ['__doc__', > '__init__', > '__module__', > 'add_succ', > 'data', > 'get_data', > 'get_id', > 'get_prev', > 'get_succ', > 'id', > 'prev', > 'remove_succ', > 'set_data', > 'set_id', > 'set_prev', > 'set_succ', > 'succ'] > > > # Node data: > ndd = nd.get_data() > > dir(ndd) > > ['__doc__', > '__init__', > '__module__', > 'branchlength', > 'comment', > 'support', > 'taxon'] > =========== > > > > > > > > Lagrange Tree Class: > (really class Node I guess, and the tree is reference by the root Node) > > ============= > type(lt1b) > > > lt1b > > > dir(lt1b) > > ['__doc__', > '__init__', > '__module__', > 'add_child', > 'children', > 'data', > 'descendants', > 'excluded_dists', > 'find_descendant', > 'graft', > 'isroot', > 'istip', > 'iternodes', > 'label', > 'labelset_nodemap', > 'leaf_distances', > 'leaves', > 'length', > 'mrca', > 'nchildren', > 'order_subtrees_by_size', > 'parent', > 'prune', > 'remove_child', > 'rootpath', > 'subtree_mapping', > 'ultrametricize_dumbly'] > ============= > > > > > Bio.PhyloXML.Tree > ============= > [not sure...perhaps someone could contribute the list of > methods/intended methods] > ============= > > > > > VII. I am Leaning Towards Bio.Nexus.Trees > > Based on current functionality and integration with BioPython, and what > can be done in the short term, it looks to me like the best option is to > mod the Bio.Nexus.Trees module, inspired by the Lagrange Node class as > necessary. However if e.g. PhyloXML is working well enough that I can > use that, that is an option. > > > > > > VIII. What I should do next > > Given what I now know, I probably should have just written a little > function to strip node labels out of my Newick trees, and done > everything based on the Bio.Nexus.Trees class. I could still do this > and continue on my merry way without too much trouble. > > But given that my tree-based functions should probably be methods of > some class...here are the questions I have: > > * Should I muck with Bio.Nexus.Trees and try to fix the node labels > issue? My instinct was not to mess with other people's stuff, but that > may be a poor instinct... > > * Should I implement my tree-based functions methods as methods of the > Bio.Nexus.Trees class? > > * Should I delay on this whole issue while it is being discussed, and go > back to issues more localized to my GSoC project, i.e. making my GBIF > functions into methods of a GBIF records class? > > > Thanks for reading! And sorry if this was more confusing than it had to > be, I am definitely learning as I go here. > > Cheers, > Nick > > > > > > > > >> >> It would be nice to design this modularly -- with mixin classes for >> related add-on functionality -- as much as possible. This would >> allow lighter weight implementations in the future if that were >> desired. >> >>> The benefit of letting the tree object structures diverge is >>> procrastination >>> -- we could reconcile the two modules after GSoC is over, with stable >>> features and test suites in place. But I could justifiably focus on >>> integration for the remaining weeks if that's best for Biopython, since >>> otherwise I'd probably be reimplementing a number of features already >>> present in other modules. >> >> My vote is for the integration work. Refactoring is hard work and >> best done early. It is easier to add functionality to a fully integrated >> PhyloXML parser in the future. >> >>> I bet this could be done without different objects. Bio.PhyloXML.Tree >>> could >>> be moved to Bio.Tree or Bio.Tree.Elements; the base class >>> PhyloElement could >>> be renamed to TreeElement; and the Nexus and Newick parsers could reuse >>> PhyloXML's Phylogeny and Clade elements, where Clade merges with the >>> existing Node class(es). Even Clade by itself might be enough. For >>> organizational purposes, format-specific tree elements could move to >>> their >>> own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some >>> multiple-inheritance tricks could be used to smooth things over. >> >> Yes, this sounds exactly right. Great stuff. >> >>> (I know nothing >>> about NeXML; should we keep an eye on that too? Glance at the homepage I >>> don't see much about complex annotation types, which is probably good >>> if we >>> want to fit that format into this framework eventually.) >> >> PhyloXML plus Nexus/Newick is probably enough to stay reasonably >> general and keep our sanity. NeXML support would be great but >> practically is an additional project. The refactoring you've described >> is a good chunk to run with. >> >> Brad >> > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From hlapp at gmx.net Tue Jul 14 07:41:23 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 14 Jul 2009 08:41:23 +0100 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A5B7E42.40106@berkeley.edu> References: <4A4141AD.6070605@berkeley.edu> <4A41AFFE.1060400@berkeley.edu> <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> Message-ID: <8D7EC898-7AAF-4140-B41C-4BB1F424150D@gmx.net> On Jul 13, 2009, at 7:34 PM, Nick Matzke wrote: > * Should I muck with Bio.Nexus.Trees and try to fix the node labels > issue? My instinct was not to mess with other people's stuff, but > that may be a poor instinct... Just my $0.02 - messing with other people's stuff is an inherent, and not infrequent, activity in distributed open-source development. I would in fact be rather merciless in doing so. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bugzilla-daemon at portal.open-bio.org Tue Jul 14 12:24:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Jul 2009 08:24:35 -0400 Subject: [Biopython-dev] [Bug 2788] Bio.Nexus.Trees newick parser does not support internal node labels In-Reply-To: Message-ID: <200907141224.n6ECOZ9X014789@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2788 ------- Comment #3 from chapmanb at 50mail.com 2009-07-14 08:24 EST ------- Created an attachment (id=1342) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1342&action=view) Fix for internal node taxon labels Includes a fix and test cases for internal nodes labeled with taxon information. Please test this out on some files of interest and report any additional problem cases. I'd like to get a few more eyes on it before checking it in. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Tue Jul 14 12:35:34 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 14 Jul 2009 08:35:34 -0400 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A5BAEF0.9050504@berkeley.edu> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> Message-ID: <20090714123534.GQ17086@sobchak.mgh.harvard.edu> Hi Nick; Thanks for the comprehensive update. It sounds like your discussion with Eric resolved most of the questions about the tree representation. It's great to see y'all converging on this. > It sounds like for my immediate purposes, Bio.Nexus.Trees is the > solution for now, I will reorganize my code accordingly based on this. > If/when Bio.Nexus.Trees accepts node labels I will remove a function > stripping out node labels. Also I have not forgotten previous comments > from Brad et al. about bringing the other code up to specs. So I will > update the BioGeography schedule and overall organization I hope to have > at the end (with classes/methods etc., instead of just a > list-o-functions, which is how my original schedule was explicitly laid > out), and post an update when done. Agreed, and seconding Hilmar that the best thing about open source code is having others looking at your code. Conversely, feel free to dig in and fix current code where it is holding you up. To remove this blocking issue on Nexus and get us rolling again, I put together an initial fix. You can grab the patch from: http://bugzilla.open-bio.org/show_bug.cgi?id=2788 Let us know if this works for your files of interest. If this clears up the Nexus issue, it would be great to see the revised schedule incorporating the refactoring. Sounds like we are moving in the right direction. Good stuff. Thanks, Brad From matzke at berkeley.edu Tue Jul 14 19:08:56 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 14 Jul 2009 12:08:56 -0700 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <20090714123534.GQ17086@sobchak.mgh.harvard.edu> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> Message-ID: <4A5CD7C8.70009@berkeley.edu> Thanks for the fix!!! A big help. I am currently organizing my functions into several classes and making sure they work, basically the classes look like they will be something like: ========== GbifXml -- for processing GBIF XML results (all of the functions for searching/extracting stuff from xmltree structures) TreeSum -- for processing trees & getting summary statistics etc. Ranges -- Geographic range of a species (collection of points, results of classification of those points into regions), GIS-like functions for processing them Points -- geographic locations of individual collected specimens ========== Brad Chapman wrote: > Hi Nick; > Thanks for the comprehensive update. It sounds like your discussion > with Eric resolved most of the questions about the tree > representation. It's great to see y'all converging on this. > >> It sounds like for my immediate purposes, Bio.Nexus.Trees is the >> solution for now, I will reorganize my code accordingly based on this. >> If/when Bio.Nexus.Trees accepts node labels I will remove a function >> stripping out node labels. Also I have not forgotten previous comments >> from Brad et al. about bringing the other code up to specs. So I will >> update the BioGeography schedule and overall organization I hope to have >> at the end (with classes/methods etc., instead of just a >> list-o-functions, which is how my original schedule was explicitly laid >> out), and post an update when done. > > Agreed, and seconding Hilmar that the best thing about open source > code is having others looking at your code. Conversely, feel free to > dig in and fix current code where it is holding you up. To remove > this blocking issue on Nexus and get us rolling again, I > put together an initial fix. You can grab the patch from: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2788 > > Let us know if this works for your files of interest. > > If this clears up the Nexus issue, it would be great to see the > revised schedule incorporating the refactoring. Sounds like we > are moving in the right direction. Good stuff. > > Thanks, > Brad > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From mjldehoon at yahoo.com Thu Jul 16 08:50:35 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 16 Jul 2009 01:50:35 -0700 (PDT) Subject: [Biopython-dev] Calculating motif scores Message-ID: <865356.89579.qm@web62406.mail.re1.yahoo.com> Hi everybody, I was looking for a way to calculate the position-weight matrix score for a given sequence. Motif.score_hit(sequence,position,normalized=0,masked=0) in Bio/Motif/_Motif.py does what I need, but it calculates the score at only one position. For speed reasons, I am looking for a function that can calculate the scores at all positions in a sequence. Something like score(pwm, sequence) returning a Numerical Python array of length len(sequence) - len(pwm) + 1, with the "score" function implemented in a C extension. Perhaps the position-weight matrix should be its own class, with "score" as one of its methods. Is there perhaps some other function that I can use for this? If not, I can contribute a C extension implementing this functionality. If so, are there any preferences on how this should be integrated with Bio.Motif? --Michiel From bartek at rezolwenta.eu.org Thu Jul 16 11:32:34 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 16 Jul 2009 13:32:34 +0200 Subject: [Biopython-dev] Calculating motif scores In-Reply-To: <865356.89579.qm@web62406.mail.re1.yahoo.com> References: <865356.89579.qm@web62406.mail.re1.yahoo.com> Message-ID: <8b34ec180907160432j6647a8e3u20054b2f7781b978@mail.gmail.com> On Thu, Jul 16, 2009 at 10:50 AM, Michiel de Hoon wrote: > > Hi everybody, Hi > > I was looking for a way to calculate the position-weight matrix score for a given sequence. Motif.score_hit(sequence,position,normalized=0,masked=0) in Bio/Motif/_Motif.py does what I need, but it calculates the score at only one position. For speed reasons, I am looking for a function that can calculate the scores at all positions in a sequence. Something like > > score(pwm, sequence) > > returning a Numerical Python array of length len(sequence) - len(pwm) + 1, with the "score" function implemented in a C extension. Perhaps the position-weight matrix should be its own class, with "score" as one of its methods. > > Is there perhaps some other function that I can use for this? The function you are looking for is called search_pwm: search_pwm(self, sequence, normalized=0, masked=0, threshold=0.0, both=True) a generator function, returning found hits in a given sequence with the pwm score higher than the threshold > If not, I can contribute a C extension implementing this functionality. If so, are there any preferences on how this should be integrated with Bio.Motif? As you can see, the current function is a generator rather than returning a full array, because of the memory issues with searching large sequences for a few cases of a good motif. If you set the threshold to (-inf) you should get the results for all positions. Nonetheless, if you have a function in c doing just that, we could incorporate it into biopython, for fast exhaustive searches on shorter seqences. cheers Bartek From mjldehoon at yahoo.com Fri Jul 17 02:25:22 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 16 Jul 2009 19:25:22 -0700 (PDT) Subject: [Biopython-dev] Calculating motif scores Message-ID: <572083.29767.qm@web62405.mail.re1.yahoo.com> > The function you are looking for is called search_pwm: > > search_pwm(self, sequence, normalized=0, masked=0, > threshold=0.0, both=True) > a generator function, returning found hits in a given > sequence with the pwm score higher than the threshold OK, that comes close to what I had in mind. > Nonetheless, if you have a function in c doing just that, > we could incorporate it into biopython, for fast exhaustive > searches on shorter sequences. It doesn't have to be so short. I've been running these calculations for whole mammalian chromosomes. For the human chromosome 1, this would take 247249719 * 4 bytes = 943 MB to store the scores in a Numerical Python array. This can still be comfortably handled by today's computers. I'll upload a C version to CVS so you guys can have a look and try it out. How would you feel about having a separate PWM class in Bio.Motif? Some of the stuff currently in the class Motif is actually more about the PWM by itself; it may make sense to separate that out. --Michiel. --Michiel. From bugzilla-daemon at portal.open-bio.org Fri Jul 17 13:12:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 17 Jul 2009 09:12:09 -0400 Subject: [Biopython-dev] [Bug 2880] New: Two unit tests issues in 1.51b (t-coffee and mafft) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2880 Summary: Two unit tests issues in 1.51b (t-coffee and mafft) Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Unit Tests AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz biopython-1.51b # python setup.py test running test test_Ace ... ok test_AlignIO ... ok test_BioSQL ... skipping. Enter your settings in Tests/setup_BioSQL.py (not important if you do not plan to use BioSQL). test_BioSQL_SeqIO ... skipping. Enter your settings in Tests/setup_BioSQL.py (not important if you do not plan to use BioSQL). test_CAPS ... ok test_Clustalw ... ok test_Clustalw_tool ... ok test_Cluster ... ok test_CodonTable ... ok test_CodonUsage ... ok test_Compass ... ok test_Crystal ... ok test_Dialign_tool ... skipping. Install DIALIGN2-2 if you want to use the Bio.Align.Applications wrapper. test_DocSQL ... /usr/lib/python2.6/site-packages/MySQLdb/__init__.py:34: DeprecationWarning: the sets module is deprecated from sets import ImmutableSet ok test_Emboss ... ok test_EmbossPrimer ... ok test_Entrez ... ok test_Enzyme ... ok test_FSSP ... ok test_Fasta ... ok test_Fasta2 ... ok test_File ... ok test_GACrossover ... ok test_GAMutation ... ok test_GAOrganism ... ok test_GAQueens ... ok test_GARepair ... ok test_GASelection ... ok test_GFF ... skipping. Environment is not configured for this test (not important if you do not plan to use Bio.GFF). test_GFF2 ... /var/tmp/portage/sci-biology/biopython-1.51b/work/biopython-1.51b/build/lib.linux-i686-2.6/Bio/Translate.py:23: DeprecationWarning: Bio.Translate and Bio.Transcribe are deprecated, and will be removed in a future release of Biopython. Please use the functions or object methods defined in Bio.Seq instead (described in the tutorial). If you want to continue to use this code, please get in contact with the Biopython developers via the mailing lists to avoid its permanent removal from Biopython. DeprecationWarning) ok test_GenBank ... ok test_GenomeDiagram ... ok test_GraphicsChromosome ... ok test_GraphicsDistribution ... ok test_GraphicsGeneral ... ok test_HMMCasino ... ok test_HMMGeneral ... ok test_HotRand ... ok test_IsoelectricPoint ... ok test_KDTree ... ok test_KEGG ... ok test_KeyWList ... ok test_Location ... ok test_LocationParser ... ok test_LogisticRegression ... ok test_MEME ... ok test_Mafft_tool ... FAIL test_MarkovModel ... ok test_Medline ... ok test_Motif ... ok test_Muscle_tool ... ok test_NCBIStandalone ... ok test_NCBITextParser ... ok test_NCBIXML ... ok test_NCBI_qblast ... ok test_NNExclusiveOr ... ok test_NNGene ... ok test_NNGeneral ... ok test_Nexus ... ok test_PDB ... ok test_PDB_unit ... ok test_ParserSupport ... ok test_Pathway ... ok test_Phd ... ok test_PopGen_FDist ... skipping. Install FDist if you want to use Bio.PopGen.FDist. test_PopGen_FDist_nodepend ... ok test_PopGen_GenePop ... ok test_PopGen_SimCoal ... skipping. Install SIMCOAL2 if you want to use Bio.PopGen.SimCoal. test_PopGen_SimCoal_nodepend ... ok test_Prank_tool ... skipping. Install PRANK if you want to use the Bio.Align.Applications wrapper. test_Probcons_tool ... ok test_ProtParam ... ok test_Restriction ... ok test_SCOP_Astral ... ok test_SCOP_Cla ... ok test_SCOP_Des ... ok test_SCOP_Dom ... ok test_SCOP_Hie ... ok test_SCOP_Raf ... ok test_SCOP_Residues ... ok test_SCOP_Scop ... ok test_SVDSuperimposer ... ok test_SeqIO ... ok test_SeqIO_features ... ok test_SeqIO_online ... ok test_SeqUtils ... ok test_Seq_objs ... ok test_SubsMat ... ok test_SwissProt ... ok test_TCoffee_tool ... Probably t-coffee is waiting for some data on its stdin. root 2987 6482 1 11:36 pts/8 00:03:17 python setup.py test root 20102 2987 0 11:45 pts/8 00:00:00 sh -c { t_coffee; } 2>&1 root 20121 20102 0 11:45 pts/8 00:00:00 t_coffee Further note that test_Mafft_tool failed as well. $ mafft checking nawk checking gawk prog=/usr/bin/gawk --------------------------------------------------------------------- MAFFT v6.240 (2007/04/04) Copyright (c) 2006 Kazutaka Katoh NAR 30:3059-3066, NAR 33:511-518 http://align.bmr.kyushu-u.ac.jp/mafft/software/ --------------------------------------------------------------------- Input file? (fasta format) @ Input file? (fasta format) @ quit quit: No such file. Input file? (fasta format) @ exit exit: No such file. Input file? (fasta format) @ Input file? (fasta format) @ x x: No such file. Input file? (fasta format) @ ^C $ -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From winda002 at student.otago.ac.nz Sat Jul 18 08:17:02 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Sat, 18 Jul 2009 20:17:02 +1200 Subject: [Biopython-dev] Arlequin sequence files in Bio.Popgen Message-ID: <20090718201702.408455fp9qeau1ha@www.studentmail.otago.ac.nz> Hi again Tiago, Sorry about falling of the grid before I could get back to you about this. Tiago Ant?o wrote: >> I've uploaded my Arlequin classes and >> functions to a branch on github so you can see them (/Bio/PopGen/Arlequin/ >> on http://github.com/dwinter/biopython/tree/arleq-branch) > > This is great, I took your code and created a new version (nothing > more than also an initial sketch - Feel free to disagree/propose > changes), you can find it here: > http://github.com/tiagoantao/biopython/tree/arlequin Yeah, all the changes you talk about seem sensible to me > OK, somebody has to do a parser to actually read the files in ;) . > Which is the biggest piece of work to be done. I don't mind doing it > (like in the next month or so - I have some free time now), but you > can do it if you want. In case you decide to do it, I have just one > major point to note: making a parser that is able to read big files > (i.e., some files cannot be parsed into memory in one go). I made this > mistake with the genepop parser and some people do complain about it. > Somethings cannot be read as lists to memory but have to be read as > iterators (issue 3 above). > I think a parser that is able to handle lots of files is also good to > help in building a sound model to represent an arlequin record. > > As usual we will need test code and documentation for all this ;) This is where I have to admit to not having the time or the skills to this justice, I'm happy to provide what help I can, (especially with the docs and tests which are probably closer to my skill-set) but just couldn't promise to do the bulk of the work. There might also be another option, a bit of searching in github found this: http://github.com/ryanraaum/oldowan.arlequin/tree/master Open (MIT license) code for dealing with Arlequin in python. I'll contact the author and ask if he is interested in contributing (it can't hurt to ask right?) Cheers, David From bugzilla-daemon at portal.open-bio.org Sat Jul 18 11:37:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Jul 2009 07:37:46 -0400 Subject: [Biopython-dev] [Bug 2880] Two unit tests issues in 1.51b (t-coffee and mafft) In-Reply-To: Message-ID: <200907181137.n6IBbkhD025712@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-18 07:37 EST ------- Could you tell us the version numbers of t-coffe and mafft you have installed? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 18 16:07:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Jul 2009 12:07:19 -0400 Subject: [Biopython-dev] [Bug 2880] Two unit tests issues in 1.51b (t-coffee and mafft) In-Reply-To: Message-ID: <200907181607.n6IG7JnE000703@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 ------- Comment #2 from mmokrejs at ribosome.natur.cuni.cz 2009-07-18 12:07 EST ------- $ t_coffee PROGRAM: T-COFFEE (Version_7.54) [cut] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 18 18:21:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Jul 2009 14:21:09 -0400 Subject: [Biopython-dev] [Bug 2880] Two unit tests issues in 1.51b (t-coffee and mafft) In-Reply-To: Message-ID: <200907181821.n6IIL9if004943@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-18 14:21 EST ------- It looks like the T-Coffee test just hung? Could you try this in the Tests directory to confirm this: python run_tests.py test_TCoffee_tool.py Could you also try just running T-Coffee directly: t_coffee In my machine this prints out some stuff, and finishes. This it what seems to be hanging on your machine... I'm thinking that instead of calling "t_coffee" we could instead use "t_coffee -version" which finishes much more quickly. So could you also try: t_coffee -version I was using T-Coffee 7.81 on Linux and things worked. Even this is out of date, so I tried the latest version too, 7.97, and again it all looks fine. ------------- Regarding MAFFT, what actually fails when you do this?: run_tests.py test_Mafft_tool.py I note you have MAFFT v6.240. I have MAFFT v6.626b and the test passes. Again, this is also out of date. Could you try updating your copy of MAFFT? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 18 18:41:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Jul 2009 14:41:38 -0400 Subject: [Biopython-dev] [Bug 2880] Two unit tests issues in 1.51b (t-coffee and mafft) In-Reply-To: Message-ID: <200907181841.n6IIfc03005430@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 ------- Comment #4 from mmokrejs at ribosome.natur.cuni.cz 2009-07-18 14:41 EST ------- # python run_tests.py test_TCoffee_tool.py test_TCoffee_tool ... ok ---------------------------------------------------------------------- Ran 1 test in 3.627 seconds # t_coffee PROGRAM: T-COFFEE (Version_7.54) -full_log S [0] -run_name S [0] -mem_mode S [0] mem -extend D [1] 1 -extend_mode S [0] very_fast_triplet -max_n_pair D [0] 10 -seq_name_for_quadruplet S [0] all -compact S [0] default [cut] # t_coffee -version PROGRAM: T-COFFEE (Version_7.54) # python run_tests.py test_Mafft_tool.py test_Mafft_tool ... FAIL ====================================================================== FAIL: Simple round-trip through app with clustal output ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/tmp/portage/sci-biology/biopython-1.51b/work/biopython-1.51b/Tests/test_Mafft_tool.py", line 78, in test_Mafft_with_Clustalw_output self.assert_(stdout.read().startswith("CLUSTAL format alignment by MAFFT")) AssertionError ---------------------------------------------------------------------- Ran 1 test in 1.598 seconds FAILED (failures = 1) # python setup.py test [cut] test_Seq_objs ... ok test_SubsMat ... ok test_SwissProt ... ok test_TCoffee_tool ... ok test_UniGene ... ok test_UniGene_obsolete ... ok test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_align ... ok test_geo ... ok test_interpro ... ok test_kNN ... ok test_lowess ... ok test_pairwise2 ... ok test_prodoc ... ok test_property_manager ... ok test_prosite ... ok test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. test_seq ... ok test_translate ... ok test_trie ... ok test_triefind ... ok Bio.Seq docstring test ... ok Bio.SeqRecord docstring test ... ok Bio.SeqIO docstring test ... ok Bio.SeqIO.QualityIO docstring test ... ok Bio.SeqIO.AceIO docstring test ... ok Bio.SeqUtils docstring test ... ok Bio.Align.Generic docstring test ... ok Bio.AlignIO docstring test ... ok Bio.AlignIO.StockholmIO docstring test ... ok Bio.Application docstring test ... ok Bio.KEGG.Compound docstring test ... ok Bio.KEGG.Enzyme docstring test ... ok Bio.Wise docstring test ... ok Bio.Wise.psw docstring test ... ok Bio.Motif docstring test ... ok Bio.Statistics.lowess docstring test ... ok ====================================================================== FAIL: Simple round-trip through app with clustal output ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Mafft_tool.py", line 78, in test_Mafft_with_Clustalw_output self.assert_(stdout.read().startswith("CLUSTAL format alignment by MAFFT")) AssertionError ---------------------------------------------------------------------- Ran 123 tests in 342.671 seconds FAILED (failures = 1) # So this time t_coffee test passed, sorry for the noise. I will try to find time next week to upgrade mafft. Thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 18 19:35:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Jul 2009 15:35:32 -0400 Subject: [Biopython-dev] [Bug 2880] test_Mafft_tool.py unit test failure In-Reply-To: Message-ID: <200907181935.n6IJZWMB007312@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Two unit tests issues in |test_Mafft_tool.py unit test |1.51b (t-coffee and mafft) |failure ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-18 15:35 EST ------- (In reply to comment #4) > # python run_tests.py test_TCoffee_tool.py > test_TCoffee_tool ... ok > ---------------------------------------------------------------------- > Ran 1 test in 3.627 seconds > > # t_coffee > > PROGRAM: T-COFFEE (Version_7.54) > -full_log S [0] > ... > [cut] > # t_coffee -version > PROGRAM: T-COFFEE (Version_7.54) OK - that all looks as I would hope. > # python run_tests.py test_Mafft_tool.py > test_Mafft_tool ... FAIL > ====================================================================== > FAIL: Simple round-trip through app with clustal output > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/var/tmp/portage/sci-biology/biopython-1.51b/work/biopython-1.51b/Tests/test_Mafft_tool.py", > line 78, in test_Mafft_with_Clustalw_output > self.assert_(stdout.read().startswith("CLUSTAL format alignment by MAFFT")) > AssertionError > > ---------------------------------------------------------------------- > Ran 1 test in 1.598 seconds > > FAILED (failures = 1) > # python setup.py test > [cut] > test_Seq_objs ... ok > test_SubsMat ... ok > test_SwissProt ... ok > test_TCoffee_tool ... ok > test_UniGene ... ok > test_UniGene_obsolete ... ok > test_Wise ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. > test_align ... ok > test_geo ... ok > test_interpro ... ok > test_kNN ... ok > test_lowess ... ok > test_pairwise2 ... ok > test_prodoc ... ok > test_property_manager ... ok > test_prosite ... ok > test_psw ... skipping. Install Wise2 (dnal) if you want to use Bio.Wise. > test_seq ... ok > test_translate ... ok > test_trie ... ok > test_triefind ... ok > Bio.Seq docstring test ... ok > Bio.SeqRecord docstring test ... ok > Bio.SeqIO docstring test ... ok > Bio.SeqIO.QualityIO docstring test ... ok > Bio.SeqIO.AceIO docstring test ... ok > Bio.SeqUtils docstring test ... ok > Bio.Align.Generic docstring test ... ok > Bio.AlignIO docstring test ... ok > Bio.AlignIO.StockholmIO docstring test ... ok > Bio.Application docstring test ... ok > Bio.KEGG.Compound docstring test ... ok > Bio.KEGG.Enzyme docstring test ... ok > Bio.Wise docstring test ... ok > Bio.Wise.psw docstring test ... ok > Bio.Motif docstring test ... ok > Bio.Statistics.lowess docstring test ... ok > ====================================================================== > FAIL: Simple round-trip through app with clustal output > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Mafft_tool.py", line 78, in test_Mafft_with_Clustalw_output > self.assert_(stdout.read().startswith("CLUSTAL format alignment by MAFFT")) > AssertionError > > ---------------------------------------------------------------------- > Ran 123 tests in 342.671 seconds > > FAILED (failures = 1) > # > > So this time t_coffee test passed, sorry for the noise. I will try > to find time next week to upgrade mafft. Thanks. I've retitled the bug to focus on the MAFFT issue. This may well be a problem with your old version of MAFFT - I know for example the the FASTA output is broken on some versions of MAFFT. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jul 20 14:57:15 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 20 Jul 2009 10:57:15 -0400 Subject: [Biopython-dev] GSoC Weekly Update 9: PhyloXML for Biopython Message-ID: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> Hi all, Previously (July 13-17) I: - Implemented "Collapse Whitespace Policy" -- the spec mentions this in the glossary but doesn't appear to say where it should be use, so I applied it willy-nilly. (Mainly on 'name' and 'desc'/'description' node text.) - Made Writer use the normal namespace prefixes -- for human-readability, though it technically doesn't matter for parsing. - Tried XSD validation on the PhyloXML.Writer output using xmlstarlet -- it failed, probably due to element ordering. - Created Bio.Tree and Bio.TreeIO modules. The PhyloXML tree classes are all under Bio.Tree now, while TreeIO contains just a thin wrapper for Parser and Writer (still under Bio.PhyloXML). Three mostly empty base classes live in Bio.Tree.BaseTree and PhyloXML's tree classes now inherit from them. This made it possible to generalize the Utils.pretty_print function and move it to Bio.Tree.Utils. The other "utility", for dumping xml tag names, was added to PhyloXML's Parser near the other xml-related helpers. - Checked that 'other' objects won't belong to the phyloXML namespace. This week (July 20-24) I will: Extend the core to the rest of the spec: - Adding unit tests and classes to support the remaining (non-core) phyloXML elements - Use the schema document to validate the input file -- or at least, make Writer use the correct sub-node ordering - Take a stab at phyloXML 1.10 support Work on documentation: - Address remaining comments from code/doc review - Revisit docstrings for all classes, functions, methods; consider enabling epydoc formatting Also: - Improve the SeqRecord conversion - Warnings: show the offending line at the previous level in the stack Remarks: I haven't done anything specifically for Nexus integration, though I'm looking at the Bio.Nexus Tree and Node classes while writing Bio.Tree.BaseTree classes. I'm also looking at PhyloDB, the BioSQL extension. Plan: BaseTree classes will mirror PhyloDB tables, and any methods from PhyloXML trees that only rely on those attributes will be moved to the base classes. Attribute naming will be tricky -- the 'node' in Nexus and PhyloDB is called 'clade' in phyloXML, and most of the base-class methods will operate on that attribute. Options: 1. Create two properties on PhyloXML's Clade and Phylogeny classes, called 'clade' and 'clades', that simply access the object's 'node' attribute. 2. Break phyloXML's naming convention, and call a 'clade' a 'node'. The I/O functions currently treat tag_name<->attribute as the general case, with exceptions like pluralization scattered in, so making this change will be unpretty but not horrible. Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Mon Jul 20 17:57:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 18:57:59 +0100 Subject: [Biopython-dev] EMBOSS format name "fastq-sanger" in Biopython? Message-ID: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Hi all at Biopython (and EMBOSS-dev CC'd), Now that EMBOSS 6.1.0 is out I've started checking it against Biopython. As I mentioned on the Biopython mailing list a week ago, in particular I'd like to make sure we agree on the various FASTQ variants. I'm waiting for EMBOSS to update the documentation on their website, but as I recall from talking to Peter Rice at BOSC/ISMB 2009 and a quick test this afternoon, they are using: fastq - FASTQ where the qualities are ignored (useful for input?) fastq-sanger - Standard Sanger style FASTQ using PHRED offset 33 fastq-solexa - Early Solexa/Illumina FASTQ, Solexa scores offset 64 fastq-illumina - Illumina 1.3+ FASTQ using PHRED offset 64 I was expecting "fastq" to be an EMBOSS input only format given how I had understood this to be interpreted (ignore the qualities). This makes sense for tasks like FASTQ to FASTQ where the qualities can be ignored. I was however surprised that using "fastq" as an output format in EMBOSS seqret gives quality strings of double quote characters. This ASCII character (34) is outside the range used in the Solexa and Illumina 1.3+ FASTQ variants. If interpreted as a Sanger style FASTQ file this means a PHRED quality of one (meaning about random, a sensible default). Enough background. The reason for this email was that (subject to confirmation), Biopython's "fastq" matches EMBOSS's "fastq-sanger", so I'd like to consider adding this as an alias in Bio.SeqIO. I resisted adding aliases initially, but we now have "gb" for "genbank" to make working with Entrez a little easier, so there is a precedent. In this case, it will make some of the test_Emboss.py code cleaner if I can just use "fastq-sanger" everywhere and have both Biopython and EMBOSS understand this. Peter From matzke at berkeley.edu Mon Jul 20 19:13:59 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 20 Jul 2009 12:13:59 -0700 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A5CD7C8.70009@berkeley.edu> References: <4A4D052D.7010708@berkeley.edu> <20090704201059.GB29677@kunkel> <86B6ADAE-8C12-4F06-B068-77CA5C577FF9@nescent.org> <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> Message-ID: <4A64C1F7.5040503@berkeley.edu> Hi all, here is my weekly update... 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! 2. Code refactoring: this is basically the layout I've got going at the moment. (long outline & function descriptions below) 3. GbifXml is working, my next task is the TreeSum class which requires re-doing the functions which made use of the lagrange tree class. I've built these functions under several different tree classes since January and have gotten pretty good at tree logic so this shouldn't be too hard. 4. Philosophy question: If I build some functions that do something new with an e.g. ElementTree (XML tree) object, should I: (a) make these functions go in a subclass of the class for the original object (thus inheriting the methods of the original class, and basically adding new methods). E.g. basically extending the methods of ElementTree, with a subclass GbifElementTree; or: (b) make a class containing the object as an attribute, with e.g. GbifXml.xmltree containing an ElementTree attribute which then gets passed to the various functions. I currently have (b) but the more I think about it, the more (a) makes more sense from a simplicity/usability/maintainability sense. Cheers! Nick ========== Class for accessing GBIF, downloading records, processing them, and extracting information from the xmltree in that class. class GbifXmlError(Exception): pass class GbifXml(): gbifxml is a class for holding and processing xmltrees of GBIF records. def __init__(self, xmltree=None): This is an instantiation class for setting up new objects of this class. def print_xmltree(self): Prints all the elements & subelements of the xmltree to screen (may require fix_ASCII to input file to succeed) def print_subelements(self, element): Takes an element from an XML tree and prints the subelements tag & text, and the within-tag items (key/value or whatnot) def element_items_to_dictionary(self, element_items): If the XML tree element has items encoded in the tag, e.g. key/value or whatever, this function puts them in a python dictionary and returns them. def extract_latlongs(self, element): Create a temporary pseudofile, extract lat longs to it, return results as string. Inspired by: http://www.skymind.com/~ocrow/python_string/ (Method 5: Write to a pseudo file) def extract_latlong_datum(self, element, file_str): Searches an element in an XML tree for lat/long information, and the complete name. Searches recursively, if there are subelements. def extract_taxonconceptkeys_tofile(self, element, outfh): Searches an element in an XML tree for TaxonOccurrence gbifKeys, and the complete sname. Searches recursively, if there are subelements. Returns file at outfh. def extract_taxonconceptkeys_tolist(self, element, output_list): Searches an element in an XML tree for TaxonOccurrence gbifKeys, and the complete name. Searches recursively, if there are subelements. Returns list. def extract_occurrence_elements(self, element, output_list): Returns a list of the elements, picking elements by TaxonOccurrence; this should return a list of elements equal to the number of hits. def find_to_elements_w_ancs(self, el_tag, anc_el_tag): Burrow into XML to get an element with tag el_tag, return only those el_tags underneath a particular parent element parent_el_tag def create_sub_xmltree(self, element): Create a subset xmltree (to avoid going back to irrelevant parents) def xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, match_el_list): Recursively burrows down to find whatever elements with el_tag exist inside a parent_el_tag. def xml_burrow_up(self, element, anc_el_tag, found_anc): Burrow up xml to find anc_el_tag def xml_burrow_up_cousin(element, cousin_el_tag, found_cousin): Burrow up from element of interest, until a cousin is found with cousin_el_tag def return_parent_in_xmltree(self, child_to_search_for): Search through an xmltree to get the parent of child_to_search_for def return_parent_in_element(self, potential_parent, child_to_search_for, returned_parent): Search through an XML element to return parent of child_to_search_for def find_1st_matching_element(self, element, el_tag, return_element): Burrow down into the XML tree, retrieve the first element with the matching tag # Functions devoted to accessing/downloading GBIF records def access_gbif(url, params): # Helper function to access various GBIF services # # choose the URL ("url") from here: # http://data.gbif.org/ws/rest/occurrence # # params are a dictionary of key/value pairs # # "_open" is from Bio.Entrez._open, online here: # http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#_open # # Get the handle of results # (looks like e.g.: > ) # (open with results_handle.read() ) def get_hits(params): Get the actual hits that are be returned by a given search (this allows parsing & gradual downloading of searches larger than e.g. 1000 records) It will return the LAST non-none instance (in a standard search result there should be only one, anyway). def get_xml_hits(params): Returns hits like get_hits, but returns a parsed XML tree. def get_all_records_by_increment(params, inc, prefix_fn): Download all of the records in stages, store in list of elements. Increments of e.g. 100 to not overload server def get_record(key): Get a single record, return xmltree for it. def get_numhits(params): Get the number of hits that will be returned by a given search (this allows parsing & gradual downloading of searches larger than e.g. 1000 records) It will return the LAST non-none instance (in a standard search result there should be only one, anyway). def extract_numhits(element): # Search an element of a parsed XML string and find the # number of hits, if it exists. Recursively searches, # if there are subelements. # def xmlstring_to_xmltree(xmlstring): Take the text string returned by GBIF and parse to an XML tree using ElementTree. Requires the intermediate step of saving to a temporary file (required to make ElementTree.parse work, apparently) class TreeSum() Summary statistics on trees (some of these now redundant with Nexus.Tree & will be eliminated. def read_ultrametric_Newick(newickstr): Read a Newick file into a tree object (a series of node objects links to parent and daughter nodes), also reading node ages and node labels if any. def list_leaves(phylo_obj): Print out all of the leaves in above a node object def treelength(node): Gets the total branchlength above a given node by recursively adding through tree. def phylodistance(node1, node2): Get the phylogenetic distance (branch length) between two nodes. def get_distance_matrix(phylo_obj): Get a matrix of all of the pairwise distances between the tips of a tree. def get_mrca_array(phylo_obj): Get a square list of lists (array) listing the mrca of each pair of leaves (half-diagonal matrix) def subset_tree(phylo_obj, list_to_keep): Given a list of tips and a tree, remove all other tips and resulting redundant nodes to produce a new smaller tree. def prune_single_desc_nodes(node): Follow a tree from the bottom up, pruning any nodes with only one descendent def find_new_root(node): Search up tree from root and make new root at first divergence def make_None_list_array(xdim, ydim): Make a list of lists ("array") with the specified dimensions def get_PD_to_mrca(node, mrca, PD): Add up the phylogenetic distance from a node to the specified ancestor (mrca). Find mrca with find_1st_match. def get_ancestors_list(node, anc_list): Get the list of ancestors of a given node def addup_PD(node, PD): Adds the branchlength of the current node to the total PD measure. def print_tree_outline_format(phylo_obj): Prints the tree out in "outline" format (daughter clades are indented, etc.) def print_Node(node, rank): Prints the node in question, and recursively all daughter nodes, maintaining rank as it goes. class Ranges(): Geographic range of a species (collection of points, results of classification of those points into regions), GIS-like functions for processing them. class Points(): geographic locations of individual collected specimens def readshpfile(fn): def summarize_shapefile(fn, output_option, outfn): def point_inside_polygon(x,y,poly): def shapefile_points_in_poly(pt_records, poly): def tablefile_points_in_poly(fh, ycol, xcol, namecol, poly): ========== Here is a summary of the Nick Matzke wrote: > Thanks for the fix!!! A big help. I am currently organizing my > functions into several classes and making sure they work, basically the > classes look like they will be something like: > > ========== > GbifXml -- for processing GBIF XML results (all of the functions for > searching/extracting stuff from xmltree structures) > > TreeSum -- for processing trees & getting summary statistics etc. > > Ranges -- Geographic range of a species (collection of points, results > of classification of those points into regions), GIS-like functions for > processing them > Points -- geographic locations of individual collected specimens > ========== > > > Brad Chapman wrote: >> Hi Nick; >> Thanks for the comprehensive update. It sounds like your discussion >> with Eric resolved most of the questions about the tree >> representation. It's great to see y'all converging on this. >> >>> It sounds like for my immediate purposes, Bio.Nexus.Trees is the >>> solution for now, I will reorganize my code accordingly based on >>> this. If/when Bio.Nexus.Trees accepts node labels I will remove a >>> function stripping out node labels. Also I have not forgotten >>> previous comments from Brad et al. about bringing the other code up >>> to specs. So I will update the BioGeography schedule and overall >>> organization I hope to have at the end (with classes/methods etc., >>> instead of just a list-o-functions, which is how my original schedule >>> was explicitly laid out), and post an update when done. >> >> Agreed, and seconding Hilmar that the best thing about open source >> code is having others looking at your code. Conversely, feel free to >> dig in and fix current code where it is holding you up. To remove >> this blocking issue on Nexus and get us rolling again, I >> put together an initial fix. You can grab the patch from: >> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2788 >> >> Let us know if this works for your files of interest. >> >> If this clears up the Nexus issue, it would be great to see the >> revised schedule incorporating the refactoring. Sounds like we are >> moving in the right direction. Good stuff. >> >> Thanks, >> Brad >> > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Mon Jul 20 19:48:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 20:48:44 +0100 Subject: [Biopython-dev] Bio.Nexus and internal node labels (Bug 2788) Message-ID: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> On Mon, Jul 20, 2009 at 8:13 PM, Nick Matzke wrote: > > Hi all, here is my weekly update... > > 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! > Cool. I haven't tried it personally though ;) Frank and/or Cymon - any comments regarding Brad checking this in? See Bug 2788 for details. I gather you (Nick) are using this on ultrametric Newick trees - could you supply a sensibly sized example to use as a unit test? Initially this can be just to test Brad's patch to Bio.Nexus, but try and pick something you can build documentation examples around in future. Thanks, Peter From matzke at berkeley.edu Mon Jul 20 19:56:18 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 20 Jul 2009 12:56:18 -0700 Subject: [Biopython-dev] Bio.Nexus and internal node labels (Bug 2788) In-Reply-To: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> References: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> Message-ID: <4A64CBE2.9050605@berkeley.edu> Peter wrote: > On Mon, Jul 20, 2009 at 8:13 PM, Nick Matzke wrote: >> Hi all, here is my weekly update... >> >> 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! >> > > Cool. I haven't tried it personally though ;) Frank and/or Cymon - any > comments regarding Brad checking this in? See Bug 2788 for details. > > I gather you (Nick) are using this on ultrametric Newick trees yes - could > you supply a sensibly sized example to use as a unit test? Initially > this can be just to test Brad's patch to Bio.Nexus, but try and pick > something you can build documentation examples around in future. This is probable a reasonable size, a subset of my bigger tree. (Just gymnosperms; times are in millions of years.) (((((Cephalotaxus:125.000000,(Taxus:100.000000,Torreya:100.000000)TT1:25.000000)Taxaceae:90.000000,((((((((Calocedrus:85.000000,Platycladus:85.000000)CP:5.000000,(Cupressus:85.000000,Juniperus:85.000000)CJ:5.000000)CJCP:5.000000,Chamaecyparis:95.000000)CCJCP:5.000000,(Thuja:7.870000,Thujopsis:7.870000)TT2:92.13)CJCPTT:30.000000,((Cryptomeria:120.000000,Taxodium:120.000000)CT:5.000000,Glyptostrobus:125.000000)CTG:5.000000)CupCallTax:5.830000,((Metasequoia:125.000000,Sequoia:125.000000)MS:5.000000,Sequoiadendron:130.000000)Sequoioid:5.830000)STCC:49.060001,Taiwania:184.889999)Taw+others:15.110000,Cunninghamia:200.000000)nonSci:15.000000)Tax+nonSci:10.000000,Sciadopitys:225.000000):25.000000,(((Abies:106.000000,Keteleeria:106.000000)AK:54.000000,(Pseudolarix:156.000000,Tsuga:156.000000)NTP:4.000000)NTPAK:24.000000,((Larix:87.000000,Pseudotsuga:87.000000)LP:81.000000,(Picea:155.000000,Pinus:155.000000)PPC:13.000000)Pinoideae:16.000000)Pinaceae:66.000000)Coniferales:25.000000,Gink go:275.000000)gymnosperm:75.000000; I am using this as a test case in my own script, I have put some effort into reading up on Unittest but I don't quite get how it all fits together yet. Another important case I will try and come up with is the results of pruning a tree, in my experience it is very easy to mess up the tree and/or branchlengths when pruning. > > Thanks, > > Peter > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From fkauff at biologie.uni-kl.de Tue Jul 21 06:32:02 2009 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Tue, 21 Jul 2009 08:32:02 +0200 Subject: [Biopython-dev] Bio.Nexus and internal node labels (Bug 2788) In-Reply-To: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> References: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> Message-ID: <4A6560E2.4030502@biologie.uni-kl.de> Hi all, Peter wrote: > On Mon, Jul 20, 2009 at 8:13 PM, Nick Matzke wrote: > >> Hi all, here is my weekly update... >> >> 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! >> >> > > Cool. I haven't tried it personally though ;) Frank and/or Cymon - any > comments regarding Brad checking this in? See Bug 2788 for details. > > Not at all - you're most welcome. Thanks for dealing with it. Frank > I gather you (Nick) are using this on ultrametric Newick trees - could > you supply a sensibly sized example to use as a unit test? Initially > this can be just to test Brad's patch to Bio.Nexus, but try and pick > something you can build documentation examples around in future. > > Thanks, > > Peter > > From biopython at maubp.freeserve.co.uk Tue Jul 21 11:32:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 12:32:59 +0100 Subject: [Biopython-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Message-ID: <320fb6e00907210432h26da39b2ka24ceb1194a1be1a@mail.gmail.com> On Mon, Jul 20, 2009 at 6:57 PM, Peter wrote: > I was expecting "fastq" to be an EMBOSS input only format given > how I had understood this to be interpreted (ignore the qualities). This > makes sense for tasks like FASTQ to FASTQ where the qualities can > be ignored. I meant of course, for FASTQ to FASTA conversion the qualities (and how they are encoded, Sanger versus Solexa versus Illumina 1.3+) can be ignored. Peter From chapmanb at 50mail.com Tue Jul 21 12:22:13 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Jul 2009 08:22:13 -0400 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A64C1F7.5040503@berkeley.edu> References: <20090707130248.GM17086@sobchak.mgh.harvard.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> Message-ID: <20090721122213.GA96870@sobchak.mgh.harvard.edu> Hi Nick; > 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! Sweet. Glad to hear it. > 2. Code refactoring: this is basically the layout I've got going at the > moment. (long outline & function descriptions below) Is this checked in on GitHub? I pulled from the Geography branch but didn't get the new code. The organization below looks great and really helps with clarity. One additional suggestion I would make is to prefix classes which are not part of the public API with an underscore (_internal_function). Just from the descriptions, I image some of the functions like xml_burrow_up_cousin would not be called directly by users. > 3. GbifXml is working, my next task is the TreeSum class which requires > re-doing the functions which made use of the lagrange tree class. I've > built these functions under several different tree classes since January > and have gotten pretty good at tree logic so this shouldn't be too hard. Great. Have you had a look at Eric's generic Tree proposal, which he was working on this week: http://github.com/etal/biopython/tree/phyloxml/Bio/Tree It would be great to propose general functionality there so it can be rolled into PhyloXML and ultimately Nexus parsing as well. > 4. Philosophy question: If I build some functions that do something new > with an e.g. ElementTree (XML tree) object, should I: > > (a) make these functions go in a subclass of the class for the original > object (thus inheriting the methods of the original class, and basically > adding new methods). E.g. basically extending the methods of > ElementTree, with a subclass GbifElementTree; or: > > (b) make a class containing the object as an attribute, with e.g. > GbifXml.xmltree containing an ElementTree attribute which then gets > passed to the various functions. > > I currently have (b) but the more I think about it, the more (a) makes > more sense from a simplicity/usability/maintainability sense. My vote would be for your (b) option. ElementTree is a pretty tricky interface with overrides for attribute access, so inheriting from it could be a bit tricky and more trouble than it's worse. If you find yourself mirroring ElementTree functionality, you could always make the tree itself a public attribute and encourage users to call it directly. Brad > > Cheers! > Nick > > ========== > Class for accessing GBIF, downloading records, processing them, and > extracting information from the xmltree in that class. > > class GbifXmlError(Exception): pass > class GbifXml(): > gbifxml is a class for holding and processing xmltrees of GBIF records. > > def __init__(self, xmltree=None): > > This is an instantiation class for setting up new objects of this > class. > > def print_xmltree(self): > > Prints all the elements & subelements of the xmltree to screen (may > require > fix_ASCII to input file to succeed) > > def print_subelements(self, element): > > Takes an element from an XML tree and prints the subelements tag & > text, and > the within-tag items (key/value or whatnot) > > > def element_items_to_dictionary(self, element_items): > > If the XML tree element has items encoded in the tag, e.g. key/value or > whatever, this function puts them in a python dictionary and returns > them. > > > > def extract_latlongs(self, element): > > Create a temporary pseudofile, extract lat longs to it, > return results as string. > > Inspired by: http://www.skymind.com/~ocrow/python_string/ > (Method 5: Write to a pseudo file) > > > def extract_latlong_datum(self, element, file_str): > > Searches an element in an XML tree for lat/long information, and the > complete name. Searches recursively, if there are subelements. > > > > def extract_taxonconceptkeys_tofile(self, element, outfh): > > Searches an element in an XML tree for TaxonOccurrence gbifKeys, > and the complete sname. Searches recursively, if there are subelements. > Returns file at outfh. > > > > > def extract_taxonconceptkeys_tolist(self, element, output_list): > > Searches an element in an XML tree for TaxonOccurrence gbifKeys, > and the complete name. Searches recursively, if there are subelements. > Returns list. > > > > > > def extract_occurrence_elements(self, element, output_list): > > Returns a list of the elements, picking elements by > TaxonOccurrence; this should > return a list of elements equal to the number of hits. > > > > > def find_to_elements_w_ancs(self, el_tag, anc_el_tag): > > Burrow into XML to get an element with tag el_tag, return only > those el_tags underneath a particular parent element parent_el_tag > > > def create_sub_xmltree(self, element): > > Create a subset xmltree (to avoid going back to irrelevant parents) > > > > def xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, > match_el_list): > > Recursively burrows down to find whatever elements with el_tag > exist inside a parent_el_tag. > > > def xml_burrow_up(self, element, anc_el_tag, found_anc): > > Burrow up xml to find anc_el_tag > > > > def xml_burrow_up_cousin(element, cousin_el_tag, found_cousin): > > Burrow up from element of interest, until a cousin is found with > cousin_el_tag > > > > def return_parent_in_xmltree(self, child_to_search_for): > > Search through an xmltree to get the parent of child_to_search_for > > > > def return_parent_in_element(self, potential_parent, > child_to_search_for, returned_parent): > > Search through an XML element to return parent of child_to_search_for > > > > def find_1st_matching_element(self, element, el_tag, return_element): > > Burrow down into the XML tree, retrieve the first element with the > matching tag > > > > > # Functions devoted to accessing/downloading GBIF records > > def access_gbif(url, params): > > # Helper function to access various GBIF services > # > # choose the URL ("url") from here: > # http://data.gbif.org/ws/rest/occurrence > # > # params are a dictionary of key/value pairs > # > # "_open" is from Bio.Entrez._open, online here: > # http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#_open > # > # Get the handle of results > # (looks like e.g.: > ) > > # (open with results_handle.read() ) > > > def get_hits(params): > > Get the actual hits that are be returned by a given search > (this allows parsing & gradual downloading of searches larger > than e.g. 1000 records) > It will return the LAST non-none instance (in a standard search > result there > should be only one, anyway). > > > def get_xml_hits(params): > > Returns hits like get_hits, but returns a parsed XML tree. > > > def get_all_records_by_increment(params, inc, prefix_fn): > > Download all of the records in stages, store in list of elements. > Increments of e.g. 100 to not overload server > > def get_record(key): > > Get a single record, return xmltree for it. > > > def get_numhits(params): > > Get the number of hits that will be returned by a given search > (this allows parsing & gradual downloading of searches larger > than e.g. 1000 records) > It will return the LAST non-none instance (in a standard search > result there > should be only one, anyway). > > def extract_numhits(element): > > # Search an element of a parsed XML string and find the > # number of hits, if it exists. Recursively searches, > # if there are subelements. > # > > def xmlstring_to_xmltree(xmlstring): > > Take the text string returned by GBIF and parse to an XML tree using > ElementTree. > Requires the intermediate step of saving to a temporary file > (required to make > ElementTree.parse work, apparently) > > > > > class TreeSum() > > Summary statistics on trees (some of these now redundant with > Nexus.Tree & will be eliminated. > > def read_ultrametric_Newick(newickstr): > > Read a Newick file into a tree object (a series of node objects > links to parent and daughter nodes), also reading node ages and node > labels if any. > > > def list_leaves(phylo_obj): > > Print out all of the leaves in above a node object > > > > def treelength(node): > > Gets the total branchlength above a given node by recursively > adding through tree. > > > def phylodistance(node1, node2): > > Get the phylogenetic distance (branch length) between two nodes. > > > def get_distance_matrix(phylo_obj): > > Get a matrix of all of the pairwise distances between the tips of a > tree. > > > > def get_mrca_array(phylo_obj): > > Get a square list of lists (array) listing the mrca of each pair of > leaves > (half-diagonal matrix) > > > > def subset_tree(phylo_obj, list_to_keep): > > Given a list of tips and a tree, remove all other tips and > resulting redundant nodes to produce a new smaller tree. > > > def prune_single_desc_nodes(node): > > Follow a tree from the bottom up, pruning any nodes with only one > descendent > > > def find_new_root(node): > > Search up tree from root and make new root at first divergence > > > def make_None_list_array(xdim, ydim): > > Make a list of lists ("array") with the specified dimensions > > > def get_PD_to_mrca(node, mrca, PD): > > Add up the phylogenetic distance from a node to the specified > ancestor (mrca). Find mrca with find_1st_match. > > > > def get_ancestors_list(node, anc_list): > > Get the list of ancestors of a given node > > > > > def addup_PD(node, PD): > > Adds the branchlength of the current node to the total PD measure. > > > def print_tree_outline_format(phylo_obj): > > Prints the tree out in "outline" format (daughter clades are > indented, etc.) > > > def print_Node(node, rank): > > Prints the node in question, and recursively all daughter nodes, > maintaining rank as it goes. > > > > class Ranges(): > > Geographic range of a species (collection of points, results > of classification of those points into regions), GIS-like functions for > processing them. > > > class Points(): > > geographic locations of individual collected specimens > > > def readshpfile(fn): > > def summarize_shapefile(fn, output_option, outfn): > > def point_inside_polygon(x,y,poly): > > def shapefile_points_in_poly(pt_records, poly): > > def tablefile_points_in_poly(fh, ycol, xcol, namecol, poly): > > ========== > > > Here is a summary of the > > Nick Matzke wrote: > > Thanks for the fix!!! A big help. I am currently organizing my > > functions into several classes and making sure they work, basically the > > classes look like they will be something like: > > > > ========== > > GbifXml -- for processing GBIF XML results (all of the functions for > > searching/extracting stuff from xmltree structures) > > > > TreeSum -- for processing trees & getting summary statistics etc. > > > > Ranges -- Geographic range of a species (collection of points, results > > of classification of those points into regions), GIS-like functions for > > processing them > > Points -- geographic locations of individual collected specimens > > ========== > > > > > > Brad Chapman wrote: > >> Hi Nick; > >> Thanks for the comprehensive update. It sounds like your discussion > >> with Eric resolved most of the questions about the tree > >> representation. It's great to see y'all converging on this. > >> > >>> It sounds like for my immediate purposes, Bio.Nexus.Trees is the > >>> solution for now, I will reorganize my code accordingly based on > >>> this. If/when Bio.Nexus.Trees accepts node labels I will remove a > >>> function stripping out node labels. Also I have not forgotten > >>> previous comments from Brad et al. about bringing the other code up > >>> to specs. So I will update the BioGeography schedule and overall > >>> organization I hope to have at the end (with classes/methods etc., > >>> instead of just a list-o-functions, which is how my original schedule > >>> was explicitly laid out), and post an update when done. > >> > >> Agreed, and seconding Hilmar that the best thing about open source > >> code is having others looking at your code. Conversely, feel free to > >> dig in and fix current code where it is holding you up. To remove > >> this blocking issue on Nexus and get us rolling again, I > >> put together an initial fix. You can grab the patch from: > >> > >> http://bugzilla.open-bio.org/show_bug.cgi?id=2788 > >> > >> Let us know if this works for your files of interest. > >> > >> If this clears up the Nexus issue, it would be great to see the > >> revised schedule incorporating the refactoring. Sounds like we are > >> moving in the right direction. Good stuff. > >> > >> Thanks, > >> Brad > >> > > > > -- > ==================================================== > Nicholas J. Matzke > Ph.D. Candidate, Graduate Student Researcher > Huelsenbeck Lab > Center for Theoretical Evolutionary Genomics > 4151 VLSB (Valley Life Sciences Building) > Department of Integrative Biology > University of California, Berkeley > > Lab websites: > http://ib.berkeley.edu/people/lab_detail.php?lab=54 > http://fisher.berkeley.edu/cteg/hlab.html > Dept. personal page: > http://ib.berkeley.edu/people/students/person_detail.php?person=370 > Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html > Lab phone: 510-643-6299 > Dept. fax: 510-643-6264 > Cell phone: 510-301-0179 > Email: matzke at berkeley.edu > > Mailing address: > Department of Integrative Biology > 3060 VLSB #3140 > Berkeley, CA 94720-3140 > > ----------------------------------------------------- > "[W]hen people thought the earth was flat, they were wrong. When people > thought the earth was spherical, they were wrong. But if you think that > thinking the earth is spherical is just as wrong as thinking the earth > is flat, then your view is wronger than both of them put together." > > Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, > 14(1), 35-44. Fall 1989. > http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm > ==================================================== From chapmanb at 50mail.com Tue Jul 21 12:40:10 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Jul 2009 08:40:10 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> Message-ID: <20090721124010.GB96870@sobchak.mgh.harvard.edu> Hi Eric; Great stuff this week. I'm happy to see the generalized Tree interface coming together and appreciate you taking the time to look through PhyloDB for future compatibility with that. > - Tried XSD validation on the PhyloXML.Writer output using xmlstarlet -- it > failed, probably due to element ordering. It would be nice to be able to pull off validation. I'm not a big stickler for XSD validation myself but have worked in the past with those who were and know that it can be a point of contention. Being able to cleanly validate will improve perception of the PhyloXML, and specifically the Biopython implementation. Hopefully that'll lead to greater use and adoption. > - Created Bio.Tree and Bio.TreeIO modules. The PhyloXML tree classes are > all under Bio.Tree now, while TreeIO contains just a thin wrapper for > Parser and Writer (still under Bio.PhyloXML). Three mostly empty base > classes live in Bio.Tree.BaseTree and PhyloXML's tree classes now inherit > from them. This looks really nice -- thanks again. Do you think any of the functionality from the Nexus trees class would fit into here and be useful for examining PhyloXML trees? There is a whole ton of stuff there but a few that caught my eye beyond the total_branch_length function you had a skeleton for were: get_terminals, is_identical, common_ancestor, and distance. > I haven't done anything specifically for Nexus integration, though I'm > looking > at the Bio.Nexus Tree and Node classes while writing Bio.Tree.BaseTree > classes. > I'm also looking at PhyloDB, the BioSQL extension. Plan: BaseTree classes > will > mirror PhyloDB tables, and any methods from PhyloXML trees that only rely on > those attributes will be moved to the base classes. This sounds fine. If you want to dig into Nexus you are welcome, but certainly it's outside the scope of the proposal. > Attribute naming will be tricky -- the 'node' in Nexus and PhyloDB is called > 'clade' in phyloXML, and most of the base-class methods will operate on that > attribute. Options: > > 1. Create two properties on PhyloXML's Clade and Phylogeny classes, > called > 'clade' and 'clades', that simply access the object's 'node' attribute. > > 2. Break phyloXML's naming convention, and call a 'clade' a 'node'. The > I/O > functions currently treat tag_name<->attribute as the general case, with > exceptions like pluralization scattered in, so making this change will > be > unpretty but not horrible. I like option 1 -- make clade and clades references to the node/nodes attribute. I do prefer the node naming convention, but for the PhyloXML specific classes you should also be able to retrieve things with their clade nomenclature. Brad From cy at cymon.org Tue Jul 21 13:01:59 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 21 Jul 2009 14:01:59 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <20090721124010.GB96870@sobchak.mgh.harvard.edu> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <20090721124010.GB96870@sobchak.mgh.harvard.edu> Message-ID: <7265d4f0907210601s35084ce2u77659ad909ea80fa@mail.gmail.com> Hi Eric, 2009/7/21 Brad Chapman > Hi Eric; ... > Do you think any of the > functionality from the Nexus trees class would fit into here and be > useful for examining PhyloXML trees? There is a whole ton of stuff > there but a few that caught my eye beyond the total_branch_length > function you had a skeleton for were: get_terminals, is_identical, > common_ancestor, and distance. > > > I haven't done anything specifically for Nexus integration, though I'm > > looking > > at the Bio.Nexus Tree and Node classes while writing Bio.Tree.BaseTree > > classes. > You might also take a look at p4's tree representation and methods: http://code.google.com/p/p4-phylogenetics/source/browse/trunk/p4/Tree.py / Node.py / Tree_muck.py etc. Cheers, C. -- From biopython at maubp.freeserve.co.uk Tue Jul 21 13:05:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 14:05:35 +0100 Subject: [Biopython-dev] [emboss-dev] EMBOSS and its FASTA like alignment output In-Reply-To: <4A65AF53.5090105@ebi.ac.uk> References: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> <4A65AF53.5090105@ebi.ac.uk> Message-ID: <320fb6e00907210605v7415b1b6id043af520c1bb8de@mail.gmail.com> Hi all, I've CC'd the Biopython-dev mailing list as this EMBOSS thread is becoming cross project. On Tue, Jul 21, 2009 at 1:06 PM, Peter Rice wrote: > > Peter wrote: >> Hi, >> >> One of the many things I talked to Peter Rice about in Sweden >> was the Pearson FASTA like output from needle and water (e.g. >> what EMBOSS calls the markx10 output format), and why it >> includes the EMBOSS header and footer lines (which start with >> a # character), which are not present in real FASTA output. >> >> Biopython can parse the pairwise -m 10 output from Bill >> Pearson's FASTA tools, so in theory we (Biopython) should >> be able to parse the markx10 output from EMBOSS needle >> and water. We could probably cope with the extra header >> and footer, but I think it would be best if EMBOSS could >> produce something more closely matching the real FASTA >> output. Unfortunately, it appears to be more than just the >> headers which upset our parser - even ignoring them, >> EMBOSS markx10 output still looks rather different to >> (current) FASTA -m 10 output. Was the markx10 output >> mimicking a particular (old) version of the FASTA tools? > > The source code documentation refers to FASTA 3.4 which > may be the last time I took a detailed look at the FASTA > alignment outputs. That might explain it - I've been using FASTA 3.5. > Can you send us some example files so we can check for > the significant differences? Sure. There are half a dozen FASTA -m 10 output files here: http://biopython.open-bio.org/SRC/biopython/Tests/Fasta/ > We plan to install all the bio* projects so it would be helpful > to have a set of biopython parser scripts we can use to test > locally. We can add them to our routine QA tests and flag up > changes as soon as they appear. If you have (the latest) Biopython installed, and periodically run the unit tests (in particular, test_Emboss.py), that would be a good start. Right now I know that unit test works with EMBOSS 4.0.0 and 6.0.1 (which happens to be on two of the machines I use for testing), and mostly works with EMBOSS 6.1.0 (everything except the GenBank regression you were just looking into today). I'm considering extending test_Emboss.py in the future to take advantage of the new features in EMBOSS 6.1.0 onwards such as GFF and FASTQ support, or perhaps having a second test script (which will be conditional on the version of EMBOSS installed). >> Peter R. did say it would be simple to turn off this header and >> footer output, so I thought I would try this myself. It looks like >> this is handled in file ajax/ajalign.c by function alignWriteMark, >> but I don't see a switch to disable the headers and footers. > > You correctly found how to turn off the header. The footer is > reported for anything except pure sequence output. > > For the next release I will add attributes to the list of alignment > formats to say whether the header and footer are needed. That > will allow us better control and reporting. > > Meanwhile, we are very happy to standardise the markx* outputs > to make them easier to parse. Biopython is the first project to > report problems with this. There are alternatives - specifying > -aformat and using some other alignment format for all > applications - but we like to conform and will do our best to fir > what parsers expect. > > Also, of course, once we know we are being parsed we will do > our best not to let the output change. This isn't really a problem. Biopython can read EMBOSS's own alignment formats (pairs and simple), so there is little need for us to be able to parse EMBOSS's version of the FASTA output. [Although at the moment we ignore all the header information, if that formatting will be consistent, we could parse it too.] However, at least one person wanted to parse EMBOSS markx10 output strongly enough that he wrote a modified version of our FASTA -m 10 parser. I would rather however have EMBOSS revise its output to better match FASTA. See http://bugzilla.open-bio.org/show_bug.cgi?id=2704 Peter C. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 13:25:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 09:25:50 -0400 Subject: [Biopython-dev] [Bug 2704] Parser for the markx10 alignment format In-Reply-To: Message-ID: <200907211325.n6LDPouc006005@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2704 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 09:25 EST ------- I've started a conversation with Peter Rice at EMBOSS about making needle and water output more FASTA like when using the markx10 format (and related FASTA mimicking output modes). See: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000583.html Later cross posted to Biopython-dev as well: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006425.html Hopefully the EMBOSS markx10 output will in future be close enough to the FASTA -m 10 output that Biopython will only need a single parser to read both. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Tue Jul 21 14:56:33 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 21 Jul 2009 16:56:33 +0200 Subject: [Biopython-dev] Calculating motif scores In-Reply-To: <572083.29767.qm@web62405.mail.re1.yahoo.com> References: <572083.29767.qm@web62405.mail.re1.yahoo.com> Message-ID: <8b34ec180907210756p3340a097i1042856386242b55@mail.gmail.com> Hi, sorry for the delayed response. Busy time... On Fri, Jul 17, 2009 at 4:25 AM, Michiel de Hoon wrote: > > It doesn't have to be so short. I've been running these calculations for whole mammalian chromosomes. For the human chromosome 1, this would take > 247249719 * 4 bytes = 943 MB to store the scores in a Numerical Python array. This can still be comfortably handled by today's computers. Well, I'm not sure if this is an expected behavior for typical uses for a single function call to allocate that much memory. Especially that most people would be interested in the "hits" which exceed some significance threshold. Nonetheless, there will be cases where the user is interested in all scores for a sequence, even the negative ones. Then it is definitely better to provide him with an array rather than a generator. > > I'll upload a C version to CVS so you guys can have a look and try it out. > I took a brief look. It seems fine to me. I haven't done any testing yet though. I'll try to integrate it into a method of Bio.Motif. What do you think about: Motif.scanPWM(self, sequence) ? > How would you feel about having a separate PWM class in Bio.Motif? Some of the stuff currently in the class Motif is actually more > about the PWM by itself; it may make sense to separate that out. Hmm, I think that your question connects directly to a bigger design question which has popped up earlier in the discussion on Bio.Motif suggestions: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005811.html I'm not sure myself whether I like to have different classes for different motif types: consensus, alignment, regexp, pwm and hmm. I understand though, that this makes things simpler for people who only use one of those types so that don't have to deal with the complications of a motif possibly coming from different sources and behaving (slightly) differently. I still think that it's useful to have a Motif class that can be used in a similar way for different kinds of motifs. As for the PWM being a separate class and used by the motif: I don't know. I'm using Bio.substmat.FreqTable for implementing frequency table, so I understand that the new PWM class would be basically a "smarter" FreqTable. I'm not sure whether it solves any problems... cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue Jul 21 15:21:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 11:21:35 -0400 Subject: [Biopython-dev] [Bug 2882] New: Unnamed variable used to raise Exception in /Bio/SwissProt/__init__.py Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2882 Summary: Unnamed variable used to raise Exception in /Bio/SwissProt/__init__.py Product: Biopython Version: 1.51b Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: sjcockell at gmail.com When raising a ValueError on finding an unknown key in a SwissProt record, Bio.SwissProt.__init__._read() references the undefined 'keyword' instead of the expected 'key'. Instead of raising a ValueError, a NameError is raised: Traceback (most recent call last): File "goClass.py", line 31, in main('tubulin') File "goClass.py", line 23, in main record = SwissProt.read(handle) File "[...]/biopython-1.51b/build/lib.macosx-10.5-i386-2.5/Bio/SwissProt/__init__.py", line 120, in read record = _read(handle) File "[...]/biopython-1.51b/build/lib.macosx-10.5-i386-2.5/Bio/SwissProt/__init__.py", line 236, in _read raise ValueError("Unknown keyword %s found" % keyword) NameError: global name 'keyword' is not defined Fixed by the following patch file: " 240c240 < raise ValueError("Unknown keyword %s found" % keyword) --- > raise ValueError("Unknown keyword %s found" % key) " Regards Simon -- http://fuzzierlogic.com http://friendfeed.com/sjcockell http://twitter.com/sjcockell -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 15:22:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 11:22:24 -0400 Subject: [Biopython-dev] [Bug 2882] Unnamed variable used to raise Exception in /Bio/SwissProt/__init__.py In-Reply-To: Message-ID: <200907211522.n6LFMO0Z010044@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2882 ------- Comment #1 from sjcockell at gmail.com 2009-07-21 11:22 EST ------- Created an attachment (id=1345) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1345&action=view) Proposed Patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 15:35:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 11:35:54 -0400 Subject: [Biopython-dev] [Bug 2882] Unnamed variable used to raise Exception in /Bio/SwissProt/__init__.py In-Reply-To: Message-ID: <200907211535.n6LFZs52010429@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2882 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 11:35 EST ------- Thanks - fixed in CVS (will be on github within the hour). Did you have an example file which triggers this, or did you just spot the error from reading the code? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Tue Jul 21 16:03:44 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 21 Jul 2009 12:03:44 -0400 Subject: [Biopython-dev] BioGeography update/BioPython tree module discussion In-Reply-To: <4A64C1F7.5040503@berkeley.edu> References: <4A4D052D.7010708@berkeley.edu> <3f6baf360907070725s4bdd9c80qb8b79f6f19e9c82a@mail.gmail.com> <320fb6e00907070812l7b8f0ea9p190c8262b67d5e64@mail.gmail.com> <3f6baf360907072109y2aca6063l80cf84c1da53b6ce@mail.gmail.com> <20090708124841.GX17086@sobchak.mgh.harvard.edu> <4A5B7E42.40106@berkeley.edu> <4A5BAEF0.9050504@berkeley.edu> <20090714123534.GQ17086@sobchak.mgh.harvard.edu> <4A5CD7C8.70009@berkeley.edu> <4A64C1F7.5040503@berkeley.edu> Message-ID: <3f6baf360907210903l5167eefdl46f5cd969c2d164b@mail.gmail.com> Hi Nick, On Mon, Jul 20, 2009 at 3:13 PM, Nick Matzke wrote: > 4. Philosophy question: If I build some functions that do something new > with an e.g. ElementTree (XML tree) object, should I: > > (a) make these functions go in a subclass of the class for the original > object (thus inheriting the methods of the original class, and basically > adding new methods). E.g. basically extending the methods of ElementTree, > with a subclass GbifElementTree; or: > > (b) make a class containing the object as an attribute, with e.g. > GbifXml.xmltree containing an ElementTree attribute which then gets passed > to the various functions. > > I currently have (b) but the more I think about it, the more (a) makes more > sense from a simplicity/usability/maintainability sense. > > I have some ElementTree-related helper functions, too. Since we're still maintaining compatibility with Python 2.4 and xml.etree didn't enter the standard library until Py2.5, the ElementTree interface could potentially come from several different sources, with slightly different capabilities. It's a weird module in general... basically, I'm treating the library like a wild badger -- a function either relies on the ETree object structure, or it doesn't, and the ETree-specific functions live in their own area near the top of the file. The methods that do phyloXML-specific work call another function to extract what they need from a node, then carry on with ordinary, well-behaved Python objects. When Bio.Tree integration comes due, we could check how much our various ETree utilities overlap and maybe combine them into a separate module. For instance, I have a tree pretty-printer and a function for dumping a list of XML node tags, too. Summary: Integrating with Bio.Tree will involve some refactoring, and it would be easier if the ElementTree stuff was quarantined off a little bit. > def extract_latlongs(self, element): > > Create a temporary pseudofile, extract lat longs to it, > return results as string. > > Inspired by: http://www.skymind.com/~ocrow/python_string/ > (Method 5: Write to a pseudo file) > Neat article! I was intrigued by this result so I tried to replicate it -- and my results were different, since newer Pythons have some string optimizations that weren't in place when the article was written. Adding strings together in a loop doesn't lead to quadratic time complexity anymore. Blogged it: http://etalog.blogspot.com/2009/07/faster-string-concatenation-in-python.html > def xmlstring_to_xmltree(xmlstring): > > Take the text string returned by GBIF and parse to an XML tree using > ElementTree. > Requires the intermediate step of saving to a temporary file (required to > make > ElementTree.parse work, apparently) > > Did cStringIO work as a temp file handle? I wonder if this is a bug in Python. Overall, it's great to see Biopython is going to have such solid phylogenetics/geography support. Should be fun to work with in the future. Cheers, Eric From hlapp at gmx.net Tue Jul 21 17:12:00 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 21 Jul 2009 13:12:00 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> Message-ID: On Jul 20, 2009, at 10:57 AM, Eric Talevich wrote: > the 'node' in Nexus and PhyloDB is called 'clade' in phyloXML Really? A clade is *not* a node in the sense it is normally used in phylogenetics, and I would suggest that PhyloXML is using "clade" synonymously with "node" it needs to change b/c using established terminology in conflicting ways isn't a good idea. A clade is a subtree of a tree, i.e., a node and all its descendent nodes (and the branches that connect them). Or more generally for an unrooted tree, it is any group of nodes (and branches connecting them) that can be completely separated from the rest of the tree by severing a single branch. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From eric.talevich at gmail.com Tue Jul 21 17:29:54 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 21 Jul 2009 13:29:54 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> Message-ID: <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> On Tue, Jul 21, 2009 at 1:03 PM, Hilmar Lapp wrote: > > On Jul 20, 2009, at 10:57 AM, Eric Talevich wrote: > > the 'node' in Nexus and PhyloDB is called 'clade' in phyloXML >> > > > Really? A clade is *not* a node in the sense it is normally used in > phylogenetics, and I would suggest that PhyloXML is using "clade" > synonymously with "node" it needs to change b/c using established > terminology in conflicting ways isn't a good idea. > > A clade is a subtree of a tree, i.e., a node and all its descendent nodes > (and the branches that connect them). Or more generally for an unrooted > tree, it is any group of nodes (and branches connecting them) that can be > completely separated from the rest of the tree by severing a single branch. > > -hilmar > Interesting to know. Here's the documentation for the Clade type: Element Clade is used in a recursive manner to describe the topology of a phylogenetic tree. The parent branch length of a clade can be described either with the 'branch_length' element or the 'branch_length' attribute (it is not recommended to use both at the same time, though). Usage of the 'branch_length' attribute allows for a less verbose description. Element 'confidence' is used to indicate the support for a clade/parent branch. Element 'events' is used to describe such events as gene-duplications at the root node/parent branch of a clade. Element 'width' is the branch width for this clade (including parent branch). Both 'color' and 'width' elements apply for the whole clade unless overwritten in-sub clades. Attribute 'id_source' is used to link other elements to a clade (on the xml-level). It has a label (name), confidence value and branch length like most Node objects do, and even an attribute called node_id. I guess nodes and edges are implicit in the phyloXML representation, and everything *except* the clade class would be considered a sub-type of the traditional node. Then maybe Clade should inherit from Tree instead of Node, and offer an interface to implicit node and edge objects. For the purposes of reusing methods among Nexus, phyloXML, etc. trees, using Clade as a Node seems easiest in terms of having the right attributes available. The same mapping is being using in the BioRuby project, too: Phylogeny:Tree, Clade:Node. (Not sure about Bioperl.) I'll hold off working on the BaseTree integration until we have consensus on this. Best, Eric From hlapp at gmx.net Tue Jul 21 17:45:08 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 21 Jul 2009 13:45:08 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> Message-ID: On Jul 21, 2009, at 1:29 PM, Eric Talevich wrote: > Element Clade is used in a recursive manner to describe the topology > of a phylogenetic tree. That's OK I guess on the topological level - a subtree of a clade is also a clade. I.e., the clade formed a node A and all its descendants is contained within the clade formed by the parent of A and all of the parent's descendants. But referring to or identifying a clade must be referring to an entire group of nodes, not only one. So attaching something to the clade semantically has to attach it to all nodes in the clade. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From czmasek at burnham.org Tue Jul 21 17:51:03 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 21 Jul 2009 10:51:03 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> Message-ID: <4A660007.5090900@burnham.org> Hi, Hilmar: Hilmar Lapp wrote: > A clade is a subtree of a tree, i.e., a node and all its descendent > nodes (and the branches that connect them). Or more generally for an > unrooted tree, it is any group of nodes (and branches connecting them) > that can be completely separated from the rest of the tree by severing > a single branch. > > -hilmar Actually, that is how clade is being used. Like so: A B C The difference is, a clade can contain other clades, wheres as node cannot. Chris From czmasek at burnham.org Tue Jul 21 18:05:25 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 21 Jul 2009 11:05:25 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> Message-ID: <4A660365.5060405@burnham.org> > But referring to or identifying a clade must be referring to an entire > group of nodes, not only one. So attaching something to the clade > semantically has to attach it to all nodes in the clade. Good point! Predefined phyloXML elements are defined to either apply to the whole clade, as long as they are not "overwritten" by values in descendant clades (for example Taxonomy) or are defined to only apply to the clade ("node" in this case) they are in, "branch_length" for example. The property element (used for "custom" data), has a "applies_to" attribute to indicate where to data should be attached to (values are: "phylogeny", "clade", "node", "parent_branch", ...). Chris > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From hlapp at gmx.net Tue Jul 21 18:24:27 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 21 Jul 2009 14:24:27 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <4A660365.5060405@burnham.org> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> <4A660365.5060405@burnham.org> Message-ID: <7265CC6F-2C0B-478A-ADAE-AD8B96ABE1EC@gmx.net> On Jul 21, 2009, at 2:05 PM, Christian M Zmasek wrote: > or are defined to only apply to the clade ("node" in this case) they > are in, "branch_length" for example. You do see how you are contradicting the previous definition here, right? *All* nodes in a clade are in that clade, and *all* branches. My recommendation is to fix this in the phyloXML spec - there is a whole field of cladistics and I don't think it's a wise idea to re- apply their terminology in ways that are in contradiction. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bugzilla-daemon at portal.open-bio.org Tue Jul 21 19:56:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 15:56:12 -0400 Subject: [Biopython-dev] [Bug 2880] test_Mafft_tool.py unit test failure In-Reply-To: Message-ID: <200907211956.n6LJuCXT018866@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2880 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 15:56 EST ------- (In reply to comment #5) > > I've retitled the bug to focus on the MAFFT issue. This may well be > a problem with your old version of MAFFT - I know for example the > the FASTA output is broken on some versions of MAFFT. > I was able to install MAFFT v6.240 on another machine, and worked out a simple fix. Basically this version produced a different CLUSTAL style header line. Should be fixed in CVS now. Thanks for the report, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 20:56:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 16:56:51 -0400 Subject: [Biopython-dev] [Bug 2874] invalid class on warning module In-Reply-To: Message-ID: <200907212056.n6LKupOV020686@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2874 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 16:56 EST ------- I thought I had already marked this bug as fixed... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 20:59:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 16:59:29 -0400 Subject: [Biopython-dev] [Bug 2879] missing __delitem__ in Bio.PDB.Entity.Entity In-Reply-To: Message-ID: <200907212059.n6LKxT1o020771@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2879 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 16:59 EST ------- Could you give a short but complete example showing the problem? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jul 21 21:09:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Jul 2009 17:09:33 -0400 Subject: [Biopython-dev] [Bug 2867] Bio.PDB.PDBList.update_pdb calls invalid os.cmd In-Reply-To: Message-ID: <200907212109.n6LL9XP6021073@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2867 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-21 17:09 EST ------- I've checked a fix for this into CVS, but have not tested it. Could you update and retry? It might be simplest to reinstall all of Biopython from CVS or github, but you only need to update this one file, /usr/lib/pymodules/python2.5/Bio/PDB/PDBList.py The new version will be on github soon, or here soon: http://biopython.org/SRC/biopython/Bio/PDB/PDBList.py The differences are quite small: RCS file: /home/repository/biopython/biopython/Bio/PDB/PDBList.py,v retrieving revision 1.25 diff -r1.25 PDBList.py 37a38 > #TODO - Use os.path.join(...) instead of adding strings with os.sep 39a41 > import shutil 248d249 < 280c281 < os.cmd('mv %s %s'%(old_file,new_file)) --- > shutil.move(old_file, new_file) i.e. The new version uses shutil.move(old_file, new_file) instead. Thanks! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 08:17:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 04:17:06 -0400 Subject: [Biopython-dev] [Bug 2879] missing __delitem__ in Bio.PDB.Entity.Entity In-Reply-To: Message-ID: <200907220817.n6M8H6IQ008427@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2879 ------- Comment #2 from katja.luck at unistra.fr 2009-07-22 04:17 EST ------- (In reply to comment #1) > Could you give a short but complete example showing the problem? > example code: from Bio.PDB.PDBParser import PDBParser if '__main__' == __name__: parser = PDBParser(PERMISSIVE=1) PDBID = '1N7T' PDB_file = '/Network/Servers/sumba/Volumes/s/luck/pymol/1N7T.pdb' structure = parser.get_structure(PDBID,PDB_file) chain = structure[0]['A'] print chain[66].get_id() chain.__delitem__(66) command line output: [carlit:/Users/katja] luck% python Python_scripts/PDZ_project/bug_example.py (' ', 66, ' ') Traceback (most recent call last): File "Python_scripts/PDZ_project/bug_example.py", line 14, in chain.__delitem__(66) File "/Library/Python/2.5/site-packages/Bio/PDB/Chain.py", line 79, in __delitem__ return Entity.__delitem__(self, id) AttributeError: class Entity has no attribute '__delitem__' Okay, I now realised that I should rather use detach_child() than the private method __delitem__() for deleting residues from a chain but still thought it might be good to report this bug. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 09:14:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 05:14:04 -0400 Subject: [Biopython-dev] [Bug 2879] missing __delitem__ in Bio.PDB.Entity.Entity In-Reply-To: Message-ID: <200907220914.n6M9E4qq009918@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2879 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-22 05:14 EST ------- Thanks for the clarification. Note in python __delitem__ is a special method, and rather than this: chain.__delitem__(66) you would normally do: del chain[66] and this will internally call the special __delitem__ method. This is much like other special methods, e.g. str(object) will internally do object.__str__() for you. You wouldn't normally use these double underscore methods explicitly. In any case, I don't understand what Thomas intended the __delitem__ to do, and there may be a bug here. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jul 22 11:56:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Jul 2009 12:56:23 +0100 Subject: [Biopython-dev] Line wrapping in FASTQ output Message-ID: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> Hi Peter R. et al, Up until now I had mostly been trying EMBOSS 6.1.0 with short read data. I've just noticed for longer reads EMBOSS wraps the sequences and qualities lines in FASTQ output (at 60 characters). There is an example of this at the end of the email. My understanding is that while line breaks are allowed in the sequences and qualities lines of a FASTQ file, they are discouraged as it can break simple minded parsers. Unfortunately right now I can't find any references/websites to back up this assertion (other than things I wrote myself since), but I was sure I read this on the MAQ site somewhere. Several sites do simply talk about "the" sequence line and "the" quality line (indeed the early drafts of the wikipedia page had this assumption, which I fixed). This is natural if all you have ever worked with is short read data. Of course, 454 reads are hundreds of bases long, and even the latest Illumina reads now are in the range 70 to 100 bp (or so I hear), so this issue will become more common - so any existing parsers that can't cope with line breaks will soon get broken, and hopefully fixed. For Biopython we should be able cope with any strange line breaks in the sequences and qualities lines on input, but for output don't do any line wrapping. I felt this would result in more widely parseable output. I wondered what your thought process was, and if you think it is worth removing the line wrapping on EMBOSS's FASTQ output (or indeed, if you have a good argument to convince me to make Biopython output FASTQ with line wrapping by default). [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as ideal for an OBF cross project mailing list, something we talked about at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) were going to look into this?] Regards, Peter C. (at Biopython) e.g. $ embossversion Reports the current EMBOSS version number 6.1.0 $ more sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN + ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! It is likely that email software will mangle the line breaks, but in my example file sanger_93.fastq the sequence and the quality are single line strings (of length 94). Now let's let EMBOSS seqret read this in and write it out again: $ seqret -filter -seq sanger_93.fastq -sformat fastq-sanger -osformat fastq-sanger @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG ACTGACTGACTGACTGACTGACTGACTGACTGAN +Test ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDC BA@?>=<;:9876543210/.-,+*)('&%$#"! The new lines are real and not just from the email formatting - you can check this by piping the output though hexdump. It appears EMBOSS is using 60 character line wrapping. Peter C. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 15:14:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 11:14:46 -0400 Subject: [Biopython-dev] [Bug 2883] New: Errors after unpickling of 1.49 seqrecords Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2883 Summary: Errors after unpickling of 1.49 seqrecords Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: andrea at biodec.com I've the same error also with biopython 1.50b I've the same errors either with python2.4 and python2.5 PROBLEM: I've for testing purposes some cPickled seqrecords that i prepared with biopython-1.49. The unpickling doesn't produce any error at all, but if i try to: 1) print the unpickled seqrecord 2) use the unpickled seqrecord i get errors. 1) ========================================================================= >>> print seqr Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.5/site-packages/biopython-1.51b-py2.5-linux-x86_64.egg/Bio/SeqRecord.py", line 501, in __str__ if self.letter_annotations : File "/usr/lib/python2.5/site-packages/biopython-1.51b-py2.5-linux-x86_64.egg/Bio/SeqRecord.py", line 170, in fget=lambda self : self._per_letter_annotations, AttributeError: 'SeqRecord' object has no attribute '_per_letter_annotations' ### This problem maybe is related to the one of the bug #2838 =============================================================================== 2)============================================================================= >>> seqr.seq Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.5/site-packages/biopython-1.51b-py2.5-linux-x86_64.egg/Bio/SeqRecord.py", line 538, in __repr__ % tuple(map(repr, (self.seq, self.id, self.name, File "/usr/lib/python2.5/site-packages/biopython-1.51b-py2.5-linux-x86_64.egg/Bio/SeqRecord.py", line 233, in seq = property(fget=lambda self : self._seq, AttributeError: 'SeqRecord' object has no attribute '_seq' =============================================================================== According to me old seqrecords didn't have any "_per_letter_annotations" or any "_seq" in SeqRecord class/instances. Maybe i've to split the two errors in two different bugs but i prefer to keep together because are related to the same main problem of "unpickling an old seqrecord" (or maybe is not a problem and i haven't to try to unpickle old seqrecord instance with new biopython versions) I didn't try the CVS code because i didn't find any related error in bugzilla.open-bio.org. I've added a dump of a seqrecord generated with biopython-1.49 Thanks Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 15:15:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 11:15:32 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907221515.n6MFFWaK023587@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #1 from andrea at biodec.com 2009-07-22 11:15 EST ------- Created an attachment (id=1346) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1346&action=view) Dump of a seqrecord generated with biopython 1.49 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 15:44:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 11:44:47 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907221544.n6MFil8a025136@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-22 11:44 EST ------- It sounds like pickling and unpickling worked for you on Biopython 1.49, but I am not 100% sure that is what you meant. The good news is I can pickle/unpickle a new SeqRecord object: >>> import pickle >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> s = Seq("ACGT", generic_dna) >>> s2 = pickle.loads(pickle.dumps(s)) >>> s2 Seq('ACGT', DNAAlphabet()) >>> from Bio.SeqRecord import SeqRecord >>> r = SeqRecord(s, id="test", letter_annotations={"dummy":[4,3,2,1]}) >>> print r ID: test Name: Description: Number of features: 0 Per letter annotation for: dummy Seq('ACGT', DNAAlphabet()) >>> r2 = pickle.loads(pickle.dumps(r)) >>> print r2 ID: test Name: Description: Number of features: 0 Per letter annotation for: dummy Seq('ACGT', DNAAlphabet()) And this also works with cPickle: >>> import cPickle >>> s3 = cPickle.loads(cPickle.dumps(s)) >>> s3 Seq('ACGT', DNAAlphabet()) >>> r3 = cPickle.loads(cPickle.dumps(r)) >>> print r3 ID: test Name: Description: Number of features: 0 Per letter annotation for: dummy Seq('ACGT', DNAAlphabet()) I would expect you to be able to pickle/unpickle new objects on your system too. However, I can confirm trying to unpickle the example you attached to this bug also fails for me (using the latest Biopython from CVS). As you may be aware, per-letter-annotation support was added in Biopython 1.50 which is stored internally by a private property of the SeqRecord, _per_letter_annotations. The seq property is also now stored internally by a private property of the SeqRecord, _seq. This means if you unpickle a pre-Biopython 1.50 SeqRecord on Biopython 1.50 or later, the _per_letter_annotations and _seq properties never gets initialised. This causes the two errors you saw. I don't think there is much we can do about this... not without making the SeqRecord even more complicated, e.g. http://code.activestate.com/recipes/521901/ Peter P.S. Bug 2838 was a problem in the DBSeqRecord (used for BioSQL), and shouldn't be relevant to the underlying SeqRecord object, or this issue. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 16:52:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 12:52:14 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907221652.n6MGqEu7028407@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #3 from andrea at biodec.com 2009-07-22 12:52 EST ------- (In reply to comment #2) > It sounds like pickling and unpickling worked for you on Biopython 1.49, but I > am not 100% sure that is what you meant. Yes, that's true. it worked. > > The good news is I can pickle/unpickle a new SeqRecord object: > yes this i know, and it works also for me and also with cPickle. > I would expect you to be able to pickle/unpickle new objects on your system > too. sure > > However, I can confirm trying to unpickle the example you attached to this bug > also fails for me (using the latest Biopython from CVS). > > As you may be aware, per-letter-annotation support was added in Biopython 1.50 > which is stored internally by a private property of the SeqRecord, > _per_letter_annotations. The seq property is also now stored internally by a > private property of the SeqRecord, _seq. This means if you unpickle a > pre-Biopython 1.50 SeqRecord on Biopython 1.50 or later, the > _per_letter_annotations and _seq properties never gets initialised. This causes > the two errors you saw. This is the problem. i've many example of seqrecord dump (that i use as a test) that due to the seqrecord modifications i cannot use anymore. - I've to convert in the new type. - or i've to design fully new tests that permit me to manage changing in the SeqRecord structure. > > I don't think there is much we can do about this... not without making the > SeqRecord even more complicated, e.g. > http://code.activestate.com/recipes/521901/ I understand. I thought SeqRecod was structurally stable. But it isn't. In this sense i can only pickle strings, lists and dictionaries... so i will redraw my tests to manage only SeqRecord stored data (representing it as a dictionary of dictionaries it would be a good solution). > > > P.S. Bug 2838 was a problem in the DBSeqRecord (used for BioSQL), and shouldn't > be relevant to the underlying SeqRecord object, or this issue. > yes, but in the last part of the bug there was a similar error AttributeError: 'DBSeqRecord' object has no attribute '_per_letter_annotations' and i thought it was due to the fact that DBSeqRecord didn't have that attribute and it was out of sync with respect to the new 1.50 seqrecord... Thanks Andrea PS: i think you could close the bug. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 22 17:28:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Jul 2009 13:28:39 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907221728.n6MHSdjt029734@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WONTFIX ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-22 13:28 EST ------- (In reply to comment #3) > > As you may be aware, per-letter-annotation support was added in Biopython > > 1.50 which is stored internally by a private property of the SeqRecord, > > _per_letter_annotations. The seq property is also now stored internally > > by a private property of the SeqRecord, _seq. This means if you unpickle > > a pre-Biopython 1.50 SeqRecord on Biopython 1.50 or later, the > > _per_letter_annotations and _seq properties never gets initialised. This > > causes the two errors you saw. > > This is the problem. i've many example of seqrecord dump (that i use as a > test) that due to the seqrecord modifications i cannot use anymore. > - I've to convert in the new type. > - or i've to design fully new tests that permit me to > manage changing in the SeqRecord structure. You can probably hack the missing per letter annotation with something like record._per_letter_annotations = {}, but it looks like there is no obvious way to get at the sequence information in the unpicked record. Would you like to discuss your storage strategy on the mailing list? I'm curious what you are doing that made you choose to use pickle like this (instead of saving to a standard sequence file format, or BioSQL). > > I don't think there is much we can do about this... not without > > making the SeqRecord even more complicated, e.g. > > http://code.activestate.com/recipes/521901/ > > I understand. I thought SeqRecod was structurally stable. > But it isn't. In this sense i can only pickle strings, lists and > dictionaries... so i will redraw my tests to manage only SeqRecord > stored data (representing it as a dictionary of dictionaries it would > be a good solution). Pickling complex objects is usually fine, unless the class changes - like the SeqRecord did (and it may do in future, or more likely the SeqFeature object may). > > P.S. Bug 2838 was a problem in the DBSeqRecord (used for BioSQL), and > > shouldn't be relevant to the underlying SeqRecord object, or this issue. > > yes, but in the last part of the bug there was a similar error > AttributeError: 'DBSeqRecord' object has no attribute > _per_letter_annotations' and i thought it was due to the fact that > DBSeqRecord didn't have that attribute and it was out of sync with > respect to the new 1.50 seqrecord... Yes, part of Bug 2838 was that the DBSeqRecord got out of sync with the SeqRecord. > PS: i think you could close the bug. OK - marking as "won't fix". Sorry about this, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jul 22 19:25:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Jul 2009 20:25:25 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00905221140l2abea51fs343e40f3a058584a@mail.gmail.com> <20090522225432.GU84112@sobchak.mgh.harvard.edu> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> Message-ID: <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> On Wed, Jul 22, 2009 at 7:16 PM, James Casbon wrote: > > A bit late to the party, but I put my sff parsing code into this fork > before reading this thread: > http://github.com/jamescasbon/biopython/tree/sff > > I have a test suite but not sure where all the other QualityIO tests > are so it can live with them > > It does work with the roche tools v2, but I have no paired end sff > files to test. Sounds interesting - github is being very slow for me right now, so I'll probably take a look tomorrow. I'll be interested to see how it compares to my rough code on Bug 2837 based on the code from Jose Blanca (this doesn't do paired end reads yet). http://bugzilla.open-bio.org/show_bug.cgi?id=2837 This is something I hope to work on for Biopython 1.52, once Biopython 1.51 final is out the door (later this month I hope). Peter From czmasek at burnham.org Thu Jul 23 04:43:08 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Wed, 22 Jul 2009 21:43:08 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML for Biopython In-Reply-To: <325F101D-1E7A-4BEA-BF2C-A3C18547063B@illinois.edu> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <4A660007.5090900@burnham.org> <325F101D-1E7A-4BEA-BF2C-A3C18547063B@illinois.edu> Message-ID: <4A67EA5C.90709@burnham.org> Hi, Chris: > From that contained Clades fall out quite easily, as they would just > be deeper subtrees within that Clade that also have a clade 'root node'. I don't understand this sentence. Chris From pmr at ebi.ac.uk Thu Jul 23 08:08:51 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 23 Jul 2009 09:08:51 +0100 Subject: [Biopython-dev] Line wrapping in FASTQ output In-Reply-To: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> References: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> Message-ID: <4A681A93.9030303@ebi.ac.uk> Peter C. wrote: > Hi Peter R. et al, > > For Biopython we should be able cope with any strange line breaks in > the sequences and qualities lines on input, but for output don't do > any line wrapping. I felt this would result in more widely parseable > output. I wondered what your thought process was, and if you think it > is worth removing the line wrapping on EMBOSS's FASTQ output (or > indeed, if you have a good argument to convince me to make Biopython > output FASTQ with line wrapping by default). There is also an issue with making the ines so long that brain-damaged parsers (those that read a line in C and fail to check it was a complete line) will fail. Leaving the line breaks in was deliberate in EMBOSS 6.1.0 to see whether any parsers would object. The obvious compromise is to increase the default line length in EMBOSS to say 500 so that anyone reading up to 512 characters will still be safe. Unfortunately some flk will then assume there will never be a line break. Alternatively, we could truly make everything fit on one line. Or we could double up the fastq outputs with and without line breaks (horrible problems with naming the ouptut formats) I suspect this one-line thing is a simple attempt to avoid the "quality line starting with '@' or '+'" issue. > [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as > ideal for an OBF cross project mailing list, something we talked about > at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) were going > to look into this?] Yes indeed I was. Waylaid by the demands of the 6.1.0 EMOSS release but I will get back on to it. regards, Peter From bugzilla-daemon at portal.open-bio.org Thu Jul 23 08:47:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Jul 2009 04:47:42 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907230847.n6N8lgYw029402@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #5 from andrea at biodec.com 2009-07-23 04:47 EST ------- > You can probably hack the missing per letter annotation with something > like record._per_letter_annotations = {}, Yes, i tried and it works... but there is no possibility to recover the seq... seqrecord.seq is not accessibile anymore.... > but it looks like there is no > obvious way to get at the sequence information in the unpicked record. > > Would you like to discuss your storage strategy on the mailing list? Sure, which one? Discussion, developement..... But are you sure it is necessary? > I'm curious what you are doing that made you choose to use pickle like > this (instead of saving to a standard sequence file format, or BioSQL). I'm using pickled object only for testing purposes. So implement a BioSQL system for that is too much... (also if it is available for sql lite) Maybe saving data in other format (for sure not fasta)... for example GenBank it could be another good solution but i will add a possible "layer of failure" related to parsing problems.... (And i think, unpickling of dictionary will not introduce this possible "layer of failure"). Were you thinking about GenBank format? Do you suggest something different? > > Sorry about this, Don't worry. I think you are developing the system in a way that it will bring it to a better state... so, it isn't a problem at all.... even better thanks a lot. Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jul 23 09:14:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 10:14:52 +0100 Subject: [Biopython-dev] Line wrapping in FASTQ output In-Reply-To: <4A681A93.9030303@ebi.ac.uk> References: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> <4A681A93.9030303@ebi.ac.uk> Message-ID: <320fb6e00907230214l6df7ff76j643e8ddc1f600054@mail.gmail.com> On Thu, Jul 23, 2009 at 9:08 AM, Peter Rice wrote: > Peter C. wrote: >> >> Hi Peter R. et al, >> >> For Biopython we should be able cope with any strange line breaks >> in the sequences and qualities lines on input, but for output don't do >> any line wrapping. I felt this would result in more widely parseable >> output. I wondered what your thought process was, and if you think >> it is worth removing the line wrapping on EMBOSS's FASTQ output >> (or indeed, if you have a good argument to convince me to make >> Biopython output FASTQ with line wrapping by default). > > There is also an issue with making the ines so long that brain-damaged > parsers (those that read a line in C and fail to check it was a complete > line) will fail. You mean a C parser with a finite string buffer (say 100 characters) which reads things line by line. Yes, that would be a bit brain dead too. I guess either way could break some parsers out there ;) > Leaving the line breaks in was deliberate in EMBOSS 6.1.0 to see > whether any parsers would object. I see - well I'm not objecting, and neither is the Biopython parser. > The obvious compromise is to increase the default line length in > EMBOSS to say 500 so that anyone reading up to 512 characters > will still be safe. Unfortunately some flk will then assume there will > never be a line break. That seems like a bad idea - especially as Roche 454 reads are in the region of 500+ bp, meaning some would wrap and some wouldn't. Even using a longer wrap like 1000 would probably just postpone the issue. If you are going to wrap, something short like 60 seems more sensible (often used in FASTA files too) given the historical 80 character width of a terminal window. People using early Solexa/Illumina machines will only see a single line, but as their read lengths are already in the range 70 to 100bp, I wonder what the latest Illumina pipelines output (wrt wrapping)? > Alternatively, we could truly make everything fit on one line. That's what Biopython currently does. But you are right - I hadn't considered brain dead parsers using fixed buffers. > Or we could double up the fastq outputs with and without line breaks > (horrible problems with naming the ouptut formats) I don't like that plan. For Biopython we could have a wrapping setting available for people who really need to specify this (as we do for FASTA already), with a sensible default value. > I suspect this one-line thing is a simple attempt to avoid the "quality line > starting with '@' or '+'" issue. Could be. I think the fact that @ and + are valid entries in the quality string is the second most annoying thing about the FASTQ format (after the lack of a clear format definition from Sanger, and the resulting variants from Solexa/Illumina etc). >> [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as >> ideal for an OBF cross project mailing list, something we talked >> about at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) >> were going to look into this?] > > Yes indeed I was. Waylaid by the demands of the 6.1.0 EMOSS release > but I will get back on to it. Thanks! > regards, > > Peter Cheers, Peter C. From bugzilla-daemon at portal.open-bio.org Thu Jul 23 09:20:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Jul 2009 05:20:20 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907230920.n6N9KKwC030688@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-23 05:20 EST ------- (In reply to comment #5) > > Would you like to discuss your storage strategy on the mailing list? > > Sure, which one? Discussion, developement..... But are you sure it is > necessary? I was thinking the main discussion list - but if this was just for your own testing, maybe we don't need to. > > I'm curious what you are doing that made you choose to use pickle like > > this (instead of saving to a standard sequence file format, or BioSQL). > > I'm using pickled object only for testing purposes. So implement a BioSQL > system for that is too much... (also if it is available for sql lite) > Maybe saving data in other format (for sure not fasta)... for example > GenBank it could be another good solution but i will add a possible > "layer of failure" related to parsing problems.... (And i think, unpickling > of dictionary will not introduce this possible "layer of failure"). > Were you thinking about GenBank format? Do you suggest something different? If your SeqRecord objects are all simply loaded from sequence files in the first place (and not modified), I would just keep the original file and re-parse it. If you have generated your own SeqRecords (or modified those from reading a file), then it makes sense to save them somehow. The choice of file format depends on the nature of annotation. The latest Biopython will now record the features in a GenBank file, making that a reasonable choice - but this does not cover per-letter-annotations. BioSQL has the same limitation. > > > Sorry about this, > > Don't worry. I think you are developing the system in a way that it > will bring it to a better state... so, it isn't a problem at all.... > even better thanks a lot. > > Andrea Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jul 23 09:34:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 10:34:26 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00905221609w60e6a225p1911a08578f7ffd8@mail.gmail.com> <21f6d10e0905221652p5ef5593bu4bbd8e732c834641@mail.gmail.com> <320fb6e00905221710q2d8c583dr6693aa3574859655@mail.gmail.com> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> Message-ID: <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> On Wed, Jul 22, 2009 at 9:51 PM, James Casbon wrote: > > 2009/7/22 Peter : >> On Wed, Jul 22, 2009 at 7:16 PM, James Casbon wrote: >>> >>> A bit late to the party, but I put my sff parsing code into this fork >>> before reading this thread: >>> http://github.com/jamescasbon/biopython/tree/sff >> >> Sounds interesting - github is being very slow for me right now, >> so I'll probably take a look tomorrow. I'll be interested to see how >> it compares to my rough code on Bug 2837 based on the code >> from Jose Blanca (this doesn't do paired end reads yet). >> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 > > I don't think there is much in it really. ?You have a factored > BinaryFile class, I have classes for the components of the SFF file. > Both are based around struct. Github is working fine now - maybe my wireless network was just too slow at home last night? Jose's code uses seek/tell which means it has to have a handle to an actual file. He also used binary read mode - I'm not sure if this was essential or not. James' code seems to make a single pass though the file handle, without using seek/tell to jump about. I think this is nicer, as it is consistent with the other SeqIO parsers, and should work on more types of handles (e.g. from gzip, StringIO, or even a network connection). It looks like you (James) construct Seq objects using the full untrimmed sequence as is. I was undecided on if trimmed or untrimmed should be the default, but the idea of some kind of masked or trimmed Seq object had come up on the mailing list which might be useful here (and in contig alignments). i.e. something which acts like a Seq object giving the trimmed sequence, but which also contains the full sequence and trim positions. I also want to look at paired end reads in SFF files... Peter From bugzilla-daemon at portal.open-bio.org Thu Jul 23 09:56:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Jul 2009 05:56:50 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907230956.n6N9uouv031896@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #7 from andrea at biodec.com 2009-07-23 05:56 EST ------- (In reply to comment #6) > (In reply to comment #5) > If your SeqRecord objects are all simply loaded from sequence files in the > first place (and not modified), I would just keep the original file and > re-parse it. > > If you have generated your own SeqRecords (or modified those from reading > a file), then it makes sense to save them somehow. The choice of file > format depends on the nature of annotation. The latest Biopython will now > record the features in a GenBank file, making that a reasonable choice - > but this does not cover per-letter-annotations. BioSQL has the same > limitation. yes, i'm testing some predictors. I do prediction and i compare the "newly predicted seqrecords" with the "previously correct predicted pickled seqrecords". I've them (the correct ones) only in pickled seqrecord format. The correctly predicted seqrecord, before prediction were in fasta format, but after i parsed them (into seqrecord), i did prediction, and then i pickled them (during prediction i add to seqrecord features and annotations). Actually i don't use per-letter-annotation despite the fact it seems interesting. But i didn't find any example in documentation (that show how the dictionary is populated...) so i really don't know how to use it.... even if i've, during prediction, a "per position annotation". Also if the "per letter annotation" is not managed in the GenBank format or in the BioSQL format (that i use a lot) i've to wait!! I was thinking also to store the pssm information somewhere in the seqrecord.... but this would be a very big change... (and also manage to store it in BioSQL.... )... but it's better to stop the discussion here or to move it... :-) Thanks Andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jul 23 10:28:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Jul 2009 06:28:42 -0400 Subject: [Biopython-dev] [Bug 2883] Errors after unpickling of 1.49 seqrecords In-Reply-To: Message-ID: <200907231028.n6NASgcX000743@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2883 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-23 06:28 EST ------- (In reply to comment #7) > ... but it's better to stop the discussion here or to move it... :-) Moving discussion to mailing list, see: http://lists.open-bio.org/pipermail/biopython/2009-July/005385.html Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jul 23 11:08:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 12:08:09 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> Message-ID: <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> On Fri, Jul 10, 2009 at 1:38 PM, Peter wrote: > On Mon, Jun 22, 2009 at 6:57 PM, Peter wrote: >> >> Once the beta release is out, we'll resume taking small changes >> (especially for documentation additions or clarifications) with a >> view to releasing Biopython 1.51 final in July (probably the second >> week, after people get back from BOSC/ISMB). > > OK, that didn't happen - too much to catch up on at work after > being away at BOSC/ISMB for a week. Also I will be on holiday > next week (graduation etc). I will have some limited internet > access. I'm thinking of doing the final release of Biopython 1.51 > the following week (i.e. the week starting 20th July). > > This will be after the annual EMBOSS release, and one little thing > I want to sort out before we release Biopython 1.51 is mapping > Solexa/PHRED scores in FASTQ files (specifically what to do with > a PHRED score of zero which is usually a dummy value, but taken > literally means "this read is wrong" or "worst than random"). After > discussion with Peter Rice at BOSC/ISMB 2009, I plan to follow > his plan for EMBOSS (map PHRED of zero to the lowest used > Solexa score, -5). Once the EMBOSS release is out, I can use it > for cross checking our FASTQ conversions. The FASTQ checking is on going. I have updated our FASTQ code to map Solexa scores as I understood Peter Rice's description of the intended EMBOSS behaviour (this is for the corner case of very poor quality reads). However, due to a couple of minor bugs I found in EMBOSS 6.1.0 we'll either have to cross check against their CVS code, or hope they release EMBOSS 6.1.1 soon. Cross checking against MAQ would also be worthwhile, but while there are some patches about to fix a couple of MAQ FASTQ bugs and include Illumina to Sanger standard conversion, this isn't in their official repository yet. I guess I could cross check against BioPerl's new FASTQ support ... > Also, we have the Bio.Application.generic_run code to retire, > which basically means we label it as obsolete and update the > tutorial to use subprocess (see other thread), but this requires > cross platform testing. I still haven't got near my Windows machine to do this. I think this is important to get done in Biopython 1.51 as we are also introducing the extended set of command line wrappers. Nevertheless, a July release is still looking possible. Are there any other issues that would block the release? Peter From eric.talevich at gmail.com Thu Jul 23 15:59:32 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 23 Jul 2009 11:59:32 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML forBiopython In-Reply-To: References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> <6CC117B53EF342238715843D2C185723@NewLife> Message-ID: <3f6baf360907230859l400f0fcfm450985cff710375b@mail.gmail.com> All, Thanks for the ongoing discussion and helpful links. I'm going to propose an object mapping here and see how it sits with everyone -- please correct any questionable statements. In raw XML, the clade designation looks reasonable. The attributes that blur the clade-node distinction are branch_length, confidence and node_id. In the first two, the attributes apply to an implicit root node, not the entire clade. (Stated this way, it makes much more sense in the XML representation to have branch_length as a child node, not an attribute.) The node_id clearly applies to the clade's root node, once it's understood that the node is implicit. http://www.phyloxml.org/documentation/version_100/phyloxml.xsd.html#h-1124608460 On Thu, Jul 23, 2009 at 2:08 AM, Chris Fields wrote: > On Jul 23, 2009, at 12:12 AM, Mark A. Jensen wrote: > >> FWIW, BioPerl has Trees and Nodes. That's it; maybe Branches later (if I >> get around to it, or convince Chase it would be a good project). > > Many of the existing generalized Tree object representations seem to be guided by the Nexus/Newick format, which is basically an s-expression. This format can represent a tree as a parenthetical expression, and a node as a token (comma-delimited, potentially combining a taxon label and branch length separated by a colon) within that expression. Edges or branches are implicit. So Trees, Nodes, branch lengths and labels are all we *really* need to find common ground on, but other, more expressive representations are certainly possible. I'm basing my BaseTree classes on the tables in BioSQL's PhyloDB extension ( http://biosql.org/wiki/Extensions) -- which were probably in turn based on BioPerl's Tree objects, but have at least been given some extra effort towards generalization. The PhyloDB schema includes include an Edge table definition, among other things. Question: The Node objects in PhyloDB have left_idx and right_idx attributes. It looks like nodes are being kept in a double-linked list, which seems like unusually low-level information to keep around since Perl, Python, Ruby and Java all have flexible array or list types that can keep track of element order efficiently. Is there a use for these indexes in general phylogenetics work that couldn't be handled by other language-specific constructs? In this scrap >> http://www.bioperl.org/wiki/Finding_all_clades_represented_in_a_tree >> I defined a clade as a "maximal set of leaf/tip taxa descended from a >> given single node", because that's really what the question poser wanted. >> You might expand that definition to include all branches and nodes between >> the "given node" and the tips. That would be synonyomous with "subtree". >> > > Yes, but some define clade slightly differently: > > http://en.wikipedia.org/wiki/Cladistics#Three_definitiOther representations > are possible.ons_of_clade > Helpful! It looks like phyloXML's interpretation is "branch-based". Note that in the spec, the Phylogeny element that the various Bio* projects have interpreted as the Tree type is defined to have exactly one Clade attribute -- presumably the root node of the tree. I'm not sure how to interpret a branch_length value for that clade; maybe it should be ignored or disallowed. I think I see the utility of a clade as an annotation entity: one wants to >> grant properties to subtrees ("Mammalia", e.g.). >> > The Clade node does have most of the important annotation types as its children -- Taxonomy, Sequence, Events, etc. Given how Nexus trees often label nodes with taxon names, the nearest phyloXML equivalent to a Node type might be Taxonomy. But in phyloXML, all of the Clade attributes and annotations apply to the root node, and potentially all sub-clades and sub-nodes that don't override this information. I don't think I'd map the basic Node type to anything but Clade for this reason. A "Node" (in BioPerl, or standard phylogenetics) can be *mapped* to a clade, >> or used to obtain a clade, *if* the tree is rooted (as Hilmar points out). >> It seems that for a rooted tree (i.e., where anc->desc relationships are >> defined), a "Clade" annotation that contained all the desired clade >> properties could be associated with the Node, because of the one-to-one >> mapping of nodes to clades in this case. In the case of an unrooted tree, a >> Clade could also be associated with a node, if the Clade also possessed a >> direction property. For example, in an unrooted tree, a Clade could be >> specified by Node + Branches of Node contained in Clade (which would be two >> of the three branches on an internal node). This would provide the direction >> of "descent". >> >> The 'rooted' and 'rerootable' attributes belong to Phylogeny, at the top of the tree. A Clade object should probably have easy access to this information for use in pruning or rerooting. This raises some questions about the role of the Phylogeny element -- is-it-really-a Tree? Or simply a wrapper with metadata about all the clades it contains, containing a single clade which is actually the top of the phylogenetic tree? In that case it could make sense for each clade to contain a direct or indirect reference to the phylogeny object, rather than the other way around. The mind reels. I was more comfortable calling it a Tree, as the other Bio* projects do, but then I haven't tried to integrate the Nexus tree classes yet. Conclusions: 1. A Clade is-a Tree, and also is-a Node for various operations. 2. For reusing base-class methods, a Clade should provide a 'node' attribute that behaves properly -- in most or all cases, the nodes will be be the same as the list of sub-clades. 3. A Clade also needs to access some attributes of its original Phylogeny. Best regards, Eric From biopython at maubp.freeserve.co.uk Thu Jul 23 16:21:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 17:21:05 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 9: PhyloXML forBiopython In-Reply-To: <3f6baf360907230859l400f0fcfm450985cff710375b@mail.gmail.com> References: <3f6baf360907200757j46641604p3622bea89e55e733@mail.gmail.com> <49C5A2AB-C177-4219-B568-71C310A7CE15@duke.edu> <3f6baf360907211029x4dc54eb3w42d772f2d6a3b7f1@mail.gmail.com> <6CC117B53EF342238715843D2C185723@NewLife> <3f6baf360907230859l400f0fcfm450985cff710375b@mail.gmail.com> Message-ID: <320fb6e00907230921lf22fc67hafd3bc8998a4eb7e@mail.gmail.com> On Thu, Jul 23, 2009 at 4:59 PM, Eric Talevich wrote: > > Question: > The Node objects in PhyloDB have left_idx and right_idx attributes. It looks > like nodes are being kept in a double-linked list, which seems like > unusually low-level information to keep around since Perl, Python, Ruby and > Java all have flexible array or list types that can keep track of element > order efficiently. Is there a use for these indexes in general phylogenetics > work that couldn't be handled by other language-specific constructs? I would guess this is like the left/right indices used in BioSQL's taxon tree, see http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html If they are being used the same way, the are an expensive to calculate second indexing scheme, which is useful for many tree operations. Peter From mjldehoon at yahoo.com Fri Jul 24 09:34:33 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 24 Jul 2009 02:34:33 -0700 (PDT) Subject: [Biopython-dev] Calculating motif scores Message-ID: <220087.67461.qm@web62406.mail.re1.yahoo.com> > As for the PWM being a separate class and used by the motif: > I don't know. I'm using Bio.SubsMat.FreqTable for implementing > frequency table, so I understand that the new PWM class would > be basically a "smarter" FreqTable. I'm not sure whether it > solves any problems... Wow, I didn't even know the Bio.SubsMat module existed. As we have several different but related modules (Bio.Motif, Bio.SubstMat, Bio.Align), I think we should define the purpose and scope of each of these modules. Maybe a good way to start is the documentation. Bio.SubsMat is currently divided into two chapters (14.4 and 16.2). I'll have a look at this over the weekend to see if this can be cleaned up a bit. --Michiel. From jblanca at btc.upv.es Fri Jul 24 10:22:39 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Fri, 24 Jul 2009 12:22:39 +0200 Subject: [Biopython-dev] [Biopython] next-gen sequencing software In-Reply-To: <320fb6e00907240250h128654e4w2e2845255392d205@mail.gmail.com> References: <200907241053.15954.jblanca@btc.upv.es> <320fb6e00907240250h128654e4w2e2845255392d205@mail.gmail.com> Message-ID: <200907241222.39608.jblanca@btc.upv.es> On Friday 24 July 2009 11:50:08 Peter wrote: > Work on improving the Biopython alignment object and introducing a > contig object is something I would like to see for the next release (once > Biopython 1.51 is out). I think that's quite necessary. Consider my code an experiment in that regard. I will be very please to discuss the details of such a class. I think that my experience with my contig implementation could be of some value. > I'm sure there is other stuff in your code that would also be very useful. > > If you want to contribute code to Biopython is will have to be under our > MIT style license, but in the meantime maybe you should stick an > an explicit license on your code? > > Peter I'm aware of the biopython licence. I prefer the GPL, that's why when I release code on my own I use it. But if some of my code could be useful to the Biopython community I have no problem with releasing under the MIT. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Fri Jul 24 10:40:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 11:40:44 +0100 Subject: [Biopython-dev] [Biopython] next-gen sequencing software In-Reply-To: <200907241222.39608.jblanca@btc.upv.es> References: <200907241053.15954.jblanca@btc.upv.es> <320fb6e00907240250h128654e4w2e2845255392d205@mail.gmail.com> <200907241222.39608.jblanca@btc.upv.es> Message-ID: <320fb6e00907240340v4d4fdb9dge48458edfa085122@mail.gmail.com> On Fri, Jul 24, 2009 at 11:22 AM, Jose Blanca wrote: > On Friday 24 July 2009 11:50:08 Peter wrote: >> Work on improving the Biopython alignment object and introducing a >> contig object is something I would like to see for the next release (once >> Biopython 1.51 is out). > > I think that's quite necessary. Consider my code an experiment in that regard. > I will be very please to discuss the details of such a class. I think that my > experience with my contig implementation could be of some value. Absolutely :) >> I'm sure there is other stuff in your code that would also be very useful. >> >> If you want to contribute code to Biopython is will have to be under our >> MIT style license, but in the meantime maybe you should stick an >> an explicit license on your code? >> >> Peter > > I'm aware of the biopython licence. I prefer the GPL, that's why when I > release code on my own I use it. But if some of my code could be useful to > the Biopython community I have no problem with releasing under the MIT. Great :) For reference, http://biopython.org/DIST/LICENSE Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 10:48:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 11:48:04 +0100 Subject: [Biopython-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Message-ID: <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> On Mon, Jul 20, 2009 at 6:57 PM, Peter wrote: > Hi all at Biopython (and EMBOSS-dev CC'd), > > Now that EMBOSS 6.1.0 is out I've started checking it against Biopython. > As I mentioned on the Biopython mailing list a week ago, in particular I'd > like to make sure we agree on the various FASTQ variants. I'm waiting > for EMBOSS to update the documentation on their website, but as I > recall from talking to Peter Rice at BOSC/ISMB 2009 and a quick test > this afternoon, they are using: > > fastq - FASTQ where the qualities are ignored (useful for input?) > fastq-sanger - Standard Sanger style FASTQ using PHRED offset 33 > fastq-solexa - Early Solexa/Illumina FASTQ, Solexa scores offset 64 > fastq-illumina - Illumina 1.3+ FASTQ using PHRED offset 64 > > I was expecting "fastq" to be an EMBOSS input only format given > how I had understood this to be interpreted (ignore the qualities). > ... I was however surprised that using "fastq" as an output format > in EMBOSS seqret gives quality strings of double quote characters. To be more precise, it looks like "fastq" as an output format in EMBOSS is an alias for "fastq-sanger" (to be confirmed), see: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html In any case, it would still make sense to include "fastq-sanger" as an alias for the Sanger standard FASTQ files in Biopython's SeqIO, especially if BioPerl is also going to use that name (to be confirmed): http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030688.html Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 12:40:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 13:40:55 +0100 Subject: [Biopython-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> Message-ID: <320fb6e00907240540i17f7f3f0kdf144c79ccbfdae@mail.gmail.com> On Fri, Jul 24, 2009 at 11:48 AM, Peter wrote: > > To be more precise, it looks like "fastq" as an output format in > EMBOSS is an alias for "fastq-sanger" (to be confirmed), see: > http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html Confirmed, http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000602.html > In any case, it would still make sense to include "fastq-sanger" as > an alias for the Sanger standard FASTQ files in Biopython's SeqIO, > especially if BioPerl is also going to use that name (to be confirmed): > http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030688.html Confirmed, BioPerl will support "fastq" or "fastq-sanger" to mean the Sanger standard FASTQ files: http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030691.html I've updated Biopython's SeqIO in CVS to support "fastq-sanger" as an alias for "fastq". Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 13:32:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 14:32:49 +0100 Subject: [Biopython-dev] FASTQ support in Biopython, BioPerl, and EMBOSS Message-ID: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Hi all, Peter Rice kindly said he will look into an OBF cross project mailing list, but in the meantime this has been cross posted to the Biopython, BioPerl, and EMBOSS development lists. On Thu, Jul 23, 2009 at 11:58 PM, Chris Fields wrote: >> I'd like to get comparisons against BioPerl's new FASTQ support >> going too. To do this I'd need to know which (branch?) of BioPerl I >> should install, and I'd also like a trivial sample BioPerl script to do >> piped FASTQ conversion. i.e. read a FASTQ file from stdin (say >> as "fastq-solexa"), and output it to stdout (say as "fastq" meaning >> the Sanger Standard FASTQ). > > You would have to install svn (bioperl-live) if you want the refactored > fastq. ?That commit was within the last month. I've got SVN bioperl-live installed and apparently working :) >> i.e. Something like this four line Biopython script would be perfect: >> http://biopython.org/wiki/Reading_from_unix_pipes > > We use named parameters so it's a little more verbose. > > use Bio::SeqIO; > my $in ?= Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-sanger'); > my $out = Bio::SeqIO->new(-format => 'fastq-solexa'); > while (my $seq = $in->next_seq) { $out->write_seq($seq) } > > Don't be surprised if there are still bugs lurking about, just let me know > and I'll fix 'em. I've got a bug report coming up in a second email, but the basics work :) e.g. Using this Sanger style FASTQ file, and converting it to Solexa style http://biopython.org/SRC/biopython/Tests/Quality/example.fastq $ more example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ;;;;;;;;;;;9;7;;.7;393333 This is simple three record FASTQ file (in the Sanger format). Using EMBOSS 6.1.0: $ seqret -filter -sformat fastq-sanger -osformat fastq-solexa < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ZZZZZZZZZZZXZVZZMVZRXRRRR Using BioPerl: $ perl bioperl_sanger2solexa.pl < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ZZZZZZZZZZZXZVZZMVZRXRRRR Using Biopython: $ python biopython_sanger2solexa.py < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ZZZZZZZZZZZXZVZZMVZRXRRRR They all agree, except that Biopython has followed the MAQ convention of omitting the (optional) repeat of the captions on the plus lines. This is something I'd already asked Peter Rice about for EMBOSS (but I think we got sidetracked): http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000577.html Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 13:53:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 14:53:40 +0100 Subject: [Biopython-dev] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Message-ID: <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> On Fri, Jul 24, 2009 at 2:32 PM, Peter wrote: >> >> Don't be surprised if there are still bugs lurking about, just let me >> know and I'll fix 'em. > > I've got a bug report coming up in a second email, but the basics work :) I think I have found a bug in BioPerl's conversion from fastq-solexa to fastq-sanger concerning lower quality scores. Here is an artificial Solexa file using the Solexa scores from 40 down to -5 (which I believe to be the full range expected from an instrument). $ more solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; A Solexa quality of 40 maps to ASCII 40+64 = 104, "h" A Solexa quality of -5 maps to ASCII -5+64 = 59, ";" You should find this example has Solexa scores 40, 39, .., -4, -5. This file is in the Biopython repository under biopython/Tests/Quality Here is the conversion using MAQ (with the chomp fix from Tim Yu to remove an extra "!" character, see the maq-help mailing list for 10 July 2009): http://sourceforge.net/mailarchive/forum.php?thread_name=320fb6e00906170708lb2ce4f7qbc5dfa43543189a2%40mail.gmail.com&forum_name=maq-help $ perl fq_all2std.pl sol2std < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN + IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" Here is the Biopython conversion, which is identical: $ python biopython_solexa2sanger.py < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN + IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" EMBOSS 6.1.0 has a rounding issue with negative Solexa scores, and the last six qualities are up by one - Peter Rice is aware of this, and has a fix: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000596.html $ seqret -filter -sformat fastq-solexa -osformat fastq-sanger < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 IHGFEDCBA@?>=<;:9876543210/.-,+*)(''&%%$$##""" Now we come to BioPerl, $ perl bioperl_solexa2sanger.pl < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 IHGFEDCBA@?>=<;:9876543210/.-,+++*)(''&&&&%%%% You look fine for the higher qualities, but there is something really wrong for the lower scores (not just the negative ones). I'll leave you to double check the details, but here are the Sanger PHRED qualities decoded into integers (using Biopython to convert from "fastq-sanger" to "qual" output): $ perl bioperl_solexa2sanger.pl < solexa_faked.fastq | python biopython_sanger2qual.py >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 10 9 8 7 6 6 5 5 5 5 4 4 4 4 $ perl fq_all2std.pl sol2std < solexa_faked.fastq | python biopython_sanger2qual.py >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1 1 Peter C. P.S. This is the BioPerl script I am using here: $ more bioperl_solexa2sanger.pl use Bio::SeqIO; my $in = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-solexa'); my $out = Bio::SeqIO->new(-format => 'fastq-sanger'); while (my $seq = $in->next_seq) { $out->write_seq($seq) }; From biopython at maubp.freeserve.co.uk Fri Jul 24 15:12:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 16:12:57 +0100 Subject: [Biopython-dev] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> Message-ID: <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> On Fri, Jul 24, 2009 at 2:53 PM, Peter wrote: > On Fri, Jul 24, 2009 at 2:32 PM, Peter wrote: >>> >>> Don't be surprised if there are still bugs lurking about, just let me >>> know and I'll fix 'em. >> >> I've got a bug report coming up in a second email, but the basics work :) > > I think I have found a bug in BioPerl's conversion from fastq-solexa > to fastq-sanger concerning lower quality scores. Next up is an issue with BioPerl converting from Sanger to Illumina. In principle this is simple - the quality strings both use PHRED scores just with different offsets. With lower PHRED scores, everything is fine: $ more sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN + IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! Again, this is an example constructed by hand to cover a broad range of valid scores, and can be found in the Biopython repository under biopython/Tests/Quality $ perl bioperl_sanger2illumina.pl < sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN +Test PHRED qualities from 40 to 0 inclusive hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ $ python biopython_sanger2illumina.py < sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN + hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ So, BioPerl and Biopython (and EMBOSS) agree - apart from the repeating second title on the plus line. I understand that EMBOSS will in future omit the repeated title on the plus line: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000598.html Now, here comes the problem. I believe FASTQ files directly from an Illumina 1.3+ pipeline will have PHRED scores in the range 0 to 40 (as in this example). However, much higher PHRED scores are possible during assembly / contig'ing and read mapping. For example, the tool MAQ will output Sanger style FASTQ files with PHRED scores in the range 0 to 93 inclusive. Now, in the Sanger FASTQ format, PHRED scores of 0 to 93 map onto ASCII values of 33 to 126 (! to ~). There is a reason for stopping at 126, since ASCII 127 is "delete". However, in the Illumina 1.3+ FASTQ format, PHRED scores of 0 to 93 would map to ASCII values of 64 to 157, which includes a lot of non printing characters. Working with such files at the command line or in an editor is a big problem. Clearly, Illumina never intended to include such high scores in their FASTQ files! Nevertheless, it is possible to write a FASTQ format following the Illumina 1.3+ encoding with these values. Biopython and EMBOSS attempt to do this - although I would regard throwing an error as equally acceptable. So, here is another hand constructed example of a Sanger style FASTQ file using the full quality range: $ more sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN + ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! Again, this example is in the Biopython repository under biopython/Tests/Quality Just to check: $ python biopython_sanger2qual.py < sanger_93.fastq >Test PHRED qualities from 93 to 0 inclusive 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 So, here we go - apologies for the expected line mangling: $ seqret -filter -sformat fastq-sanger -osformat fastq-illumina < sanger_93.fastq | hexdump -C -v 00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test PHRED qual| 00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities from 93 to| 00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0 inclusive.ACT| 00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000060 47 41 43 54 47 41 43 54 47 0a 41 43 54 47 41 43 |GACTGACTG.ACTGAC| 00000070 54 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 |TGACTGACTGACTGAC| 00000080 54 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 54 65 |TGACTGACTGAN.+Te| 00000090 73 74 0a 9d 9c 9b 9a 99 98 97 96 95 94 93 92 91 |st..............| 000000a0 90 8f 8e 8d 8c 8b 8a 89 88 87 86 85 84 83 82 81 |................| 000000b0 80 7f 7e 7d 7c 7b 7a 79 78 77 76 75 74 73 72 71 |..~}|{zyxwvutsrq| 000000c0 70 6f 6e 6d 6c 6b 6a 69 68 67 66 65 64 63 62 0a |ponmlkjihgfedcb.| 000000d0 61 60 5f 5e 5d 5c 5b 5a 59 58 57 56 55 54 53 52 |a`_^]\[ZYXWVUTSR| 000000e0 51 50 4f 4e 4d 4c 4b 4a 49 48 47 46 45 44 43 42 |QPONMLKJIHGFEDCB| 000000f0 41 40 0a |A at .| 000000f3 $ python biopython_sanger2illumina.py < sanger_93.fastq | hexdump -C -v00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test PHRED qual| 00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities from 93 to| 00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0 inclusive.ACT| 00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000060 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000070 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000080 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 0a 9d 9c |GACTGACTGAN.+...| 00000090 9b 9a 99 98 97 96 95 94 93 92 91 90 8f 8e 8d 8c |................| 000000a0 8b 8a 89 88 87 86 85 84 83 82 81 80 7f 7e 7d 7c |.............~}|| 000000b0 7b 7a 79 78 77 76 75 74 73 72 71 70 6f 6e 6d 6c |{zyxwvutsrqponml| 000000c0 6b 6a 69 68 67 66 65 64 63 62 61 60 5f 5e 5d 5c |kjihgfedcba`_^]\| 000000d0 5b 5a 59 58 57 56 55 54 53 52 51 50 4f 4e 4d 4c |[ZYXWVUTSRQPONML| 000000e0 4b 4a 49 48 47 46 45 44 43 42 41 40 0a |KJIHGFEDCBA at .| 000000ed Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to 64 in decimal, which after subtracting the Illumina offset of 64, gives PHRED scores of 93 to 0 as desired. Now to BioPerl, $ perl bioperl_sanger2illumina.pl < sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN +Test PHRED qualities from 93 to 0 inclusive hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v ... BioPerl has output an invalid FASTQ file - it seems to omit the quality scores for the top scoring nucleotides at the start. The BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40 (in hex), giving 104 to 64 in decimal, giving PHRED values of 40 to 0. I think BioPerl should either throw an error, or output the non printing characters as done by Biopython and EMBOSS. Regards, Peter C. (@Biopython) From mjldehoon at yahoo.com Sat Jul 25 15:28:35 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 25 Jul 2009 08:28:35 -0700 (PDT) Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) Message-ID: <311853.75944.qm@web62407.mail.re1.yahoo.com> Hi everybody, Over the weekend I was looking at Bio.SubsMat and its documentation. There are a few points in Bio.SubstMat that would be handled differently in modern Python, but I'd thought I'd raise them here first before I make any changes: 1) The matrix types (NOTYPE = 0, ACCREP = 1, OBSFREQ = 2, SUBS = 3, EXPFREQ = 4, LO = 5) are now global variables (at the level of Bio.SubsMat). I think that these should be class variables of the Bio.SubsMat.SeqMat class. 2) The print_mat method. It would be more Pythonic to use __str__, __format__ for this, though the latter is only available for Python versions >= 2.6. 3) The __sum__ method. I guess that this was intended to be __add__? 4) The sum_letters attribute. To calculate the sum of all values for a given letter, currently the following two functions are involved: def all_letters_sum(self): for letter in self.alphabet.letters: self.sum_letters[letter] = self.letter_sum(letter) def letter_sum(self,letter): assert letter in self.alphabet.letters sum = 0. for i in self.keys(): if letter in i: if i[0] == i[1]: sum += self[i] else: sum += (self[i] / 2.) return sum As you can see, the result is not returned, but stored in an attribute called sum_letters. I suggest to replace this with the following: def sum(self): result = {} for letter in self.alphabet.letters: result[letter] = 0.0 for pair, value in self: i1, i2 = pair if i1==i2: result[i1] += value else: result[i1] += value / 2 result[i2] += value / 2 return result so without storing the result in an attribute. Any comments, objections? --Michiel --- On Fri, 7/24/09, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: Re: [Biopython-dev] Calculating motif scores > To: "Bartek Wilczynski" > Cc: biopython-dev at biopython.org > Date: Friday, July 24, 2009, 5:34 AM > > > As for the PWM being a separate class and used by the > motif: > > I don't know. I'm using Bio.SubsMat.FreqTable for > implementing > > frequency table, so I understand that the new PWM > class would > > be basically a "smarter" FreqTable. I'm not sure > whether it > > solves any problems... > > Wow, I didn't even know the Bio.SubsMat module existed. > As we have several different but related modules > (Bio.Motif, Bio.SubstMat, Bio.Align), I think we should > define the purpose and scope of each of these modules. > Maybe a good way to start is the documentation. Bio.SubsMat > is currently divided into two chapters (14.4 and 16.2). I'll > have a look at this over the weekend to see if this can be > cleaned up a bit. > > --Michiel. > > > ? ? ? > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From idoerg at gmail.com Sat Jul 25 20:57:59 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sat, 25 Jul 2009 13:57:59 -0700 Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: References: <311853.75944.qm@web62407.mail.re1.yahoo.com> Message-ID: I'm the author of subsmat IIRC. Everything sounds good, but I would not make 2.6 changes that will break on 2.5. Ubuntu still uses 2.5 and I imagine other linux distros do too. Thanks, Iddo Would code those in myself, but I'm moving. Iddo Friedberg http://iddo-friedberg.net/contact.html On Jul 25, 2009 8:35 AM, "Michiel de Hoon" wrote: Hi everybody, Over the weekend I was looking at Bio.SubsMat and its documentation. There are a few points in Bio.SubstMat that would be handled differently in modern Python, but I'd thought I'd raise them here first before I make any changes: 1) The matrix types (NOTYPE = 0, ACCREP = 1, OBSFREQ = 2, SUBS = 3, EXPFREQ = 4, LO = 5) are now global variables (at the level of Bio.SubsMat). I think that these should be class variables of the Bio.SubsMat.SeqMat class. 2) The print_mat method. It would be more Pythonic to use __str__, __format__ for this, though the latter is only available for Python versions >= 2.6. 3) The __sum__ method. I guess that this was intended to be __add__? 4) The sum_letters attribute. To calculate the sum of all values for a given letter, currently the following two functions are involved: def all_letters_sum(self): for letter in self.alphabet.letters: self.sum_letters[letter] = self.letter_sum(letter) def letter_sum(self,letter): assert letter in self.alphabet.letters sum = 0. for i in self.keys(): if letter in i: if i[0] == i[1]: sum += self[i] else: sum += (self[i] / 2.) return sum As you can see, the result is not returned, but stored in an attribute called sum_letters. I suggest to replace this with the following: def sum(self): result = {} for letter in self.alphabet.letters: result[letter] = 0.0 for pair, value in self: i1, i2 = pair if i1==i2: result[i1] += value else: result[i1] += value / 2 result[i2] += value / 2 return result so without storing the result in an attribute. Any comments, objections? --Michiel --- On Fri, 7/24/09, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: Re: [Biopython-dev] Calculating motif scores > To: "Bartek Wilczynski" > Cc: biopython-dev at biopython.org > Date: Friday, July 24, 2009, 5:34 AM > > > As for the PWM being a separate class and used by the > motif: > > I don't know. I'm using Bio.SubsMat.FreqTable for > implementing > > frequency table, so I understand that the new PWM > class would > > be basically a "smarter" FreqTable. I'm not sure > whether it > > solves any problems... > > Wow, I didn't even know the Bio.SubsMat module existed. > As we have several different but related modules > (Bio.Motif, Bio.SubstMat, Bio.Align), I think we should > define the purpose and scope of each of these modules. > Maybe a good way to start is the documentation. Bio.SubsMat > is currently divided into two chapters (14.4 and 16.2). I'll > have a look at this over the weekend to see if this can be > cleaned up a bit. > > --Michiel. > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Sat Jul 25 21:12:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Jul 2009 22:12:26 +0100 Subject: [Biopython-dev] [Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> Message-ID: <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields wrote: > >> Now, here comes the problem. I believe FASTQ files directly >> from an Illumina 1.3+ pipeline will have PHRED scores in the >> range 0 to 40 (as in this example). However, much higher >> PHRED scores are possible during assembly / contig'ing >> and read mapping. For example, the tool MAQ will output >> Sanger style FASTQ files with PHRED scores in the range >> 0 to 93 inclusive. > > Is this behavior documented anywhere, specifically by Illumina (that values > can exceed 40)? If Illumina 1.3 is specified as being PHRED 0-40, and > another (non-Illumina) software package pushes that limit above the > specified range of Illumina values, I would consider that unfortunately yet > another variant. > > We can support it as Illumina 1.3, but my point is this may getting into a > grey area and may be something that Illumina doesn't/wouldn't support. > Reminds me a little of the multiple GFF2 variations (one of the main > reasons for a GFF3). I agree this is an grey area (high scores in Solexa/Illumina FASTQ files). >> Now, in the Sanger FASTQ format, PHRED scores of 0 to >> 93 map onto ASCII values of 33 to 126 (! to ~). There is a >> reason for stopping at 126, since ASCII 127 is "delete". >> >> However, in the Illumina 1.3+ FASTQ format, PHRED >> scores of 0 to 93 would map to ASCII values of 64 to >> 157, which includes a lot of non printing characters. >> Working with such files at the command line or in an >> editor is a big problem. Clearly, Illumina never intended >> to include such high scores in their FASTQ files! > > Exactly. > >> Nevertheless, it is possible to write a FASTQ format >> following the Illumina 1.3+ encoding with these values. >> Biopython and EMBOSS attempt to do this - although I >> would regard throwing an error as equally acceptable. >> >> So, here is another hand constructed example of a >> Sanger style FASTQ file using the full quality range: >> >> ... >> >> Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree >> on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to >> 64 in decimal, which after subtracting the Illumina offset of 64, gives >> PHRED scores of 93 to 0 as desired. >> >> Now to BioPerl, >> >> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq >> ... >> >> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v >> ... >> >> BioPerl has output an invalid FASTQ file - it seems to omit the >> quality scores for the top scoring nucleotides at the start. The >> BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40 >> (in hex), giving 104 to 64 in decimal, giving PHRED values of >> 40 to 0. I think BioPerl should either throw an error, or output >> the non printing characters as done by Biopython and EMBOSS. > > If this is accepted as common practice between BioPython and EMBOSS > we will follow similarly. I do think it's worth at least a warning for the > reasons outlined above (e.g. it likely isn't Illumina's intent to support qual > values outside the specified range). Might be worth checking into. True. I think what EMBOSS and Biopython are doing is reasonable (although a warning in this situation makes sense). Equally, an error is a valid option. However, one question is when would you issue the warning/error? For a PHRED score above 40? (Assuming we have a definative reference for Illumina using just 0 to 40). How about if a problem character would result? Since ASCII 64+63=127, the first problem character would be for PHRED score 63. i.e. An Illumina FASTQ format file can hold PHRED scores in the range 0 to 62 without using problem characters. And likewise for a Solexa FASTQ file (Solexa scores up to 62). > From this it could be summarized that converting to sanger format is least > problematic, as possible issues may be encountered when converting to the > other variants. Yes. The Sanger FASTQ format will hold PHRED scores from 0 to 93 while using nice ASCII characters - this means it is suitable for both raw reads and processed data from assemblies or read mappings. In my personal experience, Solexa/Illumina FASTQ files tend to get converted into the Sanger FASTQ format for downstream analysis (e.g. the MAQ tool, or the NCBI short read archive). i.e. Writing high quality reads (i.e. above PHRED 40) to Solexa or Illumina FASTQ files is unlikely. > We'll need to fix the solexa quality calculations in the BioPerl > parser as noted in your previous post; I'll work on that. Great. Peter From biopython at maubp.freeserve.co.uk Sat Jul 25 21:18:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Jul 2009 22:18:41 +0100 Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: <311853.75944.qm@web62407.mail.re1.yahoo.com> References: <311853.75944.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00907251418s66499a4cy5491a27af5c1b458@mail.gmail.com> On Sat, Jul 25, 2009 at 4:28 PM, Michiel de Hoon wrote: > ... > > 2) The print_mat method. It would be more Pythonic to use __str__, > __format__ for this, though the latter is only available for Python > versions >= 2.6. You can define a __format__ method on older versions of Python, it just won't do anything. For the SeqRecord and Alignment we have already added these, and also included a format method as an alias (principly to make the funcationality available on pre-Python 2.6). Using the __format__ method requires some concept of format names... The "print_mat" function sounds like it has similarities to the "pretty print" code for trees that has come up on the Tree/TreeIO thread. The existing Bio.Nexus tree object already has something as the "display" method. I'd have so spend some time looking at the code in more details to comment on the other issues. Peter From biopython at maubp.freeserve.co.uk Sat Jul 25 21:21:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Jul 2009 22:21:16 +0100 Subject: [Biopython-dev] Bio.Nexus and internal node labels (Bug 2788) In-Reply-To: <4A6560E2.4030502@biologie.uni-kl.de> References: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> <4A6560E2.4030502@biologie.uni-kl.de> Message-ID: <320fb6e00907251421o40e7931cj69caa7d5b00794a8@mail.gmail.com> On Tue, Jul 21, 2009 at 7:32 AM, Frank Kauff wrote: > > Hi all, > > Peter wrote: >> On Mon, Jul 20, 2009 at 8:13 PM, Nick Matzke wrote: >> >>> Hi all, here is my weekly update... >>> >>> 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! >> >> Cool. I haven't tried it personally though ;) Frank and/or Cymon - any >> comments regarding Brad checking this in? See Bug 2788 for details. > > Not at all - you're most welcome. Thanks for dealing with it. > > Frank Sounds like you should proably check in that fix then Brad :) Peter From pmr at ebi.ac.uk Mon Jul 27 08:55:43 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 27 Jul 2009 09:55:43 +0100 Subject: [Biopython-dev] Open-bio cross-project issues In-Reply-To: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Message-ID: <4A6D6B8F.9060108@ebi.ac.uk> Peter C. wrote (to bioperl-l, biopython-l, emboss-dev): > Hi all, > > Peter Rice kindly said he will look into an OBF cross project mailing > list, but in the meantime this has been cross posted to the Biopython, > BioPerl, and EMBOSS development lists. There is a list already for this purpose - open-bio-l I think we will also need a cross-project wiki space on the OBF site. Is there something already used by other projects or should we set something up? I am cross-posting this to other OBF project lists to encourage developers interested in combining efforts to address common problems. This started with FASTQ short read formats, and open-bio-l (a low volume list) has also seen discussion of common test data sets. Please sign up to open-bio-l (if you are not there already) and post suggestions for cross-project issues there. The list subscription page is: http://lists.open-bio.org/mailman/listinfo/open-bio-l Please feel free to forward this to any other projects I may have missed (I picked the obvious addresses from the list.open-bio-org server) regards, Peter Rice From bugzilla-daemon at portal.open-bio.org Mon Jul 27 12:27:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Jul 2009 08:27:05 -0400 Subject: [Biopython-dev] [Bug 2788] Bio.Nexus.Trees newick parser does not support internal node labels In-Reply-To: Message-ID: <200907271227.n6RCR51v032090@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2788 chapmanb at 50mail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from chapmanb at 50mail.com 2009-07-27 08:27 EST ------- Patch verified and checked in with unit tests: Checking in Bio/Nexus/Trees.py; new revision: 1.19; previous revision: 1.18 Checking in Tests/test_Nexus.py; new revision: 1.9; previous revision: 1.8 Marking bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 27 14:48:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Jul 2009 10:48:55 -0400 Subject: [Biopython-dev] [Bug 2887] New: set iteration order dependency in Bio.Data.CodonTable Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2887 Summary: set iteration order dependency in Bio.Data.CodonTable Product: Biopython Version: 1.51b Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: fwereade at googlemail.com Running under IronPython 2.0.1 with ironclad r515 (http://code.google.com/p/ironclad ) symptoms: --------------------------------------------- from Bio.Data import CodonTable File "C:\dev\biopython-1.51b\Bio\Data\CodonTable.py", line 618, in C:\dev\biop ython-1.51b\Bio\Data\CodonTable.py assert list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values) == ['TGA', 'TAA', 'TAG', 'TAR', 'TRA'] AssertionError --------------------------------------------- cause: set iteration order is different in IronPython (it may also be different in Jython and/or PyPy, and has the potential to change across CPython versions) fix: make Bio.Data.CodonTable.py:618 read as follows --------------------------------------------- assert set(list_ambiguous_codons(['TGA', 'TAA', 'TAG'],IUPACData.ambiguous_dna_values)) == set(['TGA', 'TAA', 'TAG', 'TAR', 'TRA']) --------------------------------------------- better fix: as above, but for all similar lines (the preceding lines currrently work under ipy) just a thought: it might also be worth moving all the tests into the Tests directory, rather than running them inline every time. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 27 15:06:30 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Jul 2009 11:06:30 -0400 Subject: [Biopython-dev] [Bug 2887] set iteration order dependency in Bio.Data.CodonTable In-Reply-To: Message-ID: <200907271506.n6RF6Uqj007530@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2887 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-27 11:06 EST ------- Fixed in Bio/Data/CodonTable.py CVS revision 1.15 Thanks! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 27 15:08:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Jul 2009 11:08:11 -0400 Subject: [Biopython-dev] [Bug 2887] set iteration order dependency in Bio.Data.CodonTable In-Reply-To: Message-ID: <200907271508.n6RF8Be4007635@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2887 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-27 11:08 EST ------- Fixed in Bio/Data/CodonTable.py CVS revision 1.15 so marking bug as fixed. Note I opted to preserve the existing API (i.e. return lists), so didn't use your suggested fix. Please let us know if there are any other issues with IronPython. Thanks. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jul 27 16:18:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 17:18:11 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> Message-ID: <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> On Thu, Jul 23, 2009 at 12:08 PM, Peter wrote: > > The FASTQ checking is on going. I have updated our FASTQ > code to map Solexa scores as I understood Peter Rice's > description of the intended EMBOSS behaviour (this is for the > corner case of very poor quality reads). However, due to a > couple of minor bugs I found in EMBOSS 6.1.0 we'll either > have to cross check against their CVS code, or hope they > release EMBOSS 6.1.1 soon. > > Cross checking against MAQ would also be worthwhile, but > while there are some patches about to fix a couple of MAQ > FASTQ bugs and include Illumina to Sanger standard > conversion, this isn't in their official repository yet. > > I guess I could cross check against BioPerl's new FASTQ > support ... The FASTQ cross-validation is on going, as you may have gathered from the cross-project thread (now on open-bio-l) I did start testing against BioPerl SVN which uncovered some BioPerl problems, and a grey area of the format worth debate. See also: http://lists.open-bio.org/pipermail/open-bio-l/2009-July/ This is taking longer than I had expected, but think it will be worth the effort. Peter P.S. Anyone care to guess on how EMBOSS, BioPerl, and Biopython's FASTQ parsing stacks up in terms of run time? From bugzilla-daemon at portal.open-bio.org Mon Jul 27 16:41:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Jul 2009 12:41:02 -0400 Subject: [Biopython-dev] [Bug 2887] set iteration order dependency in Bio.Data.CodonTable In-Reply-To: Message-ID: <200907271641.n6RGf2M8011652@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2887 ------- Comment #3 from fwereade at googlemail.com 2009-07-27 12:41 EST ------- Sweet! I think that's the fastest bugfix I've ever seen :-). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mhampton at d.umn.edu Mon Jul 27 17:05:05 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 27 Jul 2009 12:05:05 -0500 (CDT) Subject: [Biopython-dev] Phylip interface questions In-Reply-To: References: Message-ID: I am wondering if there is already an interface to the Phylip programs in biopython. I am pretty sure there is not, but I wanted to ask before doing a chunk of work on one. I know that AlignIO can read and write the phylip alignment files, but I think that is it. Assuming such a thing doesn't already exist, I will write some functions for calling various combinations of programs in phylip to make some common tasks easier. Mostly this will use the pexpect module. What is the most appropriate place to put such an interface within biopython? Thanks, Marshall Hampton Integrated Biosciences Program and the Department of Mathematics and Statistics University of Minnesota Duluth From biopython at maubp.freeserve.co.uk Mon Jul 27 17:24:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 18:24:57 +0100 Subject: [Biopython-dev] Phylip interface questions In-Reply-To: References: Message-ID: <320fb6e00907271024y6894bc7ch92d4411ee3784a48@mail.gmail.com> On Mon, Jul 27, 2009 at 6:05 PM, Marshall Hampton wrote: > > I am wondering if there is already an interface to the Phylip programs in > biopython. ?I am pretty sure there is not, but I wanted to ask before doing > a chunk of work on one. ?I know that AlignIO can read and write the phylip > alignment files, but I think that is it. > > Assuming such a thing doesn't already exist, I will write some functions for > calling various combinations of programs in phylip to make some common tasks > easier. ?Mostly this will use the pexpect module. ?What is the most > appropriate place to put such an interface within biopython? I really wouldn't go down the route of trying to wrap the original PHYLIP tools, it would involve piping simulated keypresses to stdin - very tricky (even if the python module pexpect is wonderful). I would instead wrap the EMBOSS packaged versions of the PHYLIP suite, which have proper command line interfaces with switches etc. In this case, Bio/Emboss/Applications.py would be the file to look at. However, something I have been discussing with Peter Rice at EMBOSS is using their ACD files (which define the EMBOSS tools command line interfaces) to automatically generate the Biopython wrappers. Peter From eric.talevich at gmail.com Mon Jul 27 17:56:40 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 27 Jul 2009 13:56:40 -0400 Subject: [Biopython-dev] GSoC Weekly Update 10: PhyloXML for Biopython Message-ID: <3f6baf360907271056t2c84429fwe17739c93046105d@mail.gmail.com> Hi folks, Previously (July 20-24) I: Finished implementing I/O methods, Tree classes and tests for all phyloXML elements. Changed Writer to preserve node order in the XML; output now validates under the phyloXML 1.00 schema (but 1.10 complains) Did some drastic code reorganization. - Bio.Tree: - Moved Clade.find() and PhyloElement.__repr__ methods to BaseTree classes - Made Clade inherit from BaseTree.Tree in addition to BaseTree.Node, and added the corresponding attributes - Moved Bio.PhyloXML.Tree to Bio.Tree.PhyloXML - Bio.TreeIO: - Merged PhyloXML's Parser and Writer into PhyloXMLIO under the new Bio.TreeIO module, and updated imports everywhere - Added wrappers for Nexus read/write; doesn't return Bio.Tree objects yet though Added/updated unit tests for all of this. Documented the code reorg on the Biopython wiki, adding Tree and TreeIO pages and fixing the examples on the PhyloXML page. Scrubbed docstrings and enabled epydoc processing. This week (July 27-31) I will: Finish implementing the phyloXML spec: - Scan "simple types" for restricted tokens; check strings in constructors - Take a stab at phyloXML 1.10 support (need a 'version' arg to Writer?) - Clean up and reorganize any code that needs it Enhancements (time permitting): - Improve the SeqRecord conversion - Work on Bio.Tree.BaseTree compatibility with BioSQL's PhyloDB extension - Port common methods to Bio.Tree.BaseTree -- see Bio.Nexus.Tree, Bioperl node objects, PyCogent, p4-phylogenetics - Tree method: build_index (set left_idx, right_idx on all nodes): - calculate left/right indexes for nested-set representation - see http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html - Export to networkx (http://networkx.lanl.gov/) -- also get graphviz export for free, via networkx.to_agraph() Remarks: - Bioperl's phyloXML driver was written for version 1.00 and might hurl if given a v1.10 file -- so that's a potential problem if Biopython defaults to writing v1.10 files. Should Writer take a option to specify the file format version number? Right now it only writes valid phyloXML v1.00. - PhyloXMLIO also always writes branch_length as an XML node, not an attribute. This validates and will be handled safely by any sane parser, and fits better with the idea of an implicit root node in each clade object, I think. (The parser still handles an attribute properly.) Any objections? - Above, I've listed more enhancements than I'll probably be able to finish this week. Which should have higher priority? I know merging Bio.Nexus and Bio.Tree would be the most useful, but since (1) Biopython development still happens on CVS, not Git, and (2) another Tree-based GSoC project is expected to land around the same time as mine, I think doing the integration right now would be kind of painful. So I can focus either on laying the groundwork in Bio.Tree.BaseTree, copying rather than moving the relevant Nexus code, or else work mainly on exporting to other useful object representations like networkx graphs, or any Biopython classes I've missed (e.g. alignments). Suggestions? Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Mon Jul 27 19:43:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 20:43:09 +0100 Subject: [Biopython-dev] Phylip interface questions In-Reply-To: References: <320fb6e00907271024y6894bc7ch92d4411ee3784a48@mail.gmail.com> Message-ID: <320fb6e00907271243q16d7a5efnca5873faaee3937f@mail.gmail.com> On Mon, Jul 27, 2009 at 7:42 PM, Marshall Hampton wrote: > > Thanks Peter, I was unaware of the EMBOSS versions of PHYLIP. I don't > think using pexpect to wrap the originals is really that hard - I have some > working fine already - but now I see its almost pointless. I don't like the > EMBOSS dependence, but it sounds like you are already working on > getting rid of that. I'm not quite sure what you are saying. Biopython doesn't depend on EMBOSS, we just have some optional code to interact with EMBOSS. If you want to run the PHYLIP tools from Python, you are going to have to install PHYLIP or EMBOSS anyway. The EMBOSS version is (I think) far more useful, and there is lots of other useful stuff in EMBOSS as well, so I really don't see a problem with recommending EMBOSS. Right now we have parsers and wrappers for some of the EMBOSS tools, and I would like to have more. Generating the wrappers (semi) automatically would be a step forward as we currently have wrappers for only about ten of the EMBOSS tools (hand picked based on people actually wanting to use them from within Biopython). Peter From mhampton at d.umn.edu Mon Jul 27 18:42:21 2009 From: mhampton at d.umn.edu (Marshall Hampton) Date: Mon, 27 Jul 2009 13:42:21 -0500 (CDT) Subject: [Biopython-dev] Phylip interface questions In-Reply-To: <320fb6e00907271024y6894bc7ch92d4411ee3784a48@mail.gmail.com> References: <320fb6e00907271024y6894bc7ch92d4411ee3784a48@mail.gmail.com> Message-ID: Thanks Peter, I was unaware of the EMBOSS versions of PHYLIP. I don't think using pexpect to wrap the originals is really that hard - I have some working fine already - but now I see its almost pointless. I don't like the EMBOSS dependence, but it sounds like you are already working on getting rid of that. Cheers, Marshall On Mon, 27 Jul 2009, Peter wrote: > On Mon, Jul 27, 2009 at 6:05 PM, Marshall Hampton wrote: >> >> I am wondering if there is already an interface to the Phylip programs in >> biopython. ?I am pretty sure there is not, but I wanted to ask before doing >> a chunk of work on one. ?I know that AlignIO can read and write the phylip >> alignment files, but I think that is it. >> >> Assuming such a thing doesn't already exist, I will write some functions for >> calling various combinations of programs in phylip to make some common tasks >> easier. ?Mostly this will use the pexpect module. ?What is the most >> appropriate place to put such an interface within biopython? > > I really wouldn't go down the route of trying to wrap the original PHYLIP > tools, it would involve piping simulated keypresses to stdin - very tricky > (even if the python module pexpect is wonderful). > > I would instead wrap the EMBOSS packaged versions of the PHYLIP > suite, which have proper command line interfaces with switches etc. > In this case, Bio/Emboss/Applications.py would be the file to look at. > However, something I have been discussing with Peter Rice at EMBOSS > is using their ACD files (which define the EMBOSS tools command line > interfaces) to automatically generate the Biopython wrappers. > > Peter > From chapmanb at 50mail.com Mon Jul 27 22:12:02 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 27 Jul 2009 18:12:02 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 10: PhyloXML for Biopython In-Reply-To: <3f6baf360907271056t2c84429fwe17739c93046105d@mail.gmail.com> References: <3f6baf360907271056t2c84429fwe17739c93046105d@mail.gmail.com> Message-ID: <20090727221202.GC68751@sobchak.mgh.harvard.edu> Hi Eric; Thanks for taking on this reorganization into Tree/TreeIO. That is turning out really nice and provides a great general framework for plugging other phylogeny modules into. > - Bioperl's phyloXML driver was written for version 1.00 and might hurl > if given a v1.10 file -- so that's a potential problem if Biopython > defaults to writing v1.10 files. Should Writer take a option to specify the > file format version number? Right now it only writes valid phyloXML v1.00. I tend to agree with Mark and Hilmar's assessment; PhyloXML is in development right now so we want to push towards the latest version. Reading Christian's summary of changes: http://phyloxml.blogspot.com/2009/06/proposed-changes-and-additions-for.html it seems like much of this is fixes. It would be worth pinging BioPerl to be sure someone will handle updates to the latest version but otherwise I would go with what is easiest. You want to be careful not to get trapped in version purgatory. > - Above, I've listed more enhancements than I'll probably be able to finish > this week. Which should have higher priority? I know merging Bio.Nexus > and Bio.Tree would be the most useful, but since (1) Biopython > development still happens on CVS, not Git, and (2) another Tree-based > GSoC project is expected to land around the same time as mine, I think > doing the integration right now would be kind of painful. So I can focus > either on laying the groundwork in Bio.Tree.BaseTree, copying rather than > moving the relevant Nexus code, or else work mainly on exporting to other > useful object representations like networkx graphs, or any Biopython > classes I've missed (e.g. alignments). Suggestions? What are you most interested in? You've certainly earned the right to work on what you think may be most useful to you in the future. Any of the listed projects are a good step forward. If you really really want my votes, they are for adding common tree manipulation methods to the base Tree class and working towards PhyloDB storage compatibility. Brad From chapmanb at 50mail.com Mon Jul 27 22:44:06 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 27 Jul 2009 18:44:06 -0400 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> Message-ID: <20090727224406.GE68751@sobchak.mgh.harvard.edu> Hi Peter; > The FASTQ cross-validation is on going, as you may have > gathered from the cross-project thread (now on open-bio-l) > I did start testing against BioPerl SVN which uncovered > some BioPerl problems, and a grey area of the format > worth debate. See also: > http://lists.open-bio.org/pipermail/open-bio-l/2009-July/ > > This is taking longer than I had expected, but think it will be > worth the effort. Glad you are tackling this -- fleshing out the incompatibilities is tough work but will save a lot of headaches for people in the future. > P.S. Anyone care to guess on how EMBOSS, BioPerl, and > Biopython's FASTQ parsing stacks up in terms of run time? We better be the fastest. Everyone knows that C code is bloated and slow. In terms of 1.51 and beyond, I've got two things: - SQLite support: I'd love to push this in now for 1.51. If we have a working version that people can test on, it'll encourage adoption for the next BioSQL release. - GFF parsing: The code is revamped to be more SeqIO like based on the discussion you, Michiel and I had earlier, and the documentation is in progress. I'll plan to get this in post-1.51 so people can work with it in git and find bugs. Brad From chapmanb at 50mail.com Mon Jul 27 22:34:16 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 27 Jul 2009 18:34:16 -0400 Subject: [Biopython-dev] Phylip interface questions In-Reply-To: References: Message-ID: <20090727223416.GD68751@sobchak.mgh.harvard.edu> Hi Marshall; > I am wondering if there is already an interface to the Phylip programs in > biopython. I am pretty sure there is not, but I wanted to ask before > doing a chunk of work on one. I know that AlignIO can read and write the > phylip alignment files, but I think that is it. > > Assuming such a thing doesn't already exist, I will write some functions > for calling various combinations of programs in phylip to make some common > tasks easier. Mostly this will use the pexpect module. What is the most > appropriate place to put such an interface within biopython? I did a lot of work with Phylip a while back, and generally interfacing with it is hideous looking but not impossible. I would create input files with the data for all of the menu items, and then feed this into the program. Then you need to handle renaming the generically named output files. Here's a chunk of it to give you the idea: pars_outfile = os.path.join(work_dir, "outgroup_phy.pars") pars_tree_outfile = os.path.join(work_dir, "outgroup_phy.parstree") hack_phylip_file = os.path.join(work_dir, "protpars.hack") hack_output = "%s\nM\nD\n%s\n13\n10\nO\n1\nY\n" % (align_file, num_boot) hack_handle = open(hack_phylip_file, "w") hack_handle.write(hack_output) hack_handle.close() cl = PhylipHackCommandline("protpars", hack_phylip_file) Application.generic_run(cl) os.rename("outfile", pars_outfile) os.rename("outtree", pars_tree_outfile) I would second Peter in using the EMBOSS interfaces to Phylip. There are ones already in Biopython for protdist, neighbor, protpars, consense and seqboot: http://github.com/biopython/biopython/blob/master/Bio/Emboss/Applications.py Why do you prefer the pexpect module for running applications? From a quick glance, the subprocess module included in Python should let you do most of what you can do with pexpect and it doesn't require an extra install. Finally, I am not as plugged in on the latest in phylogeny building but is Phylip still in favor? I know there has been a lot of work on Maximum Likelihood and Bayesian methods, like: FastTree: http://www.microbesonline.org/fasttree/index.html RAxML: http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm MrBayes: http://mrbayes.csit.fsu.edu/ In terms of Python support for these, Frank Kauff has some things to deal with RAxML: http://www.lutzonilab.net/downloads/ The latest PyCogent had support for FastTree and I believe they also tackle RAxML: http://pycogent.svn.sourceforge.net/viewvc/pycogent/trunk/cogent/app/fasttree.py?revision=333&view=markup Hope this helps. Glad to have someone thinking about these questions, Brad From winda002 at student.otago.ac.nz Tue Jul 28 01:56:46 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 28 Jul 2009 13:56:46 +1200 Subject: [Biopython-dev] Phylip interface questions In-Reply-To: References: Message-ID: <20090728135646.11011zn7r4bfzqam@www.studentmail.otago.ac.nz> Hi Marshall, I am wondering if there is already an interface to the Phylip programs in biopython. I am pretty sure there is not, but I wanted to ask before doing a chunk of work on one. I know that AlignIO can read and write the phylip alignment files, but I think that is it. Assuming such a thing doesn't already exist, I will write some functions for calling various combinations of programs in phylip to make some common tasks easier. Mostly this will use the pexpect module. I wrote a few for my own use (I presumed no one else was doing stuff like this) which I've now uploaded as module (Bio.Phylo) here: http://github.com/dwinter/biopython/tree/phylo They are for the 'new phylip' version ('f' prefixed not 'e') in EMBOSS's 'embassy' packages (which take different arguments than the classes already in the EMBOSS module...). They also depend on the cool stuff that Brad and Peter have done for applications in biopython 1.51. Hopefully they will cover some of the same ground that you want to, or at least prevent you having to start from scratch. (There's also support for PhyML which is based on phylip's dnaml but it much faster.) Cheers, David From biopython at maubp.freeserve.co.uk Tue Jul 28 09:17:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 10:17:06 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <20090727224406.GE68751@sobchak.mgh.harvard.edu> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> <20090727224406.GE68751@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00907280217obb767b6wffdc4c029bbab651@mail.gmail.com> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: > In terms of 1.51 and beyond, I've got two things: > > - SQLite support: I'd love to push this in now for 1.51. If we have > ?a working version that people can test on, it'll encourage > ?adoption for the next BioSQL release. > I would be OK with including this once Hilmar adds the SQLite schema to the BioSQL repository. I'd prefer him to do a point release of BioSQL first, but as long as this is going to happen at some point that is fine. Let's bring this up again on the BioSQL mailing list... If Hilmar isn't keen to rush, we *could* ship it anyway with Biopython, but it should then be clearly labelled as a prototype schema which may be subject to change. > - GFF parsing: The code is revamped to be more SeqIO like based > on ?the discussion you, Michiel and I had earlier, and the > ?documentation is in progress. I'll plan to get this in post-1.51 > ?so people can work with it in git and find bugs. Definitely post-1.51, note that EMBOSS 6.1.0 now has some support for GFF and features in GenBank, so we can hopefully use that as a reference implementation. i.e. Once we add GFF parsing to SeqIO, this should let Biopython convert from GFF to SeqRecord objects to GenBank, and we can compare this to EMBOSS. Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 09:26:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 10:26:30 +0100 Subject: [Biopython-dev] Phylip interface questions In-Reply-To: <20090728135646.11011zn7r4bfzqam@www.studentmail.otago.ac.nz> References: <20090728135646.11011zn7r4bfzqam@www.studentmail.otago.ac.nz> Message-ID: <320fb6e00907280226n786ae91fy6df4ed1a73aa7bbe@mail.gmail.com> On Tue, Jul 28, 2009 at 2:56 AM, David Winter wrote: > Hi Marshall, > > Assuming such a thing doesn't already exist, I will write some functions for > calling various combinations of programs in phylip to make some common tasks > easier. ?Mostly this will use the pexpect module. > I wrote a few for my own use (I presumed no one else was doing stuff like > this) which I've now uploaded as module (Bio.Phylo) here: > > http://github.com/dwinter/biopython/tree/phylo > > They are for the 'new phylip' version ('f' prefixed not 'e') in EMBOSS's > 'embassy' packages (which take different arguments than the classes already > in the EMBOSS module...). They also depend on the cool stuff that Brad and > Peter have done for applications in biopython 1.51. Hopefully they will > cover some of the same ground that you want to, or at least prevent you > having to start from scratch. Cool. I would double check that _EmbossCommandLine is still appropriate - especially with regards the outfile parameter. I changed a few things recently for seqret (which doesn't have an outfile parameter). Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 11:19:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 12:19:15 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? Message-ID: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> Hi all, As a possible enhancement to Bio.SeqIO, I've been toying with the idea of introducing another function, essentially to provide the following functionality: def convert(in_handle, in_format, out_handle, out_format, alphabet=None) : """Converts between two file formats, returns number of records.""" records = parse(in_handle, in_format, alphabet) return write(records, out_handle, out_format) As implied by this reference implementation above, this would be a convenience or helper function which would allow simple conversion scripts to save a line, e.g. import sys from Bio import SeqIO records = SeqIO.parse(sys.stdin, "fastq-solexa") SeqIO.write(records, sys.stdout, "fastq") becomes: import sys from Bio import SeqIO SeqIO.convert(sys.stdin, "fastq-solexa", sys.stdout, "fastq") Now some people might find that in itself a small improvement, but it does make the API a little more complex (feature creep). However, that isn't the real aim here. Having a function like this would allow a number of file format specific optimisations - instead of using SeqIO.parse to create SeqRecord objects which get converted by SeqIO.write as shown above. For example, converting GenBank or EMBL to FASTA (or tab), we don't need most of the annotation, so creating all those SeqFeature objects is a waste of time and memory. The GenBank/EMBL parser already has (buried) an option to skip the features, and a Bio.SeqIO.convert function would be able to exploit this. Likewise, converting any of the FASTQ formats to FASTA (which I think will be a fairly common task) can be speed up greatly by ignoring the quality scores, and even more so by never creating Seq and SeqRecord objects. I've tested this particular example, and it is massively faster (about five times faster in fact, which means it actually beats the current version of EMBOSS seqret - which is cool). Likewise converting between FASTQ formats (in particular Solexa to Sanger, and Illumina to Sanger) are also going to be common tasks which are currently something of a bottle neck. Again, this can be made faster by avoiding using Seq and SeqRecord objects within a convert function. What I have in mind is a lookup table of special case optimised converters (e.g. FASTQ to FASTA). If there is no special case defined, the convert function would default to the SeqIO parse/write code shown above. We would need a good set of unit tests to ensure these optimised converters did produce exactly the same output as the parse/write solution. Of course, if we have bottlenecks in the SeqIO parsing and writing code, it would be worthwhile of course to fix them - rather than writing a special case converter. Maybe to avoid the gradual build up of too many specialised converters, we might ask as a rule of thumb that it be at least three times faster than using parse/write? Any thoughts? Would this all just make SeqIO too complicated? Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 12:07:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 13:07:23 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <20090727224406.GE68751@sobchak.mgh.harvard.edu> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> <20090727224406.GE68751@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00907280507s4575c6f4v60409efd39a9c4aa@mail.gmail.com> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: > Hi Peter; > >> The FASTQ cross-validation is on going, as you may have >> gathered from the cross-project thread (now on open-bio-l) >> I did start testing against BioPerl SVN which uncovered >> some BioPerl problems, and a grey area of the format >> worth debate. See also: >> http://lists.open-bio.org/pipermail/open-bio-l/2009-July/ >> >> This is taking longer than I had expected, but think it will be >> worth the effort. > > Glad you are tackling this -- fleshing out the incompatibilities > is tough work but will save a lot of headaches for people in > the future. Absolutely. >> P.S. Anyone care to guess on how EMBOSS, BioPerl, and >> Biopython's FASTQ parsing stacks up in terms of run time? > > We better be the fastest. Everyone knows that C code is bloated > and slow. I pretty sure that was tongue in check, but if you were being mean you probably could describe some of the EMBOSS infrastructure as bloat. In any case, I'm sure that EMBOSS can be made faster now that speed matters here with next generation sequencing, see: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html And I've got bad news for you then - currently EMBOSS seqret is about twice as fast as CVS Biopython SeqIO (measuring parsing versus writing is a bit tricky). However, I have a cunning plan: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html Peter From pmr at ebi.ac.uk Tue Jul 28 12:40:43 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 28 Jul 2009 13:40:43 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00907280507s4575c6f4v60409efd39a9c4aa@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <320fb6e00907100538x22518497pda4dbe816b5798f7@mail.gmail.com> <320fb6e00907230408v32d2e6efr34f86a9fc5162e11@mail.gmail.com> <320fb6e00907270918o502fd0c7mfa3add332433412f@mail.gmail.com> <20090727224406.GE68751@sobchak.mgh.harvard.edu> <320fb6e00907280507s4575c6f4v60409efd39a9c4aa@mail.gmail.com> Message-ID: <4A6EF1CB.7000800@ebi.ac.uk> Peter wrote: > On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: >>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and >>> Biopython's FASTQ parsing stacks up in terms of run time? >> >> We better be the fastest. Everyone knows that C code is bloated >> and slow. > > I pretty sure that was tongue in check, but if you were being mean > you probably could describe some of the EMBOSS infrastructure > as bloat. In any case, I'm sure that EMBOSS can be made faster > now that speed matters here with next generation sequencing, see: > http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html EMBOSS code is indeed bloated and slow in some places - for example on output it constructs a sequence output object from the input sequence. However, it's C ... if we know what we're doing we can tell the machine to go faster. Unless the compiler decides it can optimise us away... Certainly this is a place where using reference-counted strings shows gains. We tend to avoid them in EMBOSS because early experience in optimising had them being deleted at the 'wrong' times and leaving us with no significant improvement in performance. Sequence output looks like a good place for them. We can also simplify the sequence output objects to avoid some of the reset operations when reusing the objects. > And I've got bad news for you then - currently EMBOSS seqret > is about twice as fast as CVS Biopython SeqIO (measuring parsing > versus writing is a bit tricky). However, I have a cunning plan: > http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html Worse news, I can find some speedups in EMBOSS ... though the split is about 40% in output and 60% in input CPU time. I/O time is another issue where we could play with blocked reads ... though when I tried that some time ago it seemed the operating systems and file systems were doing a grand job and it was hard to get a consistent speed gain even for one specific system. regards, Peter Rice From biopython at maubp.freeserve.co.uk Tue Jul 28 12:51:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 13:51:08 +0100 Subject: [Biopython-dev] FASTQ parsing speed in EMBOSS Message-ID: <320fb6e00907280551n7a42563byb802016b2342de06@mail.gmail.com> I've retitled this and CC'ed it to the EMBOSS dev list - which is probably a better place for this now! On Tue, Jul 28, 2009 at 1:40 PM, Peter Rice wrote: > Peter wrote: >> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: > >>>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and >>>> Biopython's FASTQ parsing stacks up in terms of run time? >>> >>> We better be the fastest. Everyone knows that C code is bloated >>> and slow. >> >> I pretty sure that was tongue in check, but if you were being mean >> you probably could describe some of the EMBOSS infrastructure >> as bloat. In any case, I'm sure that EMBOSS can be made faster >> now that speed matters here with next generation sequencing, see: >> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html > > EMBOSS code is indeed bloated and slow in some places - for example on > output it constructs a sequence output object from the input sequence. > However, it's C ... if we know what we're doing we can tell the machine > to go faster. Unless the compiler decides it can optimise us away... > > Certainly this is a place where using reference-counted strings shows > gains. We tend to avoid them in EMBOSS because early experience in > optimising had them being deleted at the 'wrong' times and leaving us > with no significant improvement in performance. Sequence output looks > like a good place for them. > > We can also simplify the sequence output objects to avoid some of the > reset operations when reusing the objects. > >> And I've got bad news for you then - currently EMBOSS seqret >> is about twice as fast as CVS Biopython SeqIO (measuring parsing >> versus writing is a bit tricky). However, I have a cunning plan: >> http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html > > Worse news, I can find some speedups in EMBOSS ... though > the split is about 40% in output and 60% in input CPU time. Well, it is only bad news from the point of view of Biopython bragging rights ;) And with those speed ups, I guess my fast lower level Biopython FASTQ to FASTA script will now be about the same speed as seqret! See: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html Nice work! > I/O time is another issue where we could play with blocked > reads ... though when I tried that some time ago it seemed > the operating systems and file systems were doing a grand > job and it was hard to get a consistent speed gain even for > one specific system. Maybe best avoided, given EMBOSS is truly cross platform. Peter C. From biopython at maubp.freeserve.co.uk Tue Jul 28 13:14:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 14:14:52 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> Message-ID: <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> On Tue, Jul 28, 2009 at 12:19 PM, Peter wrote: > > Any thoughts? Would this all just make SeqIO too complicated? > The idea of the Bio.SeqIO.convert function was two fold: (1) Syntactic sugar (and for this alone I wouldn't add it) (2) Faster file format conversion (e.g. for scripts or pipelines) While we could clearly out perform EMBOSS 6.1.0 on FASTQ to FASTA, given the possible speed ups Peter Rice is reporting for EMBOSS seqret, it looks this will change shortly: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006496.html I don't see any real point in trying to compete with EMBOSS for simple file conversion if in general seqret will be faster (and on the next release of EMBOSS, it should be). The real benefit if using Bio.SeqIO for any file format conversion (rather than seqret), is this lets the user add their own conditional filters or modifications as needed. And for this, my proposed function Bio.SeqIO.convert() doesn't help in any way. So, unless anyone pipes up, I probably won't pursue this. Finally, if anyone is interested, this was idea for the high speed FASTQ to FASTA conversion - as a proof of principle script using standard input and standard output at the command line: #High performance FASTQ to FASTA conversion for short reads. #This uses the low level FASTQ parser in Biopython 1.50 or #later. This avoids Bio.SeqIO and the associated overheads #of object creation and decoding the FASTQ quality string. import sys from Bio.SeqIO.QualityIO import FastqGeneralIterator #This just returns tuples of three strings from FASTQ: write = sys.stdout.write #avoid repeated attribute lookups for title, sequence, quality in FastqGeneralIterator(sys.stdin) : write(">%s\n" % title) #Wrap at 60 characters (as done by Bio.SeqIO FASTA): for i in range(0, len(sequence), 60): write(sequence[i:i+60] + "\n") If you don't want line wrapping, the code is two lines shorter, and even faster: import sys from Bio.SeqIO.QualityIO import FastqGeneralIterator write = sys.stdout.write #avoid repeated attribute lookups for title, sequence, quality in FastqGeneralIterator(sys.stdin) : write(">%s\n%s\n" % (title, sequence)) Peter From mjldehoon at yahoo.com Tue Jul 28 12:55:33 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 28 Jul 2009 05:55:33 -0700 (PDT) Subject: [Biopython-dev] Bio.SeqIO.convert function? Message-ID: <988956.8355.qm@web62404.mail.re1.yahoo.com> > Of course, if we have bottlenecks in the SeqIO parsing > and writing code, it would be worthwhile of course to fix > them - rather than writing a special case converter. Maybe > to avoid the gradual build up of too many specialised > converters, we might ask as a rule of thumb that it be > at least three times faster than using parse/write? > I have no fundamental objection, but we should first try to speed up the current GenBank parser and see if the specialized converter is still more than three times faster. --Michiel From biopython at maubp.freeserve.co.uk Tue Jul 28 13:19:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 14:19:45 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <988956.8355.qm@web62404.mail.re1.yahoo.com> References: <988956.8355.qm@web62404.mail.re1.yahoo.com> Message-ID: <320fb6e00907280619y1493ec19vdf00543cb45fc8d5@mail.gmail.com> On Tue, Jul 28, 2009 at 1:55 PM, Michiel de Hoon wrote: > >> Of course, if we have bottlenecks in the SeqIO parsing >> and writing code, it would be worthwhile of course to fix >> them - rather than writing a special case converter. Maybe >> to avoid the gradual build up of too many specialised >> converters, we might ask as a rule of thumb that it be >> at least three times faster than using parse/write? > > I have no fundamental objection, but we should first try > to speed up the current GenBank parser and see if the > specialized converter is still more than three times faster. I can already in principle make the current GenBank parser up to four times faster - I was working on this before all the FASTQ stuff and would hope to see this in Biopython 1.52, http://bugzilla.open-bio.org/show_bug.cgi?id=2738 Even with a change like that to speed up feature location parsing, it would still be faster still to skip the features in a GenBank or EMBL file completely. Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 14:47:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 15:47:57 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> Message-ID: <320fb6e00907280747g29beec82lef221e297895a097@mail.gmail.com> On Tue, Jul 28, 2009 at 2:14 PM, Peter wrote: > Finally, if anyone is interested, this was idea for the high speed > FASTQ to FASTA conversion - as a proof of principle script > using standard input and standard output at the command line: > > #High performance FASTQ to FASTA conversion for short reads. > #This uses the low level FASTQ parser in Biopython 1.50 or > #later. This avoids Bio.SeqIO and the associated overheads > #of object creation and decoding the FASTQ quality string. > import sys > from Bio.SeqIO.QualityIO import FastqGeneralIterator > #This just returns tuples of three strings from FASTQ: > write = sys.stdout.write #avoid repeated attribute lookups > for title, sequence, quality in FastqGeneralIterator(sys.stdin) : > ? ?write(">%s\n" % title) > ? ?#Wrap at 60 characters (as done by Bio.SeqIO FASTA): > ? ?for i in range(0, len(sequence), 60): > ? ? ? ?write(sequence[i:i+60] + "\n") > > If you don't want line wrapping, the code is two lines shorter, > and even faster: > > import sys > from Bio.SeqIO.QualityIO import FastqGeneralIterator > write = sys.stdout.write #avoid repeated attribute lookups > for title, sequence, quality in FastqGeneralIterator(sys.stdin) : > ? ?write(">%s\n%s\n" % (title, sequence)) > > Peter And here is a similar high performance script for mapping Solexa FASTQ to Sanger FASTQ, import sys from Bio.SeqIO.QualityIO import FastqGeneralIterator, phred_quality_from_solexa from string import maketrans solexa = "".join(chr(64+q) for q in range(-5,62+1)) sanger = "".join(chr(int(round(33+phred_quality_from_solexa(q)))) \ for q in range(-5,62+1)) mapping = maketrans(solexa, sanger) write = sys.stdout.write #avoid repeated attribute lookups for title, sequence, quality in FastqGeneralIterator(sys.stdin) : write("@%s\n%s\n+\n%s\n" % (title, sequence, quality.translate(mapping))) The same basic idea works equally well for mapping between any of the three FASTQ variants, and the speed is very similar to the FASTQ to FASTA script, taking about 1/5 of the time using SeqIO parse/write for this. I'm still investigating how to make the SeqIO parsing/writing faster. When I get an updated version of EMBOSS installed, I intend to profile it against these scripts ;) Peter From eric.talevich at gmail.com Tue Jul 28 15:49:29 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 28 Jul 2009 11:49:29 -0400 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> Message-ID: <3f6baf360907280849o68b03873n652f883f775f8f75@mail.gmail.com> Hi Peter, On Tue, Jul 28, 2009 at 9:14 AM, Peter wrote: > On Tue, Jul 28, 2009 at 12:19 PM, Peter > wrote: > > > > Any thoughts? Would this all just make SeqIO too complicated? > > > > The idea of the Bio.SeqIO.convert function was two fold: > (1) Syntactic sugar (and for this alone I wouldn't add it) > (2) Faster file format conversion (e.g. for scripts or pipelines) > > This would be nice if it was implemented in AlignIO and TreeIO, too. The naming is pretty intuitive, and the concept is general, so I don't think it makes the API any more difficult to understand. (Personally, I like having a sugary API to use inside ipython.) But the main reason I piped up was that some time ago, we observed that some popular Python libraries have functions that can accept either an open file handle or a file name, and do the right thing. The xml.etree module in the standard lib does this by checking if the 'file' argument has a 'read' method, and if not, trying to open it. I didn't see any reason for Bio.TreeIO to be any fussier than the standard library, so... http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NexusIO.py Implementing this for SeqIO.convert() (or ideally, read/parse/write on all the *IO modules) would make it very nice for files other than stdin and stdout -- otherwise, the user needs to open and maybe close two file handles before calling convert(). What do you think? Cheers, Eric From biopython at maubp.freeserve.co.uk Tue Jul 28 16:04:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 17:04:48 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <3f6baf360907280849o68b03873n652f883f775f8f75@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <320fb6e00907280614l78472493n8dca7b308b6b97df@mail.gmail.com> <3f6baf360907280849o68b03873n652f883f775f8f75@mail.gmail.com> Message-ID: <320fb6e00907280904k11e0f197qb931d622474eeb69@mail.gmail.com> On Tue, Jul 28, 2009 at 4:49 PM, Eric Talevich wrote: > Hi Peter, > > On Tue, Jul 28, 2009 at 9:14 AM, Peter wrote: > >> On Tue, Jul 28, 2009 at 12:19 PM, Peter >> wrote: >> > >> > Any thoughts? Would this all just make SeqIO too complicated? >> > >> >> The idea of the Bio.SeqIO.convert function was two fold: >> (1) Syntactic sugar (and for this alone I wouldn't add it) >> (2) Faster file format conversion (e.g. for scripts or pipelines) >> > This would be nice if it was implemented in AlignIO and TreeIO, too. The > naming is pretty intuitive, and the concept is general, so I don't think it > makes the API any more difficult to understand. (Personally, I like having a > sugary API to use inside ipython.) OK - fair point. And yes, if we added it to Bio.SeqIO, it would make sense to add a similar function to Bio.AlignIO and the nascent Bio.TreeIO module too. If combined with allowing filenames in place of handles, then yes, it makes one line file conversion very convenient too. On the more general issue of filenames versus handles, I think I'll reply on a new thread though... Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 16:34:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 17:34:48 +0100 Subject: [Biopython-dev] Handles and/or filenames in Bio.SeqIO etc? Message-ID: <320fb6e00907280934i54f326a6r38325c05a314cdbc@mail.gmail.com> Hi all, Eric just reopened an old debate - should Bio.SeqIO (and similar) support filenames as well has handles? In fact, this something we originally discussed way back when planning SeqIO way back in Nov 2006. Michiel and I were at the time generally in favour of allowing filename/handles, but Iddo Friedberg (who at that time was basically in charge) and Chris Lasher didn't like this. It would have broken with the existing Biopython parsers which were all handle only. After a little debate, we opted to support just handles, knowing we could if need be later allow filenames instead. [Other things which with hindsight I am very glad Michiel, Iddo, Chris etc talked me out of where "guessing" the file format based on the filename or its contents.] I had written up a draft email on this topic a couple of months ago, to raise this issue (which I can't find right now) which went over some of the downsides - other than complicating what is currently a nice clean API. I never sent it because after thinking about it, I was happy with handles only. I guess I'll have to retype my objections as they come back to me. On the thread about a possible Bio.SeqIO.convert function, Eric wrote: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006501.html > But the main reason I piped up was that some time ago, we observed that > some popular Python libraries have functions that can accept either an > open file handle or a file name, and do the right thing. The xml.etree > module in the standard lib does this by checking if the 'file' argument > has a 'read' method, and if not, trying to open it. I didn't see any reason > for Bio.TreeIO to be any fussier than the standard library, so... > > http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NexusIO.py First of all, I would argue Bio.TreeIO should be consistent with Bio.SeqIO and Bio.AlignIO with respect to handles vs filenames. If we do agree to support filenames or handles, then I would keep all the Bio.ModuleIO.SubModule code using handles only, and put the boiler plate (repeated) handle/filename code in the Bio.ModuleIO functions only. This is (a) less work, and (b) less code duplication. After all, the code in the modules under Bio.SeqIO (and similar) is rarely used directly. Other top level parsers, like Bio.Entrez.read() might then also deserve the filename/handle treatment. As a bonus, Bio.Nexus would cease to be an oddity as it does this already. > Implementing this for SeqIO.convert() (or ideally, read/parse/write on all > the *IO modules) would make it very nice for files other than stdin and > stdout -- otherwise, the user needs to open and maybe close two file handles > before calling convert(). > > What do you think? >From an end user point of view, especially when working directly at the python prompt interactively, being able to give filenames would be nicer. This will also make lots of the examples in the tutorial shorter and simpler, because we don't have to do things like closing output handles (because the SeqIO.write() function would do it for us). There is a minor downside that Python beginners won't necessarily get to gripes with handles so quickly. There is a cost, in that lots of parser code will need to check if it has a filename and if so open it. For output code this is a little more complex, as the writer function must also close the file afterwards. Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 16:48:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 17:48:50 +0100 Subject: [Biopython-dev] Newick support in Bio.TreeIO? Message-ID: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> Hi Eric, If you wanted a good multi-tree example file format for TreeIO, I would suggest plain Newick trees. I am familiar with plain text files which contain one Newick tree per line (with a terminating semi-colon), although in principle they could be wrapped over many lines. The neighbour joining (NJ) tree software QuickJoin from Thomas Mailund can certainly output this kind of file. I would expect to be able to read and write such multi-tree Newick files using Bio.TreeIO. http://www.daimi.au.dk/~mailund/quick-join.html The obvious application of this (which I have used personally), was to generate bootstrap trees on multiple machines in a cluster (or cores on a single machine), e.g. 100 instances each of 10 bootstrap trees, giving in total 1000 trees (which are then used either to build a consensus, or allocate bootstrap support to the randomised master tree). I wrote some code in python to do this bootstrapping step using the splits defined by each edge (i.e. the two sets of nodes you get if the edge was severed), which I represented using bit arrays, for use as keys in a dictionary mapping the splits to the master tree's edges. Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 17:04:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 18:04:52 +0100 Subject: [Biopython-dev] Bio.SubstMat (was: Re: Calculating motif scores) In-Reply-To: References: <311853.75944.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00907281004y1c597d68k5548840e3a792687@mail.gmail.com> On Sat, Jul 25, 2009 at 9:57 PM, Iddo Friedberg wrote: > I'm the author of subsmat IIRC. Everything sounds good, but I would not make > 2.6 changes that will break on 2.5. Ubuntu still uses 2.5 and I imagine > other linux distros do too. Plus we are still supporting Biopython on Python 2.4, having only recently dropped support for Python 2.3 ;) The current Ubuntu with long term support (LTS) is 8.04 (hardy), and that uses Python 2.5. However, the latest Ubuntu (jaunty) and the in development one (karmic) are already using Python 2.6. Biopython will often get used on clusters and servers (not just desktops), and these tend to get upgraded less often. Our cluster is still running Python 2.4 for example. Peter From chapmanb at 50mail.com Tue Jul 28 22:09:43 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 28 Jul 2009 18:09:43 -0400 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> Message-ID: <20090728220943.GJ68751@sobchak.mgh.harvard.edu> Hi Peter; > As a possible enhancement to Bio.SeqIO, I've been toying with > the idea of introducing another function, essentially to provide > the following functionality: > > def convert(in_handle, in_format, out_handle, out_format, alphabet=None) : > """Converts between two file formats, returns number of records.""" > records = parse(in_handle, in_format, alphabet) > return write(records, out_handle, out_format) [...] > However, that isn't the real aim here. Having a function like this > would allow a number of file format specific optimisations - > instead of using SeqIO.parse to create SeqRecord objects > which get converted by SeqIO.write as shown above. I like this idea. To the extent in which we can optimize popular conversions, this gives us a standard place to put it. There is going to be lots of fastq to fasta conversion and being as fast as possible is good (notice my avoidance of any more potentially misconstrued jokes). Conversion lately seems to be getting worse, not better, with all of the alignment and annotation formats springing up. Extending this to AlignIO and TreeIO as Eric suggested is also great. So +1 from me, Brad From chapmanb at 50mail.com Tue Jul 28 22:17:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 28 Jul 2009 18:17:26 -0400 Subject: [Biopython-dev] Handles and/or filenames in Bio.SeqIO etc? In-Reply-To: <320fb6e00907280934i54f326a6r38325c05a314cdbc@mail.gmail.com> References: <320fb6e00907280934i54f326a6r38325c05a314cdbc@mail.gmail.com> Message-ID: <20090728221726.GK68751@sobchak.mgh.harvard.edu> Hey all; > Eric just reopened an old debate - should Bio.SeqIO (and similar) > support filenames as well has handles? > > In fact, this something we originally discussed way back when planning > SeqIO way back in Nov 2006. Michiel and I were at the time generally > in favour of allowing filename/handles, but Iddo Friedberg (who at that > time was basically in charge) and Chris Lasher didn't like this. It would > have broken with the existing Biopython parsers which were all handle > only. After a little debate, we opted to support just handles, knowing we > could if need be later allow filenames instead. I am for file and handle support. Only dealing with handles is like so totally 2006. I did this in the GFF parser by necessity since Disco MapReduce needed files and the standard Biopython way is handles. Essentially, it checks for a read attribute and keeps track of needing to close the handle: if hasattr(gff_file, "read"): need_close = False in_handle = gff_file else: need_close = True in_handle = open(gff_file) > There is a minor downside > that Python beginners won't necessarily get to gripes with handles so quickly. Yes, that is the downside I see as well. The plus side of the same issue is that the learning curve is less steep. Brad From bugzilla-daemon at portal.open-bio.org Wed Jul 29 00:57:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Jul 2009 20:57:54 -0400 Subject: [Biopython-dev] [Bug 2889] New: setup.py reads stdin even when stdin is not a terminal Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2889 Summary: setup.py reads stdin even when stdin is not a terminal Product: Biopython Version: 1.51b Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: sridhar.ratna at gmail.com setup.py files are *not* meant be using raw_input and other funky things that interferes with build automation. Please remove the use of raw_input() .. or, at least, use raw_input() only when stdin is a real terminal ("if sys.stdout.isatty()"). This way you could allow your package to built via automated build tools. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From matzke at berkeley.edu Wed Jul 29 04:33:46 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 28 Jul 2009 21:33:46 -0700 Subject: [Biopython-dev] Bio.Nexus and internal node labels (Bug 2788) In-Reply-To: <320fb6e00907251421o40e7931cj69caa7d5b00794a8@mail.gmail.com> References: <320fb6e00907201248v7d5f057hc7a77b4357d74bb6@mail.gmail.com> <4A6560E2.4030502@biologie.uni-kl.de> <320fb6e00907251421o40e7931cj69caa7d5b00794a8@mail.gmail.com> Message-ID: <4A6FD12A.4020707@berkeley.edu> Peter wrote: > On Tue, Jul 21, 2009 at 7:32 AM, Frank Kauff wrote: >> Hi all, >> >> Peter wrote: >>> On Mon, Jul 20, 2009 at 8:13 PM, Nick Matzke wrote: >>> >>>> Hi all, here is my weekly update... >>>> >>>> 1. Bug fix on Nexus.Tree class is working well so far. Thanks Brad!! >>> Cool. I haven't tried it personally though ;) Frank and/or Cymon - any >>> comments regarding Brad checking this in? See Bug 2788 for details. >> Not at all - you're most welcome. Thanks for dealing with it. >> >> Frank > > Sounds like you should proably check in that fix then Brad :) > > Peter Yeah, I used the revised module for a bunch more operations, including many of the tree methods. No crashes or huge issues once I "got" how everything worked. I did have to write my own methods for what should probably eventually be basic tree methods, like deep-copying the tree, subsetting the tree based on what occurs above a given node, etc. Thanks! Nick -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Wed Jul 29 07:43:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 08:43:58 +0100 Subject: [Biopython-dev] Bio.SeqIO.convert function? In-Reply-To: <20090728220943.GJ68751@sobchak.mgh.harvard.edu> References: <320fb6e00907280419h71488002g119c49c6832826a1@mail.gmail.com> <20090728220943.GJ68751@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00907290043w34fba0dbx5a8b0b2c21d72af3@mail.gmail.com> On Tue, Jul 28, 2009 at 11:09 PM, Brad Chapman wrote: > Hi Peter; > >> As a possible enhancement to Bio.SeqIO, I've been toying with >> the idea of introducing another function, essentially to provide >> the following functionality: >> >> def convert(in_handle, in_format, out_handle, out_format, alphabet=None) : >> ? ? """Converts between two file formats, returns number of records.""" >> ? ? records = parse(in_handle, in_format, alphabet) >> ? ? return write(records, out_handle, out_format) > [...] >> However, that isn't the real aim here. Having a function like this >> would allow a number of file format specific optimisations - >> instead of using SeqIO.parse to create SeqRecord objects >> which get converted by SeqIO.write as shown above. > > I like this idea. To the extent in which we can optimize popular > conversions, this gives us a standard place to put it. There is > going to be lots of fastq to fasta conversion and being as fast as > possible is good (notice my avoidance of any more potentially > misconstrued jokes). OK, assuming we press ahead with this, the Bio.SeqIO.convert() function would be the only public API addition, the internals would all be private. What I had in mind was Bio.SeqIO.convert() using a dictionary of functions (all with the same arguments), keyed on a tuple of (in_format, out_format). I was thinking of using Bio/SeqIO/_convert.py for the individual functions (like GenBank/EMBL to FASTA/tab, or any FASTQ to FASTA/tab). Note I am expecting that in many cases it will be quite simple to handle several related conversions in one function, and this should avoid some code duplication. My marking these details as private, we can of course refine this scheme later. > Conversion lately seems to be getting worse, not better, with > all of the alignment and annotation formats springing up. > Extending this to AlignIO and TreeIO as Eric suggested is > also great. Whatever we do for Bio.SeqIO, we can follow the same pattern for Bio.AlignIO etc. > So +1 from me, > Brad And we basically had a +0 from Michiel, and a +1 from Eric. And I like the idea but am not convinced we need it. Maybe we should put the suggestion forward on the main discussion list for debate? Peter From bugzilla-daemon at portal.open-bio.org Wed Jul 29 07:46:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 03:46:25 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907290746.n6T7kPIe029876@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-29 03:46 EST ------- (In reply to comment #0) > setup.py files are *not* meant be using raw_input and other funky things > that interferes with build automation. Have you got a reference for that? I can see why it might have a problem, but there is probably official guidance for this kind of thing. > Please remove the use of raw_input() .. or, at least, use raw_input() only > when stdin is a real terminal ("if sys.stdout.isatty()"). That makes sense. But what would you do if this is not the case? > This way you could allow your package to built via automated build tools. What tool has a problem? All the Linux packagers manage fine. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 29 08:42:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 04:42:25 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907290842.n6T8gPQb032600@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 ------- Comment #2 from sridhar.ratna at gmail.com 2009-07-29 04:42 EST ------- > ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-29 03:46 EST ------- > (In reply to comment #0) >> setup.py files are *not* meant be using raw_input and other funky things >> that interferes with build automation. > > Have you got a reference for that? I can see why it might have a problem, > but there is probably official guidance for this kind of thing. Ok, I'll ease up on my assertions .. what I meant was it is a good practice to keep the script execution simple. See http://mail.python.org/pipermail/distutils-sig/2009-July/012832.html (last paragraph) >> Please remove the use of raw_input() .. or, at least, use raw_input() only >> when stdin is a real terminal ("if sys.stdout.isatty()"). > > That makes sense. But what would you do if this is not the case? Since your package already makes use of setuptools, I suggest you to make use of the 'extras' features in setuptools: http://peak.telecommunity.com/DevCenter/setuptools#declaring-extras-optional-features-with-their-own-dependencies If Foo depends on your package .. but also requires the numpy component, then Foo would depend upon "biopython[numpy]". Zope namespace packages makes use of this feature extensively (eg: zope.component[zcml]) >> This way you could allow your package to built via automated build tools. > > What tool has a problem? All the Linux packagers manage fine. PyPM (ActiveState's Python Package Manager .. analogous to PPM for Perl) is the tool that has the problem with such packages .. the resolution being to kill the build process that takes more than X number of minutes (raw_input() implies infinite execution time for no stdin). This has the unfortunate consequence of such packages becoming not part of the repository. Even if this bug is not fixed, we could patch the setup.py - but ideally I prefer this to be done in the project itself (to keep things unsophisticated). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 29 09:45:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 05:45:23 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907290945.n6T9jNHF002902@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-29 05:45 EST ------- (In reply to comment #2) > > (In reply to comment #0) > >> setup.py files are *not* meant be using raw_input and other funky > >> things that interferes with build automation. > > > > Have you got a reference for that? I can see why it might have a > > problem, but there is probably official guidance for this kind of > > thing. > > Ok, I'll ease up on my assertions .. what I meant was it is a good > practice to keep the script execution simple. See > http://mail.python.org/pipermail/distutils-sig/2009-July/012832.html > (last paragraph) > > >> Please remove the use of raw_input() .. or, at least, use raw_input() > >> only when stdin is a real terminal ("if sys.stdout.isatty()"). > > > > That makes sense. But what would you do if this is not the case? > > Since your package already makes use of setuptools, I suggest you to > make use of the 'extras' features in setuptools: The official way to install Biopython is "python setup.py install" (i.e. using distutils). We don't do anything special to support setuptools - but it seems to work. Unfortunately, using "extras_require" or "install_requires" to make setuptools happy causes ugly UserWarning messages from distutils. > >> This way you could allow your package to built via automated > >> build tools. > > > > What tool has a problem? All the Linux packagers manage fine. > > PyPM (ActiveState's Python Package Manager .. analogous to PPM for > Perl) is the tool that has the problem with such packages .. the > resolution being to kill the build process that takes more than X > number of minutes (raw_input() implies infinite execution time for no > stdin). This has the unfortunate consequence of such packages becoming > not part of the repository. > > Even if this bug is not fixed, we could patch the setup.py - but > ideally I prefer this to be done in the project itself (to keep > things unsophisticated). The yes/no prompt using raw_input is for solely for installing without NumPy (which is still useful, but only a subset of the full Biopython), and is only shown if NumPy is not installed. This is a compile time dependency for parts of Biopython. I've updated CVS and now setup.py will abort if NumPy is not installed and we don't appear to be running in a real terminal (based on your suggestion). Could you test this please? Thanks Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 29 09:47:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 05:47:35 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907290947.n6T9lZhF003020@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-29 05:47 EST ------- (In reply to comment #3) > > The yes/no prompt using raw_input is for solely for installing without > NumPy (which is still useful, but only a subset of the full Biopython), > and is only shown if NumPy is not installed. This is a compile time > dependency for parts of Biopython. > > I've updated CVS and now setup.py will abort if NumPy is not installed > and we don't appear to be running in a real terminal (based on your > suggestion). > > Could you test this please? You need setup.py CVS revision 1.170, which should also be available from github within the hour: http://github.com/biopython/biopython/tree/master I could attach the new setup.py to this bug if that would be easier for you. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Wed Jul 29 12:54:07 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 29 Jul 2009 14:54:07 +0200 Subject: [Biopython-dev] [Biopython] Restriction enzyme digestion gels In-Reply-To: <320fb6e00907290435i32a15382l48206bdbedbd7bf6@mail.gmail.com> References: <4A702ACB.2080204@dcs.gla.ac.uk> <320fb6e00907290435i32a15382l48206bdbedbd7bf6@mail.gmail.com> Message-ID: <200907291454.07300.jblanca@btc.upv.es> Hi: > There is nothing built into Biopython's graphics module for generating > fake gel images - so using matplot seems worth trying. However, I > would suggest you talk to Jose Blanca about his work first: > http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005472.html > http://bioinf.comav.upv.es/svn/gelify/gelifyfsa/ Once I needed a similar tool to represent aflp data as a gel and I wrote the code to solve that issue. I haven't used that much because the project was cancelled due to external reasons, but the code worked. You can take a look at: http://bioinf.comav.upv.es/svn/gelify/gelifyfsa/src/ If you have any problems with it, write me a line. I'm sure that it will be bugs and and the performance is not great, but it worked for me. At least I hope you can look at how the image is build using matplotlib. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From eric.talevich at gmail.com Wed Jul 29 15:49:22 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 29 Jul 2009 11:49:22 -0400 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> Message-ID: <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> Hi Peter, On Tue, Jul 28, 2009 at 12:48 PM, Peter wrote: > Hi Eric, > > If you wanted a good multi-tree example file format for TreeIO, I would > suggest plain Newick trees. I am familiar with plain text files which > contain > one Newick tree per line (with a terminating semi-colon), although in > principle they could be wrapped over many lines. The neighbour joining > (NJ) tree software QuickJoin from Thomas Mailund can certainly output > this kind of file. I would expect to be able to read and write such > multi-tree > Newick files using Bio.TreeIO. > I was wondering about this in regard to Bio.Nexus. It looks like the class Bio.Nexus.Nexus calls its _tree method when it encounters a tree block in a Nexus file, which corresponds to a tree in Newick format plus a short preamble. The _tree method churns the preamble, then passes a CharBuffer (the Newick string) and some defaults to the Bio.Nexus.Trees.Tree constructor, which does the Newick parsing and creates a Tree object. After a quick glance at the Nexus original article/spec, it looks like the format is a bindle of simpler formats for various applications; most of these formats are unique to Nexus, but Newick is dropped into Nexus completely intact. So! I'm proposing that the Newick parser, currently stashed inside Bio.Nexus.Trees, be moved to Bio.TreeIO.NewickIO, and the Nexus parser be changed to simply call the Newick parser from its new location. (A further refactoring of the Nexus parser would put the individual parsers for each block in separate classes or files, rather than mingled with the block-level parsing code. I can't guarantee I'll get around to that, though.) Does this bear any resemblance to your plan? The obvious application of this (which I have used personally), was to > generate bootstrap trees on multiple machines in a cluster (or cores on > a single machine), e.g. 100 instances each of 10 bootstrap trees, giving > in total 1000 trees (which are then used either to build a consensus, or > allocate bootstrap support to the randomised master tree). > Sounds like an incremental parse() function over these trees would be very useful for distributed bootstrap analysis etc. I don't see how Bio.Nexus currently supports this, though, beyond iterating over the 'trees' attribute, which is a list. How would a reasonable person go about this? Generate trees in Newick format rather than Nexus, run on the cluster, combine, distill, and only save the resulting master tree in Newick format (or even phyloXML)? If the Newick parser is separated from Nexus, then this wouldn't be too difficult to support. > I wrote some code in python to do this bootstrapping step using the > splits defined by each edge (i.e. the two sets of nodes you get if the > edge was severed), which I represented using bit arrays, for use as > keys in a dictionary mapping the splits to the master tree's edges. > > I would be interested to see this. Thanks, Eric P.S. On inspection, there's a possible bug in Bio.Nexus.get_start_end: the default argument for skiplist is a list with two characters in it. If skiplist is altered, this would persist across subsequent calls, wouldn't it? From biopython at maubp.freeserve.co.uk Wed Jul 29 16:16:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 17:16:57 +0100 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> Message-ID: <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> On Wed, Jul 29, 2009 at 4:49 PM, Eric Talevich wrote: > Hi Peter, > > On Tue, Jul 28, 2009 at 12:48 PM, Peter > wrote: >> >> Hi Eric, >> >> If you wanted a good multi-tree example file format for TreeIO, I would >> suggest plain Newick trees. I am familiar with plain text files which >> contain one Newick tree per line (with a terminating semi-colon), >> although in principle they could be wrapped over many lines. The >> neighbour joining (NJ) tree software QuickJoin from Thomas Mailund >> can certainly output this kind of file. I would expect to be able to read >> and write such multi-tree Newick files using Bio.TreeIO. > > I was wondering about this in regard to Bio.Nexus. It looks like the class > Bio.Nexus.Nexus calls its _tree method when it encounters a tree block in a > Nexus file, which corresponds to a tree in Newick format plus a short > preamble. The _tree method churns the preamble, then passes a CharBuffer > (the Newick string) and some defaults to the Bio.Nexus.Trees.Tree > constructor, which does the Newick parsing and creates a Tree object. > > After a quick glance at the Nexus original article/spec, it looks like the > format is a bindle of simpler formats for various applications; most of > these formats are unique to Nexus, but Newick is dropped into Nexus > completely intact. So! I'm proposing that the Newick parser, currently > stashed inside Bio.Nexus.Trees, be moved to Bio.TreeIO.NewickIO, and the > Nexus parser be changed to simply call the Newick parser from its new > location. > > (A further refactoring of the Nexus parser would put the individual parsers > for each block in separate classes or files, rather than mingled with the > block-level parsing code. I can't guarantee I'll get around to that, > though.) > > Does this bear any resemblance to your plan? No - but probably only because I didn't fancy restructuring Bio.Nexus ;) We can already call the Newick tree parser directly, so it doesn't have to be moved (although we could do). [In case you hadn't seen it, the current version of the Tutorial has a tiny example using this at the end of a ClustalW example in the Alignment chapter.] Bio.TreeIO.parse() should be an iterator, returning complete tree objects one by one. I was thinking of having Bio.TreeIO.NewickIO just take a plain text file, split it up at the ";\n" characters (or similar) to get each tree as a string, which is passed to Bio.Nexus.Trees.Tree to parse it. I'd never read the original Nexus publication which describes the file format (my University didn't subscribe to that journal). However, it appears to have been digitised and made freely available since then: http://sysbio.oxfordjournals.org/cgi/reprint/46/4/590 It looks like the NEXUS format allows explicit handling of multiple trees within the NEXUS block structure. Note that this is quite different to the simple concatenated plain text Newick files I was talking about. i.e. the "nexus" and "newick" formats in Bio.TreeIO do both deal with Newick trees, but they are held in different container formats (i.e. a NEXUS file, or plain text). >> The obvious application of this (which I have used personally), was to >> generate bootstrap trees on multiple machines in a cluster (or cores on >> a single machine), e.g. 100 instances each of 10 bootstrap trees, giving >> in total 1000 trees (which are then used either to build a consensus, or >> allocate bootstrap support to the randomised master tree). > > Sounds like an incremental parse() function over these trees would be > very useful for distributed bootstrap analysis etc. Exactly. And Bio.TreeIO.read() would be for the special case where the file format contains exactly one tree. > I don't see how Bio.Nexus currently supports this, though, beyond > iterating over the 'trees' attribute, which is a list. As far as I know, Bio.Nexus just parses a whole file in one go. This means either Bio.TreeIO.NexusIO would call this and then loop over the list (very memory inefficient), or it would need a minimal Nexus parser just to spot the TREES block, and handle them only. > How would a reasonable person go about this? > Generate trees in Newick format rather than Nexus, run on the cluster, > combine, distill, and only save the resulting master tree in Newick format > (or even phyloXML)? If the Newick parser is separated from Nexus, then > this wouldn't be too difficult to support. For the example workflow I gave, I did everything with simple Newick files. At the very end, it might make sense to save the bootstrapped tree as phyloXML, or even as a full NEXUS file bundled up with the alignment. >> I wrote some code in python to do this bootstrapping step using the >> splits defined by each edge (i.e. the two sets of nodes you get if the >> edge was severed), which I represented using bit arrays, for use as >> keys in a dictionary mapping the splits to the master tree's edges. > > I would be interested to see this. I'm not actually sure where I put it... it should be on my old desktop at home somewhere. However, I can elaborate in that in addition NJ using quicktree, I also did parsimony bootstrap values, and drew my own colourful trees using reportlab. See the three supplementary figures here: http://dx.doi.org/10.1099/mic.0.2007/013672-0 Peter > P.S. On inspection, there's a possible bug in Bio.Nexus.get_start_end: the > default argument for skiplist is a list with two characters in it. If > skiplist is altered, this would persist across subsequent calls, wouldn't > it? I don't understand what you are trying to say. If the get_start_end is called with an argument (say skiplist=["a","b"]) then this will not affect subsequence calls where there default will still be ['-','?']. From biopython at maubp.freeserve.co.uk Wed Jul 29 16:24:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 17:24:52 +0100 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> Message-ID: <320fb6e00907290924o58a63e01l37950046070c290e@mail.gmail.com> > The obvious application of this (which I have used personally), was to > generate bootstrap trees on multiple machines in a cluster (or cores on > a single machine), e.g. 100 instances each of 10 bootstrap trees, giving > in total 1000 trees (which are then used either to build a consensus, or > allocate bootstrap support to the randomised master tree). I hope it was clear anyway, but that last bit should have read: ... which are then used either to build a consensus [tree], or allocate bootstrap support to the original *non* randomised master tree [generated from the original alignment]. Peter From eric.talevich at gmail.com Wed Jul 29 17:59:27 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 29 Jul 2009 13:59:27 -0400 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> Message-ID: <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> On Wed, Jul 29, 2009 at 12:16 PM, Peter wrote: > On Wed, Jul 29, 2009 at 4:49 PM, Eric Talevich > wrote: > > > > Does this bear any resemblance to your plan? > > No - but probably only because I didn't fancy restructuring Bio.Nexus ;) > We can already call the Newick tree parser directly, so it doesn't > have to be moved (although we could do). [In case you hadn't seen > it, the current version of the Tutorial has a tiny example using this > at the end of a ClustalW example in the Alignment chapter.] > > Bio.TreeIO.parse() should be an iterator, returning complete tree > objects one by one. I was thinking of having Bio.TreeIO.NewickIO > just take a plain text file, split it up at the ";\n" characters (or > similar) > to get each tree as a string, which is passed to Bio.Nexus.Trees.Tree > to parse it. > OK, I did this. http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py > > Sounds like an incremental parse() function over these trees would be > > very useful for distributed bootstrap analysis etc. > > Exactly. And Bio.TreeIO.read() would be for the special case where > the file format contains exactly one tree. > PhyloXML has a top-level object that contains multiple phylogenies, plus arbitrary 'other' data; PhyloXML.read() returns one of those object regardless of how many phylogenies it contains. Newick doesn't have a top-level container, so returning one tree and raising a RuntimeError if there isn't exactly one tree makes sense. But Nexus has a top-level container with (potentially) a bunch of other info -- should NexusIO.read() return the complete Nexus object, or just pretend to be a Newick wrapper and behave that way? As far as I know, Bio.Nexus just parses a whole file in one go. This > means either Bio.TreeIO.NexusIO would call this and then loop over > the list (very memory inefficient), or it would need a minimal Nexus > parser just to spot the TREES block, and handle them only. > That's what I pictured for a Bio.Nexus refactoring -- I don't know the right way to do it in a memory-efficient way, though, given that there are multiple types of blocks and they may be needed at different times. Maybe make an initial pass to index the file at the block level, then call incremental line-level parsers on the selected blocks? Or, simpler, factor out the efficient line-level parsers so that they can be accessed separately if need be -- basically the way Nexus._tree() works now -- and let the block-level parsing code call those specific parsers. >> I wrote some code in python to do this bootstrapping step using the > >> splits defined by each edge (i.e. the two sets of nodes you get if the > >> edge was severed), which I represented using bit arrays, for use as > >> keys in a dictionary mapping the splits to the master tree's edges. > > > > I would be interested to see this. > > I'm not actually sure where I put it... it should be on my old desktop > at home somewhere. However, I can elaborate in that in addition NJ > using quicktree, I also did parsimony bootstrap values, and drew my > own colourful trees using reportlab. See the three supplementary > figures here: http://dx.doi.org/10.1099/mic.0.2007/013672-0 > > Hey, neat. I was about to start a project involving kinases and response regulators. How much trouble was it to draw trees in reportlab? Do you think it would be worth adding a tree-drawing module to Bio.Graphics? Eric From biopython at maubp.freeserve.co.uk Wed Jul 29 19:37:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 20:37:08 +0100 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> Message-ID: <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> On Wed, Jul 29, 2009 at 6:59 PM, Eric Talevich wrote: >> >> Bio.TreeIO.parse() should be an iterator, returning complete tree >> objects one by one. I was thinking of having Bio.TreeIO.NewickIO >> just take a plain text file, split it up at the ";\n" characters (or >> similar) to get each tree as a string, which is passed to >> Bio.Nexus.Trees.Tree to parse it. > > OK, I did this. > > http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py OK, I haven't run the code but have a couple of points. On a general point, are you intending to re-write parse and read functions for each tree format? For Bio.SeqIO all I do is write a iterator (i.e. a parse) function, and Bio.SeqIO.parse() and also Bio.SeqIO.read() call this. I didn't use RuntimeError for SeqIO and AlignIO, I used ValueError. I figured the data in the handle didn't match the expectations, which was like saying it had a bad value. It would therefore be more consistent to do the same. The parsing code looks weird to me - but that is probably a style thing. Certainly I had to stare at it to work out what it was doing. It also has a bug - consider a Newick file containing one tree but with no trailing semi colon. On a more serious note, your output code creates a monster string of all the trees in memory! Don't do this as it ruins the whole memory benefit of using iterators to keep just one tree in memory at a time: lines = (t.to_string(plain_newick=True, plain=plain, **kwargs) for t in trees) file.write(';\n'.join(lines) Instead handle the trees one by one: for t in trees : file.write(t.to_string(...) + ";\n") (I'm assuming like you did that the to_string method won't add the trailing semi colon and new line.) >> > Sounds like an incremental parse() function over these trees would be >> > very useful for distributed bootstrap analysis etc. >> >> Exactly. And Bio.TreeIO.read() would be for the special case where >> the file format contains exactly one tree. > > PhyloXML has a top-level object that contains multiple phylogenies, plus > arbitrary 'other' data; PhyloXML.read() returns one of those object > regardless of how many phylogenies it contains. Newick doesn't have a > top-level container, so returning one tree and raising a RuntimeError if > there isn't exactly one tree makes sense. But Nexus has a top-level > container with (potentially) a bunch of other info -- should NexusIO.read() > return the complete Nexus object, or just pretend to be a Newick wrapper and > behave that way? Ah. The top level information about all the trees may cause trouble for the TreeIO model I had in mind (which was *just* for trees). The advantage of this is a consistent API, the downside is certain file format specific things cannot be supported nicely. I think this balance has worked nicely for SeqIO and AlignIO to date. So: * Bio.TreeIO.read(...) would return one tree. * Bio.TreeIO.parse(...) would iterate over trees one by one. * Bio.TreeIO.write(...) would write trees out (ideally sequentially if the file format allows this). Note I am assuming it is possible to write a PhyloXML tree with minimal (empty) top level annotation? You would need to do this in order to convert from a Nexus or Newick tree to a (minimal) PhyloXML tree. So, based on how SeqIO and AlignIO work, I would expect Bio.TreeIO would only give you the trees - you'd not get the top level information. For parsing Nexus files, Bio.TreeIO would only give access to a subset of the data in a Nexus file - just the trees. In the same way, parsing a Nexus file with AlignIO only gives you the alignment. If you want any of the other data in a Nexus file, you have to use the Bio.Nexus module. If you (as a user) needed the top level annotation in a PhyloXML file, then I would say use Bio.PhyloXML (or what ever we are calling it) directly instead of Bio.TreeIO. >> As far as I know, Bio.Nexus just parses a whole file in one go. This >> means either Bio.TreeIO.NexusIO would call this and then loop over >> the list (very memory inefficient), or it would need a minimal Nexus >> parser just to spot the TREES block, and handle them only. > > That's what I pictured for a Bio.Nexus refactoring -- I don't know the right > way to do it in a memory-efficient way, though, given that there are > multiple types of blocks and they may be needed at different times. Maybe > make an initial pass to index the file at the block level, then call > incremental line-level parsers on the selected blocks? Or, simpler, factor > out the efficient line-level parsers so that they can be accessed separately > if need be -- basically the way Nexus._tree() works now -- and let the > block-level parsing code call those specific parsers. Maybe. Of course, in practice Nexus files may not be that big. I don't know if anyone uses them to store (for example) 1000 bootstrap trees. As Brad and I have noted before, spending time on refactoring Bio.Nexus is not the best use of your GSoC project time (plus we'd need to get Cymon and Frank much more involved, worry more about backwards compatibility etc). >> I'm not actually sure where I put it... it should be on my old desktop >> at home somewhere. However, I can elaborate in that in addition NJ >> using quicktree, I also did parsimony bootstrap values, and drew my >> own colourful trees using reportlab. See the three supplementary >> figures here: http://dx.doi.org/10.1099/mic.0.2007/013672-0 > > Hey, neat. I was about to start a project involving kinases and response > regulators. Cool - email me off list if you want to chat more about this aspect. > How much trouble was it to draw trees in reportlab? Do you think > it would be worth adding a tree-drawing module to Bio.Graphics? I agree that tree drawing would be a nice addition to Bio.Graphics. But that code of mine as written would not be good enough. In the end it was a bit of a hack - it got the job done but had lots of special cases (e.g. to get colouring by species to work, and in particular the double bootstrap values caused me pain as I had to have two otherwise identical trees loaded). Even ignoring this, the basic code didn't use an object orientated approach which makes it a poor match to the rest of Bio.Graphics. Basically I would want to rewrite it from scratch before I felt it was fit for public reuse, and have never found the time. Peter From bugzilla-daemon at portal.open-bio.org Wed Jul 29 20:57:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 16:57:38 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907292057.n6TKvcF9028919@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 ------- Comment #5 from sridhar.ratna at gmail.com 2009-07-29 16:57 EST ------- Yup, that works. When run as a script (eg: via subprocess module), setup.py terminates when numpy is not installed. That is good enough fix. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 29 21:02:30 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Jul 2009 17:02:30 -0400 Subject: [Biopython-dev] [Bug 2889] setup.py reads stdin even when stdin is not a terminal In-Reply-To: Message-ID: <200907292102.n6TL2UgT029129@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2889 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-29 17:02 EST ------- (In reply to comment #5) > Yup, that works. When run as a script (eg: via subprocess module), setup.py > terminates when numpy is not installed. > > That is good enough fix. > Great. Thank you for your report, and taking the time to test this for us. Marking as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From czmasek at burnham.org Wed Jul 29 21:12:52 2009 From: czmasek at burnham.org (Christian Zmasek) Date: Wed, 29 Jul 2009 14:12:52 -0700 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 10: PhyloXML for Biopython In-Reply-To: <3f6baf360907271056t2c84429fwe17739c93046105d@mail.gmail.com> References: <3f6baf360907271056t2c84429fwe17739c93046105d@mail.gmail.com> Message-ID: Hi, Eric: Looks good! Remarks: - Bioperl's phyloXML driver was written for version 1.00 and might hurl if given a v1.10 file -- so that's a potential problem if Biopython defaults to writing v1.10 files. Should Writer take a option to specify the file format version number? Right now it only writes valid phyloXML v1.00. This is a nice thought, but to be honest, I would not do it, especially since it is likely there will be more versions in the future (although, hopefully, just extending 1.10, as opposed to the removal and change of elements. - PhyloXMLIO also always writes branch_length as an XML node, not an attribute. This validates and will be handled safely by any sane parser, and fits better with the idea of an implicit root node in each clade object, I think. (The parser still handles an attribute properly.) Any objections? This is fine! - Above, I've listed more enhancements than I'll probably be able to finish this week. Which should have higher priority? I know merging Bio.Nexus and Bio.Tree would be the most useful, but since (1) Biopython development still happens on CVS, not Git, and (2) another Tree-based GSoC project is expected to land around the same time as mine, I think doing the integration right now would be kind of painful. So I can focus either on laying the groundwork in Bio.Tree.BaseTree, copying rather than moving the relevant Nexus code, or else work mainly on exporting to other useful object representations like networkx graphs, or any Biopython classes I've missed (e.g. alignments). Suggestions? Time permitting I would concentrate on exporting to other useful object representations and on Bio.Tree.BaseTree compatibility with BioSQL's PhyloDB extensions. Christian ________________________________________ From: wg-phyloinformatics-bounces at nescent.org [wg-phyloinformatics-bounces at nescent.org] On Behalf Of Eric Talevich [eric.talevich at gmail.com] Sent: Monday, July 27, 2009 10:56 AM To: Phyloinformatics Group; BioPython-Dev Mailing List Subject: [Wg-phyloinformatics] GSoC Weekly Update 10: PhyloXML for Biopython Hi folks, Previously (July 20-24) I: Finished implementing I/O methods, Tree classes and tests for all phyloXML elements. Changed Writer to preserve node order in the XML; output now validates under the phyloXML 1.00 schema (but 1.10 complains) Did some drastic code reorganization. - Bio.Tree: - Moved Clade.find() and PhyloElement.__repr__ methods to BaseTree classes - Made Clade inherit from BaseTree.Tree in addition to BaseTree.Node, and added the corresponding attributes - Moved Bio.PhyloXML.Tree to Bio.Tree.PhyloXML - Bio.TreeIO: - Merged PhyloXML's Parser and Writer into PhyloXMLIO under the new Bio.TreeIO module, and updated imports everywhere - Added wrappers for Nexus read/write; doesn't return Bio.Tree objects yet though Added/updated unit tests for all of this. Documented the code reorg on the Biopython wiki, adding Tree and TreeIO pages and fixing the examples on the PhyloXML page. Scrubbed docstrings and enabled epydoc processing. This week (July 27-31) I will: Finish implementing the phyloXML spec: - Scan "simple types" for restricted tokens; check strings in constructors - Take a stab at phyloXML 1.10 support (need a 'version' arg to Writer?) - Clean up and reorganize any code that needs it Enhancements (time permitting): - Improve the SeqRecord conversion - Work on Bio.Tree.BaseTree compatibility with BioSQL's PhyloDB extension - Port common methods to Bio.Tree.BaseTree -- see Bio.Nexus.Tree, Bioperl node objects, PyCogent, p4-phylogenetics - Tree method: build_index (set left_idx, right_idx on all nodes): - calculate left/right indexes for nested-set representation - see http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html - Export to networkx (http://networkx.lanl.gov/) -- also get graphviz export for free, via networkx.to_agraph() Remarks: - Bioperl's phyloXML driver was written for version 1.00 and might hurl if given a v1.10 file -- so that's a potential problem if Biopython defaults to writing v1.10 files. Should Writer take a option to specify the file format version number? Right now it only writes valid phyloXML v1.00. - PhyloXMLIO also always writes branch_length as an XML node, not an attribute. This validates and will be handled safely by any sane parser, and fits better with the idea of an implicit root node in each clade object, I think. (The parser still handles an attribute properly.) Any objections? - Above, I've listed more enhancements than I'll probably be able to finish this week. Which should have higher priority? I know merging Bio.Nexus and Bio.Tree would be the most useful, but since (1) Biopython development still happens on CVS, not Git, and (2) another Tree-based GSoC project is expected to land around the same time as mine, I think doing the integration right now would be kind of painful. So I can focus either on laying the groundwork in Bio.Tree.BaseTree, copying rather than moving the relevant Nexus code, or else work mainly on exporting to other useful object representations like networkx graphs, or any Biopython classes I've missed (e.g. alignments). Suggestions? Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From hlapp at gmx.net Thu Jul 30 01:55:47 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 29 Jul 2009 21:55:47 -0400 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> Message-ID: On Jul 29, 2009, at 3:37 PM, Peter wrote: > consider a Newick file containing one tree but with no trailing semi > colon That's actually not legal Newick format if you take it by the letter. Some programs out there are lenient and take it anyway, but some will actually balk and throw an error. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From eric.talevich at gmail.com Thu Jul 30 04:10:35 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 30 Jul 2009 00:10:35 -0400 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> Message-ID: <3f6baf360907292110y652c3cc9wf77ae9269080ca5b@mail.gmail.com> On Wed, Jul 29, 2009 at 3:37 PM, Peter wrote: > On Wed, Jul 29, 2009 at 6:59 PM, Eric Talevich > wrote: > >> > >> Bio.TreeIO.parse() should be an iterator, returning complete tree > >> objects one by one. I was thinking of having Bio.TreeIO.NewickIO > >> just take a plain text file, split it up at the ";\n" characters (or > >> similar) to get each tree as a string, which is passed to > >> Bio.Nexus.Trees.Tree to parse it. > > > > OK, I did this. > > > > http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py > > OK, I haven't run the code but have a couple of points. > > On a general point, are you intending to re-write parse and read > functions for each tree format? For Bio.SeqIO all I do is write a > iterator (i.e. a parse) function, and Bio.SeqIO.parse() and also > Bio.SeqIO.read() call this. > If the top-level TreeIO read function returns just the first parsed tree and raises a ValueError if 0 or >1 trees are available, then I can make the wrappers simpler and reduce some code duplication. The parsing code looks weird to me - but that is probably a style > thing. Certainly I had to stare at it to work out what it was doing. > It also has a bug - consider a Newick file containing one tree but > with no trailing semi colon. > It is weird; I'll fix these issues in parse() and write(). (I only tested with a small 2-tree file.) Style: The "foo and bar or baz" is a Py2.4-friendly idiom that we can one day replace everywhere with the real ternary expression syntax introduced in Py2.5: "bar if foo else baz". I've been using it throughout my GSoC code, though it's not really necessary in this function. Hilmar says there's supposed to be a terminal semicolon; I didn't check what Biopython's parser does but I suppose this should duplicate that. >> > Sounds like an incremental parse() function over these trees would be > >> > very useful for distributed bootstrap analysis etc. > >> > >> Exactly. And Bio.TreeIO.read() would be for the special case where > >> the file format contains exactly one tree. > > > > PhyloXML has a top-level object that contains multiple phylogenies, plus > > arbitrary 'other' data; PhyloXML.read() returns one of those object > > regardless of how many phylogenies it contains. Newick doesn't have a > > top-level container, so returning one tree and raising a RuntimeError if > > there isn't exactly one tree makes sense. But Nexus has a top-level > > container with (potentially) a bunch of other info -- should > NexusIO.read() > > return the complete Nexus object, or just pretend to be a Newick wrapper > and > > behave that way? > > > Ah. The top level information about all the trees may cause trouble > for the TreeIO model I had in mind (which was *just* for trees). The > advantage of this is a consistent API, the downside is certain file > format specific things cannot be supported nicely. I think this balance > has worked nicely for SeqIO and AlignIO to date. So: > * Bio.TreeIO.read(...) would return one tree. > * Bio.TreeIO.parse(...) would iterate over trees one by one. > * Bio.TreeIO.write(...) would write trees out (ideally sequentially > if the file format allows this). > > Note I am assuming it is possible to write a PhyloXML tree with > minimal (empty) top level annotation? You would need to do this > in order to convert from a Nexus or Newick tree to a (minimal) > PhyloXML tree. > > So, based on how SeqIO and AlignIO work, I would expect Bio.TreeIO > would only give you the trees - you'd not get the top level information. > For parsing Nexus files, Bio.TreeIO would only give access to a > subset of the data in a Nexus file - just the trees. In the same way, > parsing a Nexus file with AlignIO only gives you the alignment. If > you want any of the other data in a Nexus file, you have to use the > Bio.Nexus module. > > If you (as a user) needed the top level annotation in a PhyloXML file, > then I would say use Bio.PhyloXML (or what ever we are calling it) > directly instead of Bio.TreeIO. > Within the last couple of weeks, I moved all of the PhyloXML I/O code to Bio.TreeIO.PhyloXMLIO, and the tree class definitions to Bio.Tree.PhyloXML -- so there is no Bio.PhyloXML module now, as far as imports and setup.py are concerned. Unlike Nexus, a phyloXML file really doesn't contain anything other than phylogenetic trees and their annotations, so I didn't see the need to clutter the Bio namespace further. Plan: TreeIO has read(), parse(), write(), and possibly convert(), which behave exactly like the corresponding AlignIO and SeqIO functions, but with trees. Under Bio.TreeIO we have wrappers for other formats, and these wrappers may have public functions that go beyond the shared TreeIO ones. In some cases this can lead to a specific read-like function that returns a single object containing one or more trees, plus other tree-related metadata. This function can either be called read() also, as it currently is in PhyloXMLIO, or we could choose another name like load(). For basic tree access: from Bio import TreeIO tree = TreeIO.read('example.xml', 'phyloxml') TreeIO.write([tree], 'example.nex', 'nexus') For the connoisseur: from Bio.TreeIO import PhyloXMLIO phx = PhyloXMLIO.read('example.xml') if phx.other: # do something clever... Of course, in practice Nexus files may not be that big. I don't > know if anyone uses them to store (for example) 1000 bootstrap trees. > As Brad and I have noted before, spending time on refactoring Bio.Nexus > is not the best use of your GSoC project time (plus we'd need to get > Cymon and Frank much more involved, worry more about backwards > compatibility etc). > This refactoring quest actually started because I was trying to figure out an object model for BaseTree that could support PhyloDB, reuse the Nexus tree methods with some resemblance to the original form, and still provide useful base classes for phyloXML. That was holding up everything else -- but I think it's under control now. > I agree that tree drawing would be a nice addition to Bio.Graphics. > > But that code of mine as written would not be good enough. In the > end it was a bit of a hack - it got the job done but had lots of special > cases (e.g. to get colouring by species to work, and in particular the > double bootstrap values caused me pain as I had to have two otherwise > identical trees loaded). Even ignoring this, the basic code didn't use > an object orientated approach which makes it a poor match to the > rest of Bio.Graphics. Basically I would want to rewrite it from scratch > before I felt it was fit for public reuse, and have never found the time. > Maybe it will be worth another shot after the Tree module settles down. If networkx export comes easily this week, that may take also take care of visualization for some uses. Cheers, Eric From biopython at maubp.freeserve.co.uk Thu Jul 30 09:13:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Jul 2009 10:13:29 +0100 Subject: [Biopython-dev] Newick support in Bio.TreeIO? In-Reply-To: <3f6baf360907292110y652c3cc9wf77ae9269080ca5b@mail.gmail.com> References: <320fb6e00907280948u2100feeerf27177b2729af99d@mail.gmail.com> <3f6baf360907290849n16c5297cq8804fccc2f55d14@mail.gmail.com> <320fb6e00907290916u2da71aa2oa1d668413fbbf1a9@mail.gmail.com> <3f6baf360907291059s60f29d22s2b2a4a3c31669d4e@mail.gmail.com> <320fb6e00907291237v52902738r528bdc3ebad034a9@mail.gmail.com> <3f6baf360907292110y652c3cc9wf77ae9269080ca5b@mail.gmail.com> Message-ID: <320fb6e00907300213s582c9313ya38b48d84993101b@mail.gmail.com> On Thu, Jul 30, 2009 at 5:10 AM, Eric Talevich wrote: > On Wed, Jul 29, 2009 at 3:37 PM, Peter wrote: > >> The parsing code looks weird to me - but that is probably a style >> thing. Certainly I had to stare at it to work out what it was doing. >> It also has a bug - consider a Newick file containing one tree but >> with no trailing semi colon. > > Hilmar says there's supposed to be a terminal semicolon; I didn't check > what Biopython's parser does but I suppose this should duplicate that. Hilmar is right, see http://evolution.genetics.washington.edu/phylip/newicktree.html However, in this case I would opt to support this variant anyway for input (but you must include the ";" on output). > Plan: > TreeIO has read(), parse(), write(), and possibly convert(), which behave > exactly like the corresponding AlignIO and SeqIO functions, but with trees. > Under Bio.TreeIO we have wrappers for other formats, and these wrappers may > have public functions that go beyond the shared TreeIO ones. Sounds good. > In some cases this can lead to a specific read-like function that returns a > single object containing one or more trees, plus other tree-related > metadata. This function can either be called read() also, as it currently is > in PhyloXMLIO, or we could choose another name like load(). > > For basic tree access: > > from Bio import TreeIO > tree = TreeIO.read('example.xml', 'phyloxml') > TreeIO.write([tree], 'example.nex', 'nexus') > > For the connoisseur: > > from Bio.TreeIO import PhyloXMLIO > phx = PhyloXMLIO.read('example.xml') > if phx.other: # do something clever... Sounds OK to me at first glance. > ?Of course, in practice Nexus files may not be that big. I don't >> know if anyone uses them to store (for example) 1000 bootstrap trees. >> As Brad and I have noted before, spending time on refactoring Bio.Nexus >> is not the best use of your GSoC project time (plus we'd need to get >> Cymon and Frank much more involved, worry more about backwards >> compatibility etc). > > This refactoring quest actually started because I was trying to figure out > an object model for BaseTree that could support PhyloDB, reuse the Nexus > tree methods with some resemblance to the original form, and still provide > useful base classes for phyloXML. That was holding up everything else -- > but I think it's under control now. Cool. >> I agree that tree drawing would be a nice addition to Bio.Graphics. >> ... > > Maybe it will be worth another shot after the Tree module settles down. If > networkx export comes easily this week, that may take also take care of > visualization for some uses. Good point. In fact from memory, my tree PDF code was probably using Thomas Mailund's Newick parser (not Bio.Nexus which didn't exist when I first started work on trees). http://www.birc.au.dk/~mailund/newick.html Peter From bugzilla-daemon at portal.open-bio.org Fri Jul 31 17:20:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 13:20:28 -0400 Subject: [Biopython-dev] [Bug 2890] New: Getting setup.py to work in Jython Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2890 Summary: Getting setup.py to work in Jython Product: Biopython Version: 1.51b Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu Currently setup.py fails in Jython because it that implementation of Python does not support building C extensions. This can be avoided by adding the code: if os.name == 'java': EXTENSIONS = [] else: EXTENSIONS = [ ...continue with regular extension definition This will not introduce bugs into main BioPython target platforms (CPython), and will allow for development on new platforms (Jython). Tested with Jython 2.5.0. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 20:06:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:06:32 -0400 Subject: [Biopython-dev] [Bug 2891] New: Jython test_NCBITextParser fix+patch Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2891 Summary: Jython test_NCBITextParser fix+patch Product: Biopython Version: 1.51b Platform: Other OS/Version: Other Status: NEW Severity: enhancement Priority: P2 Component: Unit Tests AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu Jython is limited to JVM method sizes, overly large methods cause JVM exceptions (java.lang.ClassFormatError: Invalid method Code length ...). The test_NCBITextParser unit test contains a few methods that are larger then the JVM limit. This can be fixed by breaking some of the methods into multi segment tests. So test_bt007 becomes test_bt007a and test_bt007b. A sample fix patch, tested with Jython2.5.0: 713c713 < def test_bt007(self): --- > def test_bt007a(self): 1242a1243,1250 > > def test_bt007b(self): > "Test parsing PHI-BLAST, BLASTP 2.0.10 output, three rounds (bt007)" > > path = os.path.join('Blast', 'bt007') > handle = open(path) > record = self.pb_parser.parse(handle) > 1891c1899 < def test_bt009(self): --- > def test_bt009a(self): 2525a2534,2541 > > > def test_bt009b(self): > "Test parsing PHI-BLAST, BLASTP 2.0.10 output, two rounds (bt009)" > > path = os.path.join('Blast', 'bt009') > handle = open(path) > record = self.pb_parser.parse(handle) 5635c5651 < def test_bt047(self): --- > def test_bt047a(self): 6251a6268,6275 > > def test_bt047b(self): > "Test parsing PHI-BLAST, BLASTP 2.0.11 output, two rounds (bt047)" > > path = os.path.join('Blast', 'bt047') > handle = open(path) > record = self.pb_parser.parse(handle) > 9959c9983 < def test_bt060(self): --- > def test_bt060a(self): 10330a10355,10362 > > def test_bt060b(self): > "Test parsing BLASTP 2.0.12 output (bt060)" > > path = os.path.join('Blast', 'bt060') > handle = open(path) > record = self.pb_parser.parse(handle) > 11000a11033,11041 > > > def test_bt060c(self): > "Test parsing BLASTP 2.0.12 output (bt060)" > > path = os.path.join('Blast', 'bt060') > handle = open(path) > record = self.pb_parser.parse(handle) > 11504a11546,11552 > > def test_bt060d(self): > "Test parsing BLASTP 2.0.12 output (bt060)" > > path = os.path.join('Blast', 'bt060') > handle = open(path) > record = self.pb_parser.parse(handle) 11812a11861,11866 > > def test_bt060e(self): > "Test parsing BLASTP 2.0.12 output (bt060)" > > path = os.path.join('Blast', 'bt060') > handle = open(path) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 20:40:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:40:33 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200907312040.n6VKeX4u029072@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2890 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 20:40:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:40:35 -0400 Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython In-Reply-To: Message-ID: <200907312040.n6VKeZDl029078@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2890 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2891 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 20:47:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:47:17 -0400 Subject: [Biopython-dev] [Bug 2892] New: Jython MatrixInfo.py fix+patch Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2892 Summary: Jython MatrixInfo.py fix+patch Product: Biopython Version: 1.51b Platform: Other OS/Version: Other Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu BugsThisDependsOn: 2890,2891 Jython is limited to JVM method size limitations, overly large methods cause JVM exceptions (java.lang.ClassFormatError: Invalid method Code length ...). MatrixInfo creates several matrices in the base level of the module, and causes that exception in Jython. This can be fixed by putting each of the matrix definitions in separate methods and then calling those methods to define the variables. Attached is a patch the work in Jython2.5.0 and should have no affect on CPython. 9c9,10 < available_matrices = ['benner6', 'benner22', 'benner74', 'blosum100', --- > def gen_available_matrices(): > return [ 22c23,24 < benner6 = { --- > def gen_benner6(): > return { 78c80,81 < benner22 = { --- > def gen_benner22(): > return { 134c137,138 < benner74 = { --- > def gen_benner74(): > return { 190c194,195 < blosum100 = { --- > def gen_blosum100(): > return { 262c267,268 < blosum30 = { --- > def gen_blosum30(): > return { 334c340,341 < blosum35 = { --- > def gen_blosum35(): > return { 406c413,414 < blosum40 = { --- > def gen_blosum40(): > return { 478c486,487 < blosum45 = { --- > def gen_blosum45(): > return { 550c559,560 < blosum50 = { --- > def gen_blosum50(): > return { 622c632,633 < blosum55 = { --- > def gen_blosum55(): > return { 694c705,706 < blosum60 = { --- > def gen_blosum60(): > return { 766c778,779 < blosum62 = { --- > def gen_blosum62(): > return { 838c851,852 < blosum65 = { --- > def gen_blosum65(): > return { 910c924,925 < blosum70 = { --- > def gen_blosum70(): > return { 982c997,998 < blosum75 = { --- > def gen_blosum75(): > return { 1054c1070,1071 < blosum80 = { --- > def gen_blosum80(): > return { 1126c1143,1144 < blosum85 = { --- > def gen_blosum85(): > return { 1198c1216,1217 < blosum90 = { --- > def gen_blosum90(): > return { 1270c1289,1290 < blosum95 = { --- > def gen_blosum95(): > return { 1342c1362,1363 < feng = { --- > def gen_feng(): > return { 1398c1419,1420 < fitch = { --- > def gen_fitch(): > return { 1444c1466,1467 < genetic = { --- > def gen_genetic(): > return { 1500c1523,1524 < gonnet = { --- > def gen_gonnet(): > return { 1556c1580,1581 < grant = { --- > def gen_grant(): > return { 1612c1637,1638 < ident = { --- > def gen_ident(): > return { 1668c1694,1695 < johnson = { --- > def gen_johnson(): > return { 1724c1751,1752 < levin = { --- > def gen_levin(): > return { 1780c1808,1809 < mclach = { --- > def gen_mclach(): > return { 1836c1865,1866 < miyata = { --- > def gen_miyata(): > return { 1892c1922,1923 < nwsgappep = { --- > def gen_nwsgappep(): > return { 1959c1990,1991 < pam120 = { --- > def gen_pam120(): > return { 2031c2063,2064 < pam180 = { --- > def gen_pam180(): > return { 2103c2136,2137 < pam250 = { --- > def gen_pam250(): > return { 2175c2209,2210 < pam30 = { --- > def gen_pam30(): > return { 2247c2282,2283 < pam300 = { --- > def gen_pam300(): > return { 2319c2355,2356 < pam60 = { --- > def gen_pam60(): > return { 2391c2428,2429 < pam90 = { --- > def gen_pam90(): > return { 2458c2496,2497 < rao = { --- > def gen_rao(): > return { 2514c2553,2554 < risler = { --- > def gen_risler(): > return { 2570c2610,2611 < structure = { --- > def gen_structure(): > return { 2624a2666,2707 > available_matrices = gen_available_matrices() > benner6 = gen_benner6() > benner22 = gen_benner22() > benner74 = gen_benner74() > blosum100 = gen_blosum100() > blosum30 = gen_blosum30() > blosum35 = gen_blosum35() > blosum40 = gen_blosum40() > blosum45 = gen_blosum45() > blosum50 = gen_blosum50() > blosum55 = gen_blosum55() > blosum60 = gen_blosum60() > blosum62 = gen_blosum62() > blosum65 = gen_blosum65() > blosum70 = gen_blosum70() > blosum75 = gen_blosum75() > blosum80 = gen_blosum80() > blosum85 = gen_blosum85() > blosum90 = gen_blosum90() > blosum95 = gen_blosum95() > feng = gen_feng() > fitch = gen_fitch() > genetic = gen_genetic() > gonnet = gen_gonnet() > grant = gen_grant() > ident = gen_ident() > johnson = gen_johnson() > levin = gen_levin() > mclach = gen_mclach() > miyata = gen_miyata() > nwsgappep = gen_nwsgappep() > pam120 = gen_pam120() > pam180 = gen_pam180() > pam250 = gen_pam250() > pam30 = gen_pam30() > pam300 = gen_pam300() > pam60 = gen_pam60() > pam90 = gen_pam90() > rao = gen_rao() > risler = gen_risler() > structure = gen_structure() > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 20:47:30 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:47:30 -0400 Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython In-Reply-To: Message-ID: <200907312047.n6VKlU0e029268@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2890 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2892 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 20:47:31 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 16:47:31 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200907312047.n6VKlVuB029274@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2892 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 21:28:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:28:25 -0400 Subject: [Biopython-dev] [Bug 2893] New: Jython test_prosite fix+patch Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2893 Summary: Jython test_prosite fix+patch Product: Biopython Version: 1.51b Platform: Other OS/Version: Other Status: NEW Severity: enhancement Priority: P2 Component: Unit Tests AssignedTo: biopython-dev at biopython.org ReportedBy: kellrott at ucsd.edu BugsThisDependsOn: 2890,2891,2892 Jython is limited to JVM method sizes, overly large methods cause JVM exceptions (java.lang.ClassFormatError: Invalid method Code length ...). The test_prosite unit test contains a few methods that are larger then the JVM limit. This can be fixed by breaking separate methods into smaller methods. This patch combined with other bug fixes brings Biopython to the point where "jython setup.py test" can complete without throwing exceptions: Ran 122 tests in 39.295 seconds FAILED (failures = 74) Patch tested with Jython2.5.0 3742c3742 < def test_read1(self): --- > def test_read1a(self): 4096a4097,4103 > > > def test_read1b(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 4499a4507,4515 > > > > > def test_read1c(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 4934a4951,4956 > > def test_read1d(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 5190a5213,5218 > > def test_read1e(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 5611a5640,5645 > > def test_read1f(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 5892a5927,5932 > > def test_read1g(self): > "Parsing Prosite record ps00107.txt" > filename = os.path.join('Prosite', 'ps00107.txt') > handle = open(filename) > record = Prosite.read(handle) 6417c6457 < def test_read4(self): --- > def test_read4a(self): 6617a6658,6663 > > def test_read4b(self): > "Parsing Prosite record ps00432.txt" > filename = os.path.join('Prosite', 'ps00432.txt') > handle = open(filename) > record = Prosite.read(handle) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 21:28:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:28:38 -0400 Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython In-Reply-To: Message-ID: <200907312128.n6VLScnl030449@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2890 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2893 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 21:28:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:28:39 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200907312128.n6VLSdCw030455@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2893 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 21:28:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:28:45 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200907312128.n6VLSj7I030464@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 kellrott at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2893 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 21:59:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:59:34 -0400 Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython In-Reply-To: Message-ID: <200907312159.n6VLxY4V031200@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2890 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-07-31 17:59 EST ------- Fixed in CVS, the other Jython fixes will take a little longer to review. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 21:59:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:59:38 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200907312159.n6VLxcBj031220@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 Bug 2891 depends on bug 2890, which changed state. Bug 2890 Summary: Getting setup.py to work in Jython http://bugzilla.open-bio.org/show_bug.cgi?id=2890 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 21:59:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:59:52 -0400 Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch In-Reply-To: Message-ID: <200907312159.n6VLxqg3031243@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2893 Bug 2893 depends on bug 2890, which changed state. Bug 2890 Summary: Getting setup.py to work in Jython http://bugzilla.open-bio.org/show_bug.cgi?id=2890 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jul 31 21:59:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 31 Jul 2009 17:59:54 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200907312159.n6VLxs5H031255@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 Bug 2892 depends on bug 2890, which changed state. Bug 2890 Summary: Getting setup.py to work in Jython http://bugzilla.open-bio.org/show_bug.cgi?id=2890 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.