From chapmanb at 50mail.com Mon Jan 4 08:16:31 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 4 Jan 2010 08:16:31 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> Message-ID: <20100104131631.GG80812@sobchak.mgh.harvard.edu> Hey Eric; Happy New Year -- thanks for all the work on TreeIO. This sounds great and looking forward to getting it in the main trunk. I'd like to hear Peter's and other's thoughts, but just a few small comments below. > The tree annotations (e.g. id) aren't preserved perfectly during conversions > -- I'll keep working on this, but I don't think it's a blocker. The taxon > names of terminal nodes are kept as "clade" names in phyloXML for > round-tripping. Tree topology and branch lengths seem OK. Are the annotations often used in real life cases or is this more of a fringe problem? I'm not as familiar with tree work, but know this is a pain in sequence space. A good goal is to capture the most common use cases and then integrate the other issues as feasible. > Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an > incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O). > This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees, > as I imagine it: > (1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where > reasonable (since the node IDs and adjacency list lookup are no longer > needed) > (2) Implement methods in Bio.Tree.Newick with the original argument lists, > but triggering a deprecation warning indicating the newer replacement method > (3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more > shims to duplicate the original API -- so test_Nexus.py should still pass, > ideally (with deprecation warnings) > (4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of > NexusIO and Bio.Tree methods. > (5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick. > > I'm currently doing (1) and (2), with more emphasis on getting (1) right. > Not all of the important methods have been ported, but I'm happy with the > tree traversal methods. Nice. This all sounds like a really good refactoring. It sounds like 1 can happen once this all gets merged with the main branch, and could benefit from others being able to more easily look at it and make suggestions. > I noticed that in Tests/Nexus/, the example file for internal node labels is > actually in Newick/NH format, not Nexus. That was briefly confusing, so > maybe that file should be renamed. Oops, I think that may have been me. No problem, rename away. Brad From chapmanb at 50mail.com Mon Jan 4 08:16:31 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 4 Jan 2010 08:16:31 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> Message-ID: <20100104131631.GG80812@sobchak.mgh.harvard.edu> Hey Eric; Happy New Year -- thanks for all the work on TreeIO. This sounds great and looking forward to getting it in the main trunk. I'd like to hear Peter's and other's thoughts, but just a few small comments below. > The tree annotations (e.g. id) aren't preserved perfectly during conversions > -- I'll keep working on this, but I don't think it's a blocker. The taxon > names of terminal nodes are kept as "clade" names in phyloXML for > round-tripping. Tree topology and branch lengths seem OK. Are the annotations often used in real life cases or is this more of a fringe problem? I'm not as familiar with tree work, but know this is a pain in sequence space. A good goal is to capture the most common use cases and then integrate the other issues as feasible. > Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an > incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O). > This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees, > as I imagine it: > (1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where > reasonable (since the node IDs and adjacency list lookup are no longer > needed) > (2) Implement methods in Bio.Tree.Newick with the original argument lists, > but triggering a deprecation warning indicating the newer replacement method > (3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more > shims to duplicate the original API -- so test_Nexus.py should still pass, > ideally (with deprecation warnings) > (4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of > NexusIO and Bio.Tree methods. > (5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick. > > I'm currently doing (1) and (2), with more emphasis on getting (1) right. > Not all of the important methods have been ported, but I'm happy with the > tree traversal methods. Nice. This all sounds like a really good refactoring. It sounds like 1 can happen once this all gets merged with the main branch, and could benefit from others being able to more easily look at it and make suggestions. > I noticed that in Tests/Nexus/, the example file for internal node labels is > actually in Newick/NH format, not Nexus. That was briefly confusing, so > maybe that file should be renamed. Oops, I think that may have been me. No problem, rename away. Brad From eric.talevich at gmail.com Mon Jan 4 19:09:18 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 4 Jan 2010 16:09:18 -0800 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <20100104131631.GG80812@sobchak.mgh.harvard.edu> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> <20100104131631.GG80812@sobchak.mgh.harvard.edu> Message-ID: <3f6baf361001041609u7997dd61v441257dbfecdebd6@mail.gmail.com> Hi Brad, I hope the holidays treated you well. On Mon, Jan 4, 2010 at 5:16 AM, Brad Chapman wrote: > > Are the annotations often used in real life cases or is this more of > a fringe problem? I'm not as familiar with tree work, but know this > is a pain in sequence space. A good goal is to capture the most > common use cases and then integrate the other issues as feasible. > The data that TreeIO preserves round-trip are: - Branching structure (topology) - Branch lengths - Clade/taxon names - Rooted-ness (for the whole tree) - Tree ID The troublesome parts are: - The "confidences" attribute in PhyloXML trees should map onto the "support" attribute in Nexus trees, but that's tricky -- the original Nexus attribute seemed content with a little ambiguity in what that attribute's numerical value actually meant (relative/absolute support), while PhyloXML uses a list of Confidence objects containing both a numerical value and a "type" string such as "bootstrap". Currently that information is dropped when converting between PhyloXML and Nexus/Newick trees. - Nexus also has a "comment" attribute for each node, while PhyloXML doesn't directly support that. - The branch length of the root node/clade is None in PhyloXML, but 0.0 in Nexus. I prefer None because there is no meaningful branch leading to that node, but there might be a reason 0.0 was chosen for Nexus that I'm not aware of. - The names of unlabeled internal nodes might change from None to "" in some cases, since None is the PhyloXML default and "" is the Nexus default. - Since PhyloXML supports more structured taxonomic information on each node than Newick, it's possible to have a PhyloXML tree where a Clade has no name, but instead one or more Taxonomy objects containing the scientific name, common names, etc. -- so when this tree is converted to Newick format the taxonomy info is lost for those nodes. I could squash the Taxonomy object into a string for the sake of Nexus labels, but I think it would be safer (less surprising) to just write a cookbook entry on how to collapse PhyloXML Taxonomies into Clade names to aid format conversions. If the support-vs-confidence issue can be resolved, then we can treat PhyloXML as a rough superset of Newick, in terms of annotation, and then it shouldn't be surprising to lose some annotation data in converting PhyloXML to Newick. Cheers, Eric From eric.talevich at gmail.com Mon Jan 4 19:09:18 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 4 Jan 2010 16:09:18 -0800 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <20100104131631.GG80812@sobchak.mgh.harvard.edu> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> <20100104131631.GG80812@sobchak.mgh.harvard.edu> Message-ID: <3f6baf361001041609u7997dd61v441257dbfecdebd6@mail.gmail.com> Hi Brad, I hope the holidays treated you well. On Mon, Jan 4, 2010 at 5:16 AM, Brad Chapman wrote: > > Are the annotations often used in real life cases or is this more of > a fringe problem? I'm not as familiar with tree work, but know this > is a pain in sequence space. A good goal is to capture the most > common use cases and then integrate the other issues as feasible. > The data that TreeIO preserves round-trip are: - Branching structure (topology) - Branch lengths - Clade/taxon names - Rooted-ness (for the whole tree) - Tree ID The troublesome parts are: - The "confidences" attribute in PhyloXML trees should map onto the "support" attribute in Nexus trees, but that's tricky -- the original Nexus attribute seemed content with a little ambiguity in what that attribute's numerical value actually meant (relative/absolute support), while PhyloXML uses a list of Confidence objects containing both a numerical value and a "type" string such as "bootstrap". Currently that information is dropped when converting between PhyloXML and Nexus/Newick trees. - Nexus also has a "comment" attribute for each node, while PhyloXML doesn't directly support that. - The branch length of the root node/clade is None in PhyloXML, but 0.0 in Nexus. I prefer None because there is no meaningful branch leading to that node, but there might be a reason 0.0 was chosen for Nexus that I'm not aware of. - The names of unlabeled internal nodes might change from None to "" in some cases, since None is the PhyloXML default and "" is the Nexus default. - Since PhyloXML supports more structured taxonomic information on each node than Newick, it's possible to have a PhyloXML tree where a Clade has no name, but instead one or more Taxonomy objects containing the scientific name, common names, etc. -- so when this tree is converted to Newick format the taxonomy info is lost for those nodes. I could squash the Taxonomy object into a string for the sake of Nexus labels, but I think it would be safer (less surprising) to just write a cookbook entry on how to collapse PhyloXML Taxonomies into Clade names to aid format conversions. If the support-vs-confidence issue can be resolved, then we can treat PhyloXML as a rough superset of Newick, in terms of annotation, and then it shouldn't be surprising to lose some annotation data in converting PhyloXML to Newick. Cheers, Eric From biopython at maubp.freeserve.co.uk Tue Jan 5 12:50:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 17:50:25 +0000 Subject: [Biopython-dev] code credits In-Reply-To: <320fb6e00912220414t6429f1e5n792e5feeecbe633f@mail.gmail.com> References: <928490.72367.qm@web30708.mail.mud.yahoo.com> <320fb6e00912171454v2ce81fc5v93547951d7af84f8@mail.gmail.com> <320fb6e00912210357m32156fdax6639445cadd83217@mail.gmail.com> <20091221132339.GC21580@sobchak.mgh.harvard.edu> <320fb6e00912210634o77d9eb9ex21e4ec3630dd1ed6@mail.gmail.com> <320fb6e00912210848x449fd73al4e97d3c9e21cf4@mail.gmail.com> <320fb6e00912220414t6429f1e5n792e5feeecbe633f@mail.gmail.com> Message-ID: <320fb6e01001050950r64dabb1dw67baafada72f5d1a@mail.gmail.com> On Tue, Dec 22, 2009 at 12:14 PM, Peter wrote: > On Mon, Dec 21, 2009 at 4:48 PM, Peter wrote: >> So, how about a merger of (1) and (3)? i.e. >> >> * The CONTRIBUTORS file remains a single alphabetical list >> of all contributors to date (no change). >> * Entries in the NEWS file for new features etc may continue >> to credit authors as appropriate. >> * The NEWS file will include at the end of each release section >> an alphabetical list of contributors for that release (with new >> contributors flagged). This will be re-used in the release notice. > > I've done that in github - how do the NEWS and CONTRIB file look? > > http://github.com/biopython/biopython/commit/86d8d99aab894ab5f32a0e7a0c45d63a441da645 > > I haven't automatically included email addresses for the new contributors > since there is a risk of them being harvested for spam, so I figure that > should be "opt in". Thanks to those with feedback off list (e.g. sort order). I've just updated the news post to include the list of names: http://news.open-bio.org/news/2009/12/biopython-release-153/ I don't have time today, but at some point this week I want to do a another news post and email announcement describing this new Sage-like policy for recognising contributors. If anyone would like to compose a draft of the apparent consensus that would be very helpful. If anyone would like to go back over the commit log for the recent releases to update them as we've just done for 1.53, please go ahead - but post an email here to avoid duplicated efforts. Peter P.S. Happy New Year! From bugzilla-daemon at portal.open-bio.org Thu Jan 7 13:11:47 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 Jan 2010 13:11:47 -0500 Subject: [Biopython-dev] [Bug 2980] New: Bio.SeqIO can't parse EMBL CONTIG records Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2980 Summary: Bio.SeqIO can't parse EMBL CONTIG records Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk While the GenBank parser has been updated to cope with CONTIG records (using an UnknownSeq object), this has not been done for the EMBL parser. As an example test case, consider: ftp://ftp.ebi.ac.uk/pub/databases/embl/release/rel_con_hum_01_r102.dat.gz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 8 06:50:56 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 Jan 2010 06:50:56 -0500 Subject: [Biopython-dev] [Bug 2980] Bio.SeqIO can't parse EMBL CONTIG records In-Reply-To: Message-ID: <201001081150.o08Bougb013879@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2980 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-08 06:50 EST ------- Fixed in git -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Fri Jan 8 11:26:29 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 8 Jan 2010 08:26:29 -0800 (PST) Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> Message-ID: <221209.41863.qm@web62404.mail.re1.yahoo.com> I am not an expert in this area, but the code looks very well done and well organized. Thanks, Eric! I have one suggestion though: In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather have everything under Bio.Tree. This makes it easier to understand what each Bio.* module is about, and also agrees with the structure of the other modules in Biopython. The only exception is Bio.Seq, for which there is a closely related Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons; I'd rather have a single Bio.Seq there too). Thanks again, --Michiel. --- On Mon, 12/28/09, Eric Talevich wrote: > From: Eric Talevich > Subject: Re: [Biopython-dev] Code review request for phyloxml branch > To: "BioPython-Dev Mailing List" > Date: Monday, December 28, 2009, 8:51 PM > Hi folks, > > Here's an update on the status of Bio.Tree and TreeIO. I > think I've taken > care of most of the blockers since the last review in > September. > > First, some links: > http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/ > http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/ > http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py > http://github.com/etal/biopython/tree/phyloxml/Tests/test_Tree.py > http://biopython.org/wiki/PhyloXML > > Discussion: > > *TreeIO* > Conversion between Nexus, Newick and phyloXML tree file > formats works; the > read/parse/write functions for each IO format use the same > object types. > Neat! > > The tree annotations (e.g. id) aren't preserved perfectly > during conversions > -- I'll keep working on this, but I don't think it's a > blocker. The taxon > names of terminal nodes are kept as "clade" names in > phyloXML for > round-tripping. Tree topology and branch lengths seem OK. > > Under the hood: > -- PhyloXMLIO is from GSoC > -- NewickIO is ported from the Bio.Nexus.Trees parser. I > think it works the > same way. > -- NexusIO relies on Bio.Nexus.Nexus for parsing, then > converts the > resulting Nexus.Trees.Tree objects to Bio.Tree.Newick > objects. One day, when > Nexus.Trees is replaced by NewickIO in the main Nexus > parser, then this > conversion can be dropped and NexusIO will be very simple. > > *Tree* > The BaseTree object structure looks like this:* > > -- BaseTree.**Tree* contains global tree information, like > whether the tree > is rooted, and a reference to the root clade. The phyloXML > Phylogeny object > inherits from this.* > > -- BaseTree.**Subtree* contains local (clade- or > node-specific) information, > and references to each of its direct descendents, > recursively. The phyloXML > Clade object inherits from this. Nodes are implicit. I > could add references > to the ancestor of each sub-tree without too much > difficulty, but I haven't > needed them yet. > > The same methods (get_terminals et al.) generally apply to > both classes, so > I created a separate TreeMixin class from which both > BaseTree.Tree and > BaseTree.Subtree inherit. > > Bio.Tree.Newick contains simple subclasses of Tree and > Subtree, and an > incomplete set of shims that track Bio.Nexus.Trees.Tree > (minus the I/O). > This is to ease the deprecation and eventual replacement of > Bio.Nexus.Trees, > as I imagine it: > (1) Port methods from Nexus.Trees to Bio.Tree, simplifying > arguments where > reasonable (since the node IDs and adjacency list lookup > are no longer > needed) > (2) Implement methods in Bio.Tree.Newick with the original > argument lists, > but triggering a deprecation warning indicating the newer > replacement method > (3) Replace Nexus.Trees with an import of > Bio.Tree.Newick(IO) and a few more > shims to duplicate the original API -- so test_Nexus.py > should still pass, > ideally (with deprecation warnings) > (4) In Nexus.Nexus, replace all usage of Nexus.Trees with > proper usage of > NexusIO and Bio.Tree methods. > (5) Eventually delete Nexus.Trees and the shims in > Bio.Tree.Newick. > > I'm currently doing (1) and (2), with more emphasis on > getting (1) right. > Not all of the important methods have been ported, but I'm > happy with the > tree traversal methods. > * > Tests > *I created test_Tree.py to test the methods in > Bio.Tree.BaseTree; > test_PhyloXML.py tests Bio.Tree.PhyloXML objects and > Bio.TreeIO.PhyloXMLIO > parsing/writing. > > I noticed that in Tests/Nexus/, the example file for > internal node labels is > actually in Newick/NH format, not Nexus. That was briefly > confusing, so > maybe that file should be renamed. > > What do you think? > > All the best, > Eric > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Fri Jan 8 12:00:12 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Jan 2010 17:00:12 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <221209.41863.qm@web62404.mail.re1.yahoo.com> References: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> <221209.41863.qm@web62404.mail.re1.yahoo.com> Message-ID: <320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com> On Fri, Jan 8, 2010 at 4:26 PM, Michiel de Hoon wrote: > I am not an expert in this area, but the code looks very well done and well > organized. Thanks, Eric! > > I have one suggestion though: > In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather > have everything under Bio.Tree. This makes it easier to understand what each > Bio.* module is about, and also agrees with the structure of the other modules > in Biopython. The only exception is Bio.Seq, for which there is a closely related > Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons; > I'd rather have a single Bio.Seq there too). There is also Bio.AlignIO, which again might have been handled via Bio.Align with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was following the lead from BioPerl. I think there are some good points about making the code for the common object (tree, SeqRecord, Alignment) clearly separate from the code for parsing or writing it (although separate top level modules is perhaps overkill). However, I agree, this isn't universal in Biopython (e.g. Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO). So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing I don't like is that "Tree" could mean a class or a module (also a problem with other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python convention (PEP8) is to use lower case for the module ("tree") and title case for the class ("Tree"), something most of Biopython does not follow (and which we can't change without a lot of upheaval). Another option if we want to try and keep the existing module name style might be Bio.Trees containing a Tree class, or perhaps something different like Bio.Phylo instead? Peter From eric.talevich at gmail.com Fri Jan 8 13:22:11 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 8 Jan 2010 13:22:11 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com> References: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> <221209.41863.qm@web62404.mail.re1.yahoo.com> <320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com> Message-ID: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> On Fri, Jan 8, 2010 at 12:00 PM, Peter Cock wrote: > > On Fri, Jan 8, 2010 at 4:26 PM, Michiel de Hoon wrote: > > I am not an expert in this area, but the code looks very well done and well > > organized. Thanks, Eric! > > > > I have one suggestion though: > > In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather > > have everything under Bio.Tree. This makes it easier to understand what each > > Bio.* module is about, and also agrees with the structure of the other modules > > in Biopython. The only exception is Bio.Seq, for which there is a closely related > > Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons; > > I'd rather have a single Bio.Seq there too). > > There is also Bio.AlignIO, which again might have been handled via Bio.Align > with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was > following the lead from BioPerl. Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava do something completely different. I had the impression that pairing modules Foo & FooIO was an emerging convention for organizing very general data types being fed by a variety of file formats, while a single module Foo indicated support for a particular program or source, like Entrez. But I think it would be even cleaner if each Foo simply had a Foo.IO (or foo.io) sub-module organizing the I/O for multiple file formats where applicable. The TreeIO.* namespace is not crowded -- just read, write, parse, convert. If that directory is moved under Bio.Tree and renamed to IO or io, then Bio.Tree would still seem reasonably intuitive if __init__.py contained: from io import * from utils import * Then "from Bio import Tree" would be enough for most uses. > I think there are some good points about making > the code for the common object (tree, SeqRecord, Alignment) clearly separate > from the code for parsing or writing it (although separate top level modules is > perhaps overkill). However, I agree, this isn't universal in Biopython (e.g. > Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO). PDB does its own thing, too -- and some consolidation there might be nice. > So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing > I don't like is that "Tree" could mean a class or a module (also a problem with > other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python > convention (PEP8) is to use lower case for the module ("tree") and title case > for the class ("Tree"), something most of Biopython does not follow (and > which we can't change without a lot of upheaval). I could rename the modules inside Bio.Tree (or whatever we call it) to follow the PEP8 convention: Bio/Tree/ Bio/Tree/basetree.py Bio/Tree/io.py Bio/Tree/utils.py ... The Biopython convention seems to be that directory names are title case, file names are mostly title case if user-facing and lower case otherwise, and C extensions are lower case. Most of the time there won't be any need to import the sub-modules under Tree directly, so the inconsistency shouldn't be too jarring. > perhaps something different like Bio.Phylo instead? Sure, that sounds promising. Thanks! Eric From mjldehoon at yahoo.com Sat Jan 9 10:15:56 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 9 Jan 2010 07:15:56 -0800 (PST) Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> Message-ID: <863834.10061.qm@web62403.mail.re1.yahoo.com> --- On Fri, 1/8/10, Eric Talevich wrote: > Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava > do something completely different. > > I had the impression that pairing modules Foo & FooIO > was an emerging convention for organizing very general > data types being fed by a variety of file formats, while > a single module Foo indicated support > for a particular program or source, like Entrez. I think a workable convention, which is already followed by many Biopython module, is the following: 1) Bio.SomeStuff is a module containing everything related to SomeStuff, where SomeStuff is some broadly-defined field within bioinformatics (Cluster for clustering algorithms, Phylo for phylogenetics, PopGen for population genetics, Entrez for NCBI Entrez related stuff, etc.). 2) Parsing SomeStuff files, which can be in a variety of formats, is done by a read() function (to parse a single record), and/or a parse() function (to parse multiple records). The implementation details of these functions is hidden in a submodule of Bio.SomeStuff. Typically, the user won't need to interact with the submodule directly. 3) The read() / parse() functions return Bio.SomeStuff.Record objects, where Bio.SomeStuff.Record is a class that represents the primary data structure of SomeStuff information. This general framework may not be suitable in all aspects for all Biopython modules, and can be modified as needed. For example, I can imagine that the most important data structure in Bio.Phylo is a Tree object rather than a Record object. > But I think it would > be even cleaner if each Foo simply had a Foo.IO (or foo.io) > sub-module organizing the I/O for multiple file formats where > applicable. I agree. > The TreeIO.* namespace is not crowded -- just read, write, > parse, convert. If that directory is moved under Bio.Tree and > renamed to IO or io, then Bio.Tree would still seem reasonably > intuitive if __init__.py contained: > > from io import * > from utils import * > > Then "from Bio import Tree" would be enough for most uses. Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module. Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user. > > perhaps something different like Bio.Phylo instead? > > Sure, that sounds promising. I agree that Bio.Phylo is a good name. Note also that there already is a Tree class in Bio.Cluster (it represents hierarchical clustering trees). Having a Bio.Phylo.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees is not confusing. On the other hand, having a Bio.Tree.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees could potentially be confusing. --Michiel From eric.talevich at gmail.com Sat Jan 9 18:38:29 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 9 Jan 2010 18:38:29 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <863834.10061.qm@web62403.mail.re1.yahoo.com> References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> <863834.10061.qm@web62403.mail.re1.yahoo.com> Message-ID: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> Hi, Thanks for your comments. I've reorganized the modules like this: Bio/Phylo/ __init__.py, BaseTree.py, Newick.py, PhyloXML.py, Utils.py IO/ __init__.py, NexusIO.py, NewickIO.py, PhyloXMLIO.py Now "from Bio import Phylo" works for the common cases, and "from Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct access to the parsers. I renamed TreeIO to Phylo/IO -- keeping it uppercase because io is a standard module in Py2.6+, Py2.7 changes the priority rules for absolute vs. relative imports, and Py2.4 doesn't support the new syntax for relative imports. I might change the other file names to lower case before the next merge, though... On Sat, Jan 9, 2010 at 10:15 AM, Michiel de Hoon wrote: > > Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module. > > Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user. > I'm trying to avoid having to update Phylo/__init__.py each time I add or rename a public function in Utils.py or IO. So, how about this: I've added "__all__" definitions to Utils.py and IO/__init__.py so that only the relevant public functions are loaded when Phylo/__init__.py imports * from those two sub-modules. Testing manually, this seems to do the right thing. Cheers, Eric From mjldehoon at yahoo.com Sat Jan 9 21:50:21 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 9 Jan 2010 18:50:21 -0800 (PST) Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> Message-ID: <274373.93315.qm@web62406.mail.re1.yahoo.com> I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it. One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this? Thanks! --Michiel --- On Sat, 1/9/10, Eric Talevich wrote: > From: Eric Talevich > Subject: Re: [Biopython-dev] Code review request for phyloxml branch > To: "Michiel de Hoon" > Cc: "Peter Cock" , "BioPython-Dev Mailing List" > Date: Saturday, January 9, 2010, 6:38 PM > Hi, > > Thanks for your comments. I've reorganized the modules like > this: > > Bio/Phylo/ > ? ? __init__.py, BaseTree.py, Newick.py, > PhyloXML.py, Utils.py > ? ? IO/ > ? ? ? ? __init__.py, NexusIO.py, > NewickIO.py, PhyloXMLIO.py > > Now "from Bio import Phylo" works for the common cases, and > "from > Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct > access to the > parsers. > > I renamed TreeIO to Phylo/IO -- keeping it uppercase > because io is a > standard module in Py2.6+, Py2.7 changes the priority rules > for > absolute vs. relative imports, and Py2.4 doesn't support > the new > syntax for relative imports. I might change the other file > names to > lower case before the next merge, though... > > On Sat, Jan 9, 2010 at 10:15 AM, Michiel de Hoon > wrote: > > > > Rather than importing *, can we import only those > functions that a user would actually use? We should avoid > importing stuff that is essentially used only locally in > each sub-module. > > > > Another option is to have all functions that are > intended to be used by the user in Bio.Phylo, and have those > function access (internally) any sub-module as needed. For > example, a user would not notice that Bio.Phylo.read > actually uses code from Bio.Phylo.io; the latter module > would not be accessed directly by the user. > > > > I'm trying to avoid having to update Phylo/__init__.py each > time I add > or rename a public function in Utils.py or IO. So, how > about this: > I've added "__all__" definitions to Utils.py and > IO/__init__.py so > that only the relevant public functions are loaded when > Phylo/__init__.py imports * from those two sub-modules. > Testing > manually, this seems to do the right thing. > > Cheers, > Eric > From eric.talevich at gmail.com Sun Jan 10 17:02:10 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 10 Jan 2010 17:02:10 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <274373.93315.qm@web62406.mail.re1.yahoo.com> References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> <274373.93315.qm@web62406.mail.re1.yahoo.com> Message-ID: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> On Sat, Jan 9, 2010 at 9:50 PM, Michiel de Hoon wrote: > I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it. OK -- I pulled the latest from biopython/biopython on GitHub, merged my phyloxml branch into my master branch, and pushed it all back to biopython. Bio.Phylo is now part of Biopython! For documentation on the Biopython wiki, I moved the relevant parts of the Tree, TreeIO and PhyloXML pages to a new page for Bio.Phylo: http://biopython.org/wiki/Phylo It's a little rough at the moment, but I'll refine it this week. Some of the content can also be moved to separate cookbook entries. > One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this? I went over all the docstrings and comments again before merging; it should be free of Tree/TreeIO references now. Thanks for your help! Eric From biopython at maubp.freeserve.co.uk Mon Jan 11 06:04:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 11:04:03 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> <863834.10061.qm@web62403.mail.re1.yahoo.com> <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> Message-ID: <320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com> On Sat, Jan 9, 2010 at 11:38 PM, Eric Talevich wrote: > > I'm trying to avoid having to update Phylo/__init__.py each time I add > or rename a public function in Utils.py or IO. So, how about this: > I've added "__all__" definitions to Utils.py and IO/__init__.py so > that only the relevant public functions are loaded when > Phylo/__init__.py imports * from those two sub-modules. Testing > manually, this seems to do the right thing. Previously bits of Biopython have used __all__, and then abandoned this a long term maintenance load. This was before my time, so I am not familiar with the full history, but it makes me wary about using __all__ here. Personally I don't see a big problem with having just explicit manual imports within Bio/Phylo/__init__.py if and when you decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py should be made available at the top level. In general I would think relatively few things should be exposed like that. Peter From biopython at maubp.freeserve.co.uk Mon Jan 11 06:37:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 11:37:42 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> <274373.93315.qm@web62406.mail.re1.yahoo.com> <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> Message-ID: <320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com> On Sun, Jan 10, 2010 at 10:02 PM, Eric Talevich wrote: > > OK -- I pulled the latest from biopython/biopython on GitHub, merged > my phyloxml branch into my master branch, and pushed it all back to > biopython. Bio.Phylo is now part of Biopython! Wow - that was quicker than I expected. As an aside, do you know why there seem to be three main branches in the history now? I guess this was the "original" master, your local master, and your phyloxml branch? One minor thing - test_Phylo.py needs to be tweaked to raise a MissingExternalDependencyError if NetworkX isn't installed. That way the run_tests.py script will treat it as a skipped test instead of a failed test. Alternatively, if this is just a small part of the test, maybe split test_Phylo.py into two files (e.g. add a new file test_Phylo_NeworkX.py which needs the dependency). And how's this for a draft entry in the NEWS file? New module Bio.Phylo includes support for reading, writing and working with phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by Eric Talevich on a Google Summer of Code 2009 project, under The National Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and Christian Zmasek. Peter From chapmanb at 50mail.com Mon Jan 11 08:18:40 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 11 Jan 2010 08:18:40 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> <274373.93315.qm@web62406.mail.re1.yahoo.com> <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> Message-ID: <20100111131840.GB46441@sobchak.mgh.harvard.edu> Hi all; > OK -- I pulled the latest from biopython/biopython on GitHub, merged > my phyloxml branch into my master branch, and pushed it all back to > biopython. Bio.Phylo is now part of Biopython! Awesome. Congrats Eric -- thanks for all the hard work on this during the summer, and getting it in shape for inclusion. Peter and Michiel, thanks for all the helpful feedback. Really happy to have this integrated, Brad From biopython at maubp.freeserve.co.uk Mon Jan 11 08:42:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 13:42:32 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com> References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> <863834.10061.qm@web62403.mail.re1.yahoo.com> <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> <320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com> Message-ID: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> On Mon, Jan 11, 2010 at 11:04 AM, Peter wrote: > On Sat, Jan 9, 2010 at 11:38 PM, Eric Talevich wrote: >> >> I'm trying to avoid having to update Phylo/__init__.py each time I add >> or rename a public function in Utils.py or IO. So, how about this: >> I've added "__all__" definitions to Utils.py and IO/__init__.py so >> that only the relevant public functions are loaded when >> Phylo/__init__.py imports * from those two sub-modules. Testing >> manually, this seems to do the right thing. > > Previously bits of Biopython have used __all__, and then > abandoned this a long term maintenance load. This was before > my time, so I am not familiar with the full history, but it makes me > wary about using __all__ here. > > Personally I don't see a big problem with having just explicit > manual imports within Bio/Phylo/__init__.py if and when you > decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py > should be made available at the top level. In general I would > think relatively few things should be exposed like that. In fact, why even do this at all? What is wrong with leaving the IO functions (read, parse, write) as Bio.Phylo.IO.read etc e.g. >>> from Bio import Phylo >>> tree = Phylo.IO.read(open("int_node_labels.nwk"),"newick") What is the benefit of having them also exposed under the Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means there are two ways to access them which is confusing. If we do want to use Bio.Phylo.IO instead of Bio.PhyloIO (or Bio.TreeIO) then thinking long term we may want to do something about Bio.SeqIO and Bio.AlignIO to match. We could move the Bio.AlignIO functionality under Bio.Align.IO (with a suitable transition period). We could move Bio.SeqIO to Bio.Seq.IO perhaps. Or we could even talk about introducing Bio.Sequences (or something) then move Bio.SeqIO to Bio.Sequences.IO, and move Bio.SeqUtils.* under there too, and perhaps even the Seq, SeqRecord and SeqFeature objects as well. On the other hand, all that upheaval would cause a lot of pain for end users, for relatively little gain. Peter From mjldehoon at yahoo.com Mon Jan 11 10:02:46 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 11 Jan 2010 07:02:46 -0800 (PST) Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> Message-ID: <107440.85746.qm@web62406.mail.re1.yahoo.com> --- On Mon, 1/11/10, Peter wrote: > What is wrong with leaving the IO functions > (read, parse, write) as Bio.Phylo.IO.read etc > e.g. > > >>> from Bio import Phylo > >>> tree = > Phylo.IO.read(open("int_node_labels.nwk"),"newick") > > What is the benefit of having them also exposed under the > Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means > there are two ways to access them which is confusing. If we use Bio.Phylo.IO.read directly, then for consistency we'd have to do the same for all other modules. Otherwise, we'd be guessing each time whether the read() and parse() functions are in Bio.SomeModule, or Bio.SomeModule.IO. For Bio.Phylo, a simple solution is to put whatever is in Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and remove Bio.Phylo.IO.__init__.py. Then there is only one way to access the read() etc. functions. [About doing the same for Bio.Seq and Bio.Align] > On the other hand, all that upheaval would cause a > lot of pain for end users, for relatively little gain. For new users, it may be confusing to have all those different modules dealing with sequences. At least, it was for me when I started with Biopython. Therefore, for a long term solution, I'd prefer a single Bio.Seq module that incorporates all (Seq, SeqRecord, SeqIO, SeqFeature). I agree that that may cause a lot of upheaval for end users, but a suitably long transition period may mitigate those concerns. I'd prefer that to being stuck with a less-than-optimal code organization forever. --Michiel From biopython at maubp.freeserve.co.uk Mon Jan 11 11:17:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 16:17:36 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <107440.85746.qm@web62406.mail.re1.yahoo.com> References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> <107440.85746.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com> On Mon, Jan 11, 2010 at 3:02 PM, Michiel de Hoon wrote: > > On Mon, 1/11/10, Peter wrote: >> What is the benefit of having them also exposed under the >> Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means >> there are two ways to access them which is confusing. > > If we use Bio.Phylo.IO.read directly, then for consistency we'd have > to do the same for all other modules. Otherwise, we'd be guessing > each time whether the read() and parse() functions are in > Bio.SomeModule, or Bio.SomeModule.IO. Fair point. > For Bio.Phylo, a simple solution is to put whatever is in > Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and > remove Bio.Phylo.IO.__init__.py. Then there is only one > way to access the read() etc. functions. Or (if the functions are reasonably complex) keep the input/output code in a separate file, but make it explicit that it is not a public interface - e.g. use Bio/Phylo/_IO.py? > [About doing the same for Bio.Seq and Bio.Align] >> On the other hand, all that upheaval would cause a >> lot of pain for end users, for relatively little gain. > > For new users, it may be confusing to have all those > different modules dealing with sequences. At least, it > was for me when I started with Biopython. Therefore, > for a long term solution, I'd prefer a single Bio.Seq > module that incorporates all (Seq, SeqRecord, SeqIO, > SeqFeature). I agree that for a long term solution a single module make sense here, although I'm not convinced that Bio.Seq is the best name. We'd have to switch from a single file Bio/Seq.py to a folder with multiple files including Bio/Seq/__init__.py - I worry this may cause problems with updating existing Biopython installations. > I agree that that may cause a lot of upheaval for end > users, but a suitably long transition period may mitigate > those concerns. I'd prefer that to being stuck with a > less-than-optimal code organization forever. In principle I agree with that. Peter From eric.talevich at gmail.com Mon Jan 11 11:30:32 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 11 Jan 2010 11:30:32 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com> References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> <274373.93315.qm@web62406.mail.re1.yahoo.com> <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> <320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com> Message-ID: <3f6baf361001110830y391ea21cs8315a266b8b4fb43@mail.gmail.com> On Mon, Jan 11, 2010 at 6:37 AM, Peter wrote: > On Sun, Jan 10, 2010 at 10:02 PM, Eric Talevich wrote: >> >> OK -- I pulled the latest from biopython/biopython on GitHub, merged >> my phyloxml branch into my master branch, and pushed it all back to >> biopython. Bio.Phylo is now part of Biopython! > > Wow - that was quicker than I expected. As an aside, do you know > why there seem to be three main branches in the history now? > I guess this was the "original" master, your local master, and your > phyloxml branch? Er, sorry if I jumped the gun. I was eager to get this done before the semester kicks in... anyway, these are the Git commands I used: git checkout master git pull upstream # remote: biopython master git checkout phyloxml git merge master # check that it merges cleanly git checkout master git merge phyloxml # fast-forward git push upstream master git push origin master # updating my own branches on github git push origin phyloxml It looks more reasonable in gitk; maybe the branches will separate again later on GitHub when they're no longer equivalent, or when I delete the phyloxml branch. > One minor thing - test_Phylo.py needs to be tweaked to raise a > MissingExternalDependencyError if NetworkX isn't installed. That > way the run_tests.py script will treat it as a skipped test instead of > a failed test. Alternatively, if this is just a small part of the test, > maybe split test_Phylo.py into two files (e.g. add a new file > test_Phylo_NeworkX.py which needs the dependency). I extracted test_Phylo_depend.py from test_Phylo and added tests at the top level for networkx and either pygraphviz or pydot (since those are also used by Bio/Phylo/Utils.py). > And how's this for a draft entry in the NEWS file? > > New module Bio.Phylo includes support for reading, writing and working with > phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by > Eric Talevich on a Google Summer of Code 2009 project, under The National > Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and > Christian Zmasek. Great, thanks! Eric From eric.talevich at gmail.com Mon Jan 11 11:43:01 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 11 Jan 2010 11:43:01 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com> References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> <107440.85746.qm@web62406.mail.re1.yahoo.com> <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com> Message-ID: <3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com> On Mon, Jan 11, 2010 at 11:17 AM, Peter wrote: > On Mon, Jan 11, 2010 at 3:02 PM, Michiel de Hoon wrote: >> >> On Mon, 1/11/10, Peter wrote: >>> What is the benefit of having them also exposed under the >>> Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means >>> there are two ways to access them which is confusing. >> >> If we use Bio.Phylo.IO.read directly, then for consistency we'd have >> to do the same for all other modules. Otherwise, we'd be guessing >> each time whether the read() and parse() functions are in >> Bio.SomeModule, or Bio.SomeModule.IO. > > Fair point. > >> For Bio.Phylo, a simple solution is to put whatever is in >> Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and >> remove Bio.Phylo.IO.__init__.py. Then there is only one >> way to access the read() etc. functions. > > Or (if the functions are reasonably complex) keep the > input/output code in a separate file, but make it explicit > that it is not a public interface - e.g. use Bio/Phylo/_IO.py? Something like this? Phylo/ BaseTree.py Newick.py PhyloXML.py _IO.py _Utils.py PhyloXMLIO.py NewickIO.py NexusIO.py This plays well with the expected import styles: from Bio import Phylo # most common from Bio.Phylo import PhyloXML # access the defined types from Bio.Phylo import PhyloXMLIO # special parsing From biopython at maubp.freeserve.co.uk Mon Jan 11 12:11:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 17:11:29 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> Message-ID: <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> On Mon, Nov 23, 2009 at 2:43 PM, Peter wrote: > Dear all, > > Is there anyone on the dev mailing list willing to test the SFF > support I've been working on for Bio.SeqIO? The code is here, > a branch on github: > http://github.com/peterjc/biopython/tree/sff-seqio > > The important files are: > * Bio/SeqIO/SffIO.py > * Bio/SeqIO/__init__.py (defining the new format) > * Bio/SeqIO/_index.py (indexing SFF files) > > Plus unit test files: > * Tests/run_tests.py (to run the doctests) > * Tests/test_SeqIO_QualityIO.py > * Tests/test_SeqIO_index.py > * Tests/test_SeqIO.py > * Tests/Roche/* (for unit tests) > > Sebastian Bassi had a look last month and his feedback has > already helped (e.g. with error messages): > http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006903.html > > I have been using this code myself in real work, for example > editing the trim points in an SFF file to take into account PCR > primer sequences, and filtering SFF reads, checking Roche > barcodes etc. > > Thanks, > > Peter > Hi all, I didn't want to rush the SFF support into Biopython 1.53, but its been waiting "ready" for a while now. Any objections or comments about me merging this now? Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Jan 12 09:51:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jan 2010 14:51:58 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com> References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> <107440.85746.qm@web62406.mail.re1.yahoo.com> <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com> <3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com> Message-ID: <320fb6e01001120651i6b3d661m83187659595ce9e4@mail.gmail.com> On Mon, Jan 11, 2010 at 4:43 PM, Eric Talevich wrote: > On Mon, Jan 11, 2010 at 11:17 AM, Peter wrote: >> Or (if the functions are reasonably complex) keep the >> input/output code in a separate file, but make it explicit >> that it is not a public interface - e.g. use Bio/Phylo/_IO.py? > > Something like this? > > Phylo/ > ? ?BaseTree.py > ? ?Newick.py > ? ?PhyloXML.py > ? ?_IO.py > ? ?_Utils.py > ? ?PhyloXMLIO.py > ? ?NewickIO.py > ? ?NexusIO.py > > This plays well with the expected import styles: > > from Bio import Phylo ?# most common > from Bio.Phylo import PhyloXML ?# access the defined types > from Bio.Phylo import PhyloXMLIO ?# special parsing I'd forgotten Bio/Phylo/IO was a directory, and that the users may want to access PhyloXMLIO directly. That suggested structure looks reasonable... what do you think Michiel? Peter From kellrott at gmail.com Tue Jan 12 16:46:39 2010 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 12 Jan 2010 13:46:39 -0800 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> Message-ID: I've pulled from the main branch and fixed a few problems. I've tested the code against Sqlite, Python Mysql, and Jython Mysql. All three seem to be working right now. Kyle On Thu, Dec 17, 2009 at 10:03 AM, Kyle Ellrott wrote: > > > Code can be found at http://github.com/kellrott/biopython >> >> Lovely. That's on your jython branch (along with lots of your other work)? >> > > Yes, but all of the zxJDBC work has been done in the past 2 weeks (just the > last three commits), so it should be easy to cherry-pick out the relevant > patches. > > Kyle > From biopython at maubp.freeserve.co.uk Tue Jan 12 16:51:34 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jan 2010 21:51:34 +0000 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> Message-ID: <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> On Tue, Jan 12, 2010 at 9:46 PM, Kyle Ellrott wrote: > I've pulled from the main branch and fixed a few problems. ?I've tested the > code against Sqlite, Python Mysql, and Jython Mysql. ?All three seem to be > working right now. > > Kyle Excellent - I had a play last month, and Jython Mysql seemed to work. Do you know if/how to get SQLite and/or PostgreSQL drivers installed under zxJDBC? Peter From kellrott at gmail.com Tue Jan 12 17:06:39 2010 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 12 Jan 2010 14:06:39 -0800 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> Message-ID: I haven't played with Postgre yet (don't even have it installed). Sqlite as a python package hasn't been standardized to Jython yet ( http://bugs.jython.org/issue1682864 ) One option is to call SQLite JDBC ( http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing the existing SQLite code. But like zxJDBC, the jar would need to be in the CLASSPATH variable for the code to work. Kyle On Tue, Jan 12, 2010 at 1:51 PM, Peter wrote: > On Tue, Jan 12, 2010 at 9:46 PM, Kyle Ellrott wrote: > > I've pulled from the main branch and fixed a few problems. I've tested > the > > code against Sqlite, Python Mysql, and Jython Mysql. All three seem to > be > > working right now. > > > > Kyle > > Excellent - I had a play last month, and Jython Mysql seemed to work. > Do you know if/how to get SQLite and/or PostgreSQL drivers installed > under zxJDBC? > > Peter > From biopython at maubp.freeserve.co.uk Wed Jan 13 06:22:23 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jan 2010 11:22:23 +0000 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> Message-ID: <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com> On Tue, Jan 12, 2010 at 10:06 PM, Kyle Ellrott wrote: > I haven't played with Postgre yet (don't even have it installed). > Sqlite as a python package hasn't been standardized to Jython yet ?( > http://bugs.jython.org/issue1682864 ) > > One option is to call SQLite JDBC ( > http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing the > existing SQLite code. > But like zxJDBC, the jar would need to be in the CLASSPATH variable for the > code to work. I'm not 100% convinced that the details of your current approach are the best way forward: Specifically taking a user script that works on (C) Python using MySQL with MySQLdb as the driver, and when run on Jython automatically interpreting this to use the Java MySQL Connector/J with the org.gjt.mm.mysql.Driver (and so on for the PostgreSQL and SQLite drivers?) It might be clearer if we just treat the different Jython/Java drivers as top level alternatives: * MySQLdb (Python only, at least for now) * psycopg, psycopg2, pgdb (Python only, at least for now) * sqlite3 (currently Python only, maybe available on Jython later) * org.gjt.mm.mysql.Driver (Jython only) * Some JAVA PostreSQL driver (Jython only) * Some JAVA SQLite driver (Jython only) This way we have a clean separation of all the different driver or database specific changes - although the user is required to make some minor changes to take an existing BioSQL on MySQL script to explicitly change the driver from MySQLdb to org.gjt.mm.mysql.Driver if they want to run it on Jython. We also won't have lots of "if jython" statements everywhere. What are your thoughts on this? Note there will be some similarities between all the MySQL adaptors, all the PostgreSQL adaptors, etc. I've just made a small improvement to file BioSQL/DBUtils.py to reduce the code duplication for the existing (C) Python PostgreSQL adaptors. Peter From biopython at maubp.freeserve.co.uk Wed Jan 13 09:10:21 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jan 2010 14:10:21 +0000 Subject: [Biopython-dev] Phasing out support for Python 2.4? Message-ID: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> Hi all, Biopython currently supports Python 2.4, 2.5 and 2.6 (and seems to work on the current Python 2.7 alpha). Is it time to start phasing out support for Python 2.4? Reasons for encouraging Python 2.5+ include the built in support for sqlite3 (which we can use in the BioSQL wrappers) and ElementTree (which we use for the phyloXML parser) both of which must currently be manually installed for Python 2.4. Also ReportLab is talking about dropping support for Python 2.4 (another optional dependency of Biopython). As far as I know, NumPy haven't yet talked about dropping support for Python 2.4. I was thinking of the usual deprecation procedure, so we'd aim to have at least two releases and one year before actually dropping support for Python 2.4. At that point older Linux distributions which ship with Python 2.4 probably won't be supported anyway. e.g. The last version of Ubuntu to have Python 2.4 as the default was Ubuntu 6.06 LTS (Dapper Drake). The desktop edition support ended July 2009, but the server edition will be maintaned until June 2011. Peter From eric.talevich at gmail.com Wed Jan 13 12:08:24 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 13 Jan 2010 12:08:24 -0500 Subject: [Biopython-dev] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> Message-ID: <3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com> On Wed, Jan 13, 2010 at 9:10 AM, Peter wrote: > Hi all, > > Biopython currently supports Python 2.4, 2.5 and 2.6 > (and seems to work on the current Python 2.7 alpha). > > Is it time to start phasing out support for Python 2.4? > > Reasons for encouraging Python 2.5+ include the > built in support for sqlite3 (which we can use in the > BioSQL wrappers) and ElementTree (which we use > for the phyloXML parser) both of which must currently > be manually installed for Python 2.4. Also, it appears that Python 2.7 will use absolute instead of relative imports by default: http://www.python.org/dev/peps/pep-0328/ For intra-package imports like in PDB/__init__.py, an import like this: from PDBParser import PDBParser could be future-proofed for Py2.5+: from __future__ import absolute_import from .PDBParser import PDBParser But to make it work in both Py2.4 and Py2.7, it would need to be converted to an absolute import: from Bio.PDB.PDBParser import PDBParser Py2.5 introduced a number of other enticing syntax features, too: http://docs.python.org/dev/whatsnew/2.5.html - context managers (with_statement) - if-else expressions - unified try-except-finally (I flagged this issue in the comments in Bio.Phylo) - all() and any() - passing values into generators -- could be useful for parsing, maybe The enhancements to setuptools might help simplify the dependency handling in setup.py: http://docs.python.org/dev/whatsnew/2.5.html#pep-314-metadata-for-python-software-packages-v1-1 I'm also interested in the functools and ctypes modules, but don't have pressing use cases for them. (So, you can take that as a +1 from me.) Cheers, Eric From biopython at maubp.freeserve.co.uk Wed Jan 13 12:21:23 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jan 2010 17:21:23 +0000 Subject: [Biopython-dev] Phasing out support for Python 2.4? In-Reply-To: <3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com> Message-ID: <320fb6e01001130921w49b56793h413aacd3027d6275@mail.gmail.com> On Wed, Jan 13, 2010 at 5:08 PM, Eric Talevich wrote: > On Wed, Jan 13, 2010 at 9:10 AM, Peter wrote: >> Hi all, >> >> Biopython currently supports Python 2.4, 2.5 and 2.6 >> (and seems to work on the current Python 2.7 alpha). >> >> Is it time to start phasing out support for Python 2.4? >> >> Reasons for encouraging Python 2.5+ include the >> built in support for sqlite3 (which we can use in the >> BioSQL wrappers) and ElementTree (which we use >> for the phyloXML parser) both of which must currently >> be manually installed for Python 2.4. > > Also, it appears that Python 2.7 will use absolute instead > of relative imports by default: > http://www.python.org/dev/peps/pep-0328/ Thanks for the heads up on that. I think we'll just need to switch everything to absolute imports in order to cover Python 2.4 to 2.7 inclusive. > > (So, you can take that as a +1 from me.) > Good :) Peter From kellrott at gmail.com Wed Jan 13 12:37:53 2010 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 13 Jan 2010 09:37:53 -0800 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com> References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com> Message-ID: My main thought was to make it so that users can write a single script that would work on any Python system (eventually IronPython as well). Because the current system expects the user to request a specific driver (MySQLdb) that happens to be system specific, it forces user code to be system specific. One alternative would be to use the strings you describe below, but in addition add special requests that would check the system add pull the appropriate driver automatically. 'autoMySQL' or 'MySQL' - uses MySQLdb if in CPython, use org.gjt.mm.mysql.Driver if in Jython. Otherwise, if the user wants to use a specific driver, they pass it's name. Kyle On Wed, Jan 13, 2010 at 3:22 AM, Peter wrote: > On Tue, Jan 12, 2010 at 10:06 PM, Kyle Ellrott wrote: > > I haven't played with Postgre yet (don't even have it installed). > > Sqlite as a python package hasn't been standardized to Jython yet ( > > http://bugs.jython.org/issue1682864 ) > > > > One option is to call SQLite JDBC ( > > http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing > the > > existing SQLite code. > > But like zxJDBC, the jar would need to be in the CLASSPATH variable for > the > > code to work. > > I'm not 100% convinced that the details of your current approach > are the best way forward: Specifically taking a user script that works > on (C) Python using MySQL with MySQLdb as the driver, and when > run on Jython automatically interpreting this to use the Java MySQL > Connector/J with the org.gjt.mm.mysql.Driver (and so on for the > PostgreSQL and SQLite drivers?) > > It might be clearer if we just treat the different Jython/Java drivers > as top level alternatives: > > * MySQLdb (Python only, at least for now) > * psycopg, psycopg2, pgdb (Python only, at least for now) > * sqlite3 (currently Python only, maybe available on Jython later) > * org.gjt.mm.mysql.Driver (Jython only) > * Some JAVA PostreSQL driver (Jython only) > * Some JAVA SQLite driver (Jython only) > > This way we have a clean separation of all the different driver > or database specific changes - although the user is required > to make some minor changes to take an existing BioSQL on > MySQL script to explicitly change the driver from MySQLdb > to org.gjt.mm.mysql.Driver if they want to run it on Jython. > We also won't have lots of "if jython" statements everywhere. > > What are your thoughts on this? > > Note there will be some similarities between all the MySQL > adaptors, all the PostgreSQL adaptors, etc. I've just made > a small improvement to file BioSQL/DBUtils.py to reduce > the code duplication for the existing (C) Python PostgreSQL > adaptors. > > Peter > From chapmanb at 50mail.com Thu Jan 14 07:52:44 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 14 Jan 2010 07:52:44 -0500 Subject: [Biopython-dev] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> Message-ID: <20100114125244.GB59876@sobchak.mgh.harvard.edu> Hey Peter; Sounds great to me. Looking forward to being able to use conditional expressions, collections.defaultdict, functools, and the with statement. 2.5 had a lot of great stuff. Brad > Biopython currently supports Python 2.4, 2.5 and 2.6 > (and seems to work on the current Python 2.7 alpha). > > Is it time to start phasing out support for Python 2.4? > > Reasons for encouraging Python 2.5+ include the > built in support for sqlite3 (which we can use in the > BioSQL wrappers) and ElementTree (which we use > for the phyloXML parser) both of which must currently > be manually installed for Python 2.4. > > Also ReportLab is talking about dropping support > for Python 2.4 (another optional dependency of > Biopython). As far as I know, NumPy haven't yet > talked about dropping support for Python 2.4. > > I was thinking of the usual deprecation procedure, so > we'd aim to have at least two releases and one year > before actually dropping support for Python 2.4. At > that point older Linux distributions which ship with > Python 2.4 probably won't be supported anyway. > > e.g. The last version of Ubuntu to have Python 2.4 > as the default was Ubuntu 6.06 LTS (Dapper Drake). > The desktop edition support ended July 2009, but > the server edition will be maintaned until June 2011. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Thu Jan 14 09:52:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 14:52:24 +0000 Subject: [Biopython-dev] Phasing out support for Python 2.4? In-Reply-To: <20100114125244.GB59876@sobchak.mgh.harvard.edu> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <20100114125244.GB59876@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01001140652v1e11725esa6a2f91fafd0104b@mail.gmail.com> On Thu, Jan 14, 2010 at 12:52 PM, Brad Chapman wrote: > Hey Peter; > Sounds great to me. Looking forward to being able to use conditional > expressions, collections.defaultdict, functools, and the with > statement. 2.5 had a lot of great stuff. > > Brad I guess there are quite a few good things in Python 2.5+, although I think the jump from Python 2.3 to 2.4 was more important (generators and decorators). You'll have to restrain yourself from using the new toys in Biopython a little longer though Brad ;) Since this seems to have raised no immediate objections, I've sent a message to the main and announcement lists: http://lists.open-bio.org/pipermail/biopython/2010-January/006111.html http://lists.open-bio.org/pipermail/biopython-announce/2010-January/000064.html Assuming there are no objections, we can add a conditional deprecation warning to setup.py and do a news blog post (like we did for dropping Python 2.3 early last year): http://news.open-bio.org/news/2009/05/dropping-python23-support/ Peter From biopython at maubp.freeserve.co.uk Thu Jan 14 12:32:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 17:32:22 +0000 Subject: [Biopython-dev] [Biopython] Phasing out support for Python 2.4? In-Reply-To: <4B4F4071.7040601@fold.natur.cuni.cz> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <4B4F4071.7040601@fold.natur.cuni.cz> Message-ID: <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com> On Thu, Jan 14, 2010 at 4:04 PM, Martin MOKREJ? wrote: > > Hi Peter, > I don't get this point much. What is the problem stating that with > python 2.5+ one does not need to install an extra dependency while > for 2.4 one needs _two_ modules? > I don't think I want BioSQL nor sqlite so why would I have to upgrade. > Would the requirement be in python language syntax incompatibility then > I would NOT object, but in this situation ... > Martin Hi Martin, This isn't just the issue of sqlite3 and ElementTree. There are several benefits to using more recent versions of Python, for example with an eye on the future for Python 3, and on a practical level it simplifies our testing to have one less version to worry about (especially once Python 2.7 is out, currently scheduled for June 2010). We've already had minor issues with developers using Python 2.5+ syntax unwittingly which broke on Python 2.4 (nothing major, and it was easily fixed once the problem was spotted). If we continue to insist on Python 2.4 support, it may prove problematic for if future potential contributors have existing code written for Python 2.5+ which would require significant re-factoring. None of these concerns are pressing right now (and some are hypothetical), but I think you will agree that Python 2.4 is pretty old, and not widely used anymore. Having a clear plan in place for dropping it seems a sensible move, and once that happens we can start to take advantage of the language and library improvements Python 2.5 added. Are you personally using Python 2.4? If so, could you tell us a little more - for example, is this a university server which would be difficult to update? Or do you require some other Python package which requires Python 2.4? Thanks, Peter From bugzilla-daemon at portal.open-bio.org Thu Jan 14 13:55:18 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 14 Jan 2010 13:55:18 -0500 Subject: [Biopython-dev] [Bug 2992] New: Adding Uniprot XML file format parsing to Biopython Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2992 Summary: Adding Uniprot XML file format parsing to Biopython Product: Biopython Version: 1.53 Platform: All URL: http://github.com/apierleoni/biopython/tree/uniprotxml- branch OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: andrea at biocomp.unibo.it Uniprot XML formatted files are much easier to parse then the swissprot flat file, and are widely used at EMBL either for uniprot, IPI and integr8 databases -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From andrea at biocomp.unibo.it Thu Jan 14 13:57:58 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 14 Jan 2010 19:57:58 +0100 (CET) Subject: [Biopython-dev] New: Uniprot XML parser Message-ID: Hi Everyone, I've been using a lot biopython in the last couple of years, it is very useful to me. So now it's my turn to contribute and be helpful to someone else. I wrote a parser for the Uniprot XML format, that is reasonably fast (8000 entries/min on a core2duo mainstream PC). The main improvements with the actual SwissProt flat file parser are a deeper parsing of comment fields, and a Seqrecord containing features. The parser is based on the ElementTree library and was successfully tested on the complete SwissProt database (v57.12). Thus I think it is ready to be released. I followed the rules to develop a new parser for SeqIO, filed an enhancement bug to bugzilla (bug 2992), and included the parser in a public biopython fork on github available at: http://github.com/apierleoni/biopython/tree/uniprotxml-branch the new parser is in the "uniprotxml-branch" branch, and the parser code is in Bio/SeqIO/UniprotIO.py The parser can be used from SeqIO using: iterator=SeqIO.parse(handle,'uniprot') I think this could be easily integrated in Biopython, unit test is still missing, but should be very easy to do. Anyhow any code review or suggestions are welcome. Andrea From p.j.a.cock at googlemail.com Thu Jan 14 14:16:49 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 Jan 2010 19:16:49 +0000 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: Message-ID: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> On Thursday, January 14, 2010, Andrea Pierleoni wrote: > Hi Everyone, > I've been using a lot biopython in the last couple of years, it is very > useful to me. So now it's my turn to contribute and be helpful to someone > else. > I wrote a parser for the Uniprot XML format, that is reasonably fast (8000 > entries/min on a core2duo mainstream PC). The main improvements with the > actual SwissProt flat file parser are a deeper parsing of comment fields, > and a Seqrecord containing features. > > The parser is based on the ElementTree library and was successfully tested > on the complete SwissProt database (v57.12). Thus I think it is ready to > be released. > > I followed the rules to develop a new parser for SeqIO, filed an > enhancement bug to bugzilla (bug 2992), and included the parser in a > public biopython fork on github available at: > > http://github.com/apierleoni/biopython/tree/uniprotxml-branch > > the new parser is in the "uniprotxml-branch" branch, and the parser code > is in Bio/SeqIO/UniprotIO.py > > The parser can be used from SeqIO using: > > iterator=SeqIO.parse(handle,'uniprot') > > > I think this could be easily integrated in Biopython, ?unit test is still > missing, but should be very easy to do. > Anyhow any code review or suggestions are welcome. > > Andrea > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org Hi I'd spotted your branch on github - this looks like an excellent addition to Biopython :) What I would like to see is a few unit tests, specifically one using the same record in both XML (with the new parser) and the equivalent plain text SwissProt file (with the old parser) and check they agree. Also, I think you should check the start coordinates of the features are using python counting. Regards Peter From eric.talevich at gmail.com Thu Jan 14 15:03:35 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 14 Jan 2010 15:03:35 -0500 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: Message-ID: <3f6baf361001141203i304146a4ld5683190a32b7ffe@mail.gmail.com> On Thu, Jan 14, 2010 at 1:57 PM, Andrea Pierleoni wrote: > Hi Everyone, > I've been using a lot biopython in the last couple of years, it is very > useful to me. So now it's my turn to contribute and be helpful to someone > else. > I wrote a parser for the Uniprot XML format, that is reasonably fast (8000 > entries/min on a core2duo mainstream PC). The main improvements with the > actual SwissProt flat file parser are a deeper parsing of comment fields, > and a Seqrecord containing features. > > The parser is based on the ElementTree library and was successfully tested > on the complete SwissProt database (v57.12). Thus I think it is ready to > be released. Have you tried using this with Python 2.4? The ElementTree module wasn't added to the standard library until Python 2.5, so a simple "from xml.etree import ElementTree" may need some additional protection. It's also nice to let the user use a third-party implementation of ElementTree if they're stuck on Py2.4. An example of this is at the top of Bio.Phylo.PhyloXMLIO -- not pretty, but functional: http://github.com/biopython/biopython/blob/master/Bio/Phylo/PhyloXMLIO.py -Eric From p.j.a.cock at googlemail.com Thu Jan 14 18:04:36 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 Jan 2010 23:04:36 +0000 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> Message-ID: <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> On Thu, Jan 14, 2010 at 10:41 PM, Andrea Pierleoni wrote: > >> >> By default, copy the "swiss" parser. If that doesn't have the >> annotation, see if there is anything similar in the "genbank" >> parser (effectively our reference for rich annotation parsing). >> If in doubt, for now discard the data with a comment in the >> code - and then discuss it here. >> >> Peter >> > I'll take a look at both the swissprot and genbank parsers. > right now the annotation parsing shema is based on the xml schema. > eg. > > function text > > > is parsed in the annotations as: > > seqrecord.annotations['comment_function']=['function text'] > My reasoning is it should be (almost) transparent for users to switch from parsing the plain text SwissProt files ("swiss") to the XML form. There are also knock on implications for saving to BioSQL and file format conversions e.g. saving as a GenBank protein file (aka GenPept format). However, the comment parsing in the plain text "swiss" format is currently a little simplistic - partly to match what BioPerl did at the time. We can revisit that as part of this work. Peter From andrea at biocomp.unibo.it Fri Jan 15 05:35:39 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Fri, 15 Jan 2010 11:35:39 +0100 (CET) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> Message-ID: > > My reasoning is it should be (almost) transparent for > users to switch from parsing the plain text SwissProt > files ("swiss") to the XML form. This would be good > There are also knock > on implications for saving to BioSQL and file format > conversions e.g. saving as a GenBank protein file > (aka GenPept format). The returned Seqrecords are actually BioSQL-safe, since I can load them to a postgres biosql database. formatting the actual Seqrecord with 'genbank' dbxrefs, features, seq, keywords, source and names looks to be correctly reported, while there is no trace of the other annotations. I'll check it deeper. > > However, the comment parsing in the plain text "swiss" > format is currently a little simplistic - partly to match > what BioPerl did at the time. We can revisit that as > part of this work. > the main problem here are going to be the comment fields, that in the plain text predictors are parsed as a single string (this pushed me to wrote the new parser). I tried to keep comments parsing as simple as it can be, by just using lists of strings (good for BioSQL), but many comment types would be better parsed with a dictionary tree. As of now I left the option to get back the full XML for each comment, by calling: UniprotIO.UniprotIterator(handle,return_raw_comments=True) so every info in the XML file can be returned and the end user can decide how to parse those additional info. Anyhow I think it is better to discuss this when the unit test 'swiss'VS'uniprot' is ready. Andrea From p.j.a.cock at googlemail.com Fri Jan 15 06:08:32 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 Jan 2010 11:08:32 +0000 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> Message-ID: <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> On Fri, Jan 15, 2010 at 10:35 AM, Andrea Pierleoni wrote: >> >> However, the comment parsing in the plain text "swiss" >> format is currently a little simplistic - partly to match >> what BioPerl did at the time. We can revisit that as >> part of this work. >> > > the main problem here are going to be the comment fields, that in the > plain text predictors are parsed as a single string (this pushed me to > wrote the new parser). I tried to keep comments parsing as simple as it > can be, by just using lists of strings (good for BioSQL), but many comment > types would be better parsed with a dictionary tree. I think BioPerl now uses some kind of nest tree when parsing the SwissProt comment block, and I would like us to use something compatible (e.g. a dictionary tree) in the "swiss" parser (and thus also the XML parser) in such a way that we end up saving this in BioSQL the same way. > As of now I left the option to get back the full XML for each comment, by > calling: > > UniprotIO.UniprotIterator(handle,return_raw_comments=True) > > so every info in the XML file can be returned and the end user can decide > how to parse those additional info. > > Anyhow I think it is better to discuss this when the unit test > 'swiss'VS'uniprot' is ready. +1, good plan. Peter From bugzilla-daemon at portal.open-bio.org Fri Jan 15 07:38:49 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 15 Jan 2010 07:38:49 -0500 Subject: [Biopython-dev] [Bug 2704] Parser for the markx10 alignment format In-Reply-To: Message-ID: <201001151238.o0FCcnB1017338@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2704 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-15 07:38 EST ------- According to the change log for the just released EMBOSS 6.2: Alignment output included headers only for EMBOSS-specific formats. The headers have been dropped from the FASTA MARKX0 through MARKX10 formats to allow standard FASTA suite parsers to use the EMBOSS versions of these outputs. See also: http://lists.open-bio.org/pipermail/emboss-dev/2009-August/000618.html Fingers crossed this means we will be able to parse their output with the "fasta-m10" parser in Bio.AlignIO. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Mon Jan 18 08:01:15 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 18 Jan 2010 08:01:15 -0500 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions Message-ID: <20100118130115.GA48842@sobchak.mgh.harvard.edu> Hey all; After the Google groups discussion kicked off by Istvan last month, I've been thinking a bit about supplements to mailing list discussions. I'm agreed that mailman is not great for searching and archival purposes; we often see similar questions appear because finding and browsing the right thread from a past discussion is not intuitive. Google groups is okay, but doesn't offer a huge improvement over mailman. Additionally, reports indicate spamming is pretty bad, which creates additional moderation headaches. For handling "how do I do this biology task in Python" questions, what do people think about something entirely different like Stack Overflow? This presents a nice interface for asking questions, and the follow ups are voted up and down by utility so it's easy to see what the right answer is. Questions there are indexed well by search engines, so it's also more likely someone might be able to find a previous answer. There are actually a couple of questions on there with a Biopython tag: http://stackoverflow.com/questions/tagged/biopython >From our point of view, we would need to adjust the documentation to point out Stack Overflow as a place to ask questions, and then monitor the biopython tag for new posts. Mailman is still a great option for implementation discussions, but Stack Overflow could open up question/answers to a larger audience and help supplement the cookbook and formal documentation. Brad From n.j.loman at bham.ac.uk Mon Jan 18 08:21:38 2010 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Mon, 18 Jan 2010 13:21:38 +0000 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <20100118130115.GA48842@sobchak.mgh.harvard.edu> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> Message-ID: <4B546062.3090802@bham.ac.uk> Brad Chapman wrote: > For handling "how do I do this biology task in Python" questions, what > do people think about something entirely different like Stack Overflow? > This presents a nice interface for asking questions, and the follow > ups are voted up and down by utility so it's easy to see what the > right answer is. Questions there are indexed well by search engines, > so it's also more likely someone might be able to find a previous > answer. > Hi Brad Great suggestion, I have been thinking along the same lines. I really like the design of the Stack Exchange sites, it is a great way of exchanging Q&A information. It is worth mentioning that Stackoverflow is not the only site using the "Stack Exchange" format that is relevant. Here is a link to various other Stack Exchange sites: http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family Although there are Biopython questions in Stackoverflow, I wonder whether that is the correct place for questions, or whether it would be overall more productive to have a resource for bioinformatics? I think bioinformatics is the correct breadth of topic to keep a large enough community together whilst not being too off-topic. I have registered http://bioinformatics.stackexchange.com/ and will happily make you and anyone else who is interested an admin. Does the list think there could be enough community interest to justify a separate site like this? Cheers, Nick. From chapmanb at 50mail.com Mon Jan 18 09:20:10 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 18 Jan 2010 09:20:10 -0500 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <4B546062.3090802@bham.ac.uk> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> <4B546062.3090802@bham.ac.uk> Message-ID: <20100118142010.GE48842@sobchak.mgh.harvard.edu> Hi Nick; > Great suggestion, I have been thinking along the same lines. I really > like the design of the Stack Exchange sites, it is a great way of > exchanging Q&A information. > > It is worth mentioning that Stackoverflow is not the only site using the > "Stack Exchange" format that is relevant. > > Here is a link to various other Stack Exchange sites: > http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family Awesome. Thanks for the pointer. Sounds like you have a great handle on this. > Although there are Biopython questions in Stackoverflow, I wonder > whether that is the correct place for questions, or whether it would be > overall more productive to have a resource for bioinformatics? I think > bioinformatics is the correct breadth of topic to keep a large enough > community together whilst not being too off-topic. > > I have registered http://bioinformatics.stackexchange.com/ and will > happily make you and anyone else who is interested an admin. > > Does the list think there could be enough community interest to justify > a separate site like this? It looks like there are a couple of Stack Exchange sites with similar aims for open source bioinformatics and chemistry: http://biostar.stackexchange.com/ http://blueobelisk.stackexchange.com/ If we go this way we might want to talk to the owners of these sites and integrate with them. My preference would be to go with the main StackOverflow site and carve out our niche with the tagging system. We build off of an existing community instead of needing to help grow one. Some of the more successful biology communities, like the one on Friendfeed, benefit from input outside of the standard community: http://friendfeed.com/the-life-scientists I think this would be less likely with a dedicated site, as that fortuitous crosstalk is prevented by other programmers never thinking to look at a bioinformatics only site. Happy to hear what others think, Brad From biopython at maubp.freeserve.co.uk Mon Jan 18 10:58:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 Jan 2010 15:58:27 +0000 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com> Message-ID: <320fb6e01001180758t179f5ccdo99132e4b10b907bb@mail.gmail.com> On Wed, Jan 13, 2010 at 5:37 PM, Kyle Ellrott wrote: > My main thought was to make it so that users can write a single script that > would work on any Python system (eventually IronPython as well).? Because > the current system expects the user to request a specific driver (MySQLdb) > that happens to be system specific, it forces user code to be system > specific. Yes, it does - as long as Jython or any other Python implementation doesn't support that driver. In the case of SQLite, it sounds like adding sqlite3 support to Jython is planned at least. > One alternative would be to use the strings you describe below, but in > addition add special requests that would check the system add pull the > appropriate driver automatically. > 'autoMySQL' or 'MySQL' - uses MySQLdb if in CPython, use > org.gjt.mm.mysql.Driver if in Jython. > Otherwise, if the user wants to use a specific driver, they pass it's name. Maybe rather than specifying the driver, the user could specify the database back end (MySQL, PostgreSQL, SQLite, ...) and providing we know about this in advance, we can look up and try relevant drivers automatically. We could offer this in combination with the existing driver specifier. This seems cleaner than overloading the driver argument. Peter From biopython at maubp.freeserve.co.uk Mon Jan 18 11:33:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 Jan 2010 16:33:42 +0000 Subject: [Biopython-dev] EMBOSS eprimer3 parser Message-ID: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com> Hi all, Who on the dev list makes heavy use of the EMBOSS eprimer3 parser in Biopython? I'd like someone to look over Leighton's proposed enhancements to this code: http://bugzilla.open-bio.org/show_bug.cgi?id=2968 There are two main issues. First, the current code doesn't cope with multiple primer sets (so Leighton introduces read/parse functions in line with other modules for single or multiple sets of primers). This seems entirely sensible to me, and worthwhile in itself. Second, Leighton makes some changes to the primer record objects. I'm not so sure about the necessity here, even if it is backwards compatible, but I haven't really used this code. What do the rest of you think? Peter From istvan.albert at gmail.com Mon Jan 18 13:02:23 2010 From: istvan.albert at gmail.com (Istvan Albert) Date: Mon, 18 Jan 2010 13:02:23 -0500 Subject: [Biopython-dev] Biopython-dev Digest, Vol 84, Issue 14 In-Reply-To: References: Message-ID: On Mon, Jan 18, 2010 at 12:00 PM, wrote: > It looks like there are a couple of Stack Exchange sites with > similar aims for open source bioinformatics and chemistry: > > http://biostar.stackexchange.com/ > http://blueobelisk.stackexchange.com/ I am actually the original creator of http://biostar.stackexchange.com/ Created mainly to give my students a way to easily ask questions. Two things to keep in mind - it will cost money to run it, right now it is free due to it being in beta - it is not obvious that this service will actually be offered once beta concludes, or that it will be offered with the same conditions. That is pretty much what keeps me from investing more time into it. - making it a site like this only for biopython is too restrictive Other comments on using the stackoverflow main site: I think due to the site's focus being so generic programming I think most people looking for bioinformatics related information could easily get lost or not feel a connection. IMO the idea is fantastic, but it needs its own forum rather than being a small subset of a unrelated topics. best, Istvan -- Istvan Albert http://www.personal.psu.edu/iua1 From biopython at maubp.freeserve.co.uk Tue Jan 19 05:49:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jan 2010 10:49:31 +0000 Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function Message-ID: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> Hi Eric (and everyone else), I just spotted the to_adjacency_matrix function in utils: http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py The dostring says: > Create an adjacency matrix (NumPy array) from clades/branches in tree. > > Also returns a list of all clades in tree ("allclades"), where the position > of each clade in the list corresponds to a row and column of the numpy > array. So, a cell i,j in the array represents the length of the branch from > allclades[i] to allclades[j]. > > @return: tuple of (allclades, adjacency_matrix) where allclades is a list > and adjacency_matrix is a NumPy 2D array. It looks like your adjacency matrix starts as a numpy array of zeros, and then you sets some edges to branch lengths. How do you tell apart a non-connection and a real connection of length zero? These do occur, for example if you have three identical sequences, then you might expect a single node with three children. However IIRC, in (some) NJ trees each node has two children by construction, so you get an extra node connected with a branch of length zero. Peter From eric.talevich at gmail.com Tue Jan 19 10:22:30 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 19 Jan 2010 10:22:30 -0500 Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function In-Reply-To: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> Message-ID: <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com> On Tue, Jan 19, 2010 at 5:49 AM, Peter wrote: > Hi Eric (and everyone else), > > I just spotted the to_adjacency_matrix function in utils: > http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py > > The dostring says: > >> Create an adjacency matrix (NumPy array) from clades/branches in tree. > ?> >> Also returns a list of all clades in tree ("allclades"), where the position >> of each clade in the list corresponds to a row and column of the numpy >> array. So, a cell i,j in the array represents the length of the branch from >> allclades[i] to allclades[j]. >> >> @return: tuple of (allclades, adjacency_matrix) where allclades is a list >> and adjacency_matrix is a NumPy 2D array. > > It looks like your adjacency matrix starts as a numpy array of zeros, > and then you sets some edges to branch lengths. How do you tell > apart a non-connection and a real connection of length zero? These > do occur, for example if you have three identical sequences, then > you might expect a single node with three children. However IIRC, > in (some) NJ trees each node has two children by construction, > so you get an extra node connected with a branch of length zero. Shoot, you're right. I can think of three reasonable mitigations: (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate adjacency -- this seems more standard in textbooks, actually. (b) Issue a warning or raise an error if the given tree contains a 0-length branch. (c) Delete the function. Which do you recommend? The idea was to give mathematicians something to play with. For example, Chapter 2 of this report represents phylogenies this way, using 0 or 1 to indicate the presence of a branch: http://www.metaheuristics.net/~mdorigo/HomePageDorigo/thesis/dea/CatanzaroDEA.pdf Thanks for the heads-up, Eric From biopython at maubp.freeserve.co.uk Tue Jan 19 10:47:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jan 2010 15:47:39 +0000 Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function In-Reply-To: <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com> References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com> Message-ID: <320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com> On Tue, Jan 19, 2010 at 3:22 PM, Eric Talevich wrote: > > On Tue, Jan 19, 2010 at 5:49 AM, Peter wrote: >> Hi Eric (and everyone else), >> >> I just spotted the to_adjacency_matrix function in utils: >> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py >> >> The dostring says: >> >>> Create an adjacency matrix (NumPy array) from clades/branches in tree. >> ?> >>> Also returns a list of all clades in tree ("allclades"), where the position >>> of each clade in the list corresponds to a row and column of the numpy >>> array. So, a cell i,j in the array represents the length of the branch from >>> allclades[i] to allclades[j]. >>> >>> @return: tuple of (allclades, adjacency_matrix) where allclades is a list >>> and adjacency_matrix is a NumPy 2D array. >> >> It looks like your adjacency matrix starts as a numpy array of zeros, >> and then you sets some edges to branch lengths. How do you tell >> apart a non-connection and a real connection of length zero? These >> do occur, for example if you have three identical sequences, then >> you might expect a single node with three children. However IIRC, >> in (some) NJ trees each node has two children by construction, >> so you get an extra node connected with a branch of length zero. > > Shoot, you're right. I can think of three reasonable mitigations: > (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate > adjacency -- this seems more standard in textbooks, actually. > (b) Issue a warning or raise an error if the given tree contains a > 0-length branch. > (c) Delete the function. > > Which do you recommend? > > The idea was to give mathematicians something to play with. For > example, Chapter 2 of this report represents phylogenies this way, > using 0 or 1 to indicate the presence of a branch: > http://www.metaheuristics.net/~mdorigo/HomePageDorigo/thesis/dea/CatanzaroDEA.pdf > > Thanks for the heads-up, > Eric I did wonder about further options, (d) Since the distances are floats, we can use a NA as a flag for no connection. However, this does not seem very useful. (e) Collapse nodes separated by a zero length branch while building the adjacency matrix. Or, raise an error (b) but provide a tree method to collapse nodes separated by a zero length branch which could be called to "clean up" a problematic tree before making the adjacency matrix. None of these options seem ideal :( I would say the boolean matrix (a) is safe but is of limited utility. Therefore (c), remove the function for now is probably best. It can always be re-added in a later release if a good solution is agreed. Peter P.S. Another potentially interesting thing would be a matrix using the bootstrap support values (where again you have a problem with zero bootstrap support vs no connection). I'm not sure if this has any practical uses though. From eric.talevich at gmail.com Tue Jan 19 23:08:16 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 19 Jan 2010 23:08:16 -0500 Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function In-Reply-To: <320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com> References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com> <320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com> Message-ID: <3f6baf361001192008y244912aaieb7c8d2c0399903e@mail.gmail.com> On Tue, Jan 19, 2010 at 10:47 AM, Peter wrote: > On Tue, Jan 19, 2010 at 3:22 PM, Eric Talevich wrote: >> >> On Tue, Jan 19, 2010 at 5:49 AM, Peter wrote: >>> Hi Eric (and everyone else), >>> >>> I just spotted the to_adjacency_matrix function in utils: >>> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py >>> >>> It looks like your adjacency matrix starts as a numpy array of zeros, >>> and then you sets some edges to branch lengths. How do you tell >>> apart a non-connection and a real connection of length zero? >> >> Shoot, you're right. I can think of three reasonable mitigations: >> (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate >> adjacency -- this seems more standard in textbooks, actually. >> (b) Issue a warning or raise an error if the given tree contains a >> 0-length branch. >> (c) Delete the function. >> >> Which do you recommend? >> .... > > I did wonder about further options, > > (d) Since the distances are floats, we can use a NA as > a flag for no connection. However, this does not seem > very useful. Or infinity -- I think that's reasonably common in graph algorithms that use a matrix representation. Anyway, I commented it out for now. The main problem is that I don't have a clear use case for the function at the moment, just a notion that it could be useful for some novel statistical analysis or possibly rooting an unrooted tree based on a molecular clock. I'll look at other libraries to see how they use adjacency matrices, if at all. > (e) Collapse nodes separated by a zero length branch > while building the adjacency matrix. > > Or, raise an error (b) but provide a tree method to collapse > nodes separated by a zero length branch which could be > called to "clean up" a problematic tree before making the > adjacency matrix. Should be easy enough for the user to do manually: for clade in tree.find_clades(branch_length=0): tree.collapse(clade) I'm going to do some serious work on the wiki documentation soon so this sort of operation should be fairly apparent to users. > P.S. Another potentially interesting thing would be a matrix using > the bootstrap support values (where again you have a problem > with zero bootstrap support vs no connection). I'm not sure if this > has any practical uses though. Well, the commented-out code is still visible if any brave scientist is interested in modifying it for this purpose. I'm reading Joe Felsenstein's book right now, so I'll probably get the urge to add more mathy toys to Bio.Phylo soon. I'll check with the list before committing them to the trunk, though. ;) From p.j.a.cock at googlemail.com Wed Jan 20 11:16:58 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 20 Jan 2010 16:16:58 +0000 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> Message-ID: <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> On Fri, Jan 15, 2010 at 11:08 AM, Peter Cock wrote: >> Anyhow I think it is better to discuss this when the unit test >> 'swiss'VS'uniprot' is ready. > > +1, good plan. Something I should have mentioned earlier (I forgot this wasn't checked in yet) was feature support in the existing "swiss" plain text parser - hopefully we can get that working nicely as part of this XML work: http://bugzilla.open-bio.org/show_bug.cgi?id=2235 Peter From andrea at biocomp.unibo.it Wed Jan 20 11:57:47 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Wed, 20 Jan 2010 17:57:47 +0100 (CET) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> Message-ID: <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it> > > Something I should have mentioned earlier (I forgot this wasn't > checked in yet) was feature support in the existing "swiss" plain > text parser - hopefully we can get that working nicely as part of > this XML work: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2235 > > Peter > I know that the plain text swissprot parser can parse features, but last time I checked these features were not included in SeqRecords generated by Bio.SeqIO. If the two parsers have to report similar results, than the 'swiss' format in Bio.SeqIO must reports features too. I made a few changes to the original parser to map data as close as possible to the plain text parser (available on github). However the big issue are going to be the comment field: - 1 big string in the plain text parser - several annotation fields in the XML parser. I think that obtaining the same results is going to be difficult. It is hard to map the big string to many annotations (very error prone) and is also hard to map many annotations to a single string... Anyhow, unit testing is coming (thanks to Mauro) together with a detailed comparison between the two parsed seqrecords. Andrea From p.j.a.cock at googlemail.com Wed Jan 20 12:14:18 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 20 Jan 2010 17:14:18 +0000 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it> Message-ID: <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com> On Wed, Jan 20, 2010 at 4:57 PM, Andrea Pierleoni wrote: >> >> Something I should have mentioned earlier (I forgot this wasn't >> checked in yet) was feature support in the existing "swiss" plain >> text parser - hopefully we can get that working nicely as part of >> this XML work: >> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2235 >> >> Peter >> > > I know that the plain text swissprot parser can parse features, but > last time I checked these features were not included in SeqRecords > generated by Bio.SeqIO. > If the two parsers have to report similar results, than the 'swiss' > format in Bio.SeqIO must reports features too. Yes, there is an old patch on Bug 2235 to do this: http://bugzilla.open-bio.org/show_bug.cgi?id=2235 > I made a few changes to the original parser to map data as close as > possible to the plain text parser (available on github). > > However the big issue are going to be the comment field: > - 1 big string in the plain text parser > - several annotation fields in the XML parser. > > I think that obtaining the same results is going to be difficult. > It is hard to map the big string to many annotations (very error prone) > and is also hard to map many annotations to a single string... > > Anyhow, unit testing is coming (thanks to Mauro) together with a detailed > comparison between the two parsed seqrecords. Great. Peter From andrea at biocomp.unibo.it Thu Jan 21 07:01:30 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 21 Jan 2010 13:01:30 +0100 (CET) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it> <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com> Message-ID: <43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it> >> Anyhow, unit testing is coming (thanks to Mauro) together with a >> detailed >> comparison between the two parsed seqrecords. > > Great. > > Peter > As mentioned earlier, Mauro did a code review and added unit test for the parser in Tests/test_Uniprot.py the updated version is available on the github repository: http://github.com/apierleoni/biopython Since this version is mature enough I sepnt some time comparing the input from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser. This comparison was done using the Q13639 UniProt entry. This are the main differences between the two generated SeqRecords: - id: is the same (first accession) - name: is the same - description: UP reports the the recommended name , full name value, while additional names and synonyms are in the annotations. SP reports a long string containing everything parsed as it is form the plain text. - dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed, NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs - seq: is the same - features: missing in SP (I have to check with the Peter's patch) - annotations: - - identical annotations: accessions, keywords, taxonomy, organism - - mapped annotations: date_last_annotation_update in UP---> modified in SP date_last_sequence_update in UP---> sequence_modified in SP gene_name_primary in UP---> gene_name in SP >>> SP.annotations['gene_name'] 'Name=HTR4;' >>> UP.annotations['gene_name_primary'] 'HTR4' ncbi_taxid in SP ---> UP dbxrefs since it is mapped as a dbReference in the xmlfile - - references: has some minor differences. Final semicolon and double quote missing in UP for both author and title fields. In UP reference comments are reported as: "PublicationType | PublicationDate | Scope | Tissue" For submission publication type the db is reported in comments and not in journal field. - - comments: here comes the big differences. SP has comments are on a single string. UP comments are mapped to seceral annotation entries using comment type and attributes to build the annotation key. Eg. comment_function --> list of "function" type comment strings comment_subcellularlocation_location --> list of "location" strings in the subcellularlocation comment field Comments tree in XML would be easily mapped to a comment dictionary tree, but this would not be BioSQL safe. Andrea From biopython at maubp.freeserve.co.uk Thu Jan 21 07:33:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Jan 2010 12:33:53 +0000 Subject: [Biopython-dev] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL Message-ID: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Hi all, This is cross posted to try and ensure relevant people see it. I suggest we continue the discussion on the BioSQL list (for how to serialise structured annotation to BioSQL), and/or the OpenBio list (for things like file format naming conventions). I am hoping we (Bio*) can be consistent in how we parse and load into BioSQL the SwissProt DE lines (known as "swiss" format in both BioPerl and Biopython's SeqIO, and by EMBOSS) or the equivalent UniProt XML tags (which we are tentatively going to call the "uniprot" format in Biopython's SeqIO - comments?). Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") files and load them into BioSQL. Biopython currently treats the DE comment lines as a long string, as BioPerl used to: http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html I understand that BioPerl now turns the SwissProt DE lines into a TagTree, and for storing this in BioSQL this gets serialised as XML. I would like Biopython to handle this the same way (although rather than a Perl TagTree, we'd use a Python structure of course), and would appreciate clarification of what exactly was implemented (e.g. which bit of the BioPerl source code should be look at, and could you show a worked example?). Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or Open-Bio lists yet) has started work on parsing UniProt XML files for Biopython. Here the DE comment lines are already provided broken up with XML markup. Hopefully their nested structure matches what BioPerl was doing with the SwissProt DE lines. Regards, Peter From bugzilla-daemon at portal.open-bio.org Thu Jan 21 08:13:09 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 Jan 2010 08:13:09 -0500 Subject: [Biopython-dev] [Bug 2997] New: Ignore comments in SCOP parsable files Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2997 Summary: Ignore comments in SCOP parsable files Product: Biopython Version: 1.53 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: 2008 at thomas-holder.de I could not load SCOP parsable files with Bio.SCOP unless I removed the comment lines. The parser should just skip these lines. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 21 08:14:59 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 Jan 2010 08:14:59 -0500 Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files In-Reply-To: Message-ID: <201001211314.o0LDExim005529@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2997 ------- Comment #1 from 2008 at thomas-holder.de 2010-01-21 08:14 EST ------- Created an attachment (id=1432) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1432&action=view) patch to skip comment lines in SCOP parsable files -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mauro at biodec.com Thu Jan 21 15:09:28 2010 From: mauro at biodec.com (Mauro) Date: Thu, 21 Jan 2010 21:09:28 +0100 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it> <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com> <43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it> Message-ID: <4B58B478.4000703@biodec.com> On 01/21/2010 01:01 PM, Andrea Pierleoni wrote: > >>> Anyhow, unit testing is coming (thanks to Mauro) together with a >>> detailed >>> comparison between the two parsed seqrecords. >> >> Great. >> >> Peter >> > > > As mentioned earlier, Mauro did a code review and added unit test for the > parser in Tests/test_Uniprot.py > the updated version is available on the github repository: > http://github.com/apierleoni/biopython > > Since this version is mature enough I sepnt some time comparing the input > from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser. > This comparison was done using the Q13639 UniProt entry. I made also a test for this case. Currently the test fails, you can see the report made by Andrea below. If we agree with differences between the seqrecord, I do the work to change the test. Mauro. > > This are the main differences between the two generated SeqRecords: > > - id: is the same (first accession) > - name: is the same > - description: UP reports the the recommended name , full name value, while > additional names and synonyms are in the annotations. SP reports a > long string containing everything parsed as it is form the plain > text. > - dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed, > NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs > - seq: is the same > - features: missing in SP (I have to check with the Peter's patch) > - annotations: > - - identical annotations: accessions, keywords, taxonomy, organism > - - mapped annotations: > date_last_annotation_update in UP---> modified in SP > date_last_sequence_update in UP---> sequence_modified in SP > gene_name_primary in UP---> gene_name in SP > >>> SP.annotations['gene_name'] > 'Name=HTR4;' > >>> UP.annotations['gene_name_primary'] > 'HTR4' > ncbi_taxid in SP ---> UP dbxrefs since it is mapped as a > dbReference in the xmlfile > - - references: has some minor differences. > Final semicolon and double quote missing in UP for both author > and title fields. > In UP reference comments are reported as: > "PublicationType | PublicationDate | Scope | Tissue" > For submission publication type the db is reported in comments > and not in journal field. > - - comments: here comes the big differences. > SP has comments are on a single string. > UP comments are mapped to seceral annotation entries using comment > type and attributes to build the annotation key. > Eg. > comment_function --> list of "function" type comment strings > comment_subcellularlocation_location --> list of "location" > strings in the subcellularlocation comment field > > Comments tree in XML would be easily mapped to a comment dictionary > tree, but this would not be BioSQL safe. > > > Andrea > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Thu Jan 21 18:58:29 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 Jan 2010 18:58:29 -0500 Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files In-Reply-To: Message-ID: <201001212358.o0LNwTIB022421@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2997 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2010-01-21 18:58 EST ------- Can you give an example of a SCOP file that contains such comment lines? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 22 03:42:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 Jan 2010 03:42:28 -0500 Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files In-Reply-To: Message-ID: <201001220842.o0M8gSDv003709@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2997 ------- Comment #3 from 2008 at thomas-holder.de 2010-01-22 03:42 EST ------- (In reply to comment #2) > Can you give an example of a SCOP file that contains such comment lines? I want to parse these files: http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.75 http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.75 http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.hie.scop.txt_1.75 http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.com.scop.txt_1.75 They all start with 4 comment lines (release and copyright information). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 22 06:08:34 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 Jan 2010 06:08:34 -0500 Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files In-Reply-To: Message-ID: <201001221108.o0MB8YkZ008581@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2997 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2010-01-22 06:08 EST ------- Applied your patch; thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From andrea at biocomp.unibo.it Fri Jan 22 07:18:32 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Fri, 22 Jan 2010 13:18:32 +0100 (CET) Subject: [Biopython-dev] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Message-ID: <2b6e30c4628585042366646a7b46386e.squirrel@lipid.biocomp.unibo.it> I think that the point here can be a little broader, since not only the swissprot DE lines carry complex and structured data. To define a common, language-independent way to store structured data into the comment and *_qualifier_value tables of the actual BioSQL schema could be very useful. XML looks like a good candidate to me, and the UniprotXML format can be used as reference or as a template to start from. Each Bio* project will then parse and report this structured data in its own programming language data structure. Andrea > Hi all, > > This is cross posted to try and ensure relevant people see it. > I suggest we continue the discussion on the BioSQL list > (for how to serialise structured annotation to BioSQL), and/or > the OpenBio list (for things like file format naming conventions). > > I am hoping we (Bio*) can be consistent in how we parse and load > into BioSQL the SwissProt DE lines (known as "swiss" format in > both BioPerl and Biopython's SeqIO, and by EMBOSS) or the > equivalent UniProt XML tags (which we are tentatively going to > call the "uniprot" format in Biopython's SeqIO - comments?). > > Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") > files and load them into BioSQL. Biopython currently treats the DE > comment lines as a long string, as BioPerl used to: > > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html > http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html > > I understand that BioPerl now turns the SwissProt DE lines into a > TagTree, and for storing this in BioSQL this gets serialised as XML. > I would like Biopython to handle this the same way (although rather > than a Perl TagTree, we'd use a Python structure of course), and > would appreciate clarification of what exactly was implemented > (e.g. which bit of the BioPerl source code should be look at, > and could you show a worked example?). > > Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or > Open-Bio lists yet) has started work on parsing UniProt XML > files for Biopython. Here the DE comment lines are already > provided broken up with XML markup. Hopefully their nested > structure matches what BioPerl was doing with the SwissProt > DE lines. > > Regards, > > Peter > From bugzilla-daemon at portal.open-bio.org Fri Jan 22 13:43:19 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 Jan 2010 13:43:19 -0500 Subject: [Biopython-dev] [Bug 2998] New: mac error during build in 10.6.1 Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2998 Summary: mac error during build in 10.6.1 Product: Biopython Version: 1.53 Platform: PC OS/Version: Mac OS Status: NEW Severity: major Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: emeryl at uw.edu When I download the file biopython-1.53.tar.gz, uncompress it, and run python setup.py build I get an error saying gcc4.0 failed with exit code 1, among many lines of errors. Looking more closely, it appears the build process is trying to use an older version of the SDK, which is not installed by Xcode tools by default. It is trying to use /Developer/SDKs/MacOSX10.4u.sdk. On a clean install of 10.6.1 (Snow Leopard) only the SDKs for 10.5 and 10.6 are installed by the Xcode tools installer without changing options. When I reinstall the Xcode tools and this time check a box to install 10.4 support, this 10.4 sdk is installed and the build works flawlessly. This would be a difficult fix to track down for many casual users of BioPython who do not understand the Xcode tools. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 22 14:15:59 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 Jan 2010 14:15:59 -0500 Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for Mac OS In-Reply-To: Message-ID: <201001221915.o0MJFxoa024953@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2998 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|major |normal Summary|mac error during build in |Document need XCode with |10.6.1 |10.4 SDK for Mac OS ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-22 14:15 EST ------- Snow Leopard has caused all sorts of trouble for compiling Python extensions (this is not specific to Biopython). This has been discussed on our mailing list, and simply installing the Mac OS 10.4 SDK option with XCode seems to be the best solution. I've just updated the download page to try and clarify this. Is that better? This is a wiki page so you can edit it: http://biopython.org/wiki/Download I'm leaving this bug open to remind us to add a similar note to the main installation document: http://github.com/biopython/biopython/blob/master/Doc/install/Installation.tex Do you have any other suggestions? Thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 22 15:36:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 Jan 2010 15:36:36 -0500 Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for Mac OS In-Reply-To: Message-ID: <201001222036.o0MKaaZ4027368@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2998 ------- Comment #2 from emeryl at uw.edu 2010-01-22 15:36 EST ------- (In reply to comment #1) That's a good solution, but I added this small clarification also : You will need to have installed Apple's XCode tools including the optional 10.4 SDK (check the option for 10.4 support when installing Xcode tools). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 25 05:56:32 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 05:56:32 -0500 Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for Mac OS In-Reply-To: Message-ID: <201001251056.o0PAuWDI010933@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2998 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-25 05:56 EST ------- (In reply to comment #2) > (In reply to comment #1) > > That's a good solution, but I added this small clarification also : > > You will need to have installed Apple's XCode tools including the optional 10.4 > SDK (check the option for 10.4 support when installing Xcode tools). > Thanks - I've now updated the main installation document in our repository (which we'll use to update the install PDF and HTML at the next release). Marking bug as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 25 20:16:27 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:16:27 -0500 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <201001260116.o0Q1GR1c002063@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 mmokrejs at ribosome.natur.cuni.cz changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mmokrejs at ribosome.natur.cuni | |.cz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 25 20:17:41 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:17:41 -0500 Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not record molecule type or if circular In-Reply-To: Message-ID: <201001260117.o0Q1Hfdb002091@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 mmokrejs at ribosome.natur.cuni.cz changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mmokrejs at ribosome.natur.cuni | |.cz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 25 20:19:47 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:19:47 -0500 Subject: [Biopython-dev] [Bug 2597] Enforce alphabet letters in Seq objects In-Reply-To: Message-ID: <201001260119.o0Q1JlhK002189@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2597 mmokrejs at ribosome.natur.cuni.cz changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 25 20:27:14 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:27:14 -0500 Subject: [Biopython-dev] [Bug 2999] New: SeqIO.parse() or record.format("genbank") converts input sequence to uppercase or Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2999 Summary: SeqIO.parse() or record.format("genbank") converts input sequence to uppercase or Product: Biopython Version: 1.53 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz I do not know where is the problem coming from but if I parse a GenBank file with lowercased sequence (EST) and get it printed back through record.format("genbank") I receive all in uppercase. I think the upper/lower-casing should never be altered unless explicitly requested by the user. for _record in SeqIO.parse(_infile, options.format): # silly, imagine I hit "gi|14150838|gb|AAK54648.1|AF376133_1" from # a FASTA file :( if _record.id in _ids: _outfile.write(_record.format("fasta")) elif options.format == "genbank": if _record.annotations['gi'] in _ids: _outfile.write(_record.format("genbank")) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 25 20:44:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:44:28 -0500 Subject: [Biopython-dev] [Bug 3000] New: Could SeqIO.parse() store the whole, unparsed multiline entry? Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3000 Summary: Could SeqIO.parse() store the whole, unparsed multiline entry? Product: Biopython Version: 1.53 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz Taking into account the genbank file-format writing is not yet complete I wonder whether you would allow to keep optionally along each parsed record it's unparsed multi-line representation. For example, I use biopython to filter-out certain records from a fasta/genbank file by accession, gi, tissue (well the last haven't done yet;)). I do not change the format, I just ignore certain entries. I did not understand the Tutorial ("5.4.3 Getting your SeqRecord objects as formatted strings") well but I iterate over the records and once having the record I want to be on the safe side and to record._print_original_blob() and get e.g. LOCUS .... ... // I do not have the record_iterator so cannot use the proposed out_handle.write(record.format("genbank")) approach. Still, I suspect this will reformat the entry (currently I see trailing dot removed from KEYWORDS, no REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED; and FEATURES.source being re-ordered). I foresee this to depend on an optional argument to SeqIO.parse() specifying that a user wants to keep this in memory and merely that he/she understands this is probably not much useful for large chromosomes, etc. Similarly, I think until parsing/writing e.g. TITLE is fully available why couldn't you just store the whole multi-line thing in some variable? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 25 20:47:27 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:47:27 -0500 Subject: [Biopython-dev] [Bug 2601] Seq find() method: proposal In-Reply-To: Message-ID: <201001260147.o0Q1lRVk002782@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2601 mmokrejs at ribosome.natur.cuni.cz changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mmokrejs at ribosome.natur.cuni | |.cz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 26 08:03:42 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 26 Jan 2010 08:03:42 -0500 Subject: [Biopython-dev] [Bug 2999] SeqIO.parse() or record.format("genbank") converts input sequence to uppercase or In-Reply-To: Message-ID: <201001261303.o0QD3gN8019546@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2999 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-26 08:03 EST ------- In many file formats (e.g. FASTA) mixed case is allowed and useful. The sequence in a GenBank file is (by convention) always lower case, but for historical reasons Biopython converts this to upper case on parsing (not sure why, but changing it would risk breaking existing scripts). However, I think we should convert to lower case on writing GenBank output. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 26 08:15:38 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 26 Jan 2010 08:15:38 -0500 Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry? In-Reply-To: Message-ID: <201001261315.o0QDFc4f020030@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3000 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-26 08:15 EST ------- (In reply to comment #0) > Taking into account the genbank file-format writing is not yet complete I > wonder whether you would allow to keep optionally along each parsed record > it's unparsed multi-line representation. You can probably do it already with the old Bio.GenBank iterator object (I think you use no parser object to get the raw text). Adding this to Bio.SeqIO doesn't seem a wonderful idea. The whole approach only makes sense for sequential file formats with no header (like FASTA, GenBank, EMBL, SwissProt) but not interlaced files (most alignments) or those with headers or XML formats. It also breaks completely the moment the user makes any modification to the SeqRecord object - and handling that cleanly would be tricky. > Still, I suspect this will > reformat the entry (currently I see trailing dot removed from KEYWORDS, no > REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED; and FEATURES.source being > re-ordered). Yes, using Bio.SeqIO to read/write a GenBank record will give you (slightly) different output. We do not guarantee a 100% round trip (even on simpler formats like FASTA). Even little things like line wrapping would make this very difficult. Regarding GenBank KEYWORDS, please file a bug. Regarding GenBank reference lines (REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED) this is still covered by existing Bug 2294 Regarding GenBank source feature, please file a bug. > Similarly, I think until parsing/writing e.g. TITLE is fully available why > couldn't you just store the whole multi-line thing in some variable? The remaining unsupported bits of the ID line are covered byg existing Bug 2294 and Bug 2578. Regarding the reference lines (REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED) this is still covered by existing Bug 2294. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Tue Jan 26 09:02:59 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 26 Jan 2010 15:02:59 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com> References: <20091202125744.GA46415@sobchak.mgh.harvard.edu> <320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com> <320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com> Message-ID: <201001261502.59237.jblanca@btc.upv.es> Hi: I'm doing a pipeline to annotate sequences. I'm writting modules that add SeqFeatures and annoations to the sequences. Right now I'm storing the result as repr for the SeqRecords, but I would like to write gff files at the end. I've read the discussion regarding Brad's code and I've found it very interesting. I need to write those gff files so couldl use Brad's code or my own, but it would be great if I could contribute to Biopython at the same time. At the time being I don't think a consensus about what a SeqFeature should represent and how. I think Peter made a proposal about adding a parent and children properties, is this a good way to solve the problem? Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Jan 26 09:59:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 26 Jan 2010 14:59:35 +0000 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <201001261502.59237.jblanca@btc.upv.es> References: <20091202125744.GA46415@sobchak.mgh.harvard.edu> <320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com> <320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com> <201001261502.59237.jblanca@btc.upv.es> Message-ID: <320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com> Hi Jose, On Tue, Jan 26, 2010 at 2:02 PM, Jose Blanca wrote: > Hi: > > I'm doing a pipeline to annotate sequences. I'm writting modules that add > SeqFeatures and annoations to the sequences. I've done a little of that too - but with GenBank files as the output. > Right now I'm storing the result as repr for the SeqRecords, but I would like > to write gff files at the end. I've read the discussion regarding Brad's code > and I've found it very interesting. > I need to write those gff files so couldl use Brad's code or my own, but it > would be great if I could contribute to Biopython at the same time. > At the time being I don't think a consensus about what a SeqFeature should > represent and how. I think Peter made a proposal about adding a parent and > children properties, is this a good way to solve the problem? > Best regards, Brad's code is using the SeqFeature differently to existing bits of Biopython, and adding a separate child/parent mechanism for the kind of usage required for GFF(3) looks like one way forward allowing use to keep full backward compatibility. I'm actually going to see Brad in person next month at a workshop, and I'm hoping we can squeeze in a little in person debate on this then (assuming we don't settle it here on the mailing list first of course). Regards, Peter From dalloliogm at gmail.com Tue Jan 26 10:09:39 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 26 Jan 2010 16:09:39 +0100 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <20100118142010.GE48842@sobchak.mgh.harvard.edu> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> <4B546062.3090802@bham.ac.uk> <20100118142010.GE48842@sobchak.mgh.harvard.edu> Message-ID: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> On Mon, Jan 18, 2010 at 3:20 PM, Brad Chapman wrote: > Hi Nick; > Sorry for the late reply... I also use StackOverflow and I think that it is a great resource, and it would very good if we can become more represented there. At the moment there are a few questions on biopython on SO, but there are so few biopython users that people usually receive few answers and they prefer to ask their questions again in this list. I have answer to some questions tagged as 'bioinformatics' there, but lately I have not been using SO very much, and moreover the field of bioinformatics is so broad that sometimes it is very difficult to answer a technical question. > > Here is a link to various other Stack Exchange sites: > > > http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family > > Very interesting, thanks! I didn't know you could make Stack-Exchange websites so easily. How did you do that? Is there a free software behind, or do you have to pay some service provider? > It looks like there are a couple of Stack Exchange sites with > similar aims for open source bioinformatics and chemistry: > > http://biostar.stackexchange.com/ > http://blueobelisk.stackexchange.com/ > I agree, maybe it would be useful to collaborate with these websites. StackOverflow is great for programming-related questions; however, you can't use it to ask something which is not completely related, like the protocol for an experiment or which databases to use for an analysis. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From dalloliogm at gmail.com Wed Jan 27 03:56:09 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 27 Jan 2010 09:56:09 +0100 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> <4B546062.3090802@bham.ac.uk> <20100118142010.GE48842@sobchak.mgh.harvard.edu> <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> Message-ID: <5aa3b3571001270056l5ae5bd76g1a70890c94fd430b@mail.gmail.com> On Tue, Jan 26, 2010 at 4:09 PM, Giovanni Marco Dall'Olio < dalloliogm at gmail.com> wrote: > > > > On Mon, Jan 18, 2010 at 3:20 PM, Brad Chapman wrote: > >> Hi Nick; >> > > Sorry for the late reply... I also use StackOverflow and I think that it is > a great resource, and it would very good if we can become more represented > there. > By the way, it is possible to get feeds for questions on StackOverflow. For example, this is the feed for the questions tagged 'biopython': - http://stackoverflow.com/feeds/tag/biopython We could add this rss to the biopython's friendfeed or twitter page (I barely know what I am talking about here), or to the blog/wiki/etc. Maybe there is also a way to notify this mailing list of the questions asked there. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From chapmanb at 50mail.com Wed Jan 27 08:33:22 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 27 Jan 2010 08:33:22 -0500 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> <4B546062.3090802@bham.ac.uk> <20100118142010.GE48842@sobchak.mgh.harvard.edu> <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> Message-ID: <20100127133322.GV83316@sobchak.mgh.harvard.edu> Giovanni; Thanks for the feedback on this. We've had a few positive responses and I think it's something that would be low effort to experiment with. I'm open to whether we do this on the main StackOverflow site, Nick's dedicated suggested site, or Blue Obelisk. The main criteria is that we are likely to have the website be freely available (and around) in the future. > Sorry for the late reply... I also use StackOverflow and I think that it is > a great resource, and it would very good if we can become more represented > there. > At the moment there are a few questions on biopython on SO, but there are so > few biopython users that people usually receive few answers and they prefer > to ask their questions again in this list. Yes, that's what we'd be hoping to change. The main thing is that we get folks interested in python bioinformatics programming looking there, and then suggest users ask questions there. The significant benefit is that the presentation of questions and answers gives you a historical resource that is easy to search and browse. > By the way, it is possible to get feeds for questions on StackOverflow. > For example, this is the feed for the questions tagged 'biopython': > - http://stackoverflow.com/feeds/tag/biopython > We could add this rss to the biopython's friendfeed or twitter page (I > barely know what I am talking about here), or to the blog/wiki/etc. > Maybe there is also a way to notify this mailing list of the questions asked > there. There are resources we could use to redirect the feed to Twitter: http://twitterfeed.com/ and the mailing list: http://www.feedmyinbox.com/ Agreed that we should do this to increase visibility. Brad From chapmanb at 50mail.com Wed Jan 27 08:41:25 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 27 Jan 2010 08:41:25 -0500 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com> References: <20091202125744.GA46415@sobchak.mgh.harvard.edu> <320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com> <320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com> <201001261502.59237.jblanca@btc.upv.es> <320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com> Message-ID: <20100127134125.GW83316@sobchak.mgh.harvard.edu> Jose and Peter; > > Right now I'm storing the result as repr for the SeqRecords, but I would like > > to write gff files at the end. I've read the discussion regarding Brad's code > > and I've found it very interesting. > > I need to write those gff files so couldl use Brad's code or my own, but it > > would be great if I could contribute to Biopython at the same time. Awesome. Please do use my code for output and feel free to fork and make suggestions; I'm happy to integrate changes: http://github.com/chapmanb/bcbb/tree/master/gff > > At the time being I don't think a consensus about what a SeqFeature should > > represent and how. I think Peter made a proposal about adding a parent and > > children properties, is this a good way to solve the problem? > > Best regards, > > Brad's code is using the SeqFeature differently to existing bits of > Biopython, and adding a separate child/parent mechanism for the > kind of usage required for GFF(3) looks like one way forward allowing > use to keep full backward compatibility. I'm actually going to see Brad > in person next month at a workshop, and I'm hoping we can squeeze > in a little in person debate on this then (assuming we don't settle it > here on the mailing list first of course). What do you think we need to modify in the GFF parsing code to bring this in line? I'd really like to see this get into Biopython, but am not sure how to clear the blocking issues. If we can put together a list of specifics, I can try and put together time to tackle that. Brad From dalloliogm at gmail.com Wed Jan 27 08:41:24 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 27 Jan 2010 14:41:24 +0100 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <20100127133322.GV83316@sobchak.mgh.harvard.edu> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> <4B546062.3090802@bham.ac.uk> <20100118142010.GE48842@sobchak.mgh.harvard.edu> <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> <20100127133322.GV83316@sobchak.mgh.harvard.edu> Message-ID: <5aa3b3571001270541n2f047fe2qf42911b21e9494d8@mail.gmail.com> On Wed, Jan 27, 2010 at 2:33 PM, Brad Chapman wrote: > Giovanni; > Thanks for the feedback on this. We've had a few positive responses > and I think it's something that would be low effort to experiment with. > I'm open to whether we do this on the main StackOverflow site, > Nick's dedicated suggested site, or Blue Obelisk. The main criteria > is that we are likely to have the website be freely available (and > around) in the future. > Thanks to you for the proposal.. > There are resources we could use to redirect the feed to Twitter: > > http://twitterfeed.com/ > > and the mailing list: > > http://www.feedmyinbox.com/ > So, what if we use this to automatically send a notification to the biopython mailing list? The amount of traffic increased would be low, in the last three months there have only been 3 messages on biopython in StackOverflow. With an automatical notification, these questions may receive an answer a lot more quickly. When the traffic on StackOverflow grows too much, we can just inactivate the forwarding so it won't disturb the mailing list. > Agreed that we should do this to increase visibility. > > Brad > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From chapmanb at 50mail.com Thu Jan 28 15:35:05 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 28 Jan 2010 15:35:05 -0500 Subject: [Biopython-dev] OpenBio solution challenge: Project updates at BOSC 2010 Message-ID: <20100128203505.GG40046@sobchak.mgh.harvard.edu> Hello all; The BOSC 2010 organizing committee is hard at work getting prepared for this July's meeting in Boston: http://www.open-bio.org/wiki/BOSC_2010 One of the items we've traditionally had at the conference is a project update from each of the OpenBio affiliated groups. This year, we're thinking about organizing these talks around a central theme: the OpenBio solution challenge. We start with a biological question of general interest, and each of the project talks would focus around how you would solve that problem using your toolkit and programming language. This is meant to provide a challenge for OpenBio contributors, a nice tutorial style overview of various projects and approaches for other programmers, and a fun opportunity to compete and learn from other projects. Conference attendees will vote on their favorite solution, with the winner receiving fame and fortune (warning: fortune not guaranteed). For this to be successful, it of course requires interest and enthusiasm from y'all fine folks involved with the projects. Specifically: - Is there interest from your group in participating in the challenge? You'll want at least a few people to work on it, and someone to give a presentation at BOSC. - Do you have suggestions on a good theme or specific biological problem to tackle? We'll hope to pick something in a sweet spot that is challenging enough to be of interest, yet reasonable for presentation and preparation. Let's discuss ideas and get this together. Since the schedule for BOSC is developing rapidly, please give us an idea if you're interested by February 12th, and copy responses to the BOSC mailing list as a central place for discussion. bosc at open-bio.org Thanks, Brad, Michael, and the BOSC organizing committee From biopython at maubp.freeserve.co.uk Fri Jan 29 05:36:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Jan 2010 10:36:40 +0000 Subject: [Biopython-dev] [Bioperl-l] [MOBY-dev] OpenBio solution challenge: Project updates at BOSC 2010 In-Reply-To: References: <20100128203505.GG40046@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com> Hi all, This is a great topic but should be continue it on just the one mailing list? Is there a suitable BOSC list, or how about the general Open Bio list? On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson wrote: > > Brad, this sounds exciting! > > One thing strikes me, though - by asking for the sub-projects to propose > the "grand challenge" themselves the one thing you can guarantee is that > the "grand challenge" is solvable (or more likely, already solved!) > > Other "grand challenge" kinds of meetings have an independent third party > pose the problem that has to be solved, and then all groups work toward a > solution and compare their results. ?This would, IMO, be more revealing of > the "state of the art" in each Open-Bio project, and point out where the > weaknesses are that we should be focusing on... ?Someone (for example, > you!) could act as the moderator to ensure that the "grand challenge" was > at least a reasonable one, within the scope of what an Open-Bio project > *should* be able to solve... > > Just my CAD $0.02 > > Mark One possible problem with having Brad act as moderator is his ties to Biopython (plus it would be a shame if we'd be one man down for trying to solve the challenges - grin). Having a project representative "sign off" on the challenge might work - or simply the whole of the BOSC committee which is quite balanced. Alternatively some kind of panel of challenges does seem a good way to reduce individual project bias (as suggest by Scooter), but there will still need to be a judging committee. I'm curious what kind of challenges the BOSC committee had in mind - would something like taking a newly sequence bacteria and producing an automated annotation as a GenBank, EMBL, or GFF file be too ambitious for example? There are already several major projects to do this e.g. RAST http://rast.nmpdr.org/ Peter (@Biopython) From bugzilla-daemon at portal.open-bio.org Sun Jan 31 15:30:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 31 Jan 2010 15:30:45 -0500 Subject: [Biopython-dev] [Bug 3004] New: Contribute PSL alignment format to biopython Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3004 Summary: Contribute PSL alignment format to biopython Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: forgetta at gmail.com Hi Bio-pythonistas, I am interested in contributing code to biopython. I have developed a class to represent PSL output from the BLAT alignment program. I would like to contribute it to the AlignIO module. I have read through and agree to the guidelines stipulated on http://biopython.org/wiki/Contributing. I have never written unit tests before, but I am willing to learn. Thanks. Vince -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jan 31 17:24:53 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 31 Jan 2010 17:24:53 -0500 Subject: [Biopython-dev] [Bug 3004] PSL alignment format parsing in Bio.AlignIO In-Reply-To: Message-ID: <201001312224.o0VMOrha006787@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3004 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Contribute PSL alignment |PSL alignment format parsing |format to biopython |in Bio.AlignIO ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-31 17:24 EST ------- Hi Vince, This sounds interesting - I've been using BLAT's plain text BLAST output format with Biopython up until now. Have you ever used github? That would be one way to share your code. Or, just attach diff files, Python files, and example BLAT files to this bug. If you haven't already done so, signing up to our development mailing list would be a good idea. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jan 31 19:21:51 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 31 Jan 2010 19:21:51 -0500 Subject: [Biopython-dev] [Bug 3004] PSL alignment format parsing in Bio.AlignIO In-Reply-To: Message-ID: <201002010021.o110Lp9e009311@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3004 ------- Comment #2 from forgetta at gmail.com 2010-01-31 19:21 EST ------- Now on github: http://github.com/vforget/PyBLATPSL Vince -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Mon Jan 4 13:16:31 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 4 Jan 2010 08:16:31 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> Message-ID: <20100104131631.GG80812@sobchak.mgh.harvard.edu> Hey Eric; Happy New Year -- thanks for all the work on TreeIO. This sounds great and looking forward to getting it in the main trunk. I'd like to hear Peter's and other's thoughts, but just a few small comments below. > The tree annotations (e.g. id) aren't preserved perfectly during conversions > -- I'll keep working on this, but I don't think it's a blocker. The taxon > names of terminal nodes are kept as "clade" names in phyloXML for > round-tripping. Tree topology and branch lengths seem OK. Are the annotations often used in real life cases or is this more of a fringe problem? I'm not as familiar with tree work, but know this is a pain in sequence space. A good goal is to capture the most common use cases and then integrate the other issues as feasible. > Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an > incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O). > This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees, > as I imagine it: > (1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where > reasonable (since the node IDs and adjacency list lookup are no longer > needed) > (2) Implement methods in Bio.Tree.Newick with the original argument lists, > but triggering a deprecation warning indicating the newer replacement method > (3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more > shims to duplicate the original API -- so test_Nexus.py should still pass, > ideally (with deprecation warnings) > (4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of > NexusIO and Bio.Tree methods. > (5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick. > > I'm currently doing (1) and (2), with more emphasis on getting (1) right. > Not all of the important methods have been ported, but I'm happy with the > tree traversal methods. Nice. This all sounds like a really good refactoring. It sounds like 1 can happen once this all gets merged with the main branch, and could benefit from others being able to more easily look at it and make suggestions. > I noticed that in Tests/Nexus/, the example file for internal node labels is > actually in Newick/NH format, not Nexus. That was briefly confusing, so > maybe that file should be renamed. Oops, I think that may have been me. No problem, rename away. Brad From chapmanb at 50mail.com Mon Jan 4 13:16:31 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 4 Jan 2010 08:16:31 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> Message-ID: <20100104131631.GG80812@sobchak.mgh.harvard.edu> Hey Eric; Happy New Year -- thanks for all the work on TreeIO. This sounds great and looking forward to getting it in the main trunk. I'd like to hear Peter's and other's thoughts, but just a few small comments below. > The tree annotations (e.g. id) aren't preserved perfectly during conversions > -- I'll keep working on this, but I don't think it's a blocker. The taxon > names of terminal nodes are kept as "clade" names in phyloXML for > round-tripping. Tree topology and branch lengths seem OK. Are the annotations often used in real life cases or is this more of a fringe problem? I'm not as familiar with tree work, but know this is a pain in sequence space. A good goal is to capture the most common use cases and then integrate the other issues as feasible. > Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an > incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O). > This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees, > as I imagine it: > (1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where > reasonable (since the node IDs and adjacency list lookup are no longer > needed) > (2) Implement methods in Bio.Tree.Newick with the original argument lists, > but triggering a deprecation warning indicating the newer replacement method > (3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more > shims to duplicate the original API -- so test_Nexus.py should still pass, > ideally (with deprecation warnings) > (4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of > NexusIO and Bio.Tree methods. > (5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick. > > I'm currently doing (1) and (2), with more emphasis on getting (1) right. > Not all of the important methods have been ported, but I'm happy with the > tree traversal methods. Nice. This all sounds like a really good refactoring. It sounds like 1 can happen once this all gets merged with the main branch, and could benefit from others being able to more easily look at it and make suggestions. > I noticed that in Tests/Nexus/, the example file for internal node labels is > actually in Newick/NH format, not Nexus. That was briefly confusing, so > maybe that file should be renamed. Oops, I think that may have been me. No problem, rename away. Brad From eric.talevich at gmail.com Tue Jan 5 00:09:18 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 4 Jan 2010 16:09:18 -0800 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <20100104131631.GG80812@sobchak.mgh.harvard.edu> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> <20100104131631.GG80812@sobchak.mgh.harvard.edu> Message-ID: <3f6baf361001041609u7997dd61v441257dbfecdebd6@mail.gmail.com> Hi Brad, I hope the holidays treated you well. On Mon, Jan 4, 2010 at 5:16 AM, Brad Chapman wrote: > > Are the annotations often used in real life cases or is this more of > a fringe problem? I'm not as familiar with tree work, but know this > is a pain in sequence space. A good goal is to capture the most > common use cases and then integrate the other issues as feasible. > The data that TreeIO preserves round-trip are: - Branching structure (topology) - Branch lengths - Clade/taxon names - Rooted-ness (for the whole tree) - Tree ID The troublesome parts are: - The "confidences" attribute in PhyloXML trees should map onto the "support" attribute in Nexus trees, but that's tricky -- the original Nexus attribute seemed content with a little ambiguity in what that attribute's numerical value actually meant (relative/absolute support), while PhyloXML uses a list of Confidence objects containing both a numerical value and a "type" string such as "bootstrap". Currently that information is dropped when converting between PhyloXML and Nexus/Newick trees. - Nexus also has a "comment" attribute for each node, while PhyloXML doesn't directly support that. - The branch length of the root node/clade is None in PhyloXML, but 0.0 in Nexus. I prefer None because there is no meaningful branch leading to that node, but there might be a reason 0.0 was chosen for Nexus that I'm not aware of. - The names of unlabeled internal nodes might change from None to "" in some cases, since None is the PhyloXML default and "" is the Nexus default. - Since PhyloXML supports more structured taxonomic information on each node than Newick, it's possible to have a PhyloXML tree where a Clade has no name, but instead one or more Taxonomy objects containing the scientific name, common names, etc. -- so when this tree is converted to Newick format the taxonomy info is lost for those nodes. I could squash the Taxonomy object into a string for the sake of Nexus labels, but I think it would be safer (less surprising) to just write a cookbook entry on how to collapse PhyloXML Taxonomies into Clade names to aid format conversions. If the support-vs-confidence issue can be resolved, then we can treat PhyloXML as a rough superset of Newick, in terms of annotation, and then it shouldn't be surprising to lose some annotation data in converting PhyloXML to Newick. Cheers, Eric From eric.talevich at gmail.com Tue Jan 5 00:09:18 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 4 Jan 2010 16:09:18 -0800 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <20100104131631.GG80812@sobchak.mgh.harvard.edu> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> <20100104131631.GG80812@sobchak.mgh.harvard.edu> Message-ID: <3f6baf361001041609u7997dd61v441257dbfecdebd6@mail.gmail.com> Hi Brad, I hope the holidays treated you well. On Mon, Jan 4, 2010 at 5:16 AM, Brad Chapman wrote: > > Are the annotations often used in real life cases or is this more of > a fringe problem? I'm not as familiar with tree work, but know this > is a pain in sequence space. A good goal is to capture the most > common use cases and then integrate the other issues as feasible. > The data that TreeIO preserves round-trip are: - Branching structure (topology) - Branch lengths - Clade/taxon names - Rooted-ness (for the whole tree) - Tree ID The troublesome parts are: - The "confidences" attribute in PhyloXML trees should map onto the "support" attribute in Nexus trees, but that's tricky -- the original Nexus attribute seemed content with a little ambiguity in what that attribute's numerical value actually meant (relative/absolute support), while PhyloXML uses a list of Confidence objects containing both a numerical value and a "type" string such as "bootstrap". Currently that information is dropped when converting between PhyloXML and Nexus/Newick trees. - Nexus also has a "comment" attribute for each node, while PhyloXML doesn't directly support that. - The branch length of the root node/clade is None in PhyloXML, but 0.0 in Nexus. I prefer None because there is no meaningful branch leading to that node, but there might be a reason 0.0 was chosen for Nexus that I'm not aware of. - The names of unlabeled internal nodes might change from None to "" in some cases, since None is the PhyloXML default and "" is the Nexus default. - Since PhyloXML supports more structured taxonomic information on each node than Newick, it's possible to have a PhyloXML tree where a Clade has no name, but instead one or more Taxonomy objects containing the scientific name, common names, etc. -- so when this tree is converted to Newick format the taxonomy info is lost for those nodes. I could squash the Taxonomy object into a string for the sake of Nexus labels, but I think it would be safer (less surprising) to just write a cookbook entry on how to collapse PhyloXML Taxonomies into Clade names to aid format conversions. If the support-vs-confidence issue can be resolved, then we can treat PhyloXML as a rough superset of Newick, in terms of annotation, and then it shouldn't be surprising to lose some annotation data in converting PhyloXML to Newick. Cheers, Eric From biopython at maubp.freeserve.co.uk Tue Jan 5 17:50:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 17:50:25 +0000 Subject: [Biopython-dev] code credits In-Reply-To: <320fb6e00912220414t6429f1e5n792e5feeecbe633f@mail.gmail.com> References: <928490.72367.qm@web30708.mail.mud.yahoo.com> <320fb6e00912171454v2ce81fc5v93547951d7af84f8@mail.gmail.com> <320fb6e00912210357m32156fdax6639445cadd83217@mail.gmail.com> <20091221132339.GC21580@sobchak.mgh.harvard.edu> <320fb6e00912210634o77d9eb9ex21e4ec3630dd1ed6@mail.gmail.com> <320fb6e00912210848x449fd73al4e97d3c9e21cf4@mail.gmail.com> <320fb6e00912220414t6429f1e5n792e5feeecbe633f@mail.gmail.com> Message-ID: <320fb6e01001050950r64dabb1dw67baafada72f5d1a@mail.gmail.com> On Tue, Dec 22, 2009 at 12:14 PM, Peter wrote: > On Mon, Dec 21, 2009 at 4:48 PM, Peter wrote: >> So, how about a merger of (1) and (3)? i.e. >> >> * The CONTRIBUTORS file remains a single alphabetical list >> of all contributors to date (no change). >> * Entries in the NEWS file for new features etc may continue >> to credit authors as appropriate. >> * The NEWS file will include at the end of each release section >> an alphabetical list of contributors for that release (with new >> contributors flagged). This will be re-used in the release notice. > > I've done that in github - how do the NEWS and CONTRIB file look? > > http://github.com/biopython/biopython/commit/86d8d99aab894ab5f32a0e7a0c45d63a441da645 > > I haven't automatically included email addresses for the new contributors > since there is a risk of them being harvested for spam, so I figure that > should be "opt in". Thanks to those with feedback off list (e.g. sort order). I've just updated the news post to include the list of names: http://news.open-bio.org/news/2009/12/biopython-release-153/ I don't have time today, but at some point this week I want to do a another news post and email announcement describing this new Sage-like policy for recognising contributors. If anyone would like to compose a draft of the apparent consensus that would be very helpful. If anyone would like to go back over the commit log for the recent releases to update them as we've just done for 1.53, please go ahead - but post an email here to avoid duplicated efforts. Peter P.S. Happy New Year! From bugzilla-daemon at portal.open-bio.org Thu Jan 7 18:11:47 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 7 Jan 2010 13:11:47 -0500 Subject: [Biopython-dev] [Bug 2980] New: Bio.SeqIO can't parse EMBL CONTIG records Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2980 Summary: Bio.SeqIO can't parse EMBL CONTIG records Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk While the GenBank parser has been updated to cope with CONTIG records (using an UnknownSeq object), this has not been done for the EMBL parser. As an example test case, consider: ftp://ftp.ebi.ac.uk/pub/databases/embl/release/rel_con_hum_01_r102.dat.gz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 8 11:50:56 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 8 Jan 2010 06:50:56 -0500 Subject: [Biopython-dev] [Bug 2980] Bio.SeqIO can't parse EMBL CONTIG records In-Reply-To: Message-ID: <201001081150.o08Bougb013879@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2980 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-08 06:50 EST ------- Fixed in git -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Fri Jan 8 16:26:29 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 8 Jan 2010 08:26:29 -0800 (PST) Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> Message-ID: <221209.41863.qm@web62404.mail.re1.yahoo.com> I am not an expert in this area, but the code looks very well done and well organized. Thanks, Eric! I have one suggestion though: In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather have everything under Bio.Tree. This makes it easier to understand what each Bio.* module is about, and also agrees with the structure of the other modules in Biopython. The only exception is Bio.Seq, for which there is a closely related Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons; I'd rather have a single Bio.Seq there too). Thanks again, --Michiel. --- On Mon, 12/28/09, Eric Talevich wrote: > From: Eric Talevich > Subject: Re: [Biopython-dev] Code review request for phyloxml branch > To: "BioPython-Dev Mailing List" > Date: Monday, December 28, 2009, 8:51 PM > Hi folks, > > Here's an update on the status of Bio.Tree and TreeIO. I > think I've taken > care of most of the blockers since the last review in > September. > > First, some links: > http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/ > http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/ > http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py > http://github.com/etal/biopython/tree/phyloxml/Tests/test_Tree.py > http://biopython.org/wiki/PhyloXML > > Discussion: > > *TreeIO* > Conversion between Nexus, Newick and phyloXML tree file > formats works; the > read/parse/write functions for each IO format use the same > object types. > Neat! > > The tree annotations (e.g. id) aren't preserved perfectly > during conversions > -- I'll keep working on this, but I don't think it's a > blocker. The taxon > names of terminal nodes are kept as "clade" names in > phyloXML for > round-tripping. Tree topology and branch lengths seem OK. > > Under the hood: > -- PhyloXMLIO is from GSoC > -- NewickIO is ported from the Bio.Nexus.Trees parser. I > think it works the > same way. > -- NexusIO relies on Bio.Nexus.Nexus for parsing, then > converts the > resulting Nexus.Trees.Tree objects to Bio.Tree.Newick > objects. One day, when > Nexus.Trees is replaced by NewickIO in the main Nexus > parser, then this > conversion can be dropped and NexusIO will be very simple. > > *Tree* > The BaseTree object structure looks like this:* > > -- BaseTree.**Tree* contains global tree information, like > whether the tree > is rooted, and a reference to the root clade. The phyloXML > Phylogeny object > inherits from this.* > > -- BaseTree.**Subtree* contains local (clade- or > node-specific) information, > and references to each of its direct descendents, > recursively. The phyloXML > Clade object inherits from this. Nodes are implicit. I > could add references > to the ancestor of each sub-tree without too much > difficulty, but I haven't > needed them yet. > > The same methods (get_terminals et al.) generally apply to > both classes, so > I created a separate TreeMixin class from which both > BaseTree.Tree and > BaseTree.Subtree inherit. > > Bio.Tree.Newick contains simple subclasses of Tree and > Subtree, and an > incomplete set of shims that track Bio.Nexus.Trees.Tree > (minus the I/O). > This is to ease the deprecation and eventual replacement of > Bio.Nexus.Trees, > as I imagine it: > (1) Port methods from Nexus.Trees to Bio.Tree, simplifying > arguments where > reasonable (since the node IDs and adjacency list lookup > are no longer > needed) > (2) Implement methods in Bio.Tree.Newick with the original > argument lists, > but triggering a deprecation warning indicating the newer > replacement method > (3) Replace Nexus.Trees with an import of > Bio.Tree.Newick(IO) and a few more > shims to duplicate the original API -- so test_Nexus.py > should still pass, > ideally (with deprecation warnings) > (4) In Nexus.Nexus, replace all usage of Nexus.Trees with > proper usage of > NexusIO and Bio.Tree methods. > (5) Eventually delete Nexus.Trees and the shims in > Bio.Tree.Newick. > > I'm currently doing (1) and (2), with more emphasis on > getting (1) right. > Not all of the important methods have been ported, but I'm > happy with the > tree traversal methods. > * > Tests > *I created test_Tree.py to test the methods in > Bio.Tree.BaseTree; > test_PhyloXML.py tests Bio.Tree.PhyloXML objects and > Bio.TreeIO.PhyloXMLIO > parsing/writing. > > I noticed that in Tests/Nexus/, the example file for > internal node labels is > actually in Newick/NH format, not Nexus. That was briefly > confusing, so > maybe that file should be renamed. > > What do you think? > > All the best, > Eric > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Fri Jan 8 17:00:12 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Jan 2010 17:00:12 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <221209.41863.qm@web62404.mail.re1.yahoo.com> References: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> <221209.41863.qm@web62404.mail.re1.yahoo.com> Message-ID: <320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com> On Fri, Jan 8, 2010 at 4:26 PM, Michiel de Hoon wrote: > I am not an expert in this area, but the code looks very well done and well > organized. Thanks, Eric! > > I have one suggestion though: > In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather > have everything under Bio.Tree. This makes it easier to understand what each > Bio.* module is about, and also agrees with the structure of the other modules > in Biopython. The only exception is Bio.Seq, for which there is a closely related > Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons; > I'd rather have a single Bio.Seq there too). There is also Bio.AlignIO, which again might have been handled via Bio.Align with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was following the lead from BioPerl. I think there are some good points about making the code for the common object (tree, SeqRecord, Alignment) clearly separate from the code for parsing or writing it (although separate top level modules is perhaps overkill). However, I agree, this isn't universal in Biopython (e.g. Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO). So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing I don't like is that "Tree" could mean a class or a module (also a problem with other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python convention (PEP8) is to use lower case for the module ("tree") and title case for the class ("Tree"), something most of Biopython does not follow (and which we can't change without a lot of upheaval). Another option if we want to try and keep the existing module name style might be Bio.Trees containing a Tree class, or perhaps something different like Bio.Phylo instead? Peter From eric.talevich at gmail.com Fri Jan 8 18:22:11 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 8 Jan 2010 13:22:11 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com> References: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> <221209.41863.qm@web62404.mail.re1.yahoo.com> <320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com> Message-ID: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> On Fri, Jan 8, 2010 at 12:00 PM, Peter Cock wrote: > > On Fri, Jan 8, 2010 at 4:26 PM, Michiel de Hoon wrote: > > I am not an expert in this area, but the code looks very well done and well > > organized. Thanks, Eric! > > > > I have one suggestion though: > > In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather > > have everything under Bio.Tree. This makes it easier to understand what each > > Bio.* module is about, and also agrees with the structure of the other modules > > in Biopython. The only exception is Bio.Seq, for which there is a closely related > > Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons; > > I'd rather have a single Bio.Seq there too). > > There is also Bio.AlignIO, which again might have been handled via Bio.Align > with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was > following the lead from BioPerl. Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava do something completely different. I had the impression that pairing modules Foo & FooIO was an emerging convention for organizing very general data types being fed by a variety of file formats, while a single module Foo indicated support for a particular program or source, like Entrez. But I think it would be even cleaner if each Foo simply had a Foo.IO (or foo.io) sub-module organizing the I/O for multiple file formats where applicable. The TreeIO.* namespace is not crowded -- just read, write, parse, convert. If that directory is moved under Bio.Tree and renamed to IO or io, then Bio.Tree would still seem reasonably intuitive if __init__.py contained: from io import * from utils import * Then "from Bio import Tree" would be enough for most uses. > I think there are some good points about making > the code for the common object (tree, SeqRecord, Alignment) clearly separate > from the code for parsing or writing it (although separate top level modules is > perhaps overkill). However, I agree, this isn't universal in Biopython (e.g. > Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO). PDB does its own thing, too -- and some consolidation there might be nice. > So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing > I don't like is that "Tree" could mean a class or a module (also a problem with > other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python > convention (PEP8) is to use lower case for the module ("tree") and title case > for the class ("Tree"), something most of Biopython does not follow (and > which we can't change without a lot of upheaval). I could rename the modules inside Bio.Tree (or whatever we call it) to follow the PEP8 convention: Bio/Tree/ Bio/Tree/basetree.py Bio/Tree/io.py Bio/Tree/utils.py ... The Biopython convention seems to be that directory names are title case, file names are mostly title case if user-facing and lower case otherwise, and C extensions are lower case. Most of the time there won't be any need to import the sub-modules under Tree directly, so the inconsistency shouldn't be too jarring. > perhaps something different like Bio.Phylo instead? Sure, that sounds promising. Thanks! Eric From mjldehoon at yahoo.com Sat Jan 9 15:15:56 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 9 Jan 2010 07:15:56 -0800 (PST) Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> Message-ID: <863834.10061.qm@web62403.mail.re1.yahoo.com> --- On Fri, 1/8/10, Eric Talevich wrote: > Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava > do something completely different. > > I had the impression that pairing modules Foo & FooIO > was an emerging convention for organizing very general > data types being fed by a variety of file formats, while > a single module Foo indicated support > for a particular program or source, like Entrez. I think a workable convention, which is already followed by many Biopython module, is the following: 1) Bio.SomeStuff is a module containing everything related to SomeStuff, where SomeStuff is some broadly-defined field within bioinformatics (Cluster for clustering algorithms, Phylo for phylogenetics, PopGen for population genetics, Entrez for NCBI Entrez related stuff, etc.). 2) Parsing SomeStuff files, which can be in a variety of formats, is done by a read() function (to parse a single record), and/or a parse() function (to parse multiple records). The implementation details of these functions is hidden in a submodule of Bio.SomeStuff. Typically, the user won't need to interact with the submodule directly. 3) The read() / parse() functions return Bio.SomeStuff.Record objects, where Bio.SomeStuff.Record is a class that represents the primary data structure of SomeStuff information. This general framework may not be suitable in all aspects for all Biopython modules, and can be modified as needed. For example, I can imagine that the most important data structure in Bio.Phylo is a Tree object rather than a Record object. > But I think it would > be even cleaner if each Foo simply had a Foo.IO (or foo.io) > sub-module organizing the I/O for multiple file formats where > applicable. I agree. > The TreeIO.* namespace is not crowded -- just read, write, > parse, convert. If that directory is moved under Bio.Tree and > renamed to IO or io, then Bio.Tree would still seem reasonably > intuitive if __init__.py contained: > > from io import * > from utils import * > > Then "from Bio import Tree" would be enough for most uses. Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module. Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user. > > perhaps something different like Bio.Phylo instead? > > Sure, that sounds promising. I agree that Bio.Phylo is a good name. Note also that there already is a Tree class in Bio.Cluster (it represents hierarchical clustering trees). Having a Bio.Phylo.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees is not confusing. On the other hand, having a Bio.Tree.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees could potentially be confusing. --Michiel From eric.talevich at gmail.com Sat Jan 9 23:38:29 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 9 Jan 2010 18:38:29 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <863834.10061.qm@web62403.mail.re1.yahoo.com> References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> <863834.10061.qm@web62403.mail.re1.yahoo.com> Message-ID: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> Hi, Thanks for your comments. I've reorganized the modules like this: Bio/Phylo/ __init__.py, BaseTree.py, Newick.py, PhyloXML.py, Utils.py IO/ __init__.py, NexusIO.py, NewickIO.py, PhyloXMLIO.py Now "from Bio import Phylo" works for the common cases, and "from Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct access to the parsers. I renamed TreeIO to Phylo/IO -- keeping it uppercase because io is a standard module in Py2.6+, Py2.7 changes the priority rules for absolute vs. relative imports, and Py2.4 doesn't support the new syntax for relative imports. I might change the other file names to lower case before the next merge, though... On Sat, Jan 9, 2010 at 10:15 AM, Michiel de Hoon wrote: > > Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module. > > Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user. > I'm trying to avoid having to update Phylo/__init__.py each time I add or rename a public function in Utils.py or IO. So, how about this: I've added "__all__" definitions to Utils.py and IO/__init__.py so that only the relevant public functions are loaded when Phylo/__init__.py imports * from those two sub-modules. Testing manually, this seems to do the right thing. Cheers, Eric From mjldehoon at yahoo.com Sun Jan 10 02:50:21 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 9 Jan 2010 18:50:21 -0800 (PST) Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> Message-ID: <274373.93315.qm@web62406.mail.re1.yahoo.com> I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it. One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this? Thanks! --Michiel --- On Sat, 1/9/10, Eric Talevich wrote: > From: Eric Talevich > Subject: Re: [Biopython-dev] Code review request for phyloxml branch > To: "Michiel de Hoon" > Cc: "Peter Cock" , "BioPython-Dev Mailing List" > Date: Saturday, January 9, 2010, 6:38 PM > Hi, > > Thanks for your comments. I've reorganized the modules like > this: > > Bio/Phylo/ > ? ? __init__.py, BaseTree.py, Newick.py, > PhyloXML.py, Utils.py > ? ? IO/ > ? ? ? ? __init__.py, NexusIO.py, > NewickIO.py, PhyloXMLIO.py > > Now "from Bio import Phylo" works for the common cases, and > "from > Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct > access to the > parsers. > > I renamed TreeIO to Phylo/IO -- keeping it uppercase > because io is a > standard module in Py2.6+, Py2.7 changes the priority rules > for > absolute vs. relative imports, and Py2.4 doesn't support > the new > syntax for relative imports. I might change the other file > names to > lower case before the next merge, though... > > On Sat, Jan 9, 2010 at 10:15 AM, Michiel de Hoon > wrote: > > > > Rather than importing *, can we import only those > functions that a user would actually use? We should avoid > importing stuff that is essentially used only locally in > each sub-module. > > > > Another option is to have all functions that are > intended to be used by the user in Bio.Phylo, and have those > function access (internally) any sub-module as needed. For > example, a user would not notice that Bio.Phylo.read > actually uses code from Bio.Phylo.io; the latter module > would not be accessed directly by the user. > > > > I'm trying to avoid having to update Phylo/__init__.py each > time I add > or rename a public function in Utils.py or IO. So, how > about this: > I've added "__all__" definitions to Utils.py and > IO/__init__.py so > that only the relevant public functions are loaded when > Phylo/__init__.py imports * from those two sub-modules. > Testing > manually, this seems to do the right thing. > > Cheers, > Eric > From eric.talevich at gmail.com Sun Jan 10 22:02:10 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 10 Jan 2010 17:02:10 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <274373.93315.qm@web62406.mail.re1.yahoo.com> References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> <274373.93315.qm@web62406.mail.re1.yahoo.com> Message-ID: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> On Sat, Jan 9, 2010 at 9:50 PM, Michiel de Hoon wrote: > I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it. OK -- I pulled the latest from biopython/biopython on GitHub, merged my phyloxml branch into my master branch, and pushed it all back to biopython. Bio.Phylo is now part of Biopython! For documentation on the Biopython wiki, I moved the relevant parts of the Tree, TreeIO and PhyloXML pages to a new page for Bio.Phylo: http://biopython.org/wiki/Phylo It's a little rough at the moment, but I'll refine it this week. Some of the content can also be moved to separate cookbook entries. > One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this? I went over all the docstrings and comments again before merging; it should be free of Tree/TreeIO references now. Thanks for your help! Eric From biopython at maubp.freeserve.co.uk Mon Jan 11 11:04:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 11:04:03 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> <863834.10061.qm@web62403.mail.re1.yahoo.com> <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> Message-ID: <320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com> On Sat, Jan 9, 2010 at 11:38 PM, Eric Talevich wrote: > > I'm trying to avoid having to update Phylo/__init__.py each time I add > or rename a public function in Utils.py or IO. So, how about this: > I've added "__all__" definitions to Utils.py and IO/__init__.py so > that only the relevant public functions are loaded when > Phylo/__init__.py imports * from those two sub-modules. Testing > manually, this seems to do the right thing. Previously bits of Biopython have used __all__, and then abandoned this a long term maintenance load. This was before my time, so I am not familiar with the full history, but it makes me wary about using __all__ here. Personally I don't see a big problem with having just explicit manual imports within Bio/Phylo/__init__.py if and when you decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py should be made available at the top level. In general I would think relatively few things should be exposed like that. Peter From biopython at maubp.freeserve.co.uk Mon Jan 11 11:37:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 11:37:42 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> <274373.93315.qm@web62406.mail.re1.yahoo.com> <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> Message-ID: <320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com> On Sun, Jan 10, 2010 at 10:02 PM, Eric Talevich wrote: > > OK -- I pulled the latest from biopython/biopython on GitHub, merged > my phyloxml branch into my master branch, and pushed it all back to > biopython. Bio.Phylo is now part of Biopython! Wow - that was quicker than I expected. As an aside, do you know why there seem to be three main branches in the history now? I guess this was the "original" master, your local master, and your phyloxml branch? One minor thing - test_Phylo.py needs to be tweaked to raise a MissingExternalDependencyError if NetworkX isn't installed. That way the run_tests.py script will treat it as a skipped test instead of a failed test. Alternatively, if this is just a small part of the test, maybe split test_Phylo.py into two files (e.g. add a new file test_Phylo_NeworkX.py which needs the dependency). And how's this for a draft entry in the NEWS file? New module Bio.Phylo includes support for reading, writing and working with phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by Eric Talevich on a Google Summer of Code 2009 project, under The National Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and Christian Zmasek. Peter From chapmanb at 50mail.com Mon Jan 11 13:18:40 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 11 Jan 2010 08:18:40 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> <274373.93315.qm@web62406.mail.re1.yahoo.com> <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> Message-ID: <20100111131840.GB46441@sobchak.mgh.harvard.edu> Hi all; > OK -- I pulled the latest from biopython/biopython on GitHub, merged > my phyloxml branch into my master branch, and pushed it all back to > biopython. Bio.Phylo is now part of Biopython! Awesome. Congrats Eric -- thanks for all the hard work on this during the summer, and getting it in shape for inclusion. Peter and Michiel, thanks for all the helpful feedback. Really happy to have this integrated, Brad From biopython at maubp.freeserve.co.uk Mon Jan 11 13:42:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 13:42:32 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com> References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> <863834.10061.qm@web62403.mail.re1.yahoo.com> <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> <320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com> Message-ID: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> On Mon, Jan 11, 2010 at 11:04 AM, Peter wrote: > On Sat, Jan 9, 2010 at 11:38 PM, Eric Talevich wrote: >> >> I'm trying to avoid having to update Phylo/__init__.py each time I add >> or rename a public function in Utils.py or IO. So, how about this: >> I've added "__all__" definitions to Utils.py and IO/__init__.py so >> that only the relevant public functions are loaded when >> Phylo/__init__.py imports * from those two sub-modules. Testing >> manually, this seems to do the right thing. > > Previously bits of Biopython have used __all__, and then > abandoned this a long term maintenance load. This was before > my time, so I am not familiar with the full history, but it makes me > wary about using __all__ here. > > Personally I don't see a big problem with having just explicit > manual imports within Bio/Phylo/__init__.py if and when you > decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py > should be made available at the top level. In general I would > think relatively few things should be exposed like that. In fact, why even do this at all? What is wrong with leaving the IO functions (read, parse, write) as Bio.Phylo.IO.read etc e.g. >>> from Bio import Phylo >>> tree = Phylo.IO.read(open("int_node_labels.nwk"),"newick") What is the benefit of having them also exposed under the Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means there are two ways to access them which is confusing. If we do want to use Bio.Phylo.IO instead of Bio.PhyloIO (or Bio.TreeIO) then thinking long term we may want to do something about Bio.SeqIO and Bio.AlignIO to match. We could move the Bio.AlignIO functionality under Bio.Align.IO (with a suitable transition period). We could move Bio.SeqIO to Bio.Seq.IO perhaps. Or we could even talk about introducing Bio.Sequences (or something) then move Bio.SeqIO to Bio.Sequences.IO, and move Bio.SeqUtils.* under there too, and perhaps even the Seq, SeqRecord and SeqFeature objects as well. On the other hand, all that upheaval would cause a lot of pain for end users, for relatively little gain. Peter From mjldehoon at yahoo.com Mon Jan 11 15:02:46 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 11 Jan 2010 07:02:46 -0800 (PST) Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> Message-ID: <107440.85746.qm@web62406.mail.re1.yahoo.com> --- On Mon, 1/11/10, Peter wrote: > What is wrong with leaving the IO functions > (read, parse, write) as Bio.Phylo.IO.read etc > e.g. > > >>> from Bio import Phylo > >>> tree = > Phylo.IO.read(open("int_node_labels.nwk"),"newick") > > What is the benefit of having them also exposed under the > Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means > there are two ways to access them which is confusing. If we use Bio.Phylo.IO.read directly, then for consistency we'd have to do the same for all other modules. Otherwise, we'd be guessing each time whether the read() and parse() functions are in Bio.SomeModule, or Bio.SomeModule.IO. For Bio.Phylo, a simple solution is to put whatever is in Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and remove Bio.Phylo.IO.__init__.py. Then there is only one way to access the read() etc. functions. [About doing the same for Bio.Seq and Bio.Align] > On the other hand, all that upheaval would cause a > lot of pain for end users, for relatively little gain. For new users, it may be confusing to have all those different modules dealing with sequences. At least, it was for me when I started with Biopython. Therefore, for a long term solution, I'd prefer a single Bio.Seq module that incorporates all (Seq, SeqRecord, SeqIO, SeqFeature). I agree that that may cause a lot of upheaval for end users, but a suitably long transition period may mitigate those concerns. I'd prefer that to being stuck with a less-than-optimal code organization forever. --Michiel From biopython at maubp.freeserve.co.uk Mon Jan 11 16:17:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 16:17:36 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <107440.85746.qm@web62406.mail.re1.yahoo.com> References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> <107440.85746.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com> On Mon, Jan 11, 2010 at 3:02 PM, Michiel de Hoon wrote: > > On Mon, 1/11/10, Peter wrote: >> What is the benefit of having them also exposed under the >> Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means >> there are two ways to access them which is confusing. > > If we use Bio.Phylo.IO.read directly, then for consistency we'd have > to do the same for all other modules. Otherwise, we'd be guessing > each time whether the read() and parse() functions are in > Bio.SomeModule, or Bio.SomeModule.IO. Fair point. > For Bio.Phylo, a simple solution is to put whatever is in > Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and > remove Bio.Phylo.IO.__init__.py. Then there is only one > way to access the read() etc. functions. Or (if the functions are reasonably complex) keep the input/output code in a separate file, but make it explicit that it is not a public interface - e.g. use Bio/Phylo/_IO.py? > [About doing the same for Bio.Seq and Bio.Align] >> On the other hand, all that upheaval would cause a >> lot of pain for end users, for relatively little gain. > > For new users, it may be confusing to have all those > different modules dealing with sequences. At least, it > was for me when I started with Biopython. Therefore, > for a long term solution, I'd prefer a single Bio.Seq > module that incorporates all (Seq, SeqRecord, SeqIO, > SeqFeature). I agree that for a long term solution a single module make sense here, although I'm not convinced that Bio.Seq is the best name. We'd have to switch from a single file Bio/Seq.py to a folder with multiple files including Bio/Seq/__init__.py - I worry this may cause problems with updating existing Biopython installations. > I agree that that may cause a lot of upheaval for end > users, but a suitably long transition period may mitigate > those concerns. I'd prefer that to being stuck with a > less-than-optimal code organization forever. In principle I agree with that. Peter From eric.talevich at gmail.com Mon Jan 11 16:30:32 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 11 Jan 2010 11:30:32 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com> References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> <274373.93315.qm@web62406.mail.re1.yahoo.com> <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> <320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com> Message-ID: <3f6baf361001110830y391ea21cs8315a266b8b4fb43@mail.gmail.com> On Mon, Jan 11, 2010 at 6:37 AM, Peter wrote: > On Sun, Jan 10, 2010 at 10:02 PM, Eric Talevich wrote: >> >> OK -- I pulled the latest from biopython/biopython on GitHub, merged >> my phyloxml branch into my master branch, and pushed it all back to >> biopython. Bio.Phylo is now part of Biopython! > > Wow - that was quicker than I expected. As an aside, do you know > why there seem to be three main branches in the history now? > I guess this was the "original" master, your local master, and your > phyloxml branch? Er, sorry if I jumped the gun. I was eager to get this done before the semester kicks in... anyway, these are the Git commands I used: git checkout master git pull upstream # remote: biopython master git checkout phyloxml git merge master # check that it merges cleanly git checkout master git merge phyloxml # fast-forward git push upstream master git push origin master # updating my own branches on github git push origin phyloxml It looks more reasonable in gitk; maybe the branches will separate again later on GitHub when they're no longer equivalent, or when I delete the phyloxml branch. > One minor thing - test_Phylo.py needs to be tweaked to raise a > MissingExternalDependencyError if NetworkX isn't installed. That > way the run_tests.py script will treat it as a skipped test instead of > a failed test. Alternatively, if this is just a small part of the test, > maybe split test_Phylo.py into two files (e.g. add a new file > test_Phylo_NeworkX.py which needs the dependency). I extracted test_Phylo_depend.py from test_Phylo and added tests at the top level for networkx and either pygraphviz or pydot (since those are also used by Bio/Phylo/Utils.py). > And how's this for a draft entry in the NEWS file? > > New module Bio.Phylo includes support for reading, writing and working with > phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by > Eric Talevich on a Google Summer of Code 2009 project, under The National > Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and > Christian Zmasek. Great, thanks! Eric From eric.talevich at gmail.com Mon Jan 11 16:43:01 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 11 Jan 2010 11:43:01 -0500 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com> References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> <107440.85746.qm@web62406.mail.re1.yahoo.com> <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com> Message-ID: <3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com> On Mon, Jan 11, 2010 at 11:17 AM, Peter wrote: > On Mon, Jan 11, 2010 at 3:02 PM, Michiel de Hoon wrote: >> >> On Mon, 1/11/10, Peter wrote: >>> What is the benefit of having them also exposed under the >>> Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means >>> there are two ways to access them which is confusing. >> >> If we use Bio.Phylo.IO.read directly, then for consistency we'd have >> to do the same for all other modules. Otherwise, we'd be guessing >> each time whether the read() and parse() functions are in >> Bio.SomeModule, or Bio.SomeModule.IO. > > Fair point. > >> For Bio.Phylo, a simple solution is to put whatever is in >> Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and >> remove Bio.Phylo.IO.__init__.py. Then there is only one >> way to access the read() etc. functions. > > Or (if the functions are reasonably complex) keep the > input/output code in a separate file, but make it explicit > that it is not a public interface - e.g. use Bio/Phylo/_IO.py? Something like this? Phylo/ BaseTree.py Newick.py PhyloXML.py _IO.py _Utils.py PhyloXMLIO.py NewickIO.py NexusIO.py This plays well with the expected import styles: from Bio import Phylo # most common from Bio.Phylo import PhyloXML # access the defined types from Bio.Phylo import PhyloXMLIO # special parsing From biopython at maubp.freeserve.co.uk Mon Jan 11 17:11:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 17:11:29 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? In-Reply-To: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> Message-ID: <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com> On Mon, Nov 23, 2009 at 2:43 PM, Peter wrote: > Dear all, > > Is there anyone on the dev mailing list willing to test the SFF > support I've been working on for Bio.SeqIO? The code is here, > a branch on github: > http://github.com/peterjc/biopython/tree/sff-seqio > > The important files are: > * Bio/SeqIO/SffIO.py > * Bio/SeqIO/__init__.py (defining the new format) > * Bio/SeqIO/_index.py (indexing SFF files) > > Plus unit test files: > * Tests/run_tests.py (to run the doctests) > * Tests/test_SeqIO_QualityIO.py > * Tests/test_SeqIO_index.py > * Tests/test_SeqIO.py > * Tests/Roche/* (for unit tests) > > Sebastian Bassi had a look last month and his feedback has > already helped (e.g. with error messages): > http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006903.html > > I have been using this code myself in real work, for example > editing the trim points in an SFF file to take into account PCR > primer sequences, and filtering SFF reads, checking Roche > barcodes etc. > > Thanks, > > Peter > Hi all, I didn't want to rush the SFF support into Biopython 1.53, but its been waiting "ready" for a while now. Any objections or comments about me merging this now? Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Jan 12 14:51:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jan 2010 14:51:58 +0000 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com> References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> <107440.85746.qm@web62406.mail.re1.yahoo.com> <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com> <3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com> Message-ID: <320fb6e01001120651i6b3d661m83187659595ce9e4@mail.gmail.com> On Mon, Jan 11, 2010 at 4:43 PM, Eric Talevich wrote: > On Mon, Jan 11, 2010 at 11:17 AM, Peter wrote: >> Or (if the functions are reasonably complex) keep the >> input/output code in a separate file, but make it explicit >> that it is not a public interface - e.g. use Bio/Phylo/_IO.py? > > Something like this? > > Phylo/ > ? ?BaseTree.py > ? ?Newick.py > ? ?PhyloXML.py > ? ?_IO.py > ? ?_Utils.py > ? ?PhyloXMLIO.py > ? ?NewickIO.py > ? ?NexusIO.py > > This plays well with the expected import styles: > > from Bio import Phylo ?# most common > from Bio.Phylo import PhyloXML ?# access the defined types > from Bio.Phylo import PhyloXMLIO ?# special parsing I'd forgotten Bio/Phylo/IO was a directory, and that the users may want to access PhyloXMLIO directly. That suggested structure looks reasonable... what do you think Michiel? Peter From kellrott at gmail.com Tue Jan 12 21:46:39 2010 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 12 Jan 2010 13:46:39 -0800 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> Message-ID: I've pulled from the main branch and fixed a few problems. I've tested the code against Sqlite, Python Mysql, and Jython Mysql. All three seem to be working right now. Kyle On Thu, Dec 17, 2009 at 10:03 AM, Kyle Ellrott wrote: > > > Code can be found at http://github.com/kellrott/biopython >> >> Lovely. That's on your jython branch (along with lots of your other work)? >> > > Yes, but all of the zxJDBC work has been done in the past 2 weeks (just the > last three commits), so it should be easy to cherry-pick out the relevant > patches. > > Kyle > From biopython at maubp.freeserve.co.uk Tue Jan 12 21:51:34 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jan 2010 21:51:34 +0000 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> Message-ID: <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> On Tue, Jan 12, 2010 at 9:46 PM, Kyle Ellrott wrote: > I've pulled from the main branch and fixed a few problems. ?I've tested the > code against Sqlite, Python Mysql, and Jython Mysql. ?All three seem to be > working right now. > > Kyle Excellent - I had a play last month, and Jython Mysql seemed to work. Do you know if/how to get SQLite and/or PostgreSQL drivers installed under zxJDBC? Peter From kellrott at gmail.com Tue Jan 12 22:06:39 2010 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 12 Jan 2010 14:06:39 -0800 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> Message-ID: I haven't played with Postgre yet (don't even have it installed). Sqlite as a python package hasn't been standardized to Jython yet ( http://bugs.jython.org/issue1682864 ) One option is to call SQLite JDBC ( http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing the existing SQLite code. But like zxJDBC, the jar would need to be in the CLASSPATH variable for the code to work. Kyle On Tue, Jan 12, 2010 at 1:51 PM, Peter wrote: > On Tue, Jan 12, 2010 at 9:46 PM, Kyle Ellrott wrote: > > I've pulled from the main branch and fixed a few problems. I've tested > the > > code against Sqlite, Python Mysql, and Jython Mysql. All three seem to > be > > working right now. > > > > Kyle > > Excellent - I had a play last month, and Jython Mysql seemed to work. > Do you know if/how to get SQLite and/or PostgreSQL drivers installed > under zxJDBC? > > Peter > From biopython at maubp.freeserve.co.uk Wed Jan 13 11:22:23 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jan 2010 11:22:23 +0000 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> Message-ID: <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com> On Tue, Jan 12, 2010 at 10:06 PM, Kyle Ellrott wrote: > I haven't played with Postgre yet (don't even have it installed). > Sqlite as a python package hasn't been standardized to Jython yet ?( > http://bugs.jython.org/issue1682864 ) > > One option is to call SQLite JDBC ( > http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing the > existing SQLite code. > But like zxJDBC, the jar would need to be in the CLASSPATH variable for the > code to work. I'm not 100% convinced that the details of your current approach are the best way forward: Specifically taking a user script that works on (C) Python using MySQL with MySQLdb as the driver, and when run on Jython automatically interpreting this to use the Java MySQL Connector/J with the org.gjt.mm.mysql.Driver (and so on for the PostgreSQL and SQLite drivers?) It might be clearer if we just treat the different Jython/Java drivers as top level alternatives: * MySQLdb (Python only, at least for now) * psycopg, psycopg2, pgdb (Python only, at least for now) * sqlite3 (currently Python only, maybe available on Jython later) * org.gjt.mm.mysql.Driver (Jython only) * Some JAVA PostreSQL driver (Jython only) * Some JAVA SQLite driver (Jython only) This way we have a clean separation of all the different driver or database specific changes - although the user is required to make some minor changes to take an existing BioSQL on MySQL script to explicitly change the driver from MySQLdb to org.gjt.mm.mysql.Driver if they want to run it on Jython. We also won't have lots of "if jython" statements everywhere. What are your thoughts on this? Note there will be some similarities between all the MySQL adaptors, all the PostgreSQL adaptors, etc. I've just made a small improvement to file BioSQL/DBUtils.py to reduce the code duplication for the existing (C) Python PostgreSQL adaptors. Peter From biopython at maubp.freeserve.co.uk Wed Jan 13 14:10:21 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jan 2010 14:10:21 +0000 Subject: [Biopython-dev] Phasing out support for Python 2.4? Message-ID: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> Hi all, Biopython currently supports Python 2.4, 2.5 and 2.6 (and seems to work on the current Python 2.7 alpha). Is it time to start phasing out support for Python 2.4? Reasons for encouraging Python 2.5+ include the built in support for sqlite3 (which we can use in the BioSQL wrappers) and ElementTree (which we use for the phyloXML parser) both of which must currently be manually installed for Python 2.4. Also ReportLab is talking about dropping support for Python 2.4 (another optional dependency of Biopython). As far as I know, NumPy haven't yet talked about dropping support for Python 2.4. I was thinking of the usual deprecation procedure, so we'd aim to have at least two releases and one year before actually dropping support for Python 2.4. At that point older Linux distributions which ship with Python 2.4 probably won't be supported anyway. e.g. The last version of Ubuntu to have Python 2.4 as the default was Ubuntu 6.06 LTS (Dapper Drake). The desktop edition support ended July 2009, but the server edition will be maintaned until June 2011. Peter From eric.talevich at gmail.com Wed Jan 13 17:08:24 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 13 Jan 2010 12:08:24 -0500 Subject: [Biopython-dev] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> Message-ID: <3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com> On Wed, Jan 13, 2010 at 9:10 AM, Peter wrote: > Hi all, > > Biopython currently supports Python 2.4, 2.5 and 2.6 > (and seems to work on the current Python 2.7 alpha). > > Is it time to start phasing out support for Python 2.4? > > Reasons for encouraging Python 2.5+ include the > built in support for sqlite3 (which we can use in the > BioSQL wrappers) and ElementTree (which we use > for the phyloXML parser) both of which must currently > be manually installed for Python 2.4. Also, it appears that Python 2.7 will use absolute instead of relative imports by default: http://www.python.org/dev/peps/pep-0328/ For intra-package imports like in PDB/__init__.py, an import like this: from PDBParser import PDBParser could be future-proofed for Py2.5+: from __future__ import absolute_import from .PDBParser import PDBParser But to make it work in both Py2.4 and Py2.7, it would need to be converted to an absolute import: from Bio.PDB.PDBParser import PDBParser Py2.5 introduced a number of other enticing syntax features, too: http://docs.python.org/dev/whatsnew/2.5.html - context managers (with_statement) - if-else expressions - unified try-except-finally (I flagged this issue in the comments in Bio.Phylo) - all() and any() - passing values into generators -- could be useful for parsing, maybe The enhancements to setuptools might help simplify the dependency handling in setup.py: http://docs.python.org/dev/whatsnew/2.5.html#pep-314-metadata-for-python-software-packages-v1-1 I'm also interested in the functools and ctypes modules, but don't have pressing use cases for them. (So, you can take that as a +1 from me.) Cheers, Eric From biopython at maubp.freeserve.co.uk Wed Jan 13 17:21:23 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jan 2010 17:21:23 +0000 Subject: [Biopython-dev] Phasing out support for Python 2.4? In-Reply-To: <3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com> Message-ID: <320fb6e01001130921w49b56793h413aacd3027d6275@mail.gmail.com> On Wed, Jan 13, 2010 at 5:08 PM, Eric Talevich wrote: > On Wed, Jan 13, 2010 at 9:10 AM, Peter wrote: >> Hi all, >> >> Biopython currently supports Python 2.4, 2.5 and 2.6 >> (and seems to work on the current Python 2.7 alpha). >> >> Is it time to start phasing out support for Python 2.4? >> >> Reasons for encouraging Python 2.5+ include the >> built in support for sqlite3 (which we can use in the >> BioSQL wrappers) and ElementTree (which we use >> for the phyloXML parser) both of which must currently >> be manually installed for Python 2.4. > > Also, it appears that Python 2.7 will use absolute instead > of relative imports by default: > http://www.python.org/dev/peps/pep-0328/ Thanks for the heads up on that. I think we'll just need to switch everything to absolute imports in order to cover Python 2.4 to 2.7 inclusive. > > (So, you can take that as a +1 from me.) > Good :) Peter From kellrott at gmail.com Wed Jan 13 17:37:53 2010 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 13 Jan 2010 09:37:53 -0800 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com> References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com> Message-ID: My main thought was to make it so that users can write a single script that would work on any Python system (eventually IronPython as well). Because the current system expects the user to request a specific driver (MySQLdb) that happens to be system specific, it forces user code to be system specific. One alternative would be to use the strings you describe below, but in addition add special requests that would check the system add pull the appropriate driver automatically. 'autoMySQL' or 'MySQL' - uses MySQLdb if in CPython, use org.gjt.mm.mysql.Driver if in Jython. Otherwise, if the user wants to use a specific driver, they pass it's name. Kyle On Wed, Jan 13, 2010 at 3:22 AM, Peter wrote: > On Tue, Jan 12, 2010 at 10:06 PM, Kyle Ellrott wrote: > > I haven't played with Postgre yet (don't even have it installed). > > Sqlite as a python package hasn't been standardized to Jython yet ( > > http://bugs.jython.org/issue1682864 ) > > > > One option is to call SQLite JDBC ( > > http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing > the > > existing SQLite code. > > But like zxJDBC, the jar would need to be in the CLASSPATH variable for > the > > code to work. > > I'm not 100% convinced that the details of your current approach > are the best way forward: Specifically taking a user script that works > on (C) Python using MySQL with MySQLdb as the driver, and when > run on Jython automatically interpreting this to use the Java MySQL > Connector/J with the org.gjt.mm.mysql.Driver (and so on for the > PostgreSQL and SQLite drivers?) > > It might be clearer if we just treat the different Jython/Java drivers > as top level alternatives: > > * MySQLdb (Python only, at least for now) > * psycopg, psycopg2, pgdb (Python only, at least for now) > * sqlite3 (currently Python only, maybe available on Jython later) > * org.gjt.mm.mysql.Driver (Jython only) > * Some JAVA PostreSQL driver (Jython only) > * Some JAVA SQLite driver (Jython only) > > This way we have a clean separation of all the different driver > or database specific changes - although the user is required > to make some minor changes to take an existing BioSQL on > MySQL script to explicitly change the driver from MySQLdb > to org.gjt.mm.mysql.Driver if they want to run it on Jython. > We also won't have lots of "if jython" statements everywhere. > > What are your thoughts on this? > > Note there will be some similarities between all the MySQL > adaptors, all the PostgreSQL adaptors, etc. I've just made > a small improvement to file BioSQL/DBUtils.py to reduce > the code duplication for the existing (C) Python PostgreSQL > adaptors. > > Peter > From chapmanb at 50mail.com Thu Jan 14 12:52:44 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 14 Jan 2010 07:52:44 -0500 Subject: [Biopython-dev] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> Message-ID: <20100114125244.GB59876@sobchak.mgh.harvard.edu> Hey Peter; Sounds great to me. Looking forward to being able to use conditional expressions, collections.defaultdict, functools, and the with statement. 2.5 had a lot of great stuff. Brad > Biopython currently supports Python 2.4, 2.5 and 2.6 > (and seems to work on the current Python 2.7 alpha). > > Is it time to start phasing out support for Python 2.4? > > Reasons for encouraging Python 2.5+ include the > built in support for sqlite3 (which we can use in the > BioSQL wrappers) and ElementTree (which we use > for the phyloXML parser) both of which must currently > be manually installed for Python 2.4. > > Also ReportLab is talking about dropping support > for Python 2.4 (another optional dependency of > Biopython). As far as I know, NumPy haven't yet > talked about dropping support for Python 2.4. > > I was thinking of the usual deprecation procedure, so > we'd aim to have at least two releases and one year > before actually dropping support for Python 2.4. At > that point older Linux distributions which ship with > Python 2.4 probably won't be supported anyway. > > e.g. The last version of Ubuntu to have Python 2.4 > as the default was Ubuntu 6.06 LTS (Dapper Drake). > The desktop edition support ended July 2009, but > the server edition will be maintaned until June 2011. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Thu Jan 14 14:52:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 14:52:24 +0000 Subject: [Biopython-dev] Phasing out support for Python 2.4? In-Reply-To: <20100114125244.GB59876@sobchak.mgh.harvard.edu> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <20100114125244.GB59876@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01001140652v1e11725esa6a2f91fafd0104b@mail.gmail.com> On Thu, Jan 14, 2010 at 12:52 PM, Brad Chapman wrote: > Hey Peter; > Sounds great to me. Looking forward to being able to use conditional > expressions, collections.defaultdict, functools, and the with > statement. 2.5 had a lot of great stuff. > > Brad I guess there are quite a few good things in Python 2.5+, although I think the jump from Python 2.3 to 2.4 was more important (generators and decorators). You'll have to restrain yourself from using the new toys in Biopython a little longer though Brad ;) Since this seems to have raised no immediate objections, I've sent a message to the main and announcement lists: http://lists.open-bio.org/pipermail/biopython/2010-January/006111.html http://lists.open-bio.org/pipermail/biopython-announce/2010-January/000064.html Assuming there are no objections, we can add a conditional deprecation warning to setup.py and do a news blog post (like we did for dropping Python 2.3 early last year): http://news.open-bio.org/news/2009/05/dropping-python23-support/ Peter From biopython at maubp.freeserve.co.uk Thu Jan 14 17:32:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 17:32:22 +0000 Subject: [Biopython-dev] [Biopython] Phasing out support for Python 2.4? In-Reply-To: <4B4F4071.7040601@fold.natur.cuni.cz> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <4B4F4071.7040601@fold.natur.cuni.cz> Message-ID: <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com> On Thu, Jan 14, 2010 at 4:04 PM, Martin MOKREJ? wrote: > > Hi Peter, > I don't get this point much. What is the problem stating that with > python 2.5+ one does not need to install an extra dependency while > for 2.4 one needs _two_ modules? > I don't think I want BioSQL nor sqlite so why would I have to upgrade. > Would the requirement be in python language syntax incompatibility then > I would NOT object, but in this situation ... > Martin Hi Martin, This isn't just the issue of sqlite3 and ElementTree. There are several benefits to using more recent versions of Python, for example with an eye on the future for Python 3, and on a practical level it simplifies our testing to have one less version to worry about (especially once Python 2.7 is out, currently scheduled for June 2010). We've already had minor issues with developers using Python 2.5+ syntax unwittingly which broke on Python 2.4 (nothing major, and it was easily fixed once the problem was spotted). If we continue to insist on Python 2.4 support, it may prove problematic for if future potential contributors have existing code written for Python 2.5+ which would require significant re-factoring. None of these concerns are pressing right now (and some are hypothetical), but I think you will agree that Python 2.4 is pretty old, and not widely used anymore. Having a clear plan in place for dropping it seems a sensible move, and once that happens we can start to take advantage of the language and library improvements Python 2.5 added. Are you personally using Python 2.4? If so, could you tell us a little more - for example, is this a university server which would be difficult to update? Or do you require some other Python package which requires Python 2.4? Thanks, Peter From bugzilla-daemon at portal.open-bio.org Thu Jan 14 18:55:18 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 14 Jan 2010 13:55:18 -0500 Subject: [Biopython-dev] [Bug 2992] New: Adding Uniprot XML file format parsing to Biopython Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2992 Summary: Adding Uniprot XML file format parsing to Biopython Product: Biopython Version: 1.53 Platform: All URL: http://github.com/apierleoni/biopython/tree/uniprotxml- branch OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: andrea at biocomp.unibo.it Uniprot XML formatted files are much easier to parse then the swissprot flat file, and are widely used at EMBL either for uniprot, IPI and integr8 databases -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From andrea at biocomp.unibo.it Thu Jan 14 18:57:58 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 14 Jan 2010 19:57:58 +0100 (CET) Subject: [Biopython-dev] New: Uniprot XML parser Message-ID: Hi Everyone, I've been using a lot biopython in the last couple of years, it is very useful to me. So now it's my turn to contribute and be helpful to someone else. I wrote a parser for the Uniprot XML format, that is reasonably fast (8000 entries/min on a core2duo mainstream PC). The main improvements with the actual SwissProt flat file parser are a deeper parsing of comment fields, and a Seqrecord containing features. The parser is based on the ElementTree library and was successfully tested on the complete SwissProt database (v57.12). Thus I think it is ready to be released. I followed the rules to develop a new parser for SeqIO, filed an enhancement bug to bugzilla (bug 2992), and included the parser in a public biopython fork on github available at: http://github.com/apierleoni/biopython/tree/uniprotxml-branch the new parser is in the "uniprotxml-branch" branch, and the parser code is in Bio/SeqIO/UniprotIO.py The parser can be used from SeqIO using: iterator=SeqIO.parse(handle,'uniprot') I think this could be easily integrated in Biopython, unit test is still missing, but should be very easy to do. Anyhow any code review or suggestions are welcome. Andrea From p.j.a.cock at googlemail.com Thu Jan 14 19:16:49 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 Jan 2010 19:16:49 +0000 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: Message-ID: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> On Thursday, January 14, 2010, Andrea Pierleoni wrote: > Hi Everyone, > I've been using a lot biopython in the last couple of years, it is very > useful to me. So now it's my turn to contribute and be helpful to someone > else. > I wrote a parser for the Uniprot XML format, that is reasonably fast (8000 > entries/min on a core2duo mainstream PC). The main improvements with the > actual SwissProt flat file parser are a deeper parsing of comment fields, > and a Seqrecord containing features. > > The parser is based on the ElementTree library and was successfully tested > on the complete SwissProt database (v57.12). Thus I think it is ready to > be released. > > I followed the rules to develop a new parser for SeqIO, filed an > enhancement bug to bugzilla (bug 2992), and included the parser in a > public biopython fork on github available at: > > http://github.com/apierleoni/biopython/tree/uniprotxml-branch > > the new parser is in the "uniprotxml-branch" branch, and the parser code > is in Bio/SeqIO/UniprotIO.py > > The parser can be used from SeqIO using: > > iterator=SeqIO.parse(handle,'uniprot') > > > I think this could be easily integrated in Biopython, ?unit test is still > missing, but should be very easy to do. > Anyhow any code review or suggestions are welcome. > > Andrea > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org Hi I'd spotted your branch on github - this looks like an excellent addition to Biopython :) What I would like to see is a few unit tests, specifically one using the same record in both XML (with the new parser) and the equivalent plain text SwissProt file (with the old parser) and check they agree. Also, I think you should check the start coordinates of the features are using python counting. Regards Peter From eric.talevich at gmail.com Thu Jan 14 20:03:35 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 14 Jan 2010 15:03:35 -0500 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: Message-ID: <3f6baf361001141203i304146a4ld5683190a32b7ffe@mail.gmail.com> On Thu, Jan 14, 2010 at 1:57 PM, Andrea Pierleoni wrote: > Hi Everyone, > I've been using a lot biopython in the last couple of years, it is very > useful to me. So now it's my turn to contribute and be helpful to someone > else. > I wrote a parser for the Uniprot XML format, that is reasonably fast (8000 > entries/min on a core2duo mainstream PC). The main improvements with the > actual SwissProt flat file parser are a deeper parsing of comment fields, > and a Seqrecord containing features. > > The parser is based on the ElementTree library and was successfully tested > on the complete SwissProt database (v57.12). Thus I think it is ready to > be released. Have you tried using this with Python 2.4? The ElementTree module wasn't added to the standard library until Python 2.5, so a simple "from xml.etree import ElementTree" may need some additional protection. It's also nice to let the user use a third-party implementation of ElementTree if they're stuck on Py2.4. An example of this is at the top of Bio.Phylo.PhyloXMLIO -- not pretty, but functional: http://github.com/biopython/biopython/blob/master/Bio/Phylo/PhyloXMLIO.py -Eric From p.j.a.cock at googlemail.com Thu Jan 14 23:04:36 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 Jan 2010 23:04:36 +0000 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> Message-ID: <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> On Thu, Jan 14, 2010 at 10:41 PM, Andrea Pierleoni wrote: > >> >> By default, copy the "swiss" parser. If that doesn't have the >> annotation, see if there is anything similar in the "genbank" >> parser (effectively our reference for rich annotation parsing). >> If in doubt, for now discard the data with a comment in the >> code - and then discuss it here. >> >> Peter >> > I'll take a look at both the swissprot and genbank parsers. > right now the annotation parsing shema is based on the xml schema. > eg. > > function text > > > is parsed in the annotations as: > > seqrecord.annotations['comment_function']=['function text'] > My reasoning is it should be (almost) transparent for users to switch from parsing the plain text SwissProt files ("swiss") to the XML form. There are also knock on implications for saving to BioSQL and file format conversions e.g. saving as a GenBank protein file (aka GenPept format). However, the comment parsing in the plain text "swiss" format is currently a little simplistic - partly to match what BioPerl did at the time. We can revisit that as part of this work. Peter From andrea at biocomp.unibo.it Fri Jan 15 10:35:39 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Fri, 15 Jan 2010 11:35:39 +0100 (CET) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> Message-ID: > > My reasoning is it should be (almost) transparent for > users to switch from parsing the plain text SwissProt > files ("swiss") to the XML form. This would be good > There are also knock > on implications for saving to BioSQL and file format > conversions e.g. saving as a GenBank protein file > (aka GenPept format). The returned Seqrecords are actually BioSQL-safe, since I can load them to a postgres biosql database. formatting the actual Seqrecord with 'genbank' dbxrefs, features, seq, keywords, source and names looks to be correctly reported, while there is no trace of the other annotations. I'll check it deeper. > > However, the comment parsing in the plain text "swiss" > format is currently a little simplistic - partly to match > what BioPerl did at the time. We can revisit that as > part of this work. > the main problem here are going to be the comment fields, that in the plain text predictors are parsed as a single string (this pushed me to wrote the new parser). I tried to keep comments parsing as simple as it can be, by just using lists of strings (good for BioSQL), but many comment types would be better parsed with a dictionary tree. As of now I left the option to get back the full XML for each comment, by calling: UniprotIO.UniprotIterator(handle,return_raw_comments=True) so every info in the XML file can be returned and the end user can decide how to parse those additional info. Anyhow I think it is better to discuss this when the unit test 'swiss'VS'uniprot' is ready. Andrea From p.j.a.cock at googlemail.com Fri Jan 15 11:08:32 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 Jan 2010 11:08:32 +0000 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> Message-ID: <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> On Fri, Jan 15, 2010 at 10:35 AM, Andrea Pierleoni wrote: >> >> However, the comment parsing in the plain text "swiss" >> format is currently a little simplistic - partly to match >> what BioPerl did at the time. We can revisit that as >> part of this work. >> > > the main problem here are going to be the comment fields, that in the > plain text predictors are parsed as a single string (this pushed me to > wrote the new parser). I tried to keep comments parsing as simple as it > can be, by just using lists of strings (good for BioSQL), but many comment > types would be better parsed with a dictionary tree. I think BioPerl now uses some kind of nest tree when parsing the SwissProt comment block, and I would like us to use something compatible (e.g. a dictionary tree) in the "swiss" parser (and thus also the XML parser) in such a way that we end up saving this in BioSQL the same way. > As of now I left the option to get back the full XML for each comment, by > calling: > > UniprotIO.UniprotIterator(handle,return_raw_comments=True) > > so every info in the XML file can be returned and the end user can decide > how to parse those additional info. > > Anyhow I think it is better to discuss this when the unit test > 'swiss'VS'uniprot' is ready. +1, good plan. Peter From bugzilla-daemon at portal.open-bio.org Fri Jan 15 12:38:49 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 15 Jan 2010 07:38:49 -0500 Subject: [Biopython-dev] [Bug 2704] Parser for the markx10 alignment format In-Reply-To: Message-ID: <201001151238.o0FCcnB1017338@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2704 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-15 07:38 EST ------- According to the change log for the just released EMBOSS 6.2: Alignment output included headers only for EMBOSS-specific formats. The headers have been dropped from the FASTA MARKX0 through MARKX10 formats to allow standard FASTA suite parsers to use the EMBOSS versions of these outputs. See also: http://lists.open-bio.org/pipermail/emboss-dev/2009-August/000618.html Fingers crossed this means we will be able to parse their output with the "fasta-m10" parser in Bio.AlignIO. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Mon Jan 18 13:01:15 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 18 Jan 2010 08:01:15 -0500 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions Message-ID: <20100118130115.GA48842@sobchak.mgh.harvard.edu> Hey all; After the Google groups discussion kicked off by Istvan last month, I've been thinking a bit about supplements to mailing list discussions. I'm agreed that mailman is not great for searching and archival purposes; we often see similar questions appear because finding and browsing the right thread from a past discussion is not intuitive. Google groups is okay, but doesn't offer a huge improvement over mailman. Additionally, reports indicate spamming is pretty bad, which creates additional moderation headaches. For handling "how do I do this biology task in Python" questions, what do people think about something entirely different like Stack Overflow? This presents a nice interface for asking questions, and the follow ups are voted up and down by utility so it's easy to see what the right answer is. Questions there are indexed well by search engines, so it's also more likely someone might be able to find a previous answer. There are actually a couple of questions on there with a Biopython tag: http://stackoverflow.com/questions/tagged/biopython >From our point of view, we would need to adjust the documentation to point out Stack Overflow as a place to ask questions, and then monitor the biopython tag for new posts. Mailman is still a great option for implementation discussions, but Stack Overflow could open up question/answers to a larger audience and help supplement the cookbook and formal documentation. Brad From n.j.loman at bham.ac.uk Mon Jan 18 13:21:38 2010 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Mon, 18 Jan 2010 13:21:38 +0000 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <20100118130115.GA48842@sobchak.mgh.harvard.edu> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> Message-ID: <4B546062.3090802@bham.ac.uk> Brad Chapman wrote: > For handling "how do I do this biology task in Python" questions, what > do people think about something entirely different like Stack Overflow? > This presents a nice interface for asking questions, and the follow > ups are voted up and down by utility so it's easy to see what the > right answer is. Questions there are indexed well by search engines, > so it's also more likely someone might be able to find a previous > answer. > Hi Brad Great suggestion, I have been thinking along the same lines. I really like the design of the Stack Exchange sites, it is a great way of exchanging Q&A information. It is worth mentioning that Stackoverflow is not the only site using the "Stack Exchange" format that is relevant. Here is a link to various other Stack Exchange sites: http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family Although there are Biopython questions in Stackoverflow, I wonder whether that is the correct place for questions, or whether it would be overall more productive to have a resource for bioinformatics? I think bioinformatics is the correct breadth of topic to keep a large enough community together whilst not being too off-topic. I have registered http://bioinformatics.stackexchange.com/ and will happily make you and anyone else who is interested an admin. Does the list think there could be enough community interest to justify a separate site like this? Cheers, Nick. From chapmanb at 50mail.com Mon Jan 18 14:20:10 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 18 Jan 2010 09:20:10 -0500 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <4B546062.3090802@bham.ac.uk> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> <4B546062.3090802@bham.ac.uk> Message-ID: <20100118142010.GE48842@sobchak.mgh.harvard.edu> Hi Nick; > Great suggestion, I have been thinking along the same lines. I really > like the design of the Stack Exchange sites, it is a great way of > exchanging Q&A information. > > It is worth mentioning that Stackoverflow is not the only site using the > "Stack Exchange" format that is relevant. > > Here is a link to various other Stack Exchange sites: > http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family Awesome. Thanks for the pointer. Sounds like you have a great handle on this. > Although there are Biopython questions in Stackoverflow, I wonder > whether that is the correct place for questions, or whether it would be > overall more productive to have a resource for bioinformatics? I think > bioinformatics is the correct breadth of topic to keep a large enough > community together whilst not being too off-topic. > > I have registered http://bioinformatics.stackexchange.com/ and will > happily make you and anyone else who is interested an admin. > > Does the list think there could be enough community interest to justify > a separate site like this? It looks like there are a couple of Stack Exchange sites with similar aims for open source bioinformatics and chemistry: http://biostar.stackexchange.com/ http://blueobelisk.stackexchange.com/ If we go this way we might want to talk to the owners of these sites and integrate with them. My preference would be to go with the main StackOverflow site and carve out our niche with the tagging system. We build off of an existing community instead of needing to help grow one. Some of the more successful biology communities, like the one on Friendfeed, benefit from input outside of the standard community: http://friendfeed.com/the-life-scientists I think this would be less likely with a dedicated site, as that fortuitous crosstalk is prevented by other programmers never thinking to look at a bioinformatics only site. Happy to hear what others think, Brad From biopython at maubp.freeserve.co.uk Mon Jan 18 15:58:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 Jan 2010 15:58:27 +0000 Subject: [Biopython-dev] zxJDBC support for BioSQL In-Reply-To: References: <320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com> <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com> <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com> Message-ID: <320fb6e01001180758t179f5ccdo99132e4b10b907bb@mail.gmail.com> On Wed, Jan 13, 2010 at 5:37 PM, Kyle Ellrott wrote: > My main thought was to make it so that users can write a single script that > would work on any Python system (eventually IronPython as well).? Because > the current system expects the user to request a specific driver (MySQLdb) > that happens to be system specific, it forces user code to be system > specific. Yes, it does - as long as Jython or any other Python implementation doesn't support that driver. In the case of SQLite, it sounds like adding sqlite3 support to Jython is planned at least. > One alternative would be to use the strings you describe below, but in > addition add special requests that would check the system add pull the > appropriate driver automatically. > 'autoMySQL' or 'MySQL' - uses MySQLdb if in CPython, use > org.gjt.mm.mysql.Driver if in Jython. > Otherwise, if the user wants to use a specific driver, they pass it's name. Maybe rather than specifying the driver, the user could specify the database back end (MySQL, PostgreSQL, SQLite, ...) and providing we know about this in advance, we can look up and try relevant drivers automatically. We could offer this in combination with the existing driver specifier. This seems cleaner than overloading the driver argument. Peter From biopython at maubp.freeserve.co.uk Mon Jan 18 16:33:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 Jan 2010 16:33:42 +0000 Subject: [Biopython-dev] EMBOSS eprimer3 parser Message-ID: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com> Hi all, Who on the dev list makes heavy use of the EMBOSS eprimer3 parser in Biopython? I'd like someone to look over Leighton's proposed enhancements to this code: http://bugzilla.open-bio.org/show_bug.cgi?id=2968 There are two main issues. First, the current code doesn't cope with multiple primer sets (so Leighton introduces read/parse functions in line with other modules for single or multiple sets of primers). This seems entirely sensible to me, and worthwhile in itself. Second, Leighton makes some changes to the primer record objects. I'm not so sure about the necessity here, even if it is backwards compatible, but I haven't really used this code. What do the rest of you think? Peter From istvan.albert at gmail.com Mon Jan 18 18:02:23 2010 From: istvan.albert at gmail.com (Istvan Albert) Date: Mon, 18 Jan 2010 13:02:23 -0500 Subject: [Biopython-dev] Biopython-dev Digest, Vol 84, Issue 14 In-Reply-To: References: Message-ID: On Mon, Jan 18, 2010 at 12:00 PM, wrote: > It looks like there are a couple of Stack Exchange sites with > similar aims for open source bioinformatics and chemistry: > > http://biostar.stackexchange.com/ > http://blueobelisk.stackexchange.com/ I am actually the original creator of http://biostar.stackexchange.com/ Created mainly to give my students a way to easily ask questions. Two things to keep in mind - it will cost money to run it, right now it is free due to it being in beta - it is not obvious that this service will actually be offered once beta concludes, or that it will be offered with the same conditions. That is pretty much what keeps me from investing more time into it. - making it a site like this only for biopython is too restrictive Other comments on using the stackoverflow main site: I think due to the site's focus being so generic programming I think most people looking for bioinformatics related information could easily get lost or not feel a connection. IMO the idea is fantastic, but it needs its own forum rather than being a small subset of a unrelated topics. best, Istvan -- Istvan Albert http://www.personal.psu.edu/iua1 From biopython at maubp.freeserve.co.uk Tue Jan 19 10:49:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jan 2010 10:49:31 +0000 Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function Message-ID: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> Hi Eric (and everyone else), I just spotted the to_adjacency_matrix function in utils: http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py The dostring says: > Create an adjacency matrix (NumPy array) from clades/branches in tree. > > Also returns a list of all clades in tree ("allclades"), where the position > of each clade in the list corresponds to a row and column of the numpy > array. So, a cell i,j in the array represents the length of the branch from > allclades[i] to allclades[j]. > > @return: tuple of (allclades, adjacency_matrix) where allclades is a list > and adjacency_matrix is a NumPy 2D array. It looks like your adjacency matrix starts as a numpy array of zeros, and then you sets some edges to branch lengths. How do you tell apart a non-connection and a real connection of length zero? These do occur, for example if you have three identical sequences, then you might expect a single node with three children. However IIRC, in (some) NJ trees each node has two children by construction, so you get an extra node connected with a branch of length zero. Peter From eric.talevich at gmail.com Tue Jan 19 15:22:30 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 19 Jan 2010 10:22:30 -0500 Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function In-Reply-To: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> Message-ID: <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com> On Tue, Jan 19, 2010 at 5:49 AM, Peter wrote: > Hi Eric (and everyone else), > > I just spotted the to_adjacency_matrix function in utils: > http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py > > The dostring says: > >> Create an adjacency matrix (NumPy array) from clades/branches in tree. > ?> >> Also returns a list of all clades in tree ("allclades"), where the position >> of each clade in the list corresponds to a row and column of the numpy >> array. So, a cell i,j in the array represents the length of the branch from >> allclades[i] to allclades[j]. >> >> @return: tuple of (allclades, adjacency_matrix) where allclades is a list >> and adjacency_matrix is a NumPy 2D array. > > It looks like your adjacency matrix starts as a numpy array of zeros, > and then you sets some edges to branch lengths. How do you tell > apart a non-connection and a real connection of length zero? These > do occur, for example if you have three identical sequences, then > you might expect a single node with three children. However IIRC, > in (some) NJ trees each node has two children by construction, > so you get an extra node connected with a branch of length zero. Shoot, you're right. I can think of three reasonable mitigations: (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate adjacency -- this seems more standard in textbooks, actually. (b) Issue a warning or raise an error if the given tree contains a 0-length branch. (c) Delete the function. Which do you recommend? The idea was to give mathematicians something to play with. For example, Chapter 2 of this report represents phylogenies this way, using 0 or 1 to indicate the presence of a branch: http://www.metaheuristics.net/~mdorigo/HomePageDorigo/thesis/dea/CatanzaroDEA.pdf Thanks for the heads-up, Eric From biopython at maubp.freeserve.co.uk Tue Jan 19 15:47:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jan 2010 15:47:39 +0000 Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function In-Reply-To: <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com> References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com> Message-ID: <320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com> On Tue, Jan 19, 2010 at 3:22 PM, Eric Talevich wrote: > > On Tue, Jan 19, 2010 at 5:49 AM, Peter wrote: >> Hi Eric (and everyone else), >> >> I just spotted the to_adjacency_matrix function in utils: >> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py >> >> The dostring says: >> >>> Create an adjacency matrix (NumPy array) from clades/branches in tree. >> ?> >>> Also returns a list of all clades in tree ("allclades"), where the position >>> of each clade in the list corresponds to a row and column of the numpy >>> array. So, a cell i,j in the array represents the length of the branch from >>> allclades[i] to allclades[j]. >>> >>> @return: tuple of (allclades, adjacency_matrix) where allclades is a list >>> and adjacency_matrix is a NumPy 2D array. >> >> It looks like your adjacency matrix starts as a numpy array of zeros, >> and then you sets some edges to branch lengths. How do you tell >> apart a non-connection and a real connection of length zero? These >> do occur, for example if you have three identical sequences, then >> you might expect a single node with three children. However IIRC, >> in (some) NJ trees each node has two children by construction, >> so you get an extra node connected with a branch of length zero. > > Shoot, you're right. I can think of three reasonable mitigations: > (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate > adjacency -- this seems more standard in textbooks, actually. > (b) Issue a warning or raise an error if the given tree contains a > 0-length branch. > (c) Delete the function. > > Which do you recommend? > > The idea was to give mathematicians something to play with. For > example, Chapter 2 of this report represents phylogenies this way, > using 0 or 1 to indicate the presence of a branch: > http://www.metaheuristics.net/~mdorigo/HomePageDorigo/thesis/dea/CatanzaroDEA.pdf > > Thanks for the heads-up, > Eric I did wonder about further options, (d) Since the distances are floats, we can use a NA as a flag for no connection. However, this does not seem very useful. (e) Collapse nodes separated by a zero length branch while building the adjacency matrix. Or, raise an error (b) but provide a tree method to collapse nodes separated by a zero length branch which could be called to "clean up" a problematic tree before making the adjacency matrix. None of these options seem ideal :( I would say the boolean matrix (a) is safe but is of limited utility. Therefore (c), remove the function for now is probably best. It can always be re-added in a later release if a good solution is agreed. Peter P.S. Another potentially interesting thing would be a matrix using the bootstrap support values (where again you have a problem with zero bootstrap support vs no connection). I'm not sure if this has any practical uses though. From eric.talevich at gmail.com Wed Jan 20 04:08:16 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 19 Jan 2010 23:08:16 -0500 Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function In-Reply-To: <320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com> References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com> <320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com> Message-ID: <3f6baf361001192008y244912aaieb7c8d2c0399903e@mail.gmail.com> On Tue, Jan 19, 2010 at 10:47 AM, Peter wrote: > On Tue, Jan 19, 2010 at 3:22 PM, Eric Talevich wrote: >> >> On Tue, Jan 19, 2010 at 5:49 AM, Peter wrote: >>> Hi Eric (and everyone else), >>> >>> I just spotted the to_adjacency_matrix function in utils: >>> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py >>> >>> It looks like your adjacency matrix starts as a numpy array of zeros, >>> and then you sets some edges to branch lengths. How do you tell >>> apart a non-connection and a real connection of length zero? >> >> Shoot, you're right. I can think of three reasonable mitigations: >> (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate >> adjacency -- this seems more standard in textbooks, actually. >> (b) Issue a warning or raise an error if the given tree contains a >> 0-length branch. >> (c) Delete the function. >> >> Which do you recommend? >> .... > > I did wonder about further options, > > (d) Since the distances are floats, we can use a NA as > a flag for no connection. However, this does not seem > very useful. Or infinity -- I think that's reasonably common in graph algorithms that use a matrix representation. Anyway, I commented it out for now. The main problem is that I don't have a clear use case for the function at the moment, just a notion that it could be useful for some novel statistical analysis or possibly rooting an unrooted tree based on a molecular clock. I'll look at other libraries to see how they use adjacency matrices, if at all. > (e) Collapse nodes separated by a zero length branch > while building the adjacency matrix. > > Or, raise an error (b) but provide a tree method to collapse > nodes separated by a zero length branch which could be > called to "clean up" a problematic tree before making the > adjacency matrix. Should be easy enough for the user to do manually: for clade in tree.find_clades(branch_length=0): tree.collapse(clade) I'm going to do some serious work on the wiki documentation soon so this sort of operation should be fairly apparent to users. > P.S. Another potentially interesting thing would be a matrix using > the bootstrap support values (where again you have a problem > with zero bootstrap support vs no connection). I'm not sure if this > has any practical uses though. Well, the commented-out code is still visible if any brave scientist is interested in modifying it for this purpose. I'm reading Joe Felsenstein's book right now, so I'll probably get the urge to add more mathy toys to Bio.Phylo soon. I'll check with the list before committing them to the trunk, though. ;) From p.j.a.cock at googlemail.com Wed Jan 20 16:16:58 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 20 Jan 2010 16:16:58 +0000 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> Message-ID: <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> On Fri, Jan 15, 2010 at 11:08 AM, Peter Cock wrote: >> Anyhow I think it is better to discuss this when the unit test >> 'swiss'VS'uniprot' is ready. > > +1, good plan. Something I should have mentioned earlier (I forgot this wasn't checked in yet) was feature support in the existing "swiss" plain text parser - hopefully we can get that working nicely as part of this XML work: http://bugzilla.open-bio.org/show_bug.cgi?id=2235 Peter From andrea at biocomp.unibo.it Wed Jan 20 16:57:47 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Wed, 20 Jan 2010 17:57:47 +0100 (CET) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> Message-ID: <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it> > > Something I should have mentioned earlier (I forgot this wasn't > checked in yet) was feature support in the existing "swiss" plain > text parser - hopefully we can get that working nicely as part of > this XML work: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2235 > > Peter > I know that the plain text swissprot parser can parse features, but last time I checked these features were not included in SeqRecords generated by Bio.SeqIO. If the two parsers have to report similar results, than the 'swiss' format in Bio.SeqIO must reports features too. I made a few changes to the original parser to map data as close as possible to the plain text parser (available on github). However the big issue are going to be the comment field: - 1 big string in the plain text parser - several annotation fields in the XML parser. I think that obtaining the same results is going to be difficult. It is hard to map the big string to many annotations (very error prone) and is also hard to map many annotations to a single string... Anyhow, unit testing is coming (thanks to Mauro) together with a detailed comparison between the two parsed seqrecords. Andrea From p.j.a.cock at googlemail.com Wed Jan 20 17:14:18 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 20 Jan 2010 17:14:18 +0000 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it> Message-ID: <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com> On Wed, Jan 20, 2010 at 4:57 PM, Andrea Pierleoni wrote: >> >> Something I should have mentioned earlier (I forgot this wasn't >> checked in yet) was feature support in the existing "swiss" plain >> text parser - hopefully we can get that working nicely as part of >> this XML work: >> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2235 >> >> Peter >> > > I know that the plain text swissprot parser can parse features, but > last time I checked these features were not included in SeqRecords > generated by Bio.SeqIO. > If the two parsers have to report similar results, than the 'swiss' > format in Bio.SeqIO must reports features too. Yes, there is an old patch on Bug 2235 to do this: http://bugzilla.open-bio.org/show_bug.cgi?id=2235 > I made a few changes to the original parser to map data as close as > possible to the plain text parser (available on github). > > However the big issue are going to be the comment field: > - 1 big string in the plain text parser > - several annotation fields in the XML parser. > > I think that obtaining the same results is going to be difficult. > It is hard to map the big string to many annotations (very error prone) > and is also hard to map many annotations to a single string... > > Anyhow, unit testing is coming (thanks to Mauro) together with a detailed > comparison between the two parsed seqrecords. Great. Peter From andrea at biocomp.unibo.it Thu Jan 21 12:01:30 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 21 Jan 2010 13:01:30 +0100 (CET) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it> <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com> Message-ID: <43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it> >> Anyhow, unit testing is coming (thanks to Mauro) together with a >> detailed >> comparison between the two parsed seqrecords. > > Great. > > Peter > As mentioned earlier, Mauro did a code review and added unit test for the parser in Tests/test_Uniprot.py the updated version is available on the github repository: http://github.com/apierleoni/biopython Since this version is mature enough I sepnt some time comparing the input from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser. This comparison was done using the Q13639 UniProt entry. This are the main differences between the two generated SeqRecords: - id: is the same (first accession) - name: is the same - description: UP reports the the recommended name , full name value, while additional names and synonyms are in the annotations. SP reports a long string containing everything parsed as it is form the plain text. - dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed, NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs - seq: is the same - features: missing in SP (I have to check with the Peter's patch) - annotations: - - identical annotations: accessions, keywords, taxonomy, organism - - mapped annotations: date_last_annotation_update in UP---> modified in SP date_last_sequence_update in UP---> sequence_modified in SP gene_name_primary in UP---> gene_name in SP >>> SP.annotations['gene_name'] 'Name=HTR4;' >>> UP.annotations['gene_name_primary'] 'HTR4' ncbi_taxid in SP ---> UP dbxrefs since it is mapped as a dbReference in the xmlfile - - references: has some minor differences. Final semicolon and double quote missing in UP for both author and title fields. In UP reference comments are reported as: "PublicationType | PublicationDate | Scope | Tissue" For submission publication type the db is reported in comments and not in journal field. - - comments: here comes the big differences. SP has comments are on a single string. UP comments are mapped to seceral annotation entries using comment type and attributes to build the annotation key. Eg. comment_function --> list of "function" type comment strings comment_subcellularlocation_location --> list of "location" strings in the subcellularlocation comment field Comments tree in XML would be easily mapped to a comment dictionary tree, but this would not be BioSQL safe. Andrea From biopython at maubp.freeserve.co.uk Thu Jan 21 12:33:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Jan 2010 12:33:53 +0000 Subject: [Biopython-dev] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL Message-ID: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Hi all, This is cross posted to try and ensure relevant people see it. I suggest we continue the discussion on the BioSQL list (for how to serialise structured annotation to BioSQL), and/or the OpenBio list (for things like file format naming conventions). I am hoping we (Bio*) can be consistent in how we parse and load into BioSQL the SwissProt DE lines (known as "swiss" format in both BioPerl and Biopython's SeqIO, and by EMBOSS) or the equivalent UniProt XML tags (which we are tentatively going to call the "uniprot" format in Biopython's SeqIO - comments?). Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") files and load them into BioSQL. Biopython currently treats the DE comment lines as a long string, as BioPerl used to: http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html I understand that BioPerl now turns the SwissProt DE lines into a TagTree, and for storing this in BioSQL this gets serialised as XML. I would like Biopython to handle this the same way (although rather than a Perl TagTree, we'd use a Python structure of course), and would appreciate clarification of what exactly was implemented (e.g. which bit of the BioPerl source code should be look at, and could you show a worked example?). Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or Open-Bio lists yet) has started work on parsing UniProt XML files for Biopython. Here the DE comment lines are already provided broken up with XML markup. Hopefully their nested structure matches what BioPerl was doing with the SwissProt DE lines. Regards, Peter From bugzilla-daemon at portal.open-bio.org Thu Jan 21 13:13:09 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 Jan 2010 08:13:09 -0500 Subject: [Biopython-dev] [Bug 2997] New: Ignore comments in SCOP parsable files Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2997 Summary: Ignore comments in SCOP parsable files Product: Biopython Version: 1.53 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: 2008 at thomas-holder.de I could not load SCOP parsable files with Bio.SCOP unless I removed the comment lines. The parser should just skip these lines. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 21 13:14:59 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 Jan 2010 08:14:59 -0500 Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files In-Reply-To: Message-ID: <201001211314.o0LDExim005529@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2997 ------- Comment #1 from 2008 at thomas-holder.de 2010-01-21 08:14 EST ------- Created an attachment (id=1432) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1432&action=view) patch to skip comment lines in SCOP parsable files -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mauro at biodec.com Thu Jan 21 20:09:28 2010 From: mauro at biodec.com (Mauro) Date: Thu, 21 Jan 2010 21:09:28 +0100 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it> References: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com> <4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it> <320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com> <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com> <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com> <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com> <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it> <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com> <43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it> Message-ID: <4B58B478.4000703@biodec.com> On 01/21/2010 01:01 PM, Andrea Pierleoni wrote: > >>> Anyhow, unit testing is coming (thanks to Mauro) together with a >>> detailed >>> comparison between the two parsed seqrecords. >> >> Great. >> >> Peter >> > > > As mentioned earlier, Mauro did a code review and added unit test for the > parser in Tests/test_Uniprot.py > the updated version is available on the github repository: > http://github.com/apierleoni/biopython > > Since this version is mature enough I sepnt some time comparing the input > from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser. > This comparison was done using the Q13639 UniProt entry. I made also a test for this case. Currently the test fails, you can see the report made by Andrea below. If we agree with differences between the seqrecord, I do the work to change the test. Mauro. > > This are the main differences between the two generated SeqRecords: > > - id: is the same (first accession) > - name: is the same > - description: UP reports the the recommended name , full name value, while > additional names and synonyms are in the annotations. SP reports a > long string containing everything parsed as it is form the plain > text. > - dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed, > NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs > - seq: is the same > - features: missing in SP (I have to check with the Peter's patch) > - annotations: > - - identical annotations: accessions, keywords, taxonomy, organism > - - mapped annotations: > date_last_annotation_update in UP---> modified in SP > date_last_sequence_update in UP---> sequence_modified in SP > gene_name_primary in UP---> gene_name in SP > >>> SP.annotations['gene_name'] > 'Name=HTR4;' > >>> UP.annotations['gene_name_primary'] > 'HTR4' > ncbi_taxid in SP ---> UP dbxrefs since it is mapped as a > dbReference in the xmlfile > - - references: has some minor differences. > Final semicolon and double quote missing in UP for both author > and title fields. > In UP reference comments are reported as: > "PublicationType | PublicationDate | Scope | Tissue" > For submission publication type the db is reported in comments > and not in journal field. > - - comments: here comes the big differences. > SP has comments are on a single string. > UP comments are mapped to seceral annotation entries using comment > type and attributes to build the annotation key. > Eg. > comment_function --> list of "function" type comment strings > comment_subcellularlocation_location --> list of "location" > strings in the subcellularlocation comment field > > Comments tree in XML would be easily mapped to a comment dictionary > tree, but this would not be BioSQL safe. > > > Andrea > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Thu Jan 21 23:58:29 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 21 Jan 2010 18:58:29 -0500 Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files In-Reply-To: Message-ID: <201001212358.o0LNwTIB022421@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2997 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2010-01-21 18:58 EST ------- Can you give an example of a SCOP file that contains such comment lines? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 22 08:42:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 Jan 2010 03:42:28 -0500 Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files In-Reply-To: Message-ID: <201001220842.o0M8gSDv003709@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2997 ------- Comment #3 from 2008 at thomas-holder.de 2010-01-22 03:42 EST ------- (In reply to comment #2) > Can you give an example of a SCOP file that contains such comment lines? I want to parse these files: http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.75 http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.75 http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.hie.scop.txt_1.75 http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.com.scop.txt_1.75 They all start with 4 comment lines (release and copyright information). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 22 11:08:34 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 Jan 2010 06:08:34 -0500 Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files In-Reply-To: Message-ID: <201001221108.o0MB8YkZ008581@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2997 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2010-01-22 06:08 EST ------- Applied your patch; thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From andrea at biocomp.unibo.it Fri Jan 22 12:18:32 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Fri, 22 Jan 2010 13:18:32 +0100 (CET) Subject: [Biopython-dev] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Message-ID: <2b6e30c4628585042366646a7b46386e.squirrel@lipid.biocomp.unibo.it> I think that the point here can be a little broader, since not only the swissprot DE lines carry complex and structured data. To define a common, language-independent way to store structured data into the comment and *_qualifier_value tables of the actual BioSQL schema could be very useful. XML looks like a good candidate to me, and the UniprotXML format can be used as reference or as a template to start from. Each Bio* project will then parse and report this structured data in its own programming language data structure. Andrea > Hi all, > > This is cross posted to try and ensure relevant people see it. > I suggest we continue the discussion on the BioSQL list > (for how to serialise structured annotation to BioSQL), and/or > the OpenBio list (for things like file format naming conventions). > > I am hoping we (Bio*) can be consistent in how we parse and load > into BioSQL the SwissProt DE lines (known as "swiss" format in > both BioPerl and Biopython's SeqIO, and by EMBOSS) or the > equivalent UniProt XML tags (which we are tentatively going to > call the "uniprot" format in Biopython's SeqIO - comments?). > > Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") > files and load them into BioSQL. Biopython currently treats the DE > comment lines as a long string, as BioPerl used to: > > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html > http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html > > I understand that BioPerl now turns the SwissProt DE lines into a > TagTree, and for storing this in BioSQL this gets serialised as XML. > I would like Biopython to handle this the same way (although rather > than a Perl TagTree, we'd use a Python structure of course), and > would appreciate clarification of what exactly was implemented > (e.g. which bit of the BioPerl source code should be look at, > and could you show a worked example?). > > Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or > Open-Bio lists yet) has started work on parsing UniProt XML > files for Biopython. Here the DE comment lines are already > provided broken up with XML markup. Hopefully their nested > structure matches what BioPerl was doing with the SwissProt > DE lines. > > Regards, > > Peter > From bugzilla-daemon at portal.open-bio.org Fri Jan 22 18:43:19 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 Jan 2010 13:43:19 -0500 Subject: [Biopython-dev] [Bug 2998] New: mac error during build in 10.6.1 Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2998 Summary: mac error during build in 10.6.1 Product: Biopython Version: 1.53 Platform: PC OS/Version: Mac OS Status: NEW Severity: major Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: emeryl at uw.edu When I download the file biopython-1.53.tar.gz, uncompress it, and run python setup.py build I get an error saying gcc4.0 failed with exit code 1, among many lines of errors. Looking more closely, it appears the build process is trying to use an older version of the SDK, which is not installed by Xcode tools by default. It is trying to use /Developer/SDKs/MacOSX10.4u.sdk. On a clean install of 10.6.1 (Snow Leopard) only the SDKs for 10.5 and 10.6 are installed by the Xcode tools installer without changing options. When I reinstall the Xcode tools and this time check a box to install 10.4 support, this 10.4 sdk is installed and the build works flawlessly. This would be a difficult fix to track down for many casual users of BioPython who do not understand the Xcode tools. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 22 19:15:59 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 Jan 2010 14:15:59 -0500 Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for Mac OS In-Reply-To: Message-ID: <201001221915.o0MJFxoa024953@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2998 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|major |normal Summary|mac error during build in |Document need XCode with |10.6.1 |10.4 SDK for Mac OS ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-22 14:15 EST ------- Snow Leopard has caused all sorts of trouble for compiling Python extensions (this is not specific to Biopython). This has been discussed on our mailing list, and simply installing the Mac OS 10.4 SDK option with XCode seems to be the best solution. I've just updated the download page to try and clarify this. Is that better? This is a wiki page so you can edit it: http://biopython.org/wiki/Download I'm leaving this bug open to remind us to add a similar note to the main installation document: http://github.com/biopython/biopython/blob/master/Doc/install/Installation.tex Do you have any other suggestions? Thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 22 20:36:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 22 Jan 2010 15:36:36 -0500 Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for Mac OS In-Reply-To: Message-ID: <201001222036.o0MKaaZ4027368@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2998 ------- Comment #2 from emeryl at uw.edu 2010-01-22 15:36 EST ------- (In reply to comment #1) That's a good solution, but I added this small clarification also : You will need to have installed Apple's XCode tools including the optional 10.4 SDK (check the option for 10.4 support when installing Xcode tools). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 25 10:56:32 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 05:56:32 -0500 Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for Mac OS In-Reply-To: Message-ID: <201001251056.o0PAuWDI010933@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2998 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-25 05:56 EST ------- (In reply to comment #2) > (In reply to comment #1) > > That's a good solution, but I added this small clarification also : > > You will need to have installed Apple's XCode tools including the optional 10.4 > SDK (check the option for 10.4 support when installing Xcode tools). > Thanks - I've now updated the main installation document in our repository (which we'll use to update the install PDF and HTML at the next release). Marking bug as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 26 01:16:27 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:16:27 -0500 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <201001260116.o0Q1GR1c002063@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 mmokrejs at ribosome.natur.cuni.cz changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mmokrejs at ribosome.natur.cuni | |.cz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 26 01:17:41 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:17:41 -0500 Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not record molecule type or if circular In-Reply-To: Message-ID: <201001260117.o0Q1Hfdb002091@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 mmokrejs at ribosome.natur.cuni.cz changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mmokrejs at ribosome.natur.cuni | |.cz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 26 01:19:47 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:19:47 -0500 Subject: [Biopython-dev] [Bug 2597] Enforce alphabet letters in Seq objects In-Reply-To: Message-ID: <201001260119.o0Q1JlhK002189@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2597 mmokrejs at ribosome.natur.cuni.cz changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 26 01:27:14 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:27:14 -0500 Subject: [Biopython-dev] [Bug 2999] New: SeqIO.parse() or record.format("genbank") converts input sequence to uppercase or Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2999 Summary: SeqIO.parse() or record.format("genbank") converts input sequence to uppercase or Product: Biopython Version: 1.53 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz I do not know where is the problem coming from but if I parse a GenBank file with lowercased sequence (EST) and get it printed back through record.format("genbank") I receive all in uppercase. I think the upper/lower-casing should never be altered unless explicitly requested by the user. for _record in SeqIO.parse(_infile, options.format): # silly, imagine I hit "gi|14150838|gb|AAK54648.1|AF376133_1" from # a FASTA file :( if _record.id in _ids: _outfile.write(_record.format("fasta")) elif options.format == "genbank": if _record.annotations['gi'] in _ids: _outfile.write(_record.format("genbank")) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 26 01:44:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:44:28 -0500 Subject: [Biopython-dev] [Bug 3000] New: Could SeqIO.parse() store the whole, unparsed multiline entry? Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3000 Summary: Could SeqIO.parse() store the whole, unparsed multiline entry? Product: Biopython Version: 1.53 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz Taking into account the genbank file-format writing is not yet complete I wonder whether you would allow to keep optionally along each parsed record it's unparsed multi-line representation. For example, I use biopython to filter-out certain records from a fasta/genbank file by accession, gi, tissue (well the last haven't done yet;)). I do not change the format, I just ignore certain entries. I did not understand the Tutorial ("5.4.3 Getting your SeqRecord objects as formatted strings") well but I iterate over the records and once having the record I want to be on the safe side and to record._print_original_blob() and get e.g. LOCUS .... ... // I do not have the record_iterator so cannot use the proposed out_handle.write(record.format("genbank")) approach. Still, I suspect this will reformat the entry (currently I see trailing dot removed from KEYWORDS, no REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED; and FEATURES.source being re-ordered). I foresee this to depend on an optional argument to SeqIO.parse() specifying that a user wants to keep this in memory and merely that he/she understands this is probably not much useful for large chromosomes, etc. Similarly, I think until parsing/writing e.g. TITLE is fully available why couldn't you just store the whole multi-line thing in some variable? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 26 01:47:27 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 25 Jan 2010 20:47:27 -0500 Subject: [Biopython-dev] [Bug 2601] Seq find() method: proposal In-Reply-To: Message-ID: <201001260147.o0Q1lRVk002782@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2601 mmokrejs at ribosome.natur.cuni.cz changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mmokrejs at ribosome.natur.cuni | |.cz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 26 13:03:42 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 26 Jan 2010 08:03:42 -0500 Subject: [Biopython-dev] [Bug 2999] SeqIO.parse() or record.format("genbank") converts input sequence to uppercase or In-Reply-To: Message-ID: <201001261303.o0QD3gN8019546@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2999 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-26 08:03 EST ------- In many file formats (e.g. FASTA) mixed case is allowed and useful. The sequence in a GenBank file is (by convention) always lower case, but for historical reasons Biopython converts this to upper case on parsing (not sure why, but changing it would risk breaking existing scripts). However, I think we should convert to lower case on writing GenBank output. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 26 13:15:38 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 26 Jan 2010 08:15:38 -0500 Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole, unparsed multiline entry? In-Reply-To: Message-ID: <201001261315.o0QDFc4f020030@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3000 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-26 08:15 EST ------- (In reply to comment #0) > Taking into account the genbank file-format writing is not yet complete I > wonder whether you would allow to keep optionally along each parsed record > it's unparsed multi-line representation. You can probably do it already with the old Bio.GenBank iterator object (I think you use no parser object to get the raw text). Adding this to Bio.SeqIO doesn't seem a wonderful idea. The whole approach only makes sense for sequential file formats with no header (like FASTA, GenBank, EMBL, SwissProt) but not interlaced files (most alignments) or those with headers or XML formats. It also breaks completely the moment the user makes any modification to the SeqRecord object - and handling that cleanly would be tricky. > Still, I suspect this will > reformat the entry (currently I see trailing dot removed from KEYWORDS, no > REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED; and FEATURES.source being > re-ordered). Yes, using Bio.SeqIO to read/write a GenBank record will give you (slightly) different output. We do not guarantee a 100% round trip (even on simpler formats like FASTA). Even little things like line wrapping would make this very difficult. Regarding GenBank KEYWORDS, please file a bug. Regarding GenBank reference lines (REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED) this is still covered by existing Bug 2294 Regarding GenBank source feature, please file a bug. > Similarly, I think until parsing/writing e.g. TITLE is fully available why > couldn't you just store the whole multi-line thing in some variable? The remaining unsupported bits of the ID line are covered byg existing Bug 2294 and Bug 2578. Regarding the reference lines (REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED) this is still covered by existing Bug 2294. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Tue Jan 26 14:02:59 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 26 Jan 2010 15:02:59 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com> References: <20091202125744.GA46415@sobchak.mgh.harvard.edu> <320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com> <320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com> Message-ID: <201001261502.59237.jblanca@btc.upv.es> Hi: I'm doing a pipeline to annotate sequences. I'm writting modules that add SeqFeatures and annoations to the sequences. Right now I'm storing the result as repr for the SeqRecords, but I would like to write gff files at the end. I've read the discussion regarding Brad's code and I've found it very interesting. I need to write those gff files so couldl use Brad's code or my own, but it would be great if I could contribute to Biopython at the same time. At the time being I don't think a consensus about what a SeqFeature should represent and how. I think Peter made a proposal about adding a parent and children properties, is this a good way to solve the problem? Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Jan 26 14:59:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 26 Jan 2010 14:59:35 +0000 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <201001261502.59237.jblanca@btc.upv.es> References: <20091202125744.GA46415@sobchak.mgh.harvard.edu> <320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com> <320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com> <201001261502.59237.jblanca@btc.upv.es> Message-ID: <320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com> Hi Jose, On Tue, Jan 26, 2010 at 2:02 PM, Jose Blanca wrote: > Hi: > > I'm doing a pipeline to annotate sequences. I'm writting modules that add > SeqFeatures and annoations to the sequences. I've done a little of that too - but with GenBank files as the output. > Right now I'm storing the result as repr for the SeqRecords, but I would like > to write gff files at the end. I've read the discussion regarding Brad's code > and I've found it very interesting. > I need to write those gff files so couldl use Brad's code or my own, but it > would be great if I could contribute to Biopython at the same time. > At the time being I don't think a consensus about what a SeqFeature should > represent and how. I think Peter made a proposal about adding a parent and > children properties, is this a good way to solve the problem? > Best regards, Brad's code is using the SeqFeature differently to existing bits of Biopython, and adding a separate child/parent mechanism for the kind of usage required for GFF(3) looks like one way forward allowing use to keep full backward compatibility. I'm actually going to see Brad in person next month at a workshop, and I'm hoping we can squeeze in a little in person debate on this then (assuming we don't settle it here on the mailing list first of course). Regards, Peter From dalloliogm at gmail.com Tue Jan 26 15:09:39 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 26 Jan 2010 16:09:39 +0100 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <20100118142010.GE48842@sobchak.mgh.harvard.edu> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> <4B546062.3090802@bham.ac.uk> <20100118142010.GE48842@sobchak.mgh.harvard.edu> Message-ID: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> On Mon, Jan 18, 2010 at 3:20 PM, Brad Chapman wrote: > Hi Nick; > Sorry for the late reply... I also use StackOverflow and I think that it is a great resource, and it would very good if we can become more represented there. At the moment there are a few questions on biopython on SO, but there are so few biopython users that people usually receive few answers and they prefer to ask their questions again in this list. I have answer to some questions tagged as 'bioinformatics' there, but lately I have not been using SO very much, and moreover the field of bioinformatics is so broad that sometimes it is very difficult to answer a technical question. > > Here is a link to various other Stack Exchange sites: > > > http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family > > Very interesting, thanks! I didn't know you could make Stack-Exchange websites so easily. How did you do that? Is there a free software behind, or do you have to pay some service provider? > It looks like there are a couple of Stack Exchange sites with > similar aims for open source bioinformatics and chemistry: > > http://biostar.stackexchange.com/ > http://blueobelisk.stackexchange.com/ > I agree, maybe it would be useful to collaborate with these websites. StackOverflow is great for programming-related questions; however, you can't use it to ask something which is not completely related, like the protocol for an experiment or which databases to use for an analysis. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From dalloliogm at gmail.com Wed Jan 27 08:56:09 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 27 Jan 2010 09:56:09 +0100 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> <4B546062.3090802@bham.ac.uk> <20100118142010.GE48842@sobchak.mgh.harvard.edu> <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> Message-ID: <5aa3b3571001270056l5ae5bd76g1a70890c94fd430b@mail.gmail.com> On Tue, Jan 26, 2010 at 4:09 PM, Giovanni Marco Dall'Olio < dalloliogm at gmail.com> wrote: > > > > On Mon, Jan 18, 2010 at 3:20 PM, Brad Chapman wrote: > >> Hi Nick; >> > > Sorry for the late reply... I also use StackOverflow and I think that it is > a great resource, and it would very good if we can become more represented > there. > By the way, it is possible to get feeds for questions on StackOverflow. For example, this is the feed for the questions tagged 'biopython': - http://stackoverflow.com/feeds/tag/biopython We could add this rss to the biopython's friendfeed or twitter page (I barely know what I am talking about here), or to the blog/wiki/etc. Maybe there is also a way to notify this mailing list of the questions asked there. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From chapmanb at 50mail.com Wed Jan 27 13:33:22 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 27 Jan 2010 08:33:22 -0500 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> <4B546062.3090802@bham.ac.uk> <20100118142010.GE48842@sobchak.mgh.harvard.edu> <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> Message-ID: <20100127133322.GV83316@sobchak.mgh.harvard.edu> Giovanni; Thanks for the feedback on this. We've had a few positive responses and I think it's something that would be low effort to experiment with. I'm open to whether we do this on the main StackOverflow site, Nick's dedicated suggested site, or Blue Obelisk. The main criteria is that we are likely to have the website be freely available (and around) in the future. > Sorry for the late reply... I also use StackOverflow and I think that it is > a great resource, and it would very good if we can become more represented > there. > At the moment there are a few questions on biopython on SO, but there are so > few biopython users that people usually receive few answers and they prefer > to ask their questions again in this list. Yes, that's what we'd be hoping to change. The main thing is that we get folks interested in python bioinformatics programming looking there, and then suggest users ask questions there. The significant benefit is that the presentation of questions and answers gives you a historical resource that is easy to search and browse. > By the way, it is possible to get feeds for questions on StackOverflow. > For example, this is the feed for the questions tagged 'biopython': > - http://stackoverflow.com/feeds/tag/biopython > We could add this rss to the biopython's friendfeed or twitter page (I > barely know what I am talking about here), or to the blog/wiki/etc. > Maybe there is also a way to notify this mailing list of the questions asked > there. There are resources we could use to redirect the feed to Twitter: http://twitterfeed.com/ and the mailing list: http://www.feedmyinbox.com/ Agreed that we should do this to increase visibility. Brad From chapmanb at 50mail.com Wed Jan 27 13:41:25 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 27 Jan 2010 08:41:25 -0500 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com> References: <20091202125744.GA46415@sobchak.mgh.harvard.edu> <320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com> <320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com> <201001261502.59237.jblanca@btc.upv.es> <320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com> Message-ID: <20100127134125.GW83316@sobchak.mgh.harvard.edu> Jose and Peter; > > Right now I'm storing the result as repr for the SeqRecords, but I would like > > to write gff files at the end. I've read the discussion regarding Brad's code > > and I've found it very interesting. > > I need to write those gff files so couldl use Brad's code or my own, but it > > would be great if I could contribute to Biopython at the same time. Awesome. Please do use my code for output and feel free to fork and make suggestions; I'm happy to integrate changes: http://github.com/chapmanb/bcbb/tree/master/gff > > At the time being I don't think a consensus about what a SeqFeature should > > represent and how. I think Peter made a proposal about adding a parent and > > children properties, is this a good way to solve the problem? > > Best regards, > > Brad's code is using the SeqFeature differently to existing bits of > Biopython, and adding a separate child/parent mechanism for the > kind of usage required for GFF(3) looks like one way forward allowing > use to keep full backward compatibility. I'm actually going to see Brad > in person next month at a workshop, and I'm hoping we can squeeze > in a little in person debate on this then (assuming we don't settle it > here on the mailing list first of course). What do you think we need to modify in the GFF parsing code to bring this in line? I'd really like to see this get into Biopython, but am not sure how to clear the blocking issues. If we can put together a list of specifics, I can try and put together time to tackle that. Brad From dalloliogm at gmail.com Wed Jan 27 13:41:24 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 27 Jan 2010 14:41:24 +0100 Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions In-Reply-To: <20100127133322.GV83316@sobchak.mgh.harvard.edu> References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> <4B546062.3090802@bham.ac.uk> <20100118142010.GE48842@sobchak.mgh.harvard.edu> <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> <20100127133322.GV83316@sobchak.mgh.harvard.edu> Message-ID: <5aa3b3571001270541n2f047fe2qf42911b21e9494d8@mail.gmail.com> On Wed, Jan 27, 2010 at 2:33 PM, Brad Chapman wrote: > Giovanni; > Thanks for the feedback on this. We've had a few positive responses > and I think it's something that would be low effort to experiment with. > I'm open to whether we do this on the main StackOverflow site, > Nick's dedicated suggested site, or Blue Obelisk. The main criteria > is that we are likely to have the website be freely available (and > around) in the future. > Thanks to you for the proposal.. > There are resources we could use to redirect the feed to Twitter: > > http://twitterfeed.com/ > > and the mailing list: > > http://www.feedmyinbox.com/ > So, what if we use this to automatically send a notification to the biopython mailing list? The amount of traffic increased would be low, in the last three months there have only been 3 messages on biopython in StackOverflow. With an automatical notification, these questions may receive an answer a lot more quickly. When the traffic on StackOverflow grows too much, we can just inactivate the forwarding so it won't disturb the mailing list. > Agreed that we should do this to increase visibility. > > Brad > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From chapmanb at 50mail.com Thu Jan 28 20:35:05 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 28 Jan 2010 15:35:05 -0500 Subject: [Biopython-dev] OpenBio solution challenge: Project updates at BOSC 2010 Message-ID: <20100128203505.GG40046@sobchak.mgh.harvard.edu> Hello all; The BOSC 2010 organizing committee is hard at work getting prepared for this July's meeting in Boston: http://www.open-bio.org/wiki/BOSC_2010 One of the items we've traditionally had at the conference is a project update from each of the OpenBio affiliated groups. This year, we're thinking about organizing these talks around a central theme: the OpenBio solution challenge. We start with a biological question of general interest, and each of the project talks would focus around how you would solve that problem using your toolkit and programming language. This is meant to provide a challenge for OpenBio contributors, a nice tutorial style overview of various projects and approaches for other programmers, and a fun opportunity to compete and learn from other projects. Conference attendees will vote on their favorite solution, with the winner receiving fame and fortune (warning: fortune not guaranteed). For this to be successful, it of course requires interest and enthusiasm from y'all fine folks involved with the projects. Specifically: - Is there interest from your group in participating in the challenge? You'll want at least a few people to work on it, and someone to give a presentation at BOSC. - Do you have suggestions on a good theme or specific biological problem to tackle? We'll hope to pick something in a sweet spot that is challenging enough to be of interest, yet reasonable for presentation and preparation. Let's discuss ideas and get this together. Since the schedule for BOSC is developing rapidly, please give us an idea if you're interested by February 12th, and copy responses to the BOSC mailing list as a central place for discussion. bosc at open-bio.org Thanks, Brad, Michael, and the BOSC organizing committee From biopython at maubp.freeserve.co.uk Fri Jan 29 10:36:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Jan 2010 10:36:40 +0000 Subject: [Biopython-dev] [Bioperl-l] [MOBY-dev] OpenBio solution challenge: Project updates at BOSC 2010 In-Reply-To: References: <20100128203505.GG40046@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com> Hi all, This is a great topic but should be continue it on just the one mailing list? Is there a suitable BOSC list, or how about the general Open Bio list? On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson wrote: > > Brad, this sounds exciting! > > One thing strikes me, though - by asking for the sub-projects to propose > the "grand challenge" themselves the one thing you can guarantee is that > the "grand challenge" is solvable (or more likely, already solved!) > > Other "grand challenge" kinds of meetings have an independent third party > pose the problem that has to be solved, and then all groups work toward a > solution and compare their results. ?This would, IMO, be more revealing of > the "state of the art" in each Open-Bio project, and point out where the > weaknesses are that we should be focusing on... ?Someone (for example, > you!) could act as the moderator to ensure that the "grand challenge" was > at least a reasonable one, within the scope of what an Open-Bio project > *should* be able to solve... > > Just my CAD $0.02 > > Mark One possible problem with having Brad act as moderator is his ties to Biopython (plus it would be a shame if we'd be one man down for trying to solve the challenges - grin). Having a project representative "sign off" on the challenge might work - or simply the whole of the BOSC committee which is quite balanced. Alternatively some kind of panel of challenges does seem a good way to reduce individual project bias (as suggest by Scooter), but there will still need to be a judging committee. I'm curious what kind of challenges the BOSC committee had in mind - would something like taking a newly sequence bacteria and producing an automated annotation as a GenBank, EMBL, or GFF file be too ambitious for example? There are already several major projects to do this e.g. RAST http://rast.nmpdr.org/ Peter (@Biopython) From bugzilla-daemon at portal.open-bio.org Sun Jan 31 20:30:45 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 31 Jan 2010 15:30:45 -0500 Subject: [Biopython-dev] [Bug 3004] New: Contribute PSL alignment format to biopython Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3004 Summary: Contribute PSL alignment format to biopython Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: forgetta at gmail.com Hi Bio-pythonistas, I am interested in contributing code to biopython. I have developed a class to represent PSL output from the BLAT alignment program. I would like to contribute it to the AlignIO module. I have read through and agree to the guidelines stipulated on http://biopython.org/wiki/Contributing. I have never written unit tests before, but I am willing to learn. Thanks. Vince -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jan 31 22:24:53 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 31 Jan 2010 17:24:53 -0500 Subject: [Biopython-dev] [Bug 3004] PSL alignment format parsing in Bio.AlignIO In-Reply-To: Message-ID: <201001312224.o0VMOrha006787@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3004 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Contribute PSL alignment |PSL alignment format parsing |format to biopython |in Bio.AlignIO ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-01-31 17:24 EST ------- Hi Vince, This sounds interesting - I've been using BLAT's plain text BLAST output format with Biopython up until now. Have you ever used github? That would be one way to share your code. Or, just attach diff files, Python files, and example BLAT files to this bug. If you haven't already done so, signing up to our development mailing list would be a good idea. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.