From chapmanb at 50mail.com  Mon Jan  4 08:16:31 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 4 Jan 2010 08:16:31 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
Message-ID: <20100104131631.GG80812@sobchak.mgh.harvard.edu>

Hey Eric;
Happy New Year -- thanks for all the work on TreeIO. This sounds
great and looking forward to getting it in the main trunk. I'd like
to hear Peter's and other's thoughts, but just a few small comments
below.

> The tree annotations (e.g. id) aren't preserved perfectly during conversions
> -- I'll keep working on this, but I don't think it's a blocker. The taxon
> names of terminal nodes are kept as "clade" names in phyloXML for
> round-tripping. Tree topology and branch lengths seem OK.

Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.

> Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
> incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
> This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
> as I imagine it:
> (1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
> reasonable (since the node IDs and adjacency list lookup are no longer
> needed)
> (2) Implement methods in Bio.Tree.Newick with the original argument lists,
> but triggering a deprecation warning indicating the newer replacement method
> (3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
> shims to duplicate the original API -- so test_Nexus.py should still pass,
> ideally (with deprecation warnings)
> (4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
> NexusIO and Bio.Tree methods.
> (5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.
> 
> I'm currently doing (1) and (2), with more emphasis on getting (1) right.
> Not all of the important methods have been ported, but I'm happy with the
> tree traversal methods.

Nice. This all sounds like a really good refactoring. It sounds like 1 
can happen once this all gets merged with the main branch, and
could benefit from others being able to more easily look at it and
make suggestions.

> I noticed that in Tests/Nexus/, the example file for internal node labels is
> actually in Newick/NH format, not Nexus. That was briefly confusing, so
> maybe that file should be renamed.

Oops, I think that may have been me. No problem, rename away.

Brad

From chapmanb at 50mail.com  Mon Jan  4 08:16:31 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 4 Jan 2010 08:16:31 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
Message-ID: <20100104131631.GG80812@sobchak.mgh.harvard.edu>

Hey Eric;
Happy New Year -- thanks for all the work on TreeIO. This sounds
great and looking forward to getting it in the main trunk. I'd like
to hear Peter's and other's thoughts, but just a few small comments
below.

> The tree annotations (e.g. id) aren't preserved perfectly during conversions
> -- I'll keep working on this, but I don't think it's a blocker. The taxon
> names of terminal nodes are kept as "clade" names in phyloXML for
> round-tripping. Tree topology and branch lengths seem OK.

Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.

> Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
> incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
> This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
> as I imagine it:
> (1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
> reasonable (since the node IDs and adjacency list lookup are no longer
> needed)
> (2) Implement methods in Bio.Tree.Newick with the original argument lists,
> but triggering a deprecation warning indicating the newer replacement method
> (3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
> shims to duplicate the original API -- so test_Nexus.py should still pass,
> ideally (with deprecation warnings)
> (4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
> NexusIO and Bio.Tree methods.
> (5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.
> 
> I'm currently doing (1) and (2), with more emphasis on getting (1) right.
> Not all of the important methods have been ported, but I'm happy with the
> tree traversal methods.

Nice. This all sounds like a really good refactoring. It sounds like 1 
can happen once this all gets merged with the main branch, and
could benefit from others being able to more easily look at it and
make suggestions.

> I noticed that in Tests/Nexus/, the example file for internal node labels is
> actually in Newick/NH format, not Nexus. That was briefly confusing, so
> maybe that file should be renamed.

Oops, I think that may have been me. No problem, rename away.

Brad

From eric.talevich at gmail.com  Mon Jan  4 19:09:18 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 4 Jan 2010 16:09:18 -0800
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <20100104131631.GG80812@sobchak.mgh.harvard.edu>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> 
	<20100104131631.GG80812@sobchak.mgh.harvard.edu>
Message-ID: <3f6baf361001041609u7997dd61v441257dbfecdebd6@mail.gmail.com>

Hi Brad, I hope the holidays treated you well.

On Mon, Jan 4, 2010 at 5:16 AM, Brad Chapman <chapmanb at 50mail.com> wrote:

>
> Are the annotations often used in real life cases or is this more of
> a fringe problem? I'm not as familiar with tree work, but know this
> is a pain in sequence space. A good goal is to capture the most
> common use cases and then integrate the other issues as feasible.
>

The data that TreeIO preserves round-trip are:

 - Branching structure (topology)
 - Branch lengths
 - Clade/taxon names
 - Rooted-ness (for the whole tree)
 - Tree ID

The troublesome parts are:

 - The "confidences" attribute in PhyloXML trees should map onto the
"support" attribute in Nexus trees, but that's tricky -- the original Nexus
attribute seemed content with a little ambiguity in what that attribute's
numerical value actually meant (relative/absolute support), while PhyloXML
uses a list of Confidence objects containing both a numerical value and a
"type" string such as "bootstrap". Currently that information is dropped
when converting between PhyloXML and Nexus/Newick trees.
 - Nexus also has a "comment" attribute for each node, while PhyloXML
doesn't directly support that.
 - The branch length of the root node/clade is None in PhyloXML, but 0.0 in
Nexus. I prefer None because there is no meaningful branch leading to that
node, but there might be a reason 0.0 was chosen for Nexus that I'm not
aware of.
  - The names of unlabeled internal nodes might change from None to "" in
some cases, since None is the PhyloXML default and "" is the Nexus default.
 - Since PhyloXML supports more structured taxonomic information on each
node than Newick, it's possible to have a PhyloXML tree where a Clade has no
name, but instead one or more Taxonomy objects containing the scientific
name, common names, etc. -- so when this tree is converted to Newick format
the taxonomy info is lost for those nodes. I could squash the Taxonomy
object into a string for the sake of Nexus labels, but I think it would be
safer (less surprising) to just write a cookbook entry on how to collapse
PhyloXML Taxonomies into Clade names to aid format conversions.

If the support-vs-confidence issue can be resolved, then we can treat
PhyloXML as a rough superset of Newick, in terms of annotation, and then it
shouldn't be surprising to lose some annotation data in converting PhyloXML
to Newick.

Cheers,
Eric

From eric.talevich at gmail.com  Mon Jan  4 19:09:18 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 4 Jan 2010 16:09:18 -0800
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <20100104131631.GG80812@sobchak.mgh.harvard.edu>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> 
	<20100104131631.GG80812@sobchak.mgh.harvard.edu>
Message-ID: <3f6baf361001041609u7997dd61v441257dbfecdebd6@mail.gmail.com>

Hi Brad, I hope the holidays treated you well.

On Mon, Jan 4, 2010 at 5:16 AM, Brad Chapman <chapmanb at 50mail.com> wrote:

>
> Are the annotations often used in real life cases or is this more of
> a fringe problem? I'm not as familiar with tree work, but know this
> is a pain in sequence space. A good goal is to capture the most
> common use cases and then integrate the other issues as feasible.
>

The data that TreeIO preserves round-trip are:

 - Branching structure (topology)
 - Branch lengths
 - Clade/taxon names
 - Rooted-ness (for the whole tree)
 - Tree ID

The troublesome parts are:

 - The "confidences" attribute in PhyloXML trees should map onto the
"support" attribute in Nexus trees, but that's tricky -- the original Nexus
attribute seemed content with a little ambiguity in what that attribute's
numerical value actually meant (relative/absolute support), while PhyloXML
uses a list of Confidence objects containing both a numerical value and a
"type" string such as "bootstrap". Currently that information is dropped
when converting between PhyloXML and Nexus/Newick trees.
 - Nexus also has a "comment" attribute for each node, while PhyloXML
doesn't directly support that.
 - The branch length of the root node/clade is None in PhyloXML, but 0.0 in
Nexus. I prefer None because there is no meaningful branch leading to that
node, but there might be a reason 0.0 was chosen for Nexus that I'm not
aware of.
  - The names of unlabeled internal nodes might change from None to "" in
some cases, since None is the PhyloXML default and "" is the Nexus default.
 - Since PhyloXML supports more structured taxonomic information on each
node than Newick, it's possible to have a PhyloXML tree where a Clade has no
name, but instead one or more Taxonomy objects containing the scientific
name, common names, etc. -- so when this tree is converted to Newick format
the taxonomy info is lost for those nodes. I could squash the Taxonomy
object into a string for the sake of Nexus labels, but I think it would be
safer (less surprising) to just write a cookbook entry on how to collapse
PhyloXML Taxonomies into Clade names to aid format conversions.

If the support-vs-confidence issue can be resolved, then we can treat
PhyloXML as a rough superset of Newick, in terms of annotation, and then it
shouldn't be surprising to lose some annotation data in converting PhyloXML
to Newick.

Cheers,
Eric

From biopython at maubp.freeserve.co.uk  Tue Jan  5 12:50:25 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 5 Jan 2010 17:50:25 +0000
Subject: [Biopython-dev] code credits
In-Reply-To: <320fb6e00912220414t6429f1e5n792e5feeecbe633f@mail.gmail.com>
References: <bb02be080912171320u480fe461r1f517970f08e091b@mail.gmail.com>
	<928490.72367.qm@web30708.mail.mud.yahoo.com>
	<320fb6e00912171454v2ce81fc5v93547951d7af84f8@mail.gmail.com>
	<Pine.SOC.4.64.0912171946120.13591@ub.d.umn.edu>
	<320fb6e00912210357m32156fdax6639445cadd83217@mail.gmail.com>
	<20091221132339.GC21580@sobchak.mgh.harvard.edu>
	<320fb6e00912210634o77d9eb9ex21e4ec3630dd1ed6@mail.gmail.com>
	<320fb6e00912210848x449fd73al4e97d3c9e21cf4@mail.gmail.com>
	<320fb6e00912220414t6429f1e5n792e5feeecbe633f@mail.gmail.com>
Message-ID: <320fb6e01001050950r64dabb1dw67baafada72f5d1a@mail.gmail.com>

On Tue, Dec 22, 2009 at 12:14 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Dec 21, 2009 at 4:48 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> So, how about a merger of (1) and (3)? i.e.
>>
>> * The CONTRIBUTORS file remains a single alphabetical list
>> of all contributors to date (no change).
>> * Entries in the NEWS file for new features etc may continue
>> to credit authors as appropriate.
>> * The NEWS file will include at the end of each release section
>> an alphabetical list of contributors for that release (with new
>> contributors flagged). This will be re-used in the release notice.
>
> I've done that in github - how do the NEWS and CONTRIB file look?
>
> http://github.com/biopython/biopython/commit/86d8d99aab894ab5f32a0e7a0c45d63a441da645
>
> I haven't automatically included email addresses for the new contributors
> since there is a risk of them being harvested for spam, so I figure that
> should be "opt in".

Thanks to those with feedback off list (e.g. sort order).

I've just updated the news post to include the list of names:
http://news.open-bio.org/news/2009/12/biopython-release-153/

I don't have time today, but at some point this week I want to
do a another news post and email announcement describing
this new Sage-like policy for recognising contributors. If anyone
would like to compose a draft of the apparent consensus that
would be very helpful.

If anyone would like to go back over the commit log for the
recent releases to update them as we've just done for 1.53,
please go ahead - but post an email here to avoid duplicated
efforts.

Peter

P.S. Happy New Year!

From bugzilla-daemon at portal.open-bio.org  Thu Jan  7 13:11:47 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 Jan 2010 13:11:47 -0500
Subject: [Biopython-dev] [Bug 2980] New: Bio.SeqIO can't parse EMBL CONTIG
	records
Message-ID: <bug-2980-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2980

           Summary: Bio.SeqIO can't parse EMBL CONTIG records
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


While the GenBank parser has been updated to cope with CONTIG records
(using an UnknownSeq object), this has not been done for the EMBL parser.
As an example test case, consider:
ftp://ftp.ebi.ac.uk/pub/databases/embl/release/rel_con_hum_01_r102.dat.gz


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Jan  8 06:50:56 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 Jan 2010 06:50:56 -0500
Subject: [Biopython-dev] [Bug 2980] Bio.SeqIO can't parse EMBL CONTIG records
In-Reply-To: <bug-2980-42@http.bugzilla.open-bio.org/>
Message-ID: <201001081150.o08Bougb013879@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2980


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-08 06:50 EST -------
Fixed in git


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From mjldehoon at yahoo.com  Fri Jan  8 11:26:29 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 8 Jan 2010 08:26:29 -0800 (PST)
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
Message-ID: <221209.41863.qm@web62404.mail.re1.yahoo.com>

I am not an expert in this area, but the code looks very well done and well organized. Thanks, Eric!

I have one suggestion though:
In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather have everything under Bio.Tree. This makes it easier to understand what each Bio.* module is about, and also agrees with the structure of the other modules in Biopython. The only exception is Bio.Seq, for which there is a closely related Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons; I'd rather have a single Bio.Seq there too).

Thanks again,

--Michiel.

--- On Mon, 12/28/09, Eric Talevich <eric.talevich at gmail.com> wrote:

> From: Eric Talevich <eric.talevich at gmail.com>
> Subject: Re: [Biopython-dev] Code review request for phyloxml branch
> To: "BioPython-Dev Mailing List" <biopython-dev at biopython.org>
> Date: Monday, December 28, 2009, 8:51 PM
> Hi folks,
> 
> Here's an update on the status of Bio.Tree and TreeIO. I
> think I've taken
> care of most of the blockers since the last review in
> September.
> 
> First, some links:
> http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
> http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
> http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py
> http://github.com/etal/biopython/tree/phyloxml/Tests/test_Tree.py
> http://biopython.org/wiki/PhyloXML
> 
> Discussion:
> 
> *TreeIO*
> Conversion between Nexus, Newick and phyloXML tree file
> formats works; the
> read/parse/write functions for each IO format use the same
> object types.
> Neat!
> 
> The tree annotations (e.g. id) aren't preserved perfectly
> during conversions
> -- I'll keep working on this, but I don't think it's a
> blocker. The taxon
> names of terminal nodes are kept as "clade" names in
> phyloXML for
> round-tripping. Tree topology and branch lengths seem OK.
> 
> Under the hood:
> -- PhyloXMLIO is from GSoC
> -- NewickIO is ported from the Bio.Nexus.Trees parser. I
> think it works the
> same way.
> -- NexusIO relies on Bio.Nexus.Nexus for parsing, then
> converts the
> resulting Nexus.Trees.Tree objects to Bio.Tree.Newick
> objects. One day, when
> Nexus.Trees is replaced by NewickIO in the main Nexus
> parser, then this
> conversion can be dropped and NexusIO will be very simple.
> 
> *Tree*
> The BaseTree object structure looks like this:*
> 
> -- BaseTree.**Tree* contains global tree information, like
> whether the tree
> is rooted, and a reference to the root clade. The phyloXML
> Phylogeny object
> inherits from this.*
> 
> -- BaseTree.**Subtree* contains local (clade- or
> node-specific) information,
> and references to each of its direct descendents,
> recursively. The phyloXML
> Clade object inherits from this. Nodes are implicit. I
> could add references
> to the ancestor of each sub-tree without too much
> difficulty, but I haven't
> needed them yet.
> 
> The same methods (get_terminals et al.) generally apply to
> both classes, so
> I created a separate TreeMixin class from which both
> BaseTree.Tree and
> BaseTree.Subtree inherit.
> 
> Bio.Tree.Newick contains simple subclasses of Tree and
> Subtree, and an
> incomplete set of shims that track Bio.Nexus.Trees.Tree
> (minus the I/O).
> This is to ease the deprecation and eventual replacement of
> Bio.Nexus.Trees,
> as I imagine it:
> (1) Port methods from Nexus.Trees to Bio.Tree, simplifying
> arguments where
> reasonable (since the node IDs and adjacency list lookup
> are no longer
> needed)
> (2) Implement methods in Bio.Tree.Newick with the original
> argument lists,
> but triggering a deprecation warning indicating the newer
> replacement method
> (3) Replace Nexus.Trees with an import of
> Bio.Tree.Newick(IO) and a few more
> shims to duplicate the original API -- so test_Nexus.py
> should still pass,
> ideally (with deprecation warnings)
> (4) In Nexus.Nexus, replace all usage of Nexus.Trees with
> proper usage of
> NexusIO and Bio.Tree methods.
> (5) Eventually delete Nexus.Trees and the shims in
> Bio.Tree.Newick.
> 
> I'm currently doing (1) and (2), with more emphasis on
> getting (1) right.
> Not all of the important methods have been ported, but I'm
> happy with the
> tree traversal methods.
> *
> Tests
> *I created test_Tree.py to test the methods in
> Bio.Tree.BaseTree;
> test_PhyloXML.py tests Bio.Tree.PhyloXML objects and
> Bio.TreeIO.PhyloXMLIO
> parsing/writing.
> 
> I noticed that in Tests/Nexus/, the example file for
> internal node labels is
> actually in Newick/NH format, not Nexus. That was briefly
> confusing, so
> maybe that file should be renamed.
> 
> What do you think?
> 
> All the best,
> Eric
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From p.j.a.cock at googlemail.com  Fri Jan  8 12:00:12 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 8 Jan 2010 17:00:12 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <221209.41863.qm@web62404.mail.re1.yahoo.com>
References: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
	<221209.41863.qm@web62404.mail.re1.yahoo.com>
Message-ID: <320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com>

On Fri, Jan 8, 2010 at 4:26 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> I am not an expert in this area, but the code looks very well done and well
> organized. Thanks, Eric!
>
> I have one suggestion though:
> In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather
> have everything under Bio.Tree. This makes it easier to understand what each
> Bio.* module is about, and also agrees with the structure of the other modules
> in Biopython. The only exception is Bio.Seq, for which there is a closely related
> Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons;
> I'd rather have a single Bio.Seq there too).

There is also Bio.AlignIO, which again might have been handled via Bio.Align
with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was
following the lead from BioPerl. I think there are some good points about making
the code for the common object (tree, SeqRecord, Alignment) clearly separate
from the code for parsing or writing it (although separate top level modules is
perhaps overkill). However, I agree, this isn't universal in Biopython (e.g.
Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO).

So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing
I don't like is that "Tree" could mean a class or a module (also a problem with
other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python
convention (PEP8) is to use lower case for the module ("tree") and title case
for the class ("Tree"), something most of Biopython does not follow (and
which we can't change without a lot of upheaval). Another option if we want
to try and keep the existing module name style might be Bio.Trees containing
a Tree class, or perhaps something different like Bio.Phylo instead?

Peter

From eric.talevich at gmail.com  Fri Jan  8 13:22:11 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 8 Jan 2010 13:22:11 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com>
References: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> 
	<221209.41863.qm@web62404.mail.re1.yahoo.com>
	<320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com>
Message-ID: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com>

On Fri, Jan 8, 2010 at 12:00 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> On Fri, Jan 8, 2010 at 4:26 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> > I am not an expert in this area, but the code looks very well done and well
> > organized. Thanks, Eric!
> >
> > I have one suggestion though:
> > In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather
> > have everything under Bio.Tree. This makes it easier to understand what each
> > Bio.* module is about, and also agrees with the structure of the other modules
> > in Biopython. The only exception is Bio.Seq, for which there is a closely related
> > Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons;
> > I'd rather have a single Bio.Seq there too).
>
> There is also Bio.AlignIO, which again might have been handled via Bio.Align
> with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was
> following the lead from BioPerl.

Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava do
something completely different.

I had the impression that pairing modules Foo & FooIO was an emerging
convention for organizing very general data types being fed by a
variety of file formats, while a single module Foo indicated support
for a particular program or source, like Entrez. But I think it would
be even cleaner if each Foo simply had a Foo.IO (or foo.io) sub-module
organizing the I/O for multiple file formats where applicable.

The TreeIO.* namespace is not crowded -- just read, write, parse,
convert. If that directory is moved under Bio.Tree and renamed to IO
or io, then Bio.Tree would still seem reasonably intuitive if
__init__.py contained:

from io import *
from utils import *

Then "from Bio import Tree" would be enough for most uses.

> I think there are some good points about making
> the code for the common object (tree, SeqRecord, Alignment) clearly separate
> from the code for parsing or writing it (although separate top level modules is
> perhaps overkill). However, I agree, this isn't universal in Biopython (e.g.
> Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO).

PDB does its own thing, too -- and some consolidation there might be nice.

> So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing
> I don't like is that "Tree" could mean a class or a module (also a problem with
> other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python
> convention (PEP8) is to use lower case for the module ("tree") and title case
> for the class ("Tree"), something most of Biopython does not follow (and
> which we can't change without a lot of upheaval).

I could rename the modules inside Bio.Tree (or whatever we call it) to
follow the PEP8 convention:

Bio/Tree/
Bio/Tree/basetree.py
Bio/Tree/io.py
Bio/Tree/utils.py ...

The Biopython convention seems to be that directory names are title
case, file names are mostly title case if user-facing and lower case
otherwise, and C extensions are lower case. Most of the time there
won't be any need to import the sub-modules under Tree directly, so
the inconsistency shouldn't be too jarring.

> perhaps something different like Bio.Phylo instead?

Sure, that sounds promising.


Thanks!
Eric

From mjldehoon at yahoo.com  Sat Jan  9 10:15:56 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 9 Jan 2010 07:15:56 -0800 (PST)
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com>
Message-ID: <863834.10061.qm@web62403.mail.re1.yahoo.com>


--- On Fri, 1/8/10, Eric Talevich <eric.talevich at gmail.com> wrote:
> Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava
> do something completely different.
> 
> I had the impression that pairing modules Foo & FooIO
> was an emerging convention for organizing very general
> data types being fed by a variety of file formats, while
> a single module Foo indicated support
> for a particular program or source, like Entrez.

I think a workable convention, which is already followed by many Biopython module, is the following:

1) Bio.SomeStuff is a module containing everything related to SomeStuff, where SomeStuff is some broadly-defined field within bioinformatics (Cluster for clustering algorithms, Phylo for phylogenetics, PopGen for population genetics, Entrez for NCBI Entrez related stuff, etc.).

2) Parsing SomeStuff files, which can be in a variety of formats, is done by a read() function (to parse a single record), and/or a parse() function (to parse multiple records). The implementation details of these functions is hidden in a submodule of Bio.SomeStuff. Typically, the user won't need to interact with the submodule directly.

3) The read() / parse() functions return Bio.SomeStuff.Record objects, where Bio.SomeStuff.Record is a class that represents the primary data structure of SomeStuff information.

This general framework may not be suitable in all aspects for all Biopython modules, and can be modified as needed. For example, I can imagine that the most important data structure in Bio.Phylo is a Tree object rather than a Record object.

> But I think it would
> be even cleaner if each Foo simply had a Foo.IO (or foo.io)
> sub-module organizing the I/O for multiple file formats where
> applicable.

I agree.

> The TreeIO.* namespace is not crowded -- just read, write,
> parse, convert. If that directory is moved under Bio.Tree and
> renamed to IO or io, then Bio.Tree would still seem reasonably
> intuitive if __init__.py contained:
> 
> from io import *
> from utils import *
> 
> Then "from Bio import Tree" would be enough for most uses.

Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module.

Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user.
 
> > perhaps something different like Bio.Phylo instead?
> 
> Sure, that sounds promising.

I agree that Bio.Phylo is a good name. Note also that there already is a Tree class in Bio.Cluster (it represents hierarchical clustering trees). Having a Bio.Phylo.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees is not confusing. On the other hand, having a Bio.Tree.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees could potentially be confusing.

--Michiel


From eric.talevich at gmail.com  Sat Jan  9 18:38:29 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sat, 9 Jan 2010 18:38:29 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <863834.10061.qm@web62403.mail.re1.yahoo.com>
References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> 
	<863834.10061.qm@web62403.mail.re1.yahoo.com>
Message-ID: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>

Hi,

Thanks for your comments. I've reorganized the modules like this:

Bio/Phylo/
    __init__.py, BaseTree.py, Newick.py, PhyloXML.py, Utils.py
    IO/
        __init__.py, NexusIO.py, NewickIO.py, PhyloXMLIO.py

Now "from Bio import Phylo" works for the common cases, and "from
Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct access to the
parsers.

I renamed TreeIO to Phylo/IO -- keeping it uppercase because io is a
standard module in Py2.6+, Py2.7 changes the priority rules for
absolute vs. relative imports, and Py2.4 doesn't support the new
syntax for relative imports. I might change the other file names to
lower case before the next merge, though...

On Sat, Jan 9, 2010 at 10:15 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module.
>
> Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user.
>

I'm trying to avoid having to update Phylo/__init__.py each time I add
or rename a public function in Utils.py or IO. So, how about this:
I've added "__all__" definitions to Utils.py and IO/__init__.py so
that only the relevant public functions are loaded when
Phylo/__init__.py imports * from those two sub-modules. Testing
manually, this seems to do the right thing.

Cheers,
Eric


From mjldehoon at yahoo.com  Sat Jan  9 21:50:21 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 9 Jan 2010 18:50:21 -0800 (PST)
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
Message-ID: <274373.93315.qm@web62406.mail.re1.yahoo.com>

I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it. One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this?

Thanks!

--Michiel

--- On Sat, 1/9/10, Eric Talevich <eric.talevich at gmail.com> wrote:

> From: Eric Talevich <eric.talevich at gmail.com>
> Subject: Re: [Biopython-dev] Code review request for phyloxml branch
> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: "Peter Cock" <p.j.a.cock at googlemail.com>, "BioPython-Dev Mailing List" <biopython-dev at biopython.org>
> Date: Saturday, January 9, 2010, 6:38 PM
> Hi,
> 
> Thanks for your comments. I've reorganized the modules like
> this:
> 
> Bio/Phylo/
> ? ? __init__.py, BaseTree.py, Newick.py,
> PhyloXML.py, Utils.py
> ? ? IO/
> ? ? ? ? __init__.py, NexusIO.py,
> NewickIO.py, PhyloXMLIO.py
> 
> Now "from Bio import Phylo" works for the common cases, and
> "from
> Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct
> access to the
> parsers.
> 
> I renamed TreeIO to Phylo/IO -- keeping it uppercase
> because io is a
> standard module in Py2.6+, Py2.7 changes the priority rules
> for
> absolute vs. relative imports, and Py2.4 doesn't support
> the new
> syntax for relative imports. I might change the other file
> names to
> lower case before the next merge, though...
> 
> On Sat, Jan 9, 2010 at 10:15 AM, Michiel de Hoon <mjldehoon at yahoo.com>
> wrote:
> >
> > Rather than importing *, can we import only those
> functions that a user would actually use? We should avoid
> importing stuff that is essentially used only locally in
> each sub-module.
> >
> > Another option is to have all functions that are
> intended to be used by the user in Bio.Phylo, and have those
> function access (internally) any sub-module as needed. For
> example, a user would not notice that Bio.Phylo.read
> actually uses code from Bio.Phylo.io; the latter module
> would not be accessed directly by the user.
> >
> 
> I'm trying to avoid having to update Phylo/__init__.py each
> time I add
> or rename a public function in Utils.py or IO. So, how
> about this:
> I've added "__all__" definitions to Utils.py and
> IO/__init__.py so
> that only the relevant public functions are loaded when
> Phylo/__init__.py imports * from those two sub-modules.
> Testing
> manually, this seems to do the right thing.
> 
> Cheers,
> Eric
> 


From eric.talevich at gmail.com  Sun Jan 10 17:02:10 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 10 Jan 2010 17:02:10 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <274373.93315.qm@web62406.mail.re1.yahoo.com>
References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> 
	<274373.93315.qm@web62406.mail.re1.yahoo.com>
Message-ID: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com>

On Sat, Jan 9, 2010 at 9:50 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it.

OK -- I pulled the latest from biopython/biopython on GitHub, merged
my phyloxml branch into my master branch, and pushed it all back to
biopython. Bio.Phylo is now part of Biopython!

For documentation on the Biopython wiki, I moved the relevant parts of
the Tree, TreeIO and PhyloXML pages to a new page for Bio.Phylo:
http://biopython.org/wiki/Phylo

It's a little rough at the moment, but I'll refine it this week. Some
of the content can also be moved to separate cookbook entries.

> One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this?

I went over all the docstrings and comments again before merging; it
should be free of Tree/TreeIO references now.

Thanks for your help!
Eric

From biopython at maubp.freeserve.co.uk  Mon Jan 11 06:04:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 11:04:03 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com>
	<863834.10061.qm@web62403.mail.re1.yahoo.com>
	<3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
Message-ID: <320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com>

On Sat, Jan 9, 2010 at 11:38 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> I'm trying to avoid having to update Phylo/__init__.py each time I add
> or rename a public function in Utils.py or IO. So, how about this:
> I've added "__all__" definitions to Utils.py and IO/__init__.py so
> that only the relevant public functions are loaded when
> Phylo/__init__.py imports * from those two sub-modules. Testing
> manually, this seems to do the right thing.

Previously bits of Biopython have used __all__, and then
abandoned this a long term maintenance load. This was before
my time, so I am not familiar with the full history, but it makes me
wary about using __all__ here.

Personally I don't see a big problem with having just explicit
manual imports within Bio/Phylo/__init__.py if and when you
decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py
should be made available at the top level. In general I would
think relatively few things should be exposed like that.

Peter

From biopython at maubp.freeserve.co.uk  Mon Jan 11 06:37:42 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 11:37:42 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com>
References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
	<274373.93315.qm@web62406.mail.re1.yahoo.com>
	<3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com>
Message-ID: <320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com>

On Sun, Jan 10, 2010 at 10:02 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> OK -- I pulled the latest from biopython/biopython on GitHub, merged
> my phyloxml branch into my master branch, and pushed it all back to
> biopython. Bio.Phylo is now part of Biopython!

Wow - that was quicker than I expected. As an aside, do you know
why there seem to be three main branches in the history now?
I guess this was the "original" master, your local master, and your
phyloxml branch?

One minor thing - test_Phylo.py needs to be tweaked to raise a
MissingExternalDependencyError if NetworkX isn't installed. That
way the run_tests.py script will treat it as a skipped test instead of
a failed test. Alternatively, if this is just a small part of the test,
maybe split test_Phylo.py into two files (e.g. add a new file
test_Phylo_NeworkX.py which needs the dependency).

And how's this for a draft entry in the NEWS file?

New module Bio.Phylo includes support for reading, writing and working with
phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by
Eric Talevich on a Google Summer of Code 2009 project, under The National
Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and
Christian Zmasek.

Peter

From chapmanb at 50mail.com  Mon Jan 11 08:18:40 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 11 Jan 2010 08:18:40 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com>
References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
	<274373.93315.qm@web62406.mail.re1.yahoo.com>
	<3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com>
Message-ID: <20100111131840.GB46441@sobchak.mgh.harvard.edu>

Hi all;

> OK -- I pulled the latest from biopython/biopython on GitHub, merged
> my phyloxml branch into my master branch, and pushed it all back to
> biopython. Bio.Phylo is now part of Biopython!

Awesome. Congrats Eric -- thanks for all the hard work on this
during the summer, and getting it in shape for inclusion. Peter and
Michiel, thanks for all the helpful feedback. Really happy to have
this integrated,
Brad

From biopython at maubp.freeserve.co.uk  Mon Jan 11 08:42:32 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 13:42:32 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com>
References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com>
	<863834.10061.qm@web62403.mail.re1.yahoo.com>
	<3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
	<320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com>
Message-ID: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com>

On Mon, Jan 11, 2010 at 11:04 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Sat, Jan 9, 2010 at 11:38 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> I'm trying to avoid having to update Phylo/__init__.py each time I add
>> or rename a public function in Utils.py or IO. So, how about this:
>> I've added "__all__" definitions to Utils.py and IO/__init__.py so
>> that only the relevant public functions are loaded when
>> Phylo/__init__.py imports * from those two sub-modules. Testing
>> manually, this seems to do the right thing.
>
> Previously bits of Biopython have used __all__, and then
> abandoned this a long term maintenance load. This was before
> my time, so I am not familiar with the full history, but it makes me
> wary about using __all__ here.
>
> Personally I don't see a big problem with having just explicit
> manual imports within Bio/Phylo/__init__.py if and when you
> decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py
> should be made available at the top level. In general I would
> think relatively few things should be exposed like that.

In fact, why even do this at all? What is wrong with leaving
the IO functions (read, parse, write) as Bio.Phylo.IO.read etc
e.g.

>>> from Bio import Phylo
>>> tree = Phylo.IO.read(open("int_node_labels.nwk"),"newick")

What is the benefit of having them also exposed under the
Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
there are two ways to access them which is confusing.

If we do want to use Bio.Phylo.IO instead of Bio.PhyloIO
(or Bio.TreeIO) then thinking long term we may want to
do something about Bio.SeqIO and Bio.AlignIO to match.
We could move the Bio.AlignIO functionality under
Bio.Align.IO (with a suitable transition period). We could
move Bio.SeqIO to Bio.Seq.IO perhaps. Or we could
even talk about introducing Bio.Sequences (or something)
then move Bio.SeqIO to Bio.Sequences.IO, and move
Bio.SeqUtils.* under there too, and perhaps even the
Seq, SeqRecord and SeqFeature objects as well.
On the other hand, all that upheaval would cause a
lot of pain for end users, for relatively little gain.

Peter

From mjldehoon at yahoo.com  Mon Jan 11 10:02:46 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 11 Jan 2010 07:02:46 -0800 (PST)
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com>
Message-ID: <107440.85746.qm@web62406.mail.re1.yahoo.com>


--- On Mon, 1/11/10, Peter <biopython at maubp.freeserve.co.uk> wrote:
> What is wrong with leaving the IO functions
> (read, parse, write) as Bio.Phylo.IO.read etc
> e.g.
> 
> >>> from Bio import Phylo
> >>> tree =
> Phylo.IO.read(open("int_node_labels.nwk"),"newick")
> 
> What is the benefit of having them also exposed under the
> Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
> there are two ways to access them which is confusing.

If we use Bio.Phylo.IO.read directly, then for consistency we'd have to do the same for all other modules. Otherwise, we'd be guessing each time whether the read() and parse() functions are in Bio.SomeModule, or Bio.SomeModule.IO.

For Bio.Phylo, a simple solution is to put whatever is in Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and remove Bio.Phylo.IO.__init__.py. Then there is only one way to access the read() etc. functions.

[About doing the same for Bio.Seq and Bio.Align]
> On the other hand, all that upheaval would cause a
> lot of pain for end users, for relatively little gain.

For new users, it may be confusing to have all those different modules dealing with sequences. At least, it was for me when I started with Biopython. Therefore, for a long term solution, I'd prefer a single Bio.Seq module that incorporates all (Seq, SeqRecord, SeqIO, SeqFeature).

I agree that that may cause a lot of upheaval for end users, but a suitably long transition period may mitigate those concerns. I'd prefer that to being stuck with a less-than-optimal code organization forever.

--Michiel


From biopython at maubp.freeserve.co.uk  Mon Jan 11 11:17:36 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 16:17:36 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <107440.85746.qm@web62406.mail.re1.yahoo.com>
References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com>
	<107440.85746.qm@web62406.mail.re1.yahoo.com>
Message-ID: <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com>

On Mon, Jan 11, 2010 at 3:02 PM, Michiel de Hoon wrote:
>
> On Mon, 1/11/10, Peter wrote:
>> What is the benefit of having them also exposed under the
>> Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
>> there are two ways to access them which is confusing.
>
> If we use Bio.Phylo.IO.read directly, then for consistency we'd have
> to do the same for all other modules. Otherwise, we'd be guessing
> each time whether the read() and parse() functions are in
> Bio.SomeModule, or Bio.SomeModule.IO.

Fair point.

> For Bio.Phylo, a simple solution is to put whatever is in
> Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and
> remove Bio.Phylo.IO.__init__.py. Then there is only one
> way to access the read() etc. functions.

Or (if the functions are reasonably complex) keep the
input/output code in a separate file, but make it explicit
that it is not a public interface - e.g. use Bio/Phylo/_IO.py?

> [About doing the same for Bio.Seq and Bio.Align]
>> On the other hand, all that upheaval would cause a
>> lot of pain for end users, for relatively little gain.
>
> For new users, it may be confusing to have all those
> different modules dealing with sequences. At least, it
> was for me when I started with Biopython. Therefore,
> for a long term solution, I'd prefer a single Bio.Seq
> module that incorporates all (Seq, SeqRecord, SeqIO,
> SeqFeature).

I agree that for a long term solution a single module
make sense here, although I'm not convinced that
Bio.Seq is the best name. We'd have to switch from
a single file Bio/Seq.py to a folder with multiple files
including Bio/Seq/__init__.py - I worry this may cause
problems with updating existing Biopython installations.

> I agree that that may cause a lot of upheaval for end
> users, but a suitably long transition period may mitigate
> those concerns. I'd prefer that to being stuck with a
> less-than-optimal code organization forever.

In principle I agree with that.

Peter

From eric.talevich at gmail.com  Mon Jan 11 11:30:32 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 11 Jan 2010 11:30:32 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com>
References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> 
	<274373.93315.qm@web62406.mail.re1.yahoo.com>
	<3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> 
	<320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com>
Message-ID: <3f6baf361001110830y391ea21cs8315a266b8b4fb43@mail.gmail.com>

On Mon, Jan 11, 2010 at 6:37 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Sun, Jan 10, 2010 at 10:02 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> OK -- I pulled the latest from biopython/biopython on GitHub, merged
>> my phyloxml branch into my master branch, and pushed it all back to
>> biopython. Bio.Phylo is now part of Biopython!
>
> Wow - that was quicker than I expected. As an aside, do you know
> why there seem to be three main branches in the history now?
> I guess this was the "original" master, your local master, and your
> phyloxml branch?

Er, sorry if I jumped the gun. I was eager to get this done before the
semester kicks in... anyway, these are the Git commands I used:

git checkout master
git pull upstream  # remote: biopython master
git checkout phyloxml
git merge master  # check that it merges cleanly
git checkout master
git merge phyloxml  # fast-forward
git push upstream master
git push origin master  # updating my own branches on github
git push origin phyloxml

It looks more reasonable in gitk; maybe the branches will separate
again later on GitHub when they're no longer equivalent, or when I
delete the phyloxml branch.

> One minor thing - test_Phylo.py needs to be tweaked to raise a
> MissingExternalDependencyError if NetworkX isn't installed. That
> way the run_tests.py script will treat it as a skipped test instead of
> a failed test. Alternatively, if this is just a small part of the test,
> maybe split test_Phylo.py into two files (e.g. add a new file
> test_Phylo_NeworkX.py which needs the dependency).

I extracted test_Phylo_depend.py from test_Phylo and added tests at
the top level for networkx and either pygraphviz or pydot (since those
are also used by Bio/Phylo/Utils.py).

> And how's this for a draft entry in the NEWS file?
>
> New module Bio.Phylo includes support for reading, writing and working with
> phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by
> Eric Talevich on a Google Summer of Code 2009 project, under The National
> Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and
> Christian Zmasek.

Great, thanks!

Eric

From eric.talevich at gmail.com  Mon Jan 11 11:43:01 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 11 Jan 2010 11:43:01 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com>
References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> 
	<107440.85746.qm@web62406.mail.re1.yahoo.com>
	<320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com>
Message-ID: <3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com>

On Mon, Jan 11, 2010 at 11:17 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Jan 11, 2010 at 3:02 PM, Michiel de Hoon wrote:
>>
>> On Mon, 1/11/10, Peter wrote:
>>> What is the benefit of having them also exposed under the
>>> Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
>>> there are two ways to access them which is confusing.
>>
>> If we use Bio.Phylo.IO.read directly, then for consistency we'd have
>> to do the same for all other modules. Otherwise, we'd be guessing
>> each time whether the read() and parse() functions are in
>> Bio.SomeModule, or Bio.SomeModule.IO.
>
> Fair point.
>
>> For Bio.Phylo, a simple solution is to put whatever is in
>> Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and
>> remove Bio.Phylo.IO.__init__.py. Then there is only one
>> way to access the read() etc. functions.
>
> Or (if the functions are reasonably complex) keep the
> input/output code in a separate file, but make it explicit
> that it is not a public interface - e.g. use Bio/Phylo/_IO.py?

Something like this?

Phylo/
    BaseTree.py
    Newick.py
    PhyloXML.py
    _IO.py
    _Utils.py
    PhyloXMLIO.py
    NewickIO.py
    NexusIO.py

This plays well with the expected import styles:

from Bio import Phylo  # most common
from Bio.Phylo import PhyloXML  # access the defined types
from Bio.Phylo import PhyloXMLIO  # special parsing

From biopython at maubp.freeserve.co.uk  Mon Jan 11 12:11:29 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 17:11:29 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
Message-ID: <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>

On Mon, Nov 23, 2009 at 2:43 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Dear all,
>
> Is there anyone on the dev mailing list willing to test the SFF
> support I've been working on for Bio.SeqIO? The code is here,
> a branch on github:
> http://github.com/peterjc/biopython/tree/sff-seqio
>
> The important files are:
> * Bio/SeqIO/SffIO.py
> * Bio/SeqIO/__init__.py (defining the new format)
> * Bio/SeqIO/_index.py (indexing SFF files)
>
> Plus unit test files:
> * Tests/run_tests.py (to run the doctests)
> * Tests/test_SeqIO_QualityIO.py
> * Tests/test_SeqIO_index.py
> * Tests/test_SeqIO.py
> * Tests/Roche/* (for unit tests)
>
> Sebastian Bassi had a look last month and his feedback has
> already helped (e.g. with error messages):
> http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006903.html
>
> I have been using this code myself in real work, for example
> editing the trim points in an SFF file to take into account PCR
> primer sequences, and filtering SFF reads, checking Roche
> barcodes etc.
>
> Thanks,
>
> Peter
>

Hi all,

I didn't want to rush the SFF support into Biopython 1.53, but its been
waiting "ready" for a while now. Any objections or comments about
me merging this now?

Thanks,

Peter

From biopython at maubp.freeserve.co.uk  Tue Jan 12 09:51:58 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 Jan 2010 14:51:58 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com>
References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com>
	<107440.85746.qm@web62406.mail.re1.yahoo.com>
	<320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com>
	<3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com>
Message-ID: <320fb6e01001120651i6b3d661m83187659595ce9e4@mail.gmail.com>

On Mon, Jan 11, 2010 at 4:43 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Mon, Jan 11, 2010 at 11:17 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Or (if the functions are reasonably complex) keep the
>> input/output code in a separate file, but make it explicit
>> that it is not a public interface - e.g. use Bio/Phylo/_IO.py?
>
> Something like this?
>
> Phylo/
> ? ?BaseTree.py
> ? ?Newick.py
> ? ?PhyloXML.py
> ? ?_IO.py
> ? ?_Utils.py
> ? ?PhyloXMLIO.py
> ? ?NewickIO.py
> ? ?NexusIO.py
>
> This plays well with the expected import styles:
>
> from Bio import Phylo ?# most common
> from Bio.Phylo import PhyloXML ?# access the defined types
> from Bio.Phylo import PhyloXMLIO ?# special parsing

I'd forgotten Bio/Phylo/IO was a directory, and that the users may
want to access PhyloXMLIO directly. That suggested structure
looks reasonable... what do you think Michiel?

Peter


From kellrott at gmail.com  Tue Jan 12 16:46:39 2010
From: kellrott at gmail.com (Kyle Ellrott)
Date: Tue, 12 Jan 2010 13:46:39 -0800
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
Message-ID: <bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>

I've pulled from the main branch and fixed a few problems.  I've tested the
code against Sqlite, Python Mysql, and Jython Mysql.  All three seem to be
working right now.

Kyle

On Thu, Dec 17, 2009 at 10:03 AM, Kyle Ellrott <kellrott at gmail.com> wrote:

>
>  > Code can be found at http://github.com/kellrott/biopython
>>
>> Lovely. That's on your jython branch (along with lots of your other work)?
>>
>
> Yes, but all of the zxJDBC work has been done in the past 2 weeks (just the
> last three commits), so it should be easy to cherry-pick out the relevant
> patches.
>
> Kyle
>

From biopython at maubp.freeserve.co.uk  Tue Jan 12 16:51:34 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 Jan 2010 21:51:34 +0000
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
	<bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
Message-ID: <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>

On Tue, Jan 12, 2010 at 9:46 PM, Kyle Ellrott <kellrott at gmail.com> wrote:
> I've pulled from the main branch and fixed a few problems. ?I've tested the
> code against Sqlite, Python Mysql, and Jython Mysql. ?All three seem to be
> working right now.
>
> Kyle

Excellent - I had a play last month, and Jython Mysql seemed to work.
Do you know if/how to get SQLite and/or PostgreSQL drivers installed
under zxJDBC?

Peter


From kellrott at gmail.com  Tue Jan 12 17:06:39 2010
From: kellrott at gmail.com (Kyle Ellrott)
Date: Tue, 12 Jan 2010 14:06:39 -0800
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
	<bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
	<320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>
Message-ID: <bb02be081001121406y172d415fgf2e2d8f3cd19d99a@mail.gmail.com>

I haven't played with Postgre yet (don't even have it installed).
Sqlite as a python package hasn't been standardized to Jython yet  (
http://bugs.jython.org/issue1682864 )

One option is to call SQLite JDBC (
http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing the
existing SQLite code.
But like zxJDBC, the jar would need to be in the CLASSPATH variable for the
code to work.


Kyle

On Tue, Jan 12, 2010 at 1:51 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Jan 12, 2010 at 9:46 PM, Kyle Ellrott <kellrott at gmail.com> wrote:
> > I've pulled from the main branch and fixed a few problems.  I've tested
> the
> > code against Sqlite, Python Mysql, and Jython Mysql.  All three seem to
> be
> > working right now.
> >
> > Kyle
>
> Excellent - I had a play last month, and Jython Mysql seemed to work.
> Do you know if/how to get SQLite and/or PostgreSQL drivers installed
> under zxJDBC?
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Wed Jan 13 06:22:23 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 Jan 2010 11:22:23 +0000
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <bb02be081001121406y172d415fgf2e2d8f3cd19d99a@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
	<bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
	<320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>
	<bb02be081001121406y172d415fgf2e2d8f3cd19d99a@mail.gmail.com>
Message-ID: <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com>

On Tue, Jan 12, 2010 at 10:06 PM, Kyle Ellrott <kellrott at gmail.com> wrote:
> I haven't played with Postgre yet (don't even have it installed).
> Sqlite as a python package hasn't been standardized to Jython yet ?(
> http://bugs.jython.org/issue1682864 )
>
> One option is to call SQLite JDBC (
> http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing the
> existing SQLite code.
> But like zxJDBC, the jar would need to be in the CLASSPATH variable for the
> code to work.

I'm not 100% convinced that the details of your current approach
are the best way forward: Specifically taking a user script that works
on (C) Python using MySQL with MySQLdb as the driver, and when
run on Jython automatically interpreting this to use the Java MySQL
Connector/J with the org.gjt.mm.mysql.Driver (and so on for the
PostgreSQL and SQLite drivers?)

It might be clearer if we just treat the different Jython/Java drivers
as top level alternatives:

* MySQLdb (Python only, at least for now)
* psycopg, psycopg2, pgdb (Python only, at least for now)
* sqlite3 (currently Python only, maybe available on Jython later)
* org.gjt.mm.mysql.Driver (Jython only)
* Some JAVA PostreSQL driver (Jython only)
* Some JAVA SQLite driver (Jython only)

This way we have a clean separation of all the different driver
or database specific changes - although the user is required
to make some minor changes to take an existing BioSQL on
MySQL script to explicitly change the driver from MySQLdb
to org.gjt.mm.mysql.Driver if they want to run it on Jython.
We also won't have lots of "if jython" statements everywhere.

What are your thoughts on this?

Note there will be some similarities between all the MySQL
adaptors, all the PostgreSQL adaptors, etc. I've just made
a small improvement to file BioSQL/DBUtils.py to reduce
the code duplication for the existing (C) Python PostgreSQL
adaptors.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jan 13 09:10:21 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 Jan 2010 14:10:21 +0000
Subject: [Biopython-dev] Phasing out support for Python 2.4?
Message-ID: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>

Hi all,

Biopython currently supports Python 2.4, 2.5 and 2.6
(and seems to work on the current Python 2.7 alpha).

Is it time to start phasing out support for Python 2.4?

Reasons for encouraging Python 2.5+ include the
built in support for sqlite3 (which we can use in the
BioSQL wrappers) and ElementTree (which we use
for the phyloXML parser) both of which must currently
be manually installed for Python 2.4.

Also ReportLab is talking about dropping support
for Python 2.4 (another optional dependency of
Biopython). As far as I know, NumPy haven't yet
talked about dropping support for Python 2.4.

I was thinking of the usual deprecation procedure, so
we'd aim to have at least two releases and one year
before actually dropping support for Python 2.4. At
that point older Linux distributions which ship with
Python 2.4 probably won't be supported anyway.

e.g. The last version of Ubuntu to have Python 2.4
as the default was Ubuntu 6.06 LTS (Dapper Drake).
The desktop edition support ended July 2009, but
the server edition will be maintaned until June 2011.

Peter

From eric.talevich at gmail.com  Wed Jan 13 12:08:24 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 13 Jan 2010 12:08:24 -0500
Subject: [Biopython-dev] Phasing out support for Python 2.4?
In-Reply-To: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
Message-ID: <3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com>

On Wed, Jan 13, 2010 at 9:10 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> Biopython currently supports Python 2.4, 2.5 and 2.6
> (and seems to work on the current Python 2.7 alpha).
>
> Is it time to start phasing out support for Python 2.4?
>
> Reasons for encouraging Python 2.5+ include the
> built in support for sqlite3 (which we can use in the
> BioSQL wrappers) and ElementTree (which we use
> for the phyloXML parser) both of which must currently
> be manually installed for Python 2.4.

Also, it appears that Python 2.7 will use absolute instead of relative
imports by default:
http://www.python.org/dev/peps/pep-0328/

For intra-package imports like in PDB/__init__.py, an import like this:
from PDBParser import PDBParser

could be future-proofed for Py2.5+:
from __future__ import absolute_import
from .PDBParser import PDBParser

But to make it work in both Py2.4 and Py2.7, it would need to be
converted to an absolute import:
from Bio.PDB.PDBParser import PDBParser


Py2.5 introduced a number of other enticing syntax features, too:
http://docs.python.org/dev/whatsnew/2.5.html
- context managers (with_statement)
- if-else expressions
- unified try-except-finally (I flagged this issue in the comments in Bio.Phylo)
- all() and any()
- passing values into generators -- could be useful for parsing, maybe

The enhancements to setuptools might help simplify the dependency
handling in setup.py:
http://docs.python.org/dev/whatsnew/2.5.html#pep-314-metadata-for-python-software-packages-v1-1

I'm also interested in the functools and ctypes modules, but don't
have pressing use cases for them.
(So, you can take that as a +1 from me.)

Cheers,
Eric

From biopython at maubp.freeserve.co.uk  Wed Jan 13 12:21:23 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 Jan 2010 17:21:23 +0000
Subject: [Biopython-dev] Phasing out support for Python 2.4?
In-Reply-To: <3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com>
References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
	<3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com>
Message-ID: <320fb6e01001130921w49b56793h413aacd3027d6275@mail.gmail.com>

On Wed, Jan 13, 2010 at 5:08 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Wed, Jan 13, 2010 at 9:10 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Hi all,
>>
>> Biopython currently supports Python 2.4, 2.5 and 2.6
>> (and seems to work on the current Python 2.7 alpha).
>>
>> Is it time to start phasing out support for Python 2.4?
>>
>> Reasons for encouraging Python 2.5+ include the
>> built in support for sqlite3 (which we can use in the
>> BioSQL wrappers) and ElementTree (which we use
>> for the phyloXML parser) both of which must currently
>> be manually installed for Python 2.4.
>
> Also, it appears that Python 2.7 will use absolute instead
> of relative imports by default:
> http://www.python.org/dev/peps/pep-0328/

Thanks for the heads up on that. I think we'll just need
to switch everything to absolute imports in order to
cover Python 2.4 to 2.7 inclusive.

>
> (So, you can take that as a +1 from me.)
>

Good :)

Peter

From kellrott at gmail.com  Wed Jan 13 12:37:53 2010
From: kellrott at gmail.com (Kyle Ellrott)
Date: Wed, 13 Jan 2010 09:37:53 -0800
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
	<bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
	<320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>
	<bb02be081001121406y172d415fgf2e2d8f3cd19d99a@mail.gmail.com>
	<320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com>
Message-ID: <bb02be081001130937o6cad0d28h1da86b2ca2606407@mail.gmail.com>

My main thought was to make it so that users can write a single script that
would work on any Python system (eventually IronPython as well).  Because
the current system expects the user to request a specific driver (MySQLdb)
that happens to be system specific, it forces user code to be system
specific.
One alternative would be to use the strings you describe below, but in
addition add special requests that would check the system add pull the
appropriate driver automatically.
'autoMySQL' or 'MySQL' - uses MySQLdb if in CPython, use
org.gjt.mm.mysql.Driver if in Jython.
Otherwise, if the user wants to use a specific driver, they pass it's name.

Kyle

On Wed, Jan 13, 2010 at 3:22 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Jan 12, 2010 at 10:06 PM, Kyle Ellrott <kellrott at gmail.com> wrote:
> > I haven't played with Postgre yet (don't even have it installed).
> > Sqlite as a python package hasn't been standardized to Jython yet  (
> > http://bugs.jython.org/issue1682864 )
> >
> > One option is to call SQLite JDBC (
> > http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing
> the
> > existing SQLite code.
> > But like zxJDBC, the jar would need to be in the CLASSPATH variable for
> the
> > code to work.
>
> I'm not 100% convinced that the details of your current approach
> are the best way forward: Specifically taking a user script that works
> on (C) Python using MySQL with MySQLdb as the driver, and when
> run on Jython automatically interpreting this to use the Java MySQL
> Connector/J with the org.gjt.mm.mysql.Driver (and so on for the
> PostgreSQL and SQLite drivers?)
>
> It might be clearer if we just treat the different Jython/Java drivers
> as top level alternatives:
>
> * MySQLdb (Python only, at least for now)
> * psycopg, psycopg2, pgdb (Python only, at least for now)
> * sqlite3 (currently Python only, maybe available on Jython later)
> * org.gjt.mm.mysql.Driver (Jython only)
> * Some JAVA PostreSQL driver (Jython only)
> * Some JAVA SQLite driver (Jython only)
>
> This way we have a clean separation of all the different driver
> or database specific changes - although the user is required
> to make some minor changes to take an existing BioSQL on
> MySQL script to explicitly change the driver from MySQLdb
> to org.gjt.mm.mysql.Driver if they want to run it on Jython.
> We also won't have lots of "if jython" statements everywhere.
>
> What are your thoughts on this?
>
> Note there will be some similarities between all the MySQL
> adaptors, all the PostgreSQL adaptors, etc. I've just made
> a small improvement to file BioSQL/DBUtils.py to reduce
> the code duplication for the existing (C) Python PostgreSQL
> adaptors.
>
> Peter
>

From chapmanb at 50mail.com  Thu Jan 14 07:52:44 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 14 Jan 2010 07:52:44 -0500
Subject: [Biopython-dev] Phasing out support for Python 2.4?
In-Reply-To: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
Message-ID: <20100114125244.GB59876@sobchak.mgh.harvard.edu>

Hey Peter;
Sounds great to me. Looking forward to being able to use conditional
expressions, collections.defaultdict, functools, and the with
statement. 2.5 had a lot of great stuff.

Brad

> Biopython currently supports Python 2.4, 2.5 and 2.6
> (and seems to work on the current Python 2.7 alpha).
> 
> Is it time to start phasing out support for Python 2.4?
> 
> Reasons for encouraging Python 2.5+ include the
> built in support for sqlite3 (which we can use in the
> BioSQL wrappers) and ElementTree (which we use
> for the phyloXML parser) both of which must currently
> be manually installed for Python 2.4.
> 
> Also ReportLab is talking about dropping support
> for Python 2.4 (another optional dependency of
> Biopython). As far as I know, NumPy haven't yet
> talked about dropping support for Python 2.4.
> 
> I was thinking of the usual deprecation procedure, so
> we'd aim to have at least two releases and one year
> before actually dropping support for Python 2.4. At
> that point older Linux distributions which ship with
> Python 2.4 probably won't be supported anyway.
> 
> e.g. The last version of Ubuntu to have Python 2.4
> as the default was Ubuntu 6.06 LTS (Dapper Drake).
> The desktop edition support ended July 2009, but
> the server edition will be maintaned until June 2011.
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From biopython at maubp.freeserve.co.uk  Thu Jan 14 09:52:24 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 Jan 2010 14:52:24 +0000
Subject: [Biopython-dev] Phasing out support for Python 2.4?
In-Reply-To: <20100114125244.GB59876@sobchak.mgh.harvard.edu>
References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
	<20100114125244.GB59876@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01001140652v1e11725esa6a2f91fafd0104b@mail.gmail.com>

On Thu, Jan 14, 2010 at 12:52 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hey Peter;
> Sounds great to me. Looking forward to being able to use conditional
> expressions, collections.defaultdict, functools, and the with
> statement. 2.5 had a lot of great stuff.
>
> Brad

I guess there are quite a few good things in Python 2.5+,
although I think the jump from Python 2.3 to 2.4 was more
important (generators and decorators). You'll have to restrain
yourself from using the new toys in Biopython a little longer
though Brad ;)

Since this seems to have raised no immediate objections,
I've sent a message to the main and announcement lists:

http://lists.open-bio.org/pipermail/biopython/2010-January/006111.html
http://lists.open-bio.org/pipermail/biopython-announce/2010-January/000064.html

Assuming there are no objections, we can add a conditional
deprecation warning to setup.py and do a news blog post
(like we did for dropping Python 2.3 early last year):
http://news.open-bio.org/news/2009/05/dropping-python23-support/

Peter

From biopython at maubp.freeserve.co.uk  Thu Jan 14 12:32:22 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 Jan 2010 17:32:22 +0000
Subject: [Biopython-dev] [Biopython] Phasing out support for Python 2.4?
In-Reply-To: <4B4F4071.7040601@fold.natur.cuni.cz>
References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
	<320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com>
	<4B4F4071.7040601@fold.natur.cuni.cz>
Message-ID: <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com>

On Thu, Jan 14, 2010 at 4:04 PM, Martin MOKREJ?
<mmokrejs at fold.natur.cuni.cz> wrote:
>
> Hi Peter,
> I don't get this point much. What is the problem stating that with
> python 2.5+ one does not need to install an extra dependency while
> for 2.4 one needs _two_ modules?
> I don't think I want BioSQL nor sqlite so why would I have to upgrade.
> Would the requirement be in python language syntax incompatibility then
> I would NOT object, but in this situation ...
> Martin

Hi Martin,

This isn't just the issue of sqlite3 and ElementTree. There
are several benefits to using more recent versions of Python,
for example with an eye on the future for Python 3, and on
a practical level it simplifies our testing to have one less
version to worry about (especially once Python 2.7 is out,
currently scheduled for June 2010).

We've already had minor issues with developers using
Python 2.5+ syntax unwittingly which broke on Python
2.4 (nothing major, and it was easily fixed once the
problem was spotted). If we continue to insist on Python
2.4 support, it may prove problematic for if future potential
contributors have existing code written for Python 2.5+
which would require significant re-factoring.

None of these concerns are pressing right now (and
some are hypothetical), but I think you will agree that
Python 2.4 is pretty old, and not widely used anymore.
Having a clear plan in place for dropping it seems a
sensible move, and once that happens we can start
to take advantage of the language and library
improvements Python 2.5 added.

Are you personally using Python 2.4? If so, could you
tell us a little more - for example, is this a university
server which would be difficult to update? Or do you
require some other Python package which requires
Python 2.4?

Thanks,

Peter


From bugzilla-daemon at portal.open-bio.org  Thu Jan 14 13:55:18 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 14 Jan 2010 13:55:18 -0500
Subject: [Biopython-dev] [Bug 2992] New: Adding Uniprot XML file format
	parsing to Biopython
Message-ID: <bug-2992-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2992

           Summary: Adding Uniprot XML file format parsing to Biopython
           Product: Biopython
           Version: 1.53
          Platform: All
               URL: http://github.com/apierleoni/biopython/tree/uniprotxml-
                    branch
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: andrea at biocomp.unibo.it


Uniprot XML formatted files are much easier to parse then the swissprot flat
file, and are widely used at EMBL either for uniprot, IPI and integr8 databases


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From andrea at biocomp.unibo.it  Thu Jan 14 13:57:58 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Thu, 14 Jan 2010 19:57:58 +0100 (CET)
Subject: [Biopython-dev] New: Uniprot XML parser
Message-ID: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>

Hi Everyone,
I've been using a lot biopython in the last couple of years, it is very
useful to me. So now it's my turn to contribute and be helpful to someone
else.
I wrote a parser for the Uniprot XML format, that is reasonably fast (8000
entries/min on a core2duo mainstream PC). The main improvements with the
actual SwissProt flat file parser are a deeper parsing of comment fields,
and a Seqrecord containing features.

The parser is based on the ElementTree library and was successfully tested
on the complete SwissProt database (v57.12). Thus I think it is ready to
be released.

I followed the rules to develop a new parser for SeqIO, filed an
enhancement bug to bugzilla (bug 2992), and included the parser in a
public biopython fork on github available at:

http://github.com/apierleoni/biopython/tree/uniprotxml-branch

the new parser is in the "uniprotxml-branch" branch, and the parser code
is in Bio/SeqIO/UniprotIO.py

The parser can be used from SeqIO using:

iterator=SeqIO.parse(handle,'uniprot')


I think this could be easily integrated in Biopython,  unit test is still
missing, but should be very easy to do.
Anyhow any code review or suggestions are welcome.

Andrea


From p.j.a.cock at googlemail.com  Thu Jan 14 14:16:49 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 14 Jan 2010 19:16:49 +0000
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>

On Thursday, January 14, 2010, Andrea Pierleoni <andrea at biocomp.unibo.it> wrote:
> Hi Everyone,
> I've been using a lot biopython in the last couple of years, it is very
> useful to me. So now it's my turn to contribute and be helpful to someone
> else.
> I wrote a parser for the Uniprot XML format, that is reasonably fast (8000
> entries/min on a core2duo mainstream PC). The main improvements with the
> actual SwissProt flat file parser are a deeper parsing of comment fields,
> and a Seqrecord containing features.
>
> The parser is based on the ElementTree library and was successfully tested
> on the complete SwissProt database (v57.12). Thus I think it is ready to
> be released.
>
> I followed the rules to develop a new parser for SeqIO, filed an
> enhancement bug to bugzilla (bug 2992), and included the parser in a
> public biopython fork on github available at:
>
> http://github.com/apierleoni/biopython/tree/uniprotxml-branch
>
> the new parser is in the "uniprotxml-branch" branch, and the parser code
> is in Bio/SeqIO/UniprotIO.py
>
> The parser can be used from SeqIO using:
>
> iterator=SeqIO.parse(handle,'uniprot')
>
>
> I think this could be easily integrated in Biopython, ?unit test is still
> missing, but should be very easy to do.
> Anyhow any code review or suggestions are welcome.
>
> Andrea
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org


Hi

I'd spotted your branch on github - this looks like an excellent
addition to Biopython :)

What I would like to see is a few unit tests, specifically one using
the same record in both XML (with the new parser) and the equivalent
plain text SwissProt file (with the old parser) and check they agree.

Also, I think you should check the start coordinates of the features
are using python counting.

Regards

Peter


From eric.talevich at gmail.com  Thu Jan 14 15:03:35 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 14 Jan 2010 15:03:35 -0500
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
Message-ID: <3f6baf361001141203i304146a4ld5683190a32b7ffe@mail.gmail.com>

On Thu, Jan 14, 2010 at 1:57 PM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
> Hi Everyone,
> I've been using a lot biopython in the last couple of years, it is very
> useful to me. So now it's my turn to contribute and be helpful to someone
> else.
> I wrote a parser for the Uniprot XML format, that is reasonably fast (8000
> entries/min on a core2duo mainstream PC). The main improvements with the
> actual SwissProt flat file parser are a deeper parsing of comment fields,
> and a Seqrecord containing features.
>
> The parser is based on the ElementTree library and was successfully tested
> on the complete SwissProt database (v57.12). Thus I think it is ready to
> be released.

Have you tried using this with Python 2.4? The ElementTree module
wasn't added to the standard library until Python 2.5, so a simple
"from xml.etree import ElementTree" may need some additional
protection. It's also nice to let the user use a third-party
implementation of ElementTree if they're stuck on Py2.4.

An example of this is at the top of Bio.Phylo.PhyloXMLIO -- not
pretty, but functional:
http://github.com/biopython/biopython/blob/master/Bio/Phylo/PhyloXMLIO.py

-Eric

From p.j.a.cock at googlemail.com  Thu Jan 14 18:04:36 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 14 Jan 2010 23:04:36 +0000
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>

On Thu, Jan 14, 2010 at 10:41 PM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
>
>>
>> By default, copy the "swiss" parser. If that doesn't have the
>> annotation, see if there is anything similar in the "genbank"
>> parser (effectively our reference for rich annotation parsing).
>> If in doubt, for now discard the data with a comment in the
>> code - and then discuss it here.
>>
>> Peter
>>
> I'll take a look at both the swissprot and genbank parsers.
> right now the annotation parsing shema is based on the xml schema.
> eg.
> <comment type="function">
> <text>function text</text>
> </comment>
>
> is parsed in the annotations as:
>
> seqrecord.annotations['comment_function']=['function text']
>

My reasoning is it should be (almost) transparent for
users to switch from parsing the plain text SwissProt
files ("swiss") to the XML form. There are also knock
on implications for saving to BioSQL and file format
conversions e.g. saving as a GenBank protein file
(aka GenPept format).

However, the comment parsing in the plain text "swiss"
format is currently a little simplistic - partly to match
what BioPerl did at the time. We can revisit that as
part of this work.

Peter

From andrea at biocomp.unibo.it  Fri Jan 15 05:35:39 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Fri, 15 Jan 2010 11:35:39 +0100 (CET)
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
Message-ID: <bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>


>
> My reasoning is it should be (almost) transparent for
> users to switch from parsing the plain text SwissProt
> files ("swiss") to the XML form.

This would be good

> There are also knock
> on implications for saving to BioSQL and file format
> conversions e.g. saving as a GenBank protein file
> (aka GenPept format).

The returned Seqrecords are actually BioSQL-safe,
since I can load them to a postgres biosql database.
formatting the actual Seqrecord with 'genbank' dbxrefs,
features, seq, keywords, source and names looks to be correctly
reported, while there is no trace of the other annotations.
I'll check it deeper.

>
> However, the comment parsing in the plain text "swiss"
> format is currently a little simplistic - partly to match
> what BioPerl did at the time. We can revisit that as
> part of this work.
>

the main problem here are going to be the comment fields, that in the
plain text predictors are parsed as a single string (this pushed me to
wrote the new parser). I tried to keep comments parsing as simple as it
can be, by just using lists of strings (good for BioSQL), but many comment
types would be better parsed with a dictionary tree.
As of now I left the option to get back the full XML for each comment, by
calling:

UniprotIO.UniprotIterator(handle,return_raw_comments=True)

so every info in the XML file can be returned and the end user can decide
how to parse those additional info.

Anyhow I think it is better to discuss this when the unit test
'swiss'VS'uniprot' is ready.

Andrea


From p.j.a.cock at googlemail.com  Fri Jan 15 06:08:32 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 15 Jan 2010 11:08:32 +0000
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>

On Fri, Jan 15, 2010 at 10:35 AM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
>>
>> However, the comment parsing in the plain text "swiss"
>> format is currently a little simplistic - partly to match
>> what BioPerl did at the time. We can revisit that as
>> part of this work.
>>
>
> the main problem here are going to be the comment fields, that in the
> plain text predictors are parsed as a single string (this pushed me to
> wrote the new parser). I tried to keep comments parsing as simple as it
> can be, by just using lists of strings (good for BioSQL), but many comment
> types would be better parsed with a dictionary tree.

I think BioPerl now uses some kind of nest tree when parsing the
SwissProt comment block, and I would like us to use something
compatible (e.g. a dictionary tree) in the "swiss" parser (and thus
also the XML parser) in such a way that we end up saving this in
BioSQL the same way.

> As of now I left the option to get back the full XML for each comment, by
> calling:
>
> UniprotIO.UniprotIterator(handle,return_raw_comments=True)
>
> so every info in the XML file can be returned and the end user can decide
> how to parse those additional info.
>
> Anyhow I think it is better to discuss this when the unit test
> 'swiss'VS'uniprot' is ready.

+1, good plan.

Peter

From bugzilla-daemon at portal.open-bio.org  Fri Jan 15 07:38:49 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 15 Jan 2010 07:38:49 -0500
Subject: [Biopython-dev] [Bug 2704] Parser for the markx10 alignment format
In-Reply-To: <bug-2704-42@http.bugzilla.open-bio.org/>
Message-ID: <201001151238.o0FCcnB1017338@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2704


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-15 07:38 EST -------
According to the change log for the just released EMBOSS 6.2:

        Alignment output included headers only for EMBOSS-specific
        formats. The headers have been dropped from the FASTA MARKX0
        through MARKX10 formats to allow standard FASTA suite parsers to
        use the EMBOSS versions of these outputs.

See also:
http://lists.open-bio.org/pipermail/emboss-dev/2009-August/000618.html

Fingers crossed this means we will be able to parse their output
with the "fasta-m10" parser in Bio.AlignIO.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From chapmanb at 50mail.com  Mon Jan 18 08:01:15 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 18 Jan 2010 08:01:15 -0500
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
Message-ID: <20100118130115.GA48842@sobchak.mgh.harvard.edu>

Hey all;
After the Google groups discussion kicked off by Istvan last month,
I've been thinking a bit about supplements to mailing list
discussions. I'm agreed that mailman is not great for searching and
archival purposes; we often see similar questions appear because
finding and browsing the right thread from a past discussion is not
intuitive.

Google groups is okay, but doesn't offer a huge improvement over
mailman. Additionally, reports indicate spamming is pretty bad,
which creates additional moderation headaches.

For handling "how do I do this biology task in Python" questions, what
do people think about something entirely different like Stack Overflow?
This presents a nice interface for asking questions, and the follow
ups are voted up and down by utility so it's easy to see what the
right answer is. Questions there are indexed well by search engines,
so it's also more likely someone might be able to find a previous
answer.

There are actually a couple of questions on there with a Biopython
tag:

http://stackoverflow.com/questions/tagged/biopython

>From our point of view, we would need to adjust the documentation to
point out Stack Overflow as a place to ask questions, and then
monitor the biopython tag for new posts.

Mailman is still a great option for implementation discussions, but
Stack Overflow could open up question/answers to a larger audience and
help supplement the cookbook and formal documentation.

Brad

From n.j.loman at bham.ac.uk  Mon Jan 18 08:21:38 2010
From: n.j.loman at bham.ac.uk (Nick Loman)
Date: Mon, 18 Jan 2010 13:21:38 +0000
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <20100118130115.GA48842@sobchak.mgh.harvard.edu>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu>
Message-ID: <4B546062.3090802@bham.ac.uk>

Brad Chapman wrote:
> For handling "how do I do this biology task in Python" questions, what
> do people think about something entirely different like Stack Overflow?
> This presents a nice interface for asking questions, and the follow
> ups are voted up and down by utility so it's easy to see what the
> right answer is. Questions there are indexed well by search engines,
> so it's also more likely someone might be able to find a previous
> answer.
>   
Hi Brad

Great suggestion, I have been thinking along the same lines. I really
like the design of the Stack Exchange sites, it is a great way of
exchanging Q&A information.

It is worth mentioning that Stackoverflow is not the only site using the
"Stack Exchange" format that is relevant.

Here is a link to various other Stack Exchange sites:
http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family

Although there are Biopython questions in Stackoverflow, I wonder
whether that is the correct place for questions, or whether it would be
overall more productive to have a resource for bioinformatics? I think
bioinformatics is the correct breadth of topic to keep a large enough
community together whilst not being too off-topic.

I have registered http://bioinformatics.stackexchange.com/ and will
happily make you and anyone else who is interested an admin.

Does the list think there could be enough community interest to justify
a separate site like this?

Cheers,

Nick.


From chapmanb at 50mail.com  Mon Jan 18 09:20:10 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 18 Jan 2010 09:20:10 -0500
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <4B546062.3090802@bham.ac.uk>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu>
	<4B546062.3090802@bham.ac.uk>
Message-ID: <20100118142010.GE48842@sobchak.mgh.harvard.edu>

Hi Nick;

> Great suggestion, I have been thinking along the same lines. I really
> like the design of the Stack Exchange sites, it is a great way of
> exchanging Q&A information.
> 
> It is worth mentioning that Stackoverflow is not the only site using the
> "Stack Exchange" format that is relevant.
> 
> Here is a link to various other Stack Exchange sites:
> http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family

Awesome. Thanks for the pointer. Sounds like you have a great handle
on this.

> Although there are Biopython questions in Stackoverflow, I wonder
> whether that is the correct place for questions, or whether it would be
> overall more productive to have a resource for bioinformatics? I think
> bioinformatics is the correct breadth of topic to keep a large enough
> community together whilst not being too off-topic.
> 
> I have registered http://bioinformatics.stackexchange.com/ and will
> happily make you and anyone else who is interested an admin.
> 
> Does the list think there could be enough community interest to justify
> a separate site like this?

It looks like there are a couple of Stack Exchange sites with
similar aims for open source bioinformatics and chemistry:

http://biostar.stackexchange.com/
http://blueobelisk.stackexchange.com/

If we go this way we might want to talk to the owners of these sites
and integrate with them.

My preference would be to go with the main StackOverflow site and
carve out our niche with the tagging system. We build off of an
existing community instead of needing to help grow one. Some of the
more successful biology communities, like the one on Friendfeed,
benefit from input outside of the standard community:

http://friendfeed.com/the-life-scientists

I think this would be less likely with a dedicated site, as that
fortuitous crosstalk is prevented by other programmers never
thinking to look at a bioinformatics only site.

Happy to hear what others think,
Brad

From biopython at maubp.freeserve.co.uk  Mon Jan 18 10:58:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 Jan 2010 15:58:27 +0000
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <bb02be081001130937o6cad0d28h1da86b2ca2606407@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
	<bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
	<320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>
	<bb02be081001121406y172d415fgf2e2d8f3cd19d99a@mail.gmail.com>
	<320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com>
	<bb02be081001130937o6cad0d28h1da86b2ca2606407@mail.gmail.com>
Message-ID: <320fb6e01001180758t179f5ccdo99132e4b10b907bb@mail.gmail.com>

On Wed, Jan 13, 2010 at 5:37 PM, Kyle Ellrott <kellrott at gmail.com> wrote:
> My main thought was to make it so that users can write a single script that
> would work on any Python system (eventually IronPython as well).? Because
> the current system expects the user to request a specific driver (MySQLdb)
> that happens to be system specific, it forces user code to be system
> specific.

Yes, it does - as long as Jython or any other Python implementation
doesn't support that driver. In the case of SQLite, it sounds like adding
sqlite3 support to Jython is planned at least.

> One alternative would be to use the strings you describe below, but in
> addition add special requests that would check the system add pull the
> appropriate driver automatically.
> 'autoMySQL' or 'MySQL' - uses MySQLdb if in CPython, use
> org.gjt.mm.mysql.Driver if in Jython.
> Otherwise, if the user wants to use a specific driver, they pass it's name.

Maybe rather than specifying the driver, the user could specify the
database back end (MySQL, PostgreSQL, SQLite, ...) and providing
we know about this in advance, we can look up and try relevant
drivers automatically. We could offer this in combination with the
existing driver specifier. This seems cleaner than overloading the
driver argument.

Peter


From biopython at maubp.freeserve.co.uk  Mon Jan 18 11:33:42 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 Jan 2010 16:33:42 +0000
Subject: [Biopython-dev] EMBOSS eprimer3 parser
Message-ID: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com>

Hi all,

Who on the dev list makes heavy use of the EMBOSS eprimer3 parser in
Biopython? I'd like someone to look over Leighton's proposed enhancements
to this code: http://bugzilla.open-bio.org/show_bug.cgi?id=2968

There are two main issues. First, the current code doesn't cope with multiple
primer sets (so Leighton introduces read/parse functions in line with other
modules for single or multiple sets of primers). This seems entirely sensible
to me, and worthwhile in itself.

Second, Leighton makes some changes to the primer record objects.
I'm not so sure about the necessity here, even if it is backwards
compatible, but I haven't really used this code. What do the rest of
you think?

Peter

From istvan.albert at gmail.com  Mon Jan 18 13:02:23 2010
From: istvan.albert at gmail.com (Istvan Albert)
Date: Mon, 18 Jan 2010 13:02:23 -0500
Subject: [Biopython-dev] Biopython-dev Digest, Vol 84, Issue 14
In-Reply-To: <mailman.9.1263834003.27796.biopython-dev@lists.open-bio.org>
References: <mailman.9.1263834003.27796.biopython-dev@lists.open-bio.org>
Message-ID: <c878cd561001181002n34fa6bebvef7153f538b5bbc4@mail.gmail.com>

On Mon, Jan 18, 2010 at 12:00 PM,
<biopython-dev-request at lists.open-bio.org> wrote:


> It looks like there are a couple of Stack Exchange sites with
> similar aims for open source bioinformatics and chemistry:
>
> http://biostar.stackexchange.com/
> http://blueobelisk.stackexchange.com/

I am actually the original creator of
http://biostar.stackexchange.com/ Created mainly to give my students a
way to easily ask questions.

Two things to keep in mind

- it will cost money to run it, right now it is free due to it being in beta
- it is not obvious that this service will actually be offered once
beta concludes, or that it will be offered with the same conditions.
That is pretty much what keeps me from investing more time into it.
- making it a site like this only for biopython is too restrictive

Other comments on using the stackoverflow main site: I think due to
the site's focus being so generic programming I think most people
looking for bioinformatics related information could easily get lost
or not feel a connection.

IMO the idea is fantastic, but it needs its own forum rather than
being a small subset of a unrelated topics.

best,

Istvan


-- 
Istvan Albert
http://www.personal.psu.edu/iua1

From biopython at maubp.freeserve.co.uk  Tue Jan 19 05:49:31 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 19 Jan 2010 10:49:31 +0000
Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function
Message-ID: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com>

Hi Eric (and everyone else),

I just spotted the to_adjacency_matrix function in utils:
http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py

The dostring says:

> Create an adjacency matrix (NumPy array) from clades/branches in tree.
 >
> Also returns a list of all clades in tree ("allclades"), where the position
> of each clade in the list corresponds to a row and column of the numpy
> array. So, a cell i,j in the array represents the length of the branch from
> allclades[i] to allclades[j].
>
> @return: tuple of (allclades, adjacency_matrix) where allclades is a list
> and adjacency_matrix is a NumPy 2D array.

It looks like your adjacency matrix starts as a numpy array of zeros,
and then you sets some edges to branch lengths. How do you tell
apart a non-connection and a real connection of length zero? These
do occur, for example if you have three identical sequences, then
you might expect a single node with three children. However IIRC,
in (some) NJ trees each node has two children by construction,
so you get an extra node connected with a branch of length zero.

Peter

From eric.talevich at gmail.com  Tue Jan 19 10:22:30 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 19 Jan 2010 10:22:30 -0500
Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function
In-Reply-To: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com>
References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com>
Message-ID: <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com>

On Tue, Jan 19, 2010 at 5:49 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi Eric (and everyone else),
>
> I just spotted the to_adjacency_matrix function in utils:
> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py
>
> The dostring says:
>
>> Create an adjacency matrix (NumPy array) from clades/branches in tree.
> ?>
>> Also returns a list of all clades in tree ("allclades"), where the position
>> of each clade in the list corresponds to a row and column of the numpy
>> array. So, a cell i,j in the array represents the length of the branch from
>> allclades[i] to allclades[j].
>>
>> @return: tuple of (allclades, adjacency_matrix) where allclades is a list
>> and adjacency_matrix is a NumPy 2D array.
>
> It looks like your adjacency matrix starts as a numpy array of zeros,
> and then you sets some edges to branch lengths. How do you tell
> apart a non-connection and a real connection of length zero? These
> do occur, for example if you have three identical sequences, then
> you might expect a single node with three children. However IIRC,
> in (some) NJ trees each node has two children by construction,
> so you get an extra node connected with a branch of length zero.

Shoot, you're right. I can think of three reasonable mitigations:
(a) Use a boolean or 0-1 matrix instead of branch lengths to indicate
adjacency -- this seems more standard in textbooks, actually.
(b) Issue a warning or raise an error if the given tree contains a
0-length branch.
(c) Delete the function.

Which do you recommend?

The idea was to give mathematicians something to play with. For
example, Chapter 2 of this report represents phylogenies this way,
using 0 or 1 to indicate the presence of a branch:
http://www.metaheuristics.net/~mdorigo/HomePageDorigo/thesis/dea/CatanzaroDEA.pdf

Thanks for the heads-up,
Eric


From biopython at maubp.freeserve.co.uk  Tue Jan 19 10:47:39 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 19 Jan 2010 15:47:39 +0000
Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function
In-Reply-To: <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com>
References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com>
	<3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com>
Message-ID: <320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com>

On Tue, Jan 19, 2010 at 3:22 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> On Tue, Jan 19, 2010 at 5:49 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Hi Eric (and everyone else),
>>
>> I just spotted the to_adjacency_matrix function in utils:
>> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py
>>
>> The dostring says:
>>
>>> Create an adjacency matrix (NumPy array) from clades/branches in tree.
>> ?>
>>> Also returns a list of all clades in tree ("allclades"), where the position
>>> of each clade in the list corresponds to a row and column of the numpy
>>> array. So, a cell i,j in the array represents the length of the branch from
>>> allclades[i] to allclades[j].
>>>
>>> @return: tuple of (allclades, adjacency_matrix) where allclades is a list
>>> and adjacency_matrix is a NumPy 2D array.
>>
>> It looks like your adjacency matrix starts as a numpy array of zeros,
>> and then you sets some edges to branch lengths. How do you tell
>> apart a non-connection and a real connection of length zero? These
>> do occur, for example if you have three identical sequences, then
>> you might expect a single node with three children. However IIRC,
>> in (some) NJ trees each node has two children by construction,
>> so you get an extra node connected with a branch of length zero.
>
> Shoot, you're right. I can think of three reasonable mitigations:
> (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate
> adjacency -- this seems more standard in textbooks, actually.
> (b) Issue a warning or raise an error if the given tree contains a
> 0-length branch.
> (c) Delete the function.
>
> Which do you recommend?
>
> The idea was to give mathematicians something to play with. For
> example, Chapter 2 of this report represents phylogenies this way,
> using 0 or 1 to indicate the presence of a branch:
> http://www.metaheuristics.net/~mdorigo/HomePageDorigo/thesis/dea/CatanzaroDEA.pdf
>
> Thanks for the heads-up,
> Eric

I did wonder about further options,

(d) Since the distances are floats, we can use a NA as
a flag for no connection. However, this does not seem
very useful.

(e) Collapse nodes separated by a zero length branch
while building the adjacency matrix.

Or, raise an error (b) but provide a tree method to collapse
nodes separated by a zero length branch which could be
called to "clean up" a problematic tree before making the
adjacency matrix.

None of these options seem ideal :(

I would say the boolean matrix (a) is safe but is of limited utility.
Therefore (c), remove the function for now is probably best. It
can always be re-added in a later release if a good solution is
agreed.

Peter

P.S. Another potentially interesting thing would be a matrix using
the bootstrap support values (where again you have a problem
with zero bootstrap support vs no connection). I'm not sure if this
has any practical uses though.


From eric.talevich at gmail.com  Tue Jan 19 23:08:16 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 19 Jan 2010 23:08:16 -0500
Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function
In-Reply-To: <320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com>
References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> 
	<3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com> 
	<320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com>
Message-ID: <3f6baf361001192008y244912aaieb7c8d2c0399903e@mail.gmail.com>

On Tue, Jan 19, 2010 at 10:47 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Jan 19, 2010 at 3:22 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> On Tue, Jan 19, 2010 at 5:49 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>> Hi Eric (and everyone else),
>>>
>>> I just spotted the to_adjacency_matrix function in utils:
>>> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py
>>>
>>> It looks like your adjacency matrix starts as a numpy array of zeros,
>>> and then you sets some edges to branch lengths. How do you tell
>>> apart a non-connection and a real connection of length zero?
>>
>> Shoot, you're right. I can think of three reasonable mitigations:
>> (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate
>> adjacency -- this seems more standard in textbooks, actually.
>> (b) Issue a warning or raise an error if the given tree contains a
>> 0-length branch.
>> (c) Delete the function.
>>
>> Which do you recommend?
>> ....
>
> I did wonder about further options,
>
> (d) Since the distances are floats, we can use a NA as
> a flag for no connection. However, this does not seem
> very useful.

Or infinity -- I think that's reasonably common in graph algorithms
that use a matrix representation.

Anyway, I commented it out for now. The main problem is that I don't
have a clear use case for the function at the moment, just a notion
that it could be useful for some novel statistical analysis or
possibly rooting an unrooted tree based on a molecular clock. I'll
look at other libraries to see how they use adjacency matrices, if at
all.


> (e) Collapse nodes separated by a zero length branch
> while building the adjacency matrix.
>
> Or, raise an error (b) but provide a tree method to collapse
> nodes separated by a zero length branch which could be
> called to "clean up" a problematic tree before making the
> adjacency matrix.

Should be easy enough for the user to do manually:

for clade in tree.find_clades(branch_length=0):
    tree.collapse(clade)

I'm going to do some serious work on the wiki documentation soon so
this sort of operation should be fairly apparent to users.


> P.S. Another potentially interesting thing would be a matrix using
> the bootstrap support values (where again you have a problem
> with zero bootstrap support vs no connection). I'm not sure if this
> has any practical uses though.

Well, the commented-out code is still visible if any brave scientist
is interested in modifying it for this purpose. I'm reading Joe
Felsenstein's book right now, so I'll probably get the urge to add
more mathy toys to Bio.Phylo soon. I'll check with the list before
committing them to the trunk, though. ;)

From p.j.a.cock at googlemail.com  Wed Jan 20 11:16:58 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 20 Jan 2010 16:16:58 +0000
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>
Message-ID: <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>

On Fri, Jan 15, 2010 at 11:08 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> Anyhow I think it is better to discuss this when the unit test
>> 'swiss'VS'uniprot' is ready.
>
> +1, good plan.

Something I should have mentioned earlier (I forgot this wasn't
checked in yet) was feature support in the existing "swiss" plain
text parser - hopefully we can get that working nicely as part of
this XML work:

http://bugzilla.open-bio.org/show_bug.cgi?id=2235

Peter

From andrea at biocomp.unibo.it  Wed Jan 20 11:57:47 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Wed, 20 Jan 2010 17:57:47 +0100 (CET)
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>
	<320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>
Message-ID: <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it>

>
> Something I should have mentioned earlier (I forgot this wasn't
> checked in yet) was feature support in the existing "swiss" plain
> text parser - hopefully we can get that working nicely as part of
> this XML work:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2235
>
> Peter
>

I know that the plain text swissprot parser can parse features, but
last time I checked these features were not included in SeqRecords
generated by Bio.SeqIO.
If the two parsers have to report similar results, than the 'swiss'
format in Bio.SeqIO must reports features too.
I made a few changes to the original parser to map data as close as
possible to the plain text parser (available on github).

However the big issue are going to be the comment field:
- 1 big string in the plain text parser
- several annotation fields in the XML parser.

I think that obtaining the same results is going to be difficult.
It is hard to map the big string to many annotations (very error prone)
and is also hard to map many annotations to a single string...

Anyhow, unit testing is coming (thanks to Mauro) together with a detailed
comparison between the two parsed seqrecords.

Andrea


From p.j.a.cock at googlemail.com  Wed Jan 20 12:14:18 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 20 Jan 2010 17:14:18 +0000
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>
	<320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>
	<01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com>

On Wed, Jan 20, 2010 at 4:57 PM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
>>
>> Something I should have mentioned earlier (I forgot this wasn't
>> checked in yet) was feature support in the existing "swiss" plain
>> text parser - hopefully we can get that working nicely as part of
>> this XML work:
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2235
>>
>> Peter
>>
>
> I know that the plain text swissprot parser can parse features, but
> last time I checked these features were not included in SeqRecords
> generated by Bio.SeqIO.
> If the two parsers have to report similar results, than the 'swiss'
> format in Bio.SeqIO must reports features too.

Yes, there is an old patch on Bug 2235 to do this:
http://bugzilla.open-bio.org/show_bug.cgi?id=2235

> I made a few changes to the original parser to map data as close as
> possible to the plain text parser (available on github).
>
> However the big issue are going to be the comment field:
> - 1 big string in the plain text parser
> - several annotation fields in the XML parser.
>
> I think that obtaining the same results is going to be difficult.
> It is hard to map the big string to many annotations (very error prone)
> and is also hard to map many annotations to a single string...
>
> Anyhow, unit testing is coming (thanks to Mauro) together with a detailed
> comparison between the two parsed seqrecords.

Great.

Peter

From andrea at biocomp.unibo.it  Thu Jan 21 07:01:30 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Thu, 21 Jan 2010 13:01:30 +0100 (CET)
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>
	<320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>
	<01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com>
Message-ID: <43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it>


>> Anyhow, unit testing is coming (thanks to Mauro) together with a
>> detailed
>> comparison between the two parsed seqrecords.
>
> Great.
>
> Peter
>


As mentioned earlier, Mauro did a code review and added unit test for the
parser in Tests/test_Uniprot.py
the updated version is available on the github repository:
http://github.com/apierleoni/biopython

Since this version is mature enough I sepnt some time comparing the input
from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser.
This comparison was done using the Q13639 UniProt entry.

This are the main differences between the two generated SeqRecords:

- id:  is the same (first accession)
- name: is the same
- description: UP reports the  the recommended name , full name value, while
       additional names and synonyms are in the annotations. SP reports a
       long string containing everything parsed as it is form the plain
       text.
- dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed,
       NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs
- seq: is the same
- features: missing in SP (I have to check with the Peter's patch)
- annotations:
- - identical annotations: accessions, keywords, taxonomy, organism
- - mapped annotations:
       date_last_annotation_update in UP---> modified in SP
       date_last_sequence_update in UP---> sequence_modified in SP
       gene_name_primary in UP---> gene_name in SP
               >>> SP.annotations['gene_name']
               'Name=HTR4;'
               >>> UP.annotations['gene_name_primary']
               'HTR4'
       ncbi_taxid in SP ---> UP dbxrefs since it is mapped as a
                dbReference in the xmlfile
- - references: has some minor differences.
        Final semicolon and double quote missing in UP for both author
            and title fields.
        In UP reference comments are reported as:
	    "PublicationType | PublicationDate | Scope | Tissue"
	For submission publication type the db is reported in comments
            and not in journal field.
- - comments: here comes the big differences.
       SP has comments are on a single string.
       UP comments are mapped to seceral annotation entries using comment
          type and attributes to build the annotation key.
          Eg.
          comment_function --> list of  "function" type comment strings
          comment_subcellularlocation_location --> list of  "location"
               strings in the subcellularlocation comment field

       Comments  tree in XML would be easily mapped to a comment dictionary
       tree, but this would not be BioSQL safe.


Andrea


From biopython at maubp.freeserve.co.uk  Thu Jan 21 07:33:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 21 Jan 2010 12:33:53 +0000
Subject: [Biopython-dev] SwissProt DE lines and UniProt XML / TagTree as XML
	in BioSQL
Message-ID: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>

Hi all,

This is cross posted to try and ensure relevant people see it.
I suggest we continue the discussion on the BioSQL list
(for how to serialise structured annotation to BioSQL), and/or
the OpenBio list (for things like file format naming conventions).

I am hoping we (Bio*) can be consistent in how we parse and load
into BioSQL the SwissProt DE lines (known as "swiss" format in
both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
equivalent UniProt XML tags (which we are tentatively going to
call the "uniprot" format in Biopython's SeqIO - comments?).

Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
files and load them into BioSQL. Biopython currently treats the DE
comment lines as a long string, as BioPerl used to:

http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html

I understand that BioPerl now turns the SwissProt DE lines into a
TagTree, and for storing this in BioSQL this gets serialised as XML.
I would like Biopython to handle this the same way (although rather
than a Perl TagTree, we'd use a Python structure of course), and
would appreciate clarification of what exactly was implemented
(e.g. which bit of the BioPerl source code should be look at,
and could you show a worked example?).

Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
Open-Bio lists yet) has started work on parsing UniProt XML
files for Biopython. Here the DE comment lines are already
provided broken up with XML markup. Hopefully their nested
structure matches what BioPerl was doing with the SwissProt
DE lines.

Regards,

Peter

From bugzilla-daemon at portal.open-bio.org  Thu Jan 21 08:13:09 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 Jan 2010 08:13:09 -0500
Subject: [Biopython-dev] [Bug 2997] New: Ignore comments in SCOP parsable
	files
Message-ID: <bug-2997-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2997

           Summary: Ignore comments in SCOP parsable files
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: 2008 at thomas-holder.de


I could not load SCOP parsable files with Bio.SCOP unless I removed the comment
lines. The parser should just skip these lines.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Jan 21 08:14:59 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 Jan 2010 08:14:59 -0500
Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files
In-Reply-To: <bug-2997-42@http.bugzilla.open-bio.org/>
Message-ID: <201001211314.o0LDExim005529@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2997


------- Comment #1 from 2008 at thomas-holder.de  2010-01-21 08:14 EST -------
Created an attachment (id=1432)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1432&action=view)
patch to skip comment lines in SCOP parsable files


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From mauro at biodec.com  Thu Jan 21 15:09:28 2010
From: mauro at biodec.com (Mauro)
Date: Thu, 21 Jan 2010 21:09:28 +0100
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>	<320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>	<320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>	<01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it>	<320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com>
	<43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it>
Message-ID: <4B58B478.4000703@biodec.com>

On 01/21/2010 01:01 PM, Andrea Pierleoni wrote:
>
>>> Anyhow, unit testing is coming (thanks to Mauro) together with a
>>> detailed
>>> comparison between the two parsed seqrecords.
>>
>> Great.
>>
>> Peter
>>
>
>
> As mentioned earlier, Mauro did a code review and added unit test for the
> parser in Tests/test_Uniprot.py
> the updated version is available on the github repository:
> http://github.com/apierleoni/biopython
>
> Since this version is mature enough I sepnt some time comparing the input
> from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser.
> This comparison was done using the Q13639 UniProt entry.

I made also a test for this case. Currently the test fails, you can see
the report made by Andrea below. If we agree with differences between 
the seqrecord, I do the work to change the test.

Mauro.

>
> This are the main differences between the two generated SeqRecords:
>
> - id:  is the same (first accession)
> - name: is the same
> - description: UP reports the  the recommended name , full name value, while
>         additional names and synonyms are in the annotations. SP reports a
>         long string containing everything parsed as it is form the plain
>         text.
> - dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed,
>         NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs
> - seq: is the same
> - features: missing in SP (I have to check with the Peter's patch)
> - annotations:
> - - identical annotations: accessions, keywords, taxonomy, organism
> - - mapped annotations:
>         date_last_annotation_update in UP--->  modified in SP
>         date_last_sequence_update in UP--->  sequence_modified in SP
>         gene_name_primary in UP--->  gene_name in SP
>                 >>>  SP.annotations['gene_name']
>                 'Name=HTR4;'
>                 >>>  UP.annotations['gene_name_primary']
>                 'HTR4'
>         ncbi_taxid in SP --->  UP dbxrefs since it is mapped as a
>                  dbReference in the xmlfile
> - - references: has some minor differences.
>          Final semicolon and double quote missing in UP for both author
>              and title fields.
>          In UP reference comments are reported as:
> 	    "PublicationType | PublicationDate | Scope | Tissue"
> 	For submission publication type the db is reported in comments
>              and not in journal field.
> - - comments: here comes the big differences.
>         SP has comments are on a single string.
>         UP comments are mapped to seceral annotation entries using comment
>            type and attributes to build the annotation key.
>            Eg.
>            comment_function -->  list of  "function" type comment strings
>            comment_subcellularlocation_location -->  list of  "location"
>                 strings in the subcellularlocation comment field
>
>         Comments  tree in XML would be easily mapped to a comment dictionary
>         tree, but this would not be BioSQL safe.
>
>
> Andrea
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From bugzilla-daemon at portal.open-bio.org  Thu Jan 21 18:58:29 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 Jan 2010 18:58:29 -0500
Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files
In-Reply-To: <bug-2997-42@http.bugzilla.open-bio.org/>
Message-ID: <201001212358.o0LNwTIB022421@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2997


------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp  2010-01-21 18:58 EST -------
Can you give an example of a SCOP file that contains such comment lines?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Jan 22 03:42:28 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 Jan 2010 03:42:28 -0500
Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files
In-Reply-To: <bug-2997-42@http.bugzilla.open-bio.org/>
Message-ID: <201001220842.o0M8gSDv003709@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2997


------- Comment #3 from 2008 at thomas-holder.de  2010-01-22 03:42 EST -------
(In reply to comment #2)
> Can you give an example of a SCOP file that contains such comment lines?

I want to parse these files:
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.75
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.75
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.hie.scop.txt_1.75
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.com.scop.txt_1.75

They all start with 4 comment lines (release and copyright information).


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Jan 22 06:08:34 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 Jan 2010 06:08:34 -0500
Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files
In-Reply-To: <bug-2997-42@http.bugzilla.open-bio.org/>
Message-ID: <201001221108.o0MB8YkZ008581@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2997


mdehoon at ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp  2010-01-22 06:08 EST -------
Applied your patch; thanks.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From andrea at biocomp.unibo.it  Fri Jan 22 07:18:32 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Fri, 22 Jan 2010 13:18:32 +0100 (CET)
Subject: [Biopython-dev] SwissProt DE lines and UniProt XML / TagTree as
	XML in BioSQL
In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
Message-ID: <2b6e30c4628585042366646a7b46386e.squirrel@lipid.biocomp.unibo.it>

I think that the point here can be a little broader, since not only the
swissprot DE lines carry complex and structured data.
To define a common, language-independent way to store structured data into
the comment and *_qualifier_value tables of the actual BioSQL schema could
be very useful.
XML looks like a good candidate to me, and the UniprotXML format can be
used as reference or as a template to start from.
Each Bio* project will then parse and report this structured data in its
own programming language data structure.

Andrea


> Hi all,
>
> This is cross posted to try and ensure relevant people see it.
> I suggest we continue the discussion on the BioSQL list
> (for how to serialise structured annotation to BioSQL), and/or
> the OpenBio list (for things like file format naming conventions).
>
> I am hoping we (Bio*) can be consistent in how we parse and load
> into BioSQL the SwissProt DE lines (known as "swiss" format in
> both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
> equivalent UniProt XML tags (which we are tentatively going to
> call the "uniprot" format in Biopython's SeqIO - comments?).
>
> Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
> files and load them into BioSQL. Biopython currently treats the DE
> comment lines as a long string, as BioPerl used to:
>
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
> http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html
>
> I understand that BioPerl now turns the SwissProt DE lines into a
> TagTree, and for storing this in BioSQL this gets serialised as XML.
> I would like Biopython to handle this the same way (although rather
> than a Perl TagTree, we'd use a Python structure of course), and
> would appreciate clarification of what exactly was implemented
> (e.g. which bit of the BioPerl source code should be look at,
> and could you show a worked example?).
>
> Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
> Open-Bio lists yet) has started work on parsing UniProt XML
> files for Biopython. Here the DE comment lines are already
> provided broken up with XML markup. Hopefully their nested
> structure matches what BioPerl was doing with the SwissProt
> DE lines.
>
> Regards,
>
> Peter
>


From bugzilla-daemon at portal.open-bio.org  Fri Jan 22 13:43:19 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 Jan 2010 13:43:19 -0500
Subject: [Biopython-dev] [Bug 2998] New: mac error during build in 10.6.1
Message-ID: <bug-2998-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2998

           Summary: mac error during build in 10.6.1
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Mac OS
            Status: NEW
          Severity: major
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: emeryl at uw.edu


When I download the file biopython-1.53.tar.gz, uncompress it, and run 
python setup.py build
I get an error saying gcc4.0 failed with exit code 1, among many lines of
errors.  Looking more closely, it appears the build process is trying to use an
older version of the SDK, which is not installed by Xcode tools by default.  It
is trying to use /Developer/SDKs/MacOSX10.4u.sdk.  On a clean install of 10.6.1
(Snow Leopard) only the SDKs for 10.5 and 10.6 are installed by the Xcode tools
installer without changing options.  When I reinstall the Xcode tools and this
time check a box to install 10.4 support, this 10.4 sdk is installed and the
build works flawlessly.  This would be a difficult fix to track down for many
casual users of BioPython who do not understand the Xcode tools.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Jan 22 14:15:59 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 Jan 2010 14:15:59 -0500
Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for
	Mac OS
In-Reply-To: <bug-2998-42@http.bugzilla.open-bio.org/>
Message-ID: <201001221915.o0MJFxoa024953@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2998


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|major                       |normal
            Summary|mac error during build in   |Document need XCode with
                   |10.6.1                      |10.4 SDK for Mac OS


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-22 14:15 EST -------
Snow Leopard has caused all sorts of trouble for compiling Python extensions
(this is not specific to Biopython).

This has been discussed on our mailing list, and simply installing the
Mac OS 10.4 SDK option with XCode seems to be the best solution. I've just
updated the download page to try and clarify this. Is that better? This
is a wiki page so you can edit it:
http://biopython.org/wiki/Download

I'm leaving this bug open to remind us to add a similar note to the main
installation document:
http://github.com/biopython/biopython/blob/master/Doc/install/Installation.tex

Do you have any other suggestions? Thanks.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Jan 22 15:36:36 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 Jan 2010 15:36:36 -0500
Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for
	Mac OS
In-Reply-To: <bug-2998-42@http.bugzilla.open-bio.org/>
Message-ID: <201001222036.o0MKaaZ4027368@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2998


------- Comment #2 from emeryl at uw.edu  2010-01-22 15:36 EST -------
(In reply to comment #1)

That's a good solution, but I added this small clarification also :

You will need to have installed Apple's XCode tools including the optional 10.4
SDK  (check the option for 10.4 support when installing Xcode tools).


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Jan 25 05:56:32 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 05:56:32 -0500
Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for
	Mac OS
In-Reply-To: <bug-2998-42@http.bugzilla.open-bio.org/>
Message-ID: <201001251056.o0PAuWDI010933@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2998


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-25 05:56 EST -------
(In reply to comment #2)
> (In reply to comment #1)
> 
> That's a good solution, but I added this small clarification also :
> 
> You will need to have installed Apple's XCode tools including the optional 10.4
> SDK  (check the option for 10.4 support when installing Xcode tools).
>

Thanks - I've now updated the main installation document in our repository
(which we'll use to update the install PDF and HTML at the next release).

Marking bug as fixed.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Jan 25 20:16:27 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:16:27 -0500
Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing,
	in particular location parsing
In-Reply-To: <bug-2738-42@http.bugzilla.open-bio.org/>
Message-ID: <201001260116.o0Q1GR1c002063@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2738


mmokrejs at ribosome.natur.cuni.cz changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mmokrejs at ribosome.natur.cuni
                   |                            |.cz


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Jan 25 20:17:41 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:17:41 -0500
Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not
	record molecule type or if circular
In-Reply-To: <bug-2578-42@http.bugzilla.open-bio.org/>
Message-ID: <201001260117.o0Q1Hfdb002091@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2578


mmokrejs at ribosome.natur.cuni.cz changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mmokrejs at ribosome.natur.cuni
                   |                            |.cz


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Jan 25 20:19:47 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:19:47 -0500
Subject: [Biopython-dev] [Bug 2597] Enforce alphabet letters in Seq objects
In-Reply-To: <bug-2597-42@http.bugzilla.open-bio.org/>
Message-ID: <201001260119.o0Q1JlhK002189@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2597


mmokrejs at ribosome.natur.cuni.cz changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Jan 25 20:27:14 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:27:14 -0500
Subject: [Biopython-dev] [Bug 2999] New: SeqIO.parse() or
	record.format("genbank") converts input sequence to uppercase or
Message-ID: <bug-2999-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2999

           Summary: SeqIO.parse() or record.format("genbank") converts input
                    sequence to uppercase or
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: mmokrejs at ribosome.natur.cuni.cz


I do not know where is the problem coming from but if I parse a GenBank file
with lowercased sequence (EST) and get it printed back through
record.format("genbank") I receive all in uppercase. I think the
upper/lower-casing should never be altered unless explicitly requested by the
user.

for _record in SeqIO.parse(_infile, options.format):
    # silly, imagine I hit "gi|14150838|gb|AAK54648.1|AF376133_1" from
    #   a FASTA file :(
    if _record.id in _ids:
        _outfile.write(_record.format("fasta"))
    elif options.format == "genbank":
        if _record.annotations['gi'] in _ids:
            _outfile.write(_record.format("genbank"))


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Jan 25 20:44:28 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:44:28 -0500
Subject: [Biopython-dev] [Bug 3000] New: Could SeqIO.parse() store the whole,
	unparsed multiline entry?
Message-ID: <bug-3000-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3000

           Summary: Could SeqIO.parse() store the whole, unparsed multiline
                    entry?
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: mmokrejs at ribosome.natur.cuni.cz


Taking into account the genbank file-format writing is not yet complete I
wonder whether you would allow to keep optionally along each parsed record it's
unparsed multi-line representation. For example, I use biopython to filter-out
certain records from a fasta/genbank file by accession, gi, tissue (well the
last haven't done yet;)). I do not change the format, I just ignore certain
entries.

I did not understand the Tutorial ("5.4.3  Getting your SeqRecord objects as
formatted strings") well but I iterate over the records and once having the
record I want to be on the safe side and to record._print_original_blob() and
get e.g.

LOCUS ....
...
//

I do not have the record_iterator so cannot use the proposed
out_handle.write(record.format("genbank")) approach. Still, I suspect this will
reformat the entry (currently I see trailing dot removed from KEYWORDS, no
REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED; and FEATURES.source being
re-ordered).

I foresee this to depend on an optional argument to SeqIO.parse() specifying
that a user wants to keep this in memory and merely that he/she understands
this is probably not much useful for large chromosomes, etc.

Similarly, I think until parsing/writing e.g. TITLE is fully available why
couldn't you just store the whole multi-line thing in some variable?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Jan 25 20:47:27 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:47:27 -0500
Subject: [Biopython-dev] [Bug 2601] Seq find() method: proposal
In-Reply-To: <bug-2601-42@http.bugzilla.open-bio.org/>
Message-ID: <201001260147.o0Q1lRVk002782@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2601


mmokrejs at ribosome.natur.cuni.cz changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mmokrejs at ribosome.natur.cuni
                   |                            |.cz


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue Jan 26 08:03:42 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 26 Jan 2010 08:03:42 -0500
Subject: [Biopython-dev] [Bug 2999] SeqIO.parse() or
	record.format("genbank") converts input sequence to uppercase or
In-Reply-To: <bug-2999-42@http.bugzilla.open-bio.org/>
Message-ID: <201001261303.o0QD3gN8019546@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2999


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-26 08:03 EST -------
In many file formats (e.g. FASTA) mixed case is allowed and useful.

The sequence in a GenBank file is (by convention) always lower case,
but for historical reasons Biopython converts this to upper case on
parsing (not sure why, but changing it would risk breaking existing
scripts).

However, I think we should convert to lower case on writing GenBank
output.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue Jan 26 08:15:38 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 26 Jan 2010 08:15:38 -0500
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
	unparsed multiline entry?
In-Reply-To: <bug-3000-42@http.bugzilla.open-bio.org/>
Message-ID: <201001261315.o0QDFc4f020030@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3000


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-26 08:15 EST -------
(In reply to comment #0)
> Taking into account the genbank file-format writing is not yet complete I
> wonder whether you would allow to keep optionally along each parsed record
> it's unparsed multi-line representation.

You can probably do it already with the old Bio.GenBank iterator object
(I think you use no parser object to get the raw text).

Adding this to Bio.SeqIO doesn't seem a wonderful idea. The whole approach
only makes sense for sequential file formats with no header (like FASTA,
GenBank, EMBL, SwissProt) but not interlaced files (most alignments) or
those with headers or XML formats. It also breaks completely the moment
the user makes any modification to the SeqRecord object - and handling
that cleanly would be tricky.

> Still, I suspect this will
> reformat the entry (currently I see trailing dot removed from KEYWORDS, no
> REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED; and FEATURES.source being
> re-ordered).

Yes, using Bio.SeqIO to read/write a GenBank record will give you (slightly)
different output. We do not guarantee a 100% round trip (even on simpler
formats like FASTA). Even little things like line wrapping would make this
very difficult.

Regarding GenBank KEYWORDS, please file a bug.

Regarding GenBank reference lines (REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED)
this is still covered by existing Bug 2294

Regarding GenBank source feature, please file a bug.

> Similarly, I think until parsing/writing e.g. TITLE is fully available why
> couldn't you just store the whole multi-line thing in some variable?

The remaining unsupported bits of the ID line are covered byg existing
Bug 2294 and Bug 2578.

Regarding the reference lines (REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED)
this is still covered by existing Bug 2294.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From jblanca at btc.upv.es  Tue Jan 26 09:02:59 2010
From: jblanca at btc.upv.es (Jose Blanca)
Date: Tue, 26 Jan 2010 15:02:59 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com>
References: <20091202125744.GA46415@sobchak.mgh.harvard.edu>
	<320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com>
	<320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com>
Message-ID: <201001261502.59237.jblanca@btc.upv.es>

Hi:

I'm doing a pipeline to annotate sequences. I'm writting modules that add 
SeqFeatures and annoations to the sequences. 
Right now I'm storing the result as repr for the SeqRecords, but I would like 
to write gff files at the end. I've read the discussion regarding Brad's code 
and I've found it very interesting.
I need to write those gff files so couldl use Brad's code or my own, but it 
would be great if I could contribute to Biopython at the same time.
At the time being I don't think a consensus about what a SeqFeature should 
represent and how. I think Peter made a proposal about adding a parent and 
children properties, is this a good way to solve the problem? 
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

From biopython at maubp.freeserve.co.uk  Tue Jan 26 09:59:35 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 26 Jan 2010 14:59:35 +0000
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <201001261502.59237.jblanca@btc.upv.es>
References: <20091202125744.GA46415@sobchak.mgh.harvard.edu>
	<320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com>
	<320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com>
	<201001261502.59237.jblanca@btc.upv.es>
Message-ID: <320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com>

Hi Jose,

On Tue, Jan 26, 2010 at 2:02 PM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi:
>
> I'm doing a pipeline to annotate sequences. I'm writting modules that add
> SeqFeatures and annoations to the sequences.

I've done a little of that too - but with GenBank files as the output.

> Right now I'm storing the result as repr for the SeqRecords, but I would like
> to write gff files at the end. I've read the discussion regarding Brad's code
> and I've found it very interesting.
> I need to write those gff files so couldl use Brad's code or my own, but it
> would be great if I could contribute to Biopython at the same time.
> At the time being I don't think a consensus about what a SeqFeature should
> represent and how. I think Peter made a proposal about adding a parent and
> children properties, is this a good way to solve the problem?
> Best regards,

Brad's code is using the SeqFeature differently to existing bits of
Biopython, and adding a separate child/parent mechanism for the
kind of usage required for GFF(3) looks like one way forward allowing
use to keep full backward compatibility. I'm actually going to see Brad
in person next month at a workshop, and I'm hoping we can squeeze
in a little in person debate on this then (assuming we don't settle it
here on the mailing list first of course).

Regards,

Peter

From dalloliogm at gmail.com  Tue Jan 26 10:09:39 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 26 Jan 2010 16:09:39 +0100
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <20100118142010.GE48842@sobchak.mgh.harvard.edu>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> 
	<4B546062.3090802@bham.ac.uk>
	<20100118142010.GE48842@sobchak.mgh.harvard.edu>
Message-ID: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com>

On Mon, Jan 18, 2010 at 3:20 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hi Nick;
>

Sorry for the late reply... I also use StackOverflow and I think that it is
a great resource, and it would very good if we can become more represented
there.
At the moment there are a few questions on biopython on SO, but there are so
few biopython users that people usually receive few answers and they prefer
to ask their questions again in this list.
I have answer to some questions tagged as 'bioinformatics' there, but lately
I have not been using SO very much, and moreover the field of bioinformatics
is so broad that sometimes it is very difficult to answer a technical
question.


> > Here is a link to various other Stack Exchange sites:
> >
> http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family
>
>
Very interesting, thanks! I didn't know you could make Stack-Exchange
websites so easily. How did you do that? Is there a free software behind, or
do you have to pay some service provider?


> It looks like there are a couple of Stack Exchange sites with
> similar aims for open source bioinformatics and chemistry:
>
> http://biostar.stackexchange.com/
> http://blueobelisk.stackexchange.com/
>

I agree, maybe it would be useful to collaborate with these websites.
StackOverflow is great for programming-related questions; however, you can't
use it to ask something which is not completely related, like the protocol
for an experiment or which databases to use for an analysis.


-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it

From dalloliogm at gmail.com  Wed Jan 27 03:56:09 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 27 Jan 2010 09:56:09 +0100
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> 
	<4B546062.3090802@bham.ac.uk>
	<20100118142010.GE48842@sobchak.mgh.harvard.edu> 
	<5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com>
Message-ID: <5aa3b3571001270056l5ae5bd76g1a70890c94fd430b@mail.gmail.com>

On Tue, Jan 26, 2010 at 4:09 PM, Giovanni Marco Dall'Olio <
dalloliogm at gmail.com> wrote:

>
>
>
> On Mon, Jan 18, 2010 at 3:20 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
>> Hi Nick;
>>
>
> Sorry for the late reply... I also use StackOverflow and I think that it is
> a great resource, and it would very good if we can become more represented
> there.
>


By the way, it is possible to get feeds for questions on StackOverflow.
For example, this is the feed for the questions tagged 'biopython':
- http://stackoverflow.com/feeds/tag/biopython
We could add this rss to the biopython's friendfeed or twitter page (I
barely know what I am talking about here), or to the blog/wiki/etc.
Maybe there is also a way to notify this mailing list of the questions asked
there.


-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it

From chapmanb at 50mail.com  Wed Jan 27 08:33:22 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 27 Jan 2010 08:33:22 -0500
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu>
	<4B546062.3090802@bham.ac.uk>
	<20100118142010.GE48842@sobchak.mgh.harvard.edu>
	<5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com>
Message-ID: <20100127133322.GV83316@sobchak.mgh.harvard.edu>

Giovanni;
Thanks for the feedback on this. We've had a few positive responses
and I think it's something that would be low effort to experiment with.
I'm open to whether we do this on the main StackOverflow site,
Nick's dedicated suggested site, or Blue Obelisk. The main criteria
is that we are likely to have the website be freely available (and
around) in the future.

> Sorry for the late reply... I also use StackOverflow and I think that it is
> a great resource, and it would very good if we can become more represented
> there.
> At the moment there are a few questions on biopython on SO, but there are so
> few biopython users that people usually receive few answers and they prefer
> to ask their questions again in this list.

Yes, that's what we'd be hoping to change. The main thing is that we
get folks interested in python bioinformatics programming looking
there, and then suggest users ask questions there. The significant
benefit is that the presentation of questions and answers gives you 
a historical resource that is easy to search and browse.

> By the way, it is possible to get feeds for questions on StackOverflow.
> For example, this is the feed for the questions tagged 'biopython':
> - http://stackoverflow.com/feeds/tag/biopython
> We could add this rss to the biopython's friendfeed or twitter page (I
> barely know what I am talking about here), or to the blog/wiki/etc.
> Maybe there is also a way to notify this mailing list of the questions asked
> there.

There are resources we could use to redirect the feed to Twitter:

http://twitterfeed.com/

and the mailing list:

http://www.feedmyinbox.com/

Agreed that we should do this to increase visibility.

Brad

From chapmanb at 50mail.com  Wed Jan 27 08:41:25 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 27 Jan 2010 08:41:25 -0500
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com>
References: <20091202125744.GA46415@sobchak.mgh.harvard.edu>
	<320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com>
	<320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com>
	<201001261502.59237.jblanca@btc.upv.es>
	<320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com>
Message-ID: <20100127134125.GW83316@sobchak.mgh.harvard.edu>

Jose and Peter;

> > Right now I'm storing the result as repr for the SeqRecords, but I would like
> > to write gff files at the end. I've read the discussion regarding Brad's code
> > and I've found it very interesting.
> > I need to write those gff files so couldl use Brad's code or my own, but it
> > would be great if I could contribute to Biopython at the same time.

Awesome. Please do use my code for output and feel free to fork and
make suggestions; I'm happy to integrate changes:

http://github.com/chapmanb/bcbb/tree/master/gff

> > At the time being I don't think a consensus about what a SeqFeature should
> > represent and how. I think Peter made a proposal about adding a parent and
> > children properties, is this a good way to solve the problem?
> > Best regards,
> 
> Brad's code is using the SeqFeature differently to existing bits of
> Biopython, and adding a separate child/parent mechanism for the
> kind of usage required for GFF(3) looks like one way forward allowing
> use to keep full backward compatibility. I'm actually going to see Brad
> in person next month at a workshop, and I'm hoping we can squeeze
> in a little in person debate on this then (assuming we don't settle it
> here on the mailing list first of course).

What do you think we need to modify in the GFF parsing code to bring
this in line? I'd really like to see this get into Biopython, but am
not sure how to clear the blocking issues. If we can put together a
list of specifics, I can try and put together time to tackle that.

Brad

From dalloliogm at gmail.com  Wed Jan 27 08:41:24 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 27 Jan 2010 14:41:24 +0100
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <20100127133322.GV83316@sobchak.mgh.harvard.edu>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> 
	<4B546062.3090802@bham.ac.uk>
	<20100118142010.GE48842@sobchak.mgh.harvard.edu> 
	<5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> 
	<20100127133322.GV83316@sobchak.mgh.harvard.edu>
Message-ID: <5aa3b3571001270541n2f047fe2qf42911b21e9494d8@mail.gmail.com>

On Wed, Jan 27, 2010 at 2:33 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Giovanni;
> Thanks for the feedback on this. We've had a few positive responses
> and I think it's something that would be low effort to experiment with.
> I'm open to whether we do this on the main StackOverflow site,
> Nick's dedicated suggested site, or Blue Obelisk. The main criteria
> is that we are likely to have the website be freely available (and
> around) in the future.
>

Thanks to you for the proposal..


> There are resources we could use to redirect the feed to Twitter:
>
> http://twitterfeed.com/
>
> and the mailing list:
>
> http://www.feedmyinbox.com/
>

So, what if we use this to automatically send a notification to the
biopython mailing list?
The amount of traffic increased would be low, in the last three months there
have only been 3 messages  on biopython in StackOverflow.
With an automatical notification, these questions may receive an answer a
lot more quickly.
When the traffic on StackOverflow grows too much, we can just inactivate the
forwarding so it won't disturb the mailing list.


> Agreed that we should do this to increase visibility.
>
> Brad
>


-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it

From chapmanb at 50mail.com  Thu Jan 28 15:35:05 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 28 Jan 2010 15:35:05 -0500
Subject: [Biopython-dev] OpenBio solution challenge: Project updates at BOSC
	2010
Message-ID: <20100128203505.GG40046@sobchak.mgh.harvard.edu>

Hello all;
The BOSC 2010 organizing committee is hard at work getting prepared for this
July's meeting in Boston:

http://www.open-bio.org/wiki/BOSC_2010

One of the items we've traditionally had at the conference is a project 
update from each of the OpenBio affiliated groups. This year, we're thinking
about organizing these talks around a central theme: the OpenBio solution
challenge. We start with a biological question of general interest, and each
of the project talks would focus around how you would solve that problem 
using your toolkit and programming language.

This is meant to provide a challenge for OpenBio contributors, a nice tutorial
style overview of various projects and approaches for other programmers, and a
fun opportunity to compete and learn from other projects. Conference attendees
will vote on their favorite solution, with the winner receiving fame and
fortune (warning: fortune not guaranteed).

For this to be successful, it of course requires interest and enthusiasm from
y'all fine folks involved with the projects. Specifically:

- Is there interest from your group in participating in the challenge? You'll
  want at least a few people to work on it, and someone to give a presentation 
  at BOSC.

- Do you have suggestions on a good theme or specific biological problem to
  tackle? We'll hope to pick something in a sweet spot that is challenging 
  enough to be of interest, yet reasonable for presentation and preparation.

Let's discuss ideas and get this together. Since the schedule for BOSC is
developing rapidly, please give us an idea if you're interested by
February 12th, and copy responses to the BOSC mailing list as a central 
place for discussion.

bosc at open-bio.org

Thanks,
Brad, Michael, and the BOSC organizing committee

From biopython at maubp.freeserve.co.uk  Fri Jan 29 05:36:40 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 29 Jan 2010 10:36:40 +0000
Subject: [Biopython-dev] [Bioperl-l] [MOBY-dev] OpenBio solution
	challenge: Project updates at BOSC 2010
In-Reply-To: <op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
References: <20100128203505.GG40046@sobchak.mgh.harvard.edu>
	<op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com>

Hi all,

This is a great topic but should be continue it on just the one mailing list?
Is there a suitable BOSC list, or how about the general Open Bio list?

On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson <markw at illuminae.com> wrote:
>
> Brad, this sounds exciting!
>
> One thing strikes me, though - by asking for the sub-projects to propose
> the "grand challenge" themselves the one thing you can guarantee is that
> the "grand challenge" is solvable (or more likely, already solved!)
>
> Other "grand challenge" kinds of meetings have an independent third party
> pose the problem that has to be solved, and then all groups work toward a
> solution and compare their results. ?This would, IMO, be more revealing of
> the "state of the art" in each Open-Bio project, and point out where the
> weaknesses are that we should be focusing on... ?Someone (for example,
> you!) could act as the moderator to ensure that the "grand challenge" was
> at least a reasonable one, within the scope of what an Open-Bio project
> *should* be able to solve...
>
> Just my CAD $0.02
>
> Mark

One possible problem with having Brad act as moderator is his ties to
Biopython (plus it would be a shame if we'd be one man down for trying
to solve the challenges - grin). Having a project representative "sign off"
on the challenge might work - or simply the whole of the BOSC committee
which is quite balanced. Alternatively some kind of panel of challenges does
seem a good way to reduce individual project bias (as suggest by Scooter),
but there will still need to be a judging committee.

I'm curious what kind of challenges the BOSC committee had in mind -
would something like taking a newly sequence bacteria and producing
an automated annotation as a GenBank, EMBL, or GFF  file be too
ambitious for example? There are already several major projects
to do this e.g. RAST http://rast.nmpdr.org/

Peter
(@Biopython)


From bugzilla-daemon at portal.open-bio.org  Sun Jan 31 15:30:45 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 31 Jan 2010 15:30:45 -0500
Subject: [Biopython-dev] [Bug 3004] New: Contribute PSL alignment format to
	biopython
Message-ID: <bug-3004-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3004

           Summary: Contribute PSL alignment format to biopython
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: forgetta at gmail.com


Hi Bio-pythonistas,

I am interested in contributing code to biopython. I have developed a class to
represent PSL output from the BLAT alignment program. I would like to
contribute it to the AlignIO module. I have read through and agree to the
guidelines stipulated on http://biopython.org/wiki/Contributing. I have never
written unit tests before, but I am willing to learn.

Thanks.

Vince


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sun Jan 31 17:24:53 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 31 Jan 2010 17:24:53 -0500
Subject: [Biopython-dev] [Bug 3004] PSL alignment format parsing in
	Bio.AlignIO
In-Reply-To: <bug-3004-42@http.bugzilla.open-bio.org/>
Message-ID: <201001312224.o0VMOrha006787@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3004


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Contribute PSL alignment    |PSL alignment format parsing
                   |format to biopython         |in Bio.AlignIO


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-31 17:24 EST -------
Hi Vince,

This sounds interesting - I've been using BLAT's plain text BLAST output
format with Biopython up until now.

Have you ever used github? That would be one way to share your code. Or,
just attach diff files, Python files, and example BLAT files to this bug.

If you haven't already done so, signing up to our development mailing
list would be a good idea.

Thanks,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sun Jan 31 19:21:51 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 31 Jan 2010 19:21:51 -0500
Subject: [Biopython-dev] [Bug 3004] PSL alignment format parsing in
	Bio.AlignIO
In-Reply-To: <bug-3004-42@http.bugzilla.open-bio.org/>
Message-ID: <201002010021.o110Lp9e009311@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3004


------- Comment #2 from forgetta at gmail.com  2010-01-31 19:21 EST -------
Now on github:

http://github.com/vforget/PyBLATPSL

Vince


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From chapmanb at 50mail.com  Mon Jan  4 13:16:31 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 4 Jan 2010 08:16:31 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
Message-ID: <20100104131631.GG80812@sobchak.mgh.harvard.edu>

Hey Eric;
Happy New Year -- thanks for all the work on TreeIO. This sounds
great and looking forward to getting it in the main trunk. I'd like
to hear Peter's and other's thoughts, but just a few small comments
below.

> The tree annotations (e.g. id) aren't preserved perfectly during conversions
> -- I'll keep working on this, but I don't think it's a blocker. The taxon
> names of terminal nodes are kept as "clade" names in phyloXML for
> round-tripping. Tree topology and branch lengths seem OK.

Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.

> Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
> incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
> This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
> as I imagine it:
> (1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
> reasonable (since the node IDs and adjacency list lookup are no longer
> needed)
> (2) Implement methods in Bio.Tree.Newick with the original argument lists,
> but triggering a deprecation warning indicating the newer replacement method
> (3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
> shims to duplicate the original API -- so test_Nexus.py should still pass,
> ideally (with deprecation warnings)
> (4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
> NexusIO and Bio.Tree methods.
> (5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.
> 
> I'm currently doing (1) and (2), with more emphasis on getting (1) right.
> Not all of the important methods have been ported, but I'm happy with the
> tree traversal methods.

Nice. This all sounds like a really good refactoring. It sounds like 1 
can happen once this all gets merged with the main branch, and
could benefit from others being able to more easily look at it and
make suggestions.

> I noticed that in Tests/Nexus/, the example file for internal node labels is
> actually in Newick/NH format, not Nexus. That was briefly confusing, so
> maybe that file should be renamed.

Oops, I think that may have been me. No problem, rename away.

Brad


From chapmanb at 50mail.com  Mon Jan  4 13:16:31 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 4 Jan 2010 08:16:31 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
Message-ID: <20100104131631.GG80812@sobchak.mgh.harvard.edu>

Hey Eric;
Happy New Year -- thanks for all the work on TreeIO. This sounds
great and looking forward to getting it in the main trunk. I'd like
to hear Peter's and other's thoughts, but just a few small comments
below.

> The tree annotations (e.g. id) aren't preserved perfectly during conversions
> -- I'll keep working on this, but I don't think it's a blocker. The taxon
> names of terminal nodes are kept as "clade" names in phyloXML for
> round-tripping. Tree topology and branch lengths seem OK.

Are the annotations often used in real life cases or is this more of
a fringe problem? I'm not as familiar with tree work, but know this
is a pain in sequence space. A good goal is to capture the most
common use cases and then integrate the other issues as feasible.

> Bio.Tree.Newick contains simple subclasses of Tree and Subtree, and an
> incomplete set of shims that track Bio.Nexus.Trees.Tree (minus the I/O).
> This is to ease the deprecation and eventual replacement of Bio.Nexus.Trees,
> as I imagine it:
> (1) Port methods from Nexus.Trees to Bio.Tree, simplifying arguments where
> reasonable (since the node IDs and adjacency list lookup are no longer
> needed)
> (2) Implement methods in Bio.Tree.Newick with the original argument lists,
> but triggering a deprecation warning indicating the newer replacement method
> (3) Replace Nexus.Trees with an import of Bio.Tree.Newick(IO) and a few more
> shims to duplicate the original API -- so test_Nexus.py should still pass,
> ideally (with deprecation warnings)
> (4) In Nexus.Nexus, replace all usage of Nexus.Trees with proper usage of
> NexusIO and Bio.Tree methods.
> (5) Eventually delete Nexus.Trees and the shims in Bio.Tree.Newick.
> 
> I'm currently doing (1) and (2), with more emphasis on getting (1) right.
> Not all of the important methods have been ported, but I'm happy with the
> tree traversal methods.

Nice. This all sounds like a really good refactoring. It sounds like 1 
can happen once this all gets merged with the main branch, and
could benefit from others being able to more easily look at it and
make suggestions.

> I noticed that in Tests/Nexus/, the example file for internal node labels is
> actually in Newick/NH format, not Nexus. That was briefly confusing, so
> maybe that file should be renamed.

Oops, I think that may have been me. No problem, rename away.

Brad


From eric.talevich at gmail.com  Tue Jan  5 00:09:18 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 4 Jan 2010 16:09:18 -0800
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <20100104131631.GG80812@sobchak.mgh.harvard.edu>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> 
	<20100104131631.GG80812@sobchak.mgh.harvard.edu>
Message-ID: <3f6baf361001041609u7997dd61v441257dbfecdebd6@mail.gmail.com>

Hi Brad, I hope the holidays treated you well.

On Mon, Jan 4, 2010 at 5:16 AM, Brad Chapman <chapmanb at 50mail.com> wrote:

>
> Are the annotations often used in real life cases or is this more of
> a fringe problem? I'm not as familiar with tree work, but know this
> is a pain in sequence space. A good goal is to capture the most
> common use cases and then integrate the other issues as feasible.
>

The data that TreeIO preserves round-trip are:

 - Branching structure (topology)
 - Branch lengths
 - Clade/taxon names
 - Rooted-ness (for the whole tree)
 - Tree ID

The troublesome parts are:

 - The "confidences" attribute in PhyloXML trees should map onto the
"support" attribute in Nexus trees, but that's tricky -- the original Nexus
attribute seemed content with a little ambiguity in what that attribute's
numerical value actually meant (relative/absolute support), while PhyloXML
uses a list of Confidence objects containing both a numerical value and a
"type" string such as "bootstrap". Currently that information is dropped
when converting between PhyloXML and Nexus/Newick trees.
 - Nexus also has a "comment" attribute for each node, while PhyloXML
doesn't directly support that.
 - The branch length of the root node/clade is None in PhyloXML, but 0.0 in
Nexus. I prefer None because there is no meaningful branch leading to that
node, but there might be a reason 0.0 was chosen for Nexus that I'm not
aware of.
  - The names of unlabeled internal nodes might change from None to "" in
some cases, since None is the PhyloXML default and "" is the Nexus default.
 - Since PhyloXML supports more structured taxonomic information on each
node than Newick, it's possible to have a PhyloXML tree where a Clade has no
name, but instead one or more Taxonomy objects containing the scientific
name, common names, etc. -- so when this tree is converted to Newick format
the taxonomy info is lost for those nodes. I could squash the Taxonomy
object into a string for the sake of Nexus labels, but I think it would be
safer (less surprising) to just write a cookbook entry on how to collapse
PhyloXML Taxonomies into Clade names to aid format conversions.

If the support-vs-confidence issue can be resolved, then we can treat
PhyloXML as a rough superset of Newick, in terms of annotation, and then it
shouldn't be surprising to lose some annotation data in converting PhyloXML
to Newick.

Cheers,
Eric


From eric.talevich at gmail.com  Tue Jan  5 00:09:18 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 4 Jan 2010 16:09:18 -0800
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <20100104131631.GG80812@sobchak.mgh.harvard.edu>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> 
	<20100104131631.GG80812@sobchak.mgh.harvard.edu>
Message-ID: <3f6baf361001041609u7997dd61v441257dbfecdebd6@mail.gmail.com>

Hi Brad, I hope the holidays treated you well.

On Mon, Jan 4, 2010 at 5:16 AM, Brad Chapman <chapmanb at 50mail.com> wrote:

>
> Are the annotations often used in real life cases or is this more of
> a fringe problem? I'm not as familiar with tree work, but know this
> is a pain in sequence space. A good goal is to capture the most
> common use cases and then integrate the other issues as feasible.
>

The data that TreeIO preserves round-trip are:

 - Branching structure (topology)
 - Branch lengths
 - Clade/taxon names
 - Rooted-ness (for the whole tree)
 - Tree ID

The troublesome parts are:

 - The "confidences" attribute in PhyloXML trees should map onto the
"support" attribute in Nexus trees, but that's tricky -- the original Nexus
attribute seemed content with a little ambiguity in what that attribute's
numerical value actually meant (relative/absolute support), while PhyloXML
uses a list of Confidence objects containing both a numerical value and a
"type" string such as "bootstrap". Currently that information is dropped
when converting between PhyloXML and Nexus/Newick trees.
 - Nexus also has a "comment" attribute for each node, while PhyloXML
doesn't directly support that.
 - The branch length of the root node/clade is None in PhyloXML, but 0.0 in
Nexus. I prefer None because there is no meaningful branch leading to that
node, but there might be a reason 0.0 was chosen for Nexus that I'm not
aware of.
  - The names of unlabeled internal nodes might change from None to "" in
some cases, since None is the PhyloXML default and "" is the Nexus default.
 - Since PhyloXML supports more structured taxonomic information on each
node than Newick, it's possible to have a PhyloXML tree where a Clade has no
name, but instead one or more Taxonomy objects containing the scientific
name, common names, etc. -- so when this tree is converted to Newick format
the taxonomy info is lost for those nodes. I could squash the Taxonomy
object into a string for the sake of Nexus labels, but I think it would be
safer (less surprising) to just write a cookbook entry on how to collapse
PhyloXML Taxonomies into Clade names to aid format conversions.

If the support-vs-confidence issue can be resolved, then we can treat
PhyloXML as a rough superset of Newick, in terms of annotation, and then it
shouldn't be surprising to lose some annotation data in converting PhyloXML
to Newick.

Cheers,
Eric


From biopython at maubp.freeserve.co.uk  Tue Jan  5 17:50:25 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 5 Jan 2010 17:50:25 +0000
Subject: [Biopython-dev] code credits
In-Reply-To: <320fb6e00912220414t6429f1e5n792e5feeecbe633f@mail.gmail.com>
References: <bb02be080912171320u480fe461r1f517970f08e091b@mail.gmail.com>
	<928490.72367.qm@web30708.mail.mud.yahoo.com>
	<320fb6e00912171454v2ce81fc5v93547951d7af84f8@mail.gmail.com>
	<Pine.SOC.4.64.0912171946120.13591@ub.d.umn.edu>
	<320fb6e00912210357m32156fdax6639445cadd83217@mail.gmail.com>
	<20091221132339.GC21580@sobchak.mgh.harvard.edu>
	<320fb6e00912210634o77d9eb9ex21e4ec3630dd1ed6@mail.gmail.com>
	<320fb6e00912210848x449fd73al4e97d3c9e21cf4@mail.gmail.com>
	<320fb6e00912220414t6429f1e5n792e5feeecbe633f@mail.gmail.com>
Message-ID: <320fb6e01001050950r64dabb1dw67baafada72f5d1a@mail.gmail.com>

On Tue, Dec 22, 2009 at 12:14 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Dec 21, 2009 at 4:48 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> So, how about a merger of (1) and (3)? i.e.
>>
>> * The CONTRIBUTORS file remains a single alphabetical list
>> of all contributors to date (no change).
>> * Entries in the NEWS file for new features etc may continue
>> to credit authors as appropriate.
>> * The NEWS file will include at the end of each release section
>> an alphabetical list of contributors for that release (with new
>> contributors flagged). This will be re-used in the release notice.
>
> I've done that in github - how do the NEWS and CONTRIB file look?
>
> http://github.com/biopython/biopython/commit/86d8d99aab894ab5f32a0e7a0c45d63a441da645
>
> I haven't automatically included email addresses for the new contributors
> since there is a risk of them being harvested for spam, so I figure that
> should be "opt in".

Thanks to those with feedback off list (e.g. sort order).

I've just updated the news post to include the list of names:
http://news.open-bio.org/news/2009/12/biopython-release-153/

I don't have time today, but at some point this week I want to
do a another news post and email announcement describing
this new Sage-like policy for recognising contributors. If anyone
would like to compose a draft of the apparent consensus that
would be very helpful.

If anyone would like to go back over the commit log for the
recent releases to update them as we've just done for 1.53,
please go ahead - but post an email here to avoid duplicated
efforts.

Peter

P.S. Happy New Year!


From bugzilla-daemon at portal.open-bio.org  Thu Jan  7 18:11:47 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 7 Jan 2010 13:11:47 -0500
Subject: [Biopython-dev] [Bug 2980] New: Bio.SeqIO can't parse EMBL CONTIG
	records
Message-ID: <bug-2980-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2980

           Summary: Bio.SeqIO can't parse EMBL CONTIG records
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


While the GenBank parser has been updated to cope with CONTIG records
(using an UnknownSeq object), this has not been done for the EMBL parser.
As an example test case, consider:
ftp://ftp.ebi.ac.uk/pub/databases/embl/release/rel_con_hum_01_r102.dat.gz


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Jan  8 11:50:56 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 8 Jan 2010 06:50:56 -0500
Subject: [Biopython-dev] [Bug 2980] Bio.SeqIO can't parse EMBL CONTIG records
In-Reply-To: <bug-2980-42@http.bugzilla.open-bio.org/>
Message-ID: <201001081150.o08Bougb013879@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2980


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-08 06:50 EST -------
Fixed in git


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From mjldehoon at yahoo.com  Fri Jan  8 16:26:29 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 8 Jan 2010 08:26:29 -0800 (PST)
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
Message-ID: <221209.41863.qm@web62404.mail.re1.yahoo.com>

I am not an expert in this area, but the code looks very well done and well organized. Thanks, Eric!

I have one suggestion though:
In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather have everything under Bio.Tree. This makes it easier to understand what each Bio.* module is about, and also agrees with the structure of the other modules in Biopython. The only exception is Bio.Seq, for which there is a closely related Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons; I'd rather have a single Bio.Seq there too).

Thanks again,

--Michiel.

--- On Mon, 12/28/09, Eric Talevich <eric.talevich at gmail.com> wrote:

> From: Eric Talevich <eric.talevich at gmail.com>
> Subject: Re: [Biopython-dev] Code review request for phyloxml branch
> To: "BioPython-Dev Mailing List" <biopython-dev at biopython.org>
> Date: Monday, December 28, 2009, 8:51 PM
> Hi folks,
> 
> Here's an update on the status of Bio.Tree and TreeIO. I
> think I've taken
> care of most of the blockers since the last review in
> September.
> 
> First, some links:
> http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
> http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
> http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py
> http://github.com/etal/biopython/tree/phyloxml/Tests/test_Tree.py
> http://biopython.org/wiki/PhyloXML
> 
> Discussion:
> 
> *TreeIO*
> Conversion between Nexus, Newick and phyloXML tree file
> formats works; the
> read/parse/write functions for each IO format use the same
> object types.
> Neat!
> 
> The tree annotations (e.g. id) aren't preserved perfectly
> during conversions
> -- I'll keep working on this, but I don't think it's a
> blocker. The taxon
> names of terminal nodes are kept as "clade" names in
> phyloXML for
> round-tripping. Tree topology and branch lengths seem OK.
> 
> Under the hood:
> -- PhyloXMLIO is from GSoC
> -- NewickIO is ported from the Bio.Nexus.Trees parser. I
> think it works the
> same way.
> -- NexusIO relies on Bio.Nexus.Nexus for parsing, then
> converts the
> resulting Nexus.Trees.Tree objects to Bio.Tree.Newick
> objects. One day, when
> Nexus.Trees is replaced by NewickIO in the main Nexus
> parser, then this
> conversion can be dropped and NexusIO will be very simple.
> 
> *Tree*
> The BaseTree object structure looks like this:*
> 
> -- BaseTree.**Tree* contains global tree information, like
> whether the tree
> is rooted, and a reference to the root clade. The phyloXML
> Phylogeny object
> inherits from this.*
> 
> -- BaseTree.**Subtree* contains local (clade- or
> node-specific) information,
> and references to each of its direct descendents,
> recursively. The phyloXML
> Clade object inherits from this. Nodes are implicit. I
> could add references
> to the ancestor of each sub-tree without too much
> difficulty, but I haven't
> needed them yet.
> 
> The same methods (get_terminals et al.) generally apply to
> both classes, so
> I created a separate TreeMixin class from which both
> BaseTree.Tree and
> BaseTree.Subtree inherit.
> 
> Bio.Tree.Newick contains simple subclasses of Tree and
> Subtree, and an
> incomplete set of shims that track Bio.Nexus.Trees.Tree
> (minus the I/O).
> This is to ease the deprecation and eventual replacement of
> Bio.Nexus.Trees,
> as I imagine it:
> (1) Port methods from Nexus.Trees to Bio.Tree, simplifying
> arguments where
> reasonable (since the node IDs and adjacency list lookup
> are no longer
> needed)
> (2) Implement methods in Bio.Tree.Newick with the original
> argument lists,
> but triggering a deprecation warning indicating the newer
> replacement method
> (3) Replace Nexus.Trees with an import of
> Bio.Tree.Newick(IO) and a few more
> shims to duplicate the original API -- so test_Nexus.py
> should still pass,
> ideally (with deprecation warnings)
> (4) In Nexus.Nexus, replace all usage of Nexus.Trees with
> proper usage of
> NexusIO and Bio.Tree methods.
> (5) Eventually delete Nexus.Trees and the shims in
> Bio.Tree.Newick.
> 
> I'm currently doing (1) and (2), with more emphasis on
> getting (1) right.
> Not all of the important methods have been ported, but I'm
> happy with the
> tree traversal methods.
> *
> Tests
> *I created test_Tree.py to test the methods in
> Bio.Tree.BaseTree;
> test_PhyloXML.py tests Bio.Tree.PhyloXML objects and
> Bio.TreeIO.PhyloXMLIO
> parsing/writing.
> 
> I noticed that in Tests/Nexus/, the example file for
> internal node labels is
> actually in Newick/NH format, not Nexus. That was briefly
> confusing, so
> maybe that file should be renamed.
> 
> What do you think?
> 
> All the best,
> Eric
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From p.j.a.cock at googlemail.com  Fri Jan  8 17:00:12 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 8 Jan 2010 17:00:12 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <221209.41863.qm@web62404.mail.re1.yahoo.com>
References: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com>
	<221209.41863.qm@web62404.mail.re1.yahoo.com>
Message-ID: <320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com>

On Fri, Jan 8, 2010 at 4:26 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> I am not an expert in this area, but the code looks very well done and well
> organized. Thanks, Eric!
>
> I have one suggestion though:
> In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather
> have everything under Bio.Tree. This makes it easier to understand what each
> Bio.* module is about, and also agrees with the structure of the other modules
> in Biopython. The only exception is Bio.Seq, for which there is a closely related
> Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons;
> I'd rather have a single Bio.Seq there too).

There is also Bio.AlignIO, which again might have been handled via Bio.Align
with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was
following the lead from BioPerl. I think there are some good points about making
the code for the common object (tree, SeqRecord, Alignment) clearly separate
from the code for parsing or writing it (although separate top level modules is
perhaps overkill). However, I agree, this isn't universal in Biopython (e.g.
Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO).

So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing
I don't like is that "Tree" could mean a class or a module (also a problem with
other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python
convention (PEP8) is to use lower case for the module ("tree") and title case
for the class ("Tree"), something most of Biopython does not follow (and
which we can't change without a lot of upheaval). Another option if we want
to try and keep the existing module name style might be Bio.Trees containing
a Tree class, or perhaps something different like Bio.Phylo instead?

Peter


From eric.talevich at gmail.com  Fri Jan  8 18:22:11 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 8 Jan 2010 13:22:11 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com>
References: <3f6baf360912281751g5152a945p951dbbbcbffbddb1@mail.gmail.com> 
	<221209.41863.qm@web62404.mail.re1.yahoo.com>
	<320fb6e01001080900p2235eaccrba83e24e5eb2dbfe@mail.gmail.com>
Message-ID: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com>

On Fri, Jan 8, 2010 at 12:00 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> On Fri, Jan 8, 2010 at 4:26 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> > I am not an expert in this area, but the code looks very well done and well
> > organized. Thanks, Eric!
> >
> > I have one suggestion though:
> > In the current layout, there's a Bio.Tree and a Bio.TreeIO module. I'd rather
> > have everything under Bio.Tree. This makes it easier to understand what each
> > Bio.* module is about, and also agrees with the structure of the other modules
> > in Biopython. The only exception is Bio.Seq, for which there is a closely related
> > Bio.SeqIO and Bio.SeqRecord. (In my opinion, that is more for historical reasons;
> > I'd rather have a single Bio.Seq there too).
>
> There is also Bio.AlignIO, which again might have been handled via Bio.Align
> with hindsight. One reason for this choice of naming (SeqIO and AlignIO) was
> following the lead from BioPerl.

Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava do
something completely different.

I had the impression that pairing modules Foo & FooIO was an emerging
convention for organizing very general data types being fed by a
variety of file formats, while a single module Foo indicated support
for a particular program or source, like Entrez. But I think it would
be even cleaner if each Foo simply had a Foo.IO (or foo.io) sub-module
organizing the I/O for multiple file formats where applicable.

The TreeIO.* namespace is not crowded -- just read, write, parse,
convert. If that directory is moved under Bio.Tree and renamed to IO
or io, then Bio.Tree would still seem reasonably intuitive if
__init__.py contained:

from io import *
from utils import *

Then "from Bio import Tree" would be enough for most uses.

> I think there are some good points about making
> the code for the common object (tree, SeqRecord, Alignment) clearly separate
> from the code for parsing or writing it (although separate top level modules is
> perhaps overkill). However, I agree, this isn't universal in Biopython (e.g.
> Bio.Motif handles a range of motif file formats but there is no Bio.MotifIO).

PDB does its own thing, too -- and some consolidation there might be nice.

> So I'm somewhat on the fence about the Bio.TreeIO name. However, one thing
> I don't like is that "Tree" could mean a class or a module (also a problem with
> other Biopython bits like "Seq", "SeqRecord", "Nexus"). Current Python
> convention (PEP8) is to use lower case for the module ("tree") and title case
> for the class ("Tree"), something most of Biopython does not follow (and
> which we can't change without a lot of upheaval).

I could rename the modules inside Bio.Tree (or whatever we call it) to
follow the PEP8 convention:

Bio/Tree/
Bio/Tree/basetree.py
Bio/Tree/io.py
Bio/Tree/utils.py ...

The Biopython convention seems to be that directory names are title
case, file names are mostly title case if user-facing and lower case
otherwise, and C extensions are lower case. Most of the time there
won't be any need to import the sub-modules under Tree directly, so
the inconsistency shouldn't be too jarring.

> perhaps something different like Bio.Phylo instead?

Sure, that sounds promising.


Thanks!
Eric


From mjldehoon at yahoo.com  Sat Jan  9 15:15:56 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 9 Jan 2010 07:15:56 -0800 (PST)
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com>
Message-ID: <863834.10061.qm@web62403.mail.re1.yahoo.com>


--- On Fri, 1/8/10, Eric Talevich <eric.talevich at gmail.com> wrote:
> Yep, BioPerl has a TreeIO module, too. BioRuby and BioJava
> do something completely different.
> 
> I had the impression that pairing modules Foo & FooIO
> was an emerging convention for organizing very general
> data types being fed by a variety of file formats, while
> a single module Foo indicated support
> for a particular program or source, like Entrez.

I think a workable convention, which is already followed by many Biopython module, is the following:

1) Bio.SomeStuff is a module containing everything related to SomeStuff, where SomeStuff is some broadly-defined field within bioinformatics (Cluster for clustering algorithms, Phylo for phylogenetics, PopGen for population genetics, Entrez for NCBI Entrez related stuff, etc.).

2) Parsing SomeStuff files, which can be in a variety of formats, is done by a read() function (to parse a single record), and/or a parse() function (to parse multiple records). The implementation details of these functions is hidden in a submodule of Bio.SomeStuff. Typically, the user won't need to interact with the submodule directly.

3) The read() / parse() functions return Bio.SomeStuff.Record objects, where Bio.SomeStuff.Record is a class that represents the primary data structure of SomeStuff information.

This general framework may not be suitable in all aspects for all Biopython modules, and can be modified as needed. For example, I can imagine that the most important data structure in Bio.Phylo is a Tree object rather than a Record object.

> But I think it would
> be even cleaner if each Foo simply had a Foo.IO (or foo.io)
> sub-module organizing the I/O for multiple file formats where
> applicable.

I agree.

> The TreeIO.* namespace is not crowded -- just read, write,
> parse, convert. If that directory is moved under Bio.Tree and
> renamed to IO or io, then Bio.Tree would still seem reasonably
> intuitive if __init__.py contained:
> 
> from io import *
> from utils import *
> 
> Then "from Bio import Tree" would be enough for most uses.

Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module.

Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user.
 
> > perhaps something different like Bio.Phylo instead?
> 
> Sure, that sounds promising.

I agree that Bio.Phylo is a good name. Note also that there already is a Tree class in Bio.Cluster (it represents hierarchical clustering trees). Having a Bio.Phylo.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees is not confusing. On the other hand, having a Bio.Tree.Tree class for phylogenetics trees and a Bio.Cluster.Tree class for hierarchical clustering trees could potentially be confusing.

--Michiel


From eric.talevich at gmail.com  Sat Jan  9 23:38:29 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sat, 9 Jan 2010 18:38:29 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <863834.10061.qm@web62403.mail.re1.yahoo.com>
References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com> 
	<863834.10061.qm@web62403.mail.re1.yahoo.com>
Message-ID: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>

Hi,

Thanks for your comments. I've reorganized the modules like this:

Bio/Phylo/
    __init__.py, BaseTree.py, Newick.py, PhyloXML.py, Utils.py
    IO/
        __init__.py, NexusIO.py, NewickIO.py, PhyloXMLIO.py

Now "from Bio import Phylo" works for the common cases, and "from
Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct access to the
parsers.

I renamed TreeIO to Phylo/IO -- keeping it uppercase because io is a
standard module in Py2.6+, Py2.7 changes the priority rules for
absolute vs. relative imports, and Py2.4 doesn't support the new
syntax for relative imports. I might change the other file names to
lower case before the next merge, though...

On Sat, Jan 9, 2010 at 10:15 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> Rather than importing *, can we import only those functions that a user would actually use? We should avoid importing stuff that is essentially used only locally in each sub-module.
>
> Another option is to have all functions that are intended to be used by the user in Bio.Phylo, and have those function access (internally) any sub-module as needed. For example, a user would not notice that Bio.Phylo.read actually uses code from Bio.Phylo.io; the latter module would not be accessed directly by the user.
>

I'm trying to avoid having to update Phylo/__init__.py each time I add
or rename a public function in Utils.py or IO. So, how about this:
I've added "__all__" definitions to Utils.py and IO/__init__.py so
that only the relevant public functions are loaded when
Phylo/__init__.py imports * from those two sub-modules. Testing
manually, this seems to do the right thing.

Cheers,
Eric


From mjldehoon at yahoo.com  Sun Jan 10 02:50:21 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 9 Jan 2010 18:50:21 -0800 (PST)
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
Message-ID: <274373.93315.qm@web62406.mail.re1.yahoo.com>

I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it. One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this?

Thanks!

--Michiel

--- On Sat, 1/9/10, Eric Talevich <eric.talevich at gmail.com> wrote:

> From: Eric Talevich <eric.talevich at gmail.com>
> Subject: Re: [Biopython-dev] Code review request for phyloxml branch
> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: "Peter Cock" <p.j.a.cock at googlemail.com>, "BioPython-Dev Mailing List" <biopython-dev at biopython.org>
> Date: Saturday, January 9, 2010, 6:38 PM
> Hi,
> 
> Thanks for your comments. I've reorganized the modules like
> this:
> 
> Bio/Phylo/
> ? ? __init__.py, BaseTree.py, Newick.py,
> PhyloXML.py, Utils.py
> ? ? IO/
> ? ? ? ? __init__.py, NexusIO.py,
> NewickIO.py, PhyloXMLIO.py
> 
> Now "from Bio import Phylo" works for the common cases, and
> "from
> Bio.Phylo.IO import PhyloXMLIO" etc. gives more direct
> access to the
> parsers.
> 
> I renamed TreeIO to Phylo/IO -- keeping it uppercase
> because io is a
> standard module in Py2.6+, Py2.7 changes the priority rules
> for
> absolute vs. relative imports, and Py2.4 doesn't support
> the new
> syntax for relative imports. I might change the other file
> names to
> lower case before the next merge, though...
> 
> On Sat, Jan 9, 2010 at 10:15 AM, Michiel de Hoon <mjldehoon at yahoo.com>
> wrote:
> >
> > Rather than importing *, can we import only those
> functions that a user would actually use? We should avoid
> importing stuff that is essentially used only locally in
> each sub-module.
> >
> > Another option is to have all functions that are
> intended to be used by the user in Bio.Phylo, and have those
> function access (internally) any sub-module as needed. For
> example, a user would not notice that Bio.Phylo.read
> actually uses code from Bio.Phylo.io; the latter module
> would not be accessed directly by the user.
> >
> 
> I'm trying to avoid having to update Phylo/__init__.py each
> time I add
> or rename a public function in Utils.py or IO. So, how
> about this:
> I've added "__all__" definitions to Utils.py and
> IO/__init__.py so
> that only the relevant public functions are loaded when
> Phylo/__init__.py imports * from those two sub-modules.
> Testing
> manually, this seems to do the right thing.
> 
> Cheers,
> Eric
> 


From eric.talevich at gmail.com  Sun Jan 10 22:02:10 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 10 Jan 2010 17:02:10 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <274373.93315.qm@web62406.mail.re1.yahoo.com>
References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> 
	<274373.93315.qm@web62406.mail.re1.yahoo.com>
Message-ID: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com>

On Sat, Jan 9, 2010 at 9:50 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> I think that this code can now be included with Biopython, assuming that there will be some documentation on its usage to accompany it.

OK -- I pulled the latest from biopython/biopython on GitHub, merged
my phyloxml branch into my master branch, and pushed it all back to
biopython. Bio.Phylo is now part of Biopython!

For documentation on the Biopython wiki, I moved the relevant parts of
the Tree, TreeIO and PhyloXML pages to a new page for Bio.Phylo:
http://biopython.org/wiki/Phylo

It's a little rough at the moment, but I'll refine it this week. Some
of the content can also be moved to separate cookbook entries.

> One more small thing: I noticed when looking at the source code that some comments still refer to Bio.Tree rather than Bio.Phylo -- could you fix this?

I went over all the docstrings and comments again before merging; it
should be free of Tree/TreeIO references now.

Thanks for your help!
Eric


From biopython at maubp.freeserve.co.uk  Mon Jan 11 11:04:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 11:04:03 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com>
	<863834.10061.qm@web62403.mail.re1.yahoo.com>
	<3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
Message-ID: <320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com>

On Sat, Jan 9, 2010 at 11:38 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> I'm trying to avoid having to update Phylo/__init__.py each time I add
> or rename a public function in Utils.py or IO. So, how about this:
> I've added "__all__" definitions to Utils.py and IO/__init__.py so
> that only the relevant public functions are loaded when
> Phylo/__init__.py imports * from those two sub-modules. Testing
> manually, this seems to do the right thing.

Previously bits of Biopython have used __all__, and then
abandoned this a long term maintenance load. This was before
my time, so I am not familiar with the full history, but it makes me
wary about using __all__ here.

Personally I don't see a big problem with having just explicit
manual imports within Bio/Phylo/__init__.py if and when you
decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py
should be made available at the top level. In general I would
think relatively few things should be exposed like that.

Peter


From biopython at maubp.freeserve.co.uk  Mon Jan 11 11:37:42 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 11:37:42 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com>
References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
	<274373.93315.qm@web62406.mail.re1.yahoo.com>
	<3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com>
Message-ID: <320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com>

On Sun, Jan 10, 2010 at 10:02 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> OK -- I pulled the latest from biopython/biopython on GitHub, merged
> my phyloxml branch into my master branch, and pushed it all back to
> biopython. Bio.Phylo is now part of Biopython!

Wow - that was quicker than I expected. As an aside, do you know
why there seem to be three main branches in the history now?
I guess this was the "original" master, your local master, and your
phyloxml branch?

One minor thing - test_Phylo.py needs to be tweaked to raise a
MissingExternalDependencyError if NetworkX isn't installed. That
way the run_tests.py script will treat it as a skipped test instead of
a failed test. Alternatively, if this is just a small part of the test,
maybe split test_Phylo.py into two files (e.g. add a new file
test_Phylo_NeworkX.py which needs the dependency).

And how's this for a draft entry in the NEWS file?

New module Bio.Phylo includes support for reading, writing and working with
phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by
Eric Talevich on a Google Summer of Code 2009 project, under The National
Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and
Christian Zmasek.

Peter


From chapmanb at 50mail.com  Mon Jan 11 13:18:40 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 11 Jan 2010 08:18:40 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com>
References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
	<274373.93315.qm@web62406.mail.re1.yahoo.com>
	<3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com>
Message-ID: <20100111131840.GB46441@sobchak.mgh.harvard.edu>

Hi all;

> OK -- I pulled the latest from biopython/biopython on GitHub, merged
> my phyloxml branch into my master branch, and pushed it all back to
> biopython. Bio.Phylo is now part of Biopython!

Awesome. Congrats Eric -- thanks for all the hard work on this
during the summer, and getting it in shape for inclusion. Peter and
Michiel, thanks for all the helpful feedback. Really happy to have
this integrated,
Brad


From biopython at maubp.freeserve.co.uk  Mon Jan 11 13:42:32 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 13:42:32 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com>
References: <3f6baf361001081022me214cb0i58abfeaf30cd3be9@mail.gmail.com>
	<863834.10061.qm@web62403.mail.re1.yahoo.com>
	<3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com>
	<320fb6e01001110304g40c51fh686eddbfdf056f3e@mail.gmail.com>
Message-ID: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com>

On Mon, Jan 11, 2010 at 11:04 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Sat, Jan 9, 2010 at 11:38 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> I'm trying to avoid having to update Phylo/__init__.py each time I add
>> or rename a public function in Utils.py or IO. So, how about this:
>> I've added "__all__" definitions to Utils.py and IO/__init__.py so
>> that only the relevant public functions are loaded when
>> Phylo/__init__.py imports * from those two sub-modules. Testing
>> manually, this seems to do the right thing.
>
> Previously bits of Biopython have used __all__, and then
> abandoned this a long term maintenance load. This was before
> my time, so I am not familiar with the full history, but it makes me
> wary about using __all__ here.
>
> Personally I don't see a big problem with having just explicit
> manual imports within Bio/Phylo/__init__.py if and when you
> decide a new function/class/etc in Bio/Phylo/Utils.py or IO.py
> should be made available at the top level. In general I would
> think relatively few things should be exposed like that.

In fact, why even do this at all? What is wrong with leaving
the IO functions (read, parse, write) as Bio.Phylo.IO.read etc
e.g.

>>> from Bio import Phylo
>>> tree = Phylo.IO.read(open("int_node_labels.nwk"),"newick")

What is the benefit of having them also exposed under the
Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
there are two ways to access them which is confusing.

If we do want to use Bio.Phylo.IO instead of Bio.PhyloIO
(or Bio.TreeIO) then thinking long term we may want to
do something about Bio.SeqIO and Bio.AlignIO to match.
We could move the Bio.AlignIO functionality under
Bio.Align.IO (with a suitable transition period). We could
move Bio.SeqIO to Bio.Seq.IO perhaps. Or we could
even talk about introducing Bio.Sequences (or something)
then move Bio.SeqIO to Bio.Sequences.IO, and move
Bio.SeqUtils.* under there too, and perhaps even the
Seq, SeqRecord and SeqFeature objects as well.
On the other hand, all that upheaval would cause a
lot of pain for end users, for relatively little gain.

Peter


From mjldehoon at yahoo.com  Mon Jan 11 15:02:46 2010
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 11 Jan 2010 07:02:46 -0800 (PST)
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com>
Message-ID: <107440.85746.qm@web62406.mail.re1.yahoo.com>


--- On Mon, 1/11/10, Peter <biopython at maubp.freeserve.co.uk> wrote:
> What is wrong with leaving the IO functions
> (read, parse, write) as Bio.Phylo.IO.read etc
> e.g.
> 
> >>> from Bio import Phylo
> >>> tree =
> Phylo.IO.read(open("int_node_labels.nwk"),"newick")
> 
> What is the benefit of having them also exposed under the
> Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
> there are two ways to access them which is confusing.

If we use Bio.Phylo.IO.read directly, then for consistency we'd have to do the same for all other modules. Otherwise, we'd be guessing each time whether the read() and parse() functions are in Bio.SomeModule, or Bio.SomeModule.IO.

For Bio.Phylo, a simple solution is to put whatever is in Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and remove Bio.Phylo.IO.__init__.py. Then there is only one way to access the read() etc. functions.

[About doing the same for Bio.Seq and Bio.Align]
> On the other hand, all that upheaval would cause a
> lot of pain for end users, for relatively little gain.

For new users, it may be confusing to have all those different modules dealing with sequences. At least, it was for me when I started with Biopython. Therefore, for a long term solution, I'd prefer a single Bio.Seq module that incorporates all (Seq, SeqRecord, SeqIO, SeqFeature).

I agree that that may cause a lot of upheaval for end users, but a suitably long transition period may mitigate those concerns. I'd prefer that to being stuck with a less-than-optimal code organization forever.

--Michiel


From biopython at maubp.freeserve.co.uk  Mon Jan 11 16:17:36 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 16:17:36 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <107440.85746.qm@web62406.mail.re1.yahoo.com>
References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com>
	<107440.85746.qm@web62406.mail.re1.yahoo.com>
Message-ID: <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com>

On Mon, Jan 11, 2010 at 3:02 PM, Michiel de Hoon wrote:
>
> On Mon, 1/11/10, Peter wrote:
>> What is the benefit of having them also exposed under the
>> Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
>> there are two ways to access them which is confusing.
>
> If we use Bio.Phylo.IO.read directly, then for consistency we'd have
> to do the same for all other modules. Otherwise, we'd be guessing
> each time whether the read() and parse() functions are in
> Bio.SomeModule, or Bio.SomeModule.IO.

Fair point.

> For Bio.Phylo, a simple solution is to put whatever is in
> Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and
> remove Bio.Phylo.IO.__init__.py. Then there is only one
> way to access the read() etc. functions.

Or (if the functions are reasonably complex) keep the
input/output code in a separate file, but make it explicit
that it is not a public interface - e.g. use Bio/Phylo/_IO.py?

> [About doing the same for Bio.Seq and Bio.Align]
>> On the other hand, all that upheaval would cause a
>> lot of pain for end users, for relatively little gain.
>
> For new users, it may be confusing to have all those
> different modules dealing with sequences. At least, it
> was for me when I started with Biopython. Therefore,
> for a long term solution, I'd prefer a single Bio.Seq
> module that incorporates all (Seq, SeqRecord, SeqIO,
> SeqFeature).

I agree that for a long term solution a single module
make sense here, although I'm not convinced that
Bio.Seq is the best name. We'd have to switch from
a single file Bio/Seq.py to a folder with multiple files
including Bio/Seq/__init__.py - I worry this may cause
problems with updating existing Biopython installations.

> I agree that that may cause a lot of upheaval for end
> users, but a suitably long transition period may mitigate
> those concerns. I'd prefer that to being stuck with a
> less-than-optimal code organization forever.

In principle I agree with that.

Peter


From eric.talevich at gmail.com  Mon Jan 11 16:30:32 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 11 Jan 2010 11:30:32 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com>
References: <3f6baf361001091538g21d759ccpfe6547aab13f4123@mail.gmail.com> 
	<274373.93315.qm@web62406.mail.re1.yahoo.com>
	<3f6baf361001101402j6a84dbfcs6d1bc9801ada73a8@mail.gmail.com> 
	<320fb6e01001110337y4009a26ayf99bb58a1c9d9141@mail.gmail.com>
Message-ID: <3f6baf361001110830y391ea21cs8315a266b8b4fb43@mail.gmail.com>

On Mon, Jan 11, 2010 at 6:37 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Sun, Jan 10, 2010 at 10:02 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> OK -- I pulled the latest from biopython/biopython on GitHub, merged
>> my phyloxml branch into my master branch, and pushed it all back to
>> biopython. Bio.Phylo is now part of Biopython!
>
> Wow - that was quicker than I expected. As an aside, do you know
> why there seem to be three main branches in the history now?
> I guess this was the "original" master, your local master, and your
> phyloxml branch?

Er, sorry if I jumped the gun. I was eager to get this done before the
semester kicks in... anyway, these are the Git commands I used:

git checkout master
git pull upstream  # remote: biopython master
git checkout phyloxml
git merge master  # check that it merges cleanly
git checkout master
git merge phyloxml  # fast-forward
git push upstream master
git push origin master  # updating my own branches on github
git push origin phyloxml

It looks more reasonable in gitk; maybe the branches will separate
again later on GitHub when they're no longer equivalent, or when I
delete the phyloxml branch.

> One minor thing - test_Phylo.py needs to be tweaked to raise a
> MissingExternalDependencyError if NetworkX isn't installed. That
> way the run_tests.py script will treat it as a skipped test instead of
> a failed test. Alternatively, if this is just a small part of the test,
> maybe split test_Phylo.py into two files (e.g. add a new file
> test_Phylo_NeworkX.py which needs the dependency).

I extracted test_Phylo_depend.py from test_Phylo and added tests at
the top level for networkx and either pygraphviz or pydot (since those
are also used by Bio/Phylo/Utils.py).

> And how's this for a draft entry in the NEWS file?
>
> New module Bio.Phylo includes support for reading, writing and working with
> phylogenetic trees from Newick, Nexus and PhyloXML files. This was work by
> Eric Talevich on a Google Summer of Code 2009 project, under The National
> Evolutionary Synthesis Center (NESCent), mentored by Brad Chapman and
> Christian Zmasek.

Great, thanks!

Eric


From eric.talevich at gmail.com  Mon Jan 11 16:43:01 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 11 Jan 2010 11:43:01 -0500
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com>
References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com> 
	<107440.85746.qm@web62406.mail.re1.yahoo.com>
	<320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com>
Message-ID: <3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com>

On Mon, Jan 11, 2010 at 11:17 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Jan 11, 2010 at 3:02 PM, Michiel de Hoon wrote:
>>
>> On Mon, 1/11/10, Peter wrote:
>>> What is the benefit of having them also exposed under the
>>> Bio.Phylo namespace, e.g. as Bio.Phylo.read? This means
>>> there are two ways to access them which is confusing.
>>
>> If we use Bio.Phylo.IO.read directly, then for consistency we'd have
>> to do the same for all other modules. Otherwise, we'd be guessing
>> each time whether the read() and parse() functions are in
>> Bio.SomeModule, or Bio.SomeModule.IO.
>
> Fair point.
>
>> For Bio.Phylo, a simple solution is to put whatever is in
>> Bio.Phylo.IO.__init__.py in Bio.Phylo.__init__.py, and
>> remove Bio.Phylo.IO.__init__.py. Then there is only one
>> way to access the read() etc. functions.
>
> Or (if the functions are reasonably complex) keep the
> input/output code in a separate file, but make it explicit
> that it is not a public interface - e.g. use Bio/Phylo/_IO.py?

Something like this?

Phylo/
    BaseTree.py
    Newick.py
    PhyloXML.py
    _IO.py
    _Utils.py
    PhyloXMLIO.py
    NewickIO.py
    NexusIO.py

This plays well with the expected import styles:

from Bio import Phylo  # most common
from Bio.Phylo import PhyloXML  # access the defined types
from Bio.Phylo import PhyloXMLIO  # special parsing


From biopython at maubp.freeserve.co.uk  Mon Jan 11 17:11:29 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 17:11:29 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
Message-ID: <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>

On Mon, Nov 23, 2009 at 2:43 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Dear all,
>
> Is there anyone on the dev mailing list willing to test the SFF
> support I've been working on for Bio.SeqIO? The code is here,
> a branch on github:
> http://github.com/peterjc/biopython/tree/sff-seqio
>
> The important files are:
> * Bio/SeqIO/SffIO.py
> * Bio/SeqIO/__init__.py (defining the new format)
> * Bio/SeqIO/_index.py (indexing SFF files)
>
> Plus unit test files:
> * Tests/run_tests.py (to run the doctests)
> * Tests/test_SeqIO_QualityIO.py
> * Tests/test_SeqIO_index.py
> * Tests/test_SeqIO.py
> * Tests/Roche/* (for unit tests)
>
> Sebastian Bassi had a look last month and his feedback has
> already helped (e.g. with error messages):
> http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006903.html
>
> I have been using this code myself in real work, for example
> editing the trim points in an SFF file to take into account PCR
> primer sequences, and filtering SFF reads, checking Roche
> barcodes etc.
>
> Thanks,
>
> Peter
>

Hi all,

I didn't want to rush the SFF support into Biopython 1.53, but its been
waiting "ready" for a while now. Any objections or comments about
me merging this now?

Thanks,

Peter


From biopython at maubp.freeserve.co.uk  Tue Jan 12 14:51:58 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 Jan 2010 14:51:58 +0000
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com>
References: <320fb6e01001110542l4261544do1d5f7430a598f9bb@mail.gmail.com>
	<107440.85746.qm@web62406.mail.re1.yahoo.com>
	<320fb6e01001110817w173b3805wb15eff49dfc56394@mail.gmail.com>
	<3f6baf361001110843o2b1fa13fid3f169ca4accbdbd@mail.gmail.com>
Message-ID: <320fb6e01001120651i6b3d661m83187659595ce9e4@mail.gmail.com>

On Mon, Jan 11, 2010 at 4:43 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Mon, Jan 11, 2010 at 11:17 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Or (if the functions are reasonably complex) keep the
>> input/output code in a separate file, but make it explicit
>> that it is not a public interface - e.g. use Bio/Phylo/_IO.py?
>
> Something like this?
>
> Phylo/
> ? ?BaseTree.py
> ? ?Newick.py
> ? ?PhyloXML.py
> ? ?_IO.py
> ? ?_Utils.py
> ? ?PhyloXMLIO.py
> ? ?NewickIO.py
> ? ?NexusIO.py
>
> This plays well with the expected import styles:
>
> from Bio import Phylo ?# most common
> from Bio.Phylo import PhyloXML ?# access the defined types
> from Bio.Phylo import PhyloXMLIO ?# special parsing

I'd forgotten Bio/Phylo/IO was a directory, and that the users may
want to access PhyloXMLIO directly. That suggested structure
looks reasonable... what do you think Michiel?

Peter


From kellrott at gmail.com  Tue Jan 12 21:46:39 2010
From: kellrott at gmail.com (Kyle Ellrott)
Date: Tue, 12 Jan 2010 13:46:39 -0800
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
Message-ID: <bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>

I've pulled from the main branch and fixed a few problems.  I've tested the
code against Sqlite, Python Mysql, and Jython Mysql.  All three seem to be
working right now.

Kyle

On Thu, Dec 17, 2009 at 10:03 AM, Kyle Ellrott <kellrott at gmail.com> wrote:

>
>  > Code can be found at http://github.com/kellrott/biopython
>>
>> Lovely. That's on your jython branch (along with lots of your other work)?
>>
>
> Yes, but all of the zxJDBC work has been done in the past 2 weeks (just the
> last three commits), so it should be easy to cherry-pick out the relevant
> patches.
>
> Kyle
>


From biopython at maubp.freeserve.co.uk  Tue Jan 12 21:51:34 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 12 Jan 2010 21:51:34 +0000
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
	<bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
Message-ID: <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>

On Tue, Jan 12, 2010 at 9:46 PM, Kyle Ellrott <kellrott at gmail.com> wrote:
> I've pulled from the main branch and fixed a few problems. ?I've tested the
> code against Sqlite, Python Mysql, and Jython Mysql. ?All three seem to be
> working right now.
>
> Kyle

Excellent - I had a play last month, and Jython Mysql seemed to work.
Do you know if/how to get SQLite and/or PostgreSQL drivers installed
under zxJDBC?

Peter


From kellrott at gmail.com  Tue Jan 12 22:06:39 2010
From: kellrott at gmail.com (Kyle Ellrott)
Date: Tue, 12 Jan 2010 14:06:39 -0800
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
	<bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
	<320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>
Message-ID: <bb02be081001121406y172d415fgf2e2d8f3cd19d99a@mail.gmail.com>

I haven't played with Postgre yet (don't even have it installed).
Sqlite as a python package hasn't been standardized to Jython yet  (
http://bugs.jython.org/issue1682864 )

One option is to call SQLite JDBC (
http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing the
existing SQLite code.
But like zxJDBC, the jar would need to be in the CLASSPATH variable for the
code to work.


Kyle

On Tue, Jan 12, 2010 at 1:51 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Jan 12, 2010 at 9:46 PM, Kyle Ellrott <kellrott at gmail.com> wrote:
> > I've pulled from the main branch and fixed a few problems.  I've tested
> the
> > code against Sqlite, Python Mysql, and Jython Mysql.  All three seem to
> be
> > working right now.
> >
> > Kyle
>
> Excellent - I had a play last month, and Jython Mysql seemed to work.
> Do you know if/how to get SQLite and/or PostgreSQL drivers installed
> under zxJDBC?
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Wed Jan 13 11:22:23 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 Jan 2010 11:22:23 +0000
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <bb02be081001121406y172d415fgf2e2d8f3cd19d99a@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
	<bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
	<320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>
	<bb02be081001121406y172d415fgf2e2d8f3cd19d99a@mail.gmail.com>
Message-ID: <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com>

On Tue, Jan 12, 2010 at 10:06 PM, Kyle Ellrott <kellrott at gmail.com> wrote:
> I haven't played with Postgre yet (don't even have it installed).
> Sqlite as a python package hasn't been standardized to Jython yet ?(
> http://bugs.jython.org/issue1682864 )
>
> One option is to call SQLite JDBC (
> http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing the
> existing SQLite code.
> But like zxJDBC, the jar would need to be in the CLASSPATH variable for the
> code to work.

I'm not 100% convinced that the details of your current approach
are the best way forward: Specifically taking a user script that works
on (C) Python using MySQL with MySQLdb as the driver, and when
run on Jython automatically interpreting this to use the Java MySQL
Connector/J with the org.gjt.mm.mysql.Driver (and so on for the
PostgreSQL and SQLite drivers?)

It might be clearer if we just treat the different Jython/Java drivers
as top level alternatives:

* MySQLdb (Python only, at least for now)
* psycopg, psycopg2, pgdb (Python only, at least for now)
* sqlite3 (currently Python only, maybe available on Jython later)
* org.gjt.mm.mysql.Driver (Jython only)
* Some JAVA PostreSQL driver (Jython only)
* Some JAVA SQLite driver (Jython only)

This way we have a clean separation of all the different driver
or database specific changes - although the user is required
to make some minor changes to take an existing BioSQL on
MySQL script to explicitly change the driver from MySQLdb
to org.gjt.mm.mysql.Driver if they want to run it on Jython.
We also won't have lots of "if jython" statements everywhere.

What are your thoughts on this?

Note there will be some similarities between all the MySQL
adaptors, all the PostgreSQL adaptors, etc. I've just made
a small improvement to file BioSQL/DBUtils.py to reduce
the code duplication for the existing (C) Python PostgreSQL
adaptors.

Peter


From biopython at maubp.freeserve.co.uk  Wed Jan 13 14:10:21 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 Jan 2010 14:10:21 +0000
Subject: [Biopython-dev] Phasing out support for Python 2.4?
Message-ID: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>

Hi all,

Biopython currently supports Python 2.4, 2.5 and 2.6
(and seems to work on the current Python 2.7 alpha).

Is it time to start phasing out support for Python 2.4?

Reasons for encouraging Python 2.5+ include the
built in support for sqlite3 (which we can use in the
BioSQL wrappers) and ElementTree (which we use
for the phyloXML parser) both of which must currently
be manually installed for Python 2.4.

Also ReportLab is talking about dropping support
for Python 2.4 (another optional dependency of
Biopython). As far as I know, NumPy haven't yet
talked about dropping support for Python 2.4.

I was thinking of the usual deprecation procedure, so
we'd aim to have at least two releases and one year
before actually dropping support for Python 2.4. At
that point older Linux distributions which ship with
Python 2.4 probably won't be supported anyway.

e.g. The last version of Ubuntu to have Python 2.4
as the default was Ubuntu 6.06 LTS (Dapper Drake).
The desktop edition support ended July 2009, but
the server edition will be maintaned until June 2011.

Peter


From eric.talevich at gmail.com  Wed Jan 13 17:08:24 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 13 Jan 2010 12:08:24 -0500
Subject: [Biopython-dev] Phasing out support for Python 2.4?
In-Reply-To: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
Message-ID: <3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com>

On Wed, Jan 13, 2010 at 9:10 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> Biopython currently supports Python 2.4, 2.5 and 2.6
> (and seems to work on the current Python 2.7 alpha).
>
> Is it time to start phasing out support for Python 2.4?
>
> Reasons for encouraging Python 2.5+ include the
> built in support for sqlite3 (which we can use in the
> BioSQL wrappers) and ElementTree (which we use
> for the phyloXML parser) both of which must currently
> be manually installed for Python 2.4.

Also, it appears that Python 2.7 will use absolute instead of relative
imports by default:
http://www.python.org/dev/peps/pep-0328/

For intra-package imports like in PDB/__init__.py, an import like this:
from PDBParser import PDBParser

could be future-proofed for Py2.5+:
from __future__ import absolute_import
from .PDBParser import PDBParser

But to make it work in both Py2.4 and Py2.7, it would need to be
converted to an absolute import:
from Bio.PDB.PDBParser import PDBParser


Py2.5 introduced a number of other enticing syntax features, too:
http://docs.python.org/dev/whatsnew/2.5.html
- context managers (with_statement)
- if-else expressions
- unified try-except-finally (I flagged this issue in the comments in Bio.Phylo)
- all() and any()
- passing values into generators -- could be useful for parsing, maybe

The enhancements to setuptools might help simplify the dependency
handling in setup.py:
http://docs.python.org/dev/whatsnew/2.5.html#pep-314-metadata-for-python-software-packages-v1-1

I'm also interested in the functools and ctypes modules, but don't
have pressing use cases for them.
(So, you can take that as a +1 from me.)

Cheers,
Eric


From biopython at maubp.freeserve.co.uk  Wed Jan 13 17:21:23 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 13 Jan 2010 17:21:23 +0000
Subject: [Biopython-dev] Phasing out support for Python 2.4?
In-Reply-To: <3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com>
References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
	<3f6baf361001130908k68240017h195d5877841fafe4@mail.gmail.com>
Message-ID: <320fb6e01001130921w49b56793h413aacd3027d6275@mail.gmail.com>

On Wed, Jan 13, 2010 at 5:08 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Wed, Jan 13, 2010 at 9:10 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Hi all,
>>
>> Biopython currently supports Python 2.4, 2.5 and 2.6
>> (and seems to work on the current Python 2.7 alpha).
>>
>> Is it time to start phasing out support for Python 2.4?
>>
>> Reasons for encouraging Python 2.5+ include the
>> built in support for sqlite3 (which we can use in the
>> BioSQL wrappers) and ElementTree (which we use
>> for the phyloXML parser) both of which must currently
>> be manually installed for Python 2.4.
>
> Also, it appears that Python 2.7 will use absolute instead
> of relative imports by default:
> http://www.python.org/dev/peps/pep-0328/

Thanks for the heads up on that. I think we'll just need
to switch everything to absolute imports in order to
cover Python 2.4 to 2.7 inclusive.

>
> (So, you can take that as a +1 from me.)
>

Good :)

Peter


From kellrott at gmail.com  Wed Jan 13 17:37:53 2010
From: kellrott at gmail.com (Kyle Ellrott)
Date: Wed, 13 Jan 2010 09:37:53 -0800
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
	<bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
	<320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>
	<bb02be081001121406y172d415fgf2e2d8f3cd19d99a@mail.gmail.com>
	<320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com>
Message-ID: <bb02be081001130937o6cad0d28h1da86b2ca2606407@mail.gmail.com>

My main thought was to make it so that users can write a single script that
would work on any Python system (eventually IronPython as well).  Because
the current system expects the user to request a specific driver (MySQLdb)
that happens to be system specific, it forces user code to be system
specific.
One alternative would be to use the strings you describe below, but in
addition add special requests that would check the system add pull the
appropriate driver automatically.
'autoMySQL' or 'MySQL' - uses MySQLdb if in CPython, use
org.gjt.mm.mysql.Driver if in Jython.
Otherwise, if the user wants to use a specific driver, they pass it's name.

Kyle

On Wed, Jan 13, 2010 at 3:22 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Jan 12, 2010 at 10:06 PM, Kyle Ellrott <kellrott at gmail.com> wrote:
> > I haven't played with Postgre yet (don't even have it installed).
> > Sqlite as a python package hasn't been standardized to Jython yet  (
> > http://bugs.jython.org/issue1682864 )
> >
> > One option is to call SQLite JDBC (
> > http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC ) rather then reusing
> the
> > existing SQLite code.
> > But like zxJDBC, the jar would need to be in the CLASSPATH variable for
> the
> > code to work.
>
> I'm not 100% convinced that the details of your current approach
> are the best way forward: Specifically taking a user script that works
> on (C) Python using MySQL with MySQLdb as the driver, and when
> run on Jython automatically interpreting this to use the Java MySQL
> Connector/J with the org.gjt.mm.mysql.Driver (and so on for the
> PostgreSQL and SQLite drivers?)
>
> It might be clearer if we just treat the different Jython/Java drivers
> as top level alternatives:
>
> * MySQLdb (Python only, at least for now)
> * psycopg, psycopg2, pgdb (Python only, at least for now)
> * sqlite3 (currently Python only, maybe available on Jython later)
> * org.gjt.mm.mysql.Driver (Jython only)
> * Some JAVA PostreSQL driver (Jython only)
> * Some JAVA SQLite driver (Jython only)
>
> This way we have a clean separation of all the different driver
> or database specific changes - although the user is required
> to make some minor changes to take an existing BioSQL on
> MySQL script to explicitly change the driver from MySQLdb
> to org.gjt.mm.mysql.Driver if they want to run it on Jython.
> We also won't have lots of "if jython" statements everywhere.
>
> What are your thoughts on this?
>
> Note there will be some similarities between all the MySQL
> adaptors, all the PostgreSQL adaptors, etc. I've just made
> a small improvement to file BioSQL/DBUtils.py to reduce
> the code duplication for the existing (C) Python PostgreSQL
> adaptors.
>
> Peter
>


From chapmanb at 50mail.com  Thu Jan 14 12:52:44 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 14 Jan 2010 07:52:44 -0500
Subject: [Biopython-dev] Phasing out support for Python 2.4?
In-Reply-To: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
Message-ID: <20100114125244.GB59876@sobchak.mgh.harvard.edu>

Hey Peter;
Sounds great to me. Looking forward to being able to use conditional
expressions, collections.defaultdict, functools, and the with
statement. 2.5 had a lot of great stuff.

Brad

> Biopython currently supports Python 2.4, 2.5 and 2.6
> (and seems to work on the current Python 2.7 alpha).
> 
> Is it time to start phasing out support for Python 2.4?
> 
> Reasons for encouraging Python 2.5+ include the
> built in support for sqlite3 (which we can use in the
> BioSQL wrappers) and ElementTree (which we use
> for the phyloXML parser) both of which must currently
> be manually installed for Python 2.4.
> 
> Also ReportLab is talking about dropping support
> for Python 2.4 (another optional dependency of
> Biopython). As far as I know, NumPy haven't yet
> talked about dropping support for Python 2.4.
> 
> I was thinking of the usual deprecation procedure, so
> we'd aim to have at least two releases and one year
> before actually dropping support for Python 2.4. At
> that point older Linux distributions which ship with
> Python 2.4 probably won't be supported anyway.
> 
> e.g. The last version of Ubuntu to have Python 2.4
> as the default was Ubuntu 6.06 LTS (Dapper Drake).
> The desktop edition support ended July 2009, but
> the server edition will be maintaned until June 2011.
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From biopython at maubp.freeserve.co.uk  Thu Jan 14 14:52:24 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 Jan 2010 14:52:24 +0000
Subject: [Biopython-dev] Phasing out support for Python 2.4?
In-Reply-To: <20100114125244.GB59876@sobchak.mgh.harvard.edu>
References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
	<20100114125244.GB59876@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01001140652v1e11725esa6a2f91fafd0104b@mail.gmail.com>

On Thu, Jan 14, 2010 at 12:52 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hey Peter;
> Sounds great to me. Looking forward to being able to use conditional
> expressions, collections.defaultdict, functools, and the with
> statement. 2.5 had a lot of great stuff.
>
> Brad

I guess there are quite a few good things in Python 2.5+,
although I think the jump from Python 2.3 to 2.4 was more
important (generators and decorators). You'll have to restrain
yourself from using the new toys in Biopython a little longer
though Brad ;)

Since this seems to have raised no immediate objections,
I've sent a message to the main and announcement lists:

http://lists.open-bio.org/pipermail/biopython/2010-January/006111.html
http://lists.open-bio.org/pipermail/biopython-announce/2010-January/000064.html

Assuming there are no objections, we can add a conditional
deprecation warning to setup.py and do a news blog post
(like we did for dropping Python 2.3 early last year):
http://news.open-bio.org/news/2009/05/dropping-python23-support/

Peter


From biopython at maubp.freeserve.co.uk  Thu Jan 14 17:32:22 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 Jan 2010 17:32:22 +0000
Subject: [Biopython-dev] [Biopython] Phasing out support for Python 2.4?
In-Reply-To: <4B4F4071.7040601@fold.natur.cuni.cz>
References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com>
	<320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com>
	<4B4F4071.7040601@fold.natur.cuni.cz>
Message-ID: <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com>

On Thu, Jan 14, 2010 at 4:04 PM, Martin MOKREJ?
<mmokrejs at fold.natur.cuni.cz> wrote:
>
> Hi Peter,
> I don't get this point much. What is the problem stating that with
> python 2.5+ one does not need to install an extra dependency while
> for 2.4 one needs _two_ modules?
> I don't think I want BioSQL nor sqlite so why would I have to upgrade.
> Would the requirement be in python language syntax incompatibility then
> I would NOT object, but in this situation ...
> Martin

Hi Martin,

This isn't just the issue of sqlite3 and ElementTree. There
are several benefits to using more recent versions of Python,
for example with an eye on the future for Python 3, and on
a practical level it simplifies our testing to have one less
version to worry about (especially once Python 2.7 is out,
currently scheduled for June 2010).

We've already had minor issues with developers using
Python 2.5+ syntax unwittingly which broke on Python
2.4 (nothing major, and it was easily fixed once the
problem was spotted). If we continue to insist on Python
2.4 support, it may prove problematic for if future potential
contributors have existing code written for Python 2.5+
which would require significant re-factoring.

None of these concerns are pressing right now (and
some are hypothetical), but I think you will agree that
Python 2.4 is pretty old, and not widely used anymore.
Having a clear plan in place for dropping it seems a
sensible move, and once that happens we can start
to take advantage of the language and library
improvements Python 2.5 added.

Are you personally using Python 2.4? If so, could you
tell us a little more - for example, is this a university
server which would be difficult to update? Or do you
require some other Python package which requires
Python 2.4?

Thanks,

Peter


From bugzilla-daemon at portal.open-bio.org  Thu Jan 14 18:55:18 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 14 Jan 2010 13:55:18 -0500
Subject: [Biopython-dev] [Bug 2992] New: Adding Uniprot XML file format
	parsing to Biopython
Message-ID: <bug-2992-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2992

           Summary: Adding Uniprot XML file format parsing to Biopython
           Product: Biopython
           Version: 1.53
          Platform: All
               URL: http://github.com/apierleoni/biopython/tree/uniprotxml-
                    branch
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: andrea at biocomp.unibo.it


Uniprot XML formatted files are much easier to parse then the swissprot flat
file, and are widely used at EMBL either for uniprot, IPI and integr8 databases


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From andrea at biocomp.unibo.it  Thu Jan 14 18:57:58 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Thu, 14 Jan 2010 19:57:58 +0100 (CET)
Subject: [Biopython-dev] New: Uniprot XML parser
Message-ID: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>

Hi Everyone,
I've been using a lot biopython in the last couple of years, it is very
useful to me. So now it's my turn to contribute and be helpful to someone
else.
I wrote a parser for the Uniprot XML format, that is reasonably fast (8000
entries/min on a core2duo mainstream PC). The main improvements with the
actual SwissProt flat file parser are a deeper parsing of comment fields,
and a Seqrecord containing features.

The parser is based on the ElementTree library and was successfully tested
on the complete SwissProt database (v57.12). Thus I think it is ready to
be released.

I followed the rules to develop a new parser for SeqIO, filed an
enhancement bug to bugzilla (bug 2992), and included the parser in a
public biopython fork on github available at:

http://github.com/apierleoni/biopython/tree/uniprotxml-branch

the new parser is in the "uniprotxml-branch" branch, and the parser code
is in Bio/SeqIO/UniprotIO.py

The parser can be used from SeqIO using:

iterator=SeqIO.parse(handle,'uniprot')


I think this could be easily integrated in Biopython,  unit test is still
missing, but should be very easy to do.
Anyhow any code review or suggestions are welcome.

Andrea


From p.j.a.cock at googlemail.com  Thu Jan 14 19:16:49 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 14 Jan 2010 19:16:49 +0000
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>

On Thursday, January 14, 2010, Andrea Pierleoni <andrea at biocomp.unibo.it> wrote:
> Hi Everyone,
> I've been using a lot biopython in the last couple of years, it is very
> useful to me. So now it's my turn to contribute and be helpful to someone
> else.
> I wrote a parser for the Uniprot XML format, that is reasonably fast (8000
> entries/min on a core2duo mainstream PC). The main improvements with the
> actual SwissProt flat file parser are a deeper parsing of comment fields,
> and a Seqrecord containing features.
>
> The parser is based on the ElementTree library and was successfully tested
> on the complete SwissProt database (v57.12). Thus I think it is ready to
> be released.
>
> I followed the rules to develop a new parser for SeqIO, filed an
> enhancement bug to bugzilla (bug 2992), and included the parser in a
> public biopython fork on github available at:
>
> http://github.com/apierleoni/biopython/tree/uniprotxml-branch
>
> the new parser is in the "uniprotxml-branch" branch, and the parser code
> is in Bio/SeqIO/UniprotIO.py
>
> The parser can be used from SeqIO using:
>
> iterator=SeqIO.parse(handle,'uniprot')
>
>
> I think this could be easily integrated in Biopython, ?unit test is still
> missing, but should be very easy to do.
> Anyhow any code review or suggestions are welcome.
>
> Andrea
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org


Hi

I'd spotted your branch on github - this looks like an excellent
addition to Biopython :)

What I would like to see is a few unit tests, specifically one using
the same record in both XML (with the new parser) and the equivalent
plain text SwissProt file (with the old parser) and check they agree.

Also, I think you should check the start coordinates of the features
are using python counting.

Regards

Peter


From eric.talevich at gmail.com  Thu Jan 14 20:03:35 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 14 Jan 2010 15:03:35 -0500
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
Message-ID: <3f6baf361001141203i304146a4ld5683190a32b7ffe@mail.gmail.com>

On Thu, Jan 14, 2010 at 1:57 PM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
> Hi Everyone,
> I've been using a lot biopython in the last couple of years, it is very
> useful to me. So now it's my turn to contribute and be helpful to someone
> else.
> I wrote a parser for the Uniprot XML format, that is reasonably fast (8000
> entries/min on a core2duo mainstream PC). The main improvements with the
> actual SwissProt flat file parser are a deeper parsing of comment fields,
> and a Seqrecord containing features.
>
> The parser is based on the ElementTree library and was successfully tested
> on the complete SwissProt database (v57.12). Thus I think it is ready to
> be released.

Have you tried using this with Python 2.4? The ElementTree module
wasn't added to the standard library until Python 2.5, so a simple
"from xml.etree import ElementTree" may need some additional
protection. It's also nice to let the user use a third-party
implementation of ElementTree if they're stuck on Py2.4.

An example of this is at the top of Bio.Phylo.PhyloXMLIO -- not
pretty, but functional:
http://github.com/biopython/biopython/blob/master/Bio/Phylo/PhyloXMLIO.py

-Eric


From p.j.a.cock at googlemail.com  Thu Jan 14 23:04:36 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 14 Jan 2010 23:04:36 +0000
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>

On Thu, Jan 14, 2010 at 10:41 PM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
>
>>
>> By default, copy the "swiss" parser. If that doesn't have the
>> annotation, see if there is anything similar in the "genbank"
>> parser (effectively our reference for rich annotation parsing).
>> If in doubt, for now discard the data with a comment in the
>> code - and then discuss it here.
>>
>> Peter
>>
> I'll take a look at both the swissprot and genbank parsers.
> right now the annotation parsing shema is based on the xml schema.
> eg.
> <comment type="function">
> <text>function text</text>
> </comment>
>
> is parsed in the annotations as:
>
> seqrecord.annotations['comment_function']=['function text']
>

My reasoning is it should be (almost) transparent for
users to switch from parsing the plain text SwissProt
files ("swiss") to the XML form. There are also knock
on implications for saving to BioSQL and file format
conversions e.g. saving as a GenBank protein file
(aka GenPept format).

However, the comment parsing in the plain text "swiss"
format is currently a little simplistic - partly to match
what BioPerl did at the time. We can revisit that as
part of this work.

Peter


From andrea at biocomp.unibo.it  Fri Jan 15 10:35:39 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Fri, 15 Jan 2010 11:35:39 +0100 (CET)
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
Message-ID: <bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>


>
> My reasoning is it should be (almost) transparent for
> users to switch from parsing the plain text SwissProt
> files ("swiss") to the XML form.

This would be good

> There are also knock
> on implications for saving to BioSQL and file format
> conversions e.g. saving as a GenBank protein file
> (aka GenPept format).

The returned Seqrecords are actually BioSQL-safe,
since I can load them to a postgres biosql database.
formatting the actual Seqrecord with 'genbank' dbxrefs,
features, seq, keywords, source and names looks to be correctly
reported, while there is no trace of the other annotations.
I'll check it deeper.

>
> However, the comment parsing in the plain text "swiss"
> format is currently a little simplistic - partly to match
> what BioPerl did at the time. We can revisit that as
> part of this work.
>

the main problem here are going to be the comment fields, that in the
plain text predictors are parsed as a single string (this pushed me to
wrote the new parser). I tried to keep comments parsing as simple as it
can be, by just using lists of strings (good for BioSQL), but many comment
types would be better parsed with a dictionary tree.
As of now I left the option to get back the full XML for each comment, by
calling:

UniprotIO.UniprotIterator(handle,return_raw_comments=True)

so every info in the XML file can be returned and the end user can decide
how to parse those additional info.

Anyhow I think it is better to discuss this when the unit test
'swiss'VS'uniprot' is ready.

Andrea


From p.j.a.cock at googlemail.com  Fri Jan 15 11:08:32 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 15 Jan 2010 11:08:32 +0000
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>

On Fri, Jan 15, 2010 at 10:35 AM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
>>
>> However, the comment parsing in the plain text "swiss"
>> format is currently a little simplistic - partly to match
>> what BioPerl did at the time. We can revisit that as
>> part of this work.
>>
>
> the main problem here are going to be the comment fields, that in the
> plain text predictors are parsed as a single string (this pushed me to
> wrote the new parser). I tried to keep comments parsing as simple as it
> can be, by just using lists of strings (good for BioSQL), but many comment
> types would be better parsed with a dictionary tree.

I think BioPerl now uses some kind of nest tree when parsing the
SwissProt comment block, and I would like us to use something
compatible (e.g. a dictionary tree) in the "swiss" parser (and thus
also the XML parser) in such a way that we end up saving this in
BioSQL the same way.

> As of now I left the option to get back the full XML for each comment, by
> calling:
>
> UniprotIO.UniprotIterator(handle,return_raw_comments=True)
>
> so every info in the XML file can be returned and the end user can decide
> how to parse those additional info.
>
> Anyhow I think it is better to discuss this when the unit test
> 'swiss'VS'uniprot' is ready.

+1, good plan.

Peter


From bugzilla-daemon at portal.open-bio.org  Fri Jan 15 12:38:49 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 15 Jan 2010 07:38:49 -0500
Subject: [Biopython-dev] [Bug 2704] Parser for the markx10 alignment format
In-Reply-To: <bug-2704-42@http.bugzilla.open-bio.org/>
Message-ID: <201001151238.o0FCcnB1017338@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2704


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-15 07:38 EST -------
According to the change log for the just released EMBOSS 6.2:

        Alignment output included headers only for EMBOSS-specific
        formats. The headers have been dropped from the FASTA MARKX0
        through MARKX10 formats to allow standard FASTA suite parsers to
        use the EMBOSS versions of these outputs.

See also:
http://lists.open-bio.org/pipermail/emboss-dev/2009-August/000618.html

Fingers crossed this means we will be able to parse their output
with the "fasta-m10" parser in Bio.AlignIO.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From chapmanb at 50mail.com  Mon Jan 18 13:01:15 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 18 Jan 2010 08:01:15 -0500
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
Message-ID: <20100118130115.GA48842@sobchak.mgh.harvard.edu>

Hey all;
After the Google groups discussion kicked off by Istvan last month,
I've been thinking a bit about supplements to mailing list
discussions. I'm agreed that mailman is not great for searching and
archival purposes; we often see similar questions appear because
finding and browsing the right thread from a past discussion is not
intuitive.

Google groups is okay, but doesn't offer a huge improvement over
mailman. Additionally, reports indicate spamming is pretty bad,
which creates additional moderation headaches.

For handling "how do I do this biology task in Python" questions, what
do people think about something entirely different like Stack Overflow?
This presents a nice interface for asking questions, and the follow
ups are voted up and down by utility so it's easy to see what the
right answer is. Questions there are indexed well by search engines,
so it's also more likely someone might be able to find a previous
answer.

There are actually a couple of questions on there with a Biopython
tag:

http://stackoverflow.com/questions/tagged/biopython

>From our point of view, we would need to adjust the documentation to
point out Stack Overflow as a place to ask questions, and then
monitor the biopython tag for new posts.

Mailman is still a great option for implementation discussions, but
Stack Overflow could open up question/answers to a larger audience and
help supplement the cookbook and formal documentation.

Brad


From n.j.loman at bham.ac.uk  Mon Jan 18 13:21:38 2010
From: n.j.loman at bham.ac.uk (Nick Loman)
Date: Mon, 18 Jan 2010 13:21:38 +0000
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <20100118130115.GA48842@sobchak.mgh.harvard.edu>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu>
Message-ID: <4B546062.3090802@bham.ac.uk>

Brad Chapman wrote:
> For handling "how do I do this biology task in Python" questions, what
> do people think about something entirely different like Stack Overflow?
> This presents a nice interface for asking questions, and the follow
> ups are voted up and down by utility so it's easy to see what the
> right answer is. Questions there are indexed well by search engines,
> so it's also more likely someone might be able to find a previous
> answer.
>   
Hi Brad

Great suggestion, I have been thinking along the same lines. I really
like the design of the Stack Exchange sites, it is a great way of
exchanging Q&A information.

It is worth mentioning that Stackoverflow is not the only site using the
"Stack Exchange" format that is relevant.

Here is a link to various other Stack Exchange sites:
http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family

Although there are Biopython questions in Stackoverflow, I wonder
whether that is the correct place for questions, or whether it would be
overall more productive to have a resource for bioinformatics? I think
bioinformatics is the correct breadth of topic to keep a large enough
community together whilst not being too off-topic.

I have registered http://bioinformatics.stackexchange.com/ and will
happily make you and anyone else who is interested an admin.

Does the list think there could be enough community interest to justify
a separate site like this?

Cheers,

Nick.


From chapmanb at 50mail.com  Mon Jan 18 14:20:10 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 18 Jan 2010 09:20:10 -0500
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <4B546062.3090802@bham.ac.uk>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu>
	<4B546062.3090802@bham.ac.uk>
Message-ID: <20100118142010.GE48842@sobchak.mgh.harvard.edu>

Hi Nick;

> Great suggestion, I have been thinking along the same lines. I really
> like the design of the Stack Exchange sites, it is a great way of
> exchanging Q&A information.
> 
> It is worth mentioning that Stackoverflow is not the only site using the
> "Stack Exchange" format that is relevant.
> 
> Here is a link to various other Stack Exchange sites:
> http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family

Awesome. Thanks for the pointer. Sounds like you have a great handle
on this.

> Although there are Biopython questions in Stackoverflow, I wonder
> whether that is the correct place for questions, or whether it would be
> overall more productive to have a resource for bioinformatics? I think
> bioinformatics is the correct breadth of topic to keep a large enough
> community together whilst not being too off-topic.
> 
> I have registered http://bioinformatics.stackexchange.com/ and will
> happily make you and anyone else who is interested an admin.
> 
> Does the list think there could be enough community interest to justify
> a separate site like this?

It looks like there are a couple of Stack Exchange sites with
similar aims for open source bioinformatics and chemistry:

http://biostar.stackexchange.com/
http://blueobelisk.stackexchange.com/

If we go this way we might want to talk to the owners of these sites
and integrate with them.

My preference would be to go with the main StackOverflow site and
carve out our niche with the tagging system. We build off of an
existing community instead of needing to help grow one. Some of the
more successful biology communities, like the one on Friendfeed,
benefit from input outside of the standard community:

http://friendfeed.com/the-life-scientists

I think this would be less likely with a dedicated site, as that
fortuitous crosstalk is prevented by other programmers never
thinking to look at a bioinformatics only site.

Happy to hear what others think,
Brad


From biopython at maubp.freeserve.co.uk  Mon Jan 18 15:58:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 Jan 2010 15:58:27 +0000
Subject: [Biopython-dev] zxJDBC support for BioSQL
In-Reply-To: <bb02be081001130937o6cad0d28h1da86b2ca2606407@mail.gmail.com>
References: <bb02be080912161739k69e63916rbb488a6d6f35948d@mail.gmail.com>
	<320fb6e00912170246p64956c9ft85c0d288c078e097@mail.gmail.com>
	<bb02be080912171003n58ba38dej8a9aeed15a289223@mail.gmail.com>
	<bb02be081001121346r516ef6edm3733fbc16c994ce4@mail.gmail.com>
	<320fb6e01001121351t5aa1a9adt95557dbbbdd8cce3@mail.gmail.com>
	<bb02be081001121406y172d415fgf2e2d8f3cd19d99a@mail.gmail.com>
	<320fb6e01001130322y6f61e905q1cf6a1763733e2a@mail.gmail.com>
	<bb02be081001130937o6cad0d28h1da86b2ca2606407@mail.gmail.com>
Message-ID: <320fb6e01001180758t179f5ccdo99132e4b10b907bb@mail.gmail.com>

On Wed, Jan 13, 2010 at 5:37 PM, Kyle Ellrott <kellrott at gmail.com> wrote:
> My main thought was to make it so that users can write a single script that
> would work on any Python system (eventually IronPython as well).? Because
> the current system expects the user to request a specific driver (MySQLdb)
> that happens to be system specific, it forces user code to be system
> specific.

Yes, it does - as long as Jython or any other Python implementation
doesn't support that driver. In the case of SQLite, it sounds like adding
sqlite3 support to Jython is planned at least.

> One alternative would be to use the strings you describe below, but in
> addition add special requests that would check the system add pull the
> appropriate driver automatically.
> 'autoMySQL' or 'MySQL' - uses MySQLdb if in CPython, use
> org.gjt.mm.mysql.Driver if in Jython.
> Otherwise, if the user wants to use a specific driver, they pass it's name.

Maybe rather than specifying the driver, the user could specify the
database back end (MySQL, PostgreSQL, SQLite, ...) and providing
we know about this in advance, we can look up and try relevant
drivers automatically. We could offer this in combination with the
existing driver specifier. This seems cleaner than overloading the
driver argument.

Peter


From biopython at maubp.freeserve.co.uk  Mon Jan 18 16:33:42 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 Jan 2010 16:33:42 +0000
Subject: [Biopython-dev] EMBOSS eprimer3 parser
Message-ID: <320fb6e01001180833l6396cf23meb7e160fd6814e26@mail.gmail.com>

Hi all,

Who on the dev list makes heavy use of the EMBOSS eprimer3 parser in
Biopython? I'd like someone to look over Leighton's proposed enhancements
to this code: http://bugzilla.open-bio.org/show_bug.cgi?id=2968

There are two main issues. First, the current code doesn't cope with multiple
primer sets (so Leighton introduces read/parse functions in line with other
modules for single or multiple sets of primers). This seems entirely sensible
to me, and worthwhile in itself.

Second, Leighton makes some changes to the primer record objects.
I'm not so sure about the necessity here, even if it is backwards
compatible, but I haven't really used this code. What do the rest of
you think?

Peter


From istvan.albert at gmail.com  Mon Jan 18 18:02:23 2010
From: istvan.albert at gmail.com (Istvan Albert)
Date: Mon, 18 Jan 2010 13:02:23 -0500
Subject: [Biopython-dev] Biopython-dev Digest, Vol 84, Issue 14
In-Reply-To: <mailman.9.1263834003.27796.biopython-dev@lists.open-bio.org>
References: <mailman.9.1263834003.27796.biopython-dev@lists.open-bio.org>
Message-ID: <c878cd561001181002n34fa6bebvef7153f538b5bbc4@mail.gmail.com>

On Mon, Jan 18, 2010 at 12:00 PM,
<biopython-dev-request at lists.open-bio.org> wrote:


> It looks like there are a couple of Stack Exchange sites with
> similar aims for open source bioinformatics and chemistry:
>
> http://biostar.stackexchange.com/
> http://blueobelisk.stackexchange.com/

I am actually the original creator of
http://biostar.stackexchange.com/ Created mainly to give my students a
way to easily ask questions.

Two things to keep in mind

- it will cost money to run it, right now it is free due to it being in beta
- it is not obvious that this service will actually be offered once
beta concludes, or that it will be offered with the same conditions.
That is pretty much what keeps me from investing more time into it.
- making it a site like this only for biopython is too restrictive

Other comments on using the stackoverflow main site: I think due to
the site's focus being so generic programming I think most people
looking for bioinformatics related information could easily get lost
or not feel a connection.

IMO the idea is fantastic, but it needs its own forum rather than
being a small subset of a unrelated topics.

best,

Istvan


-- 
Istvan Albert
http://www.personal.psu.edu/iua1


From biopython at maubp.freeserve.co.uk  Tue Jan 19 10:49:31 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 19 Jan 2010 10:49:31 +0000
Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function
Message-ID: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com>

Hi Eric (and everyone else),

I just spotted the to_adjacency_matrix function in utils:
http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py

The dostring says:

> Create an adjacency matrix (NumPy array) from clades/branches in tree.
 >
> Also returns a list of all clades in tree ("allclades"), where the position
> of each clade in the list corresponds to a row and column of the numpy
> array. So, a cell i,j in the array represents the length of the branch from
> allclades[i] to allclades[j].
>
> @return: tuple of (allclades, adjacency_matrix) where allclades is a list
> and adjacency_matrix is a NumPy 2D array.

It looks like your adjacency matrix starts as a numpy array of zeros,
and then you sets some edges to branch lengths. How do you tell
apart a non-connection and a real connection of length zero? These
do occur, for example if you have three identical sequences, then
you might expect a single node with three children. However IIRC,
in (some) NJ trees each node has two children by construction,
so you get an extra node connected with a branch of length zero.

Peter


From eric.talevich at gmail.com  Tue Jan 19 15:22:30 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 19 Jan 2010 10:22:30 -0500
Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function
In-Reply-To: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com>
References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com>
Message-ID: <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com>

On Tue, Jan 19, 2010 at 5:49 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi Eric (and everyone else),
>
> I just spotted the to_adjacency_matrix function in utils:
> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py
>
> The dostring says:
>
>> Create an adjacency matrix (NumPy array) from clades/branches in tree.
> ?>
>> Also returns a list of all clades in tree ("allclades"), where the position
>> of each clade in the list corresponds to a row and column of the numpy
>> array. So, a cell i,j in the array represents the length of the branch from
>> allclades[i] to allclades[j].
>>
>> @return: tuple of (allclades, adjacency_matrix) where allclades is a list
>> and adjacency_matrix is a NumPy 2D array.
>
> It looks like your adjacency matrix starts as a numpy array of zeros,
> and then you sets some edges to branch lengths. How do you tell
> apart a non-connection and a real connection of length zero? These
> do occur, for example if you have three identical sequences, then
> you might expect a single node with three children. However IIRC,
> in (some) NJ trees each node has two children by construction,
> so you get an extra node connected with a branch of length zero.

Shoot, you're right. I can think of three reasonable mitigations:
(a) Use a boolean or 0-1 matrix instead of branch lengths to indicate
adjacency -- this seems more standard in textbooks, actually.
(b) Issue a warning or raise an error if the given tree contains a
0-length branch.
(c) Delete the function.

Which do you recommend?

The idea was to give mathematicians something to play with. For
example, Chapter 2 of this report represents phylogenies this way,
using 0 or 1 to indicate the presence of a branch:
http://www.metaheuristics.net/~mdorigo/HomePageDorigo/thesis/dea/CatanzaroDEA.pdf

Thanks for the heads-up,
Eric


From biopython at maubp.freeserve.co.uk  Tue Jan 19 15:47:39 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 19 Jan 2010 15:47:39 +0000
Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function
In-Reply-To: <3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com>
References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com>
	<3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com>
Message-ID: <320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com>

On Tue, Jan 19, 2010 at 3:22 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> On Tue, Jan 19, 2010 at 5:49 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Hi Eric (and everyone else),
>>
>> I just spotted the to_adjacency_matrix function in utils:
>> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py
>>
>> The dostring says:
>>
>>> Create an adjacency matrix (NumPy array) from clades/branches in tree.
>> ?>
>>> Also returns a list of all clades in tree ("allclades"), where the position
>>> of each clade in the list corresponds to a row and column of the numpy
>>> array. So, a cell i,j in the array represents the length of the branch from
>>> allclades[i] to allclades[j].
>>>
>>> @return: tuple of (allclades, adjacency_matrix) where allclades is a list
>>> and adjacency_matrix is a NumPy 2D array.
>>
>> It looks like your adjacency matrix starts as a numpy array of zeros,
>> and then you sets some edges to branch lengths. How do you tell
>> apart a non-connection and a real connection of length zero? These
>> do occur, for example if you have three identical sequences, then
>> you might expect a single node with three children. However IIRC,
>> in (some) NJ trees each node has two children by construction,
>> so you get an extra node connected with a branch of length zero.
>
> Shoot, you're right. I can think of three reasonable mitigations:
> (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate
> adjacency -- this seems more standard in textbooks, actually.
> (b) Issue a warning or raise an error if the given tree contains a
> 0-length branch.
> (c) Delete the function.
>
> Which do you recommend?
>
> The idea was to give mathematicians something to play with. For
> example, Chapter 2 of this report represents phylogenies this way,
> using 0 or 1 to indicate the presence of a branch:
> http://www.metaheuristics.net/~mdorigo/HomePageDorigo/thesis/dea/CatanzaroDEA.pdf
>
> Thanks for the heads-up,
> Eric

I did wonder about further options,

(d) Since the distances are floats, we can use a NA as
a flag for no connection. However, this does not seem
very useful.

(e) Collapse nodes separated by a zero length branch
while building the adjacency matrix.

Or, raise an error (b) but provide a tree method to collapse
nodes separated by a zero length branch which could be
called to "clean up" a problematic tree before making the
adjacency matrix.

None of these options seem ideal :(

I would say the boolean matrix (a) is safe but is of limited utility.
Therefore (c), remove the function for now is probably best. It
can always be re-added in a later release if a good solution is
agreed.

Peter

P.S. Another potentially interesting thing would be a matrix using
the bootstrap support values (where again you have a problem
with zero bootstrap support vs no connection). I'm not sure if this
has any practical uses though.


From eric.talevich at gmail.com  Wed Jan 20 04:08:16 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 19 Jan 2010 23:08:16 -0500
Subject: [Biopython-dev] Bio.Phylo to_adjacency_matrix function
In-Reply-To: <320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com>
References: <320fb6e01001190249y57b598a5tebb463973066ff9d@mail.gmail.com> 
	<3f6baf361001190722n3a6ebaa5v7d4e5170c279bc87@mail.gmail.com> 
	<320fb6e01001190747h39e0647dh594dfe9f2ba74533@mail.gmail.com>
Message-ID: <3f6baf361001192008y244912aaieb7c8d2c0399903e@mail.gmail.com>

On Tue, Jan 19, 2010 at 10:47 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Jan 19, 2010 at 3:22 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> On Tue, Jan 19, 2010 at 5:49 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>> Hi Eric (and everyone else),
>>>
>>> I just spotted the to_adjacency_matrix function in utils:
>>> http://github.com/biopython/biopython/blob/master/Bio/Phylo/_utils.py
>>>
>>> It looks like your adjacency matrix starts as a numpy array of zeros,
>>> and then you sets some edges to branch lengths. How do you tell
>>> apart a non-connection and a real connection of length zero?
>>
>> Shoot, you're right. I can think of three reasonable mitigations:
>> (a) Use a boolean or 0-1 matrix instead of branch lengths to indicate
>> adjacency -- this seems more standard in textbooks, actually.
>> (b) Issue a warning or raise an error if the given tree contains a
>> 0-length branch.
>> (c) Delete the function.
>>
>> Which do you recommend?
>> ....
>
> I did wonder about further options,
>
> (d) Since the distances are floats, we can use a NA as
> a flag for no connection. However, this does not seem
> very useful.

Or infinity -- I think that's reasonably common in graph algorithms
that use a matrix representation.

Anyway, I commented it out for now. The main problem is that I don't
have a clear use case for the function at the moment, just a notion
that it could be useful for some novel statistical analysis or
possibly rooting an unrooted tree based on a molecular clock. I'll
look at other libraries to see how they use adjacency matrices, if at
all.


> (e) Collapse nodes separated by a zero length branch
> while building the adjacency matrix.
>
> Or, raise an error (b) but provide a tree method to collapse
> nodes separated by a zero length branch which could be
> called to "clean up" a problematic tree before making the
> adjacency matrix.

Should be easy enough for the user to do manually:

for clade in tree.find_clades(branch_length=0):
    tree.collapse(clade)

I'm going to do some serious work on the wiki documentation soon so
this sort of operation should be fairly apparent to users.


> P.S. Another potentially interesting thing would be a matrix using
> the bootstrap support values (where again you have a problem
> with zero bootstrap support vs no connection). I'm not sure if this
> has any practical uses though.

Well, the commented-out code is still visible if any brave scientist
is interested in modifying it for this purpose. I'm reading Joe
Felsenstein's book right now, so I'll probably get the urge to add
more mathy toys to Bio.Phylo soon. I'll check with the list before
committing them to the trunk, though. ;)


From p.j.a.cock at googlemail.com  Wed Jan 20 16:16:58 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 20 Jan 2010 16:16:58 +0000
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>
Message-ID: <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>

On Fri, Jan 15, 2010 at 11:08 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> Anyhow I think it is better to discuss this when the unit test
>> 'swiss'VS'uniprot' is ready.
>
> +1, good plan.

Something I should have mentioned earlier (I forgot this wasn't
checked in yet) was feature support in the existing "swiss" plain
text parser - hopefully we can get that working nicely as part of
this XML work:

http://bugzilla.open-bio.org/show_bug.cgi?id=2235

Peter


From andrea at biocomp.unibo.it  Wed Jan 20 16:57:47 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Wed, 20 Jan 2010 17:57:47 +0100 (CET)
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>
	<320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>
Message-ID: <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it>

>
> Something I should have mentioned earlier (I forgot this wasn't
> checked in yet) was feature support in the existing "swiss" plain
> text parser - hopefully we can get that working nicely as part of
> this XML work:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2235
>
> Peter
>

I know that the plain text swissprot parser can parse features, but
last time I checked these features were not included in SeqRecords
generated by Bio.SeqIO.
If the two parsers have to report similar results, than the 'swiss'
format in Bio.SeqIO must reports features too.
I made a few changes to the original parser to map data as close as
possible to the plain text parser (available on github).

However the big issue are going to be the comment field:
- 1 big string in the plain text parser
- several annotation fields in the XML parser.

I think that obtaining the same results is going to be difficult.
It is hard to map the big string to many annotations (very error prone)
and is also hard to map many annotations to a single string...

Anyhow, unit testing is coming (thanks to Mauro) together with a detailed
comparison between the two parsed seqrecords.

Andrea


From p.j.a.cock at googlemail.com  Wed Jan 20 17:14:18 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 20 Jan 2010 17:14:18 +0000
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>
	<320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>
	<01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it>
Message-ID: <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com>

On Wed, Jan 20, 2010 at 4:57 PM, Andrea Pierleoni
<andrea at biocomp.unibo.it> wrote:
>>
>> Something I should have mentioned earlier (I forgot this wasn't
>> checked in yet) was feature support in the existing "swiss" plain
>> text parser - hopefully we can get that working nicely as part of
>> this XML work:
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2235
>>
>> Peter
>>
>
> I know that the plain text swissprot parser can parse features, but
> last time I checked these features were not included in SeqRecords
> generated by Bio.SeqIO.
> If the two parsers have to report similar results, than the 'swiss'
> format in Bio.SeqIO must reports features too.

Yes, there is an old patch on Bug 2235 to do this:
http://bugzilla.open-bio.org/show_bug.cgi?id=2235

> I made a few changes to the original parser to map data as close as
> possible to the plain text parser (available on github).
>
> However the big issue are going to be the comment field:
> - 1 big string in the plain text parser
> - several annotation fields in the XML parser.
>
> I think that obtaining the same results is going to be difficult.
> It is hard to map the big string to many annotations (very error prone)
> and is also hard to map many annotations to a single string...
>
> Anyhow, unit testing is coming (thanks to Mauro) together with a detailed
> comparison between the two parsed seqrecords.

Great.

Peter


From andrea at biocomp.unibo.it  Thu Jan 21 12:01:30 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Thu, 21 Jan 2010 13:01:30 +0100 (CET)
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>
	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>
	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>
	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>
	<320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>
	<01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it>
	<320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com>
Message-ID: <43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it>


>> Anyhow, unit testing is coming (thanks to Mauro) together with a
>> detailed
>> comparison between the two parsed seqrecords.
>
> Great.
>
> Peter
>


As mentioned earlier, Mauro did a code review and added unit test for the
parser in Tests/test_Uniprot.py
the updated version is available on the github repository:
http://github.com/apierleoni/biopython

Since this version is mature enough I sepnt some time comparing the input
from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser.
This comparison was done using the Q13639 UniProt entry.

This are the main differences between the two generated SeqRecords:

- id:  is the same (first accession)
- name: is the same
- description: UP reports the  the recommended name , full name value, while
       additional names and synonyms are in the annotations. SP reports a
       long string containing everything parsed as it is form the plain
       text.
- dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed,
       NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs
- seq: is the same
- features: missing in SP (I have to check with the Peter's patch)
- annotations:
- - identical annotations: accessions, keywords, taxonomy, organism
- - mapped annotations:
       date_last_annotation_update in UP---> modified in SP
       date_last_sequence_update in UP---> sequence_modified in SP
       gene_name_primary in UP---> gene_name in SP
               >>> SP.annotations['gene_name']
               'Name=HTR4;'
               >>> UP.annotations['gene_name_primary']
               'HTR4'
       ncbi_taxid in SP ---> UP dbxrefs since it is mapped as a
                dbReference in the xmlfile
- - references: has some minor differences.
        Final semicolon and double quote missing in UP for both author
            and title fields.
        In UP reference comments are reported as:
	    "PublicationType | PublicationDate | Scope | Tissue"
	For submission publication type the db is reported in comments
            and not in journal field.
- - comments: here comes the big differences.
       SP has comments are on a single string.
       UP comments are mapped to seceral annotation entries using comment
          type and attributes to build the annotation key.
          Eg.
          comment_function --> list of  "function" type comment strings
          comment_subcellularlocation_location --> list of  "location"
               strings in the subcellularlocation comment field

       Comments  tree in XML would be easily mapped to a comment dictionary
       tree, but this would not be BioSQL safe.


Andrea


From biopython at maubp.freeserve.co.uk  Thu Jan 21 12:33:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 21 Jan 2010 12:33:53 +0000
Subject: [Biopython-dev] SwissProt DE lines and UniProt XML / TagTree as XML
	in BioSQL
Message-ID: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>

Hi all,

This is cross posted to try and ensure relevant people see it.
I suggest we continue the discussion on the BioSQL list
(for how to serialise structured annotation to BioSQL), and/or
the OpenBio list (for things like file format naming conventions).

I am hoping we (Bio*) can be consistent in how we parse and load
into BioSQL the SwissProt DE lines (known as "swiss" format in
both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
equivalent UniProt XML tags (which we are tentatively going to
call the "uniprot" format in Biopython's SeqIO - comments?).

Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
files and load them into BioSQL. Biopython currently treats the DE
comment lines as a long string, as BioPerl used to:

http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html

I understand that BioPerl now turns the SwissProt DE lines into a
TagTree, and for storing this in BioSQL this gets serialised as XML.
I would like Biopython to handle this the same way (although rather
than a Perl TagTree, we'd use a Python structure of course), and
would appreciate clarification of what exactly was implemented
(e.g. which bit of the BioPerl source code should be look at,
and could you show a worked example?).

Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
Open-Bio lists yet) has started work on parsing UniProt XML
files for Biopython. Here the DE comment lines are already
provided broken up with XML markup. Hopefully their nested
structure matches what BioPerl was doing with the SwissProt
DE lines.

Regards,

Peter


From bugzilla-daemon at portal.open-bio.org  Thu Jan 21 13:13:09 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 Jan 2010 08:13:09 -0500
Subject: [Biopython-dev] [Bug 2997] New: Ignore comments in SCOP parsable
	files
Message-ID: <bug-2997-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2997

           Summary: Ignore comments in SCOP parsable files
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: 2008 at thomas-holder.de


I could not load SCOP parsable files with Bio.SCOP unless I removed the comment
lines. The parser should just skip these lines.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Jan 21 13:14:59 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 Jan 2010 08:14:59 -0500
Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files
In-Reply-To: <bug-2997-42@http.bugzilla.open-bio.org/>
Message-ID: <201001211314.o0LDExim005529@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2997


------- Comment #1 from 2008 at thomas-holder.de  2010-01-21 08:14 EST -------
Created an attachment (id=1432)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1432&action=view)
patch to skip comment lines in SCOP parsable files


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From mauro at biodec.com  Thu Jan 21 20:09:28 2010
From: mauro at biodec.com (Mauro)
Date: Thu, 21 Jan 2010 21:09:28 +0100
Subject: [Biopython-dev] New: Uniprot XML parser
In-Reply-To: <43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it>
References: <edbae272b396c786977edc3ad03fa114.squirrel@lipid.biocomp.unibo.it>	<320fb6e01001141116j206ae70cp895093f4b1ea77fc@mail.gmail.com>	<4ee07ada7c0060df57a527714e058cd5.squirrel@lipid.biocomp.unibo.it>	<320fb6e01001141348gb18a740hb4f6297d35c6638e@mail.gmail.com>	<c05ae32728b00c5a5e7cef583ba60753.squirrel@lipid.biocomp.unibo.it>	<320fb6e01001141504h789a960fy6469cdc95081d7e8@mail.gmail.com>	<bfaf64694eebedfb0a759eec9e061eb6.squirrel@lipid.biocomp.unibo.it>	<320fb6e01001150308o79dbc814j97a6c12d2eaf3aae@mail.gmail.com>	<320fb6e01001200816l1a82c45bu4445ebf724f6e3ad@mail.gmail.com>	<01ab20eb79b15658138f7625ea9b3eab.squirrel@lipid.biocomp.unibo.it>	<320fb6e01001200914o77ed010bj4377a8bf59d7f9ab@mail.gmail.com>
	<43de0312150f72ffeaae084a2fccc4a9.squirrel@lipid.biocomp.unibo.it>
Message-ID: <4B58B478.4000703@biodec.com>

On 01/21/2010 01:01 PM, Andrea Pierleoni wrote:
>
>>> Anyhow, unit testing is coming (thanks to Mauro) together with a
>>> detailed
>>> comparison between the two parsed seqrecords.
>>
>> Great.
>>
>> Peter
>>
>
>
> As mentioned earlier, Mauro did a code review and added unit test for the
> parser in Tests/test_Uniprot.py
> the updated version is available on the github repository:
> http://github.com/apierleoni/biopython
>
> Since this version is mature enough I sepnt some time comparing the input
> from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser.
> This comparison was done using the Q13639 UniProt entry.

I made also a test for this case. Currently the test fails, you can see
the report made by Andrea below. If we agree with differences between 
the seqrecord, I do the work to change the test.

Mauro.

>
> This are the main differences between the two generated SeqRecords:
>
> - id:  is the same (first accession)
> - name: is the same
> - description: UP reports the  the recommended name , full name value, while
>         additional names and synonyms are in the annotations. SP reports a
>         long string containing everything parsed as it is form the plain
>         text.
> - dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed,
>         NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs
> - seq: is the same
> - features: missing in SP (I have to check with the Peter's patch)
> - annotations:
> - - identical annotations: accessions, keywords, taxonomy, organism
> - - mapped annotations:
>         date_last_annotation_update in UP--->  modified in SP
>         date_last_sequence_update in UP--->  sequence_modified in SP
>         gene_name_primary in UP--->  gene_name in SP
>                 >>>  SP.annotations['gene_name']
>                 'Name=HTR4;'
>                 >>>  UP.annotations['gene_name_primary']
>                 'HTR4'
>         ncbi_taxid in SP --->  UP dbxrefs since it is mapped as a
>                  dbReference in the xmlfile
> - - references: has some minor differences.
>          Final semicolon and double quote missing in UP for both author
>              and title fields.
>          In UP reference comments are reported as:
> 	    "PublicationType | PublicationDate | Scope | Tissue"
> 	For submission publication type the db is reported in comments
>              and not in journal field.
> - - comments: here comes the big differences.
>         SP has comments are on a single string.
>         UP comments are mapped to seceral annotation entries using comment
>            type and attributes to build the annotation key.
>            Eg.
>            comment_function -->  list of  "function" type comment strings
>            comment_subcellularlocation_location -->  list of  "location"
>                 strings in the subcellularlocation comment field
>
>         Comments  tree in XML would be easily mapped to a comment dictionary
>         tree, but this would not be BioSQL safe.
>
>
> Andrea
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From bugzilla-daemon at portal.open-bio.org  Thu Jan 21 23:58:29 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 21 Jan 2010 18:58:29 -0500
Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files
In-Reply-To: <bug-2997-42@http.bugzilla.open-bio.org/>
Message-ID: <201001212358.o0LNwTIB022421@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2997


------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp  2010-01-21 18:58 EST -------
Can you give an example of a SCOP file that contains such comment lines?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Jan 22 08:42:28 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 Jan 2010 03:42:28 -0500
Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files
In-Reply-To: <bug-2997-42@http.bugzilla.open-bio.org/>
Message-ID: <201001220842.o0M8gSDv003709@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2997


------- Comment #3 from 2008 at thomas-holder.de  2010-01-22 03:42 EST -------
(In reply to comment #2)
> Can you give an example of a SCOP file that contains such comment lines?

I want to parse these files:
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.75
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.75
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.hie.scop.txt_1.75
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.com.scop.txt_1.75

They all start with 4 comment lines (release and copyright information).


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Jan 22 11:08:34 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 Jan 2010 06:08:34 -0500
Subject: [Biopython-dev] [Bug 2997] Ignore comments in SCOP parsable files
In-Reply-To: <bug-2997-42@http.bugzilla.open-bio.org/>
Message-ID: <201001221108.o0MB8YkZ008581@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2997


mdehoon at ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp  2010-01-22 06:08 EST -------
Applied your patch; thanks.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From andrea at biocomp.unibo.it  Fri Jan 22 12:18:32 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Fri, 22 Jan 2010 13:18:32 +0100 (CET)
Subject: [Biopython-dev] SwissProt DE lines and UniProt XML / TagTree as
	XML in BioSQL
In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
Message-ID: <2b6e30c4628585042366646a7b46386e.squirrel@lipid.biocomp.unibo.it>

I think that the point here can be a little broader, since not only the
swissprot DE lines carry complex and structured data.
To define a common, language-independent way to store structured data into
the comment and *_qualifier_value tables of the actual BioSQL schema could
be very useful.
XML looks like a good candidate to me, and the UniprotXML format can be
used as reference or as a template to start from.
Each Bio* project will then parse and report this structured data in its
own programming language data structure.

Andrea


> Hi all,
>
> This is cross posted to try and ensure relevant people see it.
> I suggest we continue the discussion on the BioSQL list
> (for how to serialise structured annotation to BioSQL), and/or
> the OpenBio list (for things like file format naming conventions).
>
> I am hoping we (Bio*) can be consistent in how we parse and load
> into BioSQL the SwissProt DE lines (known as "swiss" format in
> both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
> equivalent UniProt XML tags (which we are tentatively going to
> call the "uniprot" format in Biopython's SeqIO - comments?).
>
> Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
> files and load them into BioSQL. Biopython currently treats the DE
> comment lines as a long string, as BioPerl used to:
>
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
> http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html
>
> I understand that BioPerl now turns the SwissProt DE lines into a
> TagTree, and for storing this in BioSQL this gets serialised as XML.
> I would like Biopython to handle this the same way (although rather
> than a Perl TagTree, we'd use a Python structure of course), and
> would appreciate clarification of what exactly was implemented
> (e.g. which bit of the BioPerl source code should be look at,
> and could you show a worked example?).
>
> Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
> Open-Bio lists yet) has started work on parsing UniProt XML
> files for Biopython. Here the DE comment lines are already
> provided broken up with XML markup. Hopefully their nested
> structure matches what BioPerl was doing with the SwissProt
> DE lines.
>
> Regards,
>
> Peter
>


From bugzilla-daemon at portal.open-bio.org  Fri Jan 22 18:43:19 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 Jan 2010 13:43:19 -0500
Subject: [Biopython-dev] [Bug 2998] New: mac error during build in 10.6.1
Message-ID: <bug-2998-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2998

           Summary: mac error during build in 10.6.1
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Mac OS
            Status: NEW
          Severity: major
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: emeryl at uw.edu


When I download the file biopython-1.53.tar.gz, uncompress it, and run 
python setup.py build
I get an error saying gcc4.0 failed with exit code 1, among many lines of
errors.  Looking more closely, it appears the build process is trying to use an
older version of the SDK, which is not installed by Xcode tools by default.  It
is trying to use /Developer/SDKs/MacOSX10.4u.sdk.  On a clean install of 10.6.1
(Snow Leopard) only the SDKs for 10.5 and 10.6 are installed by the Xcode tools
installer without changing options.  When I reinstall the Xcode tools and this
time check a box to install 10.4 support, this 10.4 sdk is installed and the
build works flawlessly.  This would be a difficult fix to track down for many
casual users of BioPython who do not understand the Xcode tools.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Jan 22 19:15:59 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 Jan 2010 14:15:59 -0500
Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for
	Mac OS
In-Reply-To: <bug-2998-42@http.bugzilla.open-bio.org/>
Message-ID: <201001221915.o0MJFxoa024953@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2998


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|major                       |normal
            Summary|mac error during build in   |Document need XCode with
                   |10.6.1                      |10.4 SDK for Mac OS


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-22 14:15 EST -------
Snow Leopard has caused all sorts of trouble for compiling Python extensions
(this is not specific to Biopython).

This has been discussed on our mailing list, and simply installing the
Mac OS 10.4 SDK option with XCode seems to be the best solution. I've just
updated the download page to try and clarify this. Is that better? This
is a wiki page so you can edit it:
http://biopython.org/wiki/Download

I'm leaving this bug open to remind us to add a similar note to the main
installation document:
http://github.com/biopython/biopython/blob/master/Doc/install/Installation.tex

Do you have any other suggestions? Thanks.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Jan 22 20:36:36 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 22 Jan 2010 15:36:36 -0500
Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for
	Mac OS
In-Reply-To: <bug-2998-42@http.bugzilla.open-bio.org/>
Message-ID: <201001222036.o0MKaaZ4027368@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2998


------- Comment #2 from emeryl at uw.edu  2010-01-22 15:36 EST -------
(In reply to comment #1)

That's a good solution, but I added this small clarification also :

You will need to have installed Apple's XCode tools including the optional 10.4
SDK  (check the option for 10.4 support when installing Xcode tools).


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon Jan 25 10:56:32 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 05:56:32 -0500
Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for
	Mac OS
In-Reply-To: <bug-2998-42@http.bugzilla.open-bio.org/>
Message-ID: <201001251056.o0PAuWDI010933@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2998


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-25 05:56 EST -------
(In reply to comment #2)
> (In reply to comment #1)
> 
> That's a good solution, but I added this small clarification also :
> 
> You will need to have installed Apple's XCode tools including the optional 10.4
> SDK  (check the option for 10.4 support when installing Xcode tools).
>

Thanks - I've now updated the main installation document in our repository
(which we'll use to update the install PDF and HTML at the next release).

Marking bug as fixed.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Jan 26 01:16:27 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:16:27 -0500
Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing,
	in particular location parsing
In-Reply-To: <bug-2738-42@http.bugzilla.open-bio.org/>
Message-ID: <201001260116.o0Q1GR1c002063@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2738


mmokrejs at ribosome.natur.cuni.cz changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mmokrejs at ribosome.natur.cuni
                   |                            |.cz


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Jan 26 01:17:41 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:17:41 -0500
Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not
	record molecule type or if circular
In-Reply-To: <bug-2578-42@http.bugzilla.open-bio.org/>
Message-ID: <201001260117.o0Q1Hfdb002091@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2578


mmokrejs at ribosome.natur.cuni.cz changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mmokrejs at ribosome.natur.cuni
                   |                            |.cz


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Jan 26 01:19:47 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:19:47 -0500
Subject: [Biopython-dev] [Bug 2597] Enforce alphabet letters in Seq objects
In-Reply-To: <bug-2597-42@http.bugzilla.open-bio.org/>
Message-ID: <201001260119.o0Q1JlhK002189@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2597


mmokrejs at ribosome.natur.cuni.cz changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Jan 26 01:27:14 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:27:14 -0500
Subject: [Biopython-dev] [Bug 2999] New: SeqIO.parse() or
	record.format("genbank") converts input sequence to uppercase or
Message-ID: <bug-2999-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2999

           Summary: SeqIO.parse() or record.format("genbank") converts input
                    sequence to uppercase or
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: mmokrejs at ribosome.natur.cuni.cz


I do not know where is the problem coming from but if I parse a GenBank file
with lowercased sequence (EST) and get it printed back through
record.format("genbank") I receive all in uppercase. I think the
upper/lower-casing should never be altered unless explicitly requested by the
user.

for _record in SeqIO.parse(_infile, options.format):
    # silly, imagine I hit "gi|14150838|gb|AAK54648.1|AF376133_1" from
    #   a FASTA file :(
    if _record.id in _ids:
        _outfile.write(_record.format("fasta"))
    elif options.format == "genbank":
        if _record.annotations['gi'] in _ids:
            _outfile.write(_record.format("genbank"))


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Jan 26 01:44:28 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:44:28 -0500
Subject: [Biopython-dev] [Bug 3000] New: Could SeqIO.parse() store the whole,
	unparsed multiline entry?
Message-ID: <bug-3000-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3000

           Summary: Could SeqIO.parse() store the whole, unparsed multiline
                    entry?
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: mmokrejs at ribosome.natur.cuni.cz


Taking into account the genbank file-format writing is not yet complete I
wonder whether you would allow to keep optionally along each parsed record it's
unparsed multi-line representation. For example, I use biopython to filter-out
certain records from a fasta/genbank file by accession, gi, tissue (well the
last haven't done yet;)). I do not change the format, I just ignore certain
entries.

I did not understand the Tutorial ("5.4.3  Getting your SeqRecord objects as
formatted strings") well but I iterate over the records and once having the
record I want to be on the safe side and to record._print_original_blob() and
get e.g.

LOCUS ....
...
//

I do not have the record_iterator so cannot use the proposed
out_handle.write(record.format("genbank")) approach. Still, I suspect this will
reformat the entry (currently I see trailing dot removed from KEYWORDS, no
REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED; and FEATURES.source being
re-ordered).

I foresee this to depend on an optional argument to SeqIO.parse() specifying
that a user wants to keep this in memory and merely that he/she understands
this is probably not much useful for large chromosomes, etc.

Similarly, I think until parsing/writing e.g. TITLE is fully available why
couldn't you just store the whole multi-line thing in some variable?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Jan 26 01:47:27 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Jan 2010 20:47:27 -0500
Subject: [Biopython-dev] [Bug 2601] Seq find() method: proposal
In-Reply-To: <bug-2601-42@http.bugzilla.open-bio.org/>
Message-ID: <201001260147.o0Q1lRVk002782@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2601


mmokrejs at ribosome.natur.cuni.cz changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mmokrejs at ribosome.natur.cuni
                   |                            |.cz


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Jan 26 13:03:42 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 26 Jan 2010 08:03:42 -0500
Subject: [Biopython-dev] [Bug 2999] SeqIO.parse() or
	record.format("genbank") converts input sequence to uppercase or
In-Reply-To: <bug-2999-42@http.bugzilla.open-bio.org/>
Message-ID: <201001261303.o0QD3gN8019546@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2999


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-26 08:03 EST -------
In many file formats (e.g. FASTA) mixed case is allowed and useful.

The sequence in a GenBank file is (by convention) always lower case,
but for historical reasons Biopython converts this to upper case on
parsing (not sure why, but changing it would risk breaking existing
scripts).

However, I think we should convert to lower case on writing GenBank
output.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Jan 26 13:15:38 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 26 Jan 2010 08:15:38 -0500
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
	unparsed multiline entry?
In-Reply-To: <bug-3000-42@http.bugzilla.open-bio.org/>
Message-ID: <201001261315.o0QDFc4f020030@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3000


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-26 08:15 EST -------
(In reply to comment #0)
> Taking into account the genbank file-format writing is not yet complete I
> wonder whether you would allow to keep optionally along each parsed record
> it's unparsed multi-line representation.

You can probably do it already with the old Bio.GenBank iterator object
(I think you use no parser object to get the raw text).

Adding this to Bio.SeqIO doesn't seem a wonderful idea. The whole approach
only makes sense for sequential file formats with no header (like FASTA,
GenBank, EMBL, SwissProt) but not interlaced files (most alignments) or
those with headers or XML formats. It also breaks completely the moment
the user makes any modification to the SeqRecord object - and handling
that cleanly would be tricky.

> Still, I suspect this will
> reformat the entry (currently I see trailing dot removed from KEYWORDS, no
> REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED; and FEATURES.source being
> re-ordered).

Yes, using Bio.SeqIO to read/write a GenBank record will give you (slightly)
different output. We do not guarantee a 100% round trip (even on simpler
formats like FASTA). Even little things like line wrapping would make this
very difficult.

Regarding GenBank KEYWORDS, please file a bug.

Regarding GenBank reference lines (REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED)
this is still covered by existing Bug 2294

Regarding GenBank source feature, please file a bug.

> Similarly, I think until parsing/writing e.g. TITLE is fully available why
> couldn't you just store the whole multi-line thing in some variable?

The remaining unsupported bits of the ID line are covered byg existing
Bug 2294 and Bug 2578.

Regarding the reference lines (REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED)
this is still covered by existing Bug 2294.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From jblanca at btc.upv.es  Tue Jan 26 14:02:59 2010
From: jblanca at btc.upv.es (Jose Blanca)
Date: Tue, 26 Jan 2010 15:02:59 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com>
References: <20091202125744.GA46415@sobchak.mgh.harvard.edu>
	<320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com>
	<320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com>
Message-ID: <201001261502.59237.jblanca@btc.upv.es>

Hi:

I'm doing a pipeline to annotate sequences. I'm writting modules that add 
SeqFeatures and annoations to the sequences. 
Right now I'm storing the result as repr for the SeqRecords, but I would like 
to write gff files at the end. I've read the discussion regarding Brad's code 
and I've found it very interesting.
I need to write those gff files so couldl use Brad's code or my own, but it 
would be great if I could contribute to Biopython at the same time.
At the time being I don't think a consensus about what a SeqFeature should 
represent and how. I think Peter made a proposal about adding a parent and 
children properties, is this a good way to solve the problem? 
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)


From biopython at maubp.freeserve.co.uk  Tue Jan 26 14:59:35 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 26 Jan 2010 14:59:35 +0000
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <201001261502.59237.jblanca@btc.upv.es>
References: <20091202125744.GA46415@sobchak.mgh.harvard.edu>
	<320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com>
	<320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com>
	<201001261502.59237.jblanca@btc.upv.es>
Message-ID: <320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com>

Hi Jose,

On Tue, Jan 26, 2010 at 2:02 PM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi:
>
> I'm doing a pipeline to annotate sequences. I'm writting modules that add
> SeqFeatures and annoations to the sequences.

I've done a little of that too - but with GenBank files as the output.

> Right now I'm storing the result as repr for the SeqRecords, but I would like
> to write gff files at the end. I've read the discussion regarding Brad's code
> and I've found it very interesting.
> I need to write those gff files so couldl use Brad's code or my own, but it
> would be great if I could contribute to Biopython at the same time.
> At the time being I don't think a consensus about what a SeqFeature should
> represent and how. I think Peter made a proposal about adding a parent and
> children properties, is this a good way to solve the problem?
> Best regards,

Brad's code is using the SeqFeature differently to existing bits of
Biopython, and adding a separate child/parent mechanism for the
kind of usage required for GFF(3) looks like one way forward allowing
use to keep full backward compatibility. I'm actually going to see Brad
in person next month at a workshop, and I'm hoping we can squeeze
in a little in person debate on this then (assuming we don't settle it
here on the mailing list first of course).

Regards,

Peter


From dalloliogm at gmail.com  Tue Jan 26 15:09:39 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 26 Jan 2010 16:09:39 +0100
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <20100118142010.GE48842@sobchak.mgh.harvard.edu>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> 
	<4B546062.3090802@bham.ac.uk>
	<20100118142010.GE48842@sobchak.mgh.harvard.edu>
Message-ID: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com>

On Mon, Jan 18, 2010 at 3:20 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hi Nick;
>

Sorry for the late reply... I also use StackOverflow and I think that it is
a great resource, and it would very good if we can become more represented
there.
At the moment there are a few questions on biopython on SO, but there are so
few biopython users that people usually receive few answers and they prefer
to ask their questions again in this list.
I have answer to some questions tagged as 'bioinformatics' there, but lately
I have not been using SO very much, and moreover the field of bioinformatics
is so broad that sometimes it is very difficult to answer a technical
question.


> > Here is a link to various other Stack Exchange sites:
> >
> http://tumblr.marcosdecarvalho.com/post/252388387/the-stackexchange-family
>
>
Very interesting, thanks! I didn't know you could make Stack-Exchange
websites so easily. How did you do that? Is there a free software behind, or
do you have to pay some service provider?


> It looks like there are a couple of Stack Exchange sites with
> similar aims for open source bioinformatics and chemistry:
>
> http://biostar.stackexchange.com/
> http://blueobelisk.stackexchange.com/
>

I agree, maybe it would be useful to collaborate with these websites.
StackOverflow is great for programming-related questions; however, you can't
use it to ask something which is not completely related, like the protocol
for an experiment or which databases to use for an analysis.


-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it


From dalloliogm at gmail.com  Wed Jan 27 08:56:09 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 27 Jan 2010 09:56:09 +0100
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> 
	<4B546062.3090802@bham.ac.uk>
	<20100118142010.GE48842@sobchak.mgh.harvard.edu> 
	<5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com>
Message-ID: <5aa3b3571001270056l5ae5bd76g1a70890c94fd430b@mail.gmail.com>

On Tue, Jan 26, 2010 at 4:09 PM, Giovanni Marco Dall'Olio <
dalloliogm at gmail.com> wrote:

>
>
>
> On Mon, Jan 18, 2010 at 3:20 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
>> Hi Nick;
>>
>
> Sorry for the late reply... I also use StackOverflow and I think that it is
> a great resource, and it would very good if we can become more represented
> there.
>


By the way, it is possible to get feeds for questions on StackOverflow.
For example, this is the feed for the questions tagged 'biopython':
- http://stackoverflow.com/feeds/tag/biopython
We could add this rss to the biopython's friendfeed or twitter page (I
barely know what I am talking about here), or to the blog/wiki/etc.
Maybe there is also a way to notify this mailing list of the questions asked
there.


-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it


From chapmanb at 50mail.com  Wed Jan 27 13:33:22 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 27 Jan 2010 08:33:22 -0500
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu>
	<4B546062.3090802@bham.ac.uk>
	<20100118142010.GE48842@sobchak.mgh.harvard.edu>
	<5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com>
Message-ID: <20100127133322.GV83316@sobchak.mgh.harvard.edu>

Giovanni;
Thanks for the feedback on this. We've had a few positive responses
and I think it's something that would be low effort to experiment with.
I'm open to whether we do this on the main StackOverflow site,
Nick's dedicated suggested site, or Blue Obelisk. The main criteria
is that we are likely to have the website be freely available (and
around) in the future.

> Sorry for the late reply... I also use StackOverflow and I think that it is
> a great resource, and it would very good if we can become more represented
> there.
> At the moment there are a few questions on biopython on SO, but there are so
> few biopython users that people usually receive few answers and they prefer
> to ask their questions again in this list.

Yes, that's what we'd be hoping to change. The main thing is that we
get folks interested in python bioinformatics programming looking
there, and then suggest users ask questions there. The significant
benefit is that the presentation of questions and answers gives you 
a historical resource that is easy to search and browse.

> By the way, it is possible to get feeds for questions on StackOverflow.
> For example, this is the feed for the questions tagged 'biopython':
> - http://stackoverflow.com/feeds/tag/biopython
> We could add this rss to the biopython's friendfeed or twitter page (I
> barely know what I am talking about here), or to the blog/wiki/etc.
> Maybe there is also a way to notify this mailing list of the questions asked
> there.

There are resources we could use to redirect the feed to Twitter:

http://twitterfeed.com/

and the mailing list:

http://www.feedmyinbox.com/

Agreed that we should do this to increase visibility.

Brad


From chapmanb at 50mail.com  Wed Jan 27 13:41:25 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 27 Jan 2010 08:41:25 -0500
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com>
References: <20091202125744.GA46415@sobchak.mgh.harvard.edu>
	<320fb6e00912081430q6db93d55l6de4a02baefd6c12@mail.gmail.com>
	<320fb6e00912081538o635347ceh8e10aa4863e538e9@mail.gmail.com>
	<201001261502.59237.jblanca@btc.upv.es>
	<320fb6e01001260659ra48dc71yd0f840d181556f9d@mail.gmail.com>
Message-ID: <20100127134125.GW83316@sobchak.mgh.harvard.edu>

Jose and Peter;

> > Right now I'm storing the result as repr for the SeqRecords, but I would like
> > to write gff files at the end. I've read the discussion regarding Brad's code
> > and I've found it very interesting.
> > I need to write those gff files so couldl use Brad's code or my own, but it
> > would be great if I could contribute to Biopython at the same time.

Awesome. Please do use my code for output and feel free to fork and
make suggestions; I'm happy to integrate changes:

http://github.com/chapmanb/bcbb/tree/master/gff

> > At the time being I don't think a consensus about what a SeqFeature should
> > represent and how. I think Peter made a proposal about adding a parent and
> > children properties, is this a good way to solve the problem?
> > Best regards,
> 
> Brad's code is using the SeqFeature differently to existing bits of
> Biopython, and adding a separate child/parent mechanism for the
> kind of usage required for GFF(3) looks like one way forward allowing
> use to keep full backward compatibility. I'm actually going to see Brad
> in person next month at a workshop, and I'm hoping we can squeeze
> in a little in person debate on this then (assuming we don't settle it
> here on the mailing list first of course).

What do you think we need to modify in the GFF parsing code to bring
this in line? I'd really like to see this get into Biopython, but am
not sure how to clear the blocking issues. If we can put together a
list of specifics, I can try and put together time to tackle that.

Brad


From dalloliogm at gmail.com  Wed Jan 27 13:41:24 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 27 Jan 2010 14:41:24 +0100
Subject: [Biopython-dev] Encouraging use of Stack Overflow for questions
In-Reply-To: <20100127133322.GV83316@sobchak.mgh.harvard.edu>
References: <20100118130115.GA48842@sobchak.mgh.harvard.edu> 
	<4B546062.3090802@bham.ac.uk>
	<20100118142010.GE48842@sobchak.mgh.harvard.edu> 
	<5aa3b3571001260709w5178b4cej738714804c8ccd8d@mail.gmail.com> 
	<20100127133322.GV83316@sobchak.mgh.harvard.edu>
Message-ID: <5aa3b3571001270541n2f047fe2qf42911b21e9494d8@mail.gmail.com>

On Wed, Jan 27, 2010 at 2:33 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Giovanni;
> Thanks for the feedback on this. We've had a few positive responses
> and I think it's something that would be low effort to experiment with.
> I'm open to whether we do this on the main StackOverflow site,
> Nick's dedicated suggested site, or Blue Obelisk. The main criteria
> is that we are likely to have the website be freely available (and
> around) in the future.
>

Thanks to you for the proposal..


> There are resources we could use to redirect the feed to Twitter:
>
> http://twitterfeed.com/
>
> and the mailing list:
>
> http://www.feedmyinbox.com/
>

So, what if we use this to automatically send a notification to the
biopython mailing list?
The amount of traffic increased would be low, in the last three months there
have only been 3 messages  on biopython in StackOverflow.
With an automatical notification, these questions may receive an answer a
lot more quickly.
When the traffic on StackOverflow grows too much, we can just inactivate the
forwarding so it won't disturb the mailing list.


> Agreed that we should do this to increase visibility.
>
> Brad
>


-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it


From chapmanb at 50mail.com  Thu Jan 28 20:35:05 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 28 Jan 2010 15:35:05 -0500
Subject: [Biopython-dev] OpenBio solution challenge: Project updates at BOSC
	2010
Message-ID: <20100128203505.GG40046@sobchak.mgh.harvard.edu>

Hello all;
The BOSC 2010 organizing committee is hard at work getting prepared for this
July's meeting in Boston:

http://www.open-bio.org/wiki/BOSC_2010

One of the items we've traditionally had at the conference is a project 
update from each of the OpenBio affiliated groups. This year, we're thinking
about organizing these talks around a central theme: the OpenBio solution
challenge. We start with a biological question of general interest, and each
of the project talks would focus around how you would solve that problem 
using your toolkit and programming language.

This is meant to provide a challenge for OpenBio contributors, a nice tutorial
style overview of various projects and approaches for other programmers, and a
fun opportunity to compete and learn from other projects. Conference attendees
will vote on their favorite solution, with the winner receiving fame and
fortune (warning: fortune not guaranteed).

For this to be successful, it of course requires interest and enthusiasm from
y'all fine folks involved with the projects. Specifically:

- Is there interest from your group in participating in the challenge? You'll
  want at least a few people to work on it, and someone to give a presentation 
  at BOSC.

- Do you have suggestions on a good theme or specific biological problem to
  tackle? We'll hope to pick something in a sweet spot that is challenging 
  enough to be of interest, yet reasonable for presentation and preparation.

Let's discuss ideas and get this together. Since the schedule for BOSC is
developing rapidly, please give us an idea if you're interested by
February 12th, and copy responses to the BOSC mailing list as a central 
place for discussion.

bosc at open-bio.org

Thanks,
Brad, Michael, and the BOSC organizing committee


From biopython at maubp.freeserve.co.uk  Fri Jan 29 10:36:40 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 29 Jan 2010 10:36:40 +0000
Subject: [Biopython-dev] [Bioperl-l] [MOBY-dev] OpenBio solution
	challenge: Project updates at BOSC 2010
In-Reply-To: <op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
References: <20100128203505.GG40046@sobchak.mgh.harvard.edu>
	<op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com>

Hi all,

This is a great topic but should be continue it on just the one mailing list?
Is there a suitable BOSC list, or how about the general Open Bio list?

On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson <markw at illuminae.com> wrote:
>
> Brad, this sounds exciting!
>
> One thing strikes me, though - by asking for the sub-projects to propose
> the "grand challenge" themselves the one thing you can guarantee is that
> the "grand challenge" is solvable (or more likely, already solved!)
>
> Other "grand challenge" kinds of meetings have an independent third party
> pose the problem that has to be solved, and then all groups work toward a
> solution and compare their results. ?This would, IMO, be more revealing of
> the "state of the art" in each Open-Bio project, and point out where the
> weaknesses are that we should be focusing on... ?Someone (for example,
> you!) could act as the moderator to ensure that the "grand challenge" was
> at least a reasonable one, within the scope of what an Open-Bio project
> *should* be able to solve...
>
> Just my CAD $0.02
>
> Mark

One possible problem with having Brad act as moderator is his ties to
Biopython (plus it would be a shame if we'd be one man down for trying
to solve the challenges - grin). Having a project representative "sign off"
on the challenge might work - or simply the whole of the BOSC committee
which is quite balanced. Alternatively some kind of panel of challenges does
seem a good way to reduce individual project bias (as suggest by Scooter),
but there will still need to be a judging committee.

I'm curious what kind of challenges the BOSC committee had in mind -
would something like taking a newly sequence bacteria and producing
an automated annotation as a GenBank, EMBL, or GFF  file be too
ambitious for example? There are already several major projects
to do this e.g. RAST http://rast.nmpdr.org/

Peter
(@Biopython)


From bugzilla-daemon at portal.open-bio.org  Sun Jan 31 20:30:45 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 31 Jan 2010 15:30:45 -0500
Subject: [Biopython-dev] [Bug 3004] New: Contribute PSL alignment format to
	biopython
Message-ID: <bug-3004-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3004

           Summary: Contribute PSL alignment format to biopython
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: forgetta at gmail.com


Hi Bio-pythonistas,

I am interested in contributing code to biopython. I have developed a class to
represent PSL output from the BLAT alignment program. I would like to
contribute it to the AlignIO module. I have read through and agree to the
guidelines stipulated on http://biopython.org/wiki/Contributing. I have never
written unit tests before, but I am willing to learn.

Thanks.

Vince


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sun Jan 31 22:24:53 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 31 Jan 2010 17:24:53 -0500
Subject: [Biopython-dev] [Bug 3004] PSL alignment format parsing in
	Bio.AlignIO
In-Reply-To: <bug-3004-42@http.bugzilla.open-bio.org/>
Message-ID: <201001312224.o0VMOrha006787@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3004


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Contribute PSL alignment    |PSL alignment format parsing
                   |format to biopython         |in Bio.AlignIO


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2010-01-31 17:24 EST -------
Hi Vince,

This sounds interesting - I've been using BLAT's plain text BLAST output
format with Biopython up until now.

Have you ever used github? That would be one way to share your code. Or,
just attach diff files, Python files, and example BLAT files to this bug.

If you haven't already done so, signing up to our development mailing
list would be a good idea.

Thanks,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.