[Biopython] SciPy paper: documenting statistical data structure design issues

Tue May 25 01:03:19 EDT 2010

On Mon, May 24, 2010 at 7:17 PM, Michiel de Hoon <mjldehoon at yahoo.com>wrote:

> Hi Vincent,
>
> Thanks for letting us know. Statistics is central to many problems in
> computational biology, so this is important for us. What is the preferred
> way to contribute to this discussion? Should we join a mailing list or can
> we write something on a wiki?
>

All you need to contribute on the wiki is an account at scipy. again the
link http://scipy.org/StatisticalDataStructures

Discussions on the pystatsmodels mailing list are I am sure relevant but it
might be more beneficial to discuss  first on the biopython list as
sometimes to discussions get long and tend to be about economic type data.
The google group/mailing list is http://groups.google.ca/group/pystatsmodels

I think a few good examples of a "typical" biopy data set and or some of the
typical difficulties would be good to have on the wiki. This might help
start collaboration between statsmodels and biopython on this subject. I
think there are few people that cross over between economics and
bioinformatics.

Also If you know of other groups that would be interested please share this
link/information.

> Thanks,
> --Michiel.
>
> --- On Mon, 5/24/10, Vincent Davis <vincent at vincentdavis.net> wrote:
>
> > From: Vincent Davis <vincent at vincentdavis.net>
> > Subject: [Biopython] SciPy paper: documenting statistical data structure
> design issues
> > To: "biopython" <biopython at lists.open-bio.org>
> > Date: Monday, May 24, 2010, 3:45 PM
> > "see the message below, cross posted
> > from pystatsmodels"
> >
> > We have ben having some discussion on the pystatsmodels
> > maling list about
> > data objects, numpy arrays... I think it would be valuable
> > for some
> > biopython users to contribute some comments, examples or
> > ideas to the scipy
> > wiki that has been setup for this. I think at the heart of
> > this is that
> > although almost anything can be done with a numpy array we
> > run into many
> > problems that are difficult to solve with the current tools
> > for numpy
> > arrays. Because of this I think some nice examples of the
> > data design
> > problems that you have faced in the biopython and how they
> > have been solved
> > would be valuable.
> >
> > Thanks
> > Vincent
> >
> > On Sat, May 22, 2010 at 7:22 PM, Wes McKinney <wesmckinn at gmail.com>
> > wrote:
> >
> > > For my SciPy talk and paper in a little over a month,
> > I was hoping to
> > > render a somewhat coherent discussion of the design
> > needs of
> > > statistical data structures, based on my experience
> > developing pandas
> > > for quant finance research. I think these broadly fall
> > into a few
> > > categories: implementation ease, usability (for the
> > non-developer
> > > IPython-based console user), performance, and
> > flexibility. Hopefully
> > > this will be useful information that will help guide
> > future
> > > development efforts. What do you folks think?
> > >
> > > As part of this, I was thinking maybe we should start
> > a wiki page (or
> > > pages) somewhere to start listing out the various
> > design issues (big
> > > and small) where people can write their opinions and
> > we can have a
> > > structured discussion (e-mail is a bit hard for this
> > sort of thing).
> > > I'd also like to spend some time reading through other
> > people's code
> > > (e.g. all of the larry code) and writing down what I
> > think about their
> > > design choices in a constructive way.
> > >
> > > Part of what prompted my idea for a wiki was reading
> > some of the larry
> > > code and wanting to share my thoughts on various parts
> > of it. Of
> > > course I'm also prepared for other people to attack
> > (and for me to
> > > have to defend) my own code. For most of these things
> > there isn't a
> > > "right" and "wrong" and I am only interested in having
> > constructive
> > > discussions and hearing people's perspectives. Here's
> > an example: in
> > > pandas when adding two different-labeled 2d arrays,
> > the result has the
> > > *union* of all the labels. In la you get the
> > intersection. Certainly
> > > are pros and cons for either approach (in my case I
> > don't want to lose
> > > information, even if it's nulled out).
> > >
> > > We should also have a place where we document
> > differences in
> > > performance for various operations. I spent a lot of
> > time even before
> > > pandas was open-source obsessing over speed-- I'd like
> > to think I
> > > learned a few things but I was operating in a bubble
> > so I might have
> > > missed really obvious speedups. I also learned lots of
> > odd things
> > > about NumPy (did you know fancy indexing is a LOT
> > slower than
> > > ndarray.take?). We should probably establish some
> > apples-to-apples
> > > performance benchmarks to help people decide what to
> > use for their
> > > applications if speed matters.
> > >
> > > Best,
> > > Wes
> >
> >    *Vincent Davis
> > 720-301-3003 *
> > vincent at vincentdavis.net
> >  my blog <http://vincentdavis.net> |
> > LinkedIn<http://www.linkedin.com/in/vincentdavis>
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
>
>
>
>
  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>