[Biopython] SciPy paper: documenting statistical data structure design issues

Vincent Davis vincent at vincentdavis.net
Mon May 24 15:45:44 EDT 2010


"see the message below, cross posted from pystatsmodels"

We have ben having some discussion on the pystatsmodels maling list about
data objects, numpy arrays... I think it would be valuable for some
biopython users to contribute some comments, examples or ideas to the scipy
wiki that has been setup for this. I think at the heart of this is that
although almost anything can be done with a numpy array we run into many
problems that are difficult to solve with the current tools for numpy
arrays. Because of this I think some nice examples of the data design
problems that you have faced in the biopython and how they have been solved
would be valuable.

Thanks
Vincent

On Sat, May 22, 2010 at 7:22 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> For my SciPy talk and paper in a little over a month, I was hoping to
> render a somewhat coherent discussion of the design needs of
> statistical data structures, based on my experience developing pandas
> for quant finance research. I think these broadly fall into a few
> categories: implementation ease, usability (for the non-developer
> IPython-based console user), performance, and flexibility. Hopefully
> this will be useful information that will help guide future
> development efforts. What do you folks think?
>
> As part of this, I was thinking maybe we should start a wiki page (or
> pages) somewhere to start listing out the various design issues (big
> and small) where people can write their opinions and we can have a
> structured discussion (e-mail is a bit hard for this sort of thing).
> I'd also like to spend some time reading through other people's code
> (e.g. all of the larry code) and writing down what I think about their
> design choices in a constructive way.
>
> Part of what prompted my idea for a wiki was reading some of the larry
> code and wanting to share my thoughts on various parts of it. Of
> course I'm also prepared for other people to attack (and for me to
> have to defend) my own code. For most of these things there isn't a
> "right" and "wrong" and I am only interested in having constructive
> discussions and hearing people's perspectives. Here's an example: in
> pandas when adding two different-labeled 2d arrays, the result has the
> *union* of all the labels. In la you get the intersection. Certainly
> are pros and cons for either approach (in my case I don't want to lose
> information, even if it's nulled out).
>
> We should also have a place where we document differences in
> performance for various operations. I spent a lot of time even before
> pandas was open-source obsessing over speed-- I'd like to think I
> learned a few things but I was operating in a bubble so I might have
> missed really obvious speedups. I also learned lots of odd things
> about NumPy (did you know fancy indexing is a LOT slower than
> ndarray.take?). We should probably establish some apples-to-apples
> performance benchmarks to help people decide what to use for their
> applications if speed matters.
>
> Best,
> Wes

   *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


More information about the Biopython mailing list