[Biopython] SciPy paper: documenting statistical data structure design issues

Mon May 24 20:04:54 UTC 2010

Sorry forgot the link
http://scipy.org/StatisticalDataStructures

On Mon, May 24, 2010 at 1:45 PM, Vincent Davis <vincent at vincentdavis.net>wrote:

> "see the message below, cross posted from pystatsmodels"
>
> We have ben having some discussion on the pystatsmodels maling list about
> data objects, numpy arrays... I think it would be valuable for some
> biopython users to contribute some comments, examples or ideas to the scipy
> wiki that has been setup for this. I think at the heart of this is that
> although almost anything can be done with a numpy array we run into many
> problems that are difficult to solve with the current tools for numpy
> arrays. Because of this I think some nice examples of the data design
> problems that you have faced in the biopython and how they have been solved
> would be valuable.
>
> Thanks
> Vincent
>
> On Sat, May 22, 2010 at 7:22 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>
>> For my SciPy talk and paper in a little over a month, I was hoping to
>> render a somewhat coherent discussion of the design needs of
>> statistical data structures, based on my experience developing pandas
>> for quant finance research. I think these broadly fall into a few
>> categories: implementation ease, usability (for the non-developer
>> IPython-based console user), performance, and flexibility. Hopefully
>> this will be useful information that will help guide future
>> development efforts. What do you folks think?
>>
>> As part of this, I was thinking maybe we should start a wiki page (or
>> pages) somewhere to start listing out the various design issues (big
>> and small) where people can write their opinions and we can have a
>> structured discussion (e-mail is a bit hard for this sort of thing).
>> I'd also like to spend some time reading through other people's code
>> (e.g. all of the larry code) and writing down what I think about their
>> design choices in a constructive way.
>>
>> Part of what prompted my idea for a wiki was reading some of the larry
>> code and wanting to share my thoughts on various parts of it. Of
>> course I'm also prepared for other people to attack (and for me to
>> have to defend) my own code. For most of these things there isn't a
>> "right" and "wrong" and I am only interested in having constructive
>> discussions and hearing people's perspectives. Here's an example: in
>> pandas when adding two different-labeled 2d arrays, the result has the
>> *union* of all the labels. In la you get the intersection. Certainly
>> are pros and cons for either approach (in my case I don't want to lose
>> information, even if it's nulled out).
>>
>> We should also have a place where we document differences in
>> performance for various operations. I spent a lot of time even before
>> pandas was open-source obsessing over speed-- I'd like to think I
>> learned a few things but I was operating in a bubble so I might have
>> missed really obvious speedups. I also learned lots of odd things
>> about NumPy (did you know fancy indexing is a LOT slower than
>> ndarray.take?). We should probably establish some apples-to-apples
>> performance benchmarks to help people decide what to use for their
>> applications if speed matters.
>>
>> Best,
>> Wes
>
>    *Vincent Davis
> 720-301-3003 *
> vincent at vincentdavis.net
>  my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis>
>

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>