[Open-bio-l] a common repository for test datasets/use cases for all Bio* projects

Thu Dec 4 12:06:24 EST 2008

I don't know if this is really the best email list for this --  
although not sure what other common list should be used.

We actually a started a project like this many moons ago, but no one  
contributed examples...

http://code.open-bio.org/cgi/viewcvs.cgi/biodata/

We can start a common SVN repository for this if you like or a github  
on OBF if that is more likely to garner contributions.

In terms of documentation - you are certainly welcome to make a  
documentation repository but I would argue a wiki or wiki-like soln  
would be best for documentation.
Whether a common wiki can be maintained among the projects (or merge  
the wikifarms someday) is something to contemplate too.

-jason

On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote:

> Hi!
> My name is Giovanni, I come from biopython's mailing list.
>
> I would like to make you a proposal.
> Every module/program written in bioinformatics needs to be tested
> before it can be used to produce results that can be published.
>
> For example, let's say I want to write another fasta file parser, like
> SeqIO.FastaIO in biopython : I would have have to test the script
> against some real fasta files, just to make sure that it doesn't parse
> them in a wrong way, or that it losts data.
> Or, let's say I want to write a script to calculate Fst statistics
> over some population genetics data: I will have to compare the results
> of my scripts against other programs, check if it gives me the right
> result for a set for which I already know the Fst value, and maybe
> ideate some other kind of checks to be sure my script doesn't do weird
> things, like losing input data on the way.
>
> So, the point is.. what if we create a common repository for all this
> kind of testing data, to be used in common with all the other Bio*
> projects?
> Wouldn't it be good if all the Bio* fasta parser are able to parse the
> same files and give the same results, demonstrating that all of them
> work fine or are wrong at the same time?
>
> I am doing this because me (and Tiago), in the biopython mailing  
> list, would
> like to develop a module to calculate Fst statistics over SNP data,  
> and
> there is no point of collecting some good test datasets and not  
> sharing them
> with other similar projects in other programming languages.
>
> The same goes for much of the documentation, like use cases: if we
> collect a good base of use cases related to bioinformatics, it would
> be easier to coordinate the efforts of all the Bio* projects and
> compare the different approaches used to solve the same issue by the
> different comunities.
>
> At the moment, I have created a simple git repository on github:
> - http://github.com/dalloliogm/bio-test-datasets-repository
> but , it is still empty and maybe github is not the ideal hosting for
> such a project, since the free account has a 100MB space limit.
>
>
> -- 
> -----------------------------------------------------------
>
> My Blog on Bioinformatics (italian): http://bioinfoblog.it
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l

Jason Stajich
jason at bioperl.org