From dalloliogm at gmail.com  Tue Oct 28 07:06:14 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 28 Oct 2008 11:06:14 -0000
Subject: [Open-bio-l] a common repository for test datasets/use cases for
	all Bio* projects
Message-ID: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>

Hi!
My name is Giovanni, I come from biopython's mailing list.

I would like to make you a proposal.
Every module/program written in bioinformatics needs to be tested
before it can be used to produce results that can be published.

For example, let's say I want to write another fasta file parser, like
SeqIO.FastaIO in biopython : I would have have to test the script
against some real fasta files, just to make sure that it doesn't parse
them in a wrong way, or that it losts data.
Or, let's say I want to write a script to calculate Fst statistics
over some population genetics data: I will have to compare the results
of my scripts against other programs, check if it gives me the right
result for a set for which I already know the Fst value, and maybe
ideate some other kind of checks to be sure my script doesn't do weird
things, like losing input data on the way.

So, the point is.. what if we create a common repository for all this
kind of testing data, to be used in common with all the other Bio*
projects?
Wouldn't it be good if all the Bio* fasta parser are able to parse the
same files and give the same results, demonstrating that all of them
work fine or are wrong at the same time?

I am doing this because me (and Tiago), in the biopython mailing list, would
like to develop a module to calculate Fst statistics over SNP data, and
there is no point of collecting some good test datasets and not sharing them
with other similar projects in other programming languages.

The same goes for much of the documentation, like use cases: if we
collect a good base of use cases related to bioinformatics, it would
be easier to coordinate the efforts of all the Bio* projects and
compare the different approaches used to solve the same issue by the
different comunities.

At the moment, I have created a simple git repository on github:
- http://github.com/dalloliogm/bio-test-datasets-repository
but , it is still empty and maybe github is not the ideal hosting for
such a project, since the free account has a 100MB space limit.


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it

From dalloliogm at gmail.com  Tue Oct 28 11:06:14 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 28 Oct 2008 11:06:14 -0000
Subject: [Open-bio-l] a common repository for test datasets/use cases for
	all Bio* projects
Message-ID: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>

Hi!
My name is Giovanni, I come from biopython's mailing list.

I would like to make you a proposal.
Every module/program written in bioinformatics needs to be tested
before it can be used to produce results that can be published.

For example, let's say I want to write another fasta file parser, like
SeqIO.FastaIO in biopython : I would have have to test the script
against some real fasta files, just to make sure that it doesn't parse
them in a wrong way, or that it losts data.
Or, let's say I want to write a script to calculate Fst statistics
over some population genetics data: I will have to compare the results
of my scripts against other programs, check if it gives me the right
result for a set for which I already know the Fst value, and maybe
ideate some other kind of checks to be sure my script doesn't do weird
things, like losing input data on the way.

So, the point is.. what if we create a common repository for all this
kind of testing data, to be used in common with all the other Bio*
projects?
Wouldn't it be good if all the Bio* fasta parser are able to parse the
same files and give the same results, demonstrating that all of them
work fine or are wrong at the same time?

I am doing this because me (and Tiago), in the biopython mailing list, would
like to develop a module to calculate Fst statistics over SNP data, and
there is no point of collecting some good test datasets and not sharing them
with other similar projects in other programming languages.

The same goes for much of the documentation, like use cases: if we
collect a good base of use cases related to bioinformatics, it would
be easier to coordinate the efforts of all the Bio* projects and
compare the different approaches used to solve the same issue by the
different comunities.

At the moment, I have created a simple git repository on github:
- http://github.com/dalloliogm/bio-test-datasets-repository
but , it is still empty and maybe github is not the ideal hosting for
such a project, since the free account has a 100MB space limit.


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it