[BioPython] a common repository for test datasets/use cases for all Bio* projects

Giovanni Marco Dall'Olio dalloliogm at gmail.com
Tue Oct 28 10:46:39 UTC 2008


Hi,
I would like to make you a proposal.
Every module/program written in bioinformatics needs to be tested
before it can be used to produce results that can be published.

For example, let's say I want to write another fasta file parser, like
SeqIO.FastaIO in biopython : I would have have to test the script
against some real fasta files, just to make sure that it doesn't parse
them in a wrong way, or that it losts data.
Or, let's say I want to write a script to calculate Fst statistics
over some population genetics data: I will have to compare the results
of my scripts against other programs, check if it gives me the right
result for a set for which I already know the Fst value, and maybe
ideate some other kind of checks to be sure my script doesn't do weird
things, like losing input data on the way.

So, the point is.. what if we create a common repository for all this
kind of testing data, to be used in common with all the other Bio*
projects?
Wouldn't it be good if all the Bio* fasta parser are able to parse the
same files and give the same results, demonstrating that all of them
work fine or are wrong at the same time?

I am doing this because me (and Tiago) would like to develop a module
to calculate Fst statistics over SNP data, and there is no point of
collecting some good test datasets and not sharing them with other
similar projects in other programming languages.

The same goes for much of the documentation, like use cases: if we
collect a good base of use cases related to bioinformatics, it would
be easier to coordinate the efforts of all the Bio* projects and
compare the different approaches used to solve the same issue by the
different comunities.

At the moment, I have created a simple git repository on github:
- http://github.com/dalloliogm/bio-test-datasets-repository
but , it is still empty and maybe github is not the ideal hosting for
such a project, since the free account has a 100MB space limit.




-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it



More information about the Biopython mailing list