From dalloliogm at gmail.com Tue Oct 28 07:06:14 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 28 Oct 2008 11:06:14 -0000 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects Message-ID: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> Hi! My name is Giovanni, I come from biopython's mailing list. I would like to make you a proposal. Every module/program written in bioinformatics needs to be tested before it can be used to produce results that can be published. For example, let's say I want to write another fasta file parser, like SeqIO.FastaIO in biopython : I would have have to test the script against some real fasta files, just to make sure that it doesn't parse them in a wrong way, or that it losts data. Or, let's say I want to write a script to calculate Fst statistics over some population genetics data: I will have to compare the results of my scripts against other programs, check if it gives me the right result for a set for which I already know the Fst value, and maybe ideate some other kind of checks to be sure my script doesn't do weird things, like losing input data on the way. So, the point is.. what if we create a common repository for all this kind of testing data, to be used in common with all the other Bio* projects? Wouldn't it be good if all the Bio* fasta parser are able to parse the same files and give the same results, demonstrating that all of them work fine or are wrong at the same time? I am doing this because me (and Tiago), in the biopython mailing list, would like to develop a module to calculate Fst statistics over SNP data, and there is no point of collecting some good test datasets and not sharing them with other similar projects in other programming languages. The same goes for much of the documentation, like use cases: if we collect a good base of use cases related to bioinformatics, it would be easier to coordinate the efforts of all the Bio* projects and compare the different approaches used to solve the same issue by the different comunities. At the moment, I have created a simple git repository on github: - http://github.com/dalloliogm/bio-test-datasets-repository but , it is still empty and maybe github is not the ideal hosting for such a project, since the free account has a 100MB space limit. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Tue Oct 28 11:06:14 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 28 Oct 2008 11:06:14 -0000 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects Message-ID: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> Hi! My name is Giovanni, I come from biopython's mailing list. I would like to make you a proposal. Every module/program written in bioinformatics needs to be tested before it can be used to produce results that can be published. For example, let's say I want to write another fasta file parser, like SeqIO.FastaIO in biopython : I would have have to test the script against some real fasta files, just to make sure that it doesn't parse them in a wrong way, or that it losts data. Or, let's say I want to write a script to calculate Fst statistics over some population genetics data: I will have to compare the results of my scripts against other programs, check if it gives me the right result for a set for which I already know the Fst value, and maybe ideate some other kind of checks to be sure my script doesn't do weird things, like losing input data on the way. So, the point is.. what if we create a common repository for all this kind of testing data, to be used in common with all the other Bio* projects? Wouldn't it be good if all the Bio* fasta parser are able to parse the same files and give the same results, demonstrating that all of them work fine or are wrong at the same time? I am doing this because me (and Tiago), in the biopython mailing list, would like to develop a module to calculate Fst statistics over SNP data, and there is no point of collecting some good test datasets and not sharing them with other similar projects in other programming languages. The same goes for much of the documentation, like use cases: if we collect a good base of use cases related to bioinformatics, it would be easier to coordinate the efforts of all the Bio* projects and compare the different approaches used to solve the same issue by the different comunities. At the moment, I have created a simple git repository on github: - http://github.com/dalloliogm/bio-test-datasets-repository but , it is still empty and maybe github is not the ideal hosting for such a project, since the free account has a 100MB space limit. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it