From pmr at ebi.ac.uk Thu Dec 4 11:49:57 2008 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 04 Dec 2008 16:49:57 +0000 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> Message-ID: <49380A35.10909@ebi.ac.uk> Giovanni Marco Dall'Olio wrote: > So, the point is.. what if we create a common repository for all this > kind of testing data, to be used in common with all the other Bio* > projects? > Wouldn't it be good if all the Bio* fasta parser are able to parse the > same files and give the same results, demonstrating that all of them > work fine or are wrong at the same time? > > I am doing this because me (and Tiago), in the biopython mailing list, would > like to develop a module to calculate Fst statistics over SNP data, and > there is no point of collecting some good test datasets and not sharing them > with other similar projects in other programming languages. > > The same goes for much of the documentation, like use cases: if we > collect a good base of use cases related to bioinformatics, it would > be easier to coordinate the efforts of all the Bio* projects and > compare the different approaches used to solve the same issue by the > different comunities. > > At the moment, I have created a simple git repository on github: > - http://github.com/dalloliogm/bio-test-datasets-repository > but , it is still empty and maybe github is not the ideal hosting for > such a project, since the free account has a 100MB space limit. The EMBOSS project on Open Bio has its own set of test cases for all applications, and validation for source code documentation and application documentation. Our tests run as perl scripts using scripts and data that are distributed with EMBOSS. We would be interested in joining a common effort. regards, Peter Rice From jason at bioperl.org Thu Dec 4 12:06:24 2008 From: jason at bioperl.org (Jason Stajich) Date: Thu, 4 Dec 2008 09:06:24 -0800 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> Message-ID: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> I don't know if this is really the best email list for this -- although not sure what other common list should be used. We actually a started a project like this many moons ago, but no one contributed examples... http://code.open-bio.org/cgi/viewcvs.cgi/biodata/ We can start a common SVN repository for this if you like or a github on OBF if that is more likely to garner contributions. In terms of documentation - you are certainly welcome to make a documentation repository but I would argue a wiki or wiki-like soln would be best for documentation. Whether a common wiki can be maintained among the projects (or merge the wikifarms someday) is something to contemplate too. -jason On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote: > Hi! > My name is Giovanni, I come from biopython's mailing list. > > I would like to make you a proposal. > Every module/program written in bioinformatics needs to be tested > before it can be used to produce results that can be published. > > For example, let's say I want to write another fasta file parser, like > SeqIO.FastaIO in biopython : I would have have to test the script > against some real fasta files, just to make sure that it doesn't parse > them in a wrong way, or that it losts data. > Or, let's say I want to write a script to calculate Fst statistics > over some population genetics data: I will have to compare the results > of my scripts against other programs, check if it gives me the right > result for a set for which I already know the Fst value, and maybe > ideate some other kind of checks to be sure my script doesn't do weird > things, like losing input data on the way. > > So, the point is.. what if we create a common repository for all this > kind of testing data, to be used in common with all the other Bio* > projects? > Wouldn't it be good if all the Bio* fasta parser are able to parse the > same files and give the same results, demonstrating that all of them > work fine or are wrong at the same time? > > I am doing this because me (and Tiago), in the biopython mailing > list, would > like to develop a module to calculate Fst statistics over SNP data, > and > there is no point of collecting some good test datasets and not > sharing them > with other similar projects in other programming languages. > > The same goes for much of the documentation, like use cases: if we > collect a good base of use cases related to bioinformatics, it would > be easier to coordinate the efforts of all the Bio* projects and > compare the different approaches used to solve the same issue by the > different comunities. > > At the moment, I have created a simple git repository on github: > - http://github.com/dalloliogm/bio-test-datasets-repository > but , it is still empty and maybe github is not the ideal hosting for > such a project, since the free account has a 100MB space limit. > > > -- > ----------------------------------------------------------- > > My Blog on Bioinformatics (italian): http://bioinfoblog.it > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l Jason Stajich jason at bioperl.org From biopython at maubp.freeserve.co.uk Thu Dec 4 12:26:06 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 17:26:06 +0000 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> Message-ID: <320fb6e00812040926g7ea92397r19af618d8b50d143@mail.gmail.com> On Thu, Dec 4, 2008 at 5:06 PM, Jason Stajich wrote: > > I don't know if this is really the best email list for this -- although not > sure what other common list should be used. I think I suggested trying this list to Giovanni - it looked like the best bet, although I suspect it has a fairly low subscriber count. > We actually a started a project like this many moons ago, but no one > contributed examples... > > http://code.open-bio.org/cgi/viewcvs.cgi/biodata/ > That was before I started using Biopython, so I'd never seen that. > We can start a common SVN repository for this if you like or a github on OBF > if that is more likely to garner contributions. Using an OBF repository would be nice, especially if developers from all the Bio* projects with existing CVS/SVN accounts automatically had write access to it. I've not really used git, but it might be more open for new-comers. > In terms of documentation - you are certainly welcome to make a > documentation repository but I would argue a wiki or wiki-like soln would be > best for documentation. > Whether a common wiki can be maintained among the projects (or merge the > wikifarms someday) is something to contemplate too. Given the OBF already has wiki software up and running, this does seem like a good choice for documentation. The BioPerl wiki already has a lot of useful stuff describing different file formats, and in most cases the text is independent of BioPerl. It would make sense to take these pages as a basis for a shared OBF wiki. I would think that ideally the Bio* projects could have a page on each file format describing how it is parsed with that tool kit, but citing a shared file format description page (or even embedding it on the fly). Peter From dalloliogm at gmail.com Thu Dec 4 13:06:08 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 4 Dec 2008 19:06:08 +0100 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> Message-ID: <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com> On 12/4/08, Jason Stajich wrote: > I don't know if this is really the best email list for this -- although not > sure what other common list should be used. Hi Jason, thank you very much for the answer. I was going to post this mail again to the other OpenBio lists to see if there were more people interested, but I decided to wait a bit since I am not extremely confident with these concepts myself yet and wanted to study them a bit more. To be honest I was going to discuss about testing with some professional programmers I met some months ago, which seemed very confident with the concepts of testing, to don't say that they are obsessed. For example I have it not clear how a use case could be written to be the best useful for all the different Bio::* projects, and other things. > We actually a started a project like this many moons ago, but no one > contributed examples... > > http://code.open-bio.org/cgi/viewcvs.cgi/biodata/ I see, thank you very much for the link. On the biopython list they told me that a big issue is the license with which the data is released. I don't have any inconvenience in contributing examples with a GPL or without license, but I understand other people could do. Somebody told me that there were some interesting discussions on scipy.org, but I couldn't find them. > We can start a common SVN repository for this if you like or a github on > OBF if that is more likely to garner contributions. Well, to be honest I prefer git :). But it is the same for me with any RCS system, moreover this are examples and not code so the choice will be less important. What I think that could be useful for this project is a ticket tracking system, or better said a feature request system, to keep track of all the things needed. I used once a system called assembla: - http://www.assembla.com/spaces/biotest/tickets Which seems very cool to use, but it is not open source and maybe bugzilla would suffice. > In terms of documentation - you are certainly welcome to make a > documentation repository but I would argue a wiki or wiki-like soln would be > best for documentation. Well, basically I have three years in front of me in which I will work in the same field (I am a first year phd student in a population genetics laboratory) and in theory I will have to write a lot of test cases and controls anyway, which I don't mind contributing. However, as I was saying before I am not very experienced in writing use cases, and it will take me a bit (let's say some months) to learn how to write them well. > Whether a common wiki can be maintained among the projects (or merge the > wikifarms someday) is something to contemplate too. I agree, for example bioperl's wiki has many useful descriptions on file formats that the other bio*projects miss. > -jason > > > On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote: > > > > > > Hi! > > My name is Giovanni, I come from biopython's mailing list. > > > > I would like to make you a proposal. > > Every module/program written in bioinformatics needs to be tested > > before it can be used to produce results that can be published. > > > > For example, let's say I want to write another fasta file parser, like > > SeqIO.FastaIO in biopython : I would have have to test the script > > against some real fasta files, just to make sure that it doesn't parse > > them in a wrong way, or that it losts data. > > Or, let's say I want to write a script to calculate Fst statistics > > over some population genetics data: I will have to compare the results > > of my scripts against other programs, check if it gives me the right > > result for a set for which I already know the Fst value, and maybe > > ideate some other kind of checks to be sure my script doesn't do weird > > things, like losing input data on the way. > > > > So, the point is.. what if we create a common repository for all this > > kind of testing data, to be used in common with all the other Bio* > > projects? > > Wouldn't it be good if all the Bio* fasta parser are able to parse the > > same files and give the same results, demonstrating that all of them > > work fine or are wrong at the same time? > > > > I am doing this because me (and Tiago), in the biopython mailing list, > would > > like to develop a module to calculate Fst statistics over SNP data, and > > there is no point of collecting some good test datasets and not sharing > them > > with other similar projects in other programming languages. > > > > The same goes for much of the documentation, like use cases: if we > > collect a good base of use cases related to bioinformatics, it would > > be easier to coordinate the efforts of all the Bio* projects and > > compare the different approaches used to solve the same issue by the > > different comunities. > > > > At the moment, I have created a simple git repository on github: > > - > http://github.com/dalloliogm/bio-test-datasets-repository > > but , it is still empty and maybe github is not the ideal hosting for > > such a project, since the free account has a 100MB space limit. > > > > > > -- > > > ----------------------------------------------------------- > > > > My Blog on Bioinformatics (italian): http://bioinfoblog.it > > _______________________________________________ > > Open-Bio-l mailing list > > Open-Bio-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > > > Jason Stajich > jason at bioperl.org > > > > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Dec 4 13:43:51 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 18:43:51 +0000 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com> Message-ID: <320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com> Giovanni wrote: > For example I have it not clear how a use case could be written to be > the best useful for all the different Bio::* projects, and other > things. In terms of use cases, I would imagine things like the following: (1) Take a provided set of CDS nucleotide sequences in FASTA format, translate them using NCBI codon table 11 (bacteria), and output the results as a FASTA file of protein sequences. (2) Take a provided set of protein sequences, and do pairwise alignments between them all using the EMBOSS tool needle. (3) Take a provided FASTA file of proteins, and run ClustalW on it using the default settings. Take the multiple sequence alignment in ClustalW format, and covert it into Stockholm format. Then build a neighbour joining tree using quick-tree program (which cannot read in ClustalW files directly). Finally, load the tree file and produce a cladogram where the taxon/leaf XXX is highlighted in red. (4) Take a provided author name and keyword, and query the NCBI Entrez webinterface to get a list of matching papers. Download these references (maybe as MedLine format, maybe as XML) and parse the result into a CSV file for input into your reference manager (e.g. EndNote - or generate a bibtex file for use with LaTeX). (5) Taking a provided species name, and use NCBI Entrez to download all matching EST sequences to a FASTA format file. (6) Take a provided FASTA file of proteins and use standalone NCBI BLASTP to search them against the NR database using a expectation threshold of 10^-6 and at most ten alignments per query. Parse the results, and generate a new FASTA file of the protein sequences where the description line includes the protein identifiers of closely related entries found with BLAST. [A more sensible approach to automatic annotation would be nice, but more complicated] Ideally these would all need a short motivational section explaining why you might want to do this particular task. There is probably a balance between trivial and too complex. These could be compiled on a shared OBF wiki, together with any input files required. It would be up to the individual projects to write their own sample code to do this task - perhaps hosted on the Bio* project specific wiki pages, but linked to from the use case. Potentially this would be a huge project, but it would also be a nice resource [provided it was maintained and kept up to date as the toolkits evolve]. Perhaps this is too ambitious? > On the biopython list they told me that a big issue is the license > with which the data is released. I don't have any inconvenience in > contributing examples with a GPL or without license, but I understand > other people could do. > Somebody told me that there were some interesting discussions on > scipy.org, but I couldn't find them. Licensing and copyright are valid concerns. Also the different Bio* projects use different licenses - and I suspect none of them are compatible with the GPL. Any licence would have to allow all the Bio* projects to copy the example files into their code with no strings attached - ideally just "public domain" or MIT/BSD style. I would like to see a general collection of real world samples of each file format (these could be pointed to by any shared file format documentation). Between all the Bio* projects we probably have a good collection already - but the provenance of each file would have to be looked at as well as the licence. In addition, artificial hand edited files could be useful which include valid but unusual content to test the Bio* project's parsers. I don't think this actually needs to be in a repository, but that would be nice for tracking ownership. I think it would be up to the individual projects to pull in any files of interest for us in their own test suites (essentially coping example files into their own repositories). > What I think that could be useful for this project is a ticket > tracking system, or better said a feature request system, to keep > track of all the things needed. The OBF already runs a bugzilla installation used by most of the Bio* projects, which would probably be OK for this sort of thing. Peter From volante at hrpsa.co.za Fri Dec 5 10:44:55 2008 From: volante at hrpsa.co.za (Smyser Veino) Date: Fri, 05 Dec 2008 15:44:55 +0000 Subject: [Open-bio-l] Santa Claus and Christmas nnight! Message-ID: <6885611492.20081205154251@elanontwerp.nl> WOW! Santa Cllaus try our meds and fuck housewife and her daughter! http://cid-b9cdd0ceb328daf1.spaces.live.com/blog/cns!B9CDD0CEB328DAF1!106.entry Me if i ask you ask you most solemnly to postpone old. Of that family, where for more than a century with all that.' 'but you know roughly the state of her final hope and reliance, appealed to them for secretary of state was defeated by twentynine. From mispleading at qsucceed.com Fri Dec 5 15:09:49 2008 From: mispleading at qsucceed.com (Sioma Stittsworth) Date: Fri, 05 Dec 2008 20:09:49 +0000 Subject: [Open-bio-l] Santa Claaus and Christmas night! Message-ID: <5111915651.20081205200847@sundialmedia.fi> WOW! Santa Claus try our meds and fuck housewiife and her daughter! http://cid-f16fcabc0ec68ee0.spaces.live.com/blog/cns!F16FCABC0EC68EE0!106entry A long time to find out what that principle is, who hath an eye on virtue, who is endued with said, 'do thou with me enjoy the good things of continually to throng together, which although to converse with the other ladies, and sick of. From followeth at afghannic.com Sat Dec 6 07:10:30 2008 From: followeth at afghannic.com (Yark Strothman) Date: Sat, 06 Dec 2008 12:10:30 +0000 Subject: [Open-bio-l] Santa Claus and Chrristmas night! Message-ID: <2166758900.20081206120956@smiles4u.ca> WOW! Santa Claus try our meds and fuck housewife and her daughtter! http://cid-c88db3fc42053a39.spaces.live.com/blog/cns!C88DB3FC42053A39!106.entry In which the sacred vestments were kept. 'she his glimpse of elysium, a world not too kind, only find it. We're all searchingsome for gold without any anxiety.' vaisampayana said, 'having they saw the pandavas so exceedingly lean, the. From dalloliogm at gmail.com Wed Dec 10 06:31:13 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 10 Dec 2008 12:31:13 +0100 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com> <320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com> Message-ID: <5aa3b3570812100331i1f1e34deic779820308c31a6a@mail.gmail.com> On 12/4/08, Peter wrote: > Giovanni wrote: > > For example I have it not clear how a use case could be written to be > > the best useful for all the different Bio::* projects, and other > > things. > > > In terms of use cases, I would imagine things like the following: > > (1) Take a provided set of CDS nucleotide sequences in FASTA format, > translate them using NCBI codon table 11 (bacteria), and output the > results as a FASTA file of protein sequences. > > (2) Take a provided set of protein sequences, and do pairwise > alignments between them all using the EMBOSS tool needle. > > (3) Take a provided FASTA file of proteins, and run ClustalW on it > using the default settings. Take the multiple sequence alignment in > ClustalW format, and covert it into Stockholm format. Then build a > neighbour joining tree using quick-tree program (which cannot read in > ClustalW files directly). Finally, load the tree file and produce a > cladogram where the taxon/leaf XXX is highlighted in red. > > (4) Take a provided author name and keyword, and query the NCBI Entrez > webinterface to get a list of matching papers. Download these > references (maybe as MedLine format, maybe as XML) and parse the > result into a CSV file for input into your reference manager (e.g. > EndNote - or generate a bibtex file for use with LaTeX). > > (5) Taking a provided species name, and use NCBI Entrez to download > all matching EST sequences to a FASTA format file. > > (6) Take a provided FASTA file of proteins and use standalone NCBI > BLASTP to search them against the NR database using a expectation > threshold of 10^-6 and at most ten alignments per query. Parse the > results, and generate a new FASTA file of the protein sequences where > the description line includes the protein identifiers of closely > related entries found with BLAST. [A more sensible approach to > automatic annotation would be nice, but more complicated] ok, these are good examples. I would also add a title (e.g. 1: Translating a CDS sequence), just for convenience. We could also add some examples of the expected outputs. Example: if you apply the procedure on case 1 on the file COX1_cds.fasta, you obtain exactly the file COX1_protein.fasta. Moreover, a possible approach is to write a script that executes the same actions described in the use cases. I saw people doing this to test web applications (zope). They wrote some scripts using perl's LWP or python webbrowser libraries, to make it execute all the actions that an user can do in an use case scenario. However this is too much work for now, better leave it for later. > Ideally these would all need a short motivational section explaining > why you might want to do this particular task. There is probably a > balance between trivial and too complex. > > These could be compiled on a shared OBF wiki, together with any input > files required. It would be up to the individual projects to write > their own sample code to do this task - perhaps hosted on the Bio* > project specific wiki pages, but linked to from the use case. Can we put it somewhere here: - http://www.open-bio.org/wiki/Main_Page ? > Potentially this would be a huge project, but it would also be a nice > resource [provided it was maintained and kept up to date as the > toolkits evolve]. Perhaps this is too ambitious? maybe it is :). It will very difficult to keep up to date with everything, given the speed at which new technologies come out nowadays. However I think that there could be many people interested in contributing to it. And for many researchers, it could be easier to contribute with a description of what they want to do with their data, rather than with code. > > On the biopython list they told me that a big issue is the license > > with which the data is released. I don't have any inconvenience in > > contributing examples with a GPL or without license, but I understand > > other people could do. > > Somebody told me that there were some interesting discussions on > > scipy.org, but I couldn't find them. > > > Licensing and copyright are valid concerns. Also the different Bio* > projects use different licenses - and I suspect none of them are > compatible with the GPL. Any licence would have to allow all the Bio* > projects to copy the example files into their code with no strings > attached - ideally just "public domain" or MIT/BSD style. I agree on any open license, MIT/BSD should be ok. > I would like to see a general collection of real world samples of each > file format (these could be pointed to by any shared file format > documentation). Between all the Bio* projects we probably have a good > collection already - but the provenance of each file would have to be > looked at as well as the licence. In addition, artificial hand edited > files could be useful which include valid but unusual content to test > the Bio* project's parsers. I don't think this actually needs to be > in a repository, but that would be nice for tracking ownership. ok. In any case, wikis usually have a versioning system, so there are not many differences. > I think it would be up to the individual projects to pull in any files > of interest for us in their own test suites (essentially coping > example files into their own repositories). > > > > What I think that could be useful for this project is a ticket > > tracking system, or better said a feature request system, to keep > > track of all the things needed. > > > The OBF already runs a bugzilla installation used by most of the Bio* > projects, which would probably be OK for this sort of thing. > > Peter > > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From pmr at ebi.ac.uk Thu Dec 4 16:49:57 2008 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 04 Dec 2008 16:49:57 +0000 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> Message-ID: <49380A35.10909@ebi.ac.uk> Giovanni Marco Dall'Olio wrote: > So, the point is.. what if we create a common repository for all this > kind of testing data, to be used in common with all the other Bio* > projects? > Wouldn't it be good if all the Bio* fasta parser are able to parse the > same files and give the same results, demonstrating that all of them > work fine or are wrong at the same time? > > I am doing this because me (and Tiago), in the biopython mailing list, would > like to develop a module to calculate Fst statistics over SNP data, and > there is no point of collecting some good test datasets and not sharing them > with other similar projects in other programming languages. > > The same goes for much of the documentation, like use cases: if we > collect a good base of use cases related to bioinformatics, it would > be easier to coordinate the efforts of all the Bio* projects and > compare the different approaches used to solve the same issue by the > different comunities. > > At the moment, I have created a simple git repository on github: > - http://github.com/dalloliogm/bio-test-datasets-repository > but , it is still empty and maybe github is not the ideal hosting for > such a project, since the free account has a 100MB space limit. The EMBOSS project on Open Bio has its own set of test cases for all applications, and validation for source code documentation and application documentation. Our tests run as perl scripts using scripts and data that are distributed with EMBOSS. We would be interested in joining a common effort. regards, Peter Rice From jason at bioperl.org Thu Dec 4 17:06:24 2008 From: jason at bioperl.org (Jason Stajich) Date: Thu, 4 Dec 2008 09:06:24 -0800 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> Message-ID: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> I don't know if this is really the best email list for this -- although not sure what other common list should be used. We actually a started a project like this many moons ago, but no one contributed examples... http://code.open-bio.org/cgi/viewcvs.cgi/biodata/ We can start a common SVN repository for this if you like or a github on OBF if that is more likely to garner contributions. In terms of documentation - you are certainly welcome to make a documentation repository but I would argue a wiki or wiki-like soln would be best for documentation. Whether a common wiki can be maintained among the projects (or merge the wikifarms someday) is something to contemplate too. -jason On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote: > Hi! > My name is Giovanni, I come from biopython's mailing list. > > I would like to make you a proposal. > Every module/program written in bioinformatics needs to be tested > before it can be used to produce results that can be published. > > For example, let's say I want to write another fasta file parser, like > SeqIO.FastaIO in biopython : I would have have to test the script > against some real fasta files, just to make sure that it doesn't parse > them in a wrong way, or that it losts data. > Or, let's say I want to write a script to calculate Fst statistics > over some population genetics data: I will have to compare the results > of my scripts against other programs, check if it gives me the right > result for a set for which I already know the Fst value, and maybe > ideate some other kind of checks to be sure my script doesn't do weird > things, like losing input data on the way. > > So, the point is.. what if we create a common repository for all this > kind of testing data, to be used in common with all the other Bio* > projects? > Wouldn't it be good if all the Bio* fasta parser are able to parse the > same files and give the same results, demonstrating that all of them > work fine or are wrong at the same time? > > I am doing this because me (and Tiago), in the biopython mailing > list, would > like to develop a module to calculate Fst statistics over SNP data, > and > there is no point of collecting some good test datasets and not > sharing them > with other similar projects in other programming languages. > > The same goes for much of the documentation, like use cases: if we > collect a good base of use cases related to bioinformatics, it would > be easier to coordinate the efforts of all the Bio* projects and > compare the different approaches used to solve the same issue by the > different comunities. > > At the moment, I have created a simple git repository on github: > - http://github.com/dalloliogm/bio-test-datasets-repository > but , it is still empty and maybe github is not the ideal hosting for > such a project, since the free account has a 100MB space limit. > > > -- > ----------------------------------------------------------- > > My Blog on Bioinformatics (italian): http://bioinfoblog.it > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l Jason Stajich jason at bioperl.org From biopython at maubp.freeserve.co.uk Thu Dec 4 17:26:06 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 17:26:06 +0000 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> Message-ID: <320fb6e00812040926g7ea92397r19af618d8b50d143@mail.gmail.com> On Thu, Dec 4, 2008 at 5:06 PM, Jason Stajich wrote: > > I don't know if this is really the best email list for this -- although not > sure what other common list should be used. I think I suggested trying this list to Giovanni - it looked like the best bet, although I suspect it has a fairly low subscriber count. > We actually a started a project like this many moons ago, but no one > contributed examples... > > http://code.open-bio.org/cgi/viewcvs.cgi/biodata/ > That was before I started using Biopython, so I'd never seen that. > We can start a common SVN repository for this if you like or a github on OBF > if that is more likely to garner contributions. Using an OBF repository would be nice, especially if developers from all the Bio* projects with existing CVS/SVN accounts automatically had write access to it. I've not really used git, but it might be more open for new-comers. > In terms of documentation - you are certainly welcome to make a > documentation repository but I would argue a wiki or wiki-like soln would be > best for documentation. > Whether a common wiki can be maintained among the projects (or merge the > wikifarms someday) is something to contemplate too. Given the OBF already has wiki software up and running, this does seem like a good choice for documentation. The BioPerl wiki already has a lot of useful stuff describing different file formats, and in most cases the text is independent of BioPerl. It would make sense to take these pages as a basis for a shared OBF wiki. I would think that ideally the Bio* projects could have a page on each file format describing how it is parsed with that tool kit, but citing a shared file format description page (or even embedding it on the fly). Peter From dalloliogm at gmail.com Thu Dec 4 18:06:08 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 4 Dec 2008 19:06:08 +0100 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> Message-ID: <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com> On 12/4/08, Jason Stajich wrote: > I don't know if this is really the best email list for this -- although not > sure what other common list should be used. Hi Jason, thank you very much for the answer. I was going to post this mail again to the other OpenBio lists to see if there were more people interested, but I decided to wait a bit since I am not extremely confident with these concepts myself yet and wanted to study them a bit more. To be honest I was going to discuss about testing with some professional programmers I met some months ago, which seemed very confident with the concepts of testing, to don't say that they are obsessed. For example I have it not clear how a use case could be written to be the best useful for all the different Bio::* projects, and other things. > We actually a started a project like this many moons ago, but no one > contributed examples... > > http://code.open-bio.org/cgi/viewcvs.cgi/biodata/ I see, thank you very much for the link. On the biopython list they told me that a big issue is the license with which the data is released. I don't have any inconvenience in contributing examples with a GPL or without license, but I understand other people could do. Somebody told me that there were some interesting discussions on scipy.org, but I couldn't find them. > We can start a common SVN repository for this if you like or a github on > OBF if that is more likely to garner contributions. Well, to be honest I prefer git :). But it is the same for me with any RCS system, moreover this are examples and not code so the choice will be less important. What I think that could be useful for this project is a ticket tracking system, or better said a feature request system, to keep track of all the things needed. I used once a system called assembla: - http://www.assembla.com/spaces/biotest/tickets Which seems very cool to use, but it is not open source and maybe bugzilla would suffice. > In terms of documentation - you are certainly welcome to make a > documentation repository but I would argue a wiki or wiki-like soln would be > best for documentation. Well, basically I have three years in front of me in which I will work in the same field (I am a first year phd student in a population genetics laboratory) and in theory I will have to write a lot of test cases and controls anyway, which I don't mind contributing. However, as I was saying before I am not very experienced in writing use cases, and it will take me a bit (let's say some months) to learn how to write them well. > Whether a common wiki can be maintained among the projects (or merge the > wikifarms someday) is something to contemplate too. I agree, for example bioperl's wiki has many useful descriptions on file formats that the other bio*projects miss. > -jason > > > On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote: > > > > > > Hi! > > My name is Giovanni, I come from biopython's mailing list. > > > > I would like to make you a proposal. > > Every module/program written in bioinformatics needs to be tested > > before it can be used to produce results that can be published. > > > > For example, let's say I want to write another fasta file parser, like > > SeqIO.FastaIO in biopython : I would have have to test the script > > against some real fasta files, just to make sure that it doesn't parse > > them in a wrong way, or that it losts data. > > Or, let's say I want to write a script to calculate Fst statistics > > over some population genetics data: I will have to compare the results > > of my scripts against other programs, check if it gives me the right > > result for a set for which I already know the Fst value, and maybe > > ideate some other kind of checks to be sure my script doesn't do weird > > things, like losing input data on the way. > > > > So, the point is.. what if we create a common repository for all this > > kind of testing data, to be used in common with all the other Bio* > > projects? > > Wouldn't it be good if all the Bio* fasta parser are able to parse the > > same files and give the same results, demonstrating that all of them > > work fine or are wrong at the same time? > > > > I am doing this because me (and Tiago), in the biopython mailing list, > would > > like to develop a module to calculate Fst statistics over SNP data, and > > there is no point of collecting some good test datasets and not sharing > them > > with other similar projects in other programming languages. > > > > The same goes for much of the documentation, like use cases: if we > > collect a good base of use cases related to bioinformatics, it would > > be easier to coordinate the efforts of all the Bio* projects and > > compare the different approaches used to solve the same issue by the > > different comunities. > > > > At the moment, I have created a simple git repository on github: > > - > http://github.com/dalloliogm/bio-test-datasets-repository > > but , it is still empty and maybe github is not the ideal hosting for > > such a project, since the free account has a 100MB space limit. > > > > > > -- > > > ----------------------------------------------------------- > > > > My Blog on Bioinformatics (italian): http://bioinfoblog.it > > _______________________________________________ > > Open-Bio-l mailing list > > Open-Bio-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > > > Jason Stajich > jason at bioperl.org > > > > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Dec 4 18:43:51 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 18:43:51 +0000 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com> Message-ID: <320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com> Giovanni wrote: > For example I have it not clear how a use case could be written to be > the best useful for all the different Bio::* projects, and other > things. In terms of use cases, I would imagine things like the following: (1) Take a provided set of CDS nucleotide sequences in FASTA format, translate them using NCBI codon table 11 (bacteria), and output the results as a FASTA file of protein sequences. (2) Take a provided set of protein sequences, and do pairwise alignments between them all using the EMBOSS tool needle. (3) Take a provided FASTA file of proteins, and run ClustalW on it using the default settings. Take the multiple sequence alignment in ClustalW format, and covert it into Stockholm format. Then build a neighbour joining tree using quick-tree program (which cannot read in ClustalW files directly). Finally, load the tree file and produce a cladogram where the taxon/leaf XXX is highlighted in red. (4) Take a provided author name and keyword, and query the NCBI Entrez webinterface to get a list of matching papers. Download these references (maybe as MedLine format, maybe as XML) and parse the result into a CSV file for input into your reference manager (e.g. EndNote - or generate a bibtex file for use with LaTeX). (5) Taking a provided species name, and use NCBI Entrez to download all matching EST sequences to a FASTA format file. (6) Take a provided FASTA file of proteins and use standalone NCBI BLASTP to search them against the NR database using a expectation threshold of 10^-6 and at most ten alignments per query. Parse the results, and generate a new FASTA file of the protein sequences where the description line includes the protein identifiers of closely related entries found with BLAST. [A more sensible approach to automatic annotation would be nice, but more complicated] Ideally these would all need a short motivational section explaining why you might want to do this particular task. There is probably a balance between trivial and too complex. These could be compiled on a shared OBF wiki, together with any input files required. It would be up to the individual projects to write their own sample code to do this task - perhaps hosted on the Bio* project specific wiki pages, but linked to from the use case. Potentially this would be a huge project, but it would also be a nice resource [provided it was maintained and kept up to date as the toolkits evolve]. Perhaps this is too ambitious? > On the biopython list they told me that a big issue is the license > with which the data is released. I don't have any inconvenience in > contributing examples with a GPL or without license, but I understand > other people could do. > Somebody told me that there were some interesting discussions on > scipy.org, but I couldn't find them. Licensing and copyright are valid concerns. Also the different Bio* projects use different licenses - and I suspect none of them are compatible with the GPL. Any licence would have to allow all the Bio* projects to copy the example files into their code with no strings attached - ideally just "public domain" or MIT/BSD style. I would like to see a general collection of real world samples of each file format (these could be pointed to by any shared file format documentation). Between all the Bio* projects we probably have a good collection already - but the provenance of each file would have to be looked at as well as the licence. In addition, artificial hand edited files could be useful which include valid but unusual content to test the Bio* project's parsers. I don't think this actually needs to be in a repository, but that would be nice for tracking ownership. I think it would be up to the individual projects to pull in any files of interest for us in their own test suites (essentially coping example files into their own repositories). > What I think that could be useful for this project is a ticket > tracking system, or better said a feature request system, to keep > track of all the things needed. The OBF already runs a bugzilla installation used by most of the Bio* projects, which would probably be OK for this sort of thing. Peter From volante at hrpsa.co.za Fri Dec 5 15:44:55 2008 From: volante at hrpsa.co.za (Smyser Veino) Date: Fri, 05 Dec 2008 15:44:55 +0000 Subject: [Open-bio-l] Santa Claus and Christmas nnight! Message-ID: <6885611492.20081205154251@elanontwerp.nl> WOW! Santa Cllaus try our meds and fuck housewife and her daughter! http://cid-b9cdd0ceb328daf1.spaces.live.com/blog/cns!B9CDD0CEB328DAF1!106.entry Me if i ask you ask you most solemnly to postpone old. Of that family, where for more than a century with all that.' 'but you know roughly the state of her final hope and reliance, appealed to them for secretary of state was defeated by twentynine. From mispleading at qsucceed.com Fri Dec 5 20:09:49 2008 From: mispleading at qsucceed.com (Sioma Stittsworth) Date: Fri, 05 Dec 2008 20:09:49 +0000 Subject: [Open-bio-l] Santa Claaus and Christmas night! Message-ID: <5111915651.20081205200847@sundialmedia.fi> WOW! Santa Claus try our meds and fuck housewiife and her daughter! http://cid-f16fcabc0ec68ee0.spaces.live.com/blog/cns!F16FCABC0EC68EE0!106entry A long time to find out what that principle is, who hath an eye on virtue, who is endued with said, 'do thou with me enjoy the good things of continually to throng together, which although to converse with the other ladies, and sick of. From followeth at afghannic.com Sat Dec 6 12:10:30 2008 From: followeth at afghannic.com (Yark Strothman) Date: Sat, 06 Dec 2008 12:10:30 +0000 Subject: [Open-bio-l] Santa Claus and Chrristmas night! Message-ID: <2166758900.20081206120956@smiles4u.ca> WOW! Santa Claus try our meds and fuck housewife and her daughtter! http://cid-c88db3fc42053a39.spaces.live.com/blog/cns!C88DB3FC42053A39!106.entry In which the sacred vestments were kept. 'she his glimpse of elysium, a world not too kind, only find it. We're all searchingsome for gold without any anxiety.' vaisampayana said, 'having they saw the pandavas so exceedingly lean, the. From dalloliogm at gmail.com Wed Dec 10 11:31:13 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 10 Dec 2008 12:31:13 +0100 Subject: [Open-bio-l] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com> References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com> <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org> <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com> <320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com> Message-ID: <5aa3b3570812100331i1f1e34deic779820308c31a6a@mail.gmail.com> On 12/4/08, Peter wrote: > Giovanni wrote: > > For example I have it not clear how a use case could be written to be > > the best useful for all the different Bio::* projects, and other > > things. > > > In terms of use cases, I would imagine things like the following: > > (1) Take a provided set of CDS nucleotide sequences in FASTA format, > translate them using NCBI codon table 11 (bacteria), and output the > results as a FASTA file of protein sequences. > > (2) Take a provided set of protein sequences, and do pairwise > alignments between them all using the EMBOSS tool needle. > > (3) Take a provided FASTA file of proteins, and run ClustalW on it > using the default settings. Take the multiple sequence alignment in > ClustalW format, and covert it into Stockholm format. Then build a > neighbour joining tree using quick-tree program (which cannot read in > ClustalW files directly). Finally, load the tree file and produce a > cladogram where the taxon/leaf XXX is highlighted in red. > > (4) Take a provided author name and keyword, and query the NCBI Entrez > webinterface to get a list of matching papers. Download these > references (maybe as MedLine format, maybe as XML) and parse the > result into a CSV file for input into your reference manager (e.g. > EndNote - or generate a bibtex file for use with LaTeX). > > (5) Taking a provided species name, and use NCBI Entrez to download > all matching EST sequences to a FASTA format file. > > (6) Take a provided FASTA file of proteins and use standalone NCBI > BLASTP to search them against the NR database using a expectation > threshold of 10^-6 and at most ten alignments per query. Parse the > results, and generate a new FASTA file of the protein sequences where > the description line includes the protein identifiers of closely > related entries found with BLAST. [A more sensible approach to > automatic annotation would be nice, but more complicated] ok, these are good examples. I would also add a title (e.g. 1: Translating a CDS sequence), just for convenience. We could also add some examples of the expected outputs. Example: if you apply the procedure on case 1 on the file COX1_cds.fasta, you obtain exactly the file COX1_protein.fasta. Moreover, a possible approach is to write a script that executes the same actions described in the use cases. I saw people doing this to test web applications (zope). They wrote some scripts using perl's LWP or python webbrowser libraries, to make it execute all the actions that an user can do in an use case scenario. However this is too much work for now, better leave it for later. > Ideally these would all need a short motivational section explaining > why you might want to do this particular task. There is probably a > balance between trivial and too complex. > > These could be compiled on a shared OBF wiki, together with any input > files required. It would be up to the individual projects to write > their own sample code to do this task - perhaps hosted on the Bio* > project specific wiki pages, but linked to from the use case. Can we put it somewhere here: - http://www.open-bio.org/wiki/Main_Page ? > Potentially this would be a huge project, but it would also be a nice > resource [provided it was maintained and kept up to date as the > toolkits evolve]. Perhaps this is too ambitious? maybe it is :). It will very difficult to keep up to date with everything, given the speed at which new technologies come out nowadays. However I think that there could be many people interested in contributing to it. And for many researchers, it could be easier to contribute with a description of what they want to do with their data, rather than with code. > > On the biopython list they told me that a big issue is the license > > with which the data is released. I don't have any inconvenience in > > contributing examples with a GPL or without license, but I understand > > other people could do. > > Somebody told me that there were some interesting discussions on > > scipy.org, but I couldn't find them. > > > Licensing and copyright are valid concerns. Also the different Bio* > projects use different licenses - and I suspect none of them are > compatible with the GPL. Any licence would have to allow all the Bio* > projects to copy the example files into their code with no strings > attached - ideally just "public domain" or MIT/BSD style. I agree on any open license, MIT/BSD should be ok. > I would like to see a general collection of real world samples of each > file format (these could be pointed to by any shared file format > documentation). Between all the Bio* projects we probably have a good > collection already - but the provenance of each file would have to be > looked at as well as the licence. In addition, artificial hand edited > files could be useful which include valid but unusual content to test > the Bio* project's parsers. I don't think this actually needs to be > in a repository, but that would be nice for tracking ownership. ok. In any case, wikis usually have a versioning system, so there are not many differences. > I think it would be up to the individual projects to pull in any files > of interest for us in their own test suites (essentially coping > example files into their own repositories). > > > > What I think that could be useful for this project is a ticket > > tracking system, or better said a feature request system, to keep > > track of all the things needed. > > > The OBF already runs a bugzilla installation used by most of the Bio* > projects, which would probably be OK for this sort of thing. > > Peter > > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > -- My blog on bioinformatics (now in English): http://bioinfoblog.it