[BioRuby] BioRuby & Google Summer of Code 2011

Fri Mar 25 12:19:36 EDT 2011

Dear All,
our project, is looking for students to participate at GSoC 2011, thanks to OBF and NESCENT
Please feel free to forward this message to your university-ml, lab or local ruby group.

Use our ml to discuss ideas and feel free to contact the mentors or any other member for the development team.

March 18-27: Would-be student participants discuss application ideas with mentoring organizations.
March 28: Student application period opens.
April 8 19:00 UTC Student application deadline.
from: http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/timel

Our proposals
-links-
http://bioruby.open-bio.org/wiki/Google_Summer_of_Code#Proposal_2011
http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2011#BioRuby_forester

-text-
Proposal 2011

OBF
Support Next Generation Sequencing (NGS) in BioRuby

Rationale 
The processing and analyzing of NGS data is challenging for a variety of reasons, in particular due to the fact that the data-sets are usually very large and contain a vast amount of information and a high number of unknown data. Furthermore there are many different approaches to perform NGS analyses and several software tools need to be integrated to produce reliable results. Since this topic is so important for the BioRuby community we started a sub-project bioruby-ngs for analyzing NGS data. The project is in an early stage of development but notable results have been quickly gained. Many topics need to be still addressed, in particular:
data and results reporting
workflow management
DSL for describing experimental designs
YALIMS (Yet Another LIMS), a simple web based Lims for raw datasets processing, with reporting and monitoring
Approach 
Due to the open nature of the project the student will choose which feature he/she wants to develop and to focus on. The student will learn basic concept of NGS data analysis and will work tightly with a mentor to produce a working library that will be integrated into the BioRuby NGS project.
Difficulty and needed skills 
Medium to Hard depending on the topic selected.
The project requires
Ruby
Bash programming and knowledge of the Linux environment
Ruby on Rails 3.x
Mentors 
Raoul J.P. Bonnal, Francesco Strozzi
Project overview and updates 
[1]
Source code
https://github.com/helios/bioruby-ngs
BioRuby Wrapper for Command line application

Rationale 
The main reason for this project is the need to support different stand-alone applications critical for Next Generation Sequences analyses. Direct binding to existing C/C++ source code or rewriting all the applications is impractical and a waste of resources. A quick solution is to use stand-alone applications directly, integrating them into the BioRuby API. Some work has been already done in the BioRuby NGS project with this wrapper but a better support for demanding I/O processes is required. Following this design pattern will be possible to improve also the support for other bioinformatics suites, like EMBOSS, outdated in BioRuby at the time of this proposal.
Approach 
The student will familiarize with advanced meta-programming concepts in Ruby and will contribute to the definition of a DSL for this wrapping library. He/she will build also a parser to automatically define additional wrappers for the EMBOSS suites starting from the ACD configuration files.
Difficulty and needed skills 
Medium. Good Ruby knowledge and experience with meta-programming are required to achieve the goals.
The project requires
Ruby 1.9
Ruby Metaprogramming
Mentors 
Raoul J.P. Bonnal, Francesco Strozzi
Source code
https://github.com/helios/bioruby-ngs, wrapper branch
Represent bio-objects and related information with images

Rationale 
Most of the time, after a bioinformatics analysis, the resulting data needs to be re-processed into a graphical way since we, as human-beings, are more comfortable accessing results and data visually than browsing a huge table with interconnected information. Very often it is also difficult to extrapolate the real biological meaning from a raw datasets. The main idea of this proposal is to define and attach graphical functions to BioRuby objects and consequently to the results computed from a generic process or pipeline. With this solution, it would be possible to explore them more naturally but also to export and integrate the information into a web environment, for sharing the knowledge and the results. For example, different objects storing alignments results could share the same interface and display their data in a common way. The same is true also for other kind of objects or computational procedures.
Approach 
The student and the mentor will define together a minimum set of features that need to be shared by the BioRuby objects and that could be visualized. Then the student will create a library/module to implement these graphical features within the BioRuby project. He/she will gain experience with Rubyvis as the graphical API and with Ruby on Rails for web visualization.
Difficulty and needed skills 
Medium/Hard. The student will need to define a graphical API and integrate the new code with the existing BioRuby modules. High level coding skills will be required to create a clean API with a clear documentation.
The project requires
Very good knowledge of Ruby (1.9) and pattern design
Basic concepts of graphics/visualization
Ruby on Rails basic knowledge
Mentors 
Raoul J.P. Bonnal, Christian Zmasek, Claudio Bustos (confirm)

Modular annotation knowledge base for BioRuby

Rationale 
Handling data sets coming from platforms for gene expression analysis or real time PCR requires to access the corresponding gene annotations several times during the measurements. This kind of information is normally stored into remote databases that provide the required knowledge and data. Problems arise when the available databases do not support a specific version of the data of interest or when huge queries need to be submitted. A BioRuby knowledge base, designed to be modular and expandable through time, could solve these problems. A good compromise between performances and portability could be achieved using embedded databases and accessing the data through a clean API.
Approach 
The student and the mentor will explore which platforms should be supported by their popularity. Then the student will recover the essential annotation and will design a simple database schema to support all the relevant non-redundant information. The schema will be flexible enough to allow interconnecting the dataset with external databases or resources for subsequent analyses. After this phase of discovery and design, the student will build the database using SQLite and will write a Ruby library to access the data using ORM ActiveRecord
Difficulty and needed skills 
Medium. The student will need to define the core data to be included into the database and how this information will be organized and accessed by the end-user. The Ruby library will be created using the powerful ActiveRecord paradigms, but good coding skills will be required to design an efficient API with a clear documentation.
The project requires
Minimal SQL dialect
Good knowledge of Ruby
Experience in querying biological databases
Experience with annotation data
Mentors 
Raoul J.P. Bonnal, Francesco Strozzi
--------------

NESCENT

 BioRuby forester

Rationale 
Forester is a collection of software libraries, mostly written in Java, for comparative genomics and evolutionary biology research. A prominent example of a tool based on forester is the phylogenetic tree explorer Archaeopteryx. Most of forester's use-cases are associated with the use of evolutionary trees as tools for establishing (functional) relations between genes or proteins (for example protein function prediction with RIO) and comparing genome based features between different species. Therefore, it implements objects representing evolutionary trees overlaid with biological data from other sources (e.g. protein domain architectures), as well as algorithms operating on these, such as the automated inference of ancestral taxonomies on gene trees, which has proven useful in the functional interpretation of large gene trees.
Most of these methods are currently only accessible via the command-line or through the GUI of Archaeopteryx and therefore difficult or impossible to use from other computer programs or toolkits (such as BioRuby). Although forester is mostly written in Java, it also contains components in Ruby ("evoruby"). These implement operations on multiple sequence alignments (MSAs) that are crucial in the development of workflows for automated, large scale, phylogenetic inference, including I/O, and efficient MSA manipulation (such as deletion of all columns with a gap-portion larger than a given threshold, removal of short and/or redundant sequences).
Approach 
The goal would be to develop a framework for accessing forester's central algorithms and applications from within BioRuby. It is expected that this project will be implemented in form of a BioRuby plugin in order to avoid creating additional dependencies for the main BioRuby distribution. Full two-way access between the Java and Ruby languages can be accomplished by using JRuby as the underlaying platform.
Depending on the level of experience and skills of a student, a project proposal could also include either or both of the following additional goals.
BioRuby and the "evoruby" components of forester partially overlap in functionality. You could incorporate MSA management functionality present in "evoruby" but missing in BioRuby into the BioRuby distribution. This would not only make that functionality immediately accessible to all BioRuby users, but would also allow a larger community of developers to participate in maintentence and future development of these components.
Display gene conversions. This would entail developing a parser for GENECONV output and use the newly developed BioRuby-forester link to directly display gene conversions within Archaeopteryx.
Challenges
The student needs to learn two disparate toolkits, BioRuby and forester.
The project involves two programming languages, Ruby and Java.
Need to understand the BioRuby plugin system.
Involved toolkits or projects 
BioRuby
BioRuby plugin system
RubyGems
JRuby
forester
Degree of difficulty and needed skills 
Expected difficulty: Medium. Proficiency in at least one of the two involved programming languages, Ruby and Java, is necessary. Experience/interest in molecular evolution or comparative genomics is required, and experience with BioRuby or forester will help.
Mentors 
Christian Zmasek, Pjotr Prins, Raoul J.P. Bonnal

--
Ra

linkedin: http://it.linkedin.com/in/raoulbonnal
twitter: http://twitter.com/ilpuccio
skype: ilpuccio
irc.freenode.net: Helius
github: https://github.com/helios