[Biopython] Feature selection techniques modules

Sun Feb 6 20:37:01 UTC 2011

Hello everyone,

I am an msc student in University of Patras, Greece, in the research field
of Bioinformatics. I recently become a member of the OBF and i appreciate
the open source work of your OBF project.

I had a discussion with Mr. Robert Buels about this year gsoc, cause i look
forward to make an application and i found that OBF would be the
organization most suitable for me. Generally, i was idling in the projects
announced on previous years and i found them very interesting. As this
year's potential projects have not been announced yet, i wanted to express
to you an idea of mine, say briefly what I am thinking of doing, and ask you
if you think it is a good idea and it is worth to make an application with
this subject after March 28.

Well, I think that feature selection techniques have become a very important
issue in many bioinformatics implementations. In many cases (like protein
interactions prediction), you have to find a way to collect the best set of
features that leads to the best classification performance. I looked in
Biopython libraries and i didn't find something relative about FS techniques
implementation to a dataset of features (like t-test, ANOVA, Wilcoxon, CFS
etc... ). Hence, i think that the creation of a library focused on FS
techniques would be a good idea. Moreover, that library can have an
hierarchical structure as there are different types of FS techniques, like
filter, wrapper and embedded techniques. Furthermore, each type of them is
divided into more groups, (f.e. filter methods are divided into univariate
and multivariate methods, according to the consideration of feature
dependencies) etc...

Only some of the methods i am thinking of implementing are:

T-test, ANOVA, Gamma, bivariate methods, CFS, MRMR which are some known
filter feature selection techniques.
In wrapper and embedded methods, the classifiers are been used in the
process of feature selection, so we have techniques based on Genetic
algorithms, Random forests, logistic regression, Decision Tree Learners,
Bayesian Classifiers, etc.. In this case, the existing Biopython modules
Bio.LogisticRegression, Bio.GA and Bio.NaiveBayes could be used.

More information on the techniques I describe can be found on the following
links:

http://bioinformatics.oxfordjournals.org/content/23/19/2507.full.pdf+html
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3570EDE4C7E11AAE7CA5F727800DC58A?doi=10.1.1.37.4643&rep=rep1&type=pdf

New functions computing the above measures can be created. The calculation
can be done between vectors of features, between a feature vector and the
output vector, or even if in large datasets (with many features) been readen
from a file, in which we want to implement feature selections.

I send to you this email in order to express briefly my idea. Please let me
know what do you think about it and if it is worth been proposed as one of
my student applications in gsoc 2011, to open bioinformatics foundation. If
you want me to tell you any further details about my thinking just ask me!
:-)

Look forward to hearing from you,
Chris Dim