[Bioperl-l] clustering algorithms in BioPerl

Kevin Clancy kpclancy@hotmail.com
Mon, 26 Feb 2001 13:03:13 -0700


>* Do you have any suggestions? I'm thinking in terms of
>	- Naming schemes
You have a couple of naming schemes to consider here - you need to check the 
chip names, the sample names and the probe names on the individual chips. 
You also should consider what you're going to do if you start merging data 
from multiple sources, eg. if you are going to merge two chips with common 
probes, how are you going to differentiate between them; if you are working 
with a chip and you have a probe in multiple positions on the chip, how are 
you going to name these to keep them seperate (add positional information, 
create a versioning system?)

>	- Particular algorithms which should be implemented as a priority
Start with the simple stuff - eg sorting and filtering. Make sure that you 
can deal well with Venn-like manipulations of data, eg. intersections, 
unions, etc. From what I have seen the main algorithms are hierarchical and 
non-hierarchical clustering. Hierarchical is more qualitative - you see a 
pretty tree, but is the tree really relevent? Hierarchical is more 
quantitative but you don't see the relation of the clustered data with each 
other.

One approach that I like is to use both a heierarchical or non-hierarchical 
clustering algorithm and follow it with a seperate type of test, eg self 
organizing map, graphical display of the data in that cluster, etc. Then you 
can start to confirm your result by using independent tests.

>	- Other possible applications, which I should keep in mind
Consider how to look at control data for a particular sample and compare it 
with the remaining data;
Ability to integrate new statistics into the module(s) easily;
Ability to use MGED or GEML XML data and convert it to your working chip 
data, and vice versa;
Possibly ability to create a new chip, eg take all the transcribed regions 
of a genome, identify best probes for each transcript, plot these on a chip 
with appropriate controls.

>	- Pitfalls I should look out for
Size - expression expts can get quite large;
Statistics - keeping track of several thousand statistical manipulations can 
get interesting;
History - ability to go back to a particular point in the analysis pathway 
is important.

Good luck!


>From: Frank Gibbons <francis_gibbons@hms.harvard.edu>
>To: bioperl <bioperl-l@bioperl.org>
>Subject: [Bioperl-l] clustering algorithms in BioPerl
>Date: Fri, 23 Feb 2001 11:42:56 -0500
>
>Hi,
>
>I've been lurking for about a month. I've checked out the BioPerl homepage,
>including the list of projects. I notice that the bias is heavily towards
>sequence analysis (naturally).
>
>Right now I'm working on implementing a few clustering algorithms (priority 
>#3
>on the list of projects) in Perl, for use with DNA microarray data 
>(priority
>#4 on the list!). The algorithms themselves are quite general, and have 
>been
>around for a while, but I can find few references to implementations of 
>them
>in Perl. (I have seen mention of Jong Park's Geanfammer package, as a 
>possible
>source for clustering, but as far as I can see he implements only
>single-linkage clustering there.) I think they would be quite useful to the
>Perl community as a whole, and I would like to write them in as generic a 
>way
>as possible, which is why I'm writing to the list now, having implemented 
>only
>one particular algorithm, before I write any more!
>
>So, my questions are:
>
>* Is this appropriate for BioPerl in the first place? Would it be more
>suitable for CPAN? The algorithms are general, but my focus is on
>BioInformatics.
>
>* If so, does any one know of other work that may have been done in this 
>area,
>on which I could build/integrate with?
>
>* Do you have any suggestions? I'm thinking in terms of
>	- Naming schemes
>	- Particular algorithms which should be implemented as a priority
>	- Other possible applications, which I should keep in mind
>	- Pitfalls I should look out for
>
>
>Thanks for your input,
>
>Frank Gibbons
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>PhD, Computational Biologist, Harvard Medical School
>Dept of Biological Chemistry and Molecular Pharmacology
>240 Longwood Avenue, C-125, Boston, MA 02115
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l@bioperl.org
>http://bioperl.org/mailman/listinfo/bioperl-l

_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com