[Biojava-l] calculating properties of SymbolLists
Matthew Pocock
mrp@sanger.ac.uk
Mon, 15 May 2000 15:01:37 +0100
Gerald,
Gerald Loeffler wrote:
> Matthew Pocock wrote:
> >
> > Dear Gerald,
> >
> > Gerald Loeffler wrote:
> >
<snip/>
>
> > > public interface SymbolListPropertyCalculator {
> > > /**
> > > * calculate a property of the given symbol list and return it as an
> > > Object.
> > > * @param sl the symbol list whose property should be calculate (may
> > > not be null otherwise an
> > > * IllegalArgumentException is thrown)
> > > * @return the calculated property. Never returns null.
> > > * @exception BioException if calculating the property was not
> > > possible due to mis-usage of the
> > > * method, e.g. because the implementation
> > > requires that the symbol list
> > > * be over a specific Alphabet but sl does not
> > > fulfill this precondition.
> > > */
> > > Object calculateProperty(SymbolList sl) throws BioException;
> > > }
> > >
> >
> > Under what circumstances would you return something that needed to be an
> > Object? The properties that you have outlined are both double values. If
> > this interface is to be usefull for things like GUIs or configuration
> > scripts, then mabey:
> >
> > @throws IllegalAlphabetException if sl is over the wrong alphabet for this
> > metric
> > @throws BioException if for any reason the metric could not be calculated
> > double calculateProperty(SymbolList sl) throws IllegalAlphabetException,
> > BioException
> >
> > would be more apropreate.
>
> I had in mind that there are probably many properties of a sequence that
> are of a more complex nature and could hence not be represented as a
> simple floating point value, e.g. amino acid distribution (symbol
> distribution), hydrophobicity plot (as a function of symbol position),
> secondary structure class of the whole protein (with only 3 or 4 allowed
> values), ... the question is whether it makes sense to provide one
> interface to deal with all of these different kinds of properties...
>
There are two issues here - an interface for returning per-sequence info., and an
interface for representing position-dependant info., which I think we already have
(SymbolList and Alignment).
1) AA distribution can be cleanly represented by using the current EmissionState
objects in org.biojava.bio.dp although calculating the counts from a single
sequence could be wrapped up inside a static helper method.
2) Hydrophobicity is, as you say, a function of the symbol at a given position.
Use the 'TranslatedSymbolList', 'TranslationTable' and 'DoubleAlphabet' (all in
org.biojava.symbol) to build this functionality. Your particular TranslationTable
object will return the hydrophobicity for each amin-acid, and the
TranslatedSymbolList will allow you to view an underlying protein sequence as a
serise of DoulbeResidue objects representing the hydrophobicity plot. The
particular translation table for 'hydrophobicity' could be made a public recourse
in ProteinTools.
3) Secondary structure can be represented as a SymbolList over a small alphabet of
secondary-structre-elements (nothing, alpha, beta, coil etc.). We should add this
as a standard alphabet. To tie a sequence and its secondary structure together,
create an Alighment object containing the two items.
>
> >
> > >
> > > The one could add implementations like
> > >
> > > MolecularWeightCalculator (returns the MW in kD as a Double)
> > > IsoelectricPointCalculator (requires the symbol list to be over the
> > > protein alphabet, return the pI as a Double)
<snip/>
>
> > > 3) in which package should this stuff go?
> >
> > I would make the interface public to org.biojava.bio.sequence. These two
> > implementations could become public classes in sequence, but I would prefer
> > intances of them to be public static final properties of ProteinTools. I may
> > be off the wall here :-)
>
> so we would have both: public classes and (since the classes are
> essentially singletons), a static final member of ProteinTools for each
> of the classes?
Well, you could do that. Or - the interface can be public, but the particular
implementations for your metrics could be package-private and only accessible via
singleton instances in ProteinTools. It depends if you need to expose any extra
API to do things like modify a molecular-weight calculator.
If a particualr metric is mutable, then you could make that metric class public,
but let it provide a singleton instance for common instances.
But, the most important thing is to get the functionality out there. We can always
re-package it as-and-when.
>
>
> cheers,
> gerald
Matthew