[Bioperl-l] Questions on Representing Protein Ambiguity

James Thompson tex at biosysadmin.com
Thu Sep 30 22:49:40 EDT 2004


Bioperl-ers,

I'm currently working on an implementation of a protein-specific Position
Specific Matrix based of the design in Bio::Matrix::PSM::SiteMatrix and
SiteMatrixI. I'm a uncertain about the best way to calculate a consensus
sequence based on a protein PSM, and I'd appreciate some input.

When dealing with nucleotides in a consensus sequences, it's possible to use
IUB ambiguity codes for ambiguous positions (for example, an 'S' maps to an 'A'
or 'G'). With protein alphabets it's not possible to represent all of the
combinations with a single letter, and I'm not certain exactly how to deal with
this.

One solution (currently in ProtMatrix.pm) simply takes the most probable amino
acid at each position and puts an 'X' in any position below a threshold. This
isn't bad, but it loses some information on all of the amino acids that are
above the threshold but not the most probable.

An alternative would be to borrow an idea from Perl's regex character classes
and represent multiple residues at a position inside of a set of brackets, like
this:

M[ES]N[IAP]S

However, this may not be compatible with what people expect out of a consensus
sequence.

One compromise that I could think of would be to leave the consensus sequence
stuff alone and make the regexp method of ProtMatrix take an argument as a
threshold. That way consensus sequences have one letter per position, but
there's still a way for a user to get at information if desirable. 

Does this compromise sound reasonable? Any input on the subject is greatly
appreciated. :)

Thanks,

James Thompson




More information about the Bioperl-l mailing list