[Biopython-dev] Method to weight sequences in an alignment
Eric Talevich
eric.talevich at gmail.com
Mon Apr 9 18:25:04 UTC 2012
Folks,
I've written a function to weight sequences according to the simple scheme
used in PSI-BLAST [*]. It operates on Bio.Align.MultipleSeqAlignment
objects or lists of plain strings, and could be added as a method with
minimal changes (for Python 2.5 compatibility, mainly). Any interest in
adding it to Biopython?
The code is below.
Cheers,
Eric
[*] Henikoff & Henikoff (1994): Position-based sequence weights.
http://www.ncbi.nlm.nih.gov/pubmed/7966282
----
def sequence_weights(aln):
"""Weight aligned sequences to emphasize more divergent members.
Returns a list of floating-point numbers between 0 and 1, corresponding
to
the proportional weight of each sequence in the alignment. The first
list
is the weight of the first sequence in the alignment, and so on. Weights
sum to 1.0.
Method: At each column position, award each different residue an equal
share of the weight, and then divide that weight equally among the
sequences sharing the same residue. For each sequence, sum the
contributions from each position to give a sequence weight.
See Henikoff & Henikoff (1994): Position-based sequence weights.
"""
def col_weight(column):
"""Represent the diversity at a position.
Award each different residue an equal share of the weight, and then
divide that weight equally among the sequences sharing the same
residue.
So, if in a position of a multiple alignment, r different residues
are represented, a residue represented in only one sequence
contributes
a score of 1/r to that sequence, whereas a residue represented in s
sequences contributes a score of 1/rs to each of the s sequences.
"""
# Skip columns with all gaps or unique inserts
if len([c for c in column if c not in '-.']) < 2:
return [0] * len(column)
# Count the number of occurrences of each residue type
# (Treat gaps as a separate, 21st character)
counts = Counter(column)
# Get residue weights: 1/rs, where
# r = nb. residue types, s = count of a particular residue type
n_residues = len(counts) # r
freqs = dict((aa, 1.0 / (n_residues * count))
for aa, count in counts.iteritems())
weights = [freqs[aa] for aa in column]
return weights
seq_weights = [0] * len(aln)
col_weights = map(col_weight, zip(*aln))
# Sum the contributions from each position along each sequence -> total
weight
for col in col_weights:
for idx, row_val in enumerate(col):
seq_weights[idx] += row_val
# Normalize
scale = 1.0 / sum(seq_weights)
seq_weights = [scale * wt for wt in seq_weights]
return seq_weights
More information about the Biopython-dev
mailing list