[Biopython-dev] Bio.motifs.matrix.PositionSpecificScoringMatrix.calculate - scoring ambiguous sequences

Mon Jun 13 14:58:45 UTC 2016

What do you think Michiel?

Also related, earlier today I filed this issue:
https://github.com/biopython/biopython/issues/851

Peter

On Mon, Jun 13, 2016 at 3:26 PM, Sefa Kilic <sefa1 at umbc.edu> wrote:
> Hello all,
>
> I have been using the Bio.motifs PSSM search for a long time. Occasionally,
> I work with genome sequences containing ambiguous bases. Biopython currently
> does not support scoring sequences with ambiguous bases and I would like to
> propose a change to fix that.
>
> Currently, the "calculate" function in PositionSpecificScoringMatrix class
> checks if alphabets of both motif and sequence are
> IUPAC.IUPACUnambiguousDNA. If they are not, a ValueError exception is
> raised.
>
> The code itself, however, tolerates ambiguous bases on the sequence as NaN.
> That is, given a PSSM of length L, all L-mer subsequences of the given
> sequence are scored as NaN. I would like to extend it and do the scoring
> properly for ambiguous sequences. For instance, if the base is Y (C or T),
> it should be scored as the average of scoring it as C and as T. If the base
> is N, it should be scored as the average of all bases [S(A) + S(T) + S(C) +
> S(G)] / 4.
>
> The change needs to be done on both Python and C (_pwm.c) sides. What do you
> think? If you agree, I can implement it and send a pull request.
>
> Cheers,
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython-dev