[Biopython-dev] Bio.Motif Suggestions

Tue Apr 21 11:29:39 UTC 2009

On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
> Hi,
>
> Some thoughts and a bit of a wishlist...

These are always welcome. I can make no promises on timing of making
your wishes come true ;)

>>>
>> I think that once we start talking about gapped motifs, we are really
>> talking about
>> multiple alignments on steroids. This hasn't been done so far because you
>> don't
>> really need it for DNA motifs,
>
> It might not be required for the motifs you've been working with, but we've
> been doing profile-based searches for bipartite regulatory binding sites in
> DNA.  These sites have a variable-length spacer region, and so require
> gapped alignments for building motifs.  The spacer region consensus
> (depending on the level of identity required for the consensus) is usually
> composed of Ns.

Indeed There are dyadic motifs for some of transcription factors. So
far I was working
only under assumption that that the gap is not too variable (say 3-5
nucleotides) and
this you  can fake by using multiple PWMs with different sizes of the gap e.g.:
CACnnnGTG
CACnnnnGTG
CACnnnnnGTG

But it is a workaround rather than a feature... I'd be also interested
in knowing about other
applications where maybe this assumption (small gaps) is violated. Are
there also motifs with multiple
gaps? Implementing this feature would probably require a separate
subclass of Motif, since
the internal implementation of searching would need to be different.

This is a very good feature request, I think it is worth implementing,
though currently
I have no time to do it properly. If You don't care too much about
efficiency, I could write
quickly this dyadic subclass with the implementation based on two
motif instances and a
variable gap.

>
> I guess that this comes down to whether we choose to restrict the meaning of
> "motif" to an ungapped string of symbols (including ambiguity) representing
> nt/aa, or whether we want to permit the inclusion of variable-length gaps,
> regions, or ambiguities in a PROSITE or regular expression-like manner (e.g.
> C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or
> C{,3}A{3,5}TTTT).  Although profile methods like HMMer can produce a
> consensus output that looks like an ungapped string of symbols to represent
> a motif, it doesn't capture important features of the HMM representation.
>

I think that you are touching on multiple issues here. I'll try to
answer them separately:
- gapped alignemnts are one thing. If we have a gap in one sequence
but not in the others
(frequent in protein motifs, not so much in DNA motifs) we just need a
way to sensibly use it
in creation of PWMs for searching
- dyadic motifs (gaps in otherwise ungapped alignments) are a
different issue, since we have a
gap in all instances, but it may have a variable length. see above.
-regular expressions are a different way of describing motifs. I think
that it is not a purpose of
Bio.Motif to compete with regexps, but it would be certainly valuable
to be able to have a possibility
of creating motifs from some sort of (simplified) regexps. This was,
to some extent, discussed in
a recent thread on Seq.startswith methods
-HMM motifs are totally different kind of beast. These guys introduce
dependencies between positions
(doable also with regexps) and there is currently no support for them
in Bio.Motif. It would be cool to have
support for them, but I'm not an expert here and it looks to me like a
lot of work (also probably the methods
of Bio.Motif are not exactly right for HMMs).
-finally, suporting prosite syntax seems to be depending on the
variable gap feature, but otherwise it's simple
an important input fomat to support.

> I think the latter representations are more useful, even if harder to
> code/maintain.  I think that leaving them out would be a glaring hole in
> functionality, and that they're a target Biopython should aim for.

Usefulness is hard to define in abstract of a particular problem , so
this is arguable. It is certain that bio.Motif is
not complete suite for all kinds of motif analysis but i don't know of
any tool that is supporting alll these
types of motifs with a single API (if you know one, please tell me).
We should have ambitious goals, but
I wouldn't call it a glaring hole not to have what is currently not
available elsewhere...

>
>> I think it would be great to be able to easily
>> convert multiple alignments into motifs. This would allow us to  use
>> the power of
>> BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is
>> how to design API for these  functions.
>
> I agree.  I think that there's another important question: what do we mean,
> and need to do, when we talk about converting an alignment into a motif?
> Consensus/majority and PSSM methods from a sequence alignment should be
> straightforward to implement in Python - even for gapped alignments.
> Including a representation of variable-length gaps might be a little more
> difficult, and storing an HMM representation may be too much to manage
> immediately.  That's still three different types of object - with likely
> different components to their interfaces - to be stored.  In their
> relationship to a source alignment, these representations could be
> properties of a single alignment, or independent Bio.Motif objects (perhaps
> each with a link back to their parent alignment).
>

> The results of searches are also likely to be qualitatively different,
> depending on the type of motif used for the search, and the results desired
> by the user.
>

> I think that, for anything other than simple searches (string search,
> regex), we'd be on a hiding to nothing by implementing search methods within
> Python.  It's not likely to be as fast as dedicated search packages, and it
> would be a headache for maintenance.  So, with apologies if I missed this

What do you mean by searching here? Searching for a known motif or searching
for a new motif? And what dedicated packages you have on your mind?

> part of the discussion or documentation, it seems to me that Bio.Motif could
> be most powerful in the alignment/searching/comparison process as a 'broker'
> within BioPython, providing a consistent API for interface with external
> alignment/search/comparison applications that also permits programmatic
> manipulation of the profile/HMM/alignment.  E.g.
>
That's definitely an important field, though I'm not sure if _the_
function for Bio.Motif.

I think that the most valuable thing would be to internalize some of
the compliexity of
different ways of using motifs in bioinformatics. My modest goal for
now is making protein
motifs first class citizens (meaning handling alphabets and gaps
properly etc. ).

The next thing would be to make bio.motif cooperate nicely with
- Bio.Seq (e.g seq.startswith etc.),
- Bio.Align (conversions from-to alignments)
which includes easy motif creation from simple formats like IUPAC and
simple regexps and
would correspond to the "broker" function if I understand it correctly.

Then I think it would be really cool to have spaced motifs, although
here we need to
be careful about performance.

> align = Bio.AlignIO.read(alignfilehandle)
> consensus = align.build_consensus(threshold=0.9)
> pssm = align.build_pssm()
> hmmer = align.build_hmmer()
> hmm = align.build_hmm(order=3)
>
> Or
>
> consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9)
> pssm = Bio.Motif.build_pssm_from_alignment(align)
> hmmer = Bio.Motif.build_hmmer_from_alignment(align)
> hmm = Bio.Motif.build_hmm_from_alignment(align, order=3)
>

I would guess that the first example is what would be actually used,
but it requires
the functions on the Motif.side to be available.

As for more specific things:
- I don't like the usage of PSSM and consensus here. these are just
different ways of
looking at a Motif.
-Also the difference between HMMer and HMM is unclear to me
(isn't hmmer a tool to make HMMS? Do we support HMMER in Biopython currently?)
But I'm not too concerned about HMMs at the moment.

I would rather think of something like:

align = Bio.AlignIO.read(alignfilehandle)
motif= align.build_motif()

followed by:

motif.consensus()
motif.search_pwm(seq)
motif.search_instances(seq)
motif.weblogo()

>
> And then the consensus, pssm, hmm and hmmer objects could be used as input
> to interfaces for the relevant applications.
>
I don't understand your idea of separating consensus from pssm motifs. These
are not fundamentally different. HMMs though are really different.

> Converting an alignment into an HMM for this purpose may itself benefit from
> a call to HMMer's hmmbuild (and Pythonic representation of the data
> structure), rather than implementation of an equivalent internal function -
> even though I think one of those would be useful, too.
>
Again, I'm not sure whether we have support for HMMer now (it was
mentioned on the
mailing-list once, but I don't know what happened to it).
But I agree it would be useful.

To summarize:
- thanks for so much input, I especially apreciate the input on possible usages
- I will work on the features I mentioned in the direction of unifying
the API for
DNA and protein motifs, and I would definitely appreciate any help from others
- The dyadic  motifs (or more generally gapped motifs) are next, and
require taking
care of performance issues
-  HMM support is currently further down on my to-do list, mostly because
It needs a rather different API. But once we have the "glue" functions
for motifs, we
can try to make similar "glue" functions for HMMs.

cheers
  Bartek