[Biopython-dev] Bio.Motif Suggestions

Tue Apr 21 04:34:25 EDT 2009

Hi,

Some thoughts and a bit of a wishlist...

On 20/04/2009 16:04, "Bartek Wilczynski" <bartek at rezolwenta.eu.org> wrote:

> On Mon, Apr 20, 2009 at 4:35 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
>> 
>> What would a space in a motif mean?  Clearly something different from
>> a wildcard like N or X in nucleotide or protein sequences.  Does it
>> mean a gap of variable length?  If it means a gap of one character
>> then surely just using a "-" would be sensible (as used in multiple
>> sequence alignments), for which we have a gapped alphabet system
>> setup.
>> 
> I think that once we start talking about gapped motifs, we are really
> talking about
> multiple alignments on steroids. This hasn't been done so far because you
> don't
> really need it for DNA motifs,

It might not be required for the motifs you've been working with, but we've
been doing profile-based searches for bipartite regulatory binding sites in
DNA.  These sites have a variable-length spacer region, and so require
gapped alignments for building motifs.  The spacer region consensus
(depending on the level of identity required for the consensus) is usually
composed of Ns.  

I guess that this comes down to whether we choose to restrict the meaning of
"motif" to an ungapped string of symbols (including ambiguity) representing
nt/aa, or whether we want to permit the inclusion of variable-length gaps,
regions, or ambiguities in a PROSITE or regular expression-like manner (e.g.
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or
C{,3}A{3,5}TTTT).  Although profile methods like HMMer can produce a
consensus output that looks like an ungapped string of symbols to represent
a motif, it doesn't capture important features of the HMM representation.

I think the latter representations are more useful, even if harder to
code/maintain.  I think that leaving them out would be a glaring hole in
functionality, and that they're a target Biopython should aim for.

> I think it would be great to be
> able to easily
> convert multiple alignments into motifs. This would allow us to  use
> the power of
> BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is
> how to design API for these  functions.

I agree.  I think that there's another important question: what do we mean,
and need to do, when we talk about converting an alignment into a motif?
Consensus/majority and PSSM methods from a sequence alignment should be
straightforward to implement in Python - even for gapped alignments.
Including a representation of variable-length gaps might be a little more
difficult, and storing an HMM representation may be too much to manage
immediately.  That's still three different types of object - with likely
different components to their interfaces - to be stored.  In their
relationship to a source alignment, these representations could be
properties of a single alignment, or independent Bio.Motif objects (perhaps
each with a link back to their parent alignment).

The results of searches are also likely to be qualitatively different,
depending on the type of motif used for the search, and the results desired
by the user.  

I think that, for anything other than simple searches (string search,
regex), we'd be on a hiding to nothing by implementing search methods within
Python.  It's not likely to be as fast as dedicated search packages, and it
would be a headache for maintenance.  So, with apologies if I missed this
part of the discussion or documentation, it seems to me that Bio.Motif could
be most powerful in the alignment/searching/comparison process as a 'broker'
within BioPython, providing a consistent API for interface with external
alignment/search/comparison applications that also permits programmatic
manipulation of the profile/HMM/alignment.  E.g.

align = Bio.AlignIO.read(alignfilehandle)
consensus = align.build_consensus(threshold=0.9)
pssm = align.build_pssm()
hmmer = align.build_hmmer()
hmm = align.build_hmm(order=3)

Or

consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9)
pssm = Bio.Motif.build_pssm_from_alignment(align)
hmmer = Bio.Motif.build_hmmer_from_alignment(align)
hmm = Bio.Motif.build_hmm_from_alignment(align, order=3)

(which I don't think is as neat an interface, even if all
align.build_consensus does is call the Bio.Motif.consensus_from_alignment
method)

Followed by things like

pssm.consensus()
pssm.logo()
hmm.generate_sequence(length=100)
hmm.to_graphviz()

And then the consensus, pssm, hmm and hmmer objects could be used as input
to interfaces for the relevant applications.

Converting an alignment into an HMM for this purpose may itself benefit from
a call to HMMer's hmmbuild (and Pythonic representation of the data
structure), rather than implementation of an equivalent internal function -
even though I think one of those would be useful, too.

Cheers,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________