[Biopython-dev] Bio.Motif Suggestions

Tue Apr 21 11:59:32 EDT 2009

Hi,

thanks for your suggestions.

To make the long story short:
- I mostly agree with your points
- I've updated the wiki page to include your requests
http://biopython.org/wiki/MotifDev
- I'll definitely spend some time working on particular requests and
then post specifically.

cheers
Bartek

On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
> Hi,
>
> Some thoughts and a bit of a wishlist...
>
> On 20/04/2009 16:04, "Bartek Wilczynski" <bartek at rezolwenta.eu.org> wrote:
>
>> On Mon, Apr 20, 2009 at 4:35 PM, Peter <biopython at maubp.freeserve.co.uk>
>> wrote:
>>>
>>> What would a space in a motif mean?  Clearly something different from
>>> a wildcard like N or X in nucleotide or protein sequences.  Does it
>>> mean a gap of variable length?  If it means a gap of one character
>>> then surely just using a "-" would be sensible (as used in multiple
>>> sequence alignments), for which we have a gapped alphabet system
>>> setup.
>>>
>> I think that once we start talking about gapped motifs, we are really
>> talking about
>> multiple alignments on steroids. This hasn't been done so far because you
>> don't
>> really need it for DNA motifs,
>
> It might not be required for the motifs you've been working with, but we've
> been doing profile-based searches for bipartite regulatory binding sites in
> DNA.  These sites have a variable-length spacer region, and so require
> gapped alignments for building motifs.  The spacer region consensus
> (depending on the level of identity required for the consensus) is usually
> composed of Ns.
>
> I guess that this comes down to whether we choose to restrict the meaning of
> "motif" to an ungapped string of symbols (including ambiguity) representing
> nt/aa, or whether we want to permit the inclusion of variable-length gaps,
> regions, or ambiguities in a PROSITE or regular expression-like manner (e.g.
> C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or
> C{,3}A{3,5}TTTT).  Although profile methods like HMMer can produce a
> consensus output that looks like an ungapped string of symbols to represent
> a motif, it doesn't capture important features of the HMM representation.
>
> I think the latter representations are more useful, even if harder to
> code/maintain.  I think that leaving them out would be a glaring hole in
> functionality, and that they're a target Biopython should aim for.
>
>> I think it would be great to be
>> able to easily
>> convert multiple alignments into motifs. This would allow us to  use
>> the power of
>> BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is
>> how to design API for these  functions.
>
> I agree.  I think that there's another important question: what do we mean,
> and need to do, when we talk about converting an alignment into a motif?
> Consensus/majority and PSSM methods from a sequence alignment should be
> straightforward to implement in Python - even for gapped alignments.
> Including a representation of variable-length gaps might be a little more
> difficult, and storing an HMM representation may be too much to manage
> immediately.  That's still three different types of object - with likely
> different components to their interfaces - to be stored.  In their
> relationship to a source alignment, these representations could be
> properties of a single alignment, or independent Bio.Motif objects (perhaps
> each with a link back to their parent alignment).
>
> The results of searches are also likely to be qualitatively different,
> depending on the type of motif used for the search, and the results desired
> by the user.
>
> I think that, for anything other than simple searches (string search,
> regex), we'd be on a hiding to nothing by implementing search methods within
> Python.  It's not likely to be as fast as dedicated search packages, and it
> would be a headache for maintenance.  So, with apologies if I missed this
> part of the discussion or documentation, it seems to me that Bio.Motif could
> be most powerful in the alignment/searching/comparison process as a 'broker'
> within BioPython, providing a consistent API for interface with external
> alignment/search/comparison applications that also permits programmatic
> manipulation of the profile/HMM/alignment.  E.g.
>
> align = Bio.AlignIO.read(alignfilehandle)
> consensus = align.build_consensus(threshold=0.9)
> pssm = align.build_pssm()
> hmmer = align.build_hmmer()
> hmm = align.build_hmm(order=3)
>
> Or
>
> consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9)
> pssm = Bio.Motif.build_pssm_from_alignment(align)
> hmmer = Bio.Motif.build_hmmer_from_alignment(align)
> hmm = Bio.Motif.build_hmm_from_alignment(align, order=3)
>
> (which I don't think is as neat an interface, even if all
> align.build_consensus does is call the Bio.Motif.consensus_from_alignment
> method)
>
> Followed by things like
>
> pssm.consensus()
> pssm.logo()
> hmm.generate_sequence(length=100)
> hmm.to_graphviz()
>
> And then the consensus, pssm, hmm and hmmer objects could be used as input
> to interfaces for the relevant applications.
>
> Converting an alignment into an HMM for this purpose may itself benefit from
> a call to HMMer's hmmbuild (and Pythonic representation of the data
> structure), rather than implementation of an equivalent internal function -
> even though I think one of those would be useful, too.
>
> Cheers,
>
> L.
>
> --
> Dr Leighton Pritchard MRSC
> D131, Plant Pathology Programme, SCRI
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
> gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405
>
>
> ______________________________________________________
> SCRI, Invergowrie, Dundee, DD2 5DA.
> The Scottish Crop Research Institute is a charitable company limited by guarantee.
> Registered in Scotland No: SC 29367.
> Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
>
>
> DISCLAIMER:
>
> This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
> addressee.
> If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
> this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
>
> Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
> ______________________________________________________
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>

-- 
Bartek Wilczynski
==================
Postdoctoral fellow
EMBL, Furlong group
Meyerhoffstrasse 1,
69012 Heidelberg,
Germany
tel: +49 6221 387 8433