[Biopython-dev] Bio.Motif Suggestions

Tue Apr 21 09:50:01 EDT 2009

Hi Bartek,

It's a long one, this...  I expect many TLDR response ;)

On 21/04/2009 12:29, "Bartek Wilczynski" <bartek at rezolwenta.eu.org> wrote:

> On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard <lpritc at scri.ac.uk>
> wrote:
>> Some thoughts and a bit of a wishlist...
> 
> These are always welcome. I can make no promises on timing of making
> your wishes come true ;)

No-one ever does :(  <grin>

> But it is a workaround rather than a feature... I'd be also interested
> in knowing about other applications where maybe this assumption (small gaps)
> is violated. Are there also motifs with multiple gaps?

Yes - it might be a stretch, but if you wanted to represent the organisation
of protein domains in a multi-domain protein (e.g. a transposase, or some
pathogen effectors) as motifs you might want to do this.

> Implementing this feature would probably require a separate
> subclass of Motif, since the internal implementation of searching would
> need to be different.

I'm not sure that this needs to be true.  A motif with no gaps can be
considered as a special case of a motif with an arbitrary number of gaps.
If the base implementation is that of a gapped motif (e.g. Represented as
ACT.{5,10}CCC.{,4}TATCAT.{3}GGG) then the basic method of searching - and
here using the re module might work - doesn't need to be any different for
an ungapped variant representing a particular instance of the
multiply-gapped motif (ACTNNNNNNCCCNNNNTATCATNNNGGG), or for any other
ungapped sequence (e.g. ACTCCCTATCATGGG).

This may not be the case for more complex search algorithms, however.

Other classes of Motif may well be necessary, in any case...

> This is a very good feature request, I think it is worth implementing,
> though currently I have no time to do it properly.

I'm right there with you, unfortunately ;)

>> I guess that this comes down to whether we choose to restrict the meaning of
>> "motif" to an ungapped string of symbols (including ambiguity) representing
>> nt/aa, or whether we want to permit the inclusion of variable-length gaps
>> 
> I think that you are touching on multiple issues here.

I was trying to focus on one issue, but it does have lots of implications,
which you cover below.

The one issue I intended is this: A sequence motif can be represented in
more than one way, and those ways are not necessarily interchangeable -
either conceptually or in code.  An ungapped string of symbols isn't able to
represent the same information as a regular expression (can do ambiguity of
repeat counts), which in turn isn't able to represent the same information
as a PSSM (can represent probabilities at each position), which in turn
isn't able to represent the same information as an HMM (can represent
variable-order dependency).

However, the things you want to do with that motif, such as use it to search
a set of candidate sequences or produce an example matching sequence for
test purposes, can be the same regardless of the coding or conceptual
representation of that motif.

We come back to this below, but for now this does lead on to...

> - gapped alignemnts are one thing. If we have a gap in one sequence
> but not in the others
> (frequent in protein motifs, not so much in DNA motifs) we just need a
> way to sensibly use it in creation of PWMs for searching
> - dyadic motifs (gaps in otherwise ungapped alignments) are a
> different issue, since we have a
> gap in all instances, but it may have a variable length. see above.

These are, I think, the same issue.

In your first example, PWMs will (mostly) work because the lengths of most
sequences are the same and there are few gaps.  However, unless you have a
way of varying the length of your PWM during a query of the target sequence,
the PWM need not match the gapped sequence strongly, potentially leading to
a false negative.  As an example:

ABCDE
AB-DE 
ABCDE
ABCDE

The PWM will be (shorthand) [A1][B1][C.75,-.25][D1][E1], and when applied to
the target sequence ABDE (which was in your alignment), will not produce as
high a score as it would for the other members of the alignment.  For the
alignment:

A-CDE
AB-DE 
ABC-E
ABCDE

The PWM is (shorthand) [A1][B.75,-.25][C.75,-.25][D.75,-.25][E1]

With corresponding poor scores (potential false negatives) for target
sequences ACDE, ABDE and ABCE.

Without a way to (intelligently) place gaps in your target sequences, or
otherwise account for gaps when searching, the problem is the same whether
there is one gap or a dyadic motif.  The *practical* issue is different, in
that you can probably accept the odd false negative for a motif in which one
training sequence has a gap, but PWMs are poor candidates for alignments
with many gaps, as they can readily produce false negatives.

The key issue is that PWMs are fixed-length, and variable-length
representations are common, desirable, and difficult to express in a
fixed-width framework.

> -regular expressions are a different way of describing motifs.

That is true - they are intermediate between consensus sequence, and PSSMs
in their ability to describe variation, but also have the capacity to
represent variable-length sequences.

> I think that it is not a purpose of Bio.Motif to compete with regexps, but it
> would be certainly valuable to be able to have a possibility of creating
> motifs from some sort of (simplified) regexps. This was, to some extent,
> discussed in a recent thread on Seq.startswith methods

I was involved in that discussion :D

I don't think that Bio.Motif needs to compete with the re module, but
instead could use its robust, stable code to implement a regular expression
representation of sequence motifs, seamlessly.

> -HMM motifs are totally different kind of beast. These guys introduce
> dependencies between positions (doable also with regexps) and there is
> currently no support for them in Bio.Motif. It would be cool to have
> support for them, but I'm not an expert here and it looks to me like a
> lot of work (also probably the methods of Bio.Motif are not exactly right for
> HMMs).

You're right about the dependencies - they're the important features I was
alluding to in my post - but I don't think that regular expressions are a
good way to approach the same problem; they don't encode the same
information.

> -finally, suporting prosite syntax seems to be depending on the variable gap
> feature, but otherwise it's simple an important input fomat to support.

I wasn't suggesting PROSITE syntax as part of any desire for implementation
- though a PROSITE <-> regex/consensus translation would be useful, I think
- rather as an illustration that more people than me need variable length
spacers in their motifs.

>> I think the latter representations are more useful, even if harder to
>> code/maintain.  I think that leaving them out would be a glaring hole in
>> functionality
> 
> Usefulness is hard to define in abstract of a particular problem , so
> this is arguable. It is certain that bio.Motif is not complete suite for all
> kinds of motif analysis but i don't know of any tool that is supporting alll
> these types of motifs with a single API (if you know one, please tell me).
> We should have ambitious goals, but I wouldn't call it a glaring hole not to
> have what is currently not available elsewhere...

I apologise for my poor wording.  What I meant was that it would seem odd if
support for motif representation was considered complete without
representing variable-length sequences.  Left alone, this would always
represent an obvious target for improvement (i.e. 'a glaring hole in
functionality').  No criticism was meant by it - I think you've done a great
job so far on Bio.Motif - and I apologise if I have caused offence.

>> I think that, for anything other than simple searches (string search,
>> regex), we'd be on a hiding to nothing by implementing search methods within
>> Python.  It's not likely to be as fast as dedicated search packages, and it
>> would be a headache for maintenance.

> What do you mean by searching here? Searching for a known motif or searching
> for a new motif? And what dedicated packages you have on your mind?

Searching for a known motif in a larger sequence.  Three packages - two
biologically-dedicated, one not - spring to mind.

The non-biologically-dedicated one is grep.  Representing ambiguity symbols
as combinations of bases, e.g. [ACT] . [TA], [^T] and so on - with FASTA
files where sequences are not punctuated by \n or \r - is highly effective
for finding sequence motifs representable by regular expressions.

Dedicated 1: PSI-BLAST - takes PSSMs representing a sequence profile

Dedicated 2: HMMer - builds and uses an HMM representation of the sequence
profile.

There are others, but I'd have to think hard to recall them.  You could
consider HMMer versions 1, 2 and 3 as different, in a number of ways -
including their utility for nucleotide sequence representation...

>> it seems to me that Bio.Motif could
>> be most powerful in the alignment/searching/comparison process as a 'broker'
>> within BioPython, providing a consistent API for interface with external
>> alignment/search/comparison applications that also permits programmatic
>> manipulation of the profile/HMM/alignment.  E.g.

> I think that the most valuable thing would be to internalize some of
> the compliexity of different ways of using motifs in bioinformatics. My modest
> goal for now is making protein motifs first class citizens (meaning handling
> alphabets and gaps properly etc. ).
> The next thing would be to make bio.motif cooperate nicely with
> - Bio.Seq (e.g seq.startswith etc.),
> - Bio.Align (conversions from-to alignments)
> which includes easy motif creation from simple formats like IUPAC and
> simple regexps and would correspond to the "broker" function if I understand
> it correctly.
> Then I think it would be really cool to have spaced motifs, although
> here we need to be careful about performance.

If I might suggest: the main role of the Bio.Motif module as you intend it
appears to be to represent motifs of biological sequences, and to provide
useful functionality for them.  Now, there are several ways of representing
these motifs both conceptually, and in code - and they're not all
interchangeable.  Some of them have a many -> one mapping (PSSM -> consensus
sequence), and some have no obvious mapping at all (HMM <-/-> PSSM).  There
is a decision to be made concerning how motifs are represented internally:
PSSM, regex and/or HMM.  PSSM has the clear benefit that, given a PSSM, you
can easily generate the consensus sequence and a regular expression of
fixed-length - but the mapping to a regular expression is not clear, and may
not produce the one that the user would prefer.  HMMs can't readily be
converted to other representations, and regular expressions can't be
expanded to PSSMs, or converted to consensus sequences (unless they have no
length ambiguities).  It is not just performance we need to think about, but
the very representation of a motif.

Each of these representations is useful under different circumstances.  I
think it is worth avoiding a structure that enforces a single internal
representation and closes off future alternative representations.  Giving
the user sufficient flexibility/rope to hang themselves with in their choice
of internal representation is a Good Thing, in my opinion.

> As for more specific things:
> - I don't like the usage of PSSM and consensus here. these are just
> different ways of looking at a Motif.
> I don't understand your idea of separating consensus from pssm motifs. These
> are not fundamentally different. HMMs though are really different.

I see what you mean, but I think you're associating PSSM with Motif too
strongly.  A PSSM can be used to generate a consensus sequence, but the
resulting consensus sequence cannot be used to generate the corresponding
PSSM uniquely.  There is not a one-one mapping, and they do not describe the
same information.  Consensus sequences, for example, do not indicate the
probability of finding a particular symbol at any given position; PSSMs can.
PSSMs are fundamentally different from consensus sequences in that they
don't encode variability at any position.

Consensus, regex, PSSM and HMM are all different ways of looking at a Motif,
but they're not all internally-compatible - which is my point.  If you build
a PSSM motif and make the alignment data nonrecoverable, you cannot
reconstruct a corresponding HMM representation, later, for example.  So you
would have to decide what kind of representation you use at motif
build-time, build all of them at once, or keep the alignment around to build
what you need later.  I'd prefer to choose at build time, but YMMV.

> -Also the difference between HMMer and HMM is unclear to me
> (isn't hmmer a tool to make HMMS? Do we support HMMER in Biopython currently?)
> But I'm not too concerned about HMMs at the moment.

There is a fair amount of flexibility in how you choose to define your HMM
for a motif, and not just in the order of the HMM.  There has been
corresponding variation in how HMMer represents its data internally, over
the years.  I was meaning to imply by syntax that a HMMer-specific
representation could be called 'hmmer', but a generic internal HMM
representation could just be called 'hmm', to reflect this.  I'm not going
to insist on the convention, but it seems simple and obvious to me (again,
YMMV).

Sorry for the length and likely repetition, but I think these are issues
worth thinking about.

Cheers,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________