[Biopython] StockholmIO replaces "." with "-", why?

Peter biopython at maubp.freeserve.co.uk
Fri Apr 9 16:09:16 UTC 2010


Hi Bryan,

On Fri, Apr 9, 2010 at 4:55 PM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>
> Hello Peter,
> The HMMER suit of tools, and the Pfam website use "-" to indicate that
> an HMM visited a deletion state, and "." to indicate that the HMM on a
> different sequence visited an insertion state, and this gap is just
> added to maintain alignment.
>
>>foo
> AA...BBB---CCC
>>bar
> AAbazBBBDDDCCC
>
> In this example, the sequence "foo" doesn't have the DDD section of
> the profile HMM,
> the second sequence has not only the full model, but also contains an
> insert, "baz" that is not part of the HMM, for example, an extra-long
> loop.
>
> I hope this helps...
> -Bryan

Yes, it does. I think this HMMER/PFAM convention should be noted
on the definition of the Stockholm format - that might have prevented
this problem in Biopython since none of the examples I'd looked at
when writing the parser had this behaviour. Note your example is
more subtle than the different between internal gaps and leading or
trailing padding described by Ivan earlier:
http://lists.open-bio.org/pipermail/biopython/2010-April/006396.html

Could you point out a suitable (small) example from PFAM we can
use for a unit test, or email me an example (off list)?

Now, as to how to deal with this: We could extend the Biopython
Alphabet objects to explicitly support multiple types of gaps (the
current setup only really copes with a single gap character). Using
this information we could handle some special cases like Stockholm
to PHYLIP would require merging either gap onto a dash. This
doesn't sound that straight forward though.

Or, we can avoid explicit declarations about the sequence (just
ignore the Biopython Alphabet object capabilities and use one
of the generic alphabets), and leave the problem in the hands of
the end user. This is bound to cause some unpleasant surprises
one day, but might be the best solution.

Peter



More information about the Biopython mailing list