[Biopython-dev] Distance Matrix Parsers

Wed Jun 14 18:36:00 UTC 2006

> [I've added Chris incase he isn't on the dev-list]

Thanks Marc! I actually joined Dev list as the discussion got interesting.
Figured we'd move it to here eventually.

>> One general question about the architecture: Are you thinking of having a
>> generic "distance matrix object", and parsers/formats defined for several
>> different file formats?
>>
>
> Yes. I think that is what I am leaning towards. Now, I don't know if I'll
> be the implementor or not. It has been something on my to-do list for a
> while.

BioPython support for these formats with clean, testable code should be the
primary task, correct? I can help with this.  After we get working code,
refactoring for memory management can take place. I haven't done anything
along these lines and I'd have to rely on someone else's expertise for this.

> In my experience, most software tools usually write the distances as a
> full symmetric matrix.  However, the "standard" explicitly discusses lower
> triangular form (missing out the diagonal distance zero entries) which has
> the significant advantage of using about half the disk space. This is
> significant once you get into thousands of taxa.

I guess we need to consider that storing the matrix as a triangular form
will save some memory. However, I've emailed the SciPy/NumPy guys and there
is currently no support for a triangular/symmetric matrix; it would have to
be a square matrix. See more below.

>>>> So, make sure any parser can cope with both full symmetric, and lower
>>>> triangular forms - ideally without the user having to care.
>>>
>>> Phylip does ask you which to either read or write; this is a pain at
>>> times. So, having a parser figure this out would be nice.  However, the
>>> user should know about the choices.
>>
>> Its fairly easy for the parser to cope with either: For each line of
>> input, only use the "lower triangular" portion - just ignore any
>> remaining text which would be present for a full matrix (square) file, or
>> not present for a lower triangular file.

Well, we can save a lot of developer time by requiring the user to designate
this, with the default being a square matrix. Is it unreasonable to expect
the user to know whether his or her matrix is lower/upper-triangular or
square? Autodetection seems to add a bit of risk, e.g., either the detection
has to be confirmed by the user (in which case, what's the point of
auto-detect), or we have to have a really well tested auto-detector, i.e., a
lot more developer time.

> It should be fairly easy, but I don't understand why Philip chokes on
> square versus lower triangular. Either way, the class should "internally"
> know what the format read in was, so you can ask it.  That way if you muck
> with it or create a new matrix and want to write that out, you can ask the
> class what it read in and then have the new one write it out in that
> format.

I think it makes sense for a Phylip triangular matrix and a Phylip square
matrix to be represented as the same type of object, for reasons of
consistency, as already discussed. As Marc pointed out, its original form
can simply be represented by an attribute of the object. It should also be
possible to write the matrix back out in either triangular or square format,
regardless of its original format. These would probably just be methods of
the object, such as .to_phylip_square() and .to_phylip_ltriangular()

>>>> This also raises the point about how to store the matrix in memory.
>>>> Does Numeric/NumPy have an efficient way of storing symmetric matrices?
>>>> This is less flexible than the suggested list of lists, but for large
>>>> datasets would need much less memory.
>>>
>>> I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at storing
>>> these things. But you lose that when you want to do pythonish things to
>>> it (like write it back out).
>>
>> It depends on our target audience.  My experience with two thousand taxa
>> means that I am slightly concerned about the memory, and would lean
>> towards storing the data using Numeric/NumPy.  This could be done within
>> a nice python object, with methods to write it out again in phylip format
>> etc - so it could still behave "nicely".
>
> I agree here and think that if the user has Numeric use that, otherwise
> use built-in types. So, maybe two "hidden" classes that do the correct
> thing.

This just recently popped up on the NumPy discussion list:
http://www.mail-archive.com/numpy-discussion@lists.sourceforge.net/msg00265.html

The summary of that is we can memory-map it using numpy.memmap. I've never
used this before, so I can't really comment. I'd guess that for small data
files, this is overkill. For large sets it might be reasonable. I suppose
two separate classes could be available, one for smaller matrices and one
for larger. Again, I think the user would be intelligent enough to make the
decision as to which to use.

Since the class for handling standard (smaller) matrices will be easier to
code, I propose writing this standard one first and getting it into
BioPython. For this class, I suggest just sticking with a regular nested
list, rather than use something from Numeric/NumPy.

After this class is created and submitted, we can go back and create a class
to deal with larger matrices that's a sub-class of the standard one. This
way, the API remains the same, regardless of the class, and we will only
have to rewrite the methods that need changing due to the way we'll need to
interact with the underlying data structure of the wrapped Numeric/NumPy
object. How does that sound?

>> I like clustal's "long name variant of Phylip distance format", as for my
>> datasets my gene/domain names are longer than 10 characters.  I may well
>> be in a minority here (for now).
>>
>> I suppose if would be "good practice" to follow the official (but not
>> overly precise) phylip definition on this issue.
>>
>> So your idea of defining two similar formats would resolve this.  In
>> terms of implementation, one could probably just subclass the other to
>> reduce the amount of duplicated code.
>
> Correct. subclassing is our friend (to a point).
>

I'm in agreement with using two separate types of objects to represent these
two formats. PhylipDist should represent the Phylip spec to the T. I'm not
familiar with the Clustal spec; is it formatted similarly, sans the
requirement of 10 characters max for the sequence name?

An editorial note, I'm very frustrated with Phylip's 10 character limit for
sequence names, too. I don't know the reasoning and history behind the
decisions on the format; all I know is that it is an uncomfortably
restrictive and seemingly arbitrary format. Why it has not been updated is
beyond me, unless, like these parsers for BioPython, it's just another
project waiting for someone to work on it.

>>> I am pretty sure that Phylip doesn't care about non-unique names so why
>>> error out? However, the class should have a means for the user to ask
>>> this question.
>>
>> Because the (truncated) taxa names are going to be used as tree node
>> names by any tree building program, they really should be unique.  I
>> would expect any tree program to throw an error in this case, which is
>> why I suggested we should try not to create such files in the first
>> place.
>
> Not exactly. I've been bitten in the butt by the truncation issue several
> times. I know TreeView X doesn't care about unique names and I think
> MacClade also doesn't care. Now, PAUP and Mequite might care or any Nexus
> type-system which lists the taxon names separately from the taxons in the
> TREES block (they use numbers for the taxons which get mapped to TAXLABELS
> in the TAXA block. I believe it depends on how they decided to store these
> relationships).
>
> I guess we have three options here: 1) keep on trucking 2) raise a warning
> 3) raise an exception - something like Matrix.NonUniqueName exception so
> that you can specifically except  the exception
>

I dislike option 1, unless we also provide the user the ability to check for
non-unique names, too. Remember the Zen of Python: "Explicit is better than
implicit."

I like option 3, though I don't know how to make it possible for code
outside the parser to catch the exception and tell the parser to continue.
We could have it throw the exception by default, but if the user provides a
flag in calling the parser, like allow_non_unique=True, we could have logic
in the parser that, if True, catch the exception and continue.

>> Likewise an option to save matrices as either fully symmetric or lower
>> triangular.  I would lean towards using fully symmetric as the default as
>> it seems to be more common.
>
> Phylip's default seems to be a "Square" distance matrix, i.e.  fully
> symmetric. Keep this in mind when naming or documentation.

As I mentioned above, the same object would represent both types, and should
be equally capable of outputting itself as text in either format.

Chris