[Biopython-dev] [BioPython] Distance Matrix Parsers

Mon Jun 12 13:18:41 UTC 2006

[cross post]
On Jun 10, 2006, at 6:10 AM, Peter wrote:

> Chris Lasher wrote:
>> Hi all, Are there any modules in BioPython to parse distance
>> matrices? My poking around the BioPython modules and Google searching
>> does not turn up any signs indicating there are distance matrix
>> parsers, currently. Two particularly useful parsers would be a parser
>> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP
>> (http://evolution.genetics.washington.edu/phylip.html),
>
> I've done a very small amount of work with neighbour joining trees,
> using PHYLIP format distance matrices.  The closest I could find to a
> file format definition was this page:
>
> http://evolution.genetics.washington.edu/phylip/doc/distance.html
>
> Points to be aware of:
>
> In my experience, most software tools usually write the distances as a
> full symmetric matrix.  However, the "standard" explicitly discusses
> lower triangular form (missing out the diagonal distance zero entries)
> which has the significant advantage of using about half the disk  
> space.
>   This is significant once you get into thousands of taxa.

This is still small potatoes compared to the input needed to generate  
the distance matrixs (especially with DNA/RNA sequences of any  
decently sized gene).

>
> So, make sure any parser can cope with both full symmetric, and lower
> triangular forms - ideally without the user having to care.

Phylip does ask you which to either read or write; this is a pain at  
times. So, having a parser figure this out would be nice. However,  
the user should know about the choices.

>
> This also raises the point about how to store the matrix in memory.
> Does Numeric/NumPy have an efficient way of storing symmetric  
> matrices?
>   This is less flexible than the suggested list of lists, but for  
> large
> datasets would need much less memory.

I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at  
storing these things. But you lose that when you want to do pythonish  
things to it (like write it back out).

>
> Second point - the "official" PHYLIP distance matrix file format
> truncates the taxa names at 10 characters.  Some tools (e.g. clustalw)
> ignore this limitation and will use as many as needed for the full  
> name.

ClustalW does the CORRECT thing, it truncates the name to 10  
characters for Phylip output (alignments). And it does the CORRECT  
thing for its  distance matrix file.

In Clustalw's trees.c file

void distance_matrix_output(FILE *ofile)

	fprintf(ofile,"\n%-*s ",max_names,names[i]);  /* left justify to the  
maximum length of names in current alignment file and use a space as  
a sep */

spaces in names are bad in this case, but phylip is okay with them,  
since the first 10 characters are the taxon name.

>   I personally find this much nicer - after all most gene identifiers
> (e.g. GI numbers) are eight characters to start with, and if you are
> dealing with multiple features in each gene 10 characters is tough  
> going.
>
> So, I would make sure you test the parser on this format variant (with
> names longer than 10 characters).  I can supply some examples if  
> you like.

By definition this isn't a variant of Phylip, but another format. So,  
one would need two parsers: PhylipDist and Dist (or ClustalDist).

>
> For writing matrices to file, the issue of following the strict 10
> character taxa limit might best be handled as an option (default to  
> max
> 10, with a warning if any names are truncated, and an error if
> truncation renders names non-unique?).

DON'T give an option of 10 or more. That is NOT the definition of the  
Phylip file Matrix structure, so why give the option? Make another  
class that outputs the whole name (ClustalDist).

I am pretty sure that Phylip doesn't care about non-unique names so  
why error out? However, the class should have a means for the user to  
ask this question.

>
> Likewise an option to save matrices as either fully symmetric or lower
> triangular.  I would lean towards using fully symmetric as the default
> as it seems to be more common.

Phylip's default seems to be a "Square" distance matrix, i.e. fully  
symmetric. Keep this in mind when naming or documentation.

>
>> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
>> distance matrix format. If not, would there be any interest in
>> creating parsers for these matrices, other than my own? I think
>> parsers for distance matrices could be very useful to the community.
>
> I suspect that for serious tree building pure python will not be
> competitive with existing C/C++ code on speed - but non-the-less could
> be useful.
>

Well, we do have things like SciPy and PyClustal, which make things  
more even.

Marc