[BioPython] Distance Matrix Parsers

Sat Jun 10 15:08:43 UTC 2006

Hi,

Bio.SubsMat has a parser for substitution matrices, lower triangular and square. Feel free to recycle code.

Best,

Iddo

--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org

-----Original Message-----
From: biopython-bounces at lists.open-bio.org on behalf of Peter
Sent: Sat 6/10/2006 3:10 AM
To: BioPython Mailing List
Subject: Re: [BioPython] Distance Matrix Parsers

Chris Lasher wrote:
> Hi all, Are there any modules in BioPython to parse distance
> matrices? My poking around the BioPython modules and Google searching
> does not turn up any signs indicating there are distance matrix
> parsers, currently. Two particularly useful parsers would be a parser
> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP 
> (http://evolution.genetics.washington.edu/phylip.html),

I've done a very small amount of work with neighbour joining trees, 
using PHYLIP format distance matrices.  The closest I could find to a 
file format definition was this page:

http://evolution.genetics.washington.edu/phylip/doc/distance.html

Points to be aware of:

In my experience, most software tools usually write the distances as a 
full symmetric matrix.  However, the "standard" explicitly discusses 
lower triangular form (missing out the diagonal distance zero entries) 
which has the significant advantage of using about half the disk space. 
  This is significant once you get into thousands of taxa.

So, make sure any parser can cope with both full symmetric, and lower 
triangular forms - ideally without the user having to care.

This also raises the point about how to store the matrix in memory. 
Does Numeric/NumPy have an efficient way of storing symmetric matrices? 
  This is less flexible than the suggested list of lists, but for large 
datasets would need much less memory.

Second point - the "official" PHYLIP distance matrix file format 
truncates the taxa names at 10 characters.  Some tools (e.g. clustalw) 
ignore this limitation and will use as many as needed for the full name. 
  I personally find this much nicer - after all most gene identifiers 
(e.g. GI numbers) are eight characters to start with, and if you are 
dealing with multiple features in each gene 10 characters is tough going.

So, I would make sure you test the parser on this format variant (with 
names longer than 10 characters).  I can supply some examples if you like.

For writing matrices to file, the issue of following the strict 10 
character taxa limit might best be handled as an option (default to max 
10, with a warning if any names are truncated, and an error if 
truncation renders names non-unique?).

Likewise an option to save matrices as either fully symmetric or lower 
triangular.  I would lean towards using fully symmetric as the default 
as it seems to be more common.

> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
> distance matrix format. If not, would there be any interest in
> creating parsers for these matrices, other than my own? I think
> parsers for distance matrices could be very useful to the community.

I suspect that for serious tree building pure python will not be 
competitive with existing C/C++ code on speed - but non-the-less could 
be useful.

Peter

_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 4656 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060610/5b8aa9fa/attachment-0002.bin>