[BioPython] Distance Matrix Parsers
Peter
biopython at maubp.freeserve.co.uk
Sat Jun 10 10:10:02 UTC 2006
Chris Lasher wrote:
> Hi all, Are there any modules in BioPython to parse distance
> matrices? My poking around the BioPython modules and Google searching
> does not turn up any signs indicating there are distance matrix
> parsers, currently. Two particularly useful parsers would be a parser
> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP
> (http://evolution.genetics.washington.edu/phylip.html),
I've done a very small amount of work with neighbour joining trees,
using PHYLIP format distance matrices. The closest I could find to a
file format definition was this page:
http://evolution.genetics.washington.edu/phylip/doc/distance.html
Points to be aware of:
In my experience, most software tools usually write the distances as a
full symmetric matrix. However, the "standard" explicitly discusses
lower triangular form (missing out the diagonal distance zero entries)
which has the significant advantage of using about half the disk space.
This is significant once you get into thousands of taxa.
So, make sure any parser can cope with both full symmetric, and lower
triangular forms - ideally without the user having to care.
This also raises the point about how to store the matrix in memory.
Does Numeric/NumPy have an efficient way of storing symmetric matrices?
This is less flexible than the suggested list of lists, but for large
datasets would need much less memory.
Second point - the "official" PHYLIP distance matrix file format
truncates the taxa names at 10 characters. Some tools (e.g. clustalw)
ignore this limitation and will use as many as needed for the full name.
I personally find this much nicer - after all most gene identifiers
(e.g. GI numbers) are eight characters to start with, and if you are
dealing with multiple features in each gene 10 characters is tough going.
So, I would make sure you test the parser on this format variant (with
names longer than 10 characters). I can supply some examples if you like.
For writing matrices to file, the issue of following the strict 10
character taxa limit might best be handled as an option (default to max
10, with a warning if any names are truncated, and an error if
truncation renders names non-unique?).
Likewise an option to save matrices as either fully symmetric or lower
triangular. I would lean towards using fully symmetric as the default
as it seems to be more common.
> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
> distance matrix format. If not, would there be any interest in
> creating parsers for these matrices, other than my own? I think
> parsers for distance matrices could be very useful to the community.
I suspect that for serious tree building pure python will not be
competitive with existing C/C++ code on speed - but non-the-less could
be useful.
Peter
More information about the Biopython
mailing list