[BioPython] Distance Matrix Parsers

Peter biopython at maubp.freeserve.co.uk
Sun Jun 25 21:37:53 UTC 2006


[Off topic, but recently has anyone else get valid messages bounced due 
to a "suspicious header"?]

Hello List,

I recently wanted to load a "PHYLIP distance matrix file" created by
clustalw for my own research...

As discussed earlier, clustalw bends the official PHYLIP specification
by not truncating long names to 10 characters.  For my dataset I need
the long names to avoid ambiguity.

The attached code implements a fairly simple distance matrix class and
associated code to read (parse) and write PHYLIP style distance matrices.

There are options to control strict 10 character name truncation, and
the separator character(s) when writing files.

Internally, I store the distances as a list of lists (of different
lengths) to mimic a lower triangular matrix.

For example, this matrix:

[[0.0, 0.1, 0.2],
   [0.1, 0.0, 0.5],
   [0.2, 0.5, 0.0]]

Is stored as this:

[[], [0.1], [0.2, 0.5]]

This may not be the best way to do this in terms of speed and memory usage.

There are some simple test cases included, but I have pushed the code
very far and there may be problems.  Anyway - in case anyone is
interested either in the short term, or for ideas for how BioPython
could support these files - here it is.

I'm sure someone more familiar with arrays (Numeric and NumPy) would be
able to make the class act more like an array - but the basics are there.

As far as I could see, neither Numeric or NumPy have a specific
symmetric matrix / symmetric array class which would be ideal.

Members of the list are welcome to use the code, but please contact me
before re-distributing it to anyone else.

Peter

-------------- next part --------------
A non-text attachment was scrubbed...
Name: phylip_dst.py
Type: text/x-python
Size: 16528 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20060625/8d20b314/attachment-0002.py>


More information about the Biopython mailing list