[Biopython-dev] Distance Matrix Parsers

Marc Colosimo mcolosimo at mitre.org
Tue Jun 13 15:46:16 UTC 2006


[I've added Chris incase he isn't on the dev-list]

On Jun 12, 2006, at 5:57 PM, Peter wrote:

> [Send to the Dev list only - forward to the main discussion list if  
> you think best Marc]
>
> One general question about the architecture: Are you thinking of  
> having a generic "distance matrix object", and parsers/formats  
> defined for several different file formats?
>

Yes. I think that is what I am leaning towards. Now, I don't know if  
I'll be the implementor or not. It has been something on my to-do  
list for a while.

> Peter (me) wrote:
>>> In my experience, most software tools usually write the distances  
>>> as a
>>> full symmetric matrix.  However, the "standard" explicitly discusses
>>> lower triangular form (missing out the diagonal distance zero  
>>> entries)
>>> which has the significant advantage of using about half the disk   
>>> space. This is significant once you get into thousands of taxa.
>

Peter wrote:
> Marc Colosimo wrote:
>> This is still small potatoes compared to the input needed to  
>> generate  the distance matrixs (especially with DNA/RNA sequences  
>> of any  decently sized gene).
>
> Regarding size of matrix file versus size of alignment file, that  
> isn't hallways true.
>
> (*) The matrix file size goes as the square of the number of taxa,  
> the alignment file only linearly.
>
> (*) The matrix file is invariant with respect to the length of the  
> sequences/number of columns in the alignment.
>
> (*) The matrix file size goes linearly with the precision (number  
> of decimal places) used.
>
> As you are using "decently sized genes" then you will have large  
> alignment files, but I would imagine you have at most hundred of  
> genes per alignment - not thousands (?).
>
> For my own examples, I have about two thousand domains (not full  
> genes) and the phylip distance matrix file was MUCH bigger than the  
> alignment file.

You got me on that boundary case. I just wanted to point out that is  
not always the case.


> Peter (me) wrote
>>> So, make sure any parser can cope with both full symmetric, and  
>>> lower
>>> triangular forms - ideally without the user having to care.
>
> Marc Colosimo wrote:
>> Phylip does ask you which to either read or write; this is a pain  
>> at  times. So, having a parser figure this out would be nice.  
>> However,  the user should know about the choices.
>
> Its fairly easy for the parser to cope with either: For each line  
> of input, only use the "lower triangular" portion - just ignore any  
> remaining text which would be present for a full matrix (square)  
> file, or not present for a lower triangular file.

It should be fairly easy, but I don't understand why Philip chokes on  
square versus lower triangular. Either way, the class should  
"internally" know what the format read in was, so you can ask it.  
That way if you muck with it or create a new matrix and want to write  
that out, you can ask the class what it read in and then have the new  
one write it out in that format.

>
> Peter wrote:
>>> This also raises the point about how to store the matrix in memory.
>>> Does Numeric/NumPy have an efficient way of storing symmetric   
>>> matrices? This is less flexible than the suggested list of lists,
> >>but for large datasets would need much less memory.
>
> Marc Colosimo wrote:
>> I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at   
>> storing these things. But you lose that when you want to do  
>> pythonish  things to it (like write it back out).
>
> It depends on our target audience.  My experience with two thousand  
> taxa means that I am slightly concerned about the memory, and would  
> lean towards storing the data using Numeric/NumPy.  This could be  
> done within a nice python object, with methods to write it out  
> again in phylip format etc - so it could still behave "nicely".

I agree here and think that if the user has Numeric use that,  
otherwise use built-in types. So, maybe two "hidden" classes that do  
the correct thing.

>
> Peter wrote:
>>> Second point - the "official" PHYLIP distance matrix file format
>>> truncates the taxa names at 10 characters.  Some tools (e.g.  
>>> clustalw)
>>> ignore this limitation and will use as many as needed for the  
>>> full  name.
>
> Marc Colosimo wrote:
>> ...
>> By definition this isn't a variant of Phylip, but another format.  
>> So,  one would need two parsers: PhylipDist and Dist (or  
>> ClustalDist).
>
> That would be another way of looking at the issue, sure.  [See below]
>
> Peter wrote:
>>> For writing matrices to file, the issue of following the strict 10
>>> character taxa limit might best be handled as an option (default  
>>> to  max 10, with a warning if any names are truncated, and an  
>>> error if
>>> truncation renders names non-unique?).
>
> Marc Colosimo wrote:
>> DON'T give an option of 10 or more. That is NOT the definition of  
>> the  Phylip file Matrix structure, so why give the option? Make  
>> another  class that outputs the whole name (ClustalDist).
>
> I like clustal's "long name variant of Phylip distance format", as  
> for my datasets my gene/domain names are longer than 10  
> characters.  I may well be in a minority here (for now).
>
> I suppose if would be "good practice" to follow the official (but  
> not overly precise) phylip definition on this issue.
>
> So your idea of defining two similar formats would resolve this.   
> In terms of implementation, one could probably just subclass the  
> other to reduce the amount of duplicated code.

Correct. subclassing is our friend (to a point).

>
>> I am pretty sure that Phylip doesn't care about non-unique names  
>> so  why error out? However, the class should have a means for the  
>> user to  ask this question.
>
> Because the (truncated) taxa names are going to be used as tree  
> node names by any tree building program, they really should be  
> unique.  I would expect any tree program to throw an error in this  
> case, which is why I suggested we should try not to create such  
> files in the first place.

Not exactly. I've been bitten in the butt by the truncation issue  
several times. I know TreeView X doesn't care about unique names and  
I think MacClade also doesn't care. Now, PAUP and Mequite might care  
or any Nexus type-system which lists the taxon names separately from  
the taxons in the TREES block (they use numbers for the taxons which  
get mapped to  TAXLABELS in the TAXA block. I believe it depends on  
how they decided to store these relationships).

I guess we have three options here:
1) keep on trucking
2) raise a warning
3) raise an exception - something like Matrix.NonUniqueName exception  
so that you can specifically except  the exception

>
> Peter wrote:
>>> Likewise an option to save matrices as either fully symmetric or  
>>> lower
>>> triangular.  I would lean towards using fully symmetric as the  
>>> default
>>> as it seems to be more common.
>
> Marc Colosimo wrote:
>> Phylip's default seems to be a "Square" distance matrix, i.e.  
>> fully  symmetric. Keep this in mind when naming or documentation.
>
> Good point.
>
> Peter
>




More information about the Biopython-dev mailing list