From mcolosimo at mitre.org  Mon Jun 12 08:38:18 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 12 Jun 2006 08:38:18 -0400
Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
	<128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>
	<8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org>
	<128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com>
Message-ID: <65DF4A7E-B365-4E61-93D4-156A36F6ED54@mitre.org>

[cross-posting to biopython-dev]

Chris,

Oops, didn't notice this was on the general biopython mailing list. I  
think many of the developers also subscribe to this list, but just in  
case I'm cross posting this.

Iddo pointed out the Bio.SubsMat, which I didn't know what  that  
module did. One problem with names like that, but the API Docs are  
helpful only when you look at them <http://biopython.org/DIST/docs/ 
api/public/trees.html> (Kuddos for those who add documentation).

Given Bio.SubsMat and the BioPerl Module, I would strongly consider  
combining the Bio.SubsMat and the PhylipDist into a new Bio.Matrix  
module. From a Phylo module, a function/class can always call the  
Bio.Matrix classes.

Marc

On Jun 9, 2006, at 5:13 PM, Chris Lasher wrote:

>> I likewise didn't know about the Bio::Matrix::PhylipDist module.
>> Personally, I would opt for a Matrix Object (since this is Python a
>> OO language) and store it internally as a nested list. That way you
>> have the best of both worlds. The next question is the object
>> hierarchy. Here I would opt for a top level Matrix class (or module)
>> and then subclass that under Phylo. So, something like this:
>>
>> Bio.Matrix
>> Bio.Phylo.Matrix
>
> So is this more appropriate than Bio.Matrix.Phylo? A phylogenetic
> matrix is a type of matrix, so that hierarchy is immediately
> appealing, however, a phylogenetic matrix is not of much use in and of
> itself, so I can see the argument that it should be placed in a
> phylogeny package (which we have yet to write but as mentioned
> earlier, could be very useful).
>
>> and maybe things like the following (which isn't used/followed much
>> here in BioPython)
>>
>> Bio.Phylo.IO
>> Bio.Phylo.Parsers.PhylipDist
>> Bio.Phylo.Parsers.Newick
>> Bio.Phylo.Parsers.Nexus
>>
>> And/or have
>> Bio.Phylo.Matrix.IO that uses the PhylipDist parser.
>
> This is very very good, in my opinion. Thanks for doing the
> heavy-lifting of the brainwork on this! =-)
>
>> The next big question is what should Bio.Phylo.IO return? For
>> inspiration, we might want to look at Mesquite <http://
>> mesquiteproject.org/mesquite/mesquite.html>.
>
> I must give a better look at this site before commenting, but once
> again, thanks for bringing this to my awareness! What a helpful past
> couple of emails. I will be out for the weekend but will think more
> about this.
>
> As a sidenote, should this discussion be moved to biopython-dev or is
> it fine here?
>
> Thanks again Marc,
> Chris
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From mcolosimo at mitre.org  Mon Jun 12 09:18:41 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 12 Jun 2006 09:18:41 -0400
Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers
In-Reply-To: <448A9A7A.6050501@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
Message-ID: <CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>

[cross post]
On Jun 10, 2006, at 6:10 AM, Peter wrote:

> Chris Lasher wrote:
>> Hi all, Are there any modules in BioPython to parse distance
>> matrices? My poking around the BioPython modules and Google searching
>> does not turn up any signs indicating there are distance matrix
>> parsers, currently. Two particularly useful parsers would be a parser
>> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP
>> (http://evolution.genetics.washington.edu/phylip.html),
>
> I've done a very small amount of work with neighbour joining trees,
> using PHYLIP format distance matrices.  The closest I could find to a
> file format definition was this page:
>
> http://evolution.genetics.washington.edu/phylip/doc/distance.html
>
> Points to be aware of:
>
> In my experience, most software tools usually write the distances as a
> full symmetric matrix.  However, the "standard" explicitly discusses
> lower triangular form (missing out the diagonal distance zero entries)
> which has the significant advantage of using about half the disk  
> space.
>   This is significant once you get into thousands of taxa.

This is still small potatoes compared to the input needed to generate  
the distance matrixs (especially with DNA/RNA sequences of any  
decently sized gene).

>
> So, make sure any parser can cope with both full symmetric, and lower
> triangular forms - ideally without the user having to care.

Phylip does ask you which to either read or write; this is a pain at  
times. So, having a parser figure this out would be nice. However,  
the user should know about the choices.

>
> This also raises the point about how to store the matrix in memory.
> Does Numeric/NumPy have an efficient way of storing symmetric  
> matrices?
>   This is less flexible than the suggested list of lists, but for  
> large
> datasets would need much less memory.

I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at  
storing these things. But you lose that when you want to do pythonish  
things to it (like write it back out).

>
> Second point - the "official" PHYLIP distance matrix file format
> truncates the taxa names at 10 characters.  Some tools (e.g. clustalw)
> ignore this limitation and will use as many as needed for the full  
> name.

ClustalW does the CORRECT thing, it truncates the name to 10  
characters for Phylip output (alignments). And it does the CORRECT  
thing for its  distance matrix file.

In Clustalw's trees.c file

void distance_matrix_output(FILE *ofile)

	fprintf(ofile,"\n%-*s ",max_names,names[i]);  /* left justify to the  
maximum length of names in current alignment file and use a space as  
a sep */

spaces in names are bad in this case, but phylip is okay with them,  
since the first 10 characters are the taxon name.

>   I personally find this much nicer - after all most gene identifiers
> (e.g. GI numbers) are eight characters to start with, and if you are
> dealing with multiple features in each gene 10 characters is tough  
> going.
>
> So, I would make sure you test the parser on this format variant (with
> names longer than 10 characters).  I can supply some examples if  
> you like.

By definition this isn't a variant of Phylip, but another format. So,  
one would need two parsers: PhylipDist and Dist (or ClustalDist).

>
> For writing matrices to file, the issue of following the strict 10
> character taxa limit might best be handled as an option (default to  
> max
> 10, with a warning if any names are truncated, and an error if
> truncation renders names non-unique?).

DON'T give an option of 10 or more. That is NOT the definition of the  
Phylip file Matrix structure, so why give the option? Make another  
class that outputs the whole name (ClustalDist).

I am pretty sure that Phylip doesn't care about non-unique names so  
why error out? However, the class should have a means for the user to  
ask this question.

>
> Likewise an option to save matrices as either fully symmetric or lower
> triangular.  I would lean towards using fully symmetric as the default
> as it seems to be more common.

Phylip's default seems to be a "Square" distance matrix, i.e. fully  
symmetric. Keep this in mind when naming or documentation.

>
>> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
>> distance matrix format. If not, would there be any interest in
>> creating parsers for these matrices, other than my own? I think
>> parsers for distance matrices could be very useful to the community.
>
> I suspect that for serious tree building pure python will not be
> competitive with existing C/C++ code on speed - but non-the-less could
> be useful.
>

Well, we do have things like SciPy and PyClustal, which make things  
more even.

Marc

From biopython-dev at maubp.freeserve.co.uk  Mon Jun 12 17:57:36 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Mon, 12 Jun 2006 22:57:36 +0100
Subject: [Biopython-dev] Distance Matrix Parsers
In-Reply-To: <CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
Message-ID: <448DE350.8000403@maubp.freeserve.co.uk>

[Send to the Dev list only - forward to the main discussion list if you 
think best Marc]

One general question about the architecture: Are you thinking of having 
a generic "distance matrix object", and parsers/formats defined for 
several different file formats?

Peter (me) wrote:
>>In my experience, most software tools usually write the distances as a
>>full symmetric matrix.  However, the "standard" explicitly discusses
>>lower triangular form (missing out the diagonal distance zero entries)
>>which has the significant advantage of using about half the disk  
>>space. This is significant once you get into thousands of taxa.

Marc Colosimo wrote:
> This is still small potatoes compared to the input needed to generate  
> the distance matrixs (especially with DNA/RNA sequences of any  
> decently sized gene).

Regarding size of matrix file versus size of alignment file, that isn't 
hallways true.

(*) The matrix file size goes as the square of the number of taxa, the 
alignment file only linearly.

(*) The matrix file is invariant with respect to the length of the 
sequences/number of columns in the alignment.

(*) The matrix file size goes linearly with the precision (number of 
decimal places) used.

As you are using "decently sized genes" then you will have large 
alignment files, but I would imagine you have at most hundred of genes 
per alignment - not thousands (?).

For my own examples, I have about two thousand domains (not full genes) 
and the phylip distance matrix file was MUCH bigger than the alignment file.

Peter (me) wrote
>>So, make sure any parser can cope with both full symmetric, and lower
>>triangular forms - ideally without the user having to care.

Marc Colosimo wrote:
> Phylip does ask you which to either read or write; this is a pain at  
> times. So, having a parser figure this out would be nice. However,  
> the user should know about the choices.

Its fairly easy for the parser to cope with either: For each line of 
input, only use the "lower triangular" portion - just ignore any 
remaining text which would be present for a full matrix (square) file, 
or not present for a lower triangular file.

Peter wrote:
>>This also raises the point about how to store the matrix in memory.
>>Does Numeric/NumPy have an efficient way of storing symmetric  
>>matrices? This is less flexible than the suggested list of lists,
 >>but for large datasets would need much less memory.

Marc Colosimo wrote:
> I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at  
> storing these things. But you lose that when you want to do pythonish  
> things to it (like write it back out).

It depends on our target audience.  My experience with two thousand taxa 
means that I am slightly concerned about the memory, and would lean 
towards storing the data using Numeric/NumPy.  This could be done within 
a nice python object, with methods to write it out again in phylip 
format etc - so it could still behave "nicely".

Peter wrote:
>>Second point - the "official" PHYLIP distance matrix file format
>>truncates the taxa names at 10 characters.  Some tools (e.g. clustalw)
>>ignore this limitation and will use as many as needed for the full  
>>name.

Marc Colosimo wrote:
> ...
> 
> By definition this isn't a variant of Phylip, but another format. So,  
> one would need two parsers: PhylipDist and Dist (or ClustalDist).

That would be another way of looking at the issue, sure.  [See below]

Peter wrote:
>>For writing matrices to file, the issue of following the strict 10
>>character taxa limit might best be handled as an option (default to  
>>max 10, with a warning if any names are truncated, and an error if
>>truncation renders names non-unique?).

Marc Colosimo wrote:
> DON'T give an option of 10 or more. That is NOT the definition of the  
> Phylip file Matrix structure, so why give the option? Make another  
> class that outputs the whole name (ClustalDist).

I like clustal's "long name variant of Phylip distance format", as for 
my datasets my gene/domain names are longer than 10 characters.  I may 
well be in a minority here (for now).

I suppose if would be "good practice" to follow the official (but not 
overly precise) phylip definition on this issue.

So your idea of defining two similar formats would resolve this.  In 
terms of implementation, one could probably just subclass the other to 
reduce the amount of duplicated code.

> I am pretty sure that Phylip doesn't care about non-unique names so  
> why error out? However, the class should have a means for the user to  
> ask this question.

Because the (truncated) taxa names are going to be used as tree node 
names by any tree building program, they really should be unique.  I 
would expect any tree program to throw an error in this case, which is 
why I suggested we should try not to create such files in the first place.

Peter wrote:
>>Likewise an option to save matrices as either fully symmetric or lower
>>triangular.  I would lean towards using fully symmetric as the default
>>as it seems to be more common.

Marc Colosimo wrote:
> Phylip's default seems to be a "Square" distance matrix, i.e. fully  
> symmetric. Keep this in mind when naming or documentation.

Good point.

Peter


From mcolosimo at mitre.org  Tue Jun 13 11:46:16 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Tue, 13 Jun 2006 11:46:16 -0400
Subject: [Biopython-dev] Distance Matrix Parsers
In-Reply-To: <448DE350.8000403@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<448DE350.8000403@maubp.freeserve.co.uk>
Message-ID: <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org>

[I've added Chris incase he isn't on the dev-list]

On Jun 12, 2006, at 5:57 PM, Peter wrote:

> [Send to the Dev list only - forward to the main discussion list if  
> you think best Marc]
>
> One general question about the architecture: Are you thinking of  
> having a generic "distance matrix object", and parsers/formats  
> defined for several different file formats?
>

Yes. I think that is what I am leaning towards. Now, I don't know if  
I'll be the implementor or not. It has been something on my to-do  
list for a while.

> Peter (me) wrote:
>>> In my experience, most software tools usually write the distances  
>>> as a
>>> full symmetric matrix.  However, the "standard" explicitly discusses
>>> lower triangular form (missing out the diagonal distance zero  
>>> entries)
>>> which has the significant advantage of using about half the disk   
>>> space. This is significant once you get into thousands of taxa.
>

Peter wrote:
> Marc Colosimo wrote:
>> This is still small potatoes compared to the input needed to  
>> generate  the distance matrixs (especially with DNA/RNA sequences  
>> of any  decently sized gene).
>
> Regarding size of matrix file versus size of alignment file, that  
> isn't hallways true.
>
> (*) The matrix file size goes as the square of the number of taxa,  
> the alignment file only linearly.
>
> (*) The matrix file is invariant with respect to the length of the  
> sequences/number of columns in the alignment.
>
> (*) The matrix file size goes linearly with the precision (number  
> of decimal places) used.
>
> As you are using "decently sized genes" then you will have large  
> alignment files, but I would imagine you have at most hundred of  
> genes per alignment - not thousands (?).
>
> For my own examples, I have about two thousand domains (not full  
> genes) and the phylip distance matrix file was MUCH bigger than the  
> alignment file.

You got me on that boundary case. I just wanted to point out that is  
not always the case.


> Peter (me) wrote
>>> So, make sure any parser can cope with both full symmetric, and  
>>> lower
>>> triangular forms - ideally without the user having to care.
>
> Marc Colosimo wrote:
>> Phylip does ask you which to either read or write; this is a pain  
>> at  times. So, having a parser figure this out would be nice.  
>> However,  the user should know about the choices.
>
> Its fairly easy for the parser to cope with either: For each line  
> of input, only use the "lower triangular" portion - just ignore any  
> remaining text which would be present for a full matrix (square)  
> file, or not present for a lower triangular file.

It should be fairly easy, but I don't understand why Philip chokes on  
square versus lower triangular. Either way, the class should  
"internally" know what the format read in was, so you can ask it.  
That way if you muck with it or create a new matrix and want to write  
that out, you can ask the class what it read in and then have the new  
one write it out in that format.

>
> Peter wrote:
>>> This also raises the point about how to store the matrix in memory.
>>> Does Numeric/NumPy have an efficient way of storing symmetric   
>>> matrices? This is less flexible than the suggested list of lists,
> >>but for large datasets would need much less memory.
>
> Marc Colosimo wrote:
>> I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at   
>> storing these things. But you lose that when you want to do  
>> pythonish  things to it (like write it back out).
>
> It depends on our target audience.  My experience with two thousand  
> taxa means that I am slightly concerned about the memory, and would  
> lean towards storing the data using Numeric/NumPy.  This could be  
> done within a nice python object, with methods to write it out  
> again in phylip format etc - so it could still behave "nicely".

I agree here and think that if the user has Numeric use that,  
otherwise use built-in types. So, maybe two "hidden" classes that do  
the correct thing.

>
> Peter wrote:
>>> Second point - the "official" PHYLIP distance matrix file format
>>> truncates the taxa names at 10 characters.  Some tools (e.g.  
>>> clustalw)
>>> ignore this limitation and will use as many as needed for the  
>>> full  name.
>
> Marc Colosimo wrote:
>> ...
>> By definition this isn't a variant of Phylip, but another format.  
>> So,  one would need two parsers: PhylipDist and Dist (or  
>> ClustalDist).
>
> That would be another way of looking at the issue, sure.  [See below]
>
> Peter wrote:
>>> For writing matrices to file, the issue of following the strict 10
>>> character taxa limit might best be handled as an option (default  
>>> to  max 10, with a warning if any names are truncated, and an  
>>> error if
>>> truncation renders names non-unique?).
>
> Marc Colosimo wrote:
>> DON'T give an option of 10 or more. That is NOT the definition of  
>> the  Phylip file Matrix structure, so why give the option? Make  
>> another  class that outputs the whole name (ClustalDist).
>
> I like clustal's "long name variant of Phylip distance format", as  
> for my datasets my gene/domain names are longer than 10  
> characters.  I may well be in a minority here (for now).
>
> I suppose if would be "good practice" to follow the official (but  
> not overly precise) phylip definition on this issue.
>
> So your idea of defining two similar formats would resolve this.   
> In terms of implementation, one could probably just subclass the  
> other to reduce the amount of duplicated code.

Correct. subclassing is our friend (to a point).

>
>> I am pretty sure that Phylip doesn't care about non-unique names  
>> so  why error out? However, the class should have a means for the  
>> user to  ask this question.
>
> Because the (truncated) taxa names are going to be used as tree  
> node names by any tree building program, they really should be  
> unique.  I would expect any tree program to throw an error in this  
> case, which is why I suggested we should try not to create such  
> files in the first place.

Not exactly. I've been bitten in the butt by the truncation issue  
several times. I know TreeView X doesn't care about unique names and  
I think MacClade also doesn't care. Now, PAUP and Mequite might care  
or any Nexus type-system which lists the taxon names separately from  
the taxons in the TREES block (they use numbers for the taxons which  
get mapped to  TAXLABELS in the TAXA block. I believe it depends on  
how they decided to store these relationships).

I guess we have three options here:
1) keep on trucking
2) raise a warning
3) raise an exception - something like Matrix.NonUniqueName exception  
so that you can specifically except  the exception

>
> Peter wrote:
>>> Likewise an option to save matrices as either fully symmetric or  
>>> lower
>>> triangular.  I would lean towards using fully symmetric as the  
>>> default
>>> as it seems to be more common.
>
> Marc Colosimo wrote:
>> Phylip's default seems to be a "Square" distance matrix, i.e.  
>> fully  symmetric. Keep this in mind when naming or documentation.
>
> Good point.
>
> Peter
>


From chris.lasher at gmail.com  Wed Jun 14 14:36:00 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Wed, 14 Jun 2006 14:36:00 -0400
Subject: [Biopython-dev] Distance Matrix Parsers
In-Reply-To: <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<448DE350.8000403@maubp.freeserve.co.uk>
	<2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org>
Message-ID: <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com>

> [I've added Chris incase he isn't on the dev-list]

Thanks Marc! I actually joined Dev list as the discussion got interesting.
Figured we'd move it to here eventually.

>> One general question about the architecture: Are you thinking of having a
>> generic "distance matrix object", and parsers/formats defined for several
>> different file formats?
>>
>
> Yes. I think that is what I am leaning towards. Now, I don't know if I'll
> be the implementor or not. It has been something on my to-do list for a
> while.

BioPython support for these formats with clean, testable code should be the
primary task, correct? I can help with this.  After we get working code,
refactoring for memory management can take place. I haven't done anything
along these lines and I'd have to rely on someone else's expertise for this.

> In my experience, most software tools usually write the distances as a
> full symmetric matrix.  However, the "standard" explicitly discusses lower
> triangular form (missing out the diagonal distance zero entries) which has
> the significant advantage of using about half the disk space. This is
> significant once you get into thousands of taxa.

I guess we need to consider that storing the matrix as a triangular form
will save some memory. However, I've emailed the SciPy/NumPy guys and there
is currently no support for a triangular/symmetric matrix; it would have to
be a square matrix. See more below.

>>>> So, make sure any parser can cope with both full symmetric, and lower
>>>> triangular forms - ideally without the user having to care.
>>>
>>> Phylip does ask you which to either read or write; this is a pain at
>>> times. So, having a parser figure this out would be nice.  However, the
>>> user should know about the choices.
>>
>> Its fairly easy for the parser to cope with either: For each line of
>> input, only use the "lower triangular" portion - just ignore any
>> remaining text which would be present for a full matrix (square) file, or
>> not present for a lower triangular file.

Well, we can save a lot of developer time by requiring the user to designate
this, with the default being a square matrix. Is it unreasonable to expect
the user to know whether his or her matrix is lower/upper-triangular or
square? Autodetection seems to add a bit of risk, e.g., either the detection
has to be confirmed by the user (in which case, what's the point of
auto-detect), or we have to have a really well tested auto-detector, i.e., a
lot more developer time.

> It should be fairly easy, but I don't understand why Philip chokes on
> square versus lower triangular. Either way, the class should "internally"
> know what the format read in was, so you can ask it.  That way if you muck
> with it or create a new matrix and want to write that out, you can ask the
> class what it read in and then have the new one write it out in that
> format.

I think it makes sense for a Phylip triangular matrix and a Phylip square
matrix to be represented as the same type of object, for reasons of
consistency, as already discussed. As Marc pointed out, its original form
can simply be represented by an attribute of the object. It should also be
possible to write the matrix back out in either triangular or square format,
regardless of its original format. These would probably just be methods of
the object, such as .to_phylip_square() and .to_phylip_ltriangular()

>>>> This also raises the point about how to store the matrix in memory.
>>>> Does Numeric/NumPy have an efficient way of storing symmetric matrices?
>>>> This is less flexible than the suggested list of lists, but for large
>>>> datasets would need much less memory.
>>>
>>> I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at storing
>>> these things. But you lose that when you want to do pythonish things to
>>> it (like write it back out).
>>
>> It depends on our target audience.  My experience with two thousand taxa
>> means that I am slightly concerned about the memory, and would lean
>> towards storing the data using Numeric/NumPy.  This could be done within
>> a nice python object, with methods to write it out again in phylip format
>> etc - so it could still behave "nicely".
>
> I agree here and think that if the user has Numeric use that, otherwise
> use built-in types. So, maybe two "hidden" classes that do the correct
> thing.

This just recently popped up on the NumPy discussion list:
http://www.mail-archive.com/numpy-discussion at lists.sourceforge.net/msg00265.html

The summary of that is we can memory-map it using numpy.memmap. I've never
used this before, so I can't really comment. I'd guess that for small data
files, this is overkill. For large sets it might be reasonable. I suppose
two separate classes could be available, one for smaller matrices and one
for larger. Again, I think the user would be intelligent enough to make the
decision as to which to use.

Since the class for handling standard (smaller) matrices will be easier to
code, I propose writing this standard one first and getting it into
BioPython. For this class, I suggest just sticking with a regular nested
list, rather than use something from Numeric/NumPy.

After this class is created and submitted, we can go back and create a class
to deal with larger matrices that's a sub-class of the standard one. This
way, the API remains the same, regardless of the class, and we will only
have to rewrite the methods that need changing due to the way we'll need to
interact with the underlying data structure of the wrapped Numeric/NumPy
object. How does that sound?

>> I like clustal's "long name variant of Phylip distance format", as for my
>> datasets my gene/domain names are longer than 10 characters.  I may well
>> be in a minority here (for now).
>>
>> I suppose if would be "good practice" to follow the official (but not
>> overly precise) phylip definition on this issue.
>>
>> So your idea of defining two similar formats would resolve this.  In
>> terms of implementation, one could probably just subclass the other to
>> reduce the amount of duplicated code.
>
> Correct. subclassing is our friend (to a point).
>

I'm in agreement with using two separate types of objects to represent these
two formats. PhylipDist should represent the Phylip spec to the T. I'm not
familiar with the Clustal spec; is it formatted similarly, sans the
requirement of 10 characters max for the sequence name?

An editorial note, I'm very frustrated with Phylip's 10 character limit for
sequence names, too. I don't know the reasoning and history behind the
decisions on the format; all I know is that it is an uncomfortably
restrictive and seemingly arbitrary format. Why it has not been updated is
beyond me, unless, like these parsers for BioPython, it's just another
project waiting for someone to work on it.

>>> I am pretty sure that Phylip doesn't care about non-unique names so why
>>> error out? However, the class should have a means for the user to ask
>>> this question.
>>
>> Because the (truncated) taxa names are going to be used as tree node
>> names by any tree building program, they really should be unique.  I
>> would expect any tree program to throw an error in this case, which is
>> why I suggested we should try not to create such files in the first
>> place.
>
> Not exactly. I've been bitten in the butt by the truncation issue several
> times. I know TreeView X doesn't care about unique names and I think
> MacClade also doesn't care. Now, PAUP and Mequite might care or any Nexus
> type-system which lists the taxon names separately from the taxons in the
> TREES block (they use numbers for the taxons which get mapped to TAXLABELS
> in the TAXA block. I believe it depends on how they decided to store these
> relationships).
>
> I guess we have three options here: 1) keep on trucking 2) raise a warning
> 3) raise an exception - something like Matrix.NonUniqueName exception so
> that you can specifically except  the exception
>

I dislike option 1, unless we also provide the user the ability to check for
non-unique names, too. Remember the Zen of Python: "Explicit is better than
implicit."

I like option 3, though I don't know how to make it possible for code
outside the parser to catch the exception and tell the parser to continue.
We could have it throw the exception by default, but if the user provides a
flag in calling the parser, like allow_non_unique=True, we could have logic
in the parser that, if True, catch the exception and continue.

>> Likewise an option to save matrices as either fully symmetric or lower
>> triangular.  I would lean towards using fully symmetric as the default as
>> it seems to be more common.
>
> Phylip's default seems to be a "Square" distance matrix, i.e.  fully
> symmetric. Keep this in mind when naming or documentation.

As I mentioned above, the same object would represent both types, and should
be equally capable of outputting itself as text in either format.

Chris

From gvwilson at cs.utoronto.ca  Sun Jun 18 14:15:18 2006
From: gvwilson at cs.utoronto.ca (Greg Wilson)
Date: Sun, 18 Jun 2006 14:15:18 -0400
Subject: [Biopython-dev] ann: open source course on basic software
	development skills
Message-ID: <e74578$pq2$12@sea.gmane.org>

http://www.third-bit.com/swc is an open source course on basic software
development skills, aimed primarily at people with backgrounds in
science, engineering, and medicine who have little formal training in
programming, but find themselves doing a lot of it.  The course was
developed in part through support from the Python Software Foundation;
all of the material can be used and modified free of charge (but with
attribution).  If you have questions, would like to contribute material,
or have a success story you'd like to share, please contact Greg Wilson
(gvwilson at cs.utoronto.ca).

Thanks,
Greg


From chris.lasher at gmail.com  Mon Jun 19 13:49:34 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Mon, 19 Jun 2006 13:49:34 -0400
Subject: [Biopython-dev] Bugzilla
Message-ID: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>

I noticed that the Bugzilla for BioPython on Open Bio (
http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython ) lacks
the current version number. Also, I noticed there seem to be quite a
few open tickets. Is BioPython still using Open Bio's Bugzilla to
track bugs?

Chris

From mdehoon at c2b2.columbia.edu  Mon Jun 19 14:27:26 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Mon, 19 Jun 2006 14:27:26 -0400
Subject: [Biopython-dev] Bugzilla
In-Reply-To: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>
References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>
Message-ID: <4496EC8E.7030603@c2b2.columbia.edu>

Chris Lasher wrote:
> Is BioPython still using Open Bio's Bugzilla to track bugs?

Yes.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From biopython-dev at maubp.freeserve.co.uk  Tue Jun 20 10:38:40 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Jun 2006 15:38:40 +0100
Subject: [Biopython-dev] Bugzilla
In-Reply-To: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>
References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>
Message-ID: <44980870.5070309@maubp.freeserve.co.uk>

Chris Lasher wrote:
> I noticed that the Bugzilla for BioPython on Open Bio (
> http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython ) lacks
> the current version number.

Good point - who can edit that list?

 > Also, I noticed there seem to be quite a few open tickets.

I've dealt with a few of them...

 > Is BioPython still using Open Bio's Bugzilla to track bugs?

As Michiel said, yes we are.  In fact, maybe we should log bugs for 
several of the recent issues on the mailing list...

Peter


From biopython-dev at maubp.freeserve.co.uk  Tue Jun 20 09:40:52 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Tue, 20 Jun 2006 14:40:52 +0100
Subject: [Biopython-dev] Floats and Double in Cluster
Message-ID: <4497FAE4.3040705@maubp.freeserve.co.uk>

One for Michiel, as the author of Bio.Cluster

I've just been building the latest CVS code on Windows with MSVC 6.0
(Microsoft Visual C++) and noticed there are a lot of warnings about
double to float conversion (with associated data loss) from
Bio/Cluster/ranlib.c (see attached output).

Does this matter?

I've also tried compiling the same code on Linux which I assume is using
the default gcc 4.0.2, and there are no such warnings (even with the
compiler option -Wall being specified).

I found that I could generally get rid of the warnings by adding
explicit (float) cast statements to the problem lines.

Thanks

Peter


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: build.txt
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060620/1edd8a20/attachment-0001.txt 

From mdehoon at c2b2.columbia.edu  Tue Jun 20 12:05:30 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Tue, 20 Jun 2006 12:05:30 -0400
Subject: [Biopython-dev] Floats and Double in Cluster
In-Reply-To: <4497FAE4.3040705@maubp.freeserve.co.uk>
References: <4497FAE4.3040705@maubp.freeserve.co.uk>
Message-ID: <44981CCA.2070006@c2b2.columbia.edu>

Peter (BioPython Dev) wrote:
> One for Michiel, as the author of Bio.Cluster
> 
> I've just been building the latest CVS code on Windows with MSVC 6.0
> (Microsoft Visual C++) and noticed there are a lot of warnings about
> double to float conversion (with associated data loss) from
> Bio/Cluster/ranlib.c (see attached output).
> 
> Does this matter?
> 
Probably not. The ranlib library is quite old (maybe fifteen years or 
more). It was originally written in Fortran, and automatically converted 
into C. Such conversions are usually not so clean, so the resulting code 
tends to generate many warning messages. On the other hand, ranlib is a 
part of Numerical Python (the RandomArray module), so if there were a 
serious problem with it it would have been discovered by now.
I did find out recently though that there may be a licensing problem 
with ranlib. Since we're using only a small part of ranlib in 
Bio.Cluster, I'm planning to replace it with a different random number 
generator for the next version.

--Michiel.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From chris.lasher at gmail.com  Tue Jun 20 12:35:14 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 20 Jun 2006 12:35:14 -0400
Subject: [Biopython-dev] Bugzilla
In-Reply-To: <44980870.5070309@maubp.freeserve.co.uk>
References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>
	<44980870.5070309@maubp.freeserve.co.uk>
Message-ID: <128a885f0606200935r619424f1jd983ad51eb36d7b@mail.gmail.com>

On 6/20/06, Peter <biopython-dev at maubp.freeserve.co.uk> wrote:
>  > Also, I noticed there seem to be quite a few open tickets.
>
> I've dealt with a few of them...

So some of those tickets are not still actually "open"?

Chris

From chris.lasher at gmail.com  Tue Jun 20 12:54:49 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 20 Jun 2006 12:54:49 -0400
Subject: [Biopython-dev] Distance Matrix Parsers
In-Reply-To: <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<448DE350.8000403@maubp.freeserve.co.uk>
	<2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org>
	<128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com>
Message-ID: <128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com>

Question for the knowledgeable: is this an appropriate realm to write
a Martel parser for?

Chris

From biopython-dev at maubp.freeserve.co.uk  Tue Jun 20 13:04:25 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Jun 2006 18:04:25 +0100
Subject: [Biopython-dev] Clustal unit test
Message-ID: <44982A99.9090400@maubp.freeserve.co.uk>

Another query Michiel

A minor point first of all, adding something like the following to the 
end of test_Cluster.py makes it easy to run this unit test on its own:

if __name__ == "__main__" :
     run_tests(module = "Bio.Cluster")

Secondly, the test works for me with BioPython 1.41, but using today's' 
CVS the test fails (or at least, sits there at high CPU usage for so 
long I give up and kill it).  This was using Linux.

The only changes since BioPython 1.41 are those you you checked in today 
with this comment:

 > C Clustering Library version 1.32. Bio.Cluster became
 > objected-oriented (somewhat).

Does the unit test still work for you?

Peter


From mcolosimo at mitre.org  Tue Jun 20 14:32:08 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Tue, 20 Jun 2006 14:32:08 -0400
Subject: [Biopython-dev] Distance Matrix Parsers
In-Reply-To: <128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<448DE350.8000403@maubp.freeserve.co.uk>
	<2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org>
	<128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com>
	<128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com>
Message-ID: <6F683BD6-4359-40BF-A7E4-10C97D4032EF@mitre.org>

I would say, NO.
I think Martel is a Mac Truck and you need a little Toyota pickup.  
Sure, Martel probably can do this, but it would be overkill, IMHO.   
Maybe Andrew could chime in on this.

Marc

On Jun 20, 2006, at 12:54 PM, Chris Lasher wrote:

> Question for the knowledgeable: is this an appropriate realm to write
> a Martel parser for?
>
> Chris
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From mdehoon at c2b2.columbia.edu  Tue Jun 20 16:56:22 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Tue, 20 Jun 2006 16:56:22 -0400
Subject: [Biopython-dev] Clustal unit test
In-Reply-To: <44982A99.9090400@maubp.freeserve.co.uk>
References: <44982A99.9090400@maubp.freeserve.co.uk>
Message-ID: <449860F6.7040807@c2b2.columbia.edu>

Peter wrote:
> A minor point first of all, adding something like the following to the 
> end of test_Cluster.py makes it easy to run this unit test on its own:
> 
> if __name__ == "__main__" :
>      run_tests(module = "Bio.Cluster")

OK I've added this in CVS.

> Secondly, the test works for me with BioPython 1.41, but using today's' 
> CVS the test fails (or at least, sits there at high CPU usage for so 
> long I give up and kill it).  This was using Linux.
> 
I did make an update to Bio.Cluster but in my brilliance forgot to 
update the test scripts accordingly. CVS now contains updated 
test_Cluster.py and output/test_Cluster. Let me know if the test still 
fails with these updated files. Thanks for catching this.

--Michiel.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From bsouthey at gmail.com  Tue Jun 20 13:47:53 2006
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 20 Jun 2006 12:47:53 -0500
Subject: [Biopython-dev] Floats and Double in Cluster
In-Reply-To: <44981CCA.2070006@c2b2.columbia.edu>
References: <4497FAE4.3040705@maubp.freeserve.co.uk>
	<44981CCA.2070006@c2b2.columbia.edu>
Message-ID: <bbcd77d00606201047v3a928efeo627a09304fdb79@mail.gmail.com>

Hi,
Actually just switch to the new numpy where ranlib is no longer used.
It uses  Mersenne Twister RNG from Jean-Sebastien Roy's random kit.
Robert Kern has also written additional code.

Of course, that really means moving BioPython to numpy from Numeric.
Is there a plan for this?

Bruce


On 6/20/06, Michiel Jan Laurens de Hoon <mdehoon at c2b2.columbia.edu> wrote:
> Peter (BioPython Dev) wrote:
> > One for Michiel, as the author of Bio.Cluster
> >
> > I've just been building the latest CVS code on Windows with MSVC 6.0
> > (Microsoft Visual C++) and noticed there are a lot of warnings about
> > double to float conversion (with associated data loss) from
> > Bio/Cluster/ranlib.c (see attached output).
> >
> > Does this matter?
> >
> Probably not. The ranlib library is quite old (maybe fifteen years or
> more). It was originally written in Fortran, and automatically converted
> into C. Such conversions are usually not so clean, so the resulting code
> tends to generate many warning messages. On the other hand, ranlib is a
> part of Numerical Python (the RandomArray module), so if there were a
> serious problem with it it would have been discovered by now.
> I did find out recently though that there may be a licensing problem
> with ranlib. Since we're using only a small part of ranlib in
> Bio.Cluster, I'm planning to replace it with a different random number
> generator for the next version.
>
> --Michiel.
>
> --
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1130 St Nicholas Avenue
> New York, NY 10032
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

From mdehoon at c2b2.columbia.edu  Wed Jun 21 22:32:58 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Wed, 21 Jun 2006 22:32:58 -0400
Subject: [Biopython-dev] Biopython's XMl parser fails with NCBI blast
 changed XML output format
In-Reply-To: <d9fd76050606211206pa104f7dwdebfcb05dcab09d2@mail.gmail.com>
References: <d9fd76050606211206pa104f7dwdebfcb05dcab09d2@mail.gmail.com>
Message-ID: <449A015A.4080503@c2b2.columbia.edu>

I am not sure if the new XML format is really what NCBI wants it to be, 
since it does not agree with the Blast documentation. I asked NCBI about 
this; they have forwarded this question to their Blast developers, so 
hopefully we'll get a definite answer soon.
For the time being, I guess the only thing you can do is to download an 
older version of Blast and run your blast searches locally. Then, the 
blast XML output will be in the old format, and there should be no 
problem parsing them with the existing parser in Biopython.

--Michiel.

Rohini Damle wrote:
> Hi,
> I am trying to parse the blast output (XML formatted, using online NCBI's
> blast) I got as a result for 'short nearly exact matches' for my 50-55 
> short
> protein sequences.
> It looks like the XML format has changed and biopython's XML parser 
> fails to
> parse the blast records.
> can somebody show a way to fix this thing?
> Thank you
> Rohini Damle


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From dcoorna at dbm.ulb.ac.be  Mon Jun 26 06:57:25 2006
From: dcoorna at dbm.ulb.ac.be (david coornaert)
Date: Mon, 26 Jun 2006 12:57:25 +0200
Subject: [Biopython-dev] NCBI-XML blast parser
Message-ID: <449FBD95.5040308@dbm.ulb.ac.be>


I'm currently using this bio-python ncbiXML blast output parser

>From the cvs I fetched I see some comments about useless nature of
 Hsp_query_to
and Hsp_hit_to

Well, I need those, and can't for sure calculate it simply from
hsp_align_len (which is not included either)
because I should I manage the max len of query, hit and align string)
then take care of the strand to know wether to increase or decrease,

So I've worked out a parrallel copy of the parser,
but I'd like to know  why are these considered useless  ?
Should I commit these harmless changes ? (hence cvs access)

???


-- 
===============================================
David Coornaert [PhD]   (dcoorna at dbm.ulb.ac.be)

Belgian Embnet Node (http://www.be.embnet.org)
Universit? Libre de Bruxelles

Laboratoire de Bioinformatique
12, Rue des Professeurs Jeener & Brachet
6041  Gosselies
BELGIQUE

T?l:  +3226509975
Fax:  +3226509998
===============================================


From biopython-dev at maubp.freeserve.co.uk  Mon Jun 26 08:05:28 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Jun 2006 13:05:28 +0100
Subject: [Biopython-dev] NCBI-XML blast parser
In-Reply-To: <449FBD95.5040308@dbm.ulb.ac.be>
References: <449FBD95.5040308@dbm.ulb.ac.be>
Message-ID: <449FCD88.6080704@maubp.freeserve.co.uk>

david coornaert wrote:
> I'm currently using this bio-python ncbiXML blast output parser
> 
>>From the cvs I fetched I see some comments about useless nature of
>  Hsp_query_to
> and Hsp_hit_to
> 
> Well, I need those, and can't for sure calculate it simply from
> hsp_align_len (which is not included either)
> because I should I manage the max len of query, hit and align string)
> then take care of the strand to know wether to increase or decrease,
> 
> So I've worked out a parrallel copy of the parser,
> but I'd like to know  why are these considered useless  ?
> Should I commit these harmless changes ? (hence cvs access)
> 
> ???

Hi David

Could you file a bug and attach your patch to it please?  (Trying to 
send attachments to the mailing list can be a bit unreliable).  Then 
hopefully some of the group can at least try it out...

Out of interest, what version of Blast have you been using?  Online or 
standalone?

(If you've been following the list, we think the NCBI have changed the 
format returned for multiple queries in version 2.2.14)

Thanks

Peter


From dcoorna at dbm.ulb.ac.be  Mon Jun 26 08:44:04 2006
From: dcoorna at dbm.ulb.ac.be (david coornaert)
Date: Mon, 26 Jun 2006 14:44:04 +0200
Subject: [Biopython-dev] NCBI-XML blast parser
In-Reply-To: <449FCD88.6080704@maubp.freeserve.co.uk>
References: <449FBD95.5040308@dbm.ulb.ac.be>
	<449FCD88.6080704@maubp.freeserve.co.uk>
Message-ID: <449FD694.7030508@dbm.ulb.ac.be>

Peter wrote:
> Hi David
>
> Could you file a bug and attach your patch to it please?  (Trying to 
> send attachments to the mailing list can be a bit unreliable).  Then 
> hopefully some of the group can at least try it out...
>
>   
Well I'm not sure about bug procedure
so here's it already
I'll have a look at the list stuff quite soon and will submit as requested

I wouldn't have qualified that as a bug, just wondering why would
someone consider
this values as useless, sure you can calculate these, altho it would be
painfull and ... well since it
is already in the XML...
I simply added these (in red) :

Bio/Blast/NCBIXML.py

line 289:
# No need for Hsp_query_to
def _end_Hsp_query_to(self):
"""offset of query at the end of the alignment (one-offset)
"""
self._hsp.query_to = int(self._value)

def _end_Hsp_hit_from(self):
"""offset of the database at the start of the alignment (one-offset)
"""
self._hsp.sbjct_start = int(self._value)
# No need for Hsp_hit_to
def _end_Hsp_hit_to(self):
"""offset of the database at the end of the alignment (one-offset)
"""
self._hsp.sbjct_to = int(self._value)


Conversely, a real bug is the mess that is occuring regarding Frame and
Strand !!

in a blastn output must appear:

Strand = Plus / Plus
or
Strand = Plus / Minus (and so on)

while in a tblastx must appear:
Frame = +3/-1 (and so on)

blastx (must also present one Frame info)

unfortunately to find the appropriate strand in a blastn job, you need
to address
the hsp.frame array , eventho there's a hsp.strand array...


And all this stuff is usefull !! if it is the opposite strand you need
to swap query_start and query_to for example...


> Out of interest, what version of Blast have you been using?  Online or 
> standalone?
>
>   
well I've seen the complains regarding 2.2.14 , Hence I sticked to
2.2.13 standalone

=;B^)


-- 
===============================================
David Coornaert [PhD]   (dcoorna at dbm.ulb.ac.be)

Belgian Embnet Node (http://www.be.embnet.org)
Universite' Libre de Bruxelles

Laboratoire de Bioinformatique
12, Rue des Professeurs Jeener & Brachet
6041  Gosselies
BELGIQUE

Te'l:  +3226509975
Fax:  +3226509998
===============================================


From chris.lasher at gmail.com  Tue Jun 27 19:33:57 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 27 Jun 2006 19:33:57 -0400
Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606271632q2988f2d7y543dd441535f9808@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<449F0231.2050308@maubp.freeserve.co.uk>
	<128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com>
	<44A1B23E.5080007@maubp.freeserve.co.uk>
	<128a885f0606271632q2988f2d7y543dd441535f9808@mail.gmail.com>
Message-ID: <128a885f0606271633g6fc66bb1h8173eca5b949a1c5@mail.gmail.com>

Oh brother... today's not my day. NOW it's back on BP-Dev...

Stupidly yours,
Chris

On 6/27/06, Chris Lasher <chris.lasher at gmail.com> wrote:
> [Oops! I didn't realize I was posting to the user list! Reverting it
> back to BP-Dev]
> This code looks very good, Peter!
>
> As far as licensing, I'm new to the game, but my guess is the
> BioPython license (http://www.biopython.org/DIST/LICENSE ) is highly
> prefered for BioPython. You still retain copyright with the license,
> but the code is more "free" than under any version of the GPL.
>
> Chris
>
> On 6/27/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
> > Chris Lasher wrote:
> > > Hi Peter,
> > >
> > > Would you be up for licensing your code under the BioPython license?
> > > If not, I shouldn't  look at it, as I've started coding my own module
> > > for the project. From your description, your module sounds very good.
> > > =-)
> > >
> > > Chris
> >
> > I am quite happy to contribute the code to BioPython under the
> > appropriate license, so please go ahead.
> >
> > I've filled a bug on adding PHYLIP distance parsers to BioPython and
> > attached a slightly revised version of the code (added "fuzzy" equality
> > testing of matrices - mainly for testing):
> >
> > http://bugzilla.open-bio.org/show_bug.cgi?id=2034
> >
> > If anyone else really wants the code under some other license (GPL
> > maybe) I could probably be persuaded.
> >
> > Peter
> >
> > _______________________________________________
> > BioPython mailing list  -  BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
>

From mcolosimo at mitre.org  Mon Jun 12 12:38:18 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 12 Jun 2006 08:38:18 -0400
Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org>
	<128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com>
	<8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org>
	<128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com>
Message-ID: <65DF4A7E-B365-4E61-93D4-156A36F6ED54@mitre.org>

[cross-posting to biopython-dev]

Chris,

Oops, didn't notice this was on the general biopython mailing list. I  
think many of the developers also subscribe to this list, but just in  
case I'm cross posting this.

Iddo pointed out the Bio.SubsMat, which I didn't know what  that  
module did. One problem with names like that, but the API Docs are  
helpful only when you look at them <http://biopython.org/DIST/docs/ 
api/public/trees.html> (Kuddos for those who add documentation).

Given Bio.SubsMat and the BioPerl Module, I would strongly consider  
combining the Bio.SubsMat and the PhylipDist into a new Bio.Matrix  
module. From a Phylo module, a function/class can always call the  
Bio.Matrix classes.

Marc

On Jun 9, 2006, at 5:13 PM, Chris Lasher wrote:

>> I likewise didn't know about the Bio::Matrix::PhylipDist module.
>> Personally, I would opt for a Matrix Object (since this is Python a
>> OO language) and store it internally as a nested list. That way you
>> have the best of both worlds. The next question is the object
>> hierarchy. Here I would opt for a top level Matrix class (or module)
>> and then subclass that under Phylo. So, something like this:
>>
>> Bio.Matrix
>> Bio.Phylo.Matrix
>
> So is this more appropriate than Bio.Matrix.Phylo? A phylogenetic
> matrix is a type of matrix, so that hierarchy is immediately
> appealing, however, a phylogenetic matrix is not of much use in and of
> itself, so I can see the argument that it should be placed in a
> phylogeny package (which we have yet to write but as mentioned
> earlier, could be very useful).
>
>> and maybe things like the following (which isn't used/followed much
>> here in BioPython)
>>
>> Bio.Phylo.IO
>> Bio.Phylo.Parsers.PhylipDist
>> Bio.Phylo.Parsers.Newick
>> Bio.Phylo.Parsers.Nexus
>>
>> And/or have
>> Bio.Phylo.Matrix.IO that uses the PhylipDist parser.
>
> This is very very good, in my opinion. Thanks for doing the
> heavy-lifting of the brainwork on this! =-)
>
>> The next big question is what should Bio.Phylo.IO return? For
>> inspiration, we might want to look at Mesquite <http://
>> mesquiteproject.org/mesquite/mesquite.html>.
>
> I must give a better look at this site before commenting, but once
> again, thanks for bringing this to my awareness! What a helpful past
> couple of emails. I will be out for the weekend but will think more
> about this.
>
> As a sidenote, should this discussion be moved to biopython-dev or is
> it fine here?
>
> Thanks again Marc,
> Chris
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From mcolosimo at mitre.org  Mon Jun 12 13:18:41 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 12 Jun 2006 09:18:41 -0400
Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers
In-Reply-To: <448A9A7A.6050501@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
Message-ID: <CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>

[cross post]
On Jun 10, 2006, at 6:10 AM, Peter wrote:

> Chris Lasher wrote:
>> Hi all, Are there any modules in BioPython to parse distance
>> matrices? My poking around the BioPython modules and Google searching
>> does not turn up any signs indicating there are distance matrix
>> parsers, currently. Two particularly useful parsers would be a parser
>> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP
>> (http://evolution.genetics.washington.edu/phylip.html),
>
> I've done a very small amount of work with neighbour joining trees,
> using PHYLIP format distance matrices.  The closest I could find to a
> file format definition was this page:
>
> http://evolution.genetics.washington.edu/phylip/doc/distance.html
>
> Points to be aware of:
>
> In my experience, most software tools usually write the distances as a
> full symmetric matrix.  However, the "standard" explicitly discusses
> lower triangular form (missing out the diagonal distance zero entries)
> which has the significant advantage of using about half the disk  
> space.
>   This is significant once you get into thousands of taxa.

This is still small potatoes compared to the input needed to generate  
the distance matrixs (especially with DNA/RNA sequences of any  
decently sized gene).

>
> So, make sure any parser can cope with both full symmetric, and lower
> triangular forms - ideally without the user having to care.

Phylip does ask you which to either read or write; this is a pain at  
times. So, having a parser figure this out would be nice. However,  
the user should know about the choices.

>
> This also raises the point about how to store the matrix in memory.
> Does Numeric/NumPy have an efficient way of storing symmetric  
> matrices?
>   This is less flexible than the suggested list of lists, but for  
> large
> datasets would need much less memory.

I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at  
storing these things. But you lose that when you want to do pythonish  
things to it (like write it back out).

>
> Second point - the "official" PHYLIP distance matrix file format
> truncates the taxa names at 10 characters.  Some tools (e.g. clustalw)
> ignore this limitation and will use as many as needed for the full  
> name.

ClustalW does the CORRECT thing, it truncates the name to 10  
characters for Phylip output (alignments). And it does the CORRECT  
thing for its  distance matrix file.

In Clustalw's trees.c file

void distance_matrix_output(FILE *ofile)

	fprintf(ofile,"\n%-*s ",max_names,names[i]);  /* left justify to the  
maximum length of names in current alignment file and use a space as  
a sep */

spaces in names are bad in this case, but phylip is okay with them,  
since the first 10 characters are the taxon name.

>   I personally find this much nicer - after all most gene identifiers
> (e.g. GI numbers) are eight characters to start with, and if you are
> dealing with multiple features in each gene 10 characters is tough  
> going.
>
> So, I would make sure you test the parser on this format variant (with
> names longer than 10 characters).  I can supply some examples if  
> you like.

By definition this isn't a variant of Phylip, but another format. So,  
one would need two parsers: PhylipDist and Dist (or ClustalDist).

>
> For writing matrices to file, the issue of following the strict 10
> character taxa limit might best be handled as an option (default to  
> max
> 10, with a warning if any names are truncated, and an error if
> truncation renders names non-unique?).

DON'T give an option of 10 or more. That is NOT the definition of the  
Phylip file Matrix structure, so why give the option? Make another  
class that outputs the whole name (ClustalDist).

I am pretty sure that Phylip doesn't care about non-unique names so  
why error out? However, the class should have a means for the user to  
ask this question.

>
> Likewise an option to save matrices as either fully symmetric or lower
> triangular.  I would lean towards using fully symmetric as the default
> as it seems to be more common.

Phylip's default seems to be a "Square" distance matrix, i.e. fully  
symmetric. Keep this in mind when naming or documentation.

>
>> and a parser for the MEGA (http://www.megasoftware.net/mega.html)
>> distance matrix format. If not, would there be any interest in
>> creating parsers for these matrices, other than my own? I think
>> parsers for distance matrices could be very useful to the community.
>
> I suspect that for serious tree building pure python will not be
> competitive with existing C/C++ code on speed - but non-the-less could
> be useful.
>

Well, we do have things like SciPy and PyClustal, which make things  
more even.

Marc


From biopython-dev at maubp.freeserve.co.uk  Mon Jun 12 21:57:36 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Mon, 12 Jun 2006 22:57:36 +0100
Subject: [Biopython-dev] Distance Matrix Parsers
In-Reply-To: <CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
Message-ID: <448DE350.8000403@maubp.freeserve.co.uk>

[Send to the Dev list only - forward to the main discussion list if you 
think best Marc]

One general question about the architecture: Are you thinking of having 
a generic "distance matrix object", and parsers/formats defined for 
several different file formats?

Peter (me) wrote:
>>In my experience, most software tools usually write the distances as a
>>full symmetric matrix.  However, the "standard" explicitly discusses
>>lower triangular form (missing out the diagonal distance zero entries)
>>which has the significant advantage of using about half the disk  
>>space. This is significant once you get into thousands of taxa.

Marc Colosimo wrote:
> This is still small potatoes compared to the input needed to generate  
> the distance matrixs (especially with DNA/RNA sequences of any  
> decently sized gene).

Regarding size of matrix file versus size of alignment file, that isn't 
hallways true.

(*) The matrix file size goes as the square of the number of taxa, the 
alignment file only linearly.

(*) The matrix file is invariant with respect to the length of the 
sequences/number of columns in the alignment.

(*) The matrix file size goes linearly with the precision (number of 
decimal places) used.

As you are using "decently sized genes" then you will have large 
alignment files, but I would imagine you have at most hundred of genes 
per alignment - not thousands (?).

For my own examples, I have about two thousand domains (not full genes) 
and the phylip distance matrix file was MUCH bigger than the alignment file.

Peter (me) wrote
>>So, make sure any parser can cope with both full symmetric, and lower
>>triangular forms - ideally without the user having to care.

Marc Colosimo wrote:
> Phylip does ask you which to either read or write; this is a pain at  
> times. So, having a parser figure this out would be nice. However,  
> the user should know about the choices.

Its fairly easy for the parser to cope with either: For each line of 
input, only use the "lower triangular" portion - just ignore any 
remaining text which would be present for a full matrix (square) file, 
or not present for a lower triangular file.

Peter wrote:
>>This also raises the point about how to store the matrix in memory.
>>Does Numeric/NumPy have an efficient way of storing symmetric  
>>matrices? This is less flexible than the suggested list of lists,
 >>but for large datasets would need much less memory.

Marc Colosimo wrote:
> I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at  
> storing these things. But you lose that when you want to do pythonish  
> things to it (like write it back out).

It depends on our target audience.  My experience with two thousand taxa 
means that I am slightly concerned about the memory, and would lean 
towards storing the data using Numeric/NumPy.  This could be done within 
a nice python object, with methods to write it out again in phylip 
format etc - so it could still behave "nicely".

Peter wrote:
>>Second point - the "official" PHYLIP distance matrix file format
>>truncates the taxa names at 10 characters.  Some tools (e.g. clustalw)
>>ignore this limitation and will use as many as needed for the full  
>>name.

Marc Colosimo wrote:
> ...
> 
> By definition this isn't a variant of Phylip, but another format. So,  
> one would need two parsers: PhylipDist and Dist (or ClustalDist).

That would be another way of looking at the issue, sure.  [See below]

Peter wrote:
>>For writing matrices to file, the issue of following the strict 10
>>character taxa limit might best be handled as an option (default to  
>>max 10, with a warning if any names are truncated, and an error if
>>truncation renders names non-unique?).

Marc Colosimo wrote:
> DON'T give an option of 10 or more. That is NOT the definition of the  
> Phylip file Matrix structure, so why give the option? Make another  
> class that outputs the whole name (ClustalDist).

I like clustal's "long name variant of Phylip distance format", as for 
my datasets my gene/domain names are longer than 10 characters.  I may 
well be in a minority here (for now).

I suppose if would be "good practice" to follow the official (but not 
overly precise) phylip definition on this issue.

So your idea of defining two similar formats would resolve this.  In 
terms of implementation, one could probably just subclass the other to 
reduce the amount of duplicated code.

> I am pretty sure that Phylip doesn't care about non-unique names so  
> why error out? However, the class should have a means for the user to  
> ask this question.

Because the (truncated) taxa names are going to be used as tree node 
names by any tree building program, they really should be unique.  I 
would expect any tree program to throw an error in this case, which is 
why I suggested we should try not to create such files in the first place.

Peter wrote:
>>Likewise an option to save matrices as either fully symmetric or lower
>>triangular.  I would lean towards using fully symmetric as the default
>>as it seems to be more common.

Marc Colosimo wrote:
> Phylip's default seems to be a "Square" distance matrix, i.e. fully  
> symmetric. Keep this in mind when naming or documentation.

Good point.

Peter


From mcolosimo at mitre.org  Tue Jun 13 15:46:16 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Tue, 13 Jun 2006 11:46:16 -0400
Subject: [Biopython-dev] Distance Matrix Parsers
In-Reply-To: <448DE350.8000403@maubp.freeserve.co.uk>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<448DE350.8000403@maubp.freeserve.co.uk>
Message-ID: <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org>

[I've added Chris incase he isn't on the dev-list]

On Jun 12, 2006, at 5:57 PM, Peter wrote:

> [Send to the Dev list only - forward to the main discussion list if  
> you think best Marc]
>
> One general question about the architecture: Are you thinking of  
> having a generic "distance matrix object", and parsers/formats  
> defined for several different file formats?
>

Yes. I think that is what I am leaning towards. Now, I don't know if  
I'll be the implementor or not. It has been something on my to-do  
list for a while.

> Peter (me) wrote:
>>> In my experience, most software tools usually write the distances  
>>> as a
>>> full symmetric matrix.  However, the "standard" explicitly discusses
>>> lower triangular form (missing out the diagonal distance zero  
>>> entries)
>>> which has the significant advantage of using about half the disk   
>>> space. This is significant once you get into thousands of taxa.
>

Peter wrote:
> Marc Colosimo wrote:
>> This is still small potatoes compared to the input needed to  
>> generate  the distance matrixs (especially with DNA/RNA sequences  
>> of any  decently sized gene).
>
> Regarding size of matrix file versus size of alignment file, that  
> isn't hallways true.
>
> (*) The matrix file size goes as the square of the number of taxa,  
> the alignment file only linearly.
>
> (*) The matrix file is invariant with respect to the length of the  
> sequences/number of columns in the alignment.
>
> (*) The matrix file size goes linearly with the precision (number  
> of decimal places) used.
>
> As you are using "decently sized genes" then you will have large  
> alignment files, but I would imagine you have at most hundred of  
> genes per alignment - not thousands (?).
>
> For my own examples, I have about two thousand domains (not full  
> genes) and the phylip distance matrix file was MUCH bigger than the  
> alignment file.

You got me on that boundary case. I just wanted to point out that is  
not always the case.


> Peter (me) wrote
>>> So, make sure any parser can cope with both full symmetric, and  
>>> lower
>>> triangular forms - ideally without the user having to care.
>
> Marc Colosimo wrote:
>> Phylip does ask you which to either read or write; this is a pain  
>> at  times. So, having a parser figure this out would be nice.  
>> However,  the user should know about the choices.
>
> Its fairly easy for the parser to cope with either: For each line  
> of input, only use the "lower triangular" portion - just ignore any  
> remaining text which would be present for a full matrix (square)  
> file, or not present for a lower triangular file.

It should be fairly easy, but I don't understand why Philip chokes on  
square versus lower triangular. Either way, the class should  
"internally" know what the format read in was, so you can ask it.  
That way if you muck with it or create a new matrix and want to write  
that out, you can ask the class what it read in and then have the new  
one write it out in that format.

>
> Peter wrote:
>>> This also raises the point about how to store the matrix in memory.
>>> Does Numeric/NumPy have an efficient way of storing symmetric   
>>> matrices? This is less flexible than the suggested list of lists,
> >>but for large datasets would need much less memory.
>
> Marc Colosimo wrote:
>> I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at   
>> storing these things. But you lose that when you want to do  
>> pythonish  things to it (like write it back out).
>
> It depends on our target audience.  My experience with two thousand  
> taxa means that I am slightly concerned about the memory, and would  
> lean towards storing the data using Numeric/NumPy.  This could be  
> done within a nice python object, with methods to write it out  
> again in phylip format etc - so it could still behave "nicely".

I agree here and think that if the user has Numeric use that,  
otherwise use built-in types. So, maybe two "hidden" classes that do  
the correct thing.

>
> Peter wrote:
>>> Second point - the "official" PHYLIP distance matrix file format
>>> truncates the taxa names at 10 characters.  Some tools (e.g.  
>>> clustalw)
>>> ignore this limitation and will use as many as needed for the  
>>> full  name.
>
> Marc Colosimo wrote:
>> ...
>> By definition this isn't a variant of Phylip, but another format.  
>> So,  one would need two parsers: PhylipDist and Dist (or  
>> ClustalDist).
>
> That would be another way of looking at the issue, sure.  [See below]
>
> Peter wrote:
>>> For writing matrices to file, the issue of following the strict 10
>>> character taxa limit might best be handled as an option (default  
>>> to  max 10, with a warning if any names are truncated, and an  
>>> error if
>>> truncation renders names non-unique?).
>
> Marc Colosimo wrote:
>> DON'T give an option of 10 or more. That is NOT the definition of  
>> the  Phylip file Matrix structure, so why give the option? Make  
>> another  class that outputs the whole name (ClustalDist).
>
> I like clustal's "long name variant of Phylip distance format", as  
> for my datasets my gene/domain names are longer than 10  
> characters.  I may well be in a minority here (for now).
>
> I suppose if would be "good practice" to follow the official (but  
> not overly precise) phylip definition on this issue.
>
> So your idea of defining two similar formats would resolve this.   
> In terms of implementation, one could probably just subclass the  
> other to reduce the amount of duplicated code.

Correct. subclassing is our friend (to a point).

>
>> I am pretty sure that Phylip doesn't care about non-unique names  
>> so  why error out? However, the class should have a means for the  
>> user to  ask this question.
>
> Because the (truncated) taxa names are going to be used as tree  
> node names by any tree building program, they really should be  
> unique.  I would expect any tree program to throw an error in this  
> case, which is why I suggested we should try not to create such  
> files in the first place.

Not exactly. I've been bitten in the butt by the truncation issue  
several times. I know TreeView X doesn't care about unique names and  
I think MacClade also doesn't care. Now, PAUP and Mequite might care  
or any Nexus type-system which lists the taxon names separately from  
the taxons in the TREES block (they use numbers for the taxons which  
get mapped to  TAXLABELS in the TAXA block. I believe it depends on  
how they decided to store these relationships).

I guess we have three options here:
1) keep on trucking
2) raise a warning
3) raise an exception - something like Matrix.NonUniqueName exception  
so that you can specifically except  the exception

>
> Peter wrote:
>>> Likewise an option to save matrices as either fully symmetric or  
>>> lower
>>> triangular.  I would lean towards using fully symmetric as the  
>>> default
>>> as it seems to be more common.
>
> Marc Colosimo wrote:
>> Phylip's default seems to be a "Square" distance matrix, i.e.  
>> fully  symmetric. Keep this in mind when naming or documentation.
>
> Good point.
>
> Peter
>


From chris.lasher at gmail.com  Wed Jun 14 18:36:00 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Wed, 14 Jun 2006 14:36:00 -0400
Subject: [Biopython-dev] Distance Matrix Parsers
In-Reply-To: <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<448DE350.8000403@maubp.freeserve.co.uk>
	<2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org>
Message-ID: <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com>

> [I've added Chris incase he isn't on the dev-list]

Thanks Marc! I actually joined Dev list as the discussion got interesting.
Figured we'd move it to here eventually.

>> One general question about the architecture: Are you thinking of having a
>> generic "distance matrix object", and parsers/formats defined for several
>> different file formats?
>>
>
> Yes. I think that is what I am leaning towards. Now, I don't know if I'll
> be the implementor or not. It has been something on my to-do list for a
> while.

BioPython support for these formats with clean, testable code should be the
primary task, correct? I can help with this.  After we get working code,
refactoring for memory management can take place. I haven't done anything
along these lines and I'd have to rely on someone else's expertise for this.

> In my experience, most software tools usually write the distances as a
> full symmetric matrix.  However, the "standard" explicitly discusses lower
> triangular form (missing out the diagonal distance zero entries) which has
> the significant advantage of using about half the disk space. This is
> significant once you get into thousands of taxa.

I guess we need to consider that storing the matrix as a triangular form
will save some memory. However, I've emailed the SciPy/NumPy guys and there
is currently no support for a triangular/symmetric matrix; it would have to
be a square matrix. See more below.

>>>> So, make sure any parser can cope with both full symmetric, and lower
>>>> triangular forms - ideally without the user having to care.
>>>
>>> Phylip does ask you which to either read or write; this is a pain at
>>> times. So, having a parser figure this out would be nice.  However, the
>>> user should know about the choices.
>>
>> Its fairly easy for the parser to cope with either: For each line of
>> input, only use the "lower triangular" portion - just ignore any
>> remaining text which would be present for a full matrix (square) file, or
>> not present for a lower triangular file.

Well, we can save a lot of developer time by requiring the user to designate
this, with the default being a square matrix. Is it unreasonable to expect
the user to know whether his or her matrix is lower/upper-triangular or
square? Autodetection seems to add a bit of risk, e.g., either the detection
has to be confirmed by the user (in which case, what's the point of
auto-detect), or we have to have a really well tested auto-detector, i.e., a
lot more developer time.

> It should be fairly easy, but I don't understand why Philip chokes on
> square versus lower triangular. Either way, the class should "internally"
> know what the format read in was, so you can ask it.  That way if you muck
> with it or create a new matrix and want to write that out, you can ask the
> class what it read in and then have the new one write it out in that
> format.

I think it makes sense for a Phylip triangular matrix and a Phylip square
matrix to be represented as the same type of object, for reasons of
consistency, as already discussed. As Marc pointed out, its original form
can simply be represented by an attribute of the object. It should also be
possible to write the matrix back out in either triangular or square format,
regardless of its original format. These would probably just be methods of
the object, such as .to_phylip_square() and .to_phylip_ltriangular()

>>>> This also raises the point about how to store the matrix in memory.
>>>> Does Numeric/NumPy have an efficient way of storing symmetric matrices?
>>>> This is less flexible than the suggested list of lists, but for large
>>>> datasets would need much less memory.
>>>
>>> I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at storing
>>> these things. But you lose that when you want to do pythonish things to
>>> it (like write it back out).
>>
>> It depends on our target audience.  My experience with two thousand taxa
>> means that I am slightly concerned about the memory, and would lean
>> towards storing the data using Numeric/NumPy.  This could be done within
>> a nice python object, with methods to write it out again in phylip format
>> etc - so it could still behave "nicely".
>
> I agree here and think that if the user has Numeric use that, otherwise
> use built-in types. So, maybe two "hidden" classes that do the correct
> thing.

This just recently popped up on the NumPy discussion list:
http://www.mail-archive.com/numpy-discussion at lists.sourceforge.net/msg00265.html

The summary of that is we can memory-map it using numpy.memmap. I've never
used this before, so I can't really comment. I'd guess that for small data
files, this is overkill. For large sets it might be reasonable. I suppose
two separate classes could be available, one for smaller matrices and one
for larger. Again, I think the user would be intelligent enough to make the
decision as to which to use.

Since the class for handling standard (smaller) matrices will be easier to
code, I propose writing this standard one first and getting it into
BioPython. For this class, I suggest just sticking with a regular nested
list, rather than use something from Numeric/NumPy.

After this class is created and submitted, we can go back and create a class
to deal with larger matrices that's a sub-class of the standard one. This
way, the API remains the same, regardless of the class, and we will only
have to rewrite the methods that need changing due to the way we'll need to
interact with the underlying data structure of the wrapped Numeric/NumPy
object. How does that sound?

>> I like clustal's "long name variant of Phylip distance format", as for my
>> datasets my gene/domain names are longer than 10 characters.  I may well
>> be in a minority here (for now).
>>
>> I suppose if would be "good practice" to follow the official (but not
>> overly precise) phylip definition on this issue.
>>
>> So your idea of defining two similar formats would resolve this.  In
>> terms of implementation, one could probably just subclass the other to
>> reduce the amount of duplicated code.
>
> Correct. subclassing is our friend (to a point).
>

I'm in agreement with using two separate types of objects to represent these
two formats. PhylipDist should represent the Phylip spec to the T. I'm not
familiar with the Clustal spec; is it formatted similarly, sans the
requirement of 10 characters max for the sequence name?

An editorial note, I'm very frustrated with Phylip's 10 character limit for
sequence names, too. I don't know the reasoning and history behind the
decisions on the format; all I know is that it is an uncomfortably
restrictive and seemingly arbitrary format. Why it has not been updated is
beyond me, unless, like these parsers for BioPython, it's just another
project waiting for someone to work on it.

>>> I am pretty sure that Phylip doesn't care about non-unique names so why
>>> error out? However, the class should have a means for the user to ask
>>> this question.
>>
>> Because the (truncated) taxa names are going to be used as tree node
>> names by any tree building program, they really should be unique.  I
>> would expect any tree program to throw an error in this case, which is
>> why I suggested we should try not to create such files in the first
>> place.
>
> Not exactly. I've been bitten in the butt by the truncation issue several
> times. I know TreeView X doesn't care about unique names and I think
> MacClade also doesn't care. Now, PAUP and Mequite might care or any Nexus
> type-system which lists the taxon names separately from the taxons in the
> TREES block (they use numbers for the taxons which get mapped to TAXLABELS
> in the TAXA block. I believe it depends on how they decided to store these
> relationships).
>
> I guess we have three options here: 1) keep on trucking 2) raise a warning
> 3) raise an exception - something like Matrix.NonUniqueName exception so
> that you can specifically except  the exception
>

I dislike option 1, unless we also provide the user the ability to check for
non-unique names, too. Remember the Zen of Python: "Explicit is better than
implicit."

I like option 3, though I don't know how to make it possible for code
outside the parser to catch the exception and tell the parser to continue.
We could have it throw the exception by default, but if the user provides a
flag in calling the parser, like allow_non_unique=True, we could have logic
in the parser that, if True, catch the exception and continue.

>> Likewise an option to save matrices as either fully symmetric or lower
>> triangular.  I would lean towards using fully symmetric as the default as
>> it seems to be more common.
>
> Phylip's default seems to be a "Square" distance matrix, i.e.  fully
> symmetric. Keep this in mind when naming or documentation.

As I mentioned above, the same object would represent both types, and should
be equally capable of outputting itself as text in either format.

Chris


From gvwilson at cs.utoronto.ca  Sun Jun 18 18:15:18 2006
From: gvwilson at cs.utoronto.ca (Greg Wilson)
Date: Sun, 18 Jun 2006 14:15:18 -0400
Subject: [Biopython-dev] ann: open source course on basic software
	development skills
Message-ID: <e74578$pq2$12@sea.gmane.org>

http://www.third-bit.com/swc is an open source course on basic software
development skills, aimed primarily at people with backgrounds in
science, engineering, and medicine who have little formal training in
programming, but find themselves doing a lot of it.  The course was
developed in part through support from the Python Software Foundation;
all of the material can be used and modified free of charge (but with
attribution).  If you have questions, would like to contribute material,
or have a success story you'd like to share, please contact Greg Wilson
(gvwilson at cs.utoronto.ca).

Thanks,
Greg


From chris.lasher at gmail.com  Mon Jun 19 17:49:34 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Mon, 19 Jun 2006 13:49:34 -0400
Subject: [Biopython-dev] Bugzilla
Message-ID: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>

I noticed that the Bugzilla for BioPython on Open Bio (
http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython ) lacks
the current version number. Also, I noticed there seem to be quite a
few open tickets. Is BioPython still using Open Bio's Bugzilla to
track bugs?

Chris


From mdehoon at c2b2.columbia.edu  Mon Jun 19 18:27:26 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Mon, 19 Jun 2006 14:27:26 -0400
Subject: [Biopython-dev] Bugzilla
In-Reply-To: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>
References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>
Message-ID: <4496EC8E.7030603@c2b2.columbia.edu>

Chris Lasher wrote:
> Is BioPython still using Open Bio's Bugzilla to track bugs?

Yes.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From biopython-dev at maubp.freeserve.co.uk  Tue Jun 20 14:38:40 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Jun 2006 15:38:40 +0100
Subject: [Biopython-dev] Bugzilla
In-Reply-To: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>
References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>
Message-ID: <44980870.5070309@maubp.freeserve.co.uk>

Chris Lasher wrote:
> I noticed that the Bugzilla for BioPython on Open Bio (
> http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython ) lacks
> the current version number.

Good point - who can edit that list?

 > Also, I noticed there seem to be quite a few open tickets.

I've dealt with a few of them...

 > Is BioPython still using Open Bio's Bugzilla to track bugs?

As Michiel said, yes we are.  In fact, maybe we should log bugs for 
several of the recent issues on the mailing list...

Peter


From biopython-dev at maubp.freeserve.co.uk  Tue Jun 20 13:40:52 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Tue, 20 Jun 2006 14:40:52 +0100
Subject: [Biopython-dev] Floats and Double in Cluster
Message-ID: <4497FAE4.3040705@maubp.freeserve.co.uk>

One for Michiel, as the author of Bio.Cluster

I've just been building the latest CVS code on Windows with MSVC 6.0
(Microsoft Visual C++) and noticed there are a lot of warnings about
double to float conversion (with associated data loss) from
Bio/Cluster/ranlib.c (see attached output).

Does this matter?

I've also tried compiling the same code on Linux which I assume is using
the default gcc 4.0.2, and there are no such warnings (even with the
compiler option -Wall being specified).

I found that I could generally get rid of the warnings by adding
explicit (float) cast statements to the problem lines.

Thanks

Peter


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: build.txt
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060620/1edd8a20/attachment-0002.txt>

From mdehoon at c2b2.columbia.edu  Tue Jun 20 16:05:30 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Tue, 20 Jun 2006 12:05:30 -0400
Subject: [Biopython-dev] Floats and Double in Cluster
In-Reply-To: <4497FAE4.3040705@maubp.freeserve.co.uk>
References: <4497FAE4.3040705@maubp.freeserve.co.uk>
Message-ID: <44981CCA.2070006@c2b2.columbia.edu>

Peter (BioPython Dev) wrote:
> One for Michiel, as the author of Bio.Cluster
> 
> I've just been building the latest CVS code on Windows with MSVC 6.0
> (Microsoft Visual C++) and noticed there are a lot of warnings about
> double to float conversion (with associated data loss) from
> Bio/Cluster/ranlib.c (see attached output).
> 
> Does this matter?
> 
Probably not. The ranlib library is quite old (maybe fifteen years or 
more). It was originally written in Fortran, and automatically converted 
into C. Such conversions are usually not so clean, so the resulting code 
tends to generate many warning messages. On the other hand, ranlib is a 
part of Numerical Python (the RandomArray module), so if there were a 
serious problem with it it would have been discovered by now.
I did find out recently though that there may be a licensing problem 
with ranlib. Since we're using only a small part of ranlib in 
Bio.Cluster, I'm planning to replace it with a different random number 
generator for the next version.

--Michiel.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From chris.lasher at gmail.com  Tue Jun 20 16:35:14 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 20 Jun 2006 12:35:14 -0400
Subject: [Biopython-dev] Bugzilla
In-Reply-To: <44980870.5070309@maubp.freeserve.co.uk>
References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com>
	<44980870.5070309@maubp.freeserve.co.uk>
Message-ID: <128a885f0606200935r619424f1jd983ad51eb36d7b@mail.gmail.com>

On 6/20/06, Peter <biopython-dev at maubp.freeserve.co.uk> wrote:
>  > Also, I noticed there seem to be quite a few open tickets.
>
> I've dealt with a few of them...

So some of those tickets are not still actually "open"?

Chris


From chris.lasher at gmail.com  Tue Jun 20 16:54:49 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 20 Jun 2006 12:54:49 -0400
Subject: [Biopython-dev] Distance Matrix Parsers
In-Reply-To: <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<448DE350.8000403@maubp.freeserve.co.uk>
	<2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org>
	<128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com>
Message-ID: <128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com>

Question for the knowledgeable: is this an appropriate realm to write
a Martel parser for?

Chris


From biopython-dev at maubp.freeserve.co.uk  Tue Jun 20 17:04:25 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Jun 2006 18:04:25 +0100
Subject: [Biopython-dev] Clustal unit test
Message-ID: <44982A99.9090400@maubp.freeserve.co.uk>

Another query Michiel

A minor point first of all, adding something like the following to the 
end of test_Cluster.py makes it easy to run this unit test on its own:

if __name__ == "__main__" :
     run_tests(module = "Bio.Cluster")

Secondly, the test works for me with BioPython 1.41, but using today's' 
CVS the test fails (or at least, sits there at high CPU usage for so 
long I give up and kill it).  This was using Linux.

The only changes since BioPython 1.41 are those you you checked in today 
with this comment:

 > C Clustering Library version 1.32. Bio.Cluster became
 > objected-oriented (somewhat).

Does the unit test still work for you?

Peter


From mcolosimo at mitre.org  Tue Jun 20 18:32:08 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Tue, 20 Jun 2006 14:32:08 -0400
Subject: [Biopython-dev] Distance Matrix Parsers
In-Reply-To: <128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<448DE350.8000403@maubp.freeserve.co.uk>
	<2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org>
	<128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com>
	<128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com>
Message-ID: <6F683BD6-4359-40BF-A7E4-10C97D4032EF@mitre.org>

I would say, NO.
I think Martel is a Mac Truck and you need a little Toyota pickup.  
Sure, Martel probably can do this, but it would be overkill, IMHO.   
Maybe Andrew could chime in on this.

Marc

On Jun 20, 2006, at 12:54 PM, Chris Lasher wrote:

> Question for the knowledgeable: is this an appropriate realm to write
> a Martel parser for?
>
> Chris
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From mdehoon at c2b2.columbia.edu  Tue Jun 20 20:56:22 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Tue, 20 Jun 2006 16:56:22 -0400
Subject: [Biopython-dev] Clustal unit test
In-Reply-To: <44982A99.9090400@maubp.freeserve.co.uk>
References: <44982A99.9090400@maubp.freeserve.co.uk>
Message-ID: <449860F6.7040807@c2b2.columbia.edu>

Peter wrote:
> A minor point first of all, adding something like the following to the 
> end of test_Cluster.py makes it easy to run this unit test on its own:
> 
> if __name__ == "__main__" :
>      run_tests(module = "Bio.Cluster")

OK I've added this in CVS.

> Secondly, the test works for me with BioPython 1.41, but using today's' 
> CVS the test fails (or at least, sits there at high CPU usage for so 
> long I give up and kill it).  This was using Linux.
> 
I did make an update to Bio.Cluster but in my brilliance forgot to 
update the test scripts accordingly. CVS now contains updated 
test_Cluster.py and output/test_Cluster. Let me know if the test still 
fails with these updated files. Thanks for catching this.

--Michiel.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From bsouthey at gmail.com  Tue Jun 20 17:47:53 2006
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 20 Jun 2006 12:47:53 -0500
Subject: [Biopython-dev] Floats and Double in Cluster
In-Reply-To: <44981CCA.2070006@c2b2.columbia.edu>
References: <4497FAE4.3040705@maubp.freeserve.co.uk>
	<44981CCA.2070006@c2b2.columbia.edu>
Message-ID: <bbcd77d00606201047v3a928efeo627a09304fdb79@mail.gmail.com>

Hi,
Actually just switch to the new numpy where ranlib is no longer used.
It uses  Mersenne Twister RNG from Jean-Sebastien Roy's random kit.
Robert Kern has also written additional code.

Of course, that really means moving BioPython to numpy from Numeric.
Is there a plan for this?

Bruce


On 6/20/06, Michiel Jan Laurens de Hoon <mdehoon at c2b2.columbia.edu> wrote:
> Peter (BioPython Dev) wrote:
> > One for Michiel, as the author of Bio.Cluster
> >
> > I've just been building the latest CVS code on Windows with MSVC 6.0
> > (Microsoft Visual C++) and noticed there are a lot of warnings about
> > double to float conversion (with associated data loss) from
> > Bio/Cluster/ranlib.c (see attached output).
> >
> > Does this matter?
> >
> Probably not. The ranlib library is quite old (maybe fifteen years or
> more). It was originally written in Fortran, and automatically converted
> into C. Such conversions are usually not so clean, so the resulting code
> tends to generate many warning messages. On the other hand, ranlib is a
> part of Numerical Python (the RandomArray module), so if there were a
> serious problem with it it would have been discovered by now.
> I did find out recently though that there may be a licensing problem
> with ranlib. Since we're using only a small part of ranlib in
> Bio.Cluster, I'm planning to replace it with a different random number
> generator for the next version.
>
> --Michiel.
>
> --
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1130 St Nicholas Avenue
> New York, NY 10032
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From mdehoon at c2b2.columbia.edu  Thu Jun 22 02:32:58 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Wed, 21 Jun 2006 22:32:58 -0400
Subject: [Biopython-dev] Biopython's XMl parser fails with NCBI blast
 changed XML output format
In-Reply-To: <d9fd76050606211206pa104f7dwdebfcb05dcab09d2@mail.gmail.com>
References: <d9fd76050606211206pa104f7dwdebfcb05dcab09d2@mail.gmail.com>
Message-ID: <449A015A.4080503@c2b2.columbia.edu>

I am not sure if the new XML format is really what NCBI wants it to be, 
since it does not agree with the Blast documentation. I asked NCBI about 
this; they have forwarded this question to their Blast developers, so 
hopefully we'll get a definite answer soon.
For the time being, I guess the only thing you can do is to download an 
older version of Blast and run your blast searches locally. Then, the 
blast XML output will be in the old format, and there should be no 
problem parsing them with the existing parser in Biopython.

--Michiel.

Rohini Damle wrote:
> Hi,
> I am trying to parse the blast output (XML formatted, using online NCBI's
> blast) I got as a result for 'short nearly exact matches' for my 50-55 
> short
> protein sequences.
> It looks like the XML format has changed and biopython's XML parser 
> fails to
> parse the blast records.
> can somebody show a way to fix this thing?
> Thank you
> Rohini Damle


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From dcoorna at dbm.ulb.ac.be  Mon Jun 26 10:57:25 2006
From: dcoorna at dbm.ulb.ac.be (david coornaert)
Date: Mon, 26 Jun 2006 12:57:25 +0200
Subject: [Biopython-dev] NCBI-XML blast parser
Message-ID: <449FBD95.5040308@dbm.ulb.ac.be>


I'm currently using this bio-python ncbiXML blast output parser

>From the cvs I fetched I see some comments about useless nature of
 Hsp_query_to
and Hsp_hit_to

Well, I need those, and can't for sure calculate it simply from
hsp_align_len (which is not included either)
because I should I manage the max len of query, hit and align string)
then take care of the strand to know wether to increase or decrease,

So I've worked out a parrallel copy of the parser,
but I'd like to know  why are these considered useless  ?
Should I commit these harmless changes ? (hence cvs access)

???


-- 
===============================================
David Coornaert [PhD]   (dcoorna at dbm.ulb.ac.be)

Belgian Embnet Node (http://www.be.embnet.org)
Universit? Libre de Bruxelles

Laboratoire de Bioinformatique
12, Rue des Professeurs Jeener & Brachet
6041  Gosselies
BELGIQUE

T?l:  +3226509975
Fax:  +3226509998
===============================================


From biopython-dev at maubp.freeserve.co.uk  Mon Jun 26 12:05:28 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Jun 2006 13:05:28 +0100
Subject: [Biopython-dev] NCBI-XML blast parser
In-Reply-To: <449FBD95.5040308@dbm.ulb.ac.be>
References: <449FBD95.5040308@dbm.ulb.ac.be>
Message-ID: <449FCD88.6080704@maubp.freeserve.co.uk>

david coornaert wrote:
> I'm currently using this bio-python ncbiXML blast output parser
> 
>>From the cvs I fetched I see some comments about useless nature of
>  Hsp_query_to
> and Hsp_hit_to
> 
> Well, I need those, and can't for sure calculate it simply from
> hsp_align_len (which is not included either)
> because I should I manage the max len of query, hit and align string)
> then take care of the strand to know wether to increase or decrease,
> 
> So I've worked out a parrallel copy of the parser,
> but I'd like to know  why are these considered useless  ?
> Should I commit these harmless changes ? (hence cvs access)
> 
> ???

Hi David

Could you file a bug and attach your patch to it please?  (Trying to 
send attachments to the mailing list can be a bit unreliable).  Then 
hopefully some of the group can at least try it out...

Out of interest, what version of Blast have you been using?  Online or 
standalone?

(If you've been following the list, we think the NCBI have changed the 
format returned for multiple queries in version 2.2.14)

Thanks

Peter


From dcoorna at dbm.ulb.ac.be  Mon Jun 26 12:44:04 2006
From: dcoorna at dbm.ulb.ac.be (david coornaert)
Date: Mon, 26 Jun 2006 14:44:04 +0200
Subject: [Biopython-dev] NCBI-XML blast parser
In-Reply-To: <449FCD88.6080704@maubp.freeserve.co.uk>
References: <449FBD95.5040308@dbm.ulb.ac.be>
	<449FCD88.6080704@maubp.freeserve.co.uk>
Message-ID: <449FD694.7030508@dbm.ulb.ac.be>

Peter wrote:
> Hi David
>
> Could you file a bug and attach your patch to it please?  (Trying to 
> send attachments to the mailing list can be a bit unreliable).  Then 
> hopefully some of the group can at least try it out...
>
>   
Well I'm not sure about bug procedure
so here's it already
I'll have a look at the list stuff quite soon and will submit as requested

I wouldn't have qualified that as a bug, just wondering why would
someone consider
this values as useless, sure you can calculate these, altho it would be
painfull and ... well since it
is already in the XML...
I simply added these (in red) :

Bio/Blast/NCBIXML.py

line 289:
# No need for Hsp_query_to
def _end_Hsp_query_to(self):
"""offset of query at the end of the alignment (one-offset)
"""
self._hsp.query_to = int(self._value)

def _end_Hsp_hit_from(self):
"""offset of the database at the start of the alignment (one-offset)
"""
self._hsp.sbjct_start = int(self._value)
# No need for Hsp_hit_to
def _end_Hsp_hit_to(self):
"""offset of the database at the end of the alignment (one-offset)
"""
self._hsp.sbjct_to = int(self._value)


Conversely, a real bug is the mess that is occuring regarding Frame and
Strand !!

in a blastn output must appear:

Strand = Plus / Plus
or
Strand = Plus / Minus (and so on)

while in a tblastx must appear:
Frame = +3/-1 (and so on)

blastx (must also present one Frame info)

unfortunately to find the appropriate strand in a blastn job, you need
to address
the hsp.frame array , eventho there's a hsp.strand array...


And all this stuff is usefull !! if it is the opposite strand you need
to swap query_start and query_to for example...


> Out of interest, what version of Blast have you been using?  Online or 
> standalone?
>
>   
well I've seen the complains regarding 2.2.14 , Hence I sticked to
2.2.13 standalone

=;B^)


-- 
===============================================
David Coornaert [PhD]   (dcoorna at dbm.ulb.ac.be)

Belgian Embnet Node (http://www.be.embnet.org)
Universite' Libre de Bruxelles

Laboratoire de Bioinformatique
12, Rue des Professeurs Jeener & Brachet
6041  Gosselies
BELGIQUE

Te'l:  +3226509975
Fax:  +3226509998
===============================================


From chris.lasher at gmail.com  Tue Jun 27 23:33:57 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 27 Jun 2006 19:33:57 -0400
Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers
In-Reply-To: <128a885f0606271632q2988f2d7y543dd441535f9808@mail.gmail.com>
References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com>
	<448A9A7A.6050501@maubp.freeserve.co.uk>
	<CB52EC1C-51B4-4E5C-81A3-723D87C8CA36@mitre.org>
	<449F0231.2050308@maubp.freeserve.co.uk>
	<128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com>
	<44A1B23E.5080007@maubp.freeserve.co.uk>
	<128a885f0606271632q2988f2d7y543dd441535f9808@mail.gmail.com>
Message-ID: <128a885f0606271633g6fc66bb1h8173eca5b949a1c5@mail.gmail.com>

Oh brother... today's not my day. NOW it's back on BP-Dev...

Stupidly yours,
Chris

On 6/27/06, Chris Lasher <chris.lasher at gmail.com> wrote:
> [Oops! I didn't realize I was posting to the user list! Reverting it
> back to BP-Dev]
> This code looks very good, Peter!
>
> As far as licensing, I'm new to the game, but my guess is the
> BioPython license (http://www.biopython.org/DIST/LICENSE ) is highly
> prefered for BioPython. You still retain copyright with the license,
> but the code is more "free" than under any version of the GPL.
>
> Chris
>
> On 6/27/06, Peter <biopython at maubp.freeserve.co.uk> wrote:
> > Chris Lasher wrote:
> > > Hi Peter,
> > >
> > > Would you be up for licensing your code under the BioPython license?
> > > If not, I shouldn't  look at it, as I've started coding my own module
> > > for the project. From your description, your module sounds very good.
> > > =-)
> > >
> > > Chris
> >
> > I am quite happy to contribute the code to BioPython under the
> > appropriate license, so please go ahead.
> >
> > I've filled a bug on adding PHYLIP distance parsers to BioPython and
> > attached a slightly revised version of the code (added "fuzzy" equality
> > testing of matrices - mainly for testing):
> >
> > http://bugzilla.open-bio.org/show_bug.cgi?id=2034
> >
> > If anyone else really wants the code under some other license (GPL
> > maybe) I could probably be persuaded.
> >
> > Peter
> >
> > _______________________________________________
> > BioPython mailing list  -  BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
>