From mcolosimo at mitre.org Mon Jun 12 08:38:18 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Mon, 12 Jun 2006 08:38:18 -0400 Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers In-Reply-To: <128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org> <128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com> <8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org> <128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com> Message-ID: <65DF4A7E-B365-4E61-93D4-156A36F6ED54@mitre.org> [cross-posting to biopython-dev] Chris, Oops, didn't notice this was on the general biopython mailing list. I think many of the developers also subscribe to this list, but just in case I'm cross posting this. Iddo pointed out the Bio.SubsMat, which I didn't know what that module did. One problem with names like that, but the API Docs are helpful only when you look at them (Kuddos for those who add documentation). Given Bio.SubsMat and the BioPerl Module, I would strongly consider combining the Bio.SubsMat and the PhylipDist into a new Bio.Matrix module. From a Phylo module, a function/class can always call the Bio.Matrix classes. Marc On Jun 9, 2006, at 5:13 PM, Chris Lasher wrote: >> I likewise didn't know about the Bio::Matrix::PhylipDist module. >> Personally, I would opt for a Matrix Object (since this is Python a >> OO language) and store it internally as a nested list. That way you >> have the best of both worlds. The next question is the object >> hierarchy. Here I would opt for a top level Matrix class (or module) >> and then subclass that under Phylo. So, something like this: >> >> Bio.Matrix >> Bio.Phylo.Matrix > > So is this more appropriate than Bio.Matrix.Phylo? A phylogenetic > matrix is a type of matrix, so that hierarchy is immediately > appealing, however, a phylogenetic matrix is not of much use in and of > itself, so I can see the argument that it should be placed in a > phylogeny package (which we have yet to write but as mentioned > earlier, could be very useful). > >> and maybe things like the following (which isn't used/followed much >> here in BioPython) >> >> Bio.Phylo.IO >> Bio.Phylo.Parsers.PhylipDist >> Bio.Phylo.Parsers.Newick >> Bio.Phylo.Parsers.Nexus >> >> And/or have >> Bio.Phylo.Matrix.IO that uses the PhylipDist parser. > > This is very very good, in my opinion. Thanks for doing the > heavy-lifting of the brainwork on this! =-) > >> The next big question is what should Bio.Phylo.IO return? For >> inspiration, we might want to look at Mesquite > mesquiteproject.org/mesquite/mesquite.html>. > > I must give a better look at this site before commenting, but once > again, thanks for bringing this to my awareness! What a helpful past > couple of emails. I will be out for the weekend but will think more > about this. > > As a sidenote, should this discussion be moved to biopython-dev or is > it fine here? > > Thanks again Marc, > Chris > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mcolosimo at mitre.org Mon Jun 12 09:18:41 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Mon, 12 Jun 2006 09:18:41 -0400 Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers In-Reply-To: <448A9A7A.6050501@maubp.freeserve.co.uk> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> Message-ID: [cross post] On Jun 10, 2006, at 6:10 AM, Peter wrote: > Chris Lasher wrote: >> Hi all, Are there any modules in BioPython to parse distance >> matrices? My poking around the BioPython modules and Google searching >> does not turn up any signs indicating there are distance matrix >> parsers, currently. Two particularly useful parsers would be a parser >> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP >> (http://evolution.genetics.washington.edu/phylip.html), > > I've done a very small amount of work with neighbour joining trees, > using PHYLIP format distance matrices. The closest I could find to a > file format definition was this page: > > http://evolution.genetics.washington.edu/phylip/doc/distance.html > > Points to be aware of: > > In my experience, most software tools usually write the distances as a > full symmetric matrix. However, the "standard" explicitly discusses > lower triangular form (missing out the diagonal distance zero entries) > which has the significant advantage of using about half the disk > space. > This is significant once you get into thousands of taxa. This is still small potatoes compared to the input needed to generate the distance matrixs (especially with DNA/RNA sequences of any decently sized gene). > > So, make sure any parser can cope with both full symmetric, and lower > triangular forms - ideally without the user having to care. Phylip does ask you which to either read or write; this is a pain at times. So, having a parser figure this out would be nice. However, the user should know about the choices. > > This also raises the point about how to store the matrix in memory. > Does Numeric/NumPy have an efficient way of storing symmetric > matrices? > This is less flexible than the suggested list of lists, but for > large > datasets would need much less memory. I believe that SciPy (Numeric/NumPy/etc..) is more efficient at storing these things. But you lose that when you want to do pythonish things to it (like write it back out). > > Second point - the "official" PHYLIP distance matrix file format > truncates the taxa names at 10 characters. Some tools (e.g. clustalw) > ignore this limitation and will use as many as needed for the full > name. ClustalW does the CORRECT thing, it truncates the name to 10 characters for Phylip output (alignments). And it does the CORRECT thing for its distance matrix file. In Clustalw's trees.c file void distance_matrix_output(FILE *ofile) fprintf(ofile,"\n%-*s ",max_names,names[i]); /* left justify to the maximum length of names in current alignment file and use a space as a sep */ spaces in names are bad in this case, but phylip is okay with them, since the first 10 characters are the taxon name. > I personally find this much nicer - after all most gene identifiers > (e.g. GI numbers) are eight characters to start with, and if you are > dealing with multiple features in each gene 10 characters is tough > going. > > So, I would make sure you test the parser on this format variant (with > names longer than 10 characters). I can supply some examples if > you like. By definition this isn't a variant of Phylip, but another format. So, one would need two parsers: PhylipDist and Dist (or ClustalDist). > > For writing matrices to file, the issue of following the strict 10 > character taxa limit might best be handled as an option (default to > max > 10, with a warning if any names are truncated, and an error if > truncation renders names non-unique?). DON'T give an option of 10 or more. That is NOT the definition of the Phylip file Matrix structure, so why give the option? Make another class that outputs the whole name (ClustalDist). I am pretty sure that Phylip doesn't care about non-unique names so why error out? However, the class should have a means for the user to ask this question. > > Likewise an option to save matrices as either fully symmetric or lower > triangular. I would lean towards using fully symmetric as the default > as it seems to be more common. Phylip's default seems to be a "Square" distance matrix, i.e. fully symmetric. Keep this in mind when naming or documentation. > >> and a parser for the MEGA (http://www.megasoftware.net/mega.html) >> distance matrix format. If not, would there be any interest in >> creating parsers for these matrices, other than my own? I think >> parsers for distance matrices could be very useful to the community. > > I suspect that for serious tree building pure python will not be > competitive with existing C/C++ code on speed - but non-the-less could > be useful. > Well, we do have things like SciPy and PyClustal, which make things more even. Marc From biopython-dev at maubp.freeserve.co.uk Mon Jun 12 17:57:36 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Jun 2006 22:57:36 +0100 Subject: [Biopython-dev] Distance Matrix Parsers In-Reply-To: References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> Message-ID: <448DE350.8000403@maubp.freeserve.co.uk> [Send to the Dev list only - forward to the main discussion list if you think best Marc] One general question about the architecture: Are you thinking of having a generic "distance matrix object", and parsers/formats defined for several different file formats? Peter (me) wrote: >>In my experience, most software tools usually write the distances as a >>full symmetric matrix. However, the "standard" explicitly discusses >>lower triangular form (missing out the diagonal distance zero entries) >>which has the significant advantage of using about half the disk >>space. This is significant once you get into thousands of taxa. Marc Colosimo wrote: > This is still small potatoes compared to the input needed to generate > the distance matrixs (especially with DNA/RNA sequences of any > decently sized gene). Regarding size of matrix file versus size of alignment file, that isn't hallways true. (*) The matrix file size goes as the square of the number of taxa, the alignment file only linearly. (*) The matrix file is invariant with respect to the length of the sequences/number of columns in the alignment. (*) The matrix file size goes linearly with the precision (number of decimal places) used. As you are using "decently sized genes" then you will have large alignment files, but I would imagine you have at most hundred of genes per alignment - not thousands (?). For my own examples, I have about two thousand domains (not full genes) and the phylip distance matrix file was MUCH bigger than the alignment file. Peter (me) wrote >>So, make sure any parser can cope with both full symmetric, and lower >>triangular forms - ideally without the user having to care. Marc Colosimo wrote: > Phylip does ask you which to either read or write; this is a pain at > times. So, having a parser figure this out would be nice. However, > the user should know about the choices. Its fairly easy for the parser to cope with either: For each line of input, only use the "lower triangular" portion - just ignore any remaining text which would be present for a full matrix (square) file, or not present for a lower triangular file. Peter wrote: >>This also raises the point about how to store the matrix in memory. >>Does Numeric/NumPy have an efficient way of storing symmetric >>matrices? This is less flexible than the suggested list of lists, >>but for large datasets would need much less memory. Marc Colosimo wrote: > I believe that SciPy (Numeric/NumPy/etc..) is more efficient at > storing these things. But you lose that when you want to do pythonish > things to it (like write it back out). It depends on our target audience. My experience with two thousand taxa means that I am slightly concerned about the memory, and would lean towards storing the data using Numeric/NumPy. This could be done within a nice python object, with methods to write it out again in phylip format etc - so it could still behave "nicely". Peter wrote: >>Second point - the "official" PHYLIP distance matrix file format >>truncates the taxa names at 10 characters. Some tools (e.g. clustalw) >>ignore this limitation and will use as many as needed for the full >>name. Marc Colosimo wrote: > ... > > By definition this isn't a variant of Phylip, but another format. So, > one would need two parsers: PhylipDist and Dist (or ClustalDist). That would be another way of looking at the issue, sure. [See below] Peter wrote: >>For writing matrices to file, the issue of following the strict 10 >>character taxa limit might best be handled as an option (default to >>max 10, with a warning if any names are truncated, and an error if >>truncation renders names non-unique?). Marc Colosimo wrote: > DON'T give an option of 10 or more. That is NOT the definition of the > Phylip file Matrix structure, so why give the option? Make another > class that outputs the whole name (ClustalDist). I like clustal's "long name variant of Phylip distance format", as for my datasets my gene/domain names are longer than 10 characters. I may well be in a minority here (for now). I suppose if would be "good practice" to follow the official (but not overly precise) phylip definition on this issue. So your idea of defining two similar formats would resolve this. In terms of implementation, one could probably just subclass the other to reduce the amount of duplicated code. > I am pretty sure that Phylip doesn't care about non-unique names so > why error out? However, the class should have a means for the user to > ask this question. Because the (truncated) taxa names are going to be used as tree node names by any tree building program, they really should be unique. I would expect any tree program to throw an error in this case, which is why I suggested we should try not to create such files in the first place. Peter wrote: >>Likewise an option to save matrices as either fully symmetric or lower >>triangular. I would lean towards using fully symmetric as the default >>as it seems to be more common. Marc Colosimo wrote: > Phylip's default seems to be a "Square" distance matrix, i.e. fully > symmetric. Keep this in mind when naming or documentation. Good point. Peter From mcolosimo at mitre.org Tue Jun 13 11:46:16 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Tue, 13 Jun 2006 11:46:16 -0400 Subject: [Biopython-dev] Distance Matrix Parsers In-Reply-To: <448DE350.8000403@maubp.freeserve.co.uk> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> <448DE350.8000403@maubp.freeserve.co.uk> Message-ID: <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org> [I've added Chris incase he isn't on the dev-list] On Jun 12, 2006, at 5:57 PM, Peter wrote: > [Send to the Dev list only - forward to the main discussion list if > you think best Marc] > > One general question about the architecture: Are you thinking of > having a generic "distance matrix object", and parsers/formats > defined for several different file formats? > Yes. I think that is what I am leaning towards. Now, I don't know if I'll be the implementor or not. It has been something on my to-do list for a while. > Peter (me) wrote: >>> In my experience, most software tools usually write the distances >>> as a >>> full symmetric matrix. However, the "standard" explicitly discusses >>> lower triangular form (missing out the diagonal distance zero >>> entries) >>> which has the significant advantage of using about half the disk >>> space. This is significant once you get into thousands of taxa. > Peter wrote: > Marc Colosimo wrote: >> This is still small potatoes compared to the input needed to >> generate the distance matrixs (especially with DNA/RNA sequences >> of any decently sized gene). > > Regarding size of matrix file versus size of alignment file, that > isn't hallways true. > > (*) The matrix file size goes as the square of the number of taxa, > the alignment file only linearly. > > (*) The matrix file is invariant with respect to the length of the > sequences/number of columns in the alignment. > > (*) The matrix file size goes linearly with the precision (number > of decimal places) used. > > As you are using "decently sized genes" then you will have large > alignment files, but I would imagine you have at most hundred of > genes per alignment - not thousands (?). > > For my own examples, I have about two thousand domains (not full > genes) and the phylip distance matrix file was MUCH bigger than the > alignment file. You got me on that boundary case. I just wanted to point out that is not always the case. > Peter (me) wrote >>> So, make sure any parser can cope with both full symmetric, and >>> lower >>> triangular forms - ideally without the user having to care. > > Marc Colosimo wrote: >> Phylip does ask you which to either read or write; this is a pain >> at times. So, having a parser figure this out would be nice. >> However, the user should know about the choices. > > Its fairly easy for the parser to cope with either: For each line > of input, only use the "lower triangular" portion - just ignore any > remaining text which would be present for a full matrix (square) > file, or not present for a lower triangular file. It should be fairly easy, but I don't understand why Philip chokes on square versus lower triangular. Either way, the class should "internally" know what the format read in was, so you can ask it. That way if you muck with it or create a new matrix and want to write that out, you can ask the class what it read in and then have the new one write it out in that format. > > Peter wrote: >>> This also raises the point about how to store the matrix in memory. >>> Does Numeric/NumPy have an efficient way of storing symmetric >>> matrices? This is less flexible than the suggested list of lists, > >>but for large datasets would need much less memory. > > Marc Colosimo wrote: >> I believe that SciPy (Numeric/NumPy/etc..) is more efficient at >> storing these things. But you lose that when you want to do >> pythonish things to it (like write it back out). > > It depends on our target audience. My experience with two thousand > taxa means that I am slightly concerned about the memory, and would > lean towards storing the data using Numeric/NumPy. This could be > done within a nice python object, with methods to write it out > again in phylip format etc - so it could still behave "nicely". I agree here and think that if the user has Numeric use that, otherwise use built-in types. So, maybe two "hidden" classes that do the correct thing. > > Peter wrote: >>> Second point - the "official" PHYLIP distance matrix file format >>> truncates the taxa names at 10 characters. Some tools (e.g. >>> clustalw) >>> ignore this limitation and will use as many as needed for the >>> full name. > > Marc Colosimo wrote: >> ... >> By definition this isn't a variant of Phylip, but another format. >> So, one would need two parsers: PhylipDist and Dist (or >> ClustalDist). > > That would be another way of looking at the issue, sure. [See below] > > Peter wrote: >>> For writing matrices to file, the issue of following the strict 10 >>> character taxa limit might best be handled as an option (default >>> to max 10, with a warning if any names are truncated, and an >>> error if >>> truncation renders names non-unique?). > > Marc Colosimo wrote: >> DON'T give an option of 10 or more. That is NOT the definition of >> the Phylip file Matrix structure, so why give the option? Make >> another class that outputs the whole name (ClustalDist). > > I like clustal's "long name variant of Phylip distance format", as > for my datasets my gene/domain names are longer than 10 > characters. I may well be in a minority here (for now). > > I suppose if would be "good practice" to follow the official (but > not overly precise) phylip definition on this issue. > > So your idea of defining two similar formats would resolve this. > In terms of implementation, one could probably just subclass the > other to reduce the amount of duplicated code. Correct. subclassing is our friend (to a point). > >> I am pretty sure that Phylip doesn't care about non-unique names >> so why error out? However, the class should have a means for the >> user to ask this question. > > Because the (truncated) taxa names are going to be used as tree > node names by any tree building program, they really should be > unique. I would expect any tree program to throw an error in this > case, which is why I suggested we should try not to create such > files in the first place. Not exactly. I've been bitten in the butt by the truncation issue several times. I know TreeView X doesn't care about unique names and I think MacClade also doesn't care. Now, PAUP and Mequite might care or any Nexus type-system which lists the taxon names separately from the taxons in the TREES block (they use numbers for the taxons which get mapped to TAXLABELS in the TAXA block. I believe it depends on how they decided to store these relationships). I guess we have three options here: 1) keep on trucking 2) raise a warning 3) raise an exception - something like Matrix.NonUniqueName exception so that you can specifically except the exception > > Peter wrote: >>> Likewise an option to save matrices as either fully symmetric or >>> lower >>> triangular. I would lean towards using fully symmetric as the >>> default >>> as it seems to be more common. > > Marc Colosimo wrote: >> Phylip's default seems to be a "Square" distance matrix, i.e. >> fully symmetric. Keep this in mind when naming or documentation. > > Good point. > > Peter > From chris.lasher at gmail.com Wed Jun 14 14:36:00 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Wed, 14 Jun 2006 14:36:00 -0400 Subject: [Biopython-dev] Distance Matrix Parsers In-Reply-To: <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> <448DE350.8000403@maubp.freeserve.co.uk> <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org> Message-ID: <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com> > [I've added Chris incase he isn't on the dev-list] Thanks Marc! I actually joined Dev list as the discussion got interesting. Figured we'd move it to here eventually. >> One general question about the architecture: Are you thinking of having a >> generic "distance matrix object", and parsers/formats defined for several >> different file formats? >> > > Yes. I think that is what I am leaning towards. Now, I don't know if I'll > be the implementor or not. It has been something on my to-do list for a > while. BioPython support for these formats with clean, testable code should be the primary task, correct? I can help with this. After we get working code, refactoring for memory management can take place. I haven't done anything along these lines and I'd have to rely on someone else's expertise for this. > In my experience, most software tools usually write the distances as a > full symmetric matrix. However, the "standard" explicitly discusses lower > triangular form (missing out the diagonal distance zero entries) which has > the significant advantage of using about half the disk space. This is > significant once you get into thousands of taxa. I guess we need to consider that storing the matrix as a triangular form will save some memory. However, I've emailed the SciPy/NumPy guys and there is currently no support for a triangular/symmetric matrix; it would have to be a square matrix. See more below. >>>> So, make sure any parser can cope with both full symmetric, and lower >>>> triangular forms - ideally without the user having to care. >>> >>> Phylip does ask you which to either read or write; this is a pain at >>> times. So, having a parser figure this out would be nice. However, the >>> user should know about the choices. >> >> Its fairly easy for the parser to cope with either: For each line of >> input, only use the "lower triangular" portion - just ignore any >> remaining text which would be present for a full matrix (square) file, or >> not present for a lower triangular file. Well, we can save a lot of developer time by requiring the user to designate this, with the default being a square matrix. Is it unreasonable to expect the user to know whether his or her matrix is lower/upper-triangular or square? Autodetection seems to add a bit of risk, e.g., either the detection has to be confirmed by the user (in which case, what's the point of auto-detect), or we have to have a really well tested auto-detector, i.e., a lot more developer time. > It should be fairly easy, but I don't understand why Philip chokes on > square versus lower triangular. Either way, the class should "internally" > know what the format read in was, so you can ask it. That way if you muck > with it or create a new matrix and want to write that out, you can ask the > class what it read in and then have the new one write it out in that > format. I think it makes sense for a Phylip triangular matrix and a Phylip square matrix to be represented as the same type of object, for reasons of consistency, as already discussed. As Marc pointed out, its original form can simply be represented by an attribute of the object. It should also be possible to write the matrix back out in either triangular or square format, regardless of its original format. These would probably just be methods of the object, such as .to_phylip_square() and .to_phylip_ltriangular() >>>> This also raises the point about how to store the matrix in memory. >>>> Does Numeric/NumPy have an efficient way of storing symmetric matrices? >>>> This is less flexible than the suggested list of lists, but for large >>>> datasets would need much less memory. >>> >>> I believe that SciPy (Numeric/NumPy/etc..) is more efficient at storing >>> these things. But you lose that when you want to do pythonish things to >>> it (like write it back out). >> >> It depends on our target audience. My experience with two thousand taxa >> means that I am slightly concerned about the memory, and would lean >> towards storing the data using Numeric/NumPy. This could be done within >> a nice python object, with methods to write it out again in phylip format >> etc - so it could still behave "nicely". > > I agree here and think that if the user has Numeric use that, otherwise > use built-in types. So, maybe two "hidden" classes that do the correct > thing. This just recently popped up on the NumPy discussion list: http://www.mail-archive.com/numpy-discussion at lists.sourceforge.net/msg00265.html The summary of that is we can memory-map it using numpy.memmap. I've never used this before, so I can't really comment. I'd guess that for small data files, this is overkill. For large sets it might be reasonable. I suppose two separate classes could be available, one for smaller matrices and one for larger. Again, I think the user would be intelligent enough to make the decision as to which to use. Since the class for handling standard (smaller) matrices will be easier to code, I propose writing this standard one first and getting it into BioPython. For this class, I suggest just sticking with a regular nested list, rather than use something from Numeric/NumPy. After this class is created and submitted, we can go back and create a class to deal with larger matrices that's a sub-class of the standard one. This way, the API remains the same, regardless of the class, and we will only have to rewrite the methods that need changing due to the way we'll need to interact with the underlying data structure of the wrapped Numeric/NumPy object. How does that sound? >> I like clustal's "long name variant of Phylip distance format", as for my >> datasets my gene/domain names are longer than 10 characters. I may well >> be in a minority here (for now). >> >> I suppose if would be "good practice" to follow the official (but not >> overly precise) phylip definition on this issue. >> >> So your idea of defining two similar formats would resolve this. In >> terms of implementation, one could probably just subclass the other to >> reduce the amount of duplicated code. > > Correct. subclassing is our friend (to a point). > I'm in agreement with using two separate types of objects to represent these two formats. PhylipDist should represent the Phylip spec to the T. I'm not familiar with the Clustal spec; is it formatted similarly, sans the requirement of 10 characters max for the sequence name? An editorial note, I'm very frustrated with Phylip's 10 character limit for sequence names, too. I don't know the reasoning and history behind the decisions on the format; all I know is that it is an uncomfortably restrictive and seemingly arbitrary format. Why it has not been updated is beyond me, unless, like these parsers for BioPython, it's just another project waiting for someone to work on it. >>> I am pretty sure that Phylip doesn't care about non-unique names so why >>> error out? However, the class should have a means for the user to ask >>> this question. >> >> Because the (truncated) taxa names are going to be used as tree node >> names by any tree building program, they really should be unique. I >> would expect any tree program to throw an error in this case, which is >> why I suggested we should try not to create such files in the first >> place. > > Not exactly. I've been bitten in the butt by the truncation issue several > times. I know TreeView X doesn't care about unique names and I think > MacClade also doesn't care. Now, PAUP and Mequite might care or any Nexus > type-system which lists the taxon names separately from the taxons in the > TREES block (they use numbers for the taxons which get mapped to TAXLABELS > in the TAXA block. I believe it depends on how they decided to store these > relationships). > > I guess we have three options here: 1) keep on trucking 2) raise a warning > 3) raise an exception - something like Matrix.NonUniqueName exception so > that you can specifically except the exception > I dislike option 1, unless we also provide the user the ability to check for non-unique names, too. Remember the Zen of Python: "Explicit is better than implicit." I like option 3, though I don't know how to make it possible for code outside the parser to catch the exception and tell the parser to continue. We could have it throw the exception by default, but if the user provides a flag in calling the parser, like allow_non_unique=True, we could have logic in the parser that, if True, catch the exception and continue. >> Likewise an option to save matrices as either fully symmetric or lower >> triangular. I would lean towards using fully symmetric as the default as >> it seems to be more common. > > Phylip's default seems to be a "Square" distance matrix, i.e. fully > symmetric. Keep this in mind when naming or documentation. As I mentioned above, the same object would represent both types, and should be equally capable of outputting itself as text in either format. Chris From gvwilson at cs.utoronto.ca Sun Jun 18 14:15:18 2006 From: gvwilson at cs.utoronto.ca (Greg Wilson) Date: Sun, 18 Jun 2006 14:15:18 -0400 Subject: [Biopython-dev] ann: open source course on basic software development skills Message-ID: http://www.third-bit.com/swc is an open source course on basic software development skills, aimed primarily at people with backgrounds in science, engineering, and medicine who have little formal training in programming, but find themselves doing a lot of it. The course was developed in part through support from the Python Software Foundation; all of the material can be used and modified free of charge (but with attribution). If you have questions, would like to contribute material, or have a success story you'd like to share, please contact Greg Wilson (gvwilson at cs.utoronto.ca). Thanks, Greg From chris.lasher at gmail.com Mon Jun 19 13:49:34 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Mon, 19 Jun 2006 13:49:34 -0400 Subject: [Biopython-dev] Bugzilla Message-ID: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> I noticed that the Bugzilla for BioPython on Open Bio ( http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython ) lacks the current version number. Also, I noticed there seem to be quite a few open tickets. Is BioPython still using Open Bio's Bugzilla to track bugs? Chris From mdehoon at c2b2.columbia.edu Mon Jun 19 14:27:26 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 19 Jun 2006 14:27:26 -0400 Subject: [Biopython-dev] Bugzilla In-Reply-To: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> Message-ID: <4496EC8E.7030603@c2b2.columbia.edu> Chris Lasher wrote: > Is BioPython still using Open Bio's Bugzilla to track bugs? Yes. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Tue Jun 20 10:38:40 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jun 2006 15:38:40 +0100 Subject: [Biopython-dev] Bugzilla In-Reply-To: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> Message-ID: <44980870.5070309@maubp.freeserve.co.uk> Chris Lasher wrote: > I noticed that the Bugzilla for BioPython on Open Bio ( > http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython ) lacks > the current version number. Good point - who can edit that list? > Also, I noticed there seem to be quite a few open tickets. I've dealt with a few of them... > Is BioPython still using Open Bio's Bugzilla to track bugs? As Michiel said, yes we are. In fact, maybe we should log bugs for several of the recent issues on the mailing list... Peter From biopython-dev at maubp.freeserve.co.uk Tue Jun 20 09:40:52 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 20 Jun 2006 14:40:52 +0100 Subject: [Biopython-dev] Floats and Double in Cluster Message-ID: <4497FAE4.3040705@maubp.freeserve.co.uk> One for Michiel, as the author of Bio.Cluster I've just been building the latest CVS code on Windows with MSVC 6.0 (Microsoft Visual C++) and noticed there are a lot of warnings about double to float conversion (with associated data loss) from Bio/Cluster/ranlib.c (see attached output). Does this matter? I've also tried compiling the same code on Linux which I assume is using the default gcc 4.0.2, and there are no such warnings (even with the compiler option -Wall being specified). I found that I could generally get rid of the warnings by adding explicit (float) cast statements to the problem lines. Thanks Peter -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: build.txt Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060620/1edd8a20/attachment-0001.txt From mdehoon at c2b2.columbia.edu Tue Jun 20 12:05:30 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 20 Jun 2006 12:05:30 -0400 Subject: [Biopython-dev] Floats and Double in Cluster In-Reply-To: <4497FAE4.3040705@maubp.freeserve.co.uk> References: <4497FAE4.3040705@maubp.freeserve.co.uk> Message-ID: <44981CCA.2070006@c2b2.columbia.edu> Peter (BioPython Dev) wrote: > One for Michiel, as the author of Bio.Cluster > > I've just been building the latest CVS code on Windows with MSVC 6.0 > (Microsoft Visual C++) and noticed there are a lot of warnings about > double to float conversion (with associated data loss) from > Bio/Cluster/ranlib.c (see attached output). > > Does this matter? > Probably not. The ranlib library is quite old (maybe fifteen years or more). It was originally written in Fortran, and automatically converted into C. Such conversions are usually not so clean, so the resulting code tends to generate many warning messages. On the other hand, ranlib is a part of Numerical Python (the RandomArray module), so if there were a serious problem with it it would have been discovered by now. I did find out recently though that there may be a licensing problem with ranlib. Since we're using only a small part of ranlib in Bio.Cluster, I'm planning to replace it with a different random number generator for the next version. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From chris.lasher at gmail.com Tue Jun 20 12:35:14 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 20 Jun 2006 12:35:14 -0400 Subject: [Biopython-dev] Bugzilla In-Reply-To: <44980870.5070309@maubp.freeserve.co.uk> References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> <44980870.5070309@maubp.freeserve.co.uk> Message-ID: <128a885f0606200935r619424f1jd983ad51eb36d7b@mail.gmail.com> On 6/20/06, Peter wrote: > > Also, I noticed there seem to be quite a few open tickets. > > I've dealt with a few of them... So some of those tickets are not still actually "open"? Chris From chris.lasher at gmail.com Tue Jun 20 12:54:49 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 20 Jun 2006 12:54:49 -0400 Subject: [Biopython-dev] Distance Matrix Parsers In-Reply-To: <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> <448DE350.8000403@maubp.freeserve.co.uk> <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org> <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com> Message-ID: <128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com> Question for the knowledgeable: is this an appropriate realm to write a Martel parser for? Chris From biopython-dev at maubp.freeserve.co.uk Tue Jun 20 13:04:25 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jun 2006 18:04:25 +0100 Subject: [Biopython-dev] Clustal unit test Message-ID: <44982A99.9090400@maubp.freeserve.co.uk> Another query Michiel A minor point first of all, adding something like the following to the end of test_Cluster.py makes it easy to run this unit test on its own: if __name__ == "__main__" : run_tests(module = "Bio.Cluster") Secondly, the test works for me with BioPython 1.41, but using today's' CVS the test fails (or at least, sits there at high CPU usage for so long I give up and kill it). This was using Linux. The only changes since BioPython 1.41 are those you you checked in today with this comment: > C Clustering Library version 1.32. Bio.Cluster became > objected-oriented (somewhat). Does the unit test still work for you? Peter From mcolosimo at mitre.org Tue Jun 20 14:32:08 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Tue, 20 Jun 2006 14:32:08 -0400 Subject: [Biopython-dev] Distance Matrix Parsers In-Reply-To: <128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> <448DE350.8000403@maubp.freeserve.co.uk> <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org> <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com> <128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com> Message-ID: <6F683BD6-4359-40BF-A7E4-10C97D4032EF@mitre.org> I would say, NO. I think Martel is a Mac Truck and you need a little Toyota pickup. Sure, Martel probably can do this, but it would be overkill, IMHO. Maybe Andrew could chime in on this. Marc On Jun 20, 2006, at 12:54 PM, Chris Lasher wrote: > Question for the knowledgeable: is this an appropriate realm to write > a Martel parser for? > > Chris > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From mdehoon at c2b2.columbia.edu Tue Jun 20 16:56:22 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 20 Jun 2006 16:56:22 -0400 Subject: [Biopython-dev] Clustal unit test In-Reply-To: <44982A99.9090400@maubp.freeserve.co.uk> References: <44982A99.9090400@maubp.freeserve.co.uk> Message-ID: <449860F6.7040807@c2b2.columbia.edu> Peter wrote: > A minor point first of all, adding something like the following to the > end of test_Cluster.py makes it easy to run this unit test on its own: > > if __name__ == "__main__" : > run_tests(module = "Bio.Cluster") OK I've added this in CVS. > Secondly, the test works for me with BioPython 1.41, but using today's' > CVS the test fails (or at least, sits there at high CPU usage for so > long I give up and kill it). This was using Linux. > I did make an update to Bio.Cluster but in my brilliance forgot to update the test scripts accordingly. CVS now contains updated test_Cluster.py and output/test_Cluster. Let me know if the test still fails with these updated files. Thanks for catching this. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From bsouthey at gmail.com Tue Jun 20 13:47:53 2006 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 20 Jun 2006 12:47:53 -0500 Subject: [Biopython-dev] Floats and Double in Cluster In-Reply-To: <44981CCA.2070006@c2b2.columbia.edu> References: <4497FAE4.3040705@maubp.freeserve.co.uk> <44981CCA.2070006@c2b2.columbia.edu> Message-ID: Hi, Actually just switch to the new numpy where ranlib is no longer used. It uses Mersenne Twister RNG from Jean-Sebastien Roy's random kit. Robert Kern has also written additional code. Of course, that really means moving BioPython to numpy from Numeric. Is there a plan for this? Bruce On 6/20/06, Michiel Jan Laurens de Hoon wrote: > Peter (BioPython Dev) wrote: > > One for Michiel, as the author of Bio.Cluster > > > > I've just been building the latest CVS code on Windows with MSVC 6.0 > > (Microsoft Visual C++) and noticed there are a lot of warnings about > > double to float conversion (with associated data loss) from > > Bio/Cluster/ranlib.c (see attached output). > > > > Does this matter? > > > Probably not. The ranlib library is quite old (maybe fifteen years or > more). It was originally written in Fortran, and automatically converted > into C. Such conversions are usually not so clean, so the resulting code > tends to generate many warning messages. On the other hand, ranlib is a > part of Numerical Python (the RandomArray module), so if there were a > serious problem with it it would have been discovered by now. > I did find out recently though that there may be a licensing problem > with ranlib. Since we're using only a small part of ranlib in > Bio.Cluster, I'm planning to replace it with a different random number > generator for the next version. > > --Michiel. > > -- > Michiel de Hoon > Center for Computational Biology and Bioinformatics > Columbia University > 1130 St Nicholas Avenue > New York, NY 10032 > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From mdehoon at c2b2.columbia.edu Wed Jun 21 22:32:58 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Wed, 21 Jun 2006 22:32:58 -0400 Subject: [Biopython-dev] Biopython's XMl parser fails with NCBI blast changed XML output format In-Reply-To: References: Message-ID: <449A015A.4080503@c2b2.columbia.edu> I am not sure if the new XML format is really what NCBI wants it to be, since it does not agree with the Blast documentation. I asked NCBI about this; they have forwarded this question to their Blast developers, so hopefully we'll get a definite answer soon. For the time being, I guess the only thing you can do is to download an older version of Blast and run your blast searches locally. Then, the blast XML output will be in the old format, and there should be no problem parsing them with the existing parser in Biopython. --Michiel. Rohini Damle wrote: > Hi, > I am trying to parse the blast output (XML formatted, using online NCBI's > blast) I got as a result for 'short nearly exact matches' for my 50-55 > short > protein sequences. > It looks like the XML format has changed and biopython's XML parser > fails to > parse the blast records. > can somebody show a way to fix this thing? > Thank you > Rohini Damle -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From dcoorna at dbm.ulb.ac.be Mon Jun 26 06:57:25 2006 From: dcoorna at dbm.ulb.ac.be (david coornaert) Date: Mon, 26 Jun 2006 12:57:25 +0200 Subject: [Biopython-dev] NCBI-XML blast parser Message-ID: <449FBD95.5040308@dbm.ulb.ac.be> I'm currently using this bio-python ncbiXML blast output parser >From the cvs I fetched I see some comments about useless nature of Hsp_query_to and Hsp_hit_to Well, I need those, and can't for sure calculate it simply from hsp_align_len (which is not included either) because I should I manage the max len of query, hit and align string) then take care of the strand to know wether to increase or decrease, So I've worked out a parrallel copy of the parser, but I'd like to know why are these considered useless ? Should I commit these harmless changes ? (hence cvs access) ??? -- =============================================== David Coornaert [PhD] (dcoorna at dbm.ulb.ac.be) Belgian Embnet Node (http://www.be.embnet.org) Universit? Libre de Bruxelles Laboratoire de Bioinformatique 12, Rue des Professeurs Jeener & Brachet 6041 Gosselies BELGIQUE T?l: +3226509975 Fax: +3226509998 =============================================== From biopython-dev at maubp.freeserve.co.uk Mon Jun 26 08:05:28 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jun 2006 13:05:28 +0100 Subject: [Biopython-dev] NCBI-XML blast parser In-Reply-To: <449FBD95.5040308@dbm.ulb.ac.be> References: <449FBD95.5040308@dbm.ulb.ac.be> Message-ID: <449FCD88.6080704@maubp.freeserve.co.uk> david coornaert wrote: > I'm currently using this bio-python ncbiXML blast output parser > >>From the cvs I fetched I see some comments about useless nature of > Hsp_query_to > and Hsp_hit_to > > Well, I need those, and can't for sure calculate it simply from > hsp_align_len (which is not included either) > because I should I manage the max len of query, hit and align string) > then take care of the strand to know wether to increase or decrease, > > So I've worked out a parrallel copy of the parser, > but I'd like to know why are these considered useless ? > Should I commit these harmless changes ? (hence cvs access) > > ??? Hi David Could you file a bug and attach your patch to it please? (Trying to send attachments to the mailing list can be a bit unreliable). Then hopefully some of the group can at least try it out... Out of interest, what version of Blast have you been using? Online or standalone? (If you've been following the list, we think the NCBI have changed the format returned for multiple queries in version 2.2.14) Thanks Peter From dcoorna at dbm.ulb.ac.be Mon Jun 26 08:44:04 2006 From: dcoorna at dbm.ulb.ac.be (david coornaert) Date: Mon, 26 Jun 2006 14:44:04 +0200 Subject: [Biopython-dev] NCBI-XML blast parser In-Reply-To: <449FCD88.6080704@maubp.freeserve.co.uk> References: <449FBD95.5040308@dbm.ulb.ac.be> <449FCD88.6080704@maubp.freeserve.co.uk> Message-ID: <449FD694.7030508@dbm.ulb.ac.be> Peter wrote: > Hi David > > Could you file a bug and attach your patch to it please? (Trying to > send attachments to the mailing list can be a bit unreliable). Then > hopefully some of the group can at least try it out... > > Well I'm not sure about bug procedure so here's it already I'll have a look at the list stuff quite soon and will submit as requested I wouldn't have qualified that as a bug, just wondering why would someone consider this values as useless, sure you can calculate these, altho it would be painfull and ... well since it is already in the XML... I simply added these (in red) : Bio/Blast/NCBIXML.py line 289: # No need for Hsp_query_to def _end_Hsp_query_to(self): """offset of query at the end of the alignment (one-offset) """ self._hsp.query_to = int(self._value) def _end_Hsp_hit_from(self): """offset of the database at the start of the alignment (one-offset) """ self._hsp.sbjct_start = int(self._value) # No need for Hsp_hit_to def _end_Hsp_hit_to(self): """offset of the database at the end of the alignment (one-offset) """ self._hsp.sbjct_to = int(self._value) Conversely, a real bug is the mess that is occuring regarding Frame and Strand !! in a blastn output must appear: Strand = Plus / Plus or Strand = Plus / Minus (and so on) while in a tblastx must appear: Frame = +3/-1 (and so on) blastx (must also present one Frame info) unfortunately to find the appropriate strand in a blastn job, you need to address the hsp.frame array , eventho there's a hsp.strand array... And all this stuff is usefull !! if it is the opposite strand you need to swap query_start and query_to for example... > Out of interest, what version of Blast have you been using? Online or > standalone? > > well I've seen the complains regarding 2.2.14 , Hence I sticked to 2.2.13 standalone =;B^) -- =============================================== David Coornaert [PhD] (dcoorna at dbm.ulb.ac.be) Belgian Embnet Node (http://www.be.embnet.org) Universite' Libre de Bruxelles Laboratoire de Bioinformatique 12, Rue des Professeurs Jeener & Brachet 6041 Gosselies BELGIQUE Te'l: +3226509975 Fax: +3226509998 =============================================== From chris.lasher at gmail.com Tue Jun 27 19:33:57 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 27 Jun 2006 19:33:57 -0400 Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers In-Reply-To: <128a885f0606271632q2988f2d7y543dd441535f9808@mail.gmail.com> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> <449F0231.2050308@maubp.freeserve.co.uk> <128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com> <44A1B23E.5080007@maubp.freeserve.co.uk> <128a885f0606271632q2988f2d7y543dd441535f9808@mail.gmail.com> Message-ID: <128a885f0606271633g6fc66bb1h8173eca5b949a1c5@mail.gmail.com> Oh brother... today's not my day. NOW it's back on BP-Dev... Stupidly yours, Chris On 6/27/06, Chris Lasher wrote: > [Oops! I didn't realize I was posting to the user list! Reverting it > back to BP-Dev] > This code looks very good, Peter! > > As far as licensing, I'm new to the game, but my guess is the > BioPython license (http://www.biopython.org/DIST/LICENSE ) is highly > prefered for BioPython. You still retain copyright with the license, > but the code is more "free" than under any version of the GPL. > > Chris > > On 6/27/06, Peter wrote: > > Chris Lasher wrote: > > > Hi Peter, > > > > > > Would you be up for licensing your code under the BioPython license? > > > If not, I shouldn't look at it, as I've started coding my own module > > > for the project. From your description, your module sounds very good. > > > =-) > > > > > > Chris > > > > I am quite happy to contribute the code to BioPython under the > > appropriate license, so please go ahead. > > > > I've filled a bug on adding PHYLIP distance parsers to BioPython and > > attached a slightly revised version of the code (added "fuzzy" equality > > testing of matrices - mainly for testing): > > > > http://bugzilla.open-bio.org/show_bug.cgi?id=2034 > > > > If anyone else really wants the code under some other license (GPL > > maybe) I could probably be persuaded. > > > > Peter > > > > _______________________________________________ > > BioPython mailing list - BioPython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > From mcolosimo at mitre.org Mon Jun 12 12:38:18 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Mon, 12 Jun 2006 08:38:18 -0400 Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers In-Reply-To: <128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <9BE2CFC6-BACE-4D98-86A0-99E9CFBA228A@mitre.org> <128a885f0606090859x608e733ela89fdb879e531dc8@mail.gmail.com> <8AC5BAA2-BA47-4772-88C7-DF4B2061A8E2@mitre.org> <128a885f0606091413o23088caesf4934a81f0cc0489@mail.gmail.com> Message-ID: <65DF4A7E-B365-4E61-93D4-156A36F6ED54@mitre.org> [cross-posting to biopython-dev] Chris, Oops, didn't notice this was on the general biopython mailing list. I think many of the developers also subscribe to this list, but just in case I'm cross posting this. Iddo pointed out the Bio.SubsMat, which I didn't know what that module did. One problem with names like that, but the API Docs are helpful only when you look at them (Kuddos for those who add documentation). Given Bio.SubsMat and the BioPerl Module, I would strongly consider combining the Bio.SubsMat and the PhylipDist into a new Bio.Matrix module. From a Phylo module, a function/class can always call the Bio.Matrix classes. Marc On Jun 9, 2006, at 5:13 PM, Chris Lasher wrote: >> I likewise didn't know about the Bio::Matrix::PhylipDist module. >> Personally, I would opt for a Matrix Object (since this is Python a >> OO language) and store it internally as a nested list. That way you >> have the best of both worlds. The next question is the object >> hierarchy. Here I would opt for a top level Matrix class (or module) >> and then subclass that under Phylo. So, something like this: >> >> Bio.Matrix >> Bio.Phylo.Matrix > > So is this more appropriate than Bio.Matrix.Phylo? A phylogenetic > matrix is a type of matrix, so that hierarchy is immediately > appealing, however, a phylogenetic matrix is not of much use in and of > itself, so I can see the argument that it should be placed in a > phylogeny package (which we have yet to write but as mentioned > earlier, could be very useful). > >> and maybe things like the following (which isn't used/followed much >> here in BioPython) >> >> Bio.Phylo.IO >> Bio.Phylo.Parsers.PhylipDist >> Bio.Phylo.Parsers.Newick >> Bio.Phylo.Parsers.Nexus >> >> And/or have >> Bio.Phylo.Matrix.IO that uses the PhylipDist parser. > > This is very very good, in my opinion. Thanks for doing the > heavy-lifting of the brainwork on this! =-) > >> The next big question is what should Bio.Phylo.IO return? For >> inspiration, we might want to look at Mesquite > mesquiteproject.org/mesquite/mesquite.html>. > > I must give a better look at this site before commenting, but once > again, thanks for bringing this to my awareness! What a helpful past > couple of emails. I will be out for the weekend but will think more > about this. > > As a sidenote, should this discussion be moved to biopython-dev or is > it fine here? > > Thanks again Marc, > Chris > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mcolosimo at mitre.org Mon Jun 12 13:18:41 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Mon, 12 Jun 2006 09:18:41 -0400 Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers In-Reply-To: <448A9A7A.6050501@maubp.freeserve.co.uk> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> Message-ID: [cross post] On Jun 10, 2006, at 6:10 AM, Peter wrote: > Chris Lasher wrote: >> Hi all, Are there any modules in BioPython to parse distance >> matrices? My poking around the BioPython modules and Google searching >> does not turn up any signs indicating there are distance matrix >> parsers, currently. Two particularly useful parsers would be a parser >> for the output of DNADIST/PROTDIST/RESTDIST from PHYLIP >> (http://evolution.genetics.washington.edu/phylip.html), > > I've done a very small amount of work with neighbour joining trees, > using PHYLIP format distance matrices. The closest I could find to a > file format definition was this page: > > http://evolution.genetics.washington.edu/phylip/doc/distance.html > > Points to be aware of: > > In my experience, most software tools usually write the distances as a > full symmetric matrix. However, the "standard" explicitly discusses > lower triangular form (missing out the diagonal distance zero entries) > which has the significant advantage of using about half the disk > space. > This is significant once you get into thousands of taxa. This is still small potatoes compared to the input needed to generate the distance matrixs (especially with DNA/RNA sequences of any decently sized gene). > > So, make sure any parser can cope with both full symmetric, and lower > triangular forms - ideally without the user having to care. Phylip does ask you which to either read or write; this is a pain at times. So, having a parser figure this out would be nice. However, the user should know about the choices. > > This also raises the point about how to store the matrix in memory. > Does Numeric/NumPy have an efficient way of storing symmetric > matrices? > This is less flexible than the suggested list of lists, but for > large > datasets would need much less memory. I believe that SciPy (Numeric/NumPy/etc..) is more efficient at storing these things. But you lose that when you want to do pythonish things to it (like write it back out). > > Second point - the "official" PHYLIP distance matrix file format > truncates the taxa names at 10 characters. Some tools (e.g. clustalw) > ignore this limitation and will use as many as needed for the full > name. ClustalW does the CORRECT thing, it truncates the name to 10 characters for Phylip output (alignments). And it does the CORRECT thing for its distance matrix file. In Clustalw's trees.c file void distance_matrix_output(FILE *ofile) fprintf(ofile,"\n%-*s ",max_names,names[i]); /* left justify to the maximum length of names in current alignment file and use a space as a sep */ spaces in names are bad in this case, but phylip is okay with them, since the first 10 characters are the taxon name. > I personally find this much nicer - after all most gene identifiers > (e.g. GI numbers) are eight characters to start with, and if you are > dealing with multiple features in each gene 10 characters is tough > going. > > So, I would make sure you test the parser on this format variant (with > names longer than 10 characters). I can supply some examples if > you like. By definition this isn't a variant of Phylip, but another format. So, one would need two parsers: PhylipDist and Dist (or ClustalDist). > > For writing matrices to file, the issue of following the strict 10 > character taxa limit might best be handled as an option (default to > max > 10, with a warning if any names are truncated, and an error if > truncation renders names non-unique?). DON'T give an option of 10 or more. That is NOT the definition of the Phylip file Matrix structure, so why give the option? Make another class that outputs the whole name (ClustalDist). I am pretty sure that Phylip doesn't care about non-unique names so why error out? However, the class should have a means for the user to ask this question. > > Likewise an option to save matrices as either fully symmetric or lower > triangular. I would lean towards using fully symmetric as the default > as it seems to be more common. Phylip's default seems to be a "Square" distance matrix, i.e. fully symmetric. Keep this in mind when naming or documentation. > >> and a parser for the MEGA (http://www.megasoftware.net/mega.html) >> distance matrix format. If not, would there be any interest in >> creating parsers for these matrices, other than my own? I think >> parsers for distance matrices could be very useful to the community. > > I suspect that for serious tree building pure python will not be > competitive with existing C/C++ code on speed - but non-the-less could > be useful. > Well, we do have things like SciPy and PyClustal, which make things more even. Marc From biopython-dev at maubp.freeserve.co.uk Mon Jun 12 21:57:36 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Jun 2006 22:57:36 +0100 Subject: [Biopython-dev] Distance Matrix Parsers In-Reply-To: References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> Message-ID: <448DE350.8000403@maubp.freeserve.co.uk> [Send to the Dev list only - forward to the main discussion list if you think best Marc] One general question about the architecture: Are you thinking of having a generic "distance matrix object", and parsers/formats defined for several different file formats? Peter (me) wrote: >>In my experience, most software tools usually write the distances as a >>full symmetric matrix. However, the "standard" explicitly discusses >>lower triangular form (missing out the diagonal distance zero entries) >>which has the significant advantage of using about half the disk >>space. This is significant once you get into thousands of taxa. Marc Colosimo wrote: > This is still small potatoes compared to the input needed to generate > the distance matrixs (especially with DNA/RNA sequences of any > decently sized gene). Regarding size of matrix file versus size of alignment file, that isn't hallways true. (*) The matrix file size goes as the square of the number of taxa, the alignment file only linearly. (*) The matrix file is invariant with respect to the length of the sequences/number of columns in the alignment. (*) The matrix file size goes linearly with the precision (number of decimal places) used. As you are using "decently sized genes" then you will have large alignment files, but I would imagine you have at most hundred of genes per alignment - not thousands (?). For my own examples, I have about two thousand domains (not full genes) and the phylip distance matrix file was MUCH bigger than the alignment file. Peter (me) wrote >>So, make sure any parser can cope with both full symmetric, and lower >>triangular forms - ideally without the user having to care. Marc Colosimo wrote: > Phylip does ask you which to either read or write; this is a pain at > times. So, having a parser figure this out would be nice. However, > the user should know about the choices. Its fairly easy for the parser to cope with either: For each line of input, only use the "lower triangular" portion - just ignore any remaining text which would be present for a full matrix (square) file, or not present for a lower triangular file. Peter wrote: >>This also raises the point about how to store the matrix in memory. >>Does Numeric/NumPy have an efficient way of storing symmetric >>matrices? This is less flexible than the suggested list of lists, >>but for large datasets would need much less memory. Marc Colosimo wrote: > I believe that SciPy (Numeric/NumPy/etc..) is more efficient at > storing these things. But you lose that when you want to do pythonish > things to it (like write it back out). It depends on our target audience. My experience with two thousand taxa means that I am slightly concerned about the memory, and would lean towards storing the data using Numeric/NumPy. This could be done within a nice python object, with methods to write it out again in phylip format etc - so it could still behave "nicely". Peter wrote: >>Second point - the "official" PHYLIP distance matrix file format >>truncates the taxa names at 10 characters. Some tools (e.g. clustalw) >>ignore this limitation and will use as many as needed for the full >>name. Marc Colosimo wrote: > ... > > By definition this isn't a variant of Phylip, but another format. So, > one would need two parsers: PhylipDist and Dist (or ClustalDist). That would be another way of looking at the issue, sure. [See below] Peter wrote: >>For writing matrices to file, the issue of following the strict 10 >>character taxa limit might best be handled as an option (default to >>max 10, with a warning if any names are truncated, and an error if >>truncation renders names non-unique?). Marc Colosimo wrote: > DON'T give an option of 10 or more. That is NOT the definition of the > Phylip file Matrix structure, so why give the option? Make another > class that outputs the whole name (ClustalDist). I like clustal's "long name variant of Phylip distance format", as for my datasets my gene/domain names are longer than 10 characters. I may well be in a minority here (for now). I suppose if would be "good practice" to follow the official (but not overly precise) phylip definition on this issue. So your idea of defining two similar formats would resolve this. In terms of implementation, one could probably just subclass the other to reduce the amount of duplicated code. > I am pretty sure that Phylip doesn't care about non-unique names so > why error out? However, the class should have a means for the user to > ask this question. Because the (truncated) taxa names are going to be used as tree node names by any tree building program, they really should be unique. I would expect any tree program to throw an error in this case, which is why I suggested we should try not to create such files in the first place. Peter wrote: >>Likewise an option to save matrices as either fully symmetric or lower >>triangular. I would lean towards using fully symmetric as the default >>as it seems to be more common. Marc Colosimo wrote: > Phylip's default seems to be a "Square" distance matrix, i.e. fully > symmetric. Keep this in mind when naming or documentation. Good point. Peter From mcolosimo at mitre.org Tue Jun 13 15:46:16 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Tue, 13 Jun 2006 11:46:16 -0400 Subject: [Biopython-dev] Distance Matrix Parsers In-Reply-To: <448DE350.8000403@maubp.freeserve.co.uk> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> <448DE350.8000403@maubp.freeserve.co.uk> Message-ID: <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org> [I've added Chris incase he isn't on the dev-list] On Jun 12, 2006, at 5:57 PM, Peter wrote: > [Send to the Dev list only - forward to the main discussion list if > you think best Marc] > > One general question about the architecture: Are you thinking of > having a generic "distance matrix object", and parsers/formats > defined for several different file formats? > Yes. I think that is what I am leaning towards. Now, I don't know if I'll be the implementor or not. It has been something on my to-do list for a while. > Peter (me) wrote: >>> In my experience, most software tools usually write the distances >>> as a >>> full symmetric matrix. However, the "standard" explicitly discusses >>> lower triangular form (missing out the diagonal distance zero >>> entries) >>> which has the significant advantage of using about half the disk >>> space. This is significant once you get into thousands of taxa. > Peter wrote: > Marc Colosimo wrote: >> This is still small potatoes compared to the input needed to >> generate the distance matrixs (especially with DNA/RNA sequences >> of any decently sized gene). > > Regarding size of matrix file versus size of alignment file, that > isn't hallways true. > > (*) The matrix file size goes as the square of the number of taxa, > the alignment file only linearly. > > (*) The matrix file is invariant with respect to the length of the > sequences/number of columns in the alignment. > > (*) The matrix file size goes linearly with the precision (number > of decimal places) used. > > As you are using "decently sized genes" then you will have large > alignment files, but I would imagine you have at most hundred of > genes per alignment - not thousands (?). > > For my own examples, I have about two thousand domains (not full > genes) and the phylip distance matrix file was MUCH bigger than the > alignment file. You got me on that boundary case. I just wanted to point out that is not always the case. > Peter (me) wrote >>> So, make sure any parser can cope with both full symmetric, and >>> lower >>> triangular forms - ideally without the user having to care. > > Marc Colosimo wrote: >> Phylip does ask you which to either read or write; this is a pain >> at times. So, having a parser figure this out would be nice. >> However, the user should know about the choices. > > Its fairly easy for the parser to cope with either: For each line > of input, only use the "lower triangular" portion - just ignore any > remaining text which would be present for a full matrix (square) > file, or not present for a lower triangular file. It should be fairly easy, but I don't understand why Philip chokes on square versus lower triangular. Either way, the class should "internally" know what the format read in was, so you can ask it. That way if you muck with it or create a new matrix and want to write that out, you can ask the class what it read in and then have the new one write it out in that format. > > Peter wrote: >>> This also raises the point about how to store the matrix in memory. >>> Does Numeric/NumPy have an efficient way of storing symmetric >>> matrices? This is less flexible than the suggested list of lists, > >>but for large datasets would need much less memory. > > Marc Colosimo wrote: >> I believe that SciPy (Numeric/NumPy/etc..) is more efficient at >> storing these things. But you lose that when you want to do >> pythonish things to it (like write it back out). > > It depends on our target audience. My experience with two thousand > taxa means that I am slightly concerned about the memory, and would > lean towards storing the data using Numeric/NumPy. This could be > done within a nice python object, with methods to write it out > again in phylip format etc - so it could still behave "nicely". I agree here and think that if the user has Numeric use that, otherwise use built-in types. So, maybe two "hidden" classes that do the correct thing. > > Peter wrote: >>> Second point - the "official" PHYLIP distance matrix file format >>> truncates the taxa names at 10 characters. Some tools (e.g. >>> clustalw) >>> ignore this limitation and will use as many as needed for the >>> full name. > > Marc Colosimo wrote: >> ... >> By definition this isn't a variant of Phylip, but another format. >> So, one would need two parsers: PhylipDist and Dist (or >> ClustalDist). > > That would be another way of looking at the issue, sure. [See below] > > Peter wrote: >>> For writing matrices to file, the issue of following the strict 10 >>> character taxa limit might best be handled as an option (default >>> to max 10, with a warning if any names are truncated, and an >>> error if >>> truncation renders names non-unique?). > > Marc Colosimo wrote: >> DON'T give an option of 10 or more. That is NOT the definition of >> the Phylip file Matrix structure, so why give the option? Make >> another class that outputs the whole name (ClustalDist). > > I like clustal's "long name variant of Phylip distance format", as > for my datasets my gene/domain names are longer than 10 > characters. I may well be in a minority here (for now). > > I suppose if would be "good practice" to follow the official (but > not overly precise) phylip definition on this issue. > > So your idea of defining two similar formats would resolve this. > In terms of implementation, one could probably just subclass the > other to reduce the amount of duplicated code. Correct. subclassing is our friend (to a point). > >> I am pretty sure that Phylip doesn't care about non-unique names >> so why error out? However, the class should have a means for the >> user to ask this question. > > Because the (truncated) taxa names are going to be used as tree > node names by any tree building program, they really should be > unique. I would expect any tree program to throw an error in this > case, which is why I suggested we should try not to create such > files in the first place. Not exactly. I've been bitten in the butt by the truncation issue several times. I know TreeView X doesn't care about unique names and I think MacClade also doesn't care. Now, PAUP and Mequite might care or any Nexus type-system which lists the taxon names separately from the taxons in the TREES block (they use numbers for the taxons which get mapped to TAXLABELS in the TAXA block. I believe it depends on how they decided to store these relationships). I guess we have three options here: 1) keep on trucking 2) raise a warning 3) raise an exception - something like Matrix.NonUniqueName exception so that you can specifically except the exception > > Peter wrote: >>> Likewise an option to save matrices as either fully symmetric or >>> lower >>> triangular. I would lean towards using fully symmetric as the >>> default >>> as it seems to be more common. > > Marc Colosimo wrote: >> Phylip's default seems to be a "Square" distance matrix, i.e. >> fully symmetric. Keep this in mind when naming or documentation. > > Good point. > > Peter > From chris.lasher at gmail.com Wed Jun 14 18:36:00 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Wed, 14 Jun 2006 14:36:00 -0400 Subject: [Biopython-dev] Distance Matrix Parsers In-Reply-To: <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> <448DE350.8000403@maubp.freeserve.co.uk> <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org> Message-ID: <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com> > [I've added Chris incase he isn't on the dev-list] Thanks Marc! I actually joined Dev list as the discussion got interesting. Figured we'd move it to here eventually. >> One general question about the architecture: Are you thinking of having a >> generic "distance matrix object", and parsers/formats defined for several >> different file formats? >> > > Yes. I think that is what I am leaning towards. Now, I don't know if I'll > be the implementor or not. It has been something on my to-do list for a > while. BioPython support for these formats with clean, testable code should be the primary task, correct? I can help with this. After we get working code, refactoring for memory management can take place. I haven't done anything along these lines and I'd have to rely on someone else's expertise for this. > In my experience, most software tools usually write the distances as a > full symmetric matrix. However, the "standard" explicitly discusses lower > triangular form (missing out the diagonal distance zero entries) which has > the significant advantage of using about half the disk space. This is > significant once you get into thousands of taxa. I guess we need to consider that storing the matrix as a triangular form will save some memory. However, I've emailed the SciPy/NumPy guys and there is currently no support for a triangular/symmetric matrix; it would have to be a square matrix. See more below. >>>> So, make sure any parser can cope with both full symmetric, and lower >>>> triangular forms - ideally without the user having to care. >>> >>> Phylip does ask you which to either read or write; this is a pain at >>> times. So, having a parser figure this out would be nice. However, the >>> user should know about the choices. >> >> Its fairly easy for the parser to cope with either: For each line of >> input, only use the "lower triangular" portion - just ignore any >> remaining text which would be present for a full matrix (square) file, or >> not present for a lower triangular file. Well, we can save a lot of developer time by requiring the user to designate this, with the default being a square matrix. Is it unreasonable to expect the user to know whether his or her matrix is lower/upper-triangular or square? Autodetection seems to add a bit of risk, e.g., either the detection has to be confirmed by the user (in which case, what's the point of auto-detect), or we have to have a really well tested auto-detector, i.e., a lot more developer time. > It should be fairly easy, but I don't understand why Philip chokes on > square versus lower triangular. Either way, the class should "internally" > know what the format read in was, so you can ask it. That way if you muck > with it or create a new matrix and want to write that out, you can ask the > class what it read in and then have the new one write it out in that > format. I think it makes sense for a Phylip triangular matrix and a Phylip square matrix to be represented as the same type of object, for reasons of consistency, as already discussed. As Marc pointed out, its original form can simply be represented by an attribute of the object. It should also be possible to write the matrix back out in either triangular or square format, regardless of its original format. These would probably just be methods of the object, such as .to_phylip_square() and .to_phylip_ltriangular() >>>> This also raises the point about how to store the matrix in memory. >>>> Does Numeric/NumPy have an efficient way of storing symmetric matrices? >>>> This is less flexible than the suggested list of lists, but for large >>>> datasets would need much less memory. >>> >>> I believe that SciPy (Numeric/NumPy/etc..) is more efficient at storing >>> these things. But you lose that when you want to do pythonish things to >>> it (like write it back out). >> >> It depends on our target audience. My experience with two thousand taxa >> means that I am slightly concerned about the memory, and would lean >> towards storing the data using Numeric/NumPy. This could be done within >> a nice python object, with methods to write it out again in phylip format >> etc - so it could still behave "nicely". > > I agree here and think that if the user has Numeric use that, otherwise > use built-in types. So, maybe two "hidden" classes that do the correct > thing. This just recently popped up on the NumPy discussion list: http://www.mail-archive.com/numpy-discussion at lists.sourceforge.net/msg00265.html The summary of that is we can memory-map it using numpy.memmap. I've never used this before, so I can't really comment. I'd guess that for small data files, this is overkill. For large sets it might be reasonable. I suppose two separate classes could be available, one for smaller matrices and one for larger. Again, I think the user would be intelligent enough to make the decision as to which to use. Since the class for handling standard (smaller) matrices will be easier to code, I propose writing this standard one first and getting it into BioPython. For this class, I suggest just sticking with a regular nested list, rather than use something from Numeric/NumPy. After this class is created and submitted, we can go back and create a class to deal with larger matrices that's a sub-class of the standard one. This way, the API remains the same, regardless of the class, and we will only have to rewrite the methods that need changing due to the way we'll need to interact with the underlying data structure of the wrapped Numeric/NumPy object. How does that sound? >> I like clustal's "long name variant of Phylip distance format", as for my >> datasets my gene/domain names are longer than 10 characters. I may well >> be in a minority here (for now). >> >> I suppose if would be "good practice" to follow the official (but not >> overly precise) phylip definition on this issue. >> >> So your idea of defining two similar formats would resolve this. In >> terms of implementation, one could probably just subclass the other to >> reduce the amount of duplicated code. > > Correct. subclassing is our friend (to a point). > I'm in agreement with using two separate types of objects to represent these two formats. PhylipDist should represent the Phylip spec to the T. I'm not familiar with the Clustal spec; is it formatted similarly, sans the requirement of 10 characters max for the sequence name? An editorial note, I'm very frustrated with Phylip's 10 character limit for sequence names, too. I don't know the reasoning and history behind the decisions on the format; all I know is that it is an uncomfortably restrictive and seemingly arbitrary format. Why it has not been updated is beyond me, unless, like these parsers for BioPython, it's just another project waiting for someone to work on it. >>> I am pretty sure that Phylip doesn't care about non-unique names so why >>> error out? However, the class should have a means for the user to ask >>> this question. >> >> Because the (truncated) taxa names are going to be used as tree node >> names by any tree building program, they really should be unique. I >> would expect any tree program to throw an error in this case, which is >> why I suggested we should try not to create such files in the first >> place. > > Not exactly. I've been bitten in the butt by the truncation issue several > times. I know TreeView X doesn't care about unique names and I think > MacClade also doesn't care. Now, PAUP and Mequite might care or any Nexus > type-system which lists the taxon names separately from the taxons in the > TREES block (they use numbers for the taxons which get mapped to TAXLABELS > in the TAXA block. I believe it depends on how they decided to store these > relationships). > > I guess we have three options here: 1) keep on trucking 2) raise a warning > 3) raise an exception - something like Matrix.NonUniqueName exception so > that you can specifically except the exception > I dislike option 1, unless we also provide the user the ability to check for non-unique names, too. Remember the Zen of Python: "Explicit is better than implicit." I like option 3, though I don't know how to make it possible for code outside the parser to catch the exception and tell the parser to continue. We could have it throw the exception by default, but if the user provides a flag in calling the parser, like allow_non_unique=True, we could have logic in the parser that, if True, catch the exception and continue. >> Likewise an option to save matrices as either fully symmetric or lower >> triangular. I would lean towards using fully symmetric as the default as >> it seems to be more common. > > Phylip's default seems to be a "Square" distance matrix, i.e. fully > symmetric. Keep this in mind when naming or documentation. As I mentioned above, the same object would represent both types, and should be equally capable of outputting itself as text in either format. Chris From gvwilson at cs.utoronto.ca Sun Jun 18 18:15:18 2006 From: gvwilson at cs.utoronto.ca (Greg Wilson) Date: Sun, 18 Jun 2006 14:15:18 -0400 Subject: [Biopython-dev] ann: open source course on basic software development skills Message-ID: http://www.third-bit.com/swc is an open source course on basic software development skills, aimed primarily at people with backgrounds in science, engineering, and medicine who have little formal training in programming, but find themselves doing a lot of it. The course was developed in part through support from the Python Software Foundation; all of the material can be used and modified free of charge (but with attribution). If you have questions, would like to contribute material, or have a success story you'd like to share, please contact Greg Wilson (gvwilson at cs.utoronto.ca). Thanks, Greg From chris.lasher at gmail.com Mon Jun 19 17:49:34 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Mon, 19 Jun 2006 13:49:34 -0400 Subject: [Biopython-dev] Bugzilla Message-ID: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> I noticed that the Bugzilla for BioPython on Open Bio ( http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython ) lacks the current version number. Also, I noticed there seem to be quite a few open tickets. Is BioPython still using Open Bio's Bugzilla to track bugs? Chris From mdehoon at c2b2.columbia.edu Mon Jun 19 18:27:26 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 19 Jun 2006 14:27:26 -0400 Subject: [Biopython-dev] Bugzilla In-Reply-To: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> Message-ID: <4496EC8E.7030603@c2b2.columbia.edu> Chris Lasher wrote: > Is BioPython still using Open Bio's Bugzilla to track bugs? Yes. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Tue Jun 20 14:38:40 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jun 2006 15:38:40 +0100 Subject: [Biopython-dev] Bugzilla In-Reply-To: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> Message-ID: <44980870.5070309@maubp.freeserve.co.uk> Chris Lasher wrote: > I noticed that the Bugzilla for BioPython on Open Bio ( > http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython ) lacks > the current version number. Good point - who can edit that list? > Also, I noticed there seem to be quite a few open tickets. I've dealt with a few of them... > Is BioPython still using Open Bio's Bugzilla to track bugs? As Michiel said, yes we are. In fact, maybe we should log bugs for several of the recent issues on the mailing list... Peter From biopython-dev at maubp.freeserve.co.uk Tue Jun 20 13:40:52 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 20 Jun 2006 14:40:52 +0100 Subject: [Biopython-dev] Floats and Double in Cluster Message-ID: <4497FAE4.3040705@maubp.freeserve.co.uk> One for Michiel, as the author of Bio.Cluster I've just been building the latest CVS code on Windows with MSVC 6.0 (Microsoft Visual C++) and noticed there are a lot of warnings about double to float conversion (with associated data loss) from Bio/Cluster/ranlib.c (see attached output). Does this matter? I've also tried compiling the same code on Linux which I assume is using the default gcc 4.0.2, and there are no such warnings (even with the compiler option -Wall being specified). I found that I could generally get rid of the warnings by adding explicit (float) cast statements to the problem lines. Thanks Peter -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: build.txt URL: From mdehoon at c2b2.columbia.edu Tue Jun 20 16:05:30 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 20 Jun 2006 12:05:30 -0400 Subject: [Biopython-dev] Floats and Double in Cluster In-Reply-To: <4497FAE4.3040705@maubp.freeserve.co.uk> References: <4497FAE4.3040705@maubp.freeserve.co.uk> Message-ID: <44981CCA.2070006@c2b2.columbia.edu> Peter (BioPython Dev) wrote: > One for Michiel, as the author of Bio.Cluster > > I've just been building the latest CVS code on Windows with MSVC 6.0 > (Microsoft Visual C++) and noticed there are a lot of warnings about > double to float conversion (with associated data loss) from > Bio/Cluster/ranlib.c (see attached output). > > Does this matter? > Probably not. The ranlib library is quite old (maybe fifteen years or more). It was originally written in Fortran, and automatically converted into C. Such conversions are usually not so clean, so the resulting code tends to generate many warning messages. On the other hand, ranlib is a part of Numerical Python (the RandomArray module), so if there were a serious problem with it it would have been discovered by now. I did find out recently though that there may be a licensing problem with ranlib. Since we're using only a small part of ranlib in Bio.Cluster, I'm planning to replace it with a different random number generator for the next version. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From chris.lasher at gmail.com Tue Jun 20 16:35:14 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 20 Jun 2006 12:35:14 -0400 Subject: [Biopython-dev] Bugzilla In-Reply-To: <44980870.5070309@maubp.freeserve.co.uk> References: <128a885f0606191049m62a66c4dm88fca1754f58ef0c@mail.gmail.com> <44980870.5070309@maubp.freeserve.co.uk> Message-ID: <128a885f0606200935r619424f1jd983ad51eb36d7b@mail.gmail.com> On 6/20/06, Peter wrote: > > Also, I noticed there seem to be quite a few open tickets. > > I've dealt with a few of them... So some of those tickets are not still actually "open"? Chris From chris.lasher at gmail.com Tue Jun 20 16:54:49 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 20 Jun 2006 12:54:49 -0400 Subject: [Biopython-dev] Distance Matrix Parsers In-Reply-To: <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> <448DE350.8000403@maubp.freeserve.co.uk> <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org> <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com> Message-ID: <128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com> Question for the knowledgeable: is this an appropriate realm to write a Martel parser for? Chris From biopython-dev at maubp.freeserve.co.uk Tue Jun 20 17:04:25 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jun 2006 18:04:25 +0100 Subject: [Biopython-dev] Clustal unit test Message-ID: <44982A99.9090400@maubp.freeserve.co.uk> Another query Michiel A minor point first of all, adding something like the following to the end of test_Cluster.py makes it easy to run this unit test on its own: if __name__ == "__main__" : run_tests(module = "Bio.Cluster") Secondly, the test works for me with BioPython 1.41, but using today's' CVS the test fails (or at least, sits there at high CPU usage for so long I give up and kill it). This was using Linux. The only changes since BioPython 1.41 are those you you checked in today with this comment: > C Clustering Library version 1.32. Bio.Cluster became > objected-oriented (somewhat). Does the unit test still work for you? Peter From mcolosimo at mitre.org Tue Jun 20 18:32:08 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Tue, 20 Jun 2006 14:32:08 -0400 Subject: [Biopython-dev] Distance Matrix Parsers In-Reply-To: <128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> <448DE350.8000403@maubp.freeserve.co.uk> <2FC5D7B7-BDD5-4773-B0E7-82798D7BF585@mitre.org> <128a885f0606141136j2b0df7f8p23da61ffa439b899@mail.gmail.com> <128a885f0606200954m513acbcuc679b47a5729fd89@mail.gmail.com> Message-ID: <6F683BD6-4359-40BF-A7E4-10C97D4032EF@mitre.org> I would say, NO. I think Martel is a Mac Truck and you need a little Toyota pickup. Sure, Martel probably can do this, but it would be overkill, IMHO. Maybe Andrew could chime in on this. Marc On Jun 20, 2006, at 12:54 PM, Chris Lasher wrote: > Question for the knowledgeable: is this an appropriate realm to write > a Martel parser for? > > Chris > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From mdehoon at c2b2.columbia.edu Tue Jun 20 20:56:22 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 20 Jun 2006 16:56:22 -0400 Subject: [Biopython-dev] Clustal unit test In-Reply-To: <44982A99.9090400@maubp.freeserve.co.uk> References: <44982A99.9090400@maubp.freeserve.co.uk> Message-ID: <449860F6.7040807@c2b2.columbia.edu> Peter wrote: > A minor point first of all, adding something like the following to the > end of test_Cluster.py makes it easy to run this unit test on its own: > > if __name__ == "__main__" : > run_tests(module = "Bio.Cluster") OK I've added this in CVS. > Secondly, the test works for me with BioPython 1.41, but using today's' > CVS the test fails (or at least, sits there at high CPU usage for so > long I give up and kill it). This was using Linux. > I did make an update to Bio.Cluster but in my brilliance forgot to update the test scripts accordingly. CVS now contains updated test_Cluster.py and output/test_Cluster. Let me know if the test still fails with these updated files. Thanks for catching this. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From bsouthey at gmail.com Tue Jun 20 17:47:53 2006 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 20 Jun 2006 12:47:53 -0500 Subject: [Biopython-dev] Floats and Double in Cluster In-Reply-To: <44981CCA.2070006@c2b2.columbia.edu> References: <4497FAE4.3040705@maubp.freeserve.co.uk> <44981CCA.2070006@c2b2.columbia.edu> Message-ID: Hi, Actually just switch to the new numpy where ranlib is no longer used. It uses Mersenne Twister RNG from Jean-Sebastien Roy's random kit. Robert Kern has also written additional code. Of course, that really means moving BioPython to numpy from Numeric. Is there a plan for this? Bruce On 6/20/06, Michiel Jan Laurens de Hoon wrote: > Peter (BioPython Dev) wrote: > > One for Michiel, as the author of Bio.Cluster > > > > I've just been building the latest CVS code on Windows with MSVC 6.0 > > (Microsoft Visual C++) and noticed there are a lot of warnings about > > double to float conversion (with associated data loss) from > > Bio/Cluster/ranlib.c (see attached output). > > > > Does this matter? > > > Probably not. The ranlib library is quite old (maybe fifteen years or > more). It was originally written in Fortran, and automatically converted > into C. Such conversions are usually not so clean, so the resulting code > tends to generate many warning messages. On the other hand, ranlib is a > part of Numerical Python (the RandomArray module), so if there were a > serious problem with it it would have been discovered by now. > I did find out recently though that there may be a licensing problem > with ranlib. Since we're using only a small part of ranlib in > Bio.Cluster, I'm planning to replace it with a different random number > generator for the next version. > > --Michiel. > > -- > Michiel de Hoon > Center for Computational Biology and Bioinformatics > Columbia University > 1130 St Nicholas Avenue > New York, NY 10032 > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From mdehoon at c2b2.columbia.edu Thu Jun 22 02:32:58 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Wed, 21 Jun 2006 22:32:58 -0400 Subject: [Biopython-dev] Biopython's XMl parser fails with NCBI blast changed XML output format In-Reply-To: References: Message-ID: <449A015A.4080503@c2b2.columbia.edu> I am not sure if the new XML format is really what NCBI wants it to be, since it does not agree with the Blast documentation. I asked NCBI about this; they have forwarded this question to their Blast developers, so hopefully we'll get a definite answer soon. For the time being, I guess the only thing you can do is to download an older version of Blast and run your blast searches locally. Then, the blast XML output will be in the old format, and there should be no problem parsing them with the existing parser in Biopython. --Michiel. Rohini Damle wrote: > Hi, > I am trying to parse the blast output (XML formatted, using online NCBI's > blast) I got as a result for 'short nearly exact matches' for my 50-55 > short > protein sequences. > It looks like the XML format has changed and biopython's XML parser > fails to > parse the blast records. > can somebody show a way to fix this thing? > Thank you > Rohini Damle -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From dcoorna at dbm.ulb.ac.be Mon Jun 26 10:57:25 2006 From: dcoorna at dbm.ulb.ac.be (david coornaert) Date: Mon, 26 Jun 2006 12:57:25 +0200 Subject: [Biopython-dev] NCBI-XML blast parser Message-ID: <449FBD95.5040308@dbm.ulb.ac.be> I'm currently using this bio-python ncbiXML blast output parser >From the cvs I fetched I see some comments about useless nature of Hsp_query_to and Hsp_hit_to Well, I need those, and can't for sure calculate it simply from hsp_align_len (which is not included either) because I should I manage the max len of query, hit and align string) then take care of the strand to know wether to increase or decrease, So I've worked out a parrallel copy of the parser, but I'd like to know why are these considered useless ? Should I commit these harmless changes ? (hence cvs access) ??? -- =============================================== David Coornaert [PhD] (dcoorna at dbm.ulb.ac.be) Belgian Embnet Node (http://www.be.embnet.org) Universit? Libre de Bruxelles Laboratoire de Bioinformatique 12, Rue des Professeurs Jeener & Brachet 6041 Gosselies BELGIQUE T?l: +3226509975 Fax: +3226509998 =============================================== From biopython-dev at maubp.freeserve.co.uk Mon Jun 26 12:05:28 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jun 2006 13:05:28 +0100 Subject: [Biopython-dev] NCBI-XML blast parser In-Reply-To: <449FBD95.5040308@dbm.ulb.ac.be> References: <449FBD95.5040308@dbm.ulb.ac.be> Message-ID: <449FCD88.6080704@maubp.freeserve.co.uk> david coornaert wrote: > I'm currently using this bio-python ncbiXML blast output parser > >>From the cvs I fetched I see some comments about useless nature of > Hsp_query_to > and Hsp_hit_to > > Well, I need those, and can't for sure calculate it simply from > hsp_align_len (which is not included either) > because I should I manage the max len of query, hit and align string) > then take care of the strand to know wether to increase or decrease, > > So I've worked out a parrallel copy of the parser, > but I'd like to know why are these considered useless ? > Should I commit these harmless changes ? (hence cvs access) > > ??? Hi David Could you file a bug and attach your patch to it please? (Trying to send attachments to the mailing list can be a bit unreliable). Then hopefully some of the group can at least try it out... Out of interest, what version of Blast have you been using? Online or standalone? (If you've been following the list, we think the NCBI have changed the format returned for multiple queries in version 2.2.14) Thanks Peter From dcoorna at dbm.ulb.ac.be Mon Jun 26 12:44:04 2006 From: dcoorna at dbm.ulb.ac.be (david coornaert) Date: Mon, 26 Jun 2006 14:44:04 +0200 Subject: [Biopython-dev] NCBI-XML blast parser In-Reply-To: <449FCD88.6080704@maubp.freeserve.co.uk> References: <449FBD95.5040308@dbm.ulb.ac.be> <449FCD88.6080704@maubp.freeserve.co.uk> Message-ID: <449FD694.7030508@dbm.ulb.ac.be> Peter wrote: > Hi David > > Could you file a bug and attach your patch to it please? (Trying to > send attachments to the mailing list can be a bit unreliable). Then > hopefully some of the group can at least try it out... > > Well I'm not sure about bug procedure so here's it already I'll have a look at the list stuff quite soon and will submit as requested I wouldn't have qualified that as a bug, just wondering why would someone consider this values as useless, sure you can calculate these, altho it would be painfull and ... well since it is already in the XML... I simply added these (in red) : Bio/Blast/NCBIXML.py line 289: # No need for Hsp_query_to def _end_Hsp_query_to(self): """offset of query at the end of the alignment (one-offset) """ self._hsp.query_to = int(self._value) def _end_Hsp_hit_from(self): """offset of the database at the start of the alignment (one-offset) """ self._hsp.sbjct_start = int(self._value) # No need for Hsp_hit_to def _end_Hsp_hit_to(self): """offset of the database at the end of the alignment (one-offset) """ self._hsp.sbjct_to = int(self._value) Conversely, a real bug is the mess that is occuring regarding Frame and Strand !! in a blastn output must appear: Strand = Plus / Plus or Strand = Plus / Minus (and so on) while in a tblastx must appear: Frame = +3/-1 (and so on) blastx (must also present one Frame info) unfortunately to find the appropriate strand in a blastn job, you need to address the hsp.frame array , eventho there's a hsp.strand array... And all this stuff is usefull !! if it is the opposite strand you need to swap query_start and query_to for example... > Out of interest, what version of Blast have you been using? Online or > standalone? > > well I've seen the complains regarding 2.2.14 , Hence I sticked to 2.2.13 standalone =;B^) -- =============================================== David Coornaert [PhD] (dcoorna at dbm.ulb.ac.be) Belgian Embnet Node (http://www.be.embnet.org) Universite' Libre de Bruxelles Laboratoire de Bioinformatique 12, Rue des Professeurs Jeener & Brachet 6041 Gosselies BELGIQUE Te'l: +3226509975 Fax: +3226509998 =============================================== From chris.lasher at gmail.com Tue Jun 27 23:33:57 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 27 Jun 2006 19:33:57 -0400 Subject: [Biopython-dev] [BioPython] Distance Matrix Parsers In-Reply-To: <128a885f0606271632q2988f2d7y543dd441535f9808@mail.gmail.com> References: <128a885f0606081432k7dc9b988rdccbc3be03ca62b6@mail.gmail.com> <448A9A7A.6050501@maubp.freeserve.co.uk> <449F0231.2050308@maubp.freeserve.co.uk> <128a885f0606271434v4d5a40e9x1ceb0037d750f6a1@mail.gmail.com> <44A1B23E.5080007@maubp.freeserve.co.uk> <128a885f0606271632q2988f2d7y543dd441535f9808@mail.gmail.com> Message-ID: <128a885f0606271633g6fc66bb1h8173eca5b949a1c5@mail.gmail.com> Oh brother... today's not my day. NOW it's back on BP-Dev... Stupidly yours, Chris On 6/27/06, Chris Lasher wrote: > [Oops! I didn't realize I was posting to the user list! Reverting it > back to BP-Dev] > This code looks very good, Peter! > > As far as licensing, I'm new to the game, but my guess is the > BioPython license (http://www.biopython.org/DIST/LICENSE ) is highly > prefered for BioPython. You still retain copyright with the license, > but the code is more "free" than under any version of the GPL. > > Chris > > On 6/27/06, Peter wrote: > > Chris Lasher wrote: > > > Hi Peter, > > > > > > Would you be up for licensing your code under the BioPython license? > > > If not, I shouldn't look at it, as I've started coding my own module > > > for the project. From your description, your module sounds very good. > > > =-) > > > > > > Chris > > > > I am quite happy to contribute the code to BioPython under the > > appropriate license, so please go ahead. > > > > I've filled a bug on adding PHYLIP distance parsers to BioPython and > > attached a slightly revised version of the code (added "fuzzy" equality > > testing of matrices - mainly for testing): > > > > http://bugzilla.open-bio.org/show_bug.cgi?id=2034 > > > > If anyone else really wants the code under some other license (GPL > > maybe) I could probably be persuaded. > > > > Peter > > > > _______________________________________________ > > BioPython mailing list - BioPython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > >