From ngoto at gen-info.osaka-u.ac.jp Thu Apr 1 09:41:27 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 1 Apr 2010 22:41:27 +0900 Subject: [BioRuby] FlatFile GFF In-Reply-To: References: Message-ID: <20100401134130.0FA771CBC585@idnmail.gen-info.osaka-u.ac.jp> Hi, On Thu, 1 Apr 2010 11:33:27 +1100 Ben Woodcroft wrote: > Hi, > > I have a conceptual question for the list. When I open a gff2 file using > Bio::FlatFile, the next_entry method gives me all of the lines at once (in > the form of a Bio::GFF::GFF2 object). > > f = Bio::FlatFile.open(Bio::GFF::GFF2,"some.gff2") => Bio::FlatFile > g = f.next_entry => Bio::GFF::GFF2 object > g.records => array of GFF2 records > > To me, this seems a little counter-intuitive. I expected to get info for a > single line of the GFF file from FlatFile#next_entry The design of Bio::GFF classes was determined by the first authors of the classes. I don't know much about what they thought, but I suppose because GFF can have header lines, sequences in Fasta format, and relation information across two or more lines, they might think it is easy to gather all information in a file into a single object. Because Bio::FlatFile supports many file formats, format-specific situation may sometimes be omitted and "normalized". > The other problem is that the whole file must be parsed at the beginning, > and this can cause memory problems when using large GFF files (e.g. the > current WormBase gff2 is 2.6GB). To overcome the problem, reorganizing of Bio::GFF classes may be needed. Bio::FlatFile is only a controller with input buffer, and format specific things should be implemented in the format parser and splitter classes. Currently, for a workaroud, use Bio::GFF::GFF2::Record directly without using Bio::FlatFile. > To get around the problem I can use File.foreach('some.gff2') and then parse > each line using Bio::GFF::GFF2. I'm not sure what the situation is with > other file formats. > > So, my question is, could we introduce a foreach method into FlatFile that > iterates (without parsing all at once so it is light on memory) over the > GFF/etc entries in the file? Ideally we could change next_entry, but that > wouldn't be backwards compatible I don't think. I'm negative, because this is basically not the Bio::FlatFile issue, but the Bio::GFF design problem, and modifying only Bio::FlatFile does not solve the problem. Indeed, the method name is too confusing, because we already have Bio::FlatFile.foreach and Bio::FlatFile#each. http://bioruby.org/rdoc/classes/Bio/FlatFile.html#M002156 (foreach) http://bioruby.org/rdoc/classes/Bio/FlatFile.html#M002168 (each) I'm thinking to implement another GFF parser frontend class that can be specified as a file format. ff = Bio::FlatFile.open(Bio::GFF::AltParser, "xxx.gff") Alternatively, introducing optional parameters to a Bio::FlatFile and it could change parameters passed to the parser and splitter classes for the format. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From czmasek at burnham.org Fri Apr 2 15:37:14 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Fri, 2 Apr 2010 12:37:14 -0700 Subject: [BioRuby] Beta application for review: BioRuby - Simple duplication inference implementation In-Reply-To: References: <4BB1387C.6090503@burnham.org> Message-ID: <4BB6476A.20808@burnham.org> Hi, Jure: Indeed, you improved it a lot! Clearly, you don't _need_ to discuss 'anticipated problems', if you don't expect any. I probably would point out that: "- Extend the algorithm to support non-binary species tree as described by Vernot in "Reconciliation with non-binary species tree"." and "- Extend the algorithm to support non-binary gene trees as described by Durand in "A Hybrid Micro?Macroevolutionary Approach to Gene Tree Reconstruction"." are optional, especially the second one. Regarding my obligations for the summer, please make sure that you _only_ cancel them if/after this has proposal been approved and you have been approved for it. As you know, less than half of all proposals get accepted in the end, and each proposal has many students applying for it. Christian Jure Triglav wrote: > Hello all! > > Thank you for the thorough review Christian! I took your comments > seriously and spent the last two days reviewing papers and reading up on > various subjects related to the proposal. I've done a lot of work and I > think I have refined it substantially. I hope I have now fully > elaborated on the problem and proposed time-table. I would like to > kindly ask you to review it again and point out any remaining issues. > The only thing from your list of requirements for the proposed > time-table that I have trouble with are the anticipated problems and > possible alternative approaches, since all of the developed algorithms > seem rock solid to me (almost all of them have mathematical proof > included) and the only possible issue that I can think of, are the > incompatibilities of data structures between various algorithms (which > we will address in the first week, as most are very similar) and coding > errors (which we will fix, of course! :). > > Regarding my obligations for the summer, I would not hesitate to cancel > them (or as it is, simply not apply for clinical practice, as I have no > obligation yet), seeing as you consider them a serious issue. I am very > motivated to do this project and would like to do everything possible to > make it happen. Anyway, I can always apply for clinical practice on the > next term, it really is not an issue. > > Best regards to all, > Jure Triglav > > *The idea:* > > We would implement the simple and fast duplication inference algorithm > described by Zmasek and Eddy (Zmasek and Eddy, 2001, "A simple algorithm > to infer gene duplication and speciation events on a gene tree". With > several billion nucleotides sequenced daily (Edwards, Hansen and > Stajich, 2009, "Bioinformatics - Tools and applications"), the > determination of protein function is mostly done automatically without > human intervention by finding the most similar sequences that already > have determined protein function (microevolutionary approach). This way > of automatically determining protein function neglects additional > available information in the form of macroevolutionary relationships. By > inferring these interesting relationships (speciation, duplication) from > a species and gene tree using a simple algorithm, we can gain a better > understanding of protein function. > The importance of determining these evolutionary relationships stems > from a relatively simple assumption, that if two similar genes are > thought to be related by speciation, their function is more likely to be > similar too. On the other hand, if we determine these two genes to be > related by duplications, their function is more likely to be different, > as gene duplications are powerful drivers in the evolution of new > protein function. This is because the second copy of a gene is often > free of selective pressure and accumulates mutations more rapidly than a > single copy of a gene. > The original algorithm proposed by Zmasek and Eddy supports rooted fully > binary gene and species trees, but we have decided to expand on that > scope by implementing support for unrooted gene trees (which are > produced by some bioinformatics methods and thus need to be addressed), > non-binary species trees (since a lot of species trees are non-binary, > i.e. 64% of NCBI nodes have more than 2 children (Vernot, 2007)), and > non-binary gene trees (which are also produced by some bioinformatics > methods, but represent only uncertainties, as gene trees are inherently > binary). > > *Goals:* > ** > *- *Implement the algorithm as described by Zmasek and Eddy in "A simple > algorithm to infer gene duplication and speciation events on a gene > tree", or SDI, which is designed to work on rooted gene trees and fully > binary gene and species tree. > - Allow rooting of unrooted gene trees by minimizing sum of duplications > as described by Zmasek and Eddy in "RIO: Analyzing proteomes by > automated phylogenomics using resampled inference of orthologs", and > thus extending the implementation to support unrooted gene trees. > - Extend the algorithm to support non-binary species tree as described > by Vernot in "Reconciliation with non-binary species tree". > - Extend the algorithm to support non-binary gene trees as described by > Durand in "A Hybrid Micro?Macroevolutionary Approach to Gene Tree > Reconstruction". > > *The work:* > > Some terminology: > - g: a node in the gene tree > - p(g): the parent of node g in the gene tree > - s: a node in the species tree > - M: a mapping function that links nodes of a gene tree to nodes of a > species tree > - roofN: an adapted mapping function for non-binary species trees > - e(p(g),g): the edge between parent of g and g in the gene tree > - polytomy: a species tree node with more than 2 children > > There are several milestones to be reached in developing this idea and > this is the work plan I propose: > > 1. Development of unit tests with known species and gene trees (1 week). > > 2. Making or reusing necessary data structures, made easier by last > years GSoC contribution implementing phyloXML in BioRuby (1/2 weeks - 1 > week): > - gene tree, > - species tree, > - tree node, > - children(), > - parent(). > > 3. Developing checks for the correctness of input data for rooted fully > binary trees SDI (1/2 weeks - 1 week): > - making sure trees are rooted and binary, > - all species/gene tree nodes have at least on type of taxonomic data. > - making a taxonomy base from a type of data present in all nodes > (scientific or common name, taxonomy code, id), > - making sure taxonomic data is unique throughout external nodes. > > 4. Implementation of the recursive M function (1 week) > - traverse the gene tree in postorder (left subtree, right subtree, root), > - finding occurrences where M(parent) equals M(child 1 or 2) - this is > representative for finding a duplication. If M(parent) matches neither, > the processed node is a speciation. > > 5. Milestone - finished implementation of SDI for rooted fully binary > trees (1/2 week): > - Extensive testing, > - polishing and writing documentation with RDoc, > - cleaning up. > > 6. Milestone: Implementation of support for unrooted gene trees (1 week): > - implement an algorithm which roots an unrooted gene tree by exploring > all possible roots and selecting the one with minimum duplications, > - calculating M is the most intensive step, so we only do it once for > one rooted gene tree, > - by moving the root one node at a time, M does not have to be > calculated for every node of the gene tree, but only for two nodes: > - first child of previous root, if the new root is on a brach of first > child of the previous root, > - second child of previous root, if the new root is on a branch of > second child of the previous root, > - and the new root. > - traversing the whole gene tree one node at a time we explore all > possible root placements and resulting duplications, > - from a group of trees with a minimal number of duplications, the > shortest tree is chosen as the rooted tree, > - the algorithm for this is written in pseudocode in "RIO: Analyzing > proteomes by automated phylogenomics using resampled inference of > orthologs" by Zmasek and Eddy as "Algorithm for speciation duplication > inference combined with rooting", and needs to be translated to Ruby code. > > 7. Milestone: Implementing an duplication/loss inference algorithm for > non-binary species trees (described by Vernot, 2008) (2 weeks): > - implement function roofN(g), which returns all roots of subtrees of > the parent of "s" (s = M(g)) in which descendants of g must be present, > - if the intersection of roofN(left child of g) and roofN(right child of > g) is not NULL, then g is a required multiplication, > - else if the intersection of roofN(left child of g) and roofN(right > child of g) is NULL, then it is impossible to tell whether the event was > a duplication or a deep coalescence, the event is thus called a > conditional duplication, > - implement function N(g), which returns all of the children of M(g), > where descendants of g were present in descendants of each element in N(g), > - implementing a prediction of gene loss events by assuming that > minimizing gene losses gives the biologically most likely prediction by > taking into account the following rules, which minimize explicit loses > by predicting each loss as close as possible to the root of the gene tree: > - Binary duplication loss: if p(g) is a required duplication then, if > M(p(g)) != M(g) then species in (N(p(g)) without roofN(g)) are lost on > edge e between p(g) and g. > - Skipped species loss: if M(g) != M(p(g)) and p(M(g)) != M(p(g)) then > we can infer a loss at every skipped species between M(p(g) and M(g). > - Polytomy duplication loss: if M(p(g) is a polytomy and p(g) is a > required duplication, then species (N(p(g)) without roofN(g)) are lost > at e(p(g),g). > - Polytomy speciation loss: if M(g) != M(p(g) and M(g) is a polytomy, > then all children of M(g) should have a descendant of g. > - the algorithm for this is written in pseudocode in "Reconciliation > with non-binary species tree" by Vernot as "Algorithm 5.1", which has to > be translated to Ruby code. > > 8. Milestone: Implementing support for non-binary gene trees (2 weeks) > - gene trees are by definition binary, but some methods produce > uncertainties which result in multifurcating gene trees, > - we can support non-binary gene trees by expanding multifurcating nodes > to arbitrary binary trees and then optimizing the generated tree for > duplications and losses with a previously developed algorithm > - this approach is described in "A Hybrid Micro?Macroevolutionary > Approach to Gene Tree Reconstruction" by Durand, which contains several > pseudocode algorithms that can be ported to Ruby. > > 9. Finishing up (1-2 weeks): > - Extensive testing of all implemented algorithms, > - polishing and writing documentation using RDoc, > - cleaning up. > > *Why me?:* > > I like to set foot on unknown territory and challenge myself constantly. > I have long searched for something that would connect my love of > medicine to my love of programming, and now, thanks to GSoC and OBF, I > think I found it - bioinformatics. I am at a stage of my medical study, > where I have to decide what my future will entail, and I am (now, after > thinking about it for a long time) positive that bioinformatics will be > a big part of it. What better way to get future off to a good start, > than with a Google Summer of Code project? Based on this enthusiasm > alone you can be assured that I'll work really hard on this project and > that I will be happy to see it done. As this would be my first serious > open source engagement, you also have a chance of forming a completely > new addition to the open source world and making a good contributor out > of me. > > *Previous experience:* > > 1. I have been working on a simulation of an analytical chemistry method > for the past 2 years now, more specifically we have modeled laser > ablation + inductively coupled plasma mass spectrometry with a simple > model, which aids our elemental mapping projects. For the write-up of > this project I have been awarded with a "Pre?ernovo priznanje" in 2008 > (PDF upon request). This work entails several interesting components, > from basics such as: C# development, image input, output, multi-threaded > programming, UI development; to complex themes such as: genetic > algorithms and neural networks. All of which I learned as we worked on > the project without much hassle (source code upon request). This work is > not yet open source, because we are in the finalizing stages of the > paper and will release the source code after publication under an open > source license. > > 2. I have programmed since I was a child and I have developed a wide > specter of things in my lifetime (from a full CMS in PHP to an IRC > robot, source code upon request), but I have little experience in fully > open source projects, which I think so highly of. > > *Biography:* > > My name is Jure Triglav and I'm a 24 year old medical student from > Ljubljana, Slovenia. I was born in a small town of Murska Sobota in > Slovenia, where I went to grade school (graded excellent for all years, > awarded "Zoisova ?tipendija" for the gifted, which I still hold) and > high-school (excellent, finished as "Zlati maturant" in the company of > about 200 best students in the country). I moved to Ljubljana in 2004 to > study medicine. I am now in the last year of my medical study which I > find challenging and very interesting. > My hobbies are all over the place, from book design to photography, from > web design to typography, from guitar to poetry, from reading to > programming, from traveling to sports. > > > > *Other obligations for the summer:* > > In question, will be cancelled upon request: I have 5-hour daily > clinical practice every weekday in June, July and August, which is not > nearly as serious as it sounds, especially since this is the summer > rotation which is known for its laid back feel. These practice start at > 8 am and finish at 1 pm, and for students are not really stressful or > exhausting at all. I have in the past juggled many research obligations > with clinical practice and my studies without hiccups, but I will not do > this this summer and will dedicate 8 hours daily to Google Summer of > Code, as I realize what a great opportunity this is and how much work is > required. I have no other work, research or vacation obligations for the > period of Google Summer of Code. > > *Contact information: * > > (I will provide additional contact information in the final application) > Name: Jure Triglav > E-mail: juretriglav at gmail.com > IRC handle: x` on #obf-soc, #gsoc > > From czmasek at burnham.org Fri Apr 2 18:42:03 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Fri, 2 Apr 2010 15:42:03 -0700 Subject: [BioRuby] Beta application for review: BioRuby - Simple duplication inference implementation In-Reply-To: <249F5CC6-629E-44A1-A161-5A9D76B0DF98@gmail.com> References: <4BB1387C.6090503@burnham.org> <4BB6476A.20808@burnham.org> <249F5CC6-629E-44A1-A161-5A9D76B0DF98@gmail.com> Message-ID: <4BB672BB.9070301@burnham.org> Hi, Jure: It looks good to me! Christian Jure Triglav wrote: > Thank you Christian! > > I am glad that you find it bettered and I will now submit this application to Google, then fine tune it some more until the deadline. Do you have any recommendations as to what could still be improved or added? > > Yes I agree, the extension to non-binary gene trees is the least defined problem of the proposed group of problems, so it is a good idea to somehow point that out and make its solution and implementation optional. > > And yes, the deadline for applying to clinical practice is a few weeks after April 26th, on which day the accepted GSoC proposals will be announced, so don't worry about that. > > Thank you for your reply and help! > > Best regards, > Jure Triglav > > On Apr 2, 2010, at 8:37 PM, Christian M Zmasek wrote: > >> Hi, Jure: >> >> Indeed, you improved it a lot! >> >> Clearly, you don't _need_ to discuss 'anticipated problems', if you don't expect any. >> >> I probably would point out that: >> "- Extend the algorithm to support non-binary species tree as described >> by Vernot in "Reconciliation with non-binary species tree"." >> and >> "- Extend the algorithm to support non-binary gene trees as described by >> Durand in "A Hybrid Micro?Macroevolutionary Approach to Gene Tree >> Reconstruction"." >> >> are optional, especially the second one. >> >> Regarding my obligations for the summer, please make sure that you _only_ cancel them if/after this has proposal been approved and you have been approved for it. >> As you know, less than half of all proposals get accepted in the end, and each proposal has many students applying for it. >> >> Christian >> >> >> >> Jure Triglav wrote: >>> Hello all! >>> Thank you for the thorough review Christian! I took your comments seriously and spent the last two days reviewing papers and reading up on various subjects related to the proposal. I've done a lot of work and I think I have refined it substantially. I hope I have now fully elaborated on the problem and proposed time-table. I would like to kindly ask you to review it again and point out any remaining issues. >>> The only thing from your list of requirements for the proposed time-table that I have trouble with are the anticipated problems and possible alternative approaches, since all of the developed algorithms seem rock solid to me (almost all of them have mathematical proof included) and the only possible issue that I can think of, are the incompatibilities of data structures between various algorithms (which we will address in the first week, as most are very similar) and coding errors (which we will fix, of course! :). >>> Regarding my obligations for the summer, I would not hesitate to cancel them (or as it is, simply not apply for clinical practice, as I have no obligation yet), seeing as you consider them a serious issue. I am very motivated to do this project and would like to do everything possible to make it happen. Anyway, I can always apply for clinical practice on the next term, it really is not an issue. Best regards to all, >>> Jure Triglav >>> *The idea:* >>> We would implement the simple and fast duplication inference algorithm described by Zmasek and Eddy (Zmasek and Eddy, 2001, "A simple algorithm to infer gene duplication and speciation events on a gene tree". With several billion nucleotides sequenced daily (Edwards, Hansen and Stajich, 2009, "Bioinformatics - Tools and applications"), the determination of protein function is mostly done automatically without human intervention by finding the most similar sequences that already have determined protein function (microevolutionary approach). This way of automatically determining protein function neglects additional available information in the form of macroevolutionary relationships. By inferring these interesting relationships (speciation, duplication) from a species and gene tree using a simple algorithm, we can gain a better understanding of protein function. >>> The importance of determining these evolutionary relationships stems from a relatively simple assumption, that if two similar genes are thought to be related by speciation, their function is more likely to be similar too. On the other hand, if we determine these two genes to be related by duplications, their function is more likely to be different, as gene duplications are powerful drivers in the evolution of new protein function. This is because the second copy of a gene is often free of selective pressure and accumulates mutations more rapidly than a single copy of a gene. >>> The original algorithm proposed by Zmasek and Eddy supports rooted fully binary gene and species trees, but we have decided to expand on that scope by implementing support for unrooted gene trees (which are produced by some bioinformatics methods and thus need to be addressed), non-binary species trees (since a lot of species trees are non-binary, i.e. 64% of NCBI nodes have more than 2 children (Vernot, 2007)), and non-binary gene trees (which are also produced by some bioinformatics methods, but represent only uncertainties, as gene trees are inherently binary). >>> *Goals:* >>> ** >>> *- *Implement the algorithm as described by Zmasek and Eddy in "A simple algorithm to infer gene duplication and speciation events on a gene tree", or SDI, which is designed to work on rooted gene trees and fully binary gene and species tree. - Allow rooting of unrooted gene trees by minimizing sum of duplications as described by Zmasek and Eddy in "RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs", and thus extending the implementation to support unrooted gene trees. >>> - Extend the algorithm to support non-binary species tree as described by Vernot in "Reconciliation with non-binary species tree". >>> - Extend the algorithm to support non-binary gene trees as described by Durand in "A Hybrid Micro?Macroevolutionary Approach to Gene Tree Reconstruction". >>> *The work:* >>> Some terminology: >>> - g: a node in the gene tree >>> - p(g): the parent of node g in the gene tree >>> - s: a node in the species tree >>> - M: a mapping function that links nodes of a gene tree to nodes of a species tree >>> - roofN: an adapted mapping function for non-binary species trees >>> - e(p(g),g): the edge between parent of g and g in the gene tree >>> - polytomy: a species tree node with more than 2 children >>> There are several milestones to be reached in developing this idea and this is the work plan I propose: >>> 1. Development of unit tests with known species and gene trees (1 week). >>> 2. Making or reusing necessary data structures, made easier by last years GSoC contribution implementing phyloXML in BioRuby (1/2 weeks - 1 week): >>> - gene tree, >>> - species tree, >>> - tree node, >>> - children(), >>> - parent(). >>> 3. Developing checks for the correctness of input data for rooted fully binary trees SDI (1/2 weeks - 1 week): >>> - making sure trees are rooted and binary, >>> - all species/gene tree nodes have at least on type of taxonomic data. >>> - making a taxonomy base from a type of data present in all nodes (scientific or common name, taxonomy code, id), >>> - making sure taxonomic data is unique throughout external nodes. >>> 4. Implementation of the recursive M function (1 week) >>> - traverse the gene tree in postorder (left subtree, right subtree, root), >>> - finding occurrences where M(parent) equals M(child 1 or 2) - this is representative for finding a duplication. If M(parent) matches neither, the processed node is a speciation. >>> 5. Milestone - finished implementation of SDI for rooted fully binary trees (1/2 week): >>> - Extensive testing, >>> - polishing and writing documentation with RDoc, >>> - cleaning up. >>> 6. Milestone: Implementation of support for unrooted gene trees (1 week): >>> - implement an algorithm which roots an unrooted gene tree by exploring all possible roots and selecting the one with minimum duplications, >>> - calculating M is the most intensive step, so we only do it once for one rooted gene tree, >>> - by moving the root one node at a time, M does not have to be calculated for every node of the gene tree, but only for two nodes: >>> - first child of previous root, if the new root is on a brach of first child of the previous root, >>> - second child of previous root, if the new root is on a branch of second child of the previous root, >>> - and the new root. >>> - traversing the whole gene tree one node at a time we explore all possible root placements and resulting duplications, >>> - from a group of trees with a minimal number of duplications, the shortest tree is chosen as the rooted tree, - the algorithm for this is written in pseudocode in "RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs" by Zmasek and Eddy as "Algorithm for speciation duplication inference combined with rooting", and needs to be translated to Ruby code. >>> 7. Milestone: Implementing an duplication/loss inference algorithm for non-binary species trees (described by Vernot, 2008) (2 weeks): >>> - implement function roofN(g), which returns all roots of subtrees of the parent of "s" (s = M(g)) in which descendants of g must be present, >>> - if the intersection of roofN(left child of g) and roofN(right child of g) is not NULL, then g is a required multiplication, >>> - else if the intersection of roofN(left child of g) and roofN(right child of g) is NULL, then it is impossible to tell whether the event was a duplication or a deep coalescence, the event is thus called a conditional duplication, >>> - implement function N(g), which returns all of the children of M(g), where descendants of g were present in descendants of each element in N(g), - implementing a prediction of gene loss events by assuming that minimizing gene losses gives the biologically most likely prediction by taking into account the following rules, which minimize explicit loses by predicting each loss as close as possible to the root of the gene tree: >>> - Binary duplication loss: if p(g) is a required duplication then, if M(p(g)) != M(g) then species in (N(p(g)) without roofN(g)) are lost on edge e between p(g) and g. >>> - Skipped species loss: if M(g) != M(p(g)) and p(M(g)) != M(p(g)) then we can infer a loss at every skipped species between M(p(g) and M(g). >>> - Polytomy duplication loss: if M(p(g) is a polytomy and p(g) is a required duplication, then species (N(p(g)) without roofN(g)) are lost at e(p(g),g). >>> - Polytomy speciation loss: if M(g) != M(p(g) and M(g) is a polytomy, then all children of M(g) should have a descendant of g. >>> - the algorithm for this is written in pseudocode in "Reconciliation with non-binary species tree" by Vernot as "Algorithm 5.1", which has to be translated to Ruby code. >>> 8. Milestone: Implementing support for non-binary gene trees (2 weeks) >>> - gene trees are by definition binary, but some methods produce uncertainties which result in multifurcating gene trees, >>> - we can support non-binary gene trees by expanding multifurcating nodes to arbitrary binary trees and then optimizing the generated tree for duplications and losses with a previously developed algorithm >>> - this approach is described in "A Hybrid Micro?Macroevolutionary Approach to Gene Tree Reconstruction" by Durand, which contains several pseudocode algorithms that can be ported to Ruby. >>> 9. Finishing up (1-2 weeks): >>> - Extensive testing of all implemented algorithms, >>> - polishing and writing documentation using RDoc, >>> - cleaning up. >>> *Why me?:* >>> I like to set foot on unknown territory and challenge myself constantly. I have long searched for something that would connect my love of medicine to my love of programming, and now, thanks to GSoC and OBF, I think I found it - bioinformatics. I am at a stage of my medical study, where I have to decide what my future will entail, and I am (now, after thinking about it for a long time) positive that bioinformatics will be a big part of it. What better way to get future off to a good start, than with a Google Summer of Code project? Based on this enthusiasm alone you can be assured that I'll work really hard on this project and that I will be happy to see it done. As this would be my first serious open source engagement, you also have a chance of forming a completely new addition to the open source world and making a good contributor out of me. >>> *Previous experience:* >>> 1. I have been working on a simulation of an analytical chemistry method for the past 2 years now, more specifically we have modeled laser ablation + inductively coupled plasma mass spectrometry with a simple model, which aids our elemental mapping projects. For the write-up of this project I have been awarded with a "Pre?ernovo priznanje" in 2008 (PDF upon request). This work entails several interesting components, from basics such as: C# development, image input, output, multi-threaded programming, UI development; to complex themes such as: genetic algorithms and neural networks. All of which I learned as we worked on the project without much hassle (source code upon request). This work is not yet open source, because we are in the finalizing stages of the paper and will release the source code after publication under an open source license. 2. I have programmed since I was a child and I have developed a wide specter of things in my lifetime (from a full CMS in PHP to an IRC robot, source code upon request), but I have little experience in fully open source projects, which I think so highly of. *Biography:* >>> My name is Jure Triglav and I'm a 24 year old medical student from Ljubljana, Slovenia. I was born in a small town of Murska Sobota in Slovenia, where I went to grade school (graded excellent for all years, awarded "Zoisova ?tipendija" for the gifted, which I still hold) and high-school (excellent, finished as "Zlati maturant" in the company of about 200 best students in the country). I moved to Ljubljana in 2004 to study medicine. I am now in the last year of my medical study which I find challenging and very interesting. My hobbies are all over the place, from book design to photography, from web design to typography, from guitar to poetry, from reading to programming, from traveling to sports. *Other obligations for the summer:* >>> In question, will be cancelled upon request: I have 5-hour daily clinical practice every weekday in June, July and August, which is not nearly as serious as it sounds, especially since this is the summer rotation which is known for its laid back feel. These practice start at 8 am and finish at 1 pm, and for students are not really stressful or exhausting at all. I have in the past juggled many research obligations with clinical practice and my studies without hiccups, but I will not do this this summer and will dedicate 8 hours daily to Google Summer of Code, as I realize what a great opportunity this is and how much work is required. I have no other work, research or vacation obligations for the period of Google Summer of Code. >>> *Contact information: * >>> (I will provide additional contact information in the final application) >>> Name: Jure Triglav >>> E-mail: juretriglav at gmail.com >>> IRC handle: x` on #obf-soc, #gsoc > From monika.machunik at gmail.com Mon Apr 5 14:41:36 2010 From: monika.machunik at gmail.com (Monika Machunik) Date: Mon, 5 Apr 2010 21:41:36 +0300 Subject: [BioRuby] GSoC question (regarding SDI algorithm) Message-ID: Hello My name is Monika Machunik and I am planing to apply in this year's Summer of Code. I have read your idea description about "Implementation of algorithm to infer gene duplications in BioRuby", and, although my background does not include any biology, I got quite interested in this project (I could not find mentors' email addresses, so I'm posting it here..). I would like to shortly introduce myself to get your opinion if I would be suitable for this project. I have about a year of work experience in Java programming, including some internships and last year's GSoC. Besides Java I know C++, some C, Php, HTML, etc. I am not experienced in Ruby programming (at least have seen Ruby code;)), but I learn fast. Currently I am doing my Master degree in Computer Science, so I have some knowlegde about algorithms and data structures. I have never worked at the intersection of biology and CS, but this conjunction has always been intriguing to me. And now my thoughts about possible content of the workload. I have read the abstract of the article and, despite of my lack of biological knowledge, I managed to comprehend it;). I think I also should have no problem with understanding the algorithm itself. Apart from implementing the algorithm, the project would involve getting familiar with BioRuby, understanding phyloXML in such extent to be able to write an algorithm operating on its ready structures. I am not sure if the algorithm should be implemented inside some exisitng software, or will it be a kind of standalone algorithm? If it should be accomodated inside some application, the project would probably involve doing that too... ...let it be all for now. Let me know if I have any chances in this project :) Best regards Monika Machunik From czmasek at burnham.org Mon Apr 5 19:17:58 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Mon, 5 Apr 2010 16:17:58 -0700 Subject: [BioRuby] GSOC 2010 preliminary proposal question In-Reply-To: <2CCBF4CA-B351-46C1-A566-14BC0E4F19D6@gmail.com> References: <4BB13F46.7010607@burnham.org> <4BB14149.3060606@burnham.org> <2CCBF4CA-B351-46C1-A566-14BC0E4F19D6@gmail.com> Message-ID: <4BBA6FA6.5080404@burnham.org> Hi, Sara: You proposal looks good. I would expand the paragraph about yourself, i.e. add more details about your skills and previous programming experience. Christian Sara Rayburn wrote: > Hello Christian, > > Thanks so much for the advice. Here's a pdf of my first draft of the proposal. What do you think? > > Thanks, > > Sara Rayburn > Center for Advanced Computer Studies > University of Louisiana at Lafayette > sararayburn at gmail.com From czmasek at burnham.org Mon Apr 5 19:24:17 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Mon, 5 Apr 2010 16:24:17 -0700 Subject: [BioRuby] GSoC question (regarding SDI algorithm) In-Reply-To: References: Message-ID: <4BBA7121.5090102@burnham.org> Hi, Monika: Thank you for you interest in this proposal. Please remember that student applications are due by April 9, 19:00 UTC -- so, you have not much time left. I think your lack of experience in Biology is not a problem. The idea is to implement the algorithm with the BioRuby toolkit (http://www.bioruby.org/). Some more advice: If you plan to apply, you need to write a very detailed plan on how you intend to accomplish this project. For each step you should list: 1. Goal/deliverable 2. Approach 3. Time estimation 4. Anticipated problems & possible alternative approaches Like so: A. Prior to coding (from ... to .... ) 1. Familiarize myself with BioRuby, set up git hub repository 2. ... 3. 1 week 4. Not familiar with git, might need to... B. Week 1 (from ... to .... ) 1. Develop unit tests 2. Using manually created gene and species trees, I plan to... 3. 1 week 4. No problem anticipated Basically you also need to write a short CV, similar to a job application. Hope this helps, Christian Monika Machunik wrote: > Hello > > My name is Monika Machunik and I am planing to apply in this year's Summer > of Code. I have read your idea description about "Implementation of > algorithm to infer gene duplications in BioRuby", and, although my > background does not include any biology, I got quite interested in this > project (I could not find mentors' email addresses, so I'm posting it > here..). > > I would like to shortly introduce myself to get your opinion if I would be > suitable for this project. > > I have about a year of work experience in Java programming, including some > internships and last year's GSoC. Besides Java I know C++, some C, Php, > HTML, etc. I am not experienced in Ruby programming (at least have seen Ruby > code;)), but I learn fast. Currently I am doing my Master degree in Computer > Science, so I have some knowlegde about algorithms and data structures. I > have never worked at the intersection of biology and CS, but this > conjunction has always been intriguing to me. > > And now my thoughts about possible content of the workload. > > I have read the abstract of the article and, despite of my lack of > biological knowledge, I managed to comprehend it;). I think I also should > have no problem with understanding the algorithm itself. Apart from > implementing the algorithm, the project would involve getting familiar with > BioRuby, understanding phyloXML in such extent to be able to write an > algorithm operating on its ready structures. > > I am not sure if the algorithm should be implemented inside some exisitng > software, or will it be a kind of standalone algorithm? If it should be > accomodated inside some application, the project would probably involve > doing that too... > > ...let it be all for now. Let me know if I have any chances in this project > :) > > Best regards > Monika Machunik > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From kpatil at science.uva.nl Tue Apr 6 11:58:10 2010 From: kpatil at science.uva.nl (K. Patil) Date: Tue, 6 Apr 2010 17:58:10 +0200 (CEST) Subject: [BioRuby] distributing bioruby Message-ID: <41920.139.19.75.1.1270569490.squirrel@webmail.science.uva.nl> Hi, I would like to distribute bioruby with my own code. Is it allowed legally or there are some restrictions? More specifically I would like to have a subset of bioruby files (especially for file processing) inside my distribution. I would like to know if this is possible. best regards From ngoto at gen-info.osaka-u.ac.jp Tue Apr 6 12:41:19 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 7 Apr 2010 01:41:19 +0900 Subject: [BioRuby] distributing bioruby In-Reply-To: <41920.139.19.75.1.1270569490.squirrel@webmail.science.uva.nl> References: <41920.139.19.75.1.1270569490.squirrel@webmail.science.uva.nl> Message-ID: <20100406164119.979501CBC557@idnmail.gen-info.osaka-u.ac.jp> Hi, This may depend on the license of your software. See the file COPYING about the license of BioRuby, which is the same as Ruby's. http://github.com/bioruby/bioruby/blob/master/COPYING In addition, some files have different licenses. See the file LEGAL. http://github.com/bioruby/bioruby/blob/master/LEGAL Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 6 Apr 2010 17:58:10 +0200 (CEST) "K. Patil" wrote: > Hi, > > I would like to distribute bioruby with my own code. Is it allowed legally > or there are some restrictions? More specifically I would like to have a > subset of bioruby files (especially for file processing) inside my > distribution. > > I would like to know if this is possible. > > best regards > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From konstantin.s.stepanyuk at gmail.com Wed Apr 7 01:02:07 2010 From: konstantin.s.stepanyuk at gmail.com (Konstantin Stepanyuk) Date: Wed, 7 Apr 2010 13:02:07 +0800 Subject: [BioRuby] GSoC project Message-ID: Hi All, My name is Kostya Stepanyuk, I'm an undergraduate student from Novosibirsk State University in Russia and I'm a looking forward to participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. I already have a background in bioinformatics since I have been participating in Unipro UGENE (http://ugene.unipro.ru) open-source bioinformatics project for a long time. Also, I highly appreciate Ruby programming language and I was very glad to get to know that there is an open-source ruby-based open-source bioinformatics project. My motivation in participating in this project is to improve my knowledge of Ruby, to familiarize myself with your great project and to help BioRuby become more qualitative and popular. I'm looking forward to contribute to your promising project! I'm going to send the full application as soon as possible. Thanks, Kostya. From pjotr.public14 at thebird.nl Wed Apr 7 01:49:30 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 7 Apr 2010 07:49:30 +0200 Subject: [BioRuby] GSoC project In-Reply-To: References: Message-ID: <20100407054930.GA10407@thebird.nl> Hi Konstantin, Not much time left. Leave us enough time to help comment. Pj. On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: > Hi All, > > My name is Kostya Stepanyuk, I'm an undergraduate student from > Novosibirsk State University in Russia and I'm a looking forward to > participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. > > I already have a background in bioinformatics since I have been > participating in Unipro UGENE (http://ugene.unipro.ru) open-source > bioinformatics project for a long time. Also, I highly appreciate Ruby > programming language and I was very glad to get to know that there is > an open-source ruby-based open-source bioinformatics project. > > My motivation in participating in this project is to improve my > knowledge of Ruby, to familiarize myself with your great project and > to help BioRuby become more qualitative and popular. I'm looking > forward to contribute to your promising project! > > I'm going to send the full application as soon as possible. > > Thanks, > Kostya. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From mkikkawa at gmail.com Wed Apr 7 21:21:46 2010 From: mkikkawa at gmail.com (Masahide Kikkawa) Date: Thu, 8 Apr 2010 10:21:46 +0900 Subject: [BioRuby] Bio::MEDLINE, authors Message-ID: Hi, I encountered a bug in Bio::MEDLINE, Here is the code: ====================================== require 'rubygems' require 'bio' item = Bio::PubMed.efetch([13016983]) ref = Bio::MEDLINE.new(item[0]).reference p ref.authors ====================================== The result is: [", V. I. M. T. R. U. P. B", "SCHMIDT-NIELSEN, B."] Expected result is ["VIMTRUP B", "SCHMIDT-NIELSEN B."] Thanks. From konstantin.s.stepanyuk at gmail.com Thu Apr 8 04:55:25 2010 From: konstantin.s.stepanyuk at gmail.com (Konstantin Stepanyuk) Date: Thu, 8 Apr 2010 16:55:25 +0800 Subject: [BioRuby] GSoC project In-Reply-To: <20100407054930.GA10407@thebird.nl> References: <20100407054930.GA10407@thebird.nl> Message-ID: Hi Pjotr and folks, here is my proposal written according to the scheme published on OBF GSoC page. It is quite compact since I have not buried into the codebase and tests deeply. So I will appreciate any help or suggestions, and I'm looking forward to contribute to your project during the GSoC. Thanks! Kostya. 1.Contact information Full Name: Konstantin Stepanyuk Address: Pirogova str. 20/1, app. 800, Novosibirsk, Zip code: 630090 Russian Federation. E-mail: konstantin.s.stepanyuk at gmail.com Phone: +7 923 247 2424 ICQ: 427601980 2. Motivation and goals. Bioinformatics is one of my primary fields of interests. I already have a solid background in bioinformatics since I have been participating in Unipro UGENE (http://ugene.unipro.ru) open-source bioinformatics project for two years. My existing research area in university includes local sequence alignment and genome assembly. I highly appreciate Ruby programming language and I was very glad to get to know that there is an open-source ruby-based open-source bioinformatics project. I believe that cross-version of BioRuby is an important issue for the project, since the project is quite modern and perspective. The one of the main tasks in porting BioRuby to version 1.9.2 is improving test coverage, since currently project has quite little unit tests. It will make us more certain about introducing compatibility & conformance fixes. 3. My skills summary and work experience Programming languages: C++ (3 years), Java, Ruby, Python. Projects: * Unipro UGENE - free and open-source Integrated Bioinformatic Tools (http://ugene.unipro.ru). - Role: C++ and Qt developer for two years (Unipro LLC). - Implemented and tested several algorithms, such as Smith-Waterman local sequence alignment (and its SSE, CUDA and ATI Stream versions). * Apache Harmony - clean-room implementation of J2SE platform (http://harmony.apache.org). - Role: Intern in Intel corporation - Implemented tool for aggregating and reporting perfomance and statistical counters. 4. A project plan. I propose to divide the total work into two big milestones, accordingly to Google schedule. 1) Improving test coverage of the project. 23 May - 16 July (total 8 weeks) 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 July - 20 August (total 5 weeks). Each of this chunks of work is divided into several subparts: 1) - Evaluate test coverage (1 week). Consider integration of some tool to build process to automate test coverage reporting. Create concrete test plan which will be targeted to improve test coverage up to 90-100% - Write unit-tests according to the plan. Consider creating the stress-test suite. (6-7 weeks) 2) - Elaborate the list of incompatibilities with new version of Ruby (1 week) - Port the codebase (4 weeks) 5. My plans for the summer I plan that GSoC project will be my primary occupation during the summer. But I'm going to a have a 1 week vacation in July. On 4/7/10, Pjotr Prins wrote: > Hi Konstantin, > > Not much time left. Leave us enough time to help comment. > > Pj. > > On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: >> Hi All, >> >> My name is Kostya Stepanyuk, I'm an undergraduate student from >> Novosibirsk State University in Russia and I'm a looking forward to >> participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. >> >> I already have a background in bioinformatics since I have been >> participating in Unipro UGENE (http://ugene.unipro.ru) open-source >> bioinformatics project for a long time. Also, I highly appreciate Ruby >> programming language and I was very glad to get to know that there is >> an open-source ruby-based open-source bioinformatics project. >> >> My motivation in participating in this project is to improve my >> knowledge of Ruby, to familiarize myself with your great project and >> to help BioRuby become more qualitative and popular. I'm looking >> forward to contribute to your promising project! >> >> I'm going to send the full application as soon as possible. >> >> Thanks, >> Kostya. >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > From ngoto at gen-info.osaka-u.ac.jp Thu Apr 8 07:48:51 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 8 Apr 2010 20:48:51 +0900 Subject: [BioRuby] GSoC project In-Reply-To: References: <20100407054930.GA10407@thebird.nl> Message-ID: <20100408114851.EA2341CBC3D0@idnmail.gen-info.osaka-u.ac.jp> Hi Konstantin, In the project, Ruby porgramming skill is very important. Write more about your Ruby programming experiences. In addition, can you show URL to Ruby scripts you wrote? Please improve project plan more. For example: * Preparation. E.g. to subscribe to ruby-core mailing list to check current status of Ruby 1.9.2, installing Ruby 1.9.2 and 1.8.7, etc. Note that Ruby 1.9.2 are now under feature freeze, and will be released on July 30 ([ruby-core:28665]). You will need to compile Ruby 1.9.2 svn version at least several times. (Optionally, in every week or every day, and to contribute Ruby 1.9.2's bug fix). * About development environment and tools you will use. * About coverage check tool (rcov?). * Reading changes from Ruby 1.8.7 to 1.9.1 and from 1.9.1 to 1.9.2, and brush up the plan. http://svn.ruby-lang.org/repos/ruby/tags/v1_9_1_0/NEWS http://svn.ruby-lang.org/repos/ruby/trunk/NEWS * Extracting bioruby-1.4.0.tar.gz, looking at lib/ and test/unit (andtest/functional), and checking existance of test files and directories corersponding to library main files. For example, you can find that lib/bio/db/genbank exists, but test/unit/bio/db/genbank does not exist. Of course, in some cases, test files exist but their contents are poor. Although it is very difficult, but if you can, it is good to estimate the needed efforts. It is also good to prioritize classes/modules to write tests. I think Bio::GenBank and Bio::GenPept are high priority. * ... (Not all will be needed, and not limited to the above.) Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 8 Apr 2010 16:55:25 +0800 Konstantin Stepanyuk wrote: > Hi Pjotr and folks, > > here is my proposal written according to the scheme published on OBF > GSoC page. It is quite compact since I have not buried into the > codebase and tests deeply. So I will appreciate any help or > suggestions, and I'm looking forward to contribute to your project > during the GSoC. > > Thanks! > Kostya. > > 1.Contact information > > Full Name: Konstantin Stepanyuk > > Address: > Pirogova str. 20/1, app. 800, > Novosibirsk, > Zip code: 630090 > Russian Federation. > > E-mail: konstantin.s.stepanyuk at gmail.com > Phone: +7 923 247 2424 > ICQ: 427601980 > > > 2. Motivation and goals. > Bioinformatics is one of my primary fields of interests. I already > have a solid background in bioinformatics since I have been > participating in Unipro UGENE (http://ugene.unipro.ru) open-source > bioinformatics project for two years. My existing research area in > university includes local sequence alignment and genome assembly. > > I highly appreciate Ruby programming language and I was very glad to > get to know that there is an open-source ruby-based open-source > bioinformatics project. > I believe that cross-version of BioRuby is an important issue for the > project, since the project is quite modern and perspective. The one of > the main tasks in porting BioRuby to version 1.9.2 is improving test > coverage, since currently project has quite little unit tests. It will > make us more certain about introducing compatibility & conformance > fixes. > > > 3. My skills summary and work experience > Programming languages: C++ (3 years), Java, Ruby, Python. > Projects: > * Unipro UGENE - free and open-source Integrated Bioinformatic Tools > (http://ugene.unipro.ru). > - Role: C++ and Qt developer for two years (Unipro LLC). > - Implemented and tested several algorithms, such as Smith-Waterman > local sequence alignment (and its SSE, CUDA and ATI Stream versions). > * Apache Harmony - clean-room implementation of J2SE platform > (http://harmony.apache.org). > - Role: Intern in Intel corporation > - Implemented tool for aggregating and reporting perfomance and > statistical counters. > > > 4. A project plan. > I propose to divide the total work into two big milestones, > accordingly to Google schedule. > > 1) Improving test coverage of the project. 23 May - 16 July (total 8 weeks) > > 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 > July - 20 August (total 5 weeks). > > Each of this chunks of work is divided into several subparts: > > 1) > - Evaluate test coverage (1 week). Consider integration of some tool > to build process to automate test coverage reporting. > Create concrete test plan which will be targeted to improve test > coverage up to 90-100% > - Write unit-tests according to the plan. Consider creating the > stress-test suite. (6-7 weeks) > > 2) > - Elaborate the list of incompatibilities with new version of Ruby (1 week) > - Port the codebase (4 weeks) > > 5. My plans for the summer > I plan that GSoC project will be my primary occupation during the > summer. But I'm going to a have a 1 week vacation in July. > > On 4/7/10, Pjotr Prins wrote: > > Hi Konstantin, > > > > Not much time left. Leave us enough time to help comment. > > > > Pj. > > > > On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: > >> Hi All, > >> > >> My name is Kostya Stepanyuk, I'm an undergraduate student from > >> Novosibirsk State University in Russia and I'm a looking forward to > >> participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. > >> > >> I already have a background in bioinformatics since I have been > >> participating in Unipro UGENE (http://ugene.unipro.ru) open-source > >> bioinformatics project for a long time. Also, I highly appreciate Ruby > >> programming language and I was very glad to get to know that there is > >> an open-source ruby-based open-source bioinformatics project. > >> > >> My motivation in participating in this project is to improve my > >> knowledge of Ruby, to familiarize myself with your great project and > >> to help BioRuby become more qualitative and popular. I'm looking > >> forward to contribute to your promising project! > >> > >> I'm going to send the full application as soon as possible. > >> > >> Thanks, > >> Kostya. > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From donttrustben at gmail.com Thu Apr 8 08:14:42 2010 From: donttrustben at gmail.com (Ben Woodcroft) Date: Thu, 8 Apr 2010 22:14:42 +1000 Subject: [BioRuby] Bio::GO::GeneAssociation issue/fix and new unit test file Message-ID: Hi, I had some problems parsing gene association files using Bio::Flatfile, caused because the parser was attempting to use the split method on a nil. The offending line was @db_reference = tmp[5].split(/\|/) # That seemed easy enough to fix, but then I noticed there wasn't any test cases to test my changes against, so I made a new file test/unit/db/test_go.rb, including a simulation of one that was giving me problems. I've collected these changes in a new branch, and you can see the difference using the new github compare interface at http://github.com/wwood/bioruby/compare/36041377db...gene_association Is there any reason that the variables that correspond to arrays in GeneAssociation (@db_reference, @with, @db_object_synonym) are singular names, and not plural? It would be simple to add a alias_method db_references -> db_reference right? I also don't agree that the 'GO:' part of the identifier be chopped off by default by the goid method - gene association files are not necessarily concerned with GO - there are other ontologies out there as well. I personally never look at GO identifiers without the 'GO:' bit, so I was surprised when I saw that. Sound OK? Thanks, ben From czmasek at burnham.org Thu Apr 8 17:39:44 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Thu, 8 Apr 2010 14:39:44 -0700 Subject: [BioRuby] GSoC question (regarding SDI algorithm) In-Reply-To: References: <4BBA7121.5090102@burnham.org> Message-ID: <4BBE4D20.4050706@burnham.org> Hi, Monika: Remember, the deadline is tomorrow! > Hello > > I think I should first explicitly ask this question: I am not > experienced in Ruby, but I believe I can improve myself enough during > the Community Bonding period. Does it make me ineligible to apply? No. Part of the goals for GSoC is for students to "learn new things". > If not, please continue with reading: ;) > > > Basically you also need to write a short CV, similar to a job > application. > > Where should I later submit this CV? Should include only education / > work experience, or also something like a cover letter? It will all be part of you application (i.e. one document). You need to write an "abstract" which can can be considered a cover letter. > > And a question about the algorithm itself - would it need to be > accomodated in some application, or just be a separate BioRuby library? It would be part of BioRuby. > > Develop unit tests > > I hope this is not stupid question, but where is it possible to get the > following information about a gene tree: which nodes should receive > which annotations about 'duplication' and 'specialization', for the > duplication inference to be considered correct? I mean, as a > non-biologist I do not know what should be the correct output of the > algorithm... You should read (or at least have a look at) some of the references listed here: http://evogsoc2010.wordpress.com/2010/03/25/references-for-gene-duplications-proposal/ You can also have a look at my PhD thesis, which explains some of the background, especially chapter 1.3.2.1. See: ftp://selab.janelia.org/pub/publications/Zmasek02/Zmasek02-phdthesis.pdf Furthermore, I can easily provide you with test gene trees which have duplications assigned. This is not a big issue. > > Regards > Monika Machunik > > > 2010/4/6 Christian M Zmasek > > > Hi, Monika: > > Thank you for you interest in this proposal. > Please remember that student applications are due by April 9, 19:00 > UTC -- so, you have not much time left. > > I think your lack of experience in Biology is not a problem. > > The idea is to implement the algorithm with the BioRuby toolkit > (http://www.bioruby.org/). > > Some more advice: > > If you plan to apply, you need to write a very detailed plan on how > you intend to accomplish this project. > > For each step you should list: > 1. Goal/deliverable > 2. Approach > 3. Time estimation > 4. Anticipated problems & possible alternative approaches > > Like so: > > A. Prior to coding (from ... to .... ) > 1. Familiarize myself with BioRuby, set up git hub repository > 2. ... > 3. 1 week > 4. Not familiar with git, might need to... > > B. Week 1 (from ... to .... ) > 1. Develop unit tests > 2. Using manually created gene and species trees, I plan to... > 3. 1 week > 4. No problem anticipated > > > Basically you also need to write a short CV, similar to a job > application. > > Hope this helps, > > Christian > > > Monika Machunik wrote: > > Hello > > My name is Monika Machunik and I am planing to apply in this > year's Summer > of Code. I have read your idea description about "Implementation of > algorithm to infer gene duplications in BioRuby", and, although my > background does not include any biology, I got quite interested > in this > project (I could not find mentors' email addresses, so I'm > posting it > here..). > > I would like to shortly introduce myself to get your opinion if > I would be > suitable for this project. > > I have about a year of work experience in Java programming, > including some > internships and last year's GSoC. Besides Java I know C++, some > C, Php, > HTML, etc. I am not experienced in Ruby programming (at least > have seen Ruby > code;)), but I learn fast. Currently I am doing my Master degree > in Computer > Science, so I have some knowlegde about algorithms and data > structures. I > have never worked at the intersection of biology and CS, but this > conjunction has always been intriguing to me. > > And now my thoughts about possible content of the workload. > > I have read the abstract of the article and, despite of my lack of > biological knowledge, I managed to comprehend it;). I think I > also should > have no problem with understanding the algorithm itself. Apart from > implementing the algorithm, the project would involve getting > familiar with > BioRuby, understanding phyloXML in such extent to be able to > write an > algorithm operating on its ready structures. > > I am not sure if the algorithm should be implemented inside some > exisitng > software, or will it be a kind of standalone algorithm? If it > should be > accomodated inside some application, the project would probably > involve > doing that too... > > ...let it be all for now. Let me know if I have any chances in > this project > :) > > Best regards > Monika Machunik > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > From czmasek at burnham.org Thu Apr 8 17:43:04 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Thu, 8 Apr 2010 14:43:04 -0700 Subject: [BioRuby] GSoC project In-Reply-To: References: <20100407054930.GA10407@thebird.nl> Message-ID: <4BBE4DE8.1090008@burnham.org> Hi, Konstantin: Your project plan is not detailed enough and partially vague (for example, what do you mean by "some tool"?) Christian Konstantin Stepanyuk wrote: > Hi Pjotr and folks, > > here is my proposal written according to the scheme published on OBF > GSoC page. It is quite compact since I have not buried into the > codebase and tests deeply. So I will appreciate any help or > suggestions, and I'm looking forward to contribute to your project > during the GSoC. > > Thanks! > Kostya. > > 1.Contact information > > Full Name: Konstantin Stepanyuk > > Address: > Pirogova str. 20/1, app. 800, > Novosibirsk, > Zip code: 630090 > Russian Federation. > > E-mail: konstantin.s.stepanyuk at gmail.com > Phone: +7 923 247 2424 > ICQ: 427601980 > > > 2. Motivation and goals. > Bioinformatics is one of my primary fields of interests. I already > have a solid background in bioinformatics since I have been > participating in Unipro UGENE (http://ugene.unipro.ru) open-source > bioinformatics project for two years. My existing research area in > university includes local sequence alignment and genome assembly. > > I highly appreciate Ruby programming language and I was very glad to > get to know that there is an open-source ruby-based open-source > bioinformatics project. > I believe that cross-version of BioRuby is an important issue for the > project, since the project is quite modern and perspective. The one of > the main tasks in porting BioRuby to version 1.9.2 is improving test > coverage, since currently project has quite little unit tests. It will > make us more certain about introducing compatibility & conformance > fixes. > > > 3. My skills summary and work experience > Programming languages: C++ (3 years), Java, Ruby, Python. > Projects: > * Unipro UGENE - free and open-source Integrated Bioinformatic Tools > (http://ugene.unipro.ru). > - Role: C++ and Qt developer for two years (Unipro LLC). > - Implemented and tested several algorithms, such as Smith-Waterman > local sequence alignment (and its SSE, CUDA and ATI Stream versions). > * Apache Harmony - clean-room implementation of J2SE platform > (http://harmony.apache.org). > - Role: Intern in Intel corporation > - Implemented tool for aggregating and reporting perfomance and > statistical counters. > > > 4. A project plan. > I propose to divide the total work into two big milestones, > accordingly to Google schedule. > > 1) Improving test coverage of the project. 23 May - 16 July (total 8 weeks) > > 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 > July - 20 August (total 5 weeks). > > Each of this chunks of work is divided into several subparts: > > 1) > - Evaluate test coverage (1 week). Consider integration of some tool > to build process to automate test coverage reporting. > Create concrete test plan which will be targeted to improve test > coverage up to 90-100% > - Write unit-tests according to the plan. Consider creating the > stress-test suite. (6-7 weeks) > > 2) > - Elaborate the list of incompatibilities with new version of Ruby (1 week) > - Port the codebase (4 weeks) > > 5. My plans for the summer > I plan that GSoC project will be my primary occupation during the > summer. But I'm going to a have a 1 week vacation in July. > > On 4/7/10, Pjotr Prins wrote: >> Hi Konstantin, >> >> Not much time left. Leave us enough time to help comment. >> >> Pj. >> >> On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: >>> Hi All, >>> >>> My name is Kostya Stepanyuk, I'm an undergraduate student from >>> Novosibirsk State University in Russia and I'm a looking forward to >>> participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. >>> >>> I already have a background in bioinformatics since I have been >>> participating in Unipro UGENE (http://ugene.unipro.ru) open-source >>> bioinformatics project for a long time. Also, I highly appreciate Ruby >>> programming language and I was very glad to get to know that there is >>> an open-source ruby-based open-source bioinformatics project. >>> >>> My motivation in participating in this project is to improve my >>> knowledge of Ruby, to familiarize myself with your great project and >>> to help BioRuby become more qualitative and popular. I'm looking >>> forward to contribute to your promising project! >>> >>> I'm going to send the full application as soon as possible. >>> >>> Thanks, >>> Kostya. >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From konstantin.s.stepanyuk at gmail.com Fri Apr 9 01:19:04 2010 From: konstantin.s.stepanyuk at gmail.com (Konstantin Stepanyuk) Date: Fri, 9 Apr 2010 12:19:04 +0700 Subject: [BioRuby] GSoC project In-Reply-To: <20100408114851.EA2341CBC3D0@idnmail.gen-info.osaka-u.ac.jp> References: <20100407054930.GA10407@thebird.nl> <20100408114851.EA2341CBC3D0@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi Naohisa, Thanks for your comments! Updated version of application is below. Some quick comments are inline. > Write more about your Ruby programming experiences. > In addition, can you show URL to Ruby scripts you wrote? I have only basic knowledge of Ruby and its standard library. I wrote more about my Ruby experience in the plan below. Examples of the scripts: - solving the traveling salesman problem: http://paste2.org/p/764145 some handy small tools: - generator of random DNA sequences http://paste2.org/p/764147 - run through file tree and fix some string http://paste2.org/p/764146 > Please improve project plan more. For example: > * Preparation. E.g. to subscribe to ruby-core mailing list > [skip] I've added this to the plan, but I think I will perform all of this during the Community Bonding period > * Extracting bioruby-1.4.0.tar.gz, looking at lib/ and > test/unit (andtest/functional), and checking existance > of test files and directories corersponding to library > main files. I've already checked out the git repository, and succeed to run the tests. > Although it is very difficult, but if you can, it is > good to estimate the needed efforts. It is also good > to prioritize classes/modules to write tests. I think > Bio::GenBank and Bio::GenPept are high priority. As I think estimating of the current tests quality and coverage is quite complex task which will require load of time. So I mentioned this phase in the development plan. I've already played with the tests and IMO there is a room for improvement. Thanks, Kostya. Updated application: 1.Contact information Full Name: Konstantin Stepanyuk Address: Pirogova str. 20/1, app. 800, Novosibirsk, Zip code: 630090 Russian Federation. E-mail: konstantin.s.stepanyuk at gmail.com Phone: +7 923 247 2424 ICQ: 427601980 2. Motivation and goals. Bioinformatics is one of my primary fields of interests. I already have a solid background in bioinformatics since I have been participating in Unipro UGENE (http://ugene.unipro.ru) open-source bioinformatics project for two years. My existing research area in university includes local sequence alignment and genome assembly. I highly appreciate Ruby programming language and I was very glad to get to know that there is an open-source ruby-based open-source bioinformatics project. I believe that cross-version of BioRuby is an important issue for the project, since the project is quite modern and perspective. The one of the main tasks in porting BioRuby to version 1.9.2 is improving test coverage, since currently project has quite little unit tests. It will make us more certain about introducing compatibility & conformance fixes. 3. My skills summary and work experience Programming languages: C++ (3 years), Java, Ruby, Python. Ruby experience: basic knowledge. Ability to write simple scripts not larger than ~200 LoC. I've wrote some algorithms in Ruby such as sorts, Simulating Annealing for traveling salesman problem, several networking scripts (simple TCP servers/clients), and handy 'one-liners' for every day tasks. Projects: * Unipro UGENE - free and open-source Integrated Bioinformatic Tools ( http://ugene.unipro.ru). - Role: C++ and Qt developer for two years (Unipro LLC). - Implemented and tested several algorithms, such as Smith-Waterman local sequence alignment (and its SSE, CUDA and ATI Stream versions). * Apache Harmony - clean-room implementation of J2SE platform ( http://harmony.apache.org). - Role: Intern in Intel corporation - Implemented tool for aggregating and reporting perfomance and statistical counters. 4. A project plan. I propose to divide the total work into two big milestones, accordingly to Google schedule. Also, the plan includes preparation phase which will be performed during the Community Bonding time. 0) Preparation: - Establishing the Ruby environment: * install different actual versionf of Ruby: 1.8.7, 1.9.1, and check out the Ruby repository to be able to regularly build the newest version. * Subsribing to Ruby development mailing list to check the current status of the project - Establishing BioRuby environment * Checking out BioRuby codebase * Choosing a right tools to work with BioRuby code. Vim + Rakefiles way is surely reliable, but using some high-level IDE such as JetBrains RubyMine will be considered. 1) Improving test coverage of the project. 23 May - 16 July (total 8 weeks) 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 July - 20 August (total 5 weeks). Each of this chunks of work is divided into several subparts: 1) - Evaluate test coverage (1 week). This includes: * prioritizing classes/modules to write tests. * measuring coverage. Rcov is the first candidate to use. * integration the test coverage metrics to the build process will be considered. - Write unit-tests according to the plan. Consider creating the stress-test suite. (6-7 weeks) 2) - Elaborate the list of incompatibilities with new version of Ruby (1 week) - Port the codebase (4 weeks) 5. My plans for the summer I plan that GSoC project will be my primary occupation during the summer. But I'm going to a have a 1 week vacation in July. On Thu, 8 Apr 2010 16:55:25 +0800 Konstantin Stepanyuk wrote: > Hi Pjotr and folks, > > here is my proposal written according to the scheme published on OBF > GSoC page. It is quite compact since I have not buried into the > codebase and tests deeply. So I will appreciate any help or > suggestions, and I'm looking forward to contribute to your project > during the GSoC. > > Thanks! > Kostya. > > 1.Contact information > > Full Name: Konstantin Stepanyuk > > Address: > Pirogova str. 20/1, app. 800, > Novosibirsk, > Zip code: 630090 > Russian Federation. > > E-mail: konstantin.s.stepanyuk at gmail.com > Phone: +7 923 247 2424 > ICQ: 427601980 > > > 2. Motivation and goals. > Bioinformatics is one of my primary fields of interests. I already > have a solid background in bioinformatics since I have been > participating in Unipro UGENE (http://ugene.unipro.ru) open-source > bioinformatics project for two years. My existing research area in > university includes local sequence alignment and genome assembly. > > I highly appreciate Ruby programming language and I was very glad to > get to know that there is an open-source ruby-based open-source > bioinformatics project. > I believe that cross-version of BioRuby is an important issue for the > project, since the project is quite modern and perspective. The one of > the main tasks in porting BioRuby to version 1.9.2 is improving test > coverage, since currently project has quite little unit tests. It will > make us more certain about introducing compatibility & conformance > fixes. > > > 3. My skills summary and work experience > Programming languages: C++ (3 years), Java, Ruby, Python. > Projects: > * Unipro UGENE - free and open-source Integrated Bioinformatic Tools > (http://ugene.unipro.ru). > - Role: C++ and Qt developer for two years (Unipro LLC). > - Implemented and tested several algorithms, such as Smith-Waterman > local sequence alignment (and its SSE, CUDA and ATI Stream versions). > * Apache Harmony - clean-room implementation of J2SE platform > (http://harmony.apache.org). > - Role: Intern in Intel corporation > - Implemented tool for aggregating and reporting perfomance and > statistical counters. > > > 4. A project plan. > I propose to divide the total work into two big milestones, > accordingly to Google schedule. > > 1) Improving test coverage of the project. 23 May - 16 July (total 8 weeks) > > 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 > July - 20 August (total 5 weeks). > > Each of this chunks of work is divided into several subparts: > > 1) > - Evaluate test coverage (1 week). Consider integration of some tool > to build process to automate test coverage reporting. > Create concrete test plan which will be targeted to improve test > coverage up to 90-100% > - Write unit-tests according to the plan. Consider creating the > stress-test suite. (6-7 weeks) > > 2) > - Elaborate the list of incompatibilities with new version of Ruby (1 week) > - Port the codebase (4 weeks) > > 5. My plans for the summer > I plan that GSoC project will be my primary occupation during the > summer. But I'm going to a have a 1 week vacation in July. > > On 4/7/10, Pjotr Prins wrote: > > Hi Konstantin, > > > > Not much time left. Leave us enough time to help comment. > > > > Pj. > > > > On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: > >> Hi All, > >> > >> My name is Kostya Stepanyuk, I'm an undergraduate student from > >> Novosibirsk State University in Russia and I'm a looking forward to > >> participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. > >> > >> I already have a background in bioinformatics since I have been > >> participating in Unipro UGENE (http://ugene.unipro.ru) open-source > >> bioinformatics project for a long time. Also, I highly appreciate Ruby > >> programming language and I was very glad to get to know that there is > >> an open-source ruby-based open-source bioinformatics project. > >> > >> My motivation in participating in this project is to improve my > >> knowledge of Ruby, to familiarize myself with your great project and > >> to help BioRuby become more qualitative and popular. I'm looking > >> forward to contribute to your promising project! > >> > >> I'm going to send the full application as soon as possible. > >> > >> Thanks, > >> Kostya. > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From konstantin.s.stepanyuk at gmail.com Fri Apr 9 01:26:31 2010 From: konstantin.s.stepanyuk at gmail.com (Konstantin Stepanyuk) Date: Fri, 9 Apr 2010 12:26:31 +0700 Subject: [BioRuby] GSoC project In-Reply-To: <4BBE4DE8.1090008@burnham.org> References: <20100407054930.GA10407@thebird.nl> <4BBE4DE8.1090008@burnham.org> Message-ID: Hi Christian, Thanks for your comments! I agree that my plan is very rough, but elaborating detailed plan requires lots of work with existing codebase, tests and new Ruby issues. So it looks like almost impossible for me to write today a plan like 1) writing XXX tests for YYY method of ZZZ class (2 days). 2) fixing XXX issue from Ruby changelog in classes ZZZ (3 days). So I have a time for writing detailed testing and porting plans in the overall project plan. By 'some tool' I mentioned that I currently can't say which one will be the most suitable.. I'm not a big guy in Ruby language and tools (but not for a long time, I hope). Thanks, Kostya. On Fri, Apr 9, 2010 at 4:43 AM, Christian M Zmasek wrote: > Hi, Konstantin: > > Your project plan is not detailed enough and partially vague (for example, > what do you mean by "some tool"?) > > Christian > > > > Konstantin Stepanyuk wrote: > >> Hi Pjotr and folks, >> >> here is my proposal written according to the scheme published on OBF >> GSoC page. It is quite compact since I have not buried into the >> codebase and tests deeply. So I will appreciate any help or >> suggestions, and I'm looking forward to contribute to your project >> during the GSoC. >> >> Thanks! >> Kostya. >> >> 1.Contact information >> >> Full Name: Konstantin Stepanyuk >> >> Address: >> Pirogova str. 20/1, app. 800, >> Novosibirsk, >> Zip code: 630090 >> Russian Federation. >> >> E-mail: konstantin.s.stepanyuk at gmail.com >> Phone: +7 923 247 2424 >> ICQ: 427601980 >> >> >> 2. Motivation and goals. >> Bioinformatics is one of my primary fields of interests. I already >> have a solid background in bioinformatics since I have been >> participating in Unipro UGENE (http://ugene.unipro.ru) open-source >> bioinformatics project for two years. My existing research area in >> university includes local sequence alignment and genome assembly. >> >> I highly appreciate Ruby programming language and I was very glad to >> get to know that there is an open-source ruby-based open-source >> bioinformatics project. >> I believe that cross-version of BioRuby is an important issue for the >> project, since the project is quite modern and perspective. The one of >> the main tasks in porting BioRuby to version 1.9.2 is improving test >> coverage, since currently project has quite little unit tests. It will >> make us more certain about introducing compatibility & conformance >> fixes. >> >> >> 3. My skills summary and work experience >> Programming languages: C++ (3 years), Java, Ruby, Python. >> Projects: >> * Unipro UGENE - free and open-source Integrated Bioinformatic Tools >> (http://ugene.unipro.ru). >> - Role: C++ and Qt developer for two years (Unipro LLC). >> - Implemented and tested several algorithms, such as Smith-Waterman >> local sequence alignment (and its SSE, CUDA and ATI Stream versions). >> * Apache Harmony - clean-room implementation of J2SE platform >> (http://harmony.apache.org). >> - Role: Intern in Intel corporation >> - Implemented tool for aggregating and reporting perfomance and >> statistical counters. >> >> >> 4. A project plan. >> I propose to divide the total work into two big milestones, >> accordingly to Google schedule. >> >> 1) Improving test coverage of the project. 23 May - 16 July (total 8 >> weeks) >> >> 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 >> July - 20 August (total 5 weeks). >> >> Each of this chunks of work is divided into several subparts: >> >> 1) >> - Evaluate test coverage (1 week). Consider integration of some tool >> to build process to automate test coverage reporting. >> Create concrete test plan which will be targeted to improve test >> coverage up to 90-100% >> - Write unit-tests according to the plan. Consider creating the >> stress-test suite. (6-7 weeks) >> >> 2) >> - Elaborate the list of incompatibilities with new version of Ruby (1 >> week) >> - Port the codebase (4 weeks) >> >> 5. My plans for the summer >> I plan that GSoC project will be my primary occupation during the >> summer. But I'm going to a have a 1 week vacation in July. >> >> On 4/7/10, Pjotr Prins wrote: >> >>> Hi Konstantin, >>> >>> Not much time left. Leave us enough time to help comment. >>> >>> Pj. >>> >>> On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: >>> >>>> Hi All, >>>> >>>> My name is Kostya Stepanyuk, I'm an undergraduate student from >>>> Novosibirsk State University in Russia and I'm a looking forward to >>>> participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. >>>> >>>> I already have a background in bioinformatics since I have been >>>> participating in Unipro UGENE (http://ugene.unipro.ru) open-source >>>> bioinformatics project for a long time. Also, I highly appreciate Ruby >>>> programming language and I was very glad to get to know that there is >>>> an open-source ruby-based open-source bioinformatics project. >>>> >>>> My motivation in participating in this project is to improve my >>>> knowledge of Ruby, to familiarize myself with your great project and >>>> to help BioRuby become more qualitative and popular. I'm looking >>>> forward to contribute to your promising project! >>>> >>>> I'm going to send the full application as soon as possible. >>>> >>>> Thanks, >>>> Kostya. >>>> _______________________________________________ >>>> BioRuby Project - http://www.bioruby.org/ >>>> BioRuby mailing list >>>> BioRuby at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioruby >>>> >>> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > From sararayburn at gmail.com Fri Apr 9 08:39:25 2010 From: sararayburn at gmail.com (Sara Rayburn) Date: Fri, 9 Apr 2010 07:39:25 -0500 Subject: [BioRuby] GSOC 2010 Proposal Submitted Message-ID: <590D5674-16F0-484C-B5AB-17DABC03C7D7@gmail.com> Hello Christian, I have submitted my proposal to implement the SDI algorithms for BioRuby. Thanks so much for the feedback on my proposal draft. I look forward to hearing the results later this month. Regards, Sara Rayburn University of Louisiana sararayburn at gmail.com From czmasek at burnham.org Fri Apr 9 22:25:58 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Fri, 9 Apr 2010 19:25:58 -0700 Subject: [BioRuby] GSOC 2010 Proposal Submitted In-Reply-To: <590D5674-16F0-484C-B5AB-17DABC03C7D7@gmail.com> References: <590D5674-16F0-484C-B5AB-17DABC03C7D7@gmail.com> Message-ID: <4BBFE1B6.2090905@burnham.org> Hi, Sara: It looks like you submitted your proposal to the wrong organization! You submitted to Nescent but should should have submitted to: http://socghop.appspot.com/gsoc/org/show/google/gsoc2010/obf Christian Sara Rayburn wrote: > Hello Christian, > > I have submitted my proposal to implement the SDI algorithms for BioRuby. > > Thanks so much for the feedback on my proposal draft. I look forward to hearing the results later this month. > > Regards, > Sara Rayburn > University of Louisiana > sararayburn at gmail.com > From rutgeraldo at gmail.com Mon Apr 12 07:25:00 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Mon, 12 Apr 2010 12:25:00 +0100 Subject: [BioRuby] RDF Triples in BioRuby, a funding proposal to Google SoC In-Reply-To: <2bb9b24a1003150527p439c135dm1a164e6a5218835f@mail.gmail.com> References: <2bb9b24a1003100522p68330d6bu3f8e5f3a7f50dd6b@mail.gmail.com> <9081A9B5-611C-45C2-A099-44BAF1E524F4@hgc.jp> <2bb9b24a1003110222h4bd642adv31d1975c9edc0bba@mail.gmail.com> <2bb9b24a1003150527p439c135dm1a164e6a5218835f@mail.gmail.com> Message-ID: Hi all, here's a brief followup: we have received three student applications for this GSoC project. All three look fairly strong. Hopefully we will get funding! Rutger On Mon, Mar 15, 2010 at 1:27 PM, Rutger Vos wrote: > To follow up along more practical lines, I've had to deal with similar > design issues in Bio::Phylo (perl), TreeBASE and Mesquite (both java). > I've learned it makes sense to have: > > - a simple "annotation" object, with getters and setters for the > predicate namespace uri, the predicate string, and the value object > (either a literal or a uri), > > - a get_annotations method for all (fundamental) data objects in the > toolkit that returns a collection of these annotation object > > this way, when you serialize any bioruby object into rdf, you can add > as many other statements about that object as you want. > > Would a refactoring along those lines have a chance of being > acceptable to the bioruby community (of course subsequent to a more > detailed RFC, testing, discussion, proof of concept, etc.)? > > On Thursday, March 11, 2010, Rutger Vos wrote: >> Hi Toshiaki, >> >> great to hear there's already been a lot of discussion over this. >> (Well, I'd be surprised if there hadn't been :)) >> >> It looks to me like some fairly major bookkeeping would need to be >> implemented high up in the inheritance tree if *all* bioruby objects >> are to be serialized into RDF. It also would require all of bioruby to >> be ontologized in one fell swoop. >> >> It is perhaps more likely that subdomains are going to be ontologized >> more or less independently from one another (as you mention, >> references->RDF, or in my case phylogenetics->RDF) based implicitly on >> intermediate data formats (pubmed records and nexml, respectively). >> >> That is probably OK, we do things as needs arise. >> >> But what would be handy if the API was at least general enough so that >> this was extensible and we can make additional statements *about* >> objects when we serialize them to RDF. For example, in your pubmed >> turtle file, the subject is always >> . Is there a way, >> programmatically, where I can add additional statements about >> ? >> >> Rutger >> >> On Wed, Mar 10, 2010 at 2:21 PM, Toshiaki Katayama wrote: >>> Hi Rutger, >>> >>> Thank you for your inputs on GSoC 2010! >>> >>>> * is there a way to express triples in BioRuby? >>>> * if there is not, what would be a good design to express triples in >>>> BioRuby so that this would be more useful than just for NeXML? >>> >>> This is what we discussed during the pre-BioHackathon 2010. >>> >>> http://hackathon3.dbcls.jp/wiki/BioRuby >>> >>> My first idea was to make all BioRuby object have common output >>> method to render the object contents in various formats >>> (such as RDF/XML, Turtle, HTML, GFF, FASTA etc. if appropriate). >>> >>> Then, we tried to separate view from logic using erb, but as you >>> see in the above page, it still looks ugly. It is mainly because >>> view formatting itself requires some additional codes, specific >>> to each format. >>> >>> Therefore, we don't have a solid conclusion on this yet, unfortunately. >>> >>> Anyway, we already have PubMed to RDF converter written in Ruby as >>> the TogoWS REST API (http://togows.dbcls.jp/site/en/rest.html) at >>> >>> http://togows.dbcls.jp/entry/pubmed/16381885 >>> --> http://togows.dbcls.jp/entry/pubmed/16381885.ttl >>> >>> and, we are also trying to support KEGG to RDF conversion in this >>> framework as well. I think we can put the code in BioRuby when we finished. >>> >>> Your suggestions are welcome. :) >>> >>> Regards, >>> Toshiaki >>> >>> On 2010/03/10, at 22:22, Rutger Vos wrote: >>> >>>> Dear BioRuby-ites, >>>> >>>> my apologies that my first email to this list is so long and >>>> tangential. I am trying to find out how to express RDF triples in >>>> BioRuby. In this email I'm explaining why I care enough to try to get >>>> funding for someone to work on this. If you don't care about any of >>>> this, you can stop reading now. >>>> >>>> The National Evolutionary Synthesis Center (NESCent.org) is planning >>>> to be a mentoring organization for the Google Summer of Code 2010. I >>>> have submitted a project idea to this: to develop NeXML I/O and - >>>> probably more importantly for you - RDF capabilities for BioRuby. If >>>> funded, a student/coder will work on this full time over the summer, >>>> under the shared supervision of Jan Aerts and myself. Here is the >>>> link: http://tinyurl.com/biorubynexml >>>> >>>> NeXML is a data format for phylogenetic data that can be read and >>>> written in perl, python, java and (to some extent) c++ and javascript. >>>> RDF is the cool "new" thing (as per BioHackathon2010), but as far as I >>>> can tell BioRuby isn't completely up to speed for it, yet. >>>> >>>> (As an aside: you might ask yourself why there is something like NeXML >>>> when there is PhyloXML for BioRuby. The answer is that NeXML solves a >>>> different problem: PhyloXML started essentially as a next generation >>>> of New Hampshire eXtended (NHX) to meet the annotation needs of >>>> comparative genomics, things such as gene duplications and other >>>> molecular evolution events, on phylogenetic trees; NeXML started as a >>>> complete XML representation of the NEXUS format, providing other >>>> comparative data types such as categorical and continuous character >>>> state matrices, restriction site matrices, and so on, in addition to >>>> trees, taxa, sequence alignments. There is obviously some overlap >>>> between the formats, but I guess that is not unique in bioinformatics >>>> :)) >>>> >>>> NeXML has a semantic annotation facility that uses RDFa. This allows >>>> us to add additional metadata to a fundamental phylogenetic data >>>> object (a tree, taxon, character, etc.) to form a "triple": the >>>> fundamental data object is the triple Subject, and the Predicate and >>>> Object are added as RDFa attributes. Since NeXML can be transformed >>>> using a standard XSL stylesheet to RDF/XML, we can express a limitless >>>> number of statements about phylogenetics. H > > -- > Dr. Rutger A. Vos > School of Biological Sciences > Philip Lyle Building, Level 4 > University of Reading > Reading > RG6 6BX > United Kingdom > Tel: +44 (0) 118 378 7535 > http://www.nexml.org > http://rutgervos.blogspot.com > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From anurag08priyam at gmail.com Wed Apr 14 16:41:12 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 15 Apr 2010 02:11:12 +0530 Subject: [BioRuby] Patch for Bug 18019. Message-ID: Hello all, This is my start at being a part of the BioRuby developer community. The RubyForge bug tracking page shows bug 18019( GenBank each_entry, last entry is always nil )[1] to be open. I am attaching a patch for it. Its very tiny. The fix was already suggested in a comment by Raoul Jean Pierre Bonnal( the submitter of the bug ). I have verified the solution and created a patch for it. Or should I send a pull request on github? Patch( git format-patch ): >From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001 From: Anurag Priyam Date: Wed, 14 Apr 2010 22:58:45 +0530 Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil --- lib/bio/db/genbank/common.rb | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/common.rb index 545eac1..eaa760c 100644 --- a/lib/bio/db/genbank/common.rb +++ b/lib/bio/db/genbank/common.rb @@ -24,7 +24,7 @@ class NCBIDB # module Common - DELIMITER = RS = "\n//\n" + DELIMITER = RS = "\n//\n\n" TAGSIZE = 12 def initialize(entry) -- 1.7.0 [1] http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037 -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From jan.aerts at gmail.com Wed Apr 14 16:44:49 2010 From: jan.aerts at gmail.com (Jan Aerts) Date: Wed, 14 Apr 2010 21:44:49 +0100 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: References: Message-ID: Thanks for that, Anurag. Contributions to bioruby very much appreciated :-) @Goto-san: can you merge that fix? Cheers, jan. On 14 April 2010 21:41, Anurag Priyam wrote: > Hello all, > > This is my start at being a part of the BioRuby developer community. > > The RubyForge bug tracking page shows bug 18019( GenBank each_entry, last > entry is always nil )[1] to be open. I am attaching a patch for it. Its > very > tiny. The fix was already suggested in a comment by Raoul Jean Pierre > Bonnal( the submitter of the bug ). I have verified the solution and > created > a patch for it. Or should I send a pull request on github? > > Patch( git format-patch ): > > >From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001 > From: Anurag Priyam > Date: Wed, 14 Apr 2010 22:58:45 +0530 > Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil > > --- > lib/bio/db/genbank/common.rb | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/common.rb > index 545eac1..eaa760c 100644 > --- a/lib/bio/db/genbank/common.rb > +++ b/lib/bio/db/genbank/common.rb > @@ -24,7 +24,7 @@ class NCBIDB > # > module Common > > - DELIMITER = RS = "\n//\n" > + DELIMITER = RS = "\n//\n\n" > TAGSIZE = 12 > > def initialize(entry) > -- > 1.7.0 > > > [1] > > http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037 > > -- > Anurag Priyam > 2nd Year,Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > From ngoto at gen-info.osaka-u.ac.jp Wed Apr 14 21:34:53 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 15 Apr 2010 10:34:53 +0900 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: References: Message-ID: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> On Wed, 14 Apr 2010 21:44:49 +0100 Jan Aerts wrote: > Thanks for that, Anurag. Contributions to bioruby very much appreciated :-) > > @Goto-san: can you merge that fix? No, because the patch ignores reading of entries in the middle of the file. To parse files distributed from NCBI, the delimiter should be "\n//\n", and cannot be "\n//\n\n". Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > Cheers, > jan. > > On 14 April 2010 21:41, Anurag Priyam wrote: > > > Hello all, > > > > This is my start at being a part of the BioRuby developer community. > > > > The RubyForge bug tracking page shows bug 18019( GenBank each_entry, last > > entry is always nil )[1] to be open. I am attaching a patch for it. Its > > very > > tiny. The fix was already suggested in a comment by Raoul Jean Pierre > > Bonnal( the submitter of the bug ). I have verified the solution and > > created > > a patch for it. Or should I send a pull request on github? > > > > Patch( git format-patch ): > > > > >From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001 > > From: Anurag Priyam > > Date: Wed, 14 Apr 2010 22:58:45 +0530 > > Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil > > > > --- > > lib/bio/db/genbank/common.rb | 2 +- > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/common.rb > > index 545eac1..eaa760c 100644 > > --- a/lib/bio/db/genbank/common.rb > > +++ b/lib/bio/db/genbank/common.rb > > @@ -24,7 +24,7 @@ class NCBIDB > > # > > module Common > > > > - DELIMITER = RS = "\n//\n" > > + DELIMITER = RS = "\n//\n\n" > > TAGSIZE = 12 > > > > def initialize(entry) > > -- > > 1.7.0 > > > > > > [1] > > > > http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037 > > > > -- > > Anurag Priyam > > 2nd Year,Mechanical Engineering, > > IIT Kharagpur. > > +91-9775550642 > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Thu Apr 15 02:32:09 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 15 Apr 2010 15:32:09 +0900 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> Hi Anurag, Parsing of GenBank files is primarily tested with official GenBank releases. (But currently no unit tests. I hope they would be added during the GSoC project "Ruby 1.9.2 support of BioRuby".) The test is something like: # preparetion of test data % wget ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrt21.seq.gz % gzip -dc gbvrt21.seq.gz > gbvrt21.seq # Counts the number of entries % ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ ff.each { |e| c += 1 }; p c' gbvrt21.seq #==> 1991 # Checks if the number of entries is correct. % grep -c '^LOCUS' gbvrt21.seq #==> 1991 # Executes with the monkey patch. # Be careful that this takes very long time and large memory! % ruby -rbio -e 'Bio::GenBank::DELIMITER = "\n//\n\n"; \ c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ ff.each { |e| c += 1 }; p c' gbvrt21.seq #==> 1 It is apparent that the patch is wrong. Splitting entries by using such delimiter is simple and the performance is well, but it can only work with correct data which should always be ended with the delimiter. Characters after the last delimiter in the file is regarded as a single entry because we don't want to lose data. The behavior can be changed, for example, when getting only white spaces and then the end of file without delimiter, it is ignored and treated as EOF with no entries. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 15 Apr 2010 10:34:53 +0900 Naohisa GOTO wrote: > On Wed, 14 Apr 2010 21:44:49 +0100 > Jan Aerts wrote: > > > Thanks for that, Anurag. Contributions to bioruby very much appreciated :-) > > > > @Goto-san: can you merge that fix? > > No, because the patch ignores reading of entries in the middle of the file. > To parse files distributed from NCBI, the delimiter should be "\n//\n", > and cannot be "\n//\n\n". > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > > Cheers, > > jan. > > > > On 14 April 2010 21:41, Anurag Priyam wrote: > > > > > Hello all, > > > > > > This is my start at being a part of the BioRuby developer community. > > > > > > The RubyForge bug tracking page shows bug 18019( GenBank each_entry, last > > > entry is always nil )[1] to be open. I am attaching a patch for it. Its > > > very > > > tiny. The fix was already suggested in a comment by Raoul Jean Pierre > > > Bonnal( the submitter of the bug ). I have verified the solution and > > > created > > > a patch for it. Or should I send a pull request on github? > > > > > > Patch( git format-patch ): > > > > > > >From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001 > > > From: Anurag Priyam > > > Date: Wed, 14 Apr 2010 22:58:45 +0530 > > > Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil > > > > > > --- > > > lib/bio/db/genbank/common.rb | 2 +- > > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > > > diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/common.rb > > > index 545eac1..eaa760c 100644 > > > --- a/lib/bio/db/genbank/common.rb > > > +++ b/lib/bio/db/genbank/common.rb > > > @@ -24,7 +24,7 @@ class NCBIDB > > > # > > > module Common > > > > > > - DELIMITER = RS = "\n//\n" > > > + DELIMITER = RS = "\n//\n\n" > > > TAGSIZE = 12 > > > > > > def initialize(entry) > > > -- > > > 1.7.0 > > > > > > > > > [1] > > > > > > http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037 > > > > > > -- > > > Anurag Priyam > > > 2nd Year,Mechanical Engineering, > > > IIT Kharagpur. > > > +91-9775550642 > > > > > > _______________________________________________ > > > BioRuby Project - http://www.bioruby.org/ > > > BioRuby mailing list > > > BioRuby at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From tomoakin at kenroku.kanazawa-u.ac.jp Thu Apr 15 03:26:42 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Thu, 15 Apr 2010 16:26:42 +0900 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> Hi Goto-san, > Splitting entries by using such delimiter is simple and the > performance > is well, but it can only work with correct data which should always be > ended with the delimiter. Characters after the last delimiter in the > file is regarded as a single entry because we don't want to lose data. > > The behavior can be changed, for example, when getting only white > spaces and then the end of file without delimiter, it is ignored and > treated as EOF with no entries. Because genbank and genpept format file downloaded from NCBI with entrez usually ends with double new line characters, the latter behavior is really desired. $ wget -O sequences.gb "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4" $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\ ff.each { |e| c += 1 }; p c' sequences.gb #==> 4 Hope it becomes 3. As there are 3 entries. $ grep LOCUS sequences.gb LOCUS A00002 194 bp DNA linear PAT 10-FEB-1993 LOCUS A00003 194 bp DNA linear PAT 10-FEB-1993 LOCUS X17276 556 bp DNA linear MAM 26-FEB-1992 Actually this file have an excess newline at each end of entry. And his patch will work in this case, despite it is not right as you mentioned. Although in this example no error is reported because we don't do anything with the entry, accessing the last entry (the fourth in this case) will cause error. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From anurag08priyam at gmail.com Thu Apr 15 04:30:42 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 15 Apr 2010 14:00:42 +0530 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> Message-ID: To parse the genbank files, ultimately IO#gets(sep_string=$/) is called. I did a File.read on a small sequence file[1]. The last sequence of characters are as: ".... tctaga\n//\n\n". This shows how Ruby would see the file. Note the two "\n" at the end. That was my rationale for the patch. Now, with the current delimiter "\n//\n", what happens is that, when we call gets(delimiter) repetitively, it returns "\n" as the last entry and nil thereafter. This "\n" is the root cause of the problem as it is returned to Bio::FlatFile#next_entry and Bio::FlatFile#each_entry, from either: Bio::Splitter::Default#get_entry or Bio::Splitter::Default#get_parsed_entry. The checks employed later for the return value, include checking for nil ( return nil unless r;; in next_entry ). I think we can include check conditions for whitespace to avoid this? I believe Goto-san's mail also implied something on the same line? [1] http://home.cc.umanitoba.ca/~psgendb/X54090.gen.html On Thu, Apr 15, 2010 at 12:56 PM, Tomoaki NISHIYAMA < tomoakin at kenroku.kanazawa-u.ac.jp> wrote: > Hi Goto-san, > > > Splitting entries by using such delimiter is simple and the performance >> is well, but it can only work with correct data which should always be >> ended with the delimiter. Characters after the last delimiter in the >> file is regarded as a single entry because we don't want to lose data. >> >> The behavior can be changed, for example, when getting only white >> spaces and then the end of file without delimiter, it is ignored and >> treated as EOF with no entries. >> > > > Because genbank and genpept format file downloaded from NCBI with entrez > usually ends with double new line characters, > the latter behavior is really desired. > > $ wget -O sequences.gb " > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4 > " > > $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\ > ff.each { |e| c += 1 }; p c' sequences.gb > #==> 4 > Hope it becomes 3. As there are 3 entries. > $ grep LOCUS sequences.gb > LOCUS A00002 194 bp DNA linear PAT > 10-FEB-1993 > LOCUS A00003 194 bp DNA linear PAT > 10-FEB-1993 > LOCUS X17276 556 bp DNA linear MAM > 26-FEB-1992 > > Actually this file have an excess newline at each end of entry. > And his patch will work in this case, despite it is not right as you > mentioned. > > Although in this example no error is reported because we don't do anything > with the > entry, accessing the last entry (the fourth in this case) will cause error. > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Thu Apr 15 04:39:28 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 15 Apr 2010 14:09:28 +0530 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> Message-ID: > > Because genbank and genpept format file downloaded from NCBI with entrez > usually ends with double new line characters, > the latter behavior is really desired. > > $ wget -O sequences.gb " > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4 > " > > $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\ > ff.each { |e| c += 1 }; p c' sequences.gb > #==> 4 > Hope it becomes 3. As there are 3 entries. > $ grep LOCUS sequences.gb > LOCUS A00002 194 bp DNA linear PAT > 10-FEB-1993 > LOCUS A00003 194 bp DNA linear PAT > 10-FEB-1993 > LOCUS X17276 556 bp DNA linear MAM > 26-FEB-1992 > > Actually this file have an excess newline at each end of entry. > And his patch will work in this case, despite it is not right as you > mentioned. > > Although in this example no error is reported because we don't do anything > with the > entry, accessing the last entry (the fourth in this case) will cause error. > As, I mentioned in my previous mail, the cause for the extra entry is cause by a "\n". Even the "\n" gets parsed into Bio::GenBank object. No errors are raised. Here: $ruby -rbio -e 'ff = Bio::FlatFile.open(ARGV[0]); ff.each{ |e| puts e.entry_id};' sequences.gb A00002 A00003 X17276 nil $ruby -rbio -e 'ff = Bio::FlatFile.open(ARGV[0]); ff.each{ |e| puts e.class};' sequences.gb Bio::GenBank Bio::GenBank Bio::GenBank Bio::GenBank -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From tomoakin at kenroku.kanazawa-u.ac.jp Thu Apr 15 05:07:52 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Thu, 15 Apr 2010 18:07:52 +0900 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> Message-ID: Hi Anurag, > I believe Goto-san's mail also implied something on the same line? Goto-san's mail implied that your file is just incorrect, because there are no such excess newline in the official GenBank releases, and bioruby library is good to return extra entry containing nil on wrong input. My opinion is even if it is not exactly the same format as the GenBank releases, bioruby library should ignore the excess newline. My previous mail was to explain why such file are frequently seen, showing that the NCBI website creates such file, and an easy way to reproducibly obtain such file with reasonably looking way. (Though tools and email parameters were omitted) I am sure he knows well on why the newline cause that problem, and need not to explain on the cause. The most important issue is the decision of what is the right behavior. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From anurag08priyam at gmail.com Thu Apr 15 05:21:21 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 15 Apr 2010 14:51:21 +0530 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp>

Message-ID: On Thu, Apr 15, 2010 at 2:37 PM, Tomoaki NISHIYAMA < tomoakin at kenroku.kanazawa-u.ac.jp> wrote: > Hi Anurag, > > I believe Goto-san's mail also implied something on the same line? > > > Goto-san's mail implied that your file is just incorrect, because there are > no > such excess newline in the official GenBank releases, and bioruby library > is good to return extra entry containing nil on wrong input. > > Thanks for the clarification. I had indeed got it wrong. Now I fully understand the scenario :). > My opinion is even if it is not exactly the same format as the GenBank > releases, > bioruby library should ignore the excess newline. > My previous mail was to explain why such file are frequently seen, showing > that the NCBI website creates such file, and an easy way to reproducibly > obtain such file with reasonably looking way. (Though tools and email > parameters were omitted) > > I am sure he knows well on why the newline cause that problem, and need not > to > explain on the cause. > The most important issue is the decision of what is the right behavior. > > -- > > Tomoaki NISHIYAMA > > > Advanced Science Research Center, > > Kanazawa University, > > 13-1 Takara-machi, > > Kanazawa, 920-0934, Japan > > > -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From tomoakin at kenroku.kanazawa-u.ac.jp Thu Apr 15 22:34:14 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 16 Apr 2010 11:34:14 +0900 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <505810F2-849D-4442-BA72-2B14065F8CB7@kenroku.kanazawa-u.ac.jp> Hi Goto-san, How do you feel to change the DELIMITER to "\nLOCUS" with DELIMITER_OVERRUN = 5, like the BLAST parsers. This is not as dirty as to check for empty lines and works for both the GenBank release files and the files obtained through Entrez. $ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \ Bio::GenBank::DELIMITER_OVERRUN = 5; \ c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ ff.each { |e| c += 1 }; p c' gbvrt21.seq 1991 $ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \ Bio::GenBank::DELIMITER_OVERRUN = 5; \ c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ ff.each { |e| c += 1 }; p c' sequences.gb 3 -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2010/04/15, at 15:32, Naohisa GOTO wrote: > Hi Anurag, > > Parsing of GenBank files is primarily tested with official > GenBank releases. (But currently no unit tests. I hope they > would be added during the GSoC project "Ruby 1.9.2 support of > BioRuby".) > > The test is something like: > > # preparetion of test data > > % wget ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrt21.seq.gz > % gzip -dc gbvrt21.seq.gz > gbvrt21.seq > > # Counts the number of entries > > % ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ > ff.each { |e| c += 1 }; p c' gbvrt21.seq > #==> 1991 > > # Checks if the number of entries is correct. > > % grep -c '^LOCUS' gbvrt21.seq > > #==> 1991 > > # Executes with the monkey patch. > # Be careful that this takes very long time and large memory! > > % ruby -rbio -e 'Bio::GenBank::DELIMITER = "\n//\n\n"; \ > c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ > ff.each { |e| c += 1 }; p c' gbvrt21.seq > #==> 1 > > It is apparent that the patch is wrong. > > Splitting entries by using such delimiter is simple and the > performance > is well, but it can only work with correct data which should always be > ended with the delimiter. Characters after the last delimiter in the > file is regarded as a single entry because we don't want to lose data. > > The behavior can be changed, for example, when getting only white > spaces and then the end of file without delimiter, it is ignored and > treated as EOF with no entries. > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > On Thu, 15 Apr 2010 10:34:53 +0900 > Naohisa GOTO wrote: > >> On Wed, 14 Apr 2010 21:44:49 +0100 >> Jan Aerts wrote: >> >>> Thanks for that, Anurag. Contributions to bioruby very much >>> appreciated :-) >>> >>> @Goto-san: can you merge that fix? >> >> No, because the patch ignores reading of entries in the middle of >> the file. >> To parse files distributed from NCBI, the delimiter should be "\n// >> \n", >> and cannot be "\n//\n\n". >> >> Naohisa Goto >> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >> >>> >>> Cheers, >>> jan. >>> >>> On 14 April 2010 21:41, Anurag Priyam >>> wrote: >>> >>>> Hello all, >>>> >>>> This is my start at being a part of the BioRuby developer >>>> community. >>>> >>>> The RubyForge bug tracking page shows bug 18019( GenBank >>>> each_entry, last >>>> entry is always nil )[1] to be open. I am attaching a patch for >>>> it. Its >>>> very >>>> tiny. The fix was already suggested in a comment by Raoul Jean >>>> Pierre >>>> Bonnal( the submitter of the bug ). I have verified the solution >>>> and >>>> created >>>> a patch for it. Or should I send a pull request on github? >>>> >>>> Patch( git format-patch ): >>>> >>>>> From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 >>>>> 00:00:00 2001 >>>> From: Anurag Priyam >>>> Date: Wed, 14 Apr 2010 22:58:45 +0530 >>>> Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil >>>> >>>> --- >>>> lib/bio/db/genbank/common.rb | 2 +- >>>> 1 files changed, 1 insertions(+), 1 deletions(-) >>>> >>>> diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/ >>>> common.rb >>>> index 545eac1..eaa760c 100644 >>>> --- a/lib/bio/db/genbank/common.rb >>>> +++ b/lib/bio/db/genbank/common.rb >>>> @@ -24,7 +24,7 @@ class NCBIDB >>>> # >>>> module Common >>>> >>>> - DELIMITER = RS = "\n//\n" >>>> + DELIMITER = RS = "\n//\n\n" >>>> TAGSIZE = 12 >>>> >>>> def initialize(entry) >>>> -- >>>> 1.7.0 >>>> >>>> >>>> [1] >>>> >>>> http://rubyforge.org/tracker/index.php? >>>> func=detail&aid=18019&group_id=769&atid=3037 >>>> >>>> -- >>>> Anurag Priyam >>>> 2nd Year,Mechanical Engineering, >>>> IIT Kharagpur. >>>> +91-9775550642 >>>> >>>> _______________________________________________ >>>> BioRuby Project - http://www.bioruby.org/ >>>> BioRuby mailing list >>>> BioRuby at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioruby >>>> >>>> >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From anurag08priyam at gmail.com Fri Apr 16 16:29:07 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Sat, 17 Apr 2010 01:59:07 +0530 Subject: [BioRuby] Code cleanup: expanded tabs into spaces. Message-ID: I am unsure if it is needed or not. A simple combination of find and sed did the job. Here is the one liner that I used: find . -name '*.rb' -exec sed -i 's/\t/ /g' {} ';' A patch would have been too long, so I have sent a pull request on github for review. Diff: http://github.com/yeban/bioruby/compare/master...tab -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Fri Apr 16 16:50:06 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Sat, 17 Apr 2010 02:20:06 +0530 Subject: [BioRuby] Code cleanup: expanded tabs into spaces. In-Reply-To: References: Message-ID: I think because of the different tab sizes used by everyone, I have got the indentation wrong in few files. Replacing tabs at the beginning of the lines also does not work( infact, it destroys indentation in every file). Vim's :retab command seems to work properly though. Please could someone point me in the right direction? On Sat, Apr 17, 2010 at 1:59 AM, Anurag Priyam wrote: > I am unsure if it is needed or not. A simple combination of find and sed > did the job. > > Here is the one liner that I used: > find . -name '*.rb' -exec sed -i 's/\t/ /g' {} ';' > > A patch would have been too long, so I have sent a pull request on github > for review. > > Diff: > http://github.com/yeban/bioruby/compare/master...tab > > -- > Anurag Priyam > 2nd Year,Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Fri Apr 16 17:38:16 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Sat, 17 Apr 2010 03:08:16 +0530 Subject: [BioRuby] Code cleanup: expanded tabs into spaces. In-Reply-To: References: Message-ID: I think I have got the solution. Vim seems to expand tab nicely; preserves the indentation as well. So I used "vim -c and "find" together to do the job. Its a little hackish though. find . -name "*.rb" -exec vim -c Retab {} ';' #open a file in vim, run :retab, save and quit where, Retab command in vim has been defined as: :command Retab :call Retab_save_quit() "call the function Retab_save_quit() when, Retab command is invoked and, retab_save_quit() function was defined as: function Retab_save_quit() retab "retab the file wq "save and quit end with following options set: set shiftwidth=2 set softtabstop=2 set expandtab I disabled my default vimrc to speed up the command, still it took almost 1:40 seconds. It had to go through 422 Ruby files :-|. I have already disturbed all with a previous pull request :P, so I think I should wait for comments first. Here is the new diff: http://github.com/yeban/bioruby/compare/master...tab_2 On Sat, Apr 17, 2010 at 2:20 AM, Anurag Priyam wrote: > I think because of the different tab sizes used by everyone, I have got the > indentation wrong in few files. Replacing tabs at the beginning of the lines > also does not work( infact, it destroys indentation in every file). Vim's > :retab command seems to work properly though. Please could someone point me > in the right direction? > > > On Sat, Apr 17, 2010 at 1:59 AM, Anurag Priyam wrote: > >> I am unsure if it is needed or not. A simple combination of find and sed >> did the job. >> >> Here is the one liner that I used: >> find . -name '*.rb' -exec sed -i 's/\t/ /g' {} ';' >> >> A patch would have been too long, so I have sent a pull request on github >> for review. >> >> Diff: >> http://github.com/yeban/bioruby/compare/master...tab >> >> -- >> Anurag Priyam >> 2nd Year,Mechanical Engineering, >> IIT Kharagpur. >> +91-9775550642 >> > > > > -- > Anurag Priyam > 2nd Year,Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Fri Apr 16 18:30:18 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Sat, 17 Apr 2010 04:00:18 +0530 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <505810F2-849D-4442-BA72-2B14065F8CB7@kenroku.kanazawa-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <505810F2-849D-4442-BA72-2B14065F8CB7@kenroku.kanazawa-u.ac.jp> Message-ID: This also works: Check if the string read from the genbank file is not a whitespace sequence( patch below ). $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { |e| c += 1 }; p c' sequences.gb 3 $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { |e| c += 1 }; p c' gbvrt21.seq 1991 $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { |e| c += 1 }; p c' sample.gb #this was my sample file 1 >From 94e70a98a0643caf13acc0417b677073b8f7968d Mon Sep 17 00:00:00 2001 From: Anurag Priyam Date: Sat, 17 Apr 2010 03:50:48 +0530 Subject: [PATCH] fixed bug 18019; redundant nil entry --- lib/bio/io/flatfile/splitter.rb | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/lib/bio/io/flatfile/splitter.rb b/lib/bio/io/flatfile/splitter.rb index a07b016..f068bc2 100644 --- a/lib/bio/io/flatfile/splitter.rb +++ b/lib/bio/io/flatfile/splitter.rb @@ -191,7 +191,7 @@ module Bio self.entry_start_pos = p0 self.entry = e self.entry_ended_pos = p1 - return entry + return entry unless entry =~ /^\s$/ end end #class Defalult -- 1.7.0 On Fri, Apr 16, 2010 at 8:04 AM, Tomoaki NISHIYAMA < tomoakin at kenroku.kanazawa-u.ac.jp> wrote: > Hi Goto-san, > > How do you feel to change the DELIMITER to "\nLOCUS" with > DELIMITER_OVERRUN = 5, like the BLAST parsers. > > This is not as dirty as to check for empty lines and works for both > the GenBank release files and the files obtained through Entrez. > > $ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \ > Bio::GenBank::DELIMITER_OVERRUN = 5; \ > > c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ > ff.each { |e| c += 1 }; p c' gbvrt21.seq > 1991 > $ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \ > Bio::GenBank::DELIMITER_OVERRUN = 5; \ > > c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ > ff.each { |e| c += 1 }; p c' sequences.gb > 3 > -- Tomoaki NISHIYAMA > > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > On 2010/04/15, at 15:32, Naohisa GOTO wrote: > > Hi Anurag, >> >> Parsing of GenBank files is primarily tested with official >> GenBank releases. (But currently no unit tests. I hope they >> would be added during the GSoC project "Ruby 1.9.2 support of >> BioRuby".) >> >> The test is something like: >> >> # preparetion of test data >> >> % wget ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrt21.seq.gz >> % gzip -dc gbvrt21.seq.gz > gbvrt21.seq >> >> # Counts the number of entries >> >> % ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ >> ff.each { |e| c += 1 }; p c' gbvrt21.seq >> #==> 1991 >> >> # Checks if the number of entries is correct. >> >> % grep -c '^LOCUS' gbvrt21.seq >> >> #==> 1991 >> >> # Executes with the monkey patch. >> # Be careful that this takes very long time and large memory! >> >> % ruby -rbio -e 'Bio::GenBank::DELIMITER = "\n//\n\n"; \ >> c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ >> ff.each { |e| c += 1 }; p c' gbvrt21.seq >> #==> 1 >> >> It is apparent that the patch is wrong. >> >> Splitting entries by using such delimiter is simple and the performance >> is well, but it can only work with correct data which should always be >> ended with the delimiter. Characters after the last delimiter in the >> file is regarded as a single entry because we don't want to lose data. >> >> The behavior can be changed, for example, when getting only white >> spaces and then the end of file without delimiter, it is ignored and >> treated as EOF with no entries. >> >> Naohisa Goto >> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >> >> On Thu, 15 Apr 2010 10:34:53 +0900 >> Naohisa GOTO wrote: >> >> On Wed, 14 Apr 2010 21:44:49 +0100 >>> Jan Aerts wrote: >>> >>> Thanks for that, Anurag. Contributions to bioruby very much appreciated >>>> :-) >>>> >>>> @Goto-san: can you merge that fix? >>>> >>> >>> No, because the patch ignores reading of entries in the middle of the >>> file. >>> To parse files distributed from NCBI, the delimiter should be "\n//\n", >>> and cannot be "\n//\n\n". >>> >>> Naohisa Goto >>> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >>> >>> >>>> Cheers, >>>> jan. >>>> >>>> On 14 April 2010 21:41, Anurag Priyam wrote: >>>> >>>> Hello all, >>>>> >>>>> This is my start at being a part of the BioRuby developer community. >>>>> >>>>> The RubyForge bug tracking page shows bug 18019( GenBank each_entry, >>>>> last >>>>> entry is always nil )[1] to be open. I am attaching a patch for it. Its >>>>> very >>>>> tiny. The fix was already suggested in a comment by Raoul Jean Pierre >>>>> Bonnal( the submitter of the bug ). I have verified the solution and >>>>> created >>>>> a patch for it. Or should I send a pull request on github? >>>>> >>>>> Patch( git format-patch ): >>>>> >>>>> From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001 >>>>>> >>>>> From: Anurag Priyam >>>>> Date: Wed, 14 Apr 2010 22:58:45 +0530 >>>>> Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil >>>>> >>>>> --- >>>>> lib/bio/db/genbank/common.rb | 2 +- >>>>> 1 files changed, 1 insertions(+), 1 deletions(-) >>>>> >>>>> diff --git a/lib/bio/db/genbank/common.rb >>>>> b/lib/bio/db/genbank/common.rb >>>>> index 545eac1..eaa760c 100644 >>>>> --- a/lib/bio/db/genbank/common.rb >>>>> +++ b/lib/bio/db/genbank/common.rb >>>>> @@ -24,7 +24,7 @@ class NCBIDB >>>>> # >>>>> module Common >>>>> >>>>> - DELIMITER = RS = "\n//\n" >>>>> + DELIMITER = RS = "\n//\n\n" >>>>> TAGSIZE = 12 >>>>> >>>>> def initialize(entry) >>>>> -- >>>>> 1.7.0 >>>>> >>>>> >>>>> [1] >>>>> >>>>> >>>>> http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037 >>>>> >>>>> -- >>>>> Anurag Priyam >>>>> 2nd Year,Mechanical Engineering, >>>>> IIT Kharagpur. >>>>> +91-9775550642 >>>>> >>>>> _______________________________________________ >>>>> BioRuby Project - http://www.bioruby.org/ >>>>> BioRuby mailing list >>>>> BioRuby at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioruby >>>>> >>>>> >>>>> _______________________________________________ >>>> BioRuby Project - http://www.bioruby.org/ >>>> BioRuby mailing list >>>> BioRuby at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioruby >>>> >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> >> > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From ngoto at gen-info.osaka-u.ac.jp Fri Apr 16 19:28:56 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Sat, 17 Apr 2010 08:28:56 +0900 Subject: [BioRuby] Code cleanup: expanded tabs into spaces. In-Reply-To: References: Message-ID: <20100417082628.DDCE.EEF6E030@gen-info.osaka-u.ac.jp> Hi Anurag, Please do not replace the tabs inside string literals and comment lines. These tabs have special meanings. Tabs in string literals apparently have their own meanings. Tabs in comments normally do not affect behavior of programs, but please keep them becase these may be excerption from other literature, description of data formats using tabs, or some other intention. Did you execute all tests before committing? It seems some tests would fail because of the changes in string literals. Some functions without tests or uncovered with tests might be silently broken. It seems parsing Ruby syntax or edit files by hand may be needed. Please avoid single big commit. This makes hard if we need to revert few files due to unexpected regressions. In this case, because changes in each file are independent each other, one file one commit would be good. (This is rare special case. Usually, a commit can contain changes across two or more files if they are strongly related, e.g. when we move a method from a file to another, the changes of two files should be included in a single commit.) Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > I think I have got the solution. Vim seems to expand tab nicely; preserves > the indentation as well. So I used "vim -c and "find" together to > do the job. Its a little hackish though. > > find . -name "*.rb" -exec vim -c Retab {} ';' #open a file in vim, run > :retab, save and quit > > where, Retab command in vim has been defined as: > > :command Retab :call Retab_save_quit() "call the function Retab_save_quit() > when, Retab command is invoked > > and, retab_save_quit() function was defined as: > > function Retab_save_quit() > retab "retab the file > wq "save and quit > end > > with following options set: > set shiftwidth=2 > set softtabstop=2 > set expandtab > > I disabled my default vimrc to speed up the command, still it took almost > 1:40 seconds. It had to go through 422 Ruby files :-|. > > I have already disturbed all with a previous pull request :P, so I think I > should wait for comments first. > Here is the new diff: > http://github.com/yeban/bioruby/compare/master...tab_2 > > > On Sat, Apr 17, 2010 at 2:20 AM, Anurag Priyam wrote: > > > I think because of the different tab sizes used by everyone, I have got the > > indentation wrong in few files. Replacing tabs at the beginning of the lines > > also does not work( infact, it destroys indentation in every file). Vim's > > :retab command seems to work properly though. Please could someone point me > > in the right direction? > > > > > > On Sat, Apr 17, 2010 at 1:59 AM, Anurag Priyam wrote: > > > >> I am unsure if it is needed or not. A simple combination of find and sed > >> did the job. > >> > >> Here is the one liner that I used: > >> find . -name '*.rb' -exec sed -i 's/\t/ /g' {} ';' > >> > >> A patch would have been too long, so I have sent a pull request on github > >> for review. > >> > >> Diff: > >> http://github.com/yeban/bioruby/compare/master...tab > >> > >> -- > >> Anurag Priyam > >> 2nd Year,Mechanical Engineering, > >> IIT Kharagpur. > >> +91-9775550642 > >> > > > > > > > > -- > > Anurag Priyam > > 2nd Year,Mechanical Engineering, > > IIT Kharagpur. > > +91-9775550642 > > > > > > -- > Anurag Priyam > 2nd Year,Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- Naohisa Goto From anurag08priyam at gmail.com Fri Apr 16 20:06:48 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Sat, 17 Apr 2010 05:36:48 +0530 Subject: [BioRuby] Code cleanup: expanded tabs into spaces. In-Reply-To: <20100417082628.DDCE.EEF6E030@gen-info.osaka-u.ac.jp> References: <20100417082628.DDCE.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: > > Please do not replace the tabs inside string literals and > comment lines. These tabs have special meanings. > Tabs in string literals apparently have their own meanings. > Tabs in comments normally do not affect behavior of programs, > but please keep them becase these may be excerption from other > literature, description of data formats using tabs, or some > other intention. > > Ok. It had occurred to me to not replace tabs in string literals and comments but the easier solution does not seem to take care of that. I will try to come up with a better solution( unless, it turns to be hand editing :P ). Did you execute all tests before committing? It seems some > tests would fail because of the changes in string literals. > Some functions without tests or uncovered with tests might > be silently broken. It seems parsing Ruby syntax or edit > files by hand may be needed. > I had. Forgot to mention. I had 2 failures and 4 errors both before and after my changes. Please do have a look at the test output. I have truncated parts of traceback and added my comments in [# ] to explain the failure. 1) Error: test_log(Bio::FuncTestSOAPWSDL): Errno::ECONNREFUSED: Connection refused - connect(2) /usr/lib/ruby/1.8/net/http.rb:560:in `initialize' ... ./test/functional/bio/io/test_soapwsdl.rb:26:in `setup' 2) Error: test_set_log(Bio::FuncTestSOAPWSDL): Errno::ECONNREFUSED: Connection refused - connect(2) /usr/lib/ruby/1.8/net/http.rb:560:in `initialize' ... ./test/functional/bio/io/test_soapwsdl.rb:26:in `setup' 3) Error: test_set_wsdl(Bio::FuncTestSOAPWSDL): Errno::ECONNREFUSED: Connection refused - connect(2) /usr/lib/ruby/1.8/net/http.rb:560:in `initialize' ... ./test/functional/bio/io/test_soapwsdl.rb:26:in `setup' 4) Error: test_wsdl(Bio::FuncTestSOAPWSDL): Errno::ECONNREFUSED: Connection refused - connect(2) /usr/lib/ruby/1.8/net/http.rb:560:in `initialize' .... ./test/functional/bio/io/test_soapwsdl.rb:26:in `setup' [# Test 1-4: At college, we use proxy. This error could be because of proxy settings not being read or initialized properly. I have had such errors working with net/* libraries. I will investigate the exact reason thoroughly and get back to you.] 5) Failure: test_libxml(Bio::TestPhyloXMLWriter_Check_LibXML) [./test/unit/bio/db/test_phyloxml_writer.rb:31]: Error: libxml-ruby library is not present. Please install libxml-ruby library. It is needed for Bio::PhyloXML module. Unit test for PhyloXML will not be performed. is not true. 6) Failure: test_libxml(Bio::TestPhyloXML_Check_LibXML) [./test/unit/bio/db/test_phyloxml.rb:29]: Error: libxml-ruby library is not present. Please install libxml-ruby library. It is needed for Bio::PhyloXML module. Unit test for PhyloXML will not be performed. is not true. [# tests 5and 6 are not initialized properly. I have libxml-ruby installed. I had written a thin wrapper over libxml to parse NeXML and tests as a demonstration for my GSoC 2010 proposal. And it works for me.] 2687 tests, 19751 assertions, 2 failures, 4 errors rake aborted! Command failed with status (1): [/usr/bin/ruby1.8 -I"lib" "/usr/lib/ruby/ge...] [# I think the errors have not been caused by my changes. > Please avoid single big commit. This makes hard if we need > to revert few files due to unexpected regressions. > In this case, because changes in each file are independent > each other, one file one commit would be good. (This is rare > special case. Usually, a commit can contain changes across > two or more files if they are strongly related, e.g. when > we move a method from a file to another, the changes of two > files should be included in a single commit.) > > I understand that, it is a good practice. 90+ ruby files have mixed indentation problem :(. $pwd ~/src/bioruby $grep -sl "\t" **/*.rb | wc -l 97 That would mean 97 commits. If I prove my reasons for the failure of above tests are there chance that the changes will be merged? -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From tomoakin at kenroku.kanazawa-u.ac.jp Fri Apr 16 23:09:07 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Sat, 17 Apr 2010 12:09:07 +0900 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <505810F2-849D-4442-BA72-2B14065F8CB7@kenroku.kanazawa-u.ac.jp> Message-ID: <6F201A16-1EB2-4A42-A63B-29C7036C0974@kenroku.kanazawa-u.ac.jp> Hi Anurag, If you change that code, you need to check if it is right for all kinds of files processed with FlatFile. The regular expression > /^\s$/ will match any empty line whether it is the whole entry or just a part of it. I suspect this change will break parsing BLAST output or any file that contain internal blank lines. Did you check them? /\A\s*\z/ might work as you intend, though I feel this a dirty hack. The expression \A and \z are explained in http://ruby-doc.org/docs/ProgrammingRuby/html/language.html -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2010/04/17, at 7:30, Anurag Priyam wrote: > This also works: > > Check if the string read from the genbank file is not a whitespace > sequence( patch below ). > > $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { | > e| c += 1 }; p c' sequences.gb > 3 > > $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { | > e| c += 1 }; p c' gbvrt21.seq > 1991 > > $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);ff.each { | > e| c += 1 }; p c' sample.gb #this was my sample file > 1 > > From 94e70a98a0643caf13acc0417b677073b8f7968d Mon Sep 17 00:00:00 2001 > From: Anurag Priyam > Date: Sat, 17 Apr 2010 03:50:48 +0530 > Subject: [PATCH] fixed bug 18019; redundant nil entry > > --- > lib/bio/io/flatfile/splitter.rb | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/lib/bio/io/flatfile/splitter.rb b/lib/bio/io/flatfile/ > splitter.rb > index a07b016..f068bc2 100644 > --- a/lib/bio/io/flatfile/splitter.rb > +++ b/lib/bio/io/flatfile/splitter.rb > @@ -191,7 +191,7 @@ module Bio > self.entry_start_pos = p0 > self.entry = e > self.entry_ended_pos = p1 > - return entry > + return entry unless entry =~ /^\s$/ > end > end #class Defalult > > -- > 1.7.0 > > > On Fri, Apr 16, 2010 at 8:04 AM, Tomoaki NISHIYAMA > wrote: > Hi Goto-san, > > How do you feel to change the DELIMITER to "\nLOCUS" with > DELIMITER_OVERRUN = 5, like the BLAST parsers. > > This is not as dirty as to check for empty lines and works for both > the GenBank release files and the files obtained through Entrez. > > $ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \ > Bio::GenBank::DELIMITER_OVERRUN = 5; \ > > c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ > ff.each { |e| c += 1 }; p c' gbvrt21.seq > 1991 > $ ruby -rbio -e 'Bio::GenBank::DELIMITER = "\nLOCUS"; \ > Bio::GenBank::DELIMITER_OVERRUN = 5; \ > > c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ > ff.each { |e| c += 1 }; p c' sequences.gb > 3 > -- Tomoaki NISHIYAMA > > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > On 2010/04/15, at 15:32, Naohisa GOTO wrote: > > Hi Anurag, > > Parsing of GenBank files is primarily tested with official > GenBank releases. (But currently no unit tests. I hope they > would be added during the GSoC project "Ruby 1.9.2 support of > BioRuby".) > > The test is something like: > > # preparetion of test data > > % wget ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrt21.seq.gz > % gzip -dc gbvrt21.seq.gz > gbvrt21.seq > > # Counts the number of entries > > % ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ > ff.each { |e| c += 1 }; p c' gbvrt21.seq > #==> 1991 > > # Checks if the number of entries is correct. > > % grep -c '^LOCUS' gbvrt21.seq > > #==> 1991 > > # Executes with the monkey patch. > # Be careful that this takes very long time and large memory! > > % ruby -rbio -e 'Bio::GenBank::DELIMITER = "\n//\n\n"; \ > c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ > ff.each { |e| c += 1 }; p c' gbvrt21.seq > #==> 1 > > It is apparent that the patch is wrong. > > Splitting entries by using such delimiter is simple and the > performance > is well, but it can only work with correct data which should always be > ended with the delimiter. Characters after the last delimiter in the > file is regarded as a single entry because we don't want to lose data. > > The behavior can be changed, for example, when getting only white > spaces and then the end of file without delimiter, it is ignored and > treated as EOF with no entries. > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > On Thu, 15 Apr 2010 10:34:53 +0900 > Naohisa GOTO wrote: > > On Wed, 14 Apr 2010 21:44:49 +0100 > Jan Aerts wrote: > > Thanks for that, Anurag. Contributions to bioruby very much > appreciated :-) > > @Goto-san: can you merge that fix? > > No, because the patch ignores reading of entries in the middle of > the file. > To parse files distributed from NCBI, the delimiter should be "\n// > \n", > and cannot be "\n//\n\n". > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > Cheers, > jan. > > On 14 April 2010 21:41, Anurag Priyam > wrote: > > Hello all, > > This is my start at being a part of the BioRuby developer community. > > The RubyForge bug tracking page shows bug 18019( GenBank > each_entry, last > entry is always nil )[1] to be open. I am attaching a patch for it. > Its > very > tiny. The fix was already suggested in a comment by Raoul Jean Pierre > Bonnal( the submitter of the bug ). I have verified the solution and > created > a patch for it. Or should I send a pull request on github? > > Patch( git format-patch ): > > From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001 > From: Anurag Priyam > Date: Wed, 14 Apr 2010 22:58:45 +0530 > Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil > > --- > lib/bio/db/genbank/common.rb | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/ > common.rb > index 545eac1..eaa760c 100644 > --- a/lib/bio/db/genbank/common.rb > +++ b/lib/bio/db/genbank/common.rb > @@ -24,7 +24,7 @@ class NCBIDB > # > module Common > > - DELIMITER = RS = "\n//\n" > + DELIMITER = RS = "\n//\n\n" > TAGSIZE = 12 > > def initialize(entry) > -- > 1.7.0 > > > [1] > > http://rubyforge.org/tracker/index.php? > func=detail&aid=18019&group_id=769&atid=3037 > > -- > Anurag Priyam > 2nd Year,Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > -- > Anurag Priyam > 2nd Year,Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > <0001-fixed-bug-18019-redundant-nil-entry.patch> From pjotr.public14 at thebird.nl Sat Apr 17 03:29:50 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 17 Apr 2010 09:29:50 +0200 Subject: [BioRuby] Code cleanup: expanded tabs into spaces. In-Reply-To: <20100417082628.DDCE.EEF6E030@gen-info.osaka-u.ac.jp> References: <20100417082628.DDCE.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: <20100417072950.GA16573@thebird.nl> Hi Anurag, On Sat, Apr 17, 2010 at 08:28:56AM +0900, Naohisa Goto wrote: > Please avoid single big commit. This makes hard if we need > to revert few files due to unexpected regressions. > In this case, because changes in each file are independent > each other, one file one commit would be good. (This is rare > special case. Usually, a commit can contain changes across > two or more files if they are strongly related, e.g. when > we move a method from a file to another, the changes of two > files should be included in a single commit.) >From the git suggestions I give my collaborators. Small commits not only help cherry-picking, but also help people understand what you have done. == Using descriptive commits == When other people view your changes it is important the description is, eh, descriptive. So using a message like 'source change', or 'bug fix' is not really helpful - it is what one would expect. Better would be 'refactored variable names in func.c', or 'changed variable names to improved readability in func.c'. ! Use good patch descriptions when committing changes My suggestion is to start the description with a keyword, a colon, followed by the description. So this could work: Admintool: show import/export of tasks/assignments display XP version Look at it this way: if you were to read someone else's descriptions, would it be descriptive enough without starting to browse the code? Related is to commit often, and do small commits. This helps more writing focussed descriptions. ! Commit after small changes that can be described together when you need multiple keywords you should have used multiple commits(!) Another tip is to mark patches that have all tests succeed. Before a commit run all the tests. When the tests are OK - that is the system is in a consistent state - I mark the patch by adding an ampersand. For example: git commit -a -m "Fixed missing link &" That way it is clear when the system is broken in the patch record (some people say it should never be broken, but I think that defeats the purpose of small incremental patches). ! Add a marker to a commit message when all tests succeed Pj. From pjotr.public14 at thebird.nl Sat Apr 17 03:31:49 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 17 Apr 2010 09:31:49 +0200 Subject: [BioRuby] Code cleanup: expanded tabs into spaces. In-Reply-To: References: <20100417082628.DDCE.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: <20100417073149.GB16573@thebird.nl> On Sat, Apr 17, 2010 at 05:36:48AM +0530, Anurag Priyam wrote: > I understand that, it is a good practice. 90+ ruby files have mixed > indentation problem :(. That would be one commit - as long as it does not mix with your other changes. However, I can see people resisting this commit. Leave these types of commits to the maintainers, or create a special git branch and put it up for suggestion. Pj. From donttrustben at gmail.com Sat Apr 17 04:25:12 2010 From: donttrustben at gmail.com (Ben Woodcroft) Date: Sat, 17 Apr 2010 18:25:12 +1000 Subject: [BioRuby] Bio::GO::GeneAssociation issue/fix and new unit test file In-Reply-To: References: Message-ID: Hi, Not to be pushy, but is there any movement on this? Ignoring my suggestions for API changes (which aren't implemented), can the bug fixes be merged? Thanks, ben On 8 April 2010 22:14, Ben Woodcroft wrote: > Hi, > > I had some problems parsing gene association files using Bio::Flatfile, > caused because the parser was attempting to use the split method on a nil. > The offending line was > > @db_reference = tmp[5].split(/\|/) # > > That seemed easy enough to fix, but then I noticed there wasn't any test > cases to test my changes against, so I made a new file > test/unit/db/test_go.rb, including a simulation of one that was giving me > problems. I've collected these changes in a new branch, and you can see the > difference using the new github compare interface at > > http://github.com/wwood/bioruby/compare/36041377db...gene_association > > Is there any reason that the variables that correspond to arrays in > GeneAssociation (@db_reference, @with, @db_object_synonym) are singular > names, and not plural? It would be simple to add a alias_method > db_references -> db_reference right? > > I also don't agree that the 'GO:' part of the identifier be chopped off by > default by the goid method - gene association files are not necessarily > concerned with GO - there are other ontologies out there as well. I > personally never look at GO identifiers without the 'GO:' bit, so I was > surprised when I saw that. > > Sound OK? > Thanks, > ben > From anurag08priyam at gmail.com Sat Apr 17 12:14:40 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Sat, 17 Apr 2010 21:44:40 +0530 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <6F201A16-1EB2-4A42-A63B-29C7036C0974@kenroku.kanazawa-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <505810F2-849D-4442-BA72-2B14065F8CB7@kenroku.kanazawa-u.ac.jp> <6F201A16-1EB2-4A42-A63B-29C7036C0974@kenroku.kanazawa-u.ac.jp> Message-ID: > > If you change that code, you need to check if it is right for all kinds of > files processed with FlatFile. > > The regular expression > > /^\s$/ > > will match any empty line whether it is the whole entry or just a part of > it. > > I suspect this change will break parsing BLAST output or any file that > contain > internal blank lines. Did you check them? > I had not. It indeed caused errors. /\A\s*\z/ > might work as you intend, though I feel this a dirty hack. > I had desired the same effect. This works and passes all the tests. I too feel that my solution is hackish and I am unsure if it is going to be accepted or not. Should I create a patch out of it? But, it was so much fun coming up with my solutions based on the feedbacks. Even if my patch won't be accepted I learnt a lot and will continue to do so with further contributions to the community. I hope I am able to suggest better solutions to some other problems in the future :). Lessons learnt: Tests are very important! -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From ngoto at gen-info.osaka-u.ac.jp Mon Apr 19 09:35:51 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 19 Apr 2010 22:35:51 +0900 Subject: [BioRuby] Bio::GO::GeneAssociation issue/fix and new unit test file In-Reply-To: References: Message-ID: <20100419133553.908F71CBC59D@idnmail.gen-info.osaka-u.ac.jp> Hi Ben, On Sat, 17 Apr 2010 18:25:12 +1000 Ben Woodcroft wrote: > Hi, > > Not to be pushy, but is there any movement on this? Ignoring my suggestions > for API changes (which aren't implemented), can the bug fixes be merged? > Thanks, > ben This suggests Mitsuteru Nakao, current maintainer of the classes, is too busy. On 8 April 2010 22:14, Ben Woodcroft wrote: > Hi, > > I had some problems parsing gene association files using Bio::Flatfile, > caused because the parser was attempting to use the split method on a nil. > The offending line was > > @db_reference = tmp[5].split(/\|/) # The GO Annotation File Format 1.0 defines that each line has 15 tab-delimited fields (except comment line), and in this case, theoretically no attempt would be made to use the split method on a nil. Of course, in real data, it seems it is very inconvenient to get such exceptions, and I agree to fix. > That seemed easy enough to fix, but then I noticed there wasn't any test > cases to test my changes against, so I made a new file > test/unit/db/test_go.rb, including a simulation of one that was giving me > problems. I've collected these changes in a new branch, and you can see the > difference using the new github compare interface at > > http://github.com/wwood/bioruby/compare/36041377db...gene_association The patch seems good and will be merged. Minor thing: no need to check both nil and empty. >> @db_reference = (tmp[5].nil? or tmp[5].empty?) ? [] : tmp[5].split(/\|/) will be shortened: @db_reference = tmp[5] ? tmp[5].split(/\|/) : [] or @db_reference = tmp[5].to_s.split(/\|/) > Is there any reason that the variables that correspond to arrays in > GeneAssociation (@db_reference, @with, @db_object_synonym) are singular > names, and not plural? It would be simple to add a alias_method > db_references -> db_reference right? I suppose these were picked from the older version of "GO Annotation File Format 1.0 Guide". http://web.archive.org/web/20030401212209/http://www.geneontology.org/doc/GO.annotation.html http://web.archive.org/web/20040803050222/www.geneontology.org/GO.annotation.html (Current version: http://www.geneontology.org/GO.format.gaf-1_0.shtml ) In the file format definition, each column is shortly described with words of singular form. The first authors of the class might have used the names as they were, with only replacing colons and spaces to "_" and lower-casing. I can agree adding the aliases, if you, an active user of the class, feel confusing with the current method names. Please propose better names. > I also don't agree that the 'GO:' part of the identifier be chopped off by > default by the goid method - gene association files are not necessarily > concerned with GO - there are other ontologies out there as well. I > personally never look at GO identifiers without the 'GO:' bit, so I was > surprised when I saw that. To aviod confusion, I think adding a new method "go_id" which matches with the above naming rule for the current format definition, and changing the method "goid" to be deprecated (with warning message). (It seems the short name for the column was renamed from "GOid" to "GO ID" in 2004). > > Sound OK? > > Thanks, > > ben Thank you. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From jan.aerts at gmail.com Tue Apr 20 07:06:08 2010 From: jan.aerts at gmail.com (Jan Aerts) Date: Tue, 20 Apr 2010 12:06:08 +0100 Subject: [BioRuby] ruby best practices Message-ID: There's a very good book out now with Ruby Best Practices, including a free PDF. Highly recommended. See http://rubybestpractices.com/ jan. From ngoto at gen-info.osaka-u.ac.jp Tue Apr 20 23:32:24 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 21 Apr 2010 12:32:24 +0900 Subject: [BioRuby] ruby best practices In-Reply-To: References: Message-ID: <20100421033226.C403D1CBC3BD@idnmail.gen-info.osaka-u.ac.jp> FYI, Japanese translation of the book is now available. http://www.amazon.co.jp/exec/obidos/ASIN/4873114454/ (no PDF, only paper book) Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 20 Apr 2010 12:06:08 +0100 Jan Aerts wrote: > There's a very good book out now with Ruby Best Practices, including a free > PDF. Highly recommended. See http://rubybestpractices.com/ > > jan. From pjotr.public14 at thebird.nl Mon Apr 26 08:53:47 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 26 Apr 2010 14:53:47 +0200 Subject: [BioRuby] Alignment plugin Message-ID: <20100426125347.GB23619@thebird.nl> I am thinking of creating some new infrastructure for alignments. The Bioruby alignment architecture is not great. It contains a lot of useful functionality, but it is purely sequence organized. I did a writeup on the Bioruby blog - on ALN support and colorized HTML - if you remember. For completeness I checked the BioJAVA and BioPython implementations. The BioJAVA alignment classes are in a deep tree: biojava/alignment/src/main/java/org/biojava/bio/alignment the implementation troubles me. Partly it is JAVA itself - which makes code feel dispersed. Partly it is the implementation, which appears to be minimal. I guess it is a work in progress. The BioPython version looks like it is the best of the three. Some separation of responsibilities. Good documentation, and good validation and testing. I like that. Otherwise, functionally it is mostly comparable to BioRuby. The trick of designing good alignment classes is to make them small and fork out responsibilities. The BioJAVA version does not contain much. The BioRuby version has everything in one place, including the kitchen sink. BioPython goes some way towards what it should be, but it does not look more extensible than what we have (and I don't want to use Python). It sucks. I don't feel like replicating all other code. At the same time I want something cleaner. The PAML output adds information for each column of an alignment. Besides we deal with the translated alignment too. So PAML requires a dual alignment standard (NU+AA) with columnwise information (homology, evidence of positive selection). Add to that the phylogentic tree. For my current work I are going to add column-wise and row-wise 'meta' information, which is used for output (both HTML and graphics). I guess the best option is to write two BioRuby plugins. One for the new alignment storage and one for PAML alignments, which will include meta-info and output functionality. Questions: * What is the way to store alignments - should gaps be represented as dashes? * Should we use a String format? * How do we handle multi-value fields (e.g. degenerates)? * How do we handle quality scores (sequencers)? I think the underlying storage format should not be String - as it allows toying with the data - say, by embedding HTML. Properties, like colors, should be added on top of the alignment structure, not within. We should also allow for (future) stronger type checking of nucleotides and amino acids. If we can convert easily to the standard BioRuby alignment old functionality can be retained. Though it may not always be that natural. With Ruby a string type may be the most obvious choice (a lists of lists of a special nucleotide object is probably overkill, though it should not be). Anyone interested in participating? With regard to plugins: for now I will merely create a separate pluginname/lib/bio/pluginname.rb and add that to the include path. That should be OK for now. It will allow adding it as a gem too. Pj. From rutgeraldo at gmail.com Mon Apr 26 09:03:35 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Mon, 26 Apr 2010 14:03:35 +0100 Subject: [BioRuby] Alignment plugin In-Reply-To: <20100426125347.GB23619@thebird.nl> References: <20100426125347.GB23619@thebird.nl> Message-ID: Do you feel that these objects could/should also double as character-state matrices? On Mon, Apr 26, 2010 at 1:53 PM, Pjotr Prins wrote: > I am thinking of creating some new infrastructure for alignments. > > The Bioruby alignment architecture is not great. It contains a lot of useful > functionality, but it is purely sequence organized. I did a writeup on the > Bioruby blog - on ALN support and colorized HTML - if you remember. > > For completeness I checked the BioJAVA and BioPython implementations. > > The BioJAVA alignment classes are in a deep tree: > > ?biojava/alignment/src/main/java/org/biojava/bio/alignment > > the implementation troubles me. Partly it is JAVA itself - which makes code > feel dispersed. Partly it is the implementation, which appears to be minimal. I > guess it is a work in progress. > > The BioPython version looks like it is the best of the three. Some > separation of responsibilities. Good documentation, and good > validation and testing. I like that. Otherwise, functionally it is > mostly comparable to BioRuby. > > The trick of designing good alignment classes is to make them small and fork > out responsibilities. The BioJAVA version does not contain much. The BioRuby > version has everything in one place, including the kitchen sink. BioPython goes > some way towards what it should be, but it does not look more > extensible than what we have (and I don't want to use Python). > > It sucks. I don't feel like replicating all other code. At the same time I want > something cleaner. > > The PAML output adds information for each column of an alignment. > Besides we deal with the translated alignment too. So PAML requires a > dual alignment standard (NU+AA) with columnwise information (homology, > evidence of positive selection). Add to that the phylogentic tree. For > my current work I are going to add column-wise and row-wise 'meta' > information, which is used for output (both HTML and graphics). > > I guess the best option is to write two BioRuby plugins. One for the > new alignment storage and one for PAML alignments, which will include > meta-info and output functionality. Questions: > > * What is the way to store alignments - should gaps be represented as dashes? > * Should we use a String format? > * How do we handle multi-value fields (e.g. degenerates)? > * How do we handle quality scores (sequencers)? > > I think the underlying storage format should not be String - as it allows > toying with the data - say, by embedding HTML. Properties, like > colors, should be added on top of the alignment structure, not within. > We should also allow for (future) stronger type checking of > nucleotides and amino acids. > > If we can convert easily to the standard BioRuby alignment old > functionality can be retained. Though it may not always be that > natural. > > With Ruby a string type may be the most obvious choice (a lists of > lists of a special nucleotide object is probably overkill, though it > should not be). > > Anyone interested in participating? > > With regard to plugins: for now I will merely create a separate > > ?pluginname/lib/bio/pluginname.rb > > and add that to the include path. That should be OK for now. It will > allow adding it as a gem too. > > Pj. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From bonnalraoul at ingm.it Mon Apr 26 10:29:16 2010 From: bonnalraoul at ingm.it (Raoul Bonnal) Date: Mon, 26 Apr 2010 16:29:16 +0200 Subject: [BioRuby] R: Alignment plugin Message-ID: <9c999264-a327-4e93-b7dd-3e8473f73925@ingm.it> I think we can start discussing about the plug-in system, probably it will be discussed at the bosc2010. I think that we should find inspiration from the Rail, I like to adopt something similar to MVC and perhaps introducing a different pattern than controller. -- Raoul J.P. Bonnal Life Science Informatics Integrative Biology Program Fondazione INGM Via F. Sforza 28 20122 Milano, IT phone: +39 02 006 623 26 fax: +39 02 006 623 46 http://www.ingm.it > -----Messaggio originale----- > Da: bioruby-bounces at lists.open-bio.org [mailto:bioruby- > bounces at lists.open-bio.org] Per conto di Pjotr Prins > Inviato: luned? 26 aprile 2010 14:54 > Cc: bioruby at lists.open-bio.org > Oggetto: [BioRuby] Alignment plugin > > I am thinking of creating some new infrastructure for alignments. > > The Bioruby alignment architecture is not great. It contains a lot of > useful > functionality, but it is purely sequence organized. I did a writeup on > the > Bioruby blog - on ALN support and colorized HTML - if you remember. > > For completeness I checked the BioJAVA and BioPython implementations. > > The BioJAVA alignment classes are in a deep tree: > > biojava/alignment/src/main/java/org/biojava/bio/alignment > > the implementation troubles me. Partly it is JAVA itself - which makes > code > feel dispersed. Partly it is the implementation, which appears to be > minimal. I > guess it is a work in progress. > > The BioPython version looks like it is the best of the three. Some > separation of responsibilities. Good documentation, and good > validation and testing. I like that. Otherwise, functionally it is > mostly comparable to BioRuby. > > The trick of designing good alignment classes is to make them small and > fork > out responsibilities. The BioJAVA version does not contain much. The > BioRuby > version has everything in one place, including the kitchen sink. > BioPython goes > some way towards what it should be, but it does not look more > extensible than what we have (and I don't want to use Python). > > It sucks. I don't feel like replicating all other code. At the same > time I want > something cleaner. > > The PAML output adds information for each column of an alignment. > Besides we deal with the translated alignment too. So PAML requires a > dual alignment standard (NU+AA) with columnwise information (homology, > evidence of positive selection). Add to that the phylogentic tree. For > my current work I are going to add column-wise and row-wise 'meta' > information, which is used for output (both HTML and graphics). > > I guess the best option is to write two BioRuby plugins. One for the > new alignment storage and one for PAML alignments, which will include > meta-info and output functionality. Questions: > > * What is the way to store alignments - should gaps be represented as > dashes? > * Should we use a String format? > * How do we handle multi-value fields (e.g. degenerates)? > * How do we handle quality scores (sequencers)? > > I think the underlying storage format should not be String - as it > allows > toying with the data - say, by embedding HTML. Properties, like > colors, should be added on top of the alignment structure, not within. > We should also allow for (future) stronger type checking of > nucleotides and amino acids. > > If we can convert easily to the standard BioRuby alignment old > functionality can be retained. Though it may not always be that > natural. > > With Ruby a string type may be the most obvious choice (a lists of > lists of a special nucleotide object is probably overkill, though it > should not be). > > Anyone interested in participating? > > With regard to plugins: for now I will merely create a separate > > pluginname/lib/bio/pluginname.rb > > and add that to the include path. That should be OK for now. It will > allow adding it as a gem too. > > Pj. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr.public14 at thebird.nl Mon Apr 26 10:52:03 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 26 Apr 2010 16:52:03 +0200 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> Message-ID: <20100426145203.GA25758@thebird.nl> On Mon, Apr 26, 2010 at 02:03:35PM +0100, Rutger Vos wrote: > Do you feel that these objects could/should also double as > character-state matrices? Yes. Rather than hacking a matrix on top of a sequence alignment - what I want to do is: A sequence is a list of nucleotides/aminoacid/degenerates (whatever). Each sequence has a name and other properties (like gaps). Each nucleotide/aminoacid has properties by itself. Do gaps have properties? A 'column' may have properties. Any type of matrix can be derived from the internal structure. I don't think the underlying storage pattern is a matrix. Something else, I want sequence storage to be transparently open to other back-ends. I.e. memory storage, SQL storage, other storage etc. I want to get away from storing everything in memory. With big data it is simply not a great idea (though not so relevant for alignments, probably). Pj. From pjotr.public14 at thebird.nl Mon Apr 26 11:04:10 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 26 Apr 2010 17:04:10 +0200 Subject: [BioRuby] Alignment plugin In-Reply-To: <20100426145203.GA25758@thebird.nl> References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> Message-ID: <20100426150410.GA26465@thebird.nl> Maybe we should start defining a basic sequence object. What would we want from it, what should be core and what should be mixed in? Alignments and secondary structures should build on that. I think we have a chance of doing it right - after all our experience, and that of the other Bio* projects. Pj. From jan.aerts at gmail.com Mon Apr 26 11:15:59 2010 From: jan.aerts at gmail.com (Jan Aerts) Date: Mon, 26 Apr 2010 16:15:59 +0100 Subject: [BioRuby] Alignment plugin In-Reply-To: <20100426150410.GA26465@thebird.nl> References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> Message-ID: True. Also keep in the back of your head that one of the GSoC projects this year (hopfully) will focus on adding annotation functionality to bioruby using RDF. So any object can have arbitrary annotations (although we might decide to use a standardized vocabulary). jan. On 26 April 2010 16:04, Pjotr Prins wrote: > Maybe we should start defining a basic sequence object. What would we > want from it, what should be core and what should be mixed in? > > Alignments and secondary structures should build on that. > > I think we have a chance of doing it right - after all our > experience, and that of the other Bio* projects. > > Pj. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From pjotr.public14 at thebird.nl Mon Apr 26 11:34:59 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 26 Apr 2010 17:34:59 +0200 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> Message-ID: <20100426153459.GA26923@thebird.nl> On Mon, Apr 26, 2010 at 04:15:59PM +0100, Jan Aerts wrote: > True. Also keep in the back of your head that one of the GSoC projects this > year (hopfully) will focus on adding annotation functionality to bioruby > using RDF. So any object can have arbitrary annotations (although we might > decide to use a standardized vocabulary). We have some smart people here :-). Another thing is that parts of sequences can have meaning, like ORF, promotor, restriction site, etc. Again with possible annotations. Current implementations always start with the sequence string and build 'annotation' on top. That way you see a lot of repetition in the libraries. Standards have emerged that do something with (partial) sequence 'annotation's. These standards show again patterns of use. And again there is repetition at that level. In Bioruby there is repetition between AA sequence handlers and NUC sequence handlers. There is a lot of repetition in input parsers and output writers. My main problem, at this point, is to think up an OOP design that is strong enough to address most needs, but simple enough so we can use some form of generics. Pj. From rutgeraldo at gmail.com Mon Apr 26 11:40:11 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Mon, 26 Apr 2010 16:40:11 +0100 Subject: [BioRuby] Alignment plugin In-Reply-To: <20100426150410.GA26465@thebird.nl> References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> Message-ID: On Mon, Apr 26, 2010 at 4:04 PM, Pjotr Prins wrote: > Maybe we should start defining a basic sequence object. What would we > want from it, what should be core and what should be mixed in? > > Alignments and secondary structures should build on that. In the interest of learning from other Bio* projects ;-) it should be noted that there is a bit of a mismatch between sequences as standalone objects on the one hand, and rows within character state matrices on the other, especially when you consider types of data beyond molecular sequences (e.g. morphological character state data). Within a matrix there are columns such that every cell in a sequence now becomes a concrete instance of one of a limited set of character states for that character/column. Especially for morphological data there could be very esoteric ambiguity mappings from one state in that column to another. Imagine an alignment with unique mappings a la the IUPAC single character codings for each column. The upshot might be that you'd need a mapping object for each cell, though you'd use an immutable class for molecular data. -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From diapriid at gmail.com Mon Apr 26 11:41:39 2010 From: diapriid at gmail.com (Matt) Date: Mon, 26 Apr 2010 11:41:39 -0400 Subject: [BioRuby] Alignment plugin In-Reply-To: <20100426145203.GA25758@thebird.nl> References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> Message-ID: > Each nucleotide/aminoacid has properties by itself. Do gaps have > properties? > Not sure if you mean a single gap ("-") or any gap between nucleotides. If the latter then it would be nice if gaps had properties like * length * at_beginning_sequence (preceeds all bases) * at_end_of_sequence (found at end of all bases) Another distinction- gaps in found in a typical MSA may indicate real gaps (as inferred from evolutionary events in an alignment), or missing data, depending on how sloppy/precise a person is. cheers, Matt From pjotr.public14 at thebird.nl Mon Apr 26 12:30:55 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 26 Apr 2010 18:30:55 +0200 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> Message-ID: <20100426163055.GC28231@thebird.nl> Hi Rutger, On Mon, Apr 26, 2010 at 04:40:11PM +0100, Rutger Vos wrote: > On Mon, Apr 26, 2010 at 4:04 PM, Pjotr Prins wrote: > > Maybe we should start defining a basic sequence object. What would we > > want from it, what should be core and what should be mixed in? > > > > Alignments and secondary structures should build on that. > > In the interest of learning from other Bio* projects ;-) it should be > noted that there is a bit of a mismatch between sequences as > standalone objects on the one hand, and rows within character state > matrices on the other, especially when you consider types of data > beyond molecular sequences (e.g. morphological character state data). Yes. > Within a matrix there are columns such that every cell in a sequence > now becomes a concrete instance of one of a limited set of character > states for that character/column. Especially for morphological data > there could be very esoteric ambiguity mappings from one state in that > column to another. Imagine an alignment with unique mappings a la the > IUPAC single character codings for each column. The upshot might be > that you'd need a mapping object for each cell, though you'd use an > immutable class for molecular data. I think I understand what you mean here. The way I see it is that the sequences are immutable lists of nucleotides/amino acids. State can be at row, column or individual matrix point level. I guess it is impossible to impose the way people want to use the data structure. Either they use state as a loose component (could be a matrix) projected on the sequences, or (if our format allows it) they could maintain state at each of the three levels (row, column, point). In my case I would like to add state into the data structure (one advantage could be that it would be relatively easy to export, also to RDF). We have an alignment: aln = Alignment.new(sequences) I would like to annotate column 4:6 as having high homology aln.column(1..4, :homology=>HIGH) maybe I want to remove a part of sequence 3 and mark it as such aln.delete(3, 20:30) aln.sequence(3, :position=>20..30, :deleted=>TRUE) or indicate an ORF aln.sequence(3, :position=>40..65, :orf=>TRUE) and fetch information, like quality scores sequence = aln.sequence(3) quality = sequence.quality(:position=>40..65) Any variations, thereof. State would be maintained inside Alignment(Column), Sequence or Nucleotide/Aminoacid. Pj. From biopython at maubp.freeserve.co.uk Mon Apr 26 12:44:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 17:44:04 +0100 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> Message-ID: On Mon, Apr 26, 2010 at 4:41 PM, Matt wrote: > >> Each nucleotide/aminoacid has properties by itself. Do gaps have >> properties? >> > > Not sure if you mean a single gap ("-") or any gap between > nucleotides. ?If the latter then it would be nice if gaps had > properties like > > * length > * at_beginning_sequence (preceeds all bases) > * at_end_of_sequence ? ?(found at end of all bases) > > Another distinction- gaps in found in a typical MSA may indicate real > gaps (as inferred from evolutionary events in an alignment), or > missing data, depending on how sloppy/precise a person is. > > cheers, > Matt It is worse than that - you have leading padding, trailing padding and insertions which are all often represented with the same character. Then there are special cases like HMMER which uses two gap characters (- and .) depending on the model state: http://lists.open-bio.org/pipermail/biopython/2010-April/006400.html Then you have the other meaning of dot (.) as in PHYLIP format and some visualisations meaning same as the first sequence. Plus, to keep life interesting, some formats (e.g. ACE) use the asterisk (*) as the gap character (usually a stop symbol when working with protein sequences). Peter From diapriid at gmail.com Mon Apr 26 13:00:06 2010 From: diapriid at gmail.com (Matt) Date: Mon, 26 Apr 2010 13:00:06 -0400 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> Message-ID: Another conundrum I just thought of- in some cases gaps ("-") are used to infer both evolutionary events (aligning columns) and as space fillers for data that are contiguous with aligned partitions but not themselves considered to be aligned (e.g. a partition in a MSA containing a loop in some structural alignments). Again this might be be best practice but it does happen. While I don't know how things are modeled right now (i.e. this may already be the case) it seems that gaps should not properties of sequences, but rather properties of MSAs, as they only really only exist when two or more sequences are being compared. M On Mon, Apr 26, 2010 at 12:44 PM, Peter wrote: > On Mon, Apr 26, 2010 at 4:41 PM, Matt wrote: >> >>> Each nucleotide/aminoacid has properties by itself. Do gaps have >>> properties? >>> >> >> Not sure if you mean a single gap ("-") or any gap between >> nucleotides. ?If the latter then it would be nice if gaps had >> properties like >> >> * length >> * at_beginning_sequence (preceeds all bases) >> * at_end_of_sequence ? ?(found at end of all bases) >> >> Another distinction- gaps in found in a typical MSA may indicate real >> gaps (as inferred from evolutionary events in an alignment), or >> missing data, depending on how sloppy/precise a person is. >> >> cheers, >> Matt > > It is worse than that - you have leading padding, trailing padding > and insertions which are all often represented with the same > character. Then there are special cases like HMMER which > uses two gap characters (- and .) depending on the model state: > http://lists.open-bio.org/pipermail/biopython/2010-April/006400.html > > Then you have the other meaning of dot (.) as in PHYLIP format > and some visualisations meaning same as the first sequence. > > Plus, to keep life interesting, some formats (e.g. ACE) use the > asterisk (*) as the gap character (usually a stop symbol when > working with protein sequences). > > Peter > From rutgeraldo at gmail.com Mon Apr 26 13:25:31 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Mon, 26 Apr 2010 18:25:31 +0100 Subject: [BioRuby] Alignment plugin In-Reply-To: <20100426163055.GC28231@thebird.nl> References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> <20100426163055.GC28231@thebird.nl> Message-ID: What you describe below is not what I meant, though it's also very important w.r.t. preserving the provenance of annotations. We're thinking in a number of different directions and so the requirements are starting to creep in :-) What I meant is, by long-winded example, the following: imagine you're studying the phylogeny of lemurs, and you want to look at morphological and behavioral characters. Here's what a character state matrix might look like: D._madagascariensis - 0 H._aureus 4 1 H._simus 6 ? H._griseus ? 1 The first column captures the number of teeth in the lower-jaw toothcomb. Some lemurs use the incisors of the lower jaw as a grooming apparatus, and they have (I believe) either 4 or 6 teeth in that "comb". D._madagascariensis does not have this apparatus at all, so its state for this column could be coded as "-", conceptually a bit like a gap in an alignment, interpreted as "does not apply". We simply have no data for H._griseus, so we code it as "?", meaning "missing". The second column captures activity pattern, such that "0" means "nocturnal", and "1" means "diurnal". You can imagine that we might not know when H._simus is active, so a state "?" could be valid for this column, but a state "-" definitely isn't: the animals are either nocturnal or diurnal (or we don't know exactly which one of the two applies). To some extent, a matrix with such characters would be like an alignment, and in many cases you would analyze this data using the same tools for phylogenetic inference, like paup, phylip, mrbayes, etc. Also, the same data formats (nexus/nexml, phylip) describe both these matrices and alignments. So it would make sense to implement them as objects within the same class hierarchy, and the projects where I've looked at the insides (Bio::Phylo, Mesquite, DendroPy, CIPRES, JEBL) all do this, though not all in the same way. BioPerl does not really do this in that it has no explicit concept of categorical character state matrices beyond molecular ones. It's hard to see how something like this could be retrofitted elegantly into BioPerl, which is why I am ringing the alarm bells now :-) The problem that needs to be solved is to come up with a way to describe for each column which state symbols are allowable (and potentially annotate them) without creating a baroque beast that can stay in its cage anyway for the 90% of the time where we're dealing with molecular data where all columns have the same semantics and for which we have no further annotations per column. The way I've dealt with this in the past is to create an object that has a map where every key is a state symbol, and the values are lists of zero or more other possible states that the (i.e. N maps onto A, C, G, T but "-" maps onto an empty list). In extreme cases, such as the morphological matrix I described, you would have one such object attached to every column in the matrix. But for molecular data the object would be a singleton for the whole alignment. If you buy this line of thinking (YMMV), you might agree that a single sequence may need complicated helper objects and coordinate systems to keep track of the sort of mapping semantics that come into play once the sequence becomes homologized with others as building blocks for alignments/matrices. I hope all this makes some amount of sense :-) Rutger > I think I understand what you mean here. The way I see it is that the > sequences are immutable lists of nucleotides/amino acids. State can > be at row, column or individual matrix point level. > > I guess it is impossible to impose the way people want to use the data > structure. Either they use state as a loose component (could be a > matrix) projected on the sequences, or (if our format allows it) they > could maintain state at each of the three levels (row, column, point). > > In my case I would like to add state into the data structure (one > advantage could be that it would be relatively easy to export, also to > RDF). ?We have an alignment: > > ?aln = Alignment.new(sequences) > > I would like to annotate column 4:6 as having high homology > > ?aln.column(1..4, :homology=>HIGH) > > maybe I want to remove a part of sequence 3 and mark it as such > > ?aln.delete(3, 20:30) > ?aln.sequence(3, :position=>20..30, :deleted=>TRUE) > > or indicate an ORF > > ?aln.sequence(3, :position=>40..65, :orf=>TRUE) > > and fetch information, like quality scores > > ?sequence = aln.sequence(3) > ?quality = sequence.quality(:position=>40..65) > > Any variations, thereof. State would be maintained inside > Alignment(Column), Sequence or Nucleotide/Aminoacid. > > Pj. > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From pjotr.public14 at thebird.nl Mon Apr 26 13:38:30 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 26 Apr 2010 19:38:30 +0200 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> <20100426163055.GC28231@thebird.nl> Message-ID: <20100426173830.GA30424@thebird.nl> On Mon, Apr 26, 2010 at 06:25:31PM +0100, Rutger Vos wrote: > If you buy this line of thinking (YMMV), you might agree that a single > sequence may need complicated helper objects and coordinate systems to > keep track of the sort of mapping semantics that come into play once > the sequence becomes homologized with others as building blocks for > alignments/matrices. > > I hope all this makes some amount of sense :-) Ah, it does make sense, coming from phylogeny. And is in fact very useful for some other project I have in mind, that has to do with mosaic evolution. Maybe Ruby is not the greatest language for this. You would want built-in type checking and generics. And objects passed into the structure. However, I think these ideas can be combined. The only real difference I see is that sequences have some type of objects (annotations) that apply to a range of nucleotides/amino acids. I think the concept of range is valuable. And rather than some elaborate scheme of calculating/mapping positions for every operation, I would champion immutable values that just get copied to a new structure. E.g. if one would like delete some data we would not do that in place and try to correct every single position of linked helper objects every time. The heuristic should be to created a new copy with its helpers, that is correct. Pj. From biopython at maubp.freeserve.co.uk Mon Apr 26 13:56:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 18:56:26 +0100 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> <20100426163055.GC28231@thebird.nl> Message-ID: On Mon, Apr 26, 2010 at 6:25 PM, Rutger Vos wrote: > > What you describe below is not what I meant, though it's also very > important w.r.t. preserving the provenance of annotations. We're > thinking in a number of different directions and so the requirements > are starting to creep in :-) > > What I meant is, by long-winded example, the following: imagine you're > studying the phylogeny of lemurs, and you want to look at > morphological and behavioral characters. Here's what a character state > matrix might look like: > > D._madagascariensis - 0 > H._aureus 4 1 > H._simus 6 ? > H._griseus ? 1 > > The first column captures the number of teeth in the lower-jaw > toothcomb. Some lemurs use the incisors of the lower jaw as a grooming > apparatus, and they have (I believe) either 4 or 6 teeth in that > "comb". D._madagascariensis does not have this apparatus at all, so > its state for this column could be coded as "-", conceptually a bit > like a gap in an alignment, interpreted as "does not apply". Or perhaps as a zero? > To some extent, a matrix with such characters would be like an > alignment, and in many cases you would analyze this data using the > same tools for phylogenetic inference, like paup, phylip, mrbayes, > etc. Also, the same data formats (nexus/nexml, phylip) describe both > these matrices and alignments. In these file format, am I right in thinking the non-sequence based characteristics are all still encoded by single letters? e.g. single digits. If so, that still allows the data to be held as simple strings. Peter From rutgeraldo at gmail.com Mon Apr 26 13:59:01 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Mon, 26 Apr 2010 18:59:01 +0100 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> <20100426163055.GC28231@thebird.nl> Message-ID: > In these file format, am I right in thinking the non-sequence based > characteristics are all still encoded by single letters? e.g. single > digits. If so, that still allows the data to be held as simple strings. They could be floats. From diapriid at gmail.com Mon Apr 26 14:14:35 2010 From: diapriid at gmail.com (Matt) Date: Mon, 26 Apr 2010 14:14:35 -0400 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> <20100426163055.GC28231@thebird.nl> Message-ID: Or letters. On Mon, Apr 26, 2010 at 1:59 PM, Rutger Vos wrote: >> In these file format, am I right in thinking the non-sequence based >> characteristics are all still encoded by single letters? e.g. single >> digits. If so, that still allows the data to be held as simple strings. > > They could be floats. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From czmasek at burnham.org Tue Apr 27 00:47:57 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Mon, 26 Apr 2010 21:47:57 -0700 Subject: [BioRuby] Welcome to the 2010 Phyloinformatics Summer of Code (BioRuby/duplications) Message-ID: <4BD66C7D.8010902@burnham.org> Hi, Sara: Congratulations to having been selected as a student to the 2010 Google Summer of Code! The next several weeks until May 24 are called the "community bonding period." Although you won't be spending your full time on your project yet during that time, you get comfortable with setting aside several hours each week to get the most out of the summer, with a focus on the following. 1) Please subscribe yourself to the BioRuby mailing list at the following URL: http://lists.open-bio.org/mailman/listinfo/bioruby and introduce yourself and the project. This is your first action item, and the sooner you can accomplish it the smoother we can get the rest set up and communicated. You can use your favorite email (it need not be your gmail address). 2) Familiarize yourself with the Git distributed version control system, as well as GitHub. See: http://git-scm.com/ http://github.com/ 3) If you haven't done so yet, set yourself up with the BioRuby code base. You need to set up your own GitHub repository and clone the BioRuby main trunk (from http://github.com/bioruby/bioruby), for an example of a cloned fork, see (from Diana): http://github.com/latvianlinuxgirl/bioruby/ 4) Install the Archaeopteryx tree viewer, in order to display phyloXML formatted evolutionary trees (Archaeopteryx also allows to directly execute the SDI algorithms), available at: http://www.phylosoft.org/archaeopteryx/ 5) Obtain the forester Java source which contains an implementation of the SDI algorithms, instructions for this can be found at (for the Eclipse IDE): http://aptxevo.wordpress.com/2010/04/19/importing-forester-into-eclipse-ide/ Also see: http://www.phylosoft.org/forester/applications/sdi/ 6) I strongly recommend you to set up a blog (blogger? wordpress?) where you can report your progress and discuss issues and problems. For a very good example (again, from Diana) see: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby 7) Reading, reading, reading... Please obtain the papers listed at: http://evogsoc2010.wordpress.com/2010/03/25/references-for-gene-duplications-proposal/ (needless to say, you don't need to bother with the paper written in German). Also, obtain copies of: "Programming Ruby 1.9: The Pragmatic Programmers' Guide (Facets of Ruby)" (http://www.amazon.com/Programming-Ruby-1-9-Pragmatic-Programmers/dp/1934356085/) and "The Ruby Programming Language" (http://www.amazon.com/Ruby-Programming-Language-David-Flanagan/dp/0596516177/) and "Ruby Best Practices" (http://www.amazon.com/Ruby-Best-Practices-Gregory-Brown/dp/0596523009/) Strangely, this is available for free at: http://rubybestpractices.com/ 8) Finally, update yourself on the details of your project and related efforts. As you do this, together with your mentor review and revise your project plan to become what you will actually be working off of come the week of May 24. This will probably be the part that requires the most time. There'll be more communication on this in the near future. *You should be completely done with these tasks* and ready to go and commit code by May 24. Again, congratulations, welcome to our Summer of Code, and I'm looking forward to working with you! Christian PS: Parts of this text are copied from Hilmar's instructions. Thank you! From pjotr.public14 at thebird.nl Tue Apr 27 03:38:13 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 27 Apr 2010 09:38:13 +0200 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> <20100426163055.GC28231@thebird.nl> Message-ID: <20100427073813.GA2961@thebird.nl> Hi Peter, On Mon, Apr 26, 2010 at 06:56:26PM +0100, Peter wrote: > In these file format, am I right in thinking the non-sequence based > characteristics are all still encoded by single letters? e.g. single > digits. If so, that still allows the data to be held as simple strings. Strings have an elegance to them - in particular because so much built-in functionality of Ruby and Python allow for concise readable code (like regular expressions). However, there are short-comings. In particular where we aim for testing correctness and adding annotations. I would rather have a nucleotide as an object than as a string character. I think we can have it both ways. But don't try to enforce a String on something that is conceptually something else. BTW The GSoC Perl alignment project has been approved and they are also rethinking rather fundamental implementation details. I don't intend to replace Bio::Sequence, but rather come up with a parallel alternative. Pj. From pjotr.public14 at thebird.nl Tue Apr 27 03:41:14 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 27 Apr 2010 09:41:14 +0200 Subject: [BioRuby] Alignment plugin In-Reply-To: References: <20100426125347.GB23619@thebird.nl> <20100426145203.GA25758@thebird.nl> <20100426150410.GA26465@thebird.nl> Message-ID: <20100427074114.GB2961@thebird.nl> Jan, I notice you have the Range concept in Bio::Graphics, e.g. http://bio-graphics.rubyforge.org/classes/Bio/Feature.html You introduce names for identifying features. I am thinking of real objects. Pj. On Mon, Apr 26, 2010 at 04:15:59PM +0100, Jan Aerts wrote: > True. Also keep in the back of your head that one of the GSoC projects this > year (hopfully) will focus on adding annotation functionality to bioruby > using RDF. So any object can have arbitrary annotations (although we might > decide to use a standardized vocabulary). > > jan. > > On 26 April 2010 16:04, Pjotr Prins wrote: > > > Maybe we should start defining a basic sequence object. What would we > > want from it, what should be core and what should be mixed in? > > > > Alignments and secondary structures should build on that. > > > > I think we have a chance of doing it right - after all our > > experience, and that of the other Bio* projects. > > > > Pj. > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > From sararayburn at gmail.com Tue Apr 27 23:30:19 2010 From: sararayburn at gmail.com (Sara Rayburn) Date: Tue, 27 Apr 2010 22:30:19 -0500 Subject: [BioRuby] Welcome to the 2010 Phyloinformatics Summer of Code (BioRuby/duplications) In-Reply-To: <4BD66C7D.8010902@burnham.org> References: <4BD66C7D.8010902@burnham.org> Message-ID: <536BAA78-CD71-4AE7-8FF7-527AD8E8E93E@gmail.com> Hello all, Thank you for selecting me to participate in Google Summer of Code with BioRuby! I'm looking forward to working with the BioRuby community and to getting involved with open source development. This summer I'll be implementing a gene duplication inference algorithm for both binary and non-binary species trees. My first steps are going to be those in the list below, starting with this email introducing myself. Again, thanks, and I'm looking forward to working on the project. Sara Rayburn On Apr 26, 2010, at 11:47 PM, Christian M Zmasek wrote: > Hi, Sara: > > Congratulations to having been selected as a student to the 2010 Google Summer of Code! > > The next several weeks until May 24 are called the "community bonding period." Although you won't be spending your full time on your project yet during that time, you get comfortable with setting aside several hours each week to get the most out of the summer, with a focus on the following. > > > 1) Please subscribe yourself to the BioRuby mailing list at the following URL: http://lists.open-bio.org/mailman/listinfo/bioruby > and introduce yourself and the project. > This is your first action item, and the sooner you can accomplish it the smoother we can get the rest set up and communicated. You can use your favorite email (it need not be your gmail address). > > 2) Familiarize yourself with the Git distributed version control system, as well as GitHub. > See: > http://git-scm.com/ > http://github.com/ > > 3) If you haven't done so yet, set yourself up with the BioRuby code base. You need to set up your own GitHub repository and clone the BioRuby main trunk (from http://github.com/bioruby/bioruby), for an example of a cloned fork, see (from Diana): http://github.com/latvianlinuxgirl/bioruby/ > > 4) Install the Archaeopteryx tree viewer, in order to display phyloXML formatted evolutionary trees (Archaeopteryx also allows to directly execute the SDI algorithms), available at: > http://www.phylosoft.org/archaeopteryx/ > > 5) Obtain the forester Java source which contains an implementation of the SDI algorithms, instructions for this can be found at (for the Eclipse IDE): > http://aptxevo.wordpress.com/2010/04/19/importing-forester-into-eclipse-ide/ > Also see: > http://www.phylosoft.org/forester/applications/sdi/ > > 6) I strongly recommend you to set up a blog (blogger? wordpress?) where you can report your progress and discuss issues and problems. > For a very good example (again, from Diana) see: > https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby > > 7) Reading, reading, reading... > Please obtain the papers listed at: > http://evogsoc2010.wordpress.com/2010/03/25/references-for-gene-duplications-proposal/ > (needless to say, you don't need to bother with the paper written in German). > > Also, obtain copies of: > > "Programming Ruby 1.9: The Pragmatic Programmers' Guide (Facets of Ruby)" > (http://www.amazon.com/Programming-Ruby-1-9-Pragmatic-Programmers/dp/1934356085/) > > and > > "The Ruby Programming Language" > (http://www.amazon.com/Ruby-Programming-Language-David-Flanagan/dp/0596516177/) > > and > > "Ruby Best Practices" > (http://www.amazon.com/Ruby-Best-Practices-Gregory-Brown/dp/0596523009/) > Strangely, this is available for free at: http://rubybestpractices.com/ > > 8) Finally, update yourself on the details of your project and related efforts. As you do this, together with your mentor review and revise your project plan to become what you will actually be working off of come the week of May 24. This will probably be the part that requires the most time. There'll be more communication on this in the near future. > > > *You should be completely done with these tasks* and ready to go and > commit code by May 24. > > > Again, congratulations, welcome to our Summer of Code, and I'm looking > forward to working with you! > > Christian > > > PS: Parts of this text are copied from Hilmar's instructions. Thank you! From k.hayashi.info at gmail.com Wed Apr 28 00:12:42 2010 From: k.hayashi.info at gmail.com (Kazuhiro Hayashi) Date: Wed, 28 Apr 2010 13:12:42 +0900 Subject: [BioRuby] participation in GSoC 2010 Message-ID: Hi all: My name is Kazuhiro Hayashi. I'm a graduate student at The University of Tokyo, majoring in Computational Biology. The proposal which I submitted to GSoC 2010 was accepted yesterday. The topic of the proposal is "Ruby 1.9.2 support of BioRuby". I would like to make BioRuby work in Ruby 1.9.2 . Currently, a lot of classes in BioRuby lack unit tests. First, I'll make them in order to confirm behaviors of the classes. Then, modify the classes as they work in Both Ruby 1.8.7 and 1.9.2 . I'll work on the documentation too. the abstract of my proposal is here. http://socghop.appspot.com/gsoc/student_project/show/google/gsoc2010/obf/t127230761332 I'm glad I can work on this project as one of BioRuby developers. Thank you for selecting me. Kazuhiro -- Kazuhiro Hayashi Department of Computational Biology, The University of Tokyo email: k_hayashi at cb.k.u-tokyo.ac.jp tel: 04-7136-3988 From pjotr.public14 at thebird.nl Wed Apr 28 01:39:13 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 28 Apr 2010 07:39:13 +0200 Subject: [BioRuby] participation in GSoC 2010 In-Reply-To: References: Message-ID: <20100428053913.GA17564@thebird.nl> Welcome Kazuhiro, I am glad you take on this job. My main concern would be we don't sprinkle 'if-then' blocks throughout the code base. I think the challenge is to have one code base for Ruby 1.8, 1.9 and JRuby without having code exceptions. That should be the top priority. Where it is not possible to avoid two code paths, find ways of isolating the issue in one single 'architecture' file - e.g. in ./lib/bio/ruby1.8.rb and ./lib/bio/ruby1.9.rb. Only in the few instances there are real performance concerns I would diverge from such a strategy. I don't know how they handle it in Rails, but I would take hints from there. Pj. On Wed, Apr 28, 2010 at 01:12:42PM +0900, Kazuhiro Hayashi wrote: > Hi all: > > My name is Kazuhiro Hayashi. > I'm a graduate student at The University of Tokyo, majoring in > Computational Biology. > The proposal which I submitted to GSoC 2010 was accepted yesterday. > > The topic of the proposal is "Ruby 1.9.2 support of BioRuby". > I would like to make BioRuby work in Ruby 1.9.2 . > Currently, a lot of classes in BioRuby lack unit tests. > First, I'll make them in order to confirm behaviors of the classes. > Then, modify the classes as they work in Both Ruby 1.8.7 and 1.9.2 . > I'll work on the documentation too. > > the abstract of my proposal is here. > http://socghop.appspot.com/gsoc/student_project/show/google/gsoc2010/obf/t127230761332 > > I'm glad I can work on this project as one of BioRuby developers. > Thank you for selecting me. > > Kazuhiro > > -- > Kazuhiro Hayashi > Department of Computational Biology, The University of Tokyo > email: k_hayashi at cb.k.u-tokyo.ac.jp > tel: 04-7136-3988 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From czmasek at burnham.org Wed Apr 28 14:23:11 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Wed, 28 Apr 2010 11:23:11 -0700 Subject: [BioRuby] participation in GSoC 2010 In-Reply-To: <20100428053913.GA17564@thebird.nl> References: <20100428053913.GA17564@thebird.nl> Message-ID: <4BD87D0F.60606@burnham.org> Hi: This is probably a naive question, but I was wondering if the plan is to drop 1.8 support eventually, or is the idea to maintain support for 1.8 indefinitely? Christian Pjotr Prins wrote: > Welcome Kazuhiro, > > I am glad you take on this job. My main concern would be we don't > sprinkle 'if-then' blocks throughout the code base. I think the > challenge is to have one code base for Ruby 1.8, 1.9 and JRuby without > having code exceptions. That should be the top priority. > > Where it is not possible to avoid two code paths, find ways of > isolating the issue in one single 'architecture' file - e.g. in > ./lib/bio/ruby1.8.rb and ./lib/bio/ruby1.9.rb. > > Only in the few instances there are real performance concerns I would > diverge from such a strategy. > > I don't know how they handle it in Rails, but I would take hints from > there. > > Pj. > > On Wed, Apr 28, 2010 at 01:12:42PM +0900, Kazuhiro Hayashi wrote: >> Hi all: >> >> My name is Kazuhiro Hayashi. >> I'm a graduate student at The University of Tokyo, majoring in >> Computational Biology. >> The proposal which I submitted to GSoC 2010 was accepted yesterday. >> >> The topic of the proposal is "Ruby 1.9.2 support of BioRuby". >> I would like to make BioRuby work in Ruby 1.9.2 . >> Currently, a lot of classes in BioRuby lack unit tests. >> First, I'll make them in order to confirm behaviors of the classes. >> Then, modify the classes as they work in Both Ruby 1.8.7 and 1.9.2 . >> I'll work on the documentation too. >> >> the abstract of my proposal is here. >> http://socghop.appspot.com/gsoc/student_project/show/google/gsoc2010/obf/t127230761332 >> >> I'm glad I can work on this project as one of BioRuby developers. >> Thank you for selecting me. >> >> Kazuhiro >> >> -- >> Kazuhiro Hayashi >> Department of Computational Biology, The University of Tokyo >> email: k_hayashi at cb.k.u-tokyo.ac.jp >> tel: 04-7136-3988 >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From bonnalraoul at ingm.it Thu Apr 29 05:35:53 2010 From: bonnalraoul at ingm.it (Raoul Bonnal) Date: Thu, 29 Apr 2010 11:35:53 +0200 Subject: [BioRuby] R: participation in GSoC 2010 Message-ID: <6480b5b3-c265-408e-9629-9a74bf5812e7@ingm.it> > -----Messaggio originale----- > Da: bioruby-bounces at lists.open-bio.org [mailto:bioruby- > bounces at lists.open-bio.org] Per conto di Christian M Zmasek > Inviato: mercoled? 28 aprile 2010 20:23 > A: bioruby at lists.open-bio.org > Oggetto: Re: [BioRuby] participation in GSoC 2010 > > Hi: > > This is probably a naive question, but I was wondering if the plan is > to > drop 1.8 support eventually, or is the idea to maintain support for 1.8 > indefinitely? Indefinitely doesn't make sense but there is an official road map for 1.8.8 http://redmine.ruby-lang.org/projects/roadmap/ruby-18 Pure version is 1.8.6. 1.8.7 has some backport from 1.9 and much more backports will be in 1.8.8. The idea is to reduce the difficulties for migrating to 1.9 Could It be reasonable to dismiss the support in 1 year ? From kpatil at science.uva.nl Thu Apr 29 07:44:19 2010 From: kpatil at science.uva.nl (K. Patil) Date: Thu, 29 Apr 2010 13:44:19 +0200 (CEST) Subject: [BioRuby] feature request: tree visualization In-Reply-To: <6480b5b3-c265-408e-9629-9a74bf5812e7@ingm.it> References: <6480b5b3-c265-408e-9629-9a74bf5812e7@ingm.it> Message-ID: <1835.139.19.75.1.1272541459.squirrel@webmail.science.uva.nl> Hi, recently I came across a nice library in python which allows customized visualization of trees (especially in phylofgenetic context); http://pypi.python.org/pypi/ete2/ Is there anything similar for bioruby? or something planned for future? best From k.hayashi.info at gmail.com Thu Apr 29 10:41:14 2010 From: k.hayashi.info at gmail.com (Kazuhiro Hayashi) Date: Thu, 29 Apr 2010 23:41:14 +0900 Subject: [BioRuby] participation in GSoC 2010 In-Reply-To: <4BD87D0F.60606@burnham.org> References: <20100428053913.GA17564@thebird.nl> <4BD87D0F.60606@burnham.org> Message-ID: Hi, Thank you for the replies. Honestly, I'm not sure what kinds of tests there are in the field of Software Development. I am studying the tests and will consider what is the best way during the community bonding period. At the moment, I am planning to put the code for 1.8.7 ,1.9.2 and ,if possible, JRuby only in one code base. I don't understand what the 'architecture' file is. Could you tell me it in a little more detail? Kazuhiro 2010/4/29 Christian M Zmasek : > Hi: > > This is probably a naive question, but I was wondering if the plan is to > drop 1.8 support eventually, or is the idea to maintain support for 1.8 > indefinitely? > > Christian > > > Pjotr Prins wrote: >> >> Welcome Kazuhiro, >> >> I am glad you take on this job. My main concern would be we don't >> sprinkle 'if-then' blocks throughout the code base. I think the >> challenge is to have one code base for Ruby 1.8, 1.9 and JRuby without >> having code exceptions. That should be the top priority. >> Where it is not possible to avoid two code paths, find ways of >> isolating the issue in one single 'architecture' file - e.g. in >> ./lib/bio/ruby1.8.rb and ./lib/bio/ruby1.9.rb. >> >> Only in the few instances there are real performance concerns I would >> diverge from such a strategy. >> >> I don't know how they handle it in Rails, but I would take hints from >> there. >> >> Pj. >> >> On Wed, Apr 28, 2010 at 01:12:42PM +0900, Kazuhiro Hayashi wrote: >>> >>> Hi all: >>> >>> My name is Kazuhiro Hayashi. >>> I'm a graduate student at The University of Tokyo, majoring in >>> Computational Biology. >>> The proposal which I submitted to GSoC 2010 was accepted yesterday. >>> >>> The topic of the proposal is "Ruby 1.9.2 support of BioRuby". >>> I would like to make BioRuby work in Ruby 1.9.2 . >>> Currently, a lot of classes in BioRuby lack unit tests. >>> First, I'll make them in order to confirm behaviors of the classes. >>> Then, modify the classes as they work in Both Ruby 1.8.7 and 1.9.2 . >>> I'll work on the documentation too. >>> >>> the abstract of my proposal is here. >>> >>> http://socghop.appspot.com/gsoc/student_project/show/google/gsoc2010/obf/t127230761332 >>> >>> I'm glad I can work on this project as one of BioRuby developers. >>> Thank you for selecting me. >>> >>> Kazuhiro >>> >>> -- >>> Kazuhiro Hayashi >>> Department of Computational Biology, ?The University of Tokyo >>> email: k_hayashi at cb.k.u-tokyo.ac.jp >>> tel: 04-7136-3988 >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > -- Kazuhiro Hayashi Department of Computational Biology, The University of Tokyo email: k_hayashi at cb.k.u-tokyo.ac.jp tel: 04-7136-3988 From pjotr.public14 at thebird.nl Fri Apr 30 04:10:29 2010 From: pjotr.public14 at thebird.nl (pjotr.public14 at thebird.nl) Date: Fri, 30 Apr 2010 10:10:29 +0200 Subject: [BioRuby] participation in GSoC 2010 In-Reply-To: <20100430080825.GA6555@thebird.nl> References: <20100428053913.GA17564@thebird.nl> <4BD87D0F.60606@burnham.org> <20100430080825.GA6555@thebird.nl> Message-ID: <20100430081029.GA7221@thebird.nl> Hi Kazuhiro, Please *reply* to the list. On Thu, Apr 29, 2010 at 11:41:14PM +0900, Kazuhiro Hayashi wrote: > At the moment, I am planning to put the code for 1.8.7 ,1.9.2 and ,if > possible, JRuby only in one code base. > I don't understand what the 'architecture' file is. > Could you tell me it in a little more detail? All 1.8 stuff goes into one file. All 1.9 in another. So there is clear separation. When running Ruby 1.8 only that file gets 'required'. In pseudo-code. if ruby_version<1.9 if !isjvm? require 'bio/ruby-1.8' else require 'bio/ruby-jvm' end else require 'bio/ruby-1.9' end Implementation specific stuff will go into these files (if possible). Say you have a different println implementation, rather than sprinkling the code base with: if ruby_version<1.9 if !isjvm? println_1 ... else println_2 ... end else println_3 ... end You would 'hide' that in the architecture files. So you just get one call in the source tree: println_arch ... with implementation in the different 'architecture' files. Pj. From bonnalraoul at ingm.it Fri Apr 30 10:47:37 2010 From: bonnalraoul at ingm.it (Raoul Bonnal) Date: Fri, 30 Apr 2010 16:47:37 +0200 Subject: [BioRuby] Illumina Annotation Message-ID: Hello, I'm going to support Illumina's annotation inspired by R's package lumiHumanIDMapping, is it someone interested ? -- Raoul J.P. Bonnal Life Science Informatics Integrative Biology Program Fondazione INGM Via F. Sforza 28 20122 Milano, IT phone: +39 02 006 623 26 fax: +39 02 006 623 46 http://www.ingm.it From anurag08priyam at gmail.com Fri Apr 30 13:20:02 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Fri, 30 Apr 2010 22:50:02 +0530 Subject: [BioRuby] GSoC 2010 - NeXML parser/serializer and RDF API for BioRuby Message-ID: Hello all, I have been selected in Google Summer of Code,2010, under NESCENT and the mentorship of Rutger Vos and Jan Aerts to "Develop an API for NeXML I/O, and, RDF triples for BioRuby" [1]. Soon, I will put up on the list, drafts entailing the implementation details, for feedback. During the development phase I plan to post updates on the list for review. This will help me produce better code and design API that is acceptable to the developers and users alike. Looking forward to your support and an exciting "Summer of Code" :). [1] http://socghop.appspot.com/gsoc/student_project/show/google/gsoc2010/nescent/t127230761223 -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From donttrustben at gmail.com Thu Apr 1 00:33:27 2010 From: donttrustben at gmail.com (Ben Woodcroft) Date: Thu, 1 Apr 2010 11:33:27 +1100 Subject: [BioRuby] FlatFile GFF Message-ID: Hi, I have a conceptual question for the list. When I open a gff2 file using Bio::FlatFile, the next_entry method gives me all of the lines at once (in the form of a Bio::GFF::GFF2 object). f = Bio::FlatFile.open(Bio::GFF::GFF2,"some.gff2") => Bio::FlatFile g = f.next_entry => Bio::GFF::GFF2 object g.records => array of GFF2 records To me, this seems a little counter-intuitive. I expected to get info for a single line of the GFF file from FlatFile#next_entry The other problem is that the whole file must be parsed at the beginning, and this can cause memory problems when using large GFF files (e.g. the current WormBase gff2 is 2.6GB). To get around the problem I can use File.foreach('some.gff2') and then parse each line using Bio::GFF::GFF2. I'm not sure what the situation is with other file formats. So, my question is, could we introduce a foreach method into FlatFile that iterates (without parsing all at once so it is light on memory) over the GFF/etc entries in the file? Ideally we could change next_entry, but that wouldn't be backwards compatible I don't think. Thanks, ben -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From ngoto at gen-info.osaka-u.ac.jp Thu Apr 1 13:41:27 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 1 Apr 2010 22:41:27 +0900 Subject: [BioRuby] FlatFile GFF In-Reply-To: References: Message-ID: <20100401134130.0FA771CBC585@idnmail.gen-info.osaka-u.ac.jp> Hi, On Thu, 1 Apr 2010 11:33:27 +1100 Ben Woodcroft wrote: > Hi, > > I have a conceptual question for the list. When I open a gff2 file using > Bio::FlatFile, the next_entry method gives me all of the lines at once (in > the form of a Bio::GFF::GFF2 object). > > f = Bio::FlatFile.open(Bio::GFF::GFF2,"some.gff2") => Bio::FlatFile > g = f.next_entry => Bio::GFF::GFF2 object > g.records => array of GFF2 records > > To me, this seems a little counter-intuitive. I expected to get info for a > single line of the GFF file from FlatFile#next_entry The design of Bio::GFF classes was determined by the first authors of the classes. I don't know much about what they thought, but I suppose because GFF can have header lines, sequences in Fasta format, and relation information across two or more lines, they might think it is easy to gather all information in a file into a single object. Because Bio::FlatFile supports many file formats, format-specific situation may sometimes be omitted and "normalized". > The other problem is that the whole file must be parsed at the beginning, > and this can cause memory problems when using large GFF files (e.g. the > current WormBase gff2 is 2.6GB). To overcome the problem, reorganizing of Bio::GFF classes may be needed. Bio::FlatFile is only a controller with input buffer, and format specific things should be implemented in the format parser and splitter classes. Currently, for a workaroud, use Bio::GFF::GFF2::Record directly without using Bio::FlatFile. > To get around the problem I can use File.foreach('some.gff2') and then parse > each line using Bio::GFF::GFF2. I'm not sure what the situation is with > other file formats. > > So, my question is, could we introduce a foreach method into FlatFile that > iterates (without parsing all at once so it is light on memory) over the > GFF/etc entries in the file? Ideally we could change next_entry, but that > wouldn't be backwards compatible I don't think. I'm negative, because this is basically not the Bio::FlatFile issue, but the Bio::GFF design problem, and modifying only Bio::FlatFile does not solve the problem. Indeed, the method name is too confusing, because we already have Bio::FlatFile.foreach and Bio::FlatFile#each. http://bioruby.org/rdoc/classes/Bio/FlatFile.html#M002156 (foreach) http://bioruby.org/rdoc/classes/Bio/FlatFile.html#M002168 (each) I'm thinking to implement another GFF parser frontend class that can be specified as a file format. ff = Bio::FlatFile.open(Bio::GFF::AltParser, "xxx.gff") Alternatively, introducing optional parameters to a Bio::FlatFile and it could change parameters passed to the parser and splitter classes for the format. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From czmasek at burnham.org Fri Apr 2 19:37:14 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Fri, 2 Apr 2010 12:37:14 -0700 Subject: [BioRuby] Beta application for review: BioRuby - Simple duplication inference implementation In-Reply-To: References: <4BB1387C.6090503@burnham.org> Message-ID: <4BB6476A.20808@burnham.org> Hi, Jure: Indeed, you improved it a lot! Clearly, you don't _need_ to discuss 'anticipated problems', if you don't expect any. I probably would point out that: "- Extend the algorithm to support non-binary species tree as described by Vernot in "Reconciliation with non-binary species tree"." and "- Extend the algorithm to support non-binary gene trees as described by Durand in "A Hybrid Micro?Macroevolutionary Approach to Gene Tree Reconstruction"." are optional, especially the second one. Regarding my obligations for the summer, please make sure that you _only_ cancel them if/after this has proposal been approved and you have been approved for it. As you know, less than half of all proposals get accepted in the end, and each proposal has many students applying for it. Christian Jure Triglav wrote: > Hello all! > > Thank you for the thorough review Christian! I took your comments > seriously and spent the last two days reviewing papers and reading up on > various subjects related to the proposal. I've done a lot of work and I > think I have refined it substantially. I hope I have now fully > elaborated on the problem and proposed time-table. I would like to > kindly ask you to review it again and point out any remaining issues. > The only thing from your list of requirements for the proposed > time-table that I have trouble with are the anticipated problems and > possible alternative approaches, since all of the developed algorithms > seem rock solid to me (almost all of them have mathematical proof > included) and the only possible issue that I can think of, are the > incompatibilities of data structures between various algorithms (which > we will address in the first week, as most are very similar) and coding > errors (which we will fix, of course! :). > > Regarding my obligations for the summer, I would not hesitate to cancel > them (or as it is, simply not apply for clinical practice, as I have no > obligation yet), seeing as you consider them a serious issue. I am very > motivated to do this project and would like to do everything possible to > make it happen. Anyway, I can always apply for clinical practice on the > next term, it really is not an issue. > > Best regards to all, > Jure Triglav > > *The idea:* > > We would implement the simple and fast duplication inference algorithm > described by Zmasek and Eddy (Zmasek and Eddy, 2001, "A simple algorithm > to infer gene duplication and speciation events on a gene tree". With > several billion nucleotides sequenced daily (Edwards, Hansen and > Stajich, 2009, "Bioinformatics - Tools and applications"), the > determination of protein function is mostly done automatically without > human intervention by finding the most similar sequences that already > have determined protein function (microevolutionary approach). This way > of automatically determining protein function neglects additional > available information in the form of macroevolutionary relationships. By > inferring these interesting relationships (speciation, duplication) from > a species and gene tree using a simple algorithm, we can gain a better > understanding of protein function. > The importance of determining these evolutionary relationships stems > from a relatively simple assumption, that if two similar genes are > thought to be related by speciation, their function is more likely to be > similar too. On the other hand, if we determine these two genes to be > related by duplications, their function is more likely to be different, > as gene duplications are powerful drivers in the evolution of new > protein function. This is because the second copy of a gene is often > free of selective pressure and accumulates mutations more rapidly than a > single copy of a gene. > The original algorithm proposed by Zmasek and Eddy supports rooted fully > binary gene and species trees, but we have decided to expand on that > scope by implementing support for unrooted gene trees (which are > produced by some bioinformatics methods and thus need to be addressed), > non-binary species trees (since a lot of species trees are non-binary, > i.e. 64% of NCBI nodes have more than 2 children (Vernot, 2007)), and > non-binary gene trees (which are also produced by some bioinformatics > methods, but represent only uncertainties, as gene trees are inherently > binary). > > *Goals:* > ** > *- *Implement the algorithm as described by Zmasek and Eddy in "A simple > algorithm to infer gene duplication and speciation events on a gene > tree", or SDI, which is designed to work on rooted gene trees and fully > binary gene and species tree. > - Allow rooting of unrooted gene trees by minimizing sum of duplications > as described by Zmasek and Eddy in "RIO: Analyzing proteomes by > automated phylogenomics using resampled inference of orthologs", and > thus extending the implementation to support unrooted gene trees. > - Extend the algorithm to support non-binary species tree as described > by Vernot in "Reconciliation with non-binary species tree". > - Extend the algorithm to support non-binary gene trees as described by > Durand in "A Hybrid Micro?Macroevolutionary Approach to Gene Tree > Reconstruction". > > *The work:* > > Some terminology: > - g: a node in the gene tree > - p(g): the parent of node g in the gene tree > - s: a node in the species tree > - M: a mapping function that links nodes of a gene tree to nodes of a > species tree > - roofN: an adapted mapping function for non-binary species trees > - e(p(g),g): the edge between parent of g and g in the gene tree > - polytomy: a species tree node with more than 2 children > > There are several milestones to be reached in developing this idea and > this is the work plan I propose: > > 1. Development of unit tests with known species and gene trees (1 week). > > 2. Making or reusing necessary data structures, made easier by last > years GSoC contribution implementing phyloXML in BioRuby (1/2 weeks - 1 > week): > - gene tree, > - species tree, > - tree node, > - children(), > - parent(). > > 3. Developing checks for the correctness of input data for rooted fully > binary trees SDI (1/2 weeks - 1 week): > - making sure trees are rooted and binary, > - all species/gene tree nodes have at least on type of taxonomic data. > - making a taxonomy base from a type of data present in all nodes > (scientific or common name, taxonomy code, id), > - making sure taxonomic data is unique throughout external nodes. > > 4. Implementation of the recursive M function (1 week) > - traverse the gene tree in postorder (left subtree, right subtree, root), > - finding occurrences where M(parent) equals M(child 1 or 2) - this is > representative for finding a duplication. If M(parent) matches neither, > the processed node is a speciation. > > 5. Milestone - finished implementation of SDI for rooted fully binary > trees (1/2 week): > - Extensive testing, > - polishing and writing documentation with RDoc, > - cleaning up. > > 6. Milestone: Implementation of support for unrooted gene trees (1 week): > - implement an algorithm which roots an unrooted gene tree by exploring > all possible roots and selecting the one with minimum duplications, > - calculating M is the most intensive step, so we only do it once for > one rooted gene tree, > - by moving the root one node at a time, M does not have to be > calculated for every node of the gene tree, but only for two nodes: > - first child of previous root, if the new root is on a brach of first > child of the previous root, > - second child of previous root, if the new root is on a branch of > second child of the previous root, > - and the new root. > - traversing the whole gene tree one node at a time we explore all > possible root placements and resulting duplications, > - from a group of trees with a minimal number of duplications, the > shortest tree is chosen as the rooted tree, > - the algorithm for this is written in pseudocode in "RIO: Analyzing > proteomes by automated phylogenomics using resampled inference of > orthologs" by Zmasek and Eddy as "Algorithm for speciation duplication > inference combined with rooting", and needs to be translated to Ruby code. > > 7. Milestone: Implementing an duplication/loss inference algorithm for > non-binary species trees (described by Vernot, 2008) (2 weeks): > - implement function roofN(g), which returns all roots of subtrees of > the parent of "s" (s = M(g)) in which descendants of g must be present, > - if the intersection of roofN(left child of g) and roofN(right child of > g) is not NULL, then g is a required multiplication, > - else if the intersection of roofN(left child of g) and roofN(right > child of g) is NULL, then it is impossible to tell whether the event was > a duplication or a deep coalescence, the event is thus called a > conditional duplication, > - implement function N(g), which returns all of the children of M(g), > where descendants of g were present in descendants of each element in N(g), > - implementing a prediction of gene loss events by assuming that > minimizing gene losses gives the biologically most likely prediction by > taking into account the following rules, which minimize explicit loses > by predicting each loss as close as possible to the root of the gene tree: > - Binary duplication loss: if p(g) is a required duplication then, if > M(p(g)) != M(g) then species in (N(p(g)) without roofN(g)) are lost on > edge e between p(g) and g. > - Skipped species loss: if M(g) != M(p(g)) and p(M(g)) != M(p(g)) then > we can infer a loss at every skipped species between M(p(g) and M(g). > - Polytomy duplication loss: if M(p(g) is a polytomy and p(g) is a > required duplication, then species (N(p(g)) without roofN(g)) are lost > at e(p(g),g). > - Polytomy speciation loss: if M(g) != M(p(g) and M(g) is a polytomy, > then all children of M(g) should have a descendant of g. > - the algorithm for this is written in pseudocode in "Reconciliation > with non-binary species tree" by Vernot as "Algorithm 5.1", which has to > be translated to Ruby code. > > 8. Milestone: Implementing support for non-binary gene trees (2 weeks) > - gene trees are by definition binary, but some methods produce > uncertainties which result in multifurcating gene trees, > - we can support non-binary gene trees by expanding multifurcating nodes > to arbitrary binary trees and then optimizing the generated tree for > duplications and losses with a previously developed algorithm > - this approach is described in "A Hybrid Micro?Macroevolutionary > Approach to Gene Tree Reconstruction" by Durand, which contains several > pseudocode algorithms that can be ported to Ruby. > > 9. Finishing up (1-2 weeks): > - Extensive testing of all implemented algorithms, > - polishing and writing documentation using RDoc, > - cleaning up. > > *Why me?:* > > I like to set foot on unknown territory and challenge myself constantly. > I have long searched for something that would connect my love of > medicine to my love of programming, and now, thanks to GSoC and OBF, I > think I found it - bioinformatics. I am at a stage of my medical study, > where I have to decide what my future will entail, and I am (now, after > thinking about it for a long time) positive that bioinformatics will be > a big part of it. What better way to get future off to a good start, > than with a Google Summer of Code project? Based on this enthusiasm > alone you can be assured that I'll work really hard on this project and > that I will be happy to see it done. As this would be my first serious > open source engagement, you also have a chance of forming a completely > new addition to the open source world and making a good contributor out > of me. > > *Previous experience:* > > 1. I have been working on a simulation of an analytical chemistry method > for the past 2 years now, more specifically we have modeled laser > ablation + inductively coupled plasma mass spectrometry with a simple > model, which aids our elemental mapping projects. For the write-up of > this project I have been awarded with a "Pre?ernovo priznanje" in 2008 > (PDF upon request). This work entails several interesting components, > from basics such as: C# development, image input, output, multi-threaded > programming, UI development; to complex themes such as: genetic > algorithms and neural networks. All of which I learned as we worked on > the project without much hassle (source code upon request). This work is > not yet open source, because we are in the finalizing stages of the > paper and will release the source code after publication under an open > source license. > > 2. I have programmed since I was a child and I have developed a wide > specter of things in my lifetime (from a full CMS in PHP to an IRC > robot, source code upon request), but I have little experience in fully > open source projects, which I think so highly of. > > *Biography:* > > My name is Jure Triglav and I'm a 24 year old medical student from > Ljubljana, Slovenia. I was born in a small town of Murska Sobota in > Slovenia, where I went to grade school (graded excellent for all years, > awarded "Zoisova ?tipendija" for the gifted, which I still hold) and > high-school (excellent, finished as "Zlati maturant" in the company of > about 200 best students in the country). I moved to Ljubljana in 2004 to > study medicine. I am now in the last year of my medical study which I > find challenging and very interesting. > My hobbies are all over the place, from book design to photography, from > web design to typography, from guitar to poetry, from reading to > programming, from traveling to sports. > > > > *Other obligations for the summer:* > > In question, will be cancelled upon request: I have 5-hour daily > clinical practice every weekday in June, July and August, which is not > nearly as serious as it sounds, especially since this is the summer > rotation which is known for its laid back feel. These practice start at > 8 am and finish at 1 pm, and for students are not really stressful or > exhausting at all. I have in the past juggled many research obligations > with clinical practice and my studies without hiccups, but I will not do > this this summer and will dedicate 8 hours daily to Google Summer of > Code, as I realize what a great opportunity this is and how much work is > required. I have no other work, research or vacation obligations for the > period of Google Summer of Code. > > *Contact information: * > > (I will provide additional contact information in the final application) > Name: Jure Triglav > E-mail: juretriglav at gmail.com > IRC handle: x` on #obf-soc, #gsoc > > From czmasek at burnham.org Fri Apr 2 22:42:03 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Fri, 2 Apr 2010 15:42:03 -0700 Subject: [BioRuby] Beta application for review: BioRuby - Simple duplication inference implementation In-Reply-To: <249F5CC6-629E-44A1-A161-5A9D76B0DF98@gmail.com> References: <4BB1387C.6090503@burnham.org> <4BB6476A.20808@burnham.org> <249F5CC6-629E-44A1-A161-5A9D76B0DF98@gmail.com> Message-ID: <4BB672BB.9070301@burnham.org> Hi, Jure: It looks good to me! Christian Jure Triglav wrote: > Thank you Christian! > > I am glad that you find it bettered and I will now submit this application to Google, then fine tune it some more until the deadline. Do you have any recommendations as to what could still be improved or added? > > Yes I agree, the extension to non-binary gene trees is the least defined problem of the proposed group of problems, so it is a good idea to somehow point that out and make its solution and implementation optional. > > And yes, the deadline for applying to clinical practice is a few weeks after April 26th, on which day the accepted GSoC proposals will be announced, so don't worry about that. > > Thank you for your reply and help! > > Best regards, > Jure Triglav > > On Apr 2, 2010, at 8:37 PM, Christian M Zmasek wrote: > >> Hi, Jure: >> >> Indeed, you improved it a lot! >> >> Clearly, you don't _need_ to discuss 'anticipated problems', if you don't expect any. >> >> I probably would point out that: >> "- Extend the algorithm to support non-binary species tree as described >> by Vernot in "Reconciliation with non-binary species tree"." >> and >> "- Extend the algorithm to support non-binary gene trees as described by >> Durand in "A Hybrid Micro?Macroevolutionary Approach to Gene Tree >> Reconstruction"." >> >> are optional, especially the second one. >> >> Regarding my obligations for the summer, please make sure that you _only_ cancel them if/after this has proposal been approved and you have been approved for it. >> As you know, less than half of all proposals get accepted in the end, and each proposal has many students applying for it. >> >> Christian >> >> >> >> Jure Triglav wrote: >>> Hello all! >>> Thank you for the thorough review Christian! I took your comments seriously and spent the last two days reviewing papers and reading up on various subjects related to the proposal. I've done a lot of work and I think I have refined it substantially. I hope I have now fully elaborated on the problem and proposed time-table. I would like to kindly ask you to review it again and point out any remaining issues. >>> The only thing from your list of requirements for the proposed time-table that I have trouble with are the anticipated problems and possible alternative approaches, since all of the developed algorithms seem rock solid to me (almost all of them have mathematical proof included) and the only possible issue that I can think of, are the incompatibilities of data structures between various algorithms (which we will address in the first week, as most are very similar) and coding errors (which we will fix, of course! :). >>> Regarding my obligations for the summer, I would not hesitate to cancel them (or as it is, simply not apply for clinical practice, as I have no obligation yet), seeing as you consider them a serious issue. I am very motivated to do this project and would like to do everything possible to make it happen. Anyway, I can always apply for clinical practice on the next term, it really is not an issue. Best regards to all, >>> Jure Triglav >>> *The idea:* >>> We would implement the simple and fast duplication inference algorithm described by Zmasek and Eddy (Zmasek and Eddy, 2001, "A simple algorithm to infer gene duplication and speciation events on a gene tree". With several billion nucleotides sequenced daily (Edwards, Hansen and Stajich, 2009, "Bioinformatics - Tools and applications"), the determination of protein function is mostly done automatically without human intervention by finding the most similar sequences that already have determined protein function (microevolutionary approach). This way of automatically determining protein function neglects additional available information in the form of macroevolutionary relationships. By inferring these interesting relationships (speciation, duplication) from a species and gene tree using a simple algorithm, we can gain a better understanding of protein function. >>> The importance of determining these evolutionary relationships stems from a relatively simple assumption, that if two similar genes are thought to be related by speciation, their function is more likely to be similar too. On the other hand, if we determine these two genes to be related by duplications, their function is more likely to be different, as gene duplications are powerful drivers in the evolution of new protein function. This is because the second copy of a gene is often free of selective pressure and accumulates mutations more rapidly than a single copy of a gene. >>> The original algorithm proposed by Zmasek and Eddy supports rooted fully binary gene and species trees, but we have decided to expand on that scope by implementing support for unrooted gene trees (which are produced by some bioinformatics methods and thus need to be addressed), non-binary species trees (since a lot of species trees are non-binary, i.e. 64% of NCBI nodes have more than 2 children (Vernot, 2007)), and non-binary gene trees (which are also produced by some bioinformatics methods, but represent only uncertainties, as gene trees are inherently binary). >>> *Goals:* >>> ** >>> *- *Implement the algorithm as described by Zmasek and Eddy in "A simple algorithm to infer gene duplication and speciation events on a gene tree", or SDI, which is designed to work on rooted gene trees and fully binary gene and species tree. - Allow rooting of unrooted gene trees by minimizing sum of duplications as described by Zmasek and Eddy in "RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs", and thus extending the implementation to support unrooted gene trees. >>> - Extend the algorithm to support non-binary species tree as described by Vernot in "Reconciliation with non-binary species tree". >>> - Extend the algorithm to support non-binary gene trees as described by Durand in "A Hybrid Micro?Macroevolutionary Approach to Gene Tree Reconstruction". >>> *The work:* >>> Some terminology: >>> - g: a node in the gene tree >>> - p(g): the parent of node g in the gene tree >>> - s: a node in the species tree >>> - M: a mapping function that links nodes of a gene tree to nodes of a species tree >>> - roofN: an adapted mapping function for non-binary species trees >>> - e(p(g),g): the edge between parent of g and g in the gene tree >>> - polytomy: a species tree node with more than 2 children >>> There are several milestones to be reached in developing this idea and this is the work plan I propose: >>> 1. Development of unit tests with known species and gene trees (1 week). >>> 2. Making or reusing necessary data structures, made easier by last years GSoC contribution implementing phyloXML in BioRuby (1/2 weeks - 1 week): >>> - gene tree, >>> - species tree, >>> - tree node, >>> - children(), >>> - parent(). >>> 3. Developing checks for the correctness of input data for rooted fully binary trees SDI (1/2 weeks - 1 week): >>> - making sure trees are rooted and binary, >>> - all species/gene tree nodes have at least on type of taxonomic data. >>> - making a taxonomy base from a type of data present in all nodes (scientific or common name, taxonomy code, id), >>> - making sure taxonomic data is unique throughout external nodes. >>> 4. Implementation of the recursive M function (1 week) >>> - traverse the gene tree in postorder (left subtree, right subtree, root), >>> - finding occurrences where M(parent) equals M(child 1 or 2) - this is representative for finding a duplication. If M(parent) matches neither, the processed node is a speciation. >>> 5. Milestone - finished implementation of SDI for rooted fully binary trees (1/2 week): >>> - Extensive testing, >>> - polishing and writing documentation with RDoc, >>> - cleaning up. >>> 6. Milestone: Implementation of support for unrooted gene trees (1 week): >>> - implement an algorithm which roots an unrooted gene tree by exploring all possible roots and selecting the one with minimum duplications, >>> - calculating M is the most intensive step, so we only do it once for one rooted gene tree, >>> - by moving the root one node at a time, M does not have to be calculated for every node of the gene tree, but only for two nodes: >>> - first child of previous root, if the new root is on a brach of first child of the previous root, >>> - second child of previous root, if the new root is on a branch of second child of the previous root, >>> - and the new root. >>> - traversing the whole gene tree one node at a time we explore all possible root placements and resulting duplications, >>> - from a group of trees with a minimal number of duplications, the shortest tree is chosen as the rooted tree, - the algorithm for this is written in pseudocode in "RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs" by Zmasek and Eddy as "Algorithm for speciation duplication inference combined with rooting", and needs to be translated to Ruby code. >>> 7. Milestone: Implementing an duplication/loss inference algorithm for non-binary species trees (described by Vernot, 2008) (2 weeks): >>> - implement function roofN(g), which returns all roots of subtrees of the parent of "s" (s = M(g)) in which descendants of g must be present, >>> - if the intersection of roofN(left child of g) and roofN(right child of g) is not NULL, then g is a required multiplication, >>> - else if the intersection of roofN(left child of g) and roofN(right child of g) is NULL, then it is impossible to tell whether the event was a duplication or a deep coalescence, the event is thus called a conditional duplication, >>> - implement function N(g), which returns all of the children of M(g), where descendants of g were present in descendants of each element in N(g), - implementing a prediction of gene loss events by assuming that minimizing gene losses gives the biologically most likely prediction by taking into account the following rules, which minimize explicit loses by predicting each loss as close as possible to the root of the gene tree: >>> - Binary duplication loss: if p(g) is a required duplication then, if M(p(g)) != M(g) then species in (N(p(g)) without roofN(g)) are lost on edge e between p(g) and g. >>> - Skipped species loss: if M(g) != M(p(g)) and p(M(g)) != M(p(g)) then we can infer a loss at every skipped species between M(p(g) and M(g). >>> - Polytomy duplication loss: if M(p(g) is a polytomy and p(g) is a required duplication, then species (N(p(g)) without roofN(g)) are lost at e(p(g),g). >>> - Polytomy speciation loss: if M(g) != M(p(g) and M(g) is a polytomy, then all children of M(g) should have a descendant of g. >>> - the algorithm for this is written in pseudocode in "Reconciliation with non-binary species tree" by Vernot as "Algorithm 5.1", which has to be translated to Ruby code. >>> 8. Milestone: Implementing support for non-binary gene trees (2 weeks) >>> - gene trees are by definition binary, but some methods produce uncertainties which result in multifurcating gene trees, >>> - we can support non-binary gene trees by expanding multifurcating nodes to arbitrary binary trees and then optimizing the generated tree for duplications and losses with a previously developed algorithm >>> - this approach is described in "A Hybrid Micro?Macroevolutionary Approach to Gene Tree Reconstruction" by Durand, which contains several pseudocode algorithms that can be ported to Ruby. >>> 9. Finishing up (1-2 weeks): >>> - Extensive testing of all implemented algorithms, >>> - polishing and writing documentation using RDoc, >>> - cleaning up. >>> *Why me?:* >>> I like to set foot on unknown territory and challenge myself constantly. I have long searched for something that would connect my love of medicine to my love of programming, and now, thanks to GSoC and OBF, I think I found it - bioinformatics. I am at a stage of my medical study, where I have to decide what my future will entail, and I am (now, after thinking about it for a long time) positive that bioinformatics will be a big part of it. What better way to get future off to a good start, than with a Google Summer of Code project? Based on this enthusiasm alone you can be assured that I'll work really hard on this project and that I will be happy to see it done. As this would be my first serious open source engagement, you also have a chance of forming a completely new addition to the open source world and making a good contributor out of me. >>> *Previous experience:* >>> 1. I have been working on a simulation of an analytical chemistry method for the past 2 years now, more specifically we have modeled laser ablation + inductively coupled plasma mass spectrometry with a simple model, which aids our elemental mapping projects. For the write-up of this project I have been awarded with a "Pre?ernovo priznanje" in 2008 (PDF upon request). This work entails several interesting components, from basics such as: C# development, image input, output, multi-threaded programming, UI development; to complex themes such as: genetic algorithms and neural networks. All of which I learned as we worked on the project without much hassle (source code upon request). This work is not yet open source, because we are in the finalizing stages of the paper and will release the source code after publication under an open source license. 2. I have programmed since I was a child and I have developed a wide specter of things in my lifetime (from a full CMS in PHP to an IRC robot, source code upon request), but I have little experience in fully open source projects, which I think so highly of. *Biography:* >>> My name is Jure Triglav and I'm a 24 year old medical student from Ljubljana, Slovenia. I was born in a small town of Murska Sobota in Slovenia, where I went to grade school (graded excellent for all years, awarded "Zoisova ?tipendija" for the gifted, which I still hold) and high-school (excellent, finished as "Zlati maturant" in the company of about 200 best students in the country). I moved to Ljubljana in 2004 to study medicine. I am now in the last year of my medical study which I find challenging and very interesting. My hobbies are all over the place, from book design to photography, from web design to typography, from guitar to poetry, from reading to programming, from traveling to sports. *Other obligations for the summer:* >>> In question, will be cancelled upon request: I have 5-hour daily clinical practice every weekday in June, July and August, which is not nearly as serious as it sounds, especially since this is the summer rotation which is known for its laid back feel. These practice start at 8 am and finish at 1 pm, and for students are not really stressful or exhausting at all. I have in the past juggled many research obligations with clinical practice and my studies without hiccups, but I will not do this this summer and will dedicate 8 hours daily to Google Summer of Code, as I realize what a great opportunity this is and how much work is required. I have no other work, research or vacation obligations for the period of Google Summer of Code. >>> *Contact information: * >>> (I will provide additional contact information in the final application) >>> Name: Jure Triglav >>> E-mail: juretriglav at gmail.com >>> IRC handle: x` on #obf-soc, #gsoc > From monika.machunik at gmail.com Mon Apr 5 18:41:36 2010 From: monika.machunik at gmail.com (Monika Machunik) Date: Mon, 5 Apr 2010 21:41:36 +0300 Subject: [BioRuby] GSoC question (regarding SDI algorithm) Message-ID: Hello My name is Monika Machunik and I am planing to apply in this year's Summer of Code. I have read your idea description about "Implementation of algorithm to infer gene duplications in BioRuby", and, although my background does not include any biology, I got quite interested in this project (I could not find mentors' email addresses, so I'm posting it here..). I would like to shortly introduce myself to get your opinion if I would be suitable for this project. I have about a year of work experience in Java programming, including some internships and last year's GSoC. Besides Java I know C++, some C, Php, HTML, etc. I am not experienced in Ruby programming (at least have seen Ruby code;)), but I learn fast. Currently I am doing my Master degree in Computer Science, so I have some knowlegde about algorithms and data structures. I have never worked at the intersection of biology and CS, but this conjunction has always been intriguing to me. And now my thoughts about possible content of the workload. I have read the abstract of the article and, despite of my lack of biological knowledge, I managed to comprehend it;). I think I also should have no problem with understanding the algorithm itself. Apart from implementing the algorithm, the project would involve getting familiar with BioRuby, understanding phyloXML in such extent to be able to write an algorithm operating on its ready structures. I am not sure if the algorithm should be implemented inside some exisitng software, or will it be a kind of standalone algorithm? If it should be accomodated inside some application, the project would probably involve doing that too... ...let it be all for now. Let me know if I have any chances in this project :) Best regards Monika Machunik From czmasek at burnham.org Mon Apr 5 23:17:58 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Mon, 5 Apr 2010 16:17:58 -0700 Subject: [BioRuby] GSOC 2010 preliminary proposal question In-Reply-To: <2CCBF4CA-B351-46C1-A566-14BC0E4F19D6@gmail.com> References: <4BB13F46.7010607@burnham.org> <4BB14149.3060606@burnham.org> <2CCBF4CA-B351-46C1-A566-14BC0E4F19D6@gmail.com> Message-ID: <4BBA6FA6.5080404@burnham.org> Hi, Sara: You proposal looks good. I would expand the paragraph about yourself, i.e. add more details about your skills and previous programming experience. Christian Sara Rayburn wrote: > Hello Christian, > > Thanks so much for the advice. Here's a pdf of my first draft of the proposal. What do you think? > > Thanks, > > Sara Rayburn > Center for Advanced Computer Studies > University of Louisiana at Lafayette > sararayburn at gmail.com From czmasek at burnham.org Mon Apr 5 23:24:17 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Mon, 5 Apr 2010 16:24:17 -0700 Subject: [BioRuby] GSoC question (regarding SDI algorithm) In-Reply-To: References: Message-ID: <4BBA7121.5090102@burnham.org> Hi, Monika: Thank you for you interest in this proposal. Please remember that student applications are due by April 9, 19:00 UTC -- so, you have not much time left. I think your lack of experience in Biology is not a problem. The idea is to implement the algorithm with the BioRuby toolkit (http://www.bioruby.org/). Some more advice: If you plan to apply, you need to write a very detailed plan on how you intend to accomplish this project. For each step you should list: 1. Goal/deliverable 2. Approach 3. Time estimation 4. Anticipated problems & possible alternative approaches Like so: A. Prior to coding (from ... to .... ) 1. Familiarize myself with BioRuby, set up git hub repository 2. ... 3. 1 week 4. Not familiar with git, might need to... B. Week 1 (from ... to .... ) 1. Develop unit tests 2. Using manually created gene and species trees, I plan to... 3. 1 week 4. No problem anticipated Basically you also need to write a short CV, similar to a job application. Hope this helps, Christian Monika Machunik wrote: > Hello > > My name is Monika Machunik and I am planing to apply in this year's Summer > of Code. I have read your idea description about "Implementation of > algorithm to infer gene duplications in BioRuby", and, although my > background does not include any biology, I got quite interested in this > project (I could not find mentors' email addresses, so I'm posting it > here..). > > I would like to shortly introduce myself to get your opinion if I would be > suitable for this project. > > I have about a year of work experience in Java programming, including some > internships and last year's GSoC. Besides Java I know C++, some C, Php, > HTML, etc. I am not experienced in Ruby programming (at least have seen Ruby > code;)), but I learn fast. Currently I am doing my Master degree in Computer > Science, so I have some knowlegde about algorithms and data structures. I > have never worked at the intersection of biology and CS, but this > conjunction has always been intriguing to me. > > And now my thoughts about possible content of the workload. > > I have read the abstract of the article and, despite of my lack of > biological knowledge, I managed to comprehend it;). I think I also should > have no problem with understanding the algorithm itself. Apart from > implementing the algorithm, the project would involve getting familiar with > BioRuby, understanding phyloXML in such extent to be able to write an > algorithm operating on its ready structures. > > I am not sure if the algorithm should be implemented inside some exisitng > software, or will it be a kind of standalone algorithm? If it should be > accomodated inside some application, the project would probably involve > doing that too... > > ...let it be all for now. Let me know if I have any chances in this project > :) > > Best regards > Monika Machunik > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From kpatil at science.uva.nl Tue Apr 6 15:58:10 2010 From: kpatil at science.uva.nl (K. Patil) Date: Tue, 6 Apr 2010 17:58:10 +0200 (CEST) Subject: [BioRuby] distributing bioruby Message-ID: <41920.139.19.75.1.1270569490.squirrel@webmail.science.uva.nl> Hi, I would like to distribute bioruby with my own code. Is it allowed legally or there are some restrictions? More specifically I would like to have a subset of bioruby files (especially for file processing) inside my distribution. I would like to know if this is possible. best regards From ngoto at gen-info.osaka-u.ac.jp Tue Apr 6 16:41:19 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 7 Apr 2010 01:41:19 +0900 Subject: [BioRuby] distributing bioruby In-Reply-To: <41920.139.19.75.1.1270569490.squirrel@webmail.science.uva.nl> References: <41920.139.19.75.1.1270569490.squirrel@webmail.science.uva.nl> Message-ID: <20100406164119.979501CBC557@idnmail.gen-info.osaka-u.ac.jp> Hi, This may depend on the license of your software. See the file COPYING about the license of BioRuby, which is the same as Ruby's. http://github.com/bioruby/bioruby/blob/master/COPYING In addition, some files have different licenses. See the file LEGAL. http://github.com/bioruby/bioruby/blob/master/LEGAL Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 6 Apr 2010 17:58:10 +0200 (CEST) "K. Patil" wrote: > Hi, > > I would like to distribute bioruby with my own code. Is it allowed legally > or there are some restrictions? More specifically I would like to have a > subset of bioruby files (especially for file processing) inside my > distribution. > > I would like to know if this is possible. > > best regards > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From konstantin.s.stepanyuk at gmail.com Wed Apr 7 05:02:07 2010 From: konstantin.s.stepanyuk at gmail.com (Konstantin Stepanyuk) Date: Wed, 7 Apr 2010 13:02:07 +0800 Subject: [BioRuby] GSoC project Message-ID: Hi All, My name is Kostya Stepanyuk, I'm an undergraduate student from Novosibirsk State University in Russia and I'm a looking forward to participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. I already have a background in bioinformatics since I have been participating in Unipro UGENE (http://ugene.unipro.ru) open-source bioinformatics project for a long time. Also, I highly appreciate Ruby programming language and I was very glad to get to know that there is an open-source ruby-based open-source bioinformatics project. My motivation in participating in this project is to improve my knowledge of Ruby, to familiarize myself with your great project and to help BioRuby become more qualitative and popular. I'm looking forward to contribute to your promising project! I'm going to send the full application as soon as possible. Thanks, Kostya. From pjotr.public14 at thebird.nl Wed Apr 7 05:49:30 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 7 Apr 2010 07:49:30 +0200 Subject: [BioRuby] GSoC project In-Reply-To: References: Message-ID: <20100407054930.GA10407@thebird.nl> Hi Konstantin, Not much time left. Leave us enough time to help comment. Pj. On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: > Hi All, > > My name is Kostya Stepanyuk, I'm an undergraduate student from > Novosibirsk State University in Russia and I'm a looking forward to > participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. > > I already have a background in bioinformatics since I have been > participating in Unipro UGENE (http://ugene.unipro.ru) open-source > bioinformatics project for a long time. Also, I highly appreciate Ruby > programming language and I was very glad to get to know that there is > an open-source ruby-based open-source bioinformatics project. > > My motivation in participating in this project is to improve my > knowledge of Ruby, to familiarize myself with your great project and > to help BioRuby become more qualitative and popular. I'm looking > forward to contribute to your promising project! > > I'm going to send the full application as soon as possible. > > Thanks, > Kostya. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From mkikkawa at gmail.com Thu Apr 8 01:21:46 2010 From: mkikkawa at gmail.com (Masahide Kikkawa) Date: Thu, 8 Apr 2010 10:21:46 +0900 Subject: [BioRuby] Bio::MEDLINE, authors Message-ID: Hi, I encountered a bug in Bio::MEDLINE, Here is the code: ====================================== require 'rubygems' require 'bio' item = Bio::PubMed.efetch([13016983]) ref = Bio::MEDLINE.new(item[0]).reference p ref.authors ====================================== The result is: [", V. I. M. T. R. U. P. B", "SCHMIDT-NIELSEN, B."] Expected result is ["VIMTRUP B", "SCHMIDT-NIELSEN B."] Thanks. From konstantin.s.stepanyuk at gmail.com Thu Apr 8 08:55:25 2010 From: konstantin.s.stepanyuk at gmail.com (Konstantin Stepanyuk) Date: Thu, 8 Apr 2010 16:55:25 +0800 Subject: [BioRuby] GSoC project In-Reply-To: <20100407054930.GA10407@thebird.nl> References: <20100407054930.GA10407@thebird.nl> Message-ID: Hi Pjotr and folks, here is my proposal written according to the scheme published on OBF GSoC page. It is quite compact since I have not buried into the codebase and tests deeply. So I will appreciate any help or suggestions, and I'm looking forward to contribute to your project during the GSoC. Thanks! Kostya. 1.Contact information Full Name: Konstantin Stepanyuk Address: Pirogova str. 20/1, app. 800, Novosibirsk, Zip code: 630090 Russian Federation. E-mail: konstantin.s.stepanyuk at gmail.com Phone: +7 923 247 2424 ICQ: 427601980 2. Motivation and goals. Bioinformatics is one of my primary fields of interests. I already have a solid background in bioinformatics since I have been participating in Unipro UGENE (http://ugene.unipro.ru) open-source bioinformatics project for two years. My existing research area in university includes local sequence alignment and genome assembly. I highly appreciate Ruby programming language and I was very glad to get to know that there is an open-source ruby-based open-source bioinformatics project. I believe that cross-version of BioRuby is an important issue for the project, since the project is quite modern and perspective. The one of the main tasks in porting BioRuby to version 1.9.2 is improving test coverage, since currently project has quite little unit tests. It will make us more certain about introducing compatibility & conformance fixes. 3. My skills summary and work experience Programming languages: C++ (3 years), Java, Ruby, Python. Projects: * Unipro UGENE - free and open-source Integrated Bioinformatic Tools (http://ugene.unipro.ru). - Role: C++ and Qt developer for two years (Unipro LLC). - Implemented and tested several algorithms, such as Smith-Waterman local sequence alignment (and its SSE, CUDA and ATI Stream versions). * Apache Harmony - clean-room implementation of J2SE platform (http://harmony.apache.org). - Role: Intern in Intel corporation - Implemented tool for aggregating and reporting perfomance and statistical counters. 4. A project plan. I propose to divide the total work into two big milestones, accordingly to Google schedule. 1) Improving test coverage of the project. 23 May - 16 July (total 8 weeks) 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 July - 20 August (total 5 weeks). Each of this chunks of work is divided into several subparts: 1) - Evaluate test coverage (1 week). Consider integration of some tool to build process to automate test coverage reporting. Create concrete test plan which will be targeted to improve test coverage up to 90-100% - Write unit-tests according to the plan. Consider creating the stress-test suite. (6-7 weeks) 2) - Elaborate the list of incompatibilities with new version of Ruby (1 week) - Port the codebase (4 weeks) 5. My plans for the summer I plan that GSoC project will be my primary occupation during the summer. But I'm going to a have a 1 week vacation in July. On 4/7/10, Pjotr Prins wrote: > Hi Konstantin, > > Not much time left. Leave us enough time to help comment. > > Pj. > > On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: >> Hi All, >> >> My name is Kostya Stepanyuk, I'm an undergraduate student from >> Novosibirsk State University in Russia and I'm a looking forward to >> participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. >> >> I already have a background in bioinformatics since I have been >> participating in Unipro UGENE (http://ugene.unipro.ru) open-source >> bioinformatics project for a long time. Also, I highly appreciate Ruby >> programming language and I was very glad to get to know that there is >> an open-source ruby-based open-source bioinformatics project. >> >> My motivation in participating in this project is to improve my >> knowledge of Ruby, to familiarize myself with your great project and >> to help BioRuby become more qualitative and popular. I'm looking >> forward to contribute to your promising project! >> >> I'm going to send the full application as soon as possible. >> >> Thanks, >> Kostya. >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > From ngoto at gen-info.osaka-u.ac.jp Thu Apr 8 11:48:51 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 8 Apr 2010 20:48:51 +0900 Subject: [BioRuby] GSoC project In-Reply-To: References: <20100407054930.GA10407@thebird.nl> Message-ID: <20100408114851.EA2341CBC3D0@idnmail.gen-info.osaka-u.ac.jp> Hi Konstantin, In the project, Ruby porgramming skill is very important. Write more about your Ruby programming experiences. In addition, can you show URL to Ruby scripts you wrote? Please improve project plan more. For example: * Preparation. E.g. to subscribe to ruby-core mailing list to check current status of Ruby 1.9.2, installing Ruby 1.9.2 and 1.8.7, etc. Note that Ruby 1.9.2 are now under feature freeze, and will be released on July 30 ([ruby-core:28665]). You will need to compile Ruby 1.9.2 svn version at least several times. (Optionally, in every week or every day, and to contribute Ruby 1.9.2's bug fix). * About development environment and tools you will use. * About coverage check tool (rcov?). * Reading changes from Ruby 1.8.7 to 1.9.1 and from 1.9.1 to 1.9.2, and brush up the plan. http://svn.ruby-lang.org/repos/ruby/tags/v1_9_1_0/NEWS http://svn.ruby-lang.org/repos/ruby/trunk/NEWS * Extracting bioruby-1.4.0.tar.gz, looking at lib/ and test/unit (andtest/functional), and checking existance of test files and directories corersponding to library main files. For example, you can find that lib/bio/db/genbank exists, but test/unit/bio/db/genbank does not exist. Of course, in some cases, test files exist but their contents are poor. Although it is very difficult, but if you can, it is good to estimate the needed efforts. It is also good to prioritize classes/modules to write tests. I think Bio::GenBank and Bio::GenPept are high priority. * ... (Not all will be needed, and not limited to the above.) Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 8 Apr 2010 16:55:25 +0800 Konstantin Stepanyuk wrote: > Hi Pjotr and folks, > > here is my proposal written according to the scheme published on OBF > GSoC page. It is quite compact since I have not buried into the > codebase and tests deeply. So I will appreciate any help or > suggestions, and I'm looking forward to contribute to your project > during the GSoC. > > Thanks! > Kostya. > > 1.Contact information > > Full Name: Konstantin Stepanyuk > > Address: > Pirogova str. 20/1, app. 800, > Novosibirsk, > Zip code: 630090 > Russian Federation. > > E-mail: konstantin.s.stepanyuk at gmail.com > Phone: +7 923 247 2424 > ICQ: 427601980 > > > 2. Motivation and goals. > Bioinformatics is one of my primary fields of interests. I already > have a solid background in bioinformatics since I have been > participating in Unipro UGENE (http://ugene.unipro.ru) open-source > bioinformatics project for two years. My existing research area in > university includes local sequence alignment and genome assembly. > > I highly appreciate Ruby programming language and I was very glad to > get to know that there is an open-source ruby-based open-source > bioinformatics project. > I believe that cross-version of BioRuby is an important issue for the > project, since the project is quite modern and perspective. The one of > the main tasks in porting BioRuby to version 1.9.2 is improving test > coverage, since currently project has quite little unit tests. It will > make us more certain about introducing compatibility & conformance > fixes. > > > 3. My skills summary and work experience > Programming languages: C++ (3 years), Java, Ruby, Python. > Projects: > * Unipro UGENE - free and open-source Integrated Bioinformatic Tools > (http://ugene.unipro.ru). > - Role: C++ and Qt developer for two years (Unipro LLC). > - Implemented and tested several algorithms, such as Smith-Waterman > local sequence alignment (and its SSE, CUDA and ATI Stream versions). > * Apache Harmony - clean-room implementation of J2SE platform > (http://harmony.apache.org). > - Role: Intern in Intel corporation > - Implemented tool for aggregating and reporting perfomance and > statistical counters. > > > 4. A project plan. > I propose to divide the total work into two big milestones, > accordingly to Google schedule. > > 1) Improving test coverage of the project. 23 May - 16 July (total 8 weeks) > > 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 > July - 20 August (total 5 weeks). > > Each of this chunks of work is divided into several subparts: > > 1) > - Evaluate test coverage (1 week). Consider integration of some tool > to build process to automate test coverage reporting. > Create concrete test plan which will be targeted to improve test > coverage up to 90-100% > - Write unit-tests according to the plan. Consider creating the > stress-test suite. (6-7 weeks) > > 2) > - Elaborate the list of incompatibilities with new version of Ruby (1 week) > - Port the codebase (4 weeks) > > 5. My plans for the summer > I plan that GSoC project will be my primary occupation during the > summer. But I'm going to a have a 1 week vacation in July. > > On 4/7/10, Pjotr Prins wrote: > > Hi Konstantin, > > > > Not much time left. Leave us enough time to help comment. > > > > Pj. > > > > On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: > >> Hi All, > >> > >> My name is Kostya Stepanyuk, I'm an undergraduate student from > >> Novosibirsk State University in Russia and I'm a looking forward to > >> participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. > >> > >> I already have a background in bioinformatics since I have been > >> participating in Unipro UGENE (http://ugene.unipro.ru) open-source > >> bioinformatics project for a long time. Also, I highly appreciate Ruby > >> programming language and I was very glad to get to know that there is > >> an open-source ruby-based open-source bioinformatics project. > >> > >> My motivation in participating in this project is to improve my > >> knowledge of Ruby, to familiarize myself with your great project and > >> to help BioRuby become more qualitative and popular. I'm looking > >> forward to contribute to your promising project! > >> > >> I'm going to send the full application as soon as possible. > >> > >> Thanks, > >> Kostya. > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From donttrustben at gmail.com Thu Apr 8 12:14:42 2010 From: donttrustben at gmail.com (Ben Woodcroft) Date: Thu, 8 Apr 2010 22:14:42 +1000 Subject: [BioRuby] Bio::GO::GeneAssociation issue/fix and new unit test file Message-ID: Hi, I had some problems parsing gene association files using Bio::Flatfile, caused because the parser was attempting to use the split method on a nil. The offending line was @db_reference = tmp[5].split(/\|/) # That seemed easy enough to fix, but then I noticed there wasn't any test cases to test my changes against, so I made a new file test/unit/db/test_go.rb, including a simulation of one that was giving me problems. I've collected these changes in a new branch, and you can see the difference using the new github compare interface at http://github.com/wwood/bioruby/compare/36041377db...gene_association Is there any reason that the variables that correspond to arrays in GeneAssociation (@db_reference, @with, @db_object_synonym) are singular names, and not plural? It would be simple to add a alias_method db_references -> db_reference right? I also don't agree that the 'GO:' part of the identifier be chopped off by default by the goid method - gene association files are not necessarily concerned with GO - there are other ontologies out there as well. I personally never look at GO identifiers without the 'GO:' bit, so I was surprised when I saw that. Sound OK? Thanks, ben From czmasek at burnham.org Thu Apr 8 21:39:44 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Thu, 8 Apr 2010 14:39:44 -0700 Subject: [BioRuby] GSoC question (regarding SDI algorithm) In-Reply-To: References: <4BBA7121.5090102@burnham.org> Message-ID: <4BBE4D20.4050706@burnham.org> Hi, Monika: Remember, the deadline is tomorrow! > Hello > > I think I should first explicitly ask this question: I am not > experienced in Ruby, but I believe I can improve myself enough during > the Community Bonding period. Does it make me ineligible to apply? No. Part of the goals for GSoC is for students to "learn new things". > If not, please continue with reading: ;) > > > Basically you also need to write a short CV, similar to a job > application. > > Where should I later submit this CV? Should include only education / > work experience, or also something like a cover letter? It will all be part of you application (i.e. one document). You need to write an "abstract" which can can be considered a cover letter. > > And a question about the algorithm itself - would it need to be > accomodated in some application, or just be a separate BioRuby library? It would be part of BioRuby. > > Develop unit tests > > I hope this is not stupid question, but where is it possible to get the > following information about a gene tree: which nodes should receive > which annotations about 'duplication' and 'specialization', for the > duplication inference to be considered correct? I mean, as a > non-biologist I do not know what should be the correct output of the > algorithm... You should read (or at least have a look at) some of the references listed here: http://evogsoc2010.wordpress.com/2010/03/25/references-for-gene-duplications-proposal/ You can also have a look at my PhD thesis, which explains some of the background, especially chapter 1.3.2.1. See: ftp://selab.janelia.org/pub/publications/Zmasek02/Zmasek02-phdthesis.pdf Furthermore, I can easily provide you with test gene trees which have duplications assigned. This is not a big issue. > > Regards > Monika Machunik > > > 2010/4/6 Christian M Zmasek > > > Hi, Monika: > > Thank you for you interest in this proposal. > Please remember that student applications are due by April 9, 19:00 > UTC -- so, you have not much time left. > > I think your lack of experience in Biology is not a problem. > > The idea is to implement the algorithm with the BioRuby toolkit > (http://www.bioruby.org/). > > Some more advice: > > If you plan to apply, you need to write a very detailed plan on how > you intend to accomplish this project. > > For each step you should list: > 1. Goal/deliverable > 2. Approach > 3. Time estimation > 4. Anticipated problems & possible alternative approaches > > Like so: > > A. Prior to coding (from ... to .... ) > 1. Familiarize myself with BioRuby, set up git hub repository > 2. ... > 3. 1 week > 4. Not familiar with git, might need to... > > B. Week 1 (from ... to .... ) > 1. Develop unit tests > 2. Using manually created gene and species trees, I plan to... > 3. 1 week > 4. No problem anticipated > > > Basically you also need to write a short CV, similar to a job > application. > > Hope this helps, > > Christian > > > Monika Machunik wrote: > > Hello > > My name is Monika Machunik and I am planing to apply in this > year's Summer > of Code. I have read your idea description about "Implementation of > algorithm to infer gene duplications in BioRuby", and, although my > background does not include any biology, I got quite interested > in this > project (I could not find mentors' email addresses, so I'm > posting it > here..). > > I would like to shortly introduce myself to get your opinion if > I would be > suitable for this project. > > I have about a year of work experience in Java programming, > including some > internships and last year's GSoC. Besides Java I know C++, some > C, Php, > HTML, etc. I am not experienced in Ruby programming (at least > have seen Ruby > code;)), but I learn fast. Currently I am doing my Master degree > in Computer > Science, so I have some knowlegde about algorithms and data > structures. I > have never worked at the intersection of biology and CS, but this > conjunction has always been intriguing to me. > > And now my thoughts about possible content of the workload. > > I have read the abstract of the article and, despite of my lack of > biological knowledge, I managed to comprehend it;). I think I > also should > have no problem with understanding the algorithm itself. Apart from > implementing the algorithm, the project would involve getting > familiar with > BioRuby, understanding phyloXML in such extent to be able to > write an > algorithm operating on its ready structures. > > I am not sure if the algorithm should be implemented inside some > exisitng > software, or will it be a kind of standalone algorithm? If it > should be > accomodated inside some application, the project would probably > involve > doing that too... > > ...let it be all for now. Let me know if I have any chances in > this project > :) > > Best regards > Monika Machunik > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > From czmasek at burnham.org Thu Apr 8 21:43:04 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Thu, 8 Apr 2010 14:43:04 -0700 Subject: [BioRuby] GSoC project In-Reply-To: References: <20100407054930.GA10407@thebird.nl> Message-ID: <4BBE4DE8.1090008@burnham.org> Hi, Konstantin: Your project plan is not detailed enough and partially vague (for example, what do you mean by "some tool"?) Christian Konstantin Stepanyuk wrote: > Hi Pjotr and folks, > > here is my proposal written according to the scheme published on OBF > GSoC page. It is quite compact since I have not buried into the > codebase and tests deeply. So I will appreciate any help or > suggestions, and I'm looking forward to contribute to your project > during the GSoC. > > Thanks! > Kostya. > > 1.Contact information > > Full Name: Konstantin Stepanyuk > > Address: > Pirogova str. 20/1, app. 800, > Novosibirsk, > Zip code: 630090 > Russian Federation. > > E-mail: konstantin.s.stepanyuk at gmail.com > Phone: +7 923 247 2424 > ICQ: 427601980 > > > 2. Motivation and goals. > Bioinformatics is one of my primary fields of interests. I already > have a solid background in bioinformatics since I have been > participating in Unipro UGENE (http://ugene.unipro.ru) open-source > bioinformatics project for two years. My existing research area in > university includes local sequence alignment and genome assembly. > > I highly appreciate Ruby programming language and I was very glad to > get to know that there is an open-source ruby-based open-source > bioinformatics project. > I believe that cross-version of BioRuby is an important issue for the > project, since the project is quite modern and perspective. The one of > the main tasks in porting BioRuby to version 1.9.2 is improving test > coverage, since currently project has quite little unit tests. It will > make us more certain about introducing compatibility & conformance > fixes. > > > 3. My skills summary and work experience > Programming languages: C++ (3 years), Java, Ruby, Python. > Projects: > * Unipro UGENE - free and open-source Integrated Bioinformatic Tools > (http://ugene.unipro.ru). > - Role: C++ and Qt developer for two years (Unipro LLC). > - Implemented and tested several algorithms, such as Smith-Waterman > local sequence alignment (and its SSE, CUDA and ATI Stream versions). > * Apache Harmony - clean-room implementation of J2SE platform > (http://harmony.apache.org). > - Role: Intern in Intel corporation > - Implemented tool for aggregating and reporting perfomance and > statistical counters. > > > 4. A project plan. > I propose to divide the total work into two big milestones, > accordingly to Google schedule. > > 1) Improving test coverage of the project. 23 May - 16 July (total 8 weeks) > > 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 > July - 20 August (total 5 weeks). > > Each of this chunks of work is divided into several subparts: > > 1) > - Evaluate test coverage (1 week). Consider integration of some tool > to build process to automate test coverage reporting. > Create concrete test plan which will be targeted to improve test > coverage up to 90-100% > - Write unit-tests according to the plan. Consider creating the > stress-test suite. (6-7 weeks) > > 2) > - Elaborate the list of incompatibilities with new version of Ruby (1 week) > - Port the codebase (4 weeks) > > 5. My plans for the summer > I plan that GSoC project will be my primary occupation during the > summer. But I'm going to a have a 1 week vacation in July. > > On 4/7/10, Pjotr Prins wrote: >> Hi Konstantin, >> >> Not much time left. Leave us enough time to help comment. >> >> Pj. >> >> On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: >>> Hi All, >>> >>> My name is Kostya Stepanyuk, I'm an undergraduate student from >>> Novosibirsk State University in Russia and I'm a looking forward to >>> participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. >>> >>> I already have a background in bioinformatics since I have been >>> participating in Unipro UGENE (http://ugene.unipro.ru) open-source >>> bioinformatics project for a long time. Also, I highly appreciate Ruby >>> programming language and I was very glad to get to know that there is >>> an open-source ruby-based open-source bioinformatics project. >>> >>> My motivation in participating in this project is to improve my >>> knowledge of Ruby, to familiarize myself with your great project and >>> to help BioRuby become more qualitative and popular. I'm looking >>> forward to contribute to your promising project! >>> >>> I'm going to send the full application as soon as possible. >>> >>> Thanks, >>> Kostya. >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From konstantin.s.stepanyuk at gmail.com Fri Apr 9 05:19:04 2010 From: konstantin.s.stepanyuk at gmail.com (Konstantin Stepanyuk) Date: Fri, 9 Apr 2010 12:19:04 +0700 Subject: [BioRuby] GSoC project In-Reply-To: <20100408114851.EA2341CBC3D0@idnmail.gen-info.osaka-u.ac.jp> References: <20100407054930.GA10407@thebird.nl> <20100408114851.EA2341CBC3D0@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi Naohisa, Thanks for your comments! Updated version of application is below. Some quick comments are inline. > Write more about your Ruby programming experiences. > In addition, can you show URL to Ruby scripts you wrote? I have only basic knowledge of Ruby and its standard library. I wrote more about my Ruby experience in the plan below. Examples of the scripts: - solving the traveling salesman problem: http://paste2.org/p/764145 some handy small tools: - generator of random DNA sequences http://paste2.org/p/764147 - run through file tree and fix some string http://paste2.org/p/764146 > Please improve project plan more. For example: > * Preparation. E.g. to subscribe to ruby-core mailing list > [skip] I've added this to the plan, but I think I will perform all of this during the Community Bonding period > * Extracting bioruby-1.4.0.tar.gz, looking at lib/ and > test/unit (andtest/functional), and checking existance > of test files and directories corersponding to library > main files. I've already checked out the git repository, and succeed to run the tests. > Although it is very difficult, but if you can, it is > good to estimate the needed efforts. It is also good > to prioritize classes/modules to write tests. I think > Bio::GenBank and Bio::GenPept are high priority. As I think estimating of the current tests quality and coverage is quite complex task which will require load of time. So I mentioned this phase in the development plan. I've already played with the tests and IMO there is a room for improvement. Thanks, Kostya. Updated application: 1.Contact information Full Name: Konstantin Stepanyuk Address: Pirogova str. 20/1, app. 800, Novosibirsk, Zip code: 630090 Russian Federation. E-mail: konstantin.s.stepanyuk at gmail.com Phone: +7 923 247 2424 ICQ: 427601980 2. Motivation and goals. Bioinformatics is one of my primary fields of interests. I already have a solid background in bioinformatics since I have been participating in Unipro UGENE (http://ugene.unipro.ru) open-source bioinformatics project for two years. My existing research area in university includes local sequence alignment and genome assembly. I highly appreciate Ruby programming language and I was very glad to get to know that there is an open-source ruby-based open-source bioinformatics project. I believe that cross-version of BioRuby is an important issue for the project, since the project is quite modern and perspective. The one of the main tasks in porting BioRuby to version 1.9.2 is improving test coverage, since currently project has quite little unit tests. It will make us more certain about introducing compatibility & conformance fixes. 3. My skills summary and work experience Programming languages: C++ (3 years), Java, Ruby, Python. Ruby experience: basic knowledge. Ability to write simple scripts not larger than ~200 LoC. I've wrote some algorithms in Ruby such as sorts, Simulating Annealing for traveling salesman problem, several networking scripts (simple TCP servers/clients), and handy 'one-liners' for every day tasks. Projects: * Unipro UGENE - free and open-source Integrated Bioinformatic Tools ( http://ugene.unipro.ru). - Role: C++ and Qt developer for two years (Unipro LLC). - Implemented and tested several algorithms, such as Smith-Waterman local sequence alignment (and its SSE, CUDA and ATI Stream versions). * Apache Harmony - clean-room implementation of J2SE platform ( http://harmony.apache.org). - Role: Intern in Intel corporation - Implemented tool for aggregating and reporting perfomance and statistical counters. 4. A project plan. I propose to divide the total work into two big milestones, accordingly to Google schedule. Also, the plan includes preparation phase which will be performed during the Community Bonding time. 0) Preparation: - Establishing the Ruby environment: * install different actual versionf of Ruby: 1.8.7, 1.9.1, and check out the Ruby repository to be able to regularly build the newest version. * Subsribing to Ruby development mailing list to check the current status of the project - Establishing BioRuby environment * Checking out BioRuby codebase * Choosing a right tools to work with BioRuby code. Vim + Rakefiles way is surely reliable, but using some high-level IDE such as JetBrains RubyMine will be considered. 1) Improving test coverage of the project. 23 May - 16 July (total 8 weeks) 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 July - 20 August (total 5 weeks). Each of this chunks of work is divided into several subparts: 1) - Evaluate test coverage (1 week). This includes: * prioritizing classes/modules to write tests. * measuring coverage. Rcov is the first candidate to use. * integration the test coverage metrics to the build process will be considered. - Write unit-tests according to the plan. Consider creating the stress-test suite. (6-7 weeks) 2) - Elaborate the list of incompatibilities with new version of Ruby (1 week) - Port the codebase (4 weeks) 5. My plans for the summer I plan that GSoC project will be my primary occupation during the summer. But I'm going to a have a 1 week vacation in July. On Thu, 8 Apr 2010 16:55:25 +0800 Konstantin Stepanyuk wrote: > Hi Pjotr and folks, > > here is my proposal written according to the scheme published on OBF > GSoC page. It is quite compact since I have not buried into the > codebase and tests deeply. So I will appreciate any help or > suggestions, and I'm looking forward to contribute to your project > during the GSoC. > > Thanks! > Kostya. > > 1.Contact information > > Full Name: Konstantin Stepanyuk > > Address: > Pirogova str. 20/1, app. 800, > Novosibirsk, > Zip code: 630090 > Russian Federation. > > E-mail: konstantin.s.stepanyuk at gmail.com > Phone: +7 923 247 2424 > ICQ: 427601980 > > > 2. Motivation and goals. > Bioinformatics is one of my primary fields of interests. I already > have a solid background in bioinformatics since I have been > participating in Unipro UGENE (http://ugene.unipro.ru) open-source > bioinformatics project for two years. My existing research area in > university includes local sequence alignment and genome assembly. > > I highly appreciate Ruby programming language and I was very glad to > get to know that there is an open-source ruby-based open-source > bioinformatics project. > I believe that cross-version of BioRuby is an important issue for the > project, since the project is quite modern and perspective. The one of > the main tasks in porting BioRuby to version 1.9.2 is improving test > coverage, since currently project has quite little unit tests. It will > make us more certain about introducing compatibility & conformance > fixes. > > > 3. My skills summary and work experience > Programming languages: C++ (3 years), Java, Ruby, Python. > Projects: > * Unipro UGENE - free and open-source Integrated Bioinformatic Tools > (http://ugene.unipro.ru). > - Role: C++ and Qt developer for two years (Unipro LLC). > - Implemented and tested several algorithms, such as Smith-Waterman > local sequence alignment (and its SSE, CUDA and ATI Stream versions). > * Apache Harmony - clean-room implementation of J2SE platform > (http://harmony.apache.org). > - Role: Intern in Intel corporation > - Implemented tool for aggregating and reporting perfomance and > statistical counters. > > > 4. A project plan. > I propose to divide the total work into two big milestones, > accordingly to Google schedule. > > 1) Improving test coverage of the project. 23 May - 16 July (total 8 weeks) > > 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 > July - 20 August (total 5 weeks). > > Each of this chunks of work is divided into several subparts: > > 1) > - Evaluate test coverage (1 week). Consider integration of some tool > to build process to automate test coverage reporting. > Create concrete test plan which will be targeted to improve test > coverage up to 90-100% > - Write unit-tests according to the plan. Consider creating the > stress-test suite. (6-7 weeks) > > 2) > - Elaborate the list of incompatibilities with new version of Ruby (1 week) > - Port the codebase (4 weeks) > > 5. My plans for the summer > I plan that GSoC project will be my primary occupation during the > summer. But I'm going to a have a 1 week vacation in July. > > On 4/7/10, Pjotr Prins wrote: > > Hi Konstantin, > > > > Not much time left. Leave us enough time to help comment. > > > > Pj. > > > > On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: > >> Hi All, > >> > >> My name is Kostya Stepanyuk, I'm an undergraduate student from > >> Novosibirsk State University in Russia and I'm a looking forward to > >> participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. > >> > >> I already have a background in bioinformatics since I have been > >> participating in Unipro UGENE (http://ugene.unipro.ru) open-source > >> bioinformatics project for a long time. Also, I highly appreciate Ruby > >> programming language and I was very glad to get to know that there is > >> an open-source ruby-based open-source bioinformatics project. > >> > >> My motivation in participating in this project is to improve my > >> knowledge of Ruby, to familiarize myself with your great project and > >> to help BioRuby become more qualitative and popular. I'm looking > >> forward to contribute to your promising project! > >> > >> I'm going to send the full application as soon as possible. > >> > >> Thanks, > >> Kostya. > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From konstantin.s.stepanyuk at gmail.com Fri Apr 9 05:26:31 2010 From: konstantin.s.stepanyuk at gmail.com (Konstantin Stepanyuk) Date: Fri, 9 Apr 2010 12:26:31 +0700 Subject: [BioRuby] GSoC project In-Reply-To: <4BBE4DE8.1090008@burnham.org> References: <20100407054930.GA10407@thebird.nl> <4BBE4DE8.1090008@burnham.org> Message-ID: Hi Christian, Thanks for your comments! I agree that my plan is very rough, but elaborating detailed plan requires lots of work with existing codebase, tests and new Ruby issues. So it looks like almost impossible for me to write today a plan like 1) writing XXX tests for YYY method of ZZZ class (2 days). 2) fixing XXX issue from Ruby changelog in classes ZZZ (3 days). So I have a time for writing detailed testing and porting plans in the overall project plan. By 'some tool' I mentioned that I currently can't say which one will be the most suitable.. I'm not a big guy in Ruby language and tools (but not for a long time, I hope). Thanks, Kostya. On Fri, Apr 9, 2010 at 4:43 AM, Christian M Zmasek wrote: > Hi, Konstantin: > > Your project plan is not detailed enough and partially vague (for example, > what do you mean by "some tool"?) > > Christian > > > > Konstantin Stepanyuk wrote: > >> Hi Pjotr and folks, >> >> here is my proposal written according to the scheme published on OBF >> GSoC page. It is quite compact since I have not buried into the >> codebase and tests deeply. So I will appreciate any help or >> suggestions, and I'm looking forward to contribute to your project >> during the GSoC. >> >> Thanks! >> Kostya. >> >> 1.Contact information >> >> Full Name: Konstantin Stepanyuk >> >> Address: >> Pirogova str. 20/1, app. 800, >> Novosibirsk, >> Zip code: 630090 >> Russian Federation. >> >> E-mail: konstantin.s.stepanyuk at gmail.com >> Phone: +7 923 247 2424 >> ICQ: 427601980 >> >> >> 2. Motivation and goals. >> Bioinformatics is one of my primary fields of interests. I already >> have a solid background in bioinformatics since I have been >> participating in Unipro UGENE (http://ugene.unipro.ru) open-source >> bioinformatics project for two years. My existing research area in >> university includes local sequence alignment and genome assembly. >> >> I highly appreciate Ruby programming language and I was very glad to >> get to know that there is an open-source ruby-based open-source >> bioinformatics project. >> I believe that cross-version of BioRuby is an important issue for the >> project, since the project is quite modern and perspective. The one of >> the main tasks in porting BioRuby to version 1.9.2 is improving test >> coverage, since currently project has quite little unit tests. It will >> make us more certain about introducing compatibility & conformance >> fixes. >> >> >> 3. My skills summary and work experience >> Programming languages: C++ (3 years), Java, Ruby, Python. >> Projects: >> * Unipro UGENE - free and open-source Integrated Bioinformatic Tools >> (http://ugene.unipro.ru). >> - Role: C++ and Qt developer for two years (Unipro LLC). >> - Implemented and tested several algorithms, such as Smith-Waterman >> local sequence alignment (and its SSE, CUDA and ATI Stream versions). >> * Apache Harmony - clean-room implementation of J2SE platform >> (http://harmony.apache.org). >> - Role: Intern in Intel corporation >> - Implemented tool for aggregating and reporting perfomance and >> statistical counters. >> >> >> 4. A project plan. >> I propose to divide the total work into two big milestones, >> accordingly to Google schedule. >> >> 1) Improving test coverage of the project. 23 May - 16 July (total 8 >> weeks) >> >> 2) Porting the project codebase to be compatible with Ruby 1.9.2. 16 >> July - 20 August (total 5 weeks). >> >> Each of this chunks of work is divided into several subparts: >> >> 1) >> - Evaluate test coverage (1 week). Consider integration of some tool >> to build process to automate test coverage reporting. >> Create concrete test plan which will be targeted to improve test >> coverage up to 90-100% >> - Write unit-tests according to the plan. Consider creating the >> stress-test suite. (6-7 weeks) >> >> 2) >> - Elaborate the list of incompatibilities with new version of Ruby (1 >> week) >> - Port the codebase (4 weeks) >> >> 5. My plans for the summer >> I plan that GSoC project will be my primary occupation during the >> summer. But I'm going to a have a 1 week vacation in July. >> >> On 4/7/10, Pjotr Prins wrote: >> >>> Hi Konstantin, >>> >>> Not much time left. Leave us enough time to help comment. >>> >>> Pj. >>> >>> On Wed, Apr 07, 2010 at 01:02:07PM +0800, Konstantin Stepanyuk wrote: >>> >>>> Hi All, >>>> >>>> My name is Kostya Stepanyuk, I'm an undergraduate student from >>>> Novosibirsk State University in Russia and I'm a looking forward to >>>> participate in 'Ruby 1.9.2 support of BioRuby' GSoC project. >>>> >>>> I already have a background in bioinformatics since I have been >>>> participating in Unipro UGENE (http://ugene.unipro.ru) open-source >>>> bioinformatics project for a long time. Also, I highly appreciate Ruby >>>> programming language and I was very glad to get to know that there is >>>> an open-source ruby-based open-source bioinformatics project. >>>> >>>> My motivation in participating in this project is to improve my >>>> knowledge of Ruby, to familiarize myself with your great project and >>>> to help BioRuby become more qualitative and popular. I'm looking >>>> forward to contribute to your promising project! >>>> >>>> I'm going to send the full application as soon as possible. >>>> >>>> Thanks, >>>> Kostya. >>>> _______________________________________________ >>>> BioRuby Project - http://www.bioruby.org/ >>>> BioRuby mailing list >>>> BioRuby at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioruby >>>> >>> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > From sararayburn at gmail.com Fri Apr 9 12:39:25 2010 From: sararayburn at gmail.com (Sara Rayburn) Date: Fri, 9 Apr 2010 07:39:25 -0500 Subject: [BioRuby] GSOC 2010 Proposal Submitted Message-ID: <590D5674-16F0-484C-B5AB-17DABC03C7D7@gmail.com> Hello Christian, I have submitted my proposal to implement the SDI algorithms for BioRuby. Thanks so much for the feedback on my proposal draft. I look forward to hearing the results later this month. Regards, Sara Rayburn University of Louisiana sararayburn at gmail.com From czmasek at burnham.org Sat Apr 10 02:25:58 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Fri, 9 Apr 2010 19:25:58 -0700 Subject: [BioRuby] GSOC 2010 Proposal Submitted In-Reply-To: <590D5674-16F0-484C-B5AB-17DABC03C7D7@gmail.com> References: <590D5674-16F0-484C-B5AB-17DABC03C7D7@gmail.com> Message-ID: <4BBFE1B6.2090905@burnham.org> Hi, Sara: It looks like you submitted your proposal to the wrong organization! You submitted to Nescent but should should have submitted to: http://socghop.appspot.com/gsoc/org/show/google/gsoc2010/obf Christian Sara Rayburn wrote: > Hello Christian, > > I have submitted my proposal to implement the SDI algorithms for BioRuby. > > Thanks so much for the feedback on my proposal draft. I look forward to hearing the results later this month. > > Regards, > Sara Rayburn > University of Louisiana > sararayburn at gmail.com > From rutgeraldo at gmail.com Mon Apr 12 11:25:00 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Mon, 12 Apr 2010 12:25:00 +0100 Subject: [BioRuby] RDF Triples in BioRuby, a funding proposal to Google SoC In-Reply-To: <2bb9b24a1003150527p439c135dm1a164e6a5218835f@mail.gmail.com> References: <2bb9b24a1003100522p68330d6bu3f8e5f3a7f50dd6b@mail.gmail.com> <9081A9B5-611C-45C2-A099-44BAF1E524F4@hgc.jp> <2bb9b24a1003110222h4bd642adv31d1975c9edc0bba@mail.gmail.com> <2bb9b24a1003150527p439c135dm1a164e6a5218835f@mail.gmail.com> Message-ID: Hi all, here's a brief followup: we have received three student applications for this GSoC project. All three look fairly strong. Hopefully we will get funding! Rutger On Mon, Mar 15, 2010 at 1:27 PM, Rutger Vos wrote: > To follow up along more practical lines, I've had to deal with similar > design issues in Bio::Phylo (perl), TreeBASE and Mesquite (both java). > I've learned it makes sense to have: > > - a simple "annotation" object, with getters and setters for the > predicate namespace uri, the predicate string, and the value object > (either a literal or a uri), > > - a get_annotations method for all (fundamental) data objects in the > toolkit that returns a collection of these annotation object > > this way, when you serialize any bioruby object into rdf, you can add > as many other statements about that object as you want. > > Would a refactoring along those lines have a chance of being > acceptable to the bioruby community (of course subsequent to a more > detailed RFC, testing, discussion, proof of concept, etc.)? > > On Thursday, March 11, 2010, Rutger Vos wrote: >> Hi Toshiaki, >> >> great to hear there's already been a lot of discussion over this. >> (Well, I'd be surprised if there hadn't been :)) >> >> It looks to me like some fairly major bookkeeping would need to be >> implemented high up in the inheritance tree if *all* bioruby objects >> are to be serialized into RDF. It also would require all of bioruby to >> be ontologized in one fell swoop. >> >> It is perhaps more likely that subdomains are going to be ontologized >> more or less independently from one another (as you mention, >> references->RDF, or in my case phylogenetics->RDF) based implicitly on >> intermediate data formats (pubmed records and nexml, respectively). >> >> That is probably OK, we do things as needs arise. >> >> But what would be handy if the API was at least general enough so that >> this was extensible and we can make additional statements *about* >> objects when we serialize them to RDF. For example, in your pubmed >> turtle file, the subject is always >> . Is there a way, >> programmatically, where I can add additional statements about >> ? >> >> Rutger >> >> On Wed, Mar 10, 2010 at 2:21 PM, Toshiaki Katayama wrote: >>> Hi Rutger, >>> >>> Thank you for your inputs on GSoC 2010! >>> >>>> * is there a way to express triples in BioRuby? >>>> * if there is not, what would be a good design to express triples in >>>> BioRuby so that this would be more useful than just for NeXML? >>> >>> This is what we discussed during the pre-BioHackathon 2010. >>> >>> http://hackathon3.dbcls.jp/wiki/BioRuby >>> >>> My first idea was to make all BioRuby object have common output >>> method to render the object contents in various formats >>> (such as RDF/XML, Turtle, HTML, GFF, FASTA etc. if appropriate). >>> >>> Then, we tried to separate view from logic using erb, but as you >>> see in the above page, it still looks ugly. It is mainly because >>> view formatting itself requires some additional codes, specific >>> to each format. >>> >>> Therefore, we don't have a solid conclusion on this yet, unfortunately. >>> >>> Anyway, we already have PubMed to RDF converter written in Ruby as >>> the TogoWS REST API (http://togows.dbcls.jp/site/en/rest.html) at >>> >>> http://togows.dbcls.jp/entry/pubmed/16381885 >>> --> http://togows.dbcls.jp/entry/pubmed/16381885.ttl >>> >>> and, we are also trying to support KEGG to RDF conversion in this >>> framework as well. I think we can put the code in BioRuby when we finished. >>> >>> Your suggestions are welcome. :) >>> >>> Regards, >>> Toshiaki >>> >>> On 2010/03/10, at 22:22, Rutger Vos wrote: >>> >>>> Dear BioRuby-ites, >>>> >>>> my apologies that my first email to this list is so long and >>>> tangential. I am trying to find out how to express RDF triples in >>>> BioRuby. In this email I'm explaining why I care enough to try to get >>>> funding for someone to work on this. If you don't care about any of >>>> this, you can stop reading now. >>>> >>>> The National Evolutionary Synthesis Center (NESCent.org) is planning >>>> to be a mentoring organization for the Google Summer of Code 2010. I >>>> have submitted a project idea to this: to develop NeXML I/O and - >>>> probably more importantly for you - RDF capabilities for BioRuby. If >>>> funded, a student/coder will work on this full time over the summer, >>>> under the shared supervision of Jan Aerts and myself. Here is the >>>> link: http://tinyurl.com/biorubynexml >>>> >>>> NeXML is a data format for phylogenetic data that can be read and >>>> written in perl, python, java and (to some extent) c++ and javascript. >>>> RDF is the cool "new" thing (as per BioHackathon2010), but as far as I >>>> can tell BioRuby isn't completely up to speed for it, yet. >>>> >>>> (As an aside: you might ask yourself why there is something like NeXML >>>> when there is PhyloXML for BioRuby. The answer is that NeXML solves a >>>> different problem: PhyloXML started essentially as a next generation >>>> of New Hampshire eXtended (NHX) to meet the annotation needs of >>>> comparative genomics, things such as gene duplications and other >>>> molecular evolution events, on phylogenetic trees; NeXML started as a >>>> complete XML representation of the NEXUS format, providing other >>>> comparative data types such as categorical and continuous character >>>> state matrices, restriction site matrices, and so on, in addition to >>>> trees, taxa, sequence alignments. There is obviously some overlap >>>> between the formats, but I guess that is not unique in bioinformatics >>>> :)) >>>> >>>> NeXML has a semantic annotation facility that uses RDFa. This allows >>>> us to add additional metadata to a fundamental phylogenetic data >>>> object (a tree, taxon, character, etc.) to form a "triple": the >>>> fundamental data object is the triple Subject, and the Predicate and >>>> Object are added as RDFa attributes. Since NeXML can be transformed >>>> using a standard XSL stylesheet to RDF/XML, we can express a limitless >>>> number of statements about phylogenetics. H > > -- > Dr. Rutger A. Vos > School of Biological Sciences > Philip Lyle Building, Level 4 > University of Reading > Reading > RG6 6BX > United Kingdom > Tel: +44 (0) 118 378 7535 > http://www.nexml.org > http://rutgervos.blogspot.com > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From anurag08priyam at gmail.com Wed Apr 14 20:41:12 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 15 Apr 2010 02:11:12 +0530 Subject: [BioRuby] Patch for Bug 18019. Message-ID: Hello all, This is my start at being a part of the BioRuby developer community. The RubyForge bug tracking page shows bug 18019( GenBank each_entry, last entry is always nil )[1] to be open. I am attaching a patch for it. Its very tiny. The fix was already suggested in a comment by Raoul Jean Pierre Bonnal( the submitter of the bug ). I have verified the solution and created a patch for it. Or should I send a pull request on github? Patch( git format-patch ): >From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001 From: Anurag Priyam Date: Wed, 14 Apr 2010 22:58:45 +0530 Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil --- lib/bio/db/genbank/common.rb | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/common.rb index 545eac1..eaa760c 100644 --- a/lib/bio/db/genbank/common.rb +++ b/lib/bio/db/genbank/common.rb @@ -24,7 +24,7 @@ class NCBIDB # module Common - DELIMITER = RS = "\n//\n" + DELIMITER = RS = "\n//\n\n" TAGSIZE = 12 def initialize(entry) -- 1.7.0 [1] http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037 -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From jan.aerts at gmail.com Wed Apr 14 20:44:49 2010 From: jan.aerts at gmail.com (Jan Aerts) Date: Wed, 14 Apr 2010 21:44:49 +0100 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: References: Message-ID: Thanks for that, Anurag. Contributions to bioruby very much appreciated :-) @Goto-san: can you merge that fix? Cheers, jan. On 14 April 2010 21:41, Anurag Priyam wrote: > Hello all, > > This is my start at being a part of the BioRuby developer community. > > The RubyForge bug tracking page shows bug 18019( GenBank each_entry, last > entry is always nil )[1] to be open. I am attaching a patch for it. Its > very > tiny. The fix was already suggested in a comment by Raoul Jean Pierre > Bonnal( the submitter of the bug ). I have verified the solution and > created > a patch for it. Or should I send a pull request on github? > > Patch( git format-patch ): > > >From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001 > From: Anurag Priyam > Date: Wed, 14 Apr 2010 22:58:45 +0530 > Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil > > --- > lib/bio/db/genbank/common.rb | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/common.rb > index 545eac1..eaa760c 100644 > --- a/lib/bio/db/genbank/common.rb > +++ b/lib/bio/db/genbank/common.rb > @@ -24,7 +24,7 @@ class NCBIDB > # > module Common > > - DELIMITER = RS = "\n//\n" > + DELIMITER = RS = "\n//\n\n" > TAGSIZE = 12 > > def initialize(entry) > -- > 1.7.0 > > > [1] > > http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037 > > -- > Anurag Priyam > 2nd Year,Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > From ngoto at gen-info.osaka-u.ac.jp Thu Apr 15 01:34:53 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 15 Apr 2010 10:34:53 +0900 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: References: Message-ID: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> On Wed, 14 Apr 2010 21:44:49 +0100 Jan Aerts wrote: > Thanks for that, Anurag. Contributions to bioruby very much appreciated :-) > > @Goto-san: can you merge that fix? No, because the patch ignores reading of entries in the middle of the file. To parse files distributed from NCBI, the delimiter should be "\n//\n", and cannot be "\n//\n\n". Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > Cheers, > jan. > > On 14 April 2010 21:41, Anurag Priyam wrote: > > > Hello all, > > > > This is my start at being a part of the BioRuby developer community. > > > > The RubyForge bug tracking page shows bug 18019( GenBank each_entry, last > > entry is always nil )[1] to be open. I am attaching a patch for it. Its > > very > > tiny. The fix was already suggested in a comment by Raoul Jean Pierre > > Bonnal( the submitter of the bug ). I have verified the solution and > > created > > a patch for it. Or should I send a pull request on github? > > > > Patch( git format-patch ): > > > > >From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001 > > From: Anurag Priyam > > Date: Wed, 14 Apr 2010 22:58:45 +0530 > > Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil > > > > --- > > lib/bio/db/genbank/common.rb | 2 +- > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/common.rb > > index 545eac1..eaa760c 100644 > > --- a/lib/bio/db/genbank/common.rb > > +++ b/lib/bio/db/genbank/common.rb > > @@ -24,7 +24,7 @@ class NCBIDB > > # > > module Common > > > > - DELIMITER = RS = "\n//\n" > > + DELIMITER = RS = "\n//\n\n" > > TAGSIZE = 12 > > > > def initialize(entry) > > -- > > 1.7.0 > > > > > > [1] > > > > http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037 > > > > -- > > Anurag Priyam > > 2nd Year,Mechanical Engineering, > > IIT Kharagpur. > > +91-9775550642 > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Thu Apr 15 06:32:09 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 15 Apr 2010 15:32:09 +0900 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> Hi Anurag, Parsing of GenBank files is primarily tested with official GenBank releases. (But currently no unit tests. I hope they would be added during the GSoC project "Ruby 1.9.2 support of BioRuby".) The test is something like: # preparetion of test data % wget ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrt21.seq.gz % gzip -dc gbvrt21.seq.gz > gbvrt21.seq # Counts the number of entries % ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ ff.each { |e| c += 1 }; p c' gbvrt21.seq #==> 1991 # Checks if the number of entries is correct. % grep -c '^LOCUS' gbvrt21.seq #==> 1991 # Executes with the monkey patch. # Be careful that this takes very long time and large memory! % ruby -rbio -e 'Bio::GenBank::DELIMITER = "\n//\n\n"; \ c = 0; ff = Bio::FlatFile.open(ARGV[0]); \ ff.each { |e| c += 1 }; p c' gbvrt21.seq #==> 1 It is apparent that the patch is wrong. Splitting entries by using such delimiter is simple and the performance is well, but it can only work with correct data which should always be ended with the delimiter. Characters after the last delimiter in the file is regarded as a single entry because we don't want to lose data. The behavior can be changed, for example, when getting only white spaces and then the end of file without delimiter, it is ignored and treated as EOF with no entries. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 15 Apr 2010 10:34:53 +0900 Naohisa GOTO wrote: > On Wed, 14 Apr 2010 21:44:49 +0100 > Jan Aerts wrote: > > > Thanks for that, Anurag. Contributions to bioruby very much appreciated :-) > > > > @Goto-san: can you merge that fix? > > No, because the patch ignores reading of entries in the middle of the file. > To parse files distributed from NCBI, the delimiter should be "\n//\n", > and cannot be "\n//\n\n". > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > > Cheers, > > jan. > > > > On 14 April 2010 21:41, Anurag Priyam wrote: > > > > > Hello all, > > > > > > This is my start at being a part of the BioRuby developer community. > > > > > > The RubyForge bug tracking page shows bug 18019( GenBank each_entry, last > > > entry is always nil )[1] to be open. I am attaching a patch for it. Its > > > very > > > tiny. The fix was already suggested in a comment by Raoul Jean Pierre > > > Bonnal( the submitter of the bug ). I have verified the solution and > > > created > > > a patch for it. Or should I send a pull request on github? > > > > > > Patch( git format-patch ): > > > > > > >From ac82213651e5f5761d32cc9c658188d060c2e75a Mon Sep 17 00:00:00 2001 > > > From: Anurag Priyam > > > Date: Wed, 14 Apr 2010 22:58:45 +0530 > > > Subject: [PATCH] fixed bug 18019: last entry of each_entry is nil > > > > > > --- > > > lib/bio/db/genbank/common.rb | 2 +- > > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > > > diff --git a/lib/bio/db/genbank/common.rb b/lib/bio/db/genbank/common.rb > > > index 545eac1..eaa760c 100644 > > > --- a/lib/bio/db/genbank/common.rb > > > +++ b/lib/bio/db/genbank/common.rb > > > @@ -24,7 +24,7 @@ class NCBIDB > > > # > > > module Common > > > > > > - DELIMITER = RS = "\n//\n" > > > + DELIMITER = RS = "\n//\n\n" > > > TAGSIZE = 12 > > > > > > def initialize(entry) > > > -- > > > 1.7.0 > > > > > > > > > [1] > > > > > > http://rubyforge.org/tracker/index.php?func=detail&aid=18019&group_id=769&atid=3037 > > > > > > -- > > > Anurag Priyam > > > 2nd Year,Mechanical Engineering, > > > IIT Kharagpur. > > > +91-9775550642 > > > > > > _______________________________________________ > > > BioRuby Project - http://www.bioruby.org/ > > > BioRuby mailing list > > > BioRuby at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From tomoakin at kenroku.kanazawa-u.ac.jp Thu Apr 15 07:26:42 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Thu, 15 Apr 2010 16:26:42 +0900 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> Hi Goto-san, > Splitting entries by using such delimiter is simple and the > performance > is well, but it can only work with correct data which should always be > ended with the delimiter. Characters after the last delimiter in the > file is regarded as a single entry because we don't want to lose data. > > The behavior can be changed, for example, when getting only white > spaces and then the end of file without delimiter, it is ignored and > treated as EOF with no entries. Because genbank and genpept format file downloaded from NCBI with entrez usually ends with double new line characters, the latter behavior is really desired. $ wget -O sequences.gb "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4" $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\ ff.each { |e| c += 1 }; p c' sequences.gb #==> 4 Hope it becomes 3. As there are 3 entries. $ grep LOCUS sequences.gb LOCUS A00002 194 bp DNA linear PAT 10-FEB-1993 LOCUS A00003 194 bp DNA linear PAT 10-FEB-1993 LOCUS X17276 556 bp DNA linear MAM 26-FEB-1992 Actually this file have an excess newline at each end of entry. And his patch will work in this case, despite it is not right as you mentioned. Although in this example no error is reported because we don't do anything with the entry, accessing the last entry (the fourth in this case) will cause error. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From anurag08priyam at gmail.com Thu Apr 15 08:30:42 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 15 Apr 2010 14:00:42 +0530 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> Message-ID: To parse the genbank files, ultimately IO#gets(sep_string=$/) is called. I did a File.read on a small sequence file[1]. The last sequence of characters are as: ".... tctaga\n//\n\n". This shows how Ruby would see the file. Note the two "\n" at the end. That was my rationale for the patch. Now, with the current delimiter "\n//\n", what happens is that, when we call gets(delimiter) repetitively, it returns "\n" as the last entry and nil thereafter. This "\n" is the root cause of the problem as it is returned to Bio::FlatFile#next_entry and Bio::FlatFile#each_entry, from either: Bio::Splitter::Default#get_entry or Bio::Splitter::Default#get_parsed_entry. The checks employed later for the return value, include checking for nil ( return nil unless r;; in next_entry ). I think we can include check conditions for whitespace to avoid this? I believe Goto-san's mail also implied something on the same line? [1] http://home.cc.umanitoba.ca/~psgendb/X54090.gen.html On Thu, Apr 15, 2010 at 12:56 PM, Tomoaki NISHIYAMA < tomoakin at kenroku.kanazawa-u.ac.jp> wrote: > Hi Goto-san, > > > Splitting entries by using such delimiter is simple and the performance >> is well, but it can only work with correct data which should always be >> ended with the delimiter. Characters after the last delimiter in the >> file is regarded as a single entry because we don't want to lose data. >> >> The behavior can be changed, for example, when getting only white >> spaces and then the end of file without delimiter, it is ignored and >> treated as EOF with no entries. >> > > > Because genbank and genpept format file downloaded from NCBI with entrez > usually ends with double new line characters, > the latter behavior is really desired. > > $ wget -O sequences.gb " > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4 > " > > $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\ > ff.each { |e| c += 1 }; p c' sequences.gb > #==> 4 > Hope it becomes 3. As there are 3 entries. > $ grep LOCUS sequences.gb > LOCUS A00002 194 bp DNA linear PAT > 10-FEB-1993 > LOCUS A00003 194 bp DNA linear PAT > 10-FEB-1993 > LOCUS X17276 556 bp DNA linear MAM > 26-FEB-1992 > > Actually this file have an excess newline at each end of entry. > And his patch will work in this case, despite it is not right as you > mentioned. > > Although in this example no error is reported because we don't do anything > with the > entry, accessing the last entry (the fourth in this case) will cause error. > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Thu Apr 15 08:39:28 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 15 Apr 2010 14:09:28 +0530 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> Message-ID: > > Because genbank and genpept format file downloaded from NCBI with entrez > usually ends with double new line characters, > the latter behavior is really desired. > > $ wget -O sequences.gb " > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&cmd=text&rettype=gp&id=2,3,4 > " > > $ ruby -rbio -e 'c = 0; ff = Bio::FlatFile.open(ARGV[0]);\ > ff.each { |e| c += 1 }; p c' sequences.gb > #==> 4 > Hope it becomes 3. As there are 3 entries. > $ grep LOCUS sequences.gb > LOCUS A00002 194 bp DNA linear PAT > 10-FEB-1993 > LOCUS A00003 194 bp DNA linear PAT > 10-FEB-1993 > LOCUS X17276 556 bp DNA linear MAM > 26-FEB-1992 > > Actually this file have an excess newline at each end of entry. > And his patch will work in this case, despite it is not right as you > mentioned. > > Although in this example no error is reported because we don't do anything > with the > entry, accessing the last entry (the fourth in this case) will cause error. > As, I mentioned in my previous mail, the cause for the extra entry is cause by a "\n". Even the "\n" gets parsed into Bio::GenBank object. No errors are raised. Here: $ruby -rbio -e 'ff = Bio::FlatFile.open(ARGV[0]); ff.each{ |e| puts e.entry_id};' sequences.gb A00002 A00003 X17276 nil $ruby -rbio -e 'ff = Bio::FlatFile.open(ARGV[0]); ff.each{ |e| puts e.class};' sequences.gb Bio::GenBank Bio::GenBank Bio::GenBank Bio::GenBank -- Anurag Priyam 2nd Year,Mechanical Engineering, IIT Kharagpur. +91-9775550642 From tomoakin at kenroku.kanazawa-u.ac.jp Thu Apr 15 09:07:52 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Thu, 15 Apr 2010 18:07:52 +0900 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp> Message-ID: Hi Anurag, > I believe Goto-san's mail also implied something on the same line? Goto-san's mail implied that your file is just incorrect, because there are no such excess newline in the official GenBank releases, and bioruby library is good to return extra entry containing nil on wrong input. My opinion is even if it is not exactly the same format as the GenBank releases, bioruby library should ignore the excess newline. My previous mail was to explain why such file are frequently seen, showing that the NCBI website creates such file, and an easy way to reproducibly obtain such file with reasonably looking way. (Though tools and email parameters were omitted) I am sure he knows well on why the newline cause that problem, and need not to explain on the cause. The most important issue is the decision of what is the right behavior. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From anurag08priyam at gmail.com Thu Apr 15 09:21:21 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 15 Apr 2010 14:51:21 +0530 Subject: [BioRuby] Patch for Bug 18019. In-Reply-To: References: <20100415013454.C600F1CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100415063210.EDF761CBC3A5@idnmail.gen-info.osaka-u.ac.jp> <47333BD4-949A-4CEC-AF00-60C6CBFB1CB0@kenroku.kanazawa-u.ac.jp>