From chapmanb at 50mail.com Sun Apr 1 15:13:56 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 01 Apr 2012 15:13:56 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: References: Message-ID: <87zkavtgcr.fsf@fastmail.fm> Lenna; Thanks for the introduction and glad to hear about your interest in the variant project. I'm looking forward to seeing your proposal. The workflow for the variant project involves a biologist querying a VCF or GVF file with variants from an experiment. They should be able to easily subset and filter by file components: - Variant type: Homozygous/Heterozygous variants - Metrics: depth, strand bias, allele frequency.. - Variants annotated in coding regions causing amino acid changes As well as rapid subsetting by chromosomal region. My syggestion would be to leverage external tools as much as possible to do file manipulation and focus on an API that lets users filter and extract information pre-contained in the INFO file. Hope this is helpful as a place to get started. We can provide additional feedback once you have your proposal ready. Thanks again, Brad > Hi all, > > I realize time is short, but I am still in the planning phase of my > GSoC proposal! I wanted to take a moment to formally introduce myself > to the dev list. > > I am affiliated with Purdue University, located in Indiana, USA and > best known for engineering (Neil Armstrong is a famous graduate). I > hold a bachelor of arts in biology from Mount Holyoke College in > Massachusetts. I have extensive wet lab experience with genetics; I'm > currently working in a lab genotyping mice (the research is intestinal > lipid metabolism). In August, I begin a PhD in interdisciplinary life > science at Purdue, and I anticipate that my research will fall > somewhere in the field of bioinformatics/computational biology. I hope > to use biopython extensively! > > In my spare time, other than programming, I enjoy ballroom dance, > science fiction novels, board games, and sailing. > > I've been programming for about 6 years and using python for 4; other > languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL > (primarily MySQL and SQLite), and C++/C. I place a high value on > object oriented design and execution. > > I understand the basics of formal grammar and have some experience > with lex/flex as well as PLY (python lex/yacc). My work so far with > biopython has been on the CIF parsing module. One of my primary goals > for the genomic variants project would be to implement as much > polymorphism and abstraction as possible, for the benefit of both > users and future developers. > > I'm working on a proposal for the genomic variants project, and while > I understand the basics of molecular biology and genetics, I lack > firsthand experience with the type of workflow that would occur in the > context of genomic variants. If anyone can supply a few examples, it > would be greatly appreciated. > > I hope to have a proposal draft ready for feedback by Monday. > > Regards, > > Lenna Peterson > github.com/lennax > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From chapmanb at 50mail.com Sun Apr 1 15:28:32 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 01 Apr 2012 15:28:32 -0400 Subject: [Biopython-dev] GSoC Student Applicant In-Reply-To: References: <874ntgtca7.fsf@fastmail.fm> <87r4wa6fxx.fsf@fastmail.fm> Message-ID: <87wr5ztfof.fsf@fastmail.fm> Bow; > Thank you for the comments and suggestions. I've added a little bit > more details to my personal profile and put it up front. My project > details have also been broken down into single weeks. And I've edited > the commenting permission. Thanks for the updates, this is coming along well. My most general suggestion is to spend more time expanding the week-by-week timeline. As an example, take this weekly goal: * Write iterator and random-access parser for EMBOSS water It would be great to see more specific plans for what exactly you deliver and implement during the week. Something like: - Write iterator for EMBOSS water, expanding test suite to ensure produced AlignIO objects are compatible with previous BLAST and HMMER iterators. - Expand index functionality to handle EMBOSS water format for random access. Test edge cases: initial records, final records, empty records. - Document 'water' parsing with a use case emphasizing differences from BLAST and HMMER searching. Peter probably has more specific thoughts on the actual content but it's important to think through things in this manner. This will make it easier to approach weeks during the summer since you'll already have tasks broken down, and will also demonstrate you've thought about potential problems and roadblocks and have solutions to overcome them. > As for my other obligations, I didn't mean to give that impression. I > added a little bite more detail about the project itself, but I'm not > sure about the time that I should write. I estimate that at most, for > each week day, I spend 8 hours doing my Master's project in my lab's > campus. Since the project started, I usually use the remainder of the > time (~6 hours/day) for my own personal programming projects. I plan > to use the personal programming time slot for my GSoC instead, if > accepted. Should I be this thorough in the proposal? This is exactly my worry. You're proposing working two full time jobs all summer long. Not to denigrate your work ethic, but 80 hour weeks are hard and leave you no time for important things like having a life outside of work. My suggestion would be to see if you can scale back your Master's commitments for the summer if accepted into GSoC. This would definitely improve your proposal since reviewers will worry about the time commitment. Hope this all helps, Brad From chapmanb at 50mail.com Sun Apr 1 16:30:26 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 01 Apr 2012 16:30:26 -0400 Subject: [Biopython-dev] MAF Parser/Indexer In-Reply-To: <4F74855B.9000603@med.nyu.edu> References: <4F74855B.9000603@med.nyu.edu> Message-ID: <87obrbtct9.fsf@fastmail.fm> Andrew; Thanks for putting this together. It looks great, is well integrated with AlignIO and it's awesome to see a test suite. I dug through the code and my small suggestions would be: - Could you refactor some of the larger functions into separate smaller components? A couple of these spread over a ton of lines and it can be a bit difficult to follow the logic throughout: https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L172 https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L399 As a practical example, here you have a large block which checks the SQLite index matches the MAF file and everything looks okay: https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L199 This would be clearer if factored into something like: if os.path.isfile(sqlite_file): try: self._record_count = self._verify_record_count(con) except ... - Would you be able to put together a small example for the Cookbook or Tutorial documentation? This would be a great way to help others get started with the functionality and advertise it. Thanks again for this, Brad > Hi all, > > I would like to start a discussion about what is needed to make the > AlignIO.MafIO parser and indexer ready for the next release. If anyone > is unfamiliar with MAF (Multiple Alignment Format), it is the file > format that eukaryote genome-to-genome multiple alignments produced by > multiz are stored in. > > The exact specs are here: > http://genome.ucsc.edu/FAQ/FAQformat.html#format5 > > Some use cases are discussed in this paper, which implements (I believe) > most of the same functionality of the MafIO class in Galaxy: > http://www.ncbi.nlm.nih.gov/pubmed/21775304 > > The branch of my biopython fork that contains the class: > https://github.com/polyatail/biopython/tree/alignio-maf > > The class is implemented as a reader/writer compatible with the AlignIO > API, but implements its own indexer (MafIO.MafIndex) based on > SeqIO.index_db(). At the time, this seemed like the best way to > implement this, as MAF is explicitly designed for genome-to-genome > alignments while other formats are not. If we can assume a MAF file > contains such an alignment, we can index it by genome coordinates and > allow random access to intervals. > > This is especially useful since it is often desirable to retrieve the > spliced multiple alignment of a multi-exonic transcript, which can be > used to determine sequence conservation, construct a phylogenetic tree > for a particular gene, or pull out orthologs of a large number of genes > at once. > > The code consists of the reader, writer, and indexer classes in > AlignIO/MaFIO.py, test files in Tests/MAF, and unit tests specific to > the indexer in Tests/test_MafIO_index.py. I would really appreciate any > feedback and suggestions, and if anyone has an opportunity to use this > feature it would be great to get some feedback on its operation. > > > Thanks! > Andrew > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From redmine at redmine.open-bio.org Sun Apr 1 21:40:27 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 2 Apr 2012 01:40:27 +0000 Subject: [Biopython-dev] [Biopython - Feature #3336] (New) Make Phylo.draw more customizable Message-ID: Issue #3336 has been reported by Eric Talevich. ---------------------------------------- Feature #3336: Make Phylo.draw more customizable https://redmine.open-bio.org/issues/3336 Author: Eric Talevich Status: New Priority: Normal Assignee: Category: Target version: URL: On and off the mailing lists, I've received requests to make the plots rendered by Phylo.draw more customizable. For example: http://lists.open-bio.org/pipermail/biopython/2012-March/007851.html Since Phylo.draw is based on matplotlib/pyplot, it should be possible for essentially everything about the plot to be customizable by the user using pyplot's standard mechanisms -- e.g. adjust the font sizes with rcParams["font.size"]. Other requested features: * Accept **kwargs in Phylo.draw, and pass it along to pyplot -- but where? * Format the confidence/support values differently (currently everything is treated as a float), including or perhaps with the addition of arbitrary branch labels (e.g. estimated number of mutations on a branch) * Return a mapping of clade objects to a tuple or dict of pyplot elements (LineCollection, PatchCollection, etc.) ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From arklenna at gmail.com Sun Apr 1 22:10:45 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 1 Apr 2012 22:10:45 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: <87zkavtgcr.fsf@fastmail.fm> References: <87zkavtgcr.fsf@fastmail.fm> Message-ID: Hi Brad, Thank you so much for your suggestions. My initial evaluation of the strengths of existing software has led me to strongly agree with your recommendation to focus on the usability of the API. I submit this draft of my proposal to the dev list for feedback: https://docs.google.com/document/d/116FDQLtNnYWnm0kojad4YmQrM3cjOO8D2Vr82aW6xyA/edit Lenna On Sun, Apr 1, 2012 at 3:13 PM, Brad Chapman wrote: > > Lenna; > Thanks for the introduction and glad to hear about your interest in the > variant project. I'm looking forward to seeing your proposal. > > The workflow for the variant project involves a biologist querying a VCF > or GVF file with variants from an experiment. They should be able to > easily subset and filter by file components: > > - Variant type: Homozygous/Heterozygous variants > - Metrics: depth, strand bias, allele frequency.. > - Variants annotated in coding regions causing amino acid changes > > As well as rapid subsetting by chromosomal region. > > My syggestion would be to leverage external tools as much as possible to > do file manipulation and focus on an API that lets users filter and > extract information pre-contained in the INFO file. > > Hope this is helpful as a place to get started. We can provide > additional feedback once you have your proposal ready. Thanks again, > Brad > >> Hi all, >> >> I realize time is short, but I am still in the planning phase of my >> GSoC proposal! I wanted to take a moment to formally introduce myself >> to the dev list. >> >> I am affiliated with Purdue University, located in Indiana, USA and >> best known for engineering (Neil Armstrong is a famous graduate). I >> hold a bachelor of arts in biology from Mount Holyoke College in >> Massachusetts. I have extensive wet lab experience with genetics; I'm >> currently working in a lab genotyping mice (the research is intestinal >> lipid metabolism). In August, I begin a PhD in interdisciplinary life >> science at Purdue, and I anticipate that my research will fall >> somewhere in the field of bioinformatics/computational biology. I hope >> to use biopython extensively! >> >> In my spare time, other than programming, I enjoy ballroom dance, >> science fiction novels, board games, and sailing. >> >> I've been programming for about 6 years and using python for 4; other >> languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL >> (primarily MySQL and SQLite), and C++/C. I place a high value on >> object oriented design and execution. >> >> I understand the basics of formal grammar and have some experience >> with lex/flex as well as PLY (python lex/yacc). My work so far with >> biopython has been on the CIF parsing module. One of my primary goals >> for the genomic variants project would be to implement as much >> polymorphism and abstraction as possible, for the benefit of both >> users and future developers. >> >> I'm working on a proposal for the genomic variants project, and while >> I understand the basics of molecular biology and genetics, I lack >> firsthand experience with the type of workflow that would occur in the >> context of genomic variants. If anyone can supply a few examples, it >> would be greatly appreciated. >> >> I hope to have a proposal draft ready for feedback by Monday. >> >> Regards, >> >> Lenna Peterson >> github.com/lennax >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Mon Apr 2 04:26:16 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 2 Apr 2012 09:26:16 +0100 Subject: [Biopython-dev] MAF Parser/Indexer In-Reply-To: <87obrbtct9.fsf@fastmail.fm> References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm> Message-ID: On Sun, Apr 1, 2012 at 9:30 PM, Brad Chapman wrote: > > Andrew; > Thanks for putting this together. It looks great, is well integrated > with AlignIO and it's awesome to see a test suite. Indeed, +1 on tests :) Apologies for not replying earlier - this was flagged in my email client all of last week. > I dug through the code and my small suggestions would be: > > - Could you refactor some of the larger functions into separate smaller > ?components? A couple of these spread over a ton of lines and it can be > ?a bit difficult to follow the logic throughout: > > ... > > ?As a practical example, here you have a large block which checks the > ?SQLite index matches the MAF file and everything looks okay: Maybe I should do the same with the SeqIO SQLite code. > - Would you be able to put together a small example for the > ?Cookbook or Tutorial documentation? This would be a great way to help > ?others get started with the functionality and advertise it. He already has - very organised :) http://biopython.org/wiki/Multiple_Alignment_Format Is there any more about reverse complemented sequences and how they are handled, for in simple iterators, but more so when indexing? What I'm getting at here is the non-typical treatment of start and end being relative to the reverse complemented sequence for minus strand alignments. Here most tools/formats always count from the first base on the forward strand. Peter From andrew.sczesnak at med.nyu.edu Mon Apr 2 20:15:18 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 02 Apr 2012 20:15:18 -0400 Subject: [Biopython-dev] MAF Parser/Indexer In-Reply-To: <87obrbtct9.fsf@fastmail.fm> References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm> Message-ID: <4F7A4116.5000602@med.nyu.edu> Hi Brad, Thank you for the feedback. I've tried to work on some of your suggestions and will continue doing so. > - Could you refactor some of the larger functions into separate smaller > components? A couple of these spread over a ton of lines and it can be > a bit difficult to follow the logic throughout: Definitely--I see what you mean. I split __init__ into a couple functions. I'm still worried about the 100 lines of get_spliced(). It's big mostly because I overdid it on the comments, but hopefully that helps explain the logic enough that someone else could work on it without pulling their hair out. > - Would you be able to put together a small example for the > Cookbook or Tutorial documentation? This would be a great way to help > others get started with the functionality and advertise it. Absolutely. I have a few more ideas for cool demos that integrate with other parts of Biopython. What's the best place to put draft text for the tutorial? Thanks, Andrew From andrew.sczesnak at med.nyu.edu Mon Apr 2 20:33:51 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 02 Apr 2012 20:33:51 -0400 Subject: [Biopython-dev] MAF Parser/Indexer In-Reply-To: References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm> Message-ID: <4F7A456F.3020306@med.nyu.edu> Hi Peter, Thank you for the feedback. I will try to make sure this code is well tested before the next release. > Is there any more about reverse complemented sequences > and how they are handled, for in simple iterators, but more > so when indexing? What I'm getting at here is the non-typical > treatment of start and end being relative to the reverse > complemented sequence for minus strand alignments. Here > most tools/formats always count from the first base on the > forward strand. I'm not sure I'm understanding you, but I hope I am. In theory it seems like strandedness would be an issue, however in practice the reference species in a multiz MAF file is always the plus strand. To make sure the user isn't trying to pass a MAF file containing blocks with mixed strands to MafIndex.get_spliced(), there's a check in there to make sure all strands for the reference species are the same. We also assume that coordinates specified in a block are always in the ascending direction (i.e. they are given as 'start' and 'size' and we assume the coordinates are [start, start + size]). There could be an issue, however, if the best alignment for a particular species swaps strands between alignment blocks and/or exons of a transcript. However, it might be safe to say that the user is interested in the best alignment however it occurs, and not necessarily strand consistency. WRT MultipleSeqAlignment objects produced by get_spliced(), all annotation properties are lost upon slicing, so it is up to the user to keep track of what's what. I do remember we had talked about a way to maintain these annotations, even after slicing. Any thoughts? Thanks, Andrew From p.j.a.cock at googlemail.com Tue Apr 3 05:03:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Apr 2012 10:03:55 +0100 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: References: Message-ID: On Wed, Mar 21, 2012 at 3:27 PM, Peter Cock wrote: > Hello all, > > I'm pleased to see that the GSoC SearchIO project idea I put up > has sparked some interest: > > http://biopython.org/wiki/Google_Summer_of_Code > > ... Just a reminder that the GSoC application deadline is this Friday, 6 April. The application website has been open since 26 March, so I would encourage you to upload your current proposal soon in case there are server load problems on the last day (you will still be able to revise the proposal after uploading it). http://www.google-melange.com/gsoc/homepage/google/gsoc2012 Also, in particular for those of you interested in the SearchIO project which I would mentor, I will be away Thursday 5 and Friday 6 April, so you will not be able to ask me for any last minute feedback. Good luck, Peter From chapmanb at 50mail.com Tue Apr 3 09:06:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 03 Apr 2012 09:06:36 -0400 Subject: [Biopython-dev] MAF Parser/Indexer In-Reply-To: <4F7A4116.5000602@med.nyu.edu> References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm> <4F7A4116.5000602@med.nyu.edu> Message-ID: <87hax1hsmb.fsf@fastmail.fm> Andrew; > Definitely--I see what you mean. I split __init__ into a couple > functions. I'm still worried about the 100 lines of get_spliced(). It's > big mostly because I overdid it on the comments, but hopefully that > helps explain the logic enough that someone else could work on it > without pulling their hair out. Definitely agreed. It's well-commented which makes it much easier for others to dig in. Thanks for taking a look at the refactoring. > Absolutely. I have a few more ideas for cool demos that integrate with > other parts of Biopython. What's the best place to put draft text for > the tutorial? Apologies that I'd totally missed your cookbook entry. That looks great, but more documentation is always better. If you are okay with LaTeX, the Tutorial is in Doc/Tutorial.tex so you can edit directly. The wiki is also a good place for docs if you prefer to go that way. Thanks again for all the work on this. Looking forward to having it in, Brad From chapmanb at 50mail.com Tue Apr 3 10:53:33 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 03 Apr 2012 10:53:33 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: References: <87zkavtgcr.fsf@fastmail.fm> Message-ID: <87r4w4hno2.fsf@fastmail.fm> Lenna; Thanks for getting this together, that's a great start. I left some specific comments but my general suggestion is to get more detailed about the code specifics. During the summer, you use the weekly timeline as a todo list so having lots of details make the process so much easier. Instead of seeing a general item like: "Implement X" you want "Implement X by extending API from last week to support get_Y using sqlite3 index table. Test cases A, B, C and D to avoid...". Having these kind of checklist todos helps make it easy to get started each week and ensure everything is on track. The additional benefit for selection is that is helps convince reviewers you've thought about the technical details and forseen any potential problems. Hope this helps, Brad > Hi Brad, > > Thank you so much for your suggestions. My initial evaluation of the > strengths of existing software has led me to strongly agree with your > recommendation to focus on the usability of the API. > > I submit this draft of my proposal to the dev list for feedback: > > https://docs.google.com/document/d/116FDQLtNnYWnm0kojad4YmQrM3cjOO8D2Vr82aW6xyA/edit > > > Lenna > > > On Sun, Apr 1, 2012 at 3:13 PM, Brad Chapman wrote: > > > > Lenna; > > Thanks for the introduction and glad to hear about your interest in the > > variant project. I'm looking forward to seeing your proposal. > > > > The workflow for the variant project involves a biologist querying a VCF > > or GVF file with variants from an experiment. They should be able to > > easily subset and filter by file components: > > > > - Variant type: Homozygous/Heterozygous variants > > - Metrics: depth, strand bias, allele frequency.. > > - Variants annotated in coding regions causing amino acid changes > > > > As well as rapid subsetting by chromosomal region. > > > > My syggestion would be to leverage external tools as much as possible to > > do file manipulation and focus on an API that lets users filter and > > extract information pre-contained in the INFO file. > > > > Hope this is helpful as a place to get started. We can provide > > additional feedback once you have your proposal ready. Thanks again, > > Brad > > > >> Hi all, > >> > >> I realize time is short, but I am still in the planning phase of my > >> GSoC proposal! I wanted to take a moment to formally introduce myself > >> to the dev list. > >> > >> I am affiliated with Purdue University, located in Indiana, USA and > >> best known for engineering (Neil Armstrong is a famous graduate). I > >> hold a bachelor of arts in biology from Mount Holyoke College in > >> Massachusetts. I have extensive wet lab experience with genetics; I'm > >> currently working in a lab genotyping mice (the research is intestinal > >> lipid metabolism). In August, I begin a PhD in interdisciplinary life > >> science at Purdue, and I anticipate that my research will fall > >> somewhere in the field of bioinformatics/computational biology. I hope > >> to use biopython extensively! > >> > >> In my spare time, other than programming, I enjoy ballroom dance, > >> science fiction novels, board games, and sailing. > >> > >> I've been programming for about 6 years and using python for 4; other > >> languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL > >> (primarily MySQL and SQLite), and C++/C. I place a high value on > >> object oriented design and execution. > >> > >> I understand the basics of formal grammar and have some experience > >> with lex/flex as well as PLY (python lex/yacc). My work so far with > >> biopython has been on the CIF parsing module. One of my primary goals > >> for the genomic variants project would be to implement as much > >> polymorphism and abstraction as possible, for the benefit of both > >> users and future developers. > >> > >> I'm working on a proposal for the genomic variants project, and while > >> I understand the basics of molecular biology and genetics, I lack > >> firsthand experience with the type of workflow that would occur in the > >> context of genomic variants. If anyone can supply a few examples, it > >> would be greatly appreciated. > >> > >> I hope to have a proposal draft ready for feedback by Monday. > >> > >> Regards, > >> > >> Lenna Peterson > >> github.com/lennax > >> _______________________________________________ > >> Biopython-dev mailing list > >> Biopython-dev at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython-dev From w.arindrarto at gmail.com Tue Apr 3 11:22:04 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 3 Apr 2012 17:22:04 +0200 Subject: [Biopython-dev] GSoC Student Applicant In-Reply-To: <87wr5ztfof.fsf@fastmail.fm> References: <874ntgtca7.fsf@fastmail.fm> <87r4wa6fxx.fsf@fastmail.fm> <87wr5ztfof.fsf@fastmail.fm> Message-ID: On Sun, Apr 1, 2012 at 21:28, Brad Chapman wrote: > > Bow; > >> Thank you for the comments and suggestions. I've added a little bit >> more details to my personal profile and put it up front. My project >> details have also been broken down into single weeks. And I've edited >> the commenting permission. > > Thanks for the updates, this is coming along well. My most general > suggestion is to spend more time expanding the week-by-week > timeline. As an example, take this weekly goal: > > * Write iterator and random-access parser for EMBOSS water > > It would be great to see more specific plans for what exactly you > deliver and implement during the week. Something like: > > - Write iterator for EMBOSS water, expanding test suite to ensure > ?produced AlignIO objects are compatible with previous BLAST and HMMER > ?iterators. > > - Expand index functionality to handle EMBOSS water format for random > ?access. Test edge cases: initial records, final records, empty > ?records. > > - Document 'water' parsing with a use case emphasizing differences from > ?BLAST and HMMER searching. > > Peter probably has more specific thoughts on the actual content but it's > important to think through things in this manner. This will make it > easier to approach weeks during the summer since you'll already have > tasks broken down, and will also demonstrate you've thought about > potential problems and roadblocks and have solutions to overcome them. Thanks for another feedback, Brad. I am in the process of adding more detailed descriptions of my weekly tasks. >> As for my other obligations, I didn't mean to give that impression. I >> added a little bite more detail about the project itself, but I'm not >> sure about the time that I should write. I estimate that at most, for >> each week day, I spend 8 hours doing my Master's project in my lab's >> campus. Since the project started, I usually use the remainder of the >> time (~6 hours/day) for my own personal programming projects. I plan >> to use the personal programming time slot for my GSoC instead, if >> accepted. Should I be this thorough in the proposal? > > This is exactly my worry. You're proposing working two full time jobs > all summer long. Not to denigrate your work ethic, but 80 hour weeks are > hard and leave you no time for important things like having a life > outside of work. My suggestion would be to see if you can scale back > your Master's commitments for the summer if accepted into GSoC. This > would definitely improve your proposal since reviewers will worry about > the time commitment. > > Hope this all helps, > Brad Ah, that's ok, I understand your concern :). I talked with my supervisor yesterday regarding this and he understood that I can scale back the time spent for my current project if accepted. I've revised this detail as well in the proposal. Thanks again, Bow From p.j.a.cock at googlemail.com Tue Apr 3 11:32:08 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Apr 2012 16:32:08 +0100 Subject: [Biopython-dev] GSoC Student Applicant In-Reply-To: References: <874ntgtca7.fsf@fastmail.fm> <87r4wa6fxx.fsf@fastmail.fm> <87wr5ztfof.fsf@fastmail.fm> Message-ID: On Tue, Apr 3, 2012 at 4:22 PM, Wibowo Arindrarto wrote: > On Sun, Apr 1, 2012 at 21:28, Brad Chapman wrote: >> >> This is exactly my worry. You're proposing working two full time jobs >> all summer long. Not to denigrate your work ethic, but 80 hour weeks are >> hard and leave you no time for important things like having a life >> outside of work. My suggestion would be to see if you can scale back >> your Master's commitments for the summer if accepted into GSoC. This >> would definitely improve your proposal since reviewers will worry about >> the time commitment. >> >> Hope this all helps, >> Brad > > Ah, that's ok, I understand your concern :). I talked with my > supervisor yesterday regarding this and he understood that I can scale > back the time spent for my current project if accepted. I've revised > this detail as well in the proposal. > > Thanks again, > Bow Excellent - I'm pleased your supervisor is being supportive. That should help address this concern :) Peter From mjldehoon at yahoo.com Tue Apr 3 14:27:26 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 3 Apr 2012 11:27:26 -0700 (PDT) Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: Message-ID: <1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com> While I think that the SearchIO module is a good idea, you may want to consider choosing a different name for this module. For Bio.Seq/Bio.SeqIO and Bio.Align/Bio.AlignIO, roughly speaking the class definitions are in the former and the parser is in the latter module. I don't quite understand why these two are separated into distinct modules, as to me conceptually the two belong together. Bio.SearchIO in my understanding will combine both the parsers and the class definitions, which is a good thing, but then I would prefer a name without "IO" in it. Best, -Michiel. --- On Tue, 4/3/12, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] GSoC SearchIO project > To: "Biopython-Dev Mailing List" > Date: Tuesday, April 3, 2012, 5:03 AM > On Wed, Mar 21, 2012 at 3:27 PM, > Peter Cock > wrote: > > Hello all, > > > > I'm pleased to see that the GSoC SearchIO project idea > I put up > > has sparked some interest: > > > > http://biopython.org/wiki/Google_Summer_of_Code > > > > ... > > Just a reminder that the GSoC application deadline is this > Friday, > 6 April. The application website has been open since 26 > March, > so I would encourage you to upload your current proposal > soon > in case there are server load problems on the last day (you > will > still be able to revise the proposal after uploading it). > http://www.google-melange.com/gsoc/homepage/google/gsoc2012 > > Also, in particular for those of you interested in the > SearchIO > project which I would mentor, I will be away Thursday 5 and > Friday 6 April, so you will not be able to ask me for any > last > minute feedback. > > Good luck, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Apr 3 15:44:48 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Apr 2012 20:44:48 +0100 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: <1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com> References: <1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com> Message-ID: On Tue, Apr 3, 2012 at 7:27 PM, Michiel de Hoon wrote: > While I think that the SearchIO module is a good idea, you > may want to consider choosing a different name for this > module. For Bio.Seq/Bio.SeqIO and Bio.Align/Bio.AlignIO, > roughly speaking the class definitions are in the former and > the parser is in the latter module. I don't quite understand > why these two are separated into distinct modules, as to > me conceptually the two belong together. Bio.SearchIO in > my understanding will combine both the parsers and the > class definitions, which is a good thing, but then I would > prefer a name without "IO" in it. > > Best, > -Michiel. Yes, I was thinking to have both the parsers and the new objects under the name module namespace. The reason for using SearchIO (despite not being PEP8 compatible - something I regret in the naming of SeqIO and the pattern it set) is to match SeqIO and AlignIO and BioPerl. Anyone familiar with BioPerl will immediately see what it is for - and some of the student applicants have already used BioPerl's SearchIO. Personally I find this quite a compelling argument. That said, the name SearchIO isn't the clearest in the the world for a newcomer - however I haven't come up with anything significantly better myself. Perhaps there is a better name out there, which would justify breaking the pattern? I've considered pairwise and palign, but neither feels right. Given a clean slate (Biopython 2?), then yes, I would agree with consolidating Bio.Align and Bio.AlignIO as one namespace, probable "align" (lower case). The situation with Bio.Seq, Bio.SeqRecord and Bio.SeqIO isn't quite so simple - perhaps "seq" (lower case)? Then (in the absence of any other ideas), SearchIO would become "search" (lower case). Peter From redmine at redmine.open-bio.org Tue Apr 3 17:13:13 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 3 Apr 2012 21:13:13 +0000 Subject: [Biopython-dev] [Biopython - Bug #3337] (New) 'Bio.trie.trie' is not picklable Message-ID: Issue #3337 has been reported by Sergei Lebedev. ---------------------------------------- Bug #3337: 'Bio.trie.trie' is not picklable https://redmine.open-bio.org/issues/3337 Author: Sergei Lebedev Status: New Priority: Normal Assignee: Category: Target version: URL: Is there any reason for this, or nobody just had the need (or time) to implement pickle interface? ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From MatatTHC at gmx.de Wed Apr 4 04:46:47 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Wed, 4 Apr 2012 10:46:47 +0200 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: Hi, are there any news on this? May I help somehow? But I have to admit that I barely speak perl and have no experience with bioperl. If someone tells me where to look I might still try it. Matthias 2012/3/29 Peter Cock : > On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt wrote: >> Hi, >> >> Is it possible to get the property if a genome is circular / linear >> from SeqIO applied to genbank files? I could not find it. >> >> There is also a related bugreport: >> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 >> >> I used the old parser before and switched to SeqIO which I really like >> for the possibilities to parse different formats... but I really need >> the information. > > Does anyone happen to have a BioPerl + BioSQL setup installed > and working? IIRC checking that to make sure however we > store the circular was compatible was the only real hurdle. > > Peter From arklenna at gmail.com Wed Apr 4 20:04:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 4 Apr 2012 20:04:30 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: <87r4w4hno2.fsf@fastmail.fm> References: <87zkavtgcr.fsf@fastmail.fm> <87r4w4hno2.fsf@fastmail.fm> Message-ID: On Tue, Apr 3, 2012 at 10:53 AM, Brad Chapman wrote: > > Lenna; > Thanks for getting this together, that's a great start. I left some > specific comments but my general suggestion is to get more detailed > about the code specifics. During the summer, you use the weekly timeline > as a todo list so having lots of details make the process so much > easier. Instead of seeing a general item like: "Implement X" you want > "Implement X by extending API from last week to support get_Y using > sqlite3 index table. Test cases A, B, C and D to avoid...". > > Having these kind of checklist todos helps make it easy to get started > each week and ensure everything is on track. The additional benefit for > selection is that is helps convince reviewers you've thought about the > technical details and forseen any potential problems. > > Hope this helps, > Brad > Hi all, I'm linking to a revision of my GSoC proposal: https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit Thank you to everyone for your feedback. Peter, I didn't realize Biopython has never been tested on IronPython. As I have no familiarity with .NET or Windows, I'll have to rescind my offer to test it. Sorry to get your hopes up! Reece, I've revised the prose sections and almost completely rewritten the timeline. This version provides more information about my background, a more detailed description of the overall project, and more specific goals. Brad, I've tried to go into as much detail as my knowledge of VCF and GVF structure allows. I laid out a more specific structure for both the backend and frontend structures for the data. I've revised the unit tests to be more specific and less dependent on interaction with other modules and I've tried to anticipate some cases that may produce unexpected behavior. I also highlighted specific places where the design should be generalizable. James, I hope my revised project description is more focused. Regarding CNV etc., I did not mean to specifically exclude them by mentioning SNPs, and I've reworded that paragraph to be more general. I get the impression that CNV and other structural variants are considerably more complex to represent and manipulate. I'd be more than happy to read more about breakpoint theory etc. and to prototype any specific workflows you might suggest. Lenna From eric.talevich at gmail.com Wed Apr 4 22:53:10 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 4 Apr 2012 22:53:10 -0400 Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices Message-ID: Hi all, I'm considering some enhancements to the Phylo.draw function to make it more customizable for power users. Since the function is based on matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the user; however, I'm not fully versed in what pyplot is capable of. Relevant feature request in Redmine: https://redmine.open-bio.org/issues/3336 Ideas: 1. Make the draw function return a mapping of clades to a collection of pyplot graphical elements -- the objects emitted by pyplot during each step of rendering the plot. Each clade in the tree is mapped to a horizontal line, a vertical line, a text label (taxon name, normally), and another text label for the branch (confidence/support, normally). The user can then set the attributes of these objects as they wish, minimizing the need for futher extensions to Phylo.draw. Example: {: { "hline": , "vline": , "taxon_label": , "branch_label": }, ... If the user needs access to the figure or axis object as well, it's already easy enough to create these beforehand and pass the 'axis' object to Phylo.draw. 2. Add an argument 'branch_labels' to Phylo.draw. This will accept either (a) a dict which maps the tree's Clade objects to string labels, or (b) a function which accepts a Clade object and returns a string. Default: a function that formats the clade's 'confidence' or 'confidences' attribute, matching the current behavior. Examples: >>> draw(mytree, branch_labels={mytree.root: "Root", ...}) >>> draw(mytree, branch_labels=lambda clade: "%d" % clade.confidence) >>> draw(mytree, branch_labels=lambda clade: clade.taxonomy.rank) 3. Accept **kwargs in Phylo.draw; pass it right along to pyplot at some point. Question: What basic pyplot function accepts **Ikwargs? pyplot.figure and pyplot.set_subplot don't seem appropriate. An alternative is to use pyplot.rcParams, either leaving it all to the user or treating the **kwargs keys as the corresponding entries in rcParams. Syntax gets a little tricky. (Not a top priority for me, actually, since rcParams works.) Thoughts? All clear? Thanks, Eric From chapmanb at 50mail.com Thu Apr 5 06:47:09 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 05 Apr 2012 06:47:09 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: References: <87zkavtgcr.fsf@fastmail.fm> <87r4w4hno2.fsf@fastmail.fm> Message-ID: <871uo2cv6a.fsf@fastmail.fm> Lenna; > I'm linking to a revision of my GSoC proposal: > > https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit > > Thank you to everyone for your feedback. This is coming along great, thanks for all the work on it. I've added a couple of specific suggestions about iterative parsing, which PyVCF does, and using external tools to make the coding region evaluation work easier. One other practical suggestion: you should add a link to the latest version of your google doc at the top of your proposal on the GSoC Melange site. You won't be able to edit there after Friday but can update your google document in case of reviewer suggestions. Thanks again and best of luck during the review process, Brad > > > Peter, > > I didn't realize Biopython has never been tested on IronPython. As I > have no familiarity with .NET or Windows, I'll have to rescind my > offer to test it. Sorry to get your hopes up! > > > Reece, > > I've revised the prose sections and almost completely rewritten the > timeline. This version provides more information about my background, > a more detailed description of the overall project, and more specific > goals. > > > Brad, > > I've tried to go into as much detail as my knowledge of VCF and GVF > structure allows. I laid out a more specific structure for both the > backend and frontend structures for the data. I've revised the unit > tests to be more specific and less dependent on interaction with other > modules and I've tried to anticipate some cases that may produce > unexpected behavior. I also highlighted specific places where the > design should be generalizable. > > > James, > > I hope my revised project description is more focused. Regarding CNV > etc., I did not mean to specifically exclude them by mentioning SNPs, > and I've reworded that paragraph to be more general. I get the > impression that CNV and other structural variants are considerably > more complex to represent and manipulate. I'd be more than happy to > read more about breakpoint theory etc. and to prototype any specific > workflows you might suggest. > > > Lenna > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From arklenna at gmail.com Thu Apr 5 22:50:52 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 5 Apr 2012 22:50:52 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: <871uo2cv6a.fsf@fastmail.fm> References: <87zkavtgcr.fsf@fastmail.fm> <87r4w4hno2.fsf@fastmail.fm> <871uo2cv6a.fsf@fastmail.fm> Message-ID: On Thu, Apr 5, 2012 at 6:47 AM, Brad Chapman wrote: > > Lenna; > >> I'm linking to a revision of my GSoC proposal: >> >> https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit >> >> Thank you to everyone for your feedback. > > This is coming along great, thanks for all the work on it. I've added a > couple of specific suggestions about iterative parsing, which PyVCF > does, and using external tools to make the coding region evaluation work > easier. > > One other practical suggestion: you should add a link to the latest > version of your google doc at the top of your proposal on the GSoC > Melange site. You won't be able to edit there after Friday but can > update your google document in case of reviewer suggestions. > > Thanks again and best of luck during the review process, > Brad > Brad - Thank you again for your detailed feedback. As per your suggestion, I have updated my proposal on GSoC Melange to include a link to the latest version of my proposal. Lenna From mjldehoon at yahoo.com Sat Apr 7 00:43:56 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 6 Apr 2012 21:43:56 -0700 (PDT) Subject: [Biopython-dev] GSoC SearchIO project Message-ID: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com> --- On Tue, 4/3/12, Peter Cock wrote: > The reason for using SearchIO (despite not being PEP8 > compatible - something I regret in the naming of SeqIO > and the pattern it set) is to match SeqIO and AlignIO and > BioPerl. Anyone familiar with BioPerl will immediately see > what it is for - and some of the student applicants have > already used BioPerl's SearchIO. Personally I find this > quite a compelling argument. Sorry but I am not convinced. I doubt that somebody familiar with BioPerl's Align and AlignIO modules will have trouble finding the parser in Biopython if in Biopython there is only a Bio.Align module. Also this means that some modules in Biopython are split up in Module and ModuleIO, whereas most others are not. In this particular case, for consistency you would have to create a Bio.Search and a Bio.SearchIO module. I'd rather have a clean module organization in Biopython instead of strictly following what BioPerl did. > That said, the name SearchIO isn't the clearest in the > the world for a newcomer - however I haven't come up > with anything significantly better myself. Perhaps there > is a better name out there, which would justify breaking > the pattern? I've considered pairwise and palign, but > neither feels right. How about including this module as a submodule in Bio.Align? If we think of Bio.Align as a general module for alignments, then pairwise alignments fit in it too. It depends a bit on the exact API, but I expect that we can come up with something elegant. > Given a clean slate (Biopython 2?), then yes, I would > agree with consolidating Bio.Align and Bio.AlignIO as > one namespace, probable "align" (lower case). The > situation with Bio.Seq, Bio.SeqRecord and Bio.SeqIO > isn't quite so simple - perhaps "seq" (lower case)? There are two steps here: consolidation of some modules, and changing the names of modules to comply with PEP8. The consolidation can happen without waiting for a Biopython 2, as long as there are clear deprecating warnings in the modules that will be removed. Compliance with PEP8 is a bit trickier, since it means relearning all module names, and some systems (Windows?) may not distinguish between lower and upper case. > Then (in the absence of any other ideas), SearchIO > would become "search" (lower case). If we already know now that we will drop the IO from SearchIO at some point, then SearchIO doesn't seem to be a good name. Best, -Michiel. From eric.talevich at gmail.com Sat Apr 7 12:13:16 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 7 Apr 2012 12:13:16 -0400 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com> References: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com> Message-ID: On Sat, Apr 7, 2012 at 12:43 AM, Michiel de Hoon wrote: > --- On Tue, 4/3/12, Peter Cock wrote: > > The reason for using SearchIO (despite not being PEP8 > > compatible - something I regret in the naming of SeqIO > > and the pattern it set) is to match SeqIO and AlignIO and > > BioPerl. Anyone familiar with BioPerl will immediately see > > what it is for - and some of the student applicants have > > already used BioPerl's SearchIO. Personally I find this > > quite a compelling argument. > > Sorry but I am not convinced. I doubt that somebody familiar with > BioPerl's Align and AlignIO modules will have trouble finding the parser in > Biopython if in Biopython there is only a Bio.Align module. Also this means > that some modules in Biopython are split up in Module and ModuleIO, whereas > most others are not. In this particular case, for consistency you would > have to create a Bio.Search and a Bio.SearchIO module. I'd rather have a > clean module organization in Biopython instead of strictly following what > BioPerl did. > How about Bio.Search, for now? We had a similar discussion at the end of GSoC 2009, when we decided to merge Tree and TreeIO (names inspired by BioPerl) to create Phylo (because not all trees are phylogenies, although there is also a Perl module called Bio::Phylo). Since the *IO namespaces have only 4 public functions, plus a IO.py module for each supported I/O format, it's not too cluttered. Likewise, at the end of this GSoC it may be more clear whether the new sub-package should have a different name. (SearchIO seems to have been plenty effective at drawing attention to the project.) But in any case, I support putting all the new work under one sub-package, rather than two. > That said, the name SearchIO isn't the clearest in the > > the world for a newcomer - however I haven't come up > > with anything significantly better myself. Perhaps there > > is a better name out there, which would justify breaking > > the pattern? I've considered pairwise and palign, but > > neither feels right. > > How about including this module as a submodule in Bio.Align? If we think > of Bio.Align as a general module for alignments, then pairwise alignments > fit in it too. It depends a bit on the exact API, but I expect that we can > come up with something elegant. > > Does anything in Bio.Align already operate on SeqFeature objects? Given that BLAST or HMMer output could be interpreted as (1) a series of annotated features/regions on target sequences, or (2) a series of pairwise alignments [*], perhaps it would be most effective to support those aspects separately, through (1) Bio.Search or Bio.Feature [**], and (2) Bio.Align or Bio.AlignIO. [*] The multiple sequence alignment produced by HMMer is in a format we already handle (Stockholm). Some people want to convert BLAST output to a multiple sequence alignment, too, and while I suppose we could support that in a literal sense, the result would be worse than the output of pretty much any other alignment program so I don't think we should. [**] A Bio.Feature module could involve GFF parsing and the variant parsers, too. It would contain I/O functions that emit SeqFeatures, of course. From redmine at redmine.open-bio.org Sat Apr 7 13:31:37 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 7 Apr 2012 17:31:37 +0000 Subject: [Biopython-dev] [Biopython - Feature #3338] (New) Convert a protein alignment and nucleotide sequences to codon alignment Message-ID: Issue #3338 has been reported by Eric Talevich. ---------------------------------------- Feature #3338: Convert a protein alignment and nucleotide sequences to codon alignment https://redmine.open-bio.org/issues/3338 Author: Eric Talevich Status: New Priority: Normal Assignee: Category: Target version: URL: As discussed on the mailing list: http://lists.open-bio.org/pipermail/biopython/2012-April/007913.html This could be implemented in two ways: 1. Wrap PAL2NAL (pal2nal.pl) under Bio.Align.Applications 2. Implement this functionality directly in Python While PAL2NAL has some convenience features like aligning protein sequences to CDS sequences that don't exactly match, it would be straightforward (and simpler for the user, in most cases) to implement a fussier version of it from scratch somewhere in Biopython. So, where would be put this function? Related: * From a codon alignment, it would again be straightforward to calculate dN/dS ratios for pairs of sequences, much like PAML's yn00 (although that program does more stuff, too). Do we want to do that? Where? * Are there ways Biopython could support codon alignments better, as distinct from nucleotide alignments? ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From eric.talevich at gmail.com Sat Apr 7 14:42:02 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 7 Apr 2012 14:42:02 -0400 Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices In-Reply-To: References: Message-ID: On Wed, Apr 4, 2012 at 10:53 PM, Eric Talevich wrote: > Hi all, > > I'm considering some enhancements to the Phylo.draw function to make it > more customizable for power users. Since the function is based on > matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the > user; however, I'm not fully versed in what pyplot is capable of. > > Relevant feature request in Redmine: > https://redmine.open-bio.org/issues/3336 > > Ideas: [...] > 2. Add an argument 'branch_labels' to Phylo.draw. This will accept either > (a) a dict which maps the tree's Clade objects to string labels, or (b) a > function which accepts a Clade object and returns a string. Default: a > function that formats the clade's 'confidence' or 'confidences' attribute, > matching the current behavior. > > Examples: > >>> draw(mytree, branch_labels={mytree.root: "Root", ...}) > >>> draw(mytree, branch_labels=lambda clade: "%d" % clade.confidence) > >>> draw(mytree, branch_labels=lambda clade: clade.taxonomy.rank) > > Just committed this feature: https://github.com/biopython/biopython/commit/72990549a1b769ab19ab0bd33a8c35fdf031ac2d From lgautier at gmail.com Sun Apr 8 13:16:31 2012 From: lgautier at gmail.com (Laurent Gautier) Date: Sun, 08 Apr 2012 19:16:31 +0200 Subject: [Biopython-dev] Sphinx documentation online ? Message-ID: <4F81C7EF.7030505@gmail.com> Hi, I have seen emails exchanges and issues on the tracker regarding moving the documentation to Sphinx, but I could not find an instance of the documentation for biopython online (I was looking for one to cross-reference it with documentation I am writing). Is this still work-in-progress, or is there an instance online and I missed it ? Best, Laurent From eric.talevich at gmail.com Sun Apr 8 15:25:00 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 8 Apr 2012 15:25:00 -0400 Subject: [Biopython-dev] Sphinx documentation online ? In-Reply-To: <4F81C7EF.7030505@gmail.com> References: <4F81C7EF.7030505@gmail.com> Message-ID: On Sun, Apr 8, 2012 at 1:16 PM, Laurent Gautier wrote: > Hi, > > I have seen emails exchanges and issues on the tracker regarding moving > the documentation to Sphinx, but I could not find an instance of the > documentation for biopython online (I was looking for one to > cross-reference it with documentation I am writing). > > Is this still work-in-progress, or is there an instance online and I > missed it ? > > Hi Laurent, I proposed this a while ago and played with Sphinx a little bit, but didn't get very far. We're still using Epydoc for our generated API documentation: http://biopython.org/DIST/docs/api/ I do hope to get back to this at some point, or perhaps assist someone else with migrating Biopython to Sphinx. -Eric From lgautier at gmail.com Sun Apr 8 16:46:45 2012 From: lgautier at gmail.com (Laurent Gautier) Date: Sun, 08 Apr 2012 22:46:45 +0200 Subject: [Biopython-dev] Sphinx documentation online ? In-Reply-To: References: <4F81C7EF.7030505@gmail.com> Message-ID: <4F81F935.9030702@gmail.com> On 2012-04-08 21:25, Eric Talevich wrote: > On Sun, Apr 8, 2012 at 1:16 PM, Laurent Gautier > wrote: > > Hi, > > I have seen emails exchanges and issues on the tracker regarding > moving the documentation to Sphinx, but I could not find an > instance of the documentation for biopython online (I was looking > for one to cross-reference it with documentation I am writing). > > Is this still work-in-progress, or is there an instance online and > I missed it ? > > > Hi Laurent, > > I proposed this a while ago and played with Sphinx a little bit, but > didn't get very far. We're still using Epydoc for our generated API > documentation: > http://biopython.org/DIST/docs/api/ > > I do hope to get back to this at some point, or perhaps assist someone > else with migrating Biopython to Sphinx. > > -Eric > > Hi Eric, Thanks for the answer. I did see the Epydoc, but I was after Sphinx to be able to cross-reference documentations (see http://sphinx.pocoo.org/ext/intersphinx.html ). I'll do with it for the time being. Best, Laurent From eric.talevich at gmail.com Mon Apr 9 14:25:04 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 9 Apr 2012 14:25:04 -0400 Subject: [Biopython-dev] Method to weight sequences in an alignment Message-ID: Folks, I've written a function to weight sequences according to the simple scheme used in PSI-BLAST [*]. It operates on Bio.Align.MultipleSeqAlignment objects or lists of plain strings, and could be added as a method with minimal changes (for Python 2.5 compatibility, mainly). Any interest in adding it to Biopython? The code is below. Cheers, Eric [*] Henikoff & Henikoff (1994): Position-based sequence weights. http://www.ncbi.nlm.nih.gov/pubmed/7966282 ---- def sequence_weights(aln): """Weight aligned sequences to emphasize more divergent members. Returns a list of floating-point numbers between 0 and 1, corresponding to the proportional weight of each sequence in the alignment. The first list is the weight of the first sequence in the alignment, and so on. Weights sum to 1.0. Method: At each column position, award each different residue an equal share of the weight, and then divide that weight equally among the sequences sharing the same residue. For each sequence, sum the contributions from each position to give a sequence weight. See Henikoff & Henikoff (1994): Position-based sequence weights. """ def col_weight(column): """Represent the diversity at a position. Award each different residue an equal share of the weight, and then divide that weight equally among the sequences sharing the same residue. So, if in a position of a multiple alignment, r different residues are represented, a residue represented in only one sequence contributes a score of 1/r to that sequence, whereas a residue represented in s sequences contributes a score of 1/rs to each of the s sequences. """ # Skip columns with all gaps or unique inserts if len([c for c in column if c not in '-.']) < 2: return [0] * len(column) # Count the number of occurrences of each residue type # (Treat gaps as a separate, 21st character) counts = Counter(column) # Get residue weights: 1/rs, where # r = nb. residue types, s = count of a particular residue type n_residues = len(counts) # r freqs = dict((aa, 1.0 / (n_residues * count)) for aa, count in counts.iteritems()) weights = [freqs[aa] for aa in column] return weights seq_weights = [0] * len(aln) col_weights = map(col_weight, zip(*aln)) # Sum the contributions from each position along each sequence -> total weight for col in col_weights: for idx, row_val in enumerate(col): seq_weights[idx] += row_val # Normalize scale = 1.0 / sum(seq_weights) seq_weights = [scale * wt for wt in seq_weights] return seq_weights From mjldehoon at yahoo.com Mon Apr 9 19:27:31 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 9 Apr 2012 16:27:31 -0700 (PDT) Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: Message-ID: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> Hi Eric, Peter, > How about Bio.Search, for now? I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells users something about what the module is for. Bio.Search could be anything (search PubMed? search the Entrez databases? search Google? anyway Bio.Search does not suggest that this module is about pairwise alignments). But Peter previously mentioned that he doesn't like Bio.Pairwise; can we convince you? >> How about including this module as a submodule in Bio.Align? > Does anything in Bio.Align already operate on SeqFeature objects? I was more thinking to have this module as a submodule in Bio.Align for the purpose of module organization rather than reusing or integrating it with Bio.Align. However, if we can make use of Bio.Align, then that could be a good thing. Best, -Michiel. From chapmanb at 50mail.com Mon Apr 9 20:58:19 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 09 Apr 2012 20:58:19 -0400 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> Message-ID: <87lim4h07o.fsf@fastmail.fm> Michiel; > Hi Eric, Peter, > > > How about Bio.Search, for now? > > I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells > users something about what the module is for. Bio.Search could be > anything (search PubMed? search the Entrez databases? search Google? > anyway Bio.Search does not suggest that this module is about pairwise > alignments). But Peter previously mentioned that he doesn't like > Bio.Pairwise; can we convince you? I agree with Peter on this one. The module is primarily about searching a sequence database with an input via multiple methods, not about pairwise alignment of two sequences with is what Bio.Align.Pairwise suggests to me. Brad From redmine at redmine.open-bio.org Tue Apr 10 16:29:09 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 Apr 2012 20:29:09 +0000 Subject: [Biopython-dev] [Biopython - Bug #3340] (New) Example using Bio.Clustalw in Tutorial Message-ID: Issue #3340 has been reported by Peter Cock. ---------------------------------------- Bug #3340: Example using Bio.Clustalw in Tutorial https://redmine.open-bio.org/issues/3340 Author: Peter Cock Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Apr 10 16:29:09 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 Apr 2012 20:29:09 +0000 Subject: [Biopython-dev] [Biopython - Bug #3340] (New) Example using Bio.Clustalw in Tutorial Message-ID: Issue #3340 has been reported by Peter Cock. ---------------------------------------- Bug #3340: Example using Bio.Clustalw in Tutorial https://redmine.open-bio.org/issues/3340 Author: Peter Cock Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Thu Apr 12 12:01:47 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 12 Apr 2012 17:01:47 +0100 Subject: [Biopython-dev] BOSC 2012 - Biopython Update Message-ID: Hello all, The BOSC abstract deadline (tomorrow) has rather crept up on me, despite Nomi's reminder emails (My excuse is I've been thinking more about GSoC!). For anyone thinking of submitting a talk, the abstract limit is just a page - see: http://www.open-bio.org/wiki/BOSC_2012 I'm hoping to attend BOSC, but will probably not be at ISMB 2012. I'd be delighted for another Biopython developer to give the project update talk (and as in previous years, we'll help out with the abstract, slides, etc). Anyone interested? Giving a talk can be very helpful in getting travel funding ;) I know Eric might be a candidate as he will be in Long Beach (congratulations on getting your ISMB poster accepted Eric!). Note that dedicated "Bioinformatics Open Source Project Updates" track is new this year. The talks are likely to be at the shorter end of the talk length range specified (i.e. closer to 5 minutes than 20 mins) but that will partly depend on quite how full the final schedule turns out to be. The idea (speaking with my BOSC hat on) with the update talks is to try to highlight what is new and exciting, with only a minimal introduction for the higher profile projects - most of the audience will know roughly what BioPerl etc are, and won't be interested to hear it again ;) So for the Biopython talk we'd probably want to cover things like GSoC, work with PyPy and Python3, major new functionality, any Biopython papers, etc, and a bit on future plans. The talk should be short but sweet :) Regards, Peter From redmine at redmine.open-bio.org Thu Apr 12 14:52:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 12 Apr 2012 18:52:35 +0000 Subject: [Biopython-dev] [Biopython - Feature #3341] (New) Improve SffIO to parse 3 extra lines present in some SFF files ("Run Name:, Analysis Name:, Full Path:)" Message-ID: Issue #3341 has been reported by Martin Mokrej?. ---------------------------------------- Feature #3341: Improve SffIO to parse 3 extra lines present in some SFF files ("Run Name:, Analysis Name:, Full Path:)" https://redmine.open-bio.org/issues/3341 Author: Martin Mokrej? Status: New Priority: Normal Assignee: Category: Target version: URL: Some file have extra 3 lines per each record in the SFF file. One such file is already in biopython test data: biopython/Tests/Roche/E3MFGYR02_random_10_reads.sff biopython/Tests/Roche/paired.sff The three lines "Run Name:, Analysis Name:, Full Path:" are not parsed into the object and later on, are not written out. Hence, sff round trip read in -> write out breaks (biopython-1.58). These three lines somehow do not appear in every SFF file, and so far I haven't seen these in files extracted from SRA. Seems these only appear in original Roche SFF files. >E3MFGYR02JWQ7T Run Prefix: R_2008_01_09_16_16_00_ Region #: 2 XY Location: 3946_2103 Run Name: R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331 Analysis Name: /data/2008_02_08/R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331/D_2008_02_08_23_45_24_d41_AnalysisPipe Full Path: /data/R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331/D_2008_02_08_23_45_24_d41_AnalysisPipe Read Header Len: 32 Name Length: 14 # of Bases: 265 Clip Qual Left: 5 Clip Qual Right: 264 Clip Adap Left: 0 Clip Adap Right: 0 ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From eric.talevich at gmail.com Thu Apr 12 18:37:12 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 12 Apr 2012 18:37:12 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Thu, Apr 12, 2012 at 12:01 PM, Peter Cock wrote: > Hello all, > > The BOSC abstract deadline (tomorrow) has rather crept up on me, > despite Nomi's reminder emails (My excuse is I've been thinking > more about GSoC!). For anyone thinking of submitting a talk, the > abstract limit is just a page - see: > http://www.open-bio.org/wiki/BOSC_2012 > > I'm hoping to attend BOSC, but will probably not be at ISMB 2012. > I'd be delighted for another Biopython developer to give the project > update talk (and as in previous years, we'll help out with the abstract, > slides, etc). Anyone interested? Giving a talk can be very helpful in > getting travel funding ;) > > I know Eric might be a candidate as he will be in Long Beach > (congratulations on getting your ISMB poster accepted Eric!). > > Note that dedicated "Bioinformatics Open Source Project Updates" > track is new this year. The talks are likely to be at the shorter end of > the talk length range specified (i.e. closer to 5 minutes than 20 mins) > but that will partly depend on quite how full the final schedule turns > out to be. > > The idea (speaking with my BOSC hat on) with the update talks is > to try to highlight what is new and exciting, with only a minimal > introduction for the higher profile projects - most of the audience > will know roughly what BioPerl etc are, and won't be interested > to hear it again ;) > > So for the Biopython talk we'd probably want to cover things like > GSoC, work with PyPy and Python3, major new functionality, any > Biopython papers, etc, and a bit on future plans. The talk should be > short but sweet :) > > Regards, > > Peter OK, here are some potential talking points I scraped from past announcements: * SeqIO.index_db: Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to carry the index_db concept to other modules. * Installation improvements: pip support (v.1.57); easy_install will automatically handle the numpy dependency (v.1.59, Feb '12) * Portability: Python 3 compatibility (except for a couple C extension modules); still supporting Jython; now mostly supporting Pypy (except for modules that use numpy or C extensions) * Merged Brandon Invergo's independent project pypaml under Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip support (v.1.59) and the existing support for phylogeny I/O under Phylo, we can now easily assemble and run complete workflows involving PAML. (Similarly for PhyML, with SeqIO's "phylip-relaxed" and Bio.Phylo.Applications.PhymlCommandline.) * GenomeDiagram improvements: New, pretty features. Eye candy for the slides. * TogoWS * Next release & future plans: - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student - Brad's GFF parser - Deeper future: see the other mailing list thread * GSoC 2011 results: - Mikael Trellet -- Interface - Michele Silva -- Mocapy++ Python module; also ported two applications to Biopython - Justinas D. -- Python-based extension system for Mocapy++ * Summer of Struct: Jo?o and Eric are working to refactor and merge the vast amount of Bio.PDB-related code produced during previous GSoCs. (Includes a planned SeqIO-style API for structures in PDB, mmCIF and PBDML formats.) Improvements have been trickling in since the last BOSC; here comes the flood. From chapmanb at 50mail.com Thu Apr 12 20:23:03 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 12 Apr 2012 20:23:03 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: <877gxkh448.fsf@fastmail.fm> Eric and Peter; Eric -- I'm glad you're taking this on. It'll be great to have a Biopython presentation at BOSC. The points you mentioned all sound great, although I would drop some of the more boring ones like the installation stuff (I can pick on that, since it's mine). My only other suggestions is to focus the talk around the people who've provided the improvements. One of the awesome things about Biopython is the wide contributor base and we still manage to pull everything into a coherent package thanks to Peter's guiding hand. It would be cool to emphasize this community as part of the update. Thanks again for doing this, Brad > > Hello all, > > > > The BOSC abstract deadline (tomorrow) has rather crept up on me, > > despite Nomi's reminder emails (My excuse is I've been thinking > > more about GSoC!). For anyone thinking of submitting a talk, the > > abstract limit is just a page - see: > > http://www.open-bio.org/wiki/BOSC_2012 > > > > I'm hoping to attend BOSC, but will probably not be at ISMB 2012. > > I'd be delighted for another Biopython developer to give the project > > update talk (and as in previous years, we'll help out with the abstract, > > slides, etc). Anyone interested? Giving a talk can be very helpful in > > getting travel funding ;) > > > > I know Eric might be a candidate as he will be in Long Beach > > (congratulations on getting your ISMB poster accepted Eric!). > > > > Note that dedicated "Bioinformatics Open Source Project Updates" > > track is new this year. The talks are likely to be at the shorter end of > > the talk length range specified (i.e. closer to 5 minutes than 20 mins) > > but that will partly depend on quite how full the final schedule turns > > out to be. > > > > The idea (speaking with my BOSC hat on) with the update talks is > > to try to highlight what is new and exciting, with only a minimal > > introduction for the higher profile projects - most of the audience > > will know roughly what BioPerl etc are, and won't be interested > > to hear it again ;) > > > > So for the Biopython talk we'd probably want to cover things like > > GSoC, work with PyPy and Python3, major new functionality, any > > Biopython papers, etc, and a bit on future plans. The talk should be > > short but sweet :) > > > > Regards, > > > > Peter > > > OK, here are some potential talking points I scraped from past announcements: > > * SeqIO.index_db: > Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to > carry the index_db concept to other modules. > > * Installation improvements: > pip support (v.1.57); easy_install will automatically handle the numpy > dependency (v.1.59, Feb '12) > > * Portability: > Python 3 compatibility (except for a couple C extension modules); > still supporting Jython; now mostly supporting Pypy (except for > modules that use numpy or C extensions) > > * Merged Brandon Invergo's independent project pypaml under > Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip > support (v.1.59) and the existing support for phylogeny I/O under > Phylo, we can now easily assemble and run complete workflows involving > PAML. > (Similarly for PhyML, with SeqIO's "phylip-relaxed" and > Bio.Phylo.Applications.PhymlCommandline.) > > * GenomeDiagram improvements: > New, pretty features. Eye candy for the slides. > > * TogoWS > > * Next release & future plans: > - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student > - Brad's GFF parser > - Deeper future: see the other mailing list thread > > * GSoC 2011 results: > - Mikael Trellet -- Interface > - Michele Silva -- Mocapy++ Python module; also ported two > applications to Biopython > - Justinas D. -- Python-based extension system for Mocapy++ > > * Summer of Struct: > Jo?o and Eric are working to refactor and merge the vast amount of > Bio.PDB-related code produced during previous GSoCs. (Includes a > planned SeqIO-style API for structures in PDB, mmCIF and PBDML > formats.) Improvements have been trickling in since the last BOSC; > here comes the flood. > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From arklenna at gmail.com Thu Apr 12 23:26:35 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 12 Apr 2012 23:26:35 -0400 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References: Message-ID: On Thu, Mar 29, 2012 at 10:05 AM, Peter Cock wrote: > Hi Lenna, > > Have you tried your branch on Windows yet? > > It worked for me under my Python 2.5 setup using mingw32, > > C:\repositories\biopython>c:\python26\python setup.py install > ... > building 'Bio.PDB.mmCIF.MMCIFlex' extension > creating build\temp.win32-2.5\Release\bio\pdb > creating build\temp.win32-2.5\Release\bio\pdb\mmcif > C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio > -Ic:\python25\include -Ic:\python25\PC -c Bio/PDB/mmCIF/lex.yy.c -o > build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o > lex.yy.c:1046: warning: 'yyunput' defined but not used > C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio > -Ic:\python25\include -Ic:\python25\PC -c > Bio/PDB/mmCIF/MMCIFlexmodule.c -o > build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o > writing build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def > C:\cygwin\usr\bin\gcc.exe -mno-cygwin -shared -s > build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o > build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o > build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def > -Lc:\python25\libs -Lc:\python25\PCBuild -lpython25 -lmsvcr71 -o > build\lib.win32-2.5\Bio\PDB\mmCIF\MMCIFlex.pyd > ... > > That worked fine and test_MMCIF.py is happy. However, MSVC v9 is not: > > C:\repositories\biopython>c:\python26\python setup.py install > ... > building 'Bio.PDB.mmCIF.MMCIFlex' extension > C:\Program Files\Microsoft Visual Studio 9.0\VC\BIN\cl.exe /c /nologo > /Ox /MD /W3 /GS- /DNDEBUG -IBio -Ic:\python26\include -Ic:\python26\PC > /TcBio/PDB/mmCIF/lex.yy.c > /Fobuild\temp.win32-2.6\Release\Bio/PDB/mmCIF/lex.yy.obj > lex.yy.c > Bio/PDB/mmCIF/lex.yy.c(12) : fatal error C1083: Cannot open include > file: 'unistd.h': No such file or directory > error: command '"C:\Program Files\Microsoft Visual Studio > 9.0\VC\BIN\cl.exe"' failed with exit status 2 > > The same with Python 2.7 and the Microsoft compiler. Switching > from this in Bio/PDB/mmCIF.yy.c: > > #include > > to this: > > #include > > lets it compile (although with some warnings) and test_MMCIF.py passes. > If should be conditional of course, but I'm unclear if that is the appropriate > fix or not though. > > Peter Hi Peter, I installed flex on my Windows VM and used it to generate lex.yy.c. It puts #include inside an #ifdef so it may work with MSVC. It produces a working module for both Debian and Mac OS X (I do get "defined but not used" warnings for generated functions). I've cherry-picked it into my pull request. I know you're quite busy right now with BOSC and GSoC, but let me know if you get a chance to test it on MSVC. Lenna From p.j.a.cock at googlemail.com Fri Apr 13 07:31:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 Apr 2012 12:31:30 +0100 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Thu, Apr 12, 2012 at 11:37 PM, Eric Talevich wrote: > > OK, here are some potential talking points I scraped from past announcements: > > * SeqIO.index_db: > Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to > carry the index_db concept to other modules. Biopython 1.57 was already covered at BOSC 2011. > * Installation improvements: > pip support (v.1.57); easy_install will automatically handle the numpy > dependency (v.1.59, Feb '12) Brad commented on this, perhaps a line in the abstract? > * Portability: > Python 3 compatibility (except for a couple C extension modules); > still supporting Jython; now mostly supporting Pypy (except for > modules that use numpy or C extensions) This is something I would want to cover. > * Merged Brandon Invergo's independent project pypaml under > Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip > support (v.1.59) and the existing support for phylogeny I/O under > Phylo, we can now easily assemble and run complete workflows involving > PAML. (Similarly for PhyML, with SeqIO's "phylip-relaxed" and > Bio.Phylo.Applications.PhymlCommandline.) Yep. > * GenomeDiagram improvements: > New, pretty features. Eye candy for the slides. Yep. Maybe even an example in the abstract? > * TogoWS Yep. > * Next release & future plans: > - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student > - Brad's GFF parser > - Deeper future: see the other mailing list thread Good points - although I don't want to over promise ;) > * GSoC 2011 results: > - Mikael Trellet -- Interface > - Michele Silva -- Mocapy++ Python module; also ported two > applications to Biopython > - Justinas D. -- Python-based extension system for Mocapy++ We should have a summary of what they did somewhere, perhaps as an OBF blog post? I'm hoping to get this year's GSoC students to write weekly progress reports on a blog or at least by email to the mailing list. > * Summer of Struct: > Jo?o and Eric are working to refactor and merge the vast amount of > Bio.PDB-related code produced during previous GSoCs. (Includes a > planned SeqIO-style API for structures in PDB, mmCIF and PBDML > formats.) Improvements have been trickling in since the last BOSC; > here comes the flood. :) Here's a draft abstract - note we have to fit in a page. Having a logo or some eye catching image is very effective for standing out in the abstract book (on screen or on paper). Comments welcome - but keep in mind the one page limit. Eric - feel free to turn this into a Google Doc if you prefer. Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: Biopython_BOSC_abstract_2012_draft.pdf Type: application/pdf Size: 199737 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Biopython_BOSC_abstract_2012_draft.tex Type: application/x-tex Size: 5037 bytes Desc: not available URL: From eric.talevich at gmail.com Fri Apr 13 10:31:08 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Apr 2012 10:31:08 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: Thanks for this. I'll keep it as LaTeX, since it already looks nice. 1. Several parts say "[to be revised prior to BOSC]" -- I take it we have the option of updating our abstract shortly before BOSC, and this is a note to the conference organizers that we intend to do so? To save space and reduce distraction, should this be a footnote instead? 2. To save space: Do we need the line "Bioinformatics Open Source Conference (BOSC) ..." after the author names? 3. Again to save space, and make room to cite the Phylo paper: can we drop the citation for TogoWS, and add a few words of description in the main text where it's mentioned? (We don't cite PAML, HMMer, etc.) 4. How do you feel about dropping inline citations, and just have a list of \nocite references at the bottom? In a one-page abstract, it should be easy enough for readers to figure out what's what. -E On Fri, Apr 13, 2012 at 7:31 AM, Peter Cock wrote: > On Thu, Apr 12, 2012 at 11:37 PM, Eric Talevich wrote: >> >> OK, here are some potential talking points I scraped from past announcements: >> >> * SeqIO.index_db: >> Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to >> carry the index_db concept to other modules. > > Biopython 1.57 was already covered at BOSC 2011. > >> * Installation improvements: >> pip support (v.1.57); easy_install will automatically handle the numpy >> dependency (v.1.59, Feb '12) > > Brad commented on this, perhaps a line in the abstract? > >> * Portability: >> Python 3 compatibility (except for a couple C extension modules); >> still supporting Jython; now mostly supporting Pypy (except for >> modules that use numpy or C extensions) > > This is something I would want to cover. > >> * Merged Brandon Invergo's independent project pypaml under >> Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip >> support (v.1.59) and the existing support for phylogeny I/O under >> Phylo, we can now easily assemble and run complete workflows involving >> PAML. (Similarly for PhyML, with SeqIO's "phylip-relaxed" and >> Bio.Phylo.Applications.PhymlCommandline.) > > Yep. > >> * GenomeDiagram improvements: >> New, pretty features. Eye candy for the slides. > > Yep. Maybe even an example in the abstract? > >> * TogoWS > > Yep. > >> * Next release & future plans: >> - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student >> - Brad's GFF parser >> - Deeper future: see the other mailing list thread > > Good points - although I don't want to over promise ;) > >> * GSoC 2011 results: >> - Mikael Trellet -- Interface >> - Michele Silva -- Mocapy++ Python module; also ported two >> applications to Biopython >> - Justinas D. -- Python-based extension system for Mocapy++ > > We should have a summary of what they did somewhere, perhaps > as an OBF blog post? I'm hoping to get this year's GSoC students > to write weekly progress reports on a blog or at least by email to > the mailing list. > >> * Summer of Struct: >> Jo?o and Eric are working to refactor and merge the vast amount of >> Bio.PDB-related code produced during previous GSoCs. (Includes a >> planned SeqIO-style API for structures in PDB, mmCIF and PBDML >> formats.) Improvements have been trickling in since the last BOSC; >> here comes the flood. > > :) > > Here's a draft abstract - note we have to fit in a page. Having a logo > or some eye catching image is very effective for standing out in the > abstract book (on screen or on paper). > > Comments welcome - but keep in mind the one page limit. > > Eric - feel free to turn this into a Google Doc if you prefer. > > Peter From p.j.a.cock at googlemail.com Fri Apr 13 10:42:37 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 Apr 2012 15:42:37 +0100 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 3:31 PM, Eric Talevich wrote: > Thanks for this. I'll keep it as LaTeX, since it already looks nice. > > 1. Several parts say "[to be revised prior to BOSC]" -- I take it we > have the option of updating our abstract shortly before BOSC, and this > is a note to the conference organizers that we intend to do so? To > save space and reduce distraction, should this be a footnote instead? It is common for BOSC abstracts to be revised following review prior to acceptance (almost like a tiny paper), and yes, that was my intention. Do you think something like [to be revised during abstract review] might be clearer? I think this makes a lot of sense for the project update talks in particular - but that stage for example we'll have the GSoC students selected. > 2. To save space: Do we need the line "Bioinformatics Open Source > Conference (BOSC) ..." after the author names? I like it to make the page self contained, useful if we post it as a lone PDF file. The text could be smaller certainly if required - likewise the logo could be shrunk a little. > 3. Again to save space, and make room to cite the Phylo paper: can we > drop the citation for TogoWS, and add a few words of description in > the main text where it's mentioned? (We don't cite PAML, HMMer, etc.) Fair point, I was thinking in terms of audience recognition. PAML and HMMer are quite well known and relatively old/mature. If the Phylo paper is accepted in time to be added to abstract then of course we'd want to include it. But right now using a couple of lines for a 'submitted' citation seemed overkill to me. But if you can get it to fit nicely, please go ahead. > 4. How do you feel about dropping inline citations, and just have a > list of \nocite references at the bottom? In a one-page abstract, it > should be easy enough for readers to figure out what's what. If you prefer, or use the [1] style? Peter From eric.talevich at gmail.com Fri Apr 13 11:40:06 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Apr 2012 11:40:06 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 10:42 AM, Peter Cock wrote: > On Fri, Apr 13, 2012 at 3:31 PM, Eric Talevich wrote: >> Thanks for this. I'll keep it as LaTeX, since it already looks nice. >> >> 1. Several parts say "[to be revised prior to BOSC]" -- I take it we >> have the option of updating our abstract shortly before BOSC, and this >> is a note to the conference organizers that we intend to do so? To >> save space and reduce distraction, should this be a footnote instead? > > It is common for BOSC abstracts to be revised following review prior to > acceptance (almost like a tiny paper), and yes, that was my intention. > Do you think something like [to be revised during abstract review] > might be clearer? I think this makes a lot of sense for the project > update talks in particular - but that stage for example we'll have the > GSoC students selected. > >> 2. To save space: Do we need the line "Bioinformatics Open Source >> Conference (BOSC) ..." after the author names? > > I like it to make the page self contained, useful if we post it as a lone > PDF file. The text could be smaller certainly if required - likewise the > logo could be shrunk a little. > >> 3. Again to save space, and make room to cite the Phylo paper: can we >> drop the citation for TogoWS, and add a few words of description in >> the main text where it's mentioned? (We don't cite PAML, HMMer, etc.) > > Fair point, I was thinking in terms of audience recognition. PAML > and HMMer are quite well known and relatively old/mature. > > If the Phylo paper is accepted in time to be added to abstract then > of course we'd want to include it. But right now using a couple of > lines for a 'submitted' citation seemed overkill to me. But if you can > get it to fit nicely, please go ahead. > >> 4. How do you feel about dropping inline citations, and just have a >> list of \nocite references at the bottom? In a one-page abstract, it >> should be easy enough for readers to figure out what's what. > > If you prefer, or use the [1] style? > > Peter Here's an updated draft. How does it look? -------------- next part -------------- A non-text attachment was scrubbed... Name: Biopython_BOSC_abstract_2012_draft.pdf Type: application/pdf Size: 262728 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Biopython_BOSC_abstract_2012_draft.tex Type: application/x-tex Size: 5573 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Apr 13 11:57:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 Apr 2012 16:57:27 +0100 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich wrote: > > Here's an updated draft. How does it look? Looks fine to me - anyone else? A fresh pair of eyes would be good. Also does anyone else want to be named as a talk co-author (and promise to contribute with slides/figures/help for preparing the talk)? Or should we just put "Eric et al" since he'll be the one on stage? Peter From anaryin at gmail.com Fri Apr 13 12:02:04 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 13 Apr 2012 18:02:04 +0200 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: Third paragraph: 'summer' should read 'Summer'. Good to me! I can help with the slides/figures/help, particularly on the refactoring part of Bio.PDB to Bio.Struct. Let me know when and I can easily get on Skype. cheers! Jo?o From zhigang.wu at email.ucr.edu Fri Apr 13 12:25:34 2012 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Fri, 13 Apr 2012 09:25:34 -0700 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: Probably I caught a grammar mistake. Should we correct "Biopython 1.60 is expected *to have been* released by BOSC 2012" to "Biopython 1.60 is expected *to be* released by BOSC 2012"? Probably I was wrong. I am not a native speaker. :-) Zhigang On Fri, Apr 13, 2012 at 8:57 AM, Peter Cock wrote: > On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich > wrote: > > > > Here's an updated draft. How does it look? > > Looks fine to me - anyone else? A fresh pair of eyes would be good. > > Also does anyone else want to be named as a talk co-author (and > promise to contribute with slides/figures/help for preparing the talk)? > Or should we just put "Eric et al" since he'll be the one on stage? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From arklenna at gmail.com Fri Apr 13 12:31:53 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 13 Apr 2012 12:31:53 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 12:25 PM, Zhigang Wu wrote: > Probably I caught a grammar mistake. > > Should we correct ?"Biopython 1.60 is expected *to have been* released by > BOSC 2012" ?to "Biopython 1.60 is expected *to be* released by BOSC 2012"? > > Probably I was wrong. I am not a native speaker. :-) > > Zhigang > Hi Zhigang, Actually, either way is correct - the original way is called the future perfect tense. Here's a description of the grammar if you are interested: http://www.englishpage.com/verbpage/futureperfect.html Lenna From eric.talevich at gmail.com Fri Apr 13 13:17:31 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Apr 2012 13:17:31 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 11:57 AM, Peter Cock wrote: > On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich wrote: >> >> Here's an updated draft. How does it look? > > Looks fine to me - anyone else? A fresh pair of eyes would be good. > > Also does anyone else want to be named as a talk co-author (and > promise to contribute with slides/figures/help for preparing the talk)? > Or should we just put "Eric et al" since he'll be the one on stage? > > Peter I added Jo?o as the fourth author and submitted it. Cheers, Eric From p.j.a.cock at googlemail.com Fri Apr 13 15:32:32 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 Apr 2012 20:32:32 +0100 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 6:17 PM, Eric Talevich wrote: > > I added Jo?o as the fourth author and submitted it. > > Cheers, > Eric Thanks Eric, If there are any other comments or changes, we'll try to integrate them along with any reviewers' comments. Peter From tiagoantao at gmail.com Mon Apr 16 05:35:21 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 16 Apr 2012 10:35:21 +0100 Subject: [Biopython-dev] plink phasing and others Message-ID: Hi, During the last few months I have been in an hell hole writing code like mad. Maybe some of this code is of interest to share. I currently have: 1. Code to parse plink output. Pretty trivial stuff, but I bet lots of people are doing this 2. Code to process admixture results. Admixture is far less used than STRUCTURE 3. Code to deal with phasing formats. Beagle, PHASE and shapeit 4. PCA 5. Some gene ontology stuff My GO stuff is pretty specific, so I guess it might not be of interest. All the other components are of fairly widely used things. Admixture and PCA are standard popgen analysis. Admixture code could probably be changed to also support STRUCTURE. I am not sure but PCA might only work on linux. Plink and phasing are of more general interest. These would be out of Bio.PopGen. There is no strange requirement to any of these code with one exception: admixture and PCA require matplotib. So that people have an understanding of the impact of these things, I put the number of scholar citations: plink - 3315 smartpca - 1673 admixture - 57 structure - 7448 beagle - >300 fastphase - 1935 Unfortunately there is little code to do automated analysis using these tools. I could start migrating some of this code to biopython (would have to write documentation, and comment the code better ;) ) -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Mon Apr 16 06:26:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 11:26:30 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: 2012/4/16 Tiago Ant?o : > Hi, > > During the last few months I have been in an hell hole writing code > like mad. Maybe some of this code is of interest to share. > > I currently have: > > 1. Code to parse plink output. Pretty trivial stuff, but I bet lots of > people are doing this > 2. Code to process admixture results. Admixture is far less used than STRUCTURE > 3. Code to deal with phasing formats. Beagle, PHASE and shapeit > 4. PCA > 5. Some gene ontology stuff > > My GO stuff is pretty specific, so I guess it might not be of interest. > All the other components are of fairly widely used things. > Admixture and PCA are standard popgen analysis. Admixture code could > probably be changed to also support STRUCTURE. I am not sure but PCA > might only work on linux. > Plink and phasing are of more general interest. These would be out of > Bio.PopGen. > > There is no strange requirement to any of these code with one > exception: admixture and PCA require matplotib. > > So that people have an understanding of the impact of these things, I > put the number of scholar citations: > plink - 3315 > smartpca - 1673 > admixture - 57 > structure - 7448 > beagle - >300 > fastphase - 1935 > > Unfortunately there is little code to do automated analysis using these tools. > > I could start migrating some of this code to biopython (would have to > write documentation, and comment the code better ;) ) Sounds good. The GO stuff would/should be more general than just PopGen, and I know other people are looking at this on branches. When you said PCA, that was principle component analysis, right? What are you adding on top of NumPy/SciPy/matplotlib? Peter From tiagoantao at gmail.com Mon Apr 16 08:05:34 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 16 Apr 2012 13:05:34 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: 2012/4/16 Peter Cock : > Sounds good. The GO stuff would/should be more general than just > PopGen, and I know other people are looking at this on branches. What I do here is things like tree traversing (e.g. find all parent nodes) and stuff like that. After that I do enrichment analysis (fisher exact test, fdr, that stuff). Nothing of real interest for now. I think we can ignore my code here (for now). > When you said PCA, that was principle component analysis, right? Yep, I am using eigenstrat/smartpca. > What are you adding on top of NumPy/SciPy/matplotlib? PCA plots and admixture plots. Here is an example of both: http://2.bp.blogspot.com/-6J6Gsas4uIs/TuELU3Gf4ZI/AAAAAAAAEWQ/CymvlzkX6hQ/s1600/PIIS0002929711004885.gr2_lrg.hi.jpg TOP: PCA Bottom: admixture -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Mon Apr 16 09:50:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 14:50:18 +0100 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 4:26 AM, Lenna Peterson wrote: > > Hi Peter, > > I installed flex on my Windows VM and used it to generate lex.yy.c. It > puts #include inside an #ifdef so it may work with MSVC. It > produces a working module for both Debian and Mac OS X (I do get > "defined but not used" warnings for generated functions). I've > cherry-picked it into my pull request. > I've now tested that on my Windows machine (and Mac and Linux), and applied the changes to the master branch. Thanks! We must remember to drop an email to the Debian and RedHat packaging teams since their old patch to setup.py isn't needed now (they could control the flex problem by declaring it a build time dependency). Peter From tiagoantao at gmail.com Mon Apr 16 11:00:13 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 16 Apr 2012 16:00:13 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: Just a few practical things: 1. we still do not allow matplotlib dependencies, correct? 2. to what part of the name space should plink and phasing be added? 3. Are we on epidoc or sphinx? Or moving from one to the other? doctest is acceptable right? 4. What is the current best way to run external applications? There was an application wrapper class in the past... From p.j.a.cock at googlemail.com Mon Apr 16 11:18:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 16:18:10 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: 2012/4/16 Tiago Ant?o : > Just a few practical things: > > 1. we still do not allow matplotlib dependencies, correct? They would be run time dependencies, right? Not compile/build time? We already have things like 'soft' dependencies on ReportLab and NetworkX, and even matplotlib. It does complicate the unit tests a bit to skip anything gracefully. > > 2. to what part of the name space should plink and phasing be added? Unclear to me right now. > 3. Are we on epidoc or sphinx? Or moving from one to the other? > doctest is acceptable right? We're still using LaTeX for the tutorial, and epydoc for the API docs. Using doctest is acceptable and encouraged for documentation, but be wary of cross platform differences. If you have a doctest which has dependencies see test_wise.py rather than adding it to run_tests.py > 4. What is the current best way to run external applications? There > was an application wrapper class in the past... For simple Unix style applications controlled via the command line, use the Bio.Application framework as in Bio.Align.Applications or Bio.Sequencing.Applications, Bio.Phylo.Applications, or Bio.Emboss.Applications (etc?). Peter From p.j.a.cock at googlemail.com Mon Apr 16 11:20:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 16:20:59 +0100 Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices In-Reply-To: References: Message-ID: On Sat, Apr 7, 2012 at 7:42 PM, Eric Talevich wrote: > On Wed, Apr 4, 2012 at 10:53 PM, Eric Talevich wrote: > >> Hi all, >> >> I'm considering some enhancements to the Phylo.draw function to make it >> more customizable for power users. Since the function is based on >> matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the >> user; however, I'm not fully versed in what pyplot is capable of. >> >> Relevant feature request in Redmine: >> https://redmine.open-bio.org/issues/3336 >> >> Ideas: > > [...] > > Just committed this feature: > https://github.com/biopython/biopython/commit/72990549a1b769ab19ab0bd33a8c35fdf031ac2d Hi Eric, That seems to have caused a test failure on one of our buildslaves: ====================================================================== ERROR: Run the tree layout algorithm, but don't display it. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/buildslave/BuildBot/lin2664/build/Tests/test_Phylo_depend.py", line 51, in test_draw Phylo.draw(dollo, do_show=False) File "/home/buildslave/BuildBot/lin2664/build/build/lib.linux-x86_64-2.6/Bio/Phylo/_utils.py", line 366, in draw fig = plt.figure() File "/usr/local/lib/python2.6/site-packages/matplotlib/pyplot.py", line 270, in figure **kwargs) File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wxagg.py", line 120, in new_figure_manager backend_wx._create_wx_app() File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wx.py", line 1377, in _create_wx_app wxapp = wx.PySimpleApp() File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py", line 8078, in __init__ wx.App.__init__(self, redirect, filename, useBestVisual, clearSigInt) File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py", line 7946, in __init__ raise SystemExit(msg) SystemExit: Unable to access the X Display, is $DISPLAY set properly? ---------------------------------------------------------------------- http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/534/steps/shell/logs/stdio http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/535/steps/shell/logs/stdio Interestingly the same machine is passing the tests under other Python versions. That would seem to rule out the $DISPLAY environment variable being the cause. My hunch would be this is something about the Python 2.6 install, perhaps it is missing some library (wxPython maybe). Logged in as the buildslave on this machine I can see that both Python 2.6 & 2.7 have the same version of matplotlib installed, but only one is failing the test: $ python2.5 Python 2.5.5 (r255:77872, Jan 14 2011, 17:09:55) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import matplotlib Traceback (most recent call last): File "", line 1, in ImportError: No module named matplotlib $ python2.6 Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import matplotlib >>> matplotlib.__version__ '1.0.0' $ python2.7 Python 2.7 (r27:82500, Jul 13 2010, 14:02:41) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import matplotlib >>> matplotlib.__version__ '1.0.0' Peter From tiagoantao at gmail.com Mon Apr 16 11:31:50 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 16 Apr 2012 16:31:50 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: 2012/4/16 Peter Cock : > For simple Unix style applications controlled via the command line, > use the Bio.Application framework as in Bio.Align.Applications or > Bio.Sequencing.Applications, Bio.Phylo.Applications, or > Bio.Emboss.Applications (etc?). I wonder if people never had the need to abstract the computing infrastructure? The current code does local (blocking) execution, but we see environments with BAS or grids where other models are used. I am not suggesting any specific solution, but the current approach seems to me not very scalable. No? -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Mon Apr 16 12:08:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 17:08:20 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: 2012/4/16 Tiago Ant?o : > 2012/4/16 Peter Cock : >> For simple Unix style applications controlled via the command line, >> use the Bio.Application framework as in Bio.Align.Applications or >> Bio.Sequencing.Applications, Bio.Phylo.Applications, or >> Bio.Emboss.Applications (etc?). > > I wonder if people never had the need to abstract the computing > infrastructure? The current code does local (blocking) execution, but > we see environments with BAS or grids where other models are used. I > am not suggesting any specific solution, but the current approach > seems to me not very scalable. No? I use the current framework with an SGE cluster, str(cline_object) gives the command line string to submit as the jobs. It would be nice to have some documented examples using this in combination with multiprocessing or something... but I find most of the tools I call are already multi-threaded. Peter From andrew.sczesnak at med.nyu.edu Mon Apr 16 12:48:41 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 Apr 2012 12:48:41 -0400 Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo In-Reply-To: References: Message-ID: <4F8C4D69.4040009@med.nyu.edu> Hi Eric, I was playing with Bio.Cluster recently and noticed that trees generated by that module are not compatible with Bio.Phylo. I think it would be useful if output from Cluster could be manipulated with Phylo. At first glance, it doesn't seem like it would be that tricky to add a method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, and I would be happy to work on this. Before making an attempt, I wanted to get your feedback on whether you think this would be useful and if you had anything similar in the works already. Best, Andrew From eric.talevich at gmail.com Mon Apr 16 18:15:14 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 16 Apr 2012 18:15:14 -0400 Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo In-Reply-To: <4F8C4D69.4040009@med.nyu.edu> References: <4F8C4D69.4040009@med.nyu.edu> Message-ID: On Mon, Apr 16, 2012 at 12:48 PM, Andrew Sczesnak wrote: > Hi Eric, > > I was playing with Bio.Cluster recently and noticed that trees generated by > that module are not compatible with Bio.Phylo. I think it would be useful if > output from Cluster could be manipulated with Phylo. > > At first glance, it doesn't seem like it would be that tricky to add a > method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, and > I would be happy to work on this. Before making an attempt, I wanted to get > your feedback on whether you think this would be useful and if you had > anything similar in the works already. > > > Best, > Andrew Hi Andrew, Interesting idea. It would be simple enough to add a "from_cluster" function or class method to either Phylo/BaseTree.py or Phylo/_utils.py. But as every scientist knows, just because we can doesn't necessarily mean we should. Do you have a specific use case in mind? If the main idea is to use Bio.Cluster to generate trees based on a measure of sequence distance, we could probably do more to support that. This code might also be worth posting on wiki "Phylo cookbook" page (http://www.biopython.org/wiki/Phylo_cookbook) to get more eyes on it while we consider merging it into the main distribution. -Eric From andrew.sczesnak at med.nyu.edu Mon Apr 16 18:47:25 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 Apr 2012 18:47:25 -0400 Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo In-Reply-To: References: <4F8C4D69.4040009@med.nyu.edu> Message-ID: <4F8CA17D.4080907@med.nyu.edu> Eric, I can describe two use cases from my own experience. First, the MAF parser I've been working on can pull the multiple alignment of some gene between a bunch of genomes. Thinking of recipes for the cookbook, I thought it would be neat to walk the user through constructing a distance matrix by hand (though you're right--more could be done to support this), clustering with Bio.Cluster and visualizing the result with Bio.Phylo. I like this example because it integrates several different parts of BioPython along with a lesson about inferring distances between sequences. Second, for another project, I've been generating distance matrices based on the shared gene content of bacterial genomes and the presence-or-absence of orthologous groups in each. Presently, I ferry the matrices to a clustering program and then visualize the resulting trees in yet another tool. Looking into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and the incompatibility of their tree objects. I wonder, what would be the most elegant way of bridging the gap? Best, Andrew On 04/16/2012 06:15 PM, Eric Talevich wrote: > On Mon, Apr 16, 2012 at 12:48 PM, Andrew Sczesnak > wrote: >> Hi Eric, >> >> I was playing with Bio.Cluster recently and noticed that trees generated by >> that module are not compatible with Bio.Phylo. I think it would be useful if >> output from Cluster could be manipulated with Phylo. >> >> At first glance, it doesn't seem like it would be that tricky to add a >> method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, and >> I would be happy to work on this. Before making an attempt, I wanted to get >> your feedback on whether you think this would be useful and if you had >> anything similar in the works already. >> >> >> Best, >> Andrew > > Hi Andrew, > > Interesting idea. It would be simple enough to add a "from_cluster" > function or class method to either Phylo/BaseTree.py or > Phylo/_utils.py. But as every scientist knows, just because we can > doesn't necessarily mean we should. Do you have a specific use case in > mind? > > If the main idea is to use Bio.Cluster to generate trees based on a > measure of sequence distance, we could probably do more to support > that. This code might also be worth posting on wiki "Phylo cookbook" > page (http://www.biopython.org/wiki/Phylo_cookbook) to get more eyes > on it while we consider merging it into the main distribution. > > -Eric From eric.talevich at gmail.com Tue Apr 17 00:17:26 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 17 Apr 2012 00:17:26 -0400 Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices In-Reply-To: References: Message-ID: On Mon, Apr 16, 2012 at 11:20 AM, Peter Cock wrote: > Hi Eric, > > That seems to have caused a test failure on one of our buildslaves: > > ====================================================================== > ERROR: Run the tree layout algorithm, but don't display it. > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "/home/buildslave/BuildBot/lin2664/build/Tests/test_Phylo_depend.py", > line 51, in test_draw > ? ?Phylo.draw(dollo, do_show=False) > ?File "/home/buildslave/BuildBot/lin2664/build/build/lib.linux-x86_64-2.6/Bio/Phylo/_utils.py", > line 366, in draw > ? ?fig = plt.figure() > ?File "/usr/local/lib/python2.6/site-packages/matplotlib/pyplot.py", > line 270, in figure > ? ?**kwargs) > ?File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wxagg.py", > line 120, in new_figure_manager > ? ?backend_wx._create_wx_app() > ?File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wx.py", > line 1377, in _create_wx_app > ? ?wxapp = wx.PySimpleApp() > ?File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py", > line 8078, in __init__ > ? ?wx.App.__init__(self, redirect, filename, useBestVisual, clearSigInt) > ?File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py", > line 7946, in __init__ > ? ?raise SystemExit(msg) > SystemExit: Unable to access the X Display, is $DISPLAY set properly? > > ---------------------------------------------------------------------- > > http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/534/steps/shell/logs/stdio > http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/535/steps/shell/logs/stdio > > Interestingly the same machine is passing the tests under other Python versions. > That would seem to rule out the $DISPLAY environment variable being the cause. > My hunch would be this is something about the Python 2.6 install, perhaps it > is missing some library (wxPython maybe). > > Logged in as the buildslave on this machine I can see that both Python 2.6 & 2.7 > have the same version of matplotlib installed, but only one is failing the test: > > $ python2.5 > Python 2.5.5 (r255:77872, Jan 14 2011, 17:09:55) > [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import matplotlib > Traceback (most recent call last): > ?File "", line 1, in > ImportError: No module named matplotlib > > $ python2.6 > Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14) > [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import matplotlib >>>> matplotlib.__version__ > '1.0.0' > > $ python2.7 > Python 2.7 (r27:82500, Jul 13 2010, 14:02:41) > [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import matplotlib >>>> matplotlib.__version__ > '1.0.0' > > > Peter Actually, it was this commit which added new unit tests: https://github.com/biopython/biopython/commit/a5f995bc1d9e113d88195cc7ae6f389984d762d8 On my machine with Python 2.6 and Ubuntu, the test passes, so I'm not sure how to debug this, exactly. Do you know a way to prevent matplotlib from attempting to launch the Wx app, beyond turn off interactive mode as the test already does? One idea is to specify a matplotlib backend other than wx. For example, using this import approach in test_Phylo_depend.py might do the trick: try: import matplotlib except ImportError: raise MissingExternalDependencyError( "Install matplotlib if you want to use Bio.Phylo._utils.") else: # Don't use the Wx backend for matplotlib, b/c that depends on Wx being # properly set up on the build machine. Instead, use the simpler postscript # backend -- we're not going to display or save the plot anyway, so it # doesn't matter much, as long as it's not Wx. I guess. matplotlib.use("ps") from matplotlib import pyplot Would you be able to test this on the errant buildbot machine without having to commit this to the trunk? Thanks, Eric From p.j.a.cock at googlemail.com Tue Apr 17 05:31:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 17 Apr 2012 10:31:05 +0100 Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices In-Reply-To: References: Message-ID: On Tue, Apr 17, 2012 at 5:17 AM, Eric Talevich wrote: > > Actually, it was this commit which added new unit tests: > https://github.com/biopython/biopython/commit/a5f995bc1d9e113d88195cc7ae6f389984d762d8 > OK - thanks for checking. > On my machine with Python 2.6 and Ubuntu, the test passes, so I'm not > sure how to debug this, exactly. Do you know a way to prevent > matplotlib from attempting to launch the Wx app, beyond turn off > interactive mode as the test already does? Not sure. > One idea is to specify a matplotlib backend other than wx. For > example, using this import approach in test_Phylo_depend.py might do > the trick: > > try: > ? ?import matplotlib > except ImportError: > ? ?raise MissingExternalDependencyError( > ? ? ? ? ? ?"Install matplotlib if you want to use Bio.Phylo._utils.") > else: > ? ?# Don't use the Wx backend for matplotlib, b/c that depends on Wx being > ? ?# properly set up on the build machine. Instead, use the simpler postscript > ? ?# backend -- we're not going to display or save the plot anyway, so it > ? ?# doesn't matter much, as long as it's not Wx. I guess. > ? ?matplotlib.use("ps") > ? ?from matplotlib import pyplot > > > Would you be able to test this on the errant buildbot machine without > having to commit this to the trunk? Yes, that works (this buildbot is one of 'my' servers so I can run this directly). Please check that fix in. Thanks, Peter From p.j.a.cock at googlemail.com Tue Apr 17 11:23:22 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 17 Apr 2012 16:23:22 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond Message-ID: On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock wrote: > > Here are some things that I think are strong > candidates for 1.60 (not an exclusive list!) > > ... > > BGZF support: Low level module like Python's gzip, > support in SeqIO for indexing BGZF compressed files, > ... I've just rebased my bgzf branch, which I think is ready to apply to the trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3. https://github.com/peterjc/biopython/tree/bgzf2 Would anyone like to review this please? There are unittests and plenty of docstrings - but so far nothing in the Tutorial though. I wrote a blog post late last year explaining what this allows, and this branch includes the changes to Bio.SeqIO to index BGZF compressed sequence files this discussed: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html The probable next step after this is combining it with Andrew Sczesnak's work on indexing MAF files (they can get pretty big) as explored by 'I.J.' (who as far as I know hasn't signed up to the biopython-dev list, BCC'd). Also it would be interesting to explore doing the (de)compression of blocks on worker threads to take advantage of multiple cores. Another idea would be too switch from a plain dictionary to an ordered dictionary for holding cached decompressed blocks, giving a way to drop the oldest block (although not perhaps as clever as dropping the lest recently used block, the overhead is lower). That would require including our own OrderedDict class on the older Python platforms. Peter [*] PyPy testing is complicated by running out of file handles, an existing issue not something directly from this work. Part of this is down to different GC under PyPy. From eric.talevich at gmail.com Tue Apr 17 11:25:35 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 17 Apr 2012 11:25:35 -0400 Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo In-Reply-To: <4F8CA17D.4080907@med.nyu.edu> References: <4F8C4D69.4040009@med.nyu.edu> <4F8CA17D.4080907@med.nyu.edu> Message-ID: Andrew, It would be useful to have a quick and portable function for distance-based tree estimation in Bio.Phylo, since otherwise it's necessary to use one of the wrappers for external programs in Bio.Phylo.Applications. (And currently, only PhyML is wrapped.) Does the hierarchical clustering algorithm in Bio.Cluster correspond to any common tree-estimation algorithm, e.g. UPGMA? If so, then it would make a lot of sense to provide the glue for using it that way. If you have done some work in this direction, I would be happy to see it. -Eric On Mon, Apr 16, 2012 at 6:47 PM, Andrew Sczesnak wrote: > Eric, > > I can describe two use cases from my own experience. First, the MAF parser > I've been working on can pull the multiple alignment of some gene between a > bunch of genomes. Thinking of recipes for the cookbook, I thought it would > be neat to walk the user through constructing a distance matrix by hand > (though you're right--more could be done to support this), clustering with > Bio.Cluster and visualizing the result with Bio.Phylo. I like this example > because it integrates several different parts of BioPython along with a > lesson about inferring distances between sequences. > > Second, for another project, I've been generating distance matrices based on > the shared gene content of bacterial genomes and the presence-or-absence of > orthologous groups in each. Presently, I ferry the matrices to a clustering > program and then visualize the resulting trees in yet another tool. Looking > into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and > the incompatibility of their tree objects. > > I wonder, what would be the most elegant way of bridging the gap? > > > Best, > Andrew > From bioinformed at gmail.com Tue Apr 17 12:11:37 2012 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 17 Apr 2012 12:11:37 -0400 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: Message-ID: On Tue, Apr 17, 2012 at 11:23 AM, Peter Cock wrote: > On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock > wrote: > > > > Here are some things that I think are strong > > candidates for 1.60 (not an exclusive list!) > > > > ... > > > > BGZF support: Low level module like Python's gzip, > > support in SeqIO for indexing BGZF compressed files, > > ... > > I've just rebased my bgzf branch, which I think is ready to apply to the > trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3. > https://github.com/peterjc/biopython/tree/bgzf2 > > Would anyone like to review this please? There are unittests and > plenty of docstrings - but so far nothing in the Tutorial though. > > Hi Peter, I've implemented code to create BAM/tabix style index files and perform lookups, so it has been high on my list to test and validate your BGZF code (rather having to write my own). I'm notoriously short on time, but this is in the critical path for several projects and I'm going to work on it over the next week or so. -Kevin From redmine at redmine.open-bio.org Tue Apr 17 21:29:29 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 Apr 2012 01:29:29 +0000 Subject: [Biopython-dev] [Biopython - Bug #3333] PhyloXML writer fails to include is_aligned attribute with mol_seq elements References: Message-ID: Issue #3333 has been updated by Eric Talevich. The answer is: I'm an idiot. The mol_seq attribute was first defined as a complex attribute in the writer (via _handle_complex), but then further down redefined as a simple attribute. Fix: https://github.com/biopython/biopython/commit/a93c9892268274c4969131a1d401bb8ee235524a ---------------------------------------- Bug #3333: PhyloXML writer fails to include is_aligned attribute with mol_seq elements https://redmine.open-bio.org/issues/3333 Author: Eric Talevich Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: First reported here: http://lists.open-bio.org/pipermail/biopython/2012-March/007847.html Steps to reproduce: 1. Load a tree, convert to PhyloXML
from Bio import Phylo
from StringIO import StringIO
tree = Phylo.read(StringIO('(a,(b,c));'), 'newick').as_phyloxml()
2. Add a sequence
from Bio.Phylo import PhyloXML as PX
tree.clade[0].sequences = [PX.Sequence(type='dna', mol_seq=PX.MolSeq('AAA', is_aligned=False))]
3. Verify that the sequence information has been set -- mol_seq has is_aligned set
print tree
Clade(branch_length=1.0, name='a')
    Sequence(type='dna')
        MolSeq(value='AAA', is_aligned=False)
4. View the PhyloXML representation -- mol_seq is missing the is_aligned attribute!
print tree.format('phyloxml')
...

  c
  1.0
  
    AAA
  

...
-- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Apr 17 21:52:03 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 Apr 2012 01:52:03 +0000 Subject: [Biopython-dev] [Biopython - Bug #3333] (Closed) PhyloXML writer fails to include is_aligned attribute with mol_seq elements References: Message-ID: Issue #3333 has been updated by Eric Talevich. Status changed from New to Closed % Done changed from 0 to 100 ---------------------------------------- Bug #3333: PhyloXML writer fails to include is_aligned attribute with mol_seq elements https://redmine.open-bio.org/issues/3333 Author: Eric Talevich Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: First reported here: http://lists.open-bio.org/pipermail/biopython/2012-March/007847.html Steps to reproduce: 1. Load a tree, convert to PhyloXML
from Bio import Phylo
from StringIO import StringIO
tree = Phylo.read(StringIO('(a,(b,c));'), 'newick').as_phyloxml()
2. Add a sequence
from Bio.Phylo import PhyloXML as PX
tree.clade[0].sequences = [PX.Sequence(type='dna', mol_seq=PX.MolSeq('AAA', is_aligned=False))]
3. Verify that the sequence information has been set -- mol_seq has is_aligned set
print tree
Clade(branch_length=1.0, name='a')
    Sequence(type='dna')
        MolSeq(value='AAA', is_aligned=False)
4. View the PhyloXML representation -- mol_seq is missing the is_aligned attribute!
print tree.format('phyloxml')
...

  c
  1.0
  
    AAA
  

...
-- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu Apr 19 00:27:49 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 19 Apr 2012 04:27:49 +0000 Subject: [Biopython-dev] [Biopython - Feature #3342] (New) Phylo.root_with_outgroup: set the length of the outgroup branch Message-ID: Issue #3342 has been reported by Eric Talevich. ---------------------------------------- Feature #3342: Phylo.root_with_outgroup: set the length of the outgroup branch https://redmine.open-bio.org/issues/3342 Author: Eric Talevich Status: New Priority: Normal Assignee: Category: Target version: URL: Add an option to the root_with_outgroup method to specify the length of the branch leading from the new root to the outgroup. This should not change the total tree length, i.e. this length is subtracted from the branch on the other side of the root. This option makes it possible to root the tree in other ways that split the outgroup branch, leaving a bifurcating rather than trifurcating root. I've attached a patch that implements this feature, plus unit tests for it. HOWEVER: A sane API for this method would look like: >>> tree.root_with_outgroup("apple", "orange", outgroup_branch_length=0.4) The original function definition included *args for specifying the outgroup taxa in one shot (instead of requiring a separate call to common_ancestor). But while Python 3 permits keyword-only arguments (a defined keyword argument after *args or just *), Python 2 does not. So I made the function calling style shown above work in a very weird way: the function definition has **kwargs instead of outgroup_branch_length=None, and the necessary keyword argument is pulled out of kwargs inside the body of the function. The name of this argument is given in the docstring, so it's still partly discoverable. Are we cool with this? Or, can anyone think of a better way to handle this? ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Fri Apr 20 04:39:02 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Apr 2012 09:39:02 +0100 Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF parser (#33) In-Reply-To: References: Message-ID: I've had a quick look on GitHub and it isn't obvious to me how to get pull request emails CC'd to our dev mailing list... but anyway, Lenna has been busy: Peter ---------- Forwarded message ---------- From: Lenna Peterson Date: Thu, Apr 19, 2012 at 11:35 PM Subject: [biopython] Feature: Python implementation of MMCIF parser (#33) To: Peter Cock I've written a PLY (Python lex-yacc) module that is superimposable with the C MMCIF module. I've also partially rewritten the C MMCIF module to be object-oriented. ### Changed files ### * MMCIFlexmodule.c: Now object-oriented (open file in constructor, close file in destructor, etc). Docstrings! Added file IO exception. * MMCIF2Dict.py: Minor changes for new object oriented API * MMCIFParser: Changed all uses of map() to list comprehensions (more compatible with 3) ### New files ### * MMCIFlex.py: PLY-based module for tokenizing input. ### What it needs ### Addition of PLY dependency to setup.py. I'm not quite sure how to handle this, as PLY wouldn't be necessary on a platform with C Python. Thoughts? Which non-CPython implementations are worth testing? New C module tested on Python 2.6 on Mac OS X and Debian. I hope it still works on Windows. On my machine, the C module processes a 30,000 line test file in 10-15 ms; the Python module takes ~150 ms. You can merge this Pull Request by running: ?git pull https://github.com/lennax/biopython MMCIF2 Or you can view, comment on it, or merge it online at: ?https://github.com/biopython/biopython/pull/33 -- Commit Summary -- * Ply test in progress. * Quoted values with spaces are being broken. * Removed hard inclusion of ply. * Fixed quoted strings with spaces. * Changed Parser call to 2Dict. Semicolons break. * Changed Parser call to 2Dict. Semicolons break. * Lexes full file w/o error, FIXME loops * Tweak: comment handling * Changed token "NAME" to "TAG" * Using IUCr grammar. FIXME quote/semi * Fixed quoted strings. * Semicolon text field fixed, FIXME included \n * Fixed semi newlines. * non-eol temp fix, doesn't match single chars * Lexes full CIF file with no noticed errors. * Added timing. * Added states to lexer. * Lex loops into [header, [items], ...]; \d hacks. * Enforced semicolon rule. * Yacc works. * Re-added values to lexer state 'loop' * FIXME syntax error/hangs on full file. * Lexer gathers values, added parse precedence. * Minor lex cleanup. * Testing exclusionary lex redo. * Streamlined rules, no loop yet. * Still won't yacc 30k line file. * Merge branch 'master' of git://github.com/biopython/biopython into ply2 * Added __name__ __main__ check. * Parser redo, still doesn't parse 30k line file. * Added comments to tokenizer. * Fixed lex module's callability from yacc. * Fixed DATA token failure. * Multiple improvements, still no 30k. * Moved lexer arguments to constructor. * Moved data input to constructor, added docs * Validated to pep8. * Merge branch 'master' of git://github.com/biopython/biopython into ply2 * Add MMCIF2Dict from ply branch. * Remove flex header dependency of CIF parser. * Update MMCIFParser call of MMCIF2Dict. * PLY lexer works with MMCIF2Dict. * Cleanup. * Cleaned up import. * Updated docstring. * Subclassed dict. * Restored MMCIFParser call to MMCIF2Dict. * Removed main() from lex input. * Restored newline. * Fix C prototype warnings. * Modifying python lexer to be substitutable w/ C. * Make header for generated C. * Import C lexer or Python lexer. * Improvements and documentation. * Uncomment GLOBAL token definition. * PLY lexer and C lexer should be interchangeable. * Improve error reporting of import. * Turn on ply lex optimize. * Call instance of Python lexer. * Working on implementing class in C module. * Start unit test for MMCIF. * Minimal unit test for MMCIFParser. * Revert to old generated C; manually added noyywrap * Manually added function prototypes to generated C. * Merge branch 'ply2' into dev * Merge branch 'ply' into dev * Merge branch 'c-dev' into dev * Merge branch 'master' of git://github.com/biopython/biopython into dev * Cleaning up old files. * More cleanup. * Merging Parser from MMCIFlex branch. * Parser and unit test for PyCIFRW * Python and C lexer APIs are now identical. * Add copyright and license notices. * Merge branch 'master' of git://github.com/biopython/biopython into dev * Trying GnuWin32 flex-generated C. * Win flex generated with new mmcif.lex * GnuWin32 flex generated C, used dos2unix for CRLF * Added correct author to flex C module. * Merge branch 'master' of git://github.com/biopython/biopython into dev * Merge branch 'master' of git://github.com/biopython/biopython into dev * Change map() to list comprehensions for 3 compat. * Renamed python lexer to match C module. * Added file IO exception to C module. * Tweak lexer module import. * Prep Python CIF lexer for pull request. * Whitespace tweaks. -- File Changes -- M Bio/PDB/MMCIF2Dict.py (20) M Bio/PDB/MMCIFParser.py (8) A Bio/PDB/mmCIF/MMCIFlex.py (253) M Bio/PDB/mmCIF/MMCIFlexmodule.c (122) -- Patch Links -- ?https://github.com/biopython/biopython/pull/33.patch ?https://github.com/biopython/biopython/pull/33.diff --- Reply to this email directly or view it on GitHub: https://github.com/biopython/biopython/pull/33 From andrew.sczesnak at med.nyu.edu Fri Apr 20 18:28:43 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 20 Apr 2012 18:28:43 -0400 Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo In-Reply-To: References: <4F8C4D69.4040009@med.nyu.edu> <4F8CA17D.4080907@med.nyu.edu> Message-ID: <4F91E31B.9030101@med.nyu.edu> Eric, If my understanding is correct, UPGMA is slang for agglomerative average-linkage hierarchical clustering which is implemented along with single- and complete-linkage in the module. There's no equivalent of neighbor-joining or maximum-likelihood and Bio.Cluster probably isn't that fast with large numbers of nodes so wrappers are still useful. We could probably add an NJ implementation for small matrices pretty easily if you think it's worthwhile. Either way, the glue could be useful for visualizing relationships between genes/samples in microarrays (what I gather Bio.Cluster is intended for). Andrew On 04/17/2012 11:25 AM, Eric Talevich wrote: > Andrew, > > It would be useful to have a quick and portable function for > distance-based tree estimation in Bio.Phylo, since otherwise it's > necessary to use one of the wrappers for external programs in > Bio.Phylo.Applications. (And currently, only PhyML is wrapped.) Does > the hierarchical clustering algorithm in Bio.Cluster correspond to any > common tree-estimation algorithm, e.g. UPGMA? If so, then it would > make a lot of sense to provide the glue for using it that way. If you > have done some work in this direction, I would be happy to see it. > > -Eric > > > On Mon, Apr 16, 2012 at 6:47 PM, Andrew Sczesnak > wrote: >> Eric, >> >> I can describe two use cases from my own experience. First, the MAF parser >> I've been working on can pull the multiple alignment of some gene between a >> bunch of genomes. Thinking of recipes for the cookbook, I thought it would >> be neat to walk the user through constructing a distance matrix by hand >> (though you're right--more could be done to support this), clustering with >> Bio.Cluster and visualizing the result with Bio.Phylo. I like this example >> because it integrates several different parts of BioPython along with a >> lesson about inferring distances between sequences. >> >> Second, for another project, I've been generating distance matrices based on >> the shared gene content of bacterial genomes and the presence-or-absence of >> orthologous groups in each. Presently, I ferry the matrices to a clustering >> program and then visualize the resulting trees in yet another tool. Looking >> into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and >> the incompatibility of their tree objects. >> >> I wonder, what would be the most elegant way of bridging the gap? >> >> >> Best, >> Andrew >> From andrew.sczesnak at med.nyu.edu Fri Apr 20 18:35:59 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 20 Apr 2012 18:35:59 -0400 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: Message-ID: <4F91E4CF.8040602@med.nyu.edu> Peter, My colleague was writing some code using MafIndex and commented how long it took her to download, decompress and index the human multiz alignments from UCSC. It seems like it'd be great to keep the files compressed... perhaps if the code works well enough we can convince UCSC to host bgzip'd copies (or maybe them available on one of our institutions servers). Is I.J. interested in joining the community? I'd like to look into adding BGZF to MafIO and wouldn't want to duplicate I.J.'s effort. If not, could you put me in touch? Andrew On 04/17/2012 11:23 AM, Peter Cock wrote: > On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock wrote: >> >> Here are some things that I think are strong >> candidates for 1.60 (not an exclusive list!) >> >> ... >> >> BGZF support: Low level module like Python's gzip, >> support in SeqIO for indexing BGZF compressed files, >> ... > > I've just rebased my bgzf branch, which I think is ready to apply to the > trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3. > https://github.com/peterjc/biopython/tree/bgzf2 > > Would anyone like to review this please? There are unittests and > plenty of docstrings - but so far nothing in the Tutorial though. > > I wrote a blog post late last year explaining what this allows, and > this branch includes the changes to Bio.SeqIO to index BGZF > compressed sequence files this discussed: > http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html > > The probable next step after this is combining it with Andrew Sczesnak's > work on indexing MAF files (they can get pretty big) as explored by 'I.J.' > (who as far as I know hasn't signed up to the biopython-dev list, BCC'd). > > Also it would be interesting to explore doing the (de)compression of > blocks on worker threads to take advantage of multiple cores. > > Another idea would be too switch from a plain dictionary to an > ordered dictionary for holding cached decompressed blocks, > giving a way to drop the oldest block (although not perhaps as > clever as dropping the lest recently used block, the overhead is > lower). That would require including our own OrderedDict class > on the older Python platforms. > > Peter > > [*] PyPy testing is complicated by running out of file handles, > an existing issue not something directly from this work. Part > of this is down to different GC under PyPy. From arklenna at gmail.com Fri Apr 20 20:57:21 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 20 Apr 2012 20:57:21 -0400 Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF parser (#33) In-Reply-To: References: Message-ID: On Fri, Apr 20, 2012 at 4:39 AM, Peter Cock wrote: > I've had a quick look on GitHub and it isn't obvious to me how to get > pull request emails CC'd to our dev mailing list... but anyway, Lenna > has been busy: > > Peter > > ---------- Forwarded message ---------- > From: Lenna Peterson > > Date: Thu, Apr 19, 2012 at 11:35 PM > Subject: [biopython] Feature: Python implementation of MMCIF parser (#33) > To: Peter Cock > > > I've written a PLY (Python lex-yacc) module that is superimposable > with the C MMCIF module. > > I've also partially rewritten the C MMCIF module to be object-oriented. > > ### Changed files ### > > * MMCIFlexmodule.c: Now object-oriented (open file in constructor, > close file in destructor, etc). Docstrings! Added file IO exception. > * MMCIF2Dict.py: Minor changes for new object oriented API > * MMCIFParser: Changed all uses of map() to list comprehensions (more > compatible with 3) > > ### New files ### > > * MMCIFlex.py: PLY-based module for tokenizing input. > > ### What it needs ### > Addition of PLY dependency to setup.py. > I'm not quite sure how to handle this, as PLY wouldn't be necessary on > a platform with C Python. Thoughts? Which non-CPython implementations > are worth testing? > > > New C module tested on Python 2.6 on Mac OS X and Debian. I hope it > still works on Windows. > On my machine, the C module processes a 30,000 line test file in 10-15 > ms; the Python module takes ~150 ms. I've started testing the PLY lexer on PyPy. NumPyPy now implements more functions needed by PDB; the only things I found to be missing are random and linalg. This eliminates Superimposer, FragmentMapper, and Vector. I played around with trying to spoof "import numpy" to automatically import numpypy (code here: https://gist.github.com/2432815) but I don't think that's wise yet. My last commit to this branch was a few changes to allow the MMCIF parser to work on NumPy. PyPy won't run `setup.py test` due to global numpy failure, but if I install this branch and `pypy test_MMCIF.py`, it passes. Anybody with more PyPy and/or package structuring experience have thoughts? Lenna From p.j.a.cock at googlemail.com Sat Apr 21 06:32:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 21 Apr 2012 11:32:33 +0100 Subject: [Biopython-dev] [biopython] Feature: Python implementation of MMCIF parser (#33) In-Reply-To: References: Message-ID: On Saturday, April 21, 2012, Lenna Peterson wrote: > > > ### What it needs ### > > Addition of PLY dependency to setup.py. > > I'm not quite sure how to handle this, as PLY wouldn't be necessary on > > a platform with C Python. Thoughts? Which non-CPython implementations > > are worth testing? Basically Jython (which we've tried to support for a while) and PyPy (which I would like to officially support in future). Although a pure python setup can be useful in other settings, e.g. Windows development without the compilers otherwise needed. However, neither of those have NumPy (yet), which we need for the PDB module that would use the MMCIF parser. > > > New C module tested on Python 2.6 on Mac OS X and Debian. I hope it > > still works on Windows. > > On my machine, the C module processes a 30,000 line test file in 10-15 > > ms; the Python module takes ~150 ms. That's a factor of ten slower, but still sounds fast enough perhaps that we don't really need the C code for usability. > > I've started testing the PLY lexer on PyPy. NumPyPy now implements > more functions needed by PDB; the only things I found to be missing > are random and linalg. This eliminates Superimposer, FragmentMapper, > and Vector. > > I played around with trying to spoof "import numpy" to automatically > import numpypy (code here: https://gist.github.com/2432815) but I > don't think that's wise yet. > > My last commit to this branch was a few changes to allow the MMCIF > parser to work on NumPy. PyPy won't run `setup.py test` due to global > numpy failure, but if I install this branch and `pypy test_MMCIF.py`, > it passes. > > Anybody with more PyPy and/or package structuring experience have thoughts? I filed a few bugs on missing code in PyPy's NumPy re-implementation (now called numpypy), good to hear they are getting closer to being enough for us to run Bio.PDB on it. Thank you for exploring this. Right now with in you shoes for MMCIF parsing I would focus on the parser failures with certain input files - there is an open bug on RedMine https://redmine.open-bio.org/issues/2626 and the Issue of multiple models (Eric can probably advise here), https://redmine.open-bio.org/issues/2943 And I must close this bug now your earlier work has been checked in - https://redmine.open-bio.org/issues/2619 Thanks! Peter > From redmine at redmine.open-bio.org Sat Apr 21 06:39:15 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 21 Apr 2012 10:39:15 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] (Closed) Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Peter Cock. Status changed from New to Closed % Done changed from 0 to 100 Fixed with Lenna's work - see this commit and its parents: https://github.com/biopython/biopython/commit/e5ebb85d0614a34e59e7c2118a366512dc4d1320 ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sat Apr 21 14:05:01 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 21 Apr 2012 18:05:01 +0000 Subject: [Biopython-dev] [Biopython - Bug #2626] Bio.PDB mmCIFParser parse exceptions References: Message-ID: Issue #2626 has been updated by Lenna Peterson. File mmCifParseCheck.py added I've attempted to rescue this code from overzealous "text formatting". Attached version appeared to work on one test file; haven't tested the example broken files yet. ---------------------------------------- Bug #2626: Bio.PDB mmCIFParser parse exceptions https://redmine.open-bio.org/issues/2626 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Other Target version: 1.48 URL: I recently ran the mmCIFParser object over all of PDB's mmCIF files and found a large number of files failed to parse correctly (a short script at the end to demonstrate). Of ~50k mmCIF files, 3891 files failed to parse and another 1980 were missing fields in the mmCIF dictionary. A few examples of files that failed to parse: http://www.rcsb.org/pdb/files/1alw.cif.gz http://www.rcsb.org/pdb/files/1det.cif.gz http://www.rcsb.org/pdb/files/1tmy.cif.gz A few with missing fields: http://www.rcsb.org/pdb/files/1mfl.cif.gz http://www.rcsb.org/pdb/files/1tfj.cif.gz http://www.rcsb.org/pdb/files/1zn8.cif.gz The problem seems to be that an error in one mmCIF table, like an extra field, seems to propogate through the rest of the parse. x86_64 gentoo linux 2008, src BioPython install __CODE__ import sys from Bio.PDB import * if len(sys.argv) != 2: print "usage: mmCifParseCheck.py " sys.exit(0) structFile = sys.argv[1] resultString = ""; #parse to structure object numRes = 0 parser=MMCIFParser() try: structure=parser.get_structure('test',structFile) for model in structure: for chain in model: for residue in chain: if(residue.id[0][:2] != "H_"): numRes += 1 except: resultString += "parse to structure object failed\n"; else: resultString += "parse to structure object succeeded\n"; #parse whole mmCIF file to dict try: mmcif_dict=MMCIF2Dict.MMCIF2Dict(structFile) except: resultString += "parse to dict failed\n"; else: resultString += "parse to dict succeeded\n"; #get a required entry try: id = mmcif_dict['_entry.id'] except: resultString += "key lookup failed\n"; else: resultString += "key lookup succeeded\n"; print resultString print "number of non-het residues " + str(numRes) -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sat Apr 21 14:16:07 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 21 Apr 2012 18:16:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #2950] Bio.PDBIO.save writes MODEL records without model id References: Message-ID: Issue #2950 has been updated by Lenna Peterson. Did this commit close this bug? https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 ---------------------------------------- Bug #2950: Bio.PDBIO.save writes MODEL records without model id https://redmine.open-bio.org/issues/2950 Author: Barry Finzel Status: In Progress Priority: Normal Assignee: Konstantin Okonechnikov Category: Main Distribution Target version: Not Applicable URL: The MODEL record format for PDB files has an integer model identifier (e.g., "MODEL 1") not currently written to output. Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From arklenna at gmail.com Sun Apr 22 02:48:10 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 22 Apr 2012 02:48:10 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) Message-ID: I've implemented the parser changes (written by Paul Bathen; see bug report) to allow the MMCIF parser to handle multiple models. Models are now accessed by a string key of their model number, rather than an arbitrary index (structure['1'] versus structure[0]). I updated the MMCIF unit test for the new model access method and added a test file with multiple models. I'm not sure if there is documentation to be updated re: accessing the models. issue: https://redmine.open-bio.org/issues/2943 pull request: https://github.com/biopython/biopython/pull/34 - Lenna From MatatTHC at gmx.de Sun Apr 22 06:06:28 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Sun, 22 Apr 2012 12:06:28 +0200 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: Hi, since this bug seems to be of low priority I decided to try my best to help a bit and search the web a bit. It seems that the property is stored in PrimarySeq or Seq in bioperl. See for instance: http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Seq.pm http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/PrimarySeq.pm Or also: http://bugzilla.open-bio.org/show_bug.cgi?id=2578 This seems to be realised as boolean variable or function. Regards, Matthias 2012/4/4 Matthias Bernt : > Hi, > > are there any news on this? May I help somehow? But I have to admit > that I barely speak perl and have no experience with bioperl. If > someone tells me where to look I might still try it. > > Matthias > > 2012/3/29 Peter Cock : >> On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt wrote: >>> Hi, >>> >>> Is it possible to get the property if a genome is circular / linear >>> from SeqIO applied to genbank files? I could not find it. >>> >>> There is also a related bugreport: >>> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 >>> >>> I used the old parser before and switched to SeqIO which I really like >>> for the possibilities to parse different formats... but I really need >>> the information. >> >> Does anyone happen to have a BioPerl + BioSQL setup installed >> and working? IIRC checking that to make sure however we >> store the circular was compatible was the only real hurdle. >> >> Peter From redmine at redmine.open-bio.org Sun Apr 22 14:46:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 22 Apr 2012 18:46:35 +0000 Subject: [Biopython-dev] [Biopython - Bug #2950] Bio.PDBIO.save writes MODEL records without model id References: Message-ID: Issue #2950 has been updated by Eric Talevich. Assignee deleted (Konstantin Okonechnikov) Yes it did, thanks. I'll close this bug now. ---------------------------------------- Bug #2950: Bio.PDBIO.save writes MODEL records without model id https://redmine.open-bio.org/issues/2950 Author: Barry Finzel Status: In Progress Priority: Normal Assignee: Category: Main Distribution Target version: Not Applicable URL: The MODEL record format for PDB files has an integer model identifier (e.g., "MODEL 1") not currently written to output. Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Apr 22 14:48:39 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 22 Apr 2012 18:48:39 +0000 Subject: [Biopython-dev] [Biopython - Bug #2951] (Closed) PDBParser assigns model 0 to first model no matter what... References: Message-ID: Issue #2951 has been updated by Eric Talevich. Status changed from New to Closed % Done changed from 0 to 100 Closed with this commit, as pointed out just now by Lenna Peterson: https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 ---------------------------------------- Bug #2951: PDBParser assigns model 0 to first model no matter what... https://redmine.open-bio.org/issues/2951 Author: TallPaul empty Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.52 URL: I'm not sure if this is a bug or a feature, but PDBParser assigns the first model it sees as model 0 then increments that. This means someone thinking they are studying model X is actually studying X+1, and that assumes that authors always use sequential model numbers without skips. If authors CAN skip model number, ie, MODEL 2, then MODEL 4, then MODEL 5... then in biopython these be models 0,1, and 2 in the structure... yuck. If this needs to be maintained for posterity, I would suggest adding another field to capture the TRUE model number if it exists. See lines 106 and 122 here: http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBParser.py#106 Paul -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Apr 22 14:49:43 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 22 Apr 2012 18:49:43 +0000 Subject: [Biopython-dev] [Biopython - Bug #2950] (Closed) Bio.PDBIO.save writes MODEL records without model id References: Message-ID: Issue #2950 has been updated by Eric Talevich. Status changed from In Progress to Closed % Done changed from 20 to 100 Closed the blocker, too. Thanks again to Konstantin. ---------------------------------------- Bug #2950: Bio.PDBIO.save writes MODEL records without model id https://redmine.open-bio.org/issues/2950 Author: Barry Finzel Status: Closed Priority: Normal Assignee: Category: Main Distribution Target version: Not Applicable URL: The MODEL record format for PDB files has an integer model identifier (e.g., "MODEL 1") not currently written to output. Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From arklenna at gmail.com Mon Apr 23 01:35:23 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 23 Apr 2012 01:35:23 -0400 Subject: [Biopython-dev] pull request: Bio.SCOP.Raf chem dict updater Message-ID: I've adapted Hongbo Zhu's code to extract the three to one letter codes directly from the PDB Chemical Component dictionary. Existing calls of `from Raf import to_one_letter_code` should work as expected. pull request: https://github.com/biopython/biopython/pull/35 issue: https://redmine.open-bio.org/issues/3169 Lenna From redmine at redmine.open-bio.org Mon Apr 23 13:00:15 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 23 Apr 2012 17:00:15 +0000 Subject: [Biopython-dev] [Biopython - Bug #2943] (Closed) MMCIFParser only handling a single model. References: Message-ID: Issue #2943 has been updated by Peter Cock. Status changed from New to Closed % Done changed from 0 to 100 This should be working on the trunk now ready for Biopython 1.60 - thanks Lenna. See this commit and those preceding it: https://github.com/biopython/biopython/commit/2ac67cd14682a4bbad9e09654485914f9495138d If we've missed anything please reopen this bug. Thanks Paul! ---------------------------------------- Bug #2943: MMCIFParser only handling a single model. https://redmine.open-bio.org/issues/2943 Author: TallPaul empty Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.52 URL: MMCIFParser as-written only handles a single model in a protein. Any protein that has multiple modesl with repeating chains and residues will get an exception since the residue ID will already exist. Please make the following changes in MMCIFParser.py: Change the __doc__ setting: #Optional __DOC__ change if the new MMCIFlex is not used nor the changes #to MMCIF2Dict based on the new MMCIFlex. #Mod by Paul T. Bathen to reflect MMCIFlex built solely in Python __doc__="mmCIF parser (implemented solely in Python, no lex/flex/C code needed)" Regardles of the DOC changes: Insert the following model_list line occupancy_list=mmcif_dict["_atom_site.occupancy"] fieldname_list=mmcif_dict["_atom_site.group_PDB"] #Added by Paul T. Bathen Nov 2009 model_list=mmcif_dict["_atom_site.pdbx_PDB_model_num"] try: Make the following changes: #Modified by Paul T. Bathen Nov 2009: comment out this line #current_model_id=0 structure_builder=self._structure_builder structure_builder.init_structure(structure_id) #Modified by Paul T. Bathen Nov 2009: comment out this line #structure_builder.init_model(current_model_id) structure_builder.init_seg(" ") #Added by Paul T. Bathen Nov 2009 current_model_id = -1 Make the following changes in the for loop: #Note by Paul T. Bathen: should MMCIFParser include #the HOH and WAT stmts in PDBParser immediately below? #if fieldname=="HETATM": # if resname=="HOH" or resname=="WAT": # hetero_flag="W" # else: # hetero_flag="H" if fieldname=="HETATM": hetatm_flag="H" else: hetatm_flag=" " #Added by Paul T. Bathen Nov 2009 model_id = model_list[i] if current_model_id != model_id: current_model_id = model_id structure_builder.init_model(current_model_id) #end of addition After these changes took place, and with the new MMCIFlex and MMCIF2Dict in place, I was able to parse and test 2beg.cif and pdb2beg.ent and both parsed with the same number of models, chains, and residues. Paul -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Mon Apr 23 13:02:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 23 Apr 2012 18:02:01 +0100 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Sun, Apr 22, 2012 at 7:48 AM, Lenna Peterson wrote: > I've implemented the parser changes (written by Paul Bathen; see bug > report) to allow the MMCIF parser to handle multiple models. > > Models are now accessed by a string key of their model number, rather > than an arbitrary index (structure['1'] versus structure[0]). > > I updated the MMCIF unit test for the new model access method and > added a test file with multiple models. > > I'm not sure if there is documentation to be updated re: accessing the models. > > issue: https://redmine.open-bio.org/issues/2943 > pull request: https://github.com/biopython/biopython/pull/34 I've applied that to the trunk, thank you, but on reading this, why are the model keys strings and not integers? Does MMCIF allow odd keys or something? Peter From eric.talevich at gmail.com Mon Apr 23 16:10:27 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 23 Apr 2012 16:10:27 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Mon, Apr 23, 2012 at 1:02 PM, Peter Cock wrote: > On Sun, Apr 22, 2012 at 7:48 AM, Lenna Peterson wrote: >> I've implemented the parser changes (written by Paul Bathen; see bug >> report) to allow the MMCIF parser to handle multiple models. >> >> Models are now accessed by a string key of their model number, rather >> than an arbitrary index (structure['1'] versus structure[0]). >> >> I updated the MMCIF unit test for the new model access method and >> added a test file with multiple models. >> >> I'm not sure if there is documentation to be updated re: accessing the models. >> >> issue: https://redmine.open-bio.org/issues/2943 >> pull request: https://github.com/biopython/biopython/pull/34 > > I've applied that to the trunk, thank you, but on reading this, why are the > model keys strings and not integers? Does MMCIF allow odd keys or > something? > Ack, I didn't look at that closely enough. Check out this patch to see the current situation: https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 The models associated with a structure are numbered with a sequential integer id, starting from 0. It's always been like that in our PDB parser and we haven't changed it. To ensure that model numbers specified in the PDB file are preserved when writing the PDB back to file, the above patch introduced a new attribute on the Model object called serial_num (also an integer, equal to model.id unless specified otherwise). That attribute is only used when writing a new PDB file; Model.__getitem__ still uses Model.id as before. Perhaps that's surprising now that we read the serial numbers, but it kept backward compatibility. Plus, it preserves list-like behavior (item access via integers), even though the models are actually stored in a dict. So! In the mmCIF parser, the calls to structure_builder.init_model should be given two arguments instead of one: an integer id counting from 0, and then another integer (probably) containing the model "serial number" specified in the mmCIF file. In the event that an mmCIF file doesn't specify the model number, the serial number should be the same as the sequential id. Cool? This will also help us convert between PDB and mmCIF formats in the future. As for accessing the models by their serial number, using string keys seems like an effective workaround, but still obviously a workaround rather than an ideal situation. Let's discuss that a little more, perhaps file another bug when we've reached some consensus. Best, Eric From eric.talevich at gmail.com Mon Apr 23 16:32:11 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 23 Apr 2012 16:32:11 -0400 Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF parser (#33) In-Reply-To: References: Message-ID: On Fri, Apr 20, 2012 at 8:57 PM, Lenna Peterson wrote: > > I've started testing the PLY lexer on PyPy. NumPyPy now implements > more functions needed by PDB; the only things I found to be missing > are random and linalg. This eliminates Superimposer, FragmentMapper, > and Vector. > > I played around with trying to spoof "import numpy" to automatically > import numpypy (code here: https://gist.github.com/2432815) but I > don't think that's wise yet. > > My last commit to this branch was a few changes to allow the MMCIF > parser to work on NumPy. PyPy won't run `setup.py test` due to global > numpy failure, but if I install this branch and `pypy test_MMCIF.py`, > it passes. > > Anybody with more PyPy and/or package structuring experience have thoughts? > > Lenna Would it be more or less error-prone to simply replace every numpy import with this (after testing each module on PyPy): try: import numpy except: import numpypy as numpy Or similarly, use this as one of our compatibility utilities: from Bio import numpy # Some conditional junk in Bio/__init__.py or setup.py to reveal this module to PyPy and CPython as needed In either case, here's the relatively short list of modules that would need to be modified: Bio/Affy/CelFile.py Bio/Cluster/__init__.py Bio/KDTree/KDTree.py Bio/LogisticRegression.py Bio/MarkovModel.py Bio/MaxEntropy.py Bio/NaiveBayes.py Bio/PDB/Atom.py Bio/PDB/FragmentMapper.py Bio/PDB/MMCIFParser.py Bio/PDB/NeighborSearch.py Bio/PDB/PDBParser.py Bio/PDB/ResidueDepth.py Bio/PDB/Superimposer.py Bio/PDB/Vector.py Bio/SVDSuperimposer/SVDSuperimposer.py Bio/Statistics/lowess.py Bio/SubsMat/__init__.py Bio/kNN.py From p.j.a.cock at googlemail.com Mon Apr 23 16:47:02 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 23 Apr 2012 21:47:02 +0100 Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF parser (#33) In-Reply-To: References: Message-ID: On Mon, Apr 23, 2012 at 9:32 PM, Eric Talevich wrote: > > Would it be more or less error-prone to simply replace every numpy > import with this (after testing each module on PyPy): > > try: > ? ?import numpy > except: > ? ?import numpypy as numpy > Maybe, but right now do any of our NumPy using modules pass under PyPy? I don't believe so... but I haven't tried a PyPy nightly build lately. It was unfortunate that originally PyPy's micronumpy pretended to by numpy, so that you'd write "import numpy" and think it worked but be surprised later when something fundamental like the dot function was missing, or 2D arrays. That lead to a few nasty try/import lines in our unit tests. Let's wait and see how PyPy's numpy support improves before rushing to change any of our numpy imports. I am hopefully that Bio.PDB will be fine in their next release, whereas things using the NumPy C API will probably not be. Peter From arklenna at gmail.com Mon Apr 23 19:05:03 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 23 Apr 2012 19:05:03 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich wrote: > > Ack, I didn't look at that closely enough. Check out this patch to see > the current situation: > https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 > > The models associated with a structure are numbered with a sequential > integer id, starting from 0. It's always been like that in our PDB > parser and we haven't changed it. To ensure that model numbers > specified in the PDB file are preserved when writing the PDB back to > file, the above patch introduced a new attribute on the Model object > called serial_num (also an integer, equal to model.id unless specified > otherwise). That attribute is only used when writing a new PDB file; > Model.__getitem__ still uses Model.id as before. > > Perhaps that's surprising now that we read the serial numbers, but it > kept backward compatibility. Plus, it preserves list-like behavior > (item access via integers), even though the models are actually stored > in a dict. > > So! > > In the mmCIF parser, the calls to structure_builder.init_model should > be given two arguments instead of one: an integer id counting from 0, > and then another integer (probably) containing the model "serial > number" specified in the mmCIF file. In the event that an mmCIF file > doesn't specify the model number, the serial number should be the same > as the sequential id. > > Cool? This will also help us convert between PDB and mmCIF formats in > the future. Got it. I'm working on implementing the serial_number/model_number dichotomy for MMCIF. > As for accessing the models by their serial number, using string keys > seems like an effective workaround, but still obviously a workaround > rather than an ideal situation. Let's discuss that a little more, > perhaps file another bug when we've reached some consensus. Er, I made and then lost (still haven't *quite* gotten the hang of git rebase) a patch that applied int() to the MMCIF model numbers. I'll add that back so both model and serial numbers are ints. Lenna From arklenna at gmail.com Tue Apr 24 00:25:12 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Apr 2012 00:25:12 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich wrote: > > Ack, I didn't look at that closely enough. Check out this patch to see > the current situation: > https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 > > The models associated with a structure are numbered with a sequential > integer id, starting from 0. It's always been like that in our PDB > parser and we haven't changed it. To ensure that model numbers > specified in the PDB file are preserved when writing the PDB back to > file, the above patch introduced a new attribute on the Model object > called serial_num (also an integer, equal to model.id unless specified > otherwise). That attribute is only used when writing a new PDB file; > Model.__getitem__ still uses Model.id as before. > > Perhaps that's surprising now that we read the serial numbers, but it > kept backward compatibility. Plus, it preserves list-like behavior > (item access via integers), even though the models are actually stored > in a dict. > > So! > > In the mmCIF parser, the calls to structure_builder.init_model should > be given two arguments instead of one: an integer id counting from 0, > and then another integer (probably) containing the model "serial > number" specified in the mmCIF file. In the event that an mmCIF file > doesn't specify the model number, the serial number should be the same > as the sequential id. > > Cool? This will also help us convert between PDB and mmCIF formats in > the future. > > As for accessing the models by their serial number, using string keys > seems like an effective workaround, but still obviously a workaround > rather than an ideal situation. Let's discuss that a little more, > perhaps file another bug when we've reached some consensus. > > Best, > Eric Hi Eric, I believe I've implemented the model_id/serial_id system found in PDB: https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d Please let me know if you think that looks right. I couldn't find an mmCIF file without a model column to test, but I believe in that case it will assign model_id and serial_id to 0. Would that be the correct behavior? I also modified the unit test to check the model serial_num. https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6 Currently serial_num is int() of the CIF model column. Regarding access by string serial_num, I am concerned that the int/string access would be too subtle (structure[0] == structure['1']; structure[1] == structure['2']?). Perhaps an accessor function? i.e. structure.get_model('1') Let me know if you think I should write get_model() or something along those lines. Lenna From eric.talevich at gmail.com Tue Apr 24 11:38:50 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 24 Apr 2012 11:38:50 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson wrote: > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich wrote: >> >> Ack, I didn't look at that closely enough. Check out this patch to see >> the current situation: >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 >> >> The models associated with a structure are numbered with a sequential >> integer id, starting from 0. It's always been like that in our PDB >> parser and we haven't changed it. To ensure that model numbers >> specified in the PDB file are preserved when writing the PDB back to >> file, the above patch introduced a new attribute on the Model object >> called serial_num (also an integer, equal to model.id unless specified >> otherwise). That attribute is only used when writing a new PDB file; >> Model.__getitem__ still uses Model.id as before. >> >> Perhaps that's surprising now that we read the serial numbers, but it >> kept backward compatibility. Plus, it preserves list-like behavior >> (item access via integers), even though the models are actually stored >> in a dict. >> >> So! >> >> In the mmCIF parser, the calls to structure_builder.init_model should >> be given two arguments instead of one: an integer id counting from 0, >> and then another integer (probably) containing the model "serial >> number" specified in the mmCIF file. In the event that an mmCIF file >> doesn't specify the model number, the serial number should be the same >> as the sequential id. >> >> Cool? This will also help us convert between PDB and mmCIF formats in >> the future. >> >> As for accessing the models by their serial number, using string keys >> seems like an effective workaround, but still obviously a workaround >> rather than an ideal situation. Let's discuss that a little more, >> perhaps file another bug when we've reached some consensus. >> >> Best, >> Eric > > > Hi Eric, > > I believe I've implemented the model_id/serial_id system found in PDB: > > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d > > Please let me know if you think that looks right. I couldn't find an > mmCIF file without a model column to test, but I believe in that case > it will assign model_id and serial_id to 0. Would that be the correct > behavior? > > I also modified the unit test to check the model serial_num. > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6 > > Currently serial_num is int() of the CIF model column. Regarding > access by string serial_num, I am concerned that the int/string access > would be too subtle (structure[0] == structure['1']; structure[1] == > structure['2']?). Perhaps an accessor function? i.e. > structure.get_model('1') > > Let me know if you think I should write get_model() or something along > those lines. > > Lenna I left another nitpick on b453a, but besides that it looks exactly right to me. The string/int distinction would indeed be weird, especially for newer Python users coming from Perl or Javascript. I don't see a direct analogue for get_model(serial_num) in the other Entities (Residue, Chain, Model, Structure), so I'm inclined to put off the decision for now (i.e. leave it out of this patch set). -Eric From p.j.a.cock at googlemail.com Tue Apr 24 11:58:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 16:58:10 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: <4F91E4CF.8040602@med.nyu.edu> References: <4F91E4CF.8040602@med.nyu.edu> Message-ID: On Fri, Apr 20, 2012 at 11:35 PM, Andrew Sczesnak wrote: > Peter, > > My colleague was writing some code using MafIndex and commented how long it > took her to download, decompress and index the human multiz alignments from > UCSC. It seems like it'd be great to keep the files compressed... perhaps if > the code works well enough we can convince UCSC to host bgzip'd copies (or > maybe them available on one of our institutions servers). That does sound good - it is a perfect example of where BGZF is a more useful alternative to standard GZIP. Some numbers on how much of a size penalty it imposes would help though... > Is I.J. interested in joining the community? I'd like to look into adding > BGZF to MafIO and wouldn't want to duplicate I.J.'s effort. If not, could > you put me in touch? Perhaps he's just busy at the moment (BCC'd again)? It should be easy enough to follow the BGZF changes to Bio/SeqIO/_index.py and I'm willing to do this myself for MAF (while going over your index work - something I want to do anyway). The only potential catch is avoiding offset arithmetic. Peter From arklenna at gmail.com Tue Apr 24 13:56:37 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Apr 2012 13:56:37 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich wrote: > > On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson wrote: > > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich wrote: > >> > >> Ack, I didn't look at that closely enough. Check out this patch to see > >> the current situation: > >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 > >> > >> The models associated with a structure are numbered with a sequential > >> integer id, starting from 0. It's always been like that in our PDB > >> parser and we haven't changed it. To ensure that model numbers > >> specified in the PDB file are preserved when writing the PDB back to > >> file, the above patch introduced a new attribute on the Model object > >> called serial_num (also an integer, equal to model.id unless specified > >> otherwise). That attribute is only used when writing a new PDB file; > >> Model.__getitem__ still uses Model.id as before. > >> > >> Perhaps that's surprising now that we read the serial numbers, but it > >> kept backward compatibility. Plus, it preserves list-like behavior > >> (item access via integers), even though the models are actually stored > >> in a dict. > >> > >> So! > >> > >> In the mmCIF parser, the calls to structure_builder.init_model should > >> be given two arguments instead of one: an integer id counting from 0, > >> and then another integer (probably) containing the model "serial > >> number" specified in the mmCIF file. In the event that an mmCIF file > >> doesn't specify the model number, the serial number should be the same > >> as the sequential id. > >> > >> Cool? This will also help us convert between PDB and mmCIF formats in > >> the future. > >> > >> As for accessing the models by their serial number, using string keys > >> seems like an effective workaround, but still obviously a workaround > >> rather than an ideal situation. Let's discuss that a little more, > >> perhaps file another bug when we've reached some consensus. > >> > >> Best, > >> Eric > > > > > > Hi Eric, > > > > I believe I've implemented the model_id/serial_id system found in PDB: > > > > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d > > > > Please let me know if you think that looks right. I couldn't find an > > mmCIF file without a model column to test, but I believe in that case > > it will assign model_id and serial_id to 0. Would that be the correct > > behavior? > > > > I also modified the unit test to check the model serial_num. > > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6 > > > > Currently serial_num is int() of the CIF model column. Regarding > > access by string serial_num, I am concerned that the int/string access > > would be too subtle (structure[0] == structure['1']; structure[1] == > > structure['2']?). Perhaps an accessor function? i.e. > > structure.get_model('1') > > > > Let me know if you think I should write get_model() or something along > > those lines. > > > > Lenna > > I left another nitpick on b453a, but besides that it looks exactly right to me. > > The string/int distinction would indeed be weird, especially for newer > Python users coming from Perl or Javascript. I don't see a direct > analogue for get_model(serial_num) in the other Entities (Residue, > Chain, Model, Structure), so I'm inclined to put off the decision for > now (i.e. leave it out of this patch set). > > -Eric Eric, Okay, I've changed the bad model num generic warning to a PDBConstructionException. New pull request to get MMCIF to the same state as PDB: https://github.com/biopython/biopython/pull/36 So are chains accessed by 0, 1, 2 or by A, B, C? Lenna From anaryin at gmail.com Tue Apr 24 13:59:10 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 24 Apr 2012 19:59:10 +0200 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: Hi Lenna, IMO, chains should be accessed by A, B, C I'd say, doesn't make sense numerically. Congrats on the GSOC application and on the good work so far! Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao No dia 24 de Abril de 2012 19:56, Lenna Peterson escreveu: > On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich > wrote: > > > > On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson > wrote: > > > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich < > eric.talevich at gmail.com> wrote: > > >> > > >> Ack, I didn't look at that closely enough. Check out this patch to see > > >> the current situation: > > >> > https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 > > >> > > >> The models associated with a structure are numbered with a sequential > > >> integer id, starting from 0. It's always been like that in our PDB > > >> parser and we haven't changed it. To ensure that model numbers > > >> specified in the PDB file are preserved when writing the PDB back to > > >> file, the above patch introduced a new attribute on the Model object > > >> called serial_num (also an integer, equal to model.id unless > specified > > >> otherwise). That attribute is only used when writing a new PDB file; > > >> Model.__getitem__ still uses Model.id as before. > > >> > > >> Perhaps that's surprising now that we read the serial numbers, but it > > >> kept backward compatibility. Plus, it preserves list-like behavior > > >> (item access via integers), even though the models are actually stored > > >> in a dict. > > >> > > >> So! > > >> > > >> In the mmCIF parser, the calls to structure_builder.init_model should > > >> be given two arguments instead of one: an integer id counting from 0, > > >> and then another integer (probably) containing the model "serial > > >> number" specified in the mmCIF file. In the event that an mmCIF file > > >> doesn't specify the model number, the serial number should be the same > > >> as the sequential id. > > >> > > >> Cool? This will also help us convert between PDB and mmCIF formats in > > >> the future. > > >> > > >> As for accessing the models by their serial number, using string keys > > >> seems like an effective workaround, but still obviously a workaround > > >> rather than an ideal situation. Let's discuss that a little more, > > >> perhaps file another bug when we've reached some consensus. > > >> > > >> Best, > > >> Eric > > > > > > > > > Hi Eric, > > > > > > I believe I've implemented the model_id/serial_id system found in PDB: > > > > > > > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d > > > > > > Please let me know if you think that looks right. I couldn't find an > > > mmCIF file without a model column to test, but I believe in that case > > > it will assign model_id and serial_id to 0. Would that be the correct > > > behavior? > > > > > > I also modified the unit test to check the model serial_num. > > > > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6 > > > > > > Currently serial_num is int() of the CIF model column. Regarding > > > access by string serial_num, I am concerned that the int/string access > > > would be too subtle (structure[0] == structure['1']; structure[1] == > > > structure['2']?). Perhaps an accessor function? i.e. > > > structure.get_model('1') > > > > > > Let me know if you think I should write get_model() or something along > > > those lines. > > > > > > Lenna > > > > I left another nitpick on b453a, but besides that it looks exactly right > to me. > > > > The string/int distinction would indeed be weird, especially for newer > > Python users coming from Perl or Javascript. I don't see a direct > > analogue for get_model(serial_num) in the other Entities (Residue, > > Chain, Model, Structure), so I'm inclined to put off the decision for > > now (i.e. leave it out of this patch set). > > > > -Eric > > > Eric, > > Okay, I've changed the bad model num generic warning to a > PDBConstructionException. > > New pull request to get MMCIF to the same state as PDB: > https://github.com/biopython/biopython/pull/36 > > So are chains accessed by 0, 1, 2 or by A, B, C? > > Lenna > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Tue Apr 24 14:20:16 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 24 Apr 2012 14:20:16 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 1:56 PM, Lenna Peterson wrote: > On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich wrote: >> >> On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson wrote: >> > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich wrote: >> >> >> >> Ack, I didn't look at that closely enough. Check out this patch to see >> >> the current situation: >> >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 >> >> >> >> The models associated with a structure are numbered with a sequential >> >> integer id, starting from 0. It's always been like that in our PDB >> >> parser and we haven't changed it. To ensure that model numbers >> >> specified in the PDB file are preserved when writing the PDB back to >> >> file, the above patch introduced a new attribute on the Model object >> >> called serial_num (also an integer, equal to model.id unless specified >> >> otherwise). That attribute is only used when writing a new PDB file; >> >> Model.__getitem__ still uses Model.id as before. >> >> >> >> Perhaps that's surprising now that we read the serial numbers, but it >> >> kept backward compatibility. Plus, it preserves list-like behavior >> >> (item access via integers), even though the models are actually stored >> >> in a dict. >> >> >> >> So! >> >> >> >> In the mmCIF parser, the calls to structure_builder.init_model should >> >> be given two arguments instead of one: an integer id counting from 0, >> >> and then another integer (probably) containing the model "serial >> >> number" specified in the mmCIF file. In the event that an mmCIF file >> >> doesn't specify the model number, the serial number should be the same >> >> as the sequential id. >> >> >> >> Cool? This will also help us convert between PDB and mmCIF formats in >> >> the future. >> >> >> >> As for accessing the models by their serial number, using string keys >> >> seems like an effective workaround, but still obviously a workaround >> >> rather than an ideal situation. Let's discuss that a little more, >> >> perhaps file another bug when we've reached some consensus. >> >> >> >> Best, >> >> Eric >> > >> > >> > Hi Eric, >> > >> > I believe I've implemented the model_id/serial_id system found in PDB: >> > >> > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d >> > >> > Please let me know if you think that looks right. I couldn't find an >> > mmCIF file without a model column to test, but I believe in that case >> > it will assign model_id and serial_id to 0. Would that be the correct >> > behavior? >> > >> > I also modified the unit test to check the model serial_num. >> > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6 >> > >> > Currently serial_num is int() of the CIF model column. Regarding >> > access by string serial_num, I am concerned that the int/string access >> > would be too subtle (structure[0] == structure['1']; structure[1] == >> > structure['2']?). Perhaps an accessor function? i.e. >> > structure.get_model('1') >> > >> > Let me know if you think I should write get_model() or something along >> > those lines. >> > >> > Lenna >> >> I left another nitpick on b453a, but besides that it looks exactly right to me. >> >> The string/int distinction would indeed be weird, especially for newer >> Python users coming from Perl or Javascript. I don't see a direct >> analogue for get_model(serial_num) in the other Entities (Residue, >> Chain, Model, Structure), so I'm inclined to put off the decision for >> now (i.e. leave it out of this patch set). >> >> -Eric > > > Eric, > > Okay, I've changed the bad model num generic warning to a > PDBConstructionException. > > New pull request to get MMCIF to the same state as PDB: > https://github.com/biopython/biopython/pull/36 > > So are chains accessed by 0, 1, 2 or by A, B, C? > > Lenna Cool, I just merged the pull request. Thanks! As Jo?o said, chains are accessed by the letter ID via __getitem__ (implemented in Bio.PDB.Entity). You can get at them either way through the child_list and child_dict attributes, too. Kind of a thrill. I suppose we could eventually refactor the Entity-based classes to use a single data structure (OrderedDict, namedtuple, numpy array with named columns/rows?) in place of child_dict and child_list, and clean up some of the redundant accessors. -E From anaryin at gmail.com Tue Apr 24 14:25:15 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 24 Apr 2012 20:25:15 +0200 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: I cannot agree more with Eric on this. Child dict and child list should be for sure refactored into something unique and easier to understand (and use). Also because we should take care of that memory leak... (try running the parser over a lot of PDBs and you will see memory going up). Cheers, Jo?o From p.j.a.cock at googlemail.com Tue Apr 24 16:07:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 21:07:03 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: <67433BFC-673B-4F49-A582-6F419FD6E0B7@csail.mit.edu> References: <4F91E4CF.8040602@med.nyu.edu> <67433BFC-673B-4F49-A582-6F419FD6E0B7@csail.mit.edu> Message-ID: On Tue, Apr 24, 2012 at 7:24 PM, Irwin Jungreis wrote: > Hello Andrew and Peter. > Hi again Irwin, > The size penalty of bgz versus gzip for .maf files is quite small. For > example, compressing the 6-way C. elegans alignment .maf files is 108.9 MB > with gzip and 112 MB with bgz, a difference of less than 3%. (Each is > smaller than the uncompressed file by a factor of about 4 or 5.) That's good - and given the nature of the MAF format in line with what I was hoping for - see also the overheads I got for FASTA, SwissProt and UniProt XML here: http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html > I am not very familiar with biopython, so I've been using my own utilities. > To work with alignments I create an index file consisting of a 32-byte > record for each maf block. Each record ?contains the block start on the > reference species chromosome, the block length on the reference species, and > the virtual offset of the block start in the .maf file. I then have a > utility that will extract the alignment for a given set of spliced regions, > e.g., chrX:11568015-11569059+chrX:11569364-11569395 on the '-' strand, and > output it as a list of pairs (assembly name, base string). > > I'd be happy to share, but I have no idea how this would fit into the > existing biopython infrastructure. > > Best, > Irwin Ah - I must have misinterpreted your earlier email (off list). I'd assumed you were using Andrew's Biopython branch which indexes MAF files using an SQLite database of offsets. But in practice the principle is the same - BGZF lets you have good compression of MAF files and random access. Thank you for clarifying this. If you use Python at all perhaps you'd have some feedback on Andrew's indexing plans? That would be great - Andrew's done a great job explaining the proposed code usage here: http://biopython.org/wiki/Multiple_Alignment_Format Regards, Peter From redmine at redmine.open-bio.org Tue Apr 24 22:33:04 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 25 Apr 2012 02:33:04 +0000 Subject: [Biopython-dev] [Biopython - Feature #3344] (New) Bio.PDB.Entity classes need a __contains__ method Message-ID: Issue #3344 has been reported by Eric Talevich. ---------------------------------------- Feature #3344: Bio.PDB.Entity classes need a __contains__ method https://redmine.open-bio.org/issues/3344 Author: Eric Talevich Status: New Priority: Normal Assignee: Category: Target version: URL: The various objects constructed by Bio.PDB have list-like and dict-like behaviors, for the most part. However, the not all of the relevant magic methods have been implemented. (E.g. `residue["CA"]` works, but `"CA" in residue` does not.) We could do more to support the list-like and dict-like behaviors, but let's start with __contains__. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Apr 25 23:36:04 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 26 Apr 2012 03:36:04 +0000 Subject: [Biopython-dev] [Biopython - Bug #3169] (Closed) to_one_letter_code in Bio.SCOP.Raf is old References: Message-ID: Issue #3169 has been updated by Eric Talevich. Status changed from New to Closed % Done changed from 0 to 100 We've committed this fix now: https://github.com/biopython/biopython/pull/35 ---------------------------------------- Bug #3169: to_one_letter_code in Bio.SCOP.Raf is old https://redmine.open-bio.org/issues/3169 Author: Hongbo Zhu Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.56 URL: Hi, The dictionary to_one_letter_code in Bio.SCOP.Raf is a bit old now. The current dictionary is based on a table taken from the RAF release notes of ASTRAL. This is an old table and some new three-letter codes in the PDB are not found in it (e.g. M3L in 2X4W). ASTRAL does not use the table since v1.73. Rather, PDB Chemical Component Dictionary is used. See http://astral.berkeley.edu/seq.cgi?get=raf-edit-comments;ver=1.75 "Beginning with ASTRAL 1.73, the PDB's chemical dictionary is used to translate chemically modified residues, instead of the translation table from ASTRAL 1.55." The PDB Chemical Component Dictionary can be obtained from: http://deposit.pdb.org/cc_dict_tut.html . I have parsed the dictionary and there are 12054 three-letter codes (as of Jan 2011). Among them, most correspond to a one-letter code '?'. Still, there are 1245 three-letter codes corresponding to a one-letter code other than '?' (the list is attached in the end). Therefore, I suggest to update the to_one_letter_code dictionary in Bio.SCOP.Raf. Best regards, hongbo zhu to_one_letter_code = { '00C':'C','01W':'X','0A0':'D','0A1':'Y','0A2':'K', '0A8':'C','0AA':'V','0AB':'V','0AC':'G','0AD':'G', '0AF':'W','0AG':'L','0AH':'S','0AK':'D','0AM':'A', '0AP':'C','0AU':'U','0AV':'A','0AZ':'P','0BN':'F', '0C ':'C','0CS':'A','0DC':'C','0DG':'G','0DT':'T', '0G ':'G','0NC':'A','0SP':'A','0U ':'U','0YG':'YG', '10C':'C','125':'U','126':'U','127':'U','128':'N', '12A':'A','143':'C','175':'ASG','193':'X','1AP':'A', '1MA':'A','1MG':'G','1PA':'F','1PI':'A','1PR':'N', '1SC':'C','1TQ':'W','1TY':'Y','200':'F','23F':'F', '23S':'X','26B':'T','2AD':'X','2AG':'G','2AO':'X', '2AR':'A','2AS':'X','2AT':'T','2AU':'U','2BD':'I', '2BT':'T','2BU':'A','2CO':'C','2DA':'A','2DF':'N', '2DM':'N','2DO':'X','2DT':'T','2EG':'G','2FE':'N', '2FI':'N','2FM':'M','2GT':'T','2HF':'H','2LU':'L', '2MA':'A','2MG':'G','2ML':'L','2MR':'R','2MT':'P', '2MU':'U','2NT':'T','2OM':'U','2OT':'T','2PI':'X', '2PR':'G','2SA':'N','2SI':'X','2ST':'T','2TL':'T', '2TY':'Y','2VA':'V','32S':'X','32T':'X','3AH':'H', '3AR':'X','3CF':'F','3DA':'A','3DR':'N','3GA':'A', '3MD':'D','3ME':'U','3NF':'Y','3TY':'X','3XH':'G', '4AC':'N','4BF':'Y','4CF':'F','4CY':'M','4DP':'W', '4F3':'GYG','4FB':'P','4FW':'W','4HT':'W','4IN':'X', '4MF':'N','4MM':'X','4OC':'C','4PC':'C','4PD':'C', '4PE':'C','4PH':'F','4SC':'C','4SU':'U','4TA':'N', '5AA':'A','5AT':'T','5BU':'U','5CG':'G','5CM':'C', '5CS':'C','5FA':'A','5FC':'C','5FU':'U','5HP':'E', '5HT':'T','5HU':'U','5IC':'C','5IT':'T','5IU':'U', '5MC':'C','5MD':'N','5MU':'U','5NC':'C','5PC':'C', '5PY':'T','5SE':'U','5ZA':'TWG','64T':'T','6CL':'K', '6CT':'T','6CW':'W','6HA':'A','6HC':'C','6HG':'G', '6HN':'K','6HT':'T','6IA':'A','6MA':'A','6MC':'A', '6MI':'N','6MT':'A','6MZ':'N','6OG':'G','70U':'U', '7DA':'A','7GU':'G','7JA':'I','7MG':'G','8AN':'A', '8FG':'G','8MG':'G','8OG':'G','9NE':'E','9NF':'F', '9NR':'R','9NV':'V','A ':'A','A1P':'N','A23':'A', 'A2L':'A','A2M':'A','A34':'A','A35':'A','A38':'A', 'A39':'A','A3A':'A','A3P':'A','A40':'A','A43':'A', 'A44':'A','A47':'A','A5L':'A','A5M':'C','A5O':'A', 'A66':'X','AA3':'A','AA4':'A','AAR':'R','AB7':'X', 'ABA':'A','ABR':'A','ABS':'A','ABT':'N','ACB':'D', 'ACL':'R','AD2':'A','ADD':'X','ADX':'N','AEA':'X', 'AEI':'D','AET':'A','AFA':'N','AFF':'N','AFG':'G', 'AGM':'R','AGT':'X','AHB':'N','AHH':'X','AHO':'A', 'AHP':'A','AHS':'X','AHT':'X','AIB':'A','AKL':'D', 'ALA':'A','ALC':'A','ALG':'R','ALM':'A','ALN':'A', 'ALO':'T','ALQ':'X','ALS':'A','ALT':'A','ALY':'K', 'AP7':'A','APE':'X','APH':'A','API':'K','APK':'K', 'APM':'X','APP':'X','AR2':'R','AR4':'E','ARG':'R', 'ARM':'R','ARO':'R','ARV':'X','AS ':'A','AS2':'D', 'AS9':'X','ASA':'D','ASB':'D','ASI':'D','ASK':'D', 'ASL':'D','ASM':'X','ASN':'N','ASP':'D','ASQ':'D', 'ASU':'N','ASX':'B','ATD':'T','ATL':'T','ATM':'T', 'AVC':'A','AVN':'X','AYA':'A','AYG':'AYG','AZK':'K', 'AZS':'S','AZY':'Y','B1F':'F','B1P':'N','B2A':'A', 'B2F':'F','B2I':'I','B2V':'V','B3A':'A','B3D':'D', 'B3E':'E','B3K':'K','B3L':'X','B3M':'X','B3Q':'X', 'B3S':'S','B3T':'X','B3U':'H','B3X':'N','B3Y':'Y', 'BB6':'C','BB7':'C','BB9':'C','BBC':'C','BCS':'C', 'BCX':'C','BE2':'X','BFD':'D','BG1':'S','BGM':'G', 'BHD':'D','BIF':'F','BIL':'X','BIU':'I','BJH':'X', 'BLE':'L','BLY':'K','BMP':'N','BMT':'T','BNN':'A', 'BNO':'X','BOE':'T','BOR':'R','BPE':'C','BRU':'U', 'BSE':'S','BT5':'N','BTA':'L','BTC':'C','BTR':'W', 'BUC':'C','BUG':'V','BVP':'U','BZG':'N','C ':'C', 'C12':'TYG','C1X':'K','C25':'C','C2L':'C','C2S':'C', 'C31':'C','C32':'C','C34':'C','C36':'C','C37':'C', 'C38':'C','C3Y':'C','C42':'C','C43':'C','C45':'C', 'C46':'C','C49':'C','C4R':'C','C4S':'C','C5C':'C', 'C66':'X','C6C':'C','C99':'TFG','CAF':'C','CAL':'X', 'CAR':'C','CAS':'C','CAV':'X','CAY':'C','CB2':'C', 'CBR':'C','CBV':'C','CCC':'C','CCL':'K','CCS':'C', 'CCY':'CYG','CDE':'X','CDV':'X','CDW':'C','CEA':'C', 'CFL':'C','CFY':'FCYG','CG1':'G','CGA':'E','CGU':'E', 'CH ':'C','CH6':'MYG','CH7':'KYG','CHF':'X','CHG':'X', 'CHP':'G','CHS':'X','CIR':'R','CJO':'GYG','CLE':'L', 'CLG':'K','CLH':'K','CLV':'AFG','CM0':'N','CME':'C', 'CMH':'C','CML':'C','CMR':'C','CMT':'C','CNU':'U', 'CP1':'C','CPC':'X','CPI':'X','CQR':'GYG','CR0':'TLG', 'CR2':'GYG','CR5':'G','CR7':'KYG','CR8':'HYG','CRF':'TWG', 'CRG':'THG','CRK':'MYG','CRO':'GYG','CRQ':'QYG','CRU':'E', 'CRW':'ASG','CRX':'ASG','CS0':'C','CS1':'C','CS3':'C', 'CS4':'C','CS8':'N','CSA':'C','CSB':'C','CSD':'C', 'CSE':'C','CSF':'C','CSH':'SHG','CSI':'G','CSJ':'C', 'CSL':'C','CSO':'C','CSP':'C','CSR':'C','CSS':'C', 'CSU':'C','CSW':'C','CSX':'C','CSY':'SYG','CSZ':'C', 'CTE':'W','CTG':'T','CTH':'T','CUC':'X','CWR':'S', 'CXM':'M','CY0':'C','CY1':'C','CY3':'C','CY4':'C', 'CYA':'C','CYD':'C','CYF':'C','CYG':'C','CYJ':'X', 'CYM':'C','CYQ':'C','CYR':'C','CYS':'C','CZ2':'C', 'CZO':'GYG','CZZ':'C','D11':'T','D1P':'N','D3 ':'N', 'D33':'N','D3P':'G','D3T':'T','D4M':'T','D4P':'X', 'DA ':'A','DA2':'X','DAB':'A','DAH':'F','DAL':'A', 'DAR':'R','DAS':'D','DBB':'T','DBM':'N','DBS':'S', 'DBU':'T','DBY':'Y','DBZ':'A','DC ':'C','DC2':'C', 'DCG':'G','DCI':'X','DCL':'X','DCT':'C','DCY':'C', 'DDE':'H','DDG':'G','DDN':'U','DDX':'N','DFC':'C', 'DFG':'G','DFI':'X','DFO':'X','DFT':'N','DG ':'G', 'DGH':'G','DGI':'G','DGL':'E','DGN':'Q','DHA':'A', 'DHI':'H','DHL':'X','DHN':'V','DHP':'X','DHU':'U', 'DHV':'V','DI ':'I','DIL':'I','DIR':'R','DIV':'V', 'DLE':'L','DLS':'K','DLY':'K','DM0':'K','DMH':'N', 'DMK':'D','DMT':'X','DN ':'N','DNE':'L','DNG':'L', 'DNL':'K','DNM':'L','DNP':'A','DNR':'C','DNS':'K', 'DOA':'X','DOC':'C','DOH':'D','DON':'L','DPB':'T', 'DPH':'F','DPL':'P','DPP':'A','DPQ':'Y','DPR':'P', 'DPY':'N','DRM':'U','DRP':'N','DRT':'T','DRZ':'N', 'DSE':'S','DSG':'N','DSN':'S','DSP':'D','DT ':'T', 'DTH':'T','DTR':'W','DTY':'Y','DU ':'U','DVA':'V', 'DXD':'N','DXN':'N','DYG':'DYG','DYS':'C','DZM':'A', 'E ':'A','E1X':'A','EDA':'A','EDC':'G','EFC':'C', 'EHP':'F','EIT':'T','ENP':'N','ESB':'Y','ESC':'M', 'EXY':'L','EY5':'N','EYS':'X','F2F':'F','FA2':'A', 'FA5':'N','FAG':'N','FAI':'N','FCL':'F','FFD':'N', 'FGL':'G','FGP':'S','FHL':'X','FHO':'K','FHU':'U', 'FLA':'A','FLE':'L','FLT':'Y','FME':'M','FMG':'G', 'FMU':'N','FOE':'C','FOX':'G','FP9':'P','FPA':'F', 'FRD':'X','FT6':'W','FTR':'W','FTY':'Y','FZN':'K', 'G ':'G','G25':'G','G2L':'G','G2S':'G','G31':'G', 'G32':'G','G33':'G','G36':'G','G38':'G','G42':'G', 'G46':'G','G47':'G','G48':'G','G49':'G','G4P':'N', 'G7M':'G','GAO':'G','GAU':'E','GCK':'C','GCM':'X', 'GDP':'G','GDR':'G','GFL':'G','GGL':'E','GH3':'G', 'GHG':'Q','GHP':'G','GL3':'G','GLH':'Q','GLM':'X', 'GLN':'Q','GLQ':'E','GLU':'E','GLX':'Z','GLY':'G', 'GLZ':'G','GMA':'E','GMS':'G','GMU':'U','GN7':'G', 'GND':'X','GNE':'N','GOM':'G','GPL':'K','GS ':'G', 'GSC':'G','GSR':'G','GSS':'G','GSU':'E','GT9':'C', 'GTP':'G','GVL':'X','GYC':'CYG','GYS':'SYG','H2U':'U', 'H5M':'P','HAC':'A','HAR':'R','HBN':'H','HCS':'X', 'HDP':'U','HEU':'U','HFA':'X','HGL':'X','HHI':'H', 'HHK':'AK','HIA':'H','HIC':'H','HIP':'H','HIQ':'H', 'HIS':'H','HL2':'L','HLU':'L','HMF':'A','HMR':'R', 'HOL':'N','HPC':'F','HPE':'F','HPQ':'F','HQA':'A', 'HRG':'R','HRP':'W','HS8':'H','HS9':'H','HSE':'S', 'HSL':'S','HSO':'H','HTI':'C','HTN':'N','HTR':'W', 'HV5':'A','HVA':'V','HY3':'P','HYP':'P','HZP':'P', 'I ':'I','I2M':'I','I58':'K','I5C':'C','IAM':'A', 'IAR':'R','IAS':'D','IC ':'C','IEL':'K','IEY':'HYG', 'IG ':'G','IGL':'G','IGU':'G','IIC':'SHG','IIL':'I', 'ILE':'I','ILG':'E','ILX':'I','IMC':'C','IML':'I', 'IOY':'F','IPG':'G','IPN':'N','IRN':'N','IT1':'K', 'IU ':'U','IYR':'Y','IYT':'T','JJJ':'C','JJK':'C', 'JJL':'C','JW5':'N','K1R':'C','KAG':'G','KCX':'K', 'KGC':'K','KOR':'M','KPI':'K','KST':'K','KYQ':'K', 'L2A':'X','LA2':'K','LAA':'D','LAL':'A','LBY':'K', 'LC ':'C','LCA':'A','LCC':'N','LCG':'G','LCH':'N', 'LCK':'K','LCX':'K','LDH':'K','LED':'L','LEF':'L', 'LEH':'L','LEI':'V','LEM':'L','LEN':'L','LET':'X', 'LEU':'L','LG ':'G','LGP':'G','LHC':'X','LHU':'U', 'LKC':'N','LLP':'K','LLY':'K','LME':'E','LMQ':'Q', 'LMS':'N','LP6':'K','LPD':'P','LPG':'G','LPL':'X', 'LPS':'S','LSO':'X','LTA':'X','LTR':'W','LVG':'G', 'LVN':'V','LYM':'K','LYN':'K','LYR':'K','LYS':'K', 'LYX':'K','LYZ':'K','M0H':'C','M1G':'G','M2G':'G', 'M2L':'K','M2S':'M','M3L':'K','M5M':'C','MA ':'A', 'MA6':'A','MA7':'A','MAA':'A','MAD':'A','MAI':'R', 'MBQ':'Y','MBZ':'N','MC1':'S','MCG':'X','MCL':'K', 'MCS':'C','MCY':'C','MDH':'X','MDO':'ASG','MDR':'N', 'MEA':'F','MED':'M','MEG':'E','MEN':'N','MEP':'U', 'MEQ':'Q','MET':'M','MEU':'G','MF3':'X','MFC':'GYG', 'MG1':'G','MGG':'R','MGN':'Q','MGQ':'A','MGV':'G', 'MGY':'G','MHL':'L','MHO':'M','MHS':'H','MIA':'A', 'MIS':'S','MK8':'L','ML3':'K','MLE':'L','MLL':'L', 'MLY':'K','MLZ':'K','MME':'M','MMT':'T','MND':'N', 'MNL':'L','MNU':'U','MNV':'V','MOD':'X','MP8':'P', 'MPH':'X','MPJ':'X','MPQ':'G','MRG':'G','MSA':'G', 'MSE':'M','MSL':'M','MSO':'M','MSP':'X','MT2':'M', 'MTR':'T','MTU':'A','MTY':'Y','MVA':'V','N ':'N', 'N10':'S','N2C':'X','N5I':'N','N5M':'C','N6G':'G', 'N7P':'P','NA8':'A','NAL':'A','NAM':'A','NB8':'N', 'NBQ':'Y','NC1':'S','NCB':'A','NCX':'N','NCY':'X', 'NDF':'F','NDN':'U','NEM':'H','NEP':'H','NF2':'N', 'NFA':'F','NHL':'E','NIT':'X','NIY':'Y','NLE':'L', 'NLN':'L','NLO':'L','NLP':'L','NLQ':'Q','NMC':'G', 'NMM':'R','NMS':'T','NMT':'T','NNH':'R','NP3':'N', 'NPH':'C','NRP':'LYG','NRQ':'MYG','NSK':'X','NTY':'Y', 'NVA':'V','NYC':'TWG','NYG':'NYG','NYM':'N','NYS':'C', 'NZH':'H','O12':'X','O2C':'N','O2G':'G','OAD':'N', 'OAS':'S','OBF':'X','OBS':'X','OCS':'C','OCY':'C', 'ODP':'N','OHI':'H','OHS':'D','OIC':'X','OIP':'I', 'OLE':'X','OLT':'T','OLZ':'S','OMC':'C','OMG':'G', 'OMT':'M','OMU':'U','ONE':'U','ONL':'X','OPR':'R', 'ORN':'A','ORQ':'R','OSE':'S','OTB':'X','OTH':'T', 'OTY':'Y','OXX':'D','P ':'G','P1L':'C','P1P':'N', 'P2T':'T','P2U':'U','P2Y':'P','P5P':'A','PAQ':'Y', 'PAS':'D','PAT':'W','PAU':'A','PBB':'C','PBF':'F', 'PBT':'N','PCA':'E','PCC':'P','PCE':'X','PCS':'F', 'PDL':'X','PDU':'U','PEC':'C','PF5':'F','PFF':'F', 'PFX':'X','PG1':'S','PG7':'G','PG9':'G','PGL':'X', 'PGN':'G','PGP':'G','PGY':'G','PHA':'F','PHD':'D', 'PHE':'F','PHI':'F','PHL':'F','PHM':'F','PIV':'X', 'PLE':'L','PM3':'F','PMT':'C','POM':'P','PPN':'F', 'PPU':'A','PPW':'G','PQ1':'N','PR3':'C','PR5':'A', 'PR9':'P','PRN':'A','PRO':'P','PRS':'P','PSA':'F', 'PSH':'H','PST':'T','PSU':'U','PSW':'C','PTA':'X', 'PTH':'Y','PTM':'Y','PTR':'Y','PU ':'A','PUY':'N', 'PVH':'H','PVL':'X','PYA':'A','PYO':'U','PYX':'C', 'PYY':'N','QLG':'QLG','QUO':'G','R ':'A','R1A':'C', 'R1B':'C','R1F':'C','R7A':'C','RC7':'HYG','RCY':'C', 'RIA':'A','RMP':'A','RON':'X','RT ':'T','RTP':'N', 'S1H':'S','S2C':'C','S2D':'A','S2M':'T','S2P':'A', 'S4A':'A','S4C':'C','S4G':'G','S4U':'U','S6G':'G', 'SAC':'S','SAH':'C','SAR':'G','SBL':'S','SC ':'C', 'SCH':'C','SCS':'C','SCY':'C','SD2':'X','SDG':'G', 'SDP':'S','SEB':'S','SEC':'A','SEG':'A','SEL':'S', 'SEM':'X','SEN':'S','SEP':'S','SER':'S','SET':'S', 'SGB':'S','SHC':'C','SHP':'G','SHR':'K','SIB':'C', 'SIC':'DC','SLA':'P','SLR':'P','SLZ':'K','SMC':'C', 'SME':'M','SMF':'F','SMP':'A','SMT':'T','SNC':'C', 'SNN':'N','SOC':'C','SOS':'N','SOY':'S','SPT':'T', 'SRA':'A','SSU':'U','STY':'Y','SUB':'X','SUI':'DG', 'SUN':'S','SUR':'U','SVA':'S','SVX':'S','SVZ':'X', 'SYS':'C','T ':'T','T11':'F','T23':'T','T2S':'T', 'T2T':'N','T31':'U','T32':'T','T36':'T','T37':'T', 'T38':'T','T39':'T','T3P':'T','T41':'T','T48':'T', 'T49':'T','T4S':'T','T5O':'U','T5S':'T','T66':'X', 'T6A':'A','TA3':'T','TA4':'X','TAF':'T','TAL':'N', 'TAV':'D','TBG':'V','TBM':'T','TC1':'C','TCP':'T', 'TCQ':'X','TCR':'W','TCY':'A','TDD':'L','TDY':'T', 'TFE':'T','TFO':'A','TFQ':'F','TFT':'T','TGP':'G', 'TH6':'T','THC':'T','THO':'X','THR':'T','THX':'N', 'THZ':'R','TIH':'A','TLB':'N','TLC':'T','TLN':'U', 'TMB':'T','TMD':'T','TNB':'C','TNR':'S','TOX':'W', 'TP1':'T','TPC':'C','TPG':'G','TPH':'X','TPL':'W', 'TPO':'T','TPQ':'Y','TQQ':'W','TRF':'W','TRG':'K', 'TRN':'W','TRO':'W','TRP':'W','TRQ':'W','TRW':'W', 'TRX':'W','TS ':'N','TST':'X','TT ':'N','TTD':'T', 'TTI':'U','TTM':'T','TTQ':'W','TTS':'Y','TY2':'Y', 'TY3':'Y','TYB':'Y','TYI':'Y','TYN':'Y','TYO':'Y', 'TYQ':'Y','TYR':'Y','TYS':'Y','TYT':'Y','TYU':'N', 'TYX':'X','TYY':'Y','TZB':'X','TZO':'X','U ':'U', 'U25':'U','U2L':'U','U2N':'U','U2P':'U','U31':'U', 'U33':'U','U34':'U','U36':'U','U37':'U','U8U':'U', 'UAR':'U','UCL':'U','UD5':'U','UDP':'N','UFP':'N', 'UFR':'U','UFT':'U','UMA':'A','UMP':'U','UMS':'U', 'UN1':'X','UN2':'X','UNK':'X','UR3':'U','URD':'U', 'US1':'U','US2':'U','US3':'T','US5':'U','USM':'U', 'V1A':'C','VAD':'V','VAF':'V','VAL':'V','VB1':'K', 'VDL':'X','VLL':'X','VLM':'X','VMS':'X','VOL':'X', 'X ':'G','X2W':'E','X4A':'N','X9Q':'AFG','XAD':'A', 'XAE':'N','XAL':'A','XAR':'N','XCL':'C','XCP':'X', 'XCR':'C','XCS':'N','XCT':'C','XCY':'C','XGA':'N', 'XGL':'G','XGR':'G','XGU':'G','XTH':'T','XTL':'T', 'XTR':'T','XTS':'G','XTY':'N','XUA':'A','XUG':'G', 'XX1':'K','XXY':'THG','XYG':'DYG','Y ':'A','YCM':'C', 'YG ':'G','YOF':'Y','YRR':'N','YYG':'G','Z ':'C', 'ZAD':'A','ZAL':'A','ZBC':'C','ZCY':'C','ZDU':'U', 'ZFB':'X','ZGU':'G','ZHP':'N','ZTH':'T','ZZJ':'A' } -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu Apr 26 23:59:13 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 27 Apr 2012 03:59:13 +0000 Subject: [Biopython-dev] [Biopython - Bug #3346] (New) patch for legacy parser to support BLASTX 2.2.25+ Message-ID: Issue #3346 has been reported by John Comeau. ---------------------------------------- Bug #3346: patch for legacy parser to support BLASTX 2.2.25+ https://redmine.open-bio.org/issues/3346 Author: John Comeau Status: New Priority: Normal Assignee: Category: Target version: URL: it may also work with 2.2.26+, I have not tested. patched parser passes regression tests as per Peter Cock's instructions. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From andrew.sczesnak at med.nyu.edu Fri Apr 27 15:57:19 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 27 Apr 2012 15:57:19 -0400 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: <4F91E4CF.8040602@med.nyu.edu> Message-ID: <4F9AFA1F.6030103@med.nyu.edu> Peter, > It should be easy enough to follow the BGZF changes to Bio/SeqIO/_index.py > and I'm willing to do this myself for MAF (while going over your index work - > something I want to do anyway). The only potential catch is avoiding offset > arithmetic. I have no problem with you doing this if you're willing. It would be great to have some code review of MafIndex as well. Best, Andrew From MatatTHC at gmx.de Sat Apr 28 03:15:35 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Sat, 28 Apr 2012 09:15:35 +0200 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: Dear developers, I would like to suggest a quick "fix" for the problem. Currently the parser just returns true per default for the circular property. This is a wrong piece of information for all circular sequences. Furthermore its not possible to detect if the parser did return true because it is its default value or if its really from the data. So I suggest to return None if the parser does not parse the information. What do you think? This should be possible with minimal effort. The user could then implement a workaround on its own (like using the old parser as fallback, or just searching the first line of t) Regards, Matthias 2012/4/22 Matthias Bernt : > Hi, > > since this bug seems to be of low priority I decided to try my best to > help a bit and search the web a bit. > It seems that the property is stored in PrimarySeq or Seq ?in bioperl. > See for instance: > > http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Seq.pm > http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/PrimarySeq.pm > > Or also: > http://bugzilla.open-bio.org/show_bug.cgi?id=2578 > > This seems to be realised as boolean variable or function. > > Regards, > Matthias > > 2012/4/4 Matthias Bernt : >> Hi, >> >> are there any news on this? May I help somehow? But I have to admit >> that I barely speak perl and have no experience with bioperl. If >> someone tells me where to look I might still try it. >> >> Matthias >> >> 2012/3/29 Peter Cock : >>> On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt wrote: >>>> Hi, >>>> >>>> Is it possible to get the property if a genome is circular / linear >>>> from SeqIO applied to genbank files? I could not find it. >>>> >>>> There is also a related bugreport: >>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 >>>> >>>> I used the old parser before and switched to SeqIO which I really like >>>> for the possibilities to parse different formats... but I really need >>>> the information. >>> >>> Does anyone happen to have a BioPerl + BioSQL setup installed >>> and working? IIRC checking that to make sure however we >>> store the circular was compatible was the only real hurdle. >>> >>> Peter From w.arindrarto at gmail.com Sat Apr 28 08:08:35 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 28 Apr 2012 14:08:35 +0200 Subject: [Biopython-dev] Google Summer of Code Project: SearchIO in Biopython Message-ID: Hello everyone, This is Wibowo Arindrarto (or Bow, for short), one of the Google Summer of Code students who will work on Biopython over this summer. I will be working with Peter to add support for parsing search outputs from programs like BLAST and HMMER to Biopython, so that it's easier to extract information from their outputs. Having used some of these programs quite a lot myself, I'm really looking forward to implementing the feature. However, I do understand that it won't be just me who will use the module, but also many other Biopython user. So for everyone who is interested in giving a say, input, or critiques along the way, feel free to do so :). The official coding period starts in about a month from now. Until then, I will be doing all the preparatory work required so that coding will proceed as smooth as possible. These will include preparing the test cases and preparing the SearchIO attribute / object naming convention as well as discussing anything related to its proposed implementation. Finally, here are some links related to the project that might interest you. 1. My main biopython branch for development: https://github.com/bow/biopython/tree/searchio. Since I will be building on top of Peter's SearchIO branch ( https://github.com/peterjc/biopython/tree/search-io-test), right now it only contains Peter's branch rebased against the latest master. 2. My GSoC proposal, which outlines my plans and timeline for the project: http://bit.ly/searchio-proposal 3. The proposed SearchIO naming convention (not 100% complete as of now, but will be filled along the way): http://bit.ly/searchio-terms. One of the main goals of the project is to implement a common interface for BLAST et al, which requires SearchIO to have common attribute names that refers to different search output attributes. The link contains my proposed naming convention, which is still very open to change and discussion. Feel free to comment on the document and add your own ideas. 4. My blog, in which I will write weekly posts about the project's progress: http://bow.web.id/blog 5. An extra repo for all other auxiliary files and scripts that doesn't go into Biopython's code: https://github.com/bow/gsoc. That's it for now. Thanks for taking time to read it :). I'm looking forward to a productive summer with Biopython. Have a nice weekend, Bow From p.j.a.cock at googlemail.com Sun Apr 29 07:00:42 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 29 Apr 2012 12:00:42 +0100 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: Hi Bow, Thanks for updating the list. I'm replying just on the dev list as I'm focusing on implementation discussion in this reply. On Sat, Apr 28, 2012 at 1:08 PM, Wibowo Arindrarto wrote: > 1. My main biopython branch for development: > https://github.com/bow/biopython/tree/searchio. Since I will be building on > top of Peter's SearchIO branch ( > https://github.com/peterjc/biopython/tree/search-io-test), right now it > only contains Peter's branch rebased against the latest master. Just to be clear - you don't have to start from that branch ;) http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html As I said before, that may not be the best approach. The idea behind that code was to focus on the HSPs (in BLAST terms), and for the low level parsers to iterate over each HSP. Higher level wrappers can then batch these up by query/subject, or into the larger grouping of all the results for one query - which was the exposed high level Bio.SearchIO.parse function. That branch introduced a SearchResult object which was essentially something like a list or dict (like an OrderedDict in some ways), with some (unnecessary?) error checking for consistent contents (all from the same query). It also introduced a TopMatches object which was essentially list list (again, with some error checking for consistent contents). The advantage of using simple objects (OrderedDict and list) is simplicity and hopefully performance. But specific classes have the advantage of allowing more user friendly str/repr etc. The idea on this branch of focusing on iteration over the HSPs at the low level was it allowed a lot of flexibility, and the low level parser could be used in conjunction with indexing to see to a particular HSP and parse it, or goto the results for a particular query+match and parse its HSPs (not implemented on my old branch, but that was the plan). However, while this makes perfect sense for say the BLAST tabular output, it isn't quite such a good match for all the possible datatypes. For instance, BLAST plain text/html includes an e-value for a query/subject combination which is calculated from all the HSPs for that query/subject (taking into account order etc - I'd have to check the O'Reilly BLAST book for the details). This isn't in the tabular output, but the point is that it isn't a property of the individual HSPs, but of the match (group of HSPs). I think we need to consider the other main formats, and if all their important information lies at the HSP level or not. Perhaps iteration at the query+match level (groups of HSPs) would be best overall? Bow - If some of that doesn't make sense, I can try to clarify by email on the list, and/or we can talk about it at our next video chat. Also see if you can get the BLAST book from your library - it will probably be quite useful in this project even though it describes the 'legacy' BLAST suite: "BLAST" by Ian Korf, Mark Yandell, Joseph Bedell Publisher: O'Reilly Media, Released: July 2003 Regards, Peter From w.arindrarto at gmail.com Sun Apr 29 12:42:14 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 29 Apr 2012 18:42:14 +0200 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: On Sun, Apr 29, 2012 at 13:00, Peter Cock wrote: > > Hi Bow, > > Thanks for updating the list. I'm replying just on the dev list > as I'm focusing on implementation discussion in this reply. > > On Sat, Apr 28, 2012 at 1:08 PM, Wibowo Arindrarto > wrote: > > 1. My main biopython branch for development: > > https://github.com/bow/biopython/tree/searchio. Since I will be building > > on > > top of Peter's SearchIO branch ( > > https://github.com/peterjc/biopython/tree/search-io-test), right now it > > only contains Peter's branch rebased against the latest master. > > Just to be clear - you don't have to start from that branch ;) > http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html Ok :). I wasn't so sure about how much code from your previous branch that I will end up using, so I decided to rebase everything and then see later how much of it can be used. But it's also easier to start clean :). > As I said before, that may not be the best approach. The idea > behind that code was to focus on the HSPs (in BLAST terms), > and for the low level parsers to iterate over each HSP. Higher > level wrappers can then batch these up by query/subject, or > into the larger grouping of all the results for one query - > which was the exposed high level Bio.SearchIO.parse > function. > > That branch introduced a SearchResult object which was > essentially something like a list or dict (like an OrderedDict > in some ways), with some (unnecessary?) error checking for > consistent contents (all from the same query). It also introduced > a TopMatches object which was essentially list list (again, > with some error checking for consistent contents). > > The advantage of using simple objects (OrderedDict > and list) is simplicity and hopefully performance. But > specific classes have the advantage of allowing more > user friendly str/repr etc. > > The idea on this branch of focusing on iteration over the > HSPs at the low level was it allowed a lot of flexibility, and > the low level parser could be used in conjunction with > indexing to see to a particular HSP and parse it, or goto > the results for a particular query+match and parse its > HSPs ?(not implemented on my old branch, but that was > the plan). > > However, while this makes perfect sense for say the BLAST > tabular output, it isn't quite such a good match for all the > possible datatypes. > > For instance, BLAST plain text/html includes an e-value for > a query/subject combination which is calculated from all the > HSPs for that query/subject (taking into account order etc - > I'd have to check the O'Reilly BLAST book for the details). > This isn't in the tabular output, but the point is that it isn't a > property of the individual HSPs, but of the match (group of > HSPs). > > I think we need to consider the other main formats, and if > all their important information lies at the HSP level or not. > Perhaps iteration at the query+match level (groups of > HSPs) would be best overall? > > Bow - If some of that doesn't make sense, I can try to clarify > by email on the list, and/or we can talk about it at our next > video chat. Also see if you can get the BLAST book from > your library - it will probably be quite useful in this project > even though it describes the 'legacy' BLAST suite: > > "BLAST" by Ian Korf, Mark Yandell, Joseph Bedell > Publisher: O'Reilly Media, Released: July 2003 > > Regards, > > Peter I think I got the gist of it (please correct me if I'm wrong). Some information about the search, such as the sequence-wide e-value, may not be present in the HSP level. Ignoring them could let us focus on a perhaps simpler and more flexible implementation with better performance, but at the cost of usefulness of the data itself since we are throwing away information. What I have in mind now is actually closer to iteration on the query+subject level. To be clear first, the hierarchy of the objects that I propose is this: * Search object, to represent the entire search session. * Result object, to represent a search with one query against the database. Depending on the number of queries, we could have one to several Result objects contained in a Search. * Hit object, to represent a sequence hit. Depending on the search, we could also have multiple Hits in one Result object. * and finally, HSP object, to represent individual alignments. Iteration is done on the Results level, so the information is parsed on the search query level, not just a single HSPs (I wrote a very short description about what I'm planning the objects to be in here as well: http://bit.ly/searchio-terms). I suppose if we aim for maximum information parsing over performance and simplicity of the format-specific parsers, this is the way to go. There are other formats, too, that contains sequence-level search information not present in the alignment (e.g. HMMER text output). What do you think about this? Thanks for the BLAST book suggestion. I'll see if I can find it in my library in the mean time. regards, Bow From p.j.a.cock at googlemail.com Mon Apr 30 05:49:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 30 Apr 2012 10:49:27 +0100 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: On Sun, Apr 29, 2012 at 5:42 PM, Wibowo Arindrarto wrote: > > I think I got the gist of it (please correct me if I'm wrong). Some > information about the search, such as the sequence-wide e-value, may > not be present in the HSP level. Ignoring them could let us focus on a > perhaps simpler and more flexible implementation with better > performance, but at the cost of usefulness of the data itself since we > are throwing away information. Yes. > What I have in mind now is actually closer to iteration on the > query+subject level. To be clear first, the hierarchy of the objects > that I propose is this: > > * Search object, to represent the entire search session. > * Result object, to represent a search with one query against the > database. Depending on the number of queries, we could have one to > several Result objects contained in a Search. > * Hit object, to represent a sequence hit. Depending on the search, we > could also have multiple Hits in one Result object. > * and finally, HSP object, to represent individual alignments. > > Iteration is done on the Results level, so the information is parsed > on the search query level, not just a single HSPs (I wrote a ?very > short description about what I'm planning the objects to be in here as > well: http://bit.ly/searchio-terms). I suppose if we aim for maximum > information parsing over performance and simplicity of the > format-specific parsers, this is the way to go. There are other > formats, too, that contains sequence-level search information not > present in the alignment (e.g. HMMER text output). What do you think > about this? That sounds good . If iteration is done on the Results level, when/how would your Search object be used? Peter From w.arindrarto at gmail.com Mon Apr 30 06:08:52 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 30 Apr 2012 12:08:52 +0200 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: >> What I have in mind now is actually closer to iteration on the >> query+subject level. To be clear first, the hierarchy of the objects >> that I propose is this: >> >> * Search object, to represent the entire search session. >> * Result object, to represent a search with one query against the >> database. Depending on the number of queries, we could have one to >> several Result objects contained in a Search. >> * Hit object, to represent a sequence hit. Depending on the search, we >> could also have multiple Hits in one Result object. >> * and finally, HSP object, to represent individual alignments. >> >> Iteration is done on the Results level, so the information is parsed >> on the search query level, not just a single HSPs (I wrote a ?very >> short description about what I'm planning the objects to be in here as >> well: http://bit.ly/searchio-terms). I suppose if we aim for maximum >> information parsing over performance and simplicity of the >> format-specific parsers, this is the way to go. There are other >> formats, too, that contains sequence-level search information not >> present in the alignment (e.g. HMMER text output). What do you think >> about this? > > That sounds good . > > If iteration is done on the Results level, when/how would your > Search object be used? > > Peter I'm thinking of using the Search object as the object returned by SearchIO.parse or SearchIO.read. That way, we can store attributes common to the different search queries in it. For example: >>> search = SearchIO.parse('blast_result.xml', 'blast-xml') >>> search.format 'blast-xml' >>> search.algorithm 'blastx' >>> search.version '2.2.26+' >>> search.database 'refseq_protein' >>> search.results And iteration over the results would be done like this (for example): >>> for result in search.results: ... print result.query, print len(result) Additionaly, we can also define __iter__ and next for Search so we can just do the following: >>> for result in search: ... print result.query, print len(result) What do you think? Bow From p.j.a.cock at googlemail.com Mon Apr 30 06:57:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 30 Apr 2012 11:57:27 +0100 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: On Mon, Apr 30, 2012 at 11:08 AM, Wibowo Arindrarto wrote: > > I'm thinking of using the Search object as the object returned by > SearchIO.parse or SearchIO.read. That way, we can store attributes > common to the different search queries in it. For example: > >>>> search ?= SearchIO.parse('blast_result.xml', 'blast-xml') >>>> search.format > 'blast-xml' >>>> search.algorithm > 'blastx' >>>> search.version > '2.2.26+' >>>> search.database > 'refseq_protein' >>>> search.results > > > And iteration over the results would be done like this (for example): >>>> for result in search.results: > ... print result.query, print len(result) > > Additionaly, we can also define __iter__ and next for Search so we can > just do the following: >>>> for result in search: > ... print result.query, print len(result) > > What do you think? I think you'll get in a mess with multiple iterators all sharing the same handle and competing over using it - but maybe I'm not grasping what you have in mind. Initially keep it simple: The primary public API would be for result in Bio.SearchIO.parse(...): print result.query, print len(result) where each iteration gives a complete result set for one query. Peter P.S. With SearchIO subject to name space discussions ;) From chapmanb at 50mail.com Sun Apr 1 19:13:56 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 01 Apr 2012 15:13:56 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: References: Message-ID: <87zkavtgcr.fsf@fastmail.fm> Lenna; Thanks for the introduction and glad to hear about your interest in the variant project. I'm looking forward to seeing your proposal. The workflow for the variant project involves a biologist querying a VCF or GVF file with variants from an experiment. They should be able to easily subset and filter by file components: - Variant type: Homozygous/Heterozygous variants - Metrics: depth, strand bias, allele frequency.. - Variants annotated in coding regions causing amino acid changes As well as rapid subsetting by chromosomal region. My syggestion would be to leverage external tools as much as possible to do file manipulation and focus on an API that lets users filter and extract information pre-contained in the INFO file. Hope this is helpful as a place to get started. We can provide additional feedback once you have your proposal ready. Thanks again, Brad > Hi all, > > I realize time is short, but I am still in the planning phase of my > GSoC proposal! I wanted to take a moment to formally introduce myself > to the dev list. > > I am affiliated with Purdue University, located in Indiana, USA and > best known for engineering (Neil Armstrong is a famous graduate). I > hold a bachelor of arts in biology from Mount Holyoke College in > Massachusetts. I have extensive wet lab experience with genetics; I'm > currently working in a lab genotyping mice (the research is intestinal > lipid metabolism). In August, I begin a PhD in interdisciplinary life > science at Purdue, and I anticipate that my research will fall > somewhere in the field of bioinformatics/computational biology. I hope > to use biopython extensively! > > In my spare time, other than programming, I enjoy ballroom dance, > science fiction novels, board games, and sailing. > > I've been programming for about 6 years and using python for 4; other > languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL > (primarily MySQL and SQLite), and C++/C. I place a high value on > object oriented design and execution. > > I understand the basics of formal grammar and have some experience > with lex/flex as well as PLY (python lex/yacc). My work so far with > biopython has been on the CIF parsing module. One of my primary goals > for the genomic variants project would be to implement as much > polymorphism and abstraction as possible, for the benefit of both > users and future developers. > > I'm working on a proposal for the genomic variants project, and while > I understand the basics of molecular biology and genetics, I lack > firsthand experience with the type of workflow that would occur in the > context of genomic variants. If anyone can supply a few examples, it > would be greatly appreciated. > > I hope to have a proposal draft ready for feedback by Monday. > > Regards, > > Lenna Peterson > github.com/lennax > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From chapmanb at 50mail.com Sun Apr 1 19:28:32 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 01 Apr 2012 15:28:32 -0400 Subject: [Biopython-dev] GSoC Student Applicant In-Reply-To: References: <874ntgtca7.fsf@fastmail.fm> <87r4wa6fxx.fsf@fastmail.fm> Message-ID: <87wr5ztfof.fsf@fastmail.fm> Bow; > Thank you for the comments and suggestions. I've added a little bit > more details to my personal profile and put it up front. My project > details have also been broken down into single weeks. And I've edited > the commenting permission. Thanks for the updates, this is coming along well. My most general suggestion is to spend more time expanding the week-by-week timeline. As an example, take this weekly goal: * Write iterator and random-access parser for EMBOSS water It would be great to see more specific plans for what exactly you deliver and implement during the week. Something like: - Write iterator for EMBOSS water, expanding test suite to ensure produced AlignIO objects are compatible with previous BLAST and HMMER iterators. - Expand index functionality to handle EMBOSS water format for random access. Test edge cases: initial records, final records, empty records. - Document 'water' parsing with a use case emphasizing differences from BLAST and HMMER searching. Peter probably has more specific thoughts on the actual content but it's important to think through things in this manner. This will make it easier to approach weeks during the summer since you'll already have tasks broken down, and will also demonstrate you've thought about potential problems and roadblocks and have solutions to overcome them. > As for my other obligations, I didn't mean to give that impression. I > added a little bite more detail about the project itself, but I'm not > sure about the time that I should write. I estimate that at most, for > each week day, I spend 8 hours doing my Master's project in my lab's > campus. Since the project started, I usually use the remainder of the > time (~6 hours/day) for my own personal programming projects. I plan > to use the personal programming time slot for my GSoC instead, if > accepted. Should I be this thorough in the proposal? This is exactly my worry. You're proposing working two full time jobs all summer long. Not to denigrate your work ethic, but 80 hour weeks are hard and leave you no time for important things like having a life outside of work. My suggestion would be to see if you can scale back your Master's commitments for the summer if accepted into GSoC. This would definitely improve your proposal since reviewers will worry about the time commitment. Hope this all helps, Brad From chapmanb at 50mail.com Sun Apr 1 20:30:26 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 01 Apr 2012 16:30:26 -0400 Subject: [Biopython-dev] MAF Parser/Indexer In-Reply-To: <4F74855B.9000603@med.nyu.edu> References: <4F74855B.9000603@med.nyu.edu> Message-ID: <87obrbtct9.fsf@fastmail.fm> Andrew; Thanks for putting this together. It looks great, is well integrated with AlignIO and it's awesome to see a test suite. I dug through the code and my small suggestions would be: - Could you refactor some of the larger functions into separate smaller components? A couple of these spread over a ton of lines and it can be a bit difficult to follow the logic throughout: https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L172 https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L399 As a practical example, here you have a large block which checks the SQLite index matches the MAF file and everything looks okay: https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L199 This would be clearer if factored into something like: if os.path.isfile(sqlite_file): try: self._record_count = self._verify_record_count(con) except ... - Would you be able to put together a small example for the Cookbook or Tutorial documentation? This would be a great way to help others get started with the functionality and advertise it. Thanks again for this, Brad > Hi all, > > I would like to start a discussion about what is needed to make the > AlignIO.MafIO parser and indexer ready for the next release. If anyone > is unfamiliar with MAF (Multiple Alignment Format), it is the file > format that eukaryote genome-to-genome multiple alignments produced by > multiz are stored in. > > The exact specs are here: > http://genome.ucsc.edu/FAQ/FAQformat.html#format5 > > Some use cases are discussed in this paper, which implements (I believe) > most of the same functionality of the MafIO class in Galaxy: > http://www.ncbi.nlm.nih.gov/pubmed/21775304 > > The branch of my biopython fork that contains the class: > https://github.com/polyatail/biopython/tree/alignio-maf > > The class is implemented as a reader/writer compatible with the AlignIO > API, but implements its own indexer (MafIO.MafIndex) based on > SeqIO.index_db(). At the time, this seemed like the best way to > implement this, as MAF is explicitly designed for genome-to-genome > alignments while other formats are not. If we can assume a MAF file > contains such an alignment, we can index it by genome coordinates and > allow random access to intervals. > > This is especially useful since it is often desirable to retrieve the > spliced multiple alignment of a multi-exonic transcript, which can be > used to determine sequence conservation, construct a phylogenetic tree > for a particular gene, or pull out orthologs of a large number of genes > at once. > > The code consists of the reader, writer, and indexer classes in > AlignIO/MaFIO.py, test files in Tests/MAF, and unit tests specific to > the indexer in Tests/test_MafIO_index.py. I would really appreciate any > feedback and suggestions, and if anyone has an opportunity to use this > feature it would be great to get some feedback on its operation. > > > Thanks! > Andrew > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From redmine at redmine.open-bio.org Mon Apr 2 01:40:27 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 2 Apr 2012 01:40:27 +0000 Subject: [Biopython-dev] [Biopython - Feature #3336] (New) Make Phylo.draw more customizable Message-ID: Issue #3336 has been reported by Eric Talevich. ---------------------------------------- Feature #3336: Make Phylo.draw more customizable https://redmine.open-bio.org/issues/3336 Author: Eric Talevich Status: New Priority: Normal Assignee: Category: Target version: URL: On and off the mailing lists, I've received requests to make the plots rendered by Phylo.draw more customizable. For example: http://lists.open-bio.org/pipermail/biopython/2012-March/007851.html Since Phylo.draw is based on matplotlib/pyplot, it should be possible for essentially everything about the plot to be customizable by the user using pyplot's standard mechanisms -- e.g. adjust the font sizes with rcParams["font.size"]. Other requested features: * Accept **kwargs in Phylo.draw, and pass it along to pyplot -- but where? * Format the confidence/support values differently (currently everything is treated as a float), including or perhaps with the addition of arbitrary branch labels (e.g. estimated number of mutations on a branch) * Return a mapping of clade objects to a tuple or dict of pyplot elements (LineCollection, PatchCollection, etc.) ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From arklenna at gmail.com Mon Apr 2 02:10:45 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 1 Apr 2012 22:10:45 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: <87zkavtgcr.fsf@fastmail.fm> References: <87zkavtgcr.fsf@fastmail.fm> Message-ID: Hi Brad, Thank you so much for your suggestions. My initial evaluation of the strengths of existing software has led me to strongly agree with your recommendation to focus on the usability of the API. I submit this draft of my proposal to the dev list for feedback: https://docs.google.com/document/d/116FDQLtNnYWnm0kojad4YmQrM3cjOO8D2Vr82aW6xyA/edit Lenna On Sun, Apr 1, 2012 at 3:13 PM, Brad Chapman wrote: > > Lenna; > Thanks for the introduction and glad to hear about your interest in the > variant project. I'm looking forward to seeing your proposal. > > The workflow for the variant project involves a biologist querying a VCF > or GVF file with variants from an experiment. They should be able to > easily subset and filter by file components: > > - Variant type: Homozygous/Heterozygous variants > - Metrics: depth, strand bias, allele frequency.. > - Variants annotated in coding regions causing amino acid changes > > As well as rapid subsetting by chromosomal region. > > My syggestion would be to leverage external tools as much as possible to > do file manipulation and focus on an API that lets users filter and > extract information pre-contained in the INFO file. > > Hope this is helpful as a place to get started. We can provide > additional feedback once you have your proposal ready. Thanks again, > Brad > >> Hi all, >> >> I realize time is short, but I am still in the planning phase of my >> GSoC proposal! I wanted to take a moment to formally introduce myself >> to the dev list. >> >> I am affiliated with Purdue University, located in Indiana, USA and >> best known for engineering (Neil Armstrong is a famous graduate). I >> hold a bachelor of arts in biology from Mount Holyoke College in >> Massachusetts. I have extensive wet lab experience with genetics; I'm >> currently working in a lab genotyping mice (the research is intestinal >> lipid metabolism). In August, I begin a PhD in interdisciplinary life >> science at Purdue, and I anticipate that my research will fall >> somewhere in the field of bioinformatics/computational biology. I hope >> to use biopython extensively! >> >> In my spare time, other than programming, I enjoy ballroom dance, >> science fiction novels, board games, and sailing. >> >> I've been programming for about 6 years and using python for 4; other >> languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL >> (primarily MySQL and SQLite), and C++/C. I place a high value on >> object oriented design and execution. >> >> I understand the basics of formal grammar and have some experience >> with lex/flex as well as PLY (python lex/yacc). My work so far with >> biopython has been on the CIF parsing module. One of my primary goals >> for the genomic variants project would be to implement as much >> polymorphism and abstraction as possible, for the benefit of both >> users and future developers. >> >> I'm working on a proposal for the genomic variants project, and while >> I understand the basics of molecular biology and genetics, I lack >> firsthand experience with the type of workflow that would occur in the >> context of genomic variants. If anyone can supply a few examples, it >> would be greatly appreciated. >> >> I hope to have a proposal draft ready for feedback by Monday. >> >> Regards, >> >> Lenna Peterson >> github.com/lennax >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Mon Apr 2 08:26:16 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 2 Apr 2012 09:26:16 +0100 Subject: [Biopython-dev] MAF Parser/Indexer In-Reply-To: <87obrbtct9.fsf@fastmail.fm> References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm> Message-ID: On Sun, Apr 1, 2012 at 9:30 PM, Brad Chapman wrote: > > Andrew; > Thanks for putting this together. It looks great, is well integrated > with AlignIO and it's awesome to see a test suite. Indeed, +1 on tests :) Apologies for not replying earlier - this was flagged in my email client all of last week. > I dug through the code and my small suggestions would be: > > - Could you refactor some of the larger functions into separate smaller > ?components? A couple of these spread over a ton of lines and it can be > ?a bit difficult to follow the logic throughout: > > ... > > ?As a practical example, here you have a large block which checks the > ?SQLite index matches the MAF file and everything looks okay: Maybe I should do the same with the SeqIO SQLite code. > - Would you be able to put together a small example for the > ?Cookbook or Tutorial documentation? This would be a great way to help > ?others get started with the functionality and advertise it. He already has - very organised :) http://biopython.org/wiki/Multiple_Alignment_Format Is there any more about reverse complemented sequences and how they are handled, for in simple iterators, but more so when indexing? What I'm getting at here is the non-typical treatment of start and end being relative to the reverse complemented sequence for minus strand alignments. Here most tools/formats always count from the first base on the forward strand. Peter From andrew.sczesnak at med.nyu.edu Tue Apr 3 00:15:18 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 02 Apr 2012 20:15:18 -0400 Subject: [Biopython-dev] MAF Parser/Indexer In-Reply-To: <87obrbtct9.fsf@fastmail.fm> References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm> Message-ID: <4F7A4116.5000602@med.nyu.edu> Hi Brad, Thank you for the feedback. I've tried to work on some of your suggestions and will continue doing so. > - Could you refactor some of the larger functions into separate smaller > components? A couple of these spread over a ton of lines and it can be > a bit difficult to follow the logic throughout: Definitely--I see what you mean. I split __init__ into a couple functions. I'm still worried about the 100 lines of get_spliced(). It's big mostly because I overdid it on the comments, but hopefully that helps explain the logic enough that someone else could work on it without pulling their hair out. > - Would you be able to put together a small example for the > Cookbook or Tutorial documentation? This would be a great way to help > others get started with the functionality and advertise it. Absolutely. I have a few more ideas for cool demos that integrate with other parts of Biopython. What's the best place to put draft text for the tutorial? Thanks, Andrew From andrew.sczesnak at med.nyu.edu Tue Apr 3 00:33:51 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 02 Apr 2012 20:33:51 -0400 Subject: [Biopython-dev] MAF Parser/Indexer In-Reply-To: References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm> Message-ID: <4F7A456F.3020306@med.nyu.edu> Hi Peter, Thank you for the feedback. I will try to make sure this code is well tested before the next release. > Is there any more about reverse complemented sequences > and how they are handled, for in simple iterators, but more > so when indexing? What I'm getting at here is the non-typical > treatment of start and end being relative to the reverse > complemented sequence for minus strand alignments. Here > most tools/formats always count from the first base on the > forward strand. I'm not sure I'm understanding you, but I hope I am. In theory it seems like strandedness would be an issue, however in practice the reference species in a multiz MAF file is always the plus strand. To make sure the user isn't trying to pass a MAF file containing blocks with mixed strands to MafIndex.get_spliced(), there's a check in there to make sure all strands for the reference species are the same. We also assume that coordinates specified in a block are always in the ascending direction (i.e. they are given as 'start' and 'size' and we assume the coordinates are [start, start + size]). There could be an issue, however, if the best alignment for a particular species swaps strands between alignment blocks and/or exons of a transcript. However, it might be safe to say that the user is interested in the best alignment however it occurs, and not necessarily strand consistency. WRT MultipleSeqAlignment objects produced by get_spliced(), all annotation properties are lost upon slicing, so it is up to the user to keep track of what's what. I do remember we had talked about a way to maintain these annotations, even after slicing. Any thoughts? Thanks, Andrew From p.j.a.cock at googlemail.com Tue Apr 3 09:03:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Apr 2012 10:03:55 +0100 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: References: Message-ID: On Wed, Mar 21, 2012 at 3:27 PM, Peter Cock wrote: > Hello all, > > I'm pleased to see that the GSoC SearchIO project idea I put up > has sparked some interest: > > http://biopython.org/wiki/Google_Summer_of_Code > > ... Just a reminder that the GSoC application deadline is this Friday, 6 April. The application website has been open since 26 March, so I would encourage you to upload your current proposal soon in case there are server load problems on the last day (you will still be able to revise the proposal after uploading it). http://www.google-melange.com/gsoc/homepage/google/gsoc2012 Also, in particular for those of you interested in the SearchIO project which I would mentor, I will be away Thursday 5 and Friday 6 April, so you will not be able to ask me for any last minute feedback. Good luck, Peter From chapmanb at 50mail.com Tue Apr 3 13:06:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 03 Apr 2012 09:06:36 -0400 Subject: [Biopython-dev] MAF Parser/Indexer In-Reply-To: <4F7A4116.5000602@med.nyu.edu> References: <4F74855B.9000603@med.nyu.edu> <87obrbtct9.fsf@fastmail.fm> <4F7A4116.5000602@med.nyu.edu> Message-ID: <87hax1hsmb.fsf@fastmail.fm> Andrew; > Definitely--I see what you mean. I split __init__ into a couple > functions. I'm still worried about the 100 lines of get_spliced(). It's > big mostly because I overdid it on the comments, but hopefully that > helps explain the logic enough that someone else could work on it > without pulling their hair out. Definitely agreed. It's well-commented which makes it much easier for others to dig in. Thanks for taking a look at the refactoring. > Absolutely. I have a few more ideas for cool demos that integrate with > other parts of Biopython. What's the best place to put draft text for > the tutorial? Apologies that I'd totally missed your cookbook entry. That looks great, but more documentation is always better. If you are okay with LaTeX, the Tutorial is in Doc/Tutorial.tex so you can edit directly. The wiki is also a good place for docs if you prefer to go that way. Thanks again for all the work on this. Looking forward to having it in, Brad From chapmanb at 50mail.com Tue Apr 3 14:53:33 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 03 Apr 2012 10:53:33 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: References: <87zkavtgcr.fsf@fastmail.fm> Message-ID: <87r4w4hno2.fsf@fastmail.fm> Lenna; Thanks for getting this together, that's a great start. I left some specific comments but my general suggestion is to get more detailed about the code specifics. During the summer, you use the weekly timeline as a todo list so having lots of details make the process so much easier. Instead of seeing a general item like: "Implement X" you want "Implement X by extending API from last week to support get_Y using sqlite3 index table. Test cases A, B, C and D to avoid...". Having these kind of checklist todos helps make it easy to get started each week and ensure everything is on track. The additional benefit for selection is that is helps convince reviewers you've thought about the technical details and forseen any potential problems. Hope this helps, Brad > Hi Brad, > > Thank you so much for your suggestions. My initial evaluation of the > strengths of existing software has led me to strongly agree with your > recommendation to focus on the usability of the API. > > I submit this draft of my proposal to the dev list for feedback: > > https://docs.google.com/document/d/116FDQLtNnYWnm0kojad4YmQrM3cjOO8D2Vr82aW6xyA/edit > > > Lenna > > > On Sun, Apr 1, 2012 at 3:13 PM, Brad Chapman wrote: > > > > Lenna; > > Thanks for the introduction and glad to hear about your interest in the > > variant project. I'm looking forward to seeing your proposal. > > > > The workflow for the variant project involves a biologist querying a VCF > > or GVF file with variants from an experiment. They should be able to > > easily subset and filter by file components: > > > > - Variant type: Homozygous/Heterozygous variants > > - Metrics: depth, strand bias, allele frequency.. > > - Variants annotated in coding regions causing amino acid changes > > > > As well as rapid subsetting by chromosomal region. > > > > My syggestion would be to leverage external tools as much as possible to > > do file manipulation and focus on an API that lets users filter and > > extract information pre-contained in the INFO file. > > > > Hope this is helpful as a place to get started. We can provide > > additional feedback once you have your proposal ready. Thanks again, > > Brad > > > >> Hi all, > >> > >> I realize time is short, but I am still in the planning phase of my > >> GSoC proposal! I wanted to take a moment to formally introduce myself > >> to the dev list. > >> > >> I am affiliated with Purdue University, located in Indiana, USA and > >> best known for engineering (Neil Armstrong is a famous graduate). I > >> hold a bachelor of arts in biology from Mount Holyoke College in > >> Massachusetts. I have extensive wet lab experience with genetics; I'm > >> currently working in a lab genotyping mice (the research is intestinal > >> lipid metabolism). In August, I begin a PhD in interdisciplinary life > >> science at Purdue, and I anticipate that my research will fall > >> somewhere in the field of bioinformatics/computational biology. I hope > >> to use biopython extensively! > >> > >> In my spare time, other than programming, I enjoy ballroom dance, > >> science fiction novels, board games, and sailing. > >> > >> I've been programming for about 6 years and using python for 4; other > >> languages with which I'm familiar include Perl/CGI, HTML/CSS, PHP, SQL > >> (primarily MySQL and SQLite), and C++/C. I place a high value on > >> object oriented design and execution. > >> > >> I understand the basics of formal grammar and have some experience > >> with lex/flex as well as PLY (python lex/yacc). My work so far with > >> biopython has been on the CIF parsing module. One of my primary goals > >> for the genomic variants project would be to implement as much > >> polymorphism and abstraction as possible, for the benefit of both > >> users and future developers. > >> > >> I'm working on a proposal for the genomic variants project, and while > >> I understand the basics of molecular biology and genetics, I lack > >> firsthand experience with the type of workflow that would occur in the > >> context of genomic variants. If anyone can supply a few examples, it > >> would be greatly appreciated. > >> > >> I hope to have a proposal draft ready for feedback by Monday. > >> > >> Regards, > >> > >> Lenna Peterson > >> github.com/lennax > >> _______________________________________________ > >> Biopython-dev mailing list > >> Biopython-dev at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython-dev From w.arindrarto at gmail.com Tue Apr 3 15:22:04 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 3 Apr 2012 17:22:04 +0200 Subject: [Biopython-dev] GSoC Student Applicant In-Reply-To: <87wr5ztfof.fsf@fastmail.fm> References: <874ntgtca7.fsf@fastmail.fm> <87r4wa6fxx.fsf@fastmail.fm> <87wr5ztfof.fsf@fastmail.fm> Message-ID: On Sun, Apr 1, 2012 at 21:28, Brad Chapman wrote: > > Bow; > >> Thank you for the comments and suggestions. I've added a little bit >> more details to my personal profile and put it up front. My project >> details have also been broken down into single weeks. And I've edited >> the commenting permission. > > Thanks for the updates, this is coming along well. My most general > suggestion is to spend more time expanding the week-by-week > timeline. As an example, take this weekly goal: > > * Write iterator and random-access parser for EMBOSS water > > It would be great to see more specific plans for what exactly you > deliver and implement during the week. Something like: > > - Write iterator for EMBOSS water, expanding test suite to ensure > ?produced AlignIO objects are compatible with previous BLAST and HMMER > ?iterators. > > - Expand index functionality to handle EMBOSS water format for random > ?access. Test edge cases: initial records, final records, empty > ?records. > > - Document 'water' parsing with a use case emphasizing differences from > ?BLAST and HMMER searching. > > Peter probably has more specific thoughts on the actual content but it's > important to think through things in this manner. This will make it > easier to approach weeks during the summer since you'll already have > tasks broken down, and will also demonstrate you've thought about > potential problems and roadblocks and have solutions to overcome them. Thanks for another feedback, Brad. I am in the process of adding more detailed descriptions of my weekly tasks. >> As for my other obligations, I didn't mean to give that impression. I >> added a little bite more detail about the project itself, but I'm not >> sure about the time that I should write. I estimate that at most, for >> each week day, I spend 8 hours doing my Master's project in my lab's >> campus. Since the project started, I usually use the remainder of the >> time (~6 hours/day) for my own personal programming projects. I plan >> to use the personal programming time slot for my GSoC instead, if >> accepted. Should I be this thorough in the proposal? > > This is exactly my worry. You're proposing working two full time jobs > all summer long. Not to denigrate your work ethic, but 80 hour weeks are > hard and leave you no time for important things like having a life > outside of work. My suggestion would be to see if you can scale back > your Master's commitments for the summer if accepted into GSoC. This > would definitely improve your proposal since reviewers will worry about > the time commitment. > > Hope this all helps, > Brad Ah, that's ok, I understand your concern :). I talked with my supervisor yesterday regarding this and he understood that I can scale back the time spent for my current project if accepted. I've revised this detail as well in the proposal. Thanks again, Bow From p.j.a.cock at googlemail.com Tue Apr 3 15:32:08 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Apr 2012 16:32:08 +0100 Subject: [Biopython-dev] GSoC Student Applicant In-Reply-To: References: <874ntgtca7.fsf@fastmail.fm> <87r4wa6fxx.fsf@fastmail.fm> <87wr5ztfof.fsf@fastmail.fm> Message-ID: On Tue, Apr 3, 2012 at 4:22 PM, Wibowo Arindrarto wrote: > On Sun, Apr 1, 2012 at 21:28, Brad Chapman wrote: >> >> This is exactly my worry. You're proposing working two full time jobs >> all summer long. Not to denigrate your work ethic, but 80 hour weeks are >> hard and leave you no time for important things like having a life >> outside of work. My suggestion would be to see if you can scale back >> your Master's commitments for the summer if accepted into GSoC. This >> would definitely improve your proposal since reviewers will worry about >> the time commitment. >> >> Hope this all helps, >> Brad > > Ah, that's ok, I understand your concern :). I talked with my > supervisor yesterday regarding this and he understood that I can scale > back the time spent for my current project if accepted. I've revised > this detail as well in the proposal. > > Thanks again, > Bow Excellent - I'm pleased your supervisor is being supportive. That should help address this concern :) Peter From mjldehoon at yahoo.com Tue Apr 3 18:27:26 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 3 Apr 2012 11:27:26 -0700 (PDT) Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: Message-ID: <1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com> While I think that the SearchIO module is a good idea, you may want to consider choosing a different name for this module. For Bio.Seq/Bio.SeqIO and Bio.Align/Bio.AlignIO, roughly speaking the class definitions are in the former and the parser is in the latter module. I don't quite understand why these two are separated into distinct modules, as to me conceptually the two belong together. Bio.SearchIO in my understanding will combine both the parsers and the class definitions, which is a good thing, but then I would prefer a name without "IO" in it. Best, -Michiel. --- On Tue, 4/3/12, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] GSoC SearchIO project > To: "Biopython-Dev Mailing List" > Date: Tuesday, April 3, 2012, 5:03 AM > On Wed, Mar 21, 2012 at 3:27 PM, > Peter Cock > wrote: > > Hello all, > > > > I'm pleased to see that the GSoC SearchIO project idea > I put up > > has sparked some interest: > > > > http://biopython.org/wiki/Google_Summer_of_Code > > > > ... > > Just a reminder that the GSoC application deadline is this > Friday, > 6 April. The application website has been open since 26 > March, > so I would encourage you to upload your current proposal > soon > in case there are server load problems on the last day (you > will > still be able to revise the proposal after uploading it). > http://www.google-melange.com/gsoc/homepage/google/gsoc2012 > > Also, in particular for those of you interested in the > SearchIO > project which I would mentor, I will be away Thursday 5 and > Friday 6 April, so you will not be able to ask me for any > last > minute feedback. > > Good luck, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Apr 3 19:44:48 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Apr 2012 20:44:48 +0100 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: <1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com> References: <1333477646.41091.YahooMailClassic@web161201.mail.bf1.yahoo.com> Message-ID: On Tue, Apr 3, 2012 at 7:27 PM, Michiel de Hoon wrote: > While I think that the SearchIO module is a good idea, you > may want to consider choosing a different name for this > module. For Bio.Seq/Bio.SeqIO and Bio.Align/Bio.AlignIO, > roughly speaking the class definitions are in the former and > the parser is in the latter module. I don't quite understand > why these two are separated into distinct modules, as to > me conceptually the two belong together. Bio.SearchIO in > my understanding will combine both the parsers and the > class definitions, which is a good thing, but then I would > prefer a name without "IO" in it. > > Best, > -Michiel. Yes, I was thinking to have both the parsers and the new objects under the name module namespace. The reason for using SearchIO (despite not being PEP8 compatible - something I regret in the naming of SeqIO and the pattern it set) is to match SeqIO and AlignIO and BioPerl. Anyone familiar with BioPerl will immediately see what it is for - and some of the student applicants have already used BioPerl's SearchIO. Personally I find this quite a compelling argument. That said, the name SearchIO isn't the clearest in the the world for a newcomer - however I haven't come up with anything significantly better myself. Perhaps there is a better name out there, which would justify breaking the pattern? I've considered pairwise and palign, but neither feels right. Given a clean slate (Biopython 2?), then yes, I would agree with consolidating Bio.Align and Bio.AlignIO as one namespace, probable "align" (lower case). The situation with Bio.Seq, Bio.SeqRecord and Bio.SeqIO isn't quite so simple - perhaps "seq" (lower case)? Then (in the absence of any other ideas), SearchIO would become "search" (lower case). Peter From redmine at redmine.open-bio.org Tue Apr 3 21:13:13 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 3 Apr 2012 21:13:13 +0000 Subject: [Biopython-dev] [Biopython - Bug #3337] (New) 'Bio.trie.trie' is not picklable Message-ID: Issue #3337 has been reported by Sergei Lebedev. ---------------------------------------- Bug #3337: 'Bio.trie.trie' is not picklable https://redmine.open-bio.org/issues/3337 Author: Sergei Lebedev Status: New Priority: Normal Assignee: Category: Target version: URL: Is there any reason for this, or nobody just had the need (or time) to implement pickle interface? ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From MatatTHC at gmx.de Wed Apr 4 08:46:47 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Wed, 4 Apr 2012 10:46:47 +0200 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: Hi, are there any news on this? May I help somehow? But I have to admit that I barely speak perl and have no experience with bioperl. If someone tells me where to look I might still try it. Matthias 2012/3/29 Peter Cock : > On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt wrote: >> Hi, >> >> Is it possible to get the property if a genome is circular / linear >> from SeqIO applied to genbank files? I could not find it. >> >> There is also a related bugreport: >> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 >> >> I used the old parser before and switched to SeqIO which I really like >> for the possibilities to parse different formats... but I really need >> the information. > > Does anyone happen to have a BioPerl + BioSQL setup installed > and working? IIRC checking that to make sure however we > store the circular was compatible was the only real hurdle. > > Peter From arklenna at gmail.com Thu Apr 5 00:04:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 4 Apr 2012 20:04:30 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: <87r4w4hno2.fsf@fastmail.fm> References: <87zkavtgcr.fsf@fastmail.fm> <87r4w4hno2.fsf@fastmail.fm> Message-ID: On Tue, Apr 3, 2012 at 10:53 AM, Brad Chapman wrote: > > Lenna; > Thanks for getting this together, that's a great start. I left some > specific comments but my general suggestion is to get more detailed > about the code specifics. During the summer, you use the weekly timeline > as a todo list so having lots of details make the process so much > easier. Instead of seeing a general item like: "Implement X" you want > "Implement X by extending API from last week to support get_Y using > sqlite3 index table. Test cases A, B, C and D to avoid...". > > Having these kind of checklist todos helps make it easy to get started > each week and ensure everything is on track. The additional benefit for > selection is that is helps convince reviewers you've thought about the > technical details and forseen any potential problems. > > Hope this helps, > Brad > Hi all, I'm linking to a revision of my GSoC proposal: https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit Thank you to everyone for your feedback. Peter, I didn't realize Biopython has never been tested on IronPython. As I have no familiarity with .NET or Windows, I'll have to rescind my offer to test it. Sorry to get your hopes up! Reece, I've revised the prose sections and almost completely rewritten the timeline. This version provides more information about my background, a more detailed description of the overall project, and more specific goals. Brad, I've tried to go into as much detail as my knowledge of VCF and GVF structure allows. I laid out a more specific structure for both the backend and frontend structures for the data. I've revised the unit tests to be more specific and less dependent on interaction with other modules and I've tried to anticipate some cases that may produce unexpected behavior. I also highlighted specific places where the design should be generalizable. James, I hope my revised project description is more focused. Regarding CNV etc., I did not mean to specifically exclude them by mentioning SNPs, and I've reworded that paragraph to be more general. I get the impression that CNV and other structural variants are considerably more complex to represent and manipulate. I'd be more than happy to read more about breakpoint theory etc. and to prototype any specific workflows you might suggest. Lenna From eric.talevich at gmail.com Thu Apr 5 02:53:10 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 4 Apr 2012 22:53:10 -0400 Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices Message-ID: Hi all, I'm considering some enhancements to the Phylo.draw function to make it more customizable for power users. Since the function is based on matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the user; however, I'm not fully versed in what pyplot is capable of. Relevant feature request in Redmine: https://redmine.open-bio.org/issues/3336 Ideas: 1. Make the draw function return a mapping of clades to a collection of pyplot graphical elements -- the objects emitted by pyplot during each step of rendering the plot. Each clade in the tree is mapped to a horizontal line, a vertical line, a text label (taxon name, normally), and another text label for the branch (confidence/support, normally). The user can then set the attributes of these objects as they wish, minimizing the need for futher extensions to Phylo.draw. Example: {: { "hline": , "vline": , "taxon_label": , "branch_label": }, ... If the user needs access to the figure or axis object as well, it's already easy enough to create these beforehand and pass the 'axis' object to Phylo.draw. 2. Add an argument 'branch_labels' to Phylo.draw. This will accept either (a) a dict which maps the tree's Clade objects to string labels, or (b) a function which accepts a Clade object and returns a string. Default: a function that formats the clade's 'confidence' or 'confidences' attribute, matching the current behavior. Examples: >>> draw(mytree, branch_labels={mytree.root: "Root", ...}) >>> draw(mytree, branch_labels=lambda clade: "%d" % clade.confidence) >>> draw(mytree, branch_labels=lambda clade: clade.taxonomy.rank) 3. Accept **kwargs in Phylo.draw; pass it right along to pyplot at some point. Question: What basic pyplot function accepts **Ikwargs? pyplot.figure and pyplot.set_subplot don't seem appropriate. An alternative is to use pyplot.rcParams, either leaving it all to the user or treating the **kwargs keys as the corresponding entries in rcParams. Syntax gets a little tricky. (Not a top priority for me, actually, since rcParams works.) Thoughts? All clear? Thanks, Eric From chapmanb at 50mail.com Thu Apr 5 10:47:09 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 05 Apr 2012 06:47:09 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: References: <87zkavtgcr.fsf@fastmail.fm> <87r4w4hno2.fsf@fastmail.fm> Message-ID: <871uo2cv6a.fsf@fastmail.fm> Lenna; > I'm linking to a revision of my GSoC proposal: > > https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit > > Thank you to everyone for your feedback. This is coming along great, thanks for all the work on it. I've added a couple of specific suggestions about iterative parsing, which PyVCF does, and using external tools to make the coding region evaluation work easier. One other practical suggestion: you should add a link to the latest version of your google doc at the top of your proposal on the GSoC Melange site. You won't be able to edit there after Friday but can update your google document in case of reviewer suggestions. Thanks again and best of luck during the review process, Brad > > > Peter, > > I didn't realize Biopython has never been tested on IronPython. As I > have no familiarity with .NET or Windows, I'll have to rescind my > offer to test it. Sorry to get your hopes up! > > > Reece, > > I've revised the prose sections and almost completely rewritten the > timeline. This version provides more information about my background, > a more detailed description of the overall project, and more specific > goals. > > > Brad, > > I've tried to go into as much detail as my knowledge of VCF and GVF > structure allows. I laid out a more specific structure for both the > backend and frontend structures for the data. I've revised the unit > tests to be more specific and less dependent on interaction with other > modules and I've tried to anticipate some cases that may produce > unexpected behavior. I also highlighted specific places where the > design should be generalizable. > > > James, > > I hope my revised project description is more focused. Regarding CNV > etc., I did not mean to specifically exclude them by mentioning SNPs, > and I've reworded that paragraph to be more general. I get the > impression that CNV and other structural variants are considerably > more complex to represent and manipulate. I'd be more than happy to > read more about breakpoint theory etc. and to prototype any specific > workflows you might suggest. > > > Lenna > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From arklenna at gmail.com Fri Apr 6 02:50:52 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 5 Apr 2012 22:50:52 -0400 Subject: [Biopython-dev] GSoC genomic variant proposal In-Reply-To: <871uo2cv6a.fsf@fastmail.fm> References: <87zkavtgcr.fsf@fastmail.fm> <87r4w4hno2.fsf@fastmail.fm> <871uo2cv6a.fsf@fastmail.fm> Message-ID: On Thu, Apr 5, 2012 at 6:47 AM, Brad Chapman wrote: > > Lenna; > >> I'm linking to a revision of my GSoC proposal: >> >> https://docs.google.com/document/d/1txW9bwMYC6avlJxqs7x-mB-M09gxy2Uc6XqhR04y6ac/edit >> >> Thank you to everyone for your feedback. > > This is coming along great, thanks for all the work on it. I've added a > couple of specific suggestions about iterative parsing, which PyVCF > does, and using external tools to make the coding region evaluation work > easier. > > One other practical suggestion: you should add a link to the latest > version of your google doc at the top of your proposal on the GSoC > Melange site. You won't be able to edit there after Friday but can > update your google document in case of reviewer suggestions. > > Thanks again and best of luck during the review process, > Brad > Brad - Thank you again for your detailed feedback. As per your suggestion, I have updated my proposal on GSoC Melange to include a link to the latest version of my proposal. Lenna From mjldehoon at yahoo.com Sat Apr 7 04:43:56 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 6 Apr 2012 21:43:56 -0700 (PDT) Subject: [Biopython-dev] GSoC SearchIO project Message-ID: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com> --- On Tue, 4/3/12, Peter Cock wrote: > The reason for using SearchIO (despite not being PEP8 > compatible - something I regret in the naming of SeqIO > and the pattern it set) is to match SeqIO and AlignIO and > BioPerl. Anyone familiar with BioPerl will immediately see > what it is for - and some of the student applicants have > already used BioPerl's SearchIO. Personally I find this > quite a compelling argument. Sorry but I am not convinced. I doubt that somebody familiar with BioPerl's Align and AlignIO modules will have trouble finding the parser in Biopython if in Biopython there is only a Bio.Align module. Also this means that some modules in Biopython are split up in Module and ModuleIO, whereas most others are not. In this particular case, for consistency you would have to create a Bio.Search and a Bio.SearchIO module. I'd rather have a clean module organization in Biopython instead of strictly following what BioPerl did. > That said, the name SearchIO isn't the clearest in the > the world for a newcomer - however I haven't come up > with anything significantly better myself. Perhaps there > is a better name out there, which would justify breaking > the pattern? I've considered pairwise and palign, but > neither feels right. How about including this module as a submodule in Bio.Align? If we think of Bio.Align as a general module for alignments, then pairwise alignments fit in it too. It depends a bit on the exact API, but I expect that we can come up with something elegant. > Given a clean slate (Biopython 2?), then yes, I would > agree with consolidating Bio.Align and Bio.AlignIO as > one namespace, probable "align" (lower case). The > situation with Bio.Seq, Bio.SeqRecord and Bio.SeqIO > isn't quite so simple - perhaps "seq" (lower case)? There are two steps here: consolidation of some modules, and changing the names of modules to comply with PEP8. The consolidation can happen without waiting for a Biopython 2, as long as there are clear deprecating warnings in the modules that will be removed. Compliance with PEP8 is a bit trickier, since it means relearning all module names, and some systems (Windows?) may not distinguish between lower and upper case. > Then (in the absence of any other ideas), SearchIO > would become "search" (lower case). If we already know now that we will drop the IO from SearchIO at some point, then SearchIO doesn't seem to be a good name. Best, -Michiel. From eric.talevich at gmail.com Sat Apr 7 16:13:16 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 7 Apr 2012 12:13:16 -0400 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com> References: <1333773836.9513.YahooMailClassic@web161204.mail.bf1.yahoo.com> Message-ID: On Sat, Apr 7, 2012 at 12:43 AM, Michiel de Hoon wrote: > --- On Tue, 4/3/12, Peter Cock wrote: > > The reason for using SearchIO (despite not being PEP8 > > compatible - something I regret in the naming of SeqIO > > and the pattern it set) is to match SeqIO and AlignIO and > > BioPerl. Anyone familiar with BioPerl will immediately see > > what it is for - and some of the student applicants have > > already used BioPerl's SearchIO. Personally I find this > > quite a compelling argument. > > Sorry but I am not convinced. I doubt that somebody familiar with > BioPerl's Align and AlignIO modules will have trouble finding the parser in > Biopython if in Biopython there is only a Bio.Align module. Also this means > that some modules in Biopython are split up in Module and ModuleIO, whereas > most others are not. In this particular case, for consistency you would > have to create a Bio.Search and a Bio.SearchIO module. I'd rather have a > clean module organization in Biopython instead of strictly following what > BioPerl did. > How about Bio.Search, for now? We had a similar discussion at the end of GSoC 2009, when we decided to merge Tree and TreeIO (names inspired by BioPerl) to create Phylo (because not all trees are phylogenies, although there is also a Perl module called Bio::Phylo). Since the *IO namespaces have only 4 public functions, plus a IO.py module for each supported I/O format, it's not too cluttered. Likewise, at the end of this GSoC it may be more clear whether the new sub-package should have a different name. (SearchIO seems to have been plenty effective at drawing attention to the project.) But in any case, I support putting all the new work under one sub-package, rather than two. > That said, the name SearchIO isn't the clearest in the > > the world for a newcomer - however I haven't come up > > with anything significantly better myself. Perhaps there > > is a better name out there, which would justify breaking > > the pattern? I've considered pairwise and palign, but > > neither feels right. > > How about including this module as a submodule in Bio.Align? If we think > of Bio.Align as a general module for alignments, then pairwise alignments > fit in it too. It depends a bit on the exact API, but I expect that we can > come up with something elegant. > > Does anything in Bio.Align already operate on SeqFeature objects? Given that BLAST or HMMer output could be interpreted as (1) a series of annotated features/regions on target sequences, or (2) a series of pairwise alignments [*], perhaps it would be most effective to support those aspects separately, through (1) Bio.Search or Bio.Feature [**], and (2) Bio.Align or Bio.AlignIO. [*] The multiple sequence alignment produced by HMMer is in a format we already handle (Stockholm). Some people want to convert BLAST output to a multiple sequence alignment, too, and while I suppose we could support that in a literal sense, the result would be worse than the output of pretty much any other alignment program so I don't think we should. [**] A Bio.Feature module could involve GFF parsing and the variant parsers, too. It would contain I/O functions that emit SeqFeatures, of course. From redmine at redmine.open-bio.org Sat Apr 7 17:31:37 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 7 Apr 2012 17:31:37 +0000 Subject: [Biopython-dev] [Biopython - Feature #3338] (New) Convert a protein alignment and nucleotide sequences to codon alignment Message-ID: Issue #3338 has been reported by Eric Talevich. ---------------------------------------- Feature #3338: Convert a protein alignment and nucleotide sequences to codon alignment https://redmine.open-bio.org/issues/3338 Author: Eric Talevich Status: New Priority: Normal Assignee: Category: Target version: URL: As discussed on the mailing list: http://lists.open-bio.org/pipermail/biopython/2012-April/007913.html This could be implemented in two ways: 1. Wrap PAL2NAL (pal2nal.pl) under Bio.Align.Applications 2. Implement this functionality directly in Python While PAL2NAL has some convenience features like aligning protein sequences to CDS sequences that don't exactly match, it would be straightforward (and simpler for the user, in most cases) to implement a fussier version of it from scratch somewhere in Biopython. So, where would be put this function? Related: * From a codon alignment, it would again be straightforward to calculate dN/dS ratios for pairs of sequences, much like PAML's yn00 (although that program does more stuff, too). Do we want to do that? Where? * Are there ways Biopython could support codon alignments better, as distinct from nucleotide alignments? ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From eric.talevich at gmail.com Sat Apr 7 18:42:02 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 7 Apr 2012 14:42:02 -0400 Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices In-Reply-To: References: Message-ID: On Wed, Apr 4, 2012 at 10:53 PM, Eric Talevich wrote: > Hi all, > > I'm considering some enhancements to the Phylo.draw function to make it > more customizable for power users. Since the function is based on > matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the > user; however, I'm not fully versed in what pyplot is capable of. > > Relevant feature request in Redmine: > https://redmine.open-bio.org/issues/3336 > > Ideas: [...] > 2. Add an argument 'branch_labels' to Phylo.draw. This will accept either > (a) a dict which maps the tree's Clade objects to string labels, or (b) a > function which accepts a Clade object and returns a string. Default: a > function that formats the clade's 'confidence' or 'confidences' attribute, > matching the current behavior. > > Examples: > >>> draw(mytree, branch_labels={mytree.root: "Root", ...}) > >>> draw(mytree, branch_labels=lambda clade: "%d" % clade.confidence) > >>> draw(mytree, branch_labels=lambda clade: clade.taxonomy.rank) > > Just committed this feature: https://github.com/biopython/biopython/commit/72990549a1b769ab19ab0bd33a8c35fdf031ac2d From lgautier at gmail.com Sun Apr 8 17:16:31 2012 From: lgautier at gmail.com (Laurent Gautier) Date: Sun, 08 Apr 2012 19:16:31 +0200 Subject: [Biopython-dev] Sphinx documentation online ? Message-ID: <4F81C7EF.7030505@gmail.com> Hi, I have seen emails exchanges and issues on the tracker regarding moving the documentation to Sphinx, but I could not find an instance of the documentation for biopython online (I was looking for one to cross-reference it with documentation I am writing). Is this still work-in-progress, or is there an instance online and I missed it ? Best, Laurent From eric.talevich at gmail.com Sun Apr 8 19:25:00 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 8 Apr 2012 15:25:00 -0400 Subject: [Biopython-dev] Sphinx documentation online ? In-Reply-To: <4F81C7EF.7030505@gmail.com> References: <4F81C7EF.7030505@gmail.com> Message-ID: On Sun, Apr 8, 2012 at 1:16 PM, Laurent Gautier wrote: > Hi, > > I have seen emails exchanges and issues on the tracker regarding moving > the documentation to Sphinx, but I could not find an instance of the > documentation for biopython online (I was looking for one to > cross-reference it with documentation I am writing). > > Is this still work-in-progress, or is there an instance online and I > missed it ? > > Hi Laurent, I proposed this a while ago and played with Sphinx a little bit, but didn't get very far. We're still using Epydoc for our generated API documentation: http://biopython.org/DIST/docs/api/ I do hope to get back to this at some point, or perhaps assist someone else with migrating Biopython to Sphinx. -Eric From lgautier at gmail.com Sun Apr 8 20:46:45 2012 From: lgautier at gmail.com (Laurent Gautier) Date: Sun, 08 Apr 2012 22:46:45 +0200 Subject: [Biopython-dev] Sphinx documentation online ? In-Reply-To: References: <4F81C7EF.7030505@gmail.com> Message-ID: <4F81F935.9030702@gmail.com> On 2012-04-08 21:25, Eric Talevich wrote: > On Sun, Apr 8, 2012 at 1:16 PM, Laurent Gautier > wrote: > > Hi, > > I have seen emails exchanges and issues on the tracker regarding > moving the documentation to Sphinx, but I could not find an > instance of the documentation for biopython online (I was looking > for one to cross-reference it with documentation I am writing). > > Is this still work-in-progress, or is there an instance online and > I missed it ? > > > Hi Laurent, > > I proposed this a while ago and played with Sphinx a little bit, but > didn't get very far. We're still using Epydoc for our generated API > documentation: > http://biopython.org/DIST/docs/api/ > > I do hope to get back to this at some point, or perhaps assist someone > else with migrating Biopython to Sphinx. > > -Eric > > Hi Eric, Thanks for the answer. I did see the Epydoc, but I was after Sphinx to be able to cross-reference documentations (see http://sphinx.pocoo.org/ext/intersphinx.html ). I'll do with it for the time being. Best, Laurent From eric.talevich at gmail.com Mon Apr 9 18:25:04 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 9 Apr 2012 14:25:04 -0400 Subject: [Biopython-dev] Method to weight sequences in an alignment Message-ID: Folks, I've written a function to weight sequences according to the simple scheme used in PSI-BLAST [*]. It operates on Bio.Align.MultipleSeqAlignment objects or lists of plain strings, and could be added as a method with minimal changes (for Python 2.5 compatibility, mainly). Any interest in adding it to Biopython? The code is below. Cheers, Eric [*] Henikoff & Henikoff (1994): Position-based sequence weights. http://www.ncbi.nlm.nih.gov/pubmed/7966282 ---- def sequence_weights(aln): """Weight aligned sequences to emphasize more divergent members. Returns a list of floating-point numbers between 0 and 1, corresponding to the proportional weight of each sequence in the alignment. The first list is the weight of the first sequence in the alignment, and so on. Weights sum to 1.0. Method: At each column position, award each different residue an equal share of the weight, and then divide that weight equally among the sequences sharing the same residue. For each sequence, sum the contributions from each position to give a sequence weight. See Henikoff & Henikoff (1994): Position-based sequence weights. """ def col_weight(column): """Represent the diversity at a position. Award each different residue an equal share of the weight, and then divide that weight equally among the sequences sharing the same residue. So, if in a position of a multiple alignment, r different residues are represented, a residue represented in only one sequence contributes a score of 1/r to that sequence, whereas a residue represented in s sequences contributes a score of 1/rs to each of the s sequences. """ # Skip columns with all gaps or unique inserts if len([c for c in column if c not in '-.']) < 2: return [0] * len(column) # Count the number of occurrences of each residue type # (Treat gaps as a separate, 21st character) counts = Counter(column) # Get residue weights: 1/rs, where # r = nb. residue types, s = count of a particular residue type n_residues = len(counts) # r freqs = dict((aa, 1.0 / (n_residues * count)) for aa, count in counts.iteritems()) weights = [freqs[aa] for aa in column] return weights seq_weights = [0] * len(aln) col_weights = map(col_weight, zip(*aln)) # Sum the contributions from each position along each sequence -> total weight for col in col_weights: for idx, row_val in enumerate(col): seq_weights[idx] += row_val # Normalize scale = 1.0 / sum(seq_weights) seq_weights = [scale * wt for wt in seq_weights] return seq_weights From mjldehoon at yahoo.com Mon Apr 9 23:27:31 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 9 Apr 2012 16:27:31 -0700 (PDT) Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: Message-ID: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> Hi Eric, Peter, > How about Bio.Search, for now? I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells users something about what the module is for. Bio.Search could be anything (search PubMed? search the Entrez databases? search Google? anyway Bio.Search does not suggest that this module is about pairwise alignments). But Peter previously mentioned that he doesn't like Bio.Pairwise; can we convince you? >> How about including this module as a submodule in Bio.Align? > Does anything in Bio.Align already operate on SeqFeature objects? I was more thinking to have this module as a submodule in Bio.Align for the purpose of module organization rather than reusing or integrating it with Bio.Align. However, if we can make use of Bio.Align, then that could be a good thing. Best, -Michiel. From chapmanb at 50mail.com Tue Apr 10 00:58:19 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 09 Apr 2012 20:58:19 -0400 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> Message-ID: <87lim4h07o.fsf@fastmail.fm> Michiel; > Hi Eric, Peter, > > > How about Bio.Search, for now? > > I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells > users something about what the module is for. Bio.Search could be > anything (search PubMed? search the Entrez databases? search Google? > anyway Bio.Search does not suggest that this module is about pairwise > alignments). But Peter previously mentioned that he doesn't like > Bio.Pairwise; can we convince you? I agree with Peter on this one. The module is primarily about searching a sequence database with an input via multiple methods, not about pairwise alignment of two sequences with is what Bio.Align.Pairwise suggests to me. Brad From redmine at redmine.open-bio.org Tue Apr 10 20:29:09 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 Apr 2012 20:29:09 +0000 Subject: [Biopython-dev] [Biopython - Bug #3340] (New) Example using Bio.Clustalw in Tutorial Message-ID: Issue #3340 has been reported by Peter Cock. ---------------------------------------- Bug #3340: Example using Bio.Clustalw in Tutorial https://redmine.open-bio.org/issues/3340 Author: Peter Cock Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Apr 10 20:29:09 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 Apr 2012 20:29:09 +0000 Subject: [Biopython-dev] [Biopython - Bug #3340] (New) Example using Bio.Clustalw in Tutorial Message-ID: Issue #3340 has been reported by Peter Cock. ---------------------------------------- Bug #3340: Example using Bio.Clustalw in Tutorial https://redmine.open-bio.org/issues/3340 Author: Peter Cock Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Thu Apr 12 16:01:47 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 12 Apr 2012 17:01:47 +0100 Subject: [Biopython-dev] BOSC 2012 - Biopython Update Message-ID: Hello all, The BOSC abstract deadline (tomorrow) has rather crept up on me, despite Nomi's reminder emails (My excuse is I've been thinking more about GSoC!). For anyone thinking of submitting a talk, the abstract limit is just a page - see: http://www.open-bio.org/wiki/BOSC_2012 I'm hoping to attend BOSC, but will probably not be at ISMB 2012. I'd be delighted for another Biopython developer to give the project update talk (and as in previous years, we'll help out with the abstract, slides, etc). Anyone interested? Giving a talk can be very helpful in getting travel funding ;) I know Eric might be a candidate as he will be in Long Beach (congratulations on getting your ISMB poster accepted Eric!). Note that dedicated "Bioinformatics Open Source Project Updates" track is new this year. The talks are likely to be at the shorter end of the talk length range specified (i.e. closer to 5 minutes than 20 mins) but that will partly depend on quite how full the final schedule turns out to be. The idea (speaking with my BOSC hat on) with the update talks is to try to highlight what is new and exciting, with only a minimal introduction for the higher profile projects - most of the audience will know roughly what BioPerl etc are, and won't be interested to hear it again ;) So for the Biopython talk we'd probably want to cover things like GSoC, work with PyPy and Python3, major new functionality, any Biopython papers, etc, and a bit on future plans. The talk should be short but sweet :) Regards, Peter From redmine at redmine.open-bio.org Thu Apr 12 18:52:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 12 Apr 2012 18:52:35 +0000 Subject: [Biopython-dev] [Biopython - Feature #3341] (New) Improve SffIO to parse 3 extra lines present in some SFF files ("Run Name:, Analysis Name:, Full Path:)" Message-ID: Issue #3341 has been reported by Martin Mokrej?. ---------------------------------------- Feature #3341: Improve SffIO to parse 3 extra lines present in some SFF files ("Run Name:, Analysis Name:, Full Path:)" https://redmine.open-bio.org/issues/3341 Author: Martin Mokrej? Status: New Priority: Normal Assignee: Category: Target version: URL: Some file have extra 3 lines per each record in the SFF file. One such file is already in biopython test data: biopython/Tests/Roche/E3MFGYR02_random_10_reads.sff biopython/Tests/Roche/paired.sff The three lines "Run Name:, Analysis Name:, Full Path:" are not parsed into the object and later on, are not written out. Hence, sff round trip read in -> write out breaks (biopython-1.58). These three lines somehow do not appear in every SFF file, and so far I haven't seen these in files extracted from SRA. Seems these only appear in original Roche SFF files. >E3MFGYR02JWQ7T Run Prefix: R_2008_01_09_16_16_00_ Region #: 2 XY Location: 3946_2103 Run Name: R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331 Analysis Name: /data/2008_02_08/R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331/D_2008_02_08_23_45_24_d41_AnalysisPipe Full Path: /data/R_2008_02_08_17_05_24_build11_mlabrecque_100707593662420SV11007SID4607RunID24947331/D_2008_02_08_23_45_24_d41_AnalysisPipe Read Header Len: 32 Name Length: 14 # of Bases: 265 Clip Qual Left: 5 Clip Qual Right: 264 Clip Adap Left: 0 Clip Adap Right: 0 ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From eric.talevich at gmail.com Thu Apr 12 22:37:12 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 12 Apr 2012 18:37:12 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Thu, Apr 12, 2012 at 12:01 PM, Peter Cock wrote: > Hello all, > > The BOSC abstract deadline (tomorrow) has rather crept up on me, > despite Nomi's reminder emails (My excuse is I've been thinking > more about GSoC!). For anyone thinking of submitting a talk, the > abstract limit is just a page - see: > http://www.open-bio.org/wiki/BOSC_2012 > > I'm hoping to attend BOSC, but will probably not be at ISMB 2012. > I'd be delighted for another Biopython developer to give the project > update talk (and as in previous years, we'll help out with the abstract, > slides, etc). Anyone interested? Giving a talk can be very helpful in > getting travel funding ;) > > I know Eric might be a candidate as he will be in Long Beach > (congratulations on getting your ISMB poster accepted Eric!). > > Note that dedicated "Bioinformatics Open Source Project Updates" > track is new this year. The talks are likely to be at the shorter end of > the talk length range specified (i.e. closer to 5 minutes than 20 mins) > but that will partly depend on quite how full the final schedule turns > out to be. > > The idea (speaking with my BOSC hat on) with the update talks is > to try to highlight what is new and exciting, with only a minimal > introduction for the higher profile projects - most of the audience > will know roughly what BioPerl etc are, and won't be interested > to hear it again ;) > > So for the Biopython talk we'd probably want to cover things like > GSoC, work with PyPy and Python3, major new functionality, any > Biopython papers, etc, and a bit on future plans. The talk should be > short but sweet :) > > Regards, > > Peter OK, here are some potential talking points I scraped from past announcements: * SeqIO.index_db: Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to carry the index_db concept to other modules. * Installation improvements: pip support (v.1.57); easy_install will automatically handle the numpy dependency (v.1.59, Feb '12) * Portability: Python 3 compatibility (except for a couple C extension modules); still supporting Jython; now mostly supporting Pypy (except for modules that use numpy or C extensions) * Merged Brandon Invergo's independent project pypaml under Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip support (v.1.59) and the existing support for phylogeny I/O under Phylo, we can now easily assemble and run complete workflows involving PAML. (Similarly for PhyML, with SeqIO's "phylip-relaxed" and Bio.Phylo.Applications.PhymlCommandline.) * GenomeDiagram improvements: New, pretty features. Eye candy for the slides. * TogoWS * Next release & future plans: - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student - Brad's GFF parser - Deeper future: see the other mailing list thread * GSoC 2011 results: - Mikael Trellet -- Interface - Michele Silva -- Mocapy++ Python module; also ported two applications to Biopython - Justinas D. -- Python-based extension system for Mocapy++ * Summer of Struct: Jo?o and Eric are working to refactor and merge the vast amount of Bio.PDB-related code produced during previous GSoCs. (Includes a planned SeqIO-style API for structures in PDB, mmCIF and PBDML formats.) Improvements have been trickling in since the last BOSC; here comes the flood. From chapmanb at 50mail.com Fri Apr 13 00:23:03 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 12 Apr 2012 20:23:03 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: <877gxkh448.fsf@fastmail.fm> Eric and Peter; Eric -- I'm glad you're taking this on. It'll be great to have a Biopython presentation at BOSC. The points you mentioned all sound great, although I would drop some of the more boring ones like the installation stuff (I can pick on that, since it's mine). My only other suggestions is to focus the talk around the people who've provided the improvements. One of the awesome things about Biopython is the wide contributor base and we still manage to pull everything into a coherent package thanks to Peter's guiding hand. It would be cool to emphasize this community as part of the update. Thanks again for doing this, Brad > > Hello all, > > > > The BOSC abstract deadline (tomorrow) has rather crept up on me, > > despite Nomi's reminder emails (My excuse is I've been thinking > > more about GSoC!). For anyone thinking of submitting a talk, the > > abstract limit is just a page - see: > > http://www.open-bio.org/wiki/BOSC_2012 > > > > I'm hoping to attend BOSC, but will probably not be at ISMB 2012. > > I'd be delighted for another Biopython developer to give the project > > update talk (and as in previous years, we'll help out with the abstract, > > slides, etc). Anyone interested? Giving a talk can be very helpful in > > getting travel funding ;) > > > > I know Eric might be a candidate as he will be in Long Beach > > (congratulations on getting your ISMB poster accepted Eric!). > > > > Note that dedicated "Bioinformatics Open Source Project Updates" > > track is new this year. The talks are likely to be at the shorter end of > > the talk length range specified (i.e. closer to 5 minutes than 20 mins) > > but that will partly depend on quite how full the final schedule turns > > out to be. > > > > The idea (speaking with my BOSC hat on) with the update talks is > > to try to highlight what is new and exciting, with only a minimal > > introduction for the higher profile projects - most of the audience > > will know roughly what BioPerl etc are, and won't be interested > > to hear it again ;) > > > > So for the Biopython talk we'd probably want to cover things like > > GSoC, work with PyPy and Python3, major new functionality, any > > Biopython papers, etc, and a bit on future plans. The talk should be > > short but sweet :) > > > > Regards, > > > > Peter > > > OK, here are some potential talking points I scraped from past announcements: > > * SeqIO.index_db: > Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to > carry the index_db concept to other modules. > > * Installation improvements: > pip support (v.1.57); easy_install will automatically handle the numpy > dependency (v.1.59, Feb '12) > > * Portability: > Python 3 compatibility (except for a couple C extension modules); > still supporting Jython; now mostly supporting Pypy (except for > modules that use numpy or C extensions) > > * Merged Brandon Invergo's independent project pypaml under > Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip > support (v.1.59) and the existing support for phylogeny I/O under > Phylo, we can now easily assemble and run complete workflows involving > PAML. > (Similarly for PhyML, with SeqIO's "phylip-relaxed" and > Bio.Phylo.Applications.PhymlCommandline.) > > * GenomeDiagram improvements: > New, pretty features. Eye candy for the slides. > > * TogoWS > > * Next release & future plans: > - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student > - Brad's GFF parser > - Deeper future: see the other mailing list thread > > * GSoC 2011 results: > - Mikael Trellet -- Interface > - Michele Silva -- Mocapy++ Python module; also ported two > applications to Biopython > - Justinas D. -- Python-based extension system for Mocapy++ > > * Summer of Struct: > Jo?o and Eric are working to refactor and merge the vast amount of > Bio.PDB-related code produced during previous GSoCs. (Includes a > planned SeqIO-style API for structures in PDB, mmCIF and PBDML > formats.) Improvements have been trickling in since the last BOSC; > here comes the flood. > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From arklenna at gmail.com Fri Apr 13 03:26:35 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 12 Apr 2012 23:26:35 -0400 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References: Message-ID: On Thu, Mar 29, 2012 at 10:05 AM, Peter Cock wrote: > Hi Lenna, > > Have you tried your branch on Windows yet? > > It worked for me under my Python 2.5 setup using mingw32, > > C:\repositories\biopython>c:\python26\python setup.py install > ... > building 'Bio.PDB.mmCIF.MMCIFlex' extension > creating build\temp.win32-2.5\Release\bio\pdb > creating build\temp.win32-2.5\Release\bio\pdb\mmcif > C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio > -Ic:\python25\include -Ic:\python25\PC -c Bio/PDB/mmCIF/lex.yy.c -o > build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o > lex.yy.c:1046: warning: 'yyunput' defined but not used > C:\cygwin\usr\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IBio > -Ic:\python25\include -Ic:\python25\PC -c > Bio/PDB/mmCIF/MMCIFlexmodule.c -o > build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o > writing build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def > C:\cygwin\usr\bin\gcc.exe -mno-cygwin -shared -s > build\temp.win32-2.5\Release\bio\pdb\mmcif\lex.yy.o > build\temp.win32-2.5\Release\bio\pdb\mmcif\mmciflexmodule.o > build\temp.win32-2.5\Release\bio\pdb\mmcif\MMCIFlex.def > -Lc:\python25\libs -Lc:\python25\PCBuild -lpython25 -lmsvcr71 -o > build\lib.win32-2.5\Bio\PDB\mmCIF\MMCIFlex.pyd > ... > > That worked fine and test_MMCIF.py is happy. However, MSVC v9 is not: > > C:\repositories\biopython>c:\python26\python setup.py install > ... > building 'Bio.PDB.mmCIF.MMCIFlex' extension > C:\Program Files\Microsoft Visual Studio 9.0\VC\BIN\cl.exe /c /nologo > /Ox /MD /W3 /GS- /DNDEBUG -IBio -Ic:\python26\include -Ic:\python26\PC > /TcBio/PDB/mmCIF/lex.yy.c > /Fobuild\temp.win32-2.6\Release\Bio/PDB/mmCIF/lex.yy.obj > lex.yy.c > Bio/PDB/mmCIF/lex.yy.c(12) : fatal error C1083: Cannot open include > file: 'unistd.h': No such file or directory > error: command '"C:\Program Files\Microsoft Visual Studio > 9.0\VC\BIN\cl.exe"' failed with exit status 2 > > The same with Python 2.7 and the Microsoft compiler. Switching > from this in Bio/PDB/mmCIF.yy.c: > > #include > > to this: > > #include > > lets it compile (although with some warnings) and test_MMCIF.py passes. > If should be conditional of course, but I'm unclear if that is the appropriate > fix or not though. > > Peter Hi Peter, I installed flex on my Windows VM and used it to generate lex.yy.c. It puts #include inside an #ifdef so it may work with MSVC. It produces a working module for both Debian and Mac OS X (I do get "defined but not used" warnings for generated functions). I've cherry-picked it into my pull request. I know you're quite busy right now with BOSC and GSoC, but let me know if you get a chance to test it on MSVC. Lenna From p.j.a.cock at googlemail.com Fri Apr 13 11:31:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 Apr 2012 12:31:30 +0100 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Thu, Apr 12, 2012 at 11:37 PM, Eric Talevich wrote: > > OK, here are some potential talking points I scraped from past announcements: > > * SeqIO.index_db: > Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to > carry the index_db concept to other modules. Biopython 1.57 was already covered at BOSC 2011. > * Installation improvements: > pip support (v.1.57); easy_install will automatically handle the numpy > dependency (v.1.59, Feb '12) Brad commented on this, perhaps a line in the abstract? > * Portability: > Python 3 compatibility (except for a couple C extension modules); > still supporting Jython; now mostly supporting Pypy (except for > modules that use numpy or C extensions) This is something I would want to cover. > * Merged Brandon Invergo's independent project pypaml under > Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip > support (v.1.59) and the existing support for phylogeny I/O under > Phylo, we can now easily assemble and run complete workflows involving > PAML. (Similarly for PhyML, with SeqIO's "phylip-relaxed" and > Bio.Phylo.Applications.PhymlCommandline.) Yep. > * GenomeDiagram improvements: > New, pretty features. Eye candy for the slides. Yep. Maybe even an example in the abstract? > * TogoWS Yep. > * Next release & future plans: > - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student > - Brad's GFF parser > - Deeper future: see the other mailing list thread Good points - although I don't want to over promise ;) > * GSoC 2011 results: > - Mikael Trellet -- Interface > - Michele Silva -- Mocapy++ Python module; also ported two > applications to Biopython > - Justinas D. -- Python-based extension system for Mocapy++ We should have a summary of what they did somewhere, perhaps as an OBF blog post? I'm hoping to get this year's GSoC students to write weekly progress reports on a blog or at least by email to the mailing list. > * Summer of Struct: > Jo?o and Eric are working to refactor and merge the vast amount of > Bio.PDB-related code produced during previous GSoCs. (Includes a > planned SeqIO-style API for structures in PDB, mmCIF and PBDML > formats.) Improvements have been trickling in since the last BOSC; > here comes the flood. :) Here's a draft abstract - note we have to fit in a page. Having a logo or some eye catching image is very effective for standing out in the abstract book (on screen or on paper). Comments welcome - but keep in mind the one page limit. Eric - feel free to turn this into a Google Doc if you prefer. Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: Biopython_BOSC_abstract_2012_draft.pdf Type: application/pdf Size: 199737 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Biopython_BOSC_abstract_2012_draft.tex Type: application/x-tex Size: 5037 bytes Desc: not available URL: From eric.talevich at gmail.com Fri Apr 13 14:31:08 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Apr 2012 10:31:08 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: Thanks for this. I'll keep it as LaTeX, since it already looks nice. 1. Several parts say "[to be revised prior to BOSC]" -- I take it we have the option of updating our abstract shortly before BOSC, and this is a note to the conference organizers that we intend to do so? To save space and reduce distraction, should this be a footnote instead? 2. To save space: Do we need the line "Bioinformatics Open Source Conference (BOSC) ..." after the author names? 3. Again to save space, and make room to cite the Phylo paper: can we drop the citation for TogoWS, and add a few words of description in the main text where it's mentioned? (We don't cite PAML, HMMer, etc.) 4. How do you feel about dropping inline citations, and just have a list of \nocite references at the bottom? In a one-page abstract, it should be easy enough for readers to figure out what's what. -E On Fri, Apr 13, 2012 at 7:31 AM, Peter Cock wrote: > On Thu, Apr 12, 2012 at 11:37 PM, Eric Talevich wrote: >> >> OK, here are some potential talking points I scraped from past announcements: >> >> * SeqIO.index_db: >> Introduced v.1.57 (Apr 2011), with improvements since then. Ideas to >> carry the index_db concept to other modules. > > Biopython 1.57 was already covered at BOSC 2011. > >> * Installation improvements: >> pip support (v.1.57); easy_install will automatically handle the numpy >> dependency (v.1.59, Feb '12) > > Brad commented on this, perhaps a line in the abstract? > >> * Portability: >> Python 3 compatibility (except for a couple C extension modules); >> still supporting Jython; now mostly supporting Pypy (except for >> modules that use numpy or C extensions) > > This is something I would want to cover. > >> * Merged Brandon Invergo's independent project pypaml under >> Bio.Phylo.PAML ((v.1.58, Aug '11). With SeqIO's new sequential Phylip >> support (v.1.59) and the existing support for phylogeny I/O under >> Phylo, we can now easily assemble and run complete workflows involving >> PAML. (Similarly for PhyML, with SeqIO's "phylip-relaxed" and >> Bio.Phylo.Applications.PhymlCommandline.) > > Yep. > >> * GenomeDiagram improvements: >> New, pretty features. Eye candy for the slides. > > Yep. Maybe even an example in the abstract? > >> * TogoWS > > Yep. > >> * Next release & future plans: >> - Restored mmCIF support, via Lenna Peterson, a prospective GSoC student >> - Brad's GFF parser >> - Deeper future: see the other mailing list thread > > Good points - although I don't want to over promise ;) > >> * GSoC 2011 results: >> - Mikael Trellet -- Interface >> - Michele Silva -- Mocapy++ Python module; also ported two >> applications to Biopython >> - Justinas D. -- Python-based extension system for Mocapy++ > > We should have a summary of what they did somewhere, perhaps > as an OBF blog post? I'm hoping to get this year's GSoC students > to write weekly progress reports on a blog or at least by email to > the mailing list. > >> * Summer of Struct: >> Jo?o and Eric are working to refactor and merge the vast amount of >> Bio.PDB-related code produced during previous GSoCs. (Includes a >> planned SeqIO-style API for structures in PDB, mmCIF and PBDML >> formats.) Improvements have been trickling in since the last BOSC; >> here comes the flood. > > :) > > Here's a draft abstract - note we have to fit in a page. Having a logo > or some eye catching image is very effective for standing out in the > abstract book (on screen or on paper). > > Comments welcome - but keep in mind the one page limit. > > Eric - feel free to turn this into a Google Doc if you prefer. > > Peter From p.j.a.cock at googlemail.com Fri Apr 13 14:42:37 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 Apr 2012 15:42:37 +0100 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 3:31 PM, Eric Talevich wrote: > Thanks for this. I'll keep it as LaTeX, since it already looks nice. > > 1. Several parts say "[to be revised prior to BOSC]" -- I take it we > have the option of updating our abstract shortly before BOSC, and this > is a note to the conference organizers that we intend to do so? To > save space and reduce distraction, should this be a footnote instead? It is common for BOSC abstracts to be revised following review prior to acceptance (almost like a tiny paper), and yes, that was my intention. Do you think something like [to be revised during abstract review] might be clearer? I think this makes a lot of sense for the project update talks in particular - but that stage for example we'll have the GSoC students selected. > 2. To save space: Do we need the line "Bioinformatics Open Source > Conference (BOSC) ..." after the author names? I like it to make the page self contained, useful if we post it as a lone PDF file. The text could be smaller certainly if required - likewise the logo could be shrunk a little. > 3. Again to save space, and make room to cite the Phylo paper: can we > drop the citation for TogoWS, and add a few words of description in > the main text where it's mentioned? (We don't cite PAML, HMMer, etc.) Fair point, I was thinking in terms of audience recognition. PAML and HMMer are quite well known and relatively old/mature. If the Phylo paper is accepted in time to be added to abstract then of course we'd want to include it. But right now using a couple of lines for a 'submitted' citation seemed overkill to me. But if you can get it to fit nicely, please go ahead. > 4. How do you feel about dropping inline citations, and just have a > list of \nocite references at the bottom? In a one-page abstract, it > should be easy enough for readers to figure out what's what. If you prefer, or use the [1] style? Peter From eric.talevich at gmail.com Fri Apr 13 15:40:06 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Apr 2012 11:40:06 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 10:42 AM, Peter Cock wrote: > On Fri, Apr 13, 2012 at 3:31 PM, Eric Talevich wrote: >> Thanks for this. I'll keep it as LaTeX, since it already looks nice. >> >> 1. Several parts say "[to be revised prior to BOSC]" -- I take it we >> have the option of updating our abstract shortly before BOSC, and this >> is a note to the conference organizers that we intend to do so? To >> save space and reduce distraction, should this be a footnote instead? > > It is common for BOSC abstracts to be revised following review prior to > acceptance (almost like a tiny paper), and yes, that was my intention. > Do you think something like [to be revised during abstract review] > might be clearer? I think this makes a lot of sense for the project > update talks in particular - but that stage for example we'll have the > GSoC students selected. > >> 2. To save space: Do we need the line "Bioinformatics Open Source >> Conference (BOSC) ..." after the author names? > > I like it to make the page self contained, useful if we post it as a lone > PDF file. The text could be smaller certainly if required - likewise the > logo could be shrunk a little. > >> 3. Again to save space, and make room to cite the Phylo paper: can we >> drop the citation for TogoWS, and add a few words of description in >> the main text where it's mentioned? (We don't cite PAML, HMMer, etc.) > > Fair point, I was thinking in terms of audience recognition. PAML > and HMMer are quite well known and relatively old/mature. > > If the Phylo paper is accepted in time to be added to abstract then > of course we'd want to include it. But right now using a couple of > lines for a 'submitted' citation seemed overkill to me. But if you can > get it to fit nicely, please go ahead. > >> 4. How do you feel about dropping inline citations, and just have a >> list of \nocite references at the bottom? In a one-page abstract, it >> should be easy enough for readers to figure out what's what. > > If you prefer, or use the [1] style? > > Peter Here's an updated draft. How does it look? -------------- next part -------------- A non-text attachment was scrubbed... Name: Biopython_BOSC_abstract_2012_draft.pdf Type: application/pdf Size: 262728 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Biopython_BOSC_abstract_2012_draft.tex Type: application/x-tex Size: 5573 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Apr 13 15:57:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 Apr 2012 16:57:27 +0100 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich wrote: > > Here's an updated draft. How does it look? Looks fine to me - anyone else? A fresh pair of eyes would be good. Also does anyone else want to be named as a talk co-author (and promise to contribute with slides/figures/help for preparing the talk)? Or should we just put "Eric et al" since he'll be the one on stage? Peter From anaryin at gmail.com Fri Apr 13 16:02:04 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 13 Apr 2012 18:02:04 +0200 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: Third paragraph: 'summer' should read 'Summer'. Good to me! I can help with the slides/figures/help, particularly on the refactoring part of Bio.PDB to Bio.Struct. Let me know when and I can easily get on Skype. cheers! Jo?o From zhigang.wu at email.ucr.edu Fri Apr 13 16:25:34 2012 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Fri, 13 Apr 2012 09:25:34 -0700 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: Probably I caught a grammar mistake. Should we correct "Biopython 1.60 is expected *to have been* released by BOSC 2012" to "Biopython 1.60 is expected *to be* released by BOSC 2012"? Probably I was wrong. I am not a native speaker. :-) Zhigang On Fri, Apr 13, 2012 at 8:57 AM, Peter Cock wrote: > On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich > wrote: > > > > Here's an updated draft. How does it look? > > Looks fine to me - anyone else? A fresh pair of eyes would be good. > > Also does anyone else want to be named as a talk co-author (and > promise to contribute with slides/figures/help for preparing the talk)? > Or should we just put "Eric et al" since he'll be the one on stage? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From arklenna at gmail.com Fri Apr 13 16:31:53 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 13 Apr 2012 12:31:53 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 12:25 PM, Zhigang Wu wrote: > Probably I caught a grammar mistake. > > Should we correct ?"Biopython 1.60 is expected *to have been* released by > BOSC 2012" ?to "Biopython 1.60 is expected *to be* released by BOSC 2012"? > > Probably I was wrong. I am not a native speaker. :-) > > Zhigang > Hi Zhigang, Actually, either way is correct - the original way is called the future perfect tense. Here's a description of the grammar if you are interested: http://www.englishpage.com/verbpage/futureperfect.html Lenna From eric.talevich at gmail.com Fri Apr 13 17:17:31 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Apr 2012 13:17:31 -0400 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 11:57 AM, Peter Cock wrote: > On Fri, Apr 13, 2012 at 4:40 PM, Eric Talevich wrote: >> >> Here's an updated draft. How does it look? > > Looks fine to me - anyone else? A fresh pair of eyes would be good. > > Also does anyone else want to be named as a talk co-author (and > promise to contribute with slides/figures/help for preparing the talk)? > Or should we just put "Eric et al" since he'll be the one on stage? > > Peter I added Jo?o as the fourth author and submitted it. Cheers, Eric From p.j.a.cock at googlemail.com Fri Apr 13 19:32:32 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 Apr 2012 20:32:32 +0100 Subject: [Biopython-dev] BOSC 2012 - Biopython Update In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 6:17 PM, Eric Talevich wrote: > > I added Jo?o as the fourth author and submitted it. > > Cheers, > Eric Thanks Eric, If there are any other comments or changes, we'll try to integrate them along with any reviewers' comments. Peter From tiagoantao at gmail.com Mon Apr 16 09:35:21 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 16 Apr 2012 10:35:21 +0100 Subject: [Biopython-dev] plink phasing and others Message-ID: Hi, During the last few months I have been in an hell hole writing code like mad. Maybe some of this code is of interest to share. I currently have: 1. Code to parse plink output. Pretty trivial stuff, but I bet lots of people are doing this 2. Code to process admixture results. Admixture is far less used than STRUCTURE 3. Code to deal with phasing formats. Beagle, PHASE and shapeit 4. PCA 5. Some gene ontology stuff My GO stuff is pretty specific, so I guess it might not be of interest. All the other components are of fairly widely used things. Admixture and PCA are standard popgen analysis. Admixture code could probably be changed to also support STRUCTURE. I am not sure but PCA might only work on linux. Plink and phasing are of more general interest. These would be out of Bio.PopGen. There is no strange requirement to any of these code with one exception: admixture and PCA require matplotib. So that people have an understanding of the impact of these things, I put the number of scholar citations: plink - 3315 smartpca - 1673 admixture - 57 structure - 7448 beagle - >300 fastphase - 1935 Unfortunately there is little code to do automated analysis using these tools. I could start migrating some of this code to biopython (would have to write documentation, and comment the code better ;) ) -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Mon Apr 16 10:26:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 11:26:30 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: 2012/4/16 Tiago Ant?o : > Hi, > > During the last few months I have been in an hell hole writing code > like mad. Maybe some of this code is of interest to share. > > I currently have: > > 1. Code to parse plink output. Pretty trivial stuff, but I bet lots of > people are doing this > 2. Code to process admixture results. Admixture is far less used than STRUCTURE > 3. Code to deal with phasing formats. Beagle, PHASE and shapeit > 4. PCA > 5. Some gene ontology stuff > > My GO stuff is pretty specific, so I guess it might not be of interest. > All the other components are of fairly widely used things. > Admixture and PCA are standard popgen analysis. Admixture code could > probably be changed to also support STRUCTURE. I am not sure but PCA > might only work on linux. > Plink and phasing are of more general interest. These would be out of > Bio.PopGen. > > There is no strange requirement to any of these code with one > exception: admixture and PCA require matplotib. > > So that people have an understanding of the impact of these things, I > put the number of scholar citations: > plink - 3315 > smartpca - 1673 > admixture - 57 > structure - 7448 > beagle - >300 > fastphase - 1935 > > Unfortunately there is little code to do automated analysis using these tools. > > I could start migrating some of this code to biopython (would have to > write documentation, and comment the code better ;) ) Sounds good. The GO stuff would/should be more general than just PopGen, and I know other people are looking at this on branches. When you said PCA, that was principle component analysis, right? What are you adding on top of NumPy/SciPy/matplotlib? Peter From tiagoantao at gmail.com Mon Apr 16 12:05:34 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 16 Apr 2012 13:05:34 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: 2012/4/16 Peter Cock : > Sounds good. The GO stuff would/should be more general than just > PopGen, and I know other people are looking at this on branches. What I do here is things like tree traversing (e.g. find all parent nodes) and stuff like that. After that I do enrichment analysis (fisher exact test, fdr, that stuff). Nothing of real interest for now. I think we can ignore my code here (for now). > When you said PCA, that was principle component analysis, right? Yep, I am using eigenstrat/smartpca. > What are you adding on top of NumPy/SciPy/matplotlib? PCA plots and admixture plots. Here is an example of both: http://2.bp.blogspot.com/-6J6Gsas4uIs/TuELU3Gf4ZI/AAAAAAAAEWQ/CymvlzkX6hQ/s1600/PIIS0002929711004885.gr2_lrg.hi.jpg TOP: PCA Bottom: admixture -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Mon Apr 16 13:50:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 14:50:18 +0100 Subject: [Biopython-dev] [biopython] Fix flex library dependency of MMCIFlex; closes 2619 (#31) In-Reply-To: References: Message-ID: On Fri, Apr 13, 2012 at 4:26 AM, Lenna Peterson wrote: > > Hi Peter, > > I installed flex on my Windows VM and used it to generate lex.yy.c. It > puts #include inside an #ifdef so it may work with MSVC. It > produces a working module for both Debian and Mac OS X (I do get > "defined but not used" warnings for generated functions). I've > cherry-picked it into my pull request. > I've now tested that on my Windows machine (and Mac and Linux), and applied the changes to the master branch. Thanks! We must remember to drop an email to the Debian and RedHat packaging teams since their old patch to setup.py isn't needed now (they could control the flex problem by declaring it a build time dependency). Peter From tiagoantao at gmail.com Mon Apr 16 15:00:13 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 16 Apr 2012 16:00:13 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: Just a few practical things: 1. we still do not allow matplotlib dependencies, correct? 2. to what part of the name space should plink and phasing be added? 3. Are we on epidoc or sphinx? Or moving from one to the other? doctest is acceptable right? 4. What is the current best way to run external applications? There was an application wrapper class in the past... From p.j.a.cock at googlemail.com Mon Apr 16 15:18:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 16:18:10 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: 2012/4/16 Tiago Ant?o : > Just a few practical things: > > 1. we still do not allow matplotlib dependencies, correct? They would be run time dependencies, right? Not compile/build time? We already have things like 'soft' dependencies on ReportLab and NetworkX, and even matplotlib. It does complicate the unit tests a bit to skip anything gracefully. > > 2. to what part of the name space should plink and phasing be added? Unclear to me right now. > 3. Are we on epidoc or sphinx? Or moving from one to the other? > doctest is acceptable right? We're still using LaTeX for the tutorial, and epydoc for the API docs. Using doctest is acceptable and encouraged for documentation, but be wary of cross platform differences. If you have a doctest which has dependencies see test_wise.py rather than adding it to run_tests.py > 4. What is the current best way to run external applications? There > was an application wrapper class in the past... For simple Unix style applications controlled via the command line, use the Bio.Application framework as in Bio.Align.Applications or Bio.Sequencing.Applications, Bio.Phylo.Applications, or Bio.Emboss.Applications (etc?). Peter From p.j.a.cock at googlemail.com Mon Apr 16 15:20:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 16:20:59 +0100 Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices In-Reply-To: References: Message-ID: On Sat, Apr 7, 2012 at 7:42 PM, Eric Talevich wrote: > On Wed, Apr 4, 2012 at 10:53 PM, Eric Talevich wrote: > >> Hi all, >> >> I'm considering some enhancements to the Phylo.draw function to make it >> more customizable for power users. Since the function is based on >> matplotlib/pylab/pyplot, it's possible for quite a bit to be left to the >> user; however, I'm not fully versed in what pyplot is capable of. >> >> Relevant feature request in Redmine: >> https://redmine.open-bio.org/issues/3336 >> >> Ideas: > > [...] > > Just committed this feature: > https://github.com/biopython/biopython/commit/72990549a1b769ab19ab0bd33a8c35fdf031ac2d Hi Eric, That seems to have caused a test failure on one of our buildslaves: ====================================================================== ERROR: Run the tree layout algorithm, but don't display it. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/buildslave/BuildBot/lin2664/build/Tests/test_Phylo_depend.py", line 51, in test_draw Phylo.draw(dollo, do_show=False) File "/home/buildslave/BuildBot/lin2664/build/build/lib.linux-x86_64-2.6/Bio/Phylo/_utils.py", line 366, in draw fig = plt.figure() File "/usr/local/lib/python2.6/site-packages/matplotlib/pyplot.py", line 270, in figure **kwargs) File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wxagg.py", line 120, in new_figure_manager backend_wx._create_wx_app() File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wx.py", line 1377, in _create_wx_app wxapp = wx.PySimpleApp() File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py", line 8078, in __init__ wx.App.__init__(self, redirect, filename, useBestVisual, clearSigInt) File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py", line 7946, in __init__ raise SystemExit(msg) SystemExit: Unable to access the X Display, is $DISPLAY set properly? ---------------------------------------------------------------------- http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/534/steps/shell/logs/stdio http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/535/steps/shell/logs/stdio Interestingly the same machine is passing the tests under other Python versions. That would seem to rule out the $DISPLAY environment variable being the cause. My hunch would be this is something about the Python 2.6 install, perhaps it is missing some library (wxPython maybe). Logged in as the buildslave on this machine I can see that both Python 2.6 & 2.7 have the same version of matplotlib installed, but only one is failing the test: $ python2.5 Python 2.5.5 (r255:77872, Jan 14 2011, 17:09:55) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import matplotlib Traceback (most recent call last): File "", line 1, in ImportError: No module named matplotlib $ python2.6 Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import matplotlib >>> matplotlib.__version__ '1.0.0' $ python2.7 Python 2.7 (r27:82500, Jul 13 2010, 14:02:41) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import matplotlib >>> matplotlib.__version__ '1.0.0' Peter From tiagoantao at gmail.com Mon Apr 16 15:31:50 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 16 Apr 2012 16:31:50 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: 2012/4/16 Peter Cock : > For simple Unix style applications controlled via the command line, > use the Bio.Application framework as in Bio.Align.Applications or > Bio.Sequencing.Applications, Bio.Phylo.Applications, or > Bio.Emboss.Applications (etc?). I wonder if people never had the need to abstract the computing infrastructure? The current code does local (blocking) execution, but we see environments with BAS or grids where other models are used. I am not suggesting any specific solution, but the current approach seems to me not very scalable. No? -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Mon Apr 16 16:08:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 17:08:20 +0100 Subject: [Biopython-dev] plink phasing and others In-Reply-To: References: Message-ID: 2012/4/16 Tiago Ant?o : > 2012/4/16 Peter Cock : >> For simple Unix style applications controlled via the command line, >> use the Bio.Application framework as in Bio.Align.Applications or >> Bio.Sequencing.Applications, Bio.Phylo.Applications, or >> Bio.Emboss.Applications (etc?). > > I wonder if people never had the need to abstract the computing > infrastructure? The current code does local (blocking) execution, but > we see environments with BAS or grids where other models are used. I > am not suggesting any specific solution, but the current approach > seems to me not very scalable. No? I use the current framework with an SGE cluster, str(cline_object) gives the command line string to submit as the jobs. It would be nice to have some documented examples using this in combination with multiprocessing or something... but I find most of the tools I call are already multi-threaded. Peter From andrew.sczesnak at med.nyu.edu Mon Apr 16 16:48:41 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 Apr 2012 12:48:41 -0400 Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo In-Reply-To: References: Message-ID: <4F8C4D69.4040009@med.nyu.edu> Hi Eric, I was playing with Bio.Cluster recently and noticed that trees generated by that module are not compatible with Bio.Phylo. I think it would be useful if output from Cluster could be manipulated with Phylo. At first glance, it doesn't seem like it would be that tricky to add a method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, and I would be happy to work on this. Before making an attempt, I wanted to get your feedback on whether you think this would be useful and if you had anything similar in the works already. Best, Andrew From eric.talevich at gmail.com Mon Apr 16 22:15:14 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 16 Apr 2012 18:15:14 -0400 Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo In-Reply-To: <4F8C4D69.4040009@med.nyu.edu> References: <4F8C4D69.4040009@med.nyu.edu> Message-ID: On Mon, Apr 16, 2012 at 12:48 PM, Andrew Sczesnak wrote: > Hi Eric, > > I was playing with Bio.Cluster recently and noticed that trees generated by > that module are not compatible with Bio.Phylo. I think it would be useful if > output from Cluster could be manipulated with Phylo. > > At first glance, it doesn't seem like it would be that tricky to add a > method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, and > I would be happy to work on this. Before making an attempt, I wanted to get > your feedback on whether you think this would be useful and if you had > anything similar in the works already. > > > Best, > Andrew Hi Andrew, Interesting idea. It would be simple enough to add a "from_cluster" function or class method to either Phylo/BaseTree.py or Phylo/_utils.py. But as every scientist knows, just because we can doesn't necessarily mean we should. Do you have a specific use case in mind? If the main idea is to use Bio.Cluster to generate trees based on a measure of sequence distance, we could probably do more to support that. This code might also be worth posting on wiki "Phylo cookbook" page (http://www.biopython.org/wiki/Phylo_cookbook) to get more eyes on it while we consider merging it into the main distribution. -Eric From andrew.sczesnak at med.nyu.edu Mon Apr 16 22:47:25 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 Apr 2012 18:47:25 -0400 Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo In-Reply-To: References: <4F8C4D69.4040009@med.nyu.edu> Message-ID: <4F8CA17D.4080907@med.nyu.edu> Eric, I can describe two use cases from my own experience. First, the MAF parser I've been working on can pull the multiple alignment of some gene between a bunch of genomes. Thinking of recipes for the cookbook, I thought it would be neat to walk the user through constructing a distance matrix by hand (though you're right--more could be done to support this), clustering with Bio.Cluster and visualizing the result with Bio.Phylo. I like this example because it integrates several different parts of BioPython along with a lesson about inferring distances between sequences. Second, for another project, I've been generating distance matrices based on the shared gene content of bacterial genomes and the presence-or-absence of orthologous groups in each. Presently, I ferry the matrices to a clustering program and then visualize the resulting trees in yet another tool. Looking into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and the incompatibility of their tree objects. I wonder, what would be the most elegant way of bridging the gap? Best, Andrew On 04/16/2012 06:15 PM, Eric Talevich wrote: > On Mon, Apr 16, 2012 at 12:48 PM, Andrew Sczesnak > wrote: >> Hi Eric, >> >> I was playing with Bio.Cluster recently and noticed that trees generated by >> that module are not compatible with Bio.Phylo. I think it would be useful if >> output from Cluster could be manipulated with Phylo. >> >> At first glance, it doesn't seem like it would be that tricky to add a >> method of converting Bio.Cluster tree objects to Bio.Phylo tree objects, and >> I would be happy to work on this. Before making an attempt, I wanted to get >> your feedback on whether you think this would be useful and if you had >> anything similar in the works already. >> >> >> Best, >> Andrew > > Hi Andrew, > > Interesting idea. It would be simple enough to add a "from_cluster" > function or class method to either Phylo/BaseTree.py or > Phylo/_utils.py. But as every scientist knows, just because we can > doesn't necessarily mean we should. Do you have a specific use case in > mind? > > If the main idea is to use Bio.Cluster to generate trees based on a > measure of sequence distance, we could probably do more to support > that. This code might also be worth posting on wiki "Phylo cookbook" > page (http://www.biopython.org/wiki/Phylo_cookbook) to get more eyes > on it while we consider merging it into the main distribution. > > -Eric From eric.talevich at gmail.com Tue Apr 17 04:17:26 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 17 Apr 2012 00:17:26 -0400 Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices In-Reply-To: References: Message-ID: On Mon, Apr 16, 2012 at 11:20 AM, Peter Cock wrote: > Hi Eric, > > That seems to have caused a test failure on one of our buildslaves: > > ====================================================================== > ERROR: Run the tree layout algorithm, but don't display it. > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "/home/buildslave/BuildBot/lin2664/build/Tests/test_Phylo_depend.py", > line 51, in test_draw > ? ?Phylo.draw(dollo, do_show=False) > ?File "/home/buildslave/BuildBot/lin2664/build/build/lib.linux-x86_64-2.6/Bio/Phylo/_utils.py", > line 366, in draw > ? ?fig = plt.figure() > ?File "/usr/local/lib/python2.6/site-packages/matplotlib/pyplot.py", > line 270, in figure > ? ?**kwargs) > ?File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wxagg.py", > line 120, in new_figure_manager > ? ?backend_wx._create_wx_app() > ?File "/usr/local/lib/python2.6/site-packages/matplotlib/backends/backend_wx.py", > line 1377, in _create_wx_app > ? ?wxapp = wx.PySimpleApp() > ?File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py", > line 8078, in __init__ > ? ?wx.App.__init__(self, redirect, filename, useBestVisual, clearSigInt) > ?File "/usr/local/lib/python2.6/site-packages/wx-2.8-gtk2-unicode/wx/_core.py", > line 7946, in __init__ > ? ?raise SystemExit(msg) > SystemExit: Unable to access the X Display, is $DISPLAY set properly? > > ---------------------------------------------------------------------- > > http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/534/steps/shell/logs/stdio > http://testing.open-bio.org:8010/builders/Linux%2064%20-%20Python%202.6/builds/535/steps/shell/logs/stdio > > Interestingly the same machine is passing the tests under other Python versions. > That would seem to rule out the $DISPLAY environment variable being the cause. > My hunch would be this is something about the Python 2.6 install, perhaps it > is missing some library (wxPython maybe). > > Logged in as the buildslave on this machine I can see that both Python 2.6 & 2.7 > have the same version of matplotlib installed, but only one is failing the test: > > $ python2.5 > Python 2.5.5 (r255:77872, Jan 14 2011, 17:09:55) > [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import matplotlib > Traceback (most recent call last): > ?File "", line 1, in > ImportError: No module named matplotlib > > $ python2.6 > Python 2.6.6 (r266:84292, Aug 31 2010, 16:21:14) > [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import matplotlib >>>> matplotlib.__version__ > '1.0.0' > > $ python2.7 > Python 2.7 (r27:82500, Jul 13 2010, 14:02:41) > [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import matplotlib >>>> matplotlib.__version__ > '1.0.0' > > > Peter Actually, it was this commit which added new unit tests: https://github.com/biopython/biopython/commit/a5f995bc1d9e113d88195cc7ae6f389984d762d8 On my machine with Python 2.6 and Ubuntu, the test passes, so I'm not sure how to debug this, exactly. Do you know a way to prevent matplotlib from attempting to launch the Wx app, beyond turn off interactive mode as the test already does? One idea is to specify a matplotlib backend other than wx. For example, using this import approach in test_Phylo_depend.py might do the trick: try: import matplotlib except ImportError: raise MissingExternalDependencyError( "Install matplotlib if you want to use Bio.Phylo._utils.") else: # Don't use the Wx backend for matplotlib, b/c that depends on Wx being # properly set up on the build machine. Instead, use the simpler postscript # backend -- we're not going to display or save the plot anyway, so it # doesn't matter much, as long as it's not Wx. I guess. matplotlib.use("ps") from matplotlib import pyplot Would you be able to test this on the errant buildbot machine without having to commit this to the trunk? Thanks, Eric From p.j.a.cock at googlemail.com Tue Apr 17 09:31:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 17 Apr 2012 10:31:05 +0100 Subject: [Biopython-dev] Enhancements to Phylo.draw; pyplot best practices In-Reply-To: References: Message-ID: On Tue, Apr 17, 2012 at 5:17 AM, Eric Talevich wrote: > > Actually, it was this commit which added new unit tests: > https://github.com/biopython/biopython/commit/a5f995bc1d9e113d88195cc7ae6f389984d762d8 > OK - thanks for checking. > On my machine with Python 2.6 and Ubuntu, the test passes, so I'm not > sure how to debug this, exactly. Do you know a way to prevent > matplotlib from attempting to launch the Wx app, beyond turn off > interactive mode as the test already does? Not sure. > One idea is to specify a matplotlib backend other than wx. For > example, using this import approach in test_Phylo_depend.py might do > the trick: > > try: > ? ?import matplotlib > except ImportError: > ? ?raise MissingExternalDependencyError( > ? ? ? ? ? ?"Install matplotlib if you want to use Bio.Phylo._utils.") > else: > ? ?# Don't use the Wx backend for matplotlib, b/c that depends on Wx being > ? ?# properly set up on the build machine. Instead, use the simpler postscript > ? ?# backend -- we're not going to display or save the plot anyway, so it > ? ?# doesn't matter much, as long as it's not Wx. I guess. > ? ?matplotlib.use("ps") > ? ?from matplotlib import pyplot > > > Would you be able to test this on the errant buildbot machine without > having to commit this to the trunk? Yes, that works (this buildbot is one of 'my' servers so I can run this directly). Please check that fix in. Thanks, Peter From p.j.a.cock at googlemail.com Tue Apr 17 15:23:22 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 17 Apr 2012 16:23:22 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond Message-ID: On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock wrote: > > Here are some things that I think are strong > candidates for 1.60 (not an exclusive list!) > > ... > > BGZF support: Low level module like Python's gzip, > support in SeqIO for indexing BGZF compressed files, > ... I've just rebased my bgzf branch, which I think is ready to apply to the trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3. https://github.com/peterjc/biopython/tree/bgzf2 Would anyone like to review this please? There are unittests and plenty of docstrings - but so far nothing in the Tutorial though. I wrote a blog post late last year explaining what this allows, and this branch includes the changes to Bio.SeqIO to index BGZF compressed sequence files this discussed: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html The probable next step after this is combining it with Andrew Sczesnak's work on indexing MAF files (they can get pretty big) as explored by 'I.J.' (who as far as I know hasn't signed up to the biopython-dev list, BCC'd). Also it would be interesting to explore doing the (de)compression of blocks on worker threads to take advantage of multiple cores. Another idea would be too switch from a plain dictionary to an ordered dictionary for holding cached decompressed blocks, giving a way to drop the oldest block (although not perhaps as clever as dropping the lest recently used block, the overhead is lower). That would require including our own OrderedDict class on the older Python platforms. Peter [*] PyPy testing is complicated by running out of file handles, an existing issue not something directly from this work. Part of this is down to different GC under PyPy. From eric.talevich at gmail.com Tue Apr 17 15:25:35 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 17 Apr 2012 11:25:35 -0400 Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo In-Reply-To: <4F8CA17D.4080907@med.nyu.edu> References: <4F8C4D69.4040009@med.nyu.edu> <4F8CA17D.4080907@med.nyu.edu> Message-ID: Andrew, It would be useful to have a quick and portable function for distance-based tree estimation in Bio.Phylo, since otherwise it's necessary to use one of the wrappers for external programs in Bio.Phylo.Applications. (And currently, only PhyML is wrapped.) Does the hierarchical clustering algorithm in Bio.Cluster correspond to any common tree-estimation algorithm, e.g. UPGMA? If so, then it would make a lot of sense to provide the glue for using it that way. If you have done some work in this direction, I would be happy to see it. -Eric On Mon, Apr 16, 2012 at 6:47 PM, Andrew Sczesnak wrote: > Eric, > > I can describe two use cases from my own experience. First, the MAF parser > I've been working on can pull the multiple alignment of some gene between a > bunch of genomes. Thinking of recipes for the cookbook, I thought it would > be neat to walk the user through constructing a distance matrix by hand > (though you're right--more could be done to support this), clustering with > Bio.Cluster and visualizing the result with Bio.Phylo. I like this example > because it integrates several different parts of BioPython along with a > lesson about inferring distances between sequences. > > Second, for another project, I've been generating distance matrices based on > the shared gene content of bacterial genomes and the presence-or-absence of > orthologous groups in each. Presently, I ferry the matrices to a clustering > program and then visualize the resulting trees in yet another tool. Looking > into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and > the incompatibility of their tree objects. > > I wonder, what would be the most elegant way of bridging the gap? > > > Best, > Andrew > From bioinformed at gmail.com Tue Apr 17 16:11:37 2012 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Tue, 17 Apr 2012 12:11:37 -0400 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: Message-ID: On Tue, Apr 17, 2012 at 11:23 AM, Peter Cock wrote: > On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock > wrote: > > > > Here are some things that I think are strong > > candidates for 1.60 (not an exclusive list!) > > > > ... > > > > BGZF support: Low level module like Python's gzip, > > support in SeqIO for indexing BGZF compressed files, > > ... > > I've just rebased my bgzf branch, which I think is ready to apply to the > trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3. > https://github.com/peterjc/biopython/tree/bgzf2 > > Would anyone like to review this please? There are unittests and > plenty of docstrings - but so far nothing in the Tutorial though. > > Hi Peter, I've implemented code to create BAM/tabix style index files and perform lookups, so it has been high on my list to test and validate your BGZF code (rather having to write my own). I'm notoriously short on time, but this is in the critical path for several projects and I'm going to work on it over the next week or so. -Kevin From redmine at redmine.open-bio.org Wed Apr 18 01:29:29 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 Apr 2012 01:29:29 +0000 Subject: [Biopython-dev] [Biopython - Bug #3333] PhyloXML writer fails to include is_aligned attribute with mol_seq elements References: Message-ID: Issue #3333 has been updated by Eric Talevich. The answer is: I'm an idiot. The mol_seq attribute was first defined as a complex attribute in the writer (via _handle_complex), but then further down redefined as a simple attribute. Fix: https://github.com/biopython/biopython/commit/a93c9892268274c4969131a1d401bb8ee235524a ---------------------------------------- Bug #3333: PhyloXML writer fails to include is_aligned attribute with mol_seq elements https://redmine.open-bio.org/issues/3333 Author: Eric Talevich Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: First reported here: http://lists.open-bio.org/pipermail/biopython/2012-March/007847.html Steps to reproduce: 1. Load a tree, convert to PhyloXML
from Bio import Phylo
from StringIO import StringIO
tree = Phylo.read(StringIO('(a,(b,c));'), 'newick').as_phyloxml()
2. Add a sequence
from Bio.Phylo import PhyloXML as PX
tree.clade[0].sequences = [PX.Sequence(type='dna', mol_seq=PX.MolSeq('AAA', is_aligned=False))]
3. Verify that the sequence information has been set -- mol_seq has is_aligned set
print tree
Clade(branch_length=1.0, name='a')
    Sequence(type='dna')
        MolSeq(value='AAA', is_aligned=False)
4. View the PhyloXML representation -- mol_seq is missing the is_aligned attribute!
print tree.format('phyloxml')
...

  c
  1.0
  
    AAA
  

...
-- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Apr 18 01:52:03 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 Apr 2012 01:52:03 +0000 Subject: [Biopython-dev] [Biopython - Bug #3333] (Closed) PhyloXML writer fails to include is_aligned attribute with mol_seq elements References: Message-ID: Issue #3333 has been updated by Eric Talevich. Status changed from New to Closed % Done changed from 0 to 100 ---------------------------------------- Bug #3333: PhyloXML writer fails to include is_aligned attribute with mol_seq elements https://redmine.open-bio.org/issues/3333 Author: Eric Talevich Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: First reported here: http://lists.open-bio.org/pipermail/biopython/2012-March/007847.html Steps to reproduce: 1. Load a tree, convert to PhyloXML
from Bio import Phylo
from StringIO import StringIO
tree = Phylo.read(StringIO('(a,(b,c));'), 'newick').as_phyloxml()
2. Add a sequence
from Bio.Phylo import PhyloXML as PX
tree.clade[0].sequences = [PX.Sequence(type='dna', mol_seq=PX.MolSeq('AAA', is_aligned=False))]
3. Verify that the sequence information has been set -- mol_seq has is_aligned set
print tree
Clade(branch_length=1.0, name='a')
    Sequence(type='dna')
        MolSeq(value='AAA', is_aligned=False)
4. View the PhyloXML representation -- mol_seq is missing the is_aligned attribute!
print tree.format('phyloxml')
...

  c
  1.0
  
    AAA
  

...
-- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu Apr 19 04:27:49 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 19 Apr 2012 04:27:49 +0000 Subject: [Biopython-dev] [Biopython - Feature #3342] (New) Phylo.root_with_outgroup: set the length of the outgroup branch Message-ID: Issue #3342 has been reported by Eric Talevich. ---------------------------------------- Feature #3342: Phylo.root_with_outgroup: set the length of the outgroup branch https://redmine.open-bio.org/issues/3342 Author: Eric Talevich Status: New Priority: Normal Assignee: Category: Target version: URL: Add an option to the root_with_outgroup method to specify the length of the branch leading from the new root to the outgroup. This should not change the total tree length, i.e. this length is subtracted from the branch on the other side of the root. This option makes it possible to root the tree in other ways that split the outgroup branch, leaving a bifurcating rather than trifurcating root. I've attached a patch that implements this feature, plus unit tests for it. HOWEVER: A sane API for this method would look like: >>> tree.root_with_outgroup("apple", "orange", outgroup_branch_length=0.4) The original function definition included *args for specifying the outgroup taxa in one shot (instead of requiring a separate call to common_ancestor). But while Python 3 permits keyword-only arguments (a defined keyword argument after *args or just *), Python 2 does not. So I made the function calling style shown above work in a very weird way: the function definition has **kwargs instead of outgroup_branch_length=None, and the necessary keyword argument is pulled out of kwargs inside the body of the function. The name of this argument is given in the docstring, so it's still partly discoverable. Are we cool with this? Or, can anyone think of a better way to handle this? ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Fri Apr 20 08:39:02 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Apr 2012 09:39:02 +0100 Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF parser (#33) In-Reply-To: References: Message-ID: I've had a quick look on GitHub and it isn't obvious to me how to get pull request emails CC'd to our dev mailing list... but anyway, Lenna has been busy: Peter ---------- Forwarded message ---------- From: Lenna Peterson Date: Thu, Apr 19, 2012 at 11:35 PM Subject: [biopython] Feature: Python implementation of MMCIF parser (#33) To: Peter Cock I've written a PLY (Python lex-yacc) module that is superimposable with the C MMCIF module. I've also partially rewritten the C MMCIF module to be object-oriented. ### Changed files ### * MMCIFlexmodule.c: Now object-oriented (open file in constructor, close file in destructor, etc). Docstrings! Added file IO exception. * MMCIF2Dict.py: Minor changes for new object oriented API * MMCIFParser: Changed all uses of map() to list comprehensions (more compatible with 3) ### New files ### * MMCIFlex.py: PLY-based module for tokenizing input. ### What it needs ### Addition of PLY dependency to setup.py. I'm not quite sure how to handle this, as PLY wouldn't be necessary on a platform with C Python. Thoughts? Which non-CPython implementations are worth testing? New C module tested on Python 2.6 on Mac OS X and Debian. I hope it still works on Windows. On my machine, the C module processes a 30,000 line test file in 10-15 ms; the Python module takes ~150 ms. You can merge this Pull Request by running: ?git pull https://github.com/lennax/biopython MMCIF2 Or you can view, comment on it, or merge it online at: ?https://github.com/biopython/biopython/pull/33 -- Commit Summary -- * Ply test in progress. * Quoted values with spaces are being broken. * Removed hard inclusion of ply. * Fixed quoted strings with spaces. * Changed Parser call to 2Dict. Semicolons break. * Changed Parser call to 2Dict. Semicolons break. * Lexes full file w/o error, FIXME loops * Tweak: comment handling * Changed token "NAME" to "TAG" * Using IUCr grammar. FIXME quote/semi * Fixed quoted strings. * Semicolon text field fixed, FIXME included \n * Fixed semi newlines. * non-eol temp fix, doesn't match single chars * Lexes full CIF file with no noticed errors. * Added timing. * Added states to lexer. * Lex loops into [header, [items], ...]; \d hacks. * Enforced semicolon rule. * Yacc works. * Re-added values to lexer state 'loop' * FIXME syntax error/hangs on full file. * Lexer gathers values, added parse precedence. * Minor lex cleanup. * Testing exclusionary lex redo. * Streamlined rules, no loop yet. * Still won't yacc 30k line file. * Merge branch 'master' of git://github.com/biopython/biopython into ply2 * Added __name__ __main__ check. * Parser redo, still doesn't parse 30k line file. * Added comments to tokenizer. * Fixed lex module's callability from yacc. * Fixed DATA token failure. * Multiple improvements, still no 30k. * Moved lexer arguments to constructor. * Moved data input to constructor, added docs * Validated to pep8. * Merge branch 'master' of git://github.com/biopython/biopython into ply2 * Add MMCIF2Dict from ply branch. * Remove flex header dependency of CIF parser. * Update MMCIFParser call of MMCIF2Dict. * PLY lexer works with MMCIF2Dict. * Cleanup. * Cleaned up import. * Updated docstring. * Subclassed dict. * Restored MMCIFParser call to MMCIF2Dict. * Removed main() from lex input. * Restored newline. * Fix C prototype warnings. * Modifying python lexer to be substitutable w/ C. * Make header for generated C. * Import C lexer or Python lexer. * Improvements and documentation. * Uncomment GLOBAL token definition. * PLY lexer and C lexer should be interchangeable. * Improve error reporting of import. * Turn on ply lex optimize. * Call instance of Python lexer. * Working on implementing class in C module. * Start unit test for MMCIF. * Minimal unit test for MMCIFParser. * Revert to old generated C; manually added noyywrap * Manually added function prototypes to generated C. * Merge branch 'ply2' into dev * Merge branch 'ply' into dev * Merge branch 'c-dev' into dev * Merge branch 'master' of git://github.com/biopython/biopython into dev * Cleaning up old files. * More cleanup. * Merging Parser from MMCIFlex branch. * Parser and unit test for PyCIFRW * Python and C lexer APIs are now identical. * Add copyright and license notices. * Merge branch 'master' of git://github.com/biopython/biopython into dev * Trying GnuWin32 flex-generated C. * Win flex generated with new mmcif.lex * GnuWin32 flex generated C, used dos2unix for CRLF * Added correct author to flex C module. * Merge branch 'master' of git://github.com/biopython/biopython into dev * Merge branch 'master' of git://github.com/biopython/biopython into dev * Change map() to list comprehensions for 3 compat. * Renamed python lexer to match C module. * Added file IO exception to C module. * Tweak lexer module import. * Prep Python CIF lexer for pull request. * Whitespace tweaks. -- File Changes -- M Bio/PDB/MMCIF2Dict.py (20) M Bio/PDB/MMCIFParser.py (8) A Bio/PDB/mmCIF/MMCIFlex.py (253) M Bio/PDB/mmCIF/MMCIFlexmodule.c (122) -- Patch Links -- ?https://github.com/biopython/biopython/pull/33.patch ?https://github.com/biopython/biopython/pull/33.diff --- Reply to this email directly or view it on GitHub: https://github.com/biopython/biopython/pull/33 From andrew.sczesnak at med.nyu.edu Fri Apr 20 22:28:43 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 20 Apr 2012 18:28:43 -0400 Subject: [Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo In-Reply-To: References: <4F8C4D69.4040009@med.nyu.edu> <4F8CA17D.4080907@med.nyu.edu> Message-ID: <4F91E31B.9030101@med.nyu.edu> Eric, If my understanding is correct, UPGMA is slang for agglomerative average-linkage hierarchical clustering which is implemented along with single- and complete-linkage in the module. There's no equivalent of neighbor-joining or maximum-likelihood and Bio.Cluster probably isn't that fast with large numbers of nodes so wrappers are still useful. We could probably add an NJ implementation for small matrices pretty easily if you think it's worthwhile. Either way, the glue could be useful for visualizing relationships between genes/samples in microarrays (what I gather Bio.Cluster is intended for). Andrew On 04/17/2012 11:25 AM, Eric Talevich wrote: > Andrew, > > It would be useful to have a quick and portable function for > distance-based tree estimation in Bio.Phylo, since otherwise it's > necessary to use one of the wrappers for external programs in > Bio.Phylo.Applications. (And currently, only PhyML is wrapped.) Does > the hierarchical clustering algorithm in Bio.Cluster correspond to any > common tree-estimation algorithm, e.g. UPGMA? If so, then it would > make a lot of sense to provide the glue for using it that way. If you > have done some work in this direction, I would be happy to see it. > > -Eric > > > On Mon, Apr 16, 2012 at 6:47 PM, Andrew Sczesnak > wrote: >> Eric, >> >> I can describe two use cases from my own experience. First, the MAF parser >> I've been working on can pull the multiple alignment of some gene between a >> bunch of genomes. Thinking of recipes for the cookbook, I thought it would >> be neat to walk the user through constructing a distance matrix by hand >> (though you're right--more could be done to support this), clustering with >> Bio.Cluster and visualizing the result with Bio.Phylo. I like this example >> because it integrates several different parts of BioPython along with a >> lesson about inferring distances between sequences. >> >> Second, for another project, I've been generating distance matrices based on >> the shared gene content of bacterial genomes and the presence-or-absence of >> orthologous groups in each. Presently, I ferry the matrices to a clustering >> program and then visualize the resulting trees in yet another tool. Looking >> into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and >> the incompatibility of their tree objects. >> >> I wonder, what would be the most elegant way of bridging the gap? >> >> >> Best, >> Andrew >> From andrew.sczesnak at med.nyu.edu Fri Apr 20 22:35:59 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 20 Apr 2012 18:35:59 -0400 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: Message-ID: <4F91E4CF.8040602@med.nyu.edu> Peter, My colleague was writing some code using MafIndex and commented how long it took her to download, decompress and index the human multiz alignments from UCSC. It seems like it'd be great to keep the files compressed... perhaps if the code works well enough we can convince UCSC to host bgzip'd copies (or maybe them available on one of our institutions servers). Is I.J. interested in joining the community? I'd like to look into adding BGZF to MafIO and wouldn't want to duplicate I.J.'s effort. If not, could you put me in touch? Andrew On 04/17/2012 11:23 AM, Peter Cock wrote: > On Sat, Feb 18, 2012 at 9:39 AM, Peter Cock wrote: >> >> Here are some things that I think are strong >> candidates for 1.60 (not an exclusive list!) >> >> ... >> >> BGZF support: Low level module like Python's gzip, >> support in SeqIO for indexing BGZF compressed files, >> ... > > I've just rebased my bgzf branch, which I think is ready to apply to the > trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3. > https://github.com/peterjc/biopython/tree/bgzf2 > > Would anyone like to review this please? There are unittests and > plenty of docstrings - but so far nothing in the Tutorial though. > > I wrote a blog post late last year explaining what this allows, and > this branch includes the changes to Bio.SeqIO to index BGZF > compressed sequence files this discussed: > http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html > > The probable next step after this is combining it with Andrew Sczesnak's > work on indexing MAF files (they can get pretty big) as explored by 'I.J.' > (who as far as I know hasn't signed up to the biopython-dev list, BCC'd). > > Also it would be interesting to explore doing the (de)compression of > blocks on worker threads to take advantage of multiple cores. > > Another idea would be too switch from a plain dictionary to an > ordered dictionary for holding cached decompressed blocks, > giving a way to drop the oldest block (although not perhaps as > clever as dropping the lest recently used block, the overhead is > lower). That would require including our own OrderedDict class > on the older Python platforms. > > Peter > > [*] PyPy testing is complicated by running out of file handles, > an existing issue not something directly from this work. Part > of this is down to different GC under PyPy. From arklenna at gmail.com Sat Apr 21 00:57:21 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 20 Apr 2012 20:57:21 -0400 Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF parser (#33) In-Reply-To: References: Message-ID: On Fri, Apr 20, 2012 at 4:39 AM, Peter Cock wrote: > I've had a quick look on GitHub and it isn't obvious to me how to get > pull request emails CC'd to our dev mailing list... but anyway, Lenna > has been busy: > > Peter > > ---------- Forwarded message ---------- > From: Lenna Peterson > > Date: Thu, Apr 19, 2012 at 11:35 PM > Subject: [biopython] Feature: Python implementation of MMCIF parser (#33) > To: Peter Cock > > > I've written a PLY (Python lex-yacc) module that is superimposable > with the C MMCIF module. > > I've also partially rewritten the C MMCIF module to be object-oriented. > > ### Changed files ### > > * MMCIFlexmodule.c: Now object-oriented (open file in constructor, > close file in destructor, etc). Docstrings! Added file IO exception. > * MMCIF2Dict.py: Minor changes for new object oriented API > * MMCIFParser: Changed all uses of map() to list comprehensions (more > compatible with 3) > > ### New files ### > > * MMCIFlex.py: PLY-based module for tokenizing input. > > ### What it needs ### > Addition of PLY dependency to setup.py. > I'm not quite sure how to handle this, as PLY wouldn't be necessary on > a platform with C Python. Thoughts? Which non-CPython implementations > are worth testing? > > > New C module tested on Python 2.6 on Mac OS X and Debian. I hope it > still works on Windows. > On my machine, the C module processes a 30,000 line test file in 10-15 > ms; the Python module takes ~150 ms. I've started testing the PLY lexer on PyPy. NumPyPy now implements more functions needed by PDB; the only things I found to be missing are random and linalg. This eliminates Superimposer, FragmentMapper, and Vector. I played around with trying to spoof "import numpy" to automatically import numpypy (code here: https://gist.github.com/2432815) but I don't think that's wise yet. My last commit to this branch was a few changes to allow the MMCIF parser to work on NumPy. PyPy won't run `setup.py test` due to global numpy failure, but if I install this branch and `pypy test_MMCIF.py`, it passes. Anybody with more PyPy and/or package structuring experience have thoughts? Lenna From p.j.a.cock at googlemail.com Sat Apr 21 10:32:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 21 Apr 2012 11:32:33 +0100 Subject: [Biopython-dev] [biopython] Feature: Python implementation of MMCIF parser (#33) In-Reply-To: References: Message-ID: On Saturday, April 21, 2012, Lenna Peterson wrote: > > > ### What it needs ### > > Addition of PLY dependency to setup.py. > > I'm not quite sure how to handle this, as PLY wouldn't be necessary on > > a platform with C Python. Thoughts? Which non-CPython implementations > > are worth testing? Basically Jython (which we've tried to support for a while) and PyPy (which I would like to officially support in future). Although a pure python setup can be useful in other settings, e.g. Windows development without the compilers otherwise needed. However, neither of those have NumPy (yet), which we need for the PDB module that would use the MMCIF parser. > > > New C module tested on Python 2.6 on Mac OS X and Debian. I hope it > > still works on Windows. > > On my machine, the C module processes a 30,000 line test file in 10-15 > > ms; the Python module takes ~150 ms. That's a factor of ten slower, but still sounds fast enough perhaps that we don't really need the C code for usability. > > I've started testing the PLY lexer on PyPy. NumPyPy now implements > more functions needed by PDB; the only things I found to be missing > are random and linalg. This eliminates Superimposer, FragmentMapper, > and Vector. > > I played around with trying to spoof "import numpy" to automatically > import numpypy (code here: https://gist.github.com/2432815) but I > don't think that's wise yet. > > My last commit to this branch was a few changes to allow the MMCIF > parser to work on NumPy. PyPy won't run `setup.py test` due to global > numpy failure, but if I install this branch and `pypy test_MMCIF.py`, > it passes. > > Anybody with more PyPy and/or package structuring experience have thoughts? I filed a few bugs on missing code in PyPy's NumPy re-implementation (now called numpypy), good to hear they are getting closer to being enough for us to run Bio.PDB on it. Thank you for exploring this. Right now with in you shoes for MMCIF parsing I would focus on the parser failures with certain input files - there is an open bug on RedMine https://redmine.open-bio.org/issues/2626 and the Issue of multiple models (Eric can probably advise here), https://redmine.open-bio.org/issues/2943 And I must close this bug now your earlier work has been checked in - https://redmine.open-bio.org/issues/2619 Thanks! Peter > From redmine at redmine.open-bio.org Sat Apr 21 10:39:15 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 21 Apr 2012 10:39:15 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] (Closed) Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Peter Cock. Status changed from New to Closed % Done changed from 0 to 100 Fixed with Lenna's work - see this commit and its parents: https://github.com/biopython/biopython/commit/e5ebb85d0614a34e59e7c2118a366512dc4d1320 ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sat Apr 21 18:05:01 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 21 Apr 2012 18:05:01 +0000 Subject: [Biopython-dev] [Biopython - Bug #2626] Bio.PDB mmCIFParser parse exceptions References: Message-ID: Issue #2626 has been updated by Lenna Peterson. File mmCifParseCheck.py added I've attempted to rescue this code from overzealous "text formatting". Attached version appeared to work on one test file; haven't tested the example broken files yet. ---------------------------------------- Bug #2626: Bio.PDB mmCIFParser parse exceptions https://redmine.open-bio.org/issues/2626 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Other Target version: 1.48 URL: I recently ran the mmCIFParser object over all of PDB's mmCIF files and found a large number of files failed to parse correctly (a short script at the end to demonstrate). Of ~50k mmCIF files, 3891 files failed to parse and another 1980 were missing fields in the mmCIF dictionary. A few examples of files that failed to parse: http://www.rcsb.org/pdb/files/1alw.cif.gz http://www.rcsb.org/pdb/files/1det.cif.gz http://www.rcsb.org/pdb/files/1tmy.cif.gz A few with missing fields: http://www.rcsb.org/pdb/files/1mfl.cif.gz http://www.rcsb.org/pdb/files/1tfj.cif.gz http://www.rcsb.org/pdb/files/1zn8.cif.gz The problem seems to be that an error in one mmCIF table, like an extra field, seems to propogate through the rest of the parse. x86_64 gentoo linux 2008, src BioPython install __CODE__ import sys from Bio.PDB import * if len(sys.argv) != 2: print "usage: mmCifParseCheck.py " sys.exit(0) structFile = sys.argv[1] resultString = ""; #parse to structure object numRes = 0 parser=MMCIFParser() try: structure=parser.get_structure('test',structFile) for model in structure: for chain in model: for residue in chain: if(residue.id[0][:2] != "H_"): numRes += 1 except: resultString += "parse to structure object failed\n"; else: resultString += "parse to structure object succeeded\n"; #parse whole mmCIF file to dict try: mmcif_dict=MMCIF2Dict.MMCIF2Dict(structFile) except: resultString += "parse to dict failed\n"; else: resultString += "parse to dict succeeded\n"; #get a required entry try: id = mmcif_dict['_entry.id'] except: resultString += "key lookup failed\n"; else: resultString += "key lookup succeeded\n"; print resultString print "number of non-het residues " + str(numRes) -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sat Apr 21 18:16:07 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 21 Apr 2012 18:16:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #2950] Bio.PDBIO.save writes MODEL records without model id References: Message-ID: Issue #2950 has been updated by Lenna Peterson. Did this commit close this bug? https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 ---------------------------------------- Bug #2950: Bio.PDBIO.save writes MODEL records without model id https://redmine.open-bio.org/issues/2950 Author: Barry Finzel Status: In Progress Priority: Normal Assignee: Konstantin Okonechnikov Category: Main Distribution Target version: Not Applicable URL: The MODEL record format for PDB files has an integer model identifier (e.g., "MODEL 1") not currently written to output. Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From arklenna at gmail.com Sun Apr 22 06:48:10 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 22 Apr 2012 02:48:10 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) Message-ID: I've implemented the parser changes (written by Paul Bathen; see bug report) to allow the MMCIF parser to handle multiple models. Models are now accessed by a string key of their model number, rather than an arbitrary index (structure['1'] versus structure[0]). I updated the MMCIF unit test for the new model access method and added a test file with multiple models. I'm not sure if there is documentation to be updated re: accessing the models. issue: https://redmine.open-bio.org/issues/2943 pull request: https://github.com/biopython/biopython/pull/34 - Lenna From MatatTHC at gmx.de Sun Apr 22 10:06:28 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Sun, 22 Apr 2012 12:06:28 +0200 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: Hi, since this bug seems to be of low priority I decided to try my best to help a bit and search the web a bit. It seems that the property is stored in PrimarySeq or Seq in bioperl. See for instance: http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Seq.pm http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/PrimarySeq.pm Or also: http://bugzilla.open-bio.org/show_bug.cgi?id=2578 This seems to be realised as boolean variable or function. Regards, Matthias 2012/4/4 Matthias Bernt : > Hi, > > are there any news on this? May I help somehow? But I have to admit > that I barely speak perl and have no experience with bioperl. If > someone tells me where to look I might still try it. > > Matthias > > 2012/3/29 Peter Cock : >> On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt wrote: >>> Hi, >>> >>> Is it possible to get the property if a genome is circular / linear >>> from SeqIO applied to genbank files? I could not find it. >>> >>> There is also a related bugreport: >>> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 >>> >>> I used the old parser before and switched to SeqIO which I really like >>> for the possibilities to parse different formats... but I really need >>> the information. >> >> Does anyone happen to have a BioPerl + BioSQL setup installed >> and working? IIRC checking that to make sure however we >> store the circular was compatible was the only real hurdle. >> >> Peter From redmine at redmine.open-bio.org Sun Apr 22 18:46:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 22 Apr 2012 18:46:35 +0000 Subject: [Biopython-dev] [Biopython - Bug #2950] Bio.PDBIO.save writes MODEL records without model id References: Message-ID: Issue #2950 has been updated by Eric Talevich. Assignee deleted (Konstantin Okonechnikov) Yes it did, thanks. I'll close this bug now. ---------------------------------------- Bug #2950: Bio.PDBIO.save writes MODEL records without model id https://redmine.open-bio.org/issues/2950 Author: Barry Finzel Status: In Progress Priority: Normal Assignee: Category: Main Distribution Target version: Not Applicable URL: The MODEL record format for PDB files has an integer model identifier (e.g., "MODEL 1") not currently written to output. Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Apr 22 18:48:39 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 22 Apr 2012 18:48:39 +0000 Subject: [Biopython-dev] [Biopython - Bug #2951] (Closed) PDBParser assigns model 0 to first model no matter what... References: Message-ID: Issue #2951 has been updated by Eric Talevich. Status changed from New to Closed % Done changed from 0 to 100 Closed with this commit, as pointed out just now by Lenna Peterson: https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 ---------------------------------------- Bug #2951: PDBParser assigns model 0 to first model no matter what... https://redmine.open-bio.org/issues/2951 Author: TallPaul empty Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.52 URL: I'm not sure if this is a bug or a feature, but PDBParser assigns the first model it sees as model 0 then increments that. This means someone thinking they are studying model X is actually studying X+1, and that assumes that authors always use sequential model numbers without skips. If authors CAN skip model number, ie, MODEL 2, then MODEL 4, then MODEL 5... then in biopython these be models 0,1, and 2 in the structure... yuck. If this needs to be maintained for posterity, I would suggest adding another field to capture the TRUE model number if it exists. See lines 106 and 122 here: http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBParser.py#106 Paul -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Apr 22 18:49:43 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 22 Apr 2012 18:49:43 +0000 Subject: [Biopython-dev] [Biopython - Bug #2950] (Closed) Bio.PDBIO.save writes MODEL records without model id References: Message-ID: Issue #2950 has been updated by Eric Talevich. Status changed from In Progress to Closed % Done changed from 20 to 100 Closed the blocker, too. Thanks again to Konstantin. ---------------------------------------- Bug #2950: Bio.PDBIO.save writes MODEL records without model id https://redmine.open-bio.org/issues/2950 Author: Barry Finzel Status: Closed Priority: Normal Assignee: Category: Main Distribution Target version: Not Applicable URL: The MODEL record format for PDB files has an integer model identifier (e.g., "MODEL 1") not currently written to output. Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From arklenna at gmail.com Mon Apr 23 05:35:23 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 23 Apr 2012 01:35:23 -0400 Subject: [Biopython-dev] pull request: Bio.SCOP.Raf chem dict updater Message-ID: I've adapted Hongbo Zhu's code to extract the three to one letter codes directly from the PDB Chemical Component dictionary. Existing calls of `from Raf import to_one_letter_code` should work as expected. pull request: https://github.com/biopython/biopython/pull/35 issue: https://redmine.open-bio.org/issues/3169 Lenna From redmine at redmine.open-bio.org Mon Apr 23 17:00:15 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 23 Apr 2012 17:00:15 +0000 Subject: [Biopython-dev] [Biopython - Bug #2943] (Closed) MMCIFParser only handling a single model. References: Message-ID: Issue #2943 has been updated by Peter Cock. Status changed from New to Closed % Done changed from 0 to 100 This should be working on the trunk now ready for Biopython 1.60 - thanks Lenna. See this commit and those preceding it: https://github.com/biopython/biopython/commit/2ac67cd14682a4bbad9e09654485914f9495138d If we've missed anything please reopen this bug. Thanks Paul! ---------------------------------------- Bug #2943: MMCIFParser only handling a single model. https://redmine.open-bio.org/issues/2943 Author: TallPaul empty Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.52 URL: MMCIFParser as-written only handles a single model in a protein. Any protein that has multiple modesl with repeating chains and residues will get an exception since the residue ID will already exist. Please make the following changes in MMCIFParser.py: Change the __doc__ setting: #Optional __DOC__ change if the new MMCIFlex is not used nor the changes #to MMCIF2Dict based on the new MMCIFlex. #Mod by Paul T. Bathen to reflect MMCIFlex built solely in Python __doc__="mmCIF parser (implemented solely in Python, no lex/flex/C code needed)" Regardles of the DOC changes: Insert the following model_list line occupancy_list=mmcif_dict["_atom_site.occupancy"] fieldname_list=mmcif_dict["_atom_site.group_PDB"] #Added by Paul T. Bathen Nov 2009 model_list=mmcif_dict["_atom_site.pdbx_PDB_model_num"] try: Make the following changes: #Modified by Paul T. Bathen Nov 2009: comment out this line #current_model_id=0 structure_builder=self._structure_builder structure_builder.init_structure(structure_id) #Modified by Paul T. Bathen Nov 2009: comment out this line #structure_builder.init_model(current_model_id) structure_builder.init_seg(" ") #Added by Paul T. Bathen Nov 2009 current_model_id = -1 Make the following changes in the for loop: #Note by Paul T. Bathen: should MMCIFParser include #the HOH and WAT stmts in PDBParser immediately below? #if fieldname=="HETATM": # if resname=="HOH" or resname=="WAT": # hetero_flag="W" # else: # hetero_flag="H" if fieldname=="HETATM": hetatm_flag="H" else: hetatm_flag=" " #Added by Paul T. Bathen Nov 2009 model_id = model_list[i] if current_model_id != model_id: current_model_id = model_id structure_builder.init_model(current_model_id) #end of addition After these changes took place, and with the new MMCIFlex and MMCIF2Dict in place, I was able to parse and test 2beg.cif and pdb2beg.ent and both parsed with the same number of models, chains, and residues. Paul -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Mon Apr 23 17:02:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 23 Apr 2012 18:02:01 +0100 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Sun, Apr 22, 2012 at 7:48 AM, Lenna Peterson wrote: > I've implemented the parser changes (written by Paul Bathen; see bug > report) to allow the MMCIF parser to handle multiple models. > > Models are now accessed by a string key of their model number, rather > than an arbitrary index (structure['1'] versus structure[0]). > > I updated the MMCIF unit test for the new model access method and > added a test file with multiple models. > > I'm not sure if there is documentation to be updated re: accessing the models. > > issue: https://redmine.open-bio.org/issues/2943 > pull request: https://github.com/biopython/biopython/pull/34 I've applied that to the trunk, thank you, but on reading this, why are the model keys strings and not integers? Does MMCIF allow odd keys or something? Peter From eric.talevich at gmail.com Mon Apr 23 20:10:27 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 23 Apr 2012 16:10:27 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Mon, Apr 23, 2012 at 1:02 PM, Peter Cock wrote: > On Sun, Apr 22, 2012 at 7:48 AM, Lenna Peterson wrote: >> I've implemented the parser changes (written by Paul Bathen; see bug >> report) to allow the MMCIF parser to handle multiple models. >> >> Models are now accessed by a string key of their model number, rather >> than an arbitrary index (structure['1'] versus structure[0]). >> >> I updated the MMCIF unit test for the new model access method and >> added a test file with multiple models. >> >> I'm not sure if there is documentation to be updated re: accessing the models. >> >> issue: https://redmine.open-bio.org/issues/2943 >> pull request: https://github.com/biopython/biopython/pull/34 > > I've applied that to the trunk, thank you, but on reading this, why are the > model keys strings and not integers? Does MMCIF allow odd keys or > something? > Ack, I didn't look at that closely enough. Check out this patch to see the current situation: https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 The models associated with a structure are numbered with a sequential integer id, starting from 0. It's always been like that in our PDB parser and we haven't changed it. To ensure that model numbers specified in the PDB file are preserved when writing the PDB back to file, the above patch introduced a new attribute on the Model object called serial_num (also an integer, equal to model.id unless specified otherwise). That attribute is only used when writing a new PDB file; Model.__getitem__ still uses Model.id as before. Perhaps that's surprising now that we read the serial numbers, but it kept backward compatibility. Plus, it preserves list-like behavior (item access via integers), even though the models are actually stored in a dict. So! In the mmCIF parser, the calls to structure_builder.init_model should be given two arguments instead of one: an integer id counting from 0, and then another integer (probably) containing the model "serial number" specified in the mmCIF file. In the event that an mmCIF file doesn't specify the model number, the serial number should be the same as the sequential id. Cool? This will also help us convert between PDB and mmCIF formats in the future. As for accessing the models by their serial number, using string keys seems like an effective workaround, but still obviously a workaround rather than an ideal situation. Let's discuss that a little more, perhaps file another bug when we've reached some consensus. Best, Eric From eric.talevich at gmail.com Mon Apr 23 20:32:11 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 23 Apr 2012 16:32:11 -0400 Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF parser (#33) In-Reply-To: References: Message-ID: On Fri, Apr 20, 2012 at 8:57 PM, Lenna Peterson wrote: > > I've started testing the PLY lexer on PyPy. NumPyPy now implements > more functions needed by PDB; the only things I found to be missing > are random and linalg. This eliminates Superimposer, FragmentMapper, > and Vector. > > I played around with trying to spoof "import numpy" to automatically > import numpypy (code here: https://gist.github.com/2432815) but I > don't think that's wise yet. > > My last commit to this branch was a few changes to allow the MMCIF > parser to work on NumPy. PyPy won't run `setup.py test` due to global > numpy failure, but if I install this branch and `pypy test_MMCIF.py`, > it passes. > > Anybody with more PyPy and/or package structuring experience have thoughts? > > Lenna Would it be more or less error-prone to simply replace every numpy import with this (after testing each module on PyPy): try: import numpy except: import numpypy as numpy Or similarly, use this as one of our compatibility utilities: from Bio import numpy # Some conditional junk in Bio/__init__.py or setup.py to reveal this module to PyPy and CPython as needed In either case, here's the relatively short list of modules that would need to be modified: Bio/Affy/CelFile.py Bio/Cluster/__init__.py Bio/KDTree/KDTree.py Bio/LogisticRegression.py Bio/MarkovModel.py Bio/MaxEntropy.py Bio/NaiveBayes.py Bio/PDB/Atom.py Bio/PDB/FragmentMapper.py Bio/PDB/MMCIFParser.py Bio/PDB/NeighborSearch.py Bio/PDB/PDBParser.py Bio/PDB/ResidueDepth.py Bio/PDB/Superimposer.py Bio/PDB/Vector.py Bio/SVDSuperimposer/SVDSuperimposer.py Bio/Statistics/lowess.py Bio/SubsMat/__init__.py Bio/kNN.py From p.j.a.cock at googlemail.com Mon Apr 23 20:47:02 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 23 Apr 2012 21:47:02 +0100 Subject: [Biopython-dev] Fwd: Feature: Python implementation of MMCIF parser (#33) In-Reply-To: References: Message-ID: On Mon, Apr 23, 2012 at 9:32 PM, Eric Talevich wrote: > > Would it be more or less error-prone to simply replace every numpy > import with this (after testing each module on PyPy): > > try: > ? ?import numpy > except: > ? ?import numpypy as numpy > Maybe, but right now do any of our NumPy using modules pass under PyPy? I don't believe so... but I haven't tried a PyPy nightly build lately. It was unfortunate that originally PyPy's micronumpy pretended to by numpy, so that you'd write "import numpy" and think it worked but be surprised later when something fundamental like the dot function was missing, or 2D arrays. That lead to a few nasty try/import lines in our unit tests. Let's wait and see how PyPy's numpy support improves before rushing to change any of our numpy imports. I am hopefully that Bio.PDB will be fine in their next release, whereas things using the NumPy C API will probably not be. Peter From arklenna at gmail.com Mon Apr 23 23:05:03 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 23 Apr 2012 19:05:03 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich wrote: > > Ack, I didn't look at that closely enough. Check out this patch to see > the current situation: > https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 > > The models associated with a structure are numbered with a sequential > integer id, starting from 0. It's always been like that in our PDB > parser and we haven't changed it. To ensure that model numbers > specified in the PDB file are preserved when writing the PDB back to > file, the above patch introduced a new attribute on the Model object > called serial_num (also an integer, equal to model.id unless specified > otherwise). That attribute is only used when writing a new PDB file; > Model.__getitem__ still uses Model.id as before. > > Perhaps that's surprising now that we read the serial numbers, but it > kept backward compatibility. Plus, it preserves list-like behavior > (item access via integers), even though the models are actually stored > in a dict. > > So! > > In the mmCIF parser, the calls to structure_builder.init_model should > be given two arguments instead of one: an integer id counting from 0, > and then another integer (probably) containing the model "serial > number" specified in the mmCIF file. In the event that an mmCIF file > doesn't specify the model number, the serial number should be the same > as the sequential id. > > Cool? This will also help us convert between PDB and mmCIF formats in > the future. Got it. I'm working on implementing the serial_number/model_number dichotomy for MMCIF. > As for accessing the models by their serial number, using string keys > seems like an effective workaround, but still obviously a workaround > rather than an ideal situation. Let's discuss that a little more, > perhaps file another bug when we've reached some consensus. Er, I made and then lost (still haven't *quite* gotten the hang of git rebase) a patch that applied int() to the MMCIF model numbers. I'll add that back so both model and serial numbers are ints. Lenna From arklenna at gmail.com Tue Apr 24 04:25:12 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Apr 2012 00:25:12 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich wrote: > > Ack, I didn't look at that closely enough. Check out this patch to see > the current situation: > https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 > > The models associated with a structure are numbered with a sequential > integer id, starting from 0. It's always been like that in our PDB > parser and we haven't changed it. To ensure that model numbers > specified in the PDB file are preserved when writing the PDB back to > file, the above patch introduced a new attribute on the Model object > called serial_num (also an integer, equal to model.id unless specified > otherwise). That attribute is only used when writing a new PDB file; > Model.__getitem__ still uses Model.id as before. > > Perhaps that's surprising now that we read the serial numbers, but it > kept backward compatibility. Plus, it preserves list-like behavior > (item access via integers), even though the models are actually stored > in a dict. > > So! > > In the mmCIF parser, the calls to structure_builder.init_model should > be given two arguments instead of one: an integer id counting from 0, > and then another integer (probably) containing the model "serial > number" specified in the mmCIF file. In the event that an mmCIF file > doesn't specify the model number, the serial number should be the same > as the sequential id. > > Cool? This will also help us convert between PDB and mmCIF formats in > the future. > > As for accessing the models by their serial number, using string keys > seems like an effective workaround, but still obviously a workaround > rather than an ideal situation. Let's discuss that a little more, > perhaps file another bug when we've reached some consensus. > > Best, > Eric Hi Eric, I believe I've implemented the model_id/serial_id system found in PDB: https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d Please let me know if you think that looks right. I couldn't find an mmCIF file without a model column to test, but I believe in that case it will assign model_id and serial_id to 0. Would that be the correct behavior? I also modified the unit test to check the model serial_num. https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6 Currently serial_num is int() of the CIF model column. Regarding access by string serial_num, I am concerned that the int/string access would be too subtle (structure[0] == structure['1']; structure[1] == structure['2']?). Perhaps an accessor function? i.e. structure.get_model('1') Let me know if you think I should write get_model() or something along those lines. Lenna From eric.talevich at gmail.com Tue Apr 24 15:38:50 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 24 Apr 2012 11:38:50 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson wrote: > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich wrote: >> >> Ack, I didn't look at that closely enough. Check out this patch to see >> the current situation: >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 >> >> The models associated with a structure are numbered with a sequential >> integer id, starting from 0. It's always been like that in our PDB >> parser and we haven't changed it. To ensure that model numbers >> specified in the PDB file are preserved when writing the PDB back to >> file, the above patch introduced a new attribute on the Model object >> called serial_num (also an integer, equal to model.id unless specified >> otherwise). That attribute is only used when writing a new PDB file; >> Model.__getitem__ still uses Model.id as before. >> >> Perhaps that's surprising now that we read the serial numbers, but it >> kept backward compatibility. Plus, it preserves list-like behavior >> (item access via integers), even though the models are actually stored >> in a dict. >> >> So! >> >> In the mmCIF parser, the calls to structure_builder.init_model should >> be given two arguments instead of one: an integer id counting from 0, >> and then another integer (probably) containing the model "serial >> number" specified in the mmCIF file. In the event that an mmCIF file >> doesn't specify the model number, the serial number should be the same >> as the sequential id. >> >> Cool? This will also help us convert between PDB and mmCIF formats in >> the future. >> >> As for accessing the models by their serial number, using string keys >> seems like an effective workaround, but still obviously a workaround >> rather than an ideal situation. Let's discuss that a little more, >> perhaps file another bug when we've reached some consensus. >> >> Best, >> Eric > > > Hi Eric, > > I believe I've implemented the model_id/serial_id system found in PDB: > > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d > > Please let me know if you think that looks right. I couldn't find an > mmCIF file without a model column to test, but I believe in that case > it will assign model_id and serial_id to 0. Would that be the correct > behavior? > > I also modified the unit test to check the model serial_num. > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6 > > Currently serial_num is int() of the CIF model column. Regarding > access by string serial_num, I am concerned that the int/string access > would be too subtle (structure[0] == structure['1']; structure[1] == > structure['2']?). Perhaps an accessor function? i.e. > structure.get_model('1') > > Let me know if you think I should write get_model() or something along > those lines. > > Lenna I left another nitpick on b453a, but besides that it looks exactly right to me. The string/int distinction would indeed be weird, especially for newer Python users coming from Perl or Javascript. I don't see a direct analogue for get_model(serial_num) in the other Entities (Residue, Chain, Model, Structure), so I'm inclined to put off the decision for now (i.e. leave it out of this patch set). -Eric From p.j.a.cock at googlemail.com Tue Apr 24 15:58:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 16:58:10 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: <4F91E4CF.8040602@med.nyu.edu> References: <4F91E4CF.8040602@med.nyu.edu> Message-ID: On Fri, Apr 20, 2012 at 11:35 PM, Andrew Sczesnak wrote: > Peter, > > My colleague was writing some code using MafIndex and commented how long it > took her to download, decompress and index the human multiz alignments from > UCSC. It seems like it'd be great to keep the files compressed... perhaps if > the code works well enough we can convince UCSC to host bgzip'd copies (or > maybe them available on one of our institutions servers). That does sound good - it is a perfect example of where BGZF is a more useful alternative to standard GZIP. Some numbers on how much of a size penalty it imposes would help though... > Is I.J. interested in joining the community? I'd like to look into adding > BGZF to MafIO and wouldn't want to duplicate I.J.'s effort. If not, could > you put me in touch? Perhaps he's just busy at the moment (BCC'd again)? It should be easy enough to follow the BGZF changes to Bio/SeqIO/_index.py and I'm willing to do this myself for MAF (while going over your index work - something I want to do anyway). The only potential catch is avoiding offset arithmetic. Peter From arklenna at gmail.com Tue Apr 24 17:56:37 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Apr 2012 13:56:37 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich wrote: > > On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson wrote: > > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich wrote: > >> > >> Ack, I didn't look at that closely enough. Check out this patch to see > >> the current situation: > >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 > >> > >> The models associated with a structure are numbered with a sequential > >> integer id, starting from 0. It's always been like that in our PDB > >> parser and we haven't changed it. To ensure that model numbers > >> specified in the PDB file are preserved when writing the PDB back to > >> file, the above patch introduced a new attribute on the Model object > >> called serial_num (also an integer, equal to model.id unless specified > >> otherwise). That attribute is only used when writing a new PDB file; > >> Model.__getitem__ still uses Model.id as before. > >> > >> Perhaps that's surprising now that we read the serial numbers, but it > >> kept backward compatibility. Plus, it preserves list-like behavior > >> (item access via integers), even though the models are actually stored > >> in a dict. > >> > >> So! > >> > >> In the mmCIF parser, the calls to structure_builder.init_model should > >> be given two arguments instead of one: an integer id counting from 0, > >> and then another integer (probably) containing the model "serial > >> number" specified in the mmCIF file. In the event that an mmCIF file > >> doesn't specify the model number, the serial number should be the same > >> as the sequential id. > >> > >> Cool? This will also help us convert between PDB and mmCIF formats in > >> the future. > >> > >> As for accessing the models by their serial number, using string keys > >> seems like an effective workaround, but still obviously a workaround > >> rather than an ideal situation. Let's discuss that a little more, > >> perhaps file another bug when we've reached some consensus. > >> > >> Best, > >> Eric > > > > > > Hi Eric, > > > > I believe I've implemented the model_id/serial_id system found in PDB: > > > > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d > > > > Please let me know if you think that looks right. I couldn't find an > > mmCIF file without a model column to test, but I believe in that case > > it will assign model_id and serial_id to 0. Would that be the correct > > behavior? > > > > I also modified the unit test to check the model serial_num. > > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6 > > > > Currently serial_num is int() of the CIF model column. Regarding > > access by string serial_num, I am concerned that the int/string access > > would be too subtle (structure[0] == structure['1']; structure[1] == > > structure['2']?). Perhaps an accessor function? i.e. > > structure.get_model('1') > > > > Let me know if you think I should write get_model() or something along > > those lines. > > > > Lenna > > I left another nitpick on b453a, but besides that it looks exactly right to me. > > The string/int distinction would indeed be weird, especially for newer > Python users coming from Perl or Javascript. I don't see a direct > analogue for get_model(serial_num) in the other Entities (Residue, > Chain, Model, Structure), so I'm inclined to put off the decision for > now (i.e. leave it out of this patch set). > > -Eric Eric, Okay, I've changed the bad model num generic warning to a PDBConstructionException. New pull request to get MMCIF to the same state as PDB: https://github.com/biopython/biopython/pull/36 So are chains accessed by 0, 1, 2 or by A, B, C? Lenna From anaryin at gmail.com Tue Apr 24 17:59:10 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 24 Apr 2012 19:59:10 +0200 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: Hi Lenna, IMO, chains should be accessed by A, B, C I'd say, doesn't make sense numerically. Congrats on the GSOC application and on the good work so far! Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao No dia 24 de Abril de 2012 19:56, Lenna Peterson escreveu: > On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich > wrote: > > > > On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson > wrote: > > > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich < > eric.talevich at gmail.com> wrote: > > >> > > >> Ack, I didn't look at that closely enough. Check out this patch to see > > >> the current situation: > > >> > https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 > > >> > > >> The models associated with a structure are numbered with a sequential > > >> integer id, starting from 0. It's always been like that in our PDB > > >> parser and we haven't changed it. To ensure that model numbers > > >> specified in the PDB file are preserved when writing the PDB back to > > >> file, the above patch introduced a new attribute on the Model object > > >> called serial_num (also an integer, equal to model.id unless > specified > > >> otherwise). That attribute is only used when writing a new PDB file; > > >> Model.__getitem__ still uses Model.id as before. > > >> > > >> Perhaps that's surprising now that we read the serial numbers, but it > > >> kept backward compatibility. Plus, it preserves list-like behavior > > >> (item access via integers), even though the models are actually stored > > >> in a dict. > > >> > > >> So! > > >> > > >> In the mmCIF parser, the calls to structure_builder.init_model should > > >> be given two arguments instead of one: an integer id counting from 0, > > >> and then another integer (probably) containing the model "serial > > >> number" specified in the mmCIF file. In the event that an mmCIF file > > >> doesn't specify the model number, the serial number should be the same > > >> as the sequential id. > > >> > > >> Cool? This will also help us convert between PDB and mmCIF formats in > > >> the future. > > >> > > >> As for accessing the models by their serial number, using string keys > > >> seems like an effective workaround, but still obviously a workaround > > >> rather than an ideal situation. Let's discuss that a little more, > > >> perhaps file another bug when we've reached some consensus. > > >> > > >> Best, > > >> Eric > > > > > > > > > Hi Eric, > > > > > > I believe I've implemented the model_id/serial_id system found in PDB: > > > > > > > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d > > > > > > Please let me know if you think that looks right. I couldn't find an > > > mmCIF file without a model column to test, but I believe in that case > > > it will assign model_id and serial_id to 0. Would that be the correct > > > behavior? > > > > > > I also modified the unit test to check the model serial_num. > > > > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6 > > > > > > Currently serial_num is int() of the CIF model column. Regarding > > > access by string serial_num, I am concerned that the int/string access > > > would be too subtle (structure[0] == structure['1']; structure[1] == > > > structure['2']?). Perhaps an accessor function? i.e. > > > structure.get_model('1') > > > > > > Let me know if you think I should write get_model() or something along > > > those lines. > > > > > > Lenna > > > > I left another nitpick on b453a, but besides that it looks exactly right > to me. > > > > The string/int distinction would indeed be weird, especially for newer > > Python users coming from Perl or Javascript. I don't see a direct > > analogue for get_model(serial_num) in the other Entities (Residue, > > Chain, Model, Structure), so I'm inclined to put off the decision for > > now (i.e. leave it out of this patch set). > > > > -Eric > > > Eric, > > Okay, I've changed the bad model num generic warning to a > PDBConstructionException. > > New pull request to get MMCIF to the same state as PDB: > https://github.com/biopython/biopython/pull/36 > > So are chains accessed by 0, 1, 2 or by A, B, C? > > Lenna > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Tue Apr 24 18:20:16 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 24 Apr 2012 14:20:16 -0400 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 1:56 PM, Lenna Peterson wrote: > On Tue, Apr 24, 2012 at 11:38 AM, Eric Talevich wrote: >> >> On Tue, Apr 24, 2012 at 12:25 AM, Lenna Peterson wrote: >> > On Mon, Apr 23, 2012 at 4:10 PM, Eric Talevich wrote: >> >> >> >> Ack, I didn't look at that closely enough. Check out this patch to see >> >> the current situation: >> >> https://github.com/biopython/biopython/commit/abdab1a1132ec811f9636f8ba805bbb6cda6dbe9 >> >> >> >> The models associated with a structure are numbered with a sequential >> >> integer id, starting from 0. It's always been like that in our PDB >> >> parser and we haven't changed it. To ensure that model numbers >> >> specified in the PDB file are preserved when writing the PDB back to >> >> file, the above patch introduced a new attribute on the Model object >> >> called serial_num (also an integer, equal to model.id unless specified >> >> otherwise). That attribute is only used when writing a new PDB file; >> >> Model.__getitem__ still uses Model.id as before. >> >> >> >> Perhaps that's surprising now that we read the serial numbers, but it >> >> kept backward compatibility. Plus, it preserves list-like behavior >> >> (item access via integers), even though the models are actually stored >> >> in a dict. >> >> >> >> So! >> >> >> >> In the mmCIF parser, the calls to structure_builder.init_model should >> >> be given two arguments instead of one: an integer id counting from 0, >> >> and then another integer (probably) containing the model "serial >> >> number" specified in the mmCIF file. In the event that an mmCIF file >> >> doesn't specify the model number, the serial number should be the same >> >> as the sequential id. >> >> >> >> Cool? This will also help us convert between PDB and mmCIF formats in >> >> the future. >> >> >> >> As for accessing the models by their serial number, using string keys >> >> seems like an effective workaround, but still obviously a workaround >> >> rather than an ideal situation. Let's discuss that a little more, >> >> perhaps file another bug when we've reached some consensus. >> >> >> >> Best, >> >> Eric >> > >> > >> > Hi Eric, >> > >> > I believe I've implemented the model_id/serial_id system found in PDB: >> > >> > https://github.com/lennax/biopython/commit/b453a2968d18e157aac1f99f9f3cfeb4c09bc77d >> > >> > Please let me know if you think that looks right. I couldn't find an >> > mmCIF file without a model column to test, but I believe in that case >> > it will assign model_id and serial_id to 0. Would that be the correct >> > behavior? >> > >> > I also modified the unit test to check the model serial_num. >> > https://github.com/lennax/biopython/commit/b0443e788438b8ff72979c7a3bc0e531d4cd5cf6 >> > >> > Currently serial_num is int() of the CIF model column. Regarding >> > access by string serial_num, I am concerned that the int/string access >> > would be too subtle (structure[0] == structure['1']; structure[1] == >> > structure['2']?). Perhaps an accessor function? i.e. >> > structure.get_model('1') >> > >> > Let me know if you think I should write get_model() or something along >> > those lines. >> > >> > Lenna >> >> I left another nitpick on b453a, but besides that it looks exactly right to me. >> >> The string/int distinction would indeed be weird, especially for newer >> Python users coming from Perl or Javascript. I don't see a direct >> analogue for get_model(serial_num) in the other Entities (Residue, >> Chain, Model, Structure), so I'm inclined to put off the decision for >> now (i.e. leave it out of this patch set). >> >> -Eric > > > Eric, > > Okay, I've changed the bad model num generic warning to a > PDBConstructionException. > > New pull request to get MMCIF to the same state as PDB: > https://github.com/biopython/biopython/pull/36 > > So are chains accessed by 0, 1, 2 or by A, B, C? > > Lenna Cool, I just merged the pull request. Thanks! As Jo?o said, chains are accessed by the letter ID via __getitem__ (implemented in Bio.PDB.Entity). You can get at them either way through the child_list and child_dict attributes, too. Kind of a thrill. I suppose we could eventually refactor the Entity-based classes to use a single data structure (OrderedDict, namedtuple, numpy array with named columns/rows?) in place of child_dict and child_list, and clean up some of the redundant accessors. -E From anaryin at gmail.com Tue Apr 24 18:25:15 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 24 Apr 2012 20:25:15 +0200 Subject: [Biopython-dev] pull request: Handle MMCIF with multiple models (closes 2943) In-Reply-To: References: Message-ID: I cannot agree more with Eric on this. Child dict and child list should be for sure refactored into something unique and easier to understand (and use). Also because we should take care of that memory leak... (try running the parser over a lot of PDBs and you will see memory going up). Cheers, Jo?o From p.j.a.cock at googlemail.com Tue Apr 24 20:07:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 21:07:03 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: <67433BFC-673B-4F49-A582-6F419FD6E0B7@csail.mit.edu> References: <4F91E4CF.8040602@med.nyu.edu> <67433BFC-673B-4F49-A582-6F419FD6E0B7@csail.mit.edu> Message-ID: On Tue, Apr 24, 2012 at 7:24 PM, Irwin Jungreis wrote: > Hello Andrew and Peter. > Hi again Irwin, > The size penalty of bgz versus gzip for .maf files is quite small. For > example, compressing the 6-way C. elegans alignment .maf files is 108.9 MB > with gzip and 112 MB with bgz, a difference of less than 3%. (Each is > smaller than the uncompressed file by a factor of about 4 or 5.) That's good - and given the nature of the MAF format in line with what I was hoping for - see also the overheads I got for FASTA, SwissProt and UniProt XML here: http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html > I am not very familiar with biopython, so I've been using my own utilities. > To work with alignments I create an index file consisting of a 32-byte > record for each maf block. Each record ?contains the block start on the > reference species chromosome, the block length on the reference species, and > the virtual offset of the block start in the .maf file. I then have a > utility that will extract the alignment for a given set of spliced regions, > e.g., chrX:11568015-11569059+chrX:11569364-11569395 on the '-' strand, and > output it as a list of pairs (assembly name, base string). > > I'd be happy to share, but I have no idea how this would fit into the > existing biopython infrastructure. > > Best, > Irwin Ah - I must have misinterpreted your earlier email (off list). I'd assumed you were using Andrew's Biopython branch which indexes MAF files using an SQLite database of offsets. But in practice the principle is the same - BGZF lets you have good compression of MAF files and random access. Thank you for clarifying this. If you use Python at all perhaps you'd have some feedback on Andrew's indexing plans? That would be great - Andrew's done a great job explaining the proposed code usage here: http://biopython.org/wiki/Multiple_Alignment_Format Regards, Peter From redmine at redmine.open-bio.org Wed Apr 25 02:33:04 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 25 Apr 2012 02:33:04 +0000 Subject: [Biopython-dev] [Biopython - Feature #3344] (New) Bio.PDB.Entity classes need a __contains__ method Message-ID: Issue #3344 has been reported by Eric Talevich. ---------------------------------------- Feature #3344: Bio.PDB.Entity classes need a __contains__ method https://redmine.open-bio.org/issues/3344 Author: Eric Talevich Status: New Priority: Normal Assignee: Category: Target version: URL: The various objects constructed by Bio.PDB have list-like and dict-like behaviors, for the most part. However, the not all of the relevant magic methods have been implemented. (E.g. `residue["CA"]` works, but `"CA" in residue` does not.) We could do more to support the list-like and dict-like behaviors, but let's start with __contains__. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu Apr 26 03:36:04 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 26 Apr 2012 03:36:04 +0000 Subject: [Biopython-dev] [Biopython - Bug #3169] (Closed) to_one_letter_code in Bio.SCOP.Raf is old References: Message-ID: Issue #3169 has been updated by Eric Talevich. Status changed from New to Closed % Done changed from 0 to 100 We've committed this fix now: https://github.com/biopython/biopython/pull/35 ---------------------------------------- Bug #3169: to_one_letter_code in Bio.SCOP.Raf is old https://redmine.open-bio.org/issues/3169 Author: Hongbo Zhu Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.56 URL: Hi, The dictionary to_one_letter_code in Bio.SCOP.Raf is a bit old now. The current dictionary is based on a table taken from the RAF release notes of ASTRAL. This is an old table and some new three-letter codes in the PDB are not found in it (e.g. M3L in 2X4W). ASTRAL does not use the table since v1.73. Rather, PDB Chemical Component Dictionary is used. See http://astral.berkeley.edu/seq.cgi?get=raf-edit-comments;ver=1.75 "Beginning with ASTRAL 1.73, the PDB's chemical dictionary is used to translate chemically modified residues, instead of the translation table from ASTRAL 1.55." The PDB Chemical Component Dictionary can be obtained from: http://deposit.pdb.org/cc_dict_tut.html . I have parsed the dictionary and there are 12054 three-letter codes (as of Jan 2011). Among them, most correspond to a one-letter code '?'. Still, there are 1245 three-letter codes corresponding to a one-letter code other than '?' (the list is attached in the end). Therefore, I suggest to update the to_one_letter_code dictionary in Bio.SCOP.Raf. Best regards, hongbo zhu to_one_letter_code = { '00C':'C','01W':'X','0A0':'D','0A1':'Y','0A2':'K', '0A8':'C','0AA':'V','0AB':'V','0AC':'G','0AD':'G', '0AF':'W','0AG':'L','0AH':'S','0AK':'D','0AM':'A', '0AP':'C','0AU':'U','0AV':'A','0AZ':'P','0BN':'F', '0C ':'C','0CS':'A','0DC':'C','0DG':'G','0DT':'T', '0G ':'G','0NC':'A','0SP':'A','0U ':'U','0YG':'YG', '10C':'C','125':'U','126':'U','127':'U','128':'N', '12A':'A','143':'C','175':'ASG','193':'X','1AP':'A', '1MA':'A','1MG':'G','1PA':'F','1PI':'A','1PR':'N', '1SC':'C','1TQ':'W','1TY':'Y','200':'F','23F':'F', '23S':'X','26B':'T','2AD':'X','2AG':'G','2AO':'X', '2AR':'A','2AS':'X','2AT':'T','2AU':'U','2BD':'I', '2BT':'T','2BU':'A','2CO':'C','2DA':'A','2DF':'N', '2DM':'N','2DO':'X','2DT':'T','2EG':'G','2FE':'N', '2FI':'N','2FM':'M','2GT':'T','2HF':'H','2LU':'L', '2MA':'A','2MG':'G','2ML':'L','2MR':'R','2MT':'P', '2MU':'U','2NT':'T','2OM':'U','2OT':'T','2PI':'X', '2PR':'G','2SA':'N','2SI':'X','2ST':'T','2TL':'T', '2TY':'Y','2VA':'V','32S':'X','32T':'X','3AH':'H', '3AR':'X','3CF':'F','3DA':'A','3DR':'N','3GA':'A', '3MD':'D','3ME':'U','3NF':'Y','3TY':'X','3XH':'G', '4AC':'N','4BF':'Y','4CF':'F','4CY':'M','4DP':'W', '4F3':'GYG','4FB':'P','4FW':'W','4HT':'W','4IN':'X', '4MF':'N','4MM':'X','4OC':'C','4PC':'C','4PD':'C', '4PE':'C','4PH':'F','4SC':'C','4SU':'U','4TA':'N', '5AA':'A','5AT':'T','5BU':'U','5CG':'G','5CM':'C', '5CS':'C','5FA':'A','5FC':'C','5FU':'U','5HP':'E', '5HT':'T','5HU':'U','5IC':'C','5IT':'T','5IU':'U', '5MC':'C','5MD':'N','5MU':'U','5NC':'C','5PC':'C', '5PY':'T','5SE':'U','5ZA':'TWG','64T':'T','6CL':'K', '6CT':'T','6CW':'W','6HA':'A','6HC':'C','6HG':'G', '6HN':'K','6HT':'T','6IA':'A','6MA':'A','6MC':'A', '6MI':'N','6MT':'A','6MZ':'N','6OG':'G','70U':'U', '7DA':'A','7GU':'G','7JA':'I','7MG':'G','8AN':'A', '8FG':'G','8MG':'G','8OG':'G','9NE':'E','9NF':'F', '9NR':'R','9NV':'V','A ':'A','A1P':'N','A23':'A', 'A2L':'A','A2M':'A','A34':'A','A35':'A','A38':'A', 'A39':'A','A3A':'A','A3P':'A','A40':'A','A43':'A', 'A44':'A','A47':'A','A5L':'A','A5M':'C','A5O':'A', 'A66':'X','AA3':'A','AA4':'A','AAR':'R','AB7':'X', 'ABA':'A','ABR':'A','ABS':'A','ABT':'N','ACB':'D', 'ACL':'R','AD2':'A','ADD':'X','ADX':'N','AEA':'X', 'AEI':'D','AET':'A','AFA':'N','AFF':'N','AFG':'G', 'AGM':'R','AGT':'X','AHB':'N','AHH':'X','AHO':'A', 'AHP':'A','AHS':'X','AHT':'X','AIB':'A','AKL':'D', 'ALA':'A','ALC':'A','ALG':'R','ALM':'A','ALN':'A', 'ALO':'T','ALQ':'X','ALS':'A','ALT':'A','ALY':'K', 'AP7':'A','APE':'X','APH':'A','API':'K','APK':'K', 'APM':'X','APP':'X','AR2':'R','AR4':'E','ARG':'R', 'ARM':'R','ARO':'R','ARV':'X','AS ':'A','AS2':'D', 'AS9':'X','ASA':'D','ASB':'D','ASI':'D','ASK':'D', 'ASL':'D','ASM':'X','ASN':'N','ASP':'D','ASQ':'D', 'ASU':'N','ASX':'B','ATD':'T','ATL':'T','ATM':'T', 'AVC':'A','AVN':'X','AYA':'A','AYG':'AYG','AZK':'K', 'AZS':'S','AZY':'Y','B1F':'F','B1P':'N','B2A':'A', 'B2F':'F','B2I':'I','B2V':'V','B3A':'A','B3D':'D', 'B3E':'E','B3K':'K','B3L':'X','B3M':'X','B3Q':'X', 'B3S':'S','B3T':'X','B3U':'H','B3X':'N','B3Y':'Y', 'BB6':'C','BB7':'C','BB9':'C','BBC':'C','BCS':'C', 'BCX':'C','BE2':'X','BFD':'D','BG1':'S','BGM':'G', 'BHD':'D','BIF':'F','BIL':'X','BIU':'I','BJH':'X', 'BLE':'L','BLY':'K','BMP':'N','BMT':'T','BNN':'A', 'BNO':'X','BOE':'T','BOR':'R','BPE':'C','BRU':'U', 'BSE':'S','BT5':'N','BTA':'L','BTC':'C','BTR':'W', 'BUC':'C','BUG':'V','BVP':'U','BZG':'N','C ':'C', 'C12':'TYG','C1X':'K','C25':'C','C2L':'C','C2S':'C', 'C31':'C','C32':'C','C34':'C','C36':'C','C37':'C', 'C38':'C','C3Y':'C','C42':'C','C43':'C','C45':'C', 'C46':'C','C49':'C','C4R':'C','C4S':'C','C5C':'C', 'C66':'X','C6C':'C','C99':'TFG','CAF':'C','CAL':'X', 'CAR':'C','CAS':'C','CAV':'X','CAY':'C','CB2':'C', 'CBR':'C','CBV':'C','CCC':'C','CCL':'K','CCS':'C', 'CCY':'CYG','CDE':'X','CDV':'X','CDW':'C','CEA':'C', 'CFL':'C','CFY':'FCYG','CG1':'G','CGA':'E','CGU':'E', 'CH ':'C','CH6':'MYG','CH7':'KYG','CHF':'X','CHG':'X', 'CHP':'G','CHS':'X','CIR':'R','CJO':'GYG','CLE':'L', 'CLG':'K','CLH':'K','CLV':'AFG','CM0':'N','CME':'C', 'CMH':'C','CML':'C','CMR':'C','CMT':'C','CNU':'U', 'CP1':'C','CPC':'X','CPI':'X','CQR':'GYG','CR0':'TLG', 'CR2':'GYG','CR5':'G','CR7':'KYG','CR8':'HYG','CRF':'TWG', 'CRG':'THG','CRK':'MYG','CRO':'GYG','CRQ':'QYG','CRU':'E', 'CRW':'ASG','CRX':'ASG','CS0':'C','CS1':'C','CS3':'C', 'CS4':'C','CS8':'N','CSA':'C','CSB':'C','CSD':'C', 'CSE':'C','CSF':'C','CSH':'SHG','CSI':'G','CSJ':'C', 'CSL':'C','CSO':'C','CSP':'C','CSR':'C','CSS':'C', 'CSU':'C','CSW':'C','CSX':'C','CSY':'SYG','CSZ':'C', 'CTE':'W','CTG':'T','CTH':'T','CUC':'X','CWR':'S', 'CXM':'M','CY0':'C','CY1':'C','CY3':'C','CY4':'C', 'CYA':'C','CYD':'C','CYF':'C','CYG':'C','CYJ':'X', 'CYM':'C','CYQ':'C','CYR':'C','CYS':'C','CZ2':'C', 'CZO':'GYG','CZZ':'C','D11':'T','D1P':'N','D3 ':'N', 'D33':'N','D3P':'G','D3T':'T','D4M':'T','D4P':'X', 'DA ':'A','DA2':'X','DAB':'A','DAH':'F','DAL':'A', 'DAR':'R','DAS':'D','DBB':'T','DBM':'N','DBS':'S', 'DBU':'T','DBY':'Y','DBZ':'A','DC ':'C','DC2':'C', 'DCG':'G','DCI':'X','DCL':'X','DCT':'C','DCY':'C', 'DDE':'H','DDG':'G','DDN':'U','DDX':'N','DFC':'C', 'DFG':'G','DFI':'X','DFO':'X','DFT':'N','DG ':'G', 'DGH':'G','DGI':'G','DGL':'E','DGN':'Q','DHA':'A', 'DHI':'H','DHL':'X','DHN':'V','DHP':'X','DHU':'U', 'DHV':'V','DI ':'I','DIL':'I','DIR':'R','DIV':'V', 'DLE':'L','DLS':'K','DLY':'K','DM0':'K','DMH':'N', 'DMK':'D','DMT':'X','DN ':'N','DNE':'L','DNG':'L', 'DNL':'K','DNM':'L','DNP':'A','DNR':'C','DNS':'K', 'DOA':'X','DOC':'C','DOH':'D','DON':'L','DPB':'T', 'DPH':'F','DPL':'P','DPP':'A','DPQ':'Y','DPR':'P', 'DPY':'N','DRM':'U','DRP':'N','DRT':'T','DRZ':'N', 'DSE':'S','DSG':'N','DSN':'S','DSP':'D','DT ':'T', 'DTH':'T','DTR':'W','DTY':'Y','DU ':'U','DVA':'V', 'DXD':'N','DXN':'N','DYG':'DYG','DYS':'C','DZM':'A', 'E ':'A','E1X':'A','EDA':'A','EDC':'G','EFC':'C', 'EHP':'F','EIT':'T','ENP':'N','ESB':'Y','ESC':'M', 'EXY':'L','EY5':'N','EYS':'X','F2F':'F','FA2':'A', 'FA5':'N','FAG':'N','FAI':'N','FCL':'F','FFD':'N', 'FGL':'G','FGP':'S','FHL':'X','FHO':'K','FHU':'U', 'FLA':'A','FLE':'L','FLT':'Y','FME':'M','FMG':'G', 'FMU':'N','FOE':'C','FOX':'G','FP9':'P','FPA':'F', 'FRD':'X','FT6':'W','FTR':'W','FTY':'Y','FZN':'K', 'G ':'G','G25':'G','G2L':'G','G2S':'G','G31':'G', 'G32':'G','G33':'G','G36':'G','G38':'G','G42':'G', 'G46':'G','G47':'G','G48':'G','G49':'G','G4P':'N', 'G7M':'G','GAO':'G','GAU':'E','GCK':'C','GCM':'X', 'GDP':'G','GDR':'G','GFL':'G','GGL':'E','GH3':'G', 'GHG':'Q','GHP':'G','GL3':'G','GLH':'Q','GLM':'X', 'GLN':'Q','GLQ':'E','GLU':'E','GLX':'Z','GLY':'G', 'GLZ':'G','GMA':'E','GMS':'G','GMU':'U','GN7':'G', 'GND':'X','GNE':'N','GOM':'G','GPL':'K','GS ':'G', 'GSC':'G','GSR':'G','GSS':'G','GSU':'E','GT9':'C', 'GTP':'G','GVL':'X','GYC':'CYG','GYS':'SYG','H2U':'U', 'H5M':'P','HAC':'A','HAR':'R','HBN':'H','HCS':'X', 'HDP':'U','HEU':'U','HFA':'X','HGL':'X','HHI':'H', 'HHK':'AK','HIA':'H','HIC':'H','HIP':'H','HIQ':'H', 'HIS':'H','HL2':'L','HLU':'L','HMF':'A','HMR':'R', 'HOL':'N','HPC':'F','HPE':'F','HPQ':'F','HQA':'A', 'HRG':'R','HRP':'W','HS8':'H','HS9':'H','HSE':'S', 'HSL':'S','HSO':'H','HTI':'C','HTN':'N','HTR':'W', 'HV5':'A','HVA':'V','HY3':'P','HYP':'P','HZP':'P', 'I ':'I','I2M':'I','I58':'K','I5C':'C','IAM':'A', 'IAR':'R','IAS':'D','IC ':'C','IEL':'K','IEY':'HYG', 'IG ':'G','IGL':'G','IGU':'G','IIC':'SHG','IIL':'I', 'ILE':'I','ILG':'E','ILX':'I','IMC':'C','IML':'I', 'IOY':'F','IPG':'G','IPN':'N','IRN':'N','IT1':'K', 'IU ':'U','IYR':'Y','IYT':'T','JJJ':'C','JJK':'C', 'JJL':'C','JW5':'N','K1R':'C','KAG':'G','KCX':'K', 'KGC':'K','KOR':'M','KPI':'K','KST':'K','KYQ':'K', 'L2A':'X','LA2':'K','LAA':'D','LAL':'A','LBY':'K', 'LC ':'C','LCA':'A','LCC':'N','LCG':'G','LCH':'N', 'LCK':'K','LCX':'K','LDH':'K','LED':'L','LEF':'L', 'LEH':'L','LEI':'V','LEM':'L','LEN':'L','LET':'X', 'LEU':'L','LG ':'G','LGP':'G','LHC':'X','LHU':'U', 'LKC':'N','LLP':'K','LLY':'K','LME':'E','LMQ':'Q', 'LMS':'N','LP6':'K','LPD':'P','LPG':'G','LPL':'X', 'LPS':'S','LSO':'X','LTA':'X','LTR':'W','LVG':'G', 'LVN':'V','LYM':'K','LYN':'K','LYR':'K','LYS':'K', 'LYX':'K','LYZ':'K','M0H':'C','M1G':'G','M2G':'G', 'M2L':'K','M2S':'M','M3L':'K','M5M':'C','MA ':'A', 'MA6':'A','MA7':'A','MAA':'A','MAD':'A','MAI':'R', 'MBQ':'Y','MBZ':'N','MC1':'S','MCG':'X','MCL':'K', 'MCS':'C','MCY':'C','MDH':'X','MDO':'ASG','MDR':'N', 'MEA':'F','MED':'M','MEG':'E','MEN':'N','MEP':'U', 'MEQ':'Q','MET':'M','MEU':'G','MF3':'X','MFC':'GYG', 'MG1':'G','MGG':'R','MGN':'Q','MGQ':'A','MGV':'G', 'MGY':'G','MHL':'L','MHO':'M','MHS':'H','MIA':'A', 'MIS':'S','MK8':'L','ML3':'K','MLE':'L','MLL':'L', 'MLY':'K','MLZ':'K','MME':'M','MMT':'T','MND':'N', 'MNL':'L','MNU':'U','MNV':'V','MOD':'X','MP8':'P', 'MPH':'X','MPJ':'X','MPQ':'G','MRG':'G','MSA':'G', 'MSE':'M','MSL':'M','MSO':'M','MSP':'X','MT2':'M', 'MTR':'T','MTU':'A','MTY':'Y','MVA':'V','N ':'N', 'N10':'S','N2C':'X','N5I':'N','N5M':'C','N6G':'G', 'N7P':'P','NA8':'A','NAL':'A','NAM':'A','NB8':'N', 'NBQ':'Y','NC1':'S','NCB':'A','NCX':'N','NCY':'X', 'NDF':'F','NDN':'U','NEM':'H','NEP':'H','NF2':'N', 'NFA':'F','NHL':'E','NIT':'X','NIY':'Y','NLE':'L', 'NLN':'L','NLO':'L','NLP':'L','NLQ':'Q','NMC':'G', 'NMM':'R','NMS':'T','NMT':'T','NNH':'R','NP3':'N', 'NPH':'C','NRP':'LYG','NRQ':'MYG','NSK':'X','NTY':'Y', 'NVA':'V','NYC':'TWG','NYG':'NYG','NYM':'N','NYS':'C', 'NZH':'H','O12':'X','O2C':'N','O2G':'G','OAD':'N', 'OAS':'S','OBF':'X','OBS':'X','OCS':'C','OCY':'C', 'ODP':'N','OHI':'H','OHS':'D','OIC':'X','OIP':'I', 'OLE':'X','OLT':'T','OLZ':'S','OMC':'C','OMG':'G', 'OMT':'M','OMU':'U','ONE':'U','ONL':'X','OPR':'R', 'ORN':'A','ORQ':'R','OSE':'S','OTB':'X','OTH':'T', 'OTY':'Y','OXX':'D','P ':'G','P1L':'C','P1P':'N', 'P2T':'T','P2U':'U','P2Y':'P','P5P':'A','PAQ':'Y', 'PAS':'D','PAT':'W','PAU':'A','PBB':'C','PBF':'F', 'PBT':'N','PCA':'E','PCC':'P','PCE':'X','PCS':'F', 'PDL':'X','PDU':'U','PEC':'C','PF5':'F','PFF':'F', 'PFX':'X','PG1':'S','PG7':'G','PG9':'G','PGL':'X', 'PGN':'G','PGP':'G','PGY':'G','PHA':'F','PHD':'D', 'PHE':'F','PHI':'F','PHL':'F','PHM':'F','PIV':'X', 'PLE':'L','PM3':'F','PMT':'C','POM':'P','PPN':'F', 'PPU':'A','PPW':'G','PQ1':'N','PR3':'C','PR5':'A', 'PR9':'P','PRN':'A','PRO':'P','PRS':'P','PSA':'F', 'PSH':'H','PST':'T','PSU':'U','PSW':'C','PTA':'X', 'PTH':'Y','PTM':'Y','PTR':'Y','PU ':'A','PUY':'N', 'PVH':'H','PVL':'X','PYA':'A','PYO':'U','PYX':'C', 'PYY':'N','QLG':'QLG','QUO':'G','R ':'A','R1A':'C', 'R1B':'C','R1F':'C','R7A':'C','RC7':'HYG','RCY':'C', 'RIA':'A','RMP':'A','RON':'X','RT ':'T','RTP':'N', 'S1H':'S','S2C':'C','S2D':'A','S2M':'T','S2P':'A', 'S4A':'A','S4C':'C','S4G':'G','S4U':'U','S6G':'G', 'SAC':'S','SAH':'C','SAR':'G','SBL':'S','SC ':'C', 'SCH':'C','SCS':'C','SCY':'C','SD2':'X','SDG':'G', 'SDP':'S','SEB':'S','SEC':'A','SEG':'A','SEL':'S', 'SEM':'X','SEN':'S','SEP':'S','SER':'S','SET':'S', 'SGB':'S','SHC':'C','SHP':'G','SHR':'K','SIB':'C', 'SIC':'DC','SLA':'P','SLR':'P','SLZ':'K','SMC':'C', 'SME':'M','SMF':'F','SMP':'A','SMT':'T','SNC':'C', 'SNN':'N','SOC':'C','SOS':'N','SOY':'S','SPT':'T', 'SRA':'A','SSU':'U','STY':'Y','SUB':'X','SUI':'DG', 'SUN':'S','SUR':'U','SVA':'S','SVX':'S','SVZ':'X', 'SYS':'C','T ':'T','T11':'F','T23':'T','T2S':'T', 'T2T':'N','T31':'U','T32':'T','T36':'T','T37':'T', 'T38':'T','T39':'T','T3P':'T','T41':'T','T48':'T', 'T49':'T','T4S':'T','T5O':'U','T5S':'T','T66':'X', 'T6A':'A','TA3':'T','TA4':'X','TAF':'T','TAL':'N', 'TAV':'D','TBG':'V','TBM':'T','TC1':'C','TCP':'T', 'TCQ':'X','TCR':'W','TCY':'A','TDD':'L','TDY':'T', 'TFE':'T','TFO':'A','TFQ':'F','TFT':'T','TGP':'G', 'TH6':'T','THC':'T','THO':'X','THR':'T','THX':'N', 'THZ':'R','TIH':'A','TLB':'N','TLC':'T','TLN':'U', 'TMB':'T','TMD':'T','TNB':'C','TNR':'S','TOX':'W', 'TP1':'T','TPC':'C','TPG':'G','TPH':'X','TPL':'W', 'TPO':'T','TPQ':'Y','TQQ':'W','TRF':'W','TRG':'K', 'TRN':'W','TRO':'W','TRP':'W','TRQ':'W','TRW':'W', 'TRX':'W','TS ':'N','TST':'X','TT ':'N','TTD':'T', 'TTI':'U','TTM':'T','TTQ':'W','TTS':'Y','TY2':'Y', 'TY3':'Y','TYB':'Y','TYI':'Y','TYN':'Y','TYO':'Y', 'TYQ':'Y','TYR':'Y','TYS':'Y','TYT':'Y','TYU':'N', 'TYX':'X','TYY':'Y','TZB':'X','TZO':'X','U ':'U', 'U25':'U','U2L':'U','U2N':'U','U2P':'U','U31':'U', 'U33':'U','U34':'U','U36':'U','U37':'U','U8U':'U', 'UAR':'U','UCL':'U','UD5':'U','UDP':'N','UFP':'N', 'UFR':'U','UFT':'U','UMA':'A','UMP':'U','UMS':'U', 'UN1':'X','UN2':'X','UNK':'X','UR3':'U','URD':'U', 'US1':'U','US2':'U','US3':'T','US5':'U','USM':'U', 'V1A':'C','VAD':'V','VAF':'V','VAL':'V','VB1':'K', 'VDL':'X','VLL':'X','VLM':'X','VMS':'X','VOL':'X', 'X ':'G','X2W':'E','X4A':'N','X9Q':'AFG','XAD':'A', 'XAE':'N','XAL':'A','XAR':'N','XCL':'C','XCP':'X', 'XCR':'C','XCS':'N','XCT':'C','XCY':'C','XGA':'N', 'XGL':'G','XGR':'G','XGU':'G','XTH':'T','XTL':'T', 'XTR':'T','XTS':'G','XTY':'N','XUA':'A','XUG':'G', 'XX1':'K','XXY':'THG','XYG':'DYG','Y ':'A','YCM':'C', 'YG ':'G','YOF':'Y','YRR':'N','YYG':'G','Z ':'C', 'ZAD':'A','ZAL':'A','ZBC':'C','ZCY':'C','ZDU':'U', 'ZFB':'X','ZGU':'G','ZHP':'N','ZTH':'T','ZZJ':'A' } -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri Apr 27 03:59:13 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 27 Apr 2012 03:59:13 +0000 Subject: [Biopython-dev] [Biopython - Bug #3346] (New) patch for legacy parser to support BLASTX 2.2.25+ Message-ID: Issue #3346 has been reported by John Comeau. ---------------------------------------- Bug #3346: patch for legacy parser to support BLASTX 2.2.25+ https://redmine.open-bio.org/issues/3346 Author: John Comeau Status: New Priority: Normal Assignee: Category: Target version: URL: it may also work with 2.2.26+, I have not tested. patched parser passes regression tests as per Peter Cock's instructions. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From andrew.sczesnak at med.nyu.edu Fri Apr 27 19:57:19 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 27 Apr 2012 15:57:19 -0400 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: <4F91E4CF.8040602@med.nyu.edu> Message-ID: <4F9AFA1F.6030103@med.nyu.edu> Peter, > It should be easy enough to follow the BGZF changes to Bio/SeqIO/_index.py > and I'm willing to do this myself for MAF (while going over your index work - > something I want to do anyway). The only potential catch is avoiding offset > arithmetic. I have no problem with you doing this if you're willing. It would be great to have some code review of MafIndex as well. Best, Andrew From MatatTHC at gmx.de Sat Apr 28 07:15:35 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Sat, 28 Apr 2012 09:15:35 +0200 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: Dear developers, I would like to suggest a quick "fix" for the problem. Currently the parser just returns true per default for the circular property. This is a wrong piece of information for all circular sequences. Furthermore its not possible to detect if the parser did return true because it is its default value or if its really from the data. So I suggest to return None if the parser does not parse the information. What do you think? This should be possible with minimal effort. The user could then implement a workaround on its own (like using the old parser as fallback, or just searching the first line of t) Regards, Matthias 2012/4/22 Matthias Bernt : > Hi, > > since this bug seems to be of low priority I decided to try my best to > help a bit and search the web a bit. > It seems that the property is stored in PrimarySeq or Seq ?in bioperl. > See for instance: > > http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Seq.pm > http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/PrimarySeq.pm > > Or also: > http://bugzilla.open-bio.org/show_bug.cgi?id=2578 > > This seems to be realised as boolean variable or function. > > Regards, > Matthias > > 2012/4/4 Matthias Bernt : >> Hi, >> >> are there any news on this? May I help somehow? But I have to admit >> that I barely speak perl and have no experience with bioperl. If >> someone tells me where to look I might still try it. >> >> Matthias >> >> 2012/3/29 Peter Cock : >>> On Thu, Mar 29, 2012 at 3:38 PM, Matthias Bernt wrote: >>>> Hi, >>>> >>>> Is it possible to get the property if a genome is circular / linear >>>> from SeqIO applied to genbank files? I could not find it. >>>> >>>> There is also a related bugreport: >>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 >>>> >>>> I used the old parser before and switched to SeqIO which I really like >>>> for the possibilities to parse different formats... but I really need >>>> the information. >>> >>> Does anyone happen to have a BioPerl + BioSQL setup installed >>> and working? IIRC checking that to make sure however we >>> store the circular was compatible was the only real hurdle. >>> >>> Peter From w.arindrarto at gmail.com Sat Apr 28 12:08:35 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 28 Apr 2012 14:08:35 +0200 Subject: [Biopython-dev] Google Summer of Code Project: SearchIO in Biopython Message-ID: Hello everyone, This is Wibowo Arindrarto (or Bow, for short), one of the Google Summer of Code students who will work on Biopython over this summer. I will be working with Peter to add support for parsing search outputs from programs like BLAST and HMMER to Biopython, so that it's easier to extract information from their outputs. Having used some of these programs quite a lot myself, I'm really looking forward to implementing the feature. However, I do understand that it won't be just me who will use the module, but also many other Biopython user. So for everyone who is interested in giving a say, input, or critiques along the way, feel free to do so :). The official coding period starts in about a month from now. Until then, I will be doing all the preparatory work required so that coding will proceed as smooth as possible. These will include preparing the test cases and preparing the SearchIO attribute / object naming convention as well as discussing anything related to its proposed implementation. Finally, here are some links related to the project that might interest you. 1. My main biopython branch for development: https://github.com/bow/biopython/tree/searchio. Since I will be building on top of Peter's SearchIO branch ( https://github.com/peterjc/biopython/tree/search-io-test), right now it only contains Peter's branch rebased against the latest master. 2. My GSoC proposal, which outlines my plans and timeline for the project: http://bit.ly/searchio-proposal 3. The proposed SearchIO naming convention (not 100% complete as of now, but will be filled along the way): http://bit.ly/searchio-terms. One of the main goals of the project is to implement a common interface for BLAST et al, which requires SearchIO to have common attribute names that refers to different search output attributes. The link contains my proposed naming convention, which is still very open to change and discussion. Feel free to comment on the document and add your own ideas. 4. My blog, in which I will write weekly posts about the project's progress: http://bow.web.id/blog 5. An extra repo for all other auxiliary files and scripts that doesn't go into Biopython's code: https://github.com/bow/gsoc. That's it for now. Thanks for taking time to read it :). I'm looking forward to a productive summer with Biopython. Have a nice weekend, Bow From p.j.a.cock at googlemail.com Sun Apr 29 11:00:42 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 29 Apr 2012 12:00:42 +0100 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: Hi Bow, Thanks for updating the list. I'm replying just on the dev list as I'm focusing on implementation discussion in this reply. On Sat, Apr 28, 2012 at 1:08 PM, Wibowo Arindrarto wrote: > 1. My main biopython branch for development: > https://github.com/bow/biopython/tree/searchio. Since I will be building on > top of Peter's SearchIO branch ( > https://github.com/peterjc/biopython/tree/search-io-test), right now it > only contains Peter's branch rebased against the latest master. Just to be clear - you don't have to start from that branch ;) http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html As I said before, that may not be the best approach. The idea behind that code was to focus on the HSPs (in BLAST terms), and for the low level parsers to iterate over each HSP. Higher level wrappers can then batch these up by query/subject, or into the larger grouping of all the results for one query - which was the exposed high level Bio.SearchIO.parse function. That branch introduced a SearchResult object which was essentially something like a list or dict (like an OrderedDict in some ways), with some (unnecessary?) error checking for consistent contents (all from the same query). It also introduced a TopMatches object which was essentially list list (again, with some error checking for consistent contents). The advantage of using simple objects (OrderedDict and list) is simplicity and hopefully performance. But specific classes have the advantage of allowing more user friendly str/repr etc. The idea on this branch of focusing on iteration over the HSPs at the low level was it allowed a lot of flexibility, and the low level parser could be used in conjunction with indexing to see to a particular HSP and parse it, or goto the results for a particular query+match and parse its HSPs (not implemented on my old branch, but that was the plan). However, while this makes perfect sense for say the BLAST tabular output, it isn't quite such a good match for all the possible datatypes. For instance, BLAST plain text/html includes an e-value for a query/subject combination which is calculated from all the HSPs for that query/subject (taking into account order etc - I'd have to check the O'Reilly BLAST book for the details). This isn't in the tabular output, but the point is that it isn't a property of the individual HSPs, but of the match (group of HSPs). I think we need to consider the other main formats, and if all their important information lies at the HSP level or not. Perhaps iteration at the query+match level (groups of HSPs) would be best overall? Bow - If some of that doesn't make sense, I can try to clarify by email on the list, and/or we can talk about it at our next video chat. Also see if you can get the BLAST book from your library - it will probably be quite useful in this project even though it describes the 'legacy' BLAST suite: "BLAST" by Ian Korf, Mark Yandell, Joseph Bedell Publisher: O'Reilly Media, Released: July 2003 Regards, Peter From w.arindrarto at gmail.com Sun Apr 29 16:42:14 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 29 Apr 2012 18:42:14 +0200 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: On Sun, Apr 29, 2012 at 13:00, Peter Cock wrote: > > Hi Bow, > > Thanks for updating the list. I'm replying just on the dev list > as I'm focusing on implementation discussion in this reply. > > On Sat, Apr 28, 2012 at 1:08 PM, Wibowo Arindrarto > wrote: > > 1. My main biopython branch for development: > > https://github.com/bow/biopython/tree/searchio. Since I will be building > > on > > top of Peter's SearchIO branch ( > > https://github.com/peterjc/biopython/tree/search-io-test), right now it > > only contains Peter's branch rebased against the latest master. > > Just to be clear - you don't have to start from that branch ;) > http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html Ok :). I wasn't so sure about how much code from your previous branch that I will end up using, so I decided to rebase everything and then see later how much of it can be used. But it's also easier to start clean :). > As I said before, that may not be the best approach. The idea > behind that code was to focus on the HSPs (in BLAST terms), > and for the low level parsers to iterate over each HSP. Higher > level wrappers can then batch these up by query/subject, or > into the larger grouping of all the results for one query - > which was the exposed high level Bio.SearchIO.parse > function. > > That branch introduced a SearchResult object which was > essentially something like a list or dict (like an OrderedDict > in some ways), with some (unnecessary?) error checking for > consistent contents (all from the same query). It also introduced > a TopMatches object which was essentially list list (again, > with some error checking for consistent contents). > > The advantage of using simple objects (OrderedDict > and list) is simplicity and hopefully performance. But > specific classes have the advantage of allowing more > user friendly str/repr etc. > > The idea on this branch of focusing on iteration over the > HSPs at the low level was it allowed a lot of flexibility, and > the low level parser could be used in conjunction with > indexing to see to a particular HSP and parse it, or goto > the results for a particular query+match and parse its > HSPs ?(not implemented on my old branch, but that was > the plan). > > However, while this makes perfect sense for say the BLAST > tabular output, it isn't quite such a good match for all the > possible datatypes. > > For instance, BLAST plain text/html includes an e-value for > a query/subject combination which is calculated from all the > HSPs for that query/subject (taking into account order etc - > I'd have to check the O'Reilly BLAST book for the details). > This isn't in the tabular output, but the point is that it isn't a > property of the individual HSPs, but of the match (group of > HSPs). > > I think we need to consider the other main formats, and if > all their important information lies at the HSP level or not. > Perhaps iteration at the query+match level (groups of > HSPs) would be best overall? > > Bow - If some of that doesn't make sense, I can try to clarify > by email on the list, and/or we can talk about it at our next > video chat. Also see if you can get the BLAST book from > your library - it will probably be quite useful in this project > even though it describes the 'legacy' BLAST suite: > > "BLAST" by Ian Korf, Mark Yandell, Joseph Bedell > Publisher: O'Reilly Media, Released: July 2003 > > Regards, > > Peter I think I got the gist of it (please correct me if I'm wrong). Some information about the search, such as the sequence-wide e-value, may not be present in the HSP level. Ignoring them could let us focus on a perhaps simpler and more flexible implementation with better performance, but at the cost of usefulness of the data itself since we are throwing away information. What I have in mind now is actually closer to iteration on the query+subject level. To be clear first, the hierarchy of the objects that I propose is this: * Search object, to represent the entire search session. * Result object, to represent a search with one query against the database. Depending on the number of queries, we could have one to several Result objects contained in a Search. * Hit object, to represent a sequence hit. Depending on the search, we could also have multiple Hits in one Result object. * and finally, HSP object, to represent individual alignments. Iteration is done on the Results level, so the information is parsed on the search query level, not just a single HSPs (I wrote a very short description about what I'm planning the objects to be in here as well: http://bit.ly/searchio-terms). I suppose if we aim for maximum information parsing over performance and simplicity of the format-specific parsers, this is the way to go. There are other formats, too, that contains sequence-level search information not present in the alignment (e.g. HMMER text output). What do you think about this? Thanks for the BLAST book suggestion. I'll see if I can find it in my library in the mean time. regards, Bow From p.j.a.cock at googlemail.com Mon Apr 30 09:49:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 30 Apr 2012 10:49:27 +0100 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: On Sun, Apr 29, 2012 at 5:42 PM, Wibowo Arindrarto wrote: > > I think I got the gist of it (please correct me if I'm wrong). Some > information about the search, such as the sequence-wide e-value, may > not be present in the HSP level. Ignoring them could let us focus on a > perhaps simpler and more flexible implementation with better > performance, but at the cost of usefulness of the data itself since we > are throwing away information. Yes. > What I have in mind now is actually closer to iteration on the > query+subject level. To be clear first, the hierarchy of the objects > that I propose is this: > > * Search object, to represent the entire search session. > * Result object, to represent a search with one query against the > database. Depending on the number of queries, we could have one to > several Result objects contained in a Search. > * Hit object, to represent a sequence hit. Depending on the search, we > could also have multiple Hits in one Result object. > * and finally, HSP object, to represent individual alignments. > > Iteration is done on the Results level, so the information is parsed > on the search query level, not just a single HSPs (I wrote a ?very > short description about what I'm planning the objects to be in here as > well: http://bit.ly/searchio-terms). I suppose if we aim for maximum > information parsing over performance and simplicity of the > format-specific parsers, this is the way to go. There are other > formats, too, that contains sequence-level search information not > present in the alignment (e.g. HMMER text output). What do you think > about this? That sounds good . If iteration is done on the Results level, when/how would your Search object be used? Peter From w.arindrarto at gmail.com Mon Apr 30 10:08:52 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 30 Apr 2012 12:08:52 +0200 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: >> What I have in mind now is actually closer to iteration on the >> query+subject level. To be clear first, the hierarchy of the objects >> that I propose is this: >> >> * Search object, to represent the entire search session. >> * Result object, to represent a search with one query against the >> database. Depending on the number of queries, we could have one to >> several Result objects contained in a Search. >> * Hit object, to represent a sequence hit. Depending on the search, we >> could also have multiple Hits in one Result object. >> * and finally, HSP object, to represent individual alignments. >> >> Iteration is done on the Results level, so the information is parsed >> on the search query level, not just a single HSPs (I wrote a ?very >> short description about what I'm planning the objects to be in here as >> well: http://bit.ly/searchio-terms). I suppose if we aim for maximum >> information parsing over performance and simplicity of the >> format-specific parsers, this is the way to go. There are other >> formats, too, that contains sequence-level search information not >> present in the alignment (e.g. HMMER text output). What do you think >> about this? > > That sounds good . > > If iteration is done on the Results level, when/how would your > Search object be used? > > Peter I'm thinking of using the Search object as the object returned by SearchIO.parse or SearchIO.read. That way, we can store attributes common to the different search queries in it. For example: >>> search = SearchIO.parse('blast_result.xml', 'blast-xml') >>> search.format 'blast-xml' >>> search.algorithm 'blastx' >>> search.version '2.2.26+' >>> search.database 'refseq_protein' >>> search.results And iteration over the results would be done like this (for example): >>> for result in search.results: ... print result.query, print len(result) Additionaly, we can also define __iter__ and next for Search so we can just do the following: >>> for result in search: ... print result.query, print len(result) What do you think? Bow From p.j.a.cock at googlemail.com Mon Apr 30 10:57:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 30 Apr 2012 11:57:27 +0100 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: On Mon, Apr 30, 2012 at 11:08 AM, Wibowo Arindrarto wrote: > > I'm thinking of using the Search object as the object returned by > SearchIO.parse or SearchIO.read. That way, we can store attributes > common to the different search queries in it. For example: > >>>> search ?= SearchIO.parse('blast_result.xml', 'blast-xml') >>>> search.format > 'blast-xml' >>>> search.algorithm > 'blastx' >>>> search.version > '2.2.26+' >>>> search.database > 'refseq_protein' >>>> search.results > > > And iteration over the results would be done like this (for example): >>>> for result in search.results: > ... print result.query, print len(result) > > Additionaly, we can also define __iter__ and next for Search so we can > just do the following: >>>> for result in search: > ... print result.query, print len(result) > > What do you think? I think you'll get in a mess with multiple iterators all sharing the same handle and competing over using it - but maybe I'm not grasping what you have in mind. Initially keep it simple: The primary public API would be for result in Bio.SearchIO.parse(...): print result.query, print len(result) where each iteration gives a complete result set for one query. Peter P.S. With SearchIO subject to name space discussions ;)