From mdipierro at cs.depaul.edu Sun May 1 00:51:23 2011 From: mdipierro at cs.depaul.edu (Massimo Di Pierro) Date: Sat, 30 Apr 2011 23:51:23 -0500 Subject: [Biopython-dev] biopython web interface In-Reply-To: <3a649ae478daf0c2e544dc573a15f3b5.squirrel@lipid.biocomp.unibo.it> References: <3a649ae478daf0c2e544dc573a15f3b5.squirrel@lipid.biocomp.unibo.it> Message-ID: Hello Andrea I am a looking at something a little different than what you are doing but we should definitely collaborate. I am trying to identify tasks that are not domain specific that could benefit more than one scientific community. It seems to me all scientific communities have data, have program (in python or not it irrelevant to me) and have a workflow. They all need: 1) a tool to post the data online in a semi-automated fashion 2) a tool to share data easily (both via web interface and scripting via web service) with access control 3) a way to annotate the data as in a CMS 4) a mechanism to connect data with a workflow so that certain programs are executed automatically when new data is uploaded in the system. The programs may require user input so it should possible to somehow register a task (a program) by describing what input data it needs and what user input it needs and the system should automatically generate an interface. 5) an interface to local clusters and grid resources to submit computing jobs to I do not have the resources or the expertise to build an interface specific for biopython but I think we should collaborate because if what I am going is general enough (and I am not sure it is unless we talk more about it) it could be used to create an interface to biopython with minimal programming. I understand your focus is on algorithms but I need to start on data. It is my experience it is very difficult to automate the workflow of algorithms if there is no standard exchange format for the data. The first thing I would need to understand are: - does biopython handle some standard file formats? What do they contain? how can they be recognized? Can you send me a few example? - is there a graph of which algorithms run on which file types? - what are the most common algorithms? Can you point me to the source? I like to think of the system as something that will represent the workflow as a graph. Each file type is a node. An algorithm is a link. If a node is an image or a csv file or an xml file or a movie or a vtk file, etc. the system will be able to represent it (show it). Links "define" the file type. As long as you have a standard, you will be able to register your algorithms and the system will know what to do. The all graph is built automatically without programming by introspecting your folders and identifying your files. You will be able to annotate your folders using a markup language to augment the information. In my approach starting from the data is critical. My approach does not fly if you do not have standard file formats. Massimo P.S. Sei italiano? On Apr 30, 2011, at 12:03 PM, Andrea Pierleoni wrote: > >> >> Message: 3 >> Date: Fri, 29 Apr 2011 08:34:34 -0500 >> From: Massimo Di Pierro >> Subject: [Biopython-dev] biopython web interface >> To: >> Message-ID: <57629245-F184-4143-8B18-80E69BC2C351 at cs.depaul.edu> >> Content-Type: text/plain; charset="us-ascii" >> >> Hello everybody, >> >> I am new to biopython and I have some silly questions. >> >> Does biopython have a web interface? >> If not, would you be interested in help developing one? >> What kind of features would you be interested in? >> >> Reason for my question: I am a physicist and a professor of CS. I am >> working with a few different groups to build a unified platform to bring >> scientific data online. The main idea is that of having a tool that >> requires no programming and scientists can use to introspect an existing >> directory and turn it into dynamical web pages. Those pages can then be >> edited and re-oreganized like a CMS. The system should be able to >> recognize basic file types, group, tag and categorize them. It should them >> be possible to register algorithms, run them on the server, create a >> workflow. The system will also have an interface for mobile. >> >> Here is a first prototype for physics data that interface with the >> National Energy Research Computing Center: >> http://tests.web2py.com/nersc >> >> Since we are doing this it would be great to have as many community on >> board as possible so that we can write specs that are broad enough. >> We can do all the work or you can help us if you want. >> >> So, if you have a wish list please share it with me. >> >> Personally, I need to be educated on biopython since I do not fully >> understand what are the basic file types it handles, what are the most >> popular algorithms it provides, nor I am familiar with the typical usage >> workflow. >> >> Massimo >> >> >> > > > Hi Massimo, > BioPython itself is a python library, but a web interface would enable many > functions to biological scientist with no programming expertise. > There are some parts of the library that cope well with a > web-interface/server, > in particular the BioSQL modules. > The BioSQL schema is a relational database model to store biological data. > I do have working code for using the BioPython BioSQL functions (and more) > with > the web2py DAL, and I'm working on a complete web2py-based opensource > webserver to store and manage biological sequences/entities. > If you (or any other) are interested and want to contribute, let me know. > There are many things in common between what I'm doing and what you want > to do, > so maybe its a good idea to work together. > > Andrea Pierleoni > > > From bjclavijo at gmail.com Mon May 2 09:52:15 2011 From: bjclavijo at gmail.com (Bernardo Clavijo) Date: Mon, 2 May 2011 10:52:15 -0300 Subject: [Biopython-dev] biopython web interface In-Reply-To: References: <3a649ae478daf0c2e544dc573a15f3b5.squirrel@lipid.biocomp.unibo.it> Message-ID: Hello Massimo... first of all... thanks for web2py, which is my tool of choice for web apps :D Here goes my 2 cents about all this: 1) I you're looking for a standard format, we should me talking about sequence files ( fasta / gff ). This approach will be very restrictive, but i guess it's a starting point. 2) you should look at galaxy, in some point I was hoping to integrate a web2py programming module directly there (don't know how yet, and i'm in many things at once, so it's more like a dream than a project). Galaxy has a fex tutorials and videos that should point you in the right direction. 3) Sadly, standard data representation has been an issue for some time for the bioinformatics community, the REST / web services approach has gain some momentum and some apps talk to each other in some way, but we still have not much of a standard way to represent all the data. Ontologies are a strong point also (check http://www.obofoundry.org/ ) with sequence ontology being a great one IMHO pointing on how the data should be represented (it's recommended, even when not enforced, to use SO when creating gff3 files). 4) So far, the one tool to "standard biological data saving" I've found useful was the Chado DB schema, which BTW didn't enforce or even define how to handle a lot of situations, but is more of a framework on which to base your own data representation. I guess that's not what you're looking for, but surely an interesting approach and a lot of lessons learned there. I'm currently building a web interface for some of our projects saving genomic and proteomic data on a Chado DB ( http://gmod.org/wiki/Chado ) using web2py, but it's at least rough and in a pre-alpha (as in a PoC) state. Some other folks here have been doing the same kind of projects, hopefully someone with a better and less specific approach. If it suits you, just contact me and i'll provide you all the direction and ideas my limited knowledge could generate. I'm a little dispersed man most of the time, so maybe not your ideal adviser, but I have the will. Greets and thanks again for web2py Bernardo Clavijo PD: please folks correct all my bad ideas for Massimo to have a real view and not my mess On Sun, May 1, 2011 at 1:51 AM, Massimo Di Pierro wrote: > Hello Andrea > > I am a looking at something a little different than what you are doing but we should definitely collaborate. > I am trying to identify tasks that are not domain specific that could benefit more than one scientific community. > > It seems to me all scientific communities have data, have program (in python or not it irrelevant to me) and have a workflow. > They all need: > 1) a tool to post the data online in a semi-automated fashion > 2) a tool to share data easily (both via web interface and scripting via web service) with access control > 3) a way to annotate the data as in a CMS > 4) a mechanism to connect data with a workflow so that certain programs are executed automatically when new data is uploaded in the system. The programs may require user input so it should possible to somehow register a task (a program) by describing what input data it needs and what user input it needs and the system should automatically generate an interface. > 5) an interface to local clusters and grid resources to submit computing jobs to > > I do not have the resources or the expertise to build an interface specific for biopython but I think we should collaborate because if what I am going is general enough (and I am not sure it is unless we talk more about it) it could be used to create an interface to biopython with minimal programming. > > I understand your focus is on algorithms but I need to start on data. It is my experience it is very difficult to automate the workflow of algorithms if there is no standard exchange format for the data. > > The first thing I would need to understand are: > - does biopython handle some standard file formats? What do they contain? how can they be recognized? Can you send me a few example? > - is there a graph of which algorithms run on which file types? > - what are the most common algorithms? Can you point me to the source? > > I like to think of the system as something that will represent the workflow as a graph. Each file type is a node. An algorithm is a link. > If a node is an image or a csv file or an xml file or a movie or a vtk file, etc. the system will be able to represent it (show it). > Links "define" the file type. As long as you have a standard, you will be able to register your algorithms and the system will know what to do. > > The all graph is built automatically without programming by introspecting your folders and identifying your files. You will be able to annotate your folders using a markup language to augment the information. > > In my approach starting from the data is critical. My approach does not fly if you do not have standard file formats. > > Massimo > > > > > > > > P.S. Sei italiano? > > On Apr 30, 2011, at 12:03 PM, Andrea Pierleoni wrote: > >> >>> >>> Message: 3 >>> Date: Fri, 29 Apr 2011 08:34:34 -0500 >>> From: Massimo Di Pierro >>> Subject: [Biopython-dev] biopython web interface >>> To: >>> Message-ID: <57629245-F184-4143-8B18-80E69BC2C351 at cs.depaul.edu> >>> Content-Type: text/plain; charset="us-ascii" >>> >>> Hello everybody, >>> >>> I am new to biopython and I have some silly questions. >>> >>> Does biopython have a web interface? >>> If not, would you be interested in help developing one? >>> What kind of features would you be interested in? >>> >>> Reason for my question: I am a physicist and a professor of CS. I am >>> working with a few different groups to build a unified platform to bring >>> scientific data online. The main idea is that of having a tool that >>> requires no programming and scientists can use to introspect an existing >>> directory and turn it into dynamical web pages. Those pages can then be >>> edited and re-oreganized like a CMS. The system should be able to >>> recognize basic file types, group, tag and categorize them. It should them >>> be possible to register algorithms, run them on the server, create a >>> workflow. The system will also have an interface for mobile. >>> >>> Here is a first prototype for physics data that interface with the >>> National Energy Research Computing Center: >>> http://tests.web2py.com/nersc >>> >>> Since we are doing this it would be great to have as many community on >>> board as possible so that we can write specs that are broad enough. >>> We can do all the work or you can help us if you want. >>> >>> So, if you have a wish list please share it with me. >>> >>> Personally, I need to be educated on biopython since I do not fully >>> understand what are the basic file types it handles, what are the most >>> popular algorithms it provides, nor I am familiar with the typical usage >>> workflow. >>> >>> Massimo >>> >>> >>> >> >> >> Hi Massimo, >> BioPython itself is a python library, but a web interface would enable many >> functions to biological scientist with no programming expertise. >> There are some parts of the library that cope well with a >> web-interface/server, >> in particular the BioSQL modules. >> The BioSQL schema is a relational database model to store biological data. >> I do have working code for using the BioPython BioSQL functions (and more) >> with >> the web2py DAL, and I'm working on a complete web2py-based opensource >> webserver to store and manage biological sequences/entities. >> If you (or any other) are interested and want to contribute, let me know. >> There are ?many things in common between what I'm doing and what you want >> to do, >> so maybe its a good idea to work together. >> >> Andrea Pierleoni >> >> >> > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue May 3 05:24:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 May 2011 10:24:08 +0100 Subject: [Biopython-dev] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: Message-ID: Hello all, I've CC'd the BioPerl, BioRuby, BioJava and Biopython development mailing lists to make sure you're aware of this, but can we continue any discussion on the cross-project open-bio-l mailing list please? I noticed that recent versions of BLAST are not using a single block for each query, which was the historical behaviour and assumed by the Biopython BLAST XML parser. This may be a bug in BLAST. See link below for an example. Has anyone else noticed this, and has it been reported to the NCBI yet? Thanks, Peter (Not for the first time, I wish there was a public bug tracker for BLAST, or at least a private bug tracker so we could talk about issues with an NCBI assigned reference number.) ---------- Forwarded message ---------- From: Peter Cock Date: Wed, Apr 20, 2011 at 6:08 PM Subject: Interesting BLAST 2.2.25+ XML behaviour To: Biopython-Dev Mailing List Hi all, Have a look at this XML file from a FASTA vs FASTA search using blastp from ?BLAST 2.2.25+ (current release), which is a test file I created for the BLAST+ wrappers in Galaxy: https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml I just put it though the Biopython BLAST XML parser, and was surprised not to get four records back (since as you might guess from the filename, there were four queries). It appears this version of BLAST+ is incrementing the iteration counter for each match... or something like that. Has anyone else noticed this? I wonder if it is accidental... Peter From updates at feedmyinbox.com Wed May 4 00:37:19 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 4 May 2011 00:37:19 -0400 Subject: [Biopython-dev] 5/4 active questions tagged biopython - Stack Overflow Message-ID: // Finding/Replacing substrings with annotations in an ASCII file in Python // May 3, 2011 at 9:14 AM http://stackoverflow.com/questions/5870012/finding-replacing-substrings-with-annotations-in-an-ascii-file-in-python Hello Everyone, I'm having a little coding issue in a bioinformatics project I'm working on. Basically, my task is to extract motif sequences from a database and use the information to annotate a sequence alignment file. The alignment file is plain text, so the annotation will not be anything elaborate, at best simply replacing the extracted sequences with asterisks in the alignment file itself. I have a script which scans the database file, extracts all sequences I need, and writes them to an output file. What I need is, given a query, to read these sequences and match them to their corresponding substrings in the ASCII alignment files. Finally, for every occurrence of a motif sequence (substring of a very large string of characters) I would replace motif sequence XXXXXXX with a sequence of asterisks *. The code I am using goes like this (11SGLOBULIN is the name of the protein entry in the database): motif_file = open('/users/myfolder/final motifs_11SGLOBULIN','r') align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+') finalmotifs = motif_file.readlines() seqalign = align_file.readlines() for line in seqalign: if motif[i] in seqalign: # I have stored all motifs in a list called "motif" replace(motif, '*****') But instead of replacing each string with a sequence of asterisks, it deletes the entire file. Can anyone see why this is happening? I suspect that the problem may lie in the fact that my ASCII file is basically just one very long list of amino acids, and Python cannot know how to replace a particular substring hidden within a very long string. -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From anaryin at gmail.com Wed May 4 06:21:08 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 4 May 2011 12:21:08 +0200 Subject: [Biopython-dev] Benchmarking PDBParser Message-ID: Hello all, Following a few discussions, I'm tempted to benchmark the current implementation of the PDBParser and see how it fares against an old implementation (I think I'll use 1.48 since older versions need Numerical Python). The main objective is to see if the recent developments have a significant impact in its speed. I thought of downloading the entire PDB but since it would take several days, I downloaded the CATH domain list instead. Those are just protein ATOM records, without any header, but since all modifications were essentially dealing with ATOM records, etc, I think it might be as valid. I'll be running tests today and tomorrow and I'll put the results up somewhere later on. I'm also making the scripts available so it is easy to benchmark it later on. Thoughts or suggestions? Cheers, Jo?o From p.j.a.cock at googlemail.com Wed May 4 06:39:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 4 May 2011 11:39:19 +0100 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: On Wed, May 4, 2011 at 11:21 AM, Jo?o Rodrigues wrote: > Hello all, > > Following a few discussions, I'm tempted to benchmark the current > implementation of the PDBParser and see how it fares against an old > implementation (I think I'll use 1.48 since older versions need Numerical > Python). The main objective is to see if the recent developments have a > significant impact in its speed. > > I thought of downloading the entire PDB but since it would take several > days, I downloaded the CATH domain list instead. Those are just protein ATOM > records, without any header, but since all modifications were essentially > dealing with ATOM records, etc, I think it might be as valid. > > I'll be running tests today and tomorrow and I'll put the results up > somewhere later on. I'm also making the scripts available so it is easy to > benchmark it later on. > > Thoughts or suggestions? > > Cheers, > > Jo?o That sounds like a good idea. While you are at it, you could try both the strict and permissive modes - I wonder what proportion of the current PDB has problems in the data? Peter From anaryin at gmail.com Wed May 4 06:42:12 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 4 May 2011 12:42:12 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: I was not planning on using the PDB database, but I might as well download it then. Adding that to the list. I'm also planning on removing all elements and check the impact of finding the elements. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao From anaryin at gmail.com Wed May 4 09:23:39 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 4 May 2011 15:23:39 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: Just a word of advice. I tried to download the whole PDB with PDBList.py and I ran into an error. Their server shut me down due to too many connections. Perhaps adding an exception catcher like the one we have for NCBI servers would be useful? Preliminary results show some degradation of speed.. ==> benchmark_CATH-biopython_149.time <== Total time spent: 530.686s Average time per structure: 46.839ms ==> benchmark_CATH-biopython_current.time <== Total time spent: 686.176s Average time per structure: 60.563ms I'll write a full summary when I finish downloading the PDB and testing it. From chad.a.davis at gmail.com Wed May 4 09:55:04 2011 From: chad.a.davis at gmail.com (Chad Davis) Date: Wed, 4 May 2011 15:55:04 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: I'd be very interested in this as well. I'm working on some modifications (in the alpha stages still) to the BioPerl PDB parser (based on the Perl Data Language, analogous to NumPy) and would be interested to compare all of them (BioPython old and new, BioPerl old and new). In my experience, downloading the PDB, just the divided structures, works best with rsync, and I believe it should only take several hours, not several days, the first time. It should be as easy as: rsync -a rsync.wwpdb.org::ftp_data/structures/divided/pdb/ ./pdb Other options: http://www.wwpdb.org/downloads.html Chad On Wed, May 4, 2011 at 15:23, Jo?o Rodrigues wrote: > Just a word of advice. I tried to download the whole PDB with PDBList.py and > I ran into an error. Their server shut me down due to too many connections. > Perhaps adding an exception catcher like the one we have for NCBI servers > would be useful? > > Preliminary results show some degradation of speed.. > > ==> benchmark_CATH-biopython_149.time <== > Total time spent: 530.686s > Average time per structure: 46.839ms > > ==> benchmark_CATH-biopython_current.time <== > Total time spent: 686.176s > Average time per structure: 60.563ms > > I'll write a full summary when I finish downloading the PDB and testing it. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From anaryin at gmail.com Wed May 4 09:57:40 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 4 May 2011 15:57:40 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: Hey Chad, That's exactly what I ended up doing and it is done ;) Pretty quick, I was hoping for a day or so! Best, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Wed, May 4, 2011 at 3:55 PM, Chad Davis wrote: > I'd be very interested in this as well. > I'm working on some modifications (in the alpha stages still) to the > BioPerl PDB parser (based on the Perl Data Language, analogous to > NumPy) and would be interested to compare all of them (BioPython old > and new, BioPerl old and new). > > In my experience, downloading the PDB, just the divided structures, > works best with rsync, and I believe it should only take several > hours, not several days, the first time. It should be as easy as: > > rsync -a rsync.wwpdb.org::ftp_data/structures/divided/pdb/ ./pdb > > Other options: > http://www.wwpdb.org/downloads.html > > Chad > > > On Wed, May 4, 2011 at 15:23, Jo?o Rodrigues wrote: > > Just a word of advice. I tried to download the whole PDB with PDBList.py > and > > I ran into an error. Their server shut me down due to too many > connections. > > Perhaps adding an exception catcher like the one we have for NCBI servers > > would be useful? > > > > Preliminary results show some degradation of speed.. > > > > ==> benchmark_CATH-biopython_149.time <== > > Total time spent: 530.686s > > Average time per structure: 46.839ms > > > > ==> benchmark_CATH-biopython_current.time <== > > Total time spent: 686.176s > > Average time per structure: 60.563ms > > > > I'll write a full summary when I finish downloading the PDB and testing > it. > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From redmine at redmine.open-bio.org Wed May 4 17:56:27 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 4 May 2011 21:56:27 +0000 Subject: [Biopython-dev] [Biopython - Feature #3194] (In Progress) Bio.Phylo export to 'ape' via Rpy2 References: Message-ID: Issue #3194 has been updated by Eric Talevich. Status changed from New to In Progress Assignee changed from Eric Talevich to Biopython Dev Mailing List % Done changed from 0 to 20 Estimated time set to 0.50 I added a cookbook entry for this on the Biopython wiki: http://www.biopython.org/wiki/Phylo_cookbook#Convert_to_an_.27ape.27_tree.2C_via_Rpy2 Good enough? Trying it in ipython, it works as advertised, except after calling r.plot() the R plot window won't close until I exit ipython. Further calls to plot() update the window; it just doesn't close. ---------------------------------------- Feature #3194: Bio.Phylo export to 'ape' via Rpy2 https://redmine.open-bio.org/issues/3194 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: Not Applicable URL: There are many more packages for working with phylogenetic data in R, and most of these operate on the basic tree object defined in the ape package. Let's support interoperability through Rpy2. The trivial way to do this is serialize a tree to a Newick string, then feed that to the read.tree() function. Maybe we can build the tree object in R directly and retain the tree annotations that Newick doesn't handle. See: http://ape.mpl.ird.fr/ http://rpy.sourceforge.net/rpy2.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed May 4 18:25:11 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 4 May 2011 22:25:11 +0000 Subject: [Biopython-dev] [Biopython - Feature #3194] Bio.Phylo export to 'ape' via Rpy2 References: Message-ID: Issue #3194 has been updated by Eric Talevich. File feat3194.diff added Estimated time changed from 0.50 to 1.00 Patch based on the cookbook entry. ---------------------------------------- Feature #3194: Bio.Phylo export to 'ape' via Rpy2 https://redmine.open-bio.org/issues/3194 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: Not Applicable URL: There are many more packages for working with phylogenetic data in R, and most of these operate on the basic tree object defined in the ape package. Let's support interoperability through Rpy2. The trivial way to do this is serialize a tree to a Newick string, then feed that to the read.tree() function. Maybe we can build the tree object in R directly and retain the tree annotations that Newick doesn't handle. See: http://ape.mpl.ird.fr/ http://rpy.sourceforge.net/rpy2.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From anaryin at gmail.com Fri May 6 03:45:53 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 6 May 2011 09:45:53 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: Hello all, I'd love to come with results but I ran into some problems. The parser is consuming too much memory after a while (>2GB) and I can't get reliable timings then because of swapping.. Therefore, I'll just take a random sample of 8000 structures and use it as a benchmark. I'll post the results today, shall I put it up on the wiki? This could be an interesting thing to post for both users and future developments. Best, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Wed, May 4, 2011 at 3:57 PM, Jo?o Rodrigues wrote: > Hey Chad, > > That's exactly what I ended up doing and it is done ;) Pretty quick, I was > hoping for a day or so! > > Best, > > > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao > > > > On Wed, May 4, 2011 at 3:55 PM, Chad Davis wrote: > >> I'd be very interested in this as well. >> I'm working on some modifications (in the alpha stages still) to the >> BioPerl PDB parser (based on the Perl Data Language, analogous to >> NumPy) and would be interested to compare all of them (BioPython old >> and new, BioPerl old and new). >> >> In my experience, downloading the PDB, just the divided structures, >> works best with rsync, and I believe it should only take several >> hours, not several days, the first time. It should be as easy as: >> >> rsync -a rsync.wwpdb.org::ftp_data/structures/divided/pdb/ ./pdb >> >> Other options: >> http://www.wwpdb.org/downloads.html >> >> Chad >> >> >> On Wed, May 4, 2011 at 15:23, Jo?o Rodrigues wrote: >> > Just a word of advice. I tried to download the whole PDB with PDBList.py >> and >> > I ran into an error. Their server shut me down due to too many >> connections. >> > Perhaps adding an exception catcher like the one we have for NCBI >> servers >> > would be useful? >> > >> > Preliminary results show some degradation of speed.. >> > >> > ==> benchmark_CATH-biopython_149.time <== >> > Total time spent: 530.686s >> > Average time per structure: 46.839ms >> > >> > ==> benchmark_CATH-biopython_current.time <== >> > Total time spent: 686.176s >> > Average time per structure: 60.563ms >> > >> > I'll write a full summary when I finish downloading the PDB and testing >> it. >> > _______________________________________________ >> > Biopython-dev mailing list >> > Biopython-dev at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > >> > > From anaryin at gmail.com Fri May 6 03:54:20 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 6 May 2011 09:54:20 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() Message-ID: Hello all, The PDBParser is sometimes a bit too loud, making meaningful output drown in dozens of warnings messages. This is partly (mostly) my fault because of the element guessing addition. Therefore, I'd suggest adding a QUIET argument (bool) to PDBParser that would supress all warnings. Of course, default is False. It might come handy for batch processing of proteins. I've added it to my pdb_enhancements branch so you can take a look: https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao From p.j.a.cock at googlemail.com Fri May 6 04:18:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 May 2011 09:18:44 +0100 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 8:45 AM, Jo?o Rodrigues wrote: > Hello all, > > I'd love to come with results but I ran into some problems. The parser is > consuming too much memory after a while (>2GB) and I can't get reliable > timings then because of swapping.. Therefore, I'll just take a random sample > of 8000 structures and use it as a benchmark. Memory bloat is bad - it sounds like a garbage collection problem. Are you recreating the parser object each time? > I'll post the results today, shall I put it up on the wiki? This could be an > interesting thing to post for both users and future developments. I'd like to see the script and the results, so maybe the wiki is better. Peter From anaryin at gmail.com Fri May 6 04:24:04 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 6 May 2011 10:24:04 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: > > Memory bloat is bad - it sounds like a garbage collection problem. > Are you recreating the parser object each time? > No. I'm just calling get_structure at each step of the for loop. It's a bit irregular also, sometimes it drops from 1GB to 300MB, stays stable for a while and then spikes again. My guess is that all the data structures holding the parser structures consume quite a lot and probably there's no decent GC to clear the previous structure in time, so it accumulates. Is there any way I can profile the script to see who's keeping the most memory throughout the run? > > > I'll post the results today, shall I put it up on the wiki? This could be > an > > interesting thing to post for both users and future developments. > > I'd like to see the script and the results, so maybe the wiki is better. > Will do. Jo?o From p.j.a.cock at googlemail.com Fri May 6 04:29:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 May 2011 09:29:19 +0100 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 9:24 AM, Jo?o Rodrigues wrote: >> Memory bloat is bad - it sounds like a garbage collection problem. >> Are you recreating the parser object each time? > > No. I'm just calling get_structure at each step of the for loop. It's a bit > irregular also, sometimes it drops from 1GB to 300MB, stays stable for a > while and then spikes again. My guess is that all the data structures > holding the parser structures consume quite a lot and probably there's no > decent GC to clear the previous structure in time, so it accumulates. > You could do an explicit clear once per PDB file to test this hypothesis: import gc gc.collect() Peter From p.j.a.cock at googlemail.com Fri May 6 05:25:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 May 2011 10:25:50 +0100 Subject: [Biopython-dev] Python 2.4 / Adding QUIET argument to PDBParser() Message-ID: On Fri, May 6, 2011 at 8:54 AM, Jo?o Rodrigues wrote: > Hello all, > > The PDBParser is sometimes a bit too loud, making meaningful output drown in > dozens of warnings messages. This is partly (mostly) my fault because of the > element guessing addition. Therefore, I'd suggest adding a QUIET argument > (bool) to PDBParser that would supress all warnings. Of course, default is > False. It might come handy for batch processing of proteins. > > I've added it to my pdb_enhancements branch so you can take a look: > > https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 > I had a look and Joao and I have been having a little discussion with the github comments feature. There are two ways to solve this, (1) Have a flag which controls issuing the warning (2) Filter out PDBConstructionWarning messages The first approach is messy as the flag needs to passed down to any relevant object (or done as a global which is nasty). The second approach requires a temporary warnings filter, which I think would easily done with the context manager warnings.catch_warnings() in Python 2.5+ I'd also like to use this in the unit tests, where currently we have to save the filter list, add a temporary filter, then restore the filter list. This generally works, but there are some stray warnings that are not being silenced. Given we've already officially dropped support for Python 2.4, I don't anticipate any protests. I guess before making such a change on the trunk, Tiago or I should turn off the Python 2.4 buildbot buildslaves... Peter From p.j.a.cock at googlemail.com Fri May 6 05:31:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 May 2011 10:31:50 +0100 Subject: [Biopython-dev] Python 2.4 / Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 10:25 AM, Peter Cock wrote: > On Fri, May 6, 2011 at 8:54 AM, Jo?o Rodrigues wrote: >> Hello all, >> >> The PDBParser is sometimes a bit too loud, making meaningful output drown in >> dozens of warnings messages. This is partly (mostly) my fault because of the >> element guessing addition. Therefore, I'd suggest adding a QUIET argument >> (bool) to PDBParser that would supress all warnings. Of course, default is >> False. It might come handy for batch processing of proteins. >> >> I've added it to my pdb_enhancements branch so you can take a look: >> >> https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 >> > > I had a look and Joao and I have been having a little > discussion with the github comments feature. > > There are two ways to solve this, > (1) Have a flag which controls issuing the warning > (2) Filter out PDBConstructionWarning messages > > The first approach is messy as the flag needs to passed > down to any relevant object (or done as a global which is > nasty). > > The second approach requires a temporary warnings filter, > which I think would easily done with the context manager > warnings.catch_warnings() in Python 2.5+ Arhh, Jaoa just pointed out warnings.catch_warnings() is in Python 2.6+ so we have to wait a while longer before we can use that :( Peter From redmine at redmine.open-bio.org Fri May 6 11:57:48 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 6 May 2011 15:57:48 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Eric Talevich. Flex takes a .lex file and generates a .c file. The .c file is the important thing to compile, not .lex. Looking at the generated C in lex.yy.c, I'd guess the same thing can be compiled all the platforms we support (though I haven't confirmed). As a short-term solution, can we check in lex.yy.c and include that with the distribution, in order to eliminate the flex dependency? ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri May 6 12:05:54 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 6 May 2011 16:05:54 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Peter Cock. Eric, we need two things: (1) The flex binary to convert our lex file into C, which as you point out we might be able to do in advance (assuming this version of flex is unimportant). Detecting the flex binary is pretty easy on Unix like platforms. See comment 4. (2) The flex headers to compile the C code. This can probably be solved, perhaps looking at similar issues in NumPy. ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From eric.talevich at gmail.com Fri May 6 12:20:54 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 6 May 2011 12:20:54 -0400 Subject: [Biopython-dev] Python 2.4 / Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 5:31 AM, Peter Cock wrote: > On Fri, May 6, 2011 at 10:25 AM, Peter Cock > wrote: > > > > The second approach requires a temporary warnings filter, > > which I think would easily done with the context manager > > warnings.catch_warnings() in Python 2.5+ > > Arhh, Jaoa just pointed out warnings.catch_warnings() is > in Python 2.6+ so we have to wait a while longer before > we can use that :( > > Fortunately we've already worked around it in test_PDB.py, by monkeypatching: https://github.com/biopython/biopython/blob/master/Tests/test_PDB.py See the method test_1_warnings. Replace the function warnings.showwarnings with a new function that just collects warning objects in a list rather than printing them. Then, before the outer function ends, swap back the original showwarnings function. -E From eric.talevich at gmail.com Fri May 6 12:23:33 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 6 May 2011 12:23:33 -0400 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 3:54 AM, Jo?o Rodrigues wrote: > Hello all, > > The PDBParser is sometimes a bit too loud, making meaningful output drown > in > dozens of warnings messages. This is partly (mostly) my fault because of > the > element guessing addition. Therefore, I'd suggest adding a QUIET argument > (bool) to PDBParser that would supress all warnings. Of course, default is > False. It might come handy for batch processing of proteins. > > I've added it to my pdb_enhancements branch so you can take a look: > > > https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 > > Since the PERMISSIVE argument is already an integer, could we consolidate these by letting (PERMISSIVE=2) behave as (PERMISSIVE=1, QUIET=1) ? From p.j.a.cock at googlemail.com Fri May 6 12:25:40 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 May 2011 17:25:40 +0100 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 5:23 PM, Eric Talevich wrote: > On Fri, May 6, 2011 at 3:54 AM, Jo?o Rodrigues wrote: > >> Hello all, >> >> The PDBParser is sometimes a bit too loud, making meaningful output drown >> in >> dozens of warnings messages. This is partly (mostly) my fault because of >> the >> element guessing addition. Therefore, I'd suggest adding a QUIET argument >> (bool) to PDBParser that would supress all warnings. Of course, default is >> False. It might come handy for batch processing of proteins. >> >> I've added it to my pdb_enhancements branch so you can take a look: >> >> >> https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 >> >> > Since the PERMISSIVE argument is already an integer, could we consolidate > these by letting (PERMISSIVE=2) behave as (PERMISSIVE=1, QUIET=1) ? > I'm OK with that, Peter From redmine at redmine.open-bio.org Sat May 7 14:52:44 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 7 May 2011 18:52:44 +0000 Subject: [Biopython-dev] [Biopython - Feature #3220] Port Biopython docstrings to reStructuredText References: Message-ID: Issue #3220 has been updated by Eric Talevich. Here's a branch with Bio.Phylo converted to rst: https://github.com/etal/biopython/tree/rst_docstrings The main deviation from the Numpy guidelines is using:
:Parameters:
instead of:
Parameters
----------
This is because Epydoc only understands the former, so the latter produces something ugly in the generated docs. It will be easy enough to change, if we want, when we switch to Sphinx. ---------------------------------------- Feature #3220: Port Biopython docstrings to reStructuredText https://redmine.open-bio.org/issues/3220 Author: Eric Talevich Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The first part of the effort to port Biopython's documentation to Sphinx is to convert our API docs from Epytext to reStructuredText. Plain text will generally work. Epydoc already supports using reStructuredText as a markup language instead of the default Epytext, so this isn't as painful as it sounds. This can be done one module at a time, changing the format declaration at the top from:
__docformat__ = "epytext en"
to:
__docformat__ = "restructuredtext en"
And changing any Epytext markup in the docstrings to valid rST. Note that this adds the dependency of Docutils when generating API docs, in addition to the current dependency on Epydoc. Since documentation is normally built ahead of the time when packaging stable Biopython releases, this shouldn't be a problem for end users, and may be a small inconvenience for developers who want to work on the documentation. See: http://epydoc.sourceforge.net/manual-othermarkup.html http://docutils.sourceforge.net/rst.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun May 8 12:33:03 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 8 May 2011 16:33:03 +0000 Subject: [Biopython-dev] [Biopython - Bug #3227] (New) deprecated genbank localstion parser doesn't indicate what replaces it Message-ID: Issue #3227 has been reported by Mark Diekhans. ---------------------------------------- Bug #3227: deprecated genbank localstion parser doesn't indicate what replaces it https://redmine.open-bio.org/issues/3227 Author: Mark Diekhans Status: New Priority: High Assignee: Category: Target version: URL: Module LocationParser says: Code used for parsing GenBank/EMBL feature location strings (DEPRECATED) but it doesn't indicate what the replace is for this module. I am happy to make changes as biopython evolves, but some guidance as to how to change would be very helpful ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun May 8 16:23:26 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 8 May 2011 20:23:26 +0000 Subject: [Biopython-dev] [Biopython - Bug #3227] deprecated genbank localstion parser doesn't indicate what replaces it References: Message-ID: Issue #3227 has been updated by Peter Cock. Category set to Main Distribution Assignee set to Biopython Dev Mailing List Default assignee was lost... restoring to dev mailing list. ---------------------------------------- Bug #3227: deprecated genbank localstion parser doesn't indicate what replaces it https://redmine.open-bio.org/issues/3227 Author: Mark Diekhans Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Module LocationParser says: Code used for parsing GenBank/EMBL feature location strings (DEPRECATED) but it doesn't indicate what the replace is for this module. I am happy to make changes as biopython evolves, but some guidance as to how to change would be very helpful -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun May 8 18:59:01 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 8 May 2011 22:59:01 +0000 Subject: [Biopython-dev] [Biopython - Bug #3227] deprecated genbank localstion parser doesn't indicate what replaces it References: Message-ID: Issue #3227 has been updated by Mark Diekhans. hanks Peter! I am more than happy to change code to use the new parser. My bug report is that the module deception just says "(DEPRECATED)" and doesn't give one a clue as to how to get the same functionality. This is a request for better documentation, not continued support of this code. Mark ---------------------------------------- Bug #3227: deprecated genbank localstion parser doesn't indicate what replaces it https://redmine.open-bio.org/issues/3227 Author: Mark Diekhans Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Module LocationParser says: Code used for parsing GenBank/EMBL feature location strings (DEPRECATED) but it doesn't indicate what the replace is for this module. I am happy to make changes as biopython evolves, but some guidance as to how to change would be very helpful -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun May 8 19:24:33 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 8 May 2011 23:24:33 +0000 Subject: [Biopython-dev] [Biopython - Bug #3227] deprecated genbank localstion parser doesn't indicate what replaces it References: Message-ID: Issue #3227 has been updated by Peter Cock. The new GenBank/EMBL parser will use the new location parsing automatically. If you were using this (via Bio.GenBank or via Bio.SeqIO) you wouldn't have needed to change anything. The only people affected by the deprecation would be people using Bio.GenBank.LocationParser directly. Right now, the new location parsing code isn't really designed to be used on its own. In order to try and help you, I need to know what you were using Bio.GenBank.LocationParser for. ---------------------------------------- Bug #3227: deprecated genbank localstion parser doesn't indicate what replaces it https://redmine.open-bio.org/issues/3227 Author: Mark Diekhans Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Module LocationParser says: Code used for parsing GenBank/EMBL feature location strings (DEPRECATED) but it doesn't indicate what the replace is for this module. I am happy to make changes as biopython evolves, but some guidance as to how to change would be very helpful -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun May 8 19:34:45 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 8 May 2011 23:34:45 +0000 Subject: [Biopython-dev] [Biopython - Feature #3220] (In Progress) Port Biopython docstrings to reStructuredText References: Message-ID: Issue #3220 has been updated by Eric Talevich. Status changed from New to In Progress % Done changed from 0 to 20 Thanks for the merge, Peter: https://github.com/biopython/biopython/commit/f617101dfaf358d38e90ed778c98588ee7775c72 So building the Biopython API documentation with Epydoc now depends on docutils. Next step: grep each module for 'epytext' and port those that need it. ---------------------------------------- Feature #3220: Port Biopython docstrings to reStructuredText https://redmine.open-bio.org/issues/3220 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The first part of the effort to port Biopython's documentation to Sphinx is to convert our API docs from Epytext to reStructuredText. Plain text will generally work. Epydoc already supports using reStructuredText as a markup language instead of the default Epytext, so this isn't as painful as it sounds. This can be done one module at a time, changing the format declaration at the top from:
__docformat__ = "epytext en"
to:
__docformat__ = "restructuredtext en"
And changing any Epytext markup in the docstrings to valid rST. Note that this adds the dependency of Docutils when generating API docs, in addition to the current dependency on Epydoc. Since documentation is normally built ahead of the time when packaging stable Biopython releases, this shouldn't be a problem for end users, and may be a small inconvenience for developers who want to work on the documentation. See: http://epydoc.sourceforge.net/manual-othermarkup.html http://docutils.sourceforge.net/rst.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon May 9 12:17:53 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 9 May 2011 16:17:53 +0000 Subject: [Biopython-dev] [Biopython - Feature #3220] Port Biopython docstrings to reStructuredText References: Message-ID: Issue #3220 has been updated by Peter Cock. Eric, Do you have an HTML sample of the Bio.Phylo API docs from Sphinx? You could just email me a zip file if there isn't an easier way to show it. Alternatively, how would I use Sphinx to generate this myself? Thanks. Peter ---------------------------------------- Feature #3220: Port Biopython docstrings to reStructuredText https://redmine.open-bio.org/issues/3220 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The first part of the effort to port Biopython's documentation to Sphinx is to convert our API docs from Epytext to reStructuredText. Plain text will generally work. Epydoc already supports using reStructuredText as a markup language instead of the default Epytext, so this isn't as painful as it sounds. This can be done one module at a time, changing the format declaration at the top from:
__docformat__ = "epytext en"
to:
__docformat__ = "restructuredtext en"
And changing any Epytext markup in the docstrings to valid rST. Note that this adds the dependency of Docutils when generating API docs, in addition to the current dependency on Epydoc. Since documentation is normally built ahead of the time when packaging stable Biopython releases, this shouldn't be a problem for end users, and may be a small inconvenience for developers who want to work on the documentation. See: http://epydoc.sourceforge.net/manual-othermarkup.html http://docutils.sourceforge.net/rst.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From anaryin at gmail.com Mon May 9 12:37:50 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 9 May 2011 18:37:50 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: Hey Peter, I've only had the chance to test this today. The parsing seems to be working just fine and the RAM consumption is stable at < 100 MB. I'll see the results tomorrow. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Fri, May 6, 2011 at 10:29 AM, Peter Cock wrote: > On Fri, May 6, 2011 at 9:24 AM, Jo?o Rodrigues wrote: > >> Memory bloat is bad - it sounds like a garbage collection problem. > >> Are you recreating the parser object each time? > > > > No. I'm just calling get_structure at each step of the for loop. It's a > bit > > irregular also, sometimes it drops from 1GB to 300MB, stays stable for a > > while and then spikes again. My guess is that all the data structures > > holding the parser structures consume quite a lot and probably there's no > > decent GC to clear the previous structure in time, so it accumulates. > > > > You could do an explicit clear once per PDB file to test this hypothesis: > > import gc > gc.collect() > > Peter > From redmine at redmine.open-bio.org Mon May 9 13:10:01 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 9 May 2011 17:10:01 +0000 Subject: [Biopython-dev] [Biopython - Feature #3220] Port Biopython docstrings to reStructuredText References: Message-ID: Issue #3220 has been updated by Eric Talevich. Peter, I haven't tried using Sphinx on Bio.Phylo yet, actually. It seems to require writing a few "stub" files with commands for pulling in doctrings from the selected module... I'll tinker with it and maybe post a branch on Github if it goes well. -Eric ---------------------------------------- Feature #3220: Port Biopython docstrings to reStructuredText https://redmine.open-bio.org/issues/3220 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The first part of the effort to port Biopython's documentation to Sphinx is to convert our API docs from Epytext to reStructuredText. Plain text will generally work. Epydoc already supports using reStructuredText as a markup language instead of the default Epytext, so this isn't as painful as it sounds. This can be done one module at a time, changing the format declaration at the top from:
__docformat__ = "epytext en"
to:
__docformat__ = "restructuredtext en"
And changing any Epytext markup in the docstrings to valid rST. Note that this adds the dependency of Docutils when generating API docs, in addition to the current dependency on Epydoc. Since documentation is normally built ahead of the time when packaging stable Biopython releases, this shouldn't be a problem for end users, and may be a small inconvenience for developers who want to work on the documentation. See: http://epydoc.sourceforge.net/manual-othermarkup.html http://docutils.sourceforge.net/rst.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon May 9 22:44:36 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 May 2011 02:44:36 +0000 Subject: [Biopython-dev] [Biopython - Feature #3219] (In Progress) Port Biopython documentation to Sphinx References: Message-ID: Issue #3219 has been updated by Eric Talevich. Status changed from New to In Progress % Done changed from 0 to 20 Here's a branch where I'm testing Sphinx: https://github.com/etal/biopython/tree/sphinx-demo There's not much there yet, so don't panic. For reference, DendroPy has a good example of Sphinx in action: https://github.com/jeetsukumaran/DendroPy/tree/master/doc/source/ ---------------------------------------- Feature #3219: Port Biopython documentation to Sphinx https://redmine.open-bio.org/issues/3219 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: Currently we use Epydoc for the API reference documentation, and LaTeX (to PDF via pdflatex, and HTML via hevea) for the tutorial. There's some material on the wiki to consider, too. A number of Python projects, including CPython, now use Sphinx for documentation. Content is written in reStructuredText format, and can be pulled from both standalone .rst files and Python docstrings. This offers several advantages: (i) API documentation will be prettier and easier to navigate; (ii) the Tutorial will be easier to edit for those not fluent in LaTeX; (iii) Since the API reference and Tutorial will be written in the same markup, potentially even pulling from some shared sources, it will be easier to address redundant or overlapping portions between the two, avoiding inconsistencies. See: http://sphinx.pocoo.org/ http://docutils.sourceforge.net/ Mailing list discussion: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/007977.html Numpy's approach: http://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon May 9 22:51:51 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 May 2011 02:51:51 +0000 Subject: [Biopython-dev] [Biopython - Feature #3220] Port Biopython docstrings to reStructuredText References: Message-ID: Issue #3220 has been updated by Eric Talevich. I posted about Sphinx on the parent issue. For this bug, I reckon the best approach is to convert the rest of the docstrings to reStructuredText, removing Epytext markup wherever we find it. Going further, we could try using "restructuredtext" instead of "plaintext" as the default format when running Epydoc, and fix any errors or abominations that appear. If we can get that to work, then we'll know it's all safe to pull into Sphinx with the automodule command. ---------------------------------------- Feature #3220: Port Biopython docstrings to reStructuredText https://redmine.open-bio.org/issues/3220 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The first part of the effort to port Biopython's documentation to Sphinx is to convert our API docs from Epytext to reStructuredText. Plain text will generally work. Epydoc already supports using reStructuredText as a markup language instead of the default Epytext, so this isn't as painful as it sounds. This can be done one module at a time, changing the format declaration at the top from:
__docformat__ = "epytext en"
to:
__docformat__ = "restructuredtext en"
And changing any Epytext markup in the docstrings to valid rST. Note that this adds the dependency of Docutils when generating API docs, in addition to the current dependency on Epydoc. Since documentation is normally built ahead of the time when packaging stable Biopython releases, this shouldn't be a problem for end users, and may be a small inconvenience for developers who want to work on the documentation. See: http://epydoc.sourceforge.net/manual-othermarkup.html http://docutils.sourceforge.net/rst.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu May 12 06:08:07 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 12 May 2011 10:08:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #3229] (New) PDBParser fails when occupancy of atom is -1.0 Message-ID: Issue #3229 has been reported by Jo?o Rodrigues. ---------------------------------------- Bug #3229: PDBParser fails when occupancy of atom is -1.0 https://redmine.open-bio.org/issues/3229 Author: Jo?o Rodrigues Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: PDBID 3NH3 has occupancy values of -1.0 (seems to be an unique case in the PDB). ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu May 12 06:08:07 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 12 May 2011 10:08:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #3229] (New) PDBParser fails when occupancy of atom is -1.0 Message-ID: Issue #3229 has been reported by Jo?o Rodrigues. ---------------------------------------- Bug #3229: PDBParser fails when occupancy of atom is -1.0 https://redmine.open-bio.org/issues/3229 Author: Jo?o Rodrigues Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: PDBID 3NH3 has occupancy values of -1.0 (seems to be an unique case in the PDB). -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From anaryin at gmail.com Thu May 12 09:59:09 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 12 May 2011 15:59:09 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: First results: http://www.biopython.org/wiki/PDBParser Comments? From eric.talevich at gmail.com Thu May 12 22:26:42 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 12 May 2011 22:26:42 -0400 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: On Thu, May 12, 2011 at 9:59 AM, Jo?o Rodrigues wrote: > First results: http://www.biopython.org/wiki/PDBParser > > Comments? > Cool. So the atom_element additions did slow the parser down noticeably. The warnings may have caused some tiny slowdown, presumably when handling PDB files with inconsistencies, but I personally am not concerned about that. I think atom element assignment could be sped up in either of two ways: (a) Try to optimize Atom._assign_element for speed, somehow (b) Store only the atom field as a string during parsing. Change Atom.element and Atom.mass to be properties that parse the atom field to determine the element type on demand (i.e. self._get_element checks if self._element exists yet; if not, parse the string and set self._element; self._get_mass is basically identical to _assign_atom_mass). The lazy loading approach (b) would be faster if you're not using the element/mass values at all, but probably a little slower if you need those values from every atom in a structure. -E From updates at feedmyinbox.com Fri May 13 00:38:20 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Fri, 13 May 2011 00:38:20 -0400 Subject: [Biopython-dev] 5/13 active questions tagged biopython - Stack Overflow Message-ID: // renumber residues in a protein structure file (pdb) // May 12, 2011 at 3:54 PM http://stackoverflow.com/questions/5983689/renumber-residues-in-a-protein-structure-file-pdb Hi I am currently involved in making a website aimed at combining all papillomavirus information in a single place. As part of the effort we are curating all known files on public servers (e.g. genbank) One of the issues I ran into was that many (~50%) of all solved structures are not numbered according to the protein. I.e. a subdomain was crystallized (amino acid 310-450) however the crystallographer deposited this as residue 1-140. I was wondering whether anyone knows of a way to renumber the entire pdb file. I have found ways to renumber the sequence (identified by seqres), however this does not update the helix and sheet information. I would appreciate it if you had any suggestions? Thanks -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From anaryin at gmail.com Fri May 13 02:35:27 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 13 May 2011 08:35:27 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: Assigning the element on demand would be too slow, specially when working with modelling structures or other element-less 'formats'. Id replace your option B for a function to assign elements that could be called once, at will, from any entity subclass. On the other hand, optimizing the process probably will help but not by much i would say. Does anyone have ideas on this? Maybe a dictionary with all possible options of atom fullnames? A third issue here is also the overhead that parsing the header brings. It completely kills performance.. There is a flag in the parser called get_header that is useless at the moment. A first step would be to make usable. At least we would have an option to skip the slow part. Perhaps then it would be nice to look at parse_pdb_header and see if we can optimize it. Im curious to see the performance of my branch there because i added more parsing options there too. Cheers, Jo?o No dia 13 de Mai de 2011 04:27, "Eric Talevich" escreveu: > On Thu, May 12, 2011 at 9:59 AM, Jo?o Rodrigues wrote: > >> First results: http://www.biopython.org/wiki/PDBParser >> >> Comments? >> > > Cool. So the atom_element additions did slow the parser down noticeably. The > warnings may have caused some tiny slowdown, presumably when handling PDB > files with inconsistencies, but I personally am not concerned about that. > > I think atom element assignment could be sped up in either of two ways: > (a) Try to optimize Atom._assign_element for speed, somehow > (b) Store only the atom field as a string during parsing. Change > Atom.element and Atom.mass to be properties that parse the atom field to > determine the element type on demand (i.e. self._get_element checks if > self._element exists yet; if not, parse the string and set self._element; > self._get_mass is basically identical to _assign_atom_mass). > > The lazy loading approach (b) would be faster if you're not using the > element/mass values at all, but probably a little slower if you need those > values from every atom in a structure. > > -E From updates at feedmyinbox.com Fri May 13 04:31:03 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Fri, 13 May 2011 04:31:03 -0400 Subject: [Biopython-dev] 5/13 biopython Questions - BioStar Message-ID: // Bio.GenBank.LocationParserError // May 10, 2011 at 1:50 PM http://biostar.stackexchange.com/questions/8203/bio-genbank-locationparsererror Hi all, I'm scanning through all of GenBank's bacterial genomes using biopython. I've been getting an occasional error recently parsing location data. Specifically: File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 525, in parse for r in i: File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 437, in parse_records record = self.parse(handle, do_features) File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 420, in parse if self.feed(handle, consumer, do_features): File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 392, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 344, in _feed_feature_table consumer.location(location_string) File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line 975, in location raise LocationParserError(location_line) Bio.GenBank.LocationParserError: order(join(649703..649712,649751..649752),650047..650049) My code is a simple loop through all filenames I feed in at the command line: [...] try: contig = SeqIO.parse(open(gb_file,"r"), "genbank") except: sys.stderr.write("ERROR: Parsing gbk file "+gb_file+"!\n") sys.exit(1) sys.stderr.write("Loading genome " + str(counter) + " of "+str(len(sys.argv)-1)+" ("+gb_file+")\n") for gb_record in contig: [...] This is in the Aeropyrum pernix K1 genome, NC_000854.gbk. I don't see anything wrong with the location data. Can anyone help? Thanks, -Morgan // making all protein sequence lengths same // May 4, 2011 at 3:31 AM http://biostar.stackexchange.com/questions/8033/making-all-protein-sequence-lengths-same Is there any code in perl / python to make all protein sequences of same length, otherwise my phylogenetic tool MEGA is not working on them ? -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/687953/851dd4cd10a2537cf271a85dfd1566976527e0cd/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From redmine at redmine.open-bio.org Fri May 13 05:07:01 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 13 May 2011 09:07:01 +0000 Subject: [Biopython-dev] [Biopython - Bug #3197] SeqIO parse error with some genbank files References: Message-ID: Issue #3197 has been updated by Peter Cock. Another example from http://biostar.stackexchange.com/questions/8203/bio-genbank-locationparsererror Aeropyrum pernix K1 genome, NC_000854.gbk ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeropyrum_pernix_K1_uid57757/NC_000854.gbk >>> from Bio import SeqIO >>> r = SeqIO.read("NC_000854.gbk", "gb") ... Bio.GenBank.LocationParserError: Combinations of "join" and "order" within the same location (nested operators) are illegal: order(join(649703..649712,649751..649752),650047..650049) I have reported this GenBank file to the NCBI via gb-admin at ncbi.nlm.nih.gov ---------------------------------------- Bug #3197: SeqIO parse error with some genbank files https://redmine.open-bio.org/issues/3197 Author: Cedar McKay Status: Resolved Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.56 URL: I've found a file that seems to choke SeqIO genbank parsing. I downloaded this file straight from NCBI, so it should be a good file. I've found a couple of other files that do the same thing. I reproduced this bug on another machine, also with biopython 1.56. I am able to successfully parse other genbank files. Maybe it has something to do with that very long location? Please let me know if I can provide any other information! Thanks! Cedar >>> from Bio import SeqIO >>> record = SeqIO.read('./Acorus_americanus_NC_010093.gb', 'genbank') Traceback (most recent call last): File "", line 1, in File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", line 597, in read first = iterator.next() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", line 525, in parse for r in i: File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 437, in parse_records record = self.parse(handle, do_features) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 420, in parse if self.feed(handle, consumer, do_features): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 392, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 344, in _feed_feature_table consumer.location(location_string) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/GenBank/__init__.py", line 975, in location raise LocationParserError(location_line) Bio.GenBank.LocationParserError: order(join(42724..42726,43455..43457),43464..43469,43476..43481,43557..43562,43569..43574,43578..43583,43677..43682,44434..44439) -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From anaryin at gmail.com Fri May 13 15:35:00 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 13 May 2011 21:35:00 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Hello all, Not to let this die. I've added PERMISSIVE=2 to PDBParser. I also changed the code to remove the _handle_pdb_exception method and replace it by the warnings module. This was done in two commits in my branch: https://github.com/JoaoRodrigues/biopython/commit/5b44defc3eb0a3505668ac77b59c8980630e6b07 https://github.com/JoaoRodrigues/biopython/commit/7383e068e41dd624458b3904fcd61a04c3f319c4 Sorry to be insistent, but I don't really wish QUIET to live long if we have such an elegant alternative. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/5/6 Peter Cock > On Fri, May 6, 2011 at 5:23 PM, Eric Talevich > wrote: > > On Fri, May 6, 2011 at 3:54 AM, Jo?o Rodrigues > wrote: > > > >> Hello all, > >> > >> The PDBParser is sometimes a bit too loud, making meaningful output > drown > >> in > >> dozens of warnings messages. This is partly (mostly) my fault because of > >> the > >> element guessing addition. Therefore, I'd suggest adding a QUIET > argument > >> (bool) to PDBParser that would supress all warnings. Of course, default > is > >> False. It might come handy for batch processing of proteins. > >> > >> I've added it to my pdb_enhancements branch so you can take a look: > >> > >> > >> > https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 > >> > >> > > Since the PERMISSIVE argument is already an integer, could we consolidate > > these by letting (PERMISSIVE=2) behave as (PERMISSIVE=1, QUIET=1) ? > > > > I'm OK with that, > > Peter > From eric.talevich at gmail.com Fri May 13 15:46:01 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 May 2011 15:46:01 -0400 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Looks good to me. I can't guarantee I'll be able to merge this right away since I'm going to be traveling for the next week. Anyone else want to try it? -Eric On 5/13/11, Jo?o Rodrigues wrote: > Hello all, > > Not to let this die. > > I've added PERMISSIVE=2 to PDBParser. I also changed the code to remove the > _handle_pdb_exception method and replace it by the warnings module. > > This was done in two commits in my branch: > > https://github.com/JoaoRodrigues/biopython/commit/5b44defc3eb0a3505668ac77b59c8980630e6b07 > https://github.com/JoaoRodrigues/biopython/commit/7383e068e41dd624458b3904fcd61a04c3f319c4 > > > Sorry to be insistent, but I don't really wish QUIET to live long if we have > such an elegant alternative. > > Cheers, > > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao > > > > 2011/5/6 Peter Cock > >> On Fri, May 6, 2011 at 5:23 PM, Eric Talevich >> wrote: >> > On Fri, May 6, 2011 at 3:54 AM, Jo?o Rodrigues >> wrote: >> > >> >> Hello all, >> >> >> >> The PDBParser is sometimes a bit too loud, making meaningful output >> drown >> >> in >> >> dozens of warnings messages. This is partly (mostly) my fault because >> >> of >> >> the >> >> element guessing addition. Therefore, I'd suggest adding a QUIET >> argument >> >> (bool) to PDBParser that would supress all warnings. Of course, default >> is >> >> False. It might come handy for batch processing of proteins. >> >> >> >> I've added it to my pdb_enhancements branch so you can take a look: >> >> >> >> >> >> >> https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 >> >> >> >> >> > Since the PERMISSIVE argument is already an integer, could we >> > consolidate >> > these by letting (PERMISSIVE=2) behave as (PERMISSIVE=1, QUIET=1) ? >> > >> >> I'm OK with that, >> >> Peter >> > From andrew.sczesnak at med.nyu.edu Fri May 13 17:26:58 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 13 May 2011 17:26:58 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer Message-ID: <4DCDA222.9050807@med.nyu.edu> Hi All, I'd like to contribute MAF parser/writer classes to Bio.AlignIO. MAF is an alignment format used for whole genome alignments, as in the 30-way (or more) multiz alignments at UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/maf/ A description of the format is available here: http://genome.ucsc.edu/FAQ/FAQformat#format5 The value of this format to most users will come from the ability to extract sequences from an arbitrary number of species that align to a particular sequence range in a particular genome, at random. We should be able to say, report the alignment of 50 genomes to the human HOX locus fairly quickly (say <1s). An iterator and writer class will certainly be useful, but to implement the aforementioned functionality, some API changes are probably necessary. I think the most straightforward way of accomplishing this is to add an additional, searchable SQLite table to SeqIO's index_db(). The present table, offset_data, translates a unique sequence identifier to the file offset and is more suited to multifasta or other sequence files. Another table might store chromosome, start, and end positions to allow a set of alignment records falling within a particular sequence range on a chromosome to be extracted with an SQL query (obscured from the user). This table would remain empty in formats where no search functionality is implemented. Also necessary, a search() function on top of the index_db() UserDict, accessible as in: from AlignIO.MafIO import MafIndexer indexer = MafIndexer("mm9") index = SeqIO.index_db (index_file, maf_file, "maf", \ key_function = MafIndexer.index) for i in index.search ("chr5", 5000, 10000): print i where the output is a series of MultipleSeqAlignment objects with sequences falling within the searched range. When used with other formats, the function could perform a quick "key LIKE '%key%'" SQL query to retrieve multiple records with similar names. As a note, the MafIndexer callback function above is necessary to choose which species in the alignment the index is generated for. Some quick code implementing these additions loads the index of a 3.6GB MAF file in ~500ms and retrieves a 40kb alignment in about 1.6s, leaving some room for optimization. Does anyone have any thoughts on how index_db() should be developed, and if these changes ought to be implemented in SeqIO or an AlignIO index API be created? Thanks, -- Andrew Sczesnak Bioinformatician, Littman Lab Howard Hughes Medical Institute New York University School of Medicine 540 First Avenue New York, NY 10016 p: (212) 263-6921 f: (212) 263-1498 e: andrew.sczesnak at med.nyu.edu From p.j.a.cock at googlemail.com Fri May 13 18:27:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 May 2011 23:27:52 +0100 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 13, 2011 at 8:46 PM, Eric Talevich wrote: > Looks good to me. I can't guarantee I'll be able to merge this right > away since I'm going to be traveling for the next week. Anyone else > want to try it? > -Eric If get time this weekend, I'll look at it. After all, I did apply the quiet change to the trunk... Peter From p.j.a.cock at googlemail.com Fri May 13 18:30:34 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 May 2011 23:30:34 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DCDA222.9050807@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> Message-ID: On Fri, May 13, 2011 at 10:26 PM, Andrew Sczesnak wrote: > Hi All, > > I'd like to contribute MAF parser/writer classes to Bio.AlignIO. ?MAF is an > alignment format used for whole genome alignments, as in the 30-way (or > more) multiz alignments at UCSC: > > http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/maf/ > > A description of the format is available here: > > http://genome.ucsc.edu/FAQ/FAQformat#format5 > I've spoken to Andrew briefly before this, and I'm keen to get the core functionality of parsing and writing MAF alignments added to AlignIO. His other ideas for indexing these alignments are much more interesting - and part of a more general topic related to things like Ace alignments, or SAM/BAM alignments. Ideally we can come up with something that will work for more than just MAF alignments. Peter From p.j.a.cock at googlemail.com Sat May 14 07:30:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 May 2011 12:30:07 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DCDA222.9050807@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> Message-ID: Hi Andrews, I've had a look at those example files you linked to now. On Fri, May 13, 2011 at 10:26 PM, Andrew Sczesnak wrote: > Hi All, > The value of this format to most users will come from the ability to extract > sequences from an arbitrary number of species that align to a particular > sequence range in a particular genome, at random. ?We should be able to > say, report the alignment of 50 genomes to the human HOX locus fairly > quickly (say <1s). ?An iterator and writer class will certainly be useful, > but to implement the aforementioned functionality, some API changes are > probably necessary. I had previously considered a cross-format Bio.AlignIO index on alignment number (i.e. 0, 1, 2, ... n-1 if the file contains n alignments). That would work on PHYLIP, Stockholm, Clustalw, etc, even FASTA if your alignment all have the same number of entries. It could also be used with MAF. However, I don't think it is useful. Of the current file formats supported in AlignIO, in my experience only PHYLIP files regularly contain more than one alignment, and since these are used for bootstrapping random access is not required (iteration is enough). And presumably for MAF, there is no reason to want to access the alignments by this index number either. With something like SAM/BAM (or other assembly formats like ACE or the MIRA alignment format also called MAF), you can have multiple alignments (the contigs or chromosomes) each with many entries (supporting reads). Here there is a clear single reference coordinate system, that of the (gapped) reference contigs/chromosomes. This also means each alignment has a clear name (the name of the reference contig/chromosome), so this name and coordinates can be used for indexing (as in samtools). With MAF however, things are not so easy - any of the sequences could be used as a reference (e.g. human chr 1, or mouse chr 2), and any region of a sequence might be in more than one alignment. I'm beginning to suspect what Andrew has in mind is going to be MAF specific - so it won't be top level functionality in Bio.AlignIO, but rather tucked away in Bio.AlignIO.MafIO instead. Peter From anaryin at gmail.com Sat May 14 08:59:38 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Sat, 14 May 2011 14:59:38 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Thanks and sorry for the double commit! No dia 14 de Mai de 2011 00:27, "Peter Cock" escreveu: > On Fri, May 13, 2011 at 8:46 PM, Eric Talevich wrote: >> Looks good to me. I can't guarantee I'll be able to merge this right >> away since I'm going to be traveling for the next week. Anyone else >> want to try it? >> -Eric > > If get time this weekend, I'll look at it. After all, I did apply > the quiet change to the trunk... > > Peter From pgarland at gmail.com Sat May 14 21:13:28 2011 From: pgarland at gmail.com (Phillip Garland) Date: Sat, 14 May 2011 18:13:28 -0700 Subject: [Biopython-dev] GEO SOFT parser Message-ID: Hello, I've created a new parser for GEO SOFT files- a fairly simple line-orientated format used by NCBI's Gene Expression Omnibus for holding gene expression data, information about the experimental platform used to generate the data, and associated metadata. At the moment if parses platform (GPL), series (GSE), sample (GSM), and dataset (GDS) files into objects, with access to the metadata, and data table entries. It's accessible through my github biopython repo: https://github.com/pgarland/biopython git://github.com/pgarland/biopython.git Branch: new-geo-soft-parser All the changed files are in the Bio/Geo directory. The existing parser has the virtue of being simple and short. The parser I've written is less parsimonious, but should handle everything specified by NCBI, as well as some unspecified quirks, and documents what GEO SOFT files are expected to contain. I'm taking a look at Sean Davis's GEOquery Bioconductor package for ideas for the interface. There is a class for each GEO record type: GSM, GPL, GSE, and GDS. After instantiating each of these, you can call the parse method on the resulting object to parse the file, e.g.: >>> from Bio import Geo >>> gds858 = Geo.GDS() >>> gds858.parse('GDS858_full.soft') Each object has a dictionary named 'meta' that contains the file's metadata: >>> gds858.meta['channel_count'] 1 Each attribute has a hook to hang a function to perform additional parsing of a value, but most values are stored as strings. There is also a parseMeta() method if you just need the file's metadata (the entity attributes and data table column descriptions) and not the data table. There is also a rudimentary __str__ method to print the metadata. For files that can have data tables (GSM, GPL, and GDS files), there is currently just one method for accessing values: getTableValue() that takes an ID and a column name and returns the associated value: >>> gds858.getTableValue(1007_s_at, 'GSM14498') 3736.9000000000001 but I will implement other methods to provide more convenient access to the data table. Right now, the data table is just an 2D array and can be accessed like any 2D array: gds858.table[0][2] '3736.900' There are dictionaries for converting between IDs and column names and rows and columns: >>> gds858.idDict['1007_s_at'] 0 >>> gds858.columnDict['GSM14498'] 2 It is possible that the underlying representation of the data table could change though. On my dual-core laptop with 4GB of RAM and a 7200RPM hard drive, parsing single files is more than fast enough, but I haven't benchmarked it or looked at RAM consumption. If it's a problem for computers with less RAM or use cases that require having a lot of GEO SOFT objects in memory, I can take a look at changing the data table representation. If this parser is incorporated in BioPython, I'm happy to maintain it. The code is well-commented, but I still need to write the documentation. I've tested it on a few files of each type, but I still need to write unit tests. Since SOFT files can be fairly large- a few MB gzipped, 10's of MB unzipped, it seems undesirable to package them with the biopython source code. I could make the unit test optional and have interested users supply their own files and/or have the test download files from NCBI and unzip them. ~ Phillip From updates at feedmyinbox.com Sun May 15 00:38:06 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Sun, 15 May 2011 00:38:06 -0400 Subject: [Biopython-dev] 5/15 active questions tagged biopython - Stack Overflow Message-ID: <7ba32bcb32923a5ff3d48ac9122b3bed@74.63.51.88> // Receive DNA-Sequence by range in biopython // May 14, 2011 at 5:33 PM http://stackoverflow.com/questions/6004926/receive-dna-sequence-by-range-in-biopython Hi, i need to use a protein-prediction tool called Mutation-Taster (http://www.mutationtaster.org/). Since the input format for the batch query needs a piece of sequence surrounding the mutation, and i have only the position of the mutation within a chromosome, i need the surrounding pieces. So far i am using biopython and i tried to find a way to receive the DNA-Sequence from the NCBI Entrez databases. I want assign the chromosome number, nucleic start and end position within the chromosome to receive the dna-sequence for example in fasta format. I would not mind if it is possible in another programming language. Thanks in advance for your help -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From p.j.a.cock at googlemail.com Sun May 15 10:40:24 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 15 May 2011 15:40:24 +0100 Subject: [Biopython-dev] GEO SOFT parser In-Reply-To: References: Message-ID: On Sun, May 15, 2011 at 2:13 AM, Phillip Garland wrote: > Hello, > > I've created a new parser for GEO SOFT files- a fairly simple > line-orientated format used by NCBI's Gene Expression Omnibus for > holding gene expression data, information about the experimental > platform used to generate the data, and associated metadata. At the > moment if parses platform (GPL), series (GSE), sample (GSM), and > dataset (GDS) files into objects, with access to the metadata, and > data table entries. > > It's accessible through my github biopython repo: > https://github.com/pgarland/biopython > git://github.com/pgarland/biopython.git > > Branch: > new-geo-soft-parser > > All the changed files are in the Bio/Geo directory. > > The existing parser has the virtue of being simple and short. The > parser I've written is less parsimonious, but should handle everything > specified by NCBI, as well as some unspecified quirks, and documents > what GEO SOFT files are expected to contain. That sounds good, the current GEO parser was very minimal. > I'm taking a look at Sean > Davis's GEOquery Bioconductor package for ideas for the interface. Great - I would have encouraged you to look at Sean's R interface for ideas. https://github.com/biopython/biopython/tree/master/Tests/Geo > There is a class for each GEO record type: GSM, GPL, GSE, and GDS. > After instantiating each of these, you can call the parse method on > the resulting object to parse the file, e.g.: > >>>> from Bio import Geo >>>> gds858 = Geo.GDS() >>>> gds858.parse('GDS858_full.soft') We may want to use read rather than parse for consistency with the other newish parsers in Biopython, where parse gives an iterator while read gives a single object. > > Each object has a dictionary named 'meta' that contains the file's metadata: > >>>> gds858.meta['channel_count'] > 1 > > Each attribute has a hook to hang a function to perform additional > parsing of a value, but most values are stored as strings. > > There is also a parseMeta() method if you just need the file's > metadata (the entity attributes and data table column descriptions) > and not the data table. > > There is also a rudimentary __str__ method to print the metadata. > > For files that can have data tables (GSM, GPL, and GDS files), there > is currently just one method for accessing values: getTableValue() > that takes an ID and a column name and returns the associated value: > >>>> gds858.getTableValue(1007_s_at, 'GSM14498') > 3736.9000000000001 > > but I will implement other methods to provide more convenient access > to the data table. > > Right now, the data table is just an 2D array and can be accessed like > any 2D array: > > gds858.table[0][2] > '3736.900' > > There are dictionaries for converting between IDs and column names and > rows and columns: > >>>> gds858.idDict['1007_s_at'] > 0 > >>>> gds858.columnDict['GSM14498'] > 2 > > It is possible that the underlying representation of the data table > could change though. One possibility is a full load versus iterate over the rows approach. The later would be useful if you only wanted some of the data (e.g. particular genes), and didn't have enough RAM to load it all in full. > On my dual-core laptop with 4GB of RAM and a 7200RPM hard drive, > parsing single files is more than fast enough, but I haven't > benchmarked it or looked at RAM consumption. If it's a problem for > computers with less RAM or use cases that require having a lot of GEO > SOFT objects in memory, I can take a look at changing the data table > representation. > > If this parser is incorporated in BioPython, I'm happy to maintain it. Excellent :) > The code is well-commented, but I still need to write the > documentation. I've tested it on a few files of each type, but I still > need to write unit tests. Since SOFT files can be fairly large- ?a few > MB gzipped, 10's of MB unzipped, it seems undesirable to package them > with the biopython source code. We have a selection of small samples already in the repository under Tests/GEO - so at very least you can write unit tests using them. Also, for an online tests, it would be nice to try Entrez with the new GEO parser (IIRC, our old parser didn't work nicely with some of the live data). > I could make the unit test optional > and have interested users supply their own files and/or have the test > download files from NCBI and unzip them. We've touched on the need for "big data" tests which would be more targeted at Biopython developers than end users, but not addressed any framework for this. e.g. SeqIO indexing of large sequence files. Peter From chapmanb at 50mail.com Sun May 15 11:39:59 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 15 May 2011 11:39:59 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> Message-ID: <20110515153959.GC2530@kunkel> Andrew and Peter; Thanks for working on MAF parsing and interval access in general. A few thoughts below: > > I'd like to contribute MAF parser/writer classes to Bio.AlignIO. ?MAF is an > > alignment format used for whole genome alignments, as in the 30-way (or > > more) multiz alignments at UCSC: [...] > > The value of this format to most users will come from the ability to > > extract sequences from an arbitrary number of species that align to > > a particular sequence range in a particular genome, at random. We > I've spoken to Andrew briefly before this, and I'm keen to get > the core functionality of parsing and writing MAF alignments > added to AlignIO. His other ideas for indexing these alignments > are much more interesting - and part of a more general topic > related to things like Ace alignments, or SAM/BAM alignments. We may want to take a look at the interval access functionality in bx-python and MAF parsing tied in with this: https://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py https://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/maf.py Here is a worked example: http://bcbio.wordpress.com/2009/07/26/sorting-genomic-alignments-using-python/ It would be useful to have an API that queries across bx-python intervals, BAM intervals and other formats. Brad From Andrew.Sczesnak at med.nyu.edu Sun May 15 15:59:02 2011 From: Andrew.Sczesnak at med.nyu.edu (Sczesnak, Andrew) Date: Sun, 15 May 2011 15:59:02 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu>, Message-ID: > With something like SAM/BAM (or other assembly formats like ACE or the > MIRA alignment format also called MAF), you can have multiple > alignments (the contigs or chromosomes) each with many entries > (supporting reads). Here there is a clear single reference coordinate > system, that of the (gapped) reference contigs/chromosomes. This also > means each alignment has a clear name (the name of the reference > contig/chromosome), so this name and coordinates can be used for > indexing (as in samtools). > > With MAF however, things are not so easy - any of the sequences could > be used as a reference (e.g. human chr 1, or mouse chr 2), and any > region of a sequence might be in more than one alignment. > > I'm beginning to suspect what Andrew has in mind is going to be MAF > specific - so it won't be top level functionality in Bio.AlignIO, but > rather tucked away in Bio.AlignIO.MafIO instead. > > Peter I agree, the fact that this particular format does not explicitly define the reference sequence is problematic. Based on the spec, we ought to be prepared for a multiz MAF file with several different reference sequences. However, practically speaking, the files out there in the world _do_ have a reference sequence, which appears in all alignments and is the first listed sequence. While I think there is definitely some trickyness to how this parser will have to interact with any API, my feeling is that these portions ought to be confined to MafIO, while a more general API lives in AlignIO or elsewhere. This isn't much different from a format like SFF, I think. Andrew ------------------------------------------------------------ This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email. ================================= From Andrew.Sczesnak at med.nyu.edu Sun May 15 16:14:42 2011 From: Andrew.Sczesnak at med.nyu.edu (Sczesnak, Andrew) Date: Sun, 15 May 2011 16:14:42 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <20110515153959.GC2530@kunkel> References: <4DCDA222.9050807@med.nyu.edu> , <20110515153959.GC2530@kunkel> Message-ID: Hi Brad, > We may want to take a look at the interval access functionality in > bx-python and MAF parsing tied in with this: > > https://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py > https://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/maf.py The interval indexing scheme in bx-python seems really nice. By dropping intervals into bins, a la UCSC MySQL tables, and using a compact file format instead of SQLite, I'm sure it's quite fast. > It would be useful to have an API that queries across bx-python intervals, > BAM intervals and other formats. I agree, I think it would be great if we could implement some sort of API for indexing and accessing intervals in SAM/BAM, MAF, ACE, and really, any format that can be made to report an offset and set of interval coordinates. Even a multifasta can have interval information in the header that a user could extract and pass to the indexer with a callback function. Gene annotation files, like GFF, have this information too. What would make the most sense here? Would a more general interval indexing and searching module be too much? I feel like a task I'm always performing is searching various files by chromosome, start, and stop. Example: A BED file of ChIP-Seq peaks called by MACS--are there any peaks overlapping gene X? Example: How many alignments are there in an RNA-Seq BAM file that overlap rRNA and tRNA annotations in a GFF file, presumably from contaminating RNA? Andrew ------------------------------------------------------------ This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email. ================================= From p.j.a.cock at googlemail.com Sun May 15 16:24:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 15 May 2011 21:24:21 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> Message-ID: On Sun, May 15, 2011 at 8:59 PM, Sczesnak, Andrew wrote: >> With something like SAM/BAM (or other assembly formats like ACE or the >> MIRA alignment format also called MAF), you can have multiple >> alignments (the contigs or chromosomes) each with many entries >> (supporting reads). Here there is a clear single reference coordinate >> system, that of the (gapped) reference contigs/chromosomes. This also >> means each alignment has a clear name (the name of the reference >> contig/chromosome), so this name and coordinates can be used for >> indexing (as in samtools). >> >> With MAF however, things are not so easy - any of the sequences could >> be used as a reference (e.g. human chr 1, or mouse chr 2), and any >> region of a sequence might be in more than one alignment. >> >> I'm beginning to suspect what Andrew has in mind is going to be MAF >> specific - so it won't be top level functionality in Bio.AlignIO, but >> rather tucked away in Bio.AlignIO.MafIO instead. >> >> Peter > > I agree, the fact that this particular format does not explicitly define the > reference sequence is problematic. ?Based on the spec, we ought to be > prepared for a multiz MAF file with several different reference sequences. > However, practically speaking, the files out there in the world _do_ have a > reference sequence, which appears in all alignments and is the first listed > sequence. That may be a very useful simplifying assumption. Would you expect each position on the reference to appear in one and only one alignment block in the MAF file? Or, might a given region appear in multiple blocks? > While I think there is definitely some trickyness to how this > parser will have to interact with any API, my feeling is that these portions > ought to be confined to MafIO, while a more general API lives in AlignIO or > elsewhere. > >?This isn't much different from a format like SFF, I think. > What did you mean here? SFF is just another sequence file format as far as Bio.SeqIO goes, other than being binary it isn't exceptional. Peter From p.j.a.cock at googlemail.com Mon May 16 07:14:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 12:14:05 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DCDA222.9050807@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> Message-ID: On Fri, May 13, 2011 at 10:26 PM, Andrew Sczesnak wrote: > Hi All, > > I'd like to contribute MAF parser/writer classes to Bio.AlignIO. ?MAF is an > alignment format used for whole genome alignments, as in the 30-way (or > more) multiz alignments at UCSC: > > http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/maf/ > > A description of the format is available here: > > http://genome.ucsc.edu/FAQ/FAQformat#format5 > I started work on merging the basic parser/writer into Biopython on this new branch, https://github.com/peterjc/biopython/tree/alignio-maf As I think I mentioned by email before, there were some PEP8 formatting changes (removing spaces before brackets). Another little thing rather than MultipleSeqAlignment(alphabet) you should use MultipleSeqAlignment([], alphabet) to create an empty alignment. The former works with a deprecation warning to help transition from the old alignment object. Note that by hooking up "maf" in AlignIO as an output format, it will get exercised by some of the unit tests, in particular test_AlignIO.py - and that showed some problems. On a functional level your code was not preserving the order of the records within each alignment. By using a dictionary the order becomes Python implementation specific, meaning it cannot be assumed in unit tests (i.e. C Python vs Jython vs IronPython vs PyPy could all store dictionary elements in a different order). Also it was also breaking test_AlignIO.py, so I changed that. Do you think we should follow the speciesOrder directive if present? Note that right now, test_AlignIO.py is still not passing (which is a major reason why I haven't merged this to the trunk). Currently the issue is to do with how you are parsing species names, assuming database.chromosome is not possible in general. Also I think we may need to do something rigorous with start/end co-ordinates and strand in either the Seq or SeqRecord object. They could be updated automatically during slicing and taking reverse complement... they might not survive addition though. Peter From p.j.a.cock at googlemail.com Mon May 16 09:53:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 14:53:32 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 12:14 PM, Peter Cock wrote: > > I started work on merging the basic parser/writer into Biopython > on this new branch, > > https://github.com/peterjc/biopython/tree/alignio-maf > > As I think I mentioned by email before, there were some PEP8 > formatting changes (removing spaces before brackets). > > ... > > Note that right now, test_AlignIO.py is still not passing (which > is a major reason why I haven't merged this to the trunk). > Currently the issue is to do with how you are parsing species > names, assuming database.chromosome is not possible in general. I've changed it to preserve the identifier as is for the SeqRecord id field, got all the test suite passing, and added a couple of small MAF files from the BioPerl test suite (which highlighted some more issues). Do you think it makes sense to automatically promote any dots (periods) in the sequence to the letter of that position in the first sequence? This is something I'd been thinking we should do in the PHYLIP parser as well. See the MAF/humor.maf example. Peter From andrew.sczesnak at med.nyu.edu Mon May 16 13:03:39 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 May 2011 13:03:39 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> Message-ID: <4DD158EB.4080709@med.nyu.edu> On 05/16/2011 09:53 AM, Peter Cock wrote: > On Mon, May 16, 2011 at 12:14 PM, Peter Cock wrote: > > Do you think it makes sense to automatically promote any dots > (periods) in the sequence to the letter of that position in the first > sequence? This is something I'd been thinking we should do in > the PHYLIP parser as well. See the MAF/humor.maf example. > > Peter Yeah, that sounds right to me. The issue again is going to be the lack of an explicitly defined reference sequence. Are we going to make the assumption that the sequence appearing first in an alignment bundle is the reference? Andrew From andrew.sczesnak at med.nyu.edu Mon May 16 13:26:46 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 May 2011 13:26:46 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> Message-ID: <4DD15E56.60201@med.nyu.edu> On 05/16/2011 07:14 AM, Peter Cock wrote: > Do you think we should follow the speciesOrder directive if > present? Yeah, why not. I started working on this and the problem was, as defined in the spec, the species is just "hg19" or "mm9," yet the records are in species.chromosome format. Should we enforce that the species in a speciesOrder directive must exactly match a sequence identifier, or add a split and do some checks to make sure a record matches only one species in speciesOrder? > Also I think we may need to do something rigorous with start/end > co-ordinates and strand in either the Seq or SeqRecord object. > They could be updated automatically during slicing and taking > reverse complement... they might not survive addition though. This is interesting. I wonder if it makes sense to preserve this information if a SeqRecord is going to be maniuplated outside a MultipleSeqAlignment object. Could this be accomplished by migrating the annotation information to a SeqFeature? Andrew From p.j.a.cock at googlemail.com Mon May 16 13:54:24 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 18:54:24 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DD158EB.4080709@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> <4DD158EB.4080709@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 6:03 PM, Andrew Sczesnak wrote: > On 05/16/2011 09:53 AM, Peter Cock wrote: >> >> Do you think it makes sense to automatically promote any dots >> (periods) in the sequence to the letter of that position in the first >> sequence? This is something I'd been thinking we should do in >> the PHYLIP parser as well. See the MAF/humor.maf example. >> >> Peter > > Yeah, that sounds right to me. ?The issue again is going to be the lack of > an explicitly defined reference sequence. ?Are we going to make the > assumption that the sequence appearing first in an alignment bundle > is the reference? That is my assumption for how dots have been used in alignment formats. If you have some MAF examples using dots, that would be great. Regarding PHYLIP, I looked at this and dots/periods have been explicitly forbidden since the very earliest versions of PHYLIP, so I've made them raise an error instead: https://github.com/biopython/biopython/commit/b41975bb8363171add80d19903861f3d8cffe405 Peter From p.j.a.cock at googlemail.com Mon May 16 13:58:23 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 18:58:23 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DD15E56.60201@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> <4DD15E56.60201@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 6:26 PM, Andrew Sczesnak wrote: > On 05/16/2011 07:14 AM, Peter Cock wrote: >> >> Do you think we should follow the speciesOrder directive if >> present? > > Yeah, why not. ?I started working on this and the problem was, as defined in > the spec, the species is just "hg19" or "mm9," yet the records are in > species.chromosome format. ?Should we enforce that the species in a > speciesOrder directive must exactly match a sequence identifier, or add a > split and do some checks to make sure a record matches only one species in > speciesOrder? That is a subtlety I missed - maybe it is simpler to ignore speciesOrder after all. I presume it is intended a graphical output directive really. >> Also I think we may need to do something rigorous with start/end >> co-ordinates and strand in either the Seq or SeqRecord object. >> They could be updated automatically during slicing and taking >> reverse complement... they might not survive addition though. > > This is interesting. ?I wonder if it makes sense to preserve this > information if a SeqRecord is going to be maniuplated outside a > MultipleSeqAlignment object. ?Could this be accomplished by > migrating the annotation information to a SeqFeature? I'm not sure how using a SeqFeature would work here. Also consider that someone might manipulate the alignment directly, e.g. alignment[:,10:60] to pull out fifty columns. That seems like a use case where the start/end co-ordinates should be updated nicely. Note that internally this calls record[10:60] for each row of the alignment, so using SeqRecord objects. Peter From p.j.a.cock at googlemail.com Mon May 16 14:22:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 19:22:05 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> <4DD158EB.4080709@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 6:54 PM, Peter Cock wrote: > On Mon, May 16, 2011 at 6:03 PM, Andrew Sczesnak wrote: >> On 05/16/2011 09:53 AM, Peter Cock wrote: >>> >>> Do you think it makes sense to automatically promote any dots >>> (periods) in the sequence to the letter of that position in the first >>> sequence? This is something I'd been thinking we should do in >>> the PHYLIP parser as well. See the MAF/humor.maf example. >>> >>> Peter >> >> Yeah, that sounds right to me. ?The issue again is going to be the lack of >> an explicitly defined reference sequence. ?Are we going to make the >> assumption that the sequence appearing first in an alignment bundle >> is the reference? > > That is my assumption for how dots have been used in alignment > formats. Done on my branch: https://github.com/peterjc/biopython/commit/746d0c30b85753bb40c140b2b964e3256259414b > > If you have some MAF examples using dots, that would be great. > You'll see I have one example (from BioPerl's unit tests), but more would still be appreciated. Peter From andrew.sczesnak at med.nyu.edu Mon May 16 16:30:23 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 May 2011 16:30:23 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> <4DD158EB.4080709@med.nyu.edu> Message-ID: <4DD1895F.3050303@med.nyu.edu> On 05/16/2011 01:54 PM, Peter Cock wrote: > That is my assumption for how dots have been used in alignment > formats. > > If you have some MAF examples using dots, that would be great. I added a snippet of mouse chromosome 10 from UCSC, but it doesn't have dots. I've actually never come across one with dots. Added support for a 'track' line at the beginning of a file as well, among some other small changes. https://github.com/polyatail/biopython/commits/alignio-maf Andrew From p.j.a.cock at googlemail.com Mon May 16 16:45:38 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 21:45:38 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DD1895F.3050303@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> <4DD158EB.4080709@med.nyu.edu> <4DD1895F.3050303@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 9:30 PM, Andrew Sczesnak wrote: > On 05/16/2011 01:54 PM, Peter Cock wrote: >> >> That is my assumption for how dots have been used in alignment >> formats. >> >> If you have some MAF examples using dots, that would be great. > > I added a snippet of mouse chromosome 10 from UCSC, but it doesn't have > dots. ?I've actually never come across one with dots. > > Added support for a 'track' line at the beginning of a file as well, among > some other small changes. > > https://github.com/polyatail/biopython/commits/alignio-maf > Generally I'm happy, although after editing the BioPerl unit test, perhaps we should rename it? And did you mean to alter the newline at the end of the file? https://github.com/polyatail/biopython/commit/d423d423cc87efeb8a27a9332927e42d1beacdf2 Also, could you rewrite this to avoid the use of handle.tell? Not all handle objects support that (right?), and we shouldn't need it. https://github.com/polyatail/biopython/commit/111cf69d7e435203a781f05f9f317bc9ced03560 Peter From andrew.sczesnak at med.nyu.edu Mon May 16 16:33:53 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 May 2011 16:33:53 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> <4DD15E56.60201@med.nyu.edu> Message-ID: <4DD18A31.6030804@med.nyu.edu> On 05/16/2011 01:58 PM, Peter Cock wrote: > That is a subtlety I missed - maybe it is simpler to ignore speciesOrder > after all. I presume it is intended a graphical output directive really. Fine by me. If need be we can add this later. >> This is interesting. I wonder if it makes sense to preserve this >> information if a SeqRecord is going to be maniuplated outside a >> MultipleSeqAlignment object. Could this be accomplished by >> migrating the annotation information to a SeqFeature? > > I'm not sure how using a SeqFeature would work here. Hmm, well, strand is manipulated in a SeqFeature when .reverse_complement() is run, right? I thought that might take care of that. Though truthfully I haven't looked too much at that code. > Also consider that someone might manipulate the alignment > directly, e.g. alignment[:,10:60] to pull out fifty columns. That > seems like a use case where the start/end co-ordinates should > be updated nicely. Note that internally this calls record[10:60] > for each row of the alignment, so using SeqRecord objects. That's true. Is there a more general way to implement this? By dragging the coordinate information out of .annotations and into fields that aren't MAF-specific or something. Andrew From p.j.a.cock at googlemail.com Mon May 16 16:53:55 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 21:53:55 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DD18A31.6030804@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> <4DD15E56.60201@med.nyu.edu> <4DD18A31.6030804@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 9:33 PM, Andrew Sczesnak wrote: > On 05/16/2011 01:58 PM, Peter Cock wrote: >> >> That is a subtlety I missed - maybe it is simpler to ignore speciesOrder >> after all. I presume it is intended a graphical output directive really. > > Fine by me. ?If need be we can add this later. > >>> This is interesting. ?I wonder if it makes sense to preserve this >>> information if a SeqRecord is going to be maniuplated outside a >>> MultipleSeqAlignment object. ?Could this be accomplished by >>> migrating the annotation information to a SeqFeature? >> >> I'm not sure how using a SeqFeature would work here. > > Hmm, well, strand is manipulated in a SeqFeature when .reverse_complement() > is run, right? ?I thought that might take care of that. ?Though truthfully I > haven't looked too much at that code. The SeqFeature is for describing (part of) a SeqRecord, and both have a reverse_complement method for when you want to flip the sequence and all the features on it. >> Also consider that someone might manipulate the alignment >> directly, e.g. alignment[:,10:60] to pull out fifty columns. That >> seems like a use case where the start/end co-ordinates should >> be updated nicely. Note that internally this calls record[10:60] >> for each row of the alignment, so using SeqRecord objects. > > That's true. ?Is there a more general way to implement this? ?By dragging > the coordinate information out of .annotations and into fields that aren't > MAF-specific or something. That's what I was suggesting - the existing fasta-m10 parser can also collect start/end/strand information, and there are obvious potential uses with things like BLAST and HMMER too. One idea might be to introduce a SeqRecord subclass - I'm not sure yet. Peter From p.j.a.cock at googlemail.com Tue May 17 06:02:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 17 May 2011 11:02:08 +0100 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 13, 2011 at 8:35 PM, Jo?o Rodrigues wrote: > Hello all, > > Not to let this die. > > I've added PERMISSIVE=2 to PDBParser. I also changed the code to remove the > _handle_pdb_exception method and replace it by the warnings module. > > This was done in two commits in my branch: > > https://github.com/JoaoRodrigues/biopython/commit/5b44defc3eb0a3505668ac77b59c8980630e6b07 > https://github.com/JoaoRodrigues/biopython/commit/7383e068e41dd624458b3904fcd61a04c3f319c4 > Is getting ride of _handle_PDB_exception a good idea for performance? If I have understood your code, you just raise a warning in all cases. Then, you have a filter that either promotes the warning to an exception (permissive=0), or silences the warning (permissive=2). Also, do we want to have the same three options for all the recoverable errors? e.g. Currently, missing elements never raise an exception. Peter From anaryin at gmail.com Tue May 17 06:14:25 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 May 2011 12:14:25 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Hey, That's something I noticed too. Some errors still have PDBConstructionException as a base class, while most of them have PDBConstructionWarning. Only these latter are regulated by the new scheme. I believe they were also raised before, but inside the _handle_pdb_exception function IIRC. Regarding performance, that's something we can easily check with the benchmarks. The difference is not big, the PDB branch and 1.57+ differ just in that particular detail. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Tue, May 17, 2011 at 12:02 PM, Peter Cock wrote: > On Fri, May 13, 2011 at 8:35 PM, Jo?o Rodrigues wrote: > > Hello all, > > > > Not to let this die. > > > > I've added PERMISSIVE=2 to PDBParser. I also changed the code to remove > the > > _handle_pdb_exception method and replace it by the warnings module. > > > > This was done in two commits in my branch: > > > > > https://github.com/JoaoRodrigues/biopython/commit/5b44defc3eb0a3505668ac77b59c8980630e6b07 > > > https://github.com/JoaoRodrigues/biopython/commit/7383e068e41dd624458b3904fcd61a04c3f319c4 > > > > Is getting ride of _handle_PDB_exception a good idea for performance? > If I have understood your code, you just raise a warning in all cases. > Then, you have a filter that either promotes the warning to an exception > (permissive=0), or silences the warning (permissive=2). > > Also, do we want to have the same three options for all the recoverable > errors? e.g. Currently, missing elements never raise an exception. > > Peter > From p.j.a.cock at googlemail.com Tue May 17 06:30:39 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 17 May 2011 11:30:39 +0100 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Tue, May 17, 2011 at 11:14 AM, Jo?o Rodrigues wrote: > Hey, > > That's something I noticed too. Some errors still have > PDBConstructionException as a base class, while most of them have > PDBConstructionWarning. Only these latter are regulated by the new scheme. I > believe they were also raised before, but inside the _handle_pdb_exception > function IIRC. For backwards compatibility, we still want to use PDBConstructionException for exceptions (i.e. when permissive=0, or for non-recoverable errors) and PDBConstructionWarning for warnings (i.e. when permissive=1 or 2). The filter action may need to convert any PDBConstructionWarning to a PDBConstructionException. > Regarding performance, that's something we can easily check with the > benchmarks. The difference is not big, the PDB branch and 1.57+ differ just > in that particular detail. So you don't think this is worth worrying about? OK - if the code is cleaner this way that's a good justification. Peter From updates at feedmyinbox.com Tue May 17 07:05:32 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Tue, 17 May 2011 07:05:32 -0400 Subject: [Biopython-dev] 5/17 biopython Questions - BioStar Message-ID: <61130ea2043be0a2b73113a897fbcd9c@74.63.51.88> // [python] Uniprot ID to Gene name // May 16, 2011 at 4:36 AM http://biostar.stackexchange.com/questions/8323/python-uniprot-id-to-gene-name Hi, I've got a huge list of Uniprot IDs and I want to get the matching gene names. Do you know how to do that in python ? (I'm currently searching with Biopython...) Thanks ! Yo. -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/687953/851dd4cd10a2537cf271a85dfd1566976527e0cd/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From anaryin at gmail.com Tue May 17 07:19:37 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 May 2011 13:19:37 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Ok, the conversion from warning to exception is something I'll look into then. I also found an annoying problem in the Atom class, when assigning elements: there is an "import warnings" in the function... This is also likely killing a bit the performance.. We can more thoroughly see about the speed once I finish the PDB benchmark. Cheers, Jo?o From p.j.a.cock at googlemail.com Tue May 17 07:47:11 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 17 May 2011 12:47:11 +0100 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Tue, May 17, 2011 at 12:19 PM, Jo?o Rodrigues wrote: > Ok, the conversion from warning to exception is something I'll look into > then. > > I also found an annoying problem in the Atom class, when assigning elements: > there is an "import warnings" in the function... This is also likely killing > a bit the performance.. We can make the import top level then. > We can more thoroughly see about the speed once I finish the PDB benchmark. > > Cheers, > > Jo?o > From anaryin at gmail.com Tue May 17 08:12:53 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 May 2011 14:12:53 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: https://github.com/JoaoRodrigues/biopython/commit/2a694502f6fd116b36d8d2d15b3d4ba23ab92fe8 From anaryin at gmail.com Tue May 17 08:21:52 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 May 2011 14:21:52 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Regarding the missing element never raising an exception, here's what I propose: Change the wording of the warnings in the Atom._assign_element method so that they signal that the element was missing and it either was auto-assigned or it couldn't be assigned at all. Right now we have: if putative_element.capitalize() in IUPACData.atom_weights: msg = "Used element %r for Atom (name=%s) with given element %r" \ % (putative_element, self.name, element) element = putative_element else: msg = "Could not assign element %r for Atom (name=%s) with given element %r" \ % (putative_element, self.name, element) element = "" warnings.warn(msg, PDBConstructionWarning) I would suggest changing these two messages to make them more verbose. Setting PERMISSIVE to 0 still converts these into exceptions, but the message might not be that explicit that the element was missing to begin with. Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Tue, May 17, 2011 at 2:12 PM, Jo?o Rodrigues wrote: > > https://github.com/JoaoRodrigues/biopython/commit/2a694502f6fd116b36d8d2d15b3d4ba23ab92fe8 > From updates at feedmyinbox.com Wed May 18 01:06:16 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 18 May 2011 01:06:16 -0400 Subject: [Biopython-dev] 5/18 active questions tagged biopython - Stack Overflow Message-ID: <4aa0c9bbf3ae94272896628b51675707@74.63.51.88> // Writing out a list of strings to a file // May 17, 2011 at 3:14 PM http://stackoverflow.com/questions/6035904/writing-out-a-list-of-strings-to-a-file I have a list of abbreviations Letters = ['Ala', 'Asx', 'Cys', ... 'Glx'] I want to output this to a text file that will look this like: #Letters Ala, Asx, Cys, ..... Glx Noob programmer here! I always forget the simplest things! ah please help and thanks! import Bio from Bio import Seq from Bio.Seq import Alphabet output = 'alphabetSoupOutput.txt' fh = open(output, 'w') ThreeLetterProtein = '#Three Letter Protein' Letters = Bio.Alphabet.ThreeLetterProtein.letters fh.write(ThreeLetterProtein + '\n') #Don't know what goes here fh.close() // BioPython Alphabet Soup // May 17, 2011 at 2:28 AM http://stackoverflow.com/questions/6027064/biopython-alphabet-soup Biopython noob here, I'm trying to create a program that uses the Biopython package Alphabet and alphabet module IUPAC to write out the letters of the classes listed to a file called alphabetSoupOuput.txt. ThreeLetterProtein IUPACProtein unambiguous_dna ambiguous_dna ExtendedIUPACProtein ExtendedIUPACDNA Each group of letters should be written to its single line in the output file and the letters should be separated by commas. The line before each group of letters should contain a label that describes the letters and has a # in the first position of that line, e.g. Three Letter Protein Ala, Asx, Cys, ..., Glx Protein Letters A, C, D, E, ..., Y How can I do this? -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From updates at feedmyinbox.com Wed May 18 06:54:24 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 18 May 2011 06:54:24 -0400 Subject: [Biopython-dev] 5/18 biopython Questions - BioStar Message-ID: <257a8a7da87497cf70829245fd325ed9@74.63.51.88> // Converting GenBank to FASTA in protein form // May 18, 2011 at 12:31 AM http://biostar.stackexchange.com/questions/8377/converting-genbank-to-fasta-in-protein-form So i have a sequence that is a .gb file. What I want to do is parse and change the format of the file. I've figured out how to parse it to FASTA format, although the sequence that is in the FASTA format is nucleic and i want it to be PROTEIN. kind of stuck here... any ideas? import Bio from Bio import SeqUtils from Bio import Seq from Bio import SeqIO handle = 'sequence.gb' output = 'sequence.fasta' data = Bio.SeqIO.parse(handle, 'gb') fh = open(output, 'w') for record in data: convert = Bio.SeqIO.write(record, output, 'fasta') dna = record.seq mrna = dna.transcribe() protein = mrna.translate() // Extracting data from classes in Python // May 17, 2011 at 2:31 PM http://biostar.stackexchange.com/questions/8371/extracting-data-from-classes-in-python How can I extract data from a class in Python? >>>Bio.Alphabet.RNAAlphabet How can I extract, say for example, the letters of that Alphabet from that object in Byophthon? // Working with Alphabet Soup // May 17, 2011 at 1:23 PM http://biostar.stackexchange.com/questions/8370/working-with-alphabet-soup Biopython noob here, I'm trying to create a program that uses the Biopython package Alphabet and alphabet module IUPAC to write out the letters of the classes listed to a file called alphabetSoupOuput.txt. ThreeLetterProtein IUPACProtein unambiguous_dna ambiguous_dna ExtendedIUPACProtein ExtendedIUPACDNA Each group of letters should be written to its single line in the output file and the letters should be separated by commas. The line before each group of letters should contain a label that describes the letters and has a # in the first position of that line, e.g. Three Letter Protein Ala, Asx, Cys, ..., Glx Protein Letters A, C, D, E, ..., Y How can I do this? -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/687953/851dd4cd10a2537cf271a85dfd1566976527e0cd/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From sbassi at clubdelarazon.org Wed May 18 14:37:42 2011 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 18 May 2011 11:37:42 -0700 Subject: [Biopython-dev] SNP data into Biopython Message-ID: Hello, I wonder if would be OK to create a parser for SNP data provided by 23andme for Biopython. I could use https://github.com/ngopal/23andMe as a base. What do you think? From tiagoantao at gmail.com Wed May 18 14:45:21 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 18 May 2011 12:45:21 -0600 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: Hi, On Wed, May 18, 2011 at 12:37 PM, Sebastian Bassi wrote: > I wonder if would be OK to create a parser for SNP data provided by > 23andme for Biopython. I could use https://github.com/ngopal/23andMe > as a base. > What do you think? Are you thinking in also using the sql part of that code? I actually use a similar strategy in my project to parse HapMap data (interPopula). I just wonder what other people would think about having SQL code outside Bio.SQL? I personally have no feelings about it, but I thought I should raise the issue... Tiago From sbassi at clubdelarazon.org Wed May 18 15:00:09 2011 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 18 May 2011 12:00:09 -0700 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: 2011/5/18 Tiago Ant?o : > Are you thinking in also using the sql part of that code? I actually I didn't think in persistence yet. Just parsing it to make some operations. I could think on persistence on a second iteration. -- Sebasti?n Bassi. Lic. en Biotecnologia. Curso de Python en un d?a: http://bit.ly/cursopython From p.j.a.cock at googlemail.com Wed May 18 15:13:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 18 May 2011 20:13:50 +0100 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: 2011/5/18 Tiago Ant?o : > Hi, > > On Wed, May 18, 2011 at 12:37 PM, Sebastian Bassi > wrote: >> I wonder if would be OK to create a parser for SNP data provided by >> 23andme for Biopython. I could use https://github.com/ngopal/23andMe >> as a base. >> What do you think? Double check with the original author about reusing his code, but that could be good. Maybe under Bio/SNP/23andme.py where the Bio.SNP namespace could be extended in future? > Are you thinking in also using the sql part of that code? I actually > use a similar strategy in my project to parse HapMap data > (interPopula). I just wonder what other people would think about > having SQL code outside Bio.SQL? I personally have no feelings about > it, but I thought I should raise the issue... Tiago - Do you mean the BioSQL module (no dot)? That is specifically for the BioSQL.org schema, and there are other things under Bio.* which use SQL. Peter From redmine at redmine.open-bio.org Wed May 18 15:29:10 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 May 2011 19:29:10 +0000 Subject: [Biopython-dev] [Biopython - Bug #3232] (New) need to update info on Python version support Message-ID: Issue #3232 has been reported by Walter Gillett. ---------------------------------------- Bug #3232: need to update info on Python version support https://redmine.open-bio.org/issues/3232 Author: Walter Gillett Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The BioPython installation instructions (http://biopython.org/DIST/docs/install/Installation.html , section 2 "Installing Python") say: > "Biopython is designed to work with Python 2.4 or later (but not Python 3 yet)" but Open Bio news (http://news.open-bio.org/news/2010/11/dropping-python24-support/) says: > the forthcoming Biopython 1.56 release is planned to be our last release to support Python 2.4 since 1.57 has been released, we should update the installation instructions to indicate that Python 2.5 is now the required minimum version. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed May 18 15:29:11 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 May 2011 19:29:11 +0000 Subject: [Biopython-dev] [Biopython - Bug #3232] (New) need to update info on Python version support Message-ID: Issue #3232 has been reported by Walter Gillett. ---------------------------------------- Bug #3232: need to update info on Python version support https://redmine.open-bio.org/issues/3232 Author: Walter Gillett Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The BioPython installation instructions (http://biopython.org/DIST/docs/install/Installation.html , section 2 "Installing Python") say: > "Biopython is designed to work with Python 2.4 or later (but not Python 3 yet)" but Open Bio news (http://news.open-bio.org/news/2010/11/dropping-python24-support/) says: > the forthcoming Biopython 1.56 release is planned to be our last release to support Python 2.4 since 1.57 has been released, we should update the installation instructions to indicate that Python 2.5 is now the required minimum version. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Wed May 18 15:42:39 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 18 May 2011 20:42:39 +0100 Subject: [Biopython-dev] Biopython specific warning classes Message-ID: Hi all, I've been thinking we should introduce some specific warning classes to Biopython, in particular: ParserWarning, for any "dodgy" input files, such as invalid GenBank LOCUS lines, and so on. The existing PDB parser warning should become a subclass of this. WriterWarning, for things like "data loss", e.g. record IDs getting truncated in PHYLIP output. Perhaps even a base class BiopythonWarning, which would be useful for people wanting to ignore all the Biopython issued warnings - it might be helpful in our unit tests too. Currently (apart from the PDB module), we tend to use the default UserWarning which makes filtering the warnings as an end user (or a unit test writer) quite hard. Any thoughts? Or better name suggestions? Regards, Peter From redmine at redmine.open-bio.org Wed May 18 15:49:50 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 May 2011 19:49:50 +0000 Subject: [Biopython-dev] [Biopython - Bug #3232] (Closed) need to update info on Python version support References: Message-ID: Issue #3232 has been updated by Peter Cock. Status changed from New to Closed % Done changed from 0 to 100 Applied in changeset commit:28af0e85272acc87adb9060a008c99d28ea6c17b. ---------------------------------------- Bug #3232: need to update info on Python version support https://redmine.open-bio.org/issues/3232 Author: Walter Gillett Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The BioPython installation instructions (http://biopython.org/DIST/docs/install/Installation.html , section 2 "Installing Python") say: > "Biopython is designed to work with Python 2.4 or later (but not Python 3 yet)" but Open Bio news (http://news.open-bio.org/news/2010/11/dropping-python24-support/) says: > the forthcoming Biopython 1.56 release is planned to be our last release to support Python 2.4 since 1.57 has been released, we should update the installation instructions to indicate that Python 2.5 is now the required minimum version. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed May 18 15:54:21 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 May 2011 19:54:21 +0000 Subject: [Biopython-dev] [Biopython - Bug #3232] need to update info on Python version support References: Message-ID: Issue #3232 has been updated by Peter Cock. Online docs updated too, thanks for reporting this! http://biopython.org/DIST/docs/install/Installation.html http://biopython.org/DIST/docs/install/Installation.pdf ---------------------------------------- Bug #3232: need to update info on Python version support https://redmine.open-bio.org/issues/3232 Author: Walter Gillett Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The BioPython installation instructions (http://biopython.org/DIST/docs/install/Installation.html , section 2 "Installing Python") say: > "Biopython is designed to work with Python 2.4 or later (but not Python 3 yet)" but Open Bio news (http://news.open-bio.org/news/2010/11/dropping-python24-support/) says: > the forthcoming Biopython 1.56 release is planned to be our last release to support Python 2.4 since 1.57 has been released, we should update the installation instructions to indicate that Python 2.5 is now the required minimum version. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From tiagoantao at gmail.com Wed May 18 16:24:34 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 18 May 2011 14:24:34 -0600 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: 2011/5/18 Peter Cock : > Tiago - Do you mean the BioSQL module (no dot)? That is > specifically for the BioSQL.org schema, and there are other > things under Bio.* which use SQL. Ah, interesting. I was thinking in donating my HapMap code, but the HapMap project is always changing the directory structure (and file format!) of the site, and that renders my code (which does automatic download of data) quite unstable. :( Tiago From sbassi at clubdelarazon.org Wed May 18 19:03:51 2011 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 18 May 2011 16:03:51 -0700 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: 2011/5/18 Peter Cock : > Double check with the original author about reusing his code, > but that could be good. Maybe under Bio/SNP/23andme.py > where the Bio.SNP namespace could be extended in future? I've just asked and this is his reply: """ Thanks for your email. I'm flattered. Yes, you may include my code in biopython. I only ask two things: add my name to the list of biopython participants/contributors http://biopython.org/wiki/Participants http://biopython.org/SRC/biopython/CONTRIB add my name to the top of the python class which uses the code, stating a portion of the code came from me (assuming each python class in biopython has a comment header where each developer lists his/her name) I'm very glad you found the code useful. I'm traveling a lot these days and may not have immediate access to the internet, but please don't hesitate to shoot me an email-- I'll do my best to reply in a timely manner. Thanks, Nikhil Gopal """ From p.j.a.cock at googlemail.com Wed May 18 19:20:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 19 May 2011 00:20:02 +0100 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: On Thu, May 19, 2011 at 12:03 AM, Sebastian Bassi wrote: > 2011/5/18 Peter Cock : >> Double check with the original author about reusing his code, >> but that could be good. Maybe under Bio/SNP/23andme.py >> where the Bio.SNP namespace could be extended in future? > > I've just asked and this is his reply: > > """ > Thanks for your email. I'm flattered. Yes, you may include my code in > biopython. I only ask two things: > add my name to the list of biopython participants/contributors > http://biopython.org/wiki/Participants > http://biopython.org/SRC/biopython/CONTRIB > add my name to the top of the python class which uses the code, > stating a portion of the code came from me (assuming each python class > in biopython has a comment header where each developer lists his/her > name) > I'm very glad you found the code useful. I'm traveling a lot these > days and may not have immediate access to the internet, but please > don't hesitate to shoot me an email-- I'll do my best to reply in a > timely manner. > > Thanks, > > Nikhil Gopal > """ Assuming you asked him specifically about putting the code under the Biopython license, those terms are fine. We'd have done most of that anyway - although the wiki participants page is usually self edited. Are you happy to look at this then Sebastian? I've not worked with SNP data first hand - hopefully Tiago or others can look things over when you have something ready to merge. Regards, Peter From sbassi at clubdelarazon.org Thu May 19 02:33:50 2011 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 18 May 2011 23:33:50 -0700 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: On Wed, May 18, 2011 at 4:20 PM, Peter Cock wrote: > Are you happy to look at this then Sebastian? I've not worked with > SNP data first hand - hopefully Tiago or others can look things > over when you have something ready to merge. OK. But be patient since my github-foo is new. From updates at feedmyinbox.com Sat May 21 04:28:33 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Sat, 21 May 2011 04:28:33 -0400 Subject: [Biopython-dev] 5/21 biopython Questions - BioStar Message-ID: <53ca752aa480a76bcdc8a9070c62642a@74.63.51.88> // Massive pairwise comparison using biopython // May 20, 2011 at 6:51 PM http://biostar.stackexchange.com/questions/8456/massive-pairwise-comparison-using-biopython Hi, I have a data-set of ~7500 sequences, avg. length ~1700 bases. I need to perform pairwise analysis on the entire set. I have a biopython script to perform this analysis in parallel. My understanding is that the comparison will need to run on an MPI cluster. What are my options for doing this and where could I run the job? Thanks, Peter -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/687953/851dd4cd10a2537cf271a85dfd1566976527e0cd/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From updates at feedmyinbox.com Mon May 23 04:28:23 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Mon, 23 May 2011 04:28:23 -0400 Subject: [Biopython-dev] 5/23 biopython Questions - BioStar Message-ID: <3f6c56051f15a35ea736b3b079ba44e4@74.63.51.88> // Fragile-X BioInformatics Project // May 23, 2011 at 1:52 AM http://biostar.stackexchange.com/questions/8495/fragile-x-bioinformatics-project Hey guys, I'm looking for some advice. I have a project due in a couple weeks that needs to utilize python(and biopython) to create some sort of computational biology tool that will be used to analyze either GEO samples, DNA sequences, etc. I plan on creating a program that will analyze GEO samples of Fragile-X patients, but don't really know what else I can do. Any suggestions? I don't work in a lab and therefore don't have much experience with this. ANY suggestions would help Please and thanks! -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/687953/851dd4cd10a2537cf271a85dfd1566976527e0cd/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From redmine at redmine.open-bio.org Mon May 23 11:30:21 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 23 May 2011 15:30:21 +0000 Subject: [Biopython-dev] [Biopython - Bug #3234] (New) Bio.HMM Viterbi algorithm: initial state probabilities are wrong Message-ID: Issue #3234 has been reported by Walter Gillett. ---------------------------------------- Bug #3234: Bio.HMM Viterbi algorithm: initial state probabilities are wrong https://redmine.open-bio.org/issues/3234 Author: Walter Gillett Status: New Priority: Normal Assignee: Walter Gillett Category: Target version: URL: Spun off from #2947, see that bug for discussion. Initial state probabilities should be set explicitly, rather than using the probability of transitioning from a state back to itself, which is incorrect. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Mon May 23 12:03:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 23 May 2011 17:03:13 +0100 Subject: [Biopython-dev] [Biopython - Bug #3234] (New) Bio.HMM Viterbi algorithm: initial state probabilities are wrong In-Reply-To: References: Message-ID: On Mon, May 23, 2011 at 4:30 PM, wrote: > > Issue #3234 has been reported by Walter Gillett. > > ---------------------------------------- > Bug #3234: Bio.HMM Viterbi algorithm: initial state probabilities are wrong > https://redmine.open-bio.org/issues/3234 > > Author: Walter Gillett > Status: New > Priority: Normal > Assignee: Walter Gillett > Category: > Target version: > URL: > > > Spun off from #2947, see that bug for discussion. Initial state probabilities > should be set explicitly, rather than using the probability of transitioning > from a state back to itself, which is incorrect. Would anyone more familiar with HMMs that I am like to volunteer to review Walter's changes? e.g. Philip (CC'd)? Walter's sent a pull request via github: https://github.com/biopython/biopython/pull/6 This consists of two commits, the first an unrelated minor change to the ignore file for people using the NetBeans IDE: https://github.com/wgillett/biopython/commit/50659de2f0cfa3f0bc913b4ea88d6a001b543d98 Secondly his fix for Bug 3234, https://github.com/wgillett/biopython/commit/a60ac226ceed21fd856ff1ec1dbea2782e2172ae Thanks, Peter From pgarland at gmail.com Tue May 24 04:34:03 2011 From: pgarland at gmail.com (Phillip Garland) Date: Tue, 24 May 2011 01:34:03 -0700 Subject: [Biopython-dev] [Biopython - Bug #3234] (New) Bio.HMM Viterbi algorithm: initial state probabilities are wrong In-Reply-To: References: Message-ID: The patch looks correct to me. ~Phillip On Mon, May 23, 2011 at 9:03 AM, Peter Cock wrote: > On Mon, May 23, 2011 at 4:30 PM, ? wrote: >> >> Issue #3234 has been reported by Walter Gillett. >> >> ---------------------------------------- >> Bug #3234: Bio.HMM Viterbi algorithm: initial state probabilities are wrong >> https://redmine.open-bio.org/issues/3234 >> >> Author: Walter Gillett >> Status: New >> Priority: Normal >> Assignee: Walter Gillett >> Category: >> Target version: >> URL: >> >> >> Spun off from #2947, see that bug for discussion. Initial state probabilities >> should be set explicitly, rather than using the probability of transitioning >> from a state back to itself, which is incorrect. > > Would anyone more familiar with HMMs that I am like to volunteer to review > Walter's changes? e.g. Philip (CC'd)? > > Walter's sent a pull request via github: > https://github.com/biopython/biopython/pull/6 > > This consists of two commits, the first an unrelated minor change to the > ignore file for people using the NetBeans IDE: > https://github.com/wgillett/biopython/commit/50659de2f0cfa3f0bc913b4ea88d6a001b543d98 > > Secondly his fix for Bug 3234, > https://github.com/wgillett/biopython/commit/a60ac226ceed21fd856ff1ec1dbea2782e2172ae > > Thanks, > > Peter > From p.j.a.cock at googlemail.com Tue May 24 05:07:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 May 2011 10:07:15 +0100 Subject: [Biopython-dev] [Biopython - Bug #3234] (New) Bio.HMM Viterbi algorithm: initial state probabilities are wrong In-Reply-To: References: Message-ID: On Tue, May 24, 2011 at 9:34 AM, Phillip Garland wrote: > The patch looks correct to me. > > ~Phillip Thank you both, I've applied the change: https://github.com/biopython/biopython/commit/152f469179d4a142858a04c02169f8d1fc5f8c83 Peter From redmine at redmine.open-bio.org Tue May 24 12:13:11 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 24 May 2011 16:13:11 +0000 Subject: [Biopython-dev] [Biopython - Feature #3236] (New) Make Biopython work in PyPy 1.5 Message-ID: Issue #3236 has been reported by Eric Talevich. ---------------------------------------- Feature #3236: Make Biopython work in PyPy 1.5 https://redmine.open-bio.org/issues/3236 Author: Eric Talevich Status: New Priority: Low Assignee: Category: Target version: URL: PyPy is now roughly as production-ready as Jython: http://morepypy.blogspot.com/2011/04/pypy-15-released-catching-up.html Let's make Biopython work on PyPy 1.5. To make the pure-Python core of Biopython work, I did this: * Download and unpack the pre-compiled Linux tarball from pypy.org * Copy the header file @marshal.h@ from the CPython 2.X installation into the @pypy-c-.../include/@ directory * pypy setup.py build; pypy setup.py install * Delete pypy-c-.../site-packages/Bio/cpairwise2*.so Benchmarking a script that leans heavily on Bio.pairwise2, I see about a 2x speedup between Pypy 1.5 and CPython 2.6 -- yes, that's with the compiled C extension @cpairwise2@ in the CPython 2.6 installation. Numpy isn't available on PyPy yet, and it may be some time before it does. Observations from @pypy setup.py test@: * test_BioSQL triggers tons of RuntimeWarnings related to sqlite3 functions * test_BioSQL_SeqIO fails -- attempts to retrieve P01892 instead of Q29899 (?) * test_Restriction triggers a TypeError, somehow (also causing test_CAPS to err) * test_Entrez fails with many noisy errors -- looks related to expat, may be just my installation * importing @Bio.trie@ fails, probably due to a @marshal.h@ issue with compilation ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Tue May 24 16:20:55 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 May 2011 21:20:55 +0100 Subject: [Biopython-dev] Fwd: [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) In-Reply-To: References: Message-ID: Eric, Could you take a look at the second of these commits from Aaron please? https://github.com/habnabit/biopython/commit/533c5b0a8fd4656ef937e5e0816d2714f82ecf07 I've already applied the first one with a cherry-pick, https://github.com/biopython/biopython/tree/7bec999af556be28d1a50dac9687d62f6c200b38 Thanks, Peter ---------- Forwarded message ---------- From: habnabit Date: Tue, May 24, 2011 at 9:10 PM Subject: [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) To: p.j.a.cock at googlemail.com Hi! This is a very small set of changes for biopython; it should be evident from the diff and commit message what the intent is. -- Reply to this email directly or view it on GitHub: https://github.com/biopython/biopython/pull/7 From eric.talevich at gmail.com Wed May 25 10:33:36 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 25 May 2011 10:33:36 -0400 Subject: [Biopython-dev] [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) In-Reply-To: References: Message-ID: Thanks for these patches, Aaron! And thanks for merging the first one, Peter. The second set looks safe to me. A couple thoughts: 1. It might be more intuitive to accept a format string directly as the format_branchlength argument, e.g. Phylo.write(tree,?outfile,?'newick', format_branchlength='%.0e') Since the branch length is always supposed to be a numeric type or None, format strings alone should be sufficient to do whatever the user wants, right? Alternatively, the switch in _info_factory could go: if format_branchlength is None: fmt_bl = lambda bl: '%1.5f' % bl elif isinstance(format_branchlength, basestring): fmt_bl = lambda bl: format_branchlength % bl elif callable(format_branchlength): fmt_bl = format_branchlength else: raise WTF 2. Out of curiousity, is there a certain program out there that uses branch length in a different format? I hadn't considered this before, but I can see how scientific notation would be useful sometimes if the target program can handle it. I can merge this if we have agreement on these. Cheers, Eric On Tue, May 24, 2011 at 4:20 PM, Peter Cock wrote: > > Eric, > > Could you take a look at the second of these commits from Aaron please? > https://github.com/habnabit/biopython/commit/533c5b0a8fd4656ef937e5e0816d2714f82ecf07 > > I've already applied the first one with a cherry-pick, > https://github.com/biopython/biopython/tree/7bec999af556be28d1a50dac9687d62f6c200b38 > > Thanks, > > Peter > > > ---------- Forwarded message ---------- > From: habnabit > Date: Tue, May 24, 2011 at 9:10 PM > Subject: [biopython] Bugfix in test_Phylo; branch length formatter for > Newick trees (#7) > To: p.j.a.cock at googlemail.com > > > Hi! > > This is a very small set of changes for biopython; it should be > evident from the diff and commit message what the intent is. > > -- > Reply to this email directly or view it on GitHub: > https://github.com/biopython/biopython/pull/7 From eric.talevich at gmail.com Wed May 25 16:47:38 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 25 May 2011 16:47:38 -0400 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Tue, May 17, 2011 at 8:21 AM, Jo?o Rodrigues wrote: > Regarding the missing element never raising an exception, here's what I > propose: > > Change the wording of the warnings in the Atom._assign_element method so > that they signal that the element was missing and it either was > auto-assigned or it couldn't be assigned at all. > I agree. Just prefixing the existing messages with "Missing or unexpected element: " would probably be fine, I think. Cheers, Eric From eric.talevich at gmail.com Wed May 25 17:03:23 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 25 May 2011 17:03:23 -0400 Subject: [Biopython-dev] Biopython specific warning classes In-Reply-To: References: Message-ID: On Wed, May 18, 2011 at 3:42 PM, Peter Cock wrote: > Hi all, > > I've been thinking we should introduce some specific warning > classes to Biopython, in particular: > > ParserWarning, for any "dodgy" input files, such as invalid > GenBank LOCUS lines, and so on. The existing PDB parser > warning should become a subclass of this. > This would fit well with what PDB and Phylo already do. My docstring for PhyloXMLWarning says it's for non-compliance with the format's specification. An alternate way to do this (but less easily scaled for SeqIO) is to have warnings for each format, triggered whenever the spec for that format is violated. WriterWarning, for things like "data loss", e.g. record IDs > getting truncated in PHYLIP output. > I'm not sure whether this would be handy or tedious -- a lot of formats could conceivably lose some data in a SeqRecord, and adding checks to each writer might be too much. Maybe just document these things well somewhere. Perhaps even a base class BiopythonWarning, which would > be useful for people wanting to ignore all the Biopython issued > warnings - it might be helpful in our unit tests too. > We should make sure these are very easy to use, to avoid making the scheme complicated, like: >>> from Bio import BiopythonWarning or >>> from Bio.Warnings import BiopythonWarning, ParserWarning, WriterWarning >>> warnings.simplefilter('ignore', ParserWarning) I guess it's not so bad. Currently (apart from the PDB module), we tend to use > the default UserWarning which makes filtering the warnings > as an end user (or a unit test writer) quite hard. > Yeah, I think it would be better to reserve UserWarning for the user's application code, rather than emitting them from the Biopython library. -Eric From p.j.a.cock at googlemail.com Thu May 26 04:38:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 26 May 2011 09:38:15 +0100 Subject: [Biopython-dev] Biopython specific warning classes In-Reply-To: References: Message-ID: On Wed, May 25, 2011 at 10:03 PM, Eric Talevich wrote: > On Wed, May 18, 2011 at 3:42 PM, Peter Cock > wrote: >> >> Hi all, >> >> I've been thinking we should introduce some specific warning >> classes to Biopython, in particular: >> >> ParserWarning, for any "dodgy" input files, such as invalid >> GenBank LOCUS lines, and so on. The existing PDB parser >> warning should become a subclass of this. > > This would fit well with what PDB and Phylo already do. My docstring > for PhyloXMLWarning says it's for non-compliance with the format's > specification. The warning in the GenBank file are also for clearly non-compliant files. > An alternate way to do this (but less easily scaled for SeqIO) is to have > warnings for each format, triggered whenever the spec for that format is > violated. Once we have the base classes of BiopythonWarning and ParserWarning in place, you could introduce more subclasses - but it seems less and less useful. >> WriterWarning, for things like "data loss", e.g. record IDs >> getting truncated in PHYLIP output. > > I'm not sure whether this would be handy or tedious -- a lot of formats > could conceivably lose some data in a SeqRecord, and adding checks to each > writer might be too much. Maybe just document these things well somewhere. There are a couple of existing warnings of this kind, but I agree they should be used sparingly. >> Perhaps even a base class BiopythonWarning, which would >> be useful for people wanting to ignore all the Biopython issued >> warnings - it might be helpful in our unit tests too. > > We should make sure these are very easy to use, to avoid making the scheme > complicated, like: > >>>> from Bio import BiopythonWarning > > or > >>>> from Bio.Warnings import BiopythonWarning, ParserWarning, WriterWarning >>>> warnings.simplefilter('ignore', ParserWarning) > > I guess it's not so bad. Yes, to ignore any Biopython warnings you do: from Bio import BiopythonWarning warnings.simplefilter('ignore', BiopythonWarning) or, to ignore just our parser warnings: from Bio import ParserWarning warnings.simplefilter('ignore', ParserWarning) That seems easy to me ;) >> Currently (apart from the PDB module), we tend to use >> the default UserWarning which makes filtering the warnings >> as an end user (or a unit test writer) quite hard. > > Yeah, I think it would be better to reserve UserWarning for the user's > application code, rather than emitting them from the Biopython library. OK then - I'll work on this. Peter From p.j.a.cock at googlemail.com Thu May 26 07:02:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 26 May 2011 12:02:49 +0100 Subject: [Biopython-dev] Biopython specific warning classes In-Reply-To: References: Message-ID: On Thu, May 26, 2011 at 9:38 AM, Peter Cock wrote: > > OK then - I'll work on this. > I've made a start on this with a BiopythonWarning and BiopythonParserWarning, but have not yet gone over the whole code base to use these consistently. If anyone want to tackle their own modules first, that would be helpful. Peter From eric.talevich at gmail.com Thu May 26 23:57:14 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 26 May 2011 23:57:14 -0400 Subject: [Biopython-dev] [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) In-Reply-To: <3B2A0BA4-3B13-4DEE-ADFB-E7253857E8DA@gmail.com> References: <3B2A0BA4-3B13-4DEE-ADFB-E7253857E8DA@gmail.com> Message-ID: Aaron & folks, I've committed the original patch and another based on this discussion. https://github.com/biopython/biopython/commit/cc48ad211266cb9ac118df15889597912c79a994 On Wed, May 25, 2011 at 10:44 AM, Aaron Gallagher wrote: > On May 25, 2011, at 7:33 AM, Eric Talevich wrote: > > > 1. [...] > > Since the branch length is always supposed to be a numeric type or > > None, format strings alone should be sufficient to do whatever the > > user wants, right? > > Maybe this is more sensible; I've been struggling to come up with a use > case of a full callable though it seemed to make sense when I was > implementing it. > > > Alternatively, the switch in _info_factory could go: [...] > > I'm not a huge fan of implementing APIs like this in python, really. It is > seeming more and more like the most sensible thing is to just specify a > format string. > I changed the format_branch_length argument to take a simple format string instead of a function: https://github.com/biopython/biopython/commit/decd2a19fa3631cc34aaaf4c79d3af96c26fa1d9 > > 2. Out of curiousity, is there a certain program out there that uses > > branch length in a different format? I hadn't considered this before, > > but I can see how scientific notation would be useful sometimes if the > > target program can handle it. > > The issue in my case was not so much needing a different format (though the > tools I work on /do/ support scientific notation) so much as that the Newick > trees I generate have precision down to 1e-6. Round-tripping them through > biopython was truncating branches with very small lengths. > Good to know. The format for confidences is also hard-coded ("%1.2f"), do you suppose that should be given the same treatment? Thanks again, Eric From p.j.a.cock at googlemail.com Fri May 27 09:52:56 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 27 May 2011 14:52:56 +0100 Subject: [Biopython-dev] [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) In-Reply-To: References: <3B2A0BA4-3B13-4DEE-ADFB-E7253857E8DA@gmail.com> <7B5DB32C-25FB-43F6-A3CB-15848A975418@gmail.com> Message-ID: On Fri, May 27, 2011 at 2:48 PM, Erick Matsen wrote: > Hello everyone-- > > > Hope you don't mind my chiming into this discussion. > >> Good to know. The format for confidences is also hard-coded ("%1.2f"), do >> you suppose that should be given the same treatment? > > I think this would be entirely appropriate. There are some cases (eg > bootstrap) where the confidence is actually a count, and being able to > express it as such might be convenient. > > I have one related point to discuss if you don't mind. In > > https://github.com/biopython/biopython/blob/master/Bio/Phylo/NewickIO.py#L246 > > trees without confidence values get written out as trees with confidence > values of zero. These are of course two different things. > > I realize that if we want to write out a tree without confidence values > we can specify branchlengths_only, but it would seem to me that the most > natural behavior would be to just write out confidence values when they > are specified. > > In particular, it surprises me that reading a tree and then writing it > with the default settings changes the meaning of the tree. > > I realize that changing the behavior like this might not be possible > because this is a large group project, but I thought I would point it > out. > > Thank you for your great work here! > > Erick That is a very good point. Can we use None for no confidence value? Peter From eric.talevich at gmail.com Fri May 27 10:30:08 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 27 May 2011 10:30:08 -0400 Subject: [Biopython-dev] [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) In-Reply-To: References: <3B2A0BA4-3B13-4DEE-ADFB-E7253857E8DA@gmail.com> <7B5DB32C-25FB-43F6-A3CB-15848A975418@gmail.com> Message-ID: On Fri, May 27, 2011 at 9:52 AM, Peter Cock wrote: > On Fri, May 27, 2011 at 2:48 PM, Erick Matsen wrote: > > Hello everyone-- > > > > > > Hope you don't mind my chiming into this discussion. > > > >> Good to know. The format for confidences is also hard-coded ("%1.2f"), > do > >> you suppose that should be given the same treatment? > > > > I think this would be entirely appropriate. There are some cases (eg > > bootstrap) where the confidence is actually a count, and being able to > > express it as such might be convenient. > OK, this should be easy enough to fix. > > I have one related point to discuss if you don't mind. In > > > > https://github.com/biopython/biopython/blob/master/Bio/Phylo/ > >> >> Peter >> > NewickIO.py#L246 > > > > trees without confidence values get written out as trees with confidence > > values of zero. These are of course two different things. > > > > I realize that if we want to write out a tree without confidence values > > we can specify branchlengths_only, but it would seem to me that the most > > natural behavior would be to just write out confidence values when they > > are specified. > > > > In particular, it surprises me that reading a tree and then writing it > > with the default settings changes the meaning of the tree. > > > > I realize that changing the behavior like this might not be possible > > because this is a large group project, but I thought I would point it > > out. > > > > Thank you for your great work here! > > > > Erick > > That is a very good point. Can we use None for no confidence value? > > Yes, that should be the case, and also NewickIO should not add confidence values of 0.0 during serialization where clade.confidence is None. This probably deserves another test in test_Phylo.py. I don't see a problem with changing this behavior in Bio.Phylo, as long as it's still creating Newick files that work with other widely-used software. -Eric From mikael.trellet at gmail.com Tue May 31 07:46:50 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Tue, 31 May 2011 13:46:50 +0200 Subject: [Biopython-dev] GSoC 2011 - Interface analysis module - Week 1 Message-ID: Hi there, As mentioned in the title, you will find in this email a sum up of my first week of coding for the Google Summer of Code 2011. I will begin with a reminder of the original plan proposed to Google and I will continue with what I did and what obstacles I encountered. Please don't hesitate to post some comments, your remarks are one of the main motivation for this mail (which will be I think the first one of a weekly report) ! Week 1 [24th - 31th May] 1. Add a the new Interface module backbone in current Bio.PDB code base 1. Evaluate possible code reuse and call it into the new module 2. Try simple calculations to be sure that there is stability between the different modules (parsing for example) and functions 1. Define a stable benchmark of few PDB files of complexes to run some unit tests for each step of the project Unfortunately, one of the main part of my first week was to try to solve some troubles I had by using github directly on my Dropbox folder. I worked on several computer so I wanted to have everything synchronized, but this synchronization didn't seem to be very compatible with dropbox. I have to say that it was certainly the way I used it which were wrong, I decided finally (but also lately) to keep only one main working directory and to ssh it if I need. We began to think of an easy way to add the Interface as a new part of the SMCRA scheme. The idea was to have this new scheme = SM-I-CRA. Unfortunately the Interface object is not as well defined as just a child of model and a parent of chains. Indeed, the main part of the interface is residues, and even residues pairs. We want to keep the information of the chain but we can't keep them as they are defined actually, since we will get some overlaps, duplication and miscompatibility between the chains of our model and the chains of our interface. In the same way, our try to link the creation of the interface with existing modules as StructureBuilder and Model wasn't successful. So, we decided to simplify a bit the concept in adding the classes related to the Interface in an independent way. Obviously links will exist between the different levels of SMCRA but Interface would be considered now as a parallel entity, not integrated completely in the SMCRA scheme. End of the story, some keyboards uses now. About the coding part. I had two new classes in Bio.PDB : Interface.py and InterfaceBuilder.py For the impatient people, this is the two links of my commits : https://github.com/mtrellet/biopython/commit/4cfa4359d0f927609c076ed7b66f37add5aabdfb https://github.com/mtrellet/biopython/commit/194efe37ac8f88d688e0cf528f1fb896c8441866 Interface.py is the definition of the Interface object inherited from Entity with the following methods : *__init__*(self, id), *add*(self, entity) and *get_chains*(self). The *add* module overrides the add method of Entity in order to have an easy way to class residues according to their respective chains. The *get_chains* modules returns the chains involved in the interface defined by the Interface object. The second class created is InterfaceBuilder.py which deals directly with the interface building (hard to guess..!) We find these different modules : *__init__*(self, model, id=None, threshold=5.0, include_waters=False, *chains), *_unpack_chains*(self, list_of_tuples), *get_interface*(self), *_add_residue*(self, residue), * _build_interface*(self, model, id, threshold, include_waters=False, *chains) *__init__* : In order to initialize an interface you need to provide the model for which you want to calculate the interface, that's the only mandatory argument. *_unpack_chains*: Method used by __init__ so as to create self.chain_list, variable read in many parts of the class. It transforms a list of tuples (given by the user) in a list of characters representing the chains which will be involved in the definition of the interface. *get_interface: *Returns simply the interface *_add_residue: *Allows the user to add some specific residues to his interface *_build_interface: *The machinery to build the interface, it uses NeighborSearch and Selection in order to define the interface depending on the arguments given by the user. It was maybe a bit long and with too many details (or perhaps not details enough), as I already said, don't hesitate to make suggestions, for both my work and my report ! You should receive a dozen of these, so any comment is welcomed ! Cheers, -- Mikael TRELLET, Computational structural biology group, Utrecht University Bijvoet Center, The Netherlands From mdipierro at cs.depaul.edu Sun May 1 04:51:23 2011 From: mdipierro at cs.depaul.edu (Massimo Di Pierro) Date: Sat, 30 Apr 2011 23:51:23 -0500 Subject: [Biopython-dev] biopython web interface In-Reply-To: <3a649ae478daf0c2e544dc573a15f3b5.squirrel@lipid.biocomp.unibo.it> References: <3a649ae478daf0c2e544dc573a15f3b5.squirrel@lipid.biocomp.unibo.it> Message-ID: Hello Andrea I am a looking at something a little different than what you are doing but we should definitely collaborate. I am trying to identify tasks that are not domain specific that could benefit more than one scientific community. It seems to me all scientific communities have data, have program (in python or not it irrelevant to me) and have a workflow. They all need: 1) a tool to post the data online in a semi-automated fashion 2) a tool to share data easily (both via web interface and scripting via web service) with access control 3) a way to annotate the data as in a CMS 4) a mechanism to connect data with a workflow so that certain programs are executed automatically when new data is uploaded in the system. The programs may require user input so it should possible to somehow register a task (a program) by describing what input data it needs and what user input it needs and the system should automatically generate an interface. 5) an interface to local clusters and grid resources to submit computing jobs to I do not have the resources or the expertise to build an interface specific for biopython but I think we should collaborate because if what I am going is general enough (and I am not sure it is unless we talk more about it) it could be used to create an interface to biopython with minimal programming. I understand your focus is on algorithms but I need to start on data. It is my experience it is very difficult to automate the workflow of algorithms if there is no standard exchange format for the data. The first thing I would need to understand are: - does biopython handle some standard file formats? What do they contain? how can they be recognized? Can you send me a few example? - is there a graph of which algorithms run on which file types? - what are the most common algorithms? Can you point me to the source? I like to think of the system as something that will represent the workflow as a graph. Each file type is a node. An algorithm is a link. If a node is an image or a csv file or an xml file or a movie or a vtk file, etc. the system will be able to represent it (show it). Links "define" the file type. As long as you have a standard, you will be able to register your algorithms and the system will know what to do. The all graph is built automatically without programming by introspecting your folders and identifying your files. You will be able to annotate your folders using a markup language to augment the information. In my approach starting from the data is critical. My approach does not fly if you do not have standard file formats. Massimo P.S. Sei italiano? On Apr 30, 2011, at 12:03 PM, Andrea Pierleoni wrote: > >> >> Message: 3 >> Date: Fri, 29 Apr 2011 08:34:34 -0500 >> From: Massimo Di Pierro >> Subject: [Biopython-dev] biopython web interface >> To: >> Message-ID: <57629245-F184-4143-8B18-80E69BC2C351 at cs.depaul.edu> >> Content-Type: text/plain; charset="us-ascii" >> >> Hello everybody, >> >> I am new to biopython and I have some silly questions. >> >> Does biopython have a web interface? >> If not, would you be interested in help developing one? >> What kind of features would you be interested in? >> >> Reason for my question: I am a physicist and a professor of CS. I am >> working with a few different groups to build a unified platform to bring >> scientific data online. The main idea is that of having a tool that >> requires no programming and scientists can use to introspect an existing >> directory and turn it into dynamical web pages. Those pages can then be >> edited and re-oreganized like a CMS. The system should be able to >> recognize basic file types, group, tag and categorize them. It should them >> be possible to register algorithms, run them on the server, create a >> workflow. The system will also have an interface for mobile. >> >> Here is a first prototype for physics data that interface with the >> National Energy Research Computing Center: >> http://tests.web2py.com/nersc >> >> Since we are doing this it would be great to have as many community on >> board as possible so that we can write specs that are broad enough. >> We can do all the work or you can help us if you want. >> >> So, if you have a wish list please share it with me. >> >> Personally, I need to be educated on biopython since I do not fully >> understand what are the basic file types it handles, what are the most >> popular algorithms it provides, nor I am familiar with the typical usage >> workflow. >> >> Massimo >> >> >> > > > Hi Massimo, > BioPython itself is a python library, but a web interface would enable many > functions to biological scientist with no programming expertise. > There are some parts of the library that cope well with a > web-interface/server, > in particular the BioSQL modules. > The BioSQL schema is a relational database model to store biological data. > I do have working code for using the BioPython BioSQL functions (and more) > with > the web2py DAL, and I'm working on a complete web2py-based opensource > webserver to store and manage biological sequences/entities. > If you (or any other) are interested and want to contribute, let me know. > There are many things in common between what I'm doing and what you want > to do, > so maybe its a good idea to work together. > > Andrea Pierleoni > > > From bjclavijo at gmail.com Mon May 2 13:52:15 2011 From: bjclavijo at gmail.com (Bernardo Clavijo) Date: Mon, 2 May 2011 10:52:15 -0300 Subject: [Biopython-dev] biopython web interface In-Reply-To: References: <3a649ae478daf0c2e544dc573a15f3b5.squirrel@lipid.biocomp.unibo.it> Message-ID: Hello Massimo... first of all... thanks for web2py, which is my tool of choice for web apps :D Here goes my 2 cents about all this: 1) I you're looking for a standard format, we should me talking about sequence files ( fasta / gff ). This approach will be very restrictive, but i guess it's a starting point. 2) you should look at galaxy, in some point I was hoping to integrate a web2py programming module directly there (don't know how yet, and i'm in many things at once, so it's more like a dream than a project). Galaxy has a fex tutorials and videos that should point you in the right direction. 3) Sadly, standard data representation has been an issue for some time for the bioinformatics community, the REST / web services approach has gain some momentum and some apps talk to each other in some way, but we still have not much of a standard way to represent all the data. Ontologies are a strong point also (check http://www.obofoundry.org/ ) with sequence ontology being a great one IMHO pointing on how the data should be represented (it's recommended, even when not enforced, to use SO when creating gff3 files). 4) So far, the one tool to "standard biological data saving" I've found useful was the Chado DB schema, which BTW didn't enforce or even define how to handle a lot of situations, but is more of a framework on which to base your own data representation. I guess that's not what you're looking for, but surely an interesting approach and a lot of lessons learned there. I'm currently building a web interface for some of our projects saving genomic and proteomic data on a Chado DB ( http://gmod.org/wiki/Chado ) using web2py, but it's at least rough and in a pre-alpha (as in a PoC) state. Some other folks here have been doing the same kind of projects, hopefully someone with a better and less specific approach. If it suits you, just contact me and i'll provide you all the direction and ideas my limited knowledge could generate. I'm a little dispersed man most of the time, so maybe not your ideal adviser, but I have the will. Greets and thanks again for web2py Bernardo Clavijo PD: please folks correct all my bad ideas for Massimo to have a real view and not my mess On Sun, May 1, 2011 at 1:51 AM, Massimo Di Pierro wrote: > Hello Andrea > > I am a looking at something a little different than what you are doing but we should definitely collaborate. > I am trying to identify tasks that are not domain specific that could benefit more than one scientific community. > > It seems to me all scientific communities have data, have program (in python or not it irrelevant to me) and have a workflow. > They all need: > 1) a tool to post the data online in a semi-automated fashion > 2) a tool to share data easily (both via web interface and scripting via web service) with access control > 3) a way to annotate the data as in a CMS > 4) a mechanism to connect data with a workflow so that certain programs are executed automatically when new data is uploaded in the system. The programs may require user input so it should possible to somehow register a task (a program) by describing what input data it needs and what user input it needs and the system should automatically generate an interface. > 5) an interface to local clusters and grid resources to submit computing jobs to > > I do not have the resources or the expertise to build an interface specific for biopython but I think we should collaborate because if what I am going is general enough (and I am not sure it is unless we talk more about it) it could be used to create an interface to biopython with minimal programming. > > I understand your focus is on algorithms but I need to start on data. It is my experience it is very difficult to automate the workflow of algorithms if there is no standard exchange format for the data. > > The first thing I would need to understand are: > - does biopython handle some standard file formats? What do they contain? how can they be recognized? Can you send me a few example? > - is there a graph of which algorithms run on which file types? > - what are the most common algorithms? Can you point me to the source? > > I like to think of the system as something that will represent the workflow as a graph. Each file type is a node. An algorithm is a link. > If a node is an image or a csv file or an xml file or a movie or a vtk file, etc. the system will be able to represent it (show it). > Links "define" the file type. As long as you have a standard, you will be able to register your algorithms and the system will know what to do. > > The all graph is built automatically without programming by introspecting your folders and identifying your files. You will be able to annotate your folders using a markup language to augment the information. > > In my approach starting from the data is critical. My approach does not fly if you do not have standard file formats. > > Massimo > > > > > > > > P.S. Sei italiano? > > On Apr 30, 2011, at 12:03 PM, Andrea Pierleoni wrote: > >> >>> >>> Message: 3 >>> Date: Fri, 29 Apr 2011 08:34:34 -0500 >>> From: Massimo Di Pierro >>> Subject: [Biopython-dev] biopython web interface >>> To: >>> Message-ID: <57629245-F184-4143-8B18-80E69BC2C351 at cs.depaul.edu> >>> Content-Type: text/plain; charset="us-ascii" >>> >>> Hello everybody, >>> >>> I am new to biopython and I have some silly questions. >>> >>> Does biopython have a web interface? >>> If not, would you be interested in help developing one? >>> What kind of features would you be interested in? >>> >>> Reason for my question: I am a physicist and a professor of CS. I am >>> working with a few different groups to build a unified platform to bring >>> scientific data online. The main idea is that of having a tool that >>> requires no programming and scientists can use to introspect an existing >>> directory and turn it into dynamical web pages. Those pages can then be >>> edited and re-oreganized like a CMS. The system should be able to >>> recognize basic file types, group, tag and categorize them. It should them >>> be possible to register algorithms, run them on the server, create a >>> workflow. The system will also have an interface for mobile. >>> >>> Here is a first prototype for physics data that interface with the >>> National Energy Research Computing Center: >>> http://tests.web2py.com/nersc >>> >>> Since we are doing this it would be great to have as many community on >>> board as possible so that we can write specs that are broad enough. >>> We can do all the work or you can help us if you want. >>> >>> So, if you have a wish list please share it with me. >>> >>> Personally, I need to be educated on biopython since I do not fully >>> understand what are the basic file types it handles, what are the most >>> popular algorithms it provides, nor I am familiar with the typical usage >>> workflow. >>> >>> Massimo >>> >>> >>> >> >> >> Hi Massimo, >> BioPython itself is a python library, but a web interface would enable many >> functions to biological scientist with no programming expertise. >> There are some parts of the library that cope well with a >> web-interface/server, >> in particular the BioSQL modules. >> The BioSQL schema is a relational database model to store biological data. >> I do have working code for using the BioPython BioSQL functions (and more) >> with >> the web2py DAL, and I'm working on a complete web2py-based opensource >> webserver to store and manage biological sequences/entities. >> If you (or any other) are interested and want to contribute, let me know. >> There are ?many things in common between what I'm doing and what you want >> to do, >> so maybe its a good idea to work together. >> >> Andrea Pierleoni >> >> >> > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue May 3 09:24:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 May 2011 10:24:08 +0100 Subject: [Biopython-dev] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: Message-ID: Hello all, I've CC'd the BioPerl, BioRuby, BioJava and Biopython development mailing lists to make sure you're aware of this, but can we continue any discussion on the cross-project open-bio-l mailing list please? I noticed that recent versions of BLAST are not using a single block for each query, which was the historical behaviour and assumed by the Biopython BLAST XML parser. This may be a bug in BLAST. See link below for an example. Has anyone else noticed this, and has it been reported to the NCBI yet? Thanks, Peter (Not for the first time, I wish there was a public bug tracker for BLAST, or at least a private bug tracker so we could talk about issues with an NCBI assigned reference number.) ---------- Forwarded message ---------- From: Peter Cock Date: Wed, Apr 20, 2011 at 6:08 PM Subject: Interesting BLAST 2.2.25+ XML behaviour To: Biopython-Dev Mailing List Hi all, Have a look at this XML file from a FASTA vs FASTA search using blastp from ?BLAST 2.2.25+ (current release), which is a test file I created for the BLAST+ wrappers in Galaxy: https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml I just put it though the Biopython BLAST XML parser, and was surprised not to get four records back (since as you might guess from the filename, there were four queries). It appears this version of BLAST+ is incrementing the iteration counter for each match... or something like that. Has anyone else noticed this? I wonder if it is accidental... Peter From updates at feedmyinbox.com Wed May 4 04:37:19 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 4 May 2011 00:37:19 -0400 Subject: [Biopython-dev] 5/4 active questions tagged biopython - Stack Overflow Message-ID: // Finding/Replacing substrings with annotations in an ASCII file in Python // May 3, 2011 at 9:14 AM http://stackoverflow.com/questions/5870012/finding-replacing-substrings-with-annotations-in-an-ascii-file-in-python Hello Everyone, I'm having a little coding issue in a bioinformatics project I'm working on. Basically, my task is to extract motif sequences from a database and use the information to annotate a sequence alignment file. The alignment file is plain text, so the annotation will not be anything elaborate, at best simply replacing the extracted sequences with asterisks in the alignment file itself. I have a script which scans the database file, extracts all sequences I need, and writes them to an output file. What I need is, given a query, to read these sequences and match them to their corresponding substrings in the ASCII alignment files. Finally, for every occurrence of a motif sequence (substring of a very large string of characters) I would replace motif sequence XXXXXXX with a sequence of asterisks *. The code I am using goes like this (11SGLOBULIN is the name of the protein entry in the database): motif_file = open('/users/myfolder/final motifs_11SGLOBULIN','r') align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+') finalmotifs = motif_file.readlines() seqalign = align_file.readlines() for line in seqalign: if motif[i] in seqalign: # I have stored all motifs in a list called "motif" replace(motif, '*****') But instead of replacing each string with a sequence of asterisks, it deletes the entire file. Can anyone see why this is happening? I suspect that the problem may lie in the fact that my ASCII file is basically just one very long list of amino acids, and Python cannot know how to replace a particular substring hidden within a very long string. -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From anaryin at gmail.com Wed May 4 10:21:08 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 4 May 2011 12:21:08 +0200 Subject: [Biopython-dev] Benchmarking PDBParser Message-ID: Hello all, Following a few discussions, I'm tempted to benchmark the current implementation of the PDBParser and see how it fares against an old implementation (I think I'll use 1.48 since older versions need Numerical Python). The main objective is to see if the recent developments have a significant impact in its speed. I thought of downloading the entire PDB but since it would take several days, I downloaded the CATH domain list instead. Those are just protein ATOM records, without any header, but since all modifications were essentially dealing with ATOM records, etc, I think it might be as valid. I'll be running tests today and tomorrow and I'll put the results up somewhere later on. I'm also making the scripts available so it is easy to benchmark it later on. Thoughts or suggestions? Cheers, Jo?o From p.j.a.cock at googlemail.com Wed May 4 10:39:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 4 May 2011 11:39:19 +0100 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: On Wed, May 4, 2011 at 11:21 AM, Jo?o Rodrigues wrote: > Hello all, > > Following a few discussions, I'm tempted to benchmark the current > implementation of the PDBParser and see how it fares against an old > implementation (I think I'll use 1.48 since older versions need Numerical > Python). The main objective is to see if the recent developments have a > significant impact in its speed. > > I thought of downloading the entire PDB but since it would take several > days, I downloaded the CATH domain list instead. Those are just protein ATOM > records, without any header, but since all modifications were essentially > dealing with ATOM records, etc, I think it might be as valid. > > I'll be running tests today and tomorrow and I'll put the results up > somewhere later on. I'm also making the scripts available so it is easy to > benchmark it later on. > > Thoughts or suggestions? > > Cheers, > > Jo?o That sounds like a good idea. While you are at it, you could try both the strict and permissive modes - I wonder what proportion of the current PDB has problems in the data? Peter From anaryin at gmail.com Wed May 4 10:42:12 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 4 May 2011 12:42:12 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: I was not planning on using the PDB database, but I might as well download it then. Adding that to the list. I'm also planning on removing all elements and check the impact of finding the elements. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao From anaryin at gmail.com Wed May 4 13:23:39 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 4 May 2011 15:23:39 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: Just a word of advice. I tried to download the whole PDB with PDBList.py and I ran into an error. Their server shut me down due to too many connections. Perhaps adding an exception catcher like the one we have for NCBI servers would be useful? Preliminary results show some degradation of speed.. ==> benchmark_CATH-biopython_149.time <== Total time spent: 530.686s Average time per structure: 46.839ms ==> benchmark_CATH-biopython_current.time <== Total time spent: 686.176s Average time per structure: 60.563ms I'll write a full summary when I finish downloading the PDB and testing it. From chad.a.davis at gmail.com Wed May 4 13:55:04 2011 From: chad.a.davis at gmail.com (Chad Davis) Date: Wed, 4 May 2011 15:55:04 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: I'd be very interested in this as well. I'm working on some modifications (in the alpha stages still) to the BioPerl PDB parser (based on the Perl Data Language, analogous to NumPy) and would be interested to compare all of them (BioPython old and new, BioPerl old and new). In my experience, downloading the PDB, just the divided structures, works best with rsync, and I believe it should only take several hours, not several days, the first time. It should be as easy as: rsync -a rsync.wwpdb.org::ftp_data/structures/divided/pdb/ ./pdb Other options: http://www.wwpdb.org/downloads.html Chad On Wed, May 4, 2011 at 15:23, Jo?o Rodrigues wrote: > Just a word of advice. I tried to download the whole PDB with PDBList.py and > I ran into an error. Their server shut me down due to too many connections. > Perhaps adding an exception catcher like the one we have for NCBI servers > would be useful? > > Preliminary results show some degradation of speed.. > > ==> benchmark_CATH-biopython_149.time <== > Total time spent: 530.686s > Average time per structure: 46.839ms > > ==> benchmark_CATH-biopython_current.time <== > Total time spent: 686.176s > Average time per structure: 60.563ms > > I'll write a full summary when I finish downloading the PDB and testing it. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From anaryin at gmail.com Wed May 4 13:57:40 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 4 May 2011 15:57:40 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: Hey Chad, That's exactly what I ended up doing and it is done ;) Pretty quick, I was hoping for a day or so! Best, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Wed, May 4, 2011 at 3:55 PM, Chad Davis wrote: > I'd be very interested in this as well. > I'm working on some modifications (in the alpha stages still) to the > BioPerl PDB parser (based on the Perl Data Language, analogous to > NumPy) and would be interested to compare all of them (BioPython old > and new, BioPerl old and new). > > In my experience, downloading the PDB, just the divided structures, > works best with rsync, and I believe it should only take several > hours, not several days, the first time. It should be as easy as: > > rsync -a rsync.wwpdb.org::ftp_data/structures/divided/pdb/ ./pdb > > Other options: > http://www.wwpdb.org/downloads.html > > Chad > > > On Wed, May 4, 2011 at 15:23, Jo?o Rodrigues wrote: > > Just a word of advice. I tried to download the whole PDB with PDBList.py > and > > I ran into an error. Their server shut me down due to too many > connections. > > Perhaps adding an exception catcher like the one we have for NCBI servers > > would be useful? > > > > Preliminary results show some degradation of speed.. > > > > ==> benchmark_CATH-biopython_149.time <== > > Total time spent: 530.686s > > Average time per structure: 46.839ms > > > > ==> benchmark_CATH-biopython_current.time <== > > Total time spent: 686.176s > > Average time per structure: 60.563ms > > > > I'll write a full summary when I finish downloading the PDB and testing > it. > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From redmine at redmine.open-bio.org Wed May 4 21:56:27 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 4 May 2011 21:56:27 +0000 Subject: [Biopython-dev] [Biopython - Feature #3194] (In Progress) Bio.Phylo export to 'ape' via Rpy2 References: Message-ID: Issue #3194 has been updated by Eric Talevich. Status changed from New to In Progress Assignee changed from Eric Talevich to Biopython Dev Mailing List % Done changed from 0 to 20 Estimated time set to 0.50 I added a cookbook entry for this on the Biopython wiki: http://www.biopython.org/wiki/Phylo_cookbook#Convert_to_an_.27ape.27_tree.2C_via_Rpy2 Good enough? Trying it in ipython, it works as advertised, except after calling r.plot() the R plot window won't close until I exit ipython. Further calls to plot() update the window; it just doesn't close. ---------------------------------------- Feature #3194: Bio.Phylo export to 'ape' via Rpy2 https://redmine.open-bio.org/issues/3194 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: Not Applicable URL: There are many more packages for working with phylogenetic data in R, and most of these operate on the basic tree object defined in the ape package. Let's support interoperability through Rpy2. The trivial way to do this is serialize a tree to a Newick string, then feed that to the read.tree() function. Maybe we can build the tree object in R directly and retain the tree annotations that Newick doesn't handle. See: http://ape.mpl.ird.fr/ http://rpy.sourceforge.net/rpy2.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed May 4 22:25:11 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 4 May 2011 22:25:11 +0000 Subject: [Biopython-dev] [Biopython - Feature #3194] Bio.Phylo export to 'ape' via Rpy2 References: Message-ID: Issue #3194 has been updated by Eric Talevich. File feat3194.diff added Estimated time changed from 0.50 to 1.00 Patch based on the cookbook entry. ---------------------------------------- Feature #3194: Bio.Phylo export to 'ape' via Rpy2 https://redmine.open-bio.org/issues/3194 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: Not Applicable URL: There are many more packages for working with phylogenetic data in R, and most of these operate on the basic tree object defined in the ape package. Let's support interoperability through Rpy2. The trivial way to do this is serialize a tree to a Newick string, then feed that to the read.tree() function. Maybe we can build the tree object in R directly and retain the tree annotations that Newick doesn't handle. See: http://ape.mpl.ird.fr/ http://rpy.sourceforge.net/rpy2.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From anaryin at gmail.com Fri May 6 07:45:53 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 6 May 2011 09:45:53 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: Hello all, I'd love to come with results but I ran into some problems. The parser is consuming too much memory after a while (>2GB) and I can't get reliable timings then because of swapping.. Therefore, I'll just take a random sample of 8000 structures and use it as a benchmark. I'll post the results today, shall I put it up on the wiki? This could be an interesting thing to post for both users and future developments. Best, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Wed, May 4, 2011 at 3:57 PM, Jo?o Rodrigues wrote: > Hey Chad, > > That's exactly what I ended up doing and it is done ;) Pretty quick, I was > hoping for a day or so! > > Best, > > > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao > > > > On Wed, May 4, 2011 at 3:55 PM, Chad Davis wrote: > >> I'd be very interested in this as well. >> I'm working on some modifications (in the alpha stages still) to the >> BioPerl PDB parser (based on the Perl Data Language, analogous to >> NumPy) and would be interested to compare all of them (BioPython old >> and new, BioPerl old and new). >> >> In my experience, downloading the PDB, just the divided structures, >> works best with rsync, and I believe it should only take several >> hours, not several days, the first time. It should be as easy as: >> >> rsync -a rsync.wwpdb.org::ftp_data/structures/divided/pdb/ ./pdb >> >> Other options: >> http://www.wwpdb.org/downloads.html >> >> Chad >> >> >> On Wed, May 4, 2011 at 15:23, Jo?o Rodrigues wrote: >> > Just a word of advice. I tried to download the whole PDB with PDBList.py >> and >> > I ran into an error. Their server shut me down due to too many >> connections. >> > Perhaps adding an exception catcher like the one we have for NCBI >> servers >> > would be useful? >> > >> > Preliminary results show some degradation of speed.. >> > >> > ==> benchmark_CATH-biopython_149.time <== >> > Total time spent: 530.686s >> > Average time per structure: 46.839ms >> > >> > ==> benchmark_CATH-biopython_current.time <== >> > Total time spent: 686.176s >> > Average time per structure: 60.563ms >> > >> > I'll write a full summary when I finish downloading the PDB and testing >> it. >> > _______________________________________________ >> > Biopython-dev mailing list >> > Biopython-dev at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > >> > > From anaryin at gmail.com Fri May 6 07:54:20 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 6 May 2011 09:54:20 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() Message-ID: Hello all, The PDBParser is sometimes a bit too loud, making meaningful output drown in dozens of warnings messages. This is partly (mostly) my fault because of the element guessing addition. Therefore, I'd suggest adding a QUIET argument (bool) to PDBParser that would supress all warnings. Of course, default is False. It might come handy for batch processing of proteins. I've added it to my pdb_enhancements branch so you can take a look: https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao From p.j.a.cock at googlemail.com Fri May 6 08:18:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 May 2011 09:18:44 +0100 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 8:45 AM, Jo?o Rodrigues wrote: > Hello all, > > I'd love to come with results but I ran into some problems. The parser is > consuming too much memory after a while (>2GB) and I can't get reliable > timings then because of swapping.. Therefore, I'll just take a random sample > of 8000 structures and use it as a benchmark. Memory bloat is bad - it sounds like a garbage collection problem. Are you recreating the parser object each time? > I'll post the results today, shall I put it up on the wiki? This could be an > interesting thing to post for both users and future developments. I'd like to see the script and the results, so maybe the wiki is better. Peter From anaryin at gmail.com Fri May 6 08:24:04 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 6 May 2011 10:24:04 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: > > Memory bloat is bad - it sounds like a garbage collection problem. > Are you recreating the parser object each time? > No. I'm just calling get_structure at each step of the for loop. It's a bit irregular also, sometimes it drops from 1GB to 300MB, stays stable for a while and then spikes again. My guess is that all the data structures holding the parser structures consume quite a lot and probably there's no decent GC to clear the previous structure in time, so it accumulates. Is there any way I can profile the script to see who's keeping the most memory throughout the run? > > > I'll post the results today, shall I put it up on the wiki? This could be > an > > interesting thing to post for both users and future developments. > > I'd like to see the script and the results, so maybe the wiki is better. > Will do. Jo?o From p.j.a.cock at googlemail.com Fri May 6 08:29:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 May 2011 09:29:19 +0100 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 9:24 AM, Jo?o Rodrigues wrote: >> Memory bloat is bad - it sounds like a garbage collection problem. >> Are you recreating the parser object each time? > > No. I'm just calling get_structure at each step of the for loop. It's a bit > irregular also, sometimes it drops from 1GB to 300MB, stays stable for a > while and then spikes again. My guess is that all the data structures > holding the parser structures consume quite a lot and probably there's no > decent GC to clear the previous structure in time, so it accumulates. > You could do an explicit clear once per PDB file to test this hypothesis: import gc gc.collect() Peter From p.j.a.cock at googlemail.com Fri May 6 09:25:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 May 2011 10:25:50 +0100 Subject: [Biopython-dev] Python 2.4 / Adding QUIET argument to PDBParser() Message-ID: On Fri, May 6, 2011 at 8:54 AM, Jo?o Rodrigues wrote: > Hello all, > > The PDBParser is sometimes a bit too loud, making meaningful output drown in > dozens of warnings messages. This is partly (mostly) my fault because of the > element guessing addition. Therefore, I'd suggest adding a QUIET argument > (bool) to PDBParser that would supress all warnings. Of course, default is > False. It might come handy for batch processing of proteins. > > I've added it to my pdb_enhancements branch so you can take a look: > > https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 > I had a look and Joao and I have been having a little discussion with the github comments feature. There are two ways to solve this, (1) Have a flag which controls issuing the warning (2) Filter out PDBConstructionWarning messages The first approach is messy as the flag needs to passed down to any relevant object (or done as a global which is nasty). The second approach requires a temporary warnings filter, which I think would easily done with the context manager warnings.catch_warnings() in Python 2.5+ I'd also like to use this in the unit tests, where currently we have to save the filter list, add a temporary filter, then restore the filter list. This generally works, but there are some stray warnings that are not being silenced. Given we've already officially dropped support for Python 2.4, I don't anticipate any protests. I guess before making such a change on the trunk, Tiago or I should turn off the Python 2.4 buildbot buildslaves... Peter From p.j.a.cock at googlemail.com Fri May 6 09:31:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 May 2011 10:31:50 +0100 Subject: [Biopython-dev] Python 2.4 / Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 10:25 AM, Peter Cock wrote: > On Fri, May 6, 2011 at 8:54 AM, Jo?o Rodrigues wrote: >> Hello all, >> >> The PDBParser is sometimes a bit too loud, making meaningful output drown in >> dozens of warnings messages. This is partly (mostly) my fault because of the >> element guessing addition. Therefore, I'd suggest adding a QUIET argument >> (bool) to PDBParser that would supress all warnings. Of course, default is >> False. It might come handy for batch processing of proteins. >> >> I've added it to my pdb_enhancements branch so you can take a look: >> >> https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 >> > > I had a look and Joao and I have been having a little > discussion with the github comments feature. > > There are two ways to solve this, > (1) Have a flag which controls issuing the warning > (2) Filter out PDBConstructionWarning messages > > The first approach is messy as the flag needs to passed > down to any relevant object (or done as a global which is > nasty). > > The second approach requires a temporary warnings filter, > which I think would easily done with the context manager > warnings.catch_warnings() in Python 2.5+ Arhh, Jaoa just pointed out warnings.catch_warnings() is in Python 2.6+ so we have to wait a while longer before we can use that :( Peter From redmine at redmine.open-bio.org Fri May 6 15:57:48 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 6 May 2011 15:57:48 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Eric Talevich. Flex takes a .lex file and generates a .c file. The .c file is the important thing to compile, not .lex. Looking at the generated C in lex.yy.c, I'd guess the same thing can be compiled all the platforms we support (though I haven't confirmed). As a short-term solution, can we check in lex.yy.c and include that with the distribution, in order to eliminate the flex dependency? ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri May 6 16:05:54 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 6 May 2011 16:05:54 +0000 Subject: [Biopython-dev] [Biopython - Bug #2619] Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py References: Message-ID: Issue #2619 has been updated by Peter Cock. Eric, we need two things: (1) The flex binary to convert our lex file into C, which as you point out we might be able to do in advance (assuming this version of flex is unimportant). Detecting the flex binary is pretty easy on Unix like platforms. See comment 4. (2) The flex headers to compile the C code. This can probably be solved, perhaps looking at similar issues in NumPy. ---------------------------------------- Bug #2619: Bio.PDB.MMCIFParser component MMCIFlex commented out in setup.py https://redmine.open-bio.org/issues/2619 Author: Chris Oldfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.48 URL: MMCIFParser is a documented feature of Bio.PDB, but it is broken by default because the MMCIFlex build is commented out in the distribution setup.py. According to http://osdir.com/ml/python.bio.devel/2006-02/msg00038.html this is because it doesn't compile on Windows. Though the function is documented, the changes need to enable are not, so this seems like an installation bug to me. The fix on linux is to uncomment setup.py lines 486 on. A general work around might be to condition the compile on the os.sys.platform variable. I'd offer a diff, but I'm new to biopython and python in general, so please forgive my ignorance. Source install of version 1.48, gentoo linux 2008, x86_64. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From eric.talevich at gmail.com Fri May 6 16:20:54 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 6 May 2011 12:20:54 -0400 Subject: [Biopython-dev] Python 2.4 / Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 5:31 AM, Peter Cock wrote: > On Fri, May 6, 2011 at 10:25 AM, Peter Cock > wrote: > > > > The second approach requires a temporary warnings filter, > > which I think would easily done with the context manager > > warnings.catch_warnings() in Python 2.5+ > > Arhh, Jaoa just pointed out warnings.catch_warnings() is > in Python 2.6+ so we have to wait a while longer before > we can use that :( > > Fortunately we've already worked around it in test_PDB.py, by monkeypatching: https://github.com/biopython/biopython/blob/master/Tests/test_PDB.py See the method test_1_warnings. Replace the function warnings.showwarnings with a new function that just collects warning objects in a list rather than printing them. Then, before the outer function ends, swap back the original showwarnings function. -E From eric.talevich at gmail.com Fri May 6 16:23:33 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 6 May 2011 12:23:33 -0400 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 3:54 AM, Jo?o Rodrigues wrote: > Hello all, > > The PDBParser is sometimes a bit too loud, making meaningful output drown > in > dozens of warnings messages. This is partly (mostly) my fault because of > the > element guessing addition. Therefore, I'd suggest adding a QUIET argument > (bool) to PDBParser that would supress all warnings. Of course, default is > False. It might come handy for batch processing of proteins. > > I've added it to my pdb_enhancements branch so you can take a look: > > > https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 > > Since the PERMISSIVE argument is already an integer, could we consolidate these by letting (PERMISSIVE=2) behave as (PERMISSIVE=1, QUIET=1) ? From p.j.a.cock at googlemail.com Fri May 6 16:25:40 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 May 2011 17:25:40 +0100 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 6, 2011 at 5:23 PM, Eric Talevich wrote: > On Fri, May 6, 2011 at 3:54 AM, Jo?o Rodrigues wrote: > >> Hello all, >> >> The PDBParser is sometimes a bit too loud, making meaningful output drown >> in >> dozens of warnings messages. This is partly (mostly) my fault because of >> the >> element guessing addition. Therefore, I'd suggest adding a QUIET argument >> (bool) to PDBParser that would supress all warnings. Of course, default is >> False. It might come handy for batch processing of proteins. >> >> I've added it to my pdb_enhancements branch so you can take a look: >> >> >> https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 >> >> > Since the PERMISSIVE argument is already an integer, could we consolidate > these by letting (PERMISSIVE=2) behave as (PERMISSIVE=1, QUIET=1) ? > I'm OK with that, Peter From redmine at redmine.open-bio.org Sat May 7 18:52:44 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 7 May 2011 18:52:44 +0000 Subject: [Biopython-dev] [Biopython - Feature #3220] Port Biopython docstrings to reStructuredText References: Message-ID: Issue #3220 has been updated by Eric Talevich. Here's a branch with Bio.Phylo converted to rst: https://github.com/etal/biopython/tree/rst_docstrings The main deviation from the Numpy guidelines is using:
:Parameters:
instead of:
Parameters
----------
This is because Epydoc only understands the former, so the latter produces something ugly in the generated docs. It will be easy enough to change, if we want, when we switch to Sphinx. ---------------------------------------- Feature #3220: Port Biopython docstrings to reStructuredText https://redmine.open-bio.org/issues/3220 Author: Eric Talevich Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The first part of the effort to port Biopython's documentation to Sphinx is to convert our API docs from Epytext to reStructuredText. Plain text will generally work. Epydoc already supports using reStructuredText as a markup language instead of the default Epytext, so this isn't as painful as it sounds. This can be done one module at a time, changing the format declaration at the top from:
__docformat__ = "epytext en"
to:
__docformat__ = "restructuredtext en"
And changing any Epytext markup in the docstrings to valid rST. Note that this adds the dependency of Docutils when generating API docs, in addition to the current dependency on Epydoc. Since documentation is normally built ahead of the time when packaging stable Biopython releases, this shouldn't be a problem for end users, and may be a small inconvenience for developers who want to work on the documentation. See: http://epydoc.sourceforge.net/manual-othermarkup.html http://docutils.sourceforge.net/rst.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun May 8 16:33:03 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 8 May 2011 16:33:03 +0000 Subject: [Biopython-dev] [Biopython - Bug #3227] (New) deprecated genbank localstion parser doesn't indicate what replaces it Message-ID: Issue #3227 has been reported by Mark Diekhans. ---------------------------------------- Bug #3227: deprecated genbank localstion parser doesn't indicate what replaces it https://redmine.open-bio.org/issues/3227 Author: Mark Diekhans Status: New Priority: High Assignee: Category: Target version: URL: Module LocationParser says: Code used for parsing GenBank/EMBL feature location strings (DEPRECATED) but it doesn't indicate what the replace is for this module. I am happy to make changes as biopython evolves, but some guidance as to how to change would be very helpful ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun May 8 20:23:26 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 8 May 2011 20:23:26 +0000 Subject: [Biopython-dev] [Biopython - Bug #3227] deprecated genbank localstion parser doesn't indicate what replaces it References: Message-ID: Issue #3227 has been updated by Peter Cock. Category set to Main Distribution Assignee set to Biopython Dev Mailing List Default assignee was lost... restoring to dev mailing list. ---------------------------------------- Bug #3227: deprecated genbank localstion parser doesn't indicate what replaces it https://redmine.open-bio.org/issues/3227 Author: Mark Diekhans Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Module LocationParser says: Code used for parsing GenBank/EMBL feature location strings (DEPRECATED) but it doesn't indicate what the replace is for this module. I am happy to make changes as biopython evolves, but some guidance as to how to change would be very helpful -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun May 8 22:59:01 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 8 May 2011 22:59:01 +0000 Subject: [Biopython-dev] [Biopython - Bug #3227] deprecated genbank localstion parser doesn't indicate what replaces it References: Message-ID: Issue #3227 has been updated by Mark Diekhans. hanks Peter! I am more than happy to change code to use the new parser. My bug report is that the module deception just says "(DEPRECATED)" and doesn't give one a clue as to how to get the same functionality. This is a request for better documentation, not continued support of this code. Mark ---------------------------------------- Bug #3227: deprecated genbank localstion parser doesn't indicate what replaces it https://redmine.open-bio.org/issues/3227 Author: Mark Diekhans Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Module LocationParser says: Code used for parsing GenBank/EMBL feature location strings (DEPRECATED) but it doesn't indicate what the replace is for this module. I am happy to make changes as biopython evolves, but some guidance as to how to change would be very helpful -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun May 8 23:24:33 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 8 May 2011 23:24:33 +0000 Subject: [Biopython-dev] [Biopython - Bug #3227] deprecated genbank localstion parser doesn't indicate what replaces it References: Message-ID: Issue #3227 has been updated by Peter Cock. The new GenBank/EMBL parser will use the new location parsing automatically. If you were using this (via Bio.GenBank or via Bio.SeqIO) you wouldn't have needed to change anything. The only people affected by the deprecation would be people using Bio.GenBank.LocationParser directly. Right now, the new location parsing code isn't really designed to be used on its own. In order to try and help you, I need to know what you were using Bio.GenBank.LocationParser for. ---------------------------------------- Bug #3227: deprecated genbank localstion parser doesn't indicate what replaces it https://redmine.open-bio.org/issues/3227 Author: Mark Diekhans Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Module LocationParser says: Code used for parsing GenBank/EMBL feature location strings (DEPRECATED) but it doesn't indicate what the replace is for this module. I am happy to make changes as biopython evolves, but some guidance as to how to change would be very helpful -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun May 8 23:34:45 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 8 May 2011 23:34:45 +0000 Subject: [Biopython-dev] [Biopython - Feature #3220] (In Progress) Port Biopython docstrings to reStructuredText References: Message-ID: Issue #3220 has been updated by Eric Talevich. Status changed from New to In Progress % Done changed from 0 to 20 Thanks for the merge, Peter: https://github.com/biopython/biopython/commit/f617101dfaf358d38e90ed778c98588ee7775c72 So building the Biopython API documentation with Epydoc now depends on docutils. Next step: grep each module for 'epytext' and port those that need it. ---------------------------------------- Feature #3220: Port Biopython docstrings to reStructuredText https://redmine.open-bio.org/issues/3220 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The first part of the effort to port Biopython's documentation to Sphinx is to convert our API docs from Epytext to reStructuredText. Plain text will generally work. Epydoc already supports using reStructuredText as a markup language instead of the default Epytext, so this isn't as painful as it sounds. This can be done one module at a time, changing the format declaration at the top from:
__docformat__ = "epytext en"
to:
__docformat__ = "restructuredtext en"
And changing any Epytext markup in the docstrings to valid rST. Note that this adds the dependency of Docutils when generating API docs, in addition to the current dependency on Epydoc. Since documentation is normally built ahead of the time when packaging stable Biopython releases, this shouldn't be a problem for end users, and may be a small inconvenience for developers who want to work on the documentation. See: http://epydoc.sourceforge.net/manual-othermarkup.html http://docutils.sourceforge.net/rst.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon May 9 16:17:53 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 9 May 2011 16:17:53 +0000 Subject: [Biopython-dev] [Biopython - Feature #3220] Port Biopython docstrings to reStructuredText References: Message-ID: Issue #3220 has been updated by Peter Cock. Eric, Do you have an HTML sample of the Bio.Phylo API docs from Sphinx? You could just email me a zip file if there isn't an easier way to show it. Alternatively, how would I use Sphinx to generate this myself? Thanks. Peter ---------------------------------------- Feature #3220: Port Biopython docstrings to reStructuredText https://redmine.open-bio.org/issues/3220 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The first part of the effort to port Biopython's documentation to Sphinx is to convert our API docs from Epytext to reStructuredText. Plain text will generally work. Epydoc already supports using reStructuredText as a markup language instead of the default Epytext, so this isn't as painful as it sounds. This can be done one module at a time, changing the format declaration at the top from:
__docformat__ = "epytext en"
to:
__docformat__ = "restructuredtext en"
And changing any Epytext markup in the docstrings to valid rST. Note that this adds the dependency of Docutils when generating API docs, in addition to the current dependency on Epydoc. Since documentation is normally built ahead of the time when packaging stable Biopython releases, this shouldn't be a problem for end users, and may be a small inconvenience for developers who want to work on the documentation. See: http://epydoc.sourceforge.net/manual-othermarkup.html http://docutils.sourceforge.net/rst.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From anaryin at gmail.com Mon May 9 16:37:50 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 9 May 2011 18:37:50 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: Hey Peter, I've only had the chance to test this today. The parsing seems to be working just fine and the RAM consumption is stable at < 100 MB. I'll see the results tomorrow. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Fri, May 6, 2011 at 10:29 AM, Peter Cock wrote: > On Fri, May 6, 2011 at 9:24 AM, Jo?o Rodrigues wrote: > >> Memory bloat is bad - it sounds like a garbage collection problem. > >> Are you recreating the parser object each time? > > > > No. I'm just calling get_structure at each step of the for loop. It's a > bit > > irregular also, sometimes it drops from 1GB to 300MB, stays stable for a > > while and then spikes again. My guess is that all the data structures > > holding the parser structures consume quite a lot and probably there's no > > decent GC to clear the previous structure in time, so it accumulates. > > > > You could do an explicit clear once per PDB file to test this hypothesis: > > import gc > gc.collect() > > Peter > From redmine at redmine.open-bio.org Mon May 9 17:10:01 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 9 May 2011 17:10:01 +0000 Subject: [Biopython-dev] [Biopython - Feature #3220] Port Biopython docstrings to reStructuredText References: Message-ID: Issue #3220 has been updated by Eric Talevich. Peter, I haven't tried using Sphinx on Bio.Phylo yet, actually. It seems to require writing a few "stub" files with commands for pulling in doctrings from the selected module... I'll tinker with it and maybe post a branch on Github if it goes well. -Eric ---------------------------------------- Feature #3220: Port Biopython docstrings to reStructuredText https://redmine.open-bio.org/issues/3220 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The first part of the effort to port Biopython's documentation to Sphinx is to convert our API docs from Epytext to reStructuredText. Plain text will generally work. Epydoc already supports using reStructuredText as a markup language instead of the default Epytext, so this isn't as painful as it sounds. This can be done one module at a time, changing the format declaration at the top from:
__docformat__ = "epytext en"
to:
__docformat__ = "restructuredtext en"
And changing any Epytext markup in the docstrings to valid rST. Note that this adds the dependency of Docutils when generating API docs, in addition to the current dependency on Epydoc. Since documentation is normally built ahead of the time when packaging stable Biopython releases, this shouldn't be a problem for end users, and may be a small inconvenience for developers who want to work on the documentation. See: http://epydoc.sourceforge.net/manual-othermarkup.html http://docutils.sourceforge.net/rst.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue May 10 02:44:36 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 May 2011 02:44:36 +0000 Subject: [Biopython-dev] [Biopython - Feature #3219] (In Progress) Port Biopython documentation to Sphinx References: Message-ID: Issue #3219 has been updated by Eric Talevich. Status changed from New to In Progress % Done changed from 0 to 20 Here's a branch where I'm testing Sphinx: https://github.com/etal/biopython/tree/sphinx-demo There's not much there yet, so don't panic. For reference, DendroPy has a good example of Sphinx in action: https://github.com/jeetsukumaran/DendroPy/tree/master/doc/source/ ---------------------------------------- Feature #3219: Port Biopython documentation to Sphinx https://redmine.open-bio.org/issues/3219 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: Currently we use Epydoc for the API reference documentation, and LaTeX (to PDF via pdflatex, and HTML via hevea) for the tutorial. There's some material on the wiki to consider, too. A number of Python projects, including CPython, now use Sphinx for documentation. Content is written in reStructuredText format, and can be pulled from both standalone .rst files and Python docstrings. This offers several advantages: (i) API documentation will be prettier and easier to navigate; (ii) the Tutorial will be easier to edit for those not fluent in LaTeX; (iii) Since the API reference and Tutorial will be written in the same markup, potentially even pulling from some shared sources, it will be easier to address redundant or overlapping portions between the two, avoiding inconsistencies. See: http://sphinx.pocoo.org/ http://docutils.sourceforge.net/ Mailing list discussion: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/007977.html Numpy's approach: http://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue May 10 02:51:51 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 10 May 2011 02:51:51 +0000 Subject: [Biopython-dev] [Biopython - Feature #3220] Port Biopython docstrings to reStructuredText References: Message-ID: Issue #3220 has been updated by Eric Talevich. I posted about Sphinx on the parent issue. For this bug, I reckon the best approach is to convert the rest of the docstrings to reStructuredText, removing Epytext markup wherever we find it. Going further, we could try using "restructuredtext" instead of "plaintext" as the default format when running Epydoc, and fix any errors or abominations that appear. If we can get that to work, then we'll know it's all safe to pull into Sphinx with the automodule command. ---------------------------------------- Feature #3220: Port Biopython docstrings to reStructuredText https://redmine.open-bio.org/issues/3220 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The first part of the effort to port Biopython's documentation to Sphinx is to convert our API docs from Epytext to reStructuredText. Plain text will generally work. Epydoc already supports using reStructuredText as a markup language instead of the default Epytext, so this isn't as painful as it sounds. This can be done one module at a time, changing the format declaration at the top from:
__docformat__ = "epytext en"
to:
__docformat__ = "restructuredtext en"
And changing any Epytext markup in the docstrings to valid rST. Note that this adds the dependency of Docutils when generating API docs, in addition to the current dependency on Epydoc. Since documentation is normally built ahead of the time when packaging stable Biopython releases, this shouldn't be a problem for end users, and may be a small inconvenience for developers who want to work on the documentation. See: http://epydoc.sourceforge.net/manual-othermarkup.html http://docutils.sourceforge.net/rst.html -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu May 12 10:08:07 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 12 May 2011 10:08:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #3229] (New) PDBParser fails when occupancy of atom is -1.0 Message-ID: Issue #3229 has been reported by Jo?o Rodrigues. ---------------------------------------- Bug #3229: PDBParser fails when occupancy of atom is -1.0 https://redmine.open-bio.org/issues/3229 Author: Jo?o Rodrigues Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: PDBID 3NH3 has occupancy values of -1.0 (seems to be an unique case in the PDB). ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu May 12 10:08:07 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 12 May 2011 10:08:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #3229] (New) PDBParser fails when occupancy of atom is -1.0 Message-ID: Issue #3229 has been reported by Jo?o Rodrigues. ---------------------------------------- Bug #3229: PDBParser fails when occupancy of atom is -1.0 https://redmine.open-bio.org/issues/3229 Author: Jo?o Rodrigues Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: PDBID 3NH3 has occupancy values of -1.0 (seems to be an unique case in the PDB). -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From anaryin at gmail.com Thu May 12 13:59:09 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 12 May 2011 15:59:09 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: First results: http://www.biopython.org/wiki/PDBParser Comments? From eric.talevich at gmail.com Fri May 13 02:26:42 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 12 May 2011 22:26:42 -0400 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: On Thu, May 12, 2011 at 9:59 AM, Jo?o Rodrigues wrote: > First results: http://www.biopython.org/wiki/PDBParser > > Comments? > Cool. So the atom_element additions did slow the parser down noticeably. The warnings may have caused some tiny slowdown, presumably when handling PDB files with inconsistencies, but I personally am not concerned about that. I think atom element assignment could be sped up in either of two ways: (a) Try to optimize Atom._assign_element for speed, somehow (b) Store only the atom field as a string during parsing. Change Atom.element and Atom.mass to be properties that parse the atom field to determine the element type on demand (i.e. self._get_element checks if self._element exists yet; if not, parse the string and set self._element; self._get_mass is basically identical to _assign_atom_mass). The lazy loading approach (b) would be faster if you're not using the element/mass values at all, but probably a little slower if you need those values from every atom in a structure. -E From updates at feedmyinbox.com Fri May 13 04:38:20 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Fri, 13 May 2011 00:38:20 -0400 Subject: [Biopython-dev] 5/13 active questions tagged biopython - Stack Overflow Message-ID: // renumber residues in a protein structure file (pdb) // May 12, 2011 at 3:54 PM http://stackoverflow.com/questions/5983689/renumber-residues-in-a-protein-structure-file-pdb Hi I am currently involved in making a website aimed at combining all papillomavirus information in a single place. As part of the effort we are curating all known files on public servers (e.g. genbank) One of the issues I ran into was that many (~50%) of all solved structures are not numbered according to the protein. I.e. a subdomain was crystallized (amino acid 310-450) however the crystallographer deposited this as residue 1-140. I was wondering whether anyone knows of a way to renumber the entire pdb file. I have found ways to renumber the sequence (identified by seqres), however this does not update the helix and sheet information. I would appreciate it if you had any suggestions? Thanks -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From anaryin at gmail.com Fri May 13 06:35:27 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 13 May 2011 08:35:27 +0200 Subject: [Biopython-dev] Benchmarking PDBParser In-Reply-To: References: Message-ID: Assigning the element on demand would be too slow, specially when working with modelling structures or other element-less 'formats'. Id replace your option B for a function to assign elements that could be called once, at will, from any entity subclass. On the other hand, optimizing the process probably will help but not by much i would say. Does anyone have ideas on this? Maybe a dictionary with all possible options of atom fullnames? A third issue here is also the overhead that parsing the header brings. It completely kills performance.. There is a flag in the parser called get_header that is useless at the moment. A first step would be to make usable. At least we would have an option to skip the slow part. Perhaps then it would be nice to look at parse_pdb_header and see if we can optimize it. Im curious to see the performance of my branch there because i added more parsing options there too. Cheers, Jo?o No dia 13 de Mai de 2011 04:27, "Eric Talevich" escreveu: > On Thu, May 12, 2011 at 9:59 AM, Jo?o Rodrigues wrote: > >> First results: http://www.biopython.org/wiki/PDBParser >> >> Comments? >> > > Cool. So the atom_element additions did slow the parser down noticeably. The > warnings may have caused some tiny slowdown, presumably when handling PDB > files with inconsistencies, but I personally am not concerned about that. > > I think atom element assignment could be sped up in either of two ways: > (a) Try to optimize Atom._assign_element for speed, somehow > (b) Store only the atom field as a string during parsing. Change > Atom.element and Atom.mass to be properties that parse the atom field to > determine the element type on demand (i.e. self._get_element checks if > self._element exists yet; if not, parse the string and set self._element; > self._get_mass is basically identical to _assign_atom_mass). > > The lazy loading approach (b) would be faster if you're not using the > element/mass values at all, but probably a little slower if you need those > values from every atom in a structure. > > -E From updates at feedmyinbox.com Fri May 13 08:31:03 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Fri, 13 May 2011 04:31:03 -0400 Subject: [Biopython-dev] 5/13 biopython Questions - BioStar Message-ID: // Bio.GenBank.LocationParserError // May 10, 2011 at 1:50 PM http://biostar.stackexchange.com/questions/8203/bio-genbank-locationparsererror Hi all, I'm scanning through all of GenBank's bacterial genomes using biopython. I've been getting an occasional error recently parsing location data. Specifically: File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 525, in parse for r in i: File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 437, in parse_records record = self.parse(handle, do_features) File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 420, in parse if self.feed(handle, consumer, do_features): File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 392, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/usr/lib/pymodules/python2.7/Bio/GenBank/Scanner.py", line 344, in _feed_feature_table consumer.location(location_string) File "/usr/lib/pymodules/python2.7/Bio/GenBank/__init__.py", line 975, in location raise LocationParserError(location_line) Bio.GenBank.LocationParserError: order(join(649703..649712,649751..649752),650047..650049) My code is a simple loop through all filenames I feed in at the command line: [...] try: contig = SeqIO.parse(open(gb_file,"r"), "genbank") except: sys.stderr.write("ERROR: Parsing gbk file "+gb_file+"!\n") sys.exit(1) sys.stderr.write("Loading genome " + str(counter) + " of "+str(len(sys.argv)-1)+" ("+gb_file+")\n") for gb_record in contig: [...] This is in the Aeropyrum pernix K1 genome, NC_000854.gbk. I don't see anything wrong with the location data. Can anyone help? Thanks, -Morgan // making all protein sequence lengths same // May 4, 2011 at 3:31 AM http://biostar.stackexchange.com/questions/8033/making-all-protein-sequence-lengths-same Is there any code in perl / python to make all protein sequences of same length, otherwise my phylogenetic tool MEGA is not working on them ? -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/687953/851dd4cd10a2537cf271a85dfd1566976527e0cd/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From redmine at redmine.open-bio.org Fri May 13 09:07:01 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 13 May 2011 09:07:01 +0000 Subject: [Biopython-dev] [Biopython - Bug #3197] SeqIO parse error with some genbank files References: Message-ID: Issue #3197 has been updated by Peter Cock. Another example from http://biostar.stackexchange.com/questions/8203/bio-genbank-locationparsererror Aeropyrum pernix K1 genome, NC_000854.gbk ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Aeropyrum_pernix_K1_uid57757/NC_000854.gbk >>> from Bio import SeqIO >>> r = SeqIO.read("NC_000854.gbk", "gb") ... Bio.GenBank.LocationParserError: Combinations of "join" and "order" within the same location (nested operators) are illegal: order(join(649703..649712,649751..649752),650047..650049) I have reported this GenBank file to the NCBI via gb-admin at ncbi.nlm.nih.gov ---------------------------------------- Bug #3197: SeqIO parse error with some genbank files https://redmine.open-bio.org/issues/3197 Author: Cedar McKay Status: Resolved Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.56 URL: I've found a file that seems to choke SeqIO genbank parsing. I downloaded this file straight from NCBI, so it should be a good file. I've found a couple of other files that do the same thing. I reproduced this bug on another machine, also with biopython 1.56. I am able to successfully parse other genbank files. Maybe it has something to do with that very long location? Please let me know if I can provide any other information! Thanks! Cedar >>> from Bio import SeqIO >>> record = SeqIO.read('./Acorus_americanus_NC_010093.gb', 'genbank') Traceback (most recent call last): File "", line 1, in File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", line 597, in read first = iterator.next() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/SeqIO/__init__.py", line 525, in parse for r in i: File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 437, in parse_records record = self.parse(handle, do_features) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 420, in parse if self.feed(handle, consumer, do_features): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 392, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 344, in _feed_feature_table consumer.location(location_string) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/GenBank/__init__.py", line 975, in location raise LocationParserError(location_line) Bio.GenBank.LocationParserError: order(join(42724..42726,43455..43457),43464..43469,43476..43481,43557..43562,43569..43574,43578..43583,43677..43682,44434..44439) -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From anaryin at gmail.com Fri May 13 19:35:00 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 13 May 2011 21:35:00 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Hello all, Not to let this die. I've added PERMISSIVE=2 to PDBParser. I also changed the code to remove the _handle_pdb_exception method and replace it by the warnings module. This was done in two commits in my branch: https://github.com/JoaoRodrigues/biopython/commit/5b44defc3eb0a3505668ac77b59c8980630e6b07 https://github.com/JoaoRodrigues/biopython/commit/7383e068e41dd624458b3904fcd61a04c3f319c4 Sorry to be insistent, but I don't really wish QUIET to live long if we have such an elegant alternative. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao 2011/5/6 Peter Cock > On Fri, May 6, 2011 at 5:23 PM, Eric Talevich > wrote: > > On Fri, May 6, 2011 at 3:54 AM, Jo?o Rodrigues > wrote: > > > >> Hello all, > >> > >> The PDBParser is sometimes a bit too loud, making meaningful output > drown > >> in > >> dozens of warnings messages. This is partly (mostly) my fault because of > >> the > >> element guessing addition. Therefore, I'd suggest adding a QUIET > argument > >> (bool) to PDBParser that would supress all warnings. Of course, default > is > >> False. It might come handy for batch processing of proteins. > >> > >> I've added it to my pdb_enhancements branch so you can take a look: > >> > >> > >> > https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 > >> > >> > > Since the PERMISSIVE argument is already an integer, could we consolidate > > these by letting (PERMISSIVE=2) behave as (PERMISSIVE=1, QUIET=1) ? > > > > I'm OK with that, > > Peter > From eric.talevich at gmail.com Fri May 13 19:46:01 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 May 2011 15:46:01 -0400 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Looks good to me. I can't guarantee I'll be able to merge this right away since I'm going to be traveling for the next week. Anyone else want to try it? -Eric On 5/13/11, Jo?o Rodrigues wrote: > Hello all, > > Not to let this die. > > I've added PERMISSIVE=2 to PDBParser. I also changed the code to remove the > _handle_pdb_exception method and replace it by the warnings module. > > This was done in two commits in my branch: > > https://github.com/JoaoRodrigues/biopython/commit/5b44defc3eb0a3505668ac77b59c8980630e6b07 > https://github.com/JoaoRodrigues/biopython/commit/7383e068e41dd624458b3904fcd61a04c3f319c4 > > > Sorry to be insistent, but I don't really wish QUIET to live long if we have > such an elegant alternative. > > Cheers, > > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao > > > > 2011/5/6 Peter Cock > >> On Fri, May 6, 2011 at 5:23 PM, Eric Talevich >> wrote: >> > On Fri, May 6, 2011 at 3:54 AM, Jo?o Rodrigues >> wrote: >> > >> >> Hello all, >> >> >> >> The PDBParser is sometimes a bit too loud, making meaningful output >> drown >> >> in >> >> dozens of warnings messages. This is partly (mostly) my fault because >> >> of >> >> the >> >> element guessing addition. Therefore, I'd suggest adding a QUIET >> argument >> >> (bool) to PDBParser that would supress all warnings. Of course, default >> is >> >> False. It might come handy for batch processing of proteins. >> >> >> >> I've added it to my pdb_enhancements branch so you can take a look: >> >> >> >> >> >> >> https://github.com/JoaoRodrigues/biopython/commit/5405d8a4cc555bcfce6ad0915db62a131cee9493 >> >> >> >> >> > Since the PERMISSIVE argument is already an integer, could we >> > consolidate >> > these by letting (PERMISSIVE=2) behave as (PERMISSIVE=1, QUIET=1) ? >> > >> >> I'm OK with that, >> >> Peter >> > From andrew.sczesnak at med.nyu.edu Fri May 13 21:26:58 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Fri, 13 May 2011 17:26:58 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer Message-ID: <4DCDA222.9050807@med.nyu.edu> Hi All, I'd like to contribute MAF parser/writer classes to Bio.AlignIO. MAF is an alignment format used for whole genome alignments, as in the 30-way (or more) multiz alignments at UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/maf/ A description of the format is available here: http://genome.ucsc.edu/FAQ/FAQformat#format5 The value of this format to most users will come from the ability to extract sequences from an arbitrary number of species that align to a particular sequence range in a particular genome, at random. We should be able to say, report the alignment of 50 genomes to the human HOX locus fairly quickly (say <1s). An iterator and writer class will certainly be useful, but to implement the aforementioned functionality, some API changes are probably necessary. I think the most straightforward way of accomplishing this is to add an additional, searchable SQLite table to SeqIO's index_db(). The present table, offset_data, translates a unique sequence identifier to the file offset and is more suited to multifasta or other sequence files. Another table might store chromosome, start, and end positions to allow a set of alignment records falling within a particular sequence range on a chromosome to be extracted with an SQL query (obscured from the user). This table would remain empty in formats where no search functionality is implemented. Also necessary, a search() function on top of the index_db() UserDict, accessible as in: from AlignIO.MafIO import MafIndexer indexer = MafIndexer("mm9") index = SeqIO.index_db (index_file, maf_file, "maf", \ key_function = MafIndexer.index) for i in index.search ("chr5", 5000, 10000): print i where the output is a series of MultipleSeqAlignment objects with sequences falling within the searched range. When used with other formats, the function could perform a quick "key LIKE '%key%'" SQL query to retrieve multiple records with similar names. As a note, the MafIndexer callback function above is necessary to choose which species in the alignment the index is generated for. Some quick code implementing these additions loads the index of a 3.6GB MAF file in ~500ms and retrieves a 40kb alignment in about 1.6s, leaving some room for optimization. Does anyone have any thoughts on how index_db() should be developed, and if these changes ought to be implemented in SeqIO or an AlignIO index API be created? Thanks, -- Andrew Sczesnak Bioinformatician, Littman Lab Howard Hughes Medical Institute New York University School of Medicine 540 First Avenue New York, NY 10016 p: (212) 263-6921 f: (212) 263-1498 e: andrew.sczesnak at med.nyu.edu From p.j.a.cock at googlemail.com Fri May 13 22:27:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 May 2011 23:27:52 +0100 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 13, 2011 at 8:46 PM, Eric Talevich wrote: > Looks good to me. I can't guarantee I'll be able to merge this right > away since I'm going to be traveling for the next week. Anyone else > want to try it? > -Eric If get time this weekend, I'll look at it. After all, I did apply the quiet change to the trunk... Peter From p.j.a.cock at googlemail.com Fri May 13 22:30:34 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 May 2011 23:30:34 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DCDA222.9050807@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> Message-ID: On Fri, May 13, 2011 at 10:26 PM, Andrew Sczesnak wrote: > Hi All, > > I'd like to contribute MAF parser/writer classes to Bio.AlignIO. ?MAF is an > alignment format used for whole genome alignments, as in the 30-way (or > more) multiz alignments at UCSC: > > http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/maf/ > > A description of the format is available here: > > http://genome.ucsc.edu/FAQ/FAQformat#format5 > I've spoken to Andrew briefly before this, and I'm keen to get the core functionality of parsing and writing MAF alignments added to AlignIO. His other ideas for indexing these alignments are much more interesting - and part of a more general topic related to things like Ace alignments, or SAM/BAM alignments. Ideally we can come up with something that will work for more than just MAF alignments. Peter From p.j.a.cock at googlemail.com Sat May 14 11:30:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 May 2011 12:30:07 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DCDA222.9050807@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> Message-ID: Hi Andrews, I've had a look at those example files you linked to now. On Fri, May 13, 2011 at 10:26 PM, Andrew Sczesnak wrote: > Hi All, > The value of this format to most users will come from the ability to extract > sequences from an arbitrary number of species that align to a particular > sequence range in a particular genome, at random. ?We should be able to > say, report the alignment of 50 genomes to the human HOX locus fairly > quickly (say <1s). ?An iterator and writer class will certainly be useful, > but to implement the aforementioned functionality, some API changes are > probably necessary. I had previously considered a cross-format Bio.AlignIO index on alignment number (i.e. 0, 1, 2, ... n-1 if the file contains n alignments). That would work on PHYLIP, Stockholm, Clustalw, etc, even FASTA if your alignment all have the same number of entries. It could also be used with MAF. However, I don't think it is useful. Of the current file formats supported in AlignIO, in my experience only PHYLIP files regularly contain more than one alignment, and since these are used for bootstrapping random access is not required (iteration is enough). And presumably for MAF, there is no reason to want to access the alignments by this index number either. With something like SAM/BAM (or other assembly formats like ACE or the MIRA alignment format also called MAF), you can have multiple alignments (the contigs or chromosomes) each with many entries (supporting reads). Here there is a clear single reference coordinate system, that of the (gapped) reference contigs/chromosomes. This also means each alignment has a clear name (the name of the reference contig/chromosome), so this name and coordinates can be used for indexing (as in samtools). With MAF however, things are not so easy - any of the sequences could be used as a reference (e.g. human chr 1, or mouse chr 2), and any region of a sequence might be in more than one alignment. I'm beginning to suspect what Andrew has in mind is going to be MAF specific - so it won't be top level functionality in Bio.AlignIO, but rather tucked away in Bio.AlignIO.MafIO instead. Peter From anaryin at gmail.com Sat May 14 12:59:38 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Sat, 14 May 2011 14:59:38 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Thanks and sorry for the double commit! No dia 14 de Mai de 2011 00:27, "Peter Cock" escreveu: > On Fri, May 13, 2011 at 8:46 PM, Eric Talevich wrote: >> Looks good to me. I can't guarantee I'll be able to merge this right >> away since I'm going to be traveling for the next week. Anyone else >> want to try it? >> -Eric > > If get time this weekend, I'll look at it. After all, I did apply > the quiet change to the trunk... > > Peter From pgarland at gmail.com Sun May 15 01:13:28 2011 From: pgarland at gmail.com (Phillip Garland) Date: Sat, 14 May 2011 18:13:28 -0700 Subject: [Biopython-dev] GEO SOFT parser Message-ID: Hello, I've created a new parser for GEO SOFT files- a fairly simple line-orientated format used by NCBI's Gene Expression Omnibus for holding gene expression data, information about the experimental platform used to generate the data, and associated metadata. At the moment if parses platform (GPL), series (GSE), sample (GSM), and dataset (GDS) files into objects, with access to the metadata, and data table entries. It's accessible through my github biopython repo: https://github.com/pgarland/biopython git://github.com/pgarland/biopython.git Branch: new-geo-soft-parser All the changed files are in the Bio/Geo directory. The existing parser has the virtue of being simple and short. The parser I've written is less parsimonious, but should handle everything specified by NCBI, as well as some unspecified quirks, and documents what GEO SOFT files are expected to contain. I'm taking a look at Sean Davis's GEOquery Bioconductor package for ideas for the interface. There is a class for each GEO record type: GSM, GPL, GSE, and GDS. After instantiating each of these, you can call the parse method on the resulting object to parse the file, e.g.: >>> from Bio import Geo >>> gds858 = Geo.GDS() >>> gds858.parse('GDS858_full.soft') Each object has a dictionary named 'meta' that contains the file's metadata: >>> gds858.meta['channel_count'] 1 Each attribute has a hook to hang a function to perform additional parsing of a value, but most values are stored as strings. There is also a parseMeta() method if you just need the file's metadata (the entity attributes and data table column descriptions) and not the data table. There is also a rudimentary __str__ method to print the metadata. For files that can have data tables (GSM, GPL, and GDS files), there is currently just one method for accessing values: getTableValue() that takes an ID and a column name and returns the associated value: >>> gds858.getTableValue(1007_s_at, 'GSM14498') 3736.9000000000001 but I will implement other methods to provide more convenient access to the data table. Right now, the data table is just an 2D array and can be accessed like any 2D array: gds858.table[0][2] '3736.900' There are dictionaries for converting between IDs and column names and rows and columns: >>> gds858.idDict['1007_s_at'] 0 >>> gds858.columnDict['GSM14498'] 2 It is possible that the underlying representation of the data table could change though. On my dual-core laptop with 4GB of RAM and a 7200RPM hard drive, parsing single files is more than fast enough, but I haven't benchmarked it or looked at RAM consumption. If it's a problem for computers with less RAM or use cases that require having a lot of GEO SOFT objects in memory, I can take a look at changing the data table representation. If this parser is incorporated in BioPython, I'm happy to maintain it. The code is well-commented, but I still need to write the documentation. I've tested it on a few files of each type, but I still need to write unit tests. Since SOFT files can be fairly large- a few MB gzipped, 10's of MB unzipped, it seems undesirable to package them with the biopython source code. I could make the unit test optional and have interested users supply their own files and/or have the test download files from NCBI and unzip them. ~ Phillip From updates at feedmyinbox.com Sun May 15 04:38:06 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Sun, 15 May 2011 00:38:06 -0400 Subject: [Biopython-dev] 5/15 active questions tagged biopython - Stack Overflow Message-ID: <7ba32bcb32923a5ff3d48ac9122b3bed@74.63.51.88> // Receive DNA-Sequence by range in biopython // May 14, 2011 at 5:33 PM http://stackoverflow.com/questions/6004926/receive-dna-sequence-by-range-in-biopython Hi, i need to use a protein-prediction tool called Mutation-Taster (http://www.mutationtaster.org/). Since the input format for the batch query needs a piece of sequence surrounding the mutation, and i have only the position of the mutation within a chromosome, i need the surrounding pieces. So far i am using biopython and i tried to find a way to receive the DNA-Sequence from the NCBI Entrez databases. I want assign the chromosome number, nucleic start and end position within the chromosome to receive the dna-sequence for example in fasta format. I would not mind if it is possible in another programming language. Thanks in advance for your help -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From p.j.a.cock at googlemail.com Sun May 15 14:40:24 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 15 May 2011 15:40:24 +0100 Subject: [Biopython-dev] GEO SOFT parser In-Reply-To: References: Message-ID: On Sun, May 15, 2011 at 2:13 AM, Phillip Garland wrote: > Hello, > > I've created a new parser for GEO SOFT files- a fairly simple > line-orientated format used by NCBI's Gene Expression Omnibus for > holding gene expression data, information about the experimental > platform used to generate the data, and associated metadata. At the > moment if parses platform (GPL), series (GSE), sample (GSM), and > dataset (GDS) files into objects, with access to the metadata, and > data table entries. > > It's accessible through my github biopython repo: > https://github.com/pgarland/biopython > git://github.com/pgarland/biopython.git > > Branch: > new-geo-soft-parser > > All the changed files are in the Bio/Geo directory. > > The existing parser has the virtue of being simple and short. The > parser I've written is less parsimonious, but should handle everything > specified by NCBI, as well as some unspecified quirks, and documents > what GEO SOFT files are expected to contain. That sounds good, the current GEO parser was very minimal. > I'm taking a look at Sean > Davis's GEOquery Bioconductor package for ideas for the interface. Great - I would have encouraged you to look at Sean's R interface for ideas. https://github.com/biopython/biopython/tree/master/Tests/Geo > There is a class for each GEO record type: GSM, GPL, GSE, and GDS. > After instantiating each of these, you can call the parse method on > the resulting object to parse the file, e.g.: > >>>> from Bio import Geo >>>> gds858 = Geo.GDS() >>>> gds858.parse('GDS858_full.soft') We may want to use read rather than parse for consistency with the other newish parsers in Biopython, where parse gives an iterator while read gives a single object. > > Each object has a dictionary named 'meta' that contains the file's metadata: > >>>> gds858.meta['channel_count'] > 1 > > Each attribute has a hook to hang a function to perform additional > parsing of a value, but most values are stored as strings. > > There is also a parseMeta() method if you just need the file's > metadata (the entity attributes and data table column descriptions) > and not the data table. > > There is also a rudimentary __str__ method to print the metadata. > > For files that can have data tables (GSM, GPL, and GDS files), there > is currently just one method for accessing values: getTableValue() > that takes an ID and a column name and returns the associated value: > >>>> gds858.getTableValue(1007_s_at, 'GSM14498') > 3736.9000000000001 > > but I will implement other methods to provide more convenient access > to the data table. > > Right now, the data table is just an 2D array and can be accessed like > any 2D array: > > gds858.table[0][2] > '3736.900' > > There are dictionaries for converting between IDs and column names and > rows and columns: > >>>> gds858.idDict['1007_s_at'] > 0 > >>>> gds858.columnDict['GSM14498'] > 2 > > It is possible that the underlying representation of the data table > could change though. One possibility is a full load versus iterate over the rows approach. The later would be useful if you only wanted some of the data (e.g. particular genes), and didn't have enough RAM to load it all in full. > On my dual-core laptop with 4GB of RAM and a 7200RPM hard drive, > parsing single files is more than fast enough, but I haven't > benchmarked it or looked at RAM consumption. If it's a problem for > computers with less RAM or use cases that require having a lot of GEO > SOFT objects in memory, I can take a look at changing the data table > representation. > > If this parser is incorporated in BioPython, I'm happy to maintain it. Excellent :) > The code is well-commented, but I still need to write the > documentation. I've tested it on a few files of each type, but I still > need to write unit tests. Since SOFT files can be fairly large- ?a few > MB gzipped, 10's of MB unzipped, it seems undesirable to package them > with the biopython source code. We have a selection of small samples already in the repository under Tests/GEO - so at very least you can write unit tests using them. Also, for an online tests, it would be nice to try Entrez with the new GEO parser (IIRC, our old parser didn't work nicely with some of the live data). > I could make the unit test optional > and have interested users supply their own files and/or have the test > download files from NCBI and unzip them. We've touched on the need for "big data" tests which would be more targeted at Biopython developers than end users, but not addressed any framework for this. e.g. SeqIO indexing of large sequence files. Peter From chapmanb at 50mail.com Sun May 15 15:39:59 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 15 May 2011 11:39:59 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> Message-ID: <20110515153959.GC2530@kunkel> Andrew and Peter; Thanks for working on MAF parsing and interval access in general. A few thoughts below: > > I'd like to contribute MAF parser/writer classes to Bio.AlignIO. ?MAF is an > > alignment format used for whole genome alignments, as in the 30-way (or > > more) multiz alignments at UCSC: [...] > > The value of this format to most users will come from the ability to > > extract sequences from an arbitrary number of species that align to > > a particular sequence range in a particular genome, at random. We > I've spoken to Andrew briefly before this, and I'm keen to get > the core functionality of parsing and writing MAF alignments > added to AlignIO. His other ideas for indexing these alignments > are much more interesting - and part of a more general topic > related to things like Ace alignments, or SAM/BAM alignments. We may want to take a look at the interval access functionality in bx-python and MAF parsing tied in with this: https://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py https://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/maf.py Here is a worked example: http://bcbio.wordpress.com/2009/07/26/sorting-genomic-alignments-using-python/ It would be useful to have an API that queries across bx-python intervals, BAM intervals and other formats. Brad From Andrew.Sczesnak at med.nyu.edu Sun May 15 19:59:02 2011 From: Andrew.Sczesnak at med.nyu.edu (Sczesnak, Andrew) Date: Sun, 15 May 2011 15:59:02 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu>, Message-ID: > With something like SAM/BAM (or other assembly formats like ACE or the > MIRA alignment format also called MAF), you can have multiple > alignments (the contigs or chromosomes) each with many entries > (supporting reads). Here there is a clear single reference coordinate > system, that of the (gapped) reference contigs/chromosomes. This also > means each alignment has a clear name (the name of the reference > contig/chromosome), so this name and coordinates can be used for > indexing (as in samtools). > > With MAF however, things are not so easy - any of the sequences could > be used as a reference (e.g. human chr 1, or mouse chr 2), and any > region of a sequence might be in more than one alignment. > > I'm beginning to suspect what Andrew has in mind is going to be MAF > specific - so it won't be top level functionality in Bio.AlignIO, but > rather tucked away in Bio.AlignIO.MafIO instead. > > Peter I agree, the fact that this particular format does not explicitly define the reference sequence is problematic. Based on the spec, we ought to be prepared for a multiz MAF file with several different reference sequences. However, practically speaking, the files out there in the world _do_ have a reference sequence, which appears in all alignments and is the first listed sequence. While I think there is definitely some trickyness to how this parser will have to interact with any API, my feeling is that these portions ought to be confined to MafIO, while a more general API lives in AlignIO or elsewhere. This isn't much different from a format like SFF, I think. Andrew ------------------------------------------------------------ This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email. ================================= From Andrew.Sczesnak at med.nyu.edu Sun May 15 20:14:42 2011 From: Andrew.Sczesnak at med.nyu.edu (Sczesnak, Andrew) Date: Sun, 15 May 2011 16:14:42 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <20110515153959.GC2530@kunkel> References: <4DCDA222.9050807@med.nyu.edu> , <20110515153959.GC2530@kunkel> Message-ID: Hi Brad, > We may want to take a look at the interval access functionality in > bx-python and MAF parsing tied in with this: > > https://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py > https://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/maf.py The interval indexing scheme in bx-python seems really nice. By dropping intervals into bins, a la UCSC MySQL tables, and using a compact file format instead of SQLite, I'm sure it's quite fast. > It would be useful to have an API that queries across bx-python intervals, > BAM intervals and other formats. I agree, I think it would be great if we could implement some sort of API for indexing and accessing intervals in SAM/BAM, MAF, ACE, and really, any format that can be made to report an offset and set of interval coordinates. Even a multifasta can have interval information in the header that a user could extract and pass to the indexer with a callback function. Gene annotation files, like GFF, have this information too. What would make the most sense here? Would a more general interval indexing and searching module be too much? I feel like a task I'm always performing is searching various files by chromosome, start, and stop. Example: A BED file of ChIP-Seq peaks called by MACS--are there any peaks overlapping gene X? Example: How many alignments are there in an RNA-Seq BAM file that overlap rRNA and tRNA annotations in a GFF file, presumably from contaminating RNA? Andrew ------------------------------------------------------------ This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email. ================================= From p.j.a.cock at googlemail.com Sun May 15 20:24:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 15 May 2011 21:24:21 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> Message-ID: On Sun, May 15, 2011 at 8:59 PM, Sczesnak, Andrew wrote: >> With something like SAM/BAM (or other assembly formats like ACE or the >> MIRA alignment format also called MAF), you can have multiple >> alignments (the contigs or chromosomes) each with many entries >> (supporting reads). Here there is a clear single reference coordinate >> system, that of the (gapped) reference contigs/chromosomes. This also >> means each alignment has a clear name (the name of the reference >> contig/chromosome), so this name and coordinates can be used for >> indexing (as in samtools). >> >> With MAF however, things are not so easy - any of the sequences could >> be used as a reference (e.g. human chr 1, or mouse chr 2), and any >> region of a sequence might be in more than one alignment. >> >> I'm beginning to suspect what Andrew has in mind is going to be MAF >> specific - so it won't be top level functionality in Bio.AlignIO, but >> rather tucked away in Bio.AlignIO.MafIO instead. >> >> Peter > > I agree, the fact that this particular format does not explicitly define the > reference sequence is problematic. ?Based on the spec, we ought to be > prepared for a multiz MAF file with several different reference sequences. > However, practically speaking, the files out there in the world _do_ have a > reference sequence, which appears in all alignments and is the first listed > sequence. That may be a very useful simplifying assumption. Would you expect each position on the reference to appear in one and only one alignment block in the MAF file? Or, might a given region appear in multiple blocks? > While I think there is definitely some trickyness to how this > parser will have to interact with any API, my feeling is that these portions > ought to be confined to MafIO, while a more general API lives in AlignIO or > elsewhere. > >?This isn't much different from a format like SFF, I think. > What did you mean here? SFF is just another sequence file format as far as Bio.SeqIO goes, other than being binary it isn't exceptional. Peter From p.j.a.cock at googlemail.com Mon May 16 11:14:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 12:14:05 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DCDA222.9050807@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> Message-ID: On Fri, May 13, 2011 at 10:26 PM, Andrew Sczesnak wrote: > Hi All, > > I'd like to contribute MAF parser/writer classes to Bio.AlignIO. ?MAF is an > alignment format used for whole genome alignments, as in the 30-way (or > more) multiz alignments at UCSC: > > http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/maf/ > > A description of the format is available here: > > http://genome.ucsc.edu/FAQ/FAQformat#format5 > I started work on merging the basic parser/writer into Biopython on this new branch, https://github.com/peterjc/biopython/tree/alignio-maf As I think I mentioned by email before, there were some PEP8 formatting changes (removing spaces before brackets). Another little thing rather than MultipleSeqAlignment(alphabet) you should use MultipleSeqAlignment([], alphabet) to create an empty alignment. The former works with a deprecation warning to help transition from the old alignment object. Note that by hooking up "maf" in AlignIO as an output format, it will get exercised by some of the unit tests, in particular test_AlignIO.py - and that showed some problems. On a functional level your code was not preserving the order of the records within each alignment. By using a dictionary the order becomes Python implementation specific, meaning it cannot be assumed in unit tests (i.e. C Python vs Jython vs IronPython vs PyPy could all store dictionary elements in a different order). Also it was also breaking test_AlignIO.py, so I changed that. Do you think we should follow the speciesOrder directive if present? Note that right now, test_AlignIO.py is still not passing (which is a major reason why I haven't merged this to the trunk). Currently the issue is to do with how you are parsing species names, assuming database.chromosome is not possible in general. Also I think we may need to do something rigorous with start/end co-ordinates and strand in either the Seq or SeqRecord object. They could be updated automatically during slicing and taking reverse complement... they might not survive addition though. Peter From p.j.a.cock at googlemail.com Mon May 16 13:53:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 14:53:32 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 12:14 PM, Peter Cock wrote: > > I started work on merging the basic parser/writer into Biopython > on this new branch, > > https://github.com/peterjc/biopython/tree/alignio-maf > > As I think I mentioned by email before, there were some PEP8 > formatting changes (removing spaces before brackets). > > ... > > Note that right now, test_AlignIO.py is still not passing (which > is a major reason why I haven't merged this to the trunk). > Currently the issue is to do with how you are parsing species > names, assuming database.chromosome is not possible in general. I've changed it to preserve the identifier as is for the SeqRecord id field, got all the test suite passing, and added a couple of small MAF files from the BioPerl test suite (which highlighted some more issues). Do you think it makes sense to automatically promote any dots (periods) in the sequence to the letter of that position in the first sequence? This is something I'd been thinking we should do in the PHYLIP parser as well. See the MAF/humor.maf example. Peter From andrew.sczesnak at med.nyu.edu Mon May 16 17:03:39 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 May 2011 13:03:39 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> Message-ID: <4DD158EB.4080709@med.nyu.edu> On 05/16/2011 09:53 AM, Peter Cock wrote: > On Mon, May 16, 2011 at 12:14 PM, Peter Cock wrote: > > Do you think it makes sense to automatically promote any dots > (periods) in the sequence to the letter of that position in the first > sequence? This is something I'd been thinking we should do in > the PHYLIP parser as well. See the MAF/humor.maf example. > > Peter Yeah, that sounds right to me. The issue again is going to be the lack of an explicitly defined reference sequence. Are we going to make the assumption that the sequence appearing first in an alignment bundle is the reference? Andrew From andrew.sczesnak at med.nyu.edu Mon May 16 17:26:46 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 May 2011 13:26:46 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> Message-ID: <4DD15E56.60201@med.nyu.edu> On 05/16/2011 07:14 AM, Peter Cock wrote: > Do you think we should follow the speciesOrder directive if > present? Yeah, why not. I started working on this and the problem was, as defined in the spec, the species is just "hg19" or "mm9," yet the records are in species.chromosome format. Should we enforce that the species in a speciesOrder directive must exactly match a sequence identifier, or add a split and do some checks to make sure a record matches only one species in speciesOrder? > Also I think we may need to do something rigorous with start/end > co-ordinates and strand in either the Seq or SeqRecord object. > They could be updated automatically during slicing and taking > reverse complement... they might not survive addition though. This is interesting. I wonder if it makes sense to preserve this information if a SeqRecord is going to be maniuplated outside a MultipleSeqAlignment object. Could this be accomplished by migrating the annotation information to a SeqFeature? Andrew From p.j.a.cock at googlemail.com Mon May 16 17:54:24 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 18:54:24 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DD158EB.4080709@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> <4DD158EB.4080709@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 6:03 PM, Andrew Sczesnak wrote: > On 05/16/2011 09:53 AM, Peter Cock wrote: >> >> Do you think it makes sense to automatically promote any dots >> (periods) in the sequence to the letter of that position in the first >> sequence? This is something I'd been thinking we should do in >> the PHYLIP parser as well. See the MAF/humor.maf example. >> >> Peter > > Yeah, that sounds right to me. ?The issue again is going to be the lack of > an explicitly defined reference sequence. ?Are we going to make the > assumption that the sequence appearing first in an alignment bundle > is the reference? That is my assumption for how dots have been used in alignment formats. If you have some MAF examples using dots, that would be great. Regarding PHYLIP, I looked at this and dots/periods have been explicitly forbidden since the very earliest versions of PHYLIP, so I've made them raise an error instead: https://github.com/biopython/biopython/commit/b41975bb8363171add80d19903861f3d8cffe405 Peter From p.j.a.cock at googlemail.com Mon May 16 17:58:23 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 18:58:23 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DD15E56.60201@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> <4DD15E56.60201@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 6:26 PM, Andrew Sczesnak wrote: > On 05/16/2011 07:14 AM, Peter Cock wrote: >> >> Do you think we should follow the speciesOrder directive if >> present? > > Yeah, why not. ?I started working on this and the problem was, as defined in > the spec, the species is just "hg19" or "mm9," yet the records are in > species.chromosome format. ?Should we enforce that the species in a > speciesOrder directive must exactly match a sequence identifier, or add a > split and do some checks to make sure a record matches only one species in > speciesOrder? That is a subtlety I missed - maybe it is simpler to ignore speciesOrder after all. I presume it is intended a graphical output directive really. >> Also I think we may need to do something rigorous with start/end >> co-ordinates and strand in either the Seq or SeqRecord object. >> They could be updated automatically during slicing and taking >> reverse complement... they might not survive addition though. > > This is interesting. ?I wonder if it makes sense to preserve this > information if a SeqRecord is going to be maniuplated outside a > MultipleSeqAlignment object. ?Could this be accomplished by > migrating the annotation information to a SeqFeature? I'm not sure how using a SeqFeature would work here. Also consider that someone might manipulate the alignment directly, e.g. alignment[:,10:60] to pull out fifty columns. That seems like a use case where the start/end co-ordinates should be updated nicely. Note that internally this calls record[10:60] for each row of the alignment, so using SeqRecord objects. Peter From p.j.a.cock at googlemail.com Mon May 16 18:22:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 19:22:05 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> <4DD158EB.4080709@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 6:54 PM, Peter Cock wrote: > On Mon, May 16, 2011 at 6:03 PM, Andrew Sczesnak wrote: >> On 05/16/2011 09:53 AM, Peter Cock wrote: >>> >>> Do you think it makes sense to automatically promote any dots >>> (periods) in the sequence to the letter of that position in the first >>> sequence? This is something I'd been thinking we should do in >>> the PHYLIP parser as well. See the MAF/humor.maf example. >>> >>> Peter >> >> Yeah, that sounds right to me. ?The issue again is going to be the lack of >> an explicitly defined reference sequence. ?Are we going to make the >> assumption that the sequence appearing first in an alignment bundle >> is the reference? > > That is my assumption for how dots have been used in alignment > formats. Done on my branch: https://github.com/peterjc/biopython/commit/746d0c30b85753bb40c140b2b964e3256259414b > > If you have some MAF examples using dots, that would be great. > You'll see I have one example (from BioPerl's unit tests), but more would still be appreciated. Peter From andrew.sczesnak at med.nyu.edu Mon May 16 20:30:23 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 May 2011 16:30:23 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> <4DD158EB.4080709@med.nyu.edu> Message-ID: <4DD1895F.3050303@med.nyu.edu> On 05/16/2011 01:54 PM, Peter Cock wrote: > That is my assumption for how dots have been used in alignment > formats. > > If you have some MAF examples using dots, that would be great. I added a snippet of mouse chromosome 10 from UCSC, but it doesn't have dots. I've actually never come across one with dots. Added support for a 'track' line at the beginning of a file as well, among some other small changes. https://github.com/polyatail/biopython/commits/alignio-maf Andrew From p.j.a.cock at googlemail.com Mon May 16 20:45:38 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 21:45:38 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DD1895F.3050303@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> <4DD158EB.4080709@med.nyu.edu> <4DD1895F.3050303@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 9:30 PM, Andrew Sczesnak wrote: > On 05/16/2011 01:54 PM, Peter Cock wrote: >> >> That is my assumption for how dots have been used in alignment >> formats. >> >> If you have some MAF examples using dots, that would be great. > > I added a snippet of mouse chromosome 10 from UCSC, but it doesn't have > dots. ?I've actually never come across one with dots. > > Added support for a 'track' line at the beginning of a file as well, among > some other small changes. > > https://github.com/polyatail/biopython/commits/alignio-maf > Generally I'm happy, although after editing the BioPerl unit test, perhaps we should rename it? And did you mean to alter the newline at the end of the file? https://github.com/polyatail/biopython/commit/d423d423cc87efeb8a27a9332927e42d1beacdf2 Also, could you rewrite this to avoid the use of handle.tell? Not all handle objects support that (right?), and we shouldn't need it. https://github.com/polyatail/biopython/commit/111cf69d7e435203a781f05f9f317bc9ced03560 Peter From andrew.sczesnak at med.nyu.edu Mon May 16 20:33:53 2011 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Mon, 16 May 2011 16:33:53 -0400 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: References: <4DCDA222.9050807@med.nyu.edu> <4DD15E56.60201@med.nyu.edu> Message-ID: <4DD18A31.6030804@med.nyu.edu> On 05/16/2011 01:58 PM, Peter Cock wrote: > That is a subtlety I missed - maybe it is simpler to ignore speciesOrder > after all. I presume it is intended a graphical output directive really. Fine by me. If need be we can add this later. >> This is interesting. I wonder if it makes sense to preserve this >> information if a SeqRecord is going to be maniuplated outside a >> MultipleSeqAlignment object. Could this be accomplished by >> migrating the annotation information to a SeqFeature? > > I'm not sure how using a SeqFeature would work here. Hmm, well, strand is manipulated in a SeqFeature when .reverse_complement() is run, right? I thought that might take care of that. Though truthfully I haven't looked too much at that code. > Also consider that someone might manipulate the alignment > directly, e.g. alignment[:,10:60] to pull out fifty columns. That > seems like a use case where the start/end co-ordinates should > be updated nicely. Note that internally this calls record[10:60] > for each row of the alignment, so using SeqRecord objects. That's true. Is there a more general way to implement this? By dragging the coordinate information out of .annotations and into fields that aren't MAF-specific or something. Andrew From p.j.a.cock at googlemail.com Mon May 16 20:53:55 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 May 2011 21:53:55 +0100 Subject: [Biopython-dev] MAF Parser/Writer/Indexer In-Reply-To: <4DD18A31.6030804@med.nyu.edu> References: <4DCDA222.9050807@med.nyu.edu> <4DD15E56.60201@med.nyu.edu> <4DD18A31.6030804@med.nyu.edu> Message-ID: On Mon, May 16, 2011 at 9:33 PM, Andrew Sczesnak wrote: > On 05/16/2011 01:58 PM, Peter Cock wrote: >> >> That is a subtlety I missed - maybe it is simpler to ignore speciesOrder >> after all. I presume it is intended a graphical output directive really. > > Fine by me. ?If need be we can add this later. > >>> This is interesting. ?I wonder if it makes sense to preserve this >>> information if a SeqRecord is going to be maniuplated outside a >>> MultipleSeqAlignment object. ?Could this be accomplished by >>> migrating the annotation information to a SeqFeature? >> >> I'm not sure how using a SeqFeature would work here. > > Hmm, well, strand is manipulated in a SeqFeature when .reverse_complement() > is run, right? ?I thought that might take care of that. ?Though truthfully I > haven't looked too much at that code. The SeqFeature is for describing (part of) a SeqRecord, and both have a reverse_complement method for when you want to flip the sequence and all the features on it. >> Also consider that someone might manipulate the alignment >> directly, e.g. alignment[:,10:60] to pull out fifty columns. That >> seems like a use case where the start/end co-ordinates should >> be updated nicely. Note that internally this calls record[10:60] >> for each row of the alignment, so using SeqRecord objects. > > That's true. ?Is there a more general way to implement this? ?By dragging > the coordinate information out of .annotations and into fields that aren't > MAF-specific or something. That's what I was suggesting - the existing fasta-m10 parser can also collect start/end/strand information, and there are obvious potential uses with things like BLAST and HMMER too. One idea might be to introduce a SeqRecord subclass - I'm not sure yet. Peter From p.j.a.cock at googlemail.com Tue May 17 10:02:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 17 May 2011 11:02:08 +0100 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Fri, May 13, 2011 at 8:35 PM, Jo?o Rodrigues wrote: > Hello all, > > Not to let this die. > > I've added PERMISSIVE=2 to PDBParser. I also changed the code to remove the > _handle_pdb_exception method and replace it by the warnings module. > > This was done in two commits in my branch: > > https://github.com/JoaoRodrigues/biopython/commit/5b44defc3eb0a3505668ac77b59c8980630e6b07 > https://github.com/JoaoRodrigues/biopython/commit/7383e068e41dd624458b3904fcd61a04c3f319c4 > Is getting ride of _handle_PDB_exception a good idea for performance? If I have understood your code, you just raise a warning in all cases. Then, you have a filter that either promotes the warning to an exception (permissive=0), or silences the warning (permissive=2). Also, do we want to have the same three options for all the recoverable errors? e.g. Currently, missing elements never raise an exception. Peter From anaryin at gmail.com Tue May 17 10:14:25 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 May 2011 12:14:25 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Hey, That's something I noticed too. Some errors still have PDBConstructionException as a base class, while most of them have PDBConstructionWarning. Only these latter are regulated by the new scheme. I believe they were also raised before, but inside the _handle_pdb_exception function IIRC. Regarding performance, that's something we can easily check with the benchmarks. The difference is not big, the PDB branch and 1.57+ differ just in that particular detail. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Tue, May 17, 2011 at 12:02 PM, Peter Cock wrote: > On Fri, May 13, 2011 at 8:35 PM, Jo?o Rodrigues wrote: > > Hello all, > > > > Not to let this die. > > > > I've added PERMISSIVE=2 to PDBParser. I also changed the code to remove > the > > _handle_pdb_exception method and replace it by the warnings module. > > > > This was done in two commits in my branch: > > > > > https://github.com/JoaoRodrigues/biopython/commit/5b44defc3eb0a3505668ac77b59c8980630e6b07 > > > https://github.com/JoaoRodrigues/biopython/commit/7383e068e41dd624458b3904fcd61a04c3f319c4 > > > > Is getting ride of _handle_PDB_exception a good idea for performance? > If I have understood your code, you just raise a warning in all cases. > Then, you have a filter that either promotes the warning to an exception > (permissive=0), or silences the warning (permissive=2). > > Also, do we want to have the same three options for all the recoverable > errors? e.g. Currently, missing elements never raise an exception. > > Peter > From p.j.a.cock at googlemail.com Tue May 17 10:30:39 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 17 May 2011 11:30:39 +0100 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Tue, May 17, 2011 at 11:14 AM, Jo?o Rodrigues wrote: > Hey, > > That's something I noticed too. Some errors still have > PDBConstructionException as a base class, while most of them have > PDBConstructionWarning. Only these latter are regulated by the new scheme. I > believe they were also raised before, but inside the _handle_pdb_exception > function IIRC. For backwards compatibility, we still want to use PDBConstructionException for exceptions (i.e. when permissive=0, or for non-recoverable errors) and PDBConstructionWarning for warnings (i.e. when permissive=1 or 2). The filter action may need to convert any PDBConstructionWarning to a PDBConstructionException. > Regarding performance, that's something we can easily check with the > benchmarks. The difference is not big, the PDB branch and 1.57+ differ just > in that particular detail. So you don't think this is worth worrying about? OK - if the code is cleaner this way that's a good justification. Peter From updates at feedmyinbox.com Tue May 17 11:05:32 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Tue, 17 May 2011 07:05:32 -0400 Subject: [Biopython-dev] 5/17 biopython Questions - BioStar Message-ID: <61130ea2043be0a2b73113a897fbcd9c@74.63.51.88> // [python] Uniprot ID to Gene name // May 16, 2011 at 4:36 AM http://biostar.stackexchange.com/questions/8323/python-uniprot-id-to-gene-name Hi, I've got a huge list of Uniprot IDs and I want to get the matching gene names. Do you know how to do that in python ? (I'm currently searching with Biopython...) Thanks ! Yo. -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/687953/851dd4cd10a2537cf271a85dfd1566976527e0cd/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From anaryin at gmail.com Tue May 17 11:19:37 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 May 2011 13:19:37 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Ok, the conversion from warning to exception is something I'll look into then. I also found an annoying problem in the Atom class, when assigning elements: there is an "import warnings" in the function... This is also likely killing a bit the performance.. We can more thoroughly see about the speed once I finish the PDB benchmark. Cheers, Jo?o From p.j.a.cock at googlemail.com Tue May 17 11:47:11 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 17 May 2011 12:47:11 +0100 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Tue, May 17, 2011 at 12:19 PM, Jo?o Rodrigues wrote: > Ok, the conversion from warning to exception is something I'll look into > then. > > I also found an annoying problem in the Atom class, when assigning elements: > there is an "import warnings" in the function... This is also likely killing > a bit the performance.. We can make the import top level then. > We can more thoroughly see about the speed once I finish the PDB benchmark. > > Cheers, > > Jo?o > From anaryin at gmail.com Tue May 17 12:12:53 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 May 2011 14:12:53 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: https://github.com/JoaoRodrigues/biopython/commit/2a694502f6fd116b36d8d2d15b3d4ba23ab92fe8 From anaryin at gmail.com Tue May 17 12:21:52 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 17 May 2011 14:21:52 +0200 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: Regarding the missing element never raising an exception, here's what I propose: Change the wording of the warnings in the Atom._assign_element method so that they signal that the element was missing and it either was auto-assigned or it couldn't be assigned at all. Right now we have: if putative_element.capitalize() in IUPACData.atom_weights: msg = "Used element %r for Atom (name=%s) with given element %r" \ % (putative_element, self.name, element) element = putative_element else: msg = "Could not assign element %r for Atom (name=%s) with given element %r" \ % (putative_element, self.name, element) element = "" warnings.warn(msg, PDBConstructionWarning) I would suggest changing these two messages to make them more verbose. Setting PERMISSIVE to 0 still converts these into exceptions, but the message might not be that explicit that the element was missing to begin with. Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Tue, May 17, 2011 at 2:12 PM, Jo?o Rodrigues wrote: > > https://github.com/JoaoRodrigues/biopython/commit/2a694502f6fd116b36d8d2d15b3d4ba23ab92fe8 > From updates at feedmyinbox.com Wed May 18 05:06:16 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 18 May 2011 01:06:16 -0400 Subject: [Biopython-dev] 5/18 active questions tagged biopython - Stack Overflow Message-ID: <4aa0c9bbf3ae94272896628b51675707@74.63.51.88> // Writing out a list of strings to a file // May 17, 2011 at 3:14 PM http://stackoverflow.com/questions/6035904/writing-out-a-list-of-strings-to-a-file I have a list of abbreviations Letters = ['Ala', 'Asx', 'Cys', ... 'Glx'] I want to output this to a text file that will look this like: #Letters Ala, Asx, Cys, ..... Glx Noob programmer here! I always forget the simplest things! ah please help and thanks! import Bio from Bio import Seq from Bio.Seq import Alphabet output = 'alphabetSoupOutput.txt' fh = open(output, 'w') ThreeLetterProtein = '#Three Letter Protein' Letters = Bio.Alphabet.ThreeLetterProtein.letters fh.write(ThreeLetterProtein + '\n') #Don't know what goes here fh.close() // BioPython Alphabet Soup // May 17, 2011 at 2:28 AM http://stackoverflow.com/questions/6027064/biopython-alphabet-soup Biopython noob here, I'm trying to create a program that uses the Biopython package Alphabet and alphabet module IUPAC to write out the letters of the classes listed to a file called alphabetSoupOuput.txt. ThreeLetterProtein IUPACProtein unambiguous_dna ambiguous_dna ExtendedIUPACProtein ExtendedIUPACDNA Each group of letters should be written to its single line in the output file and the letters should be separated by commas. The line before each group of letters should contain a label that describes the letters and has a # in the first position of that line, e.g. Three Letter Protein Ala, Asx, Cys, ..., Glx Protein Letters A, C, D, E, ..., Y How can I do this? -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From updates at feedmyinbox.com Wed May 18 10:54:24 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 18 May 2011 06:54:24 -0400 Subject: [Biopython-dev] 5/18 biopython Questions - BioStar Message-ID: <257a8a7da87497cf70829245fd325ed9@74.63.51.88> // Converting GenBank to FASTA in protein form // May 18, 2011 at 12:31 AM http://biostar.stackexchange.com/questions/8377/converting-genbank-to-fasta-in-protein-form So i have a sequence that is a .gb file. What I want to do is parse and change the format of the file. I've figured out how to parse it to FASTA format, although the sequence that is in the FASTA format is nucleic and i want it to be PROTEIN. kind of stuck here... any ideas? import Bio from Bio import SeqUtils from Bio import Seq from Bio import SeqIO handle = 'sequence.gb' output = 'sequence.fasta' data = Bio.SeqIO.parse(handle, 'gb') fh = open(output, 'w') for record in data: convert = Bio.SeqIO.write(record, output, 'fasta') dna = record.seq mrna = dna.transcribe() protein = mrna.translate() // Extracting data from classes in Python // May 17, 2011 at 2:31 PM http://biostar.stackexchange.com/questions/8371/extracting-data-from-classes-in-python How can I extract data from a class in Python? >>>Bio.Alphabet.RNAAlphabet How can I extract, say for example, the letters of that Alphabet from that object in Byophthon? // Working with Alphabet Soup // May 17, 2011 at 1:23 PM http://biostar.stackexchange.com/questions/8370/working-with-alphabet-soup Biopython noob here, I'm trying to create a program that uses the Biopython package Alphabet and alphabet module IUPAC to write out the letters of the classes listed to a file called alphabetSoupOuput.txt. ThreeLetterProtein IUPACProtein unambiguous_dna ambiguous_dna ExtendedIUPACProtein ExtendedIUPACDNA Each group of letters should be written to its single line in the output file and the letters should be separated by commas. The line before each group of letters should contain a label that describes the letters and has a # in the first position of that line, e.g. Three Letter Protein Ala, Asx, Cys, ..., Glx Protein Letters A, C, D, E, ..., Y How can I do this? -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/687953/851dd4cd10a2537cf271a85dfd1566976527e0cd/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From sbassi at clubdelarazon.org Wed May 18 18:37:42 2011 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 18 May 2011 11:37:42 -0700 Subject: [Biopython-dev] SNP data into Biopython Message-ID: Hello, I wonder if would be OK to create a parser for SNP data provided by 23andme for Biopython. I could use https://github.com/ngopal/23andMe as a base. What do you think? From tiagoantao at gmail.com Wed May 18 18:45:21 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 18 May 2011 12:45:21 -0600 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: Hi, On Wed, May 18, 2011 at 12:37 PM, Sebastian Bassi wrote: > I wonder if would be OK to create a parser for SNP data provided by > 23andme for Biopython. I could use https://github.com/ngopal/23andMe > as a base. > What do you think? Are you thinking in also using the sql part of that code? I actually use a similar strategy in my project to parse HapMap data (interPopula). I just wonder what other people would think about having SQL code outside Bio.SQL? I personally have no feelings about it, but I thought I should raise the issue... Tiago From sbassi at clubdelarazon.org Wed May 18 19:00:09 2011 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 18 May 2011 12:00:09 -0700 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: 2011/5/18 Tiago Ant?o : > Are you thinking in also using the sql part of that code? I actually I didn't think in persistence yet. Just parsing it to make some operations. I could think on persistence on a second iteration. -- Sebasti?n Bassi. Lic. en Biotecnologia. Curso de Python en un d?a: http://bit.ly/cursopython From p.j.a.cock at googlemail.com Wed May 18 19:13:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 18 May 2011 20:13:50 +0100 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: 2011/5/18 Tiago Ant?o : > Hi, > > On Wed, May 18, 2011 at 12:37 PM, Sebastian Bassi > wrote: >> I wonder if would be OK to create a parser for SNP data provided by >> 23andme for Biopython. I could use https://github.com/ngopal/23andMe >> as a base. >> What do you think? Double check with the original author about reusing his code, but that could be good. Maybe under Bio/SNP/23andme.py where the Bio.SNP namespace could be extended in future? > Are you thinking in also using the sql part of that code? I actually > use a similar strategy in my project to parse HapMap data > (interPopula). I just wonder what other people would think about > having SQL code outside Bio.SQL? I personally have no feelings about > it, but I thought I should raise the issue... Tiago - Do you mean the BioSQL module (no dot)? That is specifically for the BioSQL.org schema, and there are other things under Bio.* which use SQL. Peter From redmine at redmine.open-bio.org Wed May 18 19:29:10 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 May 2011 19:29:10 +0000 Subject: [Biopython-dev] [Biopython - Bug #3232] (New) need to update info on Python version support Message-ID: Issue #3232 has been reported by Walter Gillett. ---------------------------------------- Bug #3232: need to update info on Python version support https://redmine.open-bio.org/issues/3232 Author: Walter Gillett Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The BioPython installation instructions (http://biopython.org/DIST/docs/install/Installation.html , section 2 "Installing Python") say: > "Biopython is designed to work with Python 2.4 or later (but not Python 3 yet)" but Open Bio news (http://news.open-bio.org/news/2010/11/dropping-python24-support/) says: > the forthcoming Biopython 1.56 release is planned to be our last release to support Python 2.4 since 1.57 has been released, we should update the installation instructions to indicate that Python 2.5 is now the required minimum version. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed May 18 19:29:11 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 May 2011 19:29:11 +0000 Subject: [Biopython-dev] [Biopython - Bug #3232] (New) need to update info on Python version support Message-ID: Issue #3232 has been reported by Walter Gillett. ---------------------------------------- Bug #3232: need to update info on Python version support https://redmine.open-bio.org/issues/3232 Author: Walter Gillett Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The BioPython installation instructions (http://biopython.org/DIST/docs/install/Installation.html , section 2 "Installing Python") say: > "Biopython is designed to work with Python 2.4 or later (but not Python 3 yet)" but Open Bio news (http://news.open-bio.org/news/2010/11/dropping-python24-support/) says: > the forthcoming Biopython 1.56 release is planned to be our last release to support Python 2.4 since 1.57 has been released, we should update the installation instructions to indicate that Python 2.5 is now the required minimum version. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Wed May 18 19:42:39 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 18 May 2011 20:42:39 +0100 Subject: [Biopython-dev] Biopython specific warning classes Message-ID: Hi all, I've been thinking we should introduce some specific warning classes to Biopython, in particular: ParserWarning, for any "dodgy" input files, such as invalid GenBank LOCUS lines, and so on. The existing PDB parser warning should become a subclass of this. WriterWarning, for things like "data loss", e.g. record IDs getting truncated in PHYLIP output. Perhaps even a base class BiopythonWarning, which would be useful for people wanting to ignore all the Biopython issued warnings - it might be helpful in our unit tests too. Currently (apart from the PDB module), we tend to use the default UserWarning which makes filtering the warnings as an end user (or a unit test writer) quite hard. Any thoughts? Or better name suggestions? Regards, Peter From redmine at redmine.open-bio.org Wed May 18 19:49:50 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 May 2011 19:49:50 +0000 Subject: [Biopython-dev] [Biopython - Bug #3232] (Closed) need to update info on Python version support References: Message-ID: Issue #3232 has been updated by Peter Cock. Status changed from New to Closed % Done changed from 0 to 100 Applied in changeset commit:28af0e85272acc87adb9060a008c99d28ea6c17b. ---------------------------------------- Bug #3232: need to update info on Python version support https://redmine.open-bio.org/issues/3232 Author: Walter Gillett Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The BioPython installation instructions (http://biopython.org/DIST/docs/install/Installation.html , section 2 "Installing Python") say: > "Biopython is designed to work with Python 2.4 or later (but not Python 3 yet)" but Open Bio news (http://news.open-bio.org/news/2010/11/dropping-python24-support/) says: > the forthcoming Biopython 1.56 release is planned to be our last release to support Python 2.4 since 1.57 has been released, we should update the installation instructions to indicate that Python 2.5 is now the required minimum version. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed May 18 19:54:21 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 May 2011 19:54:21 +0000 Subject: [Biopython-dev] [Biopython - Bug #3232] need to update info on Python version support References: Message-ID: Issue #3232 has been updated by Peter Cock. Online docs updated too, thanks for reporting this! http://biopython.org/DIST/docs/install/Installation.html http://biopython.org/DIST/docs/install/Installation.pdf ---------------------------------------- Bug #3232: need to update info on Python version support https://redmine.open-bio.org/issues/3232 Author: Walter Gillett Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The BioPython installation instructions (http://biopython.org/DIST/docs/install/Installation.html , section 2 "Installing Python") say: > "Biopython is designed to work with Python 2.4 or later (but not Python 3 yet)" but Open Bio news (http://news.open-bio.org/news/2010/11/dropping-python24-support/) says: > the forthcoming Biopython 1.56 release is planned to be our last release to support Python 2.4 since 1.57 has been released, we should update the installation instructions to indicate that Python 2.5 is now the required minimum version. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From tiagoantao at gmail.com Wed May 18 20:24:34 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 18 May 2011 14:24:34 -0600 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: 2011/5/18 Peter Cock : > Tiago - Do you mean the BioSQL module (no dot)? That is > specifically for the BioSQL.org schema, and there are other > things under Bio.* which use SQL. Ah, interesting. I was thinking in donating my HapMap code, but the HapMap project is always changing the directory structure (and file format!) of the site, and that renders my code (which does automatic download of data) quite unstable. :( Tiago From sbassi at clubdelarazon.org Wed May 18 23:03:51 2011 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 18 May 2011 16:03:51 -0700 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: 2011/5/18 Peter Cock : > Double check with the original author about reusing his code, > but that could be good. Maybe under Bio/SNP/23andme.py > where the Bio.SNP namespace could be extended in future? I've just asked and this is his reply: """ Thanks for your email. I'm flattered. Yes, you may include my code in biopython. I only ask two things: add my name to the list of biopython participants/contributors http://biopython.org/wiki/Participants http://biopython.org/SRC/biopython/CONTRIB add my name to the top of the python class which uses the code, stating a portion of the code came from me (assuming each python class in biopython has a comment header where each developer lists his/her name) I'm very glad you found the code useful. I'm traveling a lot these days and may not have immediate access to the internet, but please don't hesitate to shoot me an email-- I'll do my best to reply in a timely manner. Thanks, Nikhil Gopal """ From p.j.a.cock at googlemail.com Wed May 18 23:20:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 19 May 2011 00:20:02 +0100 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: On Thu, May 19, 2011 at 12:03 AM, Sebastian Bassi wrote: > 2011/5/18 Peter Cock : >> Double check with the original author about reusing his code, >> but that could be good. Maybe under Bio/SNP/23andme.py >> where the Bio.SNP namespace could be extended in future? > > I've just asked and this is his reply: > > """ > Thanks for your email. I'm flattered. Yes, you may include my code in > biopython. I only ask two things: > add my name to the list of biopython participants/contributors > http://biopython.org/wiki/Participants > http://biopython.org/SRC/biopython/CONTRIB > add my name to the top of the python class which uses the code, > stating a portion of the code came from me (assuming each python class > in biopython has a comment header where each developer lists his/her > name) > I'm very glad you found the code useful. I'm traveling a lot these > days and may not have immediate access to the internet, but please > don't hesitate to shoot me an email-- I'll do my best to reply in a > timely manner. > > Thanks, > > Nikhil Gopal > """ Assuming you asked him specifically about putting the code under the Biopython license, those terms are fine. We'd have done most of that anyway - although the wiki participants page is usually self edited. Are you happy to look at this then Sebastian? I've not worked with SNP data first hand - hopefully Tiago or others can look things over when you have something ready to merge. Regards, Peter From sbassi at clubdelarazon.org Thu May 19 06:33:50 2011 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 18 May 2011 23:33:50 -0700 Subject: [Biopython-dev] SNP data into Biopython In-Reply-To: References: Message-ID: On Wed, May 18, 2011 at 4:20 PM, Peter Cock wrote: > Are you happy to look at this then Sebastian? I've not worked with > SNP data first hand - hopefully Tiago or others can look things > over when you have something ready to merge. OK. But be patient since my github-foo is new. From updates at feedmyinbox.com Sat May 21 08:28:33 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Sat, 21 May 2011 04:28:33 -0400 Subject: [Biopython-dev] 5/21 biopython Questions - BioStar Message-ID: <53ca752aa480a76bcdc8a9070c62642a@74.63.51.88> // Massive pairwise comparison using biopython // May 20, 2011 at 6:51 PM http://biostar.stackexchange.com/questions/8456/massive-pairwise-comparison-using-biopython Hi, I have a data-set of ~7500 sequences, avg. length ~1700 bases. I need to perform pairwise analysis on the entire set. I have a biopython script to perform this analysis in parallel. My understanding is that the comparison will need to run on an MPI cluster. What are my options for doing this and where could I run the job? Thanks, Peter -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/687953/851dd4cd10a2537cf271a85dfd1566976527e0cd/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From updates at feedmyinbox.com Mon May 23 08:28:23 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Mon, 23 May 2011 04:28:23 -0400 Subject: [Biopython-dev] 5/23 biopython Questions - BioStar Message-ID: <3f6c56051f15a35ea736b3b079ba44e4@74.63.51.88> // Fragile-X BioInformatics Project // May 23, 2011 at 1:52 AM http://biostar.stackexchange.com/questions/8495/fragile-x-bioinformatics-project Hey guys, I'm looking for some advice. I have a project due in a couple weeks that needs to utilize python(and biopython) to create some sort of computational biology tool that will be used to analyze either GEO samples, DNA sequences, etc. I plan on creating a program that will analyze GEO samples of Fragile-X patients, but don't really know what else I can do. Any suggestions? I don't work in a lab and therefore don't have much experience with this. ANY suggestions would help Please and thanks! -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/687953/851dd4cd10a2537cf271a85dfd1566976527e0cd/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From redmine at redmine.open-bio.org Mon May 23 15:30:21 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 23 May 2011 15:30:21 +0000 Subject: [Biopython-dev] [Biopython - Bug #3234] (New) Bio.HMM Viterbi algorithm: initial state probabilities are wrong Message-ID: Issue #3234 has been reported by Walter Gillett. ---------------------------------------- Bug #3234: Bio.HMM Viterbi algorithm: initial state probabilities are wrong https://redmine.open-bio.org/issues/3234 Author: Walter Gillett Status: New Priority: Normal Assignee: Walter Gillett Category: Target version: URL: Spun off from #2947, see that bug for discussion. Initial state probabilities should be set explicitly, rather than using the probability of transitioning from a state back to itself, which is incorrect. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Mon May 23 16:03:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 23 May 2011 17:03:13 +0100 Subject: [Biopython-dev] [Biopython - Bug #3234] (New) Bio.HMM Viterbi algorithm: initial state probabilities are wrong In-Reply-To: References: Message-ID: On Mon, May 23, 2011 at 4:30 PM, wrote: > > Issue #3234 has been reported by Walter Gillett. > > ---------------------------------------- > Bug #3234: Bio.HMM Viterbi algorithm: initial state probabilities are wrong > https://redmine.open-bio.org/issues/3234 > > Author: Walter Gillett > Status: New > Priority: Normal > Assignee: Walter Gillett > Category: > Target version: > URL: > > > Spun off from #2947, see that bug for discussion. Initial state probabilities > should be set explicitly, rather than using the probability of transitioning > from a state back to itself, which is incorrect. Would anyone more familiar with HMMs that I am like to volunteer to review Walter's changes? e.g. Philip (CC'd)? Walter's sent a pull request via github: https://github.com/biopython/biopython/pull/6 This consists of two commits, the first an unrelated minor change to the ignore file for people using the NetBeans IDE: https://github.com/wgillett/biopython/commit/50659de2f0cfa3f0bc913b4ea88d6a001b543d98 Secondly his fix for Bug 3234, https://github.com/wgillett/biopython/commit/a60ac226ceed21fd856ff1ec1dbea2782e2172ae Thanks, Peter From pgarland at gmail.com Tue May 24 08:34:03 2011 From: pgarland at gmail.com (Phillip Garland) Date: Tue, 24 May 2011 01:34:03 -0700 Subject: [Biopython-dev] [Biopython - Bug #3234] (New) Bio.HMM Viterbi algorithm: initial state probabilities are wrong In-Reply-To: References: Message-ID: The patch looks correct to me. ~Phillip On Mon, May 23, 2011 at 9:03 AM, Peter Cock wrote: > On Mon, May 23, 2011 at 4:30 PM, ? wrote: >> >> Issue #3234 has been reported by Walter Gillett. >> >> ---------------------------------------- >> Bug #3234: Bio.HMM Viterbi algorithm: initial state probabilities are wrong >> https://redmine.open-bio.org/issues/3234 >> >> Author: Walter Gillett >> Status: New >> Priority: Normal >> Assignee: Walter Gillett >> Category: >> Target version: >> URL: >> >> >> Spun off from #2947, see that bug for discussion. Initial state probabilities >> should be set explicitly, rather than using the probability of transitioning >> from a state back to itself, which is incorrect. > > Would anyone more familiar with HMMs that I am like to volunteer to review > Walter's changes? e.g. Philip (CC'd)? > > Walter's sent a pull request via github: > https://github.com/biopython/biopython/pull/6 > > This consists of two commits, the first an unrelated minor change to the > ignore file for people using the NetBeans IDE: > https://github.com/wgillett/biopython/commit/50659de2f0cfa3f0bc913b4ea88d6a001b543d98 > > Secondly his fix for Bug 3234, > https://github.com/wgillett/biopython/commit/a60ac226ceed21fd856ff1ec1dbea2782e2172ae > > Thanks, > > Peter > From p.j.a.cock at googlemail.com Tue May 24 09:07:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 May 2011 10:07:15 +0100 Subject: [Biopython-dev] [Biopython - Bug #3234] (New) Bio.HMM Viterbi algorithm: initial state probabilities are wrong In-Reply-To: References: Message-ID: On Tue, May 24, 2011 at 9:34 AM, Phillip Garland wrote: > The patch looks correct to me. > > ~Phillip Thank you both, I've applied the change: https://github.com/biopython/biopython/commit/152f469179d4a142858a04c02169f8d1fc5f8c83 Peter From redmine at redmine.open-bio.org Tue May 24 16:13:11 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 24 May 2011 16:13:11 +0000 Subject: [Biopython-dev] [Biopython - Feature #3236] (New) Make Biopython work in PyPy 1.5 Message-ID: Issue #3236 has been reported by Eric Talevich. ---------------------------------------- Feature #3236: Make Biopython work in PyPy 1.5 https://redmine.open-bio.org/issues/3236 Author: Eric Talevich Status: New Priority: Low Assignee: Category: Target version: URL: PyPy is now roughly as production-ready as Jython: http://morepypy.blogspot.com/2011/04/pypy-15-released-catching-up.html Let's make Biopython work on PyPy 1.5. To make the pure-Python core of Biopython work, I did this: * Download and unpack the pre-compiled Linux tarball from pypy.org * Copy the header file @marshal.h@ from the CPython 2.X installation into the @pypy-c-.../include/@ directory * pypy setup.py build; pypy setup.py install * Delete pypy-c-.../site-packages/Bio/cpairwise2*.so Benchmarking a script that leans heavily on Bio.pairwise2, I see about a 2x speedup between Pypy 1.5 and CPython 2.6 -- yes, that's with the compiled C extension @cpairwise2@ in the CPython 2.6 installation. Numpy isn't available on PyPy yet, and it may be some time before it does. Observations from @pypy setup.py test@: * test_BioSQL triggers tons of RuntimeWarnings related to sqlite3 functions * test_BioSQL_SeqIO fails -- attempts to retrieve P01892 instead of Q29899 (?) * test_Restriction triggers a TypeError, somehow (also causing test_CAPS to err) * test_Entrez fails with many noisy errors -- looks related to expat, may be just my installation * importing @Bio.trie@ fails, probably due to a @marshal.h@ issue with compilation ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Tue May 24 20:20:55 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 May 2011 21:20:55 +0100 Subject: [Biopython-dev] Fwd: [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) In-Reply-To: References: Message-ID: Eric, Could you take a look at the second of these commits from Aaron please? https://github.com/habnabit/biopython/commit/533c5b0a8fd4656ef937e5e0816d2714f82ecf07 I've already applied the first one with a cherry-pick, https://github.com/biopython/biopython/tree/7bec999af556be28d1a50dac9687d62f6c200b38 Thanks, Peter ---------- Forwarded message ---------- From: habnabit Date: Tue, May 24, 2011 at 9:10 PM Subject: [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) To: p.j.a.cock at googlemail.com Hi! This is a very small set of changes for biopython; it should be evident from the diff and commit message what the intent is. -- Reply to this email directly or view it on GitHub: https://github.com/biopython/biopython/pull/7 From eric.talevich at gmail.com Wed May 25 14:33:36 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 25 May 2011 10:33:36 -0400 Subject: [Biopython-dev] [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) In-Reply-To: References: Message-ID: Thanks for these patches, Aaron! And thanks for merging the first one, Peter. The second set looks safe to me. A couple thoughts: 1. It might be more intuitive to accept a format string directly as the format_branchlength argument, e.g. Phylo.write(tree,?outfile,?'newick', format_branchlength='%.0e') Since the branch length is always supposed to be a numeric type or None, format strings alone should be sufficient to do whatever the user wants, right? Alternatively, the switch in _info_factory could go: if format_branchlength is None: fmt_bl = lambda bl: '%1.5f' % bl elif isinstance(format_branchlength, basestring): fmt_bl = lambda bl: format_branchlength % bl elif callable(format_branchlength): fmt_bl = format_branchlength else: raise WTF 2. Out of curiousity, is there a certain program out there that uses branch length in a different format? I hadn't considered this before, but I can see how scientific notation would be useful sometimes if the target program can handle it. I can merge this if we have agreement on these. Cheers, Eric On Tue, May 24, 2011 at 4:20 PM, Peter Cock wrote: > > Eric, > > Could you take a look at the second of these commits from Aaron please? > https://github.com/habnabit/biopython/commit/533c5b0a8fd4656ef937e5e0816d2714f82ecf07 > > I've already applied the first one with a cherry-pick, > https://github.com/biopython/biopython/tree/7bec999af556be28d1a50dac9687d62f6c200b38 > > Thanks, > > Peter > > > ---------- Forwarded message ---------- > From: habnabit > Date: Tue, May 24, 2011 at 9:10 PM > Subject: [biopython] Bugfix in test_Phylo; branch length formatter for > Newick trees (#7) > To: p.j.a.cock at googlemail.com > > > Hi! > > This is a very small set of changes for biopython; it should be > evident from the diff and commit message what the intent is. > > -- > Reply to this email directly or view it on GitHub: > https://github.com/biopython/biopython/pull/7 From eric.talevich at gmail.com Wed May 25 20:47:38 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 25 May 2011 16:47:38 -0400 Subject: [Biopython-dev] Adding QUIET argument to PDBParser() In-Reply-To: References: Message-ID: On Tue, May 17, 2011 at 8:21 AM, Jo?o Rodrigues wrote: > Regarding the missing element never raising an exception, here's what I > propose: > > Change the wording of the warnings in the Atom._assign_element method so > that they signal that the element was missing and it either was > auto-assigned or it couldn't be assigned at all. > I agree. Just prefixing the existing messages with "Missing or unexpected element: " would probably be fine, I think. Cheers, Eric From eric.talevich at gmail.com Wed May 25 21:03:23 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 25 May 2011 17:03:23 -0400 Subject: [Biopython-dev] Biopython specific warning classes In-Reply-To: References: Message-ID: On Wed, May 18, 2011 at 3:42 PM, Peter Cock wrote: > Hi all, > > I've been thinking we should introduce some specific warning > classes to Biopython, in particular: > > ParserWarning, for any "dodgy" input files, such as invalid > GenBank LOCUS lines, and so on. The existing PDB parser > warning should become a subclass of this. > This would fit well with what PDB and Phylo already do. My docstring for PhyloXMLWarning says it's for non-compliance with the format's specification. An alternate way to do this (but less easily scaled for SeqIO) is to have warnings for each format, triggered whenever the spec for that format is violated. WriterWarning, for things like "data loss", e.g. record IDs > getting truncated in PHYLIP output. > I'm not sure whether this would be handy or tedious -- a lot of formats could conceivably lose some data in a SeqRecord, and adding checks to each writer might be too much. Maybe just document these things well somewhere. Perhaps even a base class BiopythonWarning, which would > be useful for people wanting to ignore all the Biopython issued > warnings - it might be helpful in our unit tests too. > We should make sure these are very easy to use, to avoid making the scheme complicated, like: >>> from Bio import BiopythonWarning or >>> from Bio.Warnings import BiopythonWarning, ParserWarning, WriterWarning >>> warnings.simplefilter('ignore', ParserWarning) I guess it's not so bad. Currently (apart from the PDB module), we tend to use > the default UserWarning which makes filtering the warnings > as an end user (or a unit test writer) quite hard. > Yeah, I think it would be better to reserve UserWarning for the user's application code, rather than emitting them from the Biopython library. -Eric From p.j.a.cock at googlemail.com Thu May 26 08:38:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 26 May 2011 09:38:15 +0100 Subject: [Biopython-dev] Biopython specific warning classes In-Reply-To: References: Message-ID: On Wed, May 25, 2011 at 10:03 PM, Eric Talevich wrote: > On Wed, May 18, 2011 at 3:42 PM, Peter Cock > wrote: >> >> Hi all, >> >> I've been thinking we should introduce some specific warning >> classes to Biopython, in particular: >> >> ParserWarning, for any "dodgy" input files, such as invalid >> GenBank LOCUS lines, and so on. The existing PDB parser >> warning should become a subclass of this. > > This would fit well with what PDB and Phylo already do. My docstring > for PhyloXMLWarning says it's for non-compliance with the format's > specification. The warning in the GenBank file are also for clearly non-compliant files. > An alternate way to do this (but less easily scaled for SeqIO) is to have > warnings for each format, triggered whenever the spec for that format is > violated. Once we have the base classes of BiopythonWarning and ParserWarning in place, you could introduce more subclasses - but it seems less and less useful. >> WriterWarning, for things like "data loss", e.g. record IDs >> getting truncated in PHYLIP output. > > I'm not sure whether this would be handy or tedious -- a lot of formats > could conceivably lose some data in a SeqRecord, and adding checks to each > writer might be too much. Maybe just document these things well somewhere. There are a couple of existing warnings of this kind, but I agree they should be used sparingly. >> Perhaps even a base class BiopythonWarning, which would >> be useful for people wanting to ignore all the Biopython issued >> warnings - it might be helpful in our unit tests too. > > We should make sure these are very easy to use, to avoid making the scheme > complicated, like: > >>>> from Bio import BiopythonWarning > > or > >>>> from Bio.Warnings import BiopythonWarning, ParserWarning, WriterWarning >>>> warnings.simplefilter('ignore', ParserWarning) > > I guess it's not so bad. Yes, to ignore any Biopython warnings you do: from Bio import BiopythonWarning warnings.simplefilter('ignore', BiopythonWarning) or, to ignore just our parser warnings: from Bio import ParserWarning warnings.simplefilter('ignore', ParserWarning) That seems easy to me ;) >> Currently (apart from the PDB module), we tend to use >> the default UserWarning which makes filtering the warnings >> as an end user (or a unit test writer) quite hard. > > Yeah, I think it would be better to reserve UserWarning for the user's > application code, rather than emitting them from the Biopython library. OK then - I'll work on this. Peter From p.j.a.cock at googlemail.com Thu May 26 11:02:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 26 May 2011 12:02:49 +0100 Subject: [Biopython-dev] Biopython specific warning classes In-Reply-To: References: Message-ID: On Thu, May 26, 2011 at 9:38 AM, Peter Cock wrote: > > OK then - I'll work on this. > I've made a start on this with a BiopythonWarning and BiopythonParserWarning, but have not yet gone over the whole code base to use these consistently. If anyone want to tackle their own modules first, that would be helpful. Peter From eric.talevich at gmail.com Fri May 27 03:57:14 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 26 May 2011 23:57:14 -0400 Subject: [Biopython-dev] [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) In-Reply-To: <3B2A0BA4-3B13-4DEE-ADFB-E7253857E8DA@gmail.com> References: <3B2A0BA4-3B13-4DEE-ADFB-E7253857E8DA@gmail.com> Message-ID: Aaron & folks, I've committed the original patch and another based on this discussion. https://github.com/biopython/biopython/commit/cc48ad211266cb9ac118df15889597912c79a994 On Wed, May 25, 2011 at 10:44 AM, Aaron Gallagher wrote: > On May 25, 2011, at 7:33 AM, Eric Talevich wrote: > > > 1. [...] > > Since the branch length is always supposed to be a numeric type or > > None, format strings alone should be sufficient to do whatever the > > user wants, right? > > Maybe this is more sensible; I've been struggling to come up with a use > case of a full callable though it seemed to make sense when I was > implementing it. > > > Alternatively, the switch in _info_factory could go: [...] > > I'm not a huge fan of implementing APIs like this in python, really. It is > seeming more and more like the most sensible thing is to just specify a > format string. > I changed the format_branch_length argument to take a simple format string instead of a function: https://github.com/biopython/biopython/commit/decd2a19fa3631cc34aaaf4c79d3af96c26fa1d9 > > 2. Out of curiousity, is there a certain program out there that uses > > branch length in a different format? I hadn't considered this before, > > but I can see how scientific notation would be useful sometimes if the > > target program can handle it. > > The issue in my case was not so much needing a different format (though the > tools I work on /do/ support scientific notation) so much as that the Newick > trees I generate have precision down to 1e-6. Round-tripping them through > biopython was truncating branches with very small lengths. > Good to know. The format for confidences is also hard-coded ("%1.2f"), do you suppose that should be given the same treatment? Thanks again, Eric From p.j.a.cock at googlemail.com Fri May 27 13:52:56 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 27 May 2011 14:52:56 +0100 Subject: [Biopython-dev] [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) In-Reply-To: References: <3B2A0BA4-3B13-4DEE-ADFB-E7253857E8DA@gmail.com> <7B5DB32C-25FB-43F6-A3CB-15848A975418@gmail.com> Message-ID: On Fri, May 27, 2011 at 2:48 PM, Erick Matsen wrote: > Hello everyone-- > > > Hope you don't mind my chiming into this discussion. > >> Good to know. The format for confidences is also hard-coded ("%1.2f"), do >> you suppose that should be given the same treatment? > > I think this would be entirely appropriate. There are some cases (eg > bootstrap) where the confidence is actually a count, and being able to > express it as such might be convenient. > > I have one related point to discuss if you don't mind. In > > https://github.com/biopython/biopython/blob/master/Bio/Phylo/NewickIO.py#L246 > > trees without confidence values get written out as trees with confidence > values of zero. These are of course two different things. > > I realize that if we want to write out a tree without confidence values > we can specify branchlengths_only, but it would seem to me that the most > natural behavior would be to just write out confidence values when they > are specified. > > In particular, it surprises me that reading a tree and then writing it > with the default settings changes the meaning of the tree. > > I realize that changing the behavior like this might not be possible > because this is a large group project, but I thought I would point it > out. > > Thank you for your great work here! > > Erick That is a very good point. Can we use None for no confidence value? Peter From eric.talevich at gmail.com Fri May 27 14:30:08 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 27 May 2011 10:30:08 -0400 Subject: [Biopython-dev] [biopython] Bugfix in test_Phylo; branch length formatter for Newick trees (#7) In-Reply-To: References: <3B2A0BA4-3B13-4DEE-ADFB-E7253857E8DA@gmail.com> <7B5DB32C-25FB-43F6-A3CB-15848A975418@gmail.com> Message-ID: On Fri, May 27, 2011 at 9:52 AM, Peter Cock wrote: > On Fri, May 27, 2011 at 2:48 PM, Erick Matsen wrote: > > Hello everyone-- > > > > > > Hope you don't mind my chiming into this discussion. > > > >> Good to know. The format for confidences is also hard-coded ("%1.2f"), > do > >> you suppose that should be given the same treatment? > > > > I think this would be entirely appropriate. There are some cases (eg > > bootstrap) where the confidence is actually a count, and being able to > > express it as such might be convenient. > OK, this should be easy enough to fix. > > I have one related point to discuss if you don't mind. In > > > > https://github.com/biopython/biopython/blob/master/Bio/Phylo/ > >> >> Peter >> > NewickIO.py#L246 > > > > trees without confidence values get written out as trees with confidence > > values of zero. These are of course two different things. > > > > I realize that if we want to write out a tree without confidence values > > we can specify branchlengths_only, but it would seem to me that the most > > natural behavior would be to just write out confidence values when they > > are specified. > > > > In particular, it surprises me that reading a tree and then writing it > > with the default settings changes the meaning of the tree. > > > > I realize that changing the behavior like this might not be possible > > because this is a large group project, but I thought I would point it > > out. > > > > Thank you for your great work here! > > > > Erick > > That is a very good point. Can we use None for no confidence value? > > Yes, that should be the case, and also NewickIO should not add confidence values of 0.0 during serialization where clade.confidence is None. This probably deserves another test in test_Phylo.py. I don't see a problem with changing this behavior in Bio.Phylo, as long as it's still creating Newick files that work with other widely-used software. -Eric From mikael.trellet at gmail.com Tue May 31 11:46:50 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Tue, 31 May 2011 13:46:50 +0200 Subject: [Biopython-dev] GSoC 2011 - Interface analysis module - Week 1 Message-ID: Hi there, As mentioned in the title, you will find in this email a sum up of my first week of coding for the Google Summer of Code 2011. I will begin with a reminder of the original plan proposed to Google and I will continue with what I did and what obstacles I encountered. Please don't hesitate to post some comments, your remarks are one of the main motivation for this mail (which will be I think the first one of a weekly report) ! Week 1 [24th - 31th May] 1. Add a the new Interface module backbone in current Bio.PDB code base 1. Evaluate possible code reuse and call it into the new module 2. Try simple calculations to be sure that there is stability between the different modules (parsing for example) and functions 1. Define a stable benchmark of few PDB files of complexes to run some unit tests for each step of the project Unfortunately, one of the main part of my first week was to try to solve some troubles I had by using github directly on my Dropbox folder. I worked on several computer so I wanted to have everything synchronized, but this synchronization didn't seem to be very compatible with dropbox. I have to say that it was certainly the way I used it which were wrong, I decided finally (but also lately) to keep only one main working directory and to ssh it if I need. We began to think of an easy way to add the Interface as a new part of the SMCRA scheme. The idea was to have this new scheme = SM-I-CRA. Unfortunately the Interface object is not as well defined as just a child of model and a parent of chains. Indeed, the main part of the interface is residues, and even residues pairs. We want to keep the information of the chain but we can't keep them as they are defined actually, since we will get some overlaps, duplication and miscompatibility between the chains of our model and the chains of our interface. In the same way, our try to link the creation of the interface with existing modules as StructureBuilder and Model wasn't successful. So, we decided to simplify a bit the concept in adding the classes related to the Interface in an independent way. Obviously links will exist between the different levels of SMCRA but Interface would be considered now as a parallel entity, not integrated completely in the SMCRA scheme. End of the story, some keyboards uses now. About the coding part. I had two new classes in Bio.PDB : Interface.py and InterfaceBuilder.py For the impatient people, this is the two links of my commits : https://github.com/mtrellet/biopython/commit/4cfa4359d0f927609c076ed7b66f37add5aabdfb https://github.com/mtrellet/biopython/commit/194efe37ac8f88d688e0cf528f1fb896c8441866 Interface.py is the definition of the Interface object inherited from Entity with the following methods : *__init__*(self, id), *add*(self, entity) and *get_chains*(self). The *add* module overrides the add method of Entity in order to have an easy way to class residues according to their respective chains. The *get_chains* modules returns the chains involved in the interface defined by the Interface object. The second class created is InterfaceBuilder.py which deals directly with the interface building (hard to guess..!) We find these different modules : *__init__*(self, model, id=None, threshold=5.0, include_waters=False, *chains), *_unpack_chains*(self, list_of_tuples), *get_interface*(self), *_add_residue*(self, residue), * _build_interface*(self, model, id, threshold, include_waters=False, *chains) *__init__* : In order to initialize an interface you need to provide the model for which you want to calculate the interface, that's the only mandatory argument. *_unpack_chains*: Method used by __init__ so as to create self.chain_list, variable read in many parts of the class. It transforms a list of tuples (given by the user) in a list of characters representing the chains which will be involved in the definition of the interface. *get_interface: *Returns simply the interface *_add_residue: *Allows the user to add some specific residues to his interface *_build_interface: *The machinery to build the interface, it uses NeighborSearch and Selection in order to define the interface depending on the arguments given by the user. It was maybe a bit long and with too many details (or perhaps not details enough), as I already said, don't hesitate to make suggestions, for both my work and my report ! You should receive a dozen of these, so any comment is welcomed ! Cheers, -- Mikael TRELLET, Computational structural biology group, Utrecht University Bijvoet Center, The Netherlands