From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:00:54 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:00:54 -0500 Subject: [Biopython-dev] [Bug 3173] New: Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3173 Summary: Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 Product: Biopython Version: 1.55 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: jp.verta at gmail.com I'm running Biopython 1.55, Python 2.6 and EMBOSS version 6.3.1 on MacOS X 10.6 Snow Leopard. The Bio.Emboss.Primer3 parser seems to be incompatible with the newer version 2.2.3 of the Whitehead Primer3 program and the corresponding Emboss eprimer3 program output. The parser output for the reverse primer seems to contain all Primer -class members (primer.reverse_tm, primer.reverse_gc etc.) except the reverser primer sequence (primer.reverse_seq). Yet the eprimer3 output seems identical to that of old versions (see the output.pr3 -files attached). Here is an example code for designing primers for a set of fasta sequences. >>> def design_primers(fasta_file, output_file): from Bio import SeqIO from Bio.Emboss.Applications import Primer3Commandline from Bio.Emboss import Primer3 output = open(output_file, "w") output.write("name,forward_primer,reverse_primer,forward_tm,reverse_tm,product_size\n") for seq_record in SeqIO.parse(fasta_file, "fasta"): if not(seq_record): break open("sequence", "w").write(">"+str(seq_record.id)+"\n"+str(seq_record.seq)+"\n") primer_cl = Primer3Commandline(sequence="sequence") primer_cl.explainflag = True primer_cl.osizeopt=20 primer_cl.psizeopt=200 primer_cl.otm=65 primer_cl.maxtm=70 primer_cl.mintm=60 primer_cl.gcclamp=1 #required number of Gs or Cs at the 3' end of the primer primer_cl.outfile = "output.pr3" primer_cl() output_handle = open("output.pr3","r") primer_record = Primer3.read(output_handle) if len(primer_record.primers) > 0: primer = primer_record.primers[0] output.write("%s,%s,%s,%s,%s,%s\n" % (seq_record.id, primer.forward_seq, primer.reverse_seq, primer.forward_tm,primer.reverse_tm,primer.size)) else: print "No primers found for %s" % seq_record.id >>> This code, when executed on a file of fasta-sequences gives and output -file with forward and reverse primer id, sequence, tm and size separated by commas. When I execute it with the Primer3-2.2.3 and compatible eprimer3 versions, the field for the reverse primer sequence appears blank. I will attach the Primer3-2.2.3 compatible eprimer3 file to this report. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:03:17 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:03:17 -0500 Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 In-Reply-To: Message-ID: <201102011803.p11I3HtF008419@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3173 ------- Comment #1 from jp.verta at gmail.com 2011-02-01 13:03 EST ------- Created an attachment (id=1565) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1565&action=view) Emboss eprimer3.c file for Primer3 version 2.2.3 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:05:04 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:05:04 -0500 Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 In-Reply-To: Message-ID: <201102011805.p11I54md008512@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3173 ------- Comment #2 from jp.verta at gmail.com 2011-02-01 13:05 EST ------- Created an attachment (id=1566) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1566&action=view) Example output of Primer3 version 1.1.4 compatible Emboss eprimer3 program output -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:06:00 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:06:00 -0500 Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 In-Reply-To: Message-ID: <201102011806.p11I60De008626@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3173 jp.verta at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1566|Example output of Primer3 |Example output of Primer3 description|version 1.1.4 compatible |version 2.2.3 compatible |Emboss eprimer3 program |Emboss eprimer3 program |output |output -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:06:47 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:06:47 -0500 Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 In-Reply-To: Message-ID: <201102011806.p11I6lVr008664@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3173 ------- Comment #3 from jp.verta at gmail.com 2011-02-01 13:06 EST ------- Created an attachment (id=1567) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1567&action=view) Example output of Primer3 version 1.1.4 compatible Emboss eprimer3 program output -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:07:44 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:07:44 -0500 Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 In-Reply-To: Message-ID: <201102011807.p11I7ih8008712@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3173 jp.verta at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1566|application/octet-stream |text/plain mime type| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Feb 1 15:39:30 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Feb 2011 20:39:30 +0000 Subject: [Biopython-dev] [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> <20110201160304.GH17835@sobchak.mgh.harvard.edu> Message-ID: On Tue, Feb 1, 2011 at 4:16 PM, Peter wrote: > On Tue, Feb 1, 2011 at 4:03 PM, Brad Chapman wrote: >> >> Peter, how hard do you think it would be to have SeqIO only convert >> from the fastq encoding to phred scores on demand? Most of the time >> when dealing with fastq I do not need any conversion at all and use >> the FastqGeneralIterator to just pull out the name, sequence and >> quality. >> >> You've done a lot of nice work with the correct conversions and it >> would be great to expose that directly though on-demand conversion >> as Alan is suggesting. Ideally you would use SeqIO as normal with >> fastq files, but the quality score would not be converted to solexa >> during parsing using letter_annotations["solexa_quality"] was >> accessed. > > I actually implemented a proof of concept that does that. In order > to not alter the SeqRecord behaviour, it was a new object which > acted like a list of integers in many respects. The data is held > as a FASTQ encoded string, and decoded (and then cached) on > demand only. On output if it was already in the right encoding > the string could be used as is, otherwise the conversion could > be done very quickly with a precomputed table and the string > translate() method (without having to go via a list of integers). > It seemed to work, but I wasn't convinced about the benefits > (given the complexity). I'd really want some real world FASTQ > benchmarks to try it on... something you might have in the form > of your scripts and the real data they were written for? > > I'm pretty sure this code is in a local git branch on one of my > machines (probably at home), but I don't think I pushed it to > github. I should do that... Found it and pushed it: https://github.com/peterjc/biopython/tree/fastq-tricks Note there are unit test failures (e.g. as currently implemented there is no range checking on the characters in the quality strings at parse time). We may want to continue this on the dev mailing list... Peter From p.j.a.cock at googlemail.com Thu Feb 3 07:04:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Feb 2011 12:04:08 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: <20110128123418.GD7866@sobchak.mgh.harvard.edu> References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> Message-ID: On Wed, Jan 26, 2011 at 7:44 PM, Peter Cock wrote: > > I'm currently looking at trimming 5' and 3' PCR primer sequences - > which could equally be used for barcodes etc. I'd probably wrap this > as a Galaxy tool (using Biopython). > If anyone is interested, see this thread on the Galaxy-dev mailing list: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004290.html In terms of SFF output, I'm only writing one SFF file so the issues Jacob is concerned about (when writing one SFF file per barcode) do not apply. On Fri, Jan 28, 2011 at 12:34 PM, Brad Chapman wrote: > > I wrote up a barcode detector, remover and sorter for our Illumina > reads. There is nothing especially tricky in the implementation: it > looks for exact matches and then checks for approximate matches, > with gaps, using pairwise2: > > https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py > > The "best_match" function could be replaced with different > implementations, using the rest of the script as scaffolding to do > all of the other sorting, trimming and output. > > Brad The computationally interesting part is matching the primer/adapter/ barcode to the read (both of which may contain IUPAC ambiguity codes), which as you point out can be replaced once you have a working framework for the input, output, trimming, etc. Currently I'm using regular expressions, which is fast enough for my own needs - and this task could easily be parallelised by breaking up the input reads. Beyond that perhaps something based on Hamming distances (edit distance - number of mismatches) or Levenshtein searches might be quicker. I guess speed is more of an issue with Illumina than with 454 due to the number of reads? Brad - you mentioned using approximate matches with gaps. Did you find gapped matches made a bit difference to the number of matches found? i.e. is it worthwhile on your data? Peter From bugzilla-daemon at portal.open-bio.org Thu Feb 3 17:47:04 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Feb 2011 17:47:04 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102032247.p13Ml4QY029111@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 walter_gillett at hotmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |walter_gillett at hotmail.com ------- Comment #4 from walter_gillett at hotmail.com 2011-02-03 17:47 EST ------- Short answer: The fix looks good - I have dug into the logic in detail and stepped through the example. However, appears to me that there is still a bug in this line of code in the viterbi method: for cur_state in self.transitions_from(main_state): In this context, "cur_state" is a state prior to "main_state", so what we really need here is the set of states that lead to main_state, not the set of states that can be reached from main_state. This bug won't cause trouble in practice unless you use a non-ergodic HMM, that is, a model in which some state transitions are disallowed. (The variable names here are confusing, would be better to rename main_state to cur_state and cur_state to previous_state, or something like that.) This bug is unrelated to the problem originally reported, other than appearing in the same part of the code, so perhaps it should be handled in a separate ticket. I would be happy to code up a fix if that makes sense. Longer answer: I had spent a bunch of time recently investigating this - should have noted that in bugzilla to avoid duplication of effort. But still seems worthwhile writing down my notes to document this better, so I'll do that here. There was a error in the Viterbi algorithm termination logic, as implemented in the method MarkovModel#viterbi. The Viterbi probabilities were being multiplied by the log-probability of a transition back to an end state (state 0). This was incorrect because in log space the log-probabilities should be added, not multiplied. Peter's fix removes that multiplication, thus dropping the end state transition entirely (which Durbin considers optional, so that's fine; and it was causing trouble). With the bug fixed, the most probable state path to generate 6 tails (in the example model described by the bug reporter) becomes "uuuuuu" as expected - no final "f". At a higher level, there was (in versions 1.56 and prior, but no longer in trunk) an important undocumented (as far as I can see) requirement that the model always starts in state 0. The bug reporter complained that the results of the Viterbi path calculation are wrong because "apparently they depend upon the order of the state alphabet," which was true. In the example model, providing the state alphabet ["f", "u"] causes the system to start in state f. Since there is a big penalty in his example for switching states, you get "ff" as the most likely state path for the output sequence [tails, tails], even though the unfair coin is much more likely than the fair coin to yield tails. Looks like Peter's fix treats all starting states as equally probable, there is no longer a special start state. That's reasonable, although the coding is a little confusing: # v_{0}(0) = 0 viterbi_probs[(state_letters[0], -1)] = 0 # v_{k}(0) = 0 for k > 0 for state_letter in state_letters[1:]: viterbi_probs[(state_letter, -1)] = 0 because it could now more naturally be done in two lines of code rather than three. Possibly it's useful to keep the assignment for state 0 separate in case we want to change it. A good long-term improvement would be to have a special hidden start state like the "MagicalState" used by BioJava (see http://www.biojava.org/wiki/BioJava:Tutorial:Simple_HMMs_with_BioJava). That would make it possible to specify a probability distribution for what the initial state should be, a typical HMM feature (see Durbin's book, for example). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Feb 3 19:16:39 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Feb 2011 19:16:39 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102040016.p140GdsK031389@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #5 from pgarland at gmail.com 2011-02-03 19:16 EST ------- FWIW, I think the right thing with respect to begin states is to require the user to explicitly specify an begin state in the state alphabet, e.g.: class coin: def __init__(self): self.begin_state_name = "begin" self.letters = ["u", "f"] Having the user specify the name should reduce the chance of naming conflicts, and makes it easier for the user to understand what is going on if they print viterbi_probs, or are trying to debug a problem. The user should also be required to explicitly set the initial probabilities. There should be three methods for this, one that takes a list of initial probabilities, one that makes all initial states equally probable, and one that lets the user set the probability for each state individually. e.g: MarkovModelBuilder.set_initial_probabilities([0.01, 0.99]) MarkovModelBuilder.set_initial_probabilities_equal() MarkovModelBuilder.set_initial_probability("u", 0.01) The first and third methods would raise an exception if the sum of the probabilities did not sum to 1.0 Alternatively, the initial probabilities could be specified when defining the state alphabet: def __init__(self): self.begin_state_name = "begin" self.letters = [{'name': "u", 'init_prob': 0.01}, {'name': "f", 'init_prob': 0.99}] This has the advantage of making the code more concise and readable, because the state's declaration and specification are kept together. It has the disadvantage adding an unnecessary layer of indirection when all the states have equal initial probabilities. To make things less tedious for the user, there could either be a flag specifying that all states have an equal initial probability: Alternatively, the initial probabilities could be specified when defining the state alphabet: def __init__(self): self.begin_state_name = "begin" self.initial_probabilties_equal = True self.letters = [{'name': "u"}, {'name': "f"}] or again, a method could be provided: MarkovModelBuilder.set_initial_probabilities_equal() Because specifying the begin state name and the initial probabilities would be required, any of these changes would break the current API. Similar features should be provided for users who want to constrain the end state, but not specifying the end state should not raise an exception. I agree the variable names "main_state" and "cur_state" are confusing and should be changed. ~Phillip -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From clementsgalaxy at gmail.com Thu Feb 3 20:01:01 2011 From: clementsgalaxy at gmail.com (Dave Clements) Date: Thu, 3 Feb 2011 17:01:01 -0800 Subject: [Biopython-dev] Galaxy Community Conference, May 25-26, Lunteren, The Netherlands Message-ID: We are pleased to announce the *2011 Galaxy Community Conference*, being held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature two full days of presentations and discussion on extending Galaxy to use new tools and data sources, deploying Galaxy at your organization, and best practices for using Galaxy to further your own and your community's research. See http://galaxy.psu.edu/gcc2011/* for complete details. * *About Galaxy: *Galaxy is an open, web-based platform for *accessible, reproducible, and transparent* computational biomedical research. - *Accessibility:* Galaxy enables users without programming experience to easily specify parameters and run tools and workflows. - *Reproducibility:* Galaxy captures all information necessary so that any user can repeat and understand a complete computational analysis. - *Transparency:* Galaxy enables users to share and publish analyses via the web and create Pages--interactive, web-based documents that describe a complete analysis. Galaxy is open source for all organizations. The public Galaxy service ( http://usegalaxy.org) makes analysis tools, genomic data, tutorial demonstrations, persistent workspaces, and publication services available to any scientist that has access to the Internet. Local Galaxy servers can be set up by downloading the Galaxy application and customizing it to meet particular needs. *Conference Overview: * This event aims to engage a broader community of developers, data producers, tool creators, and core facility and other research hub staff to become an active part of the Galaxy community. We'll cover defining resources in the Galaxy framework, increasing their visibility and making them easier to use and integrate with other resources, how to extend Galaxy to use custom data sources and custom tools, and best practices for using Galaxy in your organization. Additional topics include, but are not limited to: * Talks submitted by the Galaxy community * Integration of tools (including NGS analysis tools) and distributed job management * Deployment of Galaxy instances on local resources and on the Cloud * Management of large datasets with the Galaxy Library System * Using the Galaxy LIMS functionality at NGS sequencing facilities * Visualizing Data without leaving Galaxy * Performing reproducible research * Performing and sharing complex analyses with Workflows * An "Introduction to Galaxy" session, offered on May 24, for Galaxy newcomers. *Registration: * The conference fee is ?100 on or before April 24, and ?120 after that. The meeting is being held at the Conference Centre De Werelt in Lunteren, The Netherlands, which is also the conference hotel. You are encouraged to register early, as space at the hotel (and at the "Intro to Galaxy" session) is limited and is likely to fill up before the conference itself does. See http://galaxy.psu.edu/gcc2011/Register.html * Abstract Submission: * Abstracts are now being accepted for short oral presentations. Proposals on any topic of interest to the Galaxy community are welcome and encouraged. The abstract submission deadline is the end of February 28. See http://galaxy.psu.edu/gcc2011/Abstracts.html * * *Sponsors * The 2011 Galaxy Community Conference is co-sponsored by the US National Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a collaborative institute of the bioinformatics groups in the Netherlands. Together, these groups perform cutting-edge research, develop novel tools and support platforms, create an e-science infrastructure and educate the next generations of bioinformaticians. We are looking forward to a great conference and hope to see you in the Netherlands! The Galaxy and NBIC Teams -- http://galaxy.psu.edu/gcc2011/ http://getgalaxy.org http://usegalaxy.org/ From bugzilla-daemon at portal.open-bio.org Fri Feb 4 05:05:18 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 05:05:18 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102041005.p14A5Ij0019705@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 05:05 EST ------- (In reply to comment #4) > > Looks like Peter's fix treats all starting states as equally probable, > there is no longer a special start state. That's reasonable, although the > coding is a little confusing: > It was Phillip's fix. (In reply to comment #5) > FWIW, I think the right thing with respect to begin states is to require the > user to explicitly specify an begin state in the state alphabet, e.g.: > class coin: > def __init__(self): > self.begin_state_name = "begin" > self.letters = ["u", "f"] If we go that route, we'll need to make very clear the differences between a HMM Alphabet (of states) and a biological sequence alphabet (Bio.Alphabet). It must make sense in many cases to use a biological sequence alphabet, but in general adding HMM attributes to the class does not make sense. We really need someone to volunteer to take over this code (and sort out the overlap between Bio.MarkovModel and/or Bio.HMM.MarkovModel), write some documentation for the tutorial, and sort out these remaining issues. Are either of you interested? > > I agree the variable names "main_state" and "cur_state" are confusing and > should be changed. > I'll happily merge/cherry-pick a simple diff to do that only if you do that on github, or apply a patch if you upload it here. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 08:46:02 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 08:46:02 -0500 Subject: [Biopython-dev] [Bug 3175] New: Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3175 Summary: Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 Product: Biopython Version: 1.54 Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: aaron.tin.long.lun at gmail.com When parsing genbank files using Bio.SeqIO as described in the Biopython Cookbook, the presence of a caret in the position of a feature in the annotation (e.g. CDS 1000..1001^1002) raises a LocationParserError, leading to "Syntax error at or near `Tokens('caret')' token". Appears to occur regardless of the type of the feature, whether it is normal/reverse complement, etc. Found in BioPython 1.54 on a Dell dimension 2400 running Kubuntu 10.10. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 08:49:23 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 08:49:23 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041349.p14DnN75028633@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #1 from aaron.tin.long.lun at gmail.com 2011-02-04 08:49 EST ------- Created an attachment (id=1568) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1568&action=view) Crash-inducing file for the GenBank parser Example file, modified from the human mitochondrial genome, with a caret introduced in line 96. Causes the crash described in the bug description. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 09:20:33 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 09:20:33 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041420.p14EKX5n030354@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 09:20 EST ------- Hi Aaron, The example in attachment #1568 from comment #1 is invalid. The feature location join(16024^16026..16569,1..576) is wrong since the caret should be used in the form [i]^[i+1], i.e. consecutive numbers. See: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html That example should probably be a between location like join((16024.16026)..16569,1..576) However, the example in the original bug report, 1000..1001^1002, looks possible (but unprecedented to my knowledge) and that also fails with the latest Biopython GenBank parsing code (much changed since Biopython 1.54). I don't really understand how that usefully differs from 1000..1001 or 1000..1002 though. Was that from a GenBank file from the NCBI? If so what accession please, or a URL? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 10:00:49 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 10:00:49 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102041500.p14F0naj032533@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 georg.lipps at fhnw.ch changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #7 from georg.lipps at fhnw.ch 2011-02-04 10:00 EST ------- Yes, the code seems to work now. The probability of attaining the first state is now the transition probability of remaining in the same state (here 0.95). I like the suggestion of comment #5 to explicity state the a begin state with the corresponding transition probabilities. A big THANK for fixing, Georg -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 10:19:11 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 10:19:11 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102041519.p14FJBme001095@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #8 from walter_gillett at hotmail.com 2011-02-04 10:19 EST ------- I'll volunteer to do all of that (OK with you, Phillip?). Walter (In reply to comment #6) > (In reply to comment #5) > > FWIW, I think the right thing with respect to begin states is to require the > > user to explicitly specify an begin state in the state alphabet, e.g.: > > class coin: > > def __init__(self): > > self.begin_state_name = "begin" > > self.letters = ["u", "f"] > > If we go that route, we'll need to make very clear the differences between a > HMM Alphabet (of states) and a biological sequence alphabet (Bio.Alphabet). It > must make sense in many cases to use a biological sequence alphabet, but in > general adding HMM attributes to the class does not make sense. > > We really need someone to volunteer to take over this code (and sort out the > overlap between Bio.MarkovModel and/or Bio.HMM.MarkovModel), write some > documentation for the tutorial, and sort out these remaining issues. Are either > of you interested? > > > > > I agree the variable names "main_state" and "cur_state" are confusing and > > should be changed. > > > > I'll happily merge/cherry-pick a simple diff to do that only if you do that on > github, or apply a patch if you upload it here. > > Thanks, > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 11:12:33 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 11:12:33 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102041612.p14GCXfW004211@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 11:12 EST ------- > > > > I agree the variable names "main_state" and "cur_state" are confusing and > > should be changed. > > > > I'll happily merge/cherry-pick a simple diff to do that only if you do that on > github, or apply a patch if you upload it here. I could have phrased that better: I mean a simple patch/diff to do the rename only would be easy for me to review and check in. (In reply to comment #8) > I'll volunteer to do all of that (OK with you, Phillip?). > > Walter That's OK with me. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 12:25:18 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 12:25:18 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041725.p14HPIhY008673@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #3 from aaron.tin.long.lun at gmail.com 2011-02-04 12:25 EST ------- Hi Peter, Thanks for the quick reply. I originally encountered the caret in the GenBank entry for the chromosome II assembly of the human genome (accession number NT_022221.13, downloaded from NCBI's FTP site yesterday); it can be found at the very end of the annotation, for the V_segments/CDS of the IGKV2-40 gene e.g. CDS complement(<68451760..68452072^68452073). I suspect that it is rare, because I parsed through the complete sequences of 15 other chromosomes before my program crashed. Hope that helps. Cheers, Aaron -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 12:43:37 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 12:43:37 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041743.p14HhbbY009388@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 12:43 EST ------- (In reply to comment #3) > Hi Peter, > Thanks for the quick reply. I originally encountered the caret in the GenBank > entry for the chromosome II assembly of the human genome (accession number > NT_022221.13, downloaded from NCBI's FTP site yesterday); it can be found at > the very end of the annotation, for the V_segments/CDS of the IGKV2-40 gene > e.g. CDS complement(<68451760..68452072^68452073). I suspect that it is > rare, because I parsed through the complete sequences of 15 other chromosomes > before my program crashed. Hope that helps. > Cheers, > Aaron > Where on the FTP site? Its a big place and I don't work with human genomes... Looking via the Entrez website, it seems NT_022221.13 is only 3519312bp, so this can't match the GenBank file you are looking at: http://www.ncbi.nlm.nih.gov/nuccore/NT_022221.13?report=gbwithparts Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 13:05:42 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 13:05:42 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041805.p14I5gxS010298@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 13:05 EST ------- (In reply to comment #4) > > Where on the FTP site? Its a big place and I don't work with human genomes... > Nevermind, I tried downloading a few candidates and found it - you actually meant NT_015926.15 which is in this file (whose first entry is NT_022221.13) ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_GRCh37.p2_chr2.gbk.gz It seems that Google doesn't index this site - I can understand why but it would have been useful. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 13:15:05 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 13:15:05 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041815.p14IF5Bx010832@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #6 from aaron.tin.long.lun at gmail.com 2011-02-04 13:15 EST ------- Hi Peter, Yeah, sorry about the mix-up, I'm not used to dealing with more than one sequence record per file. The caret should be present in the FTP-sourced file. Interestingly, it is not present in the Nucleotide annotation for the same accession number, which suggests that they've updated it in the two/three months since the data was pushed onto the FTP site. Cheers, Aaron -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 13:20:12 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 13:20:12 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041820.p14IKCDN011161@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #7 from aaron.tin.long.lun at gmail.com 2011-02-04 13:20 EST ------- NT_022184.15 is the record containing IGKV2-40 (and the associated caret) in my file. What I said about Nucleotide still applies, though. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 23:23:02 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 23:23:02 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102050423.p154N2fO013565@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #10 from pgarland at gmail.com 2011-02-04 23:23 EST ------- (In reply to comment #8) > I'll volunteer to do all of that (OK with you, Phillip?). > > Walter Sure. WRT my earlier comment, I realized that it's simpler for both the implementer and the user if the only user-visible change necessary to specify begin states is to add a variable to HiddenMarkovBuilder to hold the name of the begin state, and then let users use set_transition_score to specify transition probabilities from begin states. Then the relevant methods, e.g. _all_blank, allow_transition, allow_all_transitions, set_transition_score, etc have to be altered to forbid transitions to, or emissions from the begin state. And get_markov_model would raise an exception if a begin state hasn't been specified or if there isn't at least one transition from the begin state. So all users would have to do is (using the example from the bug report): ... build.begin_state_name = "begin" build.set_transition_score("begin", "u", 0.01) build.set_transition_score("begin", "f", 0.99) ... ~Phillip -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Feb 5 02:04:46 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Feb 2011 02:04:46 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102050704.p1574kup024068@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #11 from walter_gillett at yahoo.com 2011-02-05 02:04 EST ------- Sounds good. (I had been thinking about trying to preserve backward compatibility for existing clients of this class. If we require that the caller sets a begin state then all existing clients will break since none of them currently does that. But the previous fix has already broken compatibility in any case, and that was probably necessary since prior to the fix, the results were incorrect.) A possible variation would be to handle the transition from the begin state to the first real state with special-case code, so that the begin state would not be included in the set of real states. The upside would be that the methods you mention would not have to change, and we wouldn't be cluttering the state alphabet with a begin state that isn't real, which I think was a concern mentioned in comment #6 (if I understood it properly). The downside is having to add that special-case code. Not sure yet whether this is a good idea or not. Walter (In reply to comment #10) > (In reply to comment #8) > > I'll volunteer to do all of that (OK with you, Phillip?). > > > > Walter > > Sure. WRT my earlier comment, I realized that it's simpler for both the > implementer and the user if the only user-visible change necessary to specify > begin states is to add a variable to HiddenMarkovBuilder to hold the name of > the begin state, and then let users use set_transition_score to specify > transition probabilities from begin states. Then the relevant methods, e.g. > _all_blank, allow_transition, allow_all_transitions, set_transition_score, etc > have to be altered to forbid transitions to, or emissions from the begin state. > And get_markov_model would raise an exception if a begin state hasn't been > specified or if there isn't at least one transition from the begin state. > > So all users would have to do is (using the example from the bug report): > > ... > build.begin_state_name = "begin" > build.set_transition_score("begin", "u", 0.01) > build.set_transition_score("begin", "f", 0.99) > ... > > ~Phillip > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Feb 5 22:23:39 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Feb 2011 22:23:39 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102060323.p163NdIu013858@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #12 from pgarland at gmail.com 2011-02-05 22:23 EST ------- (In reply to comment #11) > Sounds good. > > (I had been thinking about trying to preserve backward compatibility for > existing clients of this class. If we require that the caller sets a begin > state then all existing clients will break since none of them currently does > that. But the previous fix has already broken compatibility in any case, and > that was probably necessary since prior to the fix, the results were > incorrect.) I don't think it's worth it to worry about preserving complete backward compatibility. Right now there are two classes of code: 1) Code that manually sets up a begin state and the appropriate transitions. All these people would need to do is add one line of code specifying the begin state, and the rest of their code would work as before. For these users, we could print an error message instructing them to set the begin_state_name variable (and document the change too!). 2) Code that does not set up a begin state, as in the bug report. Even with the applied bug fix, this code only returns a correct state sequence when all possible start states should be equally probable. In all other cases the users are possibly getting an incorrect result without being aware of it. To my mind, this is worse than breaking backward compatibility. We could maintain backward compatibility by having a default model for the initial state (e.g. equally probable, or assign random probabilities), but unless that's the model the user should be assuming for their sequence, they'll still be silently returned an incorrect result. > A possible variation would be to handle the transition from the begin state to > the first real state with special-case code, so that the begin state would not > be included in the set of real states. The upside would be that the methods you > mention would not have to change, and we wouldn't be cluttering the state > alphabet with a begin state that isn't real, which I think was a concern > mentioned in comment #6 (if I understood it properly). The downside is having > to add that special-case code. Not sure yet whether this is a good idea or not. > > Walter I hadn't thought of that approach. It could be a good way to go. I think the tradeoffs would be: A) Of the existing code, changes would be localized to the viterbi method, which would become slightly more complex. B) This approach makes it trivial to guarantee that no state can transition to the begin state. C) One new public method would have to be added, for users to set initial probabilities. D) Having to use the new method would require more, though not complex, changes to existing user code, but would have the benefit of making it as explicit as possible how the model is initialized. All in all, your idea of keeping the begin state separate looks like the way to go. ~ Phillip > (In reply to comment #10) > > (In reply to comment #8) > > > I'll volunteer to do all of that (OK with you, Phillip?). > > > > > > Walter > > > > Sure. WRT my earlier comment, I realized that it's simpler for both the > > implementer and the user if the only user-visible change necessary to specify > > begin states is to add a variable to HiddenMarkovBuilder to hold the name of > > the begin state, and then let users use set_transition_score to specify > > transition probabilities from begin states. Then the relevant methods, e.g. > > _all_blank, allow_transition, allow_all_transitions, set_transition_score, etc > > have to be altered to forbid transitions to, or emissions from the begin state. > > And get_markov_model would raise an exception if a begin state hasn't been > > specified or if there isn't at least one transition from the begin state. > > > > So all users would have to do is (using the example from the bug report): > > > > ... > > build.begin_state_name = "begin" > > build.set_transition_score("begin", "u", 0.01) > > build.set_transition_score("begin", "f", 0.99) > > ... > > > > ~Phillip > > > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Feb 6 01:46:56 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 6 Feb 2011 01:46:56 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102060646.p166kuqY018550@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #13 from walter_gillett at yahoo.com 2011-02-06 01:46 EST ------- I forked biopython, tested and checked in and pushed some improvements to variable naming and comments in the viterbi method, and submitted a pull request for your review. Thanks, Walter (In reply to comment #8) > > I'll happily merge/cherry-pick a simple diff to do that only if you do that on > > github, or apply a patch if you upload it here. > > > > Thanks, > > > > Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Mon Feb 7 07:23:56 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 7 Feb 2011 07:23:56 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> Message-ID: <20110207122356.GC18733@sobchak.mgh.harvard.edu> Peter; > The computationally interesting part is matching the primer/adapter/ > barcode to the read (both of which may contain IUPAC ambiguity codes), > which as you point out can be replaced once you have a working > framework for the input, output, trimming, etc. Absolutely. I'd be very happy if you wanted to take the framework in the script and generalize it for different matching. Let me know what I can do to help. > Currently I'm using regular expressions, which is fast enough for my > own needs - and this task could easily be parallelised by breaking > up the input reads. Beyond that perhaps something based on > Hamming distances (edit distance - number of mismatches) or > Levenshtein searches might be quicker. I guess speed is more of > an issue with Illumina than with 454 due to the number of reads? > > Brad - you mentioned using approximate matches with gaps. Did you > find gapped matches made a bit difference to the number of matches > found? i.e. is it worthwhile on your data? A large majority of the barcodes are found with exact matching via a dictionary lookup, so the gapped/mismatch alignments are only necessary for the barcodes with sequencing errors. For Illumina reads gaps aren't as common, so the mismatch alignments are more useful but I tried to make it general so as to catch as many cases as possible. Brad From bugzilla-daemon at portal.open-bio.org Tue Feb 8 11:31:38 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 11:31:38 -0500 Subject: [Biopython-dev] [Bug 3176] New: Bio SeqIO 'genbank' parse failure Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3176 Summary: Bio SeqIO 'genbank' parse failure Product: Biopython Version: 1.56 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: sschmidt at tuebingen.mpg.de Hi, the parser stumbles over a Genbank file that contains a feature without values: ___START GenBank File____ LOCUS someVector______ 6127 bp DNA circular 1-OCT-2009 SOURCE ORGANISM COMMENT none FEATURES Location/Qualifiers misc_structure 1564..1566 /ApEinfo_label=ErrorInBioPythonBecauseNoValue /ApEinfo_fwdcolor= /ApEinfo_revcolor= /vntifkey="88" /label=Stop\codon BASE COUNT 15 a 16 c 16 g 13 t ORIGIN 1 gagttccgcg ttacataact tacggtaaat ggcccgcctg gctgaccgcc caacgacccc // __END GenBank file___ The relevant error message: File "/sw/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 525, in parse for r in i: File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 437, in parse_records record = self.parse(handle, do_features) File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 420, in parse if self.feed(handle, consumer, do_features): File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 392, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 188, in parse_features features.append(self.parse_feature(feature_key, feature_lines)) File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 268, in parse_feature elif value[0]=='"': IndexError: string index out of range -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 8 11:45:40 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 11:45:40 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102081645.p18GjeR4025608@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 11:45 EST ------- Where is this problem file coming from? I'm pretty sure the NCBI (nor EMBL/DDBJ) do not use feature qualifiers like that. See: http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html If you are creating the file, why not use /key="" or /key - the later form is used in real GenBank files, e.g. /pseudo -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 8 13:25:13 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 13:25:13 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102081825.p18IPDgO029696@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #2 from sschmidt at tuebingen.mpg.de 2011-02-08 13:25 EST ------- The file is the product of ApE (http://biologylabs.utah.edu/jorgensen/wayned/ape/). I agree that this format is 'unusual' but that the code simply quits could be simply avoided by checking if there is a value is defined at all. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 8 13:28:18 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 13:28:18 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102081828.p18ISIGG029796@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 13:28 EST ------- Created an attachment (id=1569) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1569&action=view) Handle funny feature annotation Could you test the following patch? Ask if you need help with that - I can stick it on a github branch if that is easier. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Tue Feb 8 13:33:56 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 8 Feb 2011 19:33:56 +0100 Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(), remove_disordered_atoms() Message-ID: Dear All, I've been working on the above-mentioned functions following really great feedback from Eric, Kristian, and Peter. I've been also using them routinely and I've had no problems yet so they should be stable enough. Therefore I think they can be cherry-picked from my pdb_enhancements branch and added to the main branch. Let me know what you think. Cheers, Jo?o From bugzilla-daemon at portal.open-bio.org Tue Feb 8 13:54:28 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 13:54:28 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102081854.p18IsSbo030923@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #4 from sschmidt at tuebingen.mpg.de 2011-02-08 13:54 EST ------- Hmm, I patched the code and same error message. What about handling this problem at Bio/GenBank/Scanner.py directly? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 8 17:25:58 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 17:25:58 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102082225.p18MPwXR006718@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1569 is|0 |1 obsolete| | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 17:25 EST ------- (From update of attachment 1569) Sorry, must have uploaded the wrong patch - this was a work in progress for the GenBank between location bug. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Feb 9 05:47:33 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Feb 2011 05:47:33 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102091047.p19AlX92029443@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-09 05:47 EST ------- Committed: https://github.com/biopython/biopython/commit/07b6c12cf18d41749918e29b1bbc4a58a18e1180 Can you try the trunk? See http://www.biopython.org/wiki/SourceCode -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Feb 9 09:19:46 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Feb 2011 09:19:46 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102091419.p19EJkjK011310@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #7 from sschmidt at tuebingen.mpg.de 2011-02-09 09:19 EST ------- (using 07b6c12cf18d41749918e29b1bbc4a58a18e1180) works like a charm. Thanks Peter, should've come up with a similar solution -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Feb 9 09:20:22 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Feb 2011 09:20:22 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102091420.p19EKMsg011354@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 sschmidt at tuebingen.mpg.de changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #8 from sschmidt at tuebingen.mpg.de 2011-02-09 09:20 EST ------- done -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Feb 10 09:05:33 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Feb 2011 09:05:33 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102101405.p1AE5Xkl029071@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-10 09:05 EST ------- (In reply to comment #7) > NT_022184.15 is the record containing IGKV2-40 (and the associated caret) in > my file. What I said about Nucleotide still applies, though. > Yes, you're right. My mistake, NT_015926.15 was the last good record. Had you noticed this was the last gene in this record? It runs right up to the end of the sequence and beyond (missing the right most end, i.e. the 5' start of the gene since it is on the reverse strand). From the FTP site: LOCUS NT_022184 68452323 bp DNA linear CON 28-OCT-2010 DEFINITION Homo sapiens chromosome 2 genomic contig, GRCh37.p2 reference primary assembly. ... gene complement(68451760..>68452323) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /db_xref="GeneID:28916" /db_xref="HGNC:5789" /db_xref="IMGT/GENE-DB:IGKV2-40" V_segment complement(68451760..68452073^68452074) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /standard_name="IGKV2-40" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /db_xref="GeneID:28916" CDS complement(<68451760..68452072^68452073) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /exception="rearrangement required for product" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /codon_start=1 /db_xref="GeneID:28916" /db_xref="IMGT/LIGM:IGKV2-40" /db_xref="HGNC:5789" /db_xref="IMGT/GENE-DB:IGKV2-40" If we look at the record via Entrez, http://www.ncbi.nlm.nih.gov/nuccore/NT_022184.15?report=gbwithparts gene complement(68451760..>68452323) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /db_xref="GeneID:28916" /db_xref="HGNC:5789" /db_xref="IMGT/GENE-DB:IGKV2-40" V_segment complement(68451760..68452074) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /standard_name="IGKV2-40" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /db_xref="GeneID:28916" CDS complement(<68451760..68452073) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /exception="rearrangement required for product" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /codon_start=1 /db_xref="IMGT/LIGM:IGKV2-40" /db_xref="GeneID:28916" /db_xref="HGNC:5789" /db_xref="IMGT/GENE-DB:IGKV2-40" So this appears to have been updated to avoid the funny caret location, but I think they made a mistake - surely the CDS should be complement(68451760..>68452073) not complement(<68451760..68452073) as stated? Have you contacted the NCBI about this? If not, I will. I believe that the caret location in the FTP GenBank file is invalid and Biopython is right to reject it (but I would like to confirm this with the NCBI). For now the simplest solution is for you to manually edit that feature. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Thu Feb 10 10:10:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Feb 2011 15:10:19 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: <20110207122356.GC18733@sobchak.mgh.harvard.edu> References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> <20110207122356.GC18733@sobchak.mgh.harvard.edu> Message-ID: On Mon, Feb 7, 2011 at 12:23 PM, Brad Chapman wrote: > Peter; > >> The computationally interesting part is matching the primer/adapter/ >> barcode to the read (both of which may contain IUPAC ambiguity codes), >> which as you point out can be replaced once you have a working >> framework for the input, output, trimming, etc. > > Absolutely. I'd be very happy if you wanted to take the framework in > the script and generalize it for different matching. Let me know > what I can do to help. Do you have (or can you point me at) any good sample data with barcodes, or custom adapters or primer sequences? e.g. some SRA numbers you've been using. >> Currently I'm using regular expressions, which is fast enough for my >> own needs - and this task could easily be parallelised by breaking >> up the input reads. Beyond that perhaps something based on >> Hamming distances (edit distance - number of mismatches) or >> Levenshtein searches might be quicker. I guess speed is more of >> an issue with Illumina than with 454 due to the number of reads? I originally had three separate tools (with shared code) for working with FASTA, FASTQ and SFF reads, which I have recently combined into one single tool that does all three. Code here if anyone wants to look at it. https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/ seq_primer_clip.py - Python script seq_primer_clip.xml - Galaxy wrapper seq_primer_clip.txt - readme file This is still a work in progress... Peter From bugzilla-daemon at portal.open-bio.org Thu Feb 10 15:02:42 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Feb 2011 15:02:42 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102102002.p1AK2g6g017745@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #14 from walter_gillett at yahoo.com 2011-02-10 15:02 EST ------- I have checked in a fix on my github branch to the bug mentioned in comment #4: in the Viterbi recursion to determine state path probabilities, we must consider states that lead *to* the current state, not those that are reachable *from* it. See comments for this checkin: https://github.com/wgillett/biopython/commit/f8b0b94ad7ffadbf9aa923bc6273822328cb9f01 . Forgot to mention in the comments that I also fixed a bug in the allow_transition method and added a unit test for that method. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Feb 10 18:07:21 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Feb 2011 18:07:21 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102102307.p1AN7Lu0025588@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #9 from aaron.tin.long.lun at gmail.com 2011-02-10 18:07 EST ------- Thanks Peter, will do so. Cheers, Aaron -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Fri Feb 11 04:30:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 11 Feb 2011 09:30:02 +0000 Subject: [Biopython-dev] Fwd: [GitHub] Viterbi algorithm bug fix: consider states that lead *to* the current state, not reachable *from* it [biopython/biopython GH-3] In-Reply-To: <4d54436221d2a_250b3fff6ec2b2f0780@fe2.rs.github.com.tmail> References: <4d54436221d2a_250b3fff6ec2b2f0780@fe2.rs.github.com.tmail> Message-ID: Hi Brad, Do you want to look at this HMM fix too? http://bugzilla.open-bio.org/show_bug.cgi?id=2947 Also who else is getting the github pull requests? We should probably send them to the dev list, but I can't find the settings right now on GitHub... Peter ---------- Forwarded message ---------- From: GitHub Date: Thu, Feb 10, 2011 at 7:58 PM Subject: [GitHub] Viterbi algorithm bug fix: consider states that lead *to* the current state, not reachable *from* it [biopython/biopython GH-3] To: p.j.a.cock at googlemail.com wgillett wants someone to pull from wgillett:master: Bug fix related to bug #2947. Please review and commit if it's OK. Thanks, Walter Gillett View Pull Request: https://github.com/biopython/biopython/pull/3 From chapmanb at 50mail.com Mon Feb 14 08:01:10 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 14 Feb 2011 08:01:10 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> <20110207122356.GC18733@sobchak.mgh.harvard.edu> Message-ID: <20110214130110.GA12340@sobchak.mgh.harvard.edu> Peter; > Do you have (or can you point me at) any good sample data with > barcodes, or custom adapters or primer sequences? e.g. some SRA > numbers you've been using. This is a subset of two lanes from a barcoded flowcell for testing purposes: http://chapmanb.s3.amazonaws.com/110106_FC70BUKAAXX.tar.gz It has 12 barcoded samples, using the Illumina barcodes. The sequences are in this YAML file: https://github.com/chapmanb/bcbb/blob/master/nextgen/tests/data/automated/run_info.yaml > I originally had three separate tools (with shared code) for working > with FASTA, FASTQ and SFF reads, which I have recently combined > into one single tool that does all three. Code here if anyone wants to > look at it. > > https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/ Very nice. It would be great to get something general for barcode splitting as a Galaxy tool. Thanks for looking at this, Brad From p.j.a.cock at googlemail.com Mon Feb 14 08:19:45 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 14 Feb 2011 13:19:45 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: <20110214130110.GA12340@sobchak.mgh.harvard.edu> References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> <20110207122356.GC18733@sobchak.mgh.harvard.edu> <20110214130110.GA12340@sobchak.mgh.harvard.edu> Message-ID: On Mon, Feb 14, 2011 at 1:01 PM, Brad Chapman wrote: > Peter; > >> Do you have (or can you point me at) any good sample data with >> barcodes, or custom adapters or primer sequences? e.g. some SRA >> numbers you've been using. > > This is a subset of two lanes from a barcoded flowcell for testing > purposes: > > http://chapmanb.s3.amazonaws.com/110106_FC70BUKAAXX.tar.gz > > It has 12 barcoded samples, using the Illumina barcodes. The > sequences are in this YAML file: > > https://github.com/chapmanb/bcbb/blob/master/nextgen/tests/data/automated/run_info.yaml > Great :) >> I originally had three separate tools (with shared code) for working >> with FASTA, FASTQ and SFF reads, which I have recently combined >> into one single tool that does all three. Code here if anyone wants to >> look at it. >> >> https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/ > > Very nice. It would be great to get something general for barcode > splitting as a Galaxy tool. Thanks for looking at this, > Brad Yes - assuming what they have already isn't good enough (at very least the Galaxy barcode wrapper for fastx currently only handles fastq-solexa but I think that can be fixed). http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004290.html I've been focused on the PCR case where my sequences have got IUPAC ambiguity characters. For barcodes that shouldn't be an issue, but instead you may have more than one barcode and will want one output file per barcode (although not usually as complicated as Kevin's setup). I need to learn more about how Galaxy handles multiple outputs before commenting on that. Peter From tiagoantao at gmail.com Wed Feb 16 11:40:10 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 16 Feb 2011 16:40:10 +0000 Subject: [Biopython-dev] New URL for integration testing Message-ID: Hello all, Buildbot integration testing has been moved to a, hopefully, more stable location. If you are interested, please have a look at: http://testing.open-bio.org/ The old URL at events.open-bio.org is no more. Regards, Tiago -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From anaryin at gmail.com Thu Feb 17 07:59:16 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 17 Feb 2011 13:59:16 +0100 Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(), remove_disordered_atoms() In-Reply-To: References: Message-ID: Hey Kristian, To Tests/test_pdb.py ? Just to make sure that the renumbering acts on both accordingly? I agree. Jo?o From krother at rubor.de Thu Feb 17 07:54:38 2011 From: krother at rubor.de (Kristian Rother) Date: Thu, 17 Feb 2011 13:54:38 +0100 Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(), remove_disordered_atoms() In-Reply-To: References: Message-ID: Hi Joao, I think we should add a simple test function that ensures consistency of child_dict and child_list upon renumbering. Let me know if you'd prefer me to explain in Python what I mean. Kristian > Dear All, > > I've been working on the above-mentioned functions following really great > feedback from Eric, Kristian, and Peter. I've been also using them > routinely > and I've had no problems yet so they should be stable enough. Therefore I > think they can be cherry-picked from my pdb_enhancements branch and added > to > the main branch. Let me know what you think. > > Cheers, > > Jo??o > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From b.invergo at gmail.com Tue Feb 22 11:40:01 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 22 Feb 2011 17:40:01 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: Hi everyone, I've been toiling away on the PAML API and I think it's finally ready for review. If anyone's willing to give my code a review, here's my branch: https://github.com/brandoninvergo/biopython/tree/paml-branch (the API is in Bio/Phylo/PAML, as suggested before, and the tests are in Tests, with their supporting files in Tests/PAML) I'll also post a message to the Biopython user list to see if anyone would be willing to give it a test drive. Some notes: - I've implemented Codeml, Baseml/Basemlg and Yn00. I have not yet done anything with Mcmctree because I am completely ignorant about what information to extract from the output files. The other two programs in the package, Evolver and Chi2, do not accept commandline options and are instead operated by a rudimentary commandline interface, so they aren't really compatible with scripting. - Chi2 is useful, though, because it provides a chi^2 CDF, which you can use in performing maximum likelihood ratio tests, an important part of using the PAML programs. Since Python doesn't have a chi^2 cumulative distribution function in its standard library, I ported the original C code rather than writing a function which simply calls the original, with the permission of Ziheng Yang (the original author; this is mentioned in the code's comments, but he required no other licensing/copyright verbage to be included). This was no easy task, considering the C code was littered with goto statements. Anyway, this will prevent the user from having to install/import an outside package to do the tests (I personally had been using Rpy2 to call the R function pchisq()....complete overkill). Let me know if this is ok or if this causes some kind of conflict - The output of the programs varies widely with the combinatorics of the parameters and possibly between versions. I tried to include all possible output files in the Tests/PAML directory and I wrote test cases to check that they're properly parsed (with the testing of future versions in mind). So, that Tests/PAML folder has a lot more in it than the usual test folders, but I felt there was no other option. I tried to make it organized. I think those are the main points for now. I'd assume that there's more work to be done before I should perform a pull request, so I'll simply ask for your comments for now if you have the time. Cheers, Brandon Invergo On Sun, Jan 16, 2011 at 4:09 PM, Peter Cock wrote: > On Sun, Jan 16, 2011 at 2:19 PM, Brandon Invergo wrote: >> Hi everyone, >> A quick question about style: since the name "codeml" is based on a >> program which is always spelled either in all caps or in all >> lower-case, what would be the best way to write the class name >> regarding capitalization? Stick with the usual camel-case convention, >> "Codeml", anyway? > > I'd go with Codeml for a class name (or something like > CodemlResult or whatever). Neither CODEML nor codeml > seem good class names in Python. > >> Things are progressing nicely. I've already taken care of a lot of the >> minor tasks and improvements... > > Sounds good :) > > Peter > From clementsgalaxy at gmail.com Tue Feb 22 12:16:12 2011 From: clementsgalaxy at gmail.com (Dave Clements) Date: Tue, 22 Feb 2011 09:16:12 -0800 Subject: [Biopython-dev] Galaxy Community Conference, May 25-26, Lunteren, The Netherlands In-Reply-To: References: Message-ID: Hello all, Just a reminder that the abstract submission deadline for the Galaxy Community Conference is next Monday, February 28. See http://galaxy.psu.edu/gcc2011/Abstracts.html for details. Cheers, Dave C. On Thu, Feb 3, 2011 at 5:01 PM, Dave Clements wrote: > We are pleased to announce the *2011 Galaxy Community Conference*, being > held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature > two full days of presentations and discussion on extending Galaxy to use new > tools and data sources, deploying Galaxy at your organization, and best > practices for using Galaxy to further your own and your community's > research. See http://galaxy.psu.edu/gcc2011/* for complete details. > * > *About Galaxy: > *Galaxy is an open, web-based platform for *accessible, reproducible, and > transparent* computational biomedical research. > > - *Accessibility:* Galaxy enables users without programming experience > to easily specify parameters and run tools and workflows. > - *Reproducibility:* Galaxy captures all information necessary so that > any user can repeat and understand a complete computational analysis. > - *Transparency:* Galaxy enables users to share and publish analyses > via the web and create Pages--interactive, web-based documents that describe > a complete analysis. > > Galaxy is open source for all organizations. The public Galaxy service ( > http://usegalaxy.org) makes analysis tools, genomic data, > tutorial demonstrations, persistent workspaces, and publication services > available to any scientist that has access to the Internet. Local > Galaxy servers can be set up by downloading the Galaxy application and > customizing it to meet particular needs. > > *Conference Overview: > * > This event aims to engage a broader community of developers, data > producers, tool creators, and core facility and other research hub staff to > become an active part of the Galaxy community. We'll cover defining > resources in the Galaxy framework, increasing their visibility and making > them easier to use and integrate with other resources, how to extend Galaxy > to use custom data sources and custom tools, and best practices for using > Galaxy in your organization. > > Additional topics include, but are not limited to: > * Talks submitted by the Galaxy community > * Integration of tools (including NGS analysis tools) and distributed job > management > * Deployment of Galaxy instances on local resources and on the Cloud > * Management of large datasets with the Galaxy Library System > * Using the Galaxy LIMS functionality at NGS sequencing facilities > * Visualizing Data without leaving Galaxy > * Performing reproducible research > * Performing and sharing complex analyses with Workflows > * An "Introduction to Galaxy" session, offered on May 24, for Galaxy > newcomers. > > *Registration: > * > The conference fee is ?100 on or before April 24, and ?120 after that. The > meeting is being held at the Conference Centre De Werelt in Lunteren, The > Netherlands, which is also the conference hotel. You are encouraged to > register early, as space at the hotel (and at the "Intro to Galaxy" session) > is limited and is likely to fill up before the conference itself does. See > http://galaxy.psu.edu/gcc2011/Register.html > * > Abstract Submission: > * > Abstracts are now being accepted for short oral presentations. Proposals > on any topic of interest to the Galaxy community are welcome and > encouraged. The abstract submission deadline is the end of February 28. > See http://galaxy.psu.edu/gcc2011/Abstracts.html > * * > *Sponsors > * > The 2011 Galaxy Community Conference is co-sponsored by the US National > Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands > Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a > collaborative institute of the bioinformatics groups in the Netherlands. > Together, these groups perform cutting-edge research, develop novel tools > and support platforms, create an e-science infrastructure and educate the > next generations of bioinformaticians. > > We are looking forward to a great conference and hope to see you in the > Netherlands! > > The Galaxy and NBIC Teams > > -- > http://galaxy.psu.edu/gcc2011/ > http://getgalaxy.org > http://usegalaxy.org/ > -- http://galaxy.psu.edu/gcc2011/ http://getgalaxy.org http://usegalaxy.org/ From bugzilla-daemon at portal.open-bio.org Tue Feb 22 13:06:48 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 22 Feb 2011 13:06:48 -0500 Subject: [Biopython-dev] [Bug 3170] Integration of external package: pypaml In-Reply-To: Message-ID: <201102221806.p1MI6mvd015443@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3170 ------- Comment #1 from b.invergo at gmail.com 2011-02-22 13:06 EST ------- I've forked the repository on github and I've created a branch containing the new code: https://github.com/brandoninvergo/biopython/tree/paml-branch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Wed Feb 23 04:24:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 23 Feb 2011 09:24:21 +0000 Subject: [Biopython-dev] [Biopython] Biopython library for muliple sequence alignment In-Reply-To: <001501cbd324$c70a8570$551f9050$@jp> References: <001501cbd324$c70a8570$551f9050$@jp> Message-ID: On Wed, Feb 23, 2011 at 6:42 AM, Rojan Shrestha wrote: > Hello: > > I want to do multiple sequence alignment using CLUSTW. Instead of > standalone, I would like to use in my own program through biopython. I would > like to know that whether biopython has clustw function or not. It would be > very good if somebody ?gives information about this. > > Regards, > > Rojan Hello Rojan, Biopython (and BioPerl too I believe) doesn't have any multiple sequence alignment code itself. Biopython does has pairwise sequence alignment code (with a fast implementation in C). Instead (again, like BioPerl) Biopython has a wrapper and parser for calling the ClustalW command line tool from within your script and loading its output. Similarly for other alignment tools like Muscle. If you really want to be able modify the multiple sequence alignment code itself, some of these command line tools are open source. Also, I *think* that BioJava has some code for this. I don't know what BioRuby does. Peter P.S. You only really need to ask this on the Biopython Discussion List. Since you included the OBF cross project list I have tried to comment on how the other projects handle this as well. From updates at feedmyinbox.com Wed Feb 23 04:26:36 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 23 Feb 2011 04:26:36 -0500 Subject: [Biopython-dev] 2/23 active questions tagged biopython - Stack Overflow Message-ID: <64da3e945fd7631143a0bbd0fdd84e55@74.63.51.88> // Biopython CodonTable error? // February 18, 2011 at 3:02 PM http://stackoverflow.com/questions/5045967/biopython-codontable-error Hello, I am writing some code intended to translate ambiguous DNA codes into possible amino acids and I am seeing some strange translation from the Biopython 1.56 package. It appears to be translating ambiguous DNA codes to 'J' which does not exist as a code for anything. I am running python 2.6.1 on Mac OS 10.6.6. For example: >>>from Bio.Seq import * >>>translate('ARAWTAGKAMTA') 'XJXJ' or >>>from Bio.Seq import Seq >>>c = Seq('ARAWTAGKAMTA') >>>c.translate().tostring() 'XJXJ' I have looked through the Bio.Data.CodonTable source and Bio.Seq source and I cannot find a reason why this would be happening. Any ideas? Thanks! Mark -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From updates at feedmyinbox.com Wed Feb 23 04:26:36 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 23 Feb 2011 04:26:36 -0500 Subject: [Biopython-dev] 2/23 biopython Questions - BioStar Message-ID: // MuscleCommandline not writing file // February 22, 2011 at 2:34 PM http://biostar.stackexchange.com/questions/5787/musclecommandline-not-writing-file I'm trying to work through the Biopython tutorial on multiple sequence alignment and get an error whenever I try to use subprocess: child = subprocess.Popen(str(cline), stdout = subprocess.PIPE, stderr = subprocess.PIPE, shell = (sys.platform!="win32")) I get this error: Traceback (most recent call last): File "", line 2, in stdout = subprocess.PIPE) File "C:\Python27\lib\subprocess.py", line 672, in __init__ errread, errwrite) File "C:\Python27\lib\subprocess.py", line 882, in _execute_child startupinfo) WindowsError: [Error 2] The system cannot find the file specified I've gone so far as to copy and paste the tutorial into the interpreter and no luck. Neither ClustalW nor Muscle are writing the alignment files (I tried the depreciated MultipleAlignCL as well with no luck). I'm using Python v2.7 and Biopython v1.55 and have tried reinstalling both. Any advice? -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630206/59fe8f28e93f5744d887807619020b5988c5b82b/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From chapmanb at 50mail.com Wed Feb 23 08:11:51 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 23 Feb 2011 08:11:51 -0500 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: <20110223131151.GE4922@sobchak.mgh.harvard.edu> Brandon; > I've been toiling away on the PAML API and I think it's finally ready > for review. If anyone's willing to give my code a review, here's my > branch: > https://github.com/brandoninvergo/biopython/tree/paml-branch This is awesome; thanks much for all the work getting this together. It's really great to see the extensive tests. I'm also impressed with your story of porting over 'goto' statements; it's been a while since those have entered my mind: 10 PRINT "CHI SQUARE FOREVER" 20 FLASH 30 GOTO 10 A couple of more general thoughts about your code: - These looks to be a lot of shared functionality between codeml, baseml and yn00 in setting up the control files. Would it be possible to create a base class that these all inherit from? This would make the code much easier to maintain over time as formats change. - Your 'read' functions get pretty deeply nested, especially the codeml parser. What do you think about creating an internal class to split some of the parsing logic into individual functions? A nice example is the GenBank/Scanner.py code. Having functions like parse_header/parse_features makes it much easier for someone not deeply familiar with your code to start to make guesses at where different functionality exists. This way, if the format changes others can provide patches and feedback to you. Overall this is great and all the work is much appreciated. Brad From chapmanb at 50mail.com Thu Feb 24 13:26:26 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 24 Feb 2011 13:26:26 -0500 Subject: [Biopython-dev] BOSC 2011 topic organizers and Codefest Message-ID: <20110224182626.GM20125@sobchak.mgh.harvard.edu> Hi all; This year the Bioinformatics Open Source Conference (BOSC) will be taking place in Vienna, Austria on July 15-16th. This is a yearly opportunity for open source bioinformatics developers to get together in person and discuss on-going projects. Nomi Harris, Peter Rice and the other organizing committee members are already hard at work planning for the conference: http://www.open-bio.org/wiki/BOSC_2011 The call for abstracts opens next Monday, and extends through April 18th, and we've been brainstorming potential session topics. This year we've tried to focus each of the sessions around a particular biological problem or computational approach. We hope this will draw some interesting parallels between work being done in different groups, and encourage even more collaboration. We are actively looking for community members who are interested in heading up the organization of a topic. The general idea is to build a cohesive set of talks within a session. How you'd like to do this is completely flexible but some of the ideas we've been discussing are: - Having a short introductory talk to provide an overview of an area, framing the different talks within this context. - Forgoing individual question/answer and instead combining this time into a longer panel-style discussion with all of the speakers. This would help stimulate back and forth between the different projects and the audience. If you are interested in a particular topic and would like to help with the organization, please send an e-mail to the BOSC mailing list: bosc at lists.open-bio.org. We're also open to new topic suggestions, and will look to add one or two more topics to our current list. Finally, there will be a two day coding session prior to BOSC as a follow up to last year's fun and productive Codefest: http://www.open-bio.org/wiki/Codefest_2011 The Metalab, a unique hacker space in Vienna, has kindly agreed to host us for the two days. If you are at all interested, please add your name to the attendees list on the wiki. Since the Metalab organizers don't know us personally, we'd like to demonstrate there is interest and that we'll really show up with a bunch of bioinformatics hackers. More details will be in the works as the summer draws closer. Looking forward to the sound of music, Brad From b.invergo at gmail.com Fri Feb 25 11:57:19 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Fri, 25 Feb 2011 17:57:19 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: <20110223131151.GE4922@sobchak.mgh.harvard.edu> References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> <20110223131151.GE4922@sobchak.mgh.harvard.edu> Message-ID: Hi Brad, Thanks for your response! It's taken me a day or two to think about what you wrote (also balancing a PhD with the hobby projects at the moment...) > It's really great to see the extensive tests. I'm also impressed > with your story of porting over 'goto' statements; it's been a while > since those have entered my mind: To be honest, I forgot they existed. Seeing them immediately made the computer scientist in me cringe. They really confused the whole structure of the program but in the end they were solved quite easily with some carefully placed loops and conditional blocks! > - These looks to be a lot of shared functionality between codeml, > ?baseml and yn00 in setting up the control files. Would it be > ?possible to create a base class that these all inherit from? This > ?would make the code much easier to maintain over time as formats > ?change. This is a really good idea and I'm a bit disappointed that I didn't see it myself! Indeed, most of the functionality is just copied/pasted between the classes, with only some variation in the read/write_ctl_file functions for codeml and baseml. So, writing a base class would really simplify things. I do have one question, though, since this is my first time organizing my code in a large-scale Python project. Where would be the best place to implement this base paml class? In __init__.py or in its own paml.py file? I know the end result would be the same but I figure I should start learning some of these best practices. > - Your 'read' functions get pretty deeply nested, especially the > ?codeml parser. What do you think about creating an internal class > ?to split some of the parsing logic into individual functions? A > ?nice example is the GenBank/Scanner.py code. Having functions like > ?parse_header/parse_features makes it much easier for someone not > ?deeply familiar with your code to start to make guesses at where > ?different functionality exists. This way, if the format changes > ?others can provide patches and feedback to you. I'm not so sure about this mainly because of the way the output files are formatted. For example, the most common usage of codeml (the most common program of the bunch) is to run with several several "NSsites" models. If you do this, the output file is separated into segments which are headed by a line that says something like "Model 2: PositiveSelection", and the model parameters are printed out below. However, if you only run with one model, which is also a common usage, you no longer have these convenient headers and instead at the very top of the output file is a completely different indication of which model was used, but which is inconveniently missing if only model 0 was run. In other cases, such as amino acid sequence analysis, pairwise nucleotide sequence or multiple gene analyses, there's no header whatsoever indicating which kind of output file you're looking at. Instead, you just have to search for particular data patterns to parse. This mess is precisely why I had to include so many different output files for the unittesting (codeml is the main culprit; baseml is moderately bad; yn00 isn't a problem) So, because I would potentially end up scanning almost the entire file just to figure out what's going on, I think just parsing-as-you-go, using elif statements to short-circuit and skip further evaluations of a line after a match has been found, would be the better option. Perhaps the files aren't long enough to be able to make an appeal for computational efficiency but at the same time, I hesitate to read through the file multiple times unnecessarily. I agree, though, that this makes the read() function quite long. For that, though, I tried to provide descriptive comments before each parsing case, describing exactly what the next block of code is meant to parse and also including a specific example line which should be parsed by it. That said, I will take another look at the output files to see if there could be another way of implementing it. Without a doubt, the parsing is the most difficult part of implementing this module; the rest of it is quite trivial. So, best to do it right! > Overall this is great and all the work is much appreciated. Thanks! It's been a fun side project for me. Cheers, Brandon ps - I still haven't sent a message to the main Biopython list while I consider implementing at least the first suggestion above, since it would involve large changes that might cause me to accidentally break something! I'll wait until I'm a bit more confident that it's close to the final product From updates at feedmyinbox.com Mon Feb 28 04:21:17 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Mon, 28 Feb 2011 04:21:17 -0500 Subject: [Biopython-dev] 2/28 active questions tagged biopython - Stack Overflow Message-ID: <348d58cdbd9ae31e700023c354ca3ce6@74.63.51.88> // Convert nested dictionary/xml to flat file for sqlite // February 27, 2011 at 11:25 AM http://stackoverflow.com/questions/5134334/convert-nested-dictionary-xml-to-flat-file-for-sqlite Hiya- I've scoured the net and cannot seem to find an appropriate example so I thought I'd ask... (Btw, much of this is new to me- not all, just most.) Problem: trying to convert a bio/python nested dictionary (or xml) of pubmed citation data into a flat (normalized) structure eg, sqlite. Citation data was fetched from pubmed using biopython and was parsed into a dictionary, but can also retrieve as xml if needed. Not all citations will have all fields/keys and not all fields/keys will have the same number of items (authors, mesh terms, refs, etc...) and understand that this is part of the normalization process. This is about where my practical understanding ends. That said, I think the process should go something like this: first remove/normalize all unique fields (those that have 1 per paper eg, title, abstract, date, citation, etc..., but say not affiliation as that would be linked to first author). Papers with no abstract could be filled as null? Then move on to, say, authors and create a separate table again using PMID as the fk and then do same for the various other fields/keys/items in separate tables eg, mesh headings, EC numbers, ref, etc... Is there a way to do this that removes (pops?) keys/items from the master dictionary so that I can visually see what's been done/needs to be done (obviously leaving the PMID)? Again, apologies in advance if I'm asking a blindingly obvious question to the initiated- and I do understand that you can't fit a nested structure into a flat space- just looking for the least boneheaded way of going about this and hopefully one that will allow me to make sure that everything was properly captured. Many thanks, chris -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From chapmanb at 50mail.com Mon Feb 28 11:35:21 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 28 Feb 2011 11:35:21 -0500 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> <20110223131151.GE4922@sobchak.mgh.harvard.edu> Message-ID: <20110228163521.GF9652@sobchak.mgh.harvard.edu> Brandon; [pypaml branch: https://github.com/brandoninvergo/biopython/tree/paml-branch] [base class] > This is a really good idea and I'm a bit disappointed that I didn't > see it myself! Indeed, most of the functionality is just copied/pasted > between the classes, with only some variation in the > read/write_ctl_file functions for codeml and baseml. So, writing a > base class would really simplify things. I do have one question, > though, since this is my first time organizing my code in a > large-scale Python project. Where would be the best place to implement > this base paml class? In __init__.py or in its own paml.py file? I > know the end result would be the same but I figure I should start > learning some of these best practices. It's always easier to get perspective on code when you haven't been directly in the middle of it. Even if you don't have someone to do code reviews, stepping away from a project and coming back later will often lead to a bunch of insights. For the base class, I would follow Eric and Peter's example and use files in the same directory with an underscore: something like _shared.py or _base.py. [read functions] > This mess is precisely why I had to include so many different > output files for the unittesting (codeml is the main culprit; baseml > is moderately bad; yn00 isn't a problem) I definitely feel your pain on this. This is exactly why your work doing this is appreciated; you'll save someone a lot of headache later on. > So, because I would potentially end up scanning almost the entire file > just to figure out what's going on, I think just parsing-as-you-go, > using elif statements to short-circuit and skip further evaluations of > a line after a match has been found, would be the better option. > Perhaps the files aren't long enough to be able to make an appeal for > computational efficiency but at the same time, I hesitate to read > through the file multiple times unnecessarily. I agree, though, that > this makes the read() function quite long. For that, though, I tried > to provide descriptive comments before each parsing case, describing > exactly what the next block of code is meant to parse and also > including a specific example line which should be parsed by it. The issue really is that deeply nested code is hard to read, long functions are hard to read, and when you combine them together it just makes it very difficult for others to follow your logic. I don't think you necessarily have to make multiple passes to parse it in a more structure way, but what you would want to focus on is making the flow through the function simpler. The way I would normally attack this is to break components into smaller more re-usable functions. Here's a concrete example from the start of the codeml parser: https://github.com/brandoninvergo/biopython/blob/paml-branch/Bio/Phylo/PAML/codeml.py siteclass_re = re.match("Site-class models:\s*(.*)", line) if siteclass_re is not None: siteclass_model = siteclass_re.group(1) if siteclass_model == "": multi_models = True continue results["site-class model"] = siteclass_model if siteclass_model == "NearlyNeutral": current_model = 1 results["NSsites"][current_model] = \ {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] elif siteclass_model == "PositiveSelection": current_model = 2 results["NSsites"][current_model] = \ {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] elif siteclass_model == "discrete (4 categories)": current_model = 3 results["NSsites"][current_model] = \ {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] elif siteclass_model == "beta (4 categories)": current_model = 7 results["NSsites"][current_model] = \ {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] elif siteclass_model == "beta&w>1 (5 categories)": current_model = 8 results["NSsites"][current_model] = \ {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] You could refactor this something along the lines of: class _CodemlParser: def __init__(self): self.results = {} self.flags = dict(multi_models = False) def read(self, results_handle): for line in results_handle: siteclass_re = re.match("Site-class models:\s*(.*)", line) if siteclass_re is not None: self._siteclass_parse(siteclass_re) def _add_siteclass_model(self, siteclass_model): self.results["site-class model"] = siteclass_model name_to_num = {"NearlyNeutral": 1, "PositiveSelection": 2, "discrete (4 categories)": 3, "beta (4 categories)": 7 "beta&w>1 (5 categories)": 8} current_model = name_to_num[siteclass_model] self.results["NSsites"][current_model] = {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] def _siteclass_parse(self, siteclass_re): if siteclass_model == "": self.flags["multi_models"] = True else: self._add_siteclass_model(siteclass_model) You are not changing the parsing strategy, but now you've got individual functions handling each of the steps so it's clear that the _siteclass_parse either sets multi_models or adds details about the single model. Then you can dig into the _add_siteclass_model function to see what it is doing. To the reader, each individual unit can be read and understood separately. This type of refactoring work is useful generally. I have to do it all the time in my work and discover new tricks and approaches. Hope this is helpful and thanks again for all the work on this, Brad From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:00:54 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:00:54 -0500 Subject: [Biopython-dev] [Bug 3173] New: Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3173 Summary: Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 Product: Biopython Version: 1.55 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: jp.verta at gmail.com I'm running Biopython 1.55, Python 2.6 and EMBOSS version 6.3.1 on MacOS X 10.6 Snow Leopard. The Bio.Emboss.Primer3 parser seems to be incompatible with the newer version 2.2.3 of the Whitehead Primer3 program and the corresponding Emboss eprimer3 program output. The parser output for the reverse primer seems to contain all Primer -class members (primer.reverse_tm, primer.reverse_gc etc.) except the reverser primer sequence (primer.reverse_seq). Yet the eprimer3 output seems identical to that of old versions (see the output.pr3 -files attached). Here is an example code for designing primers for a set of fasta sequences. >>> def design_primers(fasta_file, output_file): from Bio import SeqIO from Bio.Emboss.Applications import Primer3Commandline from Bio.Emboss import Primer3 output = open(output_file, "w") output.write("name,forward_primer,reverse_primer,forward_tm,reverse_tm,product_size\n") for seq_record in SeqIO.parse(fasta_file, "fasta"): if not(seq_record): break open("sequence", "w").write(">"+str(seq_record.id)+"\n"+str(seq_record.seq)+"\n") primer_cl = Primer3Commandline(sequence="sequence") primer_cl.explainflag = True primer_cl.osizeopt=20 primer_cl.psizeopt=200 primer_cl.otm=65 primer_cl.maxtm=70 primer_cl.mintm=60 primer_cl.gcclamp=1 #required number of Gs or Cs at the 3' end of the primer primer_cl.outfile = "output.pr3" primer_cl() output_handle = open("output.pr3","r") primer_record = Primer3.read(output_handle) if len(primer_record.primers) > 0: primer = primer_record.primers[0] output.write("%s,%s,%s,%s,%s,%s\n" % (seq_record.id, primer.forward_seq, primer.reverse_seq, primer.forward_tm,primer.reverse_tm,primer.size)) else: print "No primers found for %s" % seq_record.id >>> This code, when executed on a file of fasta-sequences gives and output -file with forward and reverse primer id, sequence, tm and size separated by commas. When I execute it with the Primer3-2.2.3 and compatible eprimer3 versions, the field for the reverse primer sequence appears blank. I will attach the Primer3-2.2.3 compatible eprimer3 file to this report. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:03:17 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:03:17 -0500 Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 In-Reply-To: Message-ID: <201102011803.p11I3HtF008419@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3173 ------- Comment #1 from jp.verta at gmail.com 2011-02-01 13:03 EST ------- Created an attachment (id=1565) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1565&action=view) Emboss eprimer3.c file for Primer3 version 2.2.3 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:05:04 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:05:04 -0500 Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 In-Reply-To: Message-ID: <201102011805.p11I54md008512@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3173 ------- Comment #2 from jp.verta at gmail.com 2011-02-01 13:05 EST ------- Created an attachment (id=1566) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1566&action=view) Example output of Primer3 version 1.1.4 compatible Emboss eprimer3 program output -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:06:00 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:06:00 -0500 Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 In-Reply-To: Message-ID: <201102011806.p11I60De008626@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3173 jp.verta at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1566|Example output of Primer3 |Example output of Primer3 description|version 1.1.4 compatible |version 2.2.3 compatible |Emboss eprimer3 program |Emboss eprimer3 program |output |output -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:06:47 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:06:47 -0500 Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 In-Reply-To: Message-ID: <201102011806.p11I6lVr008664@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3173 ------- Comment #3 from jp.verta at gmail.com 2011-02-01 13:06 EST ------- Created an attachment (id=1567) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1567&action=view) Example output of Primer3 version 1.1.4 compatible Emboss eprimer3 program output -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:07:44 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 1 Feb 2011 13:07:44 -0500 Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser incompatibility with Primer3 version 2.2.3 In-Reply-To: Message-ID: <201102011807.p11I7ih8008712@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3173 jp.verta at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1566|application/octet-stream |text/plain mime type| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Feb 1 20:39:30 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Feb 2011 20:39:30 +0000 Subject: [Biopython-dev] [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> <20110201160304.GH17835@sobchak.mgh.harvard.edu> Message-ID: On Tue, Feb 1, 2011 at 4:16 PM, Peter wrote: > On Tue, Feb 1, 2011 at 4:03 PM, Brad Chapman wrote: >> >> Peter, how hard do you think it would be to have SeqIO only convert >> from the fastq encoding to phred scores on demand? Most of the time >> when dealing with fastq I do not need any conversion at all and use >> the FastqGeneralIterator to just pull out the name, sequence and >> quality. >> >> You've done a lot of nice work with the correct conversions and it >> would be great to expose that directly though on-demand conversion >> as Alan is suggesting. Ideally you would use SeqIO as normal with >> fastq files, but the quality score would not be converted to solexa >> during parsing using letter_annotations["solexa_quality"] was >> accessed. > > I actually implemented a proof of concept that does that. In order > to not alter the SeqRecord behaviour, it was a new object which > acted like a list of integers in many respects. The data is held > as a FASTQ encoded string, and decoded (and then cached) on > demand only. On output if it was already in the right encoding > the string could be used as is, otherwise the conversion could > be done very quickly with a precomputed table and the string > translate() method (without having to go via a list of integers). > It seemed to work, but I wasn't convinced about the benefits > (given the complexity). I'd really want some real world FASTQ > benchmarks to try it on... something you might have in the form > of your scripts and the real data they were written for? > > I'm pretty sure this code is in a local git branch on one of my > machines (probably at home), but I don't think I pushed it to > github. I should do that... Found it and pushed it: https://github.com/peterjc/biopython/tree/fastq-tricks Note there are unit test failures (e.g. as currently implemented there is no range checking on the characters in the quality strings at parse time). We may want to continue this on the dev mailing list... Peter From p.j.a.cock at googlemail.com Thu Feb 3 12:04:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Feb 2011 12:04:08 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: <20110128123418.GD7866@sobchak.mgh.harvard.edu> References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> Message-ID: On Wed, Jan 26, 2011 at 7:44 PM, Peter Cock wrote: > > I'm currently looking at trimming 5' and 3' PCR primer sequences - > which could equally be used for barcodes etc. I'd probably wrap this > as a Galaxy tool (using Biopython). > If anyone is interested, see this thread on the Galaxy-dev mailing list: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004290.html In terms of SFF output, I'm only writing one SFF file so the issues Jacob is concerned about (when writing one SFF file per barcode) do not apply. On Fri, Jan 28, 2011 at 12:34 PM, Brad Chapman wrote: > > I wrote up a barcode detector, remover and sorter for our Illumina > reads. There is nothing especially tricky in the implementation: it > looks for exact matches and then checks for approximate matches, > with gaps, using pairwise2: > > https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py > > The "best_match" function could be replaced with different > implementations, using the rest of the script as scaffolding to do > all of the other sorting, trimming and output. > > Brad The computationally interesting part is matching the primer/adapter/ barcode to the read (both of which may contain IUPAC ambiguity codes), which as you point out can be replaced once you have a working framework for the input, output, trimming, etc. Currently I'm using regular expressions, which is fast enough for my own needs - and this task could easily be parallelised by breaking up the input reads. Beyond that perhaps something based on Hamming distances (edit distance - number of mismatches) or Levenshtein searches might be quicker. I guess speed is more of an issue with Illumina than with 454 due to the number of reads? Brad - you mentioned using approximate matches with gaps. Did you find gapped matches made a bit difference to the number of matches found? i.e. is it worthwhile on your data? Peter From bugzilla-daemon at portal.open-bio.org Thu Feb 3 22:47:04 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Feb 2011 17:47:04 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102032247.p13Ml4QY029111@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 walter_gillett at hotmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |walter_gillett at hotmail.com ------- Comment #4 from walter_gillett at hotmail.com 2011-02-03 17:47 EST ------- Short answer: The fix looks good - I have dug into the logic in detail and stepped through the example. However, appears to me that there is still a bug in this line of code in the viterbi method: for cur_state in self.transitions_from(main_state): In this context, "cur_state" is a state prior to "main_state", so what we really need here is the set of states that lead to main_state, not the set of states that can be reached from main_state. This bug won't cause trouble in practice unless you use a non-ergodic HMM, that is, a model in which some state transitions are disallowed. (The variable names here are confusing, would be better to rename main_state to cur_state and cur_state to previous_state, or something like that.) This bug is unrelated to the problem originally reported, other than appearing in the same part of the code, so perhaps it should be handled in a separate ticket. I would be happy to code up a fix if that makes sense. Longer answer: I had spent a bunch of time recently investigating this - should have noted that in bugzilla to avoid duplication of effort. But still seems worthwhile writing down my notes to document this better, so I'll do that here. There was a error in the Viterbi algorithm termination logic, as implemented in the method MarkovModel#viterbi. The Viterbi probabilities were being multiplied by the log-probability of a transition back to an end state (state 0). This was incorrect because in log space the log-probabilities should be added, not multiplied. Peter's fix removes that multiplication, thus dropping the end state transition entirely (which Durbin considers optional, so that's fine; and it was causing trouble). With the bug fixed, the most probable state path to generate 6 tails (in the example model described by the bug reporter) becomes "uuuuuu" as expected - no final "f". At a higher level, there was (in versions 1.56 and prior, but no longer in trunk) an important undocumented (as far as I can see) requirement that the model always starts in state 0. The bug reporter complained that the results of the Viterbi path calculation are wrong because "apparently they depend upon the order of the state alphabet," which was true. In the example model, providing the state alphabet ["f", "u"] causes the system to start in state f. Since there is a big penalty in his example for switching states, you get "ff" as the most likely state path for the output sequence [tails, tails], even though the unfair coin is much more likely than the fair coin to yield tails. Looks like Peter's fix treats all starting states as equally probable, there is no longer a special start state. That's reasonable, although the coding is a little confusing: # v_{0}(0) = 0 viterbi_probs[(state_letters[0], -1)] = 0 # v_{k}(0) = 0 for k > 0 for state_letter in state_letters[1:]: viterbi_probs[(state_letter, -1)] = 0 because it could now more naturally be done in two lines of code rather than three. Possibly it's useful to keep the assignment for state 0 separate in case we want to change it. A good long-term improvement would be to have a special hidden start state like the "MagicalState" used by BioJava (see http://www.biojava.org/wiki/BioJava:Tutorial:Simple_HMMs_with_BioJava). That would make it possible to specify a probability distribution for what the initial state should be, a typical HMM feature (see Durbin's book, for example). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 00:16:39 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Feb 2011 19:16:39 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102040016.p140GdsK031389@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #5 from pgarland at gmail.com 2011-02-03 19:16 EST ------- FWIW, I think the right thing with respect to begin states is to require the user to explicitly specify an begin state in the state alphabet, e.g.: class coin: def __init__(self): self.begin_state_name = "begin" self.letters = ["u", "f"] Having the user specify the name should reduce the chance of naming conflicts, and makes it easier for the user to understand what is going on if they print viterbi_probs, or are trying to debug a problem. The user should also be required to explicitly set the initial probabilities. There should be three methods for this, one that takes a list of initial probabilities, one that makes all initial states equally probable, and one that lets the user set the probability for each state individually. e.g: MarkovModelBuilder.set_initial_probabilities([0.01, 0.99]) MarkovModelBuilder.set_initial_probabilities_equal() MarkovModelBuilder.set_initial_probability("u", 0.01) The first and third methods would raise an exception if the sum of the probabilities did not sum to 1.0 Alternatively, the initial probabilities could be specified when defining the state alphabet: def __init__(self): self.begin_state_name = "begin" self.letters = [{'name': "u", 'init_prob': 0.01}, {'name': "f", 'init_prob': 0.99}] This has the advantage of making the code more concise and readable, because the state's declaration and specification are kept together. It has the disadvantage adding an unnecessary layer of indirection when all the states have equal initial probabilities. To make things less tedious for the user, there could either be a flag specifying that all states have an equal initial probability: Alternatively, the initial probabilities could be specified when defining the state alphabet: def __init__(self): self.begin_state_name = "begin" self.initial_probabilties_equal = True self.letters = [{'name': "u"}, {'name': "f"}] or again, a method could be provided: MarkovModelBuilder.set_initial_probabilities_equal() Because specifying the begin state name and the initial probabilities would be required, any of these changes would break the current API. Similar features should be provided for users who want to constrain the end state, but not specifying the end state should not raise an exception. I agree the variable names "main_state" and "cur_state" are confusing and should be changed. ~Phillip -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From clementsgalaxy at gmail.com Fri Feb 4 01:01:01 2011 From: clementsgalaxy at gmail.com (Dave Clements) Date: Thu, 3 Feb 2011 17:01:01 -0800 Subject: [Biopython-dev] Galaxy Community Conference, May 25-26, Lunteren, The Netherlands Message-ID: We are pleased to announce the *2011 Galaxy Community Conference*, being held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature two full days of presentations and discussion on extending Galaxy to use new tools and data sources, deploying Galaxy at your organization, and best practices for using Galaxy to further your own and your community's research. See http://galaxy.psu.edu/gcc2011/* for complete details. * *About Galaxy: *Galaxy is an open, web-based platform for *accessible, reproducible, and transparent* computational biomedical research. - *Accessibility:* Galaxy enables users without programming experience to easily specify parameters and run tools and workflows. - *Reproducibility:* Galaxy captures all information necessary so that any user can repeat and understand a complete computational analysis. - *Transparency:* Galaxy enables users to share and publish analyses via the web and create Pages--interactive, web-based documents that describe a complete analysis. Galaxy is open source for all organizations. The public Galaxy service ( http://usegalaxy.org) makes analysis tools, genomic data, tutorial demonstrations, persistent workspaces, and publication services available to any scientist that has access to the Internet. Local Galaxy servers can be set up by downloading the Galaxy application and customizing it to meet particular needs. *Conference Overview: * This event aims to engage a broader community of developers, data producers, tool creators, and core facility and other research hub staff to become an active part of the Galaxy community. We'll cover defining resources in the Galaxy framework, increasing their visibility and making them easier to use and integrate with other resources, how to extend Galaxy to use custom data sources and custom tools, and best practices for using Galaxy in your organization. Additional topics include, but are not limited to: * Talks submitted by the Galaxy community * Integration of tools (including NGS analysis tools) and distributed job management * Deployment of Galaxy instances on local resources and on the Cloud * Management of large datasets with the Galaxy Library System * Using the Galaxy LIMS functionality at NGS sequencing facilities * Visualizing Data without leaving Galaxy * Performing reproducible research * Performing and sharing complex analyses with Workflows * An "Introduction to Galaxy" session, offered on May 24, for Galaxy newcomers. *Registration: * The conference fee is ?100 on or before April 24, and ?120 after that. The meeting is being held at the Conference Centre De Werelt in Lunteren, The Netherlands, which is also the conference hotel. You are encouraged to register early, as space at the hotel (and at the "Intro to Galaxy" session) is limited and is likely to fill up before the conference itself does. See http://galaxy.psu.edu/gcc2011/Register.html * Abstract Submission: * Abstracts are now being accepted for short oral presentations. Proposals on any topic of interest to the Galaxy community are welcome and encouraged. The abstract submission deadline is the end of February 28. See http://galaxy.psu.edu/gcc2011/Abstracts.html * * *Sponsors * The 2011 Galaxy Community Conference is co-sponsored by the US National Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a collaborative institute of the bioinformatics groups in the Netherlands. Together, these groups perform cutting-edge research, develop novel tools and support platforms, create an e-science infrastructure and educate the next generations of bioinformaticians. We are looking forward to a great conference and hope to see you in the Netherlands! The Galaxy and NBIC Teams -- http://galaxy.psu.edu/gcc2011/ http://getgalaxy.org http://usegalaxy.org/ From bugzilla-daemon at portal.open-bio.org Fri Feb 4 10:05:18 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 05:05:18 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102041005.p14A5Ij0019705@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 05:05 EST ------- (In reply to comment #4) > > Looks like Peter's fix treats all starting states as equally probable, > there is no longer a special start state. That's reasonable, although the > coding is a little confusing: > It was Phillip's fix. (In reply to comment #5) > FWIW, I think the right thing with respect to begin states is to require the > user to explicitly specify an begin state in the state alphabet, e.g.: > class coin: > def __init__(self): > self.begin_state_name = "begin" > self.letters = ["u", "f"] If we go that route, we'll need to make very clear the differences between a HMM Alphabet (of states) and a biological sequence alphabet (Bio.Alphabet). It must make sense in many cases to use a biological sequence alphabet, but in general adding HMM attributes to the class does not make sense. We really need someone to volunteer to take over this code (and sort out the overlap between Bio.MarkovModel and/or Bio.HMM.MarkovModel), write some documentation for the tutorial, and sort out these remaining issues. Are either of you interested? > > I agree the variable names "main_state" and "cur_state" are confusing and > should be changed. > I'll happily merge/cherry-pick a simple diff to do that only if you do that on github, or apply a patch if you upload it here. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 13:46:02 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 08:46:02 -0500 Subject: [Biopython-dev] [Bug 3175] New: Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3175 Summary: Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 Product: Biopython Version: 1.54 Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: aaron.tin.long.lun at gmail.com When parsing genbank files using Bio.SeqIO as described in the Biopython Cookbook, the presence of a caret in the position of a feature in the annotation (e.g. CDS 1000..1001^1002) raises a LocationParserError, leading to "Syntax error at or near `Tokens('caret')' token". Appears to occur regardless of the type of the feature, whether it is normal/reverse complement, etc. Found in BioPython 1.54 on a Dell dimension 2400 running Kubuntu 10.10. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 13:49:23 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 08:49:23 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041349.p14DnN75028633@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #1 from aaron.tin.long.lun at gmail.com 2011-02-04 08:49 EST ------- Created an attachment (id=1568) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1568&action=view) Crash-inducing file for the GenBank parser Example file, modified from the human mitochondrial genome, with a caret introduced in line 96. Causes the crash described in the bug description. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 14:20:33 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 09:20:33 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041420.p14EKX5n030354@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 09:20 EST ------- Hi Aaron, The example in attachment #1568 from comment #1 is invalid. The feature location join(16024^16026..16569,1..576) is wrong since the caret should be used in the form [i]^[i+1], i.e. consecutive numbers. See: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html That example should probably be a between location like join((16024.16026)..16569,1..576) However, the example in the original bug report, 1000..1001^1002, looks possible (but unprecedented to my knowledge) and that also fails with the latest Biopython GenBank parsing code (much changed since Biopython 1.54). I don't really understand how that usefully differs from 1000..1001 or 1000..1002 though. Was that from a GenBank file from the NCBI? If so what accession please, or a URL? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 15:00:49 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 10:00:49 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102041500.p14F0naj032533@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 georg.lipps at fhnw.ch changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #7 from georg.lipps at fhnw.ch 2011-02-04 10:00 EST ------- Yes, the code seems to work now. The probability of attaining the first state is now the transition probability of remaining in the same state (here 0.95). I like the suggestion of comment #5 to explicity state the a begin state with the corresponding transition probabilities. A big THANK for fixing, Georg -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 15:19:11 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 10:19:11 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102041519.p14FJBme001095@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #8 from walter_gillett at hotmail.com 2011-02-04 10:19 EST ------- I'll volunteer to do all of that (OK with you, Phillip?). Walter (In reply to comment #6) > (In reply to comment #5) > > FWIW, I think the right thing with respect to begin states is to require the > > user to explicitly specify an begin state in the state alphabet, e.g.: > > class coin: > > def __init__(self): > > self.begin_state_name = "begin" > > self.letters = ["u", "f"] > > If we go that route, we'll need to make very clear the differences between a > HMM Alphabet (of states) and a biological sequence alphabet (Bio.Alphabet). It > must make sense in many cases to use a biological sequence alphabet, but in > general adding HMM attributes to the class does not make sense. > > We really need someone to volunteer to take over this code (and sort out the > overlap between Bio.MarkovModel and/or Bio.HMM.MarkovModel), write some > documentation for the tutorial, and sort out these remaining issues. Are either > of you interested? > > > > > I agree the variable names "main_state" and "cur_state" are confusing and > > should be changed. > > > > I'll happily merge/cherry-pick a simple diff to do that only if you do that on > github, or apply a patch if you upload it here. > > Thanks, > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 16:12:33 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 11:12:33 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102041612.p14GCXfW004211@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 11:12 EST ------- > > > > I agree the variable names "main_state" and "cur_state" are confusing and > > should be changed. > > > > I'll happily merge/cherry-pick a simple diff to do that only if you do that on > github, or apply a patch if you upload it here. I could have phrased that better: I mean a simple patch/diff to do the rename only would be easy for me to review and check in. (In reply to comment #8) > I'll volunteer to do all of that (OK with you, Phillip?). > > Walter That's OK with me. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 17:25:18 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 12:25:18 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041725.p14HPIhY008673@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #3 from aaron.tin.long.lun at gmail.com 2011-02-04 12:25 EST ------- Hi Peter, Thanks for the quick reply. I originally encountered the caret in the GenBank entry for the chromosome II assembly of the human genome (accession number NT_022221.13, downloaded from NCBI's FTP site yesterday); it can be found at the very end of the annotation, for the V_segments/CDS of the IGKV2-40 gene e.g. CDS complement(<68451760..68452072^68452073). I suspect that it is rare, because I parsed through the complete sequences of 15 other chromosomes before my program crashed. Hope that helps. Cheers, Aaron -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 17:43:37 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 12:43:37 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041743.p14HhbbY009388@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 12:43 EST ------- (In reply to comment #3) > Hi Peter, > Thanks for the quick reply. I originally encountered the caret in the GenBank > entry for the chromosome II assembly of the human genome (accession number > NT_022221.13, downloaded from NCBI's FTP site yesterday); it can be found at > the very end of the annotation, for the V_segments/CDS of the IGKV2-40 gene > e.g. CDS complement(<68451760..68452072^68452073). I suspect that it is > rare, because I parsed through the complete sequences of 15 other chromosomes > before my program crashed. Hope that helps. > Cheers, > Aaron > Where on the FTP site? Its a big place and I don't work with human genomes... Looking via the Entrez website, it seems NT_022221.13 is only 3519312bp, so this can't match the GenBank file you are looking at: http://www.ncbi.nlm.nih.gov/nuccore/NT_022221.13?report=gbwithparts Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 18:05:42 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 13:05:42 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041805.p14I5gxS010298@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 13:05 EST ------- (In reply to comment #4) > > Where on the FTP site? Its a big place and I don't work with human genomes... > Nevermind, I tried downloading a few candidates and found it - you actually meant NT_015926.15 which is in this file (whose first entry is NT_022221.13) ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_GRCh37.p2_chr2.gbk.gz It seems that Google doesn't index this site - I can understand why but it would have been useful. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 18:15:05 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 13:15:05 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041815.p14IF5Bx010832@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #6 from aaron.tin.long.lun at gmail.com 2011-02-04 13:15 EST ------- Hi Peter, Yeah, sorry about the mix-up, I'm not used to dealing with more than one sequence record per file. The caret should be present in the FTP-sourced file. Interestingly, it is not present in the Nucleotide annotation for the same accession number, which suggests that they've updated it in the two/three months since the data was pushed onto the FTP site. Cheers, Aaron -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 4 18:20:12 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 13:20:12 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102041820.p14IKCDN011161@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #7 from aaron.tin.long.lun at gmail.com 2011-02-04 13:20 EST ------- NT_022184.15 is the record containing IGKV2-40 (and the associated caret) in my file. What I said about Nucleotide still applies, though. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Feb 5 04:23:02 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Feb 2011 23:23:02 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102050423.p154N2fO013565@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #10 from pgarland at gmail.com 2011-02-04 23:23 EST ------- (In reply to comment #8) > I'll volunteer to do all of that (OK with you, Phillip?). > > Walter Sure. WRT my earlier comment, I realized that it's simpler for both the implementer and the user if the only user-visible change necessary to specify begin states is to add a variable to HiddenMarkovBuilder to hold the name of the begin state, and then let users use set_transition_score to specify transition probabilities from begin states. Then the relevant methods, e.g. _all_blank, allow_transition, allow_all_transitions, set_transition_score, etc have to be altered to forbid transitions to, or emissions from the begin state. And get_markov_model would raise an exception if a begin state hasn't been specified or if there isn't at least one transition from the begin state. So all users would have to do is (using the example from the bug report): ... build.begin_state_name = "begin" build.set_transition_score("begin", "u", 0.01) build.set_transition_score("begin", "f", 0.99) ... ~Phillip -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Feb 5 07:04:46 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Feb 2011 02:04:46 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102050704.p1574kup024068@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #11 from walter_gillett at yahoo.com 2011-02-05 02:04 EST ------- Sounds good. (I had been thinking about trying to preserve backward compatibility for existing clients of this class. If we require that the caller sets a begin state then all existing clients will break since none of them currently does that. But the previous fix has already broken compatibility in any case, and that was probably necessary since prior to the fix, the results were incorrect.) A possible variation would be to handle the transition from the begin state to the first real state with special-case code, so that the begin state would not be included in the set of real states. The upside would be that the methods you mention would not have to change, and we wouldn't be cluttering the state alphabet with a begin state that isn't real, which I think was a concern mentioned in comment #6 (if I understood it properly). The downside is having to add that special-case code. Not sure yet whether this is a good idea or not. Walter (In reply to comment #10) > (In reply to comment #8) > > I'll volunteer to do all of that (OK with you, Phillip?). > > > > Walter > > Sure. WRT my earlier comment, I realized that it's simpler for both the > implementer and the user if the only user-visible change necessary to specify > begin states is to add a variable to HiddenMarkovBuilder to hold the name of > the begin state, and then let users use set_transition_score to specify > transition probabilities from begin states. Then the relevant methods, e.g. > _all_blank, allow_transition, allow_all_transitions, set_transition_score, etc > have to be altered to forbid transitions to, or emissions from the begin state. > And get_markov_model would raise an exception if a begin state hasn't been > specified or if there isn't at least one transition from the begin state. > > So all users would have to do is (using the example from the bug report): > > ... > build.begin_state_name = "begin" > build.set_transition_score("begin", "u", 0.01) > build.set_transition_score("begin", "f", 0.99) > ... > > ~Phillip > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Feb 6 03:23:39 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Feb 2011 22:23:39 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102060323.p163NdIu013858@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #12 from pgarland at gmail.com 2011-02-05 22:23 EST ------- (In reply to comment #11) > Sounds good. > > (I had been thinking about trying to preserve backward compatibility for > existing clients of this class. If we require that the caller sets a begin > state then all existing clients will break since none of them currently does > that. But the previous fix has already broken compatibility in any case, and > that was probably necessary since prior to the fix, the results were > incorrect.) I don't think it's worth it to worry about preserving complete backward compatibility. Right now there are two classes of code: 1) Code that manually sets up a begin state and the appropriate transitions. All these people would need to do is add one line of code specifying the begin state, and the rest of their code would work as before. For these users, we could print an error message instructing them to set the begin_state_name variable (and document the change too!). 2) Code that does not set up a begin state, as in the bug report. Even with the applied bug fix, this code only returns a correct state sequence when all possible start states should be equally probable. In all other cases the users are possibly getting an incorrect result without being aware of it. To my mind, this is worse than breaking backward compatibility. We could maintain backward compatibility by having a default model for the initial state (e.g. equally probable, or assign random probabilities), but unless that's the model the user should be assuming for their sequence, they'll still be silently returned an incorrect result. > A possible variation would be to handle the transition from the begin state to > the first real state with special-case code, so that the begin state would not > be included in the set of real states. The upside would be that the methods you > mention would not have to change, and we wouldn't be cluttering the state > alphabet with a begin state that isn't real, which I think was a concern > mentioned in comment #6 (if I understood it properly). The downside is having > to add that special-case code. Not sure yet whether this is a good idea or not. > > Walter I hadn't thought of that approach. It could be a good way to go. I think the tradeoffs would be: A) Of the existing code, changes would be localized to the viterbi method, which would become slightly more complex. B) This approach makes it trivial to guarantee that no state can transition to the begin state. C) One new public method would have to be added, for users to set initial probabilities. D) Having to use the new method would require more, though not complex, changes to existing user code, but would have the benefit of making it as explicit as possible how the model is initialized. All in all, your idea of keeping the begin state separate looks like the way to go. ~ Phillip > (In reply to comment #10) > > (In reply to comment #8) > > > I'll volunteer to do all of that (OK with you, Phillip?). > > > > > > Walter > > > > Sure. WRT my earlier comment, I realized that it's simpler for both the > > implementer and the user if the only user-visible change necessary to specify > > begin states is to add a variable to HiddenMarkovBuilder to hold the name of > > the begin state, and then let users use set_transition_score to specify > > transition probabilities from begin states. Then the relevant methods, e.g. > > _all_blank, allow_transition, allow_all_transitions, set_transition_score, etc > > have to be altered to forbid transitions to, or emissions from the begin state. > > And get_markov_model would raise an exception if a begin state hasn't been > > specified or if there isn't at least one transition from the begin state. > > > > So all users would have to do is (using the example from the bug report): > > > > ... > > build.begin_state_name = "begin" > > build.set_transition_score("begin", "u", 0.01) > > build.set_transition_score("begin", "f", 0.99) > > ... > > > > ~Phillip > > > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Feb 6 06:46:56 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 6 Feb 2011 01:46:56 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102060646.p166kuqY018550@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #13 from walter_gillett at yahoo.com 2011-02-06 01:46 EST ------- I forked biopython, tested and checked in and pushed some improvements to variable naming and comments in the viterbi method, and submitted a pull request for your review. Thanks, Walter (In reply to comment #8) > > I'll happily merge/cherry-pick a simple diff to do that only if you do that on > > github, or apply a patch if you upload it here. > > > > Thanks, > > > > Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Mon Feb 7 12:23:56 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 7 Feb 2011 07:23:56 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> Message-ID: <20110207122356.GC18733@sobchak.mgh.harvard.edu> Peter; > The computationally interesting part is matching the primer/adapter/ > barcode to the read (both of which may contain IUPAC ambiguity codes), > which as you point out can be replaced once you have a working > framework for the input, output, trimming, etc. Absolutely. I'd be very happy if you wanted to take the framework in the script and generalize it for different matching. Let me know what I can do to help. > Currently I'm using regular expressions, which is fast enough for my > own needs - and this task could easily be parallelised by breaking > up the input reads. Beyond that perhaps something based on > Hamming distances (edit distance - number of mismatches) or > Levenshtein searches might be quicker. I guess speed is more of > an issue with Illumina than with 454 due to the number of reads? > > Brad - you mentioned using approximate matches with gaps. Did you > find gapped matches made a bit difference to the number of matches > found? i.e. is it worthwhile on your data? A large majority of the barcodes are found with exact matching via a dictionary lookup, so the gapped/mismatch alignments are only necessary for the barcodes with sequencing errors. For Illumina reads gaps aren't as common, so the mismatch alignments are more useful but I tried to make it general so as to catch as many cases as possible. Brad From bugzilla-daemon at portal.open-bio.org Tue Feb 8 16:31:38 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 11:31:38 -0500 Subject: [Biopython-dev] [Bug 3176] New: Bio SeqIO 'genbank' parse failure Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3176 Summary: Bio SeqIO 'genbank' parse failure Product: Biopython Version: 1.56 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: sschmidt at tuebingen.mpg.de Hi, the parser stumbles over a Genbank file that contains a feature without values: ___START GenBank File____ LOCUS someVector______ 6127 bp DNA circular 1-OCT-2009 SOURCE ORGANISM COMMENT none FEATURES Location/Qualifiers misc_structure 1564..1566 /ApEinfo_label=ErrorInBioPythonBecauseNoValue /ApEinfo_fwdcolor= /ApEinfo_revcolor= /vntifkey="88" /label=Stop\codon BASE COUNT 15 a 16 c 16 g 13 t ORIGIN 1 gagttccgcg ttacataact tacggtaaat ggcccgcctg gctgaccgcc caacgacccc // __END GenBank file___ The relevant error message: File "/sw/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 525, in parse for r in i: File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 437, in parse_records record = self.parse(handle, do_features) File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 420, in parse if self.feed(handle, consumer, do_features): File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 392, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 188, in parse_features features.append(self.parse_feature(feature_key, feature_lines)) File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 268, in parse_feature elif value[0]=='"': IndexError: string index out of range -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 8 16:45:40 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 11:45:40 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102081645.p18GjeR4025608@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 11:45 EST ------- Where is this problem file coming from? I'm pretty sure the NCBI (nor EMBL/DDBJ) do not use feature qualifiers like that. See: http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html If you are creating the file, why not use /key="" or /key - the later form is used in real GenBank files, e.g. /pseudo -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 8 18:25:13 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 13:25:13 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102081825.p18IPDgO029696@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #2 from sschmidt at tuebingen.mpg.de 2011-02-08 13:25 EST ------- The file is the product of ApE (http://biologylabs.utah.edu/jorgensen/wayned/ape/). I agree that this format is 'unusual' but that the code simply quits could be simply avoided by checking if there is a value is defined at all. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 8 18:28:18 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 13:28:18 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102081828.p18ISIGG029796@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 13:28 EST ------- Created an attachment (id=1569) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1569&action=view) Handle funny feature annotation Could you test the following patch? Ask if you need help with that - I can stick it on a github branch if that is easier. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Tue Feb 8 18:33:56 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 8 Feb 2011 19:33:56 +0100 Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(), remove_disordered_atoms() Message-ID: Dear All, I've been working on the above-mentioned functions following really great feedback from Eric, Kristian, and Peter. I've been also using them routinely and I've had no problems yet so they should be stable enough. Therefore I think they can be cherry-picked from my pdb_enhancements branch and added to the main branch. Let me know what you think. Cheers, Jo?o From bugzilla-daemon at portal.open-bio.org Tue Feb 8 18:54:28 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 13:54:28 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102081854.p18IsSbo030923@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #4 from sschmidt at tuebingen.mpg.de 2011-02-08 13:54 EST ------- Hmm, I patched the code and same error message. What about handling this problem at Bio/GenBank/Scanner.py directly? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 8 22:25:58 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Feb 2011 17:25:58 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102082225.p18MPwXR006718@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1569 is|0 |1 obsolete| | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 17:25 EST ------- (From update of attachment 1569) Sorry, must have uploaded the wrong patch - this was a work in progress for the GenBank between location bug. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Feb 9 10:47:33 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Feb 2011 05:47:33 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102091047.p19AlX92029443@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-09 05:47 EST ------- Committed: https://github.com/biopython/biopython/commit/07b6c12cf18d41749918e29b1bbc4a58a18e1180 Can you try the trunk? See http://www.biopython.org/wiki/SourceCode -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Feb 9 14:19:46 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Feb 2011 09:19:46 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102091419.p19EJkjK011310@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 ------- Comment #7 from sschmidt at tuebingen.mpg.de 2011-02-09 09:19 EST ------- (using 07b6c12cf18d41749918e29b1bbc4a58a18e1180) works like a charm. Thanks Peter, should've come up with a similar solution -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Feb 9 14:20:22 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Feb 2011 09:20:22 -0500 Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure In-Reply-To: Message-ID: <201102091420.p19EKMsg011354@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3176 sschmidt at tuebingen.mpg.de changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #8 from sschmidt at tuebingen.mpg.de 2011-02-09 09:20 EST ------- done -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Feb 10 14:05:33 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Feb 2011 09:05:33 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102101405.p1AE5Xkl029071@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-10 09:05 EST ------- (In reply to comment #7) > NT_022184.15 is the record containing IGKV2-40 (and the associated caret) in > my file. What I said about Nucleotide still applies, though. > Yes, you're right. My mistake, NT_015926.15 was the last good record. Had you noticed this was the last gene in this record? It runs right up to the end of the sequence and beyond (missing the right most end, i.e. the 5' start of the gene since it is on the reverse strand). From the FTP site: LOCUS NT_022184 68452323 bp DNA linear CON 28-OCT-2010 DEFINITION Homo sapiens chromosome 2 genomic contig, GRCh37.p2 reference primary assembly. ... gene complement(68451760..>68452323) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /db_xref="GeneID:28916" /db_xref="HGNC:5789" /db_xref="IMGT/GENE-DB:IGKV2-40" V_segment complement(68451760..68452073^68452074) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /standard_name="IGKV2-40" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /db_xref="GeneID:28916" CDS complement(<68451760..68452072^68452073) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /exception="rearrangement required for product" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /codon_start=1 /db_xref="GeneID:28916" /db_xref="IMGT/LIGM:IGKV2-40" /db_xref="HGNC:5789" /db_xref="IMGT/GENE-DB:IGKV2-40" If we look at the record via Entrez, http://www.ncbi.nlm.nih.gov/nuccore/NT_022184.15?report=gbwithparts gene complement(68451760..>68452323) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /db_xref="GeneID:28916" /db_xref="HGNC:5789" /db_xref="IMGT/GENE-DB:IGKV2-40" V_segment complement(68451760..68452074) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /standard_name="IGKV2-40" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /db_xref="GeneID:28916" CDS complement(<68451760..68452073) /gene="IGKV2-40" /gene_synonym="IGKV240; O11; O11a" /exception="rearrangement required for product" /note="Derived by automated computational analysis using gene prediction method: Curated Genomic." /codon_start=1 /db_xref="IMGT/LIGM:IGKV2-40" /db_xref="GeneID:28916" /db_xref="HGNC:5789" /db_xref="IMGT/GENE-DB:IGKV2-40" So this appears to have been updated to avoid the funny caret location, but I think they made a mistake - surely the CDS should be complement(68451760..>68452073) not complement(<68451760..68452073) as stated? Have you contacted the NCBI about this? If not, I will. I believe that the caret location in the FTP GenBank file is invalid and Biopython is right to reject it (but I would like to confirm this with the NCBI). For now the simplest solution is for you to manually edit that feature. Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Thu Feb 10 15:10:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Feb 2011 15:10:19 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: <20110207122356.GC18733@sobchak.mgh.harvard.edu> References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> <20110207122356.GC18733@sobchak.mgh.harvard.edu> Message-ID: On Mon, Feb 7, 2011 at 12:23 PM, Brad Chapman wrote: > Peter; > >> The computationally interesting part is matching the primer/adapter/ >> barcode to the read (both of which may contain IUPAC ambiguity codes), >> which as you point out can be replaced once you have a working >> framework for the input, output, trimming, etc. > > Absolutely. I'd be very happy if you wanted to take the framework in > the script and generalize it for different matching. Let me know > what I can do to help. Do you have (or can you point me at) any good sample data with barcodes, or custom adapters or primer sequences? e.g. some SRA numbers you've been using. >> Currently I'm using regular expressions, which is fast enough for my >> own needs - and this task could easily be parallelised by breaking >> up the input reads. Beyond that perhaps something based on >> Hamming distances (edit distance - number of mismatches) or >> Levenshtein searches might be quicker. I guess speed is more of >> an issue with Illumina than with 454 due to the number of reads? I originally had three separate tools (with shared code) for working with FASTA, FASTQ and SFF reads, which I have recently combined into one single tool that does all three. Code here if anyone wants to look at it. https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/ seq_primer_clip.py - Python script seq_primer_clip.xml - Galaxy wrapper seq_primer_clip.txt - readme file This is still a work in progress... Peter From bugzilla-daemon at portal.open-bio.org Thu Feb 10 20:02:42 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Feb 2011 15:02:42 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201102102002.p1AK2g6g017745@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #14 from walter_gillett at yahoo.com 2011-02-10 15:02 EST ------- I have checked in a fix on my github branch to the bug mentioned in comment #4: in the Viterbi recursion to determine state path probabilities, we must consider states that lead *to* the current state, not those that are reachable *from* it. See comments for this checkin: https://github.com/wgillett/biopython/commit/f8b0b94ad7ffadbf9aa923bc6273822328cb9f01 . Forgot to mention in the comments that I also fixed a bug in the allow_transition method and added a unit test for that method. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Feb 10 23:07:21 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Feb 2011 18:07:21 -0500 Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54 In-Reply-To: Message-ID: <201102102307.p1AN7Lu0025588@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3175 ------- Comment #9 from aaron.tin.long.lun at gmail.com 2011-02-10 18:07 EST ------- Thanks Peter, will do so. Cheers, Aaron -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Fri Feb 11 09:30:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 11 Feb 2011 09:30:02 +0000 Subject: [Biopython-dev] Fwd: [GitHub] Viterbi algorithm bug fix: consider states that lead *to* the current state, not reachable *from* it [biopython/biopython GH-3] In-Reply-To: <4d54436221d2a_250b3fff6ec2b2f0780@fe2.rs.github.com.tmail> References: <4d54436221d2a_250b3fff6ec2b2f0780@fe2.rs.github.com.tmail> Message-ID: Hi Brad, Do you want to look at this HMM fix too? http://bugzilla.open-bio.org/show_bug.cgi?id=2947 Also who else is getting the github pull requests? We should probably send them to the dev list, but I can't find the settings right now on GitHub... Peter ---------- Forwarded message ---------- From: GitHub Date: Thu, Feb 10, 2011 at 7:58 PM Subject: [GitHub] Viterbi algorithm bug fix: consider states that lead *to* the current state, not reachable *from* it [biopython/biopython GH-3] To: p.j.a.cock at googlemail.com wgillett wants someone to pull from wgillett:master: Bug fix related to bug #2947. Please review and commit if it's OK. Thanks, Walter Gillett View Pull Request: https://github.com/biopython/biopython/pull/3 From chapmanb at 50mail.com Mon Feb 14 13:01:10 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 14 Feb 2011 08:01:10 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> <20110207122356.GC18733@sobchak.mgh.harvard.edu> Message-ID: <20110214130110.GA12340@sobchak.mgh.harvard.edu> Peter; > Do you have (or can you point me at) any good sample data with > barcodes, or custom adapters or primer sequences? e.g. some SRA > numbers you've been using. This is a subset of two lanes from a barcoded flowcell for testing purposes: http://chapmanb.s3.amazonaws.com/110106_FC70BUKAAXX.tar.gz It has 12 barcoded samples, using the Illumina barcodes. The sequences are in this YAML file: https://github.com/chapmanb/bcbb/blob/master/nextgen/tests/data/automated/run_info.yaml > I originally had three separate tools (with shared code) for working > with FASTA, FASTQ and SFF reads, which I have recently combined > into one single tool that does all three. Code here if anyone wants to > look at it. > > https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/ Very nice. It would be great to get something general for barcode splitting as a Galaxy tool. Thanks for looking at this, Brad From p.j.a.cock at googlemail.com Mon Feb 14 13:19:45 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 14 Feb 2011 13:19:45 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: <20110214130110.GA12340@sobchak.mgh.harvard.edu> References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> <20110207122356.GC18733@sobchak.mgh.harvard.edu> <20110214130110.GA12340@sobchak.mgh.harvard.edu> Message-ID: On Mon, Feb 14, 2011 at 1:01 PM, Brad Chapman wrote: > Peter; > >> Do you have (or can you point me at) any good sample data with >> barcodes, or custom adapters or primer sequences? e.g. some SRA >> numbers you've been using. > > This is a subset of two lanes from a barcoded flowcell for testing > purposes: > > http://chapmanb.s3.amazonaws.com/110106_FC70BUKAAXX.tar.gz > > It has 12 barcoded samples, using the Illumina barcodes. The > sequences are in this YAML file: > > https://github.com/chapmanb/bcbb/blob/master/nextgen/tests/data/automated/run_info.yaml > Great :) >> I originally had three separate tools (with shared code) for working >> with FASTA, FASTQ and SFF reads, which I have recently combined >> into one single tool that does all three. Code here if anyone wants to >> look at it. >> >> https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/ > > Very nice. It would be great to get something general for barcode > splitting as a Galaxy tool. Thanks for looking at this, > Brad Yes - assuming what they have already isn't good enough (at very least the Galaxy barcode wrapper for fastx currently only handles fastq-solexa but I think that can be fixed). http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004290.html I've been focused on the PCR case where my sequences have got IUPAC ambiguity characters. For barcodes that shouldn't be an issue, but instead you may have more than one barcode and will want one output file per barcode (although not usually as complicated as Kevin's setup). I need to learn more about how Galaxy handles multiple outputs before commenting on that. Peter From tiagoantao at gmail.com Wed Feb 16 16:40:10 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 16 Feb 2011 16:40:10 +0000 Subject: [Biopython-dev] New URL for integration testing Message-ID: Hello all, Buildbot integration testing has been moved to a, hopefully, more stable location. If you are interested, please have a look at: http://testing.open-bio.org/ The old URL at events.open-bio.org is no more. Regards, Tiago -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From anaryin at gmail.com Thu Feb 17 12:59:16 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 17 Feb 2011 13:59:16 +0100 Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(), remove_disordered_atoms() In-Reply-To: References: Message-ID: Hey Kristian, To Tests/test_pdb.py ? Just to make sure that the renumbering acts on both accordingly? I agree. Jo?o From krother at rubor.de Thu Feb 17 12:54:38 2011 From: krother at rubor.de (Kristian Rother) Date: Thu, 17 Feb 2011 13:54:38 +0100 Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(), remove_disordered_atoms() In-Reply-To: References: Message-ID: Hi Joao, I think we should add a simple test function that ensures consistency of child_dict and child_list upon renumbering. Let me know if you'd prefer me to explain in Python what I mean. Kristian > Dear All, > > I've been working on the above-mentioned functions following really great > feedback from Eric, Kristian, and Peter. I've been also using them > routinely > and I've had no problems yet so they should be stable enough. Therefore I > think they can be cherry-picked from my pdb_enhancements branch and added > to > the main branch. Let me know what you think. > > Cheers, > > Jo??o > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From b.invergo at gmail.com Tue Feb 22 16:40:01 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 22 Feb 2011 17:40:01 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: Hi everyone, I've been toiling away on the PAML API and I think it's finally ready for review. If anyone's willing to give my code a review, here's my branch: https://github.com/brandoninvergo/biopython/tree/paml-branch (the API is in Bio/Phylo/PAML, as suggested before, and the tests are in Tests, with their supporting files in Tests/PAML) I'll also post a message to the Biopython user list to see if anyone would be willing to give it a test drive. Some notes: - I've implemented Codeml, Baseml/Basemlg and Yn00. I have not yet done anything with Mcmctree because I am completely ignorant about what information to extract from the output files. The other two programs in the package, Evolver and Chi2, do not accept commandline options and are instead operated by a rudimentary commandline interface, so they aren't really compatible with scripting. - Chi2 is useful, though, because it provides a chi^2 CDF, which you can use in performing maximum likelihood ratio tests, an important part of using the PAML programs. Since Python doesn't have a chi^2 cumulative distribution function in its standard library, I ported the original C code rather than writing a function which simply calls the original, with the permission of Ziheng Yang (the original author; this is mentioned in the code's comments, but he required no other licensing/copyright verbage to be included). This was no easy task, considering the C code was littered with goto statements. Anyway, this will prevent the user from having to install/import an outside package to do the tests (I personally had been using Rpy2 to call the R function pchisq()....complete overkill). Let me know if this is ok or if this causes some kind of conflict - The output of the programs varies widely with the combinatorics of the parameters and possibly between versions. I tried to include all possible output files in the Tests/PAML directory and I wrote test cases to check that they're properly parsed (with the testing of future versions in mind). So, that Tests/PAML folder has a lot more in it than the usual test folders, but I felt there was no other option. I tried to make it organized. I think those are the main points for now. I'd assume that there's more work to be done before I should perform a pull request, so I'll simply ask for your comments for now if you have the time. Cheers, Brandon Invergo On Sun, Jan 16, 2011 at 4:09 PM, Peter Cock wrote: > On Sun, Jan 16, 2011 at 2:19 PM, Brandon Invergo wrote: >> Hi everyone, >> A quick question about style: since the name "codeml" is based on a >> program which is always spelled either in all caps or in all >> lower-case, what would be the best way to write the class name >> regarding capitalization? Stick with the usual camel-case convention, >> "Codeml", anyway? > > I'd go with Codeml for a class name (or something like > CodemlResult or whatever). Neither CODEML nor codeml > seem good class names in Python. > >> Things are progressing nicely. I've already taken care of a lot of the >> minor tasks and improvements... > > Sounds good :) > > Peter > From clementsgalaxy at gmail.com Tue Feb 22 17:16:12 2011 From: clementsgalaxy at gmail.com (Dave Clements) Date: Tue, 22 Feb 2011 09:16:12 -0800 Subject: [Biopython-dev] Galaxy Community Conference, May 25-26, Lunteren, The Netherlands In-Reply-To: References: Message-ID: Hello all, Just a reminder that the abstract submission deadline for the Galaxy Community Conference is next Monday, February 28. See http://galaxy.psu.edu/gcc2011/Abstracts.html for details. Cheers, Dave C. On Thu, Feb 3, 2011 at 5:01 PM, Dave Clements wrote: > We are pleased to announce the *2011 Galaxy Community Conference*, being > held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature > two full days of presentations and discussion on extending Galaxy to use new > tools and data sources, deploying Galaxy at your organization, and best > practices for using Galaxy to further your own and your community's > research. See http://galaxy.psu.edu/gcc2011/* for complete details. > * > *About Galaxy: > *Galaxy is an open, web-based platform for *accessible, reproducible, and > transparent* computational biomedical research. > > - *Accessibility:* Galaxy enables users without programming experience > to easily specify parameters and run tools and workflows. > - *Reproducibility:* Galaxy captures all information necessary so that > any user can repeat and understand a complete computational analysis. > - *Transparency:* Galaxy enables users to share and publish analyses > via the web and create Pages--interactive, web-based documents that describe > a complete analysis. > > Galaxy is open source for all organizations. The public Galaxy service ( > http://usegalaxy.org) makes analysis tools, genomic data, > tutorial demonstrations, persistent workspaces, and publication services > available to any scientist that has access to the Internet. Local > Galaxy servers can be set up by downloading the Galaxy application and > customizing it to meet particular needs. > > *Conference Overview: > * > This event aims to engage a broader community of developers, data > producers, tool creators, and core facility and other research hub staff to > become an active part of the Galaxy community. We'll cover defining > resources in the Galaxy framework, increasing their visibility and making > them easier to use and integrate with other resources, how to extend Galaxy > to use custom data sources and custom tools, and best practices for using > Galaxy in your organization. > > Additional topics include, but are not limited to: > * Talks submitted by the Galaxy community > * Integration of tools (including NGS analysis tools) and distributed job > management > * Deployment of Galaxy instances on local resources and on the Cloud > * Management of large datasets with the Galaxy Library System > * Using the Galaxy LIMS functionality at NGS sequencing facilities > * Visualizing Data without leaving Galaxy > * Performing reproducible research > * Performing and sharing complex analyses with Workflows > * An "Introduction to Galaxy" session, offered on May 24, for Galaxy > newcomers. > > *Registration: > * > The conference fee is ?100 on or before April 24, and ?120 after that. The > meeting is being held at the Conference Centre De Werelt in Lunteren, The > Netherlands, which is also the conference hotel. You are encouraged to > register early, as space at the hotel (and at the "Intro to Galaxy" session) > is limited and is likely to fill up before the conference itself does. See > http://galaxy.psu.edu/gcc2011/Register.html > * > Abstract Submission: > * > Abstracts are now being accepted for short oral presentations. Proposals > on any topic of interest to the Galaxy community are welcome and > encouraged. The abstract submission deadline is the end of February 28. > See http://galaxy.psu.edu/gcc2011/Abstracts.html > * * > *Sponsors > * > The 2011 Galaxy Community Conference is co-sponsored by the US National > Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands > Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a > collaborative institute of the bioinformatics groups in the Netherlands. > Together, these groups perform cutting-edge research, develop novel tools > and support platforms, create an e-science infrastructure and educate the > next generations of bioinformaticians. > > We are looking forward to a great conference and hope to see you in the > Netherlands! > > The Galaxy and NBIC Teams > > -- > http://galaxy.psu.edu/gcc2011/ > http://getgalaxy.org > http://usegalaxy.org/ > -- http://galaxy.psu.edu/gcc2011/ http://getgalaxy.org http://usegalaxy.org/ From bugzilla-daemon at portal.open-bio.org Tue Feb 22 18:06:48 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 22 Feb 2011 13:06:48 -0500 Subject: [Biopython-dev] [Bug 3170] Integration of external package: pypaml In-Reply-To: Message-ID: <201102221806.p1MI6mvd015443@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3170 ------- Comment #1 from b.invergo at gmail.com 2011-02-22 13:06 EST ------- I've forked the repository on github and I've created a branch containing the new code: https://github.com/brandoninvergo/biopython/tree/paml-branch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Wed Feb 23 09:24:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 23 Feb 2011 09:24:21 +0000 Subject: [Biopython-dev] [Biopython] Biopython library for muliple sequence alignment In-Reply-To: <001501cbd324$c70a8570$551f9050$@jp> References: <001501cbd324$c70a8570$551f9050$@jp> Message-ID: On Wed, Feb 23, 2011 at 6:42 AM, Rojan Shrestha wrote: > Hello: > > I want to do multiple sequence alignment using CLUSTW. Instead of > standalone, I would like to use in my own program through biopython. I would > like to know that whether biopython has clustw function or not. It would be > very good if somebody ?gives information about this. > > Regards, > > Rojan Hello Rojan, Biopython (and BioPerl too I believe) doesn't have any multiple sequence alignment code itself. Biopython does has pairwise sequence alignment code (with a fast implementation in C). Instead (again, like BioPerl) Biopython has a wrapper and parser for calling the ClustalW command line tool from within your script and loading its output. Similarly for other alignment tools like Muscle. If you really want to be able modify the multiple sequence alignment code itself, some of these command line tools are open source. Also, I *think* that BioJava has some code for this. I don't know what BioRuby does. Peter P.S. You only really need to ask this on the Biopython Discussion List. Since you included the OBF cross project list I have tried to comment on how the other projects handle this as well. From updates at feedmyinbox.com Wed Feb 23 09:26:36 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 23 Feb 2011 04:26:36 -0500 Subject: [Biopython-dev] 2/23 active questions tagged biopython - Stack Overflow Message-ID: <64da3e945fd7631143a0bbd0fdd84e55@74.63.51.88> // Biopython CodonTable error? // February 18, 2011 at 3:02 PM http://stackoverflow.com/questions/5045967/biopython-codontable-error Hello, I am writing some code intended to translate ambiguous DNA codes into possible amino acids and I am seeing some strange translation from the Biopython 1.56 package. It appears to be translating ambiguous DNA codes to 'J' which does not exist as a code for anything. I am running python 2.6.1 on Mac OS 10.6.6. For example: >>>from Bio.Seq import * >>>translate('ARAWTAGKAMTA') 'XJXJ' or >>>from Bio.Seq import Seq >>>c = Seq('ARAWTAGKAMTA') >>>c.translate().tostring() 'XJXJ' I have looked through the Bio.Data.CodonTable source and Bio.Seq source and I cannot find a reason why this would be happening. Any ideas? Thanks! Mark -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From updates at feedmyinbox.com Wed Feb 23 09:26:36 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Wed, 23 Feb 2011 04:26:36 -0500 Subject: [Biopython-dev] 2/23 biopython Questions - BioStar Message-ID: // MuscleCommandline not writing file // February 22, 2011 at 2:34 PM http://biostar.stackexchange.com/questions/5787/musclecommandline-not-writing-file I'm trying to work through the Biopython tutorial on multiple sequence alignment and get an error whenever I try to use subprocess: child = subprocess.Popen(str(cline), stdout = subprocess.PIPE, stderr = subprocess.PIPE, shell = (sys.platform!="win32")) I get this error: Traceback (most recent call last): File "", line 2, in stdout = subprocess.PIPE) File "C:\Python27\lib\subprocess.py", line 672, in __init__ errread, errwrite) File "C:\Python27\lib\subprocess.py", line 882, in _execute_child startupinfo) WindowsError: [Error 2] The system cannot find the file specified I've gone so far as to copy and paste the tutorial into the interpreter and no luck. Neither ClustalW nor Muscle are writing the alignment files (I tried the depreciated MultipleAlignCL as well with no luck). I'm using Python v2.7 and Biopython v1.55 and have tried reinstalling both. Any advice? -- Website: http://biostar.stackexchange.com/questions/tagged/biopython Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630206/59fe8f28e93f5744d887807619020b5988c5b82b/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From chapmanb at 50mail.com Wed Feb 23 13:11:51 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 23 Feb 2011 08:11:51 -0500 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: <20110223131151.GE4922@sobchak.mgh.harvard.edu> Brandon; > I've been toiling away on the PAML API and I think it's finally ready > for review. If anyone's willing to give my code a review, here's my > branch: > https://github.com/brandoninvergo/biopython/tree/paml-branch This is awesome; thanks much for all the work getting this together. It's really great to see the extensive tests. I'm also impressed with your story of porting over 'goto' statements; it's been a while since those have entered my mind: 10 PRINT "CHI SQUARE FOREVER" 20 FLASH 30 GOTO 10 A couple of more general thoughts about your code: - These looks to be a lot of shared functionality between codeml, baseml and yn00 in setting up the control files. Would it be possible to create a base class that these all inherit from? This would make the code much easier to maintain over time as formats change. - Your 'read' functions get pretty deeply nested, especially the codeml parser. What do you think about creating an internal class to split some of the parsing logic into individual functions? A nice example is the GenBank/Scanner.py code. Having functions like parse_header/parse_features makes it much easier for someone not deeply familiar with your code to start to make guesses at where different functionality exists. This way, if the format changes others can provide patches and feedback to you. Overall this is great and all the work is much appreciated. Brad From chapmanb at 50mail.com Thu Feb 24 18:26:26 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 24 Feb 2011 13:26:26 -0500 Subject: [Biopython-dev] BOSC 2011 topic organizers and Codefest Message-ID: <20110224182626.GM20125@sobchak.mgh.harvard.edu> Hi all; This year the Bioinformatics Open Source Conference (BOSC) will be taking place in Vienna, Austria on July 15-16th. This is a yearly opportunity for open source bioinformatics developers to get together in person and discuss on-going projects. Nomi Harris, Peter Rice and the other organizing committee members are already hard at work planning for the conference: http://www.open-bio.org/wiki/BOSC_2011 The call for abstracts opens next Monday, and extends through April 18th, and we've been brainstorming potential session topics. This year we've tried to focus each of the sessions around a particular biological problem or computational approach. We hope this will draw some interesting parallels between work being done in different groups, and encourage even more collaboration. We are actively looking for community members who are interested in heading up the organization of a topic. The general idea is to build a cohesive set of talks within a session. How you'd like to do this is completely flexible but some of the ideas we've been discussing are: - Having a short introductory talk to provide an overview of an area, framing the different talks within this context. - Forgoing individual question/answer and instead combining this time into a longer panel-style discussion with all of the speakers. This would help stimulate back and forth between the different projects and the audience. If you are interested in a particular topic and would like to help with the organization, please send an e-mail to the BOSC mailing list: bosc at lists.open-bio.org. We're also open to new topic suggestions, and will look to add one or two more topics to our current list. Finally, there will be a two day coding session prior to BOSC as a follow up to last year's fun and productive Codefest: http://www.open-bio.org/wiki/Codefest_2011 The Metalab, a unique hacker space in Vienna, has kindly agreed to host us for the two days. If you are at all interested, please add your name to the attendees list on the wiki. Since the Metalab organizers don't know us personally, we'd like to demonstrate there is interest and that we'll really show up with a bunch of bioinformatics hackers. More details will be in the works as the summer draws closer. Looking forward to the sound of music, Brad From b.invergo at gmail.com Fri Feb 25 16:57:19 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Fri, 25 Feb 2011 17:57:19 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: <20110223131151.GE4922@sobchak.mgh.harvard.edu> References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> <20110223131151.GE4922@sobchak.mgh.harvard.edu> Message-ID: Hi Brad, Thanks for your response! It's taken me a day or two to think about what you wrote (also balancing a PhD with the hobby projects at the moment...) > It's really great to see the extensive tests. I'm also impressed > with your story of porting over 'goto' statements; it's been a while > since those have entered my mind: To be honest, I forgot they existed. Seeing them immediately made the computer scientist in me cringe. They really confused the whole structure of the program but in the end they were solved quite easily with some carefully placed loops and conditional blocks! > - These looks to be a lot of shared functionality between codeml, > ?baseml and yn00 in setting up the control files. Would it be > ?possible to create a base class that these all inherit from? This > ?would make the code much easier to maintain over time as formats > ?change. This is a really good idea and I'm a bit disappointed that I didn't see it myself! Indeed, most of the functionality is just copied/pasted between the classes, with only some variation in the read/write_ctl_file functions for codeml and baseml. So, writing a base class would really simplify things. I do have one question, though, since this is my first time organizing my code in a large-scale Python project. Where would be the best place to implement this base paml class? In __init__.py or in its own paml.py file? I know the end result would be the same but I figure I should start learning some of these best practices. > - Your 'read' functions get pretty deeply nested, especially the > ?codeml parser. What do you think about creating an internal class > ?to split some of the parsing logic into individual functions? A > ?nice example is the GenBank/Scanner.py code. Having functions like > ?parse_header/parse_features makes it much easier for someone not > ?deeply familiar with your code to start to make guesses at where > ?different functionality exists. This way, if the format changes > ?others can provide patches and feedback to you. I'm not so sure about this mainly because of the way the output files are formatted. For example, the most common usage of codeml (the most common program of the bunch) is to run with several several "NSsites" models. If you do this, the output file is separated into segments which are headed by a line that says something like "Model 2: PositiveSelection", and the model parameters are printed out below. However, if you only run with one model, which is also a common usage, you no longer have these convenient headers and instead at the very top of the output file is a completely different indication of which model was used, but which is inconveniently missing if only model 0 was run. In other cases, such as amino acid sequence analysis, pairwise nucleotide sequence or multiple gene analyses, there's no header whatsoever indicating which kind of output file you're looking at. Instead, you just have to search for particular data patterns to parse. This mess is precisely why I had to include so many different output files for the unittesting (codeml is the main culprit; baseml is moderately bad; yn00 isn't a problem) So, because I would potentially end up scanning almost the entire file just to figure out what's going on, I think just parsing-as-you-go, using elif statements to short-circuit and skip further evaluations of a line after a match has been found, would be the better option. Perhaps the files aren't long enough to be able to make an appeal for computational efficiency but at the same time, I hesitate to read through the file multiple times unnecessarily. I agree, though, that this makes the read() function quite long. For that, though, I tried to provide descriptive comments before each parsing case, describing exactly what the next block of code is meant to parse and also including a specific example line which should be parsed by it. That said, I will take another look at the output files to see if there could be another way of implementing it. Without a doubt, the parsing is the most difficult part of implementing this module; the rest of it is quite trivial. So, best to do it right! > Overall this is great and all the work is much appreciated. Thanks! It's been a fun side project for me. Cheers, Brandon ps - I still haven't sent a message to the main Biopython list while I consider implementing at least the first suggestion above, since it would involve large changes that might cause me to accidentally break something! I'll wait until I'm a bit more confident that it's close to the final product From updates at feedmyinbox.com Mon Feb 28 09:21:17 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Mon, 28 Feb 2011 04:21:17 -0500 Subject: [Biopython-dev] 2/28 active questions tagged biopython - Stack Overflow Message-ID: <348d58cdbd9ae31e700023c354ca3ce6@74.63.51.88> // Convert nested dictionary/xml to flat file for sqlite // February 27, 2011 at 11:25 AM http://stackoverflow.com/questions/5134334/convert-nested-dictionary-xml-to-flat-file-for-sqlite Hiya- I've scoured the net and cannot seem to find an appropriate example so I thought I'd ask... (Btw, much of this is new to me- not all, just most.) Problem: trying to convert a bio/python nested dictionary (or xml) of pubmed citation data into a flat (normalized) structure eg, sqlite. Citation data was fetched from pubmed using biopython and was parsed into a dictionary, but can also retrieve as xml if needed. Not all citations will have all fields/keys and not all fields/keys will have the same number of items (authors, mesh terms, refs, etc...) and understand that this is part of the normalization process. This is about where my practical understanding ends. That said, I think the process should go something like this: first remove/normalize all unique fields (those that have 1 per paper eg, title, abstract, date, citation, etc..., but say not affiliation as that would be linked to first author). Papers with no abstract could be filled as null? Then move on to, say, authors and create a separate table again using PMID as the fk and then do same for the various other fields/keys/items in separate tables eg, mesh headings, EC numbers, ref, etc... Is there a way to do this that removes (pops?) keys/items from the master dictionary so that I can visually see what's been done/needs to be done (obviously leaving the PMID)? Again, apologies in advance if I'm asking a blindingly obvious question to the initiated- and I do understand that you can't fit a nested structure into a flat space- just looking for the least boneheaded way of going about this and hopefully one that will allow me to make sure that everything was properly captured. Many thanks, chris -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From chapmanb at 50mail.com Mon Feb 28 16:35:21 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 28 Feb 2011 11:35:21 -0500 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> <20110223131151.GE4922@sobchak.mgh.harvard.edu> Message-ID: <20110228163521.GF9652@sobchak.mgh.harvard.edu> Brandon; [pypaml branch: https://github.com/brandoninvergo/biopython/tree/paml-branch] [base class] > This is a really good idea and I'm a bit disappointed that I didn't > see it myself! Indeed, most of the functionality is just copied/pasted > between the classes, with only some variation in the > read/write_ctl_file functions for codeml and baseml. So, writing a > base class would really simplify things. I do have one question, > though, since this is my first time organizing my code in a > large-scale Python project. Where would be the best place to implement > this base paml class? In __init__.py or in its own paml.py file? I > know the end result would be the same but I figure I should start > learning some of these best practices. It's always easier to get perspective on code when you haven't been directly in the middle of it. Even if you don't have someone to do code reviews, stepping away from a project and coming back later will often lead to a bunch of insights. For the base class, I would follow Eric and Peter's example and use files in the same directory with an underscore: something like _shared.py or _base.py. [read functions] > This mess is precisely why I had to include so many different > output files for the unittesting (codeml is the main culprit; baseml > is moderately bad; yn00 isn't a problem) I definitely feel your pain on this. This is exactly why your work doing this is appreciated; you'll save someone a lot of headache later on. > So, because I would potentially end up scanning almost the entire file > just to figure out what's going on, I think just parsing-as-you-go, > using elif statements to short-circuit and skip further evaluations of > a line after a match has been found, would be the better option. > Perhaps the files aren't long enough to be able to make an appeal for > computational efficiency but at the same time, I hesitate to read > through the file multiple times unnecessarily. I agree, though, that > this makes the read() function quite long. For that, though, I tried > to provide descriptive comments before each parsing case, describing > exactly what the next block of code is meant to parse and also > including a specific example line which should be parsed by it. The issue really is that deeply nested code is hard to read, long functions are hard to read, and when you combine them together it just makes it very difficult for others to follow your logic. I don't think you necessarily have to make multiple passes to parse it in a more structure way, but what you would want to focus on is making the flow through the function simpler. The way I would normally attack this is to break components into smaller more re-usable functions. Here's a concrete example from the start of the codeml parser: https://github.com/brandoninvergo/biopython/blob/paml-branch/Bio/Phylo/PAML/codeml.py siteclass_re = re.match("Site-class models:\s*(.*)", line) if siteclass_re is not None: siteclass_model = siteclass_re.group(1) if siteclass_model == "": multi_models = True continue results["site-class model"] = siteclass_model if siteclass_model == "NearlyNeutral": current_model = 1 results["NSsites"][current_model] = \ {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] elif siteclass_model == "PositiveSelection": current_model = 2 results["NSsites"][current_model] = \ {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] elif siteclass_model == "discrete (4 categories)": current_model = 3 results["NSsites"][current_model] = \ {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] elif siteclass_model == "beta (4 categories)": current_model = 7 results["NSsites"][current_model] = \ {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] elif siteclass_model == "beta&w>1 (5 categories)": current_model = 8 results["NSsites"][current_model] = \ {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] You could refactor this something along the lines of: class _CodemlParser: def __init__(self): self.results = {} self.flags = dict(multi_models = False) def read(self, results_handle): for line in results_handle: siteclass_re = re.match("Site-class models:\s*(.*)", line) if siteclass_re is not None: self._siteclass_parse(siteclass_re) def _add_siteclass_model(self, siteclass_model): self.results["site-class model"] = siteclass_model name_to_num = {"NearlyNeutral": 1, "PositiveSelection": 2, "discrete (4 categories)": 3, "beta (4 categories)": 7 "beta&w>1 (5 categories)": 8} current_model = name_to_num[siteclass_model] self.results["NSsites"][current_model] = {"description":siteclass_model} if 0 in results["NSsites"]: del results["NSsites"][0] def _siteclass_parse(self, siteclass_re): if siteclass_model == "": self.flags["multi_models"] = True else: self._add_siteclass_model(siteclass_model) You are not changing the parsing strategy, but now you've got individual functions handling each of the steps so it's clear that the _siteclass_parse either sets multi_models or adds details about the single model. Then you can dig into the _add_siteclass_model function to see what it is doing. To the reader, each individual unit can be read and understood separately. This type of refactoring work is useful generally. I have to do it all the time in my work and discover new tricks and approaches. Hope this is helpful and thanks again for all the work on this, Brad