From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:00:54 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:00:54 -0500
Subject: [Biopython-dev] [Bug 3173] New: Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
Summary: Bio.Emboss.Primer3 parser incompatibility with Primer3
version 2.2.3
Product: Biopython
Version: 1.55
Platform: Macintosh
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: Other
AssignedTo: biopython-dev at biopython.org
ReportedBy: jp.verta at gmail.com
I'm running Biopython 1.55, Python 2.6 and EMBOSS version 6.3.1 on MacOS X 10.6
Snow Leopard.
The Bio.Emboss.Primer3 parser seems to be incompatible with the newer version
2.2.3 of the Whitehead Primer3 program and the corresponding Emboss eprimer3
program output. The parser output for the reverse primer seems to contain all
Primer -class members (primer.reverse_tm, primer.reverse_gc etc.) except the
reverser primer sequence (primer.reverse_seq). Yet the eprimer3 output seems
identical to that of old versions (see the output.pr3 -files attached).
Here is an example code for designing primers for a set of fasta sequences.
>>>
def design_primers(fasta_file, output_file):
from Bio import SeqIO
from Bio.Emboss.Applications import Primer3Commandline
from Bio.Emboss import Primer3
output = open(output_file, "w")
output.write("name,forward_primer,reverse_primer,forward_tm,reverse_tm,product_size\n")
for seq_record in SeqIO.parse(fasta_file, "fasta"):
if not(seq_record):
break
open("sequence",
"w").write(">"+str(seq_record.id)+"\n"+str(seq_record.seq)+"\n")
primer_cl = Primer3Commandline(sequence="sequence")
primer_cl.explainflag = True
primer_cl.osizeopt=20
primer_cl.psizeopt=200
primer_cl.otm=65
primer_cl.maxtm=70
primer_cl.mintm=60
primer_cl.gcclamp=1 #required number of Gs or Cs at the 3' end of the
primer
primer_cl.outfile = "output.pr3"
primer_cl()
output_handle = open("output.pr3","r")
primer_record = Primer3.read(output_handle)
if len(primer_record.primers) > 0:
primer = primer_record.primers[0]
output.write("%s,%s,%s,%s,%s,%s\n" % (seq_record.id,
primer.forward_seq, primer.reverse_seq,
primer.forward_tm,primer.reverse_tm,primer.size))
else:
print "No primers found for %s" % seq_record.id
>>>
This code, when executed on a file of fasta-sequences gives and output -file
with forward and reverse primer id, sequence, tm and size separated by commas.
When I execute it with the Primer3-2.2.3 and compatible eprimer3 versions, the
field for the reverse primer sequence appears blank.
I will attach the Primer3-2.2.3 compatible eprimer3 file to this report.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:03:17 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:03:17 -0500
Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
In-Reply-To:
Message-ID: <201102011803.p11I3HtF008419@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
------- Comment #1 from jp.verta at gmail.com 2011-02-01 13:03 EST -------
Created an attachment (id=1565)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1565&action=view)
Emboss eprimer3.c file for Primer3 version 2.2.3
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:05:04 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:05:04 -0500
Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
In-Reply-To:
Message-ID: <201102011805.p11I54md008512@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
------- Comment #2 from jp.verta at gmail.com 2011-02-01 13:05 EST -------
Created an attachment (id=1566)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1566&action=view)
Example output of Primer3 version 1.1.4 compatible Emboss eprimer3 program
output
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:06:00 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:06:00 -0500
Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
In-Reply-To:
Message-ID: <201102011806.p11I60De008626@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
jp.verta at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1566|Example output of Primer3 |Example output of Primer3
description|version 1.1.4 compatible |version 2.2.3 compatible
|Emboss eprimer3 program |Emboss eprimer3 program
|output |output
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:06:47 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:06:47 -0500
Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
In-Reply-To:
Message-ID: <201102011806.p11I6lVr008664@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
------- Comment #3 from jp.verta at gmail.com 2011-02-01 13:06 EST -------
Created an attachment (id=1567)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1567&action=view)
Example output of Primer3 version 1.1.4 compatible Emboss eprimer3 program
output
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 13:07:44 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:07:44 -0500
Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
In-Reply-To:
Message-ID: <201102011807.p11I7ih8008712@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
jp.verta at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1566|application/octet-stream |text/plain
mime type| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Tue Feb 1 15:39:30 2011
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Feb 2011 20:39:30 +0000
Subject: [Biopython-dev] [Biopython] internal function to convert
illumina quality scores to phred
In-Reply-To:
References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu>
<97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu>
<20110201160304.GH17835@sobchak.mgh.harvard.edu>
Message-ID:
On Tue, Feb 1, 2011 at 4:16 PM, Peter wrote:
> On Tue, Feb 1, 2011 at 4:03 PM, Brad Chapman wrote:
>>
>> Peter, how hard do you think it would be to have SeqIO only convert
>> from the fastq encoding to phred scores on demand? Most of the time
>> when dealing with fastq I do not need any conversion at all and use
>> the FastqGeneralIterator to just pull out the name, sequence and
>> quality.
>>
>> You've done a lot of nice work with the correct conversions and it
>> would be great to expose that directly though on-demand conversion
>> as Alan is suggesting. Ideally you would use SeqIO as normal with
>> fastq files, but the quality score would not be converted to solexa
>> during parsing using letter_annotations["solexa_quality"] was
>> accessed.
>
> I actually implemented a proof of concept that does that. In order
> to not alter the SeqRecord behaviour, it was a new object which
> acted like a list of integers in many respects. The data is held
> as a FASTQ encoded string, and decoded (and then cached) on
> demand only. On output if it was already in the right encoding
> the string could be used as is, otherwise the conversion could
> be done very quickly with a precomputed table and the string
> translate() method (without having to go via a list of integers).
> It seemed to work, but I wasn't convinced about the benefits
> (given the complexity). I'd really want some real world FASTQ
> benchmarks to try it on... something you might have in the form
> of your scripts and the real data they were written for?
>
> I'm pretty sure this code is in a local git branch on one of my
> machines (probably at home), but I don't think I pushed it to
> github. I should do that...
Found it and pushed it:
https://github.com/peterjc/biopython/tree/fastq-tricks
Note there are unit test failures (e.g. as currently implemented
there is no range checking on the characters in the quality strings
at parse time). We may want to continue this on the dev mailing list...
Peter
From p.j.a.cock at googlemail.com Thu Feb 3 07:04:08 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 3 Feb 2011 12:04:08 +0000
Subject: [Biopython-dev] Sequential SFF IO
In-Reply-To: <20110128123418.GD7866@sobchak.mgh.harvard.edu>
References:
<20110128123418.GD7866@sobchak.mgh.harvard.edu>
Message-ID:
On Wed, Jan 26, 2011 at 7:44 PM, Peter Cock wrote:
>
> I'm currently looking at trimming 5' and 3' PCR primer sequences -
> which could equally be used for barcodes etc. I'd probably wrap this
> as a Galaxy tool (using Biopython).
>
If anyone is interested, see this thread on the Galaxy-dev mailing list:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004290.html
In terms of SFF output, I'm only writing one SFF file so the issues
Jacob is concerned about (when writing one SFF file per barcode)
do not apply.
On Fri, Jan 28, 2011 at 12:34 PM, Brad Chapman wrote:
>
> I wrote up a barcode detector, remover and sorter for our Illumina
> reads. There is nothing especially tricky in the implementation: it
> looks for exact matches and then checks for approximate matches,
> with gaps, using pairwise2:
>
> https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py
>
> The "best_match" function could be replaced with different
> implementations, using the rest of the script as scaffolding to do
> all of the other sorting, trimming and output.
>
> Brad
The computationally interesting part is matching the primer/adapter/
barcode to the read (both of which may contain IUPAC ambiguity codes),
which as you point out can be replaced once you have a working
framework for the input, output, trimming, etc.
Currently I'm using regular expressions, which is fast enough for my
own needs - and this task could easily be parallelised by breaking
up the input reads. Beyond that perhaps something based on
Hamming distances (edit distance - number of mismatches) or
Levenshtein searches might be quicker. I guess speed is more of
an issue with Illumina than with 454 due to the number of reads?
Brad - you mentioned using approximate matches with gaps. Did you
find gapped matches made a bit difference to the number of matches
found? i.e. is it worthwhile on your data?
Peter
From bugzilla-daemon at portal.open-bio.org Thu Feb 3 17:47:04 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 3 Feb 2011 17:47:04 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102032247.p13Ml4QY029111@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
walter_gillett at hotmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |walter_gillett at hotmail.com
------- Comment #4 from walter_gillett at hotmail.com 2011-02-03 17:47 EST -------
Short answer:
The fix looks good - I have dug into the logic in detail and stepped through
the example. However, appears to me that there is still a bug in this line of
code in the viterbi method:
for cur_state in self.transitions_from(main_state):
In this context, "cur_state" is a state prior to "main_state", so what we
really need here is the set of states that lead to main_state, not the set of
states that can be reached from main_state. This bug won't cause trouble in
practice unless you use a non-ergodic HMM, that is, a model in which some state
transitions are disallowed. (The variable names here are confusing, would be
better to rename main_state to cur_state and cur_state to previous_state, or
something like that.) This bug is unrelated to the problem originally reported,
other than appearing in the same part of the code, so perhaps it should be
handled in a separate ticket.
I would be happy to code up a fix if that makes sense.
Longer answer:
I had spent a bunch of time recently investigating this - should have noted
that in bugzilla to avoid duplication of effort. But still seems worthwhile
writing down my notes to document this better, so I'll do that here.
There was a error in the Viterbi algorithm termination logic, as implemented in
the method MarkovModel#viterbi. The Viterbi probabilities were being multiplied
by the log-probability of a transition back to an end state (state 0). This was
incorrect because in log space the log-probabilities should be added, not
multiplied. Peter's fix removes that multiplication, thus dropping the end
state transition entirely (which Durbin considers optional, so that's fine; and
it was causing trouble). With the bug fixed, the most probable state path to
generate 6 tails (in the example model described by the bug reporter) becomes
"uuuuuu" as expected - no final "f".
At a higher level, there was (in versions 1.56 and prior, but no longer in
trunk) an important undocumented (as far as I can see) requirement that the
model always starts in state 0. The bug reporter complained that the results of
the Viterbi path calculation are wrong because "apparently they depend upon the
order of the state alphabet," which was true. In the example model, providing
the state alphabet ["f", "u"] causes the system to start in state f. Since
there is a big penalty in his example for switching states, you get "ff" as the
most likely state path for the output sequence [tails, tails], even though the
unfair coin is much more likely than the fair coin to yield tails.
Looks like Peter's fix treats all starting states as equally probable, there is
no longer a special start state. That's reasonable, although the coding is a
little confusing:
# v_{0}(0) = 0
viterbi_probs[(state_letters[0], -1)] = 0
# v_{k}(0) = 0 for k > 0
for state_letter in state_letters[1:]:
viterbi_probs[(state_letter, -1)] = 0
because it could now more naturally be done in two lines of code rather than
three. Possibly it's useful to keep the assignment for state 0 separate in case
we want to change it.
A good long-term improvement would be to have a special hidden start state like
the "MagicalState" used by BioJava (see
http://www.biojava.org/wiki/BioJava:Tutorial:Simple_HMMs_with_BioJava). That
would make it possible to specify a probability distribution for what the
initial state should be, a typical HMM feature (see Durbin's book, for
example).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Feb 3 19:16:39 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 3 Feb 2011 19:16:39 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102040016.p140GdsK031389@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #5 from pgarland at gmail.com 2011-02-03 19:16 EST -------
FWIW, I think the right thing with respect to begin states is to require the
user to explicitly specify an begin state in the state alphabet, e.g.:
class coin:
def __init__(self):
self.begin_state_name = "begin"
self.letters = ["u", "f"]
Having the user specify the name should reduce the chance of naming conflicts,
and makes it easier for the user to understand what is going on if they print
viterbi_probs, or are trying to debug a problem.
The user should also be required to explicitly set the initial probabilities.
There should be three methods for this, one that takes a list of initial
probabilities, one that makes all initial states equally probable, and one that
lets the user set the probability for each state individually. e.g:
MarkovModelBuilder.set_initial_probabilities([0.01, 0.99])
MarkovModelBuilder.set_initial_probabilities_equal()
MarkovModelBuilder.set_initial_probability("u", 0.01)
The first and third methods would raise an exception if the sum of the
probabilities did not sum to 1.0
Alternatively, the initial probabilities could be specified when defining the
state alphabet:
def __init__(self):
self.begin_state_name = "begin"
self.letters = [{'name': "u", 'init_prob': 0.01}, {'name': "f",
'init_prob': 0.99}]
This has the advantage of making the code more concise and readable, because
the state's declaration and specification are kept together. It has the
disadvantage adding an unnecessary layer of indirection when all the states
have equal initial probabilities. To make things less tedious for the user,
there could either be a flag specifying that all states have an equal initial
probability:
Alternatively, the initial probabilities could be specified when defining the
state alphabet:
def __init__(self):
self.begin_state_name = "begin"
self.initial_probabilties_equal = True
self.letters = [{'name': "u"}, {'name': "f"}]
or again, a method could be provided:
MarkovModelBuilder.set_initial_probabilities_equal()
Because specifying the begin state name and the initial probabilities would be
required, any of these changes would break the current API.
Similar features should be provided for users who want to constrain the end
state, but not specifying the end state should not raise an exception.
I agree the variable names "main_state" and "cur_state" are confusing and
should be changed.
~Phillip
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From clementsgalaxy at gmail.com Thu Feb 3 20:01:01 2011
From: clementsgalaxy at gmail.com (Dave Clements)
Date: Thu, 3 Feb 2011 17:01:01 -0800
Subject: [Biopython-dev] Galaxy Community Conference, May 25-26, Lunteren,
The Netherlands
Message-ID:
We are pleased to announce the *2011 Galaxy Community Conference*, being
held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature two
full days of presentations and discussion on extending Galaxy to use new
tools and data sources, deploying Galaxy at your organization, and best
practices for using Galaxy to further your own and your community's
research. See http://galaxy.psu.edu/gcc2011/* for complete details.
*
*About Galaxy:
*Galaxy is an open, web-based platform for *accessible, reproducible, and
transparent* computational biomedical research.
- *Accessibility:* Galaxy enables users without programming experience to
easily specify parameters and run tools and workflows.
- *Reproducibility:* Galaxy captures all information necessary so that
any user can repeat and understand a complete computational analysis.
- *Transparency:* Galaxy enables users to share and publish analyses via
the web and create Pages--interactive, web-based documents that describe a
complete analysis.
Galaxy is open source for all organizations. The public Galaxy service (
http://usegalaxy.org) makes analysis tools, genomic data,
tutorial demonstrations, persistent workspaces, and publication services
available to any scientist that has access to the Internet. Local
Galaxy servers can be set up by downloading the Galaxy application and
customizing it to meet particular needs.
*Conference Overview:
*
This event aims to engage a broader community of developers, data producers,
tool creators, and core facility and other research hub staff to become an
active part of the Galaxy community. We'll cover defining resources in the
Galaxy framework, increasing their visibility and making them easier to use
and integrate with other resources, how to extend Galaxy to use custom data
sources and custom tools, and best practices for using Galaxy in your
organization.
Additional topics include, but are not limited to:
* Talks submitted by the Galaxy community
* Integration of tools (including NGS analysis tools) and distributed job
management
* Deployment of Galaxy instances on local resources and on the Cloud
* Management of large datasets with the Galaxy Library System
* Using the Galaxy LIMS functionality at NGS sequencing facilities
* Visualizing Data without leaving Galaxy
* Performing reproducible research
* Performing and sharing complex analyses with Workflows
* An "Introduction to Galaxy" session, offered on May 24, for Galaxy
newcomers.
*Registration:
*
The conference fee is ?100 on or before April 24, and ?120 after that. The
meeting is being held at the Conference Centre De Werelt in Lunteren, The
Netherlands, which is also the conference hotel. You are encouraged to
register early, as space at the hotel (and at the "Intro to Galaxy" session)
is limited and is likely to fill up before the conference itself does. See
http://galaxy.psu.edu/gcc2011/Register.html
*
Abstract Submission:
*
Abstracts are now being accepted for short oral presentations. Proposals on
any topic of interest to the Galaxy community are welcome and encouraged.
The abstract submission deadline is the end of February 28. See
http://galaxy.psu.edu/gcc2011/Abstracts.html
* *
*Sponsors
*
The 2011 Galaxy Community Conference is co-sponsored by the US National
Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands
Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a collaborative
institute of the bioinformatics groups in the Netherlands. Together, these
groups perform cutting-edge research, develop novel tools and support
platforms, create an e-science infrastructure and educate the next
generations of bioinformaticians.
We are looking forward to a great conference and hope to see you in the
Netherlands!
The Galaxy and NBIC Teams
--
http://galaxy.psu.edu/gcc2011/
http://getgalaxy.org
http://usegalaxy.org/
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 05:05:18 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 05:05:18 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102041005.p14A5Ij0019705@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 05:05 EST -------
(In reply to comment #4)
>
> Looks like Peter's fix treats all starting states as equally probable,
> there is no longer a special start state. That's reasonable, although the
> coding is a little confusing:
>
It was Phillip's fix.
(In reply to comment #5)
> FWIW, I think the right thing with respect to begin states is to require the
> user to explicitly specify an begin state in the state alphabet, e.g.:
> class coin:
> def __init__(self):
> self.begin_state_name = "begin"
> self.letters = ["u", "f"]
If we go that route, we'll need to make very clear the differences between a
HMM Alphabet (of states) and a biological sequence alphabet (Bio.Alphabet). It
must make sense in many cases to use a biological sequence alphabet, but in
general adding HMM attributes to the class does not make sense.
We really need someone to volunteer to take over this code (and sort out the
overlap between Bio.MarkovModel and/or Bio.HMM.MarkovModel), write some
documentation for the tutorial, and sort out these remaining issues. Are either
of you interested?
>
> I agree the variable names "main_state" and "cur_state" are confusing and
> should be changed.
>
I'll happily merge/cherry-pick a simple diff to do that only if you do that on
github, or apply a patch if you upload it here.
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 08:46:02 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 08:46:02 -0500
Subject: [Biopython-dev] [Bug 3175] New: Caret in genbank files leads to
GenBank Parser crash in Biopython 1.54
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
Summary: Caret in genbank files leads to GenBank Parser crash in
Biopython 1.54
Product: Biopython
Version: 1.54
Platform: PC
OS/Version: Linux
Status: NEW
Severity: major
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: aaron.tin.long.lun at gmail.com
When parsing genbank files using Bio.SeqIO as described in the Biopython
Cookbook, the presence of a caret in the position of a feature in the
annotation (e.g. CDS 1000..1001^1002) raises a LocationParserError, leading
to "Syntax error at or near `Tokens('caret')' token". Appears to occur
regardless of the type of the feature, whether it is normal/reverse complement,
etc. Found in BioPython 1.54 on a Dell dimension 2400 running Kubuntu 10.10.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 08:49:23 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 08:49:23 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041349.p14DnN75028633@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #1 from aaron.tin.long.lun at gmail.com 2011-02-04 08:49 EST -------
Created an attachment (id=1568)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1568&action=view)
Crash-inducing file for the GenBank parser
Example file, modified from the human mitochondrial genome, with a caret
introduced in line 96. Causes the crash described in the bug description.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 09:20:33 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 09:20:33 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041420.p14EKX5n030354@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 09:20 EST -------
Hi Aaron,
The example in attachment #1568 from comment #1 is invalid. The feature
location join(16024^16026..16569,1..576) is wrong since the caret should be
used in the form [i]^[i+1], i.e. consecutive numbers. See:
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
That example should probably be a between location like
join((16024.16026)..16569,1..576)
However, the example in the original bug report, 1000..1001^1002, looks
possible (but unprecedented to my knowledge) and that also fails with the
latest Biopython GenBank parsing code (much changed since Biopython 1.54). I
don't really understand how that usefully differs from 1000..1001 or 1000..1002
though.
Was that from a GenBank file from the NCBI? If so what accession please, or a
URL?
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 10:00:49 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 10:00:49 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102041500.p14F0naj032533@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
georg.lipps at fhnw.ch changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |WORKSFORME
------- Comment #7 from georg.lipps at fhnw.ch 2011-02-04 10:00 EST -------
Yes,
the code seems to work now.
The probability of attaining the first state is now the transition probability
of remaining in the same state (here 0.95).
I like the suggestion of comment #5 to explicity state the a begin state with
the corresponding transition probabilities.
A big THANK for fixing,
Georg
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 10:19:11 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 10:19:11 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102041519.p14FJBme001095@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #8 from walter_gillett at hotmail.com 2011-02-04 10:19 EST -------
I'll volunteer to do all of that (OK with you, Phillip?).
Walter
(In reply to comment #6)
> (In reply to comment #5)
> > FWIW, I think the right thing with respect to begin states is to require the
> > user to explicitly specify an begin state in the state alphabet, e.g.:
> > class coin:
> > def __init__(self):
> > self.begin_state_name = "begin"
> > self.letters = ["u", "f"]
>
> If we go that route, we'll need to make very clear the differences between a
> HMM Alphabet (of states) and a biological sequence alphabet (Bio.Alphabet). It
> must make sense in many cases to use a biological sequence alphabet, but in
> general adding HMM attributes to the class does not make sense.
>
> We really need someone to volunteer to take over this code (and sort out the
> overlap between Bio.MarkovModel and/or Bio.HMM.MarkovModel), write some
> documentation for the tutorial, and sort out these remaining issues. Are either
> of you interested?
>
> >
> > I agree the variable names "main_state" and "cur_state" are confusing and
> > should be changed.
> >
>
> I'll happily merge/cherry-pick a simple diff to do that only if you do that on
> github, or apply a patch if you upload it here.
>
> Thanks,
>
> Peter
>
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 11:12:33 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 11:12:33 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102041612.p14GCXfW004211@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 11:12 EST -------
> >
> > I agree the variable names "main_state" and "cur_state" are confusing and
> > should be changed.
> >
>
> I'll happily merge/cherry-pick a simple diff to do that only if you do that on
> github, or apply a patch if you upload it here.
I could have phrased that better: I mean a simple patch/diff to do the rename
only would be easy for me to review and check in.
(In reply to comment #8)
> I'll volunteer to do all of that (OK with you, Phillip?).
>
> Walter
That's OK with me. Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 12:25:18 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 12:25:18 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041725.p14HPIhY008673@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #3 from aaron.tin.long.lun at gmail.com 2011-02-04 12:25 EST -------
Hi Peter,
Thanks for the quick reply. I originally encountered the caret in the GenBank
entry for the chromosome II assembly of the human genome (accession number
NT_022221.13, downloaded from NCBI's FTP site yesterday); it can be found at
the very end of the annotation, for the V_segments/CDS of the IGKV2-40 gene
e.g. CDS complement(<68451760..68452072^68452073). I suspect that it is
rare, because I parsed through the complete sequences of 15 other chromosomes
before my program crashed. Hope that helps.
Cheers,
Aaron
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 12:43:37 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 12:43:37 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041743.p14HhbbY009388@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 12:43 EST -------
(In reply to comment #3)
> Hi Peter,
> Thanks for the quick reply. I originally encountered the caret in the GenBank
> entry for the chromosome II assembly of the human genome (accession number
> NT_022221.13, downloaded from NCBI's FTP site yesterday); it can be found at
> the very end of the annotation, for the V_segments/CDS of the IGKV2-40 gene
> e.g. CDS complement(<68451760..68452072^68452073). I suspect that it is
> rare, because I parsed through the complete sequences of 15 other chromosomes
> before my program crashed. Hope that helps.
> Cheers,
> Aaron
>
Where on the FTP site? Its a big place and I don't work with human genomes...
Looking via the Entrez website, it seems NT_022221.13 is only 3519312bp,
so this can't match the GenBank file you are looking at:
http://www.ncbi.nlm.nih.gov/nuccore/NT_022221.13?report=gbwithparts
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 13:05:42 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 13:05:42 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041805.p14I5gxS010298@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 13:05 EST -------
(In reply to comment #4)
>
> Where on the FTP site? Its a big place and I don't work with human genomes...
>
Nevermind, I tried downloading a few candidates and found it - you actually
meant NT_015926.15 which is in this file (whose first entry is NT_022221.13)
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_GRCh37.p2_chr2.gbk.gz
It seems that Google doesn't index this site - I can understand why but it
would have been useful.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 13:15:05 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 13:15:05 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041815.p14IF5Bx010832@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #6 from aaron.tin.long.lun at gmail.com 2011-02-04 13:15 EST -------
Hi Peter,
Yeah, sorry about the mix-up, I'm not used to dealing with more than one
sequence record per file. The caret should be present in the FTP-sourced file.
Interestingly, it is not present in the Nucleotide annotation for the same
accession number, which suggests that they've updated it in the two/three
months since the data was pushed onto the FTP site.
Cheers,
Aaron
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 13:20:12 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 13:20:12 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041820.p14IKCDN011161@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #7 from aaron.tin.long.lun at gmail.com 2011-02-04 13:20 EST -------
NT_022184.15 is the record containing IGKV2-40 (and the associated caret) in my
file. What I said about Nucleotide still applies, though.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 23:23:02 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 23:23:02 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102050423.p154N2fO013565@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #10 from pgarland at gmail.com 2011-02-04 23:23 EST -------
(In reply to comment #8)
> I'll volunteer to do all of that (OK with you, Phillip?).
>
> Walter
Sure. WRT my earlier comment, I realized that it's simpler for both the
implementer and the user if the only user-visible change necessary to specify
begin states is to add a variable to HiddenMarkovBuilder to hold the name of
the begin state, and then let users use set_transition_score to specify
transition probabilities from begin states. Then the relevant methods, e.g.
_all_blank, allow_transition, allow_all_transitions, set_transition_score, etc
have to be altered to forbid transitions to, or emissions from the begin state.
And get_markov_model would raise an exception if a begin state hasn't been
specified or if there isn't at least one transition from the begin state.
So all users would have to do is (using the example from the bug report):
...
build.begin_state_name = "begin"
build.set_transition_score("begin", "u", 0.01)
build.set_transition_score("begin", "f", 0.99)
...
~Phillip
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Feb 5 02:04:46 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 5 Feb 2011 02:04:46 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102050704.p1574kup024068@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #11 from walter_gillett at yahoo.com 2011-02-05 02:04 EST -------
Sounds good.
(I had been thinking about trying to preserve backward compatibility for
existing clients of this class. If we require that the caller sets a begin
state then all existing clients will break since none of them currently does
that. But the previous fix has already broken compatibility in any case, and
that was probably necessary since prior to the fix, the results were
incorrect.)
A possible variation would be to handle the transition from the begin state to
the first real state with special-case code, so that the begin state would not
be included in the set of real states. The upside would be that the methods you
mention would not have to change, and we wouldn't be cluttering the state
alphabet with a begin state that isn't real, which I think was a concern
mentioned in comment #6 (if I understood it properly). The downside is having
to add that special-case code. Not sure yet whether this is a good idea or not.
Walter
(In reply to comment #10)
> (In reply to comment #8)
> > I'll volunteer to do all of that (OK with you, Phillip?).
> >
> > Walter
>
> Sure. WRT my earlier comment, I realized that it's simpler for both the
> implementer and the user if the only user-visible change necessary to specify
> begin states is to add a variable to HiddenMarkovBuilder to hold the name of
> the begin state, and then let users use set_transition_score to specify
> transition probabilities from begin states. Then the relevant methods, e.g.
> _all_blank, allow_transition, allow_all_transitions, set_transition_score, etc
> have to be altered to forbid transitions to, or emissions from the begin state.
> And get_markov_model would raise an exception if a begin state hasn't been
> specified or if there isn't at least one transition from the begin state.
>
> So all users would have to do is (using the example from the bug report):
>
> ...
> build.begin_state_name = "begin"
> build.set_transition_score("begin", "u", 0.01)
> build.set_transition_score("begin", "f", 0.99)
> ...
>
> ~Phillip
>
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Feb 5 22:23:39 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 5 Feb 2011 22:23:39 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102060323.p163NdIu013858@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #12 from pgarland at gmail.com 2011-02-05 22:23 EST -------
(In reply to comment #11)
> Sounds good.
>
> (I had been thinking about trying to preserve backward compatibility for
> existing clients of this class. If we require that the caller sets a begin
> state then all existing clients will break since none of them currently does
> that. But the previous fix has already broken compatibility in any case, and
> that was probably necessary since prior to the fix, the results were
> incorrect.)
I don't think it's worth it to worry about preserving complete backward
compatibility. Right now there are two classes of code:
1) Code that manually sets up a begin state and the appropriate transitions.
All these people would need to do is add one line of code specifying the begin
state, and the rest of their code would work as before. For these users, we
could print an error message instructing them to set the begin_state_name
variable (and document the change too!).
2) Code that does not set up a begin state, as in the bug report. Even with the
applied bug fix, this code only returns a correct state sequence when all
possible start states should be equally probable. In all other cases the users
are possibly getting an incorrect result without being aware of it. To my mind,
this is worse than breaking backward compatibility. We could maintain backward
compatibility by having a default model for the initial state (e.g. equally
probable, or assign random probabilities), but unless that's the model the user
should be assuming for their sequence, they'll still be silently returned an
incorrect result.
> A possible variation would be to handle the transition from the begin state to
> the first real state with special-case code, so that the begin state would not
> be included in the set of real states. The upside would be that the methods you
> mention would not have to change, and we wouldn't be cluttering the state
> alphabet with a begin state that isn't real, which I think was a concern
> mentioned in comment #6 (if I understood it properly). The downside is having
> to add that special-case code. Not sure yet whether this is a good idea or not.
>
> Walter
I hadn't thought of that approach. It could be a good way to go. I think the
tradeoffs would be:
A) Of the existing code, changes would be localized to the viterbi method,
which would become slightly more complex.
B) This approach makes it trivial to guarantee that no state can transition to
the begin state.
C) One new public method would have to be added, for users to set initial
probabilities.
D) Having to use the new method would require more, though not complex, changes
to existing user code, but would have the benefit of making it as explicit as
possible how the model is initialized.
All in all, your idea of keeping the begin state separate looks like the way to
go.
~ Phillip
> (In reply to comment #10)
> > (In reply to comment #8)
> > > I'll volunteer to do all of that (OK with you, Phillip?).
> > >
> > > Walter
> >
> > Sure. WRT my earlier comment, I realized that it's simpler for both the
> > implementer and the user if the only user-visible change necessary to specify
> > begin states is to add a variable to HiddenMarkovBuilder to hold the name of
> > the begin state, and then let users use set_transition_score to specify
> > transition probabilities from begin states. Then the relevant methods, e.g.
> > _all_blank, allow_transition, allow_all_transitions, set_transition_score, etc
> > have to be altered to forbid transitions to, or emissions from the begin state.
> > And get_markov_model would raise an exception if a begin state hasn't been
> > specified or if there isn't at least one transition from the begin state.
> >
> > So all users would have to do is (using the example from the bug report):
> >
> > ...
> > build.begin_state_name = "begin"
> > build.set_transition_score("begin", "u", 0.01)
> > build.set_transition_score("begin", "f", 0.99)
> > ...
> >
> > ~Phillip
> >
>
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Feb 6 01:46:56 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 6 Feb 2011 01:46:56 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102060646.p166kuqY018550@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #13 from walter_gillett at yahoo.com 2011-02-06 01:46 EST -------
I forked biopython, tested and checked in and pushed some improvements to
variable naming and comments in the viterbi method, and submitted a pull
request for your review. Thanks,
Walter
(In reply to comment #8)
> > I'll happily merge/cherry-pick a simple diff to do that only if you do that on
> > github, or apply a patch if you upload it here.
> >
> > Thanks,
> >
> > Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chapmanb at 50mail.com Mon Feb 7 07:23:56 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 7 Feb 2011 07:23:56 -0500
Subject: [Biopython-dev] Sequential SFF IO
In-Reply-To:
References:
<20110128123418.GD7866@sobchak.mgh.harvard.edu>
Message-ID: <20110207122356.GC18733@sobchak.mgh.harvard.edu>
Peter;
> The computationally interesting part is matching the primer/adapter/
> barcode to the read (both of which may contain IUPAC ambiguity codes),
> which as you point out can be replaced once you have a working
> framework for the input, output, trimming, etc.
Absolutely. I'd be very happy if you wanted to take the framework in
the script and generalize it for different matching. Let me know
what I can do to help.
> Currently I'm using regular expressions, which is fast enough for my
> own needs - and this task could easily be parallelised by breaking
> up the input reads. Beyond that perhaps something based on
> Hamming distances (edit distance - number of mismatches) or
> Levenshtein searches might be quicker. I guess speed is more of
> an issue with Illumina than with 454 due to the number of reads?
>
> Brad - you mentioned using approximate matches with gaps. Did you
> find gapped matches made a bit difference to the number of matches
> found? i.e. is it worthwhile on your data?
A large majority of the barcodes are found with exact matching
via a dictionary lookup, so the gapped/mismatch alignments are only
necessary for the barcodes with sequencing errors. For Illumina
reads gaps aren't as common, so the mismatch alignments are more
useful but I tried to make it general so as to catch as many cases
as possible.
Brad
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 11:31:38 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 11:31:38 -0500
Subject: [Biopython-dev] [Bug 3176] New: Bio SeqIO 'genbank' parse failure
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
Summary: Bio SeqIO 'genbank' parse failure
Product: Biopython
Version: 1.56
Platform: Macintosh
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: sschmidt at tuebingen.mpg.de
Hi,
the parser stumbles over a Genbank file that contains a feature without values:
___START GenBank File____
LOCUS someVector______ 6127 bp DNA circular 1-OCT-2009
SOURCE
ORGANISM
COMMENT none
FEATURES Location/Qualifiers
misc_structure 1564..1566
/ApEinfo_label=ErrorInBioPythonBecauseNoValue
/ApEinfo_fwdcolor=
/ApEinfo_revcolor=
/vntifkey="88"
/label=Stop\codon
BASE COUNT 15 a 16 c 16 g 13 t
ORIGIN
1 gagttccgcg ttacataact tacggtaaat ggcccgcctg gctgaccgcc caacgacccc
//
__END GenBank file___
The relevant error message:
File "/sw/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 525, in
parse
for r in i:
File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 437, in
parse_records
record = self.parse(handle, do_features)
File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 420, in
parse
if self.feed(handle, consumer, do_features):
File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 392, in
feed
self._feed_feature_table(consumer, self.parse_features(skip=False))
File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 188, in
parse_features
features.append(self.parse_feature(feature_key, feature_lines))
File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 268, in
parse_feature
elif value[0]=='"':
IndexError: string index out of range
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 11:45:40 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 11:45:40 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102081645.p18GjeR4025608@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 11:45 EST -------
Where is this problem file coming from? I'm pretty sure the NCBI (nor
EMBL/DDBJ) do not use feature qualifiers like that.
See: http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html
If you are creating the file, why not use /key="" or /key - the later form is
used in real GenBank files, e.g. /pseudo
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 13:25:13 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 13:25:13 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102081825.p18IPDgO029696@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #2 from sschmidt at tuebingen.mpg.de 2011-02-08 13:25 EST -------
The file is the product of ApE
(http://biologylabs.utah.edu/jorgensen/wayned/ape/).
I agree that this format is 'unusual' but that the code simply quits could be
simply avoided by checking if there is a value is defined at all.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 13:28:18 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 13:28:18 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102081828.p18ISIGG029796@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 13:28 EST -------
Created an attachment (id=1569)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1569&action=view)
Handle funny feature annotation
Could you test the following patch? Ask if you need help with that - I can
stick it on a github branch if that is easier.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From anaryin at gmail.com Tue Feb 8 13:33:56 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 8 Feb 2011 19:33:56 +0100
Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(),
remove_disordered_atoms()
Message-ID:
Dear All,
I've been working on the above-mentioned functions following really great
feedback from Eric, Kristian, and Peter. I've been also using them routinely
and I've had no problems yet so they should be stable enough. Therefore I
think they can be cherry-picked from my pdb_enhancements branch and added to
the main branch. Let me know what you think.
Cheers,
Jo?o
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 13:54:28 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 13:54:28 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102081854.p18IsSbo030923@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #4 from sschmidt at tuebingen.mpg.de 2011-02-08 13:54 EST -------
Hmm,
I patched the code and same error message. What about handling this problem at
Bio/GenBank/Scanner.py directly?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 17:25:58 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 17:25:58 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102082225.p18MPwXR006718@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1569 is|0 |1
obsolete| |
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 17:25 EST -------
(From update of attachment 1569)
Sorry, must have uploaded the wrong patch - this was a work in progress for the
GenBank between location bug.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Feb 9 05:47:33 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Feb 2011 05:47:33 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102091047.p19AlX92029443@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-09 05:47 EST -------
Committed:
https://github.com/biopython/biopython/commit/07b6c12cf18d41749918e29b1bbc4a58a18e1180
Can you try the trunk? See
http://www.biopython.org/wiki/SourceCode
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Feb 9 09:19:46 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Feb 2011 09:19:46 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102091419.p19EJkjK011310@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #7 from sschmidt at tuebingen.mpg.de 2011-02-09 09:19 EST -------
(using 07b6c12cf18d41749918e29b1bbc4a58a18e1180)
works like a charm.
Thanks Peter, should've come up with a similar solution
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Feb 9 09:20:22 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Feb 2011 09:20:22 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102091420.p19EKMsg011354@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
sschmidt at tuebingen.mpg.de changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #8 from sschmidt at tuebingen.mpg.de 2011-02-09 09:20 EST -------
done
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Feb 10 09:05:33 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Feb 2011 09:05:33 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102101405.p1AE5Xkl029071@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-10 09:05 EST -------
(In reply to comment #7)
> NT_022184.15 is the record containing IGKV2-40 (and the associated caret) in
> my file. What I said about Nucleotide still applies, though.
>
Yes, you're right. My mistake, NT_015926.15 was the last good record.
Had you noticed this was the last gene in this record? It runs right up to
the end of the sequence and beyond (missing the right most end, i.e. the 5'
start of the gene since it is on the reverse strand). From the FTP site:
LOCUS NT_022184 68452323 bp DNA linear CON 28-OCT-2010
DEFINITION Homo sapiens chromosome 2 genomic contig, GRCh37.p2 reference
primary assembly.
...
gene complement(68451760..>68452323)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/db_xref="GeneID:28916"
/db_xref="HGNC:5789"
/db_xref="IMGT/GENE-DB:IGKV2-40"
V_segment complement(68451760..68452073^68452074)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/standard_name="IGKV2-40"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/db_xref="GeneID:28916"
CDS complement(<68451760..68452072^68452073)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/exception="rearrangement required for product"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/codon_start=1
/db_xref="GeneID:28916"
/db_xref="IMGT/LIGM:IGKV2-40"
/db_xref="HGNC:5789"
/db_xref="IMGT/GENE-DB:IGKV2-40"
If we look at the record via Entrez,
http://www.ncbi.nlm.nih.gov/nuccore/NT_022184.15?report=gbwithparts
gene complement(68451760..>68452323)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/db_xref="GeneID:28916"
/db_xref="HGNC:5789"
/db_xref="IMGT/GENE-DB:IGKV2-40"
V_segment complement(68451760..68452074)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/standard_name="IGKV2-40"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/db_xref="GeneID:28916"
CDS complement(<68451760..68452073)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/exception="rearrangement required for product"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/codon_start=1
/db_xref="IMGT/LIGM:IGKV2-40"
/db_xref="GeneID:28916"
/db_xref="HGNC:5789"
/db_xref="IMGT/GENE-DB:IGKV2-40"
So this appears to have been updated to avoid the funny caret location,
but I think they made a mistake - surely the CDS should be
complement(68451760..>68452073) not complement(<68451760..68452073)
as stated?
Have you contacted the NCBI about this? If not, I will.
I believe that the caret location in the FTP GenBank file is invalid and
Biopython is right to reject it (but I would like to confirm this with the
NCBI). For now the simplest solution is for you to manually edit that feature.
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Thu Feb 10 10:10:19 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Feb 2011 15:10:19 +0000
Subject: [Biopython-dev] Sequential SFF IO
In-Reply-To: <20110207122356.GC18733@sobchak.mgh.harvard.edu>
References:
<20110128123418.GD7866@sobchak.mgh.harvard.edu>
<20110207122356.GC18733@sobchak.mgh.harvard.edu>
Message-ID:
On Mon, Feb 7, 2011 at 12:23 PM, Brad Chapman wrote:
> Peter;
>
>> The computationally interesting part is matching the primer/adapter/
>> barcode to the read (both of which may contain IUPAC ambiguity codes),
>> which as you point out can be replaced once you have a working
>> framework for the input, output, trimming, etc.
>
> Absolutely. I'd be very happy if you wanted to take the framework in
> the script and generalize it for different matching. Let me know
> what I can do to help.
Do you have (or can you point me at) any good sample data with
barcodes, or custom adapters or primer sequences? e.g. some SRA
numbers you've been using.
>> Currently I'm using regular expressions, which is fast enough for my
>> own needs - and this task could easily be parallelised by breaking
>> up the input reads. Beyond that perhaps something based on
>> Hamming distances (edit distance - number of mismatches) or
>> Levenshtein searches might be quicker. I guess speed is more of
>> an issue with Illumina than with 454 due to the number of reads?
I originally had three separate tools (with shared code) for working
with FASTA, FASTQ and SFF reads, which I have recently combined
into one single tool that does all three. Code here if anyone wants to
look at it.
https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/
seq_primer_clip.py - Python script
seq_primer_clip.xml - Galaxy wrapper
seq_primer_clip.txt - readme file
This is still a work in progress...
Peter
From bugzilla-daemon at portal.open-bio.org Thu Feb 10 15:02:42 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Feb 2011 15:02:42 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102102002.p1AK2g6g017745@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #14 from walter_gillett at yahoo.com 2011-02-10 15:02 EST -------
I have checked in a fix on my github branch to the bug mentioned in comment #4:
in the Viterbi recursion to determine state path probabilities, we must
consider states that lead *to* the current state, not those that are reachable
*from* it. See comments for this checkin:
https://github.com/wgillett/biopython/commit/f8b0b94ad7ffadbf9aa923bc6273822328cb9f01
. Forgot to mention in the comments that I also fixed a bug in the
allow_transition method and added a unit test for that method.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Feb 10 18:07:21 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Feb 2011 18:07:21 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102102307.p1AN7Lu0025588@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #9 from aaron.tin.long.lun at gmail.com 2011-02-10 18:07 EST -------
Thanks Peter, will do so.
Cheers,
Aaron
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Fri Feb 11 04:30:02 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 11 Feb 2011 09:30:02 +0000
Subject: [Biopython-dev] Fwd: [GitHub] Viterbi algorithm bug fix: consider
states that lead *to* the current state,
not reachable *from* it [biopython/biopython GH-3]
In-Reply-To: <4d54436221d2a_250b3fff6ec2b2f0780@fe2.rs.github.com.tmail>
References: <4d54436221d2a_250b3fff6ec2b2f0780@fe2.rs.github.com.tmail>
Message-ID:
Hi Brad,
Do you want to look at this HMM fix too?
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
Also who else is getting the github pull requests? We should
probably send them to the dev list, but I can't find the settings
right now on GitHub...
Peter
---------- Forwarded message ----------
From: GitHub
Date: Thu, Feb 10, 2011 at 7:58 PM
Subject: [GitHub] Viterbi algorithm bug fix: consider states that lead
*to* the current state, not reachable *from* it [biopython/biopython
GH-3]
To: p.j.a.cock at googlemail.com
wgillett wants someone to pull from wgillett:master:
Bug fix related to bug #2947. Please review and commit if it's OK. Thanks,
Walter Gillett
View Pull Request: https://github.com/biopython/biopython/pull/3
From chapmanb at 50mail.com Mon Feb 14 08:01:10 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 14 Feb 2011 08:01:10 -0500
Subject: [Biopython-dev] Sequential SFF IO
In-Reply-To:
References:
<20110128123418.GD7866@sobchak.mgh.harvard.edu>
<20110207122356.GC18733@sobchak.mgh.harvard.edu>
Message-ID: <20110214130110.GA12340@sobchak.mgh.harvard.edu>
Peter;
> Do you have (or can you point me at) any good sample data with
> barcodes, or custom adapters or primer sequences? e.g. some SRA
> numbers you've been using.
This is a subset of two lanes from a barcoded flowcell for testing
purposes:
http://chapmanb.s3.amazonaws.com/110106_FC70BUKAAXX.tar.gz
It has 12 barcoded samples, using the Illumina barcodes. The
sequences are in this YAML file:
https://github.com/chapmanb/bcbb/blob/master/nextgen/tests/data/automated/run_info.yaml
> I originally had three separate tools (with shared code) for working
> with FASTA, FASTQ and SFF reads, which I have recently combined
> into one single tool that does all three. Code here if anyone wants to
> look at it.
>
> https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/
Very nice. It would be great to get something general for barcode
splitting as a Galaxy tool. Thanks for looking at this,
Brad
From p.j.a.cock at googlemail.com Mon Feb 14 08:19:45 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 14 Feb 2011 13:19:45 +0000
Subject: [Biopython-dev] Sequential SFF IO
In-Reply-To: <20110214130110.GA12340@sobchak.mgh.harvard.edu>
References:
<20110128123418.GD7866@sobchak.mgh.harvard.edu>
<20110207122356.GC18733@sobchak.mgh.harvard.edu>
<20110214130110.GA12340@sobchak.mgh.harvard.edu>
Message-ID:
On Mon, Feb 14, 2011 at 1:01 PM, Brad Chapman wrote:
> Peter;
>
>> Do you have (or can you point me at) any good sample data with
>> barcodes, or custom adapters or primer sequences? e.g. some SRA
>> numbers you've been using.
>
> This is a subset of two lanes from a barcoded flowcell for testing
> purposes:
>
> http://chapmanb.s3.amazonaws.com/110106_FC70BUKAAXX.tar.gz
>
> It has 12 barcoded samples, using the Illumina barcodes. The
> sequences are in this YAML file:
>
> https://github.com/chapmanb/bcbb/blob/master/nextgen/tests/data/automated/run_info.yaml
>
Great :)
>> I originally had three separate tools (with shared code) for working
>> with FASTA, FASTQ and SFF reads, which I have recently combined
>> into one single tool that does all three. Code here if anyone wants to
>> look at it.
>>
>> https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/
>
> Very nice. It would be great to get something general for barcode
> splitting as a Galaxy tool. Thanks for looking at this,
> Brad
Yes - assuming what they have already isn't good enough (at
very least the Galaxy barcode wrapper for fastx currently only
handles fastq-solexa but I think that can be fixed).
http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004290.html
I've been focused on the PCR case where my sequences have
got IUPAC ambiguity characters. For barcodes that shouldn't be
an issue, but instead you may have more than one barcode and
will want one output file per barcode (although not usually as
complicated as Kevin's setup). I need to learn more about how
Galaxy handles multiple outputs before commenting on that.
Peter
From tiagoantao at gmail.com Wed Feb 16 11:40:10 2011
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 16 Feb 2011 16:40:10 +0000
Subject: [Biopython-dev] New URL for integration testing
Message-ID:
Hello all,
Buildbot integration testing has been moved to a, hopefully, more
stable location. If you are interested, please have a look at:
http://testing.open-bio.org/
The old URL at events.open-bio.org is no more.
Regards,
Tiago
--
"If you want to get laid, go to college.? If you want an education, go
to the library." - Frank Zappa
From anaryin at gmail.com Thu Feb 17 07:59:16 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 17 Feb 2011 13:59:16 +0100
Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(),
remove_disordered_atoms()
In-Reply-To:
References:
Message-ID:
Hey Kristian,
To Tests/test_pdb.py ? Just to make sure that the renumbering acts on both
accordingly? I agree.
Jo?o
From krother at rubor.de Thu Feb 17 07:54:38 2011
From: krother at rubor.de (Kristian Rother)
Date: Thu, 17 Feb 2011 13:54:38 +0100
Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(),
remove_disordered_atoms()
In-Reply-To:
References:
Message-ID:
Hi Joao,
I think we should add a simple test function that ensures consistency of
child_dict and child_list upon renumbering. Let me know if you'd prefer me
to explain in Python what I mean.
Kristian
> Dear All,
>
> I've been working on the above-mentioned functions following really great
> feedback from Eric, Kristian, and Peter. I've been also using them
> routinely
> and I've had no problems yet so they should be stable enough. Therefore I
> think they can be cherry-picked from my pdb_enhancements branch and added
> to
> the main branch. Let me know what you think.
>
> Cheers,
>
> Jo??o
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
From b.invergo at gmail.com Tue Feb 22 11:40:01 2011
From: b.invergo at gmail.com (Brandon Invergo)
Date: Tue, 22 Feb 2011 17:40:01 +0100
Subject: [Biopython-dev] pypaml
In-Reply-To:
References:
<20110114154035.GC30193@sobchak.mgh.harvard.edu>
Message-ID:
Hi everyone,
I've been toiling away on the PAML API and I think it's finally ready
for review. If anyone's willing to give my code a review, here's my
branch:
https://github.com/brandoninvergo/biopython/tree/paml-branch
(the API is in Bio/Phylo/PAML, as suggested before, and the tests are
in Tests, with their supporting files in Tests/PAML)
I'll also post a message to the Biopython user list to see if anyone
would be willing to give it a test drive.
Some notes:
- I've implemented Codeml, Baseml/Basemlg and Yn00. I have not yet
done anything with Mcmctree because I am completely ignorant about
what information to extract from the output files. The other two
programs in the package, Evolver and Chi2, do not accept commandline
options and are instead operated by a rudimentary commandline
interface, so they aren't really compatible with scripting.
- Chi2 is useful, though, because it provides a chi^2 CDF, which you
can use in performing maximum likelihood ratio tests, an important
part of using the PAML programs. Since Python doesn't have a chi^2
cumulative distribution function in its standard library, I ported the
original C code rather than writing a function which simply calls the
original, with the permission of Ziheng Yang (the original author;
this is mentioned in the code's comments, but he required no other
licensing/copyright verbage to be included). This was no easy task,
considering the C code was littered with goto statements. Anyway, this
will prevent the user from having to install/import an outside package
to do the tests (I personally had been using Rpy2 to call the R
function pchisq()....complete overkill). Let me know if this is ok or
if this causes some kind of conflict
- The output of the programs varies widely with the combinatorics of
the parameters and possibly between versions. I tried to include all
possible output files in the Tests/PAML directory and I wrote test
cases to check that they're properly parsed (with the testing of
future versions in mind). So, that Tests/PAML folder has a lot more in
it than the usual test folders, but I felt there was no other option.
I tried to make it organized.
I think those are the main points for now. I'd assume that there's
more work to be done before I should perform a pull request, so I'll
simply ask for your comments for now if you have the time.
Cheers,
Brandon Invergo
On Sun, Jan 16, 2011 at 4:09 PM, Peter Cock wrote:
> On Sun, Jan 16, 2011 at 2:19 PM, Brandon Invergo wrote:
>> Hi everyone,
>> A quick question about style: since the name "codeml" is based on a
>> program which is always spelled either in all caps or in all
>> lower-case, what would be the best way to write the class name
>> regarding capitalization? Stick with the usual camel-case convention,
>> "Codeml", anyway?
>
> I'd go with Codeml for a class name (or something like
> CodemlResult or whatever). Neither CODEML nor codeml
> seem good class names in Python.
>
>> Things are progressing nicely. I've already taken care of a lot of the
>> minor tasks and improvements...
>
> Sounds good :)
>
> Peter
>
From clementsgalaxy at gmail.com Tue Feb 22 12:16:12 2011
From: clementsgalaxy at gmail.com (Dave Clements)
Date: Tue, 22 Feb 2011 09:16:12 -0800
Subject: [Biopython-dev] Galaxy Community Conference, May 25-26, Lunteren,
The Netherlands
In-Reply-To:
References:
Message-ID:
Hello all,
Just a reminder that the abstract submission deadline for the Galaxy
Community Conference is next Monday, February 28. See
http://galaxy.psu.edu/gcc2011/Abstracts.html for details.
Cheers,
Dave C.
On Thu, Feb 3, 2011 at 5:01 PM, Dave Clements wrote:
> We are pleased to announce the *2011 Galaxy Community Conference*, being
> held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature
> two full days of presentations and discussion on extending Galaxy to use new
> tools and data sources, deploying Galaxy at your organization, and best
> practices for using Galaxy to further your own and your community's
> research. See http://galaxy.psu.edu/gcc2011/* for complete details.
> *
> *About Galaxy:
> *Galaxy is an open, web-based platform for *accessible, reproducible, and
> transparent* computational biomedical research.
>
> - *Accessibility:* Galaxy enables users without programming experience
> to easily specify parameters and run tools and workflows.
> - *Reproducibility:* Galaxy captures all information necessary so that
> any user can repeat and understand a complete computational analysis.
> - *Transparency:* Galaxy enables users to share and publish analyses
> via the web and create Pages--interactive, web-based documents that describe
> a complete analysis.
>
> Galaxy is open source for all organizations. The public Galaxy service (
> http://usegalaxy.org) makes analysis tools, genomic data,
> tutorial demonstrations, persistent workspaces, and publication services
> available to any scientist that has access to the Internet. Local
> Galaxy servers can be set up by downloading the Galaxy application and
> customizing it to meet particular needs.
>
> *Conference Overview:
> *
> This event aims to engage a broader community of developers, data
> producers, tool creators, and core facility and other research hub staff to
> become an active part of the Galaxy community. We'll cover defining
> resources in the Galaxy framework, increasing their visibility and making
> them easier to use and integrate with other resources, how to extend Galaxy
> to use custom data sources and custom tools, and best practices for using
> Galaxy in your organization.
>
> Additional topics include, but are not limited to:
> * Talks submitted by the Galaxy community
> * Integration of tools (including NGS analysis tools) and distributed job
> management
> * Deployment of Galaxy instances on local resources and on the Cloud
> * Management of large datasets with the Galaxy Library System
> * Using the Galaxy LIMS functionality at NGS sequencing facilities
> * Visualizing Data without leaving Galaxy
> * Performing reproducible research
> * Performing and sharing complex analyses with Workflows
> * An "Introduction to Galaxy" session, offered on May 24, for Galaxy
> newcomers.
>
> *Registration:
> *
> The conference fee is ?100 on or before April 24, and ?120 after that. The
> meeting is being held at the Conference Centre De Werelt in Lunteren, The
> Netherlands, which is also the conference hotel. You are encouraged to
> register early, as space at the hotel (and at the "Intro to Galaxy" session)
> is limited and is likely to fill up before the conference itself does. See
> http://galaxy.psu.edu/gcc2011/Register.html
> *
> Abstract Submission:
> *
> Abstracts are now being accepted for short oral presentations. Proposals
> on any topic of interest to the Galaxy community are welcome and
> encouraged. The abstract submission deadline is the end of February 28.
> See http://galaxy.psu.edu/gcc2011/Abstracts.html
> * *
> *Sponsors
> *
> The 2011 Galaxy Community Conference is co-sponsored by the US National
> Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands
> Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a
> collaborative institute of the bioinformatics groups in the Netherlands.
> Together, these groups perform cutting-edge research, develop novel tools
> and support platforms, create an e-science infrastructure and educate the
> next generations of bioinformaticians.
>
> We are looking forward to a great conference and hope to see you in the
> Netherlands!
>
> The Galaxy and NBIC Teams
>
> --
> http://galaxy.psu.edu/gcc2011/
> http://getgalaxy.org
> http://usegalaxy.org/
>
--
http://galaxy.psu.edu/gcc2011/
http://getgalaxy.org
http://usegalaxy.org/
From bugzilla-daemon at portal.open-bio.org Tue Feb 22 13:06:48 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 22 Feb 2011 13:06:48 -0500
Subject: [Biopython-dev] [Bug 3170] Integration of external package: pypaml
In-Reply-To:
Message-ID: <201102221806.p1MI6mvd015443@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3170
------- Comment #1 from b.invergo at gmail.com 2011-02-22 13:06 EST -------
I've forked the repository on github and I've created a branch containing the
new code:
https://github.com/brandoninvergo/biopython/tree/paml-branch
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Wed Feb 23 04:24:21 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 23 Feb 2011 09:24:21 +0000
Subject: [Biopython-dev] [Biopython] Biopython library for muliple
sequence alignment
In-Reply-To: <001501cbd324$c70a8570$551f9050$@jp>
References: <001501cbd324$c70a8570$551f9050$@jp>
Message-ID:
On Wed, Feb 23, 2011 at 6:42 AM, Rojan Shrestha wrote:
> Hello:
>
> I want to do multiple sequence alignment using CLUSTW. Instead of
> standalone, I would like to use in my own program through biopython. I would
> like to know that whether biopython has clustw function or not. It would be
> very good if somebody ?gives information about this.
>
> Regards,
>
> Rojan
Hello Rojan,
Biopython (and BioPerl too I believe) doesn't have any multiple sequence
alignment code itself. Biopython does has pairwise sequence alignment
code (with a fast implementation in C).
Instead (again, like BioPerl) Biopython has a wrapper and parser for
calling the ClustalW command line tool from within your script and
loading its output. Similarly for other alignment tools like Muscle.
If you really want to be able modify the multiple sequence alignment
code itself, some of these command line tools are open source. Also,
I *think* that BioJava has some code for this.
I don't know what BioRuby does.
Peter
P.S. You only really need to ask this on the Biopython Discussion List.
Since you included the OBF cross project list I have tried to comment
on how the other projects handle this as well.
From updates at feedmyinbox.com Wed Feb 23 04:26:36 2011
From: updates at feedmyinbox.com (Feed My Inbox)
Date: Wed, 23 Feb 2011 04:26:36 -0500
Subject: [Biopython-dev] 2/23 active questions tagged biopython - Stack
Overflow
Message-ID: <64da3e945fd7631143a0bbd0fdd84e55@74.63.51.88>
// Biopython CodonTable error?
// February 18, 2011 at 3:02 PM
http://stackoverflow.com/questions/5045967/biopython-codontable-error
Hello, I am writing some code intended to translate ambiguous DNA codes into possible amino acids and I am seeing some strange translation from the Biopython 1.56 package. It appears to be translating ambiguous DNA codes to 'J' which does not exist as a code for anything. I am running python 2.6.1 on Mac OS 10.6.6.
For example:
>>>from Bio.Seq import *
>>>translate('ARAWTAGKAMTA')
'XJXJ'
or
>>>from Bio.Seq import Seq
>>>c = Seq('ARAWTAGKAMTA')
>>>c.translate().tostring()
'XJXJ'
I have looked through the Bio.Data.CodonTable source and Bio.Seq source and I cannot find a reason why this would be happening. Any ideas?
Thanks!
Mark
--
Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active
Account Login:
https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
Unsubscribe here:
http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
--
This email was carefully delivered by FeedMyInbox.com.
PO Box 682532 Franklin, TN 37068
From updates at feedmyinbox.com Wed Feb 23 04:26:36 2011
From: updates at feedmyinbox.com (Feed My Inbox)
Date: Wed, 23 Feb 2011 04:26:36 -0500
Subject: [Biopython-dev] 2/23 biopython Questions - BioStar
Message-ID:
// MuscleCommandline not writing file
// February 22, 2011 at 2:34 PM
http://biostar.stackexchange.com/questions/5787/musclecommandline-not-writing-file
I'm trying to work through the Biopython tutorial on multiple sequence alignment and get an error whenever I try to use subprocess:
child = subprocess.Popen(str(cline),
stdout = subprocess.PIPE,
stderr = subprocess.PIPE,
shell = (sys.platform!="win32"))
I get this error:
Traceback (most recent call last):
File "", line 2, in
stdout = subprocess.PIPE)
File "C:\Python27\lib\subprocess.py", line 672, in __init__
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 882, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified
I've gone so far as to copy and paste the tutorial into the interpreter and no luck. Neither ClustalW nor Muscle are writing the alignment files (I tried the depreciated MultipleAlignCL as well with no luck).
I'm using Python v2.7 and Biopython v1.55 and have tried reinstalling both. Any advice?
--
Website: http://biostar.stackexchange.com/questions/tagged/biopython
Account Login:
https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
Unsubscribe here:
http://www.feedmyinbox.com/feeds/unsubscribe/630206/59fe8f28e93f5744d887807619020b5988c5b82b/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
--
This email was carefully delivered by FeedMyInbox.com.
PO Box 682532 Franklin, TN 37068
From chapmanb at 50mail.com Wed Feb 23 08:11:51 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 23 Feb 2011 08:11:51 -0500
Subject: [Biopython-dev] pypaml
In-Reply-To:
References:
<20110114154035.GC30193@sobchak.mgh.harvard.edu>
Message-ID: <20110223131151.GE4922@sobchak.mgh.harvard.edu>
Brandon;
> I've been toiling away on the PAML API and I think it's finally ready
> for review. If anyone's willing to give my code a review, here's my
> branch:
> https://github.com/brandoninvergo/biopython/tree/paml-branch
This is awesome; thanks much for all the work getting this together.
It's really great to see the extensive tests. I'm also impressed
with your story of porting over 'goto' statements; it's been a while
since those have entered my mind:
10 PRINT "CHI SQUARE FOREVER"
20 FLASH
30 GOTO 10
A couple of more general thoughts about your code:
- These looks to be a lot of shared functionality between codeml,
baseml and yn00 in setting up the control files. Would it be
possible to create a base class that these all inherit from? This
would make the code much easier to maintain over time as formats
change.
- Your 'read' functions get pretty deeply nested, especially the
codeml parser. What do you think about creating an internal class
to split some of the parsing logic into individual functions? A
nice example is the GenBank/Scanner.py code. Having functions like
parse_header/parse_features makes it much easier for someone not
deeply familiar with your code to start to make guesses at where
different functionality exists. This way, if the format changes
others can provide patches and feedback to you.
Overall this is great and all the work is much appreciated.
Brad
From chapmanb at 50mail.com Thu Feb 24 13:26:26 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 24 Feb 2011 13:26:26 -0500
Subject: [Biopython-dev] BOSC 2011 topic organizers and Codefest
Message-ID: <20110224182626.GM20125@sobchak.mgh.harvard.edu>
Hi all;
This year the Bioinformatics Open Source Conference (BOSC) will be taking
place in Vienna, Austria on July 15-16th. This is a yearly opportunity for
open source bioinformatics developers to get together in person and discuss
on-going projects. Nomi Harris, Peter Rice and the other organizing committee
members are already hard at work planning for the conference:
http://www.open-bio.org/wiki/BOSC_2011
The call for abstracts opens next Monday, and extends through April
18th, and we've been brainstorming potential session topics. This
year we've tried to focus each of the sessions around a particular
biological problem or computational approach. We hope this will draw
some interesting parallels between work being done in different groups,
and encourage even more collaboration.
We are actively looking for community members who are interested in heading up
the organization of a topic. The general idea is to build a cohesive set of talks
within a session. How you'd like to do this is completely flexible but some of
the ideas we've been discussing are:
- Having a short introductory talk to provide an overview of an area, framing
the different talks within this context.
- Forgoing individual question/answer and instead combining this time into a
longer panel-style discussion with all of the speakers. This would help
stimulate back and forth between the different projects and the
audience.
If you are interested in a particular topic and would like to help with the
organization, please send an e-mail to the BOSC mailing list:
bosc at lists.open-bio.org. We're also open to new topic suggestions, and will
look to add one or two more topics to our current list.
Finally, there will be a two day coding session prior to BOSC as a follow up
to last year's fun and productive Codefest:
http://www.open-bio.org/wiki/Codefest_2011
The Metalab, a unique hacker space in Vienna, has kindly agreed to host
us for the two days. If you are at all interested, please add your name
to the attendees list on the wiki. Since the Metalab organizers don't
know us personally, we'd like to demonstrate there is interest and that we'll
really show up with a bunch of bioinformatics hackers. More details will be in
the works as the summer draws closer.
Looking forward to the sound of music,
Brad
From b.invergo at gmail.com Fri Feb 25 11:57:19 2011
From: b.invergo at gmail.com (Brandon Invergo)
Date: Fri, 25 Feb 2011 17:57:19 +0100
Subject: [Biopython-dev] pypaml
In-Reply-To: <20110223131151.GE4922@sobchak.mgh.harvard.edu>
References:
<20110114154035.GC30193@sobchak.mgh.harvard.edu>
<20110223131151.GE4922@sobchak.mgh.harvard.edu>
Message-ID:
Hi Brad,
Thanks for your response! It's taken me a day or two to think about
what you wrote (also balancing a PhD with the hobby projects at the
moment...)
> It's really great to see the extensive tests. I'm also impressed
> with your story of porting over 'goto' statements; it's been a while
> since those have entered my mind:
To be honest, I forgot they existed. Seeing them immediately made the
computer scientist in me cringe. They really confused the whole
structure of the program but in the end they were solved quite easily
with some carefully placed loops and conditional blocks!
> - These looks to be a lot of shared functionality between codeml,
> ?baseml and yn00 in setting up the control files. Would it be
> ?possible to create a base class that these all inherit from? This
> ?would make the code much easier to maintain over time as formats
> ?change.
This is a really good idea and I'm a bit disappointed that I didn't
see it myself! Indeed, most of the functionality is just copied/pasted
between the classes, with only some variation in the
read/write_ctl_file functions for codeml and baseml. So, writing a
base class would really simplify things. I do have one question,
though, since this is my first time organizing my code in a
large-scale Python project. Where would be the best place to implement
this base paml class? In __init__.py or in its own paml.py file? I
know the end result would be the same but I figure I should start
learning some of these best practices.
> - Your 'read' functions get pretty deeply nested, especially the
> ?codeml parser. What do you think about creating an internal class
> ?to split some of the parsing logic into individual functions? A
> ?nice example is the GenBank/Scanner.py code. Having functions like
> ?parse_header/parse_features makes it much easier for someone not
> ?deeply familiar with your code to start to make guesses at where
> ?different functionality exists. This way, if the format changes
> ?others can provide patches and feedback to you.
I'm not so sure about this mainly because of the way the output files
are formatted. For example, the most common usage of codeml (the most
common program of the bunch) is to run with several several "NSsites"
models. If you do this, the output file is separated into segments
which are headed by a line that says something like "Model 2:
PositiveSelection", and the model parameters are printed out below.
However, if you only run with one model, which is also a common usage,
you no longer have these convenient headers and instead at the very
top of the output file is a completely different indication of which
model was used, but which is inconveniently missing if only model 0
was run. In other cases, such as amino acid sequence analysis,
pairwise nucleotide sequence or multiple gene analyses, there's no
header whatsoever indicating which kind of output file you're looking
at. Instead, you just have to search for particular data patterns to
parse. This mess is precisely why I had to include so many different
output files for the unittesting (codeml is the main culprit; baseml
is moderately bad; yn00 isn't a problem)
So, because I would potentially end up scanning almost the entire file
just to figure out what's going on, I think just parsing-as-you-go,
using elif statements to short-circuit and skip further evaluations of
a line after a match has been found, would be the better option.
Perhaps the files aren't long enough to be able to make an appeal for
computational efficiency but at the same time, I hesitate to read
through the file multiple times unnecessarily. I agree, though, that
this makes the read() function quite long. For that, though, I tried
to provide descriptive comments before each parsing case, describing
exactly what the next block of code is meant to parse and also
including a specific example line which should be parsed by it.
That said, I will take another look at the output files to see if
there could be another way of implementing it. Without a doubt, the
parsing is the most difficult part of implementing this module; the
rest of it is quite trivial. So, best to do it right!
> Overall this is great and all the work is much appreciated.
Thanks! It's been a fun side project for me.
Cheers,
Brandon
ps - I still haven't sent a message to the main Biopython list while I
consider implementing at least the first suggestion above, since it
would involve large changes that might cause me to accidentally break
something! I'll wait until I'm a bit more confident that it's close to
the final product
From updates at feedmyinbox.com Mon Feb 28 04:21:17 2011
From: updates at feedmyinbox.com (Feed My Inbox)
Date: Mon, 28 Feb 2011 04:21:17 -0500
Subject: [Biopython-dev] 2/28 active questions tagged biopython - Stack
Overflow
Message-ID: <348d58cdbd9ae31e700023c354ca3ce6@74.63.51.88>
// Convert nested dictionary/xml to flat file for sqlite
// February 27, 2011 at 11:25 AM
http://stackoverflow.com/questions/5134334/convert-nested-dictionary-xml-to-flat-file-for-sqlite
Hiya-
I've scoured the net and cannot seem to find an appropriate example so I thought I'd ask...
(Btw, much of this is new to me- not all, just most.)
Problem: trying to convert a bio/python nested dictionary (or xml) of pubmed citation data into a flat (normalized) structure eg, sqlite. Citation data was fetched from pubmed using biopython and was parsed into a dictionary, but can also retrieve as xml if needed.
Not all citations will have all fields/keys and not all fields/keys will have the same number of items (authors, mesh terms, refs, etc...) and understand that this is part of the normalization process.
This is about where my practical understanding ends.
That said, I think the process should go something like this: first remove/normalize all unique fields (those that have 1 per paper eg, title, abstract, date, citation, etc..., but say not affiliation as that would be linked to first author). Papers with no abstract could be filled as null?
Then move on to, say, authors and create a separate table again using PMID as the fk and then do same for the various other fields/keys/items in separate tables eg, mesh headings, EC numbers, ref, etc...
Is there a way to do this that removes (pops?) keys/items from the master dictionary so that I can visually see what's been done/needs to be done (obviously leaving the PMID)?
Again, apologies in advance if I'm asking a blindingly obvious question to the initiated- and I do understand that you can't fit a nested structure into a flat space- just looking for the least boneheaded way of going about this and hopefully one that will allow me to make sure that everything was properly captured.
Many thanks,
chris
--
Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active
Account Login:
https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
Unsubscribe here:
http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
--
This email was carefully delivered by FeedMyInbox.com.
PO Box 682532 Franklin, TN 37068
From chapmanb at 50mail.com Mon Feb 28 11:35:21 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 28 Feb 2011 11:35:21 -0500
Subject: [Biopython-dev] pypaml
In-Reply-To:
References: <20110114154035.GC30193@sobchak.mgh.harvard.edu>
<20110223131151.GE4922@sobchak.mgh.harvard.edu>
Message-ID: <20110228163521.GF9652@sobchak.mgh.harvard.edu>
Brandon;
[pypaml branch: https://github.com/brandoninvergo/biopython/tree/paml-branch]
[base class]
> This is a really good idea and I'm a bit disappointed that I didn't
> see it myself! Indeed, most of the functionality is just copied/pasted
> between the classes, with only some variation in the
> read/write_ctl_file functions for codeml and baseml. So, writing a
> base class would really simplify things. I do have one question,
> though, since this is my first time organizing my code in a
> large-scale Python project. Where would be the best place to implement
> this base paml class? In __init__.py or in its own paml.py file? I
> know the end result would be the same but I figure I should start
> learning some of these best practices.
It's always easier to get perspective on code when you haven't been
directly in the middle of it. Even if you don't have someone to do
code reviews, stepping away from a project and coming back later
will often lead to a bunch of insights.
For the base class, I would follow Eric and Peter's example and use
files in the same directory with an underscore: something like _shared.py
or _base.py.
[read functions]
> This mess is precisely why I had to include so many different
> output files for the unittesting (codeml is the main culprit; baseml
> is moderately bad; yn00 isn't a problem)
I definitely feel your pain on this. This is exactly why your work
doing this is appreciated; you'll save someone a lot of headache
later on.
> So, because I would potentially end up scanning almost the entire file
> just to figure out what's going on, I think just parsing-as-you-go,
> using elif statements to short-circuit and skip further evaluations of
> a line after a match has been found, would be the better option.
> Perhaps the files aren't long enough to be able to make an appeal for
> computational efficiency but at the same time, I hesitate to read
> through the file multiple times unnecessarily. I agree, though, that
> this makes the read() function quite long. For that, though, I tried
> to provide descriptive comments before each parsing case, describing
> exactly what the next block of code is meant to parse and also
> including a specific example line which should be parsed by it.
The issue really is that deeply nested code is hard to read,
long functions are hard to read, and when you combine them together
it just makes it very difficult for others to follow your logic.
I don't think you necessarily have to make multiple passes to parse it
in a more structure way, but what you would want to focus on is making
the flow through the function simpler. The way I would normally attack
this is to break components into smaller more re-usable functions.
Here's a concrete example from the start of the codeml parser:
https://github.com/brandoninvergo/biopython/blob/paml-branch/Bio/Phylo/PAML/codeml.py
siteclass_re = re.match("Site-class models:\s*(.*)", line)
if siteclass_re is not None:
siteclass_model = siteclass_re.group(1)
if siteclass_model == "":
multi_models = True
continue
results["site-class model"] = siteclass_model
if siteclass_model == "NearlyNeutral":
current_model = 1
results["NSsites"][current_model] = \
{"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
elif siteclass_model == "PositiveSelection":
current_model = 2
results["NSsites"][current_model] = \
{"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
elif siteclass_model == "discrete (4 categories)":
current_model = 3
results["NSsites"][current_model] = \
{"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
elif siteclass_model == "beta (4 categories)":
current_model = 7
results["NSsites"][current_model] = \
{"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
elif siteclass_model == "beta&w>1 (5 categories)":
current_model = 8
results["NSsites"][current_model] = \
{"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
You could refactor this something along the lines of:
class _CodemlParser:
def __init__(self):
self.results = {}
self.flags = dict(multi_models = False)
def read(self, results_handle):
for line in results_handle:
siteclass_re = re.match("Site-class models:\s*(.*)", line)
if siteclass_re is not None:
self._siteclass_parse(siteclass_re)
def _add_siteclass_model(self, siteclass_model):
self.results["site-class model"] = siteclass_model
name_to_num = {"NearlyNeutral": 1,
"PositiveSelection": 2,
"discrete (4 categories)": 3,
"beta (4 categories)": 7
"beta&w>1 (5 categories)": 8}
current_model = name_to_num[siteclass_model]
self.results["NSsites"][current_model] = {"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
def _siteclass_parse(self, siteclass_re):
if siteclass_model == "":
self.flags["multi_models"] = True
else:
self._add_siteclass_model(siteclass_model)
You are not changing the parsing strategy, but now you've got
individual functions handling each of the steps so it's clear that
the _siteclass_parse either sets multi_models or adds details about
the single model. Then you can dig into the _add_siteclass_model
function to see what it is doing. To the reader, each individual
unit can be read and understood separately.
This type of refactoring work is useful generally. I have to do it all
the time in my work and discover new tricks and approaches. Hope this
is helpful and thanks again for all the work on this,
Brad
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:00:54 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:00:54 -0500
Subject: [Biopython-dev] [Bug 3173] New: Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
Summary: Bio.Emboss.Primer3 parser incompatibility with Primer3
version 2.2.3
Product: Biopython
Version: 1.55
Platform: Macintosh
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: Other
AssignedTo: biopython-dev at biopython.org
ReportedBy: jp.verta at gmail.com
I'm running Biopython 1.55, Python 2.6 and EMBOSS version 6.3.1 on MacOS X 10.6
Snow Leopard.
The Bio.Emboss.Primer3 parser seems to be incompatible with the newer version
2.2.3 of the Whitehead Primer3 program and the corresponding Emboss eprimer3
program output. The parser output for the reverse primer seems to contain all
Primer -class members (primer.reverse_tm, primer.reverse_gc etc.) except the
reverser primer sequence (primer.reverse_seq). Yet the eprimer3 output seems
identical to that of old versions (see the output.pr3 -files attached).
Here is an example code for designing primers for a set of fasta sequences.
>>>
def design_primers(fasta_file, output_file):
from Bio import SeqIO
from Bio.Emboss.Applications import Primer3Commandline
from Bio.Emboss import Primer3
output = open(output_file, "w")
output.write("name,forward_primer,reverse_primer,forward_tm,reverse_tm,product_size\n")
for seq_record in SeqIO.parse(fasta_file, "fasta"):
if not(seq_record):
break
open("sequence",
"w").write(">"+str(seq_record.id)+"\n"+str(seq_record.seq)+"\n")
primer_cl = Primer3Commandline(sequence="sequence")
primer_cl.explainflag = True
primer_cl.osizeopt=20
primer_cl.psizeopt=200
primer_cl.otm=65
primer_cl.maxtm=70
primer_cl.mintm=60
primer_cl.gcclamp=1 #required number of Gs or Cs at the 3' end of the
primer
primer_cl.outfile = "output.pr3"
primer_cl()
output_handle = open("output.pr3","r")
primer_record = Primer3.read(output_handle)
if len(primer_record.primers) > 0:
primer = primer_record.primers[0]
output.write("%s,%s,%s,%s,%s,%s\n" % (seq_record.id,
primer.forward_seq, primer.reverse_seq,
primer.forward_tm,primer.reverse_tm,primer.size))
else:
print "No primers found for %s" % seq_record.id
>>>
This code, when executed on a file of fasta-sequences gives and output -file
with forward and reverse primer id, sequence, tm and size separated by commas.
When I execute it with the Primer3-2.2.3 and compatible eprimer3 versions, the
field for the reverse primer sequence appears blank.
I will attach the Primer3-2.2.3 compatible eprimer3 file to this report.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:03:17 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:03:17 -0500
Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
In-Reply-To:
Message-ID: <201102011803.p11I3HtF008419@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
------- Comment #1 from jp.verta at gmail.com 2011-02-01 13:03 EST -------
Created an attachment (id=1565)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1565&action=view)
Emboss eprimer3.c file for Primer3 version 2.2.3
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:05:04 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:05:04 -0500
Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
In-Reply-To:
Message-ID: <201102011805.p11I54md008512@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
------- Comment #2 from jp.verta at gmail.com 2011-02-01 13:05 EST -------
Created an attachment (id=1566)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1566&action=view)
Example output of Primer3 version 1.1.4 compatible Emboss eprimer3 program
output
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:06:00 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:06:00 -0500
Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
In-Reply-To:
Message-ID: <201102011806.p11I60De008626@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
jp.verta at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1566|Example output of Primer3 |Example output of Primer3
description|version 1.1.4 compatible |version 2.2.3 compatible
|Emboss eprimer3 program |Emboss eprimer3 program
|output |output
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:06:47 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:06:47 -0500
Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
In-Reply-To:
Message-ID: <201102011806.p11I6lVr008664@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
------- Comment #3 from jp.verta at gmail.com 2011-02-01 13:06 EST -------
Created an attachment (id=1567)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1567&action=view)
Example output of Primer3 version 1.1.4 compatible Emboss eprimer3 program
output
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 1 18:07:44 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 1 Feb 2011 13:07:44 -0500
Subject: [Biopython-dev] [Bug 3173] Bio.Emboss.Primer3 parser
incompatibility with Primer3 version 2.2.3
In-Reply-To:
Message-ID: <201102011807.p11I7ih8008712@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3173
jp.verta at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1566|application/octet-stream |text/plain
mime type| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Tue Feb 1 20:39:30 2011
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Feb 2011 20:39:30 +0000
Subject: [Biopython-dev] [Biopython] internal function to convert
illumina quality scores to phred
In-Reply-To:
References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu>
<97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu>
<20110201160304.GH17835@sobchak.mgh.harvard.edu>
Message-ID:
On Tue, Feb 1, 2011 at 4:16 PM, Peter wrote:
> On Tue, Feb 1, 2011 at 4:03 PM, Brad Chapman wrote:
>>
>> Peter, how hard do you think it would be to have SeqIO only convert
>> from the fastq encoding to phred scores on demand? Most of the time
>> when dealing with fastq I do not need any conversion at all and use
>> the FastqGeneralIterator to just pull out the name, sequence and
>> quality.
>>
>> You've done a lot of nice work with the correct conversions and it
>> would be great to expose that directly though on-demand conversion
>> as Alan is suggesting. Ideally you would use SeqIO as normal with
>> fastq files, but the quality score would not be converted to solexa
>> during parsing using letter_annotations["solexa_quality"] was
>> accessed.
>
> I actually implemented a proof of concept that does that. In order
> to not alter the SeqRecord behaviour, it was a new object which
> acted like a list of integers in many respects. The data is held
> as a FASTQ encoded string, and decoded (and then cached) on
> demand only. On output if it was already in the right encoding
> the string could be used as is, otherwise the conversion could
> be done very quickly with a precomputed table and the string
> translate() method (without having to go via a list of integers).
> It seemed to work, but I wasn't convinced about the benefits
> (given the complexity). I'd really want some real world FASTQ
> benchmarks to try it on... something you might have in the form
> of your scripts and the real data they were written for?
>
> I'm pretty sure this code is in a local git branch on one of my
> machines (probably at home), but I don't think I pushed it to
> github. I should do that...
Found it and pushed it:
https://github.com/peterjc/biopython/tree/fastq-tricks
Note there are unit test failures (e.g. as currently implemented
there is no range checking on the characters in the quality strings
at parse time). We may want to continue this on the dev mailing list...
Peter
From p.j.a.cock at googlemail.com Thu Feb 3 12:04:08 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 3 Feb 2011 12:04:08 +0000
Subject: [Biopython-dev] Sequential SFF IO
In-Reply-To: <20110128123418.GD7866@sobchak.mgh.harvard.edu>
References:
<20110128123418.GD7866@sobchak.mgh.harvard.edu>
Message-ID:
On Wed, Jan 26, 2011 at 7:44 PM, Peter Cock wrote:
>
> I'm currently looking at trimming 5' and 3' PCR primer sequences -
> which could equally be used for barcodes etc. I'd probably wrap this
> as a Galaxy tool (using Biopython).
>
If anyone is interested, see this thread on the Galaxy-dev mailing list:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004290.html
In terms of SFF output, I'm only writing one SFF file so the issues
Jacob is concerned about (when writing one SFF file per barcode)
do not apply.
On Fri, Jan 28, 2011 at 12:34 PM, Brad Chapman wrote:
>
> I wrote up a barcode detector, remover and sorter for our Illumina
> reads. There is nothing especially tricky in the implementation: it
> looks for exact matches and then checks for approximate matches,
> with gaps, using pairwise2:
>
> https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py
>
> The "best_match" function could be replaced with different
> implementations, using the rest of the script as scaffolding to do
> all of the other sorting, trimming and output.
>
> Brad
The computationally interesting part is matching the primer/adapter/
barcode to the read (both of which may contain IUPAC ambiguity codes),
which as you point out can be replaced once you have a working
framework for the input, output, trimming, etc.
Currently I'm using regular expressions, which is fast enough for my
own needs - and this task could easily be parallelised by breaking
up the input reads. Beyond that perhaps something based on
Hamming distances (edit distance - number of mismatches) or
Levenshtein searches might be quicker. I guess speed is more of
an issue with Illumina than with 454 due to the number of reads?
Brad - you mentioned using approximate matches with gaps. Did you
find gapped matches made a bit difference to the number of matches
found? i.e. is it worthwhile on your data?
Peter
From bugzilla-daemon at portal.open-bio.org Thu Feb 3 22:47:04 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 3 Feb 2011 17:47:04 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102032247.p13Ml4QY029111@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
walter_gillett at hotmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |walter_gillett at hotmail.com
------- Comment #4 from walter_gillett at hotmail.com 2011-02-03 17:47 EST -------
Short answer:
The fix looks good - I have dug into the logic in detail and stepped through
the example. However, appears to me that there is still a bug in this line of
code in the viterbi method:
for cur_state in self.transitions_from(main_state):
In this context, "cur_state" is a state prior to "main_state", so what we
really need here is the set of states that lead to main_state, not the set of
states that can be reached from main_state. This bug won't cause trouble in
practice unless you use a non-ergodic HMM, that is, a model in which some state
transitions are disallowed. (The variable names here are confusing, would be
better to rename main_state to cur_state and cur_state to previous_state, or
something like that.) This bug is unrelated to the problem originally reported,
other than appearing in the same part of the code, so perhaps it should be
handled in a separate ticket.
I would be happy to code up a fix if that makes sense.
Longer answer:
I had spent a bunch of time recently investigating this - should have noted
that in bugzilla to avoid duplication of effort. But still seems worthwhile
writing down my notes to document this better, so I'll do that here.
There was a error in the Viterbi algorithm termination logic, as implemented in
the method MarkovModel#viterbi. The Viterbi probabilities were being multiplied
by the log-probability of a transition back to an end state (state 0). This was
incorrect because in log space the log-probabilities should be added, not
multiplied. Peter's fix removes that multiplication, thus dropping the end
state transition entirely (which Durbin considers optional, so that's fine; and
it was causing trouble). With the bug fixed, the most probable state path to
generate 6 tails (in the example model described by the bug reporter) becomes
"uuuuuu" as expected - no final "f".
At a higher level, there was (in versions 1.56 and prior, but no longer in
trunk) an important undocumented (as far as I can see) requirement that the
model always starts in state 0. The bug reporter complained that the results of
the Viterbi path calculation are wrong because "apparently they depend upon the
order of the state alphabet," which was true. In the example model, providing
the state alphabet ["f", "u"] causes the system to start in state f. Since
there is a big penalty in his example for switching states, you get "ff" as the
most likely state path for the output sequence [tails, tails], even though the
unfair coin is much more likely than the fair coin to yield tails.
Looks like Peter's fix treats all starting states as equally probable, there is
no longer a special start state. That's reasonable, although the coding is a
little confusing:
# v_{0}(0) = 0
viterbi_probs[(state_letters[0], -1)] = 0
# v_{k}(0) = 0 for k > 0
for state_letter in state_letters[1:]:
viterbi_probs[(state_letter, -1)] = 0
because it could now more naturally be done in two lines of code rather than
three. Possibly it's useful to keep the assignment for state 0 separate in case
we want to change it.
A good long-term improvement would be to have a special hidden start state like
the "MagicalState" used by BioJava (see
http://www.biojava.org/wiki/BioJava:Tutorial:Simple_HMMs_with_BioJava). That
would make it possible to specify a probability distribution for what the
initial state should be, a typical HMM feature (see Durbin's book, for
example).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 00:16:39 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 3 Feb 2011 19:16:39 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102040016.p140GdsK031389@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #5 from pgarland at gmail.com 2011-02-03 19:16 EST -------
FWIW, I think the right thing with respect to begin states is to require the
user to explicitly specify an begin state in the state alphabet, e.g.:
class coin:
def __init__(self):
self.begin_state_name = "begin"
self.letters = ["u", "f"]
Having the user specify the name should reduce the chance of naming conflicts,
and makes it easier for the user to understand what is going on if they print
viterbi_probs, or are trying to debug a problem.
The user should also be required to explicitly set the initial probabilities.
There should be three methods for this, one that takes a list of initial
probabilities, one that makes all initial states equally probable, and one that
lets the user set the probability for each state individually. e.g:
MarkovModelBuilder.set_initial_probabilities([0.01, 0.99])
MarkovModelBuilder.set_initial_probabilities_equal()
MarkovModelBuilder.set_initial_probability("u", 0.01)
The first and third methods would raise an exception if the sum of the
probabilities did not sum to 1.0
Alternatively, the initial probabilities could be specified when defining the
state alphabet:
def __init__(self):
self.begin_state_name = "begin"
self.letters = [{'name': "u", 'init_prob': 0.01}, {'name': "f",
'init_prob': 0.99}]
This has the advantage of making the code more concise and readable, because
the state's declaration and specification are kept together. It has the
disadvantage adding an unnecessary layer of indirection when all the states
have equal initial probabilities. To make things less tedious for the user,
there could either be a flag specifying that all states have an equal initial
probability:
Alternatively, the initial probabilities could be specified when defining the
state alphabet:
def __init__(self):
self.begin_state_name = "begin"
self.initial_probabilties_equal = True
self.letters = [{'name': "u"}, {'name': "f"}]
or again, a method could be provided:
MarkovModelBuilder.set_initial_probabilities_equal()
Because specifying the begin state name and the initial probabilities would be
required, any of these changes would break the current API.
Similar features should be provided for users who want to constrain the end
state, but not specifying the end state should not raise an exception.
I agree the variable names "main_state" and "cur_state" are confusing and
should be changed.
~Phillip
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From clementsgalaxy at gmail.com Fri Feb 4 01:01:01 2011
From: clementsgalaxy at gmail.com (Dave Clements)
Date: Thu, 3 Feb 2011 17:01:01 -0800
Subject: [Biopython-dev] Galaxy Community Conference, May 25-26, Lunteren,
The Netherlands
Message-ID:
We are pleased to announce the *2011 Galaxy Community Conference*, being
held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature two
full days of presentations and discussion on extending Galaxy to use new
tools and data sources, deploying Galaxy at your organization, and best
practices for using Galaxy to further your own and your community's
research. See http://galaxy.psu.edu/gcc2011/* for complete details.
*
*About Galaxy:
*Galaxy is an open, web-based platform for *accessible, reproducible, and
transparent* computational biomedical research.
- *Accessibility:* Galaxy enables users without programming experience to
easily specify parameters and run tools and workflows.
- *Reproducibility:* Galaxy captures all information necessary so that
any user can repeat and understand a complete computational analysis.
- *Transparency:* Galaxy enables users to share and publish analyses via
the web and create Pages--interactive, web-based documents that describe a
complete analysis.
Galaxy is open source for all organizations. The public Galaxy service (
http://usegalaxy.org) makes analysis tools, genomic data,
tutorial demonstrations, persistent workspaces, and publication services
available to any scientist that has access to the Internet. Local
Galaxy servers can be set up by downloading the Galaxy application and
customizing it to meet particular needs.
*Conference Overview:
*
This event aims to engage a broader community of developers, data producers,
tool creators, and core facility and other research hub staff to become an
active part of the Galaxy community. We'll cover defining resources in the
Galaxy framework, increasing their visibility and making them easier to use
and integrate with other resources, how to extend Galaxy to use custom data
sources and custom tools, and best practices for using Galaxy in your
organization.
Additional topics include, but are not limited to:
* Talks submitted by the Galaxy community
* Integration of tools (including NGS analysis tools) and distributed job
management
* Deployment of Galaxy instances on local resources and on the Cloud
* Management of large datasets with the Galaxy Library System
* Using the Galaxy LIMS functionality at NGS sequencing facilities
* Visualizing Data without leaving Galaxy
* Performing reproducible research
* Performing and sharing complex analyses with Workflows
* An "Introduction to Galaxy" session, offered on May 24, for Galaxy
newcomers.
*Registration:
*
The conference fee is ?100 on or before April 24, and ?120 after that. The
meeting is being held at the Conference Centre De Werelt in Lunteren, The
Netherlands, which is also the conference hotel. You are encouraged to
register early, as space at the hotel (and at the "Intro to Galaxy" session)
is limited and is likely to fill up before the conference itself does. See
http://galaxy.psu.edu/gcc2011/Register.html
*
Abstract Submission:
*
Abstracts are now being accepted for short oral presentations. Proposals on
any topic of interest to the Galaxy community are welcome and encouraged.
The abstract submission deadline is the end of February 28. See
http://galaxy.psu.edu/gcc2011/Abstracts.html
* *
*Sponsors
*
The 2011 Galaxy Community Conference is co-sponsored by the US National
Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands
Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a collaborative
institute of the bioinformatics groups in the Netherlands. Together, these
groups perform cutting-edge research, develop novel tools and support
platforms, create an e-science infrastructure and educate the next
generations of bioinformaticians.
We are looking forward to a great conference and hope to see you in the
Netherlands!
The Galaxy and NBIC Teams
--
http://galaxy.psu.edu/gcc2011/
http://getgalaxy.org
http://usegalaxy.org/
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 10:05:18 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 05:05:18 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102041005.p14A5Ij0019705@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 05:05 EST -------
(In reply to comment #4)
>
> Looks like Peter's fix treats all starting states as equally probable,
> there is no longer a special start state. That's reasonable, although the
> coding is a little confusing:
>
It was Phillip's fix.
(In reply to comment #5)
> FWIW, I think the right thing with respect to begin states is to require the
> user to explicitly specify an begin state in the state alphabet, e.g.:
> class coin:
> def __init__(self):
> self.begin_state_name = "begin"
> self.letters = ["u", "f"]
If we go that route, we'll need to make very clear the differences between a
HMM Alphabet (of states) and a biological sequence alphabet (Bio.Alphabet). It
must make sense in many cases to use a biological sequence alphabet, but in
general adding HMM attributes to the class does not make sense.
We really need someone to volunteer to take over this code (and sort out the
overlap between Bio.MarkovModel and/or Bio.HMM.MarkovModel), write some
documentation for the tutorial, and sort out these remaining issues. Are either
of you interested?
>
> I agree the variable names "main_state" and "cur_state" are confusing and
> should be changed.
>
I'll happily merge/cherry-pick a simple diff to do that only if you do that on
github, or apply a patch if you upload it here.
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 13:46:02 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 08:46:02 -0500
Subject: [Biopython-dev] [Bug 3175] New: Caret in genbank files leads to
GenBank Parser crash in Biopython 1.54
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
Summary: Caret in genbank files leads to GenBank Parser crash in
Biopython 1.54
Product: Biopython
Version: 1.54
Platform: PC
OS/Version: Linux
Status: NEW
Severity: major
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: aaron.tin.long.lun at gmail.com
When parsing genbank files using Bio.SeqIO as described in the Biopython
Cookbook, the presence of a caret in the position of a feature in the
annotation (e.g. CDS 1000..1001^1002) raises a LocationParserError, leading
to "Syntax error at or near `Tokens('caret')' token". Appears to occur
regardless of the type of the feature, whether it is normal/reverse complement,
etc. Found in BioPython 1.54 on a Dell dimension 2400 running Kubuntu 10.10.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 13:49:23 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 08:49:23 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041349.p14DnN75028633@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #1 from aaron.tin.long.lun at gmail.com 2011-02-04 08:49 EST -------
Created an attachment (id=1568)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1568&action=view)
Crash-inducing file for the GenBank parser
Example file, modified from the human mitochondrial genome, with a caret
introduced in line 96. Causes the crash described in the bug description.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 14:20:33 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 09:20:33 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041420.p14EKX5n030354@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 09:20 EST -------
Hi Aaron,
The example in attachment #1568 from comment #1 is invalid. The feature
location join(16024^16026..16569,1..576) is wrong since the caret should be
used in the form [i]^[i+1], i.e. consecutive numbers. See:
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
That example should probably be a between location like
join((16024.16026)..16569,1..576)
However, the example in the original bug report, 1000..1001^1002, looks
possible (but unprecedented to my knowledge) and that also fails with the
latest Biopython GenBank parsing code (much changed since Biopython 1.54). I
don't really understand how that usefully differs from 1000..1001 or 1000..1002
though.
Was that from a GenBank file from the NCBI? If so what accession please, or a
URL?
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 15:00:49 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 10:00:49 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102041500.p14F0naj032533@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
georg.lipps at fhnw.ch changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |WORKSFORME
------- Comment #7 from georg.lipps at fhnw.ch 2011-02-04 10:00 EST -------
Yes,
the code seems to work now.
The probability of attaining the first state is now the transition probability
of remaining in the same state (here 0.95).
I like the suggestion of comment #5 to explicity state the a begin state with
the corresponding transition probabilities.
A big THANK for fixing,
Georg
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 15:19:11 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 10:19:11 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102041519.p14FJBme001095@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #8 from walter_gillett at hotmail.com 2011-02-04 10:19 EST -------
I'll volunteer to do all of that (OK with you, Phillip?).
Walter
(In reply to comment #6)
> (In reply to comment #5)
> > FWIW, I think the right thing with respect to begin states is to require the
> > user to explicitly specify an begin state in the state alphabet, e.g.:
> > class coin:
> > def __init__(self):
> > self.begin_state_name = "begin"
> > self.letters = ["u", "f"]
>
> If we go that route, we'll need to make very clear the differences between a
> HMM Alphabet (of states) and a biological sequence alphabet (Bio.Alphabet). It
> must make sense in many cases to use a biological sequence alphabet, but in
> general adding HMM attributes to the class does not make sense.
>
> We really need someone to volunteer to take over this code (and sort out the
> overlap between Bio.MarkovModel and/or Bio.HMM.MarkovModel), write some
> documentation for the tutorial, and sort out these remaining issues. Are either
> of you interested?
>
> >
> > I agree the variable names "main_state" and "cur_state" are confusing and
> > should be changed.
> >
>
> I'll happily merge/cherry-pick a simple diff to do that only if you do that on
> github, or apply a patch if you upload it here.
>
> Thanks,
>
> Peter
>
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 16:12:33 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 11:12:33 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102041612.p14GCXfW004211@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 11:12 EST -------
> >
> > I agree the variable names "main_state" and "cur_state" are confusing and
> > should be changed.
> >
>
> I'll happily merge/cherry-pick a simple diff to do that only if you do that on
> github, or apply a patch if you upload it here.
I could have phrased that better: I mean a simple patch/diff to do the rename
only would be easy for me to review and check in.
(In reply to comment #8)
> I'll volunteer to do all of that (OK with you, Phillip?).
>
> Walter
That's OK with me. Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 17:25:18 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 12:25:18 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041725.p14HPIhY008673@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #3 from aaron.tin.long.lun at gmail.com 2011-02-04 12:25 EST -------
Hi Peter,
Thanks for the quick reply. I originally encountered the caret in the GenBank
entry for the chromosome II assembly of the human genome (accession number
NT_022221.13, downloaded from NCBI's FTP site yesterday); it can be found at
the very end of the annotation, for the V_segments/CDS of the IGKV2-40 gene
e.g. CDS complement(<68451760..68452072^68452073). I suspect that it is
rare, because I parsed through the complete sequences of 15 other chromosomes
before my program crashed. Hope that helps.
Cheers,
Aaron
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 17:43:37 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 12:43:37 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041743.p14HhbbY009388@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 12:43 EST -------
(In reply to comment #3)
> Hi Peter,
> Thanks for the quick reply. I originally encountered the caret in the GenBank
> entry for the chromosome II assembly of the human genome (accession number
> NT_022221.13, downloaded from NCBI's FTP site yesterday); it can be found at
> the very end of the annotation, for the V_segments/CDS of the IGKV2-40 gene
> e.g. CDS complement(<68451760..68452072^68452073). I suspect that it is
> rare, because I parsed through the complete sequences of 15 other chromosomes
> before my program crashed. Hope that helps.
> Cheers,
> Aaron
>
Where on the FTP site? Its a big place and I don't work with human genomes...
Looking via the Entrez website, it seems NT_022221.13 is only 3519312bp,
so this can't match the GenBank file you are looking at:
http://www.ncbi.nlm.nih.gov/nuccore/NT_022221.13?report=gbwithparts
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 18:05:42 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 13:05:42 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041805.p14I5gxS010298@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-04 13:05 EST -------
(In reply to comment #4)
>
> Where on the FTP site? Its a big place and I don't work with human genomes...
>
Nevermind, I tried downloading a few candidates and found it - you actually
meant NT_015926.15 which is in this file (whose first entry is NT_022221.13)
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_02/hs_ref_GRCh37.p2_chr2.gbk.gz
It seems that Google doesn't index this site - I can understand why but it
would have been useful.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 18:15:05 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 13:15:05 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041815.p14IF5Bx010832@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #6 from aaron.tin.long.lun at gmail.com 2011-02-04 13:15 EST -------
Hi Peter,
Yeah, sorry about the mix-up, I'm not used to dealing with more than one
sequence record per file. The caret should be present in the FTP-sourced file.
Interestingly, it is not present in the Nucleotide annotation for the same
accession number, which suggests that they've updated it in the two/three
months since the data was pushed onto the FTP site.
Cheers,
Aaron
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Feb 4 18:20:12 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 13:20:12 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102041820.p14IKCDN011161@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #7 from aaron.tin.long.lun at gmail.com 2011-02-04 13:20 EST -------
NT_022184.15 is the record containing IGKV2-40 (and the associated caret) in my
file. What I said about Nucleotide still applies, though.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Feb 5 04:23:02 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Feb 2011 23:23:02 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102050423.p154N2fO013565@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #10 from pgarland at gmail.com 2011-02-04 23:23 EST -------
(In reply to comment #8)
> I'll volunteer to do all of that (OK with you, Phillip?).
>
> Walter
Sure. WRT my earlier comment, I realized that it's simpler for both the
implementer and the user if the only user-visible change necessary to specify
begin states is to add a variable to HiddenMarkovBuilder to hold the name of
the begin state, and then let users use set_transition_score to specify
transition probabilities from begin states. Then the relevant methods, e.g.
_all_blank, allow_transition, allow_all_transitions, set_transition_score, etc
have to be altered to forbid transitions to, or emissions from the begin state.
And get_markov_model would raise an exception if a begin state hasn't been
specified or if there isn't at least one transition from the begin state.
So all users would have to do is (using the example from the bug report):
...
build.begin_state_name = "begin"
build.set_transition_score("begin", "u", 0.01)
build.set_transition_score("begin", "f", 0.99)
...
~Phillip
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Feb 5 07:04:46 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 5 Feb 2011 02:04:46 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102050704.p1574kup024068@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #11 from walter_gillett at yahoo.com 2011-02-05 02:04 EST -------
Sounds good.
(I had been thinking about trying to preserve backward compatibility for
existing clients of this class. If we require that the caller sets a begin
state then all existing clients will break since none of them currently does
that. But the previous fix has already broken compatibility in any case, and
that was probably necessary since prior to the fix, the results were
incorrect.)
A possible variation would be to handle the transition from the begin state to
the first real state with special-case code, so that the begin state would not
be included in the set of real states. The upside would be that the methods you
mention would not have to change, and we wouldn't be cluttering the state
alphabet with a begin state that isn't real, which I think was a concern
mentioned in comment #6 (if I understood it properly). The downside is having
to add that special-case code. Not sure yet whether this is a good idea or not.
Walter
(In reply to comment #10)
> (In reply to comment #8)
> > I'll volunteer to do all of that (OK with you, Phillip?).
> >
> > Walter
>
> Sure. WRT my earlier comment, I realized that it's simpler for both the
> implementer and the user if the only user-visible change necessary to specify
> begin states is to add a variable to HiddenMarkovBuilder to hold the name of
> the begin state, and then let users use set_transition_score to specify
> transition probabilities from begin states. Then the relevant methods, e.g.
> _all_blank, allow_transition, allow_all_transitions, set_transition_score, etc
> have to be altered to forbid transitions to, or emissions from the begin state.
> And get_markov_model would raise an exception if a begin state hasn't been
> specified or if there isn't at least one transition from the begin state.
>
> So all users would have to do is (using the example from the bug report):
>
> ...
> build.begin_state_name = "begin"
> build.set_transition_score("begin", "u", 0.01)
> build.set_transition_score("begin", "f", 0.99)
> ...
>
> ~Phillip
>
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Feb 6 03:23:39 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 5 Feb 2011 22:23:39 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102060323.p163NdIu013858@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #12 from pgarland at gmail.com 2011-02-05 22:23 EST -------
(In reply to comment #11)
> Sounds good.
>
> (I had been thinking about trying to preserve backward compatibility for
> existing clients of this class. If we require that the caller sets a begin
> state then all existing clients will break since none of them currently does
> that. But the previous fix has already broken compatibility in any case, and
> that was probably necessary since prior to the fix, the results were
> incorrect.)
I don't think it's worth it to worry about preserving complete backward
compatibility. Right now there are two classes of code:
1) Code that manually sets up a begin state and the appropriate transitions.
All these people would need to do is add one line of code specifying the begin
state, and the rest of their code would work as before. For these users, we
could print an error message instructing them to set the begin_state_name
variable (and document the change too!).
2) Code that does not set up a begin state, as in the bug report. Even with the
applied bug fix, this code only returns a correct state sequence when all
possible start states should be equally probable. In all other cases the users
are possibly getting an incorrect result without being aware of it. To my mind,
this is worse than breaking backward compatibility. We could maintain backward
compatibility by having a default model for the initial state (e.g. equally
probable, or assign random probabilities), but unless that's the model the user
should be assuming for their sequence, they'll still be silently returned an
incorrect result.
> A possible variation would be to handle the transition from the begin state to
> the first real state with special-case code, so that the begin state would not
> be included in the set of real states. The upside would be that the methods you
> mention would not have to change, and we wouldn't be cluttering the state
> alphabet with a begin state that isn't real, which I think was a concern
> mentioned in comment #6 (if I understood it properly). The downside is having
> to add that special-case code. Not sure yet whether this is a good idea or not.
>
> Walter
I hadn't thought of that approach. It could be a good way to go. I think the
tradeoffs would be:
A) Of the existing code, changes would be localized to the viterbi method,
which would become slightly more complex.
B) This approach makes it trivial to guarantee that no state can transition to
the begin state.
C) One new public method would have to be added, for users to set initial
probabilities.
D) Having to use the new method would require more, though not complex, changes
to existing user code, but would have the benefit of making it as explicit as
possible how the model is initialized.
All in all, your idea of keeping the begin state separate looks like the way to
go.
~ Phillip
> (In reply to comment #10)
> > (In reply to comment #8)
> > > I'll volunteer to do all of that (OK with you, Phillip?).
> > >
> > > Walter
> >
> > Sure. WRT my earlier comment, I realized that it's simpler for both the
> > implementer and the user if the only user-visible change necessary to specify
> > begin states is to add a variable to HiddenMarkovBuilder to hold the name of
> > the begin state, and then let users use set_transition_score to specify
> > transition probabilities from begin states. Then the relevant methods, e.g.
> > _all_blank, allow_transition, allow_all_transitions, set_transition_score, etc
> > have to be altered to forbid transitions to, or emissions from the begin state.
> > And get_markov_model would raise an exception if a begin state hasn't been
> > specified or if there isn't at least one transition from the begin state.
> >
> > So all users would have to do is (using the example from the bug report):
> >
> > ...
> > build.begin_state_name = "begin"
> > build.set_transition_score("begin", "u", 0.01)
> > build.set_transition_score("begin", "f", 0.99)
> > ...
> >
> > ~Phillip
> >
>
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Feb 6 06:46:56 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 6 Feb 2011 01:46:56 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102060646.p166kuqY018550@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #13 from walter_gillett at yahoo.com 2011-02-06 01:46 EST -------
I forked biopython, tested and checked in and pushed some improvements to
variable naming and comments in the viterbi method, and submitted a pull
request for your review. Thanks,
Walter
(In reply to comment #8)
> > I'll happily merge/cherry-pick a simple diff to do that only if you do that on
> > github, or apply a patch if you upload it here.
> >
> > Thanks,
> >
> > Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chapmanb at 50mail.com Mon Feb 7 12:23:56 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 7 Feb 2011 07:23:56 -0500
Subject: [Biopython-dev] Sequential SFF IO
In-Reply-To:
References:
<20110128123418.GD7866@sobchak.mgh.harvard.edu>
Message-ID: <20110207122356.GC18733@sobchak.mgh.harvard.edu>
Peter;
> The computationally interesting part is matching the primer/adapter/
> barcode to the read (both of which may contain IUPAC ambiguity codes),
> which as you point out can be replaced once you have a working
> framework for the input, output, trimming, etc.
Absolutely. I'd be very happy if you wanted to take the framework in
the script and generalize it for different matching. Let me know
what I can do to help.
> Currently I'm using regular expressions, which is fast enough for my
> own needs - and this task could easily be parallelised by breaking
> up the input reads. Beyond that perhaps something based on
> Hamming distances (edit distance - number of mismatches) or
> Levenshtein searches might be quicker. I guess speed is more of
> an issue with Illumina than with 454 due to the number of reads?
>
> Brad - you mentioned using approximate matches with gaps. Did you
> find gapped matches made a bit difference to the number of matches
> found? i.e. is it worthwhile on your data?
A large majority of the barcodes are found with exact matching
via a dictionary lookup, so the gapped/mismatch alignments are only
necessary for the barcodes with sequencing errors. For Illumina
reads gaps aren't as common, so the mismatch alignments are more
useful but I tried to make it general so as to catch as many cases
as possible.
Brad
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 16:31:38 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 11:31:38 -0500
Subject: [Biopython-dev] [Bug 3176] New: Bio SeqIO 'genbank' parse failure
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
Summary: Bio SeqIO 'genbank' parse failure
Product: Biopython
Version: 1.56
Platform: Macintosh
OS/Version: Mac OS
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: sschmidt at tuebingen.mpg.de
Hi,
the parser stumbles over a Genbank file that contains a feature without values:
___START GenBank File____
LOCUS someVector______ 6127 bp DNA circular 1-OCT-2009
SOURCE
ORGANISM
COMMENT none
FEATURES Location/Qualifiers
misc_structure 1564..1566
/ApEinfo_label=ErrorInBioPythonBecauseNoValue
/ApEinfo_fwdcolor=
/ApEinfo_revcolor=
/vntifkey="88"
/label=Stop\codon
BASE COUNT 15 a 16 c 16 g 13 t
ORIGIN
1 gagttccgcg ttacataact tacggtaaat ggcccgcctg gctgaccgcc caacgacccc
//
__END GenBank file___
The relevant error message:
File "/sw/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 525, in
parse
for r in i:
File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 437, in
parse_records
record = self.parse(handle, do_features)
File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 420, in
parse
if self.feed(handle, consumer, do_features):
File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 392, in
feed
self._feed_feature_table(consumer, self.parse_features(skip=False))
File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 188, in
parse_features
features.append(self.parse_feature(feature_key, feature_lines))
File "/sw/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 268, in
parse_feature
elif value[0]=='"':
IndexError: string index out of range
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 16:45:40 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 11:45:40 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102081645.p18GjeR4025608@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 11:45 EST -------
Where is this problem file coming from? I'm pretty sure the NCBI (nor
EMBL/DDBJ) do not use feature qualifiers like that.
See: http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html
If you are creating the file, why not use /key="" or /key - the later form is
used in real GenBank files, e.g. /pseudo
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 18:25:13 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 13:25:13 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102081825.p18IPDgO029696@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #2 from sschmidt at tuebingen.mpg.de 2011-02-08 13:25 EST -------
The file is the product of ApE
(http://biologylabs.utah.edu/jorgensen/wayned/ape/).
I agree that this format is 'unusual' but that the code simply quits could be
simply avoided by checking if there is a value is defined at all.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 18:28:18 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 13:28:18 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102081828.p18ISIGG029796@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 13:28 EST -------
Created an attachment (id=1569)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1569&action=view)
Handle funny feature annotation
Could you test the following patch? Ask if you need help with that - I can
stick it on a github branch if that is easier.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From anaryin at gmail.com Tue Feb 8 18:33:56 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 8 Feb 2011 19:33:56 +0100
Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(),
remove_disordered_atoms()
Message-ID:
Dear All,
I've been working on the above-mentioned functions following really great
feedback from Eric, Kristian, and Peter. I've been also using them routinely
and I've had no problems yet so they should be stable enough. Therefore I
think they can be cherry-picked from my pdb_enhancements branch and added to
the main branch. Let me know what you think.
Cheers,
Jo?o
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 18:54:28 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 13:54:28 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102081854.p18IsSbo030923@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #4 from sschmidt at tuebingen.mpg.de 2011-02-08 13:54 EST -------
Hmm,
I patched the code and same error message. What about handling this problem at
Bio/GenBank/Scanner.py directly?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Feb 8 22:25:58 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 8 Feb 2011 17:25:58 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102082225.p18MPwXR006718@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1569 is|0 |1
obsolete| |
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-08 17:25 EST -------
(From update of attachment 1569)
Sorry, must have uploaded the wrong patch - this was a work in progress for the
GenBank between location bug.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Feb 9 10:47:33 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Feb 2011 05:47:33 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102091047.p19AlX92029443@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-09 05:47 EST -------
Committed:
https://github.com/biopython/biopython/commit/07b6c12cf18d41749918e29b1bbc4a58a18e1180
Can you try the trunk? See
http://www.biopython.org/wiki/SourceCode
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Feb 9 14:19:46 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Feb 2011 09:19:46 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102091419.p19EJkjK011310@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
------- Comment #7 from sschmidt at tuebingen.mpg.de 2011-02-09 09:19 EST -------
(using 07b6c12cf18d41749918e29b1bbc4a58a18e1180)
works like a charm.
Thanks Peter, should've come up with a similar solution
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Feb 9 14:20:22 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Feb 2011 09:20:22 -0500
Subject: [Biopython-dev] [Bug 3176] Bio SeqIO 'genbank' parse failure
In-Reply-To:
Message-ID: <201102091420.p19EKMsg011354@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3176
sschmidt at tuebingen.mpg.de changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #8 from sschmidt at tuebingen.mpg.de 2011-02-09 09:20 EST -------
done
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Feb 10 14:05:33 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Feb 2011 09:05:33 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102101405.p1AE5Xkl029071@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2011-02-10 09:05 EST -------
(In reply to comment #7)
> NT_022184.15 is the record containing IGKV2-40 (and the associated caret) in
> my file. What I said about Nucleotide still applies, though.
>
Yes, you're right. My mistake, NT_015926.15 was the last good record.
Had you noticed this was the last gene in this record? It runs right up to
the end of the sequence and beyond (missing the right most end, i.e. the 5'
start of the gene since it is on the reverse strand). From the FTP site:
LOCUS NT_022184 68452323 bp DNA linear CON 28-OCT-2010
DEFINITION Homo sapiens chromosome 2 genomic contig, GRCh37.p2 reference
primary assembly.
...
gene complement(68451760..>68452323)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/db_xref="GeneID:28916"
/db_xref="HGNC:5789"
/db_xref="IMGT/GENE-DB:IGKV2-40"
V_segment complement(68451760..68452073^68452074)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/standard_name="IGKV2-40"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/db_xref="GeneID:28916"
CDS complement(<68451760..68452072^68452073)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/exception="rearrangement required for product"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/codon_start=1
/db_xref="GeneID:28916"
/db_xref="IMGT/LIGM:IGKV2-40"
/db_xref="HGNC:5789"
/db_xref="IMGT/GENE-DB:IGKV2-40"
If we look at the record via Entrez,
http://www.ncbi.nlm.nih.gov/nuccore/NT_022184.15?report=gbwithparts
gene complement(68451760..>68452323)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/db_xref="GeneID:28916"
/db_xref="HGNC:5789"
/db_xref="IMGT/GENE-DB:IGKV2-40"
V_segment complement(68451760..68452074)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/standard_name="IGKV2-40"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/db_xref="GeneID:28916"
CDS complement(<68451760..68452073)
/gene="IGKV2-40"
/gene_synonym="IGKV240; O11; O11a"
/exception="rearrangement required for product"
/note="Derived by automated computational analysis using
gene prediction method: Curated Genomic."
/codon_start=1
/db_xref="IMGT/LIGM:IGKV2-40"
/db_xref="GeneID:28916"
/db_xref="HGNC:5789"
/db_xref="IMGT/GENE-DB:IGKV2-40"
So this appears to have been updated to avoid the funny caret location,
but I think they made a mistake - surely the CDS should be
complement(68451760..>68452073) not complement(<68451760..68452073)
as stated?
Have you contacted the NCBI about this? If not, I will.
I believe that the caret location in the FTP GenBank file is invalid and
Biopython is right to reject it (but I would like to confirm this with the
NCBI). For now the simplest solution is for you to manually edit that feature.
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Thu Feb 10 15:10:19 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 10 Feb 2011 15:10:19 +0000
Subject: [Biopython-dev] Sequential SFF IO
In-Reply-To: <20110207122356.GC18733@sobchak.mgh.harvard.edu>
References:
<20110128123418.GD7866@sobchak.mgh.harvard.edu>
<20110207122356.GC18733@sobchak.mgh.harvard.edu>
Message-ID:
On Mon, Feb 7, 2011 at 12:23 PM, Brad Chapman wrote:
> Peter;
>
>> The computationally interesting part is matching the primer/adapter/
>> barcode to the read (both of which may contain IUPAC ambiguity codes),
>> which as you point out can be replaced once you have a working
>> framework for the input, output, trimming, etc.
>
> Absolutely. I'd be very happy if you wanted to take the framework in
> the script and generalize it for different matching. Let me know
> what I can do to help.
Do you have (or can you point me at) any good sample data with
barcodes, or custom adapters or primer sequences? e.g. some SRA
numbers you've been using.
>> Currently I'm using regular expressions, which is fast enough for my
>> own needs - and this task could easily be parallelised by breaking
>> up the input reads. Beyond that perhaps something based on
>> Hamming distances (edit distance - number of mismatches) or
>> Levenshtein searches might be quicker. I guess speed is more of
>> an issue with Illumina than with 454 due to the number of reads?
I originally had three separate tools (with shared code) for working
with FASTA, FASTQ and SFF reads, which I have recently combined
into one single tool that does all three. Code here if anyone wants to
look at it.
https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/
seq_primer_clip.py - Python script
seq_primer_clip.xml - Galaxy wrapper
seq_primer_clip.txt - readme file
This is still a work in progress...
Peter
From bugzilla-daemon at portal.open-bio.org Thu Feb 10 20:02:42 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Feb 2011 15:02:42 -0500
Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path
In-Reply-To:
Message-ID: <201102102002.p1AK2g6g017745@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
------- Comment #14 from walter_gillett at yahoo.com 2011-02-10 15:02 EST -------
I have checked in a fix on my github branch to the bug mentioned in comment #4:
in the Viterbi recursion to determine state path probabilities, we must
consider states that lead *to* the current state, not those that are reachable
*from* it. See comments for this checkin:
https://github.com/wgillett/biopython/commit/f8b0b94ad7ffadbf9aa923bc6273822328cb9f01
. Forgot to mention in the comments that I also fixed a bug in the
allow_transition method and added a unit test for that method.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Feb 10 23:07:21 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Feb 2011 18:07:21 -0500
Subject: [Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank
Parser crash in Biopython 1.54
In-Reply-To:
Message-ID: <201102102307.p1AN7Lu0025588@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
------- Comment #9 from aaron.tin.long.lun at gmail.com 2011-02-10 18:07 EST -------
Thanks Peter, will do so.
Cheers,
Aaron
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Fri Feb 11 09:30:02 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 11 Feb 2011 09:30:02 +0000
Subject: [Biopython-dev] Fwd: [GitHub] Viterbi algorithm bug fix: consider
states that lead *to* the current state,
not reachable *from* it [biopython/biopython GH-3]
In-Reply-To: <4d54436221d2a_250b3fff6ec2b2f0780@fe2.rs.github.com.tmail>
References: <4d54436221d2a_250b3fff6ec2b2f0780@fe2.rs.github.com.tmail>
Message-ID:
Hi Brad,
Do you want to look at this HMM fix too?
http://bugzilla.open-bio.org/show_bug.cgi?id=2947
Also who else is getting the github pull requests? We should
probably send them to the dev list, but I can't find the settings
right now on GitHub...
Peter
---------- Forwarded message ----------
From: GitHub
Date: Thu, Feb 10, 2011 at 7:58 PM
Subject: [GitHub] Viterbi algorithm bug fix: consider states that lead
*to* the current state, not reachable *from* it [biopython/biopython
GH-3]
To: p.j.a.cock at googlemail.com
wgillett wants someone to pull from wgillett:master:
Bug fix related to bug #2947. Please review and commit if it's OK. Thanks,
Walter Gillett
View Pull Request: https://github.com/biopython/biopython/pull/3
From chapmanb at 50mail.com Mon Feb 14 13:01:10 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 14 Feb 2011 08:01:10 -0500
Subject: [Biopython-dev] Sequential SFF IO
In-Reply-To:
References:
<20110128123418.GD7866@sobchak.mgh.harvard.edu>
<20110207122356.GC18733@sobchak.mgh.harvard.edu>
Message-ID: <20110214130110.GA12340@sobchak.mgh.harvard.edu>
Peter;
> Do you have (or can you point me at) any good sample data with
> barcodes, or custom adapters or primer sequences? e.g. some SRA
> numbers you've been using.
This is a subset of two lanes from a barcoded flowcell for testing
purposes:
http://chapmanb.s3.amazonaws.com/110106_FC70BUKAAXX.tar.gz
It has 12 barcoded samples, using the Illumina barcodes. The
sequences are in this YAML file:
https://github.com/chapmanb/bcbb/blob/master/nextgen/tests/data/automated/run_info.yaml
> I originally had three separate tools (with shared code) for working
> with FASTA, FASTQ and SFF reads, which I have recently combined
> into one single tool that does all three. Code here if anyone wants to
> look at it.
>
> https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/
Very nice. It would be great to get something general for barcode
splitting as a Galaxy tool. Thanks for looking at this,
Brad
From p.j.a.cock at googlemail.com Mon Feb 14 13:19:45 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 14 Feb 2011 13:19:45 +0000
Subject: [Biopython-dev] Sequential SFF IO
In-Reply-To: <20110214130110.GA12340@sobchak.mgh.harvard.edu>
References:
<20110128123418.GD7866@sobchak.mgh.harvard.edu>
<20110207122356.GC18733@sobchak.mgh.harvard.edu>
<20110214130110.GA12340@sobchak.mgh.harvard.edu>
Message-ID:
On Mon, Feb 14, 2011 at 1:01 PM, Brad Chapman wrote:
> Peter;
>
>> Do you have (or can you point me at) any good sample data with
>> barcodes, or custom adapters or primer sequences? e.g. some SRA
>> numbers you've been using.
>
> This is a subset of two lanes from a barcoded flowcell for testing
> purposes:
>
> http://chapmanb.s3.amazonaws.com/110106_FC70BUKAAXX.tar.gz
>
> It has 12 barcoded samples, using the Illumina barcodes. The
> sequences are in this YAML file:
>
> https://github.com/chapmanb/bcbb/blob/master/nextgen/tests/data/automated/run_info.yaml
>
Great :)
>> I originally had three separate tools (with shared code) for working
>> with FASTA, FASTQ and SFF reads, which I have recently combined
>> into one single tool that does all three. Code here if anyone wants to
>> look at it.
>>
>> https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/
>
> Very nice. It would be great to get something general for barcode
> splitting as a Galaxy tool. Thanks for looking at this,
> Brad
Yes - assuming what they have already isn't good enough (at
very least the Galaxy barcode wrapper for fastx currently only
handles fastq-solexa but I think that can be fixed).
http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004290.html
I've been focused on the PCR case where my sequences have
got IUPAC ambiguity characters. For barcodes that shouldn't be
an issue, but instead you may have more than one barcode and
will want one output file per barcode (although not usually as
complicated as Kevin's setup). I need to learn more about how
Galaxy handles multiple outputs before commenting on that.
Peter
From tiagoantao at gmail.com Wed Feb 16 16:40:10 2011
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 16 Feb 2011 16:40:10 +0000
Subject: [Biopython-dev] New URL for integration testing
Message-ID:
Hello all,
Buildbot integration testing has been moved to a, hopefully, more
stable location. If you are interested, please have a look at:
http://testing.open-bio.org/
The old URL at events.open-bio.org is no more.
Regards,
Tiago
--
"If you want to get laid, go to college.? If you want an education, go
to the library." - Frank Zappa
From anaryin at gmail.com Thu Feb 17 12:59:16 2011
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 17 Feb 2011 13:59:16 +0100
Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(),
remove_disordered_atoms()
In-Reply-To:
References:
Message-ID:
Hey Kristian,
To Tests/test_pdb.py ? Just to make sure that the renumbering acts on both
accordingly? I agree.
Jo?o
From krother at rubor.de Thu Feb 17 12:54:38 2011
From: krother at rubor.de (Kristian Rother)
Date: Thu, 17 Feb 2011 13:54:38 +0100
Subject: [Biopython-dev] New functions in Bio.PDB: renumber_residues(),
remove_disordered_atoms()
In-Reply-To:
References:
Message-ID:
Hi Joao,
I think we should add a simple test function that ensures consistency of
child_dict and child_list upon renumbering. Let me know if you'd prefer me
to explain in Python what I mean.
Kristian
> Dear All,
>
> I've been working on the above-mentioned functions following really great
> feedback from Eric, Kristian, and Peter. I've been also using them
> routinely
> and I've had no problems yet so they should be stable enough. Therefore I
> think they can be cherry-picked from my pdb_enhancements branch and added
> to
> the main branch. Let me know what you think.
>
> Cheers,
>
> Jo??o
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
From b.invergo at gmail.com Tue Feb 22 16:40:01 2011
From: b.invergo at gmail.com (Brandon Invergo)
Date: Tue, 22 Feb 2011 17:40:01 +0100
Subject: [Biopython-dev] pypaml
In-Reply-To:
References:
<20110114154035.GC30193@sobchak.mgh.harvard.edu>
Message-ID:
Hi everyone,
I've been toiling away on the PAML API and I think it's finally ready
for review. If anyone's willing to give my code a review, here's my
branch:
https://github.com/brandoninvergo/biopython/tree/paml-branch
(the API is in Bio/Phylo/PAML, as suggested before, and the tests are
in Tests, with their supporting files in Tests/PAML)
I'll also post a message to the Biopython user list to see if anyone
would be willing to give it a test drive.
Some notes:
- I've implemented Codeml, Baseml/Basemlg and Yn00. I have not yet
done anything with Mcmctree because I am completely ignorant about
what information to extract from the output files. The other two
programs in the package, Evolver and Chi2, do not accept commandline
options and are instead operated by a rudimentary commandline
interface, so they aren't really compatible with scripting.
- Chi2 is useful, though, because it provides a chi^2 CDF, which you
can use in performing maximum likelihood ratio tests, an important
part of using the PAML programs. Since Python doesn't have a chi^2
cumulative distribution function in its standard library, I ported the
original C code rather than writing a function which simply calls the
original, with the permission of Ziheng Yang (the original author;
this is mentioned in the code's comments, but he required no other
licensing/copyright verbage to be included). This was no easy task,
considering the C code was littered with goto statements. Anyway, this
will prevent the user from having to install/import an outside package
to do the tests (I personally had been using Rpy2 to call the R
function pchisq()....complete overkill). Let me know if this is ok or
if this causes some kind of conflict
- The output of the programs varies widely with the combinatorics of
the parameters and possibly between versions. I tried to include all
possible output files in the Tests/PAML directory and I wrote test
cases to check that they're properly parsed (with the testing of
future versions in mind). So, that Tests/PAML folder has a lot more in
it than the usual test folders, but I felt there was no other option.
I tried to make it organized.
I think those are the main points for now. I'd assume that there's
more work to be done before I should perform a pull request, so I'll
simply ask for your comments for now if you have the time.
Cheers,
Brandon Invergo
On Sun, Jan 16, 2011 at 4:09 PM, Peter Cock wrote:
> On Sun, Jan 16, 2011 at 2:19 PM, Brandon Invergo wrote:
>> Hi everyone,
>> A quick question about style: since the name "codeml" is based on a
>> program which is always spelled either in all caps or in all
>> lower-case, what would be the best way to write the class name
>> regarding capitalization? Stick with the usual camel-case convention,
>> "Codeml", anyway?
>
> I'd go with Codeml for a class name (or something like
> CodemlResult or whatever). Neither CODEML nor codeml
> seem good class names in Python.
>
>> Things are progressing nicely. I've already taken care of a lot of the
>> minor tasks and improvements...
>
> Sounds good :)
>
> Peter
>
From clementsgalaxy at gmail.com Tue Feb 22 17:16:12 2011
From: clementsgalaxy at gmail.com (Dave Clements)
Date: Tue, 22 Feb 2011 09:16:12 -0800
Subject: [Biopython-dev] Galaxy Community Conference, May 25-26, Lunteren,
The Netherlands
In-Reply-To:
References:
Message-ID:
Hello all,
Just a reminder that the abstract submission deadline for the Galaxy
Community Conference is next Monday, February 28. See
http://galaxy.psu.edu/gcc2011/Abstracts.html for details.
Cheers,
Dave C.
On Thu, Feb 3, 2011 at 5:01 PM, Dave Clements wrote:
> We are pleased to announce the *2011 Galaxy Community Conference*, being
> held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature
> two full days of presentations and discussion on extending Galaxy to use new
> tools and data sources, deploying Galaxy at your organization, and best
> practices for using Galaxy to further your own and your community's
> research. See http://galaxy.psu.edu/gcc2011/* for complete details.
> *
> *About Galaxy:
> *Galaxy is an open, web-based platform for *accessible, reproducible, and
> transparent* computational biomedical research.
>
> - *Accessibility:* Galaxy enables users without programming experience
> to easily specify parameters and run tools and workflows.
> - *Reproducibility:* Galaxy captures all information necessary so that
> any user can repeat and understand a complete computational analysis.
> - *Transparency:* Galaxy enables users to share and publish analyses
> via the web and create Pages--interactive, web-based documents that describe
> a complete analysis.
>
> Galaxy is open source for all organizations. The public Galaxy service (
> http://usegalaxy.org) makes analysis tools, genomic data,
> tutorial demonstrations, persistent workspaces, and publication services
> available to any scientist that has access to the Internet. Local
> Galaxy servers can be set up by downloading the Galaxy application and
> customizing it to meet particular needs.
>
> *Conference Overview:
> *
> This event aims to engage a broader community of developers, data
> producers, tool creators, and core facility and other research hub staff to
> become an active part of the Galaxy community. We'll cover defining
> resources in the Galaxy framework, increasing their visibility and making
> them easier to use and integrate with other resources, how to extend Galaxy
> to use custom data sources and custom tools, and best practices for using
> Galaxy in your organization.
>
> Additional topics include, but are not limited to:
> * Talks submitted by the Galaxy community
> * Integration of tools (including NGS analysis tools) and distributed job
> management
> * Deployment of Galaxy instances on local resources and on the Cloud
> * Management of large datasets with the Galaxy Library System
> * Using the Galaxy LIMS functionality at NGS sequencing facilities
> * Visualizing Data without leaving Galaxy
> * Performing reproducible research
> * Performing and sharing complex analyses with Workflows
> * An "Introduction to Galaxy" session, offered on May 24, for Galaxy
> newcomers.
>
> *Registration:
> *
> The conference fee is ?100 on or before April 24, and ?120 after that. The
> meeting is being held at the Conference Centre De Werelt in Lunteren, The
> Netherlands, which is also the conference hotel. You are encouraged to
> register early, as space at the hotel (and at the "Intro to Galaxy" session)
> is limited and is likely to fill up before the conference itself does. See
> http://galaxy.psu.edu/gcc2011/Register.html
> *
> Abstract Submission:
> *
> Abstracts are now being accepted for short oral presentations. Proposals
> on any topic of interest to the Galaxy community are welcome and
> encouraged. The abstract submission deadline is the end of February 28.
> See http://galaxy.psu.edu/gcc2011/Abstracts.html
> * *
> *Sponsors
> *
> The 2011 Galaxy Community Conference is co-sponsored by the US National
> Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands
> Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a
> collaborative institute of the bioinformatics groups in the Netherlands.
> Together, these groups perform cutting-edge research, develop novel tools
> and support platforms, create an e-science infrastructure and educate the
> next generations of bioinformaticians.
>
> We are looking forward to a great conference and hope to see you in the
> Netherlands!
>
> The Galaxy and NBIC Teams
>
> --
> http://galaxy.psu.edu/gcc2011/
> http://getgalaxy.org
> http://usegalaxy.org/
>
--
http://galaxy.psu.edu/gcc2011/
http://getgalaxy.org
http://usegalaxy.org/
From bugzilla-daemon at portal.open-bio.org Tue Feb 22 18:06:48 2011
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 22 Feb 2011 13:06:48 -0500
Subject: [Biopython-dev] [Bug 3170] Integration of external package: pypaml
In-Reply-To:
Message-ID: <201102221806.p1MI6mvd015443@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=3170
------- Comment #1 from b.invergo at gmail.com 2011-02-22 13:06 EST -------
I've forked the repository on github and I've created a branch containing the
new code:
https://github.com/brandoninvergo/biopython/tree/paml-branch
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Wed Feb 23 09:24:21 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 23 Feb 2011 09:24:21 +0000
Subject: [Biopython-dev] [Biopython] Biopython library for muliple
sequence alignment
In-Reply-To: <001501cbd324$c70a8570$551f9050$@jp>
References: <001501cbd324$c70a8570$551f9050$@jp>
Message-ID:
On Wed, Feb 23, 2011 at 6:42 AM, Rojan Shrestha wrote:
> Hello:
>
> I want to do multiple sequence alignment using CLUSTW. Instead of
> standalone, I would like to use in my own program through biopython. I would
> like to know that whether biopython has clustw function or not. It would be
> very good if somebody ?gives information about this.
>
> Regards,
>
> Rojan
Hello Rojan,
Biopython (and BioPerl too I believe) doesn't have any multiple sequence
alignment code itself. Biopython does has pairwise sequence alignment
code (with a fast implementation in C).
Instead (again, like BioPerl) Biopython has a wrapper and parser for
calling the ClustalW command line tool from within your script and
loading its output. Similarly for other alignment tools like Muscle.
If you really want to be able modify the multiple sequence alignment
code itself, some of these command line tools are open source. Also,
I *think* that BioJava has some code for this.
I don't know what BioRuby does.
Peter
P.S. You only really need to ask this on the Biopython Discussion List.
Since you included the OBF cross project list I have tried to comment
on how the other projects handle this as well.
From updates at feedmyinbox.com Wed Feb 23 09:26:36 2011
From: updates at feedmyinbox.com (Feed My Inbox)
Date: Wed, 23 Feb 2011 04:26:36 -0500
Subject: [Biopython-dev] 2/23 active questions tagged biopython - Stack
Overflow
Message-ID: <64da3e945fd7631143a0bbd0fdd84e55@74.63.51.88>
// Biopython CodonTable error?
// February 18, 2011 at 3:02 PM
http://stackoverflow.com/questions/5045967/biopython-codontable-error
Hello, I am writing some code intended to translate ambiguous DNA codes into possible amino acids and I am seeing some strange translation from the Biopython 1.56 package. It appears to be translating ambiguous DNA codes to 'J' which does not exist as a code for anything. I am running python 2.6.1 on Mac OS 10.6.6.
For example:
>>>from Bio.Seq import *
>>>translate('ARAWTAGKAMTA')
'XJXJ'
or
>>>from Bio.Seq import Seq
>>>c = Seq('ARAWTAGKAMTA')
>>>c.translate().tostring()
'XJXJ'
I have looked through the Bio.Data.CodonTable source and Bio.Seq source and I cannot find a reason why this would be happening. Any ideas?
Thanks!
Mark
--
Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active
Account Login:
https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
Unsubscribe here:
http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
--
This email was carefully delivered by FeedMyInbox.com.
PO Box 682532 Franklin, TN 37068
From updates at feedmyinbox.com Wed Feb 23 09:26:36 2011
From: updates at feedmyinbox.com (Feed My Inbox)
Date: Wed, 23 Feb 2011 04:26:36 -0500
Subject: [Biopython-dev] 2/23 biopython Questions - BioStar
Message-ID:
// MuscleCommandline not writing file
// February 22, 2011 at 2:34 PM
http://biostar.stackexchange.com/questions/5787/musclecommandline-not-writing-file
I'm trying to work through the Biopython tutorial on multiple sequence alignment and get an error whenever I try to use subprocess:
child = subprocess.Popen(str(cline),
stdout = subprocess.PIPE,
stderr = subprocess.PIPE,
shell = (sys.platform!="win32"))
I get this error:
Traceback (most recent call last):
File "", line 2, in
stdout = subprocess.PIPE)
File "C:\Python27\lib\subprocess.py", line 672, in __init__
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 882, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified
I've gone so far as to copy and paste the tutorial into the interpreter and no luck. Neither ClustalW nor Muscle are writing the alignment files (I tried the depreciated MultipleAlignCL as well with no luck).
I'm using Python v2.7 and Biopython v1.55 and have tried reinstalling both. Any advice?
--
Website: http://biostar.stackexchange.com/questions/tagged/biopython
Account Login:
https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
Unsubscribe here:
http://www.feedmyinbox.com/feeds/unsubscribe/630206/59fe8f28e93f5744d887807619020b5988c5b82b/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
--
This email was carefully delivered by FeedMyInbox.com.
PO Box 682532 Franklin, TN 37068
From chapmanb at 50mail.com Wed Feb 23 13:11:51 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 23 Feb 2011 08:11:51 -0500
Subject: [Biopython-dev] pypaml
In-Reply-To:
References:
<20110114154035.GC30193@sobchak.mgh.harvard.edu>
Message-ID: <20110223131151.GE4922@sobchak.mgh.harvard.edu>
Brandon;
> I've been toiling away on the PAML API and I think it's finally ready
> for review. If anyone's willing to give my code a review, here's my
> branch:
> https://github.com/brandoninvergo/biopython/tree/paml-branch
This is awesome; thanks much for all the work getting this together.
It's really great to see the extensive tests. I'm also impressed
with your story of porting over 'goto' statements; it's been a while
since those have entered my mind:
10 PRINT "CHI SQUARE FOREVER"
20 FLASH
30 GOTO 10
A couple of more general thoughts about your code:
- These looks to be a lot of shared functionality between codeml,
baseml and yn00 in setting up the control files. Would it be
possible to create a base class that these all inherit from? This
would make the code much easier to maintain over time as formats
change.
- Your 'read' functions get pretty deeply nested, especially the
codeml parser. What do you think about creating an internal class
to split some of the parsing logic into individual functions? A
nice example is the GenBank/Scanner.py code. Having functions like
parse_header/parse_features makes it much easier for someone not
deeply familiar with your code to start to make guesses at where
different functionality exists. This way, if the format changes
others can provide patches and feedback to you.
Overall this is great and all the work is much appreciated.
Brad
From chapmanb at 50mail.com Thu Feb 24 18:26:26 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 24 Feb 2011 13:26:26 -0500
Subject: [Biopython-dev] BOSC 2011 topic organizers and Codefest
Message-ID: <20110224182626.GM20125@sobchak.mgh.harvard.edu>
Hi all;
This year the Bioinformatics Open Source Conference (BOSC) will be taking
place in Vienna, Austria on July 15-16th. This is a yearly opportunity for
open source bioinformatics developers to get together in person and discuss
on-going projects. Nomi Harris, Peter Rice and the other organizing committee
members are already hard at work planning for the conference:
http://www.open-bio.org/wiki/BOSC_2011
The call for abstracts opens next Monday, and extends through April
18th, and we've been brainstorming potential session topics. This
year we've tried to focus each of the sessions around a particular
biological problem or computational approach. We hope this will draw
some interesting parallels between work being done in different groups,
and encourage even more collaboration.
We are actively looking for community members who are interested in heading up
the organization of a topic. The general idea is to build a cohesive set of talks
within a session. How you'd like to do this is completely flexible but some of
the ideas we've been discussing are:
- Having a short introductory talk to provide an overview of an area, framing
the different talks within this context.
- Forgoing individual question/answer and instead combining this time into a
longer panel-style discussion with all of the speakers. This would help
stimulate back and forth between the different projects and the
audience.
If you are interested in a particular topic and would like to help with the
organization, please send an e-mail to the BOSC mailing list:
bosc at lists.open-bio.org. We're also open to new topic suggestions, and will
look to add one or two more topics to our current list.
Finally, there will be a two day coding session prior to BOSC as a follow up
to last year's fun and productive Codefest:
http://www.open-bio.org/wiki/Codefest_2011
The Metalab, a unique hacker space in Vienna, has kindly agreed to host
us for the two days. If you are at all interested, please add your name
to the attendees list on the wiki. Since the Metalab organizers don't
know us personally, we'd like to demonstrate there is interest and that we'll
really show up with a bunch of bioinformatics hackers. More details will be in
the works as the summer draws closer.
Looking forward to the sound of music,
Brad
From b.invergo at gmail.com Fri Feb 25 16:57:19 2011
From: b.invergo at gmail.com (Brandon Invergo)
Date: Fri, 25 Feb 2011 17:57:19 +0100
Subject: [Biopython-dev] pypaml
In-Reply-To: <20110223131151.GE4922@sobchak.mgh.harvard.edu>
References:
<20110114154035.GC30193@sobchak.mgh.harvard.edu>
<20110223131151.GE4922@sobchak.mgh.harvard.edu>
Message-ID:
Hi Brad,
Thanks for your response! It's taken me a day or two to think about
what you wrote (also balancing a PhD with the hobby projects at the
moment...)
> It's really great to see the extensive tests. I'm also impressed
> with your story of porting over 'goto' statements; it's been a while
> since those have entered my mind:
To be honest, I forgot they existed. Seeing them immediately made the
computer scientist in me cringe. They really confused the whole
structure of the program but in the end they were solved quite easily
with some carefully placed loops and conditional blocks!
> - These looks to be a lot of shared functionality between codeml,
> ?baseml and yn00 in setting up the control files. Would it be
> ?possible to create a base class that these all inherit from? This
> ?would make the code much easier to maintain over time as formats
> ?change.
This is a really good idea and I'm a bit disappointed that I didn't
see it myself! Indeed, most of the functionality is just copied/pasted
between the classes, with only some variation in the
read/write_ctl_file functions for codeml and baseml. So, writing a
base class would really simplify things. I do have one question,
though, since this is my first time organizing my code in a
large-scale Python project. Where would be the best place to implement
this base paml class? In __init__.py or in its own paml.py file? I
know the end result would be the same but I figure I should start
learning some of these best practices.
> - Your 'read' functions get pretty deeply nested, especially the
> ?codeml parser. What do you think about creating an internal class
> ?to split some of the parsing logic into individual functions? A
> ?nice example is the GenBank/Scanner.py code. Having functions like
> ?parse_header/parse_features makes it much easier for someone not
> ?deeply familiar with your code to start to make guesses at where
> ?different functionality exists. This way, if the format changes
> ?others can provide patches and feedback to you.
I'm not so sure about this mainly because of the way the output files
are formatted. For example, the most common usage of codeml (the most
common program of the bunch) is to run with several several "NSsites"
models. If you do this, the output file is separated into segments
which are headed by a line that says something like "Model 2:
PositiveSelection", and the model parameters are printed out below.
However, if you only run with one model, which is also a common usage,
you no longer have these convenient headers and instead at the very
top of the output file is a completely different indication of which
model was used, but which is inconveniently missing if only model 0
was run. In other cases, such as amino acid sequence analysis,
pairwise nucleotide sequence or multiple gene analyses, there's no
header whatsoever indicating which kind of output file you're looking
at. Instead, you just have to search for particular data patterns to
parse. This mess is precisely why I had to include so many different
output files for the unittesting (codeml is the main culprit; baseml
is moderately bad; yn00 isn't a problem)
So, because I would potentially end up scanning almost the entire file
just to figure out what's going on, I think just parsing-as-you-go,
using elif statements to short-circuit and skip further evaluations of
a line after a match has been found, would be the better option.
Perhaps the files aren't long enough to be able to make an appeal for
computational efficiency but at the same time, I hesitate to read
through the file multiple times unnecessarily. I agree, though, that
this makes the read() function quite long. For that, though, I tried
to provide descriptive comments before each parsing case, describing
exactly what the next block of code is meant to parse and also
including a specific example line which should be parsed by it.
That said, I will take another look at the output files to see if
there could be another way of implementing it. Without a doubt, the
parsing is the most difficult part of implementing this module; the
rest of it is quite trivial. So, best to do it right!
> Overall this is great and all the work is much appreciated.
Thanks! It's been a fun side project for me.
Cheers,
Brandon
ps - I still haven't sent a message to the main Biopython list while I
consider implementing at least the first suggestion above, since it
would involve large changes that might cause me to accidentally break
something! I'll wait until I'm a bit more confident that it's close to
the final product
From updates at feedmyinbox.com Mon Feb 28 09:21:17 2011
From: updates at feedmyinbox.com (Feed My Inbox)
Date: Mon, 28 Feb 2011 04:21:17 -0500
Subject: [Biopython-dev] 2/28 active questions tagged biopython - Stack
Overflow
Message-ID: <348d58cdbd9ae31e700023c354ca3ce6@74.63.51.88>
// Convert nested dictionary/xml to flat file for sqlite
// February 27, 2011 at 11:25 AM
http://stackoverflow.com/questions/5134334/convert-nested-dictionary-xml-to-flat-file-for-sqlite
Hiya-
I've scoured the net and cannot seem to find an appropriate example so I thought I'd ask...
(Btw, much of this is new to me- not all, just most.)
Problem: trying to convert a bio/python nested dictionary (or xml) of pubmed citation data into a flat (normalized) structure eg, sqlite. Citation data was fetched from pubmed using biopython and was parsed into a dictionary, but can also retrieve as xml if needed.
Not all citations will have all fields/keys and not all fields/keys will have the same number of items (authors, mesh terms, refs, etc...) and understand that this is part of the normalization process.
This is about where my practical understanding ends.
That said, I think the process should go something like this: first remove/normalize all unique fields (those that have 1 per paper eg, title, abstract, date, citation, etc..., but say not affiliation as that would be linked to first author). Papers with no abstract could be filled as null?
Then move on to, say, authors and create a separate table again using PMID as the fk and then do same for the various other fields/keys/items in separate tables eg, mesh headings, EC numbers, ref, etc...
Is there a way to do this that removes (pops?) keys/items from the master dictionary so that I can visually see what's been done/needs to be done (obviously leaving the PMID)?
Again, apologies in advance if I'm asking a blindingly obvious question to the initiated- and I do understand that you can't fit a nested structure into a flat space- just looking for the least boneheaded way of going about this and hopefully one that will allow me to make sure that everything was properly captured.
Many thanks,
chris
--
Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=active
Account Login:
https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
Unsubscribe here:
http://www.feedmyinbox.com/feeds/unsubscribe/630208/9a33fac9c8e89861715f609a2333362c8425e495/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email
--
This email was carefully delivered by FeedMyInbox.com.
PO Box 682532 Franklin, TN 37068
From chapmanb at 50mail.com Mon Feb 28 16:35:21 2011
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 28 Feb 2011 11:35:21 -0500
Subject: [Biopython-dev] pypaml
In-Reply-To:
References: <20110114154035.GC30193@sobchak.mgh.harvard.edu>
<20110223131151.GE4922@sobchak.mgh.harvard.edu>
Message-ID: <20110228163521.GF9652@sobchak.mgh.harvard.edu>
Brandon;
[pypaml branch: https://github.com/brandoninvergo/biopython/tree/paml-branch]
[base class]
> This is a really good idea and I'm a bit disappointed that I didn't
> see it myself! Indeed, most of the functionality is just copied/pasted
> between the classes, with only some variation in the
> read/write_ctl_file functions for codeml and baseml. So, writing a
> base class would really simplify things. I do have one question,
> though, since this is my first time organizing my code in a
> large-scale Python project. Where would be the best place to implement
> this base paml class? In __init__.py or in its own paml.py file? I
> know the end result would be the same but I figure I should start
> learning some of these best practices.
It's always easier to get perspective on code when you haven't been
directly in the middle of it. Even if you don't have someone to do
code reviews, stepping away from a project and coming back later
will often lead to a bunch of insights.
For the base class, I would follow Eric and Peter's example and use
files in the same directory with an underscore: something like _shared.py
or _base.py.
[read functions]
> This mess is precisely why I had to include so many different
> output files for the unittesting (codeml is the main culprit; baseml
> is moderately bad; yn00 isn't a problem)
I definitely feel your pain on this. This is exactly why your work
doing this is appreciated; you'll save someone a lot of headache
later on.
> So, because I would potentially end up scanning almost the entire file
> just to figure out what's going on, I think just parsing-as-you-go,
> using elif statements to short-circuit and skip further evaluations of
> a line after a match has been found, would be the better option.
> Perhaps the files aren't long enough to be able to make an appeal for
> computational efficiency but at the same time, I hesitate to read
> through the file multiple times unnecessarily. I agree, though, that
> this makes the read() function quite long. For that, though, I tried
> to provide descriptive comments before each parsing case, describing
> exactly what the next block of code is meant to parse and also
> including a specific example line which should be parsed by it.
The issue really is that deeply nested code is hard to read,
long functions are hard to read, and when you combine them together
it just makes it very difficult for others to follow your logic.
I don't think you necessarily have to make multiple passes to parse it
in a more structure way, but what you would want to focus on is making
the flow through the function simpler. The way I would normally attack
this is to break components into smaller more re-usable functions.
Here's a concrete example from the start of the codeml parser:
https://github.com/brandoninvergo/biopython/blob/paml-branch/Bio/Phylo/PAML/codeml.py
siteclass_re = re.match("Site-class models:\s*(.*)", line)
if siteclass_re is not None:
siteclass_model = siteclass_re.group(1)
if siteclass_model == "":
multi_models = True
continue
results["site-class model"] = siteclass_model
if siteclass_model == "NearlyNeutral":
current_model = 1
results["NSsites"][current_model] = \
{"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
elif siteclass_model == "PositiveSelection":
current_model = 2
results["NSsites"][current_model] = \
{"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
elif siteclass_model == "discrete (4 categories)":
current_model = 3
results["NSsites"][current_model] = \
{"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
elif siteclass_model == "beta (4 categories)":
current_model = 7
results["NSsites"][current_model] = \
{"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
elif siteclass_model == "beta&w>1 (5 categories)":
current_model = 8
results["NSsites"][current_model] = \
{"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
You could refactor this something along the lines of:
class _CodemlParser:
def __init__(self):
self.results = {}
self.flags = dict(multi_models = False)
def read(self, results_handle):
for line in results_handle:
siteclass_re = re.match("Site-class models:\s*(.*)", line)
if siteclass_re is not None:
self._siteclass_parse(siteclass_re)
def _add_siteclass_model(self, siteclass_model):
self.results["site-class model"] = siteclass_model
name_to_num = {"NearlyNeutral": 1,
"PositiveSelection": 2,
"discrete (4 categories)": 3,
"beta (4 categories)": 7
"beta&w>1 (5 categories)": 8}
current_model = name_to_num[siteclass_model]
self.results["NSsites"][current_model] = {"description":siteclass_model}
if 0 in results["NSsites"]:
del results["NSsites"][0]
def _siteclass_parse(self, siteclass_re):
if siteclass_model == "":
self.flags["multi_models"] = True
else:
self._add_siteclass_model(siteclass_model)
You are not changing the parsing strategy, but now you've got
individual functions handling each of the steps so it's clear that
the _siteclass_parse either sets multi_models or adds details about
the single model. Then you can dig into the _add_siteclass_model
function to see what it is doing. To the reader, each individual
unit can be read and understood separately.
This type of refactoring work is useful generally. I have to do it all
the time in my work and discover new tricks and approaches. Hope this
is helpful and thanks again for all the work on this,
Brad