From p.j.a.cock at googlemail.com Wed Aug 1 05:27:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Aug 2012 10:27:14 +0100 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> Message-ID: On Wed, Aug 1, 2012 at 1:37 AM, Zachary Charlop-Powers wrote: > Hello Biopython, > > I am writing about a small feature that I would like to see implemented > (and could possibly help to implement it: I haven't contributed before and > am not sure exactly how tough this will be). When using Genome Diagram to > draw features you can specify which strand to put a feature on. If the > strand is positive it will go above the track in the positive-facing > direction and if negative it will go below the track in the negative facing > direction. (seehttp://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc200) . That's a > great behavior. Yep - all fine so far. > However if you use strand="None", Genome Diagram will draw > the features inline with the track and always in the positive direction. > For myself, and probably others, keeping the direction of the features is > immensely useful as you can often get a sense of operon structure in > prokaryote genomes just by looking at the genes. Of course the forward and > the minus strands can be drawn but condensing small sections of genes to a > single track saves space when making images. > > So, would it be possible to change the default behavior of Genome Diagram > to draw features inline (strand="None"), but to preserve their orientation? I think I know what you mean - that kind of picture is quite common e.g. for viruses - but only where there are no overlapping genes on opposite strands. GenomeDiagram was written originally primarily for bacteria, were overlapping genes on opposite strands are more common, which may explain the design choices made. Currently strand controls both orientation (for arrows, no effect on box sigils) and vertical placement (above, below, or straddling the line). Basically you want to override the vertical placement only? Note this is sigil dependent - it makes sense for the arrow, but not the default box (which was originally the only sigil supported). The good news is the underlying drawing code can do this - the arrow drawing is just given a bounding box and the requested orientation (left or right) argument set by the get_feature_sigil method of the LinearDrawer or CircularDrawer. If you need this right now, a careful hack in get_feature_sigil is the way to proceed. The question is how to most cleanly expose this to the user while not breaking anything else (e.g. cross links), and ideally allow for a related option which Leighton and I have considered (but not had a pressing need to implement) for frame specific placement. i.e. Rather than treating the vertical drawing spaces as two regions (above the axis line for the forward strand, below the line for the reverse strand), treat it as six regions (three frames above and below the axis line). I'm picturing something a bit like the view in the Artemis annotation editor. One question which constrains this design choice is would you want to mix these placements on the same track? I think yes - using plain strandless BOX features (at the bottom of the z-order stack) is a really useful way to to highlight a region of interest (which could have multiple genes drawn on top of it). That suggests this setting might be best at the GenomeDiagram feature level. Perhaps a new attribute/argument 'strand_mode', (a) ignore strand for vertical placement (what you want) (b) divide vertical space in two (current behaviour) (c) divide vertical space in six (frame specific placement) Hmm. Leighton? Peter P.S. Frame specific placement would work best with an overhaul of how we draw multi-fragment features like genes with exons. Here a whole new sigil class for linking sub-parts of a feature might make sense. That is again something we only chatted about so far, but would make GenomeDiagram more useful for drawing eukaryotic annotation. From p.j.a.cock at googlemail.com Wed Aug 1 06:43:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Aug 2012 11:43:59 +0100 Subject: [Biopython-dev] back_table in Bio.Data.CodonTable In-Reply-To: References: Message-ID: On Tue, Jul 31, 2012 at 8:07 PM, Jeff Hussmann wrote: > It seems desirable to have each amino acid's list of codons be given > in a deterministic order. I have been sorting lexicographically using > the ordering 'TCAG'. This is referred to as the 'conventional > ordering' in CodonTable.__str__. Lexical sorting (i.e. using Python's sort on a list of codons) seems best, it is simple and predictable. > The most flexible solution would be > to take the ordering from self.nucleotide_alphabet.letters, but this > would give 'GATC' for any CodonTable using IUPAC.unambiguous_dna as > its nucleotide alphabet. Are there any Biopython-wide conventions > here? I'm not sure why the alphabets used that particular order over another. Peter From Leighton.Pritchard at hutton.ac.uk Wed Aug 1 06:53:19 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Wed, 1 Aug 2012 10:53:19 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> Message-ID: <089BCE07-D9CB-4657-800C-8E0ACABED1A9@hutton.ac.uk> Hi all, On 1 Aug 2012, at Wednesday, August 1, 10:27, Peter Cock wrote: On Wed, Aug 1, 2012 at 1:37 AM, Zachary Charlop-Powers wrote: However if you use strand="None", Genome Diagram will draw the features inline with the track and always in the positive direction. For myself, and probably others, keeping the direction of the features is immensely useful as you can often get a sense of operon structure in prokaryote genomes just by looking at the genes. That's true. I find it easiest to identify operon structure in that way (i.e. visually and approximately) by noting where the features swap between positive and negative strands. Other approaches might include colouring positive/negative/None strand features differently. Of course the forward and the minus strands can be drawn but condensing small sections of genes to a single track saves space when making images. It doesn't, if the single track is the same height as before - what differs is the whether the features on that track are half, or full, track height. So, would it be possible to change the default behavior of Genome Diagram to draw features inline (strand="None"), but to preserve their orientation? I think there's a better way to get what you're after. Changing the default setting here would modify more than whether the arrow spans the whole track, and it would also mean that GenomeDiagram does not respect the strand data of features by default. I think that's a bad thing. I think I know what you mean - that kind of picture is quite common e.g. for viruses - but only where there are no overlapping genes on opposite strands. GenomeDiagram was written originally primarily for bacteria, were overlapping genes on opposite strands are more common, which may explain the design choices made. My original choice was made for a combination of reasons: - I wanted to respect the strand information in the source data - The 'box' sigil was easiest to draw, and was the first to be available (this carries no inherent directional information as an image) The overlapping gene issue is relevant but, since the resolution of a drawn image is often such that boxes slightly overlap even when there is no feature overlap, it didn't feature in my consideration. Currently strand controls both orientation (for arrows, no effect on box sigils) and vertical placement (above, below, or straddling the line). Basically you want to override the vertical placement only? Note this is sigil dependent - it makes sense for the arrow, but not the default box (which was originally the only sigil supported). That's how I understand Zachary's suggestion: to draw an arrow with orientation preserved, but across the positive and negative strands of the track. The good news is the underlying drawing code can do this - the arrow drawing is just given a bounding box and the requested orientation (left or right) argument set by the get_feature_sigil method of the LinearDrawer or CircularDrawer. If you need this right now, a careful hack in get_feature_sigil is the way to proceed. The question is how to most cleanly expose this to the user while not breaking anything else (e.g. cross links), and ideally allow for a related option which Leighton and I have considered [?] My original plan was to have more sigils available, implemented as draw_X() functions in the AbstractDrawer module. This would seem to be a good case for a draw_large_arrow() (or somesuch) function. The issue then would be a slight change to the prototypes for the existing draw_box and draw_arrow functions. Basically, we'd pass the overall bounding box and strand (x0, x1, btm, ctr, top, strand) information to the new functions, and let them decide where to place the sigil - above, below, or straddling the centre line. Then, we could choose whether draw_arrow() takes an additional argument (e.g. straddle=True) for the behaviour that Zachary wants, or whether we use a new sigil ('large_arrow'), which could have its own function - just like that of draw_arrow() - but would probably be better implemented by just passing the straddle=True (or whatever) argument. This way, the change is transparent to the user, except for perhaps choosing 'large_arrow' rather than 'arrow' as a sigil. That suggests this setting might be best at the GenomeDiagram feature level. Perhaps a new attribute/argument 'strand_mode', (a) ignore strand for vertical placement (what you want) (b) divide vertical space in two (current behaviour) (c) divide vertical space in six (frame specific placement) Hmm. Leighton? I'm choosing to leave frame-specificity out of the discussion, for now ;) Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From p.j.a.cock at googlemail.com Wed Aug 1 07:05:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Aug 2012 12:05:51 +0100 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: <089BCE07-D9CB-4657-800C-8E0ACABED1A9@hutton.ac.uk> References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <089BCE07-D9CB-4657-800C-8E0ACABED1A9@hutton.ac.uk> Message-ID: On Wed, Aug 1, 2012 at 11:53 AM, Leighton Pritchard wrote: > > It doesn't, if the single track is the same height as before - what differs > is the whether the features on that track are half, or full, track height. Yes, but once you've configured the arrows to straddle the axis, you can then allocate less vertical space to that track. i.e. it needs less space. >> The question is how to most cleanly expose this to the user while >> not breaking anything else (e.g. cross links), and ideally allow for >> a related option which Leighton and I have considered [?] > > My original plan was to have more sigils available, implemented as draw_X() > functions in the AbstractDrawer module. This would seem to be a good case > for a draw_large_arrow() (or somesuch) function. The issue then would be a > slight change to the prototypes for the existing draw_box and draw_arrow > functions. Basically, we'd pass the overall bounding box and strand (x0, x1, > btm, ctr, top, strand) information to the new functions, and let them decide > where to place the sigil - above, below, or straddling the centre line. > > Then, we could choose whether draw_arrow() takes an additional argument > (e.g. straddle=True) for the behaviour that Zachary wants, or whether we use > a new sigil ('large_arrow'), which could have its own function - just like > that of draw_arrow() - but would probably be better implemented by just > passing the straddle=True (or whatever) argument. > > This way, the change is transparent to the user, except for perhaps choosing > 'large_arrow' rather than 'arrow' as a sigil. That was another idea I was considering. Under this model, the sigils could be given the full strand straddling bounding box, and decide if they will use all of this (i.e. the new 'large_arrow', or the current sigils when strand-less), or just half as in the stranded current 'arrow' and 'box' sigils where the strand is known. That could work quite well, and the end user API is quite clean. Peter From Leighton.Pritchard at hutton.ac.uk Wed Aug 1 07:23:48 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Wed, 1 Aug 2012 11:23:48 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <089BCE07-D9CB-4657-800C-8E0ACABED1A9@hutton.ac.uk> Message-ID: <93ED1DEB-1C9B-4D34-A898-D326ED5F8C2F@hutton.ac.uk> On 1 Aug 2012, at Wednesday, August 1, 12:05, Peter Cock wrote: On Wed, Aug 1, 2012 at 11:53 AM, Leighton Pritchard > wrote: It doesn't, if the single track is the same height as before - what differs is the whether the features on that track are half, or full, track height. Yes, but once you've configured the arrows to straddle the axis, you can then allocate less vertical space to that track. i.e. it needs less space. I understand that - and maybe I'm being (over) pedantic - but you can allocate less vertical space to the track in either case: the question is what kind of feature representation gives you the desired information legibly at those settings ;) That was another idea I was considering. Under this model, the sigils could be given the full strand straddling bounding box, and decide if they will use all of this (i.e. the new 'large_arrow', or the current sigils when strand-less), or just half as in the stranded current 'arrow' and 'box' sigils where the strand is known. That could work quite well, and the end user API is quite clean. This option gets my vote. L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From zcharlop at mail.rockefeller.edu Wed Aug 1 10:27:32 2012 From: zcharlop at mail.rockefeller.edu (Zachary Charlop-Powers) Date: Wed, 1 Aug 2012 14:27:32 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> Message-ID: <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> Leighton, Peter, I love that we're not in the same timezone; I ask a question when I leave work and - lo,and, behold - when I return in the morning there is a well thought out response. Thank you both. The good news is the underlying drawing code can do this - the arrow drawing is just given a bounding box and the requested orientation (left or right) argument set by the get_feature_sigil method of the LinearDrawer or CircularDrawer. If you need this right now, a careful hack in get_feature_sigil is the way to proceed. I will take a look at this for a quick hack for some drawing I am working on. That was another idea I was considering. Under this model, the sigils could be given the full strand straddling bounding box, and decide if they will use all of this (i.e. the new 'large_arrow', or the current sigils when strand-less), or just half as in the stranded current 'arrow' and 'box' sigils where the strand is known. That could work quite well, and the end user API is quite clean. This option gets my vote. L. If you are both in agreement that this option is desirable and that it can be implemented in the sigil style, now we face the question of coding it. Would either of you consider working on it? If not this might be a problem I could tackle with a small amount of mentoring. Please let me know - I am happy to take a stab at it. best regards, zach cp From p.j.a.cock at googlemail.com Wed Aug 1 13:15:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Aug 2012 18:15:31 +0100 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> Message-ID: On Wed, Aug 1, 2012 at 3:27 PM, Zachary Charlop-Powers wrote: > Leighton, > Peter, > > I love that we're not in the same timezone; I ask a question when I leave > work and - lo,and, behold - when I return in the morning there is a well > thought out response. Thank you both. :) Peter wrote: >>> The good news is the underlying drawing code can do this - the >>> arrow drawing is just given a bounding box and the requested >>> orientation (left or right) argument set by the get_feature_sigil >>> method of the LinearDrawer or CircularDrawer. >>> >>> If you need this right now, a careful hack in get_feature_sigil is >>> the way to proceed. Zachary wrote: > I will take a look at this for a quick hack for some drawing I am > working on. I hope you found any effort spent useful for understanding the codebase... even if it doesn't turn out to be needed (see below). Peter wrote: >>> That was another idea I was considering. Under this model, the sigils >>> could be given the full strand straddling bounding box, and decide if >>> they will use all of this (i.e. the new 'large_arrow', or the current sigils >>> when strand-less), or just half as in the stranded current 'arrow' and >>> 'box' sigils where the strand is known. >>> >>> That could work quite well, and the end user API is quite clean. Leighton wrote: >> This option gets my vote. >> >> L. Zachary wrote: > If you are both in agreement that this option is desirable and that it can > be implemented in the sigil style, now we face the question of coding it. > Would either of you consider working on it? If not this might be a problem I > could tackle with a small amount of mentoring. Please let me know - I am > happy to take a stab at it. I had a go this afternoon (a quite moment between rushes - grin), and it wasn't as bad as I feared. This is on a git branch at the moment, https://github.com/peterjc/biopython/tree/gd-big Thus far, just two commits. The first refactors the current code to move the strand handling into the sigil code (but should, I hope, have no side effects): https://github.com/peterjc/biopython/commit/d9c416be7dd2c7081bd66bd553c9feb0174ecc13 The second commit implements the new axis straddling arrow (for both linear and circular diagrams) plus a minimal test: https://github.com/peterjc/biopython/commit/b58903d5c455416028a8ae410b2063d536448d59 To match the current sigil argument names BOX and ARROW, I have provisionally called BIGARROW. Any better ideas? Also, to match the current arrow's behaviour, strand-less features get an arrow pointing to the right (like a forward strand arrow). Leighton and I had a little debate about this - with hindsight, the original arrow sigil might have raised an error or drawn a box in this situation - but I'm not willing to change this and break existing code. It would be great if you (Zachary) could give this a test, both to look for regressions (anything that broke) and try the new sigil out. Are you familiar with git, and installing Biopython from source? Regards, Peter From zcharlop at mail.rockefeller.edu Wed Aug 1 18:10:55 2012 From: zcharlop at mail.rockefeller.edu (Zachary Charlop-Powers) Date: Wed, 1 Aug 2012 22:10:55 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> Message-ID: <4A304FFF-48C9-42D8-9A26-1A9FBDA9AAA7@rockefeller.edu> Peter wrote: It would be great if you (Zachary) could give this a test, both to look for regressions (anything that broke) and try the new sigil out. Are you familiar with git, and installing Biopython from source? Just reran my previous image-generation scripts with your BioPython. I used sigil="BIGARROW" instead of "ARROW" and it worked like a charm. Awesome. Would you want to add the "BIGARROW" option to the tutorial? best, zach cp From p.j.a.cock at googlemail.com Wed Aug 1 18:33:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Aug 2012 23:33:14 +0100 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: <4A304FFF-48C9-42D8-9A26-1A9FBDA9AAA7@rockefeller.edu> References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> <4A304FFF-48C9-42D8-9A26-1A9FBDA9AAA7@rockefeller.edu> Message-ID: On Wed, Aug 1, 2012 at 11:10 PM, Zachary Charlop-Powers wrote: >> Peter wrote: >> >> It would be great if you (Zachary) could give this a test, both to look >> for regressions (anything that broke) and try the new sigil out. Are >> you familiar with git, and installing Biopython from source? >> > > Just reran my previous image-generation scripts with your BioPython. > I used sigil="BIGARROW" instead of "ARROW" and it worked like a > charm. Awesome. Great. Thanks for quickly testing this. > > Would you want to add the "BIGARROW" option to the tutorial? > Yes, if/when we merge this (and I'll try to talk to Leighton about it tomorrow), then I would also want to update the Tutorial to describe this new feature. There is almost no point writing new code if we don't document it. Peter From tiagoantao at gmail.com Wed Aug 1 23:39:43 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 1 Aug 2012 20:39:43 -0700 Subject: [Biopython-dev] Away Re: buildbot failure in Biopython on Linux - Python 3.1 Message-ID: I am currently away from office. I will respond back on the 20th of August. Regards, Tiago -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From Leighton.Pritchard at hutton.ac.uk Thu Aug 2 03:42:47 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Thu, 2 Aug 2012 07:42:47 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> Message-ID: Hi, On 1 Aug 2012, at Wednesday, August 1, 18:15, Peter Cock wrote: On Wed, Aug 1, 2012 at 3:27 PM, Zachary Charlop-Powers > wrote: Leighton, Peter, I love that we're not in the same timezone; I ask a question when I leave work and - lo,and, behold - when I return in the morning there is a well thought out response. Thank you both. No worries. I had a go this afternoon (a quite moment between rushes - grin), Good job getting it done so quickly! and it wasn't as bad as I feared. [?] To match the current sigil argument names BOX and ARROW, I have provisionally called BIGARROW. Any better ideas? BIGARROW sounds fine to me. I like literal names. Leighton and I had a little debate about this - with hindsight, the original arrow sigil might have raised an error or drawn a box in this situation - but I'm not willing to change this and break existing code. Likewise - now it's been there so long, I think it would be inconsistent at this point to change it. Arguably, the default setting has to choose a direction simply because (single-headed) arrows have a direction. For those figures where you're being precise, users can use a box for a feature with no direction; if it's pointing the wrong way, users can set the feature strand. Left-to-right as a default is arbitrary, though. Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From p.j.a.cock at googlemail.com Thu Aug 2 12:12:54 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 Aug 2012 17:12:54 +0100 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> Message-ID: On Thu, Aug 2, 2012 at 8:42 AM, Leighton Pritchard wrote: >Peter wrote: >> >> To match the current sigil argument names BOX and ARROW, I have >> provisionally called BIGARROW. Any better ideas? >> > > BIGARROW sounds fine to me. I like literal names. > Great. Checked into the master, and I updated the Tutorial and the Proux et al 2002 Figure 6 reproduction example to use this: Before (cross-links with strand specific ARROW sigil): http://biopython.org/DIST/docs/tutorial/images/three_track_cl2.png After (cross-links with strand straddling BIGARROW sigil): http://biopython.org/DIST/docs/tutorial/images/three_track_cl2a.png Original (I don't know what was used to draw this): http://dx.doi.org/10.1128/JB.184.21.6026-6036.2002 Regards, Peter From clements at galaxyproject.org Fri Aug 3 19:23:25 2012 From: clements at galaxyproject.org (Dave Clements) Date: Fri, 3 Aug 2012 16:23:25 -0700 Subject: [Biopython-dev] Galaxy is Hiring Postdocs Message-ID: Hello all, The Galaxy Project , a highly successful high throughput data analysis platform for Life Sciences with over 23,000 users worldwide , is hiring: The Taylor Lab in Biologyand Mathematics & Computer Science at Emory Universityis looking for *postdoctoral scholars * to work on the Galaxy Project. Postdoctoral applicantsshould have expertise in Bioinformatics and Computational Biology and research interests that complement but extend the lab's current interests: The Galaxy project; distributed and high-performance computing for data intensive science; vertebrate functional genomics; and genomics and epigenomic mechanisms of gene regulation, the role of transcription factors and chromatin structure in global gene expression, development, and differentiation. See the announcementfor full details ( http://bx.mathcs.emory.edu/joining/postdocs/). The Nekrutenko Lab at the Huck Institutes of Life Sciences at Penn State is seeking *highly opinionated and biologically inclined* *Postdoctoral researchers*within the Galaxy Project to develop best practices for analysis of next-generation sequencing data in all areas of Life Sciences where NGS is used. Successful candidates will join a vibrant research group at the core of the Galaxy Project and will work on setting trends in modern data-driven life-sciences. Please send your CV and names/e-mail addresses of three references to jobs at galaxyproject.org. Thanks, Dave C. -- http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://galaxyproject.org/wiki/ From arklenna at gmail.com Tue Aug 7 01:11:04 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 7 Aug 2012 01:11:04 -0400 Subject: [Biopython-dev] GSoC python variant update Message-ID: Full post: http://arklenna.tumblr.com/post/28890255191/ Summary: * I'm working on the coordinate mapper Reece contributed: http://biopython.org/pipermail/biopython/2010-June/006598.html * I'm representing intron locations relative to CDS coords using the HGVS standards: http://www.hgvs.org/mutnomen/refseq_figure.html I'd like to know if there are other common ways of representing such positions. * In order to customize the display of positions (e.g. 0-based or 1-based), I'm using a class as a configuration container. I've read on StackOverflow that attempts to use globals or a singleton class are discouraged in Python, but I have not found practical suggestions for how to implement module-wide configurations. Suggestions are welcome. * Any advice about circular genomes or strandedness is also welcome. * This mapper will work for SeqRecords, SeqFeatures, FeatureLocations, etc. Are there other Biopython objects that store sequence coordinates and thus should be mappable? Regards, Lenna From mjldehoon at yahoo.com Tue Aug 7 02:40:13 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 6 Aug 2012 23:40:13 -0700 (PDT) Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif Message-ID: <1344321613.96095.YahooMailClassic@web164001.mail.gq1.yahoo.com> Dear all, Currently Bio.Motif has some support for writing TRANSFAC files but not for reading TRANSFAC files. I would like to add such a parser to Bio.Motif. Do you all agree that it fits in this module? Note that the TRANSFAC files very much look like EMBL files, and therefore contain much more information than what is currently in a Bio.Motif._Motif.Motif object (the object to be generated by Bio.Motif.read(handle, "transfac")). Perhaps the easiest is to add an attribute .annotations to Bio.Motif._Motif.Motif objects, and use it as a dictionary to store the EMBL-like annotations under their 2-letter keys. On a related note, currently Bio.Motif._Motif.Motif objects also perform functions that are more appropriate for a separate PWM (position-weight matrix) class within Bio.Motif. It may be a good idea to have a separate PWM class for this functionality. Best, -Michiel. From bartek at rezolwenta.eu.org Tue Aug 7 03:18:43 2012 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 7 Aug 2012 09:18:43 +0200 Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif In-Reply-To: <1344321613.96095.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1344321613.96095.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: Hi Michiel, On Tue, Aug 7, 2012 at 8:40 AM, Michiel de Hoon wrote: > Dear all, > > Currently Bio.Motif has some support for writing TRANSFAC files but not for reading TRANSFAC files. I would like to add such a parser to Bio.Motif. Do you all agree that it fits in this module? Note that the TRANSFAC files very much look like EMBL files, and therefore contain much more information than what is currently in a Bio.Motif._Motif.Motif object (the object to be generated by Bio.Motif.read(handle, "transfac")). Perhaps the easiest is to add an attribute .annotations to Bio.Motif._Motif.Motif objects, and use it as a dictionary to store the EMBL-like annotations under their 2-letter keys. > That would certainly be a valuable addition. I didn't add it as a format because it might get a bit confusing for users. The TRANSFAC itself (trademarked, afaik), as distributed by the BIObase company and is not available unless you pay them some license(you have to register even for the "publicly available" one that comes with a license too). If you do, then you get access to a number of interconnected datasets, including information about what they call "matrices", "sites" and "transcription factors" and "classes". I think that if we want to support their filetypes, we probably should think whether we should support the matrix file only or maybe the other ones asa well. The confusing part is that many programs use "transfac-like" formats, i.e. files very similar to the part in the "matrix" file that corresponds to the PWM itself. (For example see http://www.benoslab.pitt.edu/stamp/help.html). > On a related note, currently Bio.Motif._Motif.Motif objects also perform functions that are more appropriate for a separate PWM (position-weight matrix) class within Bio.Motif. It may be a good idea to have a separate PWM class for this functionality. Currently, Bio.Motif.Motif class represents something sequence-like. It can either be seen a set of instances (.add_instance(), .search_instance()) or as a PWM (.log_odds(), search_pwm(), etc), It can hold some annotation part (i.e. name etc), however, in my mind, it is the core of the functionality for "motif" analysis. I can imagine other types of motifs (we discussed regExp or HMM based motifs) that could subclass Motif, but I think this should be the role of the Motif class. Then comes the thing with annotations. I would rather vote for something more similar to SeqRecord and Seq, where a new class (MotifRecord?) would hold all the annotation data from TRANSFAC or somesuch DB, and the Motif would remain more sequence-like. With respect to moving the PWM-related functionality to a separate class, I'm not sure. I think it is valuable to be able to load instances from a file and then convert them to a PWM. It could be done with separate classes, but I'm not sure it would be easier then... best Bartek -- Bartek Wilczynski From mjldehoon at yahoo.com Tue Aug 7 04:39:15 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 7 Aug 2012 01:39:15 -0700 (PDT) Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif In-Reply-To: Message-ID: <1344328755.85288.YahooMailClassic@web164002.mail.gq1.yahoo.com> Hi Bartek, Thanks for your reply. --- On Tue, 8/7/12, Bartek Wilczynski wrote: > If you do, then you get access to a number of interconnected > datasets, including information about what they call "matrices", >?"sites" and "transcription factors" and "classes". I think that if > we want to support their filetypes, we probably should think whether > we should support the matrix file only or maybe the other ones asa > well. I would suggest to just support the matrices for now. > The confusing part is that many programs use "transfac-like" > formats, i.e. files very similar to the part in the "matrix" > file that corresponds to the PWM itself. (For example see > http://www.benoslab.pitt.edu/stamp/help.html). This also means that if Bio.Motif can parse TRANSFAC files, then it can parse the transfac-like formats, at least to some degree. Personally I am actually more interested in the SwissRegulon database, which uses a transfac-like format > Then comes the thing with annotations. I would rather > vote for something more similar to SeqRecord and Seq, > where a new class (MotifRecord?) would hold all the > annotation data from TRANSFAC or somesuch DB, and the > Motif would remain more sequence-like. Are you suggesting that MotifRecord subclasses Bio.Motif._Motif.Motif? For example we could have a Bio.Motif.Parsers.TRANSFAC.Motif class that subclasses Bio.Motif._Motif.Motif. Then Bio.Motif._Motif.Motif remains sequence-like, and Bio.Motif.Parsers.TRANSFAC.Motif takes care of the annotations. Alternatively we could say that Bio.Motif.Parsers.TRANSFAC.read returns a Bio.Motif.Parsers.TRANSFAC.Record object that contains the motif information as an attribute (so record.motif would be an instance of Bio.Motif._Motif.Motif). Best, -Michiel From mjldehoon at yahoo.com Tue Aug 7 10:47:00 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 7 Aug 2012 07:47:00 -0700 (PDT) Subject: [Biopython-dev] Fw: Re: Parsing TRANSFAC matrices with Bio.Motif Message-ID: <1344350820.11922.YahooMailClassic@web164006.mail.gq1.yahoo.com> Forwarding Bartek's email to the list .. I am pretty much OK with his suggestions, but feel free to comment or suggest other solutions before we start implementing this. Best, -Michiel. --- On Tue, 8/7/12, Bartek Wilczynski wrote: > From: Bartek Wilczynski > Subject: Re: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif > To: "Michiel de Hoon" > Date: Tuesday, August 7, 2012, 5:16 AM > On Tue, Aug 7, 2012 at 10:39 AM, > Michiel de Hoon > wrote: > > Hi Bartek, > > > > Thanks for your reply. > > > > --- On Tue, 8/7/12, Bartek Wilczynski > wrote: > >> If you do, then you get access to a number of > interconnected > >> datasets, including information about what they > call "matrices", > >> "sites" and "transcription factors" and "classes". > I think that if > >> we want to support their filetypes, we probably > should think whether > >> we should support the matrix file only or maybe the > other ones asa > >> well. > > > > I would suggest to just support the matrices for now. > > > I'm fine with that. Some links between the files might be > less > usefule, but that might be added later. > > >> The confusing part is that many programs use > "transfac-like" > >> formats, i.e. files very similar to the part in the > "matrix" > >> file that corresponds to the PWM itself. (For > example see > >> http://www.benoslab.pitt.edu/stamp/help.html). > > > > This also means that if Bio.Motif can parse TRANSFAC > files, then it > > can parse the transfac-like formats, at least to some > degree. Personally I am actually more interested in the > SwissRegulon database, which uses a transfac-like format > > > > In principle yes, but there are slight variants making > things "almost > working". That's the main reason I didn't put the code I was > using > myself into biopython repository, as it might cause some > weird > breakages. For examples, some formats drop the P0 column > (the > "transfac-like" in STAMP, for one) which makes it impossible > to figure > out whether you are interpreting the numbers right unless > you agree on > some ordering of nucleotides. I would suggest that we should > support > databases named directly and, maybe, think about generic > methods for > "raw PSSM" files, that would require the user to give the > nucleotide > order... > > >> Then comes the thing with annotations. I would > rather > >> vote for something more similar to SeqRecord and > Seq, > >> where a new class (MotifRecord?) would hold all > the > >> annotation data from TRANSFAC or somesuch DB, and > the > >> Motif would remain more sequence-like. > > > > Are you suggesting that MotifRecord subclasses > Bio.Motif._Motif.Motif? > > For example we could have a > Bio.Motif.Parsers.TRANSFAC.Motif class that subclasses > Bio.Motif._Motif.Motif. Then? Bio.Motif._Motif.Motif > remains sequence-like, and Bio.Motif.Parsers.TRANSFAC.Motif > takes care of the annotations. > > > > Alternatively we could say that > Bio.Motif.Parsers.TRANSFAC.read returns a > Bio.Motif.Parsers.TRANSFAC.Record object that contains the > motif information as an attribute (so record.motif would be > an instance of Bio.Motif._Motif.Motif). > > > > For me, personally, the version where transfac motif is a > subclass of > Motif is a more useful one. It is simpler, and it adds > annotations as > attributes of a motif. However, if we decided that we want > the whole > TRANSFAC db with all it's annotations, the more natural way > would be > to have separate classes for instances and motifs and maybe > even > separate record classes representing a database record > (there might be > more transfac records referencing the same matrix). I don't > think that > there is so much need for supporting all the stuff from > TRANSFAC (I > don't know anybody who would be using all their annotations, > people > seem to care only about matrices anyway) so I'd vote for the > simpler > way of subclassing Motif. > > best > Bartek > -- > Bartek Wilczynski > From w.arindrarto at gmail.com Tue Aug 7 13:56:26 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 7 Aug 2012 19:56:26 +0200 Subject: [Biopython-dev] GSoC Project Update -- 11 Message-ID: Hello everyone, I have just posted my latest update on my project here: http://bow.web.id/blog/2012/08/back-on-the-main-branch/ It's been taking quite a while since I posted my last update since there has been a considerable change to the SearchIO object model I'm using. The details are in my blog post, but to keep it short, it was because the previous model (QueryResult, Hit, and HSP) was inadequate in handling files that have multiple sequences in their HSP (so far seen in files output by BLAT and Exonerate). In my previous updates, I've been using simple Python lists to store attributes related to these multiple sequences, but that turned out to be problematic as it may make the object have inconsistent attributes. After trying out several different implementations and discussing them with Peter, we've finally settled on a new model. The new model changes the HSP object into a container that stores a new object: HSPFragment. HSPFragment represents a single, contiguous alignment of the hit and query sequence. It only stores the sequence, coordinates, frames, and strands. Other attributes made by the search program (such as evalues or scores) are stored in the HSP object. This change required some modifications on all of the current parsers, but from a user's perspective working with file formats other than BLAT or Exonerate, the changes should be minimum. Aside from this, there's also a small update on the main API which lets it accept keyword arguments. The arguments modify behaviors of the parser, and they are different for each parser. Currently, this is only used by the BLAST tabular parser, but I imagine more parsers will use this in the future. Finally, having settled on a firmer object model, I'll be spending the rest of my time to focus on the documentation. There may still be small fixes to the code, but I expect nothing as major as this one. regards, Bow From chapmanb at 50mail.com Wed Aug 8 09:55:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 08 Aug 2012 09:55:36 -0400 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: References: Message-ID: <874nodh4iv.fsf@fastmail.fm> Lenna; This all sounds great and will be a nice practical addition to Biopython. Thanks for taking it on. Some specific thoughts on your questions: > * I'm representing intron locations relative to CDS coords using the > HGVS standards: http://www.hgvs.org/mutnomen/refseq_figure.html > I'd like to know if there are other common ways of representing such > positions. I don't know of one myself, so it's great to be following a standard rather than reinventing something. Nice work. > * In order to customize the display of positions (e.g. 0-based or > 1-based), I'm using a class as a configuration container. I've read on > StackOverflow that attempts to use globals or a singleton class are > discouraged in Python, but I have not found practical suggestions for > how to implement module-wide configurations. Suggestions are welcome. With configuration items like this, you have two choices: - A global variable. - Pass the configuration to every function that needs it. There are tradeoffs with both approaches, but for this case I agree with your decision to use globals. Most people will want 0-based/Biopython style but it gives those who don't a knob to switch over. > * Any advice about circular genomes or strandedness is also welcome. Circular handling is an unresolved issue in Biopython: https://redmine.open-bio.org/issues/2578 It's a bit tricky, especially with features that span the origin. I'd prioritize handling strandedness since you're going to have plenty of reverse strand coding sequences. You're mapping not only within the coding region but also back to the original sequence on the reverse strand. So in your g2c mapping, the original gene goes from e1 -> s1 -> e0 -> s0 as you read 5' to 3' across the sequence. The best place to get started is to pick a reverse strand gene and then work through the mappings, thinking through the orientations. I find drawing it out to be the easiest way. > * This mapper will work for SeqRecords, SeqFeatures, FeatureLocations, > etc. Are there other Biopython objects that store sequence coordinates > and thus should be mappable? That sounds like a great start. Thanks again for this, Brad From p.j.a.cock at googlemail.com Wed Aug 8 10:33:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Aug 2012 15:33:05 +0100 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: <874nodh4iv.fsf@fastmail.fm> References: <874nodh4iv.fsf@fastmail.fm> Message-ID: On Wed, Aug 8, 2012 at 2:55 PM, Brad Chapman wrote: >Lenna wrote: >> * Any advice about circular genomes or strandedness is also welcome. > > Circular handling is an unresolved issue in Biopython: > > https://redmine.open-bio.org/issues/2578 > > It's a bit tricky, especially with features that span the origin. > > I'd prioritize handling strandedness since you're going to have plenty > of reverse strand coding sequences. You're mapping not only within the > coding region but also back to the original sequence on the reverse > strand. So in your g2c mapping, the original gene goes from > e1 -> s1 -> e0 -> s0 as you read 5' to 3' across the sequence. The best > place to get started is to pick a reverse strand gene and then work > through the mappings, thinking through the orientations. I find drawing > it out to be the easiest way. And then think about mixed strand genes, e.g. transpliced tRNA is a good example - there is a GenBank example in our unit tests. Peter From lgautier at gmail.com Wed Aug 8 12:37:35 2012 From: lgautier at gmail.com (Laurent Gautier) Date: Wed, 08 Aug 2012 18:37:35 +0200 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: References: Message-ID: <502295CF.3020103@gmail.com> On 2012-08-08 18:00, biopython-dev-request at lists.open-bio.org wro > Lenna; > This all sounds great and will be a nice practical addition to > Biopython. Thanks for taking it on. Some specific thoughts on your questions: > >> >* I'm representing intron locations relative to CDS coords using the >> >HGVS standards:http://www.hgvs.org/mutnomen/refseq_figure.html >> >I'd like to know if there are other common ways of representing such >> >positions. > I don't know of one myself, so it's great to be following a standard > rather than reinventing something. Nice work. > >> >* In order to customize the display of positions (e.g. 0-based or >> >1-based), I'm using a class as a configuration container. I've read on >> >StackOverflow that attempts to use globals or a singleton class are >> >discouraged in Python, but I have not found practical suggestions for >> >how to implement module-wide configurations. Suggestions are welcome. Module-wide configuration can be implemented as variables, as long as they are declared before the functions using them. If considering a package rather than a single module, options can be stored in a module dedicated to options (since Python modules are singletons). > With configuration items like this, you have two choices: > > - A global variable. > - Pass the configuration to every function that needs it. > > There are tradeoffs with both approaches, but for this case I agree with > your decision to use globals. Most people will want 0-based/Biopython > style but it gives those who don't a knob to switch over. I'd argue that allowing to switch is an invitation to spectacular issues down the road. An easy, yet frightening, example would be the case where using third-party code (such a module) changes this without you knowing. An other scary thought is that this would amount to bringing the infamous Perl variable "$[" to Python. Go explain again that folks should Python for its elegance and simplicity after that. Best, L. From arklenna at gmail.com Wed Aug 8 14:44:33 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 8 Aug 2012 14:44:33 -0400 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: <502295CF.3020103@gmail.com> References: <502295CF.3020103@gmail.com> Message-ID: On Wed, Aug 8, 2012 at 12:37 PM, Laurent Gautier wrote: > On 2012-08-08 18:00, biopython-dev-request at lists.open-bio.org wro > >> >>> >* In order to customize the display of positions (e.g. 0-based or >>> >1-based), I'm using a class as a configuration container. I've read on >>> >StackOverflow that attempts to use globals or a singleton class are >>> >discouraged in Python, but I have not found practical suggestions for >>> >how to implement module-wide configurations. Suggestions are welcome. > > > Module-wide configuration can be implemented as variables, as long as they > are declared before the functions using them. > If considering a package rather than a single module, options can be stored > in a module dedicated to options (since Python modules are singletons). > Hi Laurent, I really like the idea of a configuration module. I will definitely move in that direction. > >> With configuration items like this, you have two choices: >> >> - A global variable. >> - Pass the configuration to every function that needs it. >> >> There are tradeoffs with both approaches, but for this case I agree with >> your decision to use globals. Most people will want 0-based/Biopython >> style but it gives those who don't a knob to switch over. > > > I'd argue that allowing to switch is an invitation to spectacular issues > down the road. > An easy, yet frightening, example would be the case where using third-party > code (such a module) changes this without you knowing. > > An other scary thought is that this would amount to bringing the infamous > Perl variable "$[" to Python. Go explain again that folks should Python for > its elegance and simplicity after that. > > Yikes. My approach will not be comparable to $[. For starters, it wouldn't modify the behavior of every sequence-like object. My current thought would be to store the 0-based position in an attribute `pos`, have a property `pos_str` that returns `pos` + `Config.index`. For representations, `__str__` will return `pos_str`, and `__repr__` will return `pos` (always 0-based). Math would always use the 0-based position. I intend to keep the influence of the hypothetical mapping Config module limited to Biopython Seq* objects. It should also be possible to make a kill switch, namely, a version of the Config module where all of the settings are neutral to adding (i.e. `def __add__(self, other): return other`). Please let me know if this would not fully address your concerns. Cheers, Lenna From lgautier at gmail.com Wed Aug 8 17:58:26 2012 From: lgautier at gmail.com (Laurent Gautier) Date: Wed, 08 Aug 2012 23:58:26 +0200 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: References: <502295CF.3020103@gmail.com> Message-ID: <5022E102.9010509@gmail.com> On 2012-08-08 20:44, Lenna Peterson wrote: > On Wed, Aug 8, 2012 at 12:37 PM, Laurent Gautier wrote: >> On 2012-08-08 18:00, biopython-dev-request at lists.open-bio.org wro >> >>>>> * In order to customize the display of positions (e.g. 0-based or >>>>> 1-based), I'm using a class as a configuration container. I've read on >>>>> StackOverflow that attempts to use globals or a singleton class are >>>>> discouraged in Python, but I have not found practical suggestions for >>>>> how to implement module-wide configurations. Suggestions are welcome. >> >> Module-wide configuration can be implemented as variables, as long as they >> are declared before the functions using them. >> If considering a package rather than a single module, options can be stored >> in a module dedicated to options (since Python modules are singletons). >> > Hi Laurent, > > I really like the idea of a configuration module. I will definitely > move in that direction. > >>> With configuration items like this, you have two choices: >>> >>> - A global variable. >>> - Pass the configuration to every function that needs it. >>> >>> There are tradeoffs with both approaches, but for this case I agree with >>> your decision to use globals. Most people will want 0-based/Biopython >>> style but it gives those who don't a knob to switch over. >> >> I'd argue that allowing to switch is an invitation to spectacular issues >> down the road. >> An easy, yet frightening, example would be the case where using third-party >> code (such a module) changes this without you knowing. >> >> An other scary thought is that this would amount to bringing the infamous >> Perl variable "$[" to Python. Go explain again that folks should Python for >> its elegance and simplicity after that. >> >> > Yikes. My approach will not be comparable to $[. For starters, it > wouldn't modify the behavior of every sequence-like object. > > My current thought would be to store the 0-based position in an > attribute `pos`, have a property `pos_str` that returns `pos` + > `Config.index`. For representations, `__str__` will return `pos_str`, > and `__repr__` will return `pos` (always 0-based). Math would always > use the 0-based position. > > I intend to keep the influence of the hypothetical mapping Config > module limited to Biopython Seq* objects. It should also be possible > to make a kill switch, namely, a version of the Config module where > all of the settings are neutral to adding (i.e. `def __add__(self, > other): return other`). What about making the design decision that string representations that are 1-based then, and go beyond making a kill switch by just kill the switch ? You'd document it, folks that want 0-based positions would cook their own function(s). I think that configuration modules can be very useful for an application (an example here: http://flask.pocoo.org/snippets/2/ ), but I am more reserved about its use in a library. But do not let me stop you from pursuing this; I am only expressing an opinion. One last point though. Let me describe a possible scenario: 3rd-party module "foo" is using the Biopython Seq* part, and its author thinks that Config.index should at 1 one, so he/she sets it accordingly. An early line in foo.py is: from somewhere.in.biopython.seq import config config.index = 1 There is an other piece of code (let's call it bar.py), written by someone else or by the same person at a different time. Now the hype is all about 0-based indexes, so the author sets it to be sure: from somewhere.in.biopython.seq import config config.index = 0 To complete the scenario bar.py is using foo.py, or the other way around. The requirement for one an other does not even have to be direct. Now config.index will be what the last piece of code sets it to, although other parts of the code might assume it is set to something else. That sort of situation is not prevented from happening with any sort of module in Python (e.g., import sys; sys.stdout = sys.stderr), but people know they should not do it. Here the config.index would appear as something people should change if they like. Again, that's just an opinion. Others might differ. Best, Laurent > > Please let me know if this would not fully address your concerns. > > Cheers, > > Lenna From arklenna at gmail.com Wed Aug 8 18:39:48 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 8 Aug 2012 18:39:48 -0400 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: <5022E102.9010509@gmail.com> References: <502295CF.3020103@gmail.com> <5022E102.9010509@gmail.com> Message-ID: On Wed, Aug 8, 2012 at 5:58 PM, Laurent Gautier wrote: > On 2012-08-08 20:44, Lenna Peterson wrote: >> >> On Wed, Aug 8, 2012 at 12:37 PM, Laurent Gautier >> wrote: >>> >>> On 2012-08-08 18:00, biopython-dev-request at lists.open-bio.org wro >>> >>>>>> * In order to customize the display of positions (e.g. 0-based or >>>>>> 1-based), I'm using a class as a configuration container. I've read on >>>>>> StackOverflow that attempts to use globals or a singleton class are >>>>>> discouraged in Python, but I have not found practical suggestions for >>>>>> how to implement module-wide configurations. Suggestions are welcome. >>> >>> >>> Module-wide configuration can be implemented as variables, as long as >>> they >>> are declared before the functions using them. >>> If considering a package rather than a single module, options can be >>> stored >>> in a module dedicated to options (since Python modules are singletons). >>> >> Hi Laurent, >> >> I really like the idea of a configuration module. I will definitely >> move in that direction. >> >>>> With configuration items like this, you have two choices: >>>> >>>> - A global variable. >>>> - Pass the configuration to every function that needs it. >>>> >>>> There are tradeoffs with both approaches, but for this case I agree with >>>> your decision to use globals. Most people will want 0-based/Biopython >>>> style but it gives those who don't a knob to switch over. >>> >>> >>> I'd argue that allowing to switch is an invitation to spectacular issues >>> down the road. >>> An easy, yet frightening, example would be the case where using >>> third-party >>> code (such a module) changes this without you knowing. >>> >>> An other scary thought is that this would amount to bringing the infamous >>> Perl variable "$[" to Python. Go explain again that folks should Python >>> for >>> its elegance and simplicity after that. >>> >>> >> Yikes. My approach will not be comparable to $[. For starters, it >> wouldn't modify the behavior of every sequence-like object. >> >> My current thought would be to store the 0-based position in an >> attribute `pos`, have a property `pos_str` that returns `pos` + >> `Config.index`. For representations, `__str__` will return `pos_str`, >> and `__repr__` will return `pos` (always 0-based). Math would always >> use the 0-based position. >> >> I intend to keep the influence of the hypothetical mapping Config >> module limited to Biopython Seq* objects. It should also be possible >> to make a kill switch, namely, a version of the Config module where >> all of the settings are neutral to adding (i.e. `def __add__(self, >> other): return other`). > > > What about making the design decision that string representations that are > 1-based then, and go beyond making a kill switch by just kill the switch ? > You'd document it, folks that want 0-based positions would cook their own > function(s). > > I think that configuration modules can be very useful for an application (an > example here: > http://flask.pocoo.org/snippets/2/ ), but I am more reserved about its use > in a library. > > But do not let me stop you from pursuing this; I am only expressing an > opinion. One last point though. > Let me describe a possible scenario: > > 3rd-party module "foo" is using the Biopython Seq* part, and its author > thinks that Config.index should at 1 one, so he/she sets it accordingly. > An early line in foo.py is: > from somewhere.in.biopython.seq import config > config.index = 1 > > There is an other piece of code (let's call it bar.py), written by someone > else or by the same person at a different time. Now the hype is all about > 0-based indexes, so the author sets it to be sure: > from somewhere.in.biopython.seq import config > config.index = 0 > > To complete the scenario bar.py is using foo.py, or the other way around. > The requirement for one an other does not even have to be direct. Now > config.index will be what the last piece of code sets it to, although other > parts of the code might assume it is set to something else. > > That sort of situation is not prevented from happening with any sort of > module in Python (e.g., import sys; sys.stdout = sys.stderr), but people > know they should not do it. Here the config.index would appear as something > people should change if they like. > > Again, that's just an opinion. Others might differ. > > Best, > > > Laurent > > >> >> Please let me know if this would not fully address your concerns. >> >> Cheers, >> >> Lenna > > Laurent, I must thank you again for your foresight. I am realizing I may have gotten carried away with configurability. My initial goal with the index setting was to enable both GenBank and HGVS representations of genomic positions; a much simpler and safer approach would be to have `to_genbank()` and `to_hgvs()` methods. A user could set the relevant objects' __str__ to either of those. Cheers, Lenna From p.j.a.cock at googlemail.com Thu Aug 9 05:07:15 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 9 Aug 2012 10:07:15 +0100 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: <5022E102.9010509@gmail.com> References: <502295CF.3020103@gmail.com> <5022E102.9010509@gmail.com> Message-ID: On Wed, Aug 8, 2012 at 10:58 PM, Laurent Gautier wrote: > > What about making the design decision that string representations that are > 1-based then, and go beyond making a kill switch by just kill the switch ? > You'd document it, folks that want 0-based positions would cook their own > function(s). > > I think that configuration modules can be very useful for an application ... I agree that a module level config setting is unwise. However, I'd much prefer the string representation was 0-based for consistency, both internal to the module and with most of Biopython. (The restriction module uses 1-based counting which I find very annoying.) You could still provide something like a format method to give a string in common representations (e.g. GenBank/EMBL/INSDC style location strings). Peter From mjldehoon at yahoo.com Thu Aug 9 07:07:20 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 9 Aug 2012 04:07:20 -0700 (PDT) Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts Message-ID: <1344510440.89823.YahooMailClassic@web164003.mail.gq1.yahoo.com> Hi guys, In the Motif class in Bio.Motif._Motif, there is an attribute self.has_instances to identify whether the attributes self.instances is defined. I think that we can remove the self.has_instances attribute from the code and simply set self.instances=None when it is undefined. Same thing for self.counts and self.has_counts. Any objections? Best, -Michiel. From bartek at rezolwenta.eu.org Thu Aug 9 08:26:33 2012 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 9 Aug 2012 14:26:33 +0200 Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: <1344510440.89823.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1344510440.89823.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: On Thu, Aug 9, 2012 at 1:07 PM, Michiel de Hoon wrote: > Hi guys, > > In the Motif class in Bio.Motif._Motif, there is an attribute self.has_instances to identify whether the attributes self.instances is defined. I think that we can remove the self.has_instances attribute from the code and simply set self.instances=None when it is undefined. Same thing for self.counts and self.has_counts. > Any objections? Makes sense to me. +1 -- Bartek Wilczynski From mjldehoon at yahoo.com Thu Aug 9 12:00:14 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 9 Aug 2012 09:00:14 -0700 (PDT) Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: Message-ID: <1344528014.32936.YahooMailClassic@web164006.mail.gq1.yahoo.com> OK, done. Thanks! -Michiel. --- On Thu, 8/9/12, Bartek Wilczynski wrote: > From: Bartek Wilczynski > Subject: Re: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts > To: "Michiel de Hoon" > Cc: biopython-dev at biopython.org > Date: Thursday, August 9, 2012, 8:26 AM > On Thu, Aug 9, 2012 at 1:07 PM, > Michiel de Hoon > wrote: > > Hi guys, > > > > In the Motif class in Bio.Motif._Motif, there is an > attribute self.has_instances to identify whether the > attributes self.instances is defined. I think that we can > remove the self.has_instances attribute from the code and > simply set self.instances=None when it is undefined. Same > thing for self.counts and self.has_counts. > > Any objections? > > Makes sense to me. +1 > > -- > Bartek Wilczynski > From tiagoantao at gmail.com Thu Aug 9 23:04:53 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 9 Aug 2012 20:04:53 -0700 Subject: [Biopython-dev] Away Re: buildbot failure in Biopython on Linux 64 - Python 2.7 Message-ID: I am currently away from office. I will respond back on the 20th of August. Regards, Tiago -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Fri Aug 10 04:33:43 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 10 Aug 2012 09:33:43 +0100 Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: <1344528014.32936.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1344528014.32936.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Thu, Aug 9, 2012 at 5:00 PM, Michiel de Hoon wrote: > OK, done. Thanks! > -Michiel. You'll also need to update the example in the Tutorial, quote: The arnt and srf motifs can both do the same things for us, but they use different internal representations of the motif. We can tell that by inspecting the \verb|has_counts| and has_instances properties: >>> arnt.has_instances True >>> srf.has_instances False >>> srf.has_counts True This means test_Tutorial.py is failing (across all platforms). Presumably we would suggest switching these to somethinglike: >>> arnt.instances is None False etc? In fact given the old methods were documents like this, I would be happier if we could phase them out with a deprecation warning via a read only property method, @property def has_instances(self): """"Does this motif have instances (DEPRECATED).""" import warnings from Bio import BiopythonDeprecationWarning warnings.warn("Check if motif.instance is None or not instead", BiopythonDeprecationWarning) return self.instances is not None (untested, but something like that) Peter From p.j.a.cock at googlemail.com Fri Aug 10 16:04:54 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 10 Aug 2012 21:04:54 +0100 Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: References: <1344528014.32936.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Fri, Aug 10, 2012 at 9:33 AM, Peter Cock wrote: > On Thu, Aug 9, 2012 at 5:00 PM, Michiel de Hoon wrote: >> OK, done. Thanks! >> -Michiel. > > You'll also need to update the example in the Tutorial, quote: > > The arnt and srf motifs can both do the same things for us, > but they use different internal representations of the motif. > We can tell that by inspecting the \verb|has_counts| and > has_instances properties: > > >>> arnt.has_instances > True > >>> srf.has_instances > False > >>> srf.has_counts > True > > This means test_Tutorial.py is failing (across all platforms). > Presumably we would suggest switching these to somethinglike: > > >>> arnt.instances is None > False Fixed: https://github.com/biopython/biopython/commit/b866e74dc9b6162517588ea4c0e4d1ecde5ed87c > etc? In fact given the old methods were documents like > this, I would be happier if we could phase them out with > a deprecation warning via a read only property method, > > @property > def has_instances(self): > """"Does this motif have instances (DEPRECATED).""" > import warnings > from Bio import BiopythonDeprecationWarning > warnings.warn("Check if motif.instance is None or not instead", > BiopythonDeprecationWarning) > return self.instances is not None > > (untested, but something like that) Done: https://github.com/biopython/biopython/commit/fd2223d118227c921524e070c803b97bc979a70f Although since that won't work on old Biopython either (you'd get an AttributeError), perhaps we should label these new backwards compatible properties as obsolete with a pending deprecation warning for the next release (delay the deprecation)? Peter From mjldehoon at yahoo.com Fri Aug 10 23:48:29 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 10 Aug 2012 20:48:29 -0700 (PDT) Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: Message-ID: <1344656909.14019.YahooMailClassic@web164002.mail.gq1.yahoo.com> Hi Peter, --- On Fri, 8/10/12, Peter Cock wrote: > > This means test_Tutorial.py is failing (across all > platforms). > > Presumably we would suggest switching these to > somethinglike: > > > >? ???>>> arnt.instances is > None > >? ???False > > Fixed: > https://github.com/biopython/biopython/commit/b866e74dc9b6162517588ea4c0e4d1ecde5ed87c Thanks for fixing this! Sorry I missed to do this when I was making these changes. > Although since that won't work on old Biopython either > (you'd > get an AttributeError), perhaps we should label these new > backwards compatible properties as obsolete with a pending > deprecation warning for the next release (delay the > deprecation)? > I think we are being way too careful. Requiring proper deprecation warnings each time we make a change in Biopython will slow down its development and improvement. In the past when making changes to the existing code, we have gotten very few complaints; also in this case I doubt that anybody will miss has_counts, has_instances. Best, -Michiel. From mjldehoon at yahoo.com Sat Aug 11 00:25:05 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 10 Aug 2012 21:25:05 -0700 (PDT) Subject: [Biopython-dev] Bio.Motif AlignAce parser Message-ID: <1344659105.5874.YahooMailClassic@web164003.mail.gq1.yahoo.com> Hi guys, Looking some more at the parsers in Bio.Motif. In the Record class in Bio/Motif/Parsers/AlignAce.py, we have an attribute self.current_motif that points to the motif currently being parsed by the parser (or, after the parser finishes, the last motif that was parsed). As far as I can tell this, using a temporary variable current_motif within the read() function would be sufficient; we don't need to store it in the record. I would also suggest for the read() function to strip() all lines. Currently the end-of-line markers are kept. For example the version and the command line are stored as "AlignACE 4.0 05/13/04\n" and "./AlignACE -i test.fa \n" respectively. The version of the AlignACE program is stored in record.ver. The MEME and Mast parsers in Bio.Motif instead use record.version. For consistency I would suggest to use record.version also in the AlignACE parser. The command line is stored in record.cmd_line. The MEME parser uses record.command instead. I think both are fine, but I would also prefer this to be consistent. Then there are two attributes param_dict and seq_dict. The former is a dictionary that stores the parameters used in the run. The latter is not a dictionary but a list of sequence-related information. Since usually we don't put the type of the object in the attribute names, I would suggest to call these simply parameters and sequences. For comparison, the Mast parser uses record.sequences for an analogous attribute; MEME uses record.sequence_names. For consistency I would suggest to use record.sequences for all three. This would create some backward-incompatible changes that may confuse users. Now currently the parsers are located in Bio.Motif.Parsers.AlignAce, Bio.Motif.Parsers.MEME, and Bio.Motif.Parsers.Mast. I would prefer Bio.Motif.AlignAce, Bio.Motif.MEME, Bio.Motif.Mast. Currently to parse the AlignAce output one would do >>> from Bio.Motif.Parsers import AlignAce >>> record = AlignAce.read(handle) >>> record If we move the parsers one level up, this would be >>> from Bio.Motif import AlignAce >>> record = AlignAce.read(handle) >>> record which looks a bit more straightforward to me. In addition, this allows us to put a deprecation warning on the Bio.Motif.Parsers.AlignAce, Bio.Motif.Parsers.MEME, and Bio.Motif.Parsers.Mast modules as a whole, and we won't have to put deprecation warnings on each change separately. Any comments, objections? Best, -Michiel. From p.j.a.cock at googlemail.com Sat Aug 11 06:50:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 11 Aug 2012 11:50:07 +0100 Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: <1344656909.14019.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1344656909.14019.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: On Saturday, August 11, 2012, Michiel de Hoon wrote: > Hi Peter, > > > Although since that won't work on old Biopython either > > (you'd > > get an AttributeError), perhaps we should label these new > > backwards compatible properties as obsolete with a pending > > deprecation warning for the next release (delay the > > deprecation)? > > > > I think we are being way too careful. Requiring proper deprecation > warnings each time we make a change in Biopython will slow down its > development and improvement. In the past when making changes to the > existing code, we have gotten very few complaints; also in this case I > doubt that anybody will miss has_counts, has_instances. > > Best, > -Michiel. > In this case you're probably right about it not causing too much inconvenience - this is a relatively new module after all. Peter From arklenna at gmail.com Mon Aug 13 01:00:41 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 13 Aug 2012 01:00:41 -0400 Subject: [Biopython-dev] GSoC python variant update 10 Message-ID: Link: http://arklenna.tumblr.com/post/29317968106/ Post: Following extensive [discussion](http://biopython.org/pipermail/biopython-dev/2012-August/009849.html) on the dev list of the pros and cons of configuration classes/modules, I have refactored my [coordinate mapper](https://gist.github.com/3172753) to keep configuration as isolated as possible. All mapping functions use base 0 internally. Transformation to and from 1-based coords is allowed by custom MapPosition objects. (they are currently separate from the Seq* positions but could probably subclass ExactPosition). The MapPosition objects have to_dialect and from_dialect methods that automatically handle conversion between bases and other formatting details. There are two different ways a user can convert a coordinate from HGVS: # ... assuming cm is an instance of CoordinateMapper # Manually construct position from HGVS CDS_coord = CDSPosition.from_hgvs("6+1") genomic_coord = cm.c2g(CDS_coord) print genomic_coord.to_hgvs() # Pass dialect argument to mapping function genomic_coord = cm.c2g("6+1", dialect="HGVS") print genomic_coord.to_hgvs() Furthermore, the inheritance hierarchy is designed to allow a user to set a default string representation: # Set MapPositions to print as HGVS by default def use_hgvs(self): return str(self.to_hgvs()) MapPosition.__str__ = use_hgvs The [version](https://gist.github.com/3172753/577b7c383e057b78cdcee64be33f18117a46faaf) as of this writing is passing tests using base 0. I have not yet implemented tests for `from_hgvs` or `to_hgvs`, but that's next on my list. I'm hoping to have time for strand and mixed strand, too. Cheers, Lenna From bartek at rezolwenta.eu.org Mon Aug 13 09:12:35 2012 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 13 Aug 2012 15:12:35 +0200 Subject: [Biopython-dev] Bio.Motif AlignAce parser In-Reply-To: <1344659105.5874.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1344659105.5874.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: Sounds great to me. Bartek On Sat, Aug 11, 2012 at 6:25 AM, Michiel de Hoon wrote: > Hi guys, > > Looking some more at the parsers in Bio.Motif. > > In the Record class in Bio/Motif/Parsers/AlignAce.py, we have an attribute self.current_motif that points to the motif currently being parsed by the parser (or, after the parser finishes, the last motif that was parsed). As far as I can tell this, using a temporary variable current_motif within the read() function would be sufficient; we don't need to store it in the record. > > I would also suggest for the read() function to strip() all lines. Currently the end-of-line markers are kept. For example the version and the command line are stored as "AlignACE 4.0 05/13/04\n" and "./AlignACE -i test.fa \n" respectively. > > The version of the AlignACE program is stored in record.ver. The MEME and Mast parsers in Bio.Motif instead use record.version. For consistency I would suggest to use record.version also in the AlignACE parser. > > The command line is stored in record.cmd_line. The MEME parser uses record.command instead. I think both are fine, but I would also prefer this to be consistent. > > Then there are two attributes param_dict and seq_dict. The former is a dictionary that stores the parameters used in the run. The latter is not a dictionary but a list of sequence-related information. Since usually we don't put the type of the object in the attribute names, I would suggest to call these simply parameters and sequences. For comparison, the Mast parser uses record.sequences for an analogous attribute; MEME uses record.sequence_names. For consistency I would suggest to use record.sequences for all three. > > This would create some backward-incompatible changes that may confuse users. Now currently the parsers are located in Bio.Motif.Parsers.AlignAce, Bio.Motif.Parsers.MEME, and Bio.Motif.Parsers.Mast. I would prefer Bio.Motif.AlignAce, Bio.Motif.MEME, Bio.Motif.Mast. Currently to parse the AlignAce output one would do >>>> from Bio.Motif.Parsers import AlignAce >>>> record = AlignAce.read(handle) >>>> record > > If we move the parsers one level up, this would be >>>> from Bio.Motif import AlignAce >>>> record = AlignAce.read(handle) >>>> record > > which looks a bit more straightforward to me. In addition, this allows us to put a deprecation warning on the Bio.Motif.Parsers.AlignAce, Bio.Motif.Parsers.MEME, and Bio.Motif.Parsers.Mast modules as a whole, and we won't have to put deprecation warnings on each change separately. > > Any comments, objections? > > Best, > -Michiel. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Bartek Wilczynski From arnaud.poret at gmail.com Mon Aug 13 10:07:39 2012 From: arnaud.poret at gmail.com (Arnaud Poret) Date: Mon, 13 Aug 2012 16:07:39 +0200 Subject: [Biopython-dev] obo parser Message-ID: Hi everyone, I'm a newcomer and I'm writing an obo parser for importing ontologies into python. I'm not sure, but has already BioPython an obo parser? If yes, I'm reinventing the wheel... If no, I'll be glad to contribute. From tiagoantao at gmail.com Mon Aug 13 23:23:01 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 13 Aug 2012 20:23:01 -0700 Subject: [Biopython-dev] Away Re: buildbot failure in Biopython on Windows XP - Python 2.5 Message-ID: I am currently away from office. I will respond back on the 20th of August. Regards, Tiago -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Tue Aug 14 07:06:32 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 14 Aug 2012 12:06:32 +0100 Subject: [Biopython-dev] Genome Diagram Sigils, was: Default Behavior Message-ID: On Thu, Aug 2, 2012 at 5:12 PM, Peter Cock wrote: > On Thu, Aug 2, 2012 at 8:42 AM, Leighton Pritchard wrote: >>Peter wrote: >>> >>> To match the current sigil argument names BOX and ARROW, I have >>> provisionally called BIGARROW. Any better ideas? >>> >> >> BIGARROW sounds fine to me. I like literal names. >> > > Great. Checked into the master, and I updated the Tutorial and > the Proux et al 2002 Figure 6 reproduction example to use this: > > Before (cross-links with strand specific ARROW sigil): > http://biopython.org/DIST/docs/tutorial/images/three_track_cl2.png > > After (cross-links with strand straddling BIGARROW sigil): > http://biopython.org/DIST/docs/tutorial/images/three_track_cl2a.png > > Original (I don't know what was used to draw this): > http://dx.doi.org/10.1128/JB.184.21.6026-6036.2002 > > Regards, > > Peter Further to that work, I updated some older code for a JAGGY sigil, and also an OCTO sigil (names open to suggestions), which are on my gd-sigils branch which has documentation in the tutorial, including this image of the expanded sigil set: https://github.com/peterjc/biopython/blob/e09e264dd73953554609498c15b67d86686592fb/Doc/images/GD_sigils.png This is a slight simplification of the old JAGGY code in that it does (yet) allow control of the teeth length (e.g. to have just teeth on one end). I am thinking this could be exposed like the existing arrow specific options. I originally created the JAGGY sigil for marking a break point in a contig/scaffold. For instance, you might want to mark a run of NNNNN bases in a scaffold with a jaggy sigil (straddling both strands) as a clear visual marker to explain why there were no genes. Other sigil ideas I pondered include an OVAL, which should be quite easy for the linear diagrams, but rather more work to implement for circular diagrams due to the distorted curves. Peter From p.j.a.cock at googlemail.com Tue Aug 14 15:49:23 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 14 Aug 2012 20:49:23 +0100 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: <87lim4h07o.fsf@fastmail.fm> References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> <87lim4h07o.fsf@fastmail.fm> Message-ID: On Tue, Apr 10, 2012 at 1:58 AM, Brad Chapman wrote: > Michiel; >> Hi Eric, Peter, >> >> > How about Bio.Search, for now? >> >> I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells >> users something about what the module is for. Bio.Search could be >> anything (search PubMed? search the Entrez databases? search Google? >> anyway Bio.Search does not suggest that this module is about pairwise >> alignments). But Peter previously mentioned that he doesn't like >> Bio.Pairwise; can we convince you? > > I agree with Peter on this one. The module is primarily about searching > a sequence database with an input via multiple methods, not about > pairwise alignment of two sequences with is what Bio.Align.Pairwise > suggests to me. > > Brad On potential problem with Bio.Search (on top of concerns raised here about vagueness) Bow and I were just talking about during our weekly GSoC video call was the existence of Bio/Search.py which is obsolete and long overdue for removal. I have just deprecated it (something I forgot to do before the last release): https://github.com/biopython/biopython/commit/5a275ccd1df3def40df1eef517af755d373dadd8 We'd earlier talked about using Bio.Search as the namespace. I was worried about the potential existence on a user's machine of both Bio/Search.py (the old obsolete code) and Bio/Search/__init__.py (aka SearchIO, the new module) and which would take precedence when doing: from Bio import Search Given how Python module installations work, that seems highly likely to occur. The good news is that the package would take priority - see http://www.python.org/doc/essays/packages.html >>>> What If I Have a Module and a Package With The Same Name? >>>> >>>> You may have a directory (on sys.path) which has both a module >>>> spam.py and a subdirectory spam that contains an __init__.py >>>> (without the __init__.py, a directory is not recognized as a package). >>>> In this case, the subdirectory has precedence, and importing spam >>>> will ignore the spam.py file, loading the package spam instead. If >>>> you want the module spam.py to have precedence, it must be >>>> placed in a directory that comes earlier in sys.path. So there is no technical reason to avoid Bio.Search as an option for the Bio.SearchIO namespace. We could then have Bio.Search.Applications for command line wrappers, consistent with Bio.Phylo.Applications, Bio.Motif.Applications and Bio.Align.Applications. Of course, Bio.Search is still perhaps too broad a name... but on balance perhaps it is still better than Bio.SearchIO? Regards, Peter From tiagoantao at gmail.com Tue Aug 14 16:39:12 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 14 Aug 2012 21:39:12 +0100 Subject: [Biopython-dev] jython/testing Message-ID: Hi, I have been trying to use biopython with jython 2.7 alpha 2. Here follows a report. There are still a few problems (with SeqIO only): test_SeqIO ... ERROR test_SeqIO_QualityIO ... FAIL test_SeqIO_index ... FAIL The errors are something like (all the same kind of stuff really): SeqIO ====================================================================== ERROR: test_SeqIO ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 341, in runTest suite = unittest.TestLoader().loadTestsFromName(name) File "/home/tr353/local/jython/Lib/unittest/loader.py", line 91, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "/home/tr353/local/jython/Lib/unittest/loader.py", line 91, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "/home/tr353/tmp/biopython/Tests/test_SeqIO.py", line 627, in check_simple_write_read(records) File "/home/tr353/tmp/biopython/Tests/test_SeqIO.py", line 352, in check_simple_write_read records2 = list(SeqIO.parse(handle=handle, format=format)) File "/home/tr353/tmp/biopython/Tests/test_SeqIO.py", line 352, in check_simple_write_read records2 = list(SeqIO.parse(handle=handle, format=format)) File "/home/tr353/tmp/biopython/Bio/SeqIO/__init__.py", line 537, in parse for r in i: File "/home/tr353/tmp/biopython/Bio/SeqIO/SffIO.py", line 828, in SffIterator header_length, index_offset, index_length, number_of_reads, \ File "/home/tr353/tmp/biopython/Bio/SeqIO/SffIO.py", line 285, in _sff_file_header magic_number, ver0, ver1, ver2, ver3, index_offset, index_length, \ error: unpack str size does not match format SeqIO_QualityIO ====================================================================== ERROR: test_E3MFGYR02 (test_SeqIO_QualityIO.TestWriteRead) Write and read back E3MFGYR02_random_10_reads.sff ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tr353/tmp/biopython/Tests/test_SeqIO_QualityIO.py", line 551, in test_E3MFGYR02 self.check(os.path.join("Roche", "E3MFGYR02_random_10_reads.sff"), "sff", File "/home/tr353/tmp/biopython/Tests/test_SeqIO_QualityIO.py", line 477, in check write_read(filename, format, f) File "/home/tr353/tmp/biopython/Tests/test_SeqIO_QualityIO.py", line 52, in write_read records2 = list(SeqIO.parse(handle,out_format)) File "/home/tr353/tmp/biopython/Bio/SeqIO/__init__.py", line 537, in parse for r in i: File "/home/tr353/tmp/biopython/Bio/SeqIO/SffIO.py", line 828, in SffIterator header_length, index_offset, index_length, number_of_reads, \ File "/home/tr353/tmp/biopython/Bio/SeqIO/SffIO.py", line 285, in _sff_file_header magic_number, ver0, ver1, ver2, ver3, index_offset, index_length, \ error: unpack str size does not match format SeqIO.index ====================================================================== ERROR: test_sff_Roche_greek_sff_get_raw (test_SeqIO_index.IndexDictTests) Index sff file Roche/greek.sff get_raw ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tr353/tmp/biopython/Tests/test_SeqIO_index.py", line 430, in f = lambda x : x.get_raw_check(fn, fmt, alpha, c) File "/home/tr353/tmp/biopython/Tests/test_SeqIO_index.py", line 301, in get_raw_check rec2 = SeqIO.SffIO._sff_read_seq_record(handle, File "/home/tr353/tmp/biopython/Bio/SeqIO/SffIO.py", line 561, in _sff_read_seq_record read_header_length, name_length, seq_len, clip_qual_left, \ error: unpack str size does not match format I suppose this is because of issues with the alpha version of jython 2.7. Tiago PS - I do not have all external dependencies installed on my machine, so a few modules are untested. -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Wed Aug 15 07:18:50 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 15 Aug 2012 12:18:50 +0100 Subject: [Biopython-dev] jython/testing In-Reply-To: References: Message-ID: On Tue, Aug 14, 2012 at 9:39 PM, Tiago Ant?o wrote: > Hi, > > I have been trying to use biopython with jython 2.7 alpha 2. Here > follows a report. > > > There are still a few problems (with SeqIO only): > test_SeqIO ... ERROR > test_SeqIO_QualityIO ... FAIL > test_SeqIO_index ... FAIL > > The errors are something like (all the same kind of stuff really): > > ... I see that on my machine too. From looking at the tracebacks and the associated code, the failures all involve BytesIO (or StringIO depending on the Python version). Note that BytesIO is new in Python 2.6, and thus also new in Jython 2.7 compared to Jython 2.5. This is enough to demonstrate a bug in Jython 2.7a2, which explains some if not all of our unit test failures: Expected behaviour: $ python Python 2.7.2 (default, Jun 20 2012, 16:23:33) [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from io import BytesIO >>> raw = open("Roche/E3MFGYR02_random_10_reads.sff", "rb").read() >>> raw == BytesIO(raw).read() True >>> len(raw) 17592 >>> quit() Broken behaviour: $ ~/jython2.7a2/jython Jython 2.7a2 (default:9c148a201233, May 24 2012, 15:49:00) [Java HotSpot(TM) 64-Bit Server VM (Apple Inc.)] on java1.6.0_33 Type "help", "copyright", "credits" or "license" for more information. >>> from io import BytesIO >>> raw = open("Roche/E3MFGYR02_random_10_reads.sff", "rb").read() >>> raw == BytesIO(raw).read() False >>> len(raw) 17592 >>> len(BytesIO(raw).read()) 51577 >>> BytesIO(raw).read()[:100] "bytearray(b'.sff\\x00\\x00\\x00\\x01\\x00\\x00\\x00\\x00\\x00\\x00A\\xb8\\x00\\x00\\x02\\xfc\\x00\\x00\\x00\\n\\x01\\xb8\\" >>> raw[:100] '.sff\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00A\xb8\x00\x00\x02\xfc\x00\x00\x00\n\x01\xb8\x00\x04\x01\x90\x01TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT' >>> quit() I will report this. Peter From p.j.a.cock at googlemail.com Wed Aug 15 07:26:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 15 Aug 2012 12:26:19 +0100 Subject: [Biopython-dev] jython/testing In-Reply-To: References: Message-ID: On Wed, Aug 15, 2012 at 12:18 PM, Peter Cock wrote: > On Tue, Aug 14, 2012 at 9:39 PM, Tiago Ant?o wrote: >> Hi, >> >> I have been trying to use biopython with jython 2.7 alpha 2. Here >> follows a report. >> >> >> There are still a few problems (with SeqIO only): >> test_SeqIO ... ERROR >> test_SeqIO_QualityIO ... FAIL >> test_SeqIO_index ... FAIL >> >> The errors are something like (all the same kind of stuff really): >> >> ... > > I see that on my machine too. From looking at the tracebacks and > the associated code, the failures all involve BytesIO (or StringIO > depending on the Python version). Note that BytesIO is new in > Python 2.6, and thus also new in Jython 2.7 compared to Jython 2.5. > > This is enough to demonstrate a bug in Jython 2.7a2, which explains > some if not all of our unit test failures: > > ... > > I will report this. Filed as http://bugs.jython.org/issue1959 with a shorter test case. Peter From arklenna at gmail.com Thu Aug 16 21:58:46 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 16 Aug 2012 21:58:46 -0400 Subject: [Biopython-dev] GSoC Python variant (penultimate) update Message-ID: Post: http://arklenna.tumblr.com/post/29592108099/ I have been considering how to handle gene strandedness. As long as I'm correctly interpreting the following position, my coordinate mapper should produce the correct coordinates with negative strand or mixed strand features. GenBank: join(complement(25..30), 36..40) Biopython: FeatureLocation(24, 30, -1) + FeatureLocation(35, 40) (please click through to post for monospaced font) 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 <---------------- -------------> 5 4 3 2 1 0 6 7 8 9 10 I have to admit that it wasn't until I read a BioStar [post](http://biostars.org/post/show/3423/forward-and-reverse-strand-conventions/) earlier this week that I fully understood the relationship between plus/minus forward/reverse sense/antisense coding/template strands. So please let me know as soon as possible if I've made a mistake in the above code. `c2g` yields the correct genome position, but not the strand. I still need to integrate strand information into my `GenomePosition` object and/or partially merge it with `ExactLocation`. This weekend I intend to expand documentation and write a brief cookbook entry. Cheers, Lenna From arnaud.poret at gmail.com Fri Aug 17 03:38:28 2012 From: arnaud.poret at gmail.com (Arnaud Poret) Date: Fri, 17 Aug 2012 09:38:28 +0200 Subject: [Biopython-dev] obo parser Message-ID: Hi everyone, I'm a newcomer and I'm writing an obo parser for importing ontologies into python. I'm not sure, but has already BioPython an obo parser? If yes, I'm reinventing the wheel... If no, I'll be glad to contribute. Arnaud. From p.j.a.cock at googlemail.com Fri Aug 17 04:15:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 17 Aug 2012 09:15:10 +0100 Subject: [Biopython-dev] obo parser In-Reply-To: References: Message-ID: On Mon, Aug 13, 2012 at 3:07 PM, Arnaud Poret wrote: > Hi everyone, > > I'm a newcomer and I'm writing an obo parser for importing ontologies > into python. I'm not sure, but has already BioPython an obo parser? > > If yes, I'm reinventing the wheel... > > If no, I'll be glad to contribute. There does seem to be interest, questions about ontologies, GO and OBO crop up every so often. There were some people actually working on this too, but it has gone quiet. e.g. http://lists.open-bio.org/pipermail/biopython-dev/2012-February/009384.html http://lists.open-bio.org/pipermail/biopython-dev/2011-July/009031.html Chris Lasher's repository has vanished, but Eric's older work is still online (CC'd): https://github.com/kellrott/biopython/tree/gosupport Eric & Chris - where do things stand? Regards, Peter From p.j.a.cock at googlemail.com Fri Aug 17 04:21:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 17 Aug 2012 09:21:01 +0100 Subject: [Biopython-dev] [GSoC] GSoC Python variant (penultimate) update In-Reply-To: References: Message-ID: On Fri, Aug 17, 2012 at 2:58 AM, Lenna Peterson wrote: > > I have to admit that it wasn't until I read a BioStar > [post](http://biostars.org/post/show/3423/forward-and-reverse-strand-conventions/) > earlier this week that I fully understood the relationship between > plus/minus forward/reverse sense/antisense coding/template strands. So > please let me know as soon as possible if I've made a mistake in the > above code. Given this is nice and fresh in your mind, can you suggest any clarifications to the Biopython Tutorial section talking about this issue? The section on transcription & translation starting: "Before talking about transcription, I want to try and clarify the strand issue. Consider the following (made up) stretch of double stranded DNA which encodes a short peptide: ..." Hmm. That should probably say "I want to try to clarify...". Peter From p.j.a.cock at googlemail.com Fri Aug 17 12:42:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 17 Aug 2012 17:42:57 +0100 Subject: [Biopython-dev] BioSQL tests Message-ID: Dear all, I realised this week that I didn't have a working BioSQL test setup under either MySQL or PostgreSQL, and the buildbot machines are not testing these either. Therefore I have re-factored the BioSQL unit tests as follows: First I turned my print-and-compare test_BioSQL_SeqIO.py script into proper UnitTest based tests, so that all the BioSQL tests could be combined in one file, test_BioSQL.py. This allowed a further reorganisation to allow any one machine to test all the supported back ends one after the other - previously the setup only tested one backend (defaulting to SQLite3). We now have three test scripts named after the backend library used to connect to the database: test_BioSQL_MySQLdb.py test_BioSQL_psycopg2.py test_BioSQL_sqlite3.py Subsequently I modified our TravisCI configuration to install the required dependencies to run all these tests. The default usernames and passwords for MySQLdb and postgresql are set to match those under TravisCI. Local users would probably have to adjust these values (in the same way they used to prior to the refactoring). Note that psycopg2 only works on C Python 2 & 3 for now (there is a PyPy alternative I have not looked into). MySQLdb only works on C Python 2 (there is a problem installing it under Python 3.2). This did show I'd broken using BioSQL under MySQLdb, at least under this particular version, fixed now: https://github.com/biopython/biopython/commit/4a67d851d1eda0a138b604c8aeffc151d331a29b So the good news is that now TravisCI will run the BioSQL tests on all three database backends, on several versions of Python (but just on Linux). http://travis-ci.org/biopython/biopython/ What I have not addressed is if/how we should deal with test database setting under buildbot - perhaps by environment variable overrides? If anyone would like to look into using MySQLdb and/or psycopg2 under PyPy and Jython, that would also be useful too. Regards, Peter From arklenna at gmail.com Mon Aug 20 00:22:36 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 20 Aug 2012 00:22:36 -0400 Subject: [Biopython-dev] GSoC python variant final update Message-ID: Post: http://arklenna.tumblr.com/post/29808300789/ The coordinate mapper, with updated documentation, is now located on this branch: https://github.com/lennax/biopython/tree/f_loc4 It awaits the merging of Peter's f_loc4 branch. I've written an entry on coordinate mapping for the Cookbook: http://biopython.org/wiki/Coordinate_mapping Additionally, at Peter's suggestion, I've written a clarification of strand as it relates to transcription and translation. It's available here: https://docs.google.com/document/d/11R7EOJXn90lN5_SmaPOyN5rFfPQybbCbUBo6EY0R0pA/edit It's been a great experience working with this project this summer. Thank you to everyone involved. Cheers, Lenna From mjldehoon at yahoo.com Mon Aug 20 08:38:37 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 20 Aug 2012 05:38:37 -0700 (PDT) Subject: [Biopython-dev] Bio.Cluster in the main Biopython documentation Message-ID: <1345466317.39160.YahooMailClassic@web164003.mail.gq1.yahoo.com> Dear all, Previously the documentation for Bio.Cluster was only available as a separate PDF on the Biopython website. I have now integrated this documentation into the Biopython Tutorial. The new tutorial is already uploaded to the repository, and will be visible at http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html once the nightly build is done. Since the documentation for Bio.Cluster contains many references to the literature, I started using the LaTeX \cite command, which are understood and formatted properly by Hevea. While at it, I also converted the references I could find in other parts of the Tutorial to \cite references. This creates a list of references at the end of the Tutorial. Please let us know if you don't like this approach. The documentation for Bio.Cluster is fairly long, and while modifying it for inclusion into the Tutorial some mistakes may have crept in. Please let me know if you find any such mistakes (or feel free to fix them yourself, if it is clear what the text should be). For now we can leave the PDF with the separate description of Bio.Cluster on the website as is for users of Biopython 1.60, but once the next version of Biopython is out I would like to replace it with a PDF referring to the main Tutorial. Thanks, -Michiel. From chapmanb at 50mail.com Mon Aug 20 08:45:49 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 20 Aug 2012 08:45:49 -0400 Subject: [Biopython-dev] [GSoC] GSoC python variant final update In-Reply-To: References: Message-ID: <87harxzq82.fsf@fastmail.fm> Lenna; Thanks for the documentation and getting that all code moved into a branch. This looks great and looking forward to having it merged when Peter's work goes in. Thanks also for all the great work this summer and good luck on the first day of PhD school, Brad > Post: http://arklenna.tumblr.com/post/29808300789/ > > The coordinate mapper, with updated documentation, is now located on > this branch: https://github.com/lennax/biopython/tree/f_loc4 > It awaits the merging of Peter's f_loc4 branch. > > I've written an entry on coordinate mapping for the Cookbook: > http://biopython.org/wiki/Coordinate_mapping > > Additionally, at Peter's suggestion, I've written a clarification of > strand as it relates to transcription and translation. It's available > here: https://docs.google.com/document/d/11R7EOJXn90lN5_SmaPOyN5rFfPQybbCbUBo6EY0R0pA/edit > > It's been a great experience working with this project this summer. > Thank you to everyone involved. > > Cheers, > > Lenna > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From redmine at redmine.open-bio.org Tue Aug 21 06:27:14 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 10:27:14 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] (New) PDBParser fails to parse PDBs produced by PatchDock Message-ID: Issue #3379 has been reported by David Cain. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 06:27:14 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 10:27:14 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] (New) PDBParser fails to parse PDBs produced by PatchDock Message-ID: Issue #3379 has been reported by David Cain. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 06:36:07 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 10:36:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by Peter Cock. If as I understood you, PatchDock is producing invalid PDB files, have you raised the issue with them too? I accept that out of practicality, a little lenience in our parsers can be helpful, and may be appropriate in this case. Do you have any sample data files you could share - for example a valid PDB file before processing, and the problematic PDB file after processing with PatchDock? ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 07:08:53 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 11:08:53 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by Jo?o Rodrigues. Disclaimer: I am a HADDOCK team member and therefore in direct competition with PATCHDOCK. I totally disagree with this. This is not compliant with the PDB format at all: "Each file should terminate with a line containing only the word END". Having data beyond END is just bad practice in my opinion. There are two statements to close a chain/model - ENDMDL and TER - and these should be used. Sorry to be a pain, but if we are fixing this it's just encouraging a bad practice.. standards are there to be respected. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 07:21:57 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 11:21:57 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by Peter Cock. Given Joao's comments, lenience does not sound appropriate in this case. If the parser's current behaviour is to silently ignore data after an END line, that seems less than ideal. How about we add a clear error/warning to the parser if there is content in the file after an END line? i.e. Treat it as an exception in strict mode, treat it as a warning in permissive mode (and continue to ignore anything after the END line)? A sample file would be helpful to verify this, and could even be used for a unit test (with your permission). ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 07:26:48 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 11:26:48 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by David Cain. I completely agree with Jo?o, actually- disrespecting the file spec is a bad idea. I just figured I'd bring this to discussion. I very much think a warning of some sort should be raised, though. Half the structure silently failing to parse is a big problem. I think your solution is perfect, and I'd be very happy to write the unit test. I'll upload a sample file in just a bit. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 08:05:37 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 12:05:37 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by David Cain. File complex.1.pdb added I ran PatchDock's antigen-antibody complex mode on an antigen and antibody file (2fgw and 5ebx) that individually parse without warnings. (Note that I chose these files at random; their docking is useful only as an example). I've attached the complex file produced by @PatchDock/transOutput.pl@) (only the top-scoring conformation considered). As you can see, the @CONECT@ and @END@ records of the antibody will stop the rest of the file from being parsed. I'd be happy to take a stab at writing the error/warning message for premature @END@/@CONECT@ records in addition to the unit test that checks for this behavior. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 08:35:07 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 12:35:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by Jo?o Rodrigues. Agreed with Peter that it should raise an exception/warning. This is really pure concatenation of the two PDBs.. If you could have a go at it, I could test it too. Thanks David. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From w.arindrarto at gmail.com Tue Aug 21 12:01:21 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 21 Aug 2012 18:01:21 +0200 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> <87lim4h07o.fsf@fastmail.fm> Message-ID: On Tue, Aug 14, 2012 at 9:49 PM, Peter Cock wrote: > On Tue, Apr 10, 2012 at 1:58 AM, Brad Chapman wrote: >> Michiel; >>> Hi Eric, Peter, >>> >>> > How about Bio.Search, for now? >>> >>> I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells >>> users something about what the module is for. Bio.Search could be >>> anything (search PubMed? search the Entrez databases? search Google? >>> anyway Bio.Search does not suggest that this module is about pairwise >>> alignments). But Peter previously mentioned that he doesn't like >>> Bio.Pairwise; can we convince you? >> >> I agree with Peter on this one. The module is primarily about searching >> a sequence database with an input via multiple methods, not about >> pairwise alignment of two sequences with is what Bio.Align.Pairwise >> suggests to me. >> >> Brad > > On potential problem with Bio.Search (on top of concerns raised > here about vagueness) Bow and I were just talking about during > our weekly GSoC video call was the existence of Bio/Search.py > which is obsolete and long overdue for removal. I have just > deprecated it (something I forgot to do before the last release): > https://github.com/biopython/biopython/commit/5a275ccd1df3def40df1eef517af755d373dadd8 > > We'd earlier talked about using Bio.Search as the namespace. I was > worried about the potential existence on a user's machine of both > Bio/Search.py (the old obsolete code) and Bio/Search/__init__.py > (aka SearchIO, the new module) and which would take precedence > when doing: from Bio import Search > > Given how Python module installations work, that seems highly > likely to occur. The good news is that the package would take > priority - see http://www.python.org/doc/essays/packages.html > >>>>> What If I Have a Module and a Package With The Same Name? >>>>> >>>>> You may have a directory (on sys.path) which has both a module >>>>> spam.py and a subdirectory spam that contains an __init__.py >>>>> (without the __init__.py, a directory is not recognized as a package). >>>>> In this case, the subdirectory has precedence, and importing spam >>>>> will ignore the spam.py file, loading the package spam instead. If >>>>> you want the module spam.py to have precedence, it must be >>>>> placed in a directory that comes earlier in sys.path. > > So there is no technical reason to avoid Bio.Search as an > option for the Bio.SearchIO namespace. We could then > have Bio.Search.Applications for command line wrappers, > consistent with Bio.Phylo.Applications, Bio.Motif.Applications > and Bio.Align.Applications. > > Of course, Bio.Search is still perhaps too broad a name... but > on balance perhaps it is still better than Bio.SearchIO? > > Regards, > > Peter Hi everyone, If I may add my two cents, for now I am in favor of putting the module under Bio.Search. It is not the best name out there (it does sound a bit vague), but it's the one that seem to be the most intuitive (until a better alternative comes out). There were some other alternatives that I and Peter have discussed, but they seem less appealing for us. You're free to add your thoughts on these of course :) : - Bio.SeqSearch. This sounds ok, but when you consider we have Bio.Seq, Bio.SeqRecord, Bio.SeqFeature, and Bio.SeqUtils, it becomes quite confusing quickly. - Bio.PSearch ('p' for pairwise). This one seemed the less intuitive among the three options, so I'm not so big on this. For now, I'm still writing everything (code, docstrings, tutorial) using SearchIO. I suppose it's better if we could agree on a more suitable name, though. On another note, I'm also in favor of using the Bio.Phylo module skeleton for Bio.SearchIO / Bio.Search. We may then group all sequence search-related application wrappers under Applications (I actually prefers 'app' for better PEP8 compliance, but that's another discussion) and perhaps even refactor our remote search calls (e.g. the 'qblast' module) under Bio.Search as well. cheers, Bow From w.arindrarto at gmail.com Tue Aug 21 12:09:07 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 21 Aug 2012 18:09:07 +0200 Subject: [Biopython-dev] GSoC Project Update -- 10 In-Reply-To: References: Message-ID: Hi everyone, I've just posted my last entry for my Google Summer of Code project this year: http://bow.web.id/blog/2012/08/summers-over/ I want to say thank you to the Biopython community, especially Peter for mentoring me this summer :), to OBF for accepting my proposal, and to anyone who has helped and given me valuable inputs for me throughout the project :). It's been a priceless learning experience, and I only hope that my code will be useful in return. There are still some things to do before the code is merge-ready and even more when the code is included in an official release, so I'll still be around. cheers, Bow From mictadlo at gmail.com Tue Aug 21 20:55:30 2012 From: mictadlo at gmail.com (Mic) Date: Wed, 22 Aug 2012 10:55:30 +1000 Subject: [Biopython-dev] [BioRuby] Final GSoC report In-Reply-To: References: Message-ID: Hi, Python is able to connect to D with help of http://pyd.dsource.org/ . Maybe it would be something for Biopython Cheers, Mic On Wed, Aug 22, 2012 at 5:11 AM, Marjan Povolni wrote: > http://blog.mpthecoder.com/post/29910330225/final-gsoc-report > > *Summary* > > Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the end > of the summer. At least in GSoC terms. Should I say end of the project? I > don?t think so. The tools can still be improved, and the Ruby bindings > should follow. > > The major changes since the last release include the following: > > - filtering functionality has been moved to a separate utility: > gff3-filter, along with a new language for specifying filtering > expressions, > - conversion to table format of selected fields has been moved to a > separate utility: gff3-select. However, the ?select option is still > part of > gff3-filter, > - gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA files > for CDS and mRNA records and features, > - man pages for utilities. > > ** > The original idea was to create a GFF3/GTF parser in D and Ruby bindings. > The Ruby bindings part didn?t work out because there is still no support > for D shared libraries in Linux, but instead there are now a few useful > command-line tools for processing GFF3 which can be used without > programming knowledge. > > To me, the summer was fun, challenging, and a great experience. I even got > to meet my mentor in person, and other community members too, and to make > my first steps in bioinformatics. I even gave a small presentation at the > EU-codefest. What a summer it was! > > Thanks to everybody who made it possible: Google, Open Bioinformatics > Foundation and my mentor Pjotr Prins. > > -- > Marjan > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From p.j.a.cock at googlemail.com Wed Aug 22 04:42:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Aug 2012 09:42:03 +0100 Subject: [Biopython-dev] [GSoC] GSoC Project Update -- 10 In-Reply-To: References: Message-ID: On Tue, Aug 21, 2012 at 5:09 PM, Wibowo Arindrarto wrote: > Hi everyone, > > I've just posted my last entry for my Google Summer of Code project > this year: http://bow.web.id/blog/2012/08/summers-over/ > > I want to say thank you to the Biopython community, especially Peter > for mentoring me this summer :), to OBF for accepting my proposal, and > to anyone who has helped and given me valuable inputs for me > throughout the project :). > > It's been a priceless learning experience, and I only hope that my > code will be useful in return. > > There are still some things to do before the code is merge-ready and > even more when the code is included in an official release, so I'll > still be around. > > cheers, > Bow Thank you Bow, It has been a pleasure to mentor you, and I'm excited about getting this (and Lenna's and other branches) into Biopython. Now, back to the module naming discussion... ;) http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009868.html http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009888.html Peter From p.j.a.cock at googlemail.com Wed Aug 22 07:07:11 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Aug 2012 12:07:11 +0100 Subject: [Biopython-dev] Beta code in the official releases? Message-ID: Hi all, One of the ideas I discussed with Bow during this GSoC project was introducing a new warning, something like Bio.BiopythonBetaCode (the exact name isn't important), to be used to label new experimental modules for which we *expect* there to be changes in the next release. The idea is to combine the simplicity of distribution and installation of the 'monolithic' Biopython library with some of the flexibility offered by a more modular approach. This would be particularly helpful for those on Windows, where installing a Biopython branch from git is quite a daunting task. The idea is that in one of the next releases you'd be able to try Bio.SearchIO (or Bio.Struct or GFF or Variants or ...) and see something like this: >>> from Bio import SearchIO Bio/SearchIO/__init__.py:16: BiopythonBetaCode: Bio.SearchIO is in beta, and likely to change warnings.warn("Bio.SearchIO is in beta, and likely to change", BiopythonBetaCode) By using a specific warning class, any keen beta tester can silence all the BiopythonBetaCode warnings if they wished to. Is anyone familiar enough with Linux packaging polices to have any thoughts on how they would treat this? Provided we only use this for self contained modules, they could potentially split the beta-modules into a sub-package (in the same way that Biopython and its BioSQL support are split in Debian). I envision using this as a way to encourage wider 'beta testing' of self contained modules which are close to a stable release. Does anyone think this is a good idea? Are there any downsides I'm overlooking? Thanks, Peter From p.j.a.cock at googlemail.com Wed Aug 22 07:10:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Aug 2012 12:10:56 +0100 Subject: [Biopython-dev] [BioRuby] [GSoC] Final GSoC report In-Reply-To: <20120822104352.GA11847@thebird.nl> References: <20120822104352.GA11847@thebird.nl> Message-ID: On Wed, Aug 22, 2012 at 11:43 AM, Pjotr Prins wrote: > Yes, linking to D from an interpreted language is not hard, basically > it is the same calling convention as that of C. So a D shared library > looks the same as a C shared library to the calling code - all > existing foreign function interfaces (FFI) work. That is the good > news. How do things stand from a cross-platform perspective? i.e. When might this be doable on Linux, Mac OS X, and Windows? (and other Unix like platforms of potential interest) > The bad news, as Artem points out, is that there is a problem in the > D garbage collector. Items get collected, which should not. This will > be fixed sooner or later. The commitment is there, and it is moving > up the priority list. Is there a D issue/bug tracker for this? Thanks, Peter From chapmanb at 50mail.com Wed Aug 22 20:42:09 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 22 Aug 2012 20:42:09 -0400 Subject: [Biopython-dev] Beta code in the official releases? In-Reply-To: References: Message-ID: <877gsq8mn2.fsf@fastmail.fm> Peter; +1. I'm for making the process of getting new code into Biopython a bit quicker and this seems like a nice step in that direction. With code has been well designed tested and documented, this will help speed the transition into releases and get more eyes on it quicker, while allowing some potential breaking changes as beta functionality gets finalized. Thanks for the good suggestion, Brad > Hi all, > > One of the ideas I discussed with Bow during this GSoC > project was introducing a new warning, something like > Bio.BiopythonBetaCode (the exact name isn't important), > to be used to label new experimental modules for which > we *expect* there to be changes in the next release. > > The idea is to combine the simplicity of distribution and > installation of the 'monolithic' Biopython library with some > of the flexibility offered by a more modular approach. > This would be particularly helpful for those on Windows, > where installing a Biopython branch from git is quite a > daunting task. > > The idea is that in one of the next releases you'd be able > to try Bio.SearchIO (or Bio.Struct or GFF or Variants or ...) > and see something like this: > >>>> from Bio import SearchIO > Bio/SearchIO/__init__.py:16: BiopythonBetaCode: Bio.SearchIO is in > beta, and likely to change > warnings.warn("Bio.SearchIO is in beta, and likely to change", > BiopythonBetaCode) > > By using a specific warning class, any keen beta tester can > silence all the BiopythonBetaCode warnings if they wished to. > > Is anyone familiar enough with Linux packaging polices to > have any thoughts on how they would treat this? Provided > we only use this for self contained modules, they could > potentially split the beta-modules into a sub-package (in the > same way that Biopython and its BioSQL support are split > in Debian). > > I envision using this as a way to encourage wider 'beta testing' > of self contained modules which are close to a stable release. > Does anyone think this is a good idea? Are there any downsides > I'm overlooking? > > Thanks, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From redmine at redmine.open-bio.org Mon Aug 27 00:24:16 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 27 Aug 2012 04:24:16 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by David Cain. Regarding "pure concatenation," I wasn't exaggerating when I said really ugly Perl scripts. =) I created a "pull request on the Biopython GitHub repository":https://github.com/biopython/biopython/pull/60. Could you give me some feedback on my solution? If the devs agree on a certain behavior, I'll start writing some unit tests. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From Andrew.Sczesnak at med.nyu.edu Wed Aug 29 13:54:08 2012 From: Andrew.Sczesnak at med.nyu.edu (Sczesnak, Andrew) Date: Wed, 29 Aug 2012 17:54:08 +0000 Subject: [Biopython-dev] Beta code in the official releases? In-Reply-To: <877gsq8mn2.fsf@fastmail.fm> References: , <877gsq8mn2.fsf@fastmail.fm> Message-ID: <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org> +1 It's been over a year since I first submit my MAF code! ________________________________________ From: biopython-dev-bounces at lists.open-bio.org [biopython-dev-bounces at lists.open-bio.org] on behalf of Brad Chapman [chapmanb at 50mail.com] Sent: Wednesday, August 22, 2012 8:42 PM To: Peter Cock; Biopython-Dev Mailing List Subject: Re: [Biopython-dev] Beta code in the official releases? Peter; +1. I'm for making the process of getting new code into Biopython a bit quicker and this seems like a nice step in that direction. With code has been well designed tested and documented, this will help speed the transition into releases and get more eyes on it quicker, while allowing some potential breaking changes as beta functionality gets finalized. Thanks for the good suggestion, Brad > Hi all, > > One of the ideas I discussed with Bow during this GSoC > project was introducing a new warning, something like > Bio.BiopythonBetaCode (the exact name isn't important), > to be used to label new experimental modules for which > we *expect* there to be changes in the next release. > > The idea is to combine the simplicity of distribution and > installation of the 'monolithic' Biopython library with some > of the flexibility offered by a more modular approach. > This would be particularly helpful for those on Windows, > where installing a Biopython branch from git is quite a > daunting task. > > The idea is that in one of the next releases you'd be able > to try Bio.SearchIO (or Bio.Struct or GFF or Variants or ...) > and see something like this: > >>>> from Bio import SearchIO > Bio/SearchIO/__init__.py:16: BiopythonBetaCode: Bio.SearchIO is in > beta, and likely to change > warnings.warn("Bio.SearchIO is in beta, and likely to change", > BiopythonBetaCode) > > By using a specific warning class, any keen beta tester can > silence all the BiopythonBetaCode warnings if they wished to. > > Is anyone familiar enough with Linux packaging polices to > have any thoughts on how they would treat this? Provided > we only use this for self contained modules, they could > potentially split the beta-modules into a sub-package (in the > same way that Biopython and its BioSQL support are split > in Debian). > > I envision using this as a way to encourage wider 'beta testing' > of self contained modules which are close to a stable release. > Does anyone think this is a good idea? Are there any downsides > I'm overlooking? > > Thanks, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Thu Aug 30 04:16:13 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 Aug 2012 09:16:13 +0100 Subject: [Biopython-dev] Genome Diagram Sigils, was: Default Behavior In-Reply-To: References: Message-ID: On Tue, Aug 14, 2012 at 12:06 PM, Peter Cock wrote: > On Thu, Aug 2, 2012 at 5:12 PM, Peter Cock wrote: > > Further to that work, I updated some older code for a JAGGY > sigil, and also an OCTO sigil (names open to suggestions), > which are on my gd-sigils branch which has documentation > in the tutorial, including this image of the expanded sigil set: > https://github.com/peterjc/biopython/blob/e09e264dd73953554609498c15b67d86686592fb/Doc/images/GD_sigils.png > > This is a slight simplification of the old JAGGY code in that it > does (yet) allow control of the teeth length (e.g. to have just > teeth on one end). I am thinking this could be exposed like > the existing arrow specific options. > > I originally created the JAGGY sigil for marking a break point > in a contig/scaffold. For instance, you might want to mark a > run of NNNNN bases in a scaffold with a jaggy sigil (straddling > both strands) as a clear visual marker to explain why there > were no genes. > > Other sigil ideas I pondered include an OVAL, which should > be quite easy for the linear diagrams, but rather more work to > implement for circular diagrams due to the distorted curves. > > Peter Do people think (either of) these two sigils are worth adding to the main branch? Potentially they can be generalised - the JAGGY sigil in particular would be much more flexible if the head & tail teeth presence (or tooth length?) could be controlled. e.g. to draw a sigil with a flat edge on the left, and a jagged edge on the right. Peter From Leighton.Pritchard at hutton.ac.uk Thu Aug 30 04:51:50 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Thu, 30 Aug 2012 08:51:50 +0000 Subject: [Biopython-dev] Genome Diagram Sigils, was: Default Behavior In-Reply-To: References: Message-ID: On 30 Aug 2012, at Thursday, August 30, 09:16, Peter Cock wrote: On Tue, Aug 14, 2012 at 12:06 PM, Peter Cock > wrote: On Thu, Aug 2, 2012 at 5:12 PM, Peter Cock > wrote: Further to that work, I updated some older code for a JAGGY sigil, and also an OCTO sigil (names open to suggestions), which are on my gd-sigils branch which has documentation in the tutorial, including this image of the expanded sigil set: https://github.com/peterjc/biopython/blob/e09e264dd73953554609498c15b67d86686592fb/Doc/images/GD_sigils.png [?] Do people think (either of) these two sigils are worth adding to the main branch? Yes - I do. L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From p.j.a.cock at googlemail.com Thu Aug 30 06:18:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 Aug 2012 11:18:57 +0100 Subject: [Biopython-dev] Genome Diagram Sigils, was: Default Behavior In-Reply-To: References: Message-ID: On Thu, Aug 30, 2012 at 9:51 AM, Leighton Pritchard wrote: > > On 30 Aug 2012, at Thursday, August 30, 09:16, Peter Cock wrote: >> Do people think (either of) these two sigils are worth adding >> to the main branch? > > Yes - I do. > > L. Done. Branch rebased and applied to master. Peter From p.j.a.cock at googlemail.com Thu Aug 30 07:46:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 Aug 2012 12:46:05 +0100 Subject: [Biopython-dev] Genome Diagram Sigils, was: Default Behavior In-Reply-To: References: Message-ID: On Thu, Aug 30, 2012 at 11:18 AM, Peter Cock wrote: > On Thu, Aug 30, 2012 at 9:51 AM, Leighton Pritchard > wrote: >> >> On 30 Aug 2012, at Thursday, August 30, 09:16, Peter Cock wrote: >>> Do people think (either of) these two sigils are worth adding >>> to the main branch? >> >> Yes - I do. >> >> L. > > Done. Branch rebased and applied to master. > > Peter And you can see the example in the Tutorial here, http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html#sec:gd_sigils (These sigils all work on circular diagrams too, see the examples made by test_GenomeDiagram.py) Peter From zcharlop at mail.rockefeller.edu Wed Aug 1 00:37:27 2012 From: zcharlop at mail.rockefeller.edu (Zachary Charlop-Powers) Date: Wed, 1 Aug 2012 00:37:27 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior Message-ID: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> Hello Biopython, I am writing about a small feature that I would like to see implemented (and could possibly help to implement it: I haven't contributed before and am not sure exactly how tough this will be). When using Genome Diagram to draw features you can specify which strand to put a feature on. If the strand is positive it will go above the track in the positive-facing direction and if negative it will go below the track in the negative facing direction. (seehttp://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc200) . That's a great behavior. However if you use strand="None", Genome Diagram will draw the features inline with the track and always in the positive direction. For myself, and probably others, keeping the direction of the features is immensely useful as you can often get a sense of operon structure in prokaryote genomes just by looking at the genes. Of course the forward and the minus strands can be drawn but condensing small sections of genes to a single track saves space when making images. So, would it be possible to change the default behavior of Genome Diagram to draw features inline (strand="None"), but to preserve their orientation? best, zach cp From p.j.a.cock at googlemail.com Wed Aug 1 09:27:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Aug 2012 10:27:14 +0100 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> Message-ID: On Wed, Aug 1, 2012 at 1:37 AM, Zachary Charlop-Powers wrote: > Hello Biopython, > > I am writing about a small feature that I would like to see implemented > (and could possibly help to implement it: I haven't contributed before and > am not sure exactly how tough this will be). When using Genome Diagram to > draw features you can specify which strand to put a feature on. If the > strand is positive it will go above the track in the positive-facing > direction and if negative it will go below the track in the negative facing > direction. (seehttp://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc200) . That's a > great behavior. Yep - all fine so far. > However if you use strand="None", Genome Diagram will draw > the features inline with the track and always in the positive direction. > For myself, and probably others, keeping the direction of the features is > immensely useful as you can often get a sense of operon structure in > prokaryote genomes just by looking at the genes. Of course the forward and > the minus strands can be drawn but condensing small sections of genes to a > single track saves space when making images. > > So, would it be possible to change the default behavior of Genome Diagram > to draw features inline (strand="None"), but to preserve their orientation? I think I know what you mean - that kind of picture is quite common e.g. for viruses - but only where there are no overlapping genes on opposite strands. GenomeDiagram was written originally primarily for bacteria, were overlapping genes on opposite strands are more common, which may explain the design choices made. Currently strand controls both orientation (for arrows, no effect on box sigils) and vertical placement (above, below, or straddling the line). Basically you want to override the vertical placement only? Note this is sigil dependent - it makes sense for the arrow, but not the default box (which was originally the only sigil supported). The good news is the underlying drawing code can do this - the arrow drawing is just given a bounding box and the requested orientation (left or right) argument set by the get_feature_sigil method of the LinearDrawer or CircularDrawer. If you need this right now, a careful hack in get_feature_sigil is the way to proceed. The question is how to most cleanly expose this to the user while not breaking anything else (e.g. cross links), and ideally allow for a related option which Leighton and I have considered (but not had a pressing need to implement) for frame specific placement. i.e. Rather than treating the vertical drawing spaces as two regions (above the axis line for the forward strand, below the line for the reverse strand), treat it as six regions (three frames above and below the axis line). I'm picturing something a bit like the view in the Artemis annotation editor. One question which constrains this design choice is would you want to mix these placements on the same track? I think yes - using plain strandless BOX features (at the bottom of the z-order stack) is a really useful way to to highlight a region of interest (which could have multiple genes drawn on top of it). That suggests this setting might be best at the GenomeDiagram feature level. Perhaps a new attribute/argument 'strand_mode', (a) ignore strand for vertical placement (what you want) (b) divide vertical space in two (current behaviour) (c) divide vertical space in six (frame specific placement) Hmm. Leighton? Peter P.S. Frame specific placement would work best with an overhaul of how we draw multi-fragment features like genes with exons. Here a whole new sigil class for linking sub-parts of a feature might make sense. That is again something we only chatted about so far, but would make GenomeDiagram more useful for drawing eukaryotic annotation. From p.j.a.cock at googlemail.com Wed Aug 1 10:43:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Aug 2012 11:43:59 +0100 Subject: [Biopython-dev] back_table in Bio.Data.CodonTable In-Reply-To: References: Message-ID: On Tue, Jul 31, 2012 at 8:07 PM, Jeff Hussmann wrote: > It seems desirable to have each amino acid's list of codons be given > in a deterministic order. I have been sorting lexicographically using > the ordering 'TCAG'. This is referred to as the 'conventional > ordering' in CodonTable.__str__. Lexical sorting (i.e. using Python's sort on a list of codons) seems best, it is simple and predictable. > The most flexible solution would be > to take the ordering from self.nucleotide_alphabet.letters, but this > would give 'GATC' for any CodonTable using IUPAC.unambiguous_dna as > its nucleotide alphabet. Are there any Biopython-wide conventions > here? I'm not sure why the alphabets used that particular order over another. Peter From Leighton.Pritchard at hutton.ac.uk Wed Aug 1 10:53:19 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Wed, 1 Aug 2012 10:53:19 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> Message-ID: <089BCE07-D9CB-4657-800C-8E0ACABED1A9@hutton.ac.uk> Hi all, On 1 Aug 2012, at Wednesday, August 1, 10:27, Peter Cock wrote: On Wed, Aug 1, 2012 at 1:37 AM, Zachary Charlop-Powers wrote: However if you use strand="None", Genome Diagram will draw the features inline with the track and always in the positive direction. For myself, and probably others, keeping the direction of the features is immensely useful as you can often get a sense of operon structure in prokaryote genomes just by looking at the genes. That's true. I find it easiest to identify operon structure in that way (i.e. visually and approximately) by noting where the features swap between positive and negative strands. Other approaches might include colouring positive/negative/None strand features differently. Of course the forward and the minus strands can be drawn but condensing small sections of genes to a single track saves space when making images. It doesn't, if the single track is the same height as before - what differs is the whether the features on that track are half, or full, track height. So, would it be possible to change the default behavior of Genome Diagram to draw features inline (strand="None"), but to preserve their orientation? I think there's a better way to get what you're after. Changing the default setting here would modify more than whether the arrow spans the whole track, and it would also mean that GenomeDiagram does not respect the strand data of features by default. I think that's a bad thing. I think I know what you mean - that kind of picture is quite common e.g. for viruses - but only where there are no overlapping genes on opposite strands. GenomeDiagram was written originally primarily for bacteria, were overlapping genes on opposite strands are more common, which may explain the design choices made. My original choice was made for a combination of reasons: - I wanted to respect the strand information in the source data - The 'box' sigil was easiest to draw, and was the first to be available (this carries no inherent directional information as an image) The overlapping gene issue is relevant but, since the resolution of a drawn image is often such that boxes slightly overlap even when there is no feature overlap, it didn't feature in my consideration. Currently strand controls both orientation (for arrows, no effect on box sigils) and vertical placement (above, below, or straddling the line). Basically you want to override the vertical placement only? Note this is sigil dependent - it makes sense for the arrow, but not the default box (which was originally the only sigil supported). That's how I understand Zachary's suggestion: to draw an arrow with orientation preserved, but across the positive and negative strands of the track. The good news is the underlying drawing code can do this - the arrow drawing is just given a bounding box and the requested orientation (left or right) argument set by the get_feature_sigil method of the LinearDrawer or CircularDrawer. If you need this right now, a careful hack in get_feature_sigil is the way to proceed. The question is how to most cleanly expose this to the user while not breaking anything else (e.g. cross links), and ideally allow for a related option which Leighton and I have considered [?] My original plan was to have more sigils available, implemented as draw_X() functions in the AbstractDrawer module. This would seem to be a good case for a draw_large_arrow() (or somesuch) function. The issue then would be a slight change to the prototypes for the existing draw_box and draw_arrow functions. Basically, we'd pass the overall bounding box and strand (x0, x1, btm, ctr, top, strand) information to the new functions, and let them decide where to place the sigil - above, below, or straddling the centre line. Then, we could choose whether draw_arrow() takes an additional argument (e.g. straddle=True) for the behaviour that Zachary wants, or whether we use a new sigil ('large_arrow'), which could have its own function - just like that of draw_arrow() - but would probably be better implemented by just passing the straddle=True (or whatever) argument. This way, the change is transparent to the user, except for perhaps choosing 'large_arrow' rather than 'arrow' as a sigil. That suggests this setting might be best at the GenomeDiagram feature level. Perhaps a new attribute/argument 'strand_mode', (a) ignore strand for vertical placement (what you want) (b) divide vertical space in two (current behaviour) (c) divide vertical space in six (frame specific placement) Hmm. Leighton? I'm choosing to leave frame-specificity out of the discussion, for now ;) Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From p.j.a.cock at googlemail.com Wed Aug 1 11:05:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Aug 2012 12:05:51 +0100 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: <089BCE07-D9CB-4657-800C-8E0ACABED1A9@hutton.ac.uk> References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <089BCE07-D9CB-4657-800C-8E0ACABED1A9@hutton.ac.uk> Message-ID: On Wed, Aug 1, 2012 at 11:53 AM, Leighton Pritchard wrote: > > It doesn't, if the single track is the same height as before - what differs > is the whether the features on that track are half, or full, track height. Yes, but once you've configured the arrows to straddle the axis, you can then allocate less vertical space to that track. i.e. it needs less space. >> The question is how to most cleanly expose this to the user while >> not breaking anything else (e.g. cross links), and ideally allow for >> a related option which Leighton and I have considered [?] > > My original plan was to have more sigils available, implemented as draw_X() > functions in the AbstractDrawer module. This would seem to be a good case > for a draw_large_arrow() (or somesuch) function. The issue then would be a > slight change to the prototypes for the existing draw_box and draw_arrow > functions. Basically, we'd pass the overall bounding box and strand (x0, x1, > btm, ctr, top, strand) information to the new functions, and let them decide > where to place the sigil - above, below, or straddling the centre line. > > Then, we could choose whether draw_arrow() takes an additional argument > (e.g. straddle=True) for the behaviour that Zachary wants, or whether we use > a new sigil ('large_arrow'), which could have its own function - just like > that of draw_arrow() - but would probably be better implemented by just > passing the straddle=True (or whatever) argument. > > This way, the change is transparent to the user, except for perhaps choosing > 'large_arrow' rather than 'arrow' as a sigil. That was another idea I was considering. Under this model, the sigils could be given the full strand straddling bounding box, and decide if they will use all of this (i.e. the new 'large_arrow', or the current sigils when strand-less), or just half as in the stranded current 'arrow' and 'box' sigils where the strand is known. That could work quite well, and the end user API is quite clean. Peter From Leighton.Pritchard at hutton.ac.uk Wed Aug 1 11:23:48 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Wed, 1 Aug 2012 11:23:48 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <089BCE07-D9CB-4657-800C-8E0ACABED1A9@hutton.ac.uk> Message-ID: <93ED1DEB-1C9B-4D34-A898-D326ED5F8C2F@hutton.ac.uk> On 1 Aug 2012, at Wednesday, August 1, 12:05, Peter Cock wrote: On Wed, Aug 1, 2012 at 11:53 AM, Leighton Pritchard > wrote: It doesn't, if the single track is the same height as before - what differs is the whether the features on that track are half, or full, track height. Yes, but once you've configured the arrows to straddle the axis, you can then allocate less vertical space to that track. i.e. it needs less space. I understand that - and maybe I'm being (over) pedantic - but you can allocate less vertical space to the track in either case: the question is what kind of feature representation gives you the desired information legibly at those settings ;) That was another idea I was considering. Under this model, the sigils could be given the full strand straddling bounding box, and decide if they will use all of this (i.e. the new 'large_arrow', or the current sigils when strand-less), or just half as in the stranded current 'arrow' and 'box' sigils where the strand is known. That could work quite well, and the end user API is quite clean. This option gets my vote. L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From zcharlop at mail.rockefeller.edu Wed Aug 1 14:27:32 2012 From: zcharlop at mail.rockefeller.edu (Zachary Charlop-Powers) Date: Wed, 1 Aug 2012 14:27:32 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> Message-ID: <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> Leighton, Peter, I love that we're not in the same timezone; I ask a question when I leave work and - lo,and, behold - when I return in the morning there is a well thought out response. Thank you both. The good news is the underlying drawing code can do this - the arrow drawing is just given a bounding box and the requested orientation (left or right) argument set by the get_feature_sigil method of the LinearDrawer or CircularDrawer. If you need this right now, a careful hack in get_feature_sigil is the way to proceed. I will take a look at this for a quick hack for some drawing I am working on. That was another idea I was considering. Under this model, the sigils could be given the full strand straddling bounding box, and decide if they will use all of this (i.e. the new 'large_arrow', or the current sigils when strand-less), or just half as in the stranded current 'arrow' and 'box' sigils where the strand is known. That could work quite well, and the end user API is quite clean. This option gets my vote. L. If you are both in agreement that this option is desirable and that it can be implemented in the sigil style, now we face the question of coding it. Would either of you consider working on it? If not this might be a problem I could tackle with a small amount of mentoring. Please let me know - I am happy to take a stab at it. best regards, zach cp From p.j.a.cock at googlemail.com Wed Aug 1 17:15:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Aug 2012 18:15:31 +0100 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> Message-ID: On Wed, Aug 1, 2012 at 3:27 PM, Zachary Charlop-Powers wrote: > Leighton, > Peter, > > I love that we're not in the same timezone; I ask a question when I leave > work and - lo,and, behold - when I return in the morning there is a well > thought out response. Thank you both. :) Peter wrote: >>> The good news is the underlying drawing code can do this - the >>> arrow drawing is just given a bounding box and the requested >>> orientation (left or right) argument set by the get_feature_sigil >>> method of the LinearDrawer or CircularDrawer. >>> >>> If you need this right now, a careful hack in get_feature_sigil is >>> the way to proceed. Zachary wrote: > I will take a look at this for a quick hack for some drawing I am > working on. I hope you found any effort spent useful for understanding the codebase... even if it doesn't turn out to be needed (see below). Peter wrote: >>> That was another idea I was considering. Under this model, the sigils >>> could be given the full strand straddling bounding box, and decide if >>> they will use all of this (i.e. the new 'large_arrow', or the current sigils >>> when strand-less), or just half as in the stranded current 'arrow' and >>> 'box' sigils where the strand is known. >>> >>> That could work quite well, and the end user API is quite clean. Leighton wrote: >> This option gets my vote. >> >> L. Zachary wrote: > If you are both in agreement that this option is desirable and that it can > be implemented in the sigil style, now we face the question of coding it. > Would either of you consider working on it? If not this might be a problem I > could tackle with a small amount of mentoring. Please let me know - I am > happy to take a stab at it. I had a go this afternoon (a quite moment between rushes - grin), and it wasn't as bad as I feared. This is on a git branch at the moment, https://github.com/peterjc/biopython/tree/gd-big Thus far, just two commits. The first refactors the current code to move the strand handling into the sigil code (but should, I hope, have no side effects): https://github.com/peterjc/biopython/commit/d9c416be7dd2c7081bd66bd553c9feb0174ecc13 The second commit implements the new axis straddling arrow (for both linear and circular diagrams) plus a minimal test: https://github.com/peterjc/biopython/commit/b58903d5c455416028a8ae410b2063d536448d59 To match the current sigil argument names BOX and ARROW, I have provisionally called BIGARROW. Any better ideas? Also, to match the current arrow's behaviour, strand-less features get an arrow pointing to the right (like a forward strand arrow). Leighton and I had a little debate about this - with hindsight, the original arrow sigil might have raised an error or drawn a box in this situation - but I'm not willing to change this and break existing code. It would be great if you (Zachary) could give this a test, both to look for regressions (anything that broke) and try the new sigil out. Are you familiar with git, and installing Biopython from source? Regards, Peter From zcharlop at mail.rockefeller.edu Wed Aug 1 22:10:55 2012 From: zcharlop at mail.rockefeller.edu (Zachary Charlop-Powers) Date: Wed, 1 Aug 2012 22:10:55 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> Message-ID: <4A304FFF-48C9-42D8-9A26-1A9FBDA9AAA7@rockefeller.edu> Peter wrote: It would be great if you (Zachary) could give this a test, both to look for regressions (anything that broke) and try the new sigil out. Are you familiar with git, and installing Biopython from source? Just reran my previous image-generation scripts with your BioPython. I used sigil="BIGARROW" instead of "ARROW" and it worked like a charm. Awesome. Would you want to add the "BIGARROW" option to the tutorial? best, zach cp From p.j.a.cock at googlemail.com Wed Aug 1 22:33:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Aug 2012 23:33:14 +0100 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: <4A304FFF-48C9-42D8-9A26-1A9FBDA9AAA7@rockefeller.edu> References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> <4A304FFF-48C9-42D8-9A26-1A9FBDA9AAA7@rockefeller.edu> Message-ID: On Wed, Aug 1, 2012 at 11:10 PM, Zachary Charlop-Powers wrote: >> Peter wrote: >> >> It would be great if you (Zachary) could give this a test, both to look >> for regressions (anything that broke) and try the new sigil out. Are >> you familiar with git, and installing Biopython from source? >> > > Just reran my previous image-generation scripts with your BioPython. > I used sigil="BIGARROW" instead of "ARROW" and it worked like a > charm. Awesome. Great. Thanks for quickly testing this. > > Would you want to add the "BIGARROW" option to the tutorial? > Yes, if/when we merge this (and I'll try to talk to Leighton about it tomorrow), then I would also want to update the Tutorial to describe this new feature. There is almost no point writing new code if we don't document it. Peter From tiagoantao at gmail.com Thu Aug 2 03:39:43 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 1 Aug 2012 20:39:43 -0700 Subject: [Biopython-dev] Away Re: buildbot failure in Biopython on Linux - Python 3.1 Message-ID: I am currently away from office. I will respond back on the 20th of August. Regards, Tiago -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From Leighton.Pritchard at hutton.ac.uk Thu Aug 2 07:42:47 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Thu, 2 Aug 2012 07:42:47 +0000 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> Message-ID: Hi, On 1 Aug 2012, at Wednesday, August 1, 18:15, Peter Cock wrote: On Wed, Aug 1, 2012 at 3:27 PM, Zachary Charlop-Powers > wrote: Leighton, Peter, I love that we're not in the same timezone; I ask a question when I leave work and - lo,and, behold - when I return in the morning there is a well thought out response. Thank you both. No worries. I had a go this afternoon (a quite moment between rushes - grin), Good job getting it done so quickly! and it wasn't as bad as I feared. [?] To match the current sigil argument names BOX and ARROW, I have provisionally called BIGARROW. Any better ideas? BIGARROW sounds fine to me. I like literal names. Leighton and I had a little debate about this - with hindsight, the original arrow sigil might have raised an error or drawn a box in this situation - but I'm not willing to change this and break existing code. Likewise - now it's been there so long, I think it would be inconsistent at this point to change it. Arguably, the default setting has to choose a direction simply because (single-headed) arrows have a direction. For those figures where you're being precise, users can use a box for a feature with no direction; if it's pointing the wrong way, users can set the feature strand. Left-to-right as a default is arbitrary, though. Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From p.j.a.cock at googlemail.com Thu Aug 2 16:12:54 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 Aug 2012 17:12:54 +0100 Subject: [Biopython-dev] Genome Diagram Default Behavior In-Reply-To: References: <2054694E-0D60-4F16-A7EE-ABC8AD59F344@rockefeller.edu> <5D18E2FE-3756-44E1-9DB3-4BAC690DFD78@rockefeller.edu> Message-ID: On Thu, Aug 2, 2012 at 8:42 AM, Leighton Pritchard wrote: >Peter wrote: >> >> To match the current sigil argument names BOX and ARROW, I have >> provisionally called BIGARROW. Any better ideas? >> > > BIGARROW sounds fine to me. I like literal names. > Great. Checked into the master, and I updated the Tutorial and the Proux et al 2002 Figure 6 reproduction example to use this: Before (cross-links with strand specific ARROW sigil): http://biopython.org/DIST/docs/tutorial/images/three_track_cl2.png After (cross-links with strand straddling BIGARROW sigil): http://biopython.org/DIST/docs/tutorial/images/three_track_cl2a.png Original (I don't know what was used to draw this): http://dx.doi.org/10.1128/JB.184.21.6026-6036.2002 Regards, Peter From clements at galaxyproject.org Fri Aug 3 23:23:25 2012 From: clements at galaxyproject.org (Dave Clements) Date: Fri, 3 Aug 2012 16:23:25 -0700 Subject: [Biopython-dev] Galaxy is Hiring Postdocs Message-ID: Hello all, The Galaxy Project , a highly successful high throughput data analysis platform for Life Sciences with over 23,000 users worldwide , is hiring: The Taylor Lab in Biologyand Mathematics & Computer Science at Emory Universityis looking for *postdoctoral scholars * to work on the Galaxy Project. Postdoctoral applicantsshould have expertise in Bioinformatics and Computational Biology and research interests that complement but extend the lab's current interests: The Galaxy project; distributed and high-performance computing for data intensive science; vertebrate functional genomics; and genomics and epigenomic mechanisms of gene regulation, the role of transcription factors and chromatin structure in global gene expression, development, and differentiation. See the announcementfor full details ( http://bx.mathcs.emory.edu/joining/postdocs/). The Nekrutenko Lab at the Huck Institutes of Life Sciences at Penn State is seeking *highly opinionated and biologically inclined* *Postdoctoral researchers*within the Galaxy Project to develop best practices for analysis of next-generation sequencing data in all areas of Life Sciences where NGS is used. Successful candidates will join a vibrant research group at the core of the Galaxy Project and will work on setting trends in modern data-driven life-sciences. Please send your CV and names/e-mail addresses of three references to jobs at galaxyproject.org. Thanks, Dave C. -- http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://galaxyproject.org/wiki/ From arklenna at gmail.com Tue Aug 7 05:11:04 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 7 Aug 2012 01:11:04 -0400 Subject: [Biopython-dev] GSoC python variant update Message-ID: Full post: http://arklenna.tumblr.com/post/28890255191/ Summary: * I'm working on the coordinate mapper Reece contributed: http://biopython.org/pipermail/biopython/2010-June/006598.html * I'm representing intron locations relative to CDS coords using the HGVS standards: http://www.hgvs.org/mutnomen/refseq_figure.html I'd like to know if there are other common ways of representing such positions. * In order to customize the display of positions (e.g. 0-based or 1-based), I'm using a class as a configuration container. I've read on StackOverflow that attempts to use globals or a singleton class are discouraged in Python, but I have not found practical suggestions for how to implement module-wide configurations. Suggestions are welcome. * Any advice about circular genomes or strandedness is also welcome. * This mapper will work for SeqRecords, SeqFeatures, FeatureLocations, etc. Are there other Biopython objects that store sequence coordinates and thus should be mappable? Regards, Lenna From mjldehoon at yahoo.com Tue Aug 7 06:40:13 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 6 Aug 2012 23:40:13 -0700 (PDT) Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif Message-ID: <1344321613.96095.YahooMailClassic@web164001.mail.gq1.yahoo.com> Dear all, Currently Bio.Motif has some support for writing TRANSFAC files but not for reading TRANSFAC files. I would like to add such a parser to Bio.Motif. Do you all agree that it fits in this module? Note that the TRANSFAC files very much look like EMBL files, and therefore contain much more information than what is currently in a Bio.Motif._Motif.Motif object (the object to be generated by Bio.Motif.read(handle, "transfac")). Perhaps the easiest is to add an attribute .annotations to Bio.Motif._Motif.Motif objects, and use it as a dictionary to store the EMBL-like annotations under their 2-letter keys. On a related note, currently Bio.Motif._Motif.Motif objects also perform functions that are more appropriate for a separate PWM (position-weight matrix) class within Bio.Motif. It may be a good idea to have a separate PWM class for this functionality. Best, -Michiel. From bartek at rezolwenta.eu.org Tue Aug 7 07:18:43 2012 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 7 Aug 2012 09:18:43 +0200 Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif In-Reply-To: <1344321613.96095.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1344321613.96095.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: Hi Michiel, On Tue, Aug 7, 2012 at 8:40 AM, Michiel de Hoon wrote: > Dear all, > > Currently Bio.Motif has some support for writing TRANSFAC files but not for reading TRANSFAC files. I would like to add such a parser to Bio.Motif. Do you all agree that it fits in this module? Note that the TRANSFAC files very much look like EMBL files, and therefore contain much more information than what is currently in a Bio.Motif._Motif.Motif object (the object to be generated by Bio.Motif.read(handle, "transfac")). Perhaps the easiest is to add an attribute .annotations to Bio.Motif._Motif.Motif objects, and use it as a dictionary to store the EMBL-like annotations under their 2-letter keys. > That would certainly be a valuable addition. I didn't add it as a format because it might get a bit confusing for users. The TRANSFAC itself (trademarked, afaik), as distributed by the BIObase company and is not available unless you pay them some license(you have to register even for the "publicly available" one that comes with a license too). If you do, then you get access to a number of interconnected datasets, including information about what they call "matrices", "sites" and "transcription factors" and "classes". I think that if we want to support their filetypes, we probably should think whether we should support the matrix file only or maybe the other ones asa well. The confusing part is that many programs use "transfac-like" formats, i.e. files very similar to the part in the "matrix" file that corresponds to the PWM itself. (For example see http://www.benoslab.pitt.edu/stamp/help.html). > On a related note, currently Bio.Motif._Motif.Motif objects also perform functions that are more appropriate for a separate PWM (position-weight matrix) class within Bio.Motif. It may be a good idea to have a separate PWM class for this functionality. Currently, Bio.Motif.Motif class represents something sequence-like. It can either be seen a set of instances (.add_instance(), .search_instance()) or as a PWM (.log_odds(), search_pwm(), etc), It can hold some annotation part (i.e. name etc), however, in my mind, it is the core of the functionality for "motif" analysis. I can imagine other types of motifs (we discussed regExp or HMM based motifs) that could subclass Motif, but I think this should be the role of the Motif class. Then comes the thing with annotations. I would rather vote for something more similar to SeqRecord and Seq, where a new class (MotifRecord?) would hold all the annotation data from TRANSFAC or somesuch DB, and the Motif would remain more sequence-like. With respect to moving the PWM-related functionality to a separate class, I'm not sure. I think it is valuable to be able to load instances from a file and then convert them to a PWM. It could be done with separate classes, but I'm not sure it would be easier then... best Bartek -- Bartek Wilczynski From mjldehoon at yahoo.com Tue Aug 7 08:39:15 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 7 Aug 2012 01:39:15 -0700 (PDT) Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif In-Reply-To: Message-ID: <1344328755.85288.YahooMailClassic@web164002.mail.gq1.yahoo.com> Hi Bartek, Thanks for your reply. --- On Tue, 8/7/12, Bartek Wilczynski wrote: > If you do, then you get access to a number of interconnected > datasets, including information about what they call "matrices", >?"sites" and "transcription factors" and "classes". I think that if > we want to support their filetypes, we probably should think whether > we should support the matrix file only or maybe the other ones asa > well. I would suggest to just support the matrices for now. > The confusing part is that many programs use "transfac-like" > formats, i.e. files very similar to the part in the "matrix" > file that corresponds to the PWM itself. (For example see > http://www.benoslab.pitt.edu/stamp/help.html). This also means that if Bio.Motif can parse TRANSFAC files, then it can parse the transfac-like formats, at least to some degree. Personally I am actually more interested in the SwissRegulon database, which uses a transfac-like format > Then comes the thing with annotations. I would rather > vote for something more similar to SeqRecord and Seq, > where a new class (MotifRecord?) would hold all the > annotation data from TRANSFAC or somesuch DB, and the > Motif would remain more sequence-like. Are you suggesting that MotifRecord subclasses Bio.Motif._Motif.Motif? For example we could have a Bio.Motif.Parsers.TRANSFAC.Motif class that subclasses Bio.Motif._Motif.Motif. Then Bio.Motif._Motif.Motif remains sequence-like, and Bio.Motif.Parsers.TRANSFAC.Motif takes care of the annotations. Alternatively we could say that Bio.Motif.Parsers.TRANSFAC.read returns a Bio.Motif.Parsers.TRANSFAC.Record object that contains the motif information as an attribute (so record.motif would be an instance of Bio.Motif._Motif.Motif). Best, -Michiel From mjldehoon at yahoo.com Tue Aug 7 14:47:00 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 7 Aug 2012 07:47:00 -0700 (PDT) Subject: [Biopython-dev] Fw: Re: Parsing TRANSFAC matrices with Bio.Motif Message-ID: <1344350820.11922.YahooMailClassic@web164006.mail.gq1.yahoo.com> Forwarding Bartek's email to the list .. I am pretty much OK with his suggestions, but feel free to comment or suggest other solutions before we start implementing this. Best, -Michiel. --- On Tue, 8/7/12, Bartek Wilczynski wrote: > From: Bartek Wilczynski > Subject: Re: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif > To: "Michiel de Hoon" > Date: Tuesday, August 7, 2012, 5:16 AM > On Tue, Aug 7, 2012 at 10:39 AM, > Michiel de Hoon > wrote: > > Hi Bartek, > > > > Thanks for your reply. > > > > --- On Tue, 8/7/12, Bartek Wilczynski > wrote: > >> If you do, then you get access to a number of > interconnected > >> datasets, including information about what they > call "matrices", > >> "sites" and "transcription factors" and "classes". > I think that if > >> we want to support their filetypes, we probably > should think whether > >> we should support the matrix file only or maybe the > other ones asa > >> well. > > > > I would suggest to just support the matrices for now. > > > I'm fine with that. Some links between the files might be > less > usefule, but that might be added later. > > >> The confusing part is that many programs use > "transfac-like" > >> formats, i.e. files very similar to the part in the > "matrix" > >> file that corresponds to the PWM itself. (For > example see > >> http://www.benoslab.pitt.edu/stamp/help.html). > > > > This also means that if Bio.Motif can parse TRANSFAC > files, then it > > can parse the transfac-like formats, at least to some > degree. Personally I am actually more interested in the > SwissRegulon database, which uses a transfac-like format > > > > In principle yes, but there are slight variants making > things "almost > working". That's the main reason I didn't put the code I was > using > myself into biopython repository, as it might cause some > weird > breakages. For examples, some formats drop the P0 column > (the > "transfac-like" in STAMP, for one) which makes it impossible > to figure > out whether you are interpreting the numbers right unless > you agree on > some ordering of nucleotides. I would suggest that we should > support > databases named directly and, maybe, think about generic > methods for > "raw PSSM" files, that would require the user to give the > nucleotide > order... > > >> Then comes the thing with annotations. I would > rather > >> vote for something more similar to SeqRecord and > Seq, > >> where a new class (MotifRecord?) would hold all > the > >> annotation data from TRANSFAC or somesuch DB, and > the > >> Motif would remain more sequence-like. > > > > Are you suggesting that MotifRecord subclasses > Bio.Motif._Motif.Motif? > > For example we could have a > Bio.Motif.Parsers.TRANSFAC.Motif class that subclasses > Bio.Motif._Motif.Motif. Then? Bio.Motif._Motif.Motif > remains sequence-like, and Bio.Motif.Parsers.TRANSFAC.Motif > takes care of the annotations. > > > > Alternatively we could say that > Bio.Motif.Parsers.TRANSFAC.read returns a > Bio.Motif.Parsers.TRANSFAC.Record object that contains the > motif information as an attribute (so record.motif would be > an instance of Bio.Motif._Motif.Motif). > > > > For me, personally, the version where transfac motif is a > subclass of > Motif is a more useful one. It is simpler, and it adds > annotations as > attributes of a motif. However, if we decided that we want > the whole > TRANSFAC db with all it's annotations, the more natural way > would be > to have separate classes for instances and motifs and maybe > even > separate record classes representing a database record > (there might be > more transfac records referencing the same matrix). I don't > think that > there is so much need for supporting all the stuff from > TRANSFAC (I > don't know anybody who would be using all their annotations, > people > seem to care only about matrices anyway) so I'd vote for the > simpler > way of subclassing Motif. > > best > Bartek > -- > Bartek Wilczynski > From w.arindrarto at gmail.com Tue Aug 7 17:56:26 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 7 Aug 2012 19:56:26 +0200 Subject: [Biopython-dev] GSoC Project Update -- 11 Message-ID: Hello everyone, I have just posted my latest update on my project here: http://bow.web.id/blog/2012/08/back-on-the-main-branch/ It's been taking quite a while since I posted my last update since there has been a considerable change to the SearchIO object model I'm using. The details are in my blog post, but to keep it short, it was because the previous model (QueryResult, Hit, and HSP) was inadequate in handling files that have multiple sequences in their HSP (so far seen in files output by BLAT and Exonerate). In my previous updates, I've been using simple Python lists to store attributes related to these multiple sequences, but that turned out to be problematic as it may make the object have inconsistent attributes. After trying out several different implementations and discussing them with Peter, we've finally settled on a new model. The new model changes the HSP object into a container that stores a new object: HSPFragment. HSPFragment represents a single, contiguous alignment of the hit and query sequence. It only stores the sequence, coordinates, frames, and strands. Other attributes made by the search program (such as evalues or scores) are stored in the HSP object. This change required some modifications on all of the current parsers, but from a user's perspective working with file formats other than BLAT or Exonerate, the changes should be minimum. Aside from this, there's also a small update on the main API which lets it accept keyword arguments. The arguments modify behaviors of the parser, and they are different for each parser. Currently, this is only used by the BLAST tabular parser, but I imagine more parsers will use this in the future. Finally, having settled on a firmer object model, I'll be spending the rest of my time to focus on the documentation. There may still be small fixes to the code, but I expect nothing as major as this one. regards, Bow From chapmanb at 50mail.com Wed Aug 8 13:55:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 08 Aug 2012 09:55:36 -0400 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: References: Message-ID: <874nodh4iv.fsf@fastmail.fm> Lenna; This all sounds great and will be a nice practical addition to Biopython. Thanks for taking it on. Some specific thoughts on your questions: > * I'm representing intron locations relative to CDS coords using the > HGVS standards: http://www.hgvs.org/mutnomen/refseq_figure.html > I'd like to know if there are other common ways of representing such > positions. I don't know of one myself, so it's great to be following a standard rather than reinventing something. Nice work. > * In order to customize the display of positions (e.g. 0-based or > 1-based), I'm using a class as a configuration container. I've read on > StackOverflow that attempts to use globals or a singleton class are > discouraged in Python, but I have not found practical suggestions for > how to implement module-wide configurations. Suggestions are welcome. With configuration items like this, you have two choices: - A global variable. - Pass the configuration to every function that needs it. There are tradeoffs with both approaches, but for this case I agree with your decision to use globals. Most people will want 0-based/Biopython style but it gives those who don't a knob to switch over. > * Any advice about circular genomes or strandedness is also welcome. Circular handling is an unresolved issue in Biopython: https://redmine.open-bio.org/issues/2578 It's a bit tricky, especially with features that span the origin. I'd prioritize handling strandedness since you're going to have plenty of reverse strand coding sequences. You're mapping not only within the coding region but also back to the original sequence on the reverse strand. So in your g2c mapping, the original gene goes from e1 -> s1 -> e0 -> s0 as you read 5' to 3' across the sequence. The best place to get started is to pick a reverse strand gene and then work through the mappings, thinking through the orientations. I find drawing it out to be the easiest way. > * This mapper will work for SeqRecords, SeqFeatures, FeatureLocations, > etc. Are there other Biopython objects that store sequence coordinates > and thus should be mappable? That sounds like a great start. Thanks again for this, Brad From p.j.a.cock at googlemail.com Wed Aug 8 14:33:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Aug 2012 15:33:05 +0100 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: <874nodh4iv.fsf@fastmail.fm> References: <874nodh4iv.fsf@fastmail.fm> Message-ID: On Wed, Aug 8, 2012 at 2:55 PM, Brad Chapman wrote: >Lenna wrote: >> * Any advice about circular genomes or strandedness is also welcome. > > Circular handling is an unresolved issue in Biopython: > > https://redmine.open-bio.org/issues/2578 > > It's a bit tricky, especially with features that span the origin. > > I'd prioritize handling strandedness since you're going to have plenty > of reverse strand coding sequences. You're mapping not only within the > coding region but also back to the original sequence on the reverse > strand. So in your g2c mapping, the original gene goes from > e1 -> s1 -> e0 -> s0 as you read 5' to 3' across the sequence. The best > place to get started is to pick a reverse strand gene and then work > through the mappings, thinking through the orientations. I find drawing > it out to be the easiest way. And then think about mixed strand genes, e.g. transpliced tRNA is a good example - there is a GenBank example in our unit tests. Peter From lgautier at gmail.com Wed Aug 8 16:37:35 2012 From: lgautier at gmail.com (Laurent Gautier) Date: Wed, 08 Aug 2012 18:37:35 +0200 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: References: Message-ID: <502295CF.3020103@gmail.com> On 2012-08-08 18:00, biopython-dev-request at lists.open-bio.org wro > Lenna; > This all sounds great and will be a nice practical addition to > Biopython. Thanks for taking it on. Some specific thoughts on your questions: > >> >* I'm representing intron locations relative to CDS coords using the >> >HGVS standards:http://www.hgvs.org/mutnomen/refseq_figure.html >> >I'd like to know if there are other common ways of representing such >> >positions. > I don't know of one myself, so it's great to be following a standard > rather than reinventing something. Nice work. > >> >* In order to customize the display of positions (e.g. 0-based or >> >1-based), I'm using a class as a configuration container. I've read on >> >StackOverflow that attempts to use globals or a singleton class are >> >discouraged in Python, but I have not found practical suggestions for >> >how to implement module-wide configurations. Suggestions are welcome. Module-wide configuration can be implemented as variables, as long as they are declared before the functions using them. If considering a package rather than a single module, options can be stored in a module dedicated to options (since Python modules are singletons). > With configuration items like this, you have two choices: > > - A global variable. > - Pass the configuration to every function that needs it. > > There are tradeoffs with both approaches, but for this case I agree with > your decision to use globals. Most people will want 0-based/Biopython > style but it gives those who don't a knob to switch over. I'd argue that allowing to switch is an invitation to spectacular issues down the road. An easy, yet frightening, example would be the case where using third-party code (such a module) changes this without you knowing. An other scary thought is that this would amount to bringing the infamous Perl variable "$[" to Python. Go explain again that folks should Python for its elegance and simplicity after that. Best, L. From arklenna at gmail.com Wed Aug 8 18:44:33 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 8 Aug 2012 14:44:33 -0400 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: <502295CF.3020103@gmail.com> References: <502295CF.3020103@gmail.com> Message-ID: On Wed, Aug 8, 2012 at 12:37 PM, Laurent Gautier wrote: > On 2012-08-08 18:00, biopython-dev-request at lists.open-bio.org wro > >> >>> >* In order to customize the display of positions (e.g. 0-based or >>> >1-based), I'm using a class as a configuration container. I've read on >>> >StackOverflow that attempts to use globals or a singleton class are >>> >discouraged in Python, but I have not found practical suggestions for >>> >how to implement module-wide configurations. Suggestions are welcome. > > > Module-wide configuration can be implemented as variables, as long as they > are declared before the functions using them. > If considering a package rather than a single module, options can be stored > in a module dedicated to options (since Python modules are singletons). > Hi Laurent, I really like the idea of a configuration module. I will definitely move in that direction. > >> With configuration items like this, you have two choices: >> >> - A global variable. >> - Pass the configuration to every function that needs it. >> >> There are tradeoffs with both approaches, but for this case I agree with >> your decision to use globals. Most people will want 0-based/Biopython >> style but it gives those who don't a knob to switch over. > > > I'd argue that allowing to switch is an invitation to spectacular issues > down the road. > An easy, yet frightening, example would be the case where using third-party > code (such a module) changes this without you knowing. > > An other scary thought is that this would amount to bringing the infamous > Perl variable "$[" to Python. Go explain again that folks should Python for > its elegance and simplicity after that. > > Yikes. My approach will not be comparable to $[. For starters, it wouldn't modify the behavior of every sequence-like object. My current thought would be to store the 0-based position in an attribute `pos`, have a property `pos_str` that returns `pos` + `Config.index`. For representations, `__str__` will return `pos_str`, and `__repr__` will return `pos` (always 0-based). Math would always use the 0-based position. I intend to keep the influence of the hypothetical mapping Config module limited to Biopython Seq* objects. It should also be possible to make a kill switch, namely, a version of the Config module where all of the settings are neutral to adding (i.e. `def __add__(self, other): return other`). Please let me know if this would not fully address your concerns. Cheers, Lenna From lgautier at gmail.com Wed Aug 8 21:58:26 2012 From: lgautier at gmail.com (Laurent Gautier) Date: Wed, 08 Aug 2012 23:58:26 +0200 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: References: <502295CF.3020103@gmail.com> Message-ID: <5022E102.9010509@gmail.com> On 2012-08-08 20:44, Lenna Peterson wrote: > On Wed, Aug 8, 2012 at 12:37 PM, Laurent Gautier wrote: >> On 2012-08-08 18:00, biopython-dev-request at lists.open-bio.org wro >> >>>>> * In order to customize the display of positions (e.g. 0-based or >>>>> 1-based), I'm using a class as a configuration container. I've read on >>>>> StackOverflow that attempts to use globals or a singleton class are >>>>> discouraged in Python, but I have not found practical suggestions for >>>>> how to implement module-wide configurations. Suggestions are welcome. >> >> Module-wide configuration can be implemented as variables, as long as they >> are declared before the functions using them. >> If considering a package rather than a single module, options can be stored >> in a module dedicated to options (since Python modules are singletons). >> > Hi Laurent, > > I really like the idea of a configuration module. I will definitely > move in that direction. > >>> With configuration items like this, you have two choices: >>> >>> - A global variable. >>> - Pass the configuration to every function that needs it. >>> >>> There are tradeoffs with both approaches, but for this case I agree with >>> your decision to use globals. Most people will want 0-based/Biopython >>> style but it gives those who don't a knob to switch over. >> >> I'd argue that allowing to switch is an invitation to spectacular issues >> down the road. >> An easy, yet frightening, example would be the case where using third-party >> code (such a module) changes this without you knowing. >> >> An other scary thought is that this would amount to bringing the infamous >> Perl variable "$[" to Python. Go explain again that folks should Python for >> its elegance and simplicity after that. >> >> > Yikes. My approach will not be comparable to $[. For starters, it > wouldn't modify the behavior of every sequence-like object. > > My current thought would be to store the 0-based position in an > attribute `pos`, have a property `pos_str` that returns `pos` + > `Config.index`. For representations, `__str__` will return `pos_str`, > and `__repr__` will return `pos` (always 0-based). Math would always > use the 0-based position. > > I intend to keep the influence of the hypothetical mapping Config > module limited to Biopython Seq* objects. It should also be possible > to make a kill switch, namely, a version of the Config module where > all of the settings are neutral to adding (i.e. `def __add__(self, > other): return other`). What about making the design decision that string representations that are 1-based then, and go beyond making a kill switch by just kill the switch ? You'd document it, folks that want 0-based positions would cook their own function(s). I think that configuration modules can be very useful for an application (an example here: http://flask.pocoo.org/snippets/2/ ), but I am more reserved about its use in a library. But do not let me stop you from pursuing this; I am only expressing an opinion. One last point though. Let me describe a possible scenario: 3rd-party module "foo" is using the Biopython Seq* part, and its author thinks that Config.index should at 1 one, so he/she sets it accordingly. An early line in foo.py is: from somewhere.in.biopython.seq import config config.index = 1 There is an other piece of code (let's call it bar.py), written by someone else or by the same person at a different time. Now the hype is all about 0-based indexes, so the author sets it to be sure: from somewhere.in.biopython.seq import config config.index = 0 To complete the scenario bar.py is using foo.py, or the other way around. The requirement for one an other does not even have to be direct. Now config.index will be what the last piece of code sets it to, although other parts of the code might assume it is set to something else. That sort of situation is not prevented from happening with any sort of module in Python (e.g., import sys; sys.stdout = sys.stderr), but people know they should not do it. Here the config.index would appear as something people should change if they like. Again, that's just an opinion. Others might differ. Best, Laurent > > Please let me know if this would not fully address your concerns. > > Cheers, > > Lenna From arklenna at gmail.com Wed Aug 8 22:39:48 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 8 Aug 2012 18:39:48 -0400 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: <5022E102.9010509@gmail.com> References: <502295CF.3020103@gmail.com> <5022E102.9010509@gmail.com> Message-ID: On Wed, Aug 8, 2012 at 5:58 PM, Laurent Gautier wrote: > On 2012-08-08 20:44, Lenna Peterson wrote: >> >> On Wed, Aug 8, 2012 at 12:37 PM, Laurent Gautier >> wrote: >>> >>> On 2012-08-08 18:00, biopython-dev-request at lists.open-bio.org wro >>> >>>>>> * In order to customize the display of positions (e.g. 0-based or >>>>>> 1-based), I'm using a class as a configuration container. I've read on >>>>>> StackOverflow that attempts to use globals or a singleton class are >>>>>> discouraged in Python, but I have not found practical suggestions for >>>>>> how to implement module-wide configurations. Suggestions are welcome. >>> >>> >>> Module-wide configuration can be implemented as variables, as long as >>> they >>> are declared before the functions using them. >>> If considering a package rather than a single module, options can be >>> stored >>> in a module dedicated to options (since Python modules are singletons). >>> >> Hi Laurent, >> >> I really like the idea of a configuration module. I will definitely >> move in that direction. >> >>>> With configuration items like this, you have two choices: >>>> >>>> - A global variable. >>>> - Pass the configuration to every function that needs it. >>>> >>>> There are tradeoffs with both approaches, but for this case I agree with >>>> your decision to use globals. Most people will want 0-based/Biopython >>>> style but it gives those who don't a knob to switch over. >>> >>> >>> I'd argue that allowing to switch is an invitation to spectacular issues >>> down the road. >>> An easy, yet frightening, example would be the case where using >>> third-party >>> code (such a module) changes this without you knowing. >>> >>> An other scary thought is that this would amount to bringing the infamous >>> Perl variable "$[" to Python. Go explain again that folks should Python >>> for >>> its elegance and simplicity after that. >>> >>> >> Yikes. My approach will not be comparable to $[. For starters, it >> wouldn't modify the behavior of every sequence-like object. >> >> My current thought would be to store the 0-based position in an >> attribute `pos`, have a property `pos_str` that returns `pos` + >> `Config.index`. For representations, `__str__` will return `pos_str`, >> and `__repr__` will return `pos` (always 0-based). Math would always >> use the 0-based position. >> >> I intend to keep the influence of the hypothetical mapping Config >> module limited to Biopython Seq* objects. It should also be possible >> to make a kill switch, namely, a version of the Config module where >> all of the settings are neutral to adding (i.e. `def __add__(self, >> other): return other`). > > > What about making the design decision that string representations that are > 1-based then, and go beyond making a kill switch by just kill the switch ? > You'd document it, folks that want 0-based positions would cook their own > function(s). > > I think that configuration modules can be very useful for an application (an > example here: > http://flask.pocoo.org/snippets/2/ ), but I am more reserved about its use > in a library. > > But do not let me stop you from pursuing this; I am only expressing an > opinion. One last point though. > Let me describe a possible scenario: > > 3rd-party module "foo" is using the Biopython Seq* part, and its author > thinks that Config.index should at 1 one, so he/she sets it accordingly. > An early line in foo.py is: > from somewhere.in.biopython.seq import config > config.index = 1 > > There is an other piece of code (let's call it bar.py), written by someone > else or by the same person at a different time. Now the hype is all about > 0-based indexes, so the author sets it to be sure: > from somewhere.in.biopython.seq import config > config.index = 0 > > To complete the scenario bar.py is using foo.py, or the other way around. > The requirement for one an other does not even have to be direct. Now > config.index will be what the last piece of code sets it to, although other > parts of the code might assume it is set to something else. > > That sort of situation is not prevented from happening with any sort of > module in Python (e.g., import sys; sys.stdout = sys.stderr), but people > know they should not do it. Here the config.index would appear as something > people should change if they like. > > Again, that's just an opinion. Others might differ. > > Best, > > > Laurent > > >> >> Please let me know if this would not fully address your concerns. >> >> Cheers, >> >> Lenna > > Laurent, I must thank you again for your foresight. I am realizing I may have gotten carried away with configurability. My initial goal with the index setting was to enable both GenBank and HGVS representations of genomic positions; a much simpler and safer approach would be to have `to_genbank()` and `to_hgvs()` methods. A user could set the relevant objects' __str__ to either of those. Cheers, Lenna From p.j.a.cock at googlemail.com Thu Aug 9 09:07:15 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 9 Aug 2012 10:07:15 +0100 Subject: [Biopython-dev] GSoC python variant update In-Reply-To: <5022E102.9010509@gmail.com> References: <502295CF.3020103@gmail.com> <5022E102.9010509@gmail.com> Message-ID: On Wed, Aug 8, 2012 at 10:58 PM, Laurent Gautier wrote: > > What about making the design decision that string representations that are > 1-based then, and go beyond making a kill switch by just kill the switch ? > You'd document it, folks that want 0-based positions would cook their own > function(s). > > I think that configuration modules can be very useful for an application ... I agree that a module level config setting is unwise. However, I'd much prefer the string representation was 0-based for consistency, both internal to the module and with most of Biopython. (The restriction module uses 1-based counting which I find very annoying.) You could still provide something like a format method to give a string in common representations (e.g. GenBank/EMBL/INSDC style location strings). Peter From mjldehoon at yahoo.com Thu Aug 9 11:07:20 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 9 Aug 2012 04:07:20 -0700 (PDT) Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts Message-ID: <1344510440.89823.YahooMailClassic@web164003.mail.gq1.yahoo.com> Hi guys, In the Motif class in Bio.Motif._Motif, there is an attribute self.has_instances to identify whether the attributes self.instances is defined. I think that we can remove the self.has_instances attribute from the code and simply set self.instances=None when it is undefined. Same thing for self.counts and self.has_counts. Any objections? Best, -Michiel. From bartek at rezolwenta.eu.org Thu Aug 9 12:26:33 2012 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 9 Aug 2012 14:26:33 +0200 Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: <1344510440.89823.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1344510440.89823.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: On Thu, Aug 9, 2012 at 1:07 PM, Michiel de Hoon wrote: > Hi guys, > > In the Motif class in Bio.Motif._Motif, there is an attribute self.has_instances to identify whether the attributes self.instances is defined. I think that we can remove the self.has_instances attribute from the code and simply set self.instances=None when it is undefined. Same thing for self.counts and self.has_counts. > Any objections? Makes sense to me. +1 -- Bartek Wilczynski From mjldehoon at yahoo.com Thu Aug 9 16:00:14 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 9 Aug 2012 09:00:14 -0700 (PDT) Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: Message-ID: <1344528014.32936.YahooMailClassic@web164006.mail.gq1.yahoo.com> OK, done. Thanks! -Michiel. --- On Thu, 8/9/12, Bartek Wilczynski wrote: > From: Bartek Wilczynski > Subject: Re: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts > To: "Michiel de Hoon" > Cc: biopython-dev at biopython.org > Date: Thursday, August 9, 2012, 8:26 AM > On Thu, Aug 9, 2012 at 1:07 PM, > Michiel de Hoon > wrote: > > Hi guys, > > > > In the Motif class in Bio.Motif._Motif, there is an > attribute self.has_instances to identify whether the > attributes self.instances is defined. I think that we can > remove the self.has_instances attribute from the code and > simply set self.instances=None when it is undefined. Same > thing for self.counts and self.has_counts. > > Any objections? > > Makes sense to me. +1 > > -- > Bartek Wilczynski > From tiagoantao at gmail.com Fri Aug 10 03:04:53 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 9 Aug 2012 20:04:53 -0700 Subject: [Biopython-dev] Away Re: buildbot failure in Biopython on Linux 64 - Python 2.7 Message-ID: I am currently away from office. I will respond back on the 20th of August. Regards, Tiago -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Fri Aug 10 08:33:43 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 10 Aug 2012 09:33:43 +0100 Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: <1344528014.32936.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1344528014.32936.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Thu, Aug 9, 2012 at 5:00 PM, Michiel de Hoon wrote: > OK, done. Thanks! > -Michiel. You'll also need to update the example in the Tutorial, quote: The arnt and srf motifs can both do the same things for us, but they use different internal representations of the motif. We can tell that by inspecting the \verb|has_counts| and has_instances properties: >>> arnt.has_instances True >>> srf.has_instances False >>> srf.has_counts True This means test_Tutorial.py is failing (across all platforms). Presumably we would suggest switching these to somethinglike: >>> arnt.instances is None False etc? In fact given the old methods were documents like this, I would be happier if we could phase them out with a deprecation warning via a read only property method, @property def has_instances(self): """"Does this motif have instances (DEPRECATED).""" import warnings from Bio import BiopythonDeprecationWarning warnings.warn("Check if motif.instance is None or not instead", BiopythonDeprecationWarning) return self.instances is not None (untested, but something like that) Peter From p.j.a.cock at googlemail.com Fri Aug 10 20:04:54 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 10 Aug 2012 21:04:54 +0100 Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: References: <1344528014.32936.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Fri, Aug 10, 2012 at 9:33 AM, Peter Cock wrote: > On Thu, Aug 9, 2012 at 5:00 PM, Michiel de Hoon wrote: >> OK, done. Thanks! >> -Michiel. > > You'll also need to update the example in the Tutorial, quote: > > The arnt and srf motifs can both do the same things for us, > but they use different internal representations of the motif. > We can tell that by inspecting the \verb|has_counts| and > has_instances properties: > > >>> arnt.has_instances > True > >>> srf.has_instances > False > >>> srf.has_counts > True > > This means test_Tutorial.py is failing (across all platforms). > Presumably we would suggest switching these to somethinglike: > > >>> arnt.instances is None > False Fixed: https://github.com/biopython/biopython/commit/b866e74dc9b6162517588ea4c0e4d1ecde5ed87c > etc? In fact given the old methods were documents like > this, I would be happier if we could phase them out with > a deprecation warning via a read only property method, > > @property > def has_instances(self): > """"Does this motif have instances (DEPRECATED).""" > import warnings > from Bio import BiopythonDeprecationWarning > warnings.warn("Check if motif.instance is None or not instead", > BiopythonDeprecationWarning) > return self.instances is not None > > (untested, but something like that) Done: https://github.com/biopython/biopython/commit/fd2223d118227c921524e070c803b97bc979a70f Although since that won't work on old Biopython either (you'd get an AttributeError), perhaps we should label these new backwards compatible properties as obsolete with a pending deprecation warning for the next release (delay the deprecation)? Peter From mjldehoon at yahoo.com Sat Aug 11 03:48:29 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 10 Aug 2012 20:48:29 -0700 (PDT) Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: Message-ID: <1344656909.14019.YahooMailClassic@web164002.mail.gq1.yahoo.com> Hi Peter, --- On Fri, 8/10/12, Peter Cock wrote: > > This means test_Tutorial.py is failing (across all > platforms). > > Presumably we would suggest switching these to > somethinglike: > > > >? ???>>> arnt.instances is > None > >? ???False > > Fixed: > https://github.com/biopython/biopython/commit/b866e74dc9b6162517588ea4c0e4d1ecde5ed87c Thanks for fixing this! Sorry I missed to do this when I was making these changes. > Although since that won't work on old Biopython either > (you'd > get an AttributeError), perhaps we should label these new > backwards compatible properties as obsolete with a pending > deprecation warning for the next release (delay the > deprecation)? > I think we are being way too careful. Requiring proper deprecation warnings each time we make a change in Biopython will slow down its development and improvement. In the past when making changes to the existing code, we have gotten very few complaints; also in this case I doubt that anybody will miss has_counts, has_instances. Best, -Michiel. From mjldehoon at yahoo.com Sat Aug 11 04:25:05 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 10 Aug 2012 21:25:05 -0700 (PDT) Subject: [Biopython-dev] Bio.Motif AlignAce parser Message-ID: <1344659105.5874.YahooMailClassic@web164003.mail.gq1.yahoo.com> Hi guys, Looking some more at the parsers in Bio.Motif. In the Record class in Bio/Motif/Parsers/AlignAce.py, we have an attribute self.current_motif that points to the motif currently being parsed by the parser (or, after the parser finishes, the last motif that was parsed). As far as I can tell this, using a temporary variable current_motif within the read() function would be sufficient; we don't need to store it in the record. I would also suggest for the read() function to strip() all lines. Currently the end-of-line markers are kept. For example the version and the command line are stored as "AlignACE 4.0 05/13/04\n" and "./AlignACE -i test.fa \n" respectively. The version of the AlignACE program is stored in record.ver. The MEME and Mast parsers in Bio.Motif instead use record.version. For consistency I would suggest to use record.version also in the AlignACE parser. The command line is stored in record.cmd_line. The MEME parser uses record.command instead. I think both are fine, but I would also prefer this to be consistent. Then there are two attributes param_dict and seq_dict. The former is a dictionary that stores the parameters used in the run. The latter is not a dictionary but a list of sequence-related information. Since usually we don't put the type of the object in the attribute names, I would suggest to call these simply parameters and sequences. For comparison, the Mast parser uses record.sequences for an analogous attribute; MEME uses record.sequence_names. For consistency I would suggest to use record.sequences for all three. This would create some backward-incompatible changes that may confuse users. Now currently the parsers are located in Bio.Motif.Parsers.AlignAce, Bio.Motif.Parsers.MEME, and Bio.Motif.Parsers.Mast. I would prefer Bio.Motif.AlignAce, Bio.Motif.MEME, Bio.Motif.Mast. Currently to parse the AlignAce output one would do >>> from Bio.Motif.Parsers import AlignAce >>> record = AlignAce.read(handle) >>> record If we move the parsers one level up, this would be >>> from Bio.Motif import AlignAce >>> record = AlignAce.read(handle) >>> record which looks a bit more straightforward to me. In addition, this allows us to put a deprecation warning on the Bio.Motif.Parsers.AlignAce, Bio.Motif.Parsers.MEME, and Bio.Motif.Parsers.Mast modules as a whole, and we won't have to put deprecation warnings on each change separately. Any comments, objections? Best, -Michiel. From p.j.a.cock at googlemail.com Sat Aug 11 10:50:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 11 Aug 2012 11:50:07 +0100 Subject: [Biopython-dev] Bio.Motif._Motif has_instances, has_counts In-Reply-To: <1344656909.14019.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1344656909.14019.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: On Saturday, August 11, 2012, Michiel de Hoon wrote: > Hi Peter, > > > Although since that won't work on old Biopython either > > (you'd > > get an AttributeError), perhaps we should label these new > > backwards compatible properties as obsolete with a pending > > deprecation warning for the next release (delay the > > deprecation)? > > > > I think we are being way too careful. Requiring proper deprecation > warnings each time we make a change in Biopython will slow down its > development and improvement. In the past when making changes to the > existing code, we have gotten very few complaints; also in this case I > doubt that anybody will miss has_counts, has_instances. > > Best, > -Michiel. > In this case you're probably right about it not causing too much inconvenience - this is a relatively new module after all. Peter From arklenna at gmail.com Mon Aug 13 05:00:41 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 13 Aug 2012 01:00:41 -0400 Subject: [Biopython-dev] GSoC python variant update 10 Message-ID: Link: http://arklenna.tumblr.com/post/29317968106/ Post: Following extensive [discussion](http://biopython.org/pipermail/biopython-dev/2012-August/009849.html) on the dev list of the pros and cons of configuration classes/modules, I have refactored my [coordinate mapper](https://gist.github.com/3172753) to keep configuration as isolated as possible. All mapping functions use base 0 internally. Transformation to and from 1-based coords is allowed by custom MapPosition objects. (they are currently separate from the Seq* positions but could probably subclass ExactPosition). The MapPosition objects have to_dialect and from_dialect methods that automatically handle conversion between bases and other formatting details. There are two different ways a user can convert a coordinate from HGVS: # ... assuming cm is an instance of CoordinateMapper # Manually construct position from HGVS CDS_coord = CDSPosition.from_hgvs("6+1") genomic_coord = cm.c2g(CDS_coord) print genomic_coord.to_hgvs() # Pass dialect argument to mapping function genomic_coord = cm.c2g("6+1", dialect="HGVS") print genomic_coord.to_hgvs() Furthermore, the inheritance hierarchy is designed to allow a user to set a default string representation: # Set MapPositions to print as HGVS by default def use_hgvs(self): return str(self.to_hgvs()) MapPosition.__str__ = use_hgvs The [version](https://gist.github.com/3172753/577b7c383e057b78cdcee64be33f18117a46faaf) as of this writing is passing tests using base 0. I have not yet implemented tests for `from_hgvs` or `to_hgvs`, but that's next on my list. I'm hoping to have time for strand and mixed strand, too. Cheers, Lenna From bartek at rezolwenta.eu.org Mon Aug 13 13:12:35 2012 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 13 Aug 2012 15:12:35 +0200 Subject: [Biopython-dev] Bio.Motif AlignAce parser In-Reply-To: <1344659105.5874.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1344659105.5874.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: Sounds great to me. Bartek On Sat, Aug 11, 2012 at 6:25 AM, Michiel de Hoon wrote: > Hi guys, > > Looking some more at the parsers in Bio.Motif. > > In the Record class in Bio/Motif/Parsers/AlignAce.py, we have an attribute self.current_motif that points to the motif currently being parsed by the parser (or, after the parser finishes, the last motif that was parsed). As far as I can tell this, using a temporary variable current_motif within the read() function would be sufficient; we don't need to store it in the record. > > I would also suggest for the read() function to strip() all lines. Currently the end-of-line markers are kept. For example the version and the command line are stored as "AlignACE 4.0 05/13/04\n" and "./AlignACE -i test.fa \n" respectively. > > The version of the AlignACE program is stored in record.ver. The MEME and Mast parsers in Bio.Motif instead use record.version. For consistency I would suggest to use record.version also in the AlignACE parser. > > The command line is stored in record.cmd_line. The MEME parser uses record.command instead. I think both are fine, but I would also prefer this to be consistent. > > Then there are two attributes param_dict and seq_dict. The former is a dictionary that stores the parameters used in the run. The latter is not a dictionary but a list of sequence-related information. Since usually we don't put the type of the object in the attribute names, I would suggest to call these simply parameters and sequences. For comparison, the Mast parser uses record.sequences for an analogous attribute; MEME uses record.sequence_names. For consistency I would suggest to use record.sequences for all three. > > This would create some backward-incompatible changes that may confuse users. Now currently the parsers are located in Bio.Motif.Parsers.AlignAce, Bio.Motif.Parsers.MEME, and Bio.Motif.Parsers.Mast. I would prefer Bio.Motif.AlignAce, Bio.Motif.MEME, Bio.Motif.Mast. Currently to parse the AlignAce output one would do >>>> from Bio.Motif.Parsers import AlignAce >>>> record = AlignAce.read(handle) >>>> record > > If we move the parsers one level up, this would be >>>> from Bio.Motif import AlignAce >>>> record = AlignAce.read(handle) >>>> record > > which looks a bit more straightforward to me. In addition, this allows us to put a deprecation warning on the Bio.Motif.Parsers.AlignAce, Bio.Motif.Parsers.MEME, and Bio.Motif.Parsers.Mast modules as a whole, and we won't have to put deprecation warnings on each change separately. > > Any comments, objections? > > Best, > -Michiel. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Bartek Wilczynski From arnaud.poret at gmail.com Mon Aug 13 14:07:39 2012 From: arnaud.poret at gmail.com (Arnaud Poret) Date: Mon, 13 Aug 2012 16:07:39 +0200 Subject: [Biopython-dev] obo parser Message-ID: Hi everyone, I'm a newcomer and I'm writing an obo parser for importing ontologies into python. I'm not sure, but has already BioPython an obo parser? If yes, I'm reinventing the wheel... If no, I'll be glad to contribute. From tiagoantao at gmail.com Tue Aug 14 03:23:01 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 13 Aug 2012 20:23:01 -0700 Subject: [Biopython-dev] Away Re: buildbot failure in Biopython on Windows XP - Python 2.5 Message-ID: I am currently away from office. I will respond back on the 20th of August. Regards, Tiago -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Tue Aug 14 11:06:32 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 14 Aug 2012 12:06:32 +0100 Subject: [Biopython-dev] Genome Diagram Sigils, was: Default Behavior Message-ID: On Thu, Aug 2, 2012 at 5:12 PM, Peter Cock wrote: > On Thu, Aug 2, 2012 at 8:42 AM, Leighton Pritchard wrote: >>Peter wrote: >>> >>> To match the current sigil argument names BOX and ARROW, I have >>> provisionally called BIGARROW. Any better ideas? >>> >> >> BIGARROW sounds fine to me. I like literal names. >> > > Great. Checked into the master, and I updated the Tutorial and > the Proux et al 2002 Figure 6 reproduction example to use this: > > Before (cross-links with strand specific ARROW sigil): > http://biopython.org/DIST/docs/tutorial/images/three_track_cl2.png > > After (cross-links with strand straddling BIGARROW sigil): > http://biopython.org/DIST/docs/tutorial/images/three_track_cl2a.png > > Original (I don't know what was used to draw this): > http://dx.doi.org/10.1128/JB.184.21.6026-6036.2002 > > Regards, > > Peter Further to that work, I updated some older code for a JAGGY sigil, and also an OCTO sigil (names open to suggestions), which are on my gd-sigils branch which has documentation in the tutorial, including this image of the expanded sigil set: https://github.com/peterjc/biopython/blob/e09e264dd73953554609498c15b67d86686592fb/Doc/images/GD_sigils.png This is a slight simplification of the old JAGGY code in that it does (yet) allow control of the teeth length (e.g. to have just teeth on one end). I am thinking this could be exposed like the existing arrow specific options. I originally created the JAGGY sigil for marking a break point in a contig/scaffold. For instance, you might want to mark a run of NNNNN bases in a scaffold with a jaggy sigil (straddling both strands) as a clear visual marker to explain why there were no genes. Other sigil ideas I pondered include an OVAL, which should be quite easy for the linear diagrams, but rather more work to implement for circular diagrams due to the distorted curves. Peter From p.j.a.cock at googlemail.com Tue Aug 14 19:49:23 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 14 Aug 2012 20:49:23 +0100 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: <87lim4h07o.fsf@fastmail.fm> References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> <87lim4h07o.fsf@fastmail.fm> Message-ID: On Tue, Apr 10, 2012 at 1:58 AM, Brad Chapman wrote: > Michiel; >> Hi Eric, Peter, >> >> > How about Bio.Search, for now? >> >> I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells >> users something about what the module is for. Bio.Search could be >> anything (search PubMed? search the Entrez databases? search Google? >> anyway Bio.Search does not suggest that this module is about pairwise >> alignments). But Peter previously mentioned that he doesn't like >> Bio.Pairwise; can we convince you? > > I agree with Peter on this one. The module is primarily about searching > a sequence database with an input via multiple methods, not about > pairwise alignment of two sequences with is what Bio.Align.Pairwise > suggests to me. > > Brad On potential problem with Bio.Search (on top of concerns raised here about vagueness) Bow and I were just talking about during our weekly GSoC video call was the existence of Bio/Search.py which is obsolete and long overdue for removal. I have just deprecated it (something I forgot to do before the last release): https://github.com/biopython/biopython/commit/5a275ccd1df3def40df1eef517af755d373dadd8 We'd earlier talked about using Bio.Search as the namespace. I was worried about the potential existence on a user's machine of both Bio/Search.py (the old obsolete code) and Bio/Search/__init__.py (aka SearchIO, the new module) and which would take precedence when doing: from Bio import Search Given how Python module installations work, that seems highly likely to occur. The good news is that the package would take priority - see http://www.python.org/doc/essays/packages.html >>>> What If I Have a Module and a Package With The Same Name? >>>> >>>> You may have a directory (on sys.path) which has both a module >>>> spam.py and a subdirectory spam that contains an __init__.py >>>> (without the __init__.py, a directory is not recognized as a package). >>>> In this case, the subdirectory has precedence, and importing spam >>>> will ignore the spam.py file, loading the package spam instead. If >>>> you want the module spam.py to have precedence, it must be >>>> placed in a directory that comes earlier in sys.path. So there is no technical reason to avoid Bio.Search as an option for the Bio.SearchIO namespace. We could then have Bio.Search.Applications for command line wrappers, consistent with Bio.Phylo.Applications, Bio.Motif.Applications and Bio.Align.Applications. Of course, Bio.Search is still perhaps too broad a name... but on balance perhaps it is still better than Bio.SearchIO? Regards, Peter From tiagoantao at gmail.com Tue Aug 14 20:39:12 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 14 Aug 2012 21:39:12 +0100 Subject: [Biopython-dev] jython/testing Message-ID: Hi, I have been trying to use biopython with jython 2.7 alpha 2. Here follows a report. There are still a few problems (with SeqIO only): test_SeqIO ... ERROR test_SeqIO_QualityIO ... FAIL test_SeqIO_index ... FAIL The errors are something like (all the same kind of stuff really): SeqIO ====================================================================== ERROR: test_SeqIO ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 341, in runTest suite = unittest.TestLoader().loadTestsFromName(name) File "/home/tr353/local/jython/Lib/unittest/loader.py", line 91, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "/home/tr353/local/jython/Lib/unittest/loader.py", line 91, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "/home/tr353/tmp/biopython/Tests/test_SeqIO.py", line 627, in check_simple_write_read(records) File "/home/tr353/tmp/biopython/Tests/test_SeqIO.py", line 352, in check_simple_write_read records2 = list(SeqIO.parse(handle=handle, format=format)) File "/home/tr353/tmp/biopython/Tests/test_SeqIO.py", line 352, in check_simple_write_read records2 = list(SeqIO.parse(handle=handle, format=format)) File "/home/tr353/tmp/biopython/Bio/SeqIO/__init__.py", line 537, in parse for r in i: File "/home/tr353/tmp/biopython/Bio/SeqIO/SffIO.py", line 828, in SffIterator header_length, index_offset, index_length, number_of_reads, \ File "/home/tr353/tmp/biopython/Bio/SeqIO/SffIO.py", line 285, in _sff_file_header magic_number, ver0, ver1, ver2, ver3, index_offset, index_length, \ error: unpack str size does not match format SeqIO_QualityIO ====================================================================== ERROR: test_E3MFGYR02 (test_SeqIO_QualityIO.TestWriteRead) Write and read back E3MFGYR02_random_10_reads.sff ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tr353/tmp/biopython/Tests/test_SeqIO_QualityIO.py", line 551, in test_E3MFGYR02 self.check(os.path.join("Roche", "E3MFGYR02_random_10_reads.sff"), "sff", File "/home/tr353/tmp/biopython/Tests/test_SeqIO_QualityIO.py", line 477, in check write_read(filename, format, f) File "/home/tr353/tmp/biopython/Tests/test_SeqIO_QualityIO.py", line 52, in write_read records2 = list(SeqIO.parse(handle,out_format)) File "/home/tr353/tmp/biopython/Bio/SeqIO/__init__.py", line 537, in parse for r in i: File "/home/tr353/tmp/biopython/Bio/SeqIO/SffIO.py", line 828, in SffIterator header_length, index_offset, index_length, number_of_reads, \ File "/home/tr353/tmp/biopython/Bio/SeqIO/SffIO.py", line 285, in _sff_file_header magic_number, ver0, ver1, ver2, ver3, index_offset, index_length, \ error: unpack str size does not match format SeqIO.index ====================================================================== ERROR: test_sff_Roche_greek_sff_get_raw (test_SeqIO_index.IndexDictTests) Index sff file Roche/greek.sff get_raw ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tr353/tmp/biopython/Tests/test_SeqIO_index.py", line 430, in f = lambda x : x.get_raw_check(fn, fmt, alpha, c) File "/home/tr353/tmp/biopython/Tests/test_SeqIO_index.py", line 301, in get_raw_check rec2 = SeqIO.SffIO._sff_read_seq_record(handle, File "/home/tr353/tmp/biopython/Bio/SeqIO/SffIO.py", line 561, in _sff_read_seq_record read_header_length, name_length, seq_len, clip_qual_left, \ error: unpack str size does not match format I suppose this is because of issues with the alpha version of jython 2.7. Tiago PS - I do not have all external dependencies installed on my machine, so a few modules are untested. -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From p.j.a.cock at googlemail.com Wed Aug 15 11:18:50 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 15 Aug 2012 12:18:50 +0100 Subject: [Biopython-dev] jython/testing In-Reply-To: References: Message-ID: On Tue, Aug 14, 2012 at 9:39 PM, Tiago Ant?o wrote: > Hi, > > I have been trying to use biopython with jython 2.7 alpha 2. Here > follows a report. > > > There are still a few problems (with SeqIO only): > test_SeqIO ... ERROR > test_SeqIO_QualityIO ... FAIL > test_SeqIO_index ... FAIL > > The errors are something like (all the same kind of stuff really): > > ... I see that on my machine too. From looking at the tracebacks and the associated code, the failures all involve BytesIO (or StringIO depending on the Python version). Note that BytesIO is new in Python 2.6, and thus also new in Jython 2.7 compared to Jython 2.5. This is enough to demonstrate a bug in Jython 2.7a2, which explains some if not all of our unit test failures: Expected behaviour: $ python Python 2.7.2 (default, Jun 20 2012, 16:23:33) [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from io import BytesIO >>> raw = open("Roche/E3MFGYR02_random_10_reads.sff", "rb").read() >>> raw == BytesIO(raw).read() True >>> len(raw) 17592 >>> quit() Broken behaviour: $ ~/jython2.7a2/jython Jython 2.7a2 (default:9c148a201233, May 24 2012, 15:49:00) [Java HotSpot(TM) 64-Bit Server VM (Apple Inc.)] on java1.6.0_33 Type "help", "copyright", "credits" or "license" for more information. >>> from io import BytesIO >>> raw = open("Roche/E3MFGYR02_random_10_reads.sff", "rb").read() >>> raw == BytesIO(raw).read() False >>> len(raw) 17592 >>> len(BytesIO(raw).read()) 51577 >>> BytesIO(raw).read()[:100] "bytearray(b'.sff\\x00\\x00\\x00\\x01\\x00\\x00\\x00\\x00\\x00\\x00A\\xb8\\x00\\x00\\x02\\xfc\\x00\\x00\\x00\\n\\x01\\xb8\\" >>> raw[:100] '.sff\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00A\xb8\x00\x00\x02\xfc\x00\x00\x00\n\x01\xb8\x00\x04\x01\x90\x01TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT' >>> quit() I will report this. Peter From p.j.a.cock at googlemail.com Wed Aug 15 11:26:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 15 Aug 2012 12:26:19 +0100 Subject: [Biopython-dev] jython/testing In-Reply-To: References: Message-ID: On Wed, Aug 15, 2012 at 12:18 PM, Peter Cock wrote: > On Tue, Aug 14, 2012 at 9:39 PM, Tiago Ant?o wrote: >> Hi, >> >> I have been trying to use biopython with jython 2.7 alpha 2. Here >> follows a report. >> >> >> There are still a few problems (with SeqIO only): >> test_SeqIO ... ERROR >> test_SeqIO_QualityIO ... FAIL >> test_SeqIO_index ... FAIL >> >> The errors are something like (all the same kind of stuff really): >> >> ... > > I see that on my machine too. From looking at the tracebacks and > the associated code, the failures all involve BytesIO (or StringIO > depending on the Python version). Note that BytesIO is new in > Python 2.6, and thus also new in Jython 2.7 compared to Jython 2.5. > > This is enough to demonstrate a bug in Jython 2.7a2, which explains > some if not all of our unit test failures: > > ... > > I will report this. Filed as http://bugs.jython.org/issue1959 with a shorter test case. Peter From arklenna at gmail.com Fri Aug 17 01:58:46 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 16 Aug 2012 21:58:46 -0400 Subject: [Biopython-dev] GSoC Python variant (penultimate) update Message-ID: Post: http://arklenna.tumblr.com/post/29592108099/ I have been considering how to handle gene strandedness. As long as I'm correctly interpreting the following position, my coordinate mapper should produce the correct coordinates with negative strand or mixed strand features. GenBank: join(complement(25..30), 36..40) Biopython: FeatureLocation(24, 30, -1) + FeatureLocation(35, 40) (please click through to post for monospaced font) 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 <---------------- -------------> 5 4 3 2 1 0 6 7 8 9 10 I have to admit that it wasn't until I read a BioStar [post](http://biostars.org/post/show/3423/forward-and-reverse-strand-conventions/) earlier this week that I fully understood the relationship between plus/minus forward/reverse sense/antisense coding/template strands. So please let me know as soon as possible if I've made a mistake in the above code. `c2g` yields the correct genome position, but not the strand. I still need to integrate strand information into my `GenomePosition` object and/or partially merge it with `ExactLocation`. This weekend I intend to expand documentation and write a brief cookbook entry. Cheers, Lenna From arnaud.poret at gmail.com Fri Aug 17 07:38:28 2012 From: arnaud.poret at gmail.com (Arnaud Poret) Date: Fri, 17 Aug 2012 09:38:28 +0200 Subject: [Biopython-dev] obo parser Message-ID: Hi everyone, I'm a newcomer and I'm writing an obo parser for importing ontologies into python. I'm not sure, but has already BioPython an obo parser? If yes, I'm reinventing the wheel... If no, I'll be glad to contribute. Arnaud. From p.j.a.cock at googlemail.com Fri Aug 17 08:15:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 17 Aug 2012 09:15:10 +0100 Subject: [Biopython-dev] obo parser In-Reply-To: References: Message-ID: On Mon, Aug 13, 2012 at 3:07 PM, Arnaud Poret wrote: > Hi everyone, > > I'm a newcomer and I'm writing an obo parser for importing ontologies > into python. I'm not sure, but has already BioPython an obo parser? > > If yes, I'm reinventing the wheel... > > If no, I'll be glad to contribute. There does seem to be interest, questions about ontologies, GO and OBO crop up every so often. There were some people actually working on this too, but it has gone quiet. e.g. http://lists.open-bio.org/pipermail/biopython-dev/2012-February/009384.html http://lists.open-bio.org/pipermail/biopython-dev/2011-July/009031.html Chris Lasher's repository has vanished, but Eric's older work is still online (CC'd): https://github.com/kellrott/biopython/tree/gosupport Eric & Chris - where do things stand? Regards, Peter From p.j.a.cock at googlemail.com Fri Aug 17 08:21:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 17 Aug 2012 09:21:01 +0100 Subject: [Biopython-dev] [GSoC] GSoC Python variant (penultimate) update In-Reply-To: References: Message-ID: On Fri, Aug 17, 2012 at 2:58 AM, Lenna Peterson wrote: > > I have to admit that it wasn't until I read a BioStar > [post](http://biostars.org/post/show/3423/forward-and-reverse-strand-conventions/) > earlier this week that I fully understood the relationship between > plus/minus forward/reverse sense/antisense coding/template strands. So > please let me know as soon as possible if I've made a mistake in the > above code. Given this is nice and fresh in your mind, can you suggest any clarifications to the Biopython Tutorial section talking about this issue? The section on transcription & translation starting: "Before talking about transcription, I want to try and clarify the strand issue. Consider the following (made up) stretch of double stranded DNA which encodes a short peptide: ..." Hmm. That should probably say "I want to try to clarify...". Peter From p.j.a.cock at googlemail.com Fri Aug 17 16:42:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 17 Aug 2012 17:42:57 +0100 Subject: [Biopython-dev] BioSQL tests Message-ID: Dear all, I realised this week that I didn't have a working BioSQL test setup under either MySQL or PostgreSQL, and the buildbot machines are not testing these either. Therefore I have re-factored the BioSQL unit tests as follows: First I turned my print-and-compare test_BioSQL_SeqIO.py script into proper UnitTest based tests, so that all the BioSQL tests could be combined in one file, test_BioSQL.py. This allowed a further reorganisation to allow any one machine to test all the supported back ends one after the other - previously the setup only tested one backend (defaulting to SQLite3). We now have three test scripts named after the backend library used to connect to the database: test_BioSQL_MySQLdb.py test_BioSQL_psycopg2.py test_BioSQL_sqlite3.py Subsequently I modified our TravisCI configuration to install the required dependencies to run all these tests. The default usernames and passwords for MySQLdb and postgresql are set to match those under TravisCI. Local users would probably have to adjust these values (in the same way they used to prior to the refactoring). Note that psycopg2 only works on C Python 2 & 3 for now (there is a PyPy alternative I have not looked into). MySQLdb only works on C Python 2 (there is a problem installing it under Python 3.2). This did show I'd broken using BioSQL under MySQLdb, at least under this particular version, fixed now: https://github.com/biopython/biopython/commit/4a67d851d1eda0a138b604c8aeffc151d331a29b So the good news is that now TravisCI will run the BioSQL tests on all three database backends, on several versions of Python (but just on Linux). http://travis-ci.org/biopython/biopython/ What I have not addressed is if/how we should deal with test database setting under buildbot - perhaps by environment variable overrides? If anyone would like to look into using MySQLdb and/or psycopg2 under PyPy and Jython, that would also be useful too. Regards, Peter From arklenna at gmail.com Mon Aug 20 04:22:36 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 20 Aug 2012 00:22:36 -0400 Subject: [Biopython-dev] GSoC python variant final update Message-ID: Post: http://arklenna.tumblr.com/post/29808300789/ The coordinate mapper, with updated documentation, is now located on this branch: https://github.com/lennax/biopython/tree/f_loc4 It awaits the merging of Peter's f_loc4 branch. I've written an entry on coordinate mapping for the Cookbook: http://biopython.org/wiki/Coordinate_mapping Additionally, at Peter's suggestion, I've written a clarification of strand as it relates to transcription and translation. It's available here: https://docs.google.com/document/d/11R7EOJXn90lN5_SmaPOyN5rFfPQybbCbUBo6EY0R0pA/edit It's been a great experience working with this project this summer. Thank you to everyone involved. Cheers, Lenna From mjldehoon at yahoo.com Mon Aug 20 12:38:37 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 20 Aug 2012 05:38:37 -0700 (PDT) Subject: [Biopython-dev] Bio.Cluster in the main Biopython documentation Message-ID: <1345466317.39160.YahooMailClassic@web164003.mail.gq1.yahoo.com> Dear all, Previously the documentation for Bio.Cluster was only available as a separate PDF on the Biopython website. I have now integrated this documentation into the Biopython Tutorial. The new tutorial is already uploaded to the repository, and will be visible at http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html once the nightly build is done. Since the documentation for Bio.Cluster contains many references to the literature, I started using the LaTeX \cite command, which are understood and formatted properly by Hevea. While at it, I also converted the references I could find in other parts of the Tutorial to \cite references. This creates a list of references at the end of the Tutorial. Please let us know if you don't like this approach. The documentation for Bio.Cluster is fairly long, and while modifying it for inclusion into the Tutorial some mistakes may have crept in. Please let me know if you find any such mistakes (or feel free to fix them yourself, if it is clear what the text should be). For now we can leave the PDF with the separate description of Bio.Cluster on the website as is for users of Biopython 1.60, but once the next version of Biopython is out I would like to replace it with a PDF referring to the main Tutorial. Thanks, -Michiel. From chapmanb at 50mail.com Mon Aug 20 12:45:49 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 20 Aug 2012 08:45:49 -0400 Subject: [Biopython-dev] [GSoC] GSoC python variant final update In-Reply-To: References: Message-ID: <87harxzq82.fsf@fastmail.fm> Lenna; Thanks for the documentation and getting that all code moved into a branch. This looks great and looking forward to having it merged when Peter's work goes in. Thanks also for all the great work this summer and good luck on the first day of PhD school, Brad > Post: http://arklenna.tumblr.com/post/29808300789/ > > The coordinate mapper, with updated documentation, is now located on > this branch: https://github.com/lennax/biopython/tree/f_loc4 > It awaits the merging of Peter's f_loc4 branch. > > I've written an entry on coordinate mapping for the Cookbook: > http://biopython.org/wiki/Coordinate_mapping > > Additionally, at Peter's suggestion, I've written a clarification of > strand as it relates to transcription and translation. It's available > here: https://docs.google.com/document/d/11R7EOJXn90lN5_SmaPOyN5rFfPQybbCbUBo6EY0R0pA/edit > > It's been a great experience working with this project this summer. > Thank you to everyone involved. > > Cheers, > > Lenna > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From redmine at redmine.open-bio.org Tue Aug 21 10:27:14 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 10:27:14 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] (New) PDBParser fails to parse PDBs produced by PatchDock Message-ID: Issue #3379 has been reported by David Cain. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 10:27:14 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 10:27:14 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] (New) PDBParser fails to parse PDBs produced by PatchDock Message-ID: Issue #3379 has been reported by David Cain. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 10:36:07 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 10:36:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by Peter Cock. If as I understood you, PatchDock is producing invalid PDB files, have you raised the issue with them too? I accept that out of practicality, a little lenience in our parsers can be helpful, and may be appropriate in this case. Do you have any sample data files you could share - for example a valid PDB file before processing, and the problematic PDB file after processing with PatchDock? ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 11:08:53 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 11:08:53 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by Jo?o Rodrigues. Disclaimer: I am a HADDOCK team member and therefore in direct competition with PATCHDOCK. I totally disagree with this. This is not compliant with the PDB format at all: "Each file should terminate with a line containing only the word END". Having data beyond END is just bad practice in my opinion. There are two statements to close a chain/model - ENDMDL and TER - and these should be used. Sorry to be a pain, but if we are fixing this it's just encouraging a bad practice.. standards are there to be respected. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 11:21:57 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 11:21:57 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by Peter Cock. Given Joao's comments, lenience does not sound appropriate in this case. If the parser's current behaviour is to silently ignore data after an END line, that seems less than ideal. How about we add a clear error/warning to the parser if there is content in the file after an END line? i.e. Treat it as an exception in strict mode, treat it as a warning in permissive mode (and continue to ignore anything after the END line)? A sample file would be helpful to verify this, and could even be used for a unit test (with your permission). ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 11:26:48 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 11:26:48 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by David Cain. I completely agree with Jo?o, actually- disrespecting the file spec is a bad idea. I just figured I'd bring this to discussion. I very much think a warning of some sort should be raised, though. Half the structure silently failing to parse is a big problem. I think your solution is perfect, and I'd be very happy to write the unit test. I'll upload a sample file in just a bit. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 12:05:37 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 12:05:37 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by David Cain. File complex.1.pdb added I ran PatchDock's antigen-antibody complex mode on an antigen and antibody file (2fgw and 5ebx) that individually parse without warnings. (Note that I chose these files at random; their docking is useful only as an example). I've attached the complex file produced by @PatchDock/transOutput.pl@) (only the top-scoring conformation considered). As you can see, the @CONECT@ and @END@ records of the antibody will stop the rest of the file from being parsed. I'd be happy to take a stab at writing the error/warning message for premature @END@/@CONECT@ records in addition to the unit test that checks for this behavior. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Aug 21 12:35:07 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 Aug 2012 12:35:07 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by Jo?o Rodrigues. Agreed with Peter that it should raise an exception/warning. This is really pure concatenation of the two PDBs.. If you could have a go at it, I could test it too. Thanks David. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From w.arindrarto at gmail.com Tue Aug 21 16:01:21 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 21 Aug 2012 18:01:21 +0200 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> <87lim4h07o.fsf@fastmail.fm> Message-ID: On Tue, Aug 14, 2012 at 9:49 PM, Peter Cock wrote: > On Tue, Apr 10, 2012 at 1:58 AM, Brad Chapman wrote: >> Michiel; >>> Hi Eric, Peter, >>> >>> > How about Bio.Search, for now? >>> >>> I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells >>> users something about what the module is for. Bio.Search could be >>> anything (search PubMed? search the Entrez databases? search Google? >>> anyway Bio.Search does not suggest that this module is about pairwise >>> alignments). But Peter previously mentioned that he doesn't like >>> Bio.Pairwise; can we convince you? >> >> I agree with Peter on this one. The module is primarily about searching >> a sequence database with an input via multiple methods, not about >> pairwise alignment of two sequences with is what Bio.Align.Pairwise >> suggests to me. >> >> Brad > > On potential problem with Bio.Search (on top of concerns raised > here about vagueness) Bow and I were just talking about during > our weekly GSoC video call was the existence of Bio/Search.py > which is obsolete and long overdue for removal. I have just > deprecated it (something I forgot to do before the last release): > https://github.com/biopython/biopython/commit/5a275ccd1df3def40df1eef517af755d373dadd8 > > We'd earlier talked about using Bio.Search as the namespace. I was > worried about the potential existence on a user's machine of both > Bio/Search.py (the old obsolete code) and Bio/Search/__init__.py > (aka SearchIO, the new module) and which would take precedence > when doing: from Bio import Search > > Given how Python module installations work, that seems highly > likely to occur. The good news is that the package would take > priority - see http://www.python.org/doc/essays/packages.html > >>>>> What If I Have a Module and a Package With The Same Name? >>>>> >>>>> You may have a directory (on sys.path) which has both a module >>>>> spam.py and a subdirectory spam that contains an __init__.py >>>>> (without the __init__.py, a directory is not recognized as a package). >>>>> In this case, the subdirectory has precedence, and importing spam >>>>> will ignore the spam.py file, loading the package spam instead. If >>>>> you want the module spam.py to have precedence, it must be >>>>> placed in a directory that comes earlier in sys.path. > > So there is no technical reason to avoid Bio.Search as an > option for the Bio.SearchIO namespace. We could then > have Bio.Search.Applications for command line wrappers, > consistent with Bio.Phylo.Applications, Bio.Motif.Applications > and Bio.Align.Applications. > > Of course, Bio.Search is still perhaps too broad a name... but > on balance perhaps it is still better than Bio.SearchIO? > > Regards, > > Peter Hi everyone, If I may add my two cents, for now I am in favor of putting the module under Bio.Search. It is not the best name out there (it does sound a bit vague), but it's the one that seem to be the most intuitive (until a better alternative comes out). There were some other alternatives that I and Peter have discussed, but they seem less appealing for us. You're free to add your thoughts on these of course :) : - Bio.SeqSearch. This sounds ok, but when you consider we have Bio.Seq, Bio.SeqRecord, Bio.SeqFeature, and Bio.SeqUtils, it becomes quite confusing quickly. - Bio.PSearch ('p' for pairwise). This one seemed the less intuitive among the three options, so I'm not so big on this. For now, I'm still writing everything (code, docstrings, tutorial) using SearchIO. I suppose it's better if we could agree on a more suitable name, though. On another note, I'm also in favor of using the Bio.Phylo module skeleton for Bio.SearchIO / Bio.Search. We may then group all sequence search-related application wrappers under Applications (I actually prefers 'app' for better PEP8 compliance, but that's another discussion) and perhaps even refactor our remote search calls (e.g. the 'qblast' module) under Bio.Search as well. cheers, Bow From w.arindrarto at gmail.com Tue Aug 21 16:09:07 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 21 Aug 2012 18:09:07 +0200 Subject: [Biopython-dev] GSoC Project Update -- 10 In-Reply-To: References: Message-ID: Hi everyone, I've just posted my last entry for my Google Summer of Code project this year: http://bow.web.id/blog/2012/08/summers-over/ I want to say thank you to the Biopython community, especially Peter for mentoring me this summer :), to OBF for accepting my proposal, and to anyone who has helped and given me valuable inputs for me throughout the project :). It's been a priceless learning experience, and I only hope that my code will be useful in return. There are still some things to do before the code is merge-ready and even more when the code is included in an official release, so I'll still be around. cheers, Bow From mictadlo at gmail.com Wed Aug 22 00:55:30 2012 From: mictadlo at gmail.com (Mic) Date: Wed, 22 Aug 2012 10:55:30 +1000 Subject: [Biopython-dev] [BioRuby] Final GSoC report In-Reply-To: References: Message-ID: Hi, Python is able to connect to D with help of http://pyd.dsource.org/ . Maybe it would be something for Biopython Cheers, Mic On Wed, Aug 22, 2012 at 5:11 AM, Marjan Povolni wrote: > http://blog.mpthecoder.com/post/29910330225/final-gsoc-report > > *Summary* > > Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the end > of the summer. At least in GSoC terms. Should I say end of the project? I > don?t think so. The tools can still be improved, and the Ruby bindings > should follow. > > The major changes since the last release include the following: > > - filtering functionality has been moved to a separate utility: > gff3-filter, along with a new language for specifying filtering > expressions, > - conversion to table format of selected fields has been moved to a > separate utility: gff3-select. However, the ?select option is still > part of > gff3-filter, > - gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA files > for CDS and mRNA records and features, > - man pages for utilities. > > ** > The original idea was to create a GFF3/GTF parser in D and Ruby bindings. > The Ruby bindings part didn?t work out because there is still no support > for D shared libraries in Linux, but instead there are now a few useful > command-line tools for processing GFF3 which can be used without > programming knowledge. > > To me, the summer was fun, challenging, and a great experience. I even got > to meet my mentor in person, and other community members too, and to make > my first steps in bioinformatics. I even gave a small presentation at the > EU-codefest. What a summer it was! > > Thanks to everybody who made it possible: Google, Open Bioinformatics > Foundation and my mentor Pjotr Prins. > > -- > Marjan > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From p.j.a.cock at googlemail.com Wed Aug 22 08:42:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Aug 2012 09:42:03 +0100 Subject: [Biopython-dev] [GSoC] GSoC Project Update -- 10 In-Reply-To: References: Message-ID: On Tue, Aug 21, 2012 at 5:09 PM, Wibowo Arindrarto wrote: > Hi everyone, > > I've just posted my last entry for my Google Summer of Code project > this year: http://bow.web.id/blog/2012/08/summers-over/ > > I want to say thank you to the Biopython community, especially Peter > for mentoring me this summer :), to OBF for accepting my proposal, and > to anyone who has helped and given me valuable inputs for me > throughout the project :). > > It's been a priceless learning experience, and I only hope that my > code will be useful in return. > > There are still some things to do before the code is merge-ready and > even more when the code is included in an official release, so I'll > still be around. > > cheers, > Bow Thank you Bow, It has been a pleasure to mentor you, and I'm excited about getting this (and Lenna's and other branches) into Biopython. Now, back to the module naming discussion... ;) http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009868.html http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009888.html Peter From p.j.a.cock at googlemail.com Wed Aug 22 11:07:11 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Aug 2012 12:07:11 +0100 Subject: [Biopython-dev] Beta code in the official releases? Message-ID: Hi all, One of the ideas I discussed with Bow during this GSoC project was introducing a new warning, something like Bio.BiopythonBetaCode (the exact name isn't important), to be used to label new experimental modules for which we *expect* there to be changes in the next release. The idea is to combine the simplicity of distribution and installation of the 'monolithic' Biopython library with some of the flexibility offered by a more modular approach. This would be particularly helpful for those on Windows, where installing a Biopython branch from git is quite a daunting task. The idea is that in one of the next releases you'd be able to try Bio.SearchIO (or Bio.Struct or GFF or Variants or ...) and see something like this: >>> from Bio import SearchIO Bio/SearchIO/__init__.py:16: BiopythonBetaCode: Bio.SearchIO is in beta, and likely to change warnings.warn("Bio.SearchIO is in beta, and likely to change", BiopythonBetaCode) By using a specific warning class, any keen beta tester can silence all the BiopythonBetaCode warnings if they wished to. Is anyone familiar enough with Linux packaging polices to have any thoughts on how they would treat this? Provided we only use this for self contained modules, they could potentially split the beta-modules into a sub-package (in the same way that Biopython and its BioSQL support are split in Debian). I envision using this as a way to encourage wider 'beta testing' of self contained modules which are close to a stable release. Does anyone think this is a good idea? Are there any downsides I'm overlooking? Thanks, Peter From p.j.a.cock at googlemail.com Wed Aug 22 11:10:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Aug 2012 12:10:56 +0100 Subject: [Biopython-dev] [BioRuby] [GSoC] Final GSoC report In-Reply-To: <20120822104352.GA11847@thebird.nl> References: <20120822104352.GA11847@thebird.nl> Message-ID: On Wed, Aug 22, 2012 at 11:43 AM, Pjotr Prins wrote: > Yes, linking to D from an interpreted language is not hard, basically > it is the same calling convention as that of C. So a D shared library > looks the same as a C shared library to the calling code - all > existing foreign function interfaces (FFI) work. That is the good > news. How do things stand from a cross-platform perspective? i.e. When might this be doable on Linux, Mac OS X, and Windows? (and other Unix like platforms of potential interest) > The bad news, as Artem points out, is that there is a problem in the > D garbage collector. Items get collected, which should not. This will > be fixed sooner or later. The commitment is there, and it is moving > up the priority list. Is there a D issue/bug tracker for this? Thanks, Peter From chapmanb at 50mail.com Thu Aug 23 00:42:09 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 22 Aug 2012 20:42:09 -0400 Subject: [Biopython-dev] Beta code in the official releases? In-Reply-To: References: Message-ID: <877gsq8mn2.fsf@fastmail.fm> Peter; +1. I'm for making the process of getting new code into Biopython a bit quicker and this seems like a nice step in that direction. With code has been well designed tested and documented, this will help speed the transition into releases and get more eyes on it quicker, while allowing some potential breaking changes as beta functionality gets finalized. Thanks for the good suggestion, Brad > Hi all, > > One of the ideas I discussed with Bow during this GSoC > project was introducing a new warning, something like > Bio.BiopythonBetaCode (the exact name isn't important), > to be used to label new experimental modules for which > we *expect* there to be changes in the next release. > > The idea is to combine the simplicity of distribution and > installation of the 'monolithic' Biopython library with some > of the flexibility offered by a more modular approach. > This would be particularly helpful for those on Windows, > where installing a Biopython branch from git is quite a > daunting task. > > The idea is that in one of the next releases you'd be able > to try Bio.SearchIO (or Bio.Struct or GFF or Variants or ...) > and see something like this: > >>>> from Bio import SearchIO > Bio/SearchIO/__init__.py:16: BiopythonBetaCode: Bio.SearchIO is in > beta, and likely to change > warnings.warn("Bio.SearchIO is in beta, and likely to change", > BiopythonBetaCode) > > By using a specific warning class, any keen beta tester can > silence all the BiopythonBetaCode warnings if they wished to. > > Is anyone familiar enough with Linux packaging polices to > have any thoughts on how they would treat this? Provided > we only use this for self contained modules, they could > potentially split the beta-modules into a sub-package (in the > same way that Biopython and its BioSQL support are split > in Debian). > > I envision using this as a way to encourage wider 'beta testing' > of self contained modules which are close to a stable release. > Does anyone think this is a good idea? Are there any downsides > I'm overlooking? > > Thanks, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From redmine at redmine.open-bio.org Mon Aug 27 04:24:16 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 27 Aug 2012 04:24:16 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by David Cain. Regarding "pure concatenation," I wasn't exaggerating when I said really ugly Perl scripts. =) I created a "pull request on the Biopython GitHub repository":https://github.com/biopython/biopython/pull/60. Could you give me some feedback on my solution? If the devs agree on a certain behavior, I'll start writing some unit tests. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From Andrew.Sczesnak at med.nyu.edu Wed Aug 29 17:54:08 2012 From: Andrew.Sczesnak at med.nyu.edu (Sczesnak, Andrew) Date: Wed, 29 Aug 2012 17:54:08 +0000 Subject: [Biopython-dev] Beta code in the official releases? In-Reply-To: <877gsq8mn2.fsf@fastmail.fm> References: , <877gsq8mn2.fsf@fastmail.fm> Message-ID: <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org> +1 It's been over a year since I first submit my MAF code! ________________________________________ From: biopython-dev-bounces at lists.open-bio.org [biopython-dev-bounces at lists.open-bio.org] on behalf of Brad Chapman [chapmanb at 50mail.com] Sent: Wednesday, August 22, 2012 8:42 PM To: Peter Cock; Biopython-Dev Mailing List Subject: Re: [Biopython-dev] Beta code in the official releases? Peter; +1. I'm for making the process of getting new code into Biopython a bit quicker and this seems like a nice step in that direction. With code has been well designed tested and documented, this will help speed the transition into releases and get more eyes on it quicker, while allowing some potential breaking changes as beta functionality gets finalized. Thanks for the good suggestion, Brad > Hi all, > > One of the ideas I discussed with Bow during this GSoC > project was introducing a new warning, something like > Bio.BiopythonBetaCode (the exact name isn't important), > to be used to label new experimental modules for which > we *expect* there to be changes in the next release. > > The idea is to combine the simplicity of distribution and > installation of the 'monolithic' Biopython library with some > of the flexibility offered by a more modular approach. > This would be particularly helpful for those on Windows, > where installing a Biopython branch from git is quite a > daunting task. > > The idea is that in one of the next releases you'd be able > to try Bio.SearchIO (or Bio.Struct or GFF or Variants or ...) > and see something like this: > >>>> from Bio import SearchIO > Bio/SearchIO/__init__.py:16: BiopythonBetaCode: Bio.SearchIO is in > beta, and likely to change > warnings.warn("Bio.SearchIO is in beta, and likely to change", > BiopythonBetaCode) > > By using a specific warning class, any keen beta tester can > silence all the BiopythonBetaCode warnings if they wished to. > > Is anyone familiar enough with Linux packaging polices to > have any thoughts on how they would treat this? Provided > we only use this for self contained modules, they could > potentially split the beta-modules into a sub-package (in the > same way that Biopython and its BioSQL support are split > in Debian). > > I envision using this as a way to encourage wider 'beta testing' > of self contained modules which are close to a stable release. > Does anyone think this is a good idea? Are there any downsides > I'm overlooking? > > Thanks, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Thu Aug 30 08:16:13 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 Aug 2012 09:16:13 +0100 Subject: [Biopython-dev] Genome Diagram Sigils, was: Default Behavior In-Reply-To: References: Message-ID: On Tue, Aug 14, 2012 at 12:06 PM, Peter Cock wrote: > On Thu, Aug 2, 2012 at 5:12 PM, Peter Cock wrote: > > Further to that work, I updated some older code for a JAGGY > sigil, and also an OCTO sigil (names open to suggestions), > which are on my gd-sigils branch which has documentation > in the tutorial, including this image of the expanded sigil set: > https://github.com/peterjc/biopython/blob/e09e264dd73953554609498c15b67d86686592fb/Doc/images/GD_sigils.png > > This is a slight simplification of the old JAGGY code in that it > does (yet) allow control of the teeth length (e.g. to have just > teeth on one end). I am thinking this could be exposed like > the existing arrow specific options. > > I originally created the JAGGY sigil for marking a break point > in a contig/scaffold. For instance, you might want to mark a > run of NNNNN bases in a scaffold with a jaggy sigil (straddling > both strands) as a clear visual marker to explain why there > were no genes. > > Other sigil ideas I pondered include an OVAL, which should > be quite easy for the linear diagrams, but rather more work to > implement for circular diagrams due to the distorted curves. > > Peter Do people think (either of) these two sigils are worth adding to the main branch? Potentially they can be generalised - the JAGGY sigil in particular would be much more flexible if the head & tail teeth presence (or tooth length?) could be controlled. e.g. to draw a sigil with a flat edge on the left, and a jagged edge on the right. Peter From Leighton.Pritchard at hutton.ac.uk Thu Aug 30 08:51:50 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Thu, 30 Aug 2012 08:51:50 +0000 Subject: [Biopython-dev] Genome Diagram Sigils, was: Default Behavior In-Reply-To: References: Message-ID: On 30 Aug 2012, at Thursday, August 30, 09:16, Peter Cock wrote: On Tue, Aug 14, 2012 at 12:06 PM, Peter Cock > wrote: On Thu, Aug 2, 2012 at 5:12 PM, Peter Cock > wrote: Further to that work, I updated some older code for a JAGGY sigil, and also an OCTO sigil (names open to suggestions), which are on my gd-sigils branch which has documentation in the tutorial, including this image of the expanded sigil set: https://github.com/peterjc/biopython/blob/e09e264dd73953554609498c15b67d86686592fb/Doc/images/GD_sigils.png [?] Do people think (either of) these two sigils are worth adding to the main branch? Yes - I do. L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From p.j.a.cock at googlemail.com Thu Aug 30 10:18:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 Aug 2012 11:18:57 +0100 Subject: [Biopython-dev] Genome Diagram Sigils, was: Default Behavior In-Reply-To: References: Message-ID: On Thu, Aug 30, 2012 at 9:51 AM, Leighton Pritchard wrote: > > On 30 Aug 2012, at Thursday, August 30, 09:16, Peter Cock wrote: >> Do people think (either of) these two sigils are worth adding >> to the main branch? > > Yes - I do. > > L. Done. Branch rebased and applied to master. Peter From p.j.a.cock at googlemail.com Thu Aug 30 11:46:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 Aug 2012 12:46:05 +0100 Subject: [Biopython-dev] Genome Diagram Sigils, was: Default Behavior In-Reply-To: References: Message-ID: On Thu, Aug 30, 2012 at 11:18 AM, Peter Cock wrote: > On Thu, Aug 30, 2012 at 9:51 AM, Leighton Pritchard > wrote: >> >> On 30 Aug 2012, at Thursday, August 30, 09:16, Peter Cock wrote: >>> Do people think (either of) these two sigils are worth adding >>> to the main branch? >> >> Yes - I do. >> >> L. > > Done. Branch rebased and applied to master. > > Peter And you can see the example in the Tutorial here, http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html#sec:gd_sigils (These sigils all work on circular diagrams too, see the examples made by test_GenomeDiagram.py) Peter