From w.arindrarto at gmail.com Tue May 1 10:52:38 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 1 May 2012 16:52:38 +0200 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: On Mon, Apr 30, 2012 at 12:57, Peter Cock wrote: > On Mon, Apr 30, 2012 at 11:08 AM, Wibowo Arindrarto > wrote: >> >> I'm thinking of using the Search object as the object returned by >> SearchIO.parse or SearchIO.read. That way, we can store attributes >> common to the different search queries in it. For example: >> >>>>> search ?= SearchIO.parse('blast_result.xml', 'blast-xml') >>>>> search.format >> 'blast-xml' >>>>> search.algorithm >> 'blastx' >>>>> search.version >> '2.2.26+' >>>>> search.database >> 'refseq_protein' >>>>> search.results >> >> >> And iteration over the results would be done like this (for example): >>>>> for result in search.results: >> ... print result.query, print len(result) >> >> Additionaly, we can also define __iter__ and next for Search so we can >> just do the following: >>>>> for result in search: >> ... print result.query, print len(result) >> >> What do you think? > > I think you'll get in a mess with multiple iterators all sharing the > same handle and competing over using it - but maybe I'm not > grasping what you have in mind. > > Initially keep it simple: The primary public API would be > > for result in Bio.SearchIO.parse(...): > ? ? print result.query, print len(result) > > where each iteration gives a complete result set for one query. > > Peter > > P.S. With SearchIO subject to name space discussions ;) Hmm..Ok. I'll stick to the simpler API for the initial implementation ~ see if later it's feasible to add more details :) (and perhaps change the namespace too, as touched earlier). Bow From w.arindrarto at gmail.com Wed May 2 04:17:19 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 2 May 2012 10:17:19 +0200 Subject: [Biopython-dev] HMMER (+ BLAT) wrappers Message-ID: Hi everyone, The past week I've been trying to generate some test cases for BLAST, HMMER, et al. I was writing some short scripts to automate the test case generation, when I realized that Biopython doesn't have wrappers for HMMER and BLAT, so I decided to write them. The code is here: https://github.com/bow/gsoc/blob/master/hmmer/_HMMER.py and here: https://github.com/bow/gsoc/blob/master/blat/_BLAT.py. If it is of general interest to Biopython, I'd love to submit a pull request for these wrappers. They were primarily written for test case generation, but I imagine they won't require that many tweaks to make it suitable for inclusion in Biopython. However, before I can do that, there are some issues that I think needs to be discussed: 1. Where should the wrappers be put? I noticed that different wrappers are located in different directories according to their 'theme' (e.g. BLAST wrappers in Bio.Blast.Applications and ClustalW wrapper in Bio.Align.Applications). For the HMMER wrapper, should it be put inside Bio.Motif.Applications? For the BLAT wrapper, should I create a new Bio.Blat folder just for it? Yesterday I thought maybe it would be easier if all application wrappers are put inside the same directory (e.g. all in Bio.Applications), so maybe that's a viable option for future releases? 2. How should shared options among slightly different programs be handled? We can rely on creating abstract subclasses for them, but I find it easier to simply create lists and then combine them in the different programs. The current HMMER wrapper employs both of these approaches, but I think it needs to stick to just one approach to make the code easier to understand. 3. Is there a convention for naming the command line arguments? For example, if the command line option trigger is '--domE', should I name the Python variable, for example, 'domE', 'dome', 'dom_e', or 'dom_E'? 4. For the HMMER wrapper, there are some flags that are exclusive to each other (i.e. the user can only choose one of the flags). If the user chooses both, HMMER doesn't show any error messages ~ but nothing is run. Should the wrapper check for such mutually exclusive flags when it's created as well? 5. For BLAT, the installed suite includes a program that runs a BLAT server to handle search requests from different clients. It doesn't seem to be a typical program that should be wrapped by Biopython, but I might be wrong. Should a wrapper for the server be included as well? cheers, Bow From chris.mit7 at gmail.com Wed May 2 10:50:05 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Wed, 2 May 2012 10:50:05 -0400 Subject: [Biopython-dev] HMMER (+ BLAT) wrappers In-Reply-To: References: Message-ID: Hey Bow, I think it would be better to have an option to send the query to the local server should one be running as opposed to wrapping a gfServer that would be local for the duration of a given python process. This would allow for cases where someone has their BLAT queries split up in a script and not incur the loading time for the database multiple times. The gfServer/gfQuery setup is also rather a pain to use from my experience (it's all relative paths). I also think using the -pslx output would be a better default since -psl doesn't provide you with the sequence alignments. Chris On Wed, May 2, 2012 at 4:17 AM, Wibowo Arindrarto wrote: > Hi everyone, > > The past week I've been trying to generate some test cases for BLAST, > HMMER, et al. I was writing some short scripts to automate the test > case generation, when I realized that Biopython doesn't have wrappers > for HMMER and BLAT, so I decided to write them. The code is here: > https://github.com/bow/gsoc/blob/master/hmmer/_HMMER.py and here: > https://github.com/bow/gsoc/blob/master/blat/_BLAT.py. > > If it is of general interest to Biopython, I'd love to submit a pull > request for these wrappers. They were primarily written for test case > generation, but I imagine they won't require that many tweaks to make > it suitable for inclusion in Biopython. However, before I can do that, > there are some issues that I think needs to be discussed: > > 1. Where should the wrappers be put? I noticed that different wrappers > are located in different directories according to their 'theme' (e.g. > BLAST wrappers in Bio.Blast.Applications and ClustalW wrapper in > Bio.Align.Applications). For the HMMER wrapper, should it be put > inside Bio.Motif.Applications? For the BLAT wrapper, should I create a > new Bio.Blat folder just for it? Yesterday I thought maybe it would be > easier if all application wrappers are put inside the same directory > (e.g. all in Bio.Applications), so maybe that's a viable option for > future releases? > > 2. How should shared options among slightly different programs be > handled? We can rely on creating abstract subclasses for them, but I > find it easier to simply create lists and then combine them in the > different programs. The current HMMER wrapper employs both of these > approaches, but I think it needs to stick to just one approach to make > the code easier to understand. > > 3. Is there a convention for naming the command line arguments? For > example, if the command line option trigger is '--domE', should I name > the Python variable, for example, 'domE', 'dome', 'dom_e', or 'dom_E'? > > 4. For the HMMER wrapper, there are some flags that are exclusive to > each other (i.e. the user can only choose one of the flags). If the > user chooses both, HMMER doesn't show any error messages ~ but nothing > is run. Should the wrapper check for such mutually exclusive flags > when it's created as well? > > 5. For BLAT, the installed suite includes a program that runs a BLAT > server to handle search requests from different clients. It doesn't > seem to be a typical program that should be wrapped by Biopython, but > I might be wrong. Should a wrapper for the server be included as well? > > cheers, > Bow > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From w.arindrarto at gmail.com Wed May 2 11:21:33 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 2 May 2012 17:21:33 +0200 Subject: [Biopython-dev] HMMER (+ BLAT) wrappers In-Reply-To: References: Message-ID: On Wed, May 2, 2012 at 4:50 PM, Chris Mitchell wrote: > Hey Bow, > > I think it would be better to have an option to send the query to the local > server should one be running as opposed to wrapping a gfServer that would be > local for the duration of a given python process.? This would allow for > cases where someone has their BLAT queries split up in a script and not > incur the loading time for the database multiple times.? The > gfServer/gfQuery setup is also rather a pain to use from my experience (it's > all relative paths).? I also think using the -pslx output would be a better > default since -psl doesn't provide you with the sequence alignments. > > Chris > Hi Chris, You are talking about 'gfServer query ...' right? That does make gfServer usable enough to be wrapped. Thanks for pointing that out ~ I overlooked that gfServer can also do queries. I admit that the current BLAT wrapper is very minimum, but like I said, it shouldn't take that much time to write wrappers for the rest of the executables in the suite :). As for the file format, I actually prefer leaving the options as it is (i.e. the program's default) to keep surprises minimum to users (even though I agree that the pslx output is more informative). Bow From redmine at redmine.open-bio.org Wed May 2 15:04:38 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 2 May 2012 19:04:38 +0000 Subject: [Biopython-dev] [Biopython - Bug #3348] (New) Documentation error Message-ID: Issue #3348 has been reported by Patrick P. ---------------------------------------- Bug #3348: Documentation error https://redmine.open-bio.org/issues/3348 Author: Patrick P Status: New Priority: Normal Assignee: Category: Target version: URL: 4.3.3 Sequence [...] For example consider a (short) gene sequence with location 5:18 on the reverse strand, which in GenBank/EMBL notation using 1-based counting would be complement(4..18), like this: [...] -------------------------
                                                                         vvv
                                                                          v
 ---> in GenBank/EMBL notation using 1-based counting would be complement(6..18)
                                                                          ^
                                                                         ^^^
---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Thu May 3 13:50:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 May 2012 18:50:55 +0100 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: On Saturday, April 28, 2012, Matthias Bernt wrote: > Dear developers, > > I would like to suggest a quick "fix" for the problem. Currently the > parser just returns true per default for the circular property. This > is a wrong piece of information for all circular sequences. > Furthermore its not possible to detect if the parser did return true > because it is its default value or if its really from the data. So I > suggest to return None if the parser does not parse the information. > > What do you think? This should be possible with minimal effort. > > The parsing side of this is trivial - the only piece missing is how best to present the information in the SeqRecord for BioSQL compatibility (and perhaps some extra work on our BioSQL bindings). That requires someone to test where BioPerl stores this in BioSQL (as that is the reference implementation). Without that, a "quick fix" will mostly likely create a bug in our BioSQL support - in that we wouldn't store the circular field in the same way as the other Bio* implementations. Peter From redmine at redmine.open-bio.org Fri May 4 15:33:54 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 4 May 2012 19:33:54 +0000 Subject: [Biopython-dev] [Biopython - Bug #3349] (New) Bio.PDB.Entity.copy child handling errors. Message-ID: Issue #3349 has been reported by Alexander Ford. ---------------------------------------- Bug #3349: Bio.PDB.Entity.copy child handling errors. https://redmine.open-bio.org/issues/3349 Author: Alexander Ford Status: New Priority: Normal Assignee: Category: Target version: URL: https://github.com/asford/biopython/tree/entity_copy_fix The Bio.PDB.Entity.copy function, introduced in revision 0d8299a9, does not properly handle the entity child list. Iteration over the child_dict results in a loss of child ordering and the explicit call to detach_child results in a destructive modification of the copied entity's child elements' parent reference. The copy function should instead instantiate empty child_list and child_dict elements in the copy object and then add copies of each element in the source object's child_list via the Entity.add function. This will appropriately update the copy's child_dict and the child's parent reference while preserving child ordering. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Sat May 5 05:04:58 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 5 May 2012 10:04:58 +0100 Subject: [Biopython-dev] Fwd: [biopython] Fixing Entity copy method to preserve child ordering. (#37) In-Reply-To: References: Message-ID: Who wants to review this one? Peter ---------- Forwarded message ---------- From: *asford* Date: Friday, May 4, 2012 Subject: [biopython] Fixing Entity copy method to preserve child ordering. (#37) To: Peter Cock Pull request to fix issue #3349. You can merge this Pull Request by running: git pull https://github.com/asford/biopython entity_copy_fix Or you can view, comment on it, or merge it online at: https://github.com/biopython/biopython/pull/37 -- Commit Summary -- * Fixing Entity copy method to preserve child ordering. -- File Changes -- M Bio/PDB/Entity.py (11) -- Patch Links -- https://github.com/biopython/biopython/pull/37.patch https://github.com/biopython/biopython/pull/37.diff --- Reply to this email directly or view it on GitHub: https://github.com/biopython/biopython/pull/37 From eric.talevich at gmail.com Sat May 5 12:09:55 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 5 May 2012 12:09:55 -0400 Subject: [Biopython-dev] Fwd: [biopython] Fixing Entity copy method to preserve child ordering. (#37) In-Reply-To: References: Message-ID: I'll check it out today. On Sat, May 5, 2012 at 5:04 AM, Peter Cock wrote: > Who wants to review this one? > > Peter > > ---------- Forwarded message ---------- > From: *asford* > Date: Friday, May 4, 2012 > Subject: [biopython] Fixing Entity copy method to preserve child ordering. > (#37) > To: Peter Cock > > > Pull request to fix issue #3349. > > You can merge this Pull Request by running: > > git pull https://github.com/asford/biopython entity_copy_fix > > Or you can view, comment on it, or merge it online at: > > https://github.com/biopython/biopython/pull/37 > > -- Commit Summary -- > > * Fixing Entity copy method to preserve child ordering. > > -- File Changes -- > > M Bio/PDB/Entity.py (11) > > -- Patch Links -- > > https://github.com/biopython/biopython/pull/37.patch > https://github.com/biopython/biopython/pull/37.diff > > --- > Reply to this email directly or view it on GitHub: > https://github.com/biopython/biopython/pull/37 > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Sun May 6 07:09:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 6 May 2012 12:09:30 +0100 Subject: [Biopython-dev] Fwd: 2012 SciPy Bioinformatics Workshop In-Reply-To: <1336063455.23270.YahooMailNeo@web111204.mail.gq1.yahoo.com> References: <1336063455.23270.YahooMailNeo@web111204.mail.gq1.yahoo.com> Message-ID: Dear Biopythoneers, Are any of us planning to attend the SciPy meeting? The 2012 SciPy Bioinformatics Workshop is crying out for a Biopython related talk... and from the email below it sounds like they're not just looking for a developers perspectives, but also how Python is being used in bioinformatics. Is it quite close after BOSC and ISMB but July 19 doesn't actually clash: http://www.open-bio.org/wiki/BOSC_2012 SciPy 2012 as a whole does clash with ISMB, and for those in Europe, it clashes with the planned CodeFest too: http://www.open-bio.org/wiki/EU_Codefest_2012 July is definitely conference season... Peter ---------- Forwarded message ---------- From: *Chris Mueller* Date: Thursday, May 3, 2012 Subject: [Numpy-discussion] 2012 SciPy Bioinformatics Workshop To: "chris.mueller at lab7.io" We are pleased to announce the 2012 SciPy Bioinformatics Workshop held in conjunction with SciPy 2012 this July in Austin, TX. Python in biology is not dead yet... in fact, it's alive and well! Remember just a few short years ago when BioPerl ruled the world? Just one minor paradigm shift* later and Python now has a commanding presence in bioinformatics. From Python bindings to common tools all the way to entire Python-based informatics platforms, Python is used everywhere** in modern bioinformatics. If you use Python for bioinformatics or just want to learn more about how its being used, join us at the 2012 SciPy Bioinformatics Workshop. We will have speakers from both academia and industry showcasing how Python is enabling biologists to effectively work with large, complex data sets. The workshop will be held the evening of July 19 from 5-6:30. More information about SciPy is available on the conference site: http://conference.scipy.org/scipy2012/ !! Participate !! Are you using Python in bioinformatics? We'd love to have you share your story. We are looking for 3-4 speakers to share their experiences using Python for bioinformatics. Please contact Chris Mueller at chris.mueller [at] lab7.io and Ray Roberts at rroberts [at] enthought.com to volunteer. Please include a brief description or link to a paper/topic which you would like to discuss. Presentations will last for 15 minutes each and will be followed by a panel Q&A. -- * That would be next generation sequencing ** Yes, we aRe awaRe of that otheR language used eveRywhere, but let's celebRate Python Right now. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion From tiagoantao at gmail.com Sun May 6 07:16:36 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 6 May 2012 12:16:36 +0100 Subject: [Biopython-dev] [Biopython] Fwd: 2012 SciPy Bioinformatics Workshop In-Reply-To: References: <1336063455.23270.YahooMailNeo@web111204.mail.gq1.yahoo.com> Message-ID: Hi, On Sun, May 6, 2012 at 12:09 PM, Peter Cock wrote: > SciPy 2012 as a whole does clash with ISMB, and for those in Europe, it > clashes with the planned CodeFest too: > http://www.open-bio.org/wiki/EU_Codefest_2012 Are any people from here going to the codefest? Tiago From p.j.a.cock at googlemail.com Mon May 7 04:37:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 May 2012 09:37:38 +0100 Subject: [Biopython-dev] [Biopython] Fwd: 2012 SciPy Bioinformatics Workshop In-Reply-To: References: <1336063455.23270.YahooMailNeo@web111204.mail.gq1.yahoo.com> Message-ID: On Sun, May 6, 2012 at 12:16 PM, Tiago Ant?o wrote: > Hi, > > On Sun, May 6, 2012 at 12:09 PM, Peter Cock wrote: >> SciPy 2012 as a whole does clash with ISMB, and for those in Europe, it >> clashes with the planned CodeFest too: >> http://www.open-bio.org/wiki/EU_Codefest_2012 > > Are any people from here going to the codefest? > > Tiago Brad is going to the pre-BOSC CodeFest in California, http://www.open-bio.org/wiki/Codefest_2012 I'm not sure if we have any Biopython folk signed up for the post-BOSC EU CodeFest in Italy yet. http://www.open-bio.org/wiki/EU_Codefest_2012 I aim to attend one of the CodeFests - trying to firm up summer travel plans now... Peter From arklenna at gmail.com Sun May 6 17:26:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 6 May 2012 17:26:30 -0400 Subject: [Biopython-dev] GSoC python variant update Message-ID: Hi all, I've written a few new posts on my blog; here's the latest: http://arklenna.tumblr.com/post/22542372076/spot-isa-dog I will attach a UML diagram and include the part of the post addressing the diagram. Click through to the full post for a bonus Einstein quote! ------- My main goals are not limited to: * Make the structure parser and file-format agnostic: an abstracted OO design should allow anything to be slotted in (for example, Marjan's C GFF parser?) * Maintain encapsulation: limit how much each object can see of objects above and below it * Allow extension at multiple levels: some existing parsers may process data in different ways; this structure should allow handling both raw data and data in various formats. The `Variant` object's constructor allows an end user to change the default parsers. Practical implementation details of `parse()` and `write()` will need to be finessed - for example, ways to help the user sift through immense quantities of data. I'm still in the process of comparing the data contained in VCF/GVF files as well as the APIs of PyVCF and BCBio.GFF. `Parser` and `Writer` are both abstract classes that will define all methods found in known parsers/writers with `NotImplementedError`s. I'm speculating on whether a Variant-specific exception would be useful, but a custom message should suffice. Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper` would each inherit from both `Parser` and `Writer`. As the name implies, they would serve as the adapter between the generic `Variant` and the specific parser. I anticipate that this structure could easily be extended to allow intermediate storage in DBs as well as innumerable sorting/comparing/filtering methods inside `Variant`. ------- I would appreciate any and all feedback about the overall structure. Namespace is definitely flexible. I'd also appreciate any specific genomic variant workflows, and if somebody can point me to smallish sample files of the same data in both VCF and GVF, I'd be eternally grateful. Regards, Lenna -------------- next part -------------- A non-text attachment was scrubbed... Name: Variant_UML.png Type: image/png Size: 23313 bytes Desc: not available URL: From chapmanb at 50mail.com Mon May 7 20:24:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 07 May 2012 20:24:39 -0400 Subject: [Biopython-dev] [GSoC] GSoC python variant update In-Reply-To: References: Message-ID: <87mx5jfrjs.fsf@fastmail.fm> Lenna; This all looks great for a top level overview of the classes. This should give you sufficient flexibility to work on the different file types. Another approach is to avoid some of the inheritence and have parse/write dispatch to VCF or GFF specific classes based on the filetype: if filetype == "vcf": variant_handler = PyVCFVariants() elif filetype == "gvf": variant_handler = GVFVariants() variant_handler.parse(*args) Avoiding layers can be nice to simplify the architecture, as long as it gives you the flexibility you need. My suggestion for digging more in the API design would be to start playing with some VCF files and getting comfortable with the data they have and where it would go in Biopython objects. VCF is much more widely used than GVF so it's a good practical place to start. Thanks for all this work and best of luck on finals, Brad > Hi all, > > I've written a few new posts on my blog; here's the latest: > > http://arklenna.tumblr.com/post/22542372076/spot-isa-dog > > I will attach a UML diagram and include the part of the post > addressing the diagram. Click through to the full post for a bonus > Einstein quote! > > ------- > > My main goals are not limited to: > > * Make the structure parser and file-format agnostic: an abstracted > OO design should allow anything to be slotted in (for example, > Marjan's C GFF parser?) > * Maintain encapsulation: limit how much each object can see of > objects above and below it > * Allow extension at multiple levels: some existing parsers may > process data in different ways; this structure should allow handling > both raw data and data in various formats. > > The `Variant` object's constructor allows an end user to change the > default parsers. Practical implementation details of `parse()` and > `write()` will need to be finessed - for example, ways to help the > user sift through immense quantities of data. I'm still in the process > of comparing the data contained in VCF/GVF files as well as the APIs > of PyVCF and BCBio.GFF. > > `Parser` and `Writer` are both abstract classes that will define all > methods found in known parsers/writers with `NotImplementedError`s. > I'm speculating on whether a Variant-specific exception would be > useful, but a custom message should suffice. > > Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper` > would each inherit from both `Parser` and `Writer`. As the name > implies, they would serve as the adapter between the generic `Variant` > and the specific parser. > > I anticipate that this structure could easily be extended to allow > intermediate storage in DBs as well as innumerable > sorting/comparing/filtering methods inside `Variant`. > > ------- > > I would appreciate any and all feedback about the overall structure. > Namespace is definitely flexible. I'd also appreciate any specific > genomic variant workflows, and if somebody can point me to smallish > sample files of the same data in both VCF and GVF, I'd be eternally > grateful. > > Regards, > > Lenna Attachment: Variant_UML.png (image/png) > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From casbon at gmail.com Tue May 8 04:57:57 2012 From: casbon at gmail.com (James Casbon) Date: Tue, 8 May 2012 09:57:57 +0100 Subject: [Biopython-dev] [GSoC] GSoC python variant update In-Reply-To: <87mx5jfrjs.fsf@fastmail.fm> References: <87mx5jfrjs.fsf@fastmail.fm> Message-ID: On 8 May 2012 01:24, Brad Chapman wrote: > > Lenna; > This all looks great for a top level overview of the classes. This > should give you sufficient flexibility to work on the different file > types. Another approach is to avoid some of the inheritence and have > parse/write dispatch to VCF or GFF specific classes based on the > filetype: > > if filetype == "vcf": > ? ?variant_handler = PyVCFVariants() > elif filetype == "gvf": > ? ?variant_handler = GVFVariants() > variant_handler.parse(*args) > > Avoiding layers can be nice to simplify the architecture, as long as it > gives you the flexibility you need. Hi Lenna, This looks a good start, but I would agree with Brad that layers of inheritance aren't always the best way to proceed with python. Specific feedback: why does the Variant have parse/write methods when you state that you will use adaptation from the general variation class to the actual parser? I'm also slightly worried this could be pretty slow when dealing with the volume of data you get from a VCF file. As for the points in your blog post... I have plenty of data, do we know any SNP callers capable of creating GVF files? If so, I can give you both formats. The simplest variant workflows would be to filter and then score on some metric. Filter would be to remove noise, so quality threshold is the simplest one. The metric used depends on the experimental setup. For case/control, a fishers test is quite easy, or for a single population an HWE test is fairly simple. Hope this helps, -- James http://casbon.me/ From w.arindrarto at gmail.com Wed May 9 12:24:43 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 9 May 2012 18:24:43 +0200 Subject: [Biopython-dev] GSoC Project Update -- 1 Message-ID: Hi everyone, I just posted my latest blog updated here: http://bow.web.id/blog/2012/05/warming-up-for-the-coding-period/ To summarize, I've spent most of my time getting to know the programs I will support better. This has been done by: 1. Playing around with the programs to see how many different outputs I can generate. 2. Writing scripts to automate test case generation for each of the programs. 3. Writing wrappers (for programs not yet wrapped by Biopython: FASTA, HMMER, and BLAT) to ease writing the test case generators. 4. Continuing to complete my proposed SearchIO object naming scheme (http://bit.ly/searchio-terms) The test cases, their generators, and the wrappers I've written are available in my non-Biopython gsoc repo here: http://github.com/bow/gsoc/. Additionally, I've used the generated test case to improve a recent bug report and submitted a fix for the next release. For the coming weeks prior to coding start, I'm planning to play around more with XML and SQLite as I will use them in the code. I might start to add more skeleton code to my current development branch as well (https://github.com/bow/biopython). cheers, Bow From arklenna at gmail.com Wed May 9 20:16:18 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 9 May 2012 20:16:18 -0400 Subject: [Biopython-dev] [GSoC] GSoC python variant update In-Reply-To: <20120508114043.GC14359@thebird.nl> References: <87mx5jfrjs.fsf@fastmail.fm> <20120508114043.GC14359@thebird.nl> Message-ID: I think my UML diagram may need a legend, or perhaps it should just be abandoned. I've written some skeleton code to try to avoid confusion about the pesky OO terms that have slightly different meanings for every language. https://gist.github.com/2649676 Regarding concerns about inheritance: I think the UML diagram implies 3 levels of inheritance. The only inheritance I intended was from abstract interfaces like Parser or Writer, that only contain non-implemented methods. Because I can't guarantee that all future parsers will have common attribute and method names, the only solution I can see is to write an interface and inherit from that to make wrappers for each parser. Thank you to Eric for this link: (https://en.wikipedia.org/wiki/Fragile_base_class). The page states that the best way to avoid problems is to use an interface. Also thank you to Pjotr for the article about mixins (http://www.cs.utexas.edu/~lin/papers/aop03.pdf). I believe I'm using inheritance in a safe and helpful manner. James, I hope my clarification and skeleton code answer any questions you have about the implementation. Brad, I am using if statements to determine which parser to use, but I am still calling wrappers that inherit from an interface. Eric, I looked at the structure of PDBParser. Is the idea that a user might pass in an instance of StructureBuilder that already contained some structure and add to it? Or is there another purpose that isn't jumping out at me? In my skeleton code, I used the example of StructureBuilder, but I'm not sure if there's an advantage to passing the object rather than the object's name. And finally, Brad and James, I will do my best to get more conversant with VCF etc. If I'm not a user, I can't be a capable developer. Looking forward to any more structural feedback! Cheers, Lenna From eric.talevich at gmail.com Thu May 10 09:36:49 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 May 2012 09:36:49 -0400 Subject: [Biopython-dev] [GSoC] GSoC python variant update In-Reply-To: References: <87mx5jfrjs.fsf@fastmail.fm> <20120508114043.GC14359@thebird.nl> Message-ID: On Wed, May 9, 2012 at 8:16 PM, Lenna Peterson wrote: > I looked at the structure of PDBParser. Is the idea that a user might pass > in an instance of StructureBuilder that already contained some structure > and add to it? Or is there another purpose that isn't jumping out at me? In > my skeleton code, I used the example of StructureBuilder, but I'm not sure > if there's an advantage to passing the object rather than the object's name. > > My understanding of the producer/consumer design in Bio.PDB (I didn't write it) is that the logic for parsing the given file format is contained in the *Parser class, and the logic for building the target object is in the *Builder class. This is useful if the target object is somewhat complex to build, as is the case with PDB's Structure/Model/Chain/Residue/Atom hierarchy -- the parser just passes raw values along to the appropriate method on the StructureBuilder class. (The Internet also points out that this design is super useful if "producing" and "consuming" are asynchronous, which is not the case here... yet?) Regarding the shared interface, I think we've generally achieved this throughout most of Biopython by just remembering to implement the required methods on each parser and writer class -- just "parse" and "write", usually. Essentially, it's your design minus the common base class that enforces the interface; an error in the implementation would result in an AttributeError rather than a NotImplementedError. This works because (1) Python uses duck typing, unlike C++ and Java; (2) in Biopython, each file format is usually implemented by one dedicated person who can keep it all in their head, and we don't add new file formats very rapidly; (3) we maintain pretty good coverage with our unit tests, and certainly add unit tests for new parsers. Given all that, I think your design is superior, and it's quite clear how it all works from the way you've written it. As for the difference between passing an instance of the *Builder object versus a reference to the *Builder class (did I get that right?), it requires slightly less code from the user to pass a reference to the class. Also, if you set the object-or-class as a default argument, remember that objects are mutable, so you risk hitting one of Python's most infamous gotchas (default arguments are only evaluated once, so the second time you use the parser, you'll be adding to the original object instead of starting with a fresh copy). Cheers, Eric From w.arindrarto at gmail.com Fri May 11 12:08:25 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 11 May 2012 18:08:25 +0200 Subject: [Biopython-dev] Biopython wrappers' behavior Message-ID: Hi everyone, There has been a recent discussion on Github (here: https://github.com/bow/biopython/commit/b0b1a460149d4a68f76ebde916471628cecfe4e7#-P0) regarding the way our command line wrappers are supposed to work. It started as a question on how to handle incompatible parameters and boils down to how much complexity we want to have in our wrappers. To give you an illustration: We have wrappers for BLAST, which raise exceptions if two incompatible parameters are used at the same time. This mimics BLAST's behavior, since it will also show errors if given that same combination of parameters sans using our wrapper. However, the way other programs handle incompatible parameters are not always the same as BLAST's. For example, HMMER doesn't show any errors but nothing still gets run, and EMBOSS seems to use the last parameter it sees, ignoring previous ones. I have not tested this for all available programs and parameters in each suite, but it seems reasonable to extrapolate the behavior to the rest of the programs in their respective suite. The question is, how should our wrappers handle this? Should we: * Raise errors whenever incompatible parameters are used (as seen in BLAST's wrappers)? Or perhaps just give warnings? This is an extra layer of complexity, but it would help users figure out if something goes unexpected when using our wrappers. * Leave it as it is and not worry about incompatible parameters at all? Perhaps we could also report a bug / feature request to the respective programs' authors and expect their default behavior to change? * (other ideas...)? I personally favor mimicking the programs' behavior as close as possible. If it gives errors, we should handle it with our code, if not then we leave it as it is, even if it results in some unexpected behavior, but this is just me. What do you think? How should our wrappers handle incompatible parameters? Bow From eric.talevich at gmail.com Sat May 12 14:38:27 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 12 May 2012 14:38:27 -0400 Subject: [Biopython-dev] Biopython wrappers' behavior In-Reply-To: References: Message-ID: On Fri, May 11, 2012 at 12:08 PM, Wibowo Arindrarto wrote: > Hi everyone, > > There has been a recent discussion on Github (here: > > https://github.com/bow/biopython/commit/b0b1a460149d4a68f76ebde916471628cecfe4e7#-P0 > ) > regarding the way our command line wrappers are supposed to work. It > started as a question on how to handle incompatible parameters and > boils down to how much complexity we want to have in our wrappers. > > To give you an illustration: > > We have wrappers for BLAST, which raise exceptions if two incompatible > parameters are used at the same time. This mimics BLAST's behavior, > since it will also show errors if given that same combination of > parameters sans using our wrapper. However, the way other programs > handle incompatible parameters are not always the same as BLAST's. For > example, HMMER doesn't show any errors but nothing still gets run, and > EMBOSS seems to use the last parameter it sees, ignoring previous > ones. I have not tested this for all available programs and parameters > in each suite, but it seems reasonable to extrapolate the behavior to > the rest of the programs in their respective suite. > > The question is, how should our wrappers handle this? Should we: > > * Raise errors whenever incompatible parameters are used (as seen in > BLAST's wrappers)? Or perhaps just give warnings? This is an extra > layer of complexity, but it would help users figure out if something > goes unexpected when using our wrappers. > * Leave it as it is and not worry about incompatible parameters at > all? Perhaps we could also report a bug / feature request to the > respective programs' authors and expect their default behavior to > change? > * (other ideas...)? > > I personally favor mimicking the programs' behavior as close as > possible. If it gives errors, we should handle it with our code, if > not then we leave it as it is, even if it results in some unexpected > behavior, but this is just me. What do you think? How should our > wrappers handle incompatible parameters? There are certain motivations that apply to command-line tools but not Python object-based wrappers. The first thing that comes to mind is the use of scripts and aliases on the command line, where an existing setting "--foo" can be reversed/nullified by adding the "--no-foo" later in the command line. Examples -- say these are set globally in /etc/profile: % alias ourwater="water -brief" % ourwater -nobrief % export COMMON_BLAST_OPTIONS="-d /opt/db/nr -e 1e-4 --foo" % blastall -i myseq.fa $COMMON_BLAST_OPTIONS --no-foo I think EMBOSS handles this situation in the most Unix-friendly way, while BLAST is being fussy and HMMer is... still in development. In any case, this situation doesn't apply in Python/Biopython. If we want to reverse or reset an attribute on an object, we assign a new value to it, problem solved. >>> some_cmd = SomeCommandlineWrapper(foo=True) >>> some_cmd.foo = False So, I would support these behaviors in general: 1. If conflicting options are specified together in the constructor (__init__), raise an exception: >>> SomeCommandlineWrapper(foo=True, nofoo=True) # kaboom! 2. Where it's possible and intuitive, only use one attribute to specify boolean behaviors. Instead of having 'foo' and 'nofoo' attributes, just have 'foo', and let the 'nofoo' switch set that attribute to False. When building the command line for execution, sort it out again. I'm not sure about the easiest way to do this with Bio.Applications, but maybe we should come up with a standard mechanism for it. From w.arindrarto at gmail.com Mon May 14 15:29:44 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 14 May 2012 21:29:44 +0200 Subject: [Biopython-dev] Biopython wrappers' behavior In-Reply-To: References: Message-ID: > There are certain motivations that apply to command-line tools but not > Python object-based wrappers. The first thing that comes to mind is the use > of scripts and aliases on the command line, where an existing setting > "--foo" can be reversed/nullified by adding the "--no-foo" later in the > command line. > > Examples -- say these are set globally in /etc/profile: > > % alias ourwater="water -brief" > % ourwater -nobrief > > % export COMMON_BLAST_OPTIONS="-d /opt/db/nr -e 1e-4 --foo" > % blastall -i myseq.fa $COMMON_BLAST_OPTIONS --no-foo > > > I think EMBOSS handles this situation in the most Unix-friendly way, while > BLAST is being fussy and HMMer is... still in development. > > In any case, this situation doesn't apply in Python/Biopython. If we want to > reverse or reset an attribute on an object, we assign a new value to it, > problem solved. > >>>> some_cmd = SomeCommandlineWrapper(foo=True) >>>> some_cmd.foo = False > > So, I would support these behaviors in general: > > 1. If conflicting options are specified together in the constructor > (__init__), raise an exception: > >>>> SomeCommandlineWrapper(foo=True, nofoo=True)? # kaboom! > > 2. Where it's possible and intuitive, only use one attribute to specify > boolean behaviors. Instead of having 'foo' and 'nofoo' attributes, just have > 'foo', and let the 'nofoo' switch set that attribute to False. When building > the command line for execution, sort it out again. I'm not sure about the > easiest way to do this with Bio.Applications, but maybe we should come up > with a standard mechanism for it. > Hi Eric, Thanks for the explanation! It never occured to me we should consider custom command-line aliases as well, but that makes sense now. For your first point, there's already an incompatibility-checking mechanism implemented in the Bio.Blast.Applications module. It's currently tied-up to Bio.Blast.Applications's _validate method, but it seems doable to generalize this into a method of Bio.Application.AbstractCommandline, so it's available to the rest of the command line wrappers (EMBOSS wrappers being some of them). As per your second point, I can imagine three general ways to do this (on top of my head): 1. Implement a method to override one parameter setting with its opposing parameter in AbstractCommandline. This is perhaps similar to Bio.Blast.Application's _validate_incompatibilities method, only instead of raising an exception it deletes one of the parameters. 2. Implement a new _AbstractParameter subclass that can handle two different incompatible parameters (this is perhaps too complicated) 3. Implement an incompatibility checking mechanism in AbstractCommandline.__str__, to define parameters that can override its pair (e.g. foo and nofoo). This will keep the opposing parameters stored as the object attribute (so a __repr__ will reveal them both), but it won't get passed on to the console as the __call__ method relies on __str__. Bow From eric.talevich at gmail.com Mon May 14 15:53:50 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 14 May 2012 15:53:50 -0400 Subject: [Biopython-dev] Biopython wrappers' behavior In-Reply-To: References: Message-ID: On Mon, May 14, 2012 at 3:29 PM, Wibowo Arindrarto wrote: > > There are certain motivations that apply to command-line tools but not > > Python object-based wrappers. The first thing that comes to mind is the > use > > of scripts and aliases on the command line, where an existing setting > > "--foo" can be reversed/nullified by adding the "--no-foo" later in the > > command line. > > > > Examples -- say these are set globally in /etc/profile: > > > > % alias ourwater="water -brief" > > % ourwater -nobrief > > > > % export COMMON_BLAST_OPTIONS="-d /opt/db/nr -e 1e-4 --foo" > > % blastall -i myseq.fa $COMMON_BLAST_OPTIONS --no-foo > > > > > > I think EMBOSS handles this situation in the most Unix-friendly way, > while > > BLAST is being fussy and HMMer is... still in development. > > > > In any case, this situation doesn't apply in Python/Biopython. If we > want to > > reverse or reset an attribute on an object, we assign a new value to it, > > problem solved. > > > >>>> some_cmd = SomeCommandlineWrapper(foo=True) > >>>> some_cmd.foo = False > > > > So, I would support these behaviors in general: > > > > 1. If conflicting options are specified together in the constructor > > (__init__), raise an exception: > > > >>>> SomeCommandlineWrapper(foo=True, nofoo=True) # kaboom! > > > > 2. Where it's possible and intuitive, only use one attribute to specify > > boolean behaviors. Instead of having 'foo' and 'nofoo' attributes, just > have > > 'foo', and let the 'nofoo' switch set that attribute to False. When > building > > the command line for execution, sort it out again. I'm not sure about the > > easiest way to do this with Bio.Applications, but maybe we should come up > > with a standard mechanism for it. > > > > Hi Eric, > > Thanks for the explanation! It never occured to me we should consider > custom command-line aliases as well, but that makes sense now. > > For your first point, there's already an incompatibility-checking > mechanism implemented in the Bio.Blast.Applications module. It's > currently tied-up to Bio.Blast.Applications's _validate method, but it > seems doable to generalize this into a method of > Bio.Application.AbstractCommandline, so it's available to the rest of > the command line wrappers (EMBOSS wrappers being some of them). > > As per your second point, I can imagine three general ways to do this > (on top of my head): > > 1. Implement a method to override one parameter setting with its > opposing parameter in AbstractCommandline. This is perhaps similar to > Bio.Blast.Application's _validate_incompatibilities method, only > instead of raising an exception it deletes one of the parameters. > > 2. Implement a new _AbstractParameter subclass that can handle two > different incompatible parameters (this is perhaps too complicated) > > 3. Implement an incompatibility checking mechanism in > AbstractCommandline.__str__, to define parameters that can override > its pair (e.g. foo and nofoo). This will keep the opposing parameters > stored as the object attribute (so a __repr__ will reveal them both), > but it won't get passed on to the console as the __call__ method > relies on __str__. > > Here's a fourth to consider, similar to your #1 (not to disagree with any of your suggestions): add an "_AntiSwitch" class to Bio.Applications, which includes a reference or string name of the attribute/parameter it nullifies. Would that be easier to specify when writing the application wrapper? From clements at galaxyproject.org Mon May 14 16:57:19 2012 From: clements at galaxyproject.org (Dave Clements) Date: Mon, 14 May 2012 13:57:19 -0700 Subject: [Biopython-dev] 2012 Galaxy Community Conference Message-ID: Hello all, We are pleased to announce that early registration for the 2012 Galaxy Community Conference (GCC2012, http://galaxyproject.org/GCC2012) is now open. GCC2012 will be held July 25-27, at the UIC Forum, in Chicago, Illinois. The conference will feature two full days of presentations, discussions, lightning talks, and breakouts. We have also added a new full day of training this year, featuring 3 parallel tracks with four workshops each, covering seven to twelve different topics (please vote on topics by Friday May 18: http://bit.ly/GCC2012TDSurvey). The Galaxy Community Conference is for: * Sequencing core facility staff * Bioinformatics core staff * Bioinformatics tool and workflow developers * Bioinformatics focused principal investigators and researchers * Data producers * Power bioinformatics users This event is about integrating, analyzing, and sharing the diverse and very large datasets that are now typical in biomedical research. GCC2012 is an opportunity to share best practices with, and learn from, a large community of researchers and support staff who are facing the challenges of data-intensive biology. Galaxy is an open web-based platform for data intensive biomedical research (http://galaxyproject.org) that is widely used and deployed at research organizations of all sizes and around the world. Registration is very affordable, especially for post-docs and students. *You can can save 36% to 42% by registering on or before June 11*. Conference lodging can also be booked. Low-cost rooms have been reserved on the UIC campus. You can also stay at the official conference hotel, at a substantial discount. There are a limited rooms available in both, and you are encouraged to register early. Thanks, and hope to see you in Chicago! Dave Clements, on behalf of the GCC2012 Organizing Committee PS: Please help get the word out. A flyer and graphics are at http://wiki.g2.bx.psu.edu/Events/GCC2012/Promotion. -- http://galaxyproject.org/GCC2012 http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://galaxyproject.org/wiki/ From erikclarke at gmail.com Tue May 15 12:44:32 2012 From: erikclarke at gmail.com (Erik Clarke) Date: Tue, 15 May 2012 09:44:32 -0700 Subject: [Biopython-dev] GEO library revamp Message-ID: Hi all, I saw on the wiki that the BioPython GEO library was in need of some TLC. I agree; a recent effort to use the parser for a project in our lab was stymied by its lack of flexibility (it seems to be particularly poor at reading GEO datasets, for instance). In response, we've developed a basic GEO module in Python loosely based on GEOQuery and the existing Geo module. Currently, our module is capable of downloading and parsing all four major GEO record types and providing rudimentary pretty-print output of the data. It also provides a representation of a GDS file in a form amenable to statistical analysis using SciPy. I've included a method that finds the enriched genes in a given subset as a demonstration. Since it was an internal project before this, I would appreciate any feedback in terms of usability, bugs, etc that we may not have caught. It's still under active development as I flesh out some of the missing features (better pretty-printing, bug fixes, complete unit-test coverage, etc). In any case, my development branch of BioPython is here: https://github.com/eclarke/biopython/tree/GEOQuery, and obviously all of the new code is in the Bio/Geo folder (Records.py will replace Record.py). I've tried to make it as well-commented as possible. I have not yet tested it on Python < 2.7, but I plan on doing so. If this is of interest to anybody, I would be more than happy to tweak it as people saw fit and hopefully one day replace the current GEO parser. Cheers, Erik Clarke The Scripps Research Institute La Jolla, CA From w.arindrarto at gmail.com Tue May 15 16:32:08 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 15 May 2012 22:32:08 +0200 Subject: [Biopython-dev] Biopython wrappers' behavior In-Reply-To: References: Message-ID: >> As per your second point, I can imagine three general ways to do this >> (on top of my head): >> >> 1. Implement a method to override one parameter setting with its >> opposing parameter in AbstractCommandline. This is perhaps similar to >> Bio.Blast.Application's _validate_incompatibilities method, only >> instead of raising an exception it deletes one of the parameters. >> >> 2. Implement a new _AbstractParameter subclass that can handle two >> different incompatible parameters (this is perhaps too complicated) >> >> 3. Implement an incompatibility checking mechanism in >> AbstractCommandline.__str__, to define parameters that can override >> its pair (e.g. foo and nofoo). This will keep the opposing parameters >> stored as the object attribute (so a __repr__ will reveal them both), >> but it won't get passed on to the console as the __call__ method >> relies on __str__. >> > > Here's a fourth to consider, similar to your #1 (not to disagree with any of > your suggestions): add an "_AntiSwitch" class to Bio.Applications, which > includes a reference or string name of the attribute/parameter it nullifies. > Would that be easier to specify when writing the application wrapper? > That seems doable :). I can't imagine it being too hard to implement technically, as this will only be used for options that can be grouped under one name (i.e. opposing boolean parameters). From redmine at redmine.open-bio.org Wed May 16 05:32:51 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 16 May 2012 09:32:51 +0000 Subject: [Biopython-dev] [Biopython - Bug #3353] (New) Bio.KEGG.Enzymes change Message-ID: Issue #3353 has been reported by Thomas van Gurp. ---------------------------------------- Bug #3353: Bio.KEGG.Enzymes change https://redmine.open-bio.org/issues/3353 Author: Thomas van Gurp Status: New Priority: Normal Assignee: Category: Target version: URL: When retrieving an enzyme from [[http://soap.genome.jp/KEGG.wsdl]] using ec_handle = client.service.bget(ec) and Bio.KEGG.Enzyme.parse(ec_handle.split('\n')) there is an error in the way pathways are parsed. The layout changed from: PATHWAY PATH: MAP00130 Ubiquinone biosynthesis to PATHWAY ec00030 Pentose phosphate pathway ec00480 Glutathione metabolism A minor modification in the code fixes this issue: elif keyword=="PATHWAY ": if data[:5]=='PATH': path, map, name = data.split(None,2) path = 'PATH:' pathway = (path[:-1], map, name) record.pathway.append(pathway) else: map, name = data.split(None,1) path = 'PATH:' pathway = (path[:-1], map, name) record.pathway.append(pathway) ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed May 16 06:19:14 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 16 May 2012 10:19:14 +0000 Subject: [Biopython-dev] [Biopython - Bug #3354] (New) Legacy blast XML parser returns prematurely StopIteration Message-ID: Issue #3354 has been reported by Martin Mokrej?. ---------------------------------------- Bug #3354: Legacy blast XML parser returns prematurely StopIteration https://redmine.open-bio.org/issues/3354 Author: Martin Mokrej? Status: New Priority: Normal Assignee: Category: Target version: URL: Hi, I am parsing some blast 2.2.24 XML output and the last record I get is the one from iteration 124. I see that entry is followed by a new section which is probably the culprit. I will try newer legacy blast but still, biopython could maybe overcome this bug in XML input?
blastall -p blastn -A 4 -i SRR068315.fasta -d my_targets.fasta -F 0 -S 1 -r 2 -e 10e-30 -m 7




  blastn
  blastn 2.2.24 [Aug-08-2010]
  ~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~"Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs",  Nucleic Acids Res. 25:3389-3402.
  my_targets.fasta
  lcl|1_0
  FYUQ5C204IQCOE length=283 xy=3463_2076 region=4 run=R_2009_07_08_19_30_38_
  318
  
    
      1e-29
      2
      -3
      5
      2
      F
    
  
  
[cut]
    
      124
      lcl|124_0
      FYUQ5C204JXGMI length=44 xy=3954_2264 region=4 run=R_2009_07_08_19_30_38_
      350
      
        
          22
          9262
          0
          0
          0.41
          0.625
          0.78
        
      
      No hits found
    
    
      1
      
        
          22
          9262
          0
          0
          0.41
          0.625
          0.78
        
      
    
    
      125
      lcl|125_0
      FYUQ5C204JFG82 length=173 xy=3749_2948 region=4 run=R_2009_07_08_19_30_38_
      208
      
        
          22
          9262
          0
          0
          0.41
          0.625
          0.78
        
      
      No hits found
    
    
      126
      lcl|126_0
      FYUQ5C204I2D3A length=146 xy=3600_2628 region=4 run=R_2009_07_08_19_30_38_
      205
      
        
          22
          9262
          0
          0
          0.41
          0.625
          0.78
        
      
      No hits found
    


Grep-ping for the iteration numbers I foresee few more cases like that ahead in the XML file:

      234
      1
      235
      236

      345
      1
      346
      347

      450
      1
      451
      452

      555
      1
      556
      557

      655
      1
      656
      657

      759
      1
      760
      761

      859
      1
      860
      861

      956
      1
      957
      958

      1050
      1
      1051
      1052

      1145
      1
      1146
      1147

      1239
      1
      1240
      1241

      1333
      1
      1334
      1335

      1430
      1
      1431
      1432

      1523
      1
      1524
      1525

      1610
      1
      1611
      1612

      1703
      1
      1704
      1705

      1792
      1
      1793
      1794

      1881
      1
      1882
      1883


Then, no this problem anymore until end of the XML file at:
     25698
I am attaching the XML file with entries removed since about the last problematic place, with the two "closing" XML lines added so the file should be valid XML again. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Wed May 16 13:07:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 16 May 2012 18:07:18 +0100 Subject: [Biopython-dev] GEO library revamp In-Reply-To: References: Message-ID: On Tue, May 15, 2012 at 5:44 PM, Erik Clarke wrote: > Hi all, > I saw on the wiki that the BioPython GEO library was in need of some TLC. > I agree; a recent effort to use the parser for a project in our lab was > stymied by its lack of flexibility (it seems to be particularly poor at > reading GEO datasets, for instance). > > In response, we've developed a basic GEO module in Python loosely based on > GEOQuery and the existing Geo module. Currently, our module is capable of > downloading and parsing all four major GEO record types and providing > rudimentary pretty-print output of the data. It also provides a > representation of a GDS file in a form amenable to statistical analysis > using SciPy. I've included a method that finds the enriched genes in a given > subset as a demonstration. > > Since it was an internal project before this, I would appreciate any > feedback in terms of usability, bugs, etc that we may not have caught. It's > still under active development as I flesh out some of the missing features > (better pretty-printing, bug fixes, complete unit-test coverage, etc). > > In any case, my development branch of BioPython is here: > https://github.com/eclarke/biopython/tree/GEOQuery, and obviously all of the > new code is in the Bio/Geo folder (Records.py will replace Record.py). I've > tried to make it as well-commented as possible. I have not yet tested it on > Python < 2.7, but I plan on doing so. > > If this is of interest to anybody, I would be more than happy to tweak it > as people saw fit and hopefully one day replace the current GEO parser. > > Cheers, > Erik Clarke > The Scripps Research Institute > La Jolla, CA Hi Erik, That does sound promising. Switching to using numpy seems very sensible :) As you'll have read on the "Project Ideas" list on the wiki, I was thinking we should draw inspiration from Sean Davis' GEOquery http://www.bioconductor.org/packages/bioc/html/GEOquery.html in R/Bioconductor - which I had previously used from Python via rpy http://www.warwick.ac.uk/go/peter_cock/r/geo/ Sean sometimes posts here on the Biopython lists, so it would be great if he could comment on your work. Peter From sdavis2 at mail.nih.gov Wed May 16 13:20:11 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 16 May 2012 13:20:11 -0400 Subject: [Biopython-dev] GEO library revamp In-Reply-To: References: Message-ID: On Wed, May 16, 2012 at 1:07 PM, Peter Cock wrote: > On Tue, May 15, 2012 at 5:44 PM, Erik Clarke wrote: > > Hi all, > > I saw on the wiki that the BioPython GEO library was in need of some TLC. > > I agree; a recent effort to use the parser for a project in our lab was > > stymied by its lack of flexibility (it seems to be particularly poor at > > reading GEO datasets, for instance). > > > > In response, we've developed a basic GEO module in Python loosely based > on > > GEOQuery and the existing Geo module. Currently, our module is capable of > > downloading and parsing all four major GEO record types and providing > > rudimentary pretty-print output of the data. It also provides a > > representation of a GDS file in a form amenable to statistical analysis > > using SciPy. I've included a method that finds the enriched genes in a > given > > subset as a demonstration. > > > > Since it was an internal project before this, I would appreciate any > > feedback in terms of usability, bugs, etc that we may not have caught. > It's > > still under active development as I flesh out some of the missing > features > > (better pretty-printing, bug fixes, complete unit-test coverage, etc). > > > > In any case, my development branch of BioPython is here: > > https://github.com/eclarke/biopython/tree/GEOQuery, and obviously all > of the > > new code is in the Bio/Geo folder (Records.py will replace Record.py). > I've > > tried to make it as well-commented as possible. I have not yet tested it > on > > Python < 2.7, but I plan on doing so. > > > > If this is of interest to anybody, I would be more than happy to tweak it > > as people saw fit and hopefully one day replace the current GEO parser. > > > > Cheers, > > Erik Clarke > > The Scripps Research Institute > > La Jolla, CA > > Hi Erik, > > That does sound promising. Switching to using numpy seems > very sensible :) > > As you'll have read on the "Project Ideas" list on the wiki, I was > thinking we should draw inspiration from Sean Davis' GEOquery > http://www.bioconductor.org/packages/bioc/html/GEOquery.html > in R/Bioconductor - which I had previously used from Python via > rpy http://www.warwick.ac.uk/go/peter_cock/r/geo/ > > Sean sometimes posts here on the Biopython lists, so it would > be great if he could comment on your work. > > I'm looking forward to taking a look. It will be great to have a native python implementation. In the short term, Erik, you might take a look at some of the tests in the GEOquery package for some (high level) edge cases that I have stumbled onto over the years. Sean From erikclarke at gmail.com Wed May 16 14:51:45 2012 From: erikclarke at gmail.com (Erik Clarke) Date: Wed, 16 May 2012 11:51:45 -0700 Subject: [Biopython-dev] GEO library revamp In-Reply-To: References: Message-ID: <38363A1D-A317-4D8E-9AD4-A1EDE4ECC064@gmail.com> Thanks Sean, I'll definitely have a look at those. I'm looking forward to hearing your thoughts or critiques of the implementation. -Erik On May 16, 2012, at 10:20 AM, Sean Davis wrote: > > > On Wed, May 16, 2012 at 1:07 PM, Peter Cock wrote: > On Tue, May 15, 2012 at 5:44 PM, Erik Clarke wrote: > > Hi all, > > I saw on the wiki that the BioPython GEO library was in need of some TLC. > > I agree; a recent effort to use the parser for a project in our lab was > > stymied by its lack of flexibility (it seems to be particularly poor at > > reading GEO datasets, for instance). > > > > In response, we've developed a basic GEO module in Python loosely based on > > GEOQuery and the existing Geo module. Currently, our module is capable of > > downloading and parsing all four major GEO record types and providing > > rudimentary pretty-print output of the data. It also provides a > > representation of a GDS file in a form amenable to statistical analysis > > using SciPy. I've included a method that finds the enriched genes in a given > > subset as a demonstration. > > > > Since it was an internal project before this, I would appreciate any > > feedback in terms of usability, bugs, etc that we may not have caught. It's > > still under active development as I flesh out some of the missing features > > (better pretty-printing, bug fixes, complete unit-test coverage, etc). > > > > In any case, my development branch of BioPython is here: > > https://github.com/eclarke/biopython/tree/GEOQuery, and obviously all of the > > new code is in the Bio/Geo folder (Records.py will replace Record.py). I've > > tried to make it as well-commented as possible. I have not yet tested it on > > Python < 2.7, but I plan on doing so. > > > > If this is of interest to anybody, I would be more than happy to tweak it > > as people saw fit and hopefully one day replace the current GEO parser. > > > > Cheers, > > Erik Clarke > > The Scripps Research Institute > > La Jolla, CA > > Hi Erik, > > That does sound promising. Switching to using numpy seems > very sensible :) > > As you'll have read on the "Project Ideas" list on the wiki, I was > thinking we should draw inspiration from Sean Davis' GEOquery > http://www.bioconductor.org/packages/bioc/html/GEOquery.html > in R/Bioconductor - which I had previously used from Python via > rpy http://www.warwick.ac.uk/go/peter_cock/r/geo/ > > Sean sometimes posts here on the Biopython lists, so it would > be great if he could comment on your work. > > > I'm looking forward to taking a look. It will be great to have a native python implementation. In the short term, Erik, you might take a look at some of the tests in the GEOquery package for some (high level) edge cases that I have stumbled onto over the years. > > Sean From w.arindrarto at gmail.com Wed May 16 15:36:28 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 16 May 2012 21:36:28 +0200 Subject: [Biopython-dev] GSoC Project Update -- 2 Message-ID: Hi everyone, I just posted my latest GSoC blog update here: http://bow.web.id/blog/2012/05/the-final-preparations/ To summarize, I spent the last week playing with XML and SQLite, and in extension SeqIO's index and index_db. I didn't write as much as real code the week before (mostly on online tutorials). Additionally, I started writing some of the SearchIO main methods, improved the test case generation time, and added more entries to the SearchIO terms table (http://bit.ly/searchio-terms). Finally, from this day onwards, I'm starting coding for the actual SearchIO implementation. The weekly plan will follow my proposed timeline (http://bit.ly/searchio-proposal) and I'll be writing mostly on my main SearchIO branch (https://github.com/bow/biopython/tree/searchio/Bio/SearchIO). cheers, Bow P.S. I also updated my blog last week so that the GSoC entries can be tracked through its own feed. The feed is available here: http://bow.web.id/feed/atom-gsoc.xml From arklenna at gmail.com Wed May 16 16:01:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 16 May 2012 16:01:30 -0400 Subject: [Biopython-dev] GSoC python variant update 2 Message-ID: Hi all, Latest blog post here: http://arklenna.tumblr.com/post/23178684555/week-2 Brief summary of this post: I don't think `SeqFeature` or an extension thereof would be appropriate for storing Variant data; therefore, I intend to make a new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if this structure should be associated with `Seq`, i.e. by naming it `SeqVariant`, and would like feedback on this question. It could be very difficult to make PyVCF compatible with Python 2.5. Therefore, I am planning to write my project to be compatible with Python 2.6 and delaying its inclusion in the main Biopython branch until a future 2.6+ Biopython release. Alternate suggestions are welcome. This week I will solidify the structure so I am ready for the end of the community bonding period and the start of coding on May 21. Regards, Lenna From p.j.a.cock at googlemail.com Wed May 16 16:47:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 16 May 2012 21:47:21 +0100 Subject: [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: Message-ID: On Wed, May 16, 2012 at 9:01 PM, Lenna Peterson wrote: > Hi all, > > Latest blog post here: http://arklenna.tumblr.com/post/23178684555/week-2 > > Brief summary of this post: > > It could be very difficult to make PyVCF compatible with Python 2.5. What makes you worry? You mention argparse in the blog post, but that is for parsing command line arguments - and so is not really relevant for a library like Biopython (unless you are planning a bunch of command line tools too?). Peter From chapmanb at 50mail.com Wed May 16 20:19:01 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 16 May 2012 20:19:01 -0400 Subject: [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: Message-ID: <871umju0ay.fsf@fastmail.fm> Lenna; Thanks for the update on your thinking. Sounds like you are right on track. > I don't think `SeqFeature` or an extension thereof would be > appropriate for storing Variant data; therefore, I intend to make a > new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if > this structure should be associated with `Seq`, i.e. by naming it > `SeqVariant`, and would like feedback on this question. I'm agreed about SeqFeature. Would you consider using _Record/_Call directly? Then you could provide functionality to convert this to/from basic SeqFeatures if needed. An advantage of using these structures explicitly is that you could plug in compatible APIs, like Aaron Quinlan's CyVCF: https://github.com/arq5x/cyvcf I don't think we should add a new representation class unless we explicitly need to store additional information. > It could be very difficult to make PyVCF compatible with Python > 2.5. Therefore, I am planning to write my project to be compatible > with Python 2.6 and delaying its inclusion in the main Biopython > branch until a future 2.6+ Biopython release. Alternate suggestions > are welcome. I'm agreed with this. I don't think 2.5 is an entrenched as 2.4 was so think we could move on a deprecation path for it. It's more important to be forward compatible with 3.x and 2.6+ should make that easier. Thanks again for sharing all your thoughts and digging into this, Brad From arklenna at gmail.com Fri May 18 00:35:10 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 18 May 2012 00:35:10 -0400 Subject: [Biopython-dev] GSoC python variant update 2 In-Reply-To: <871umju0ay.fsf@fastmail.fm> References: <871umju0ay.fsf@fastmail.fm> Message-ID: On Wed, May 16, 2012 at 4:47 PM, Peter Cock wrote: >> >> It could be very difficult to make PyVCF compatible with Python 2.5. > > What makes you worry? You mention argparse in the blog post, > but that is for parsing command line arguments - and so is > not really relevant for a library like Biopython (unless you are > planning a bunch of command line tools too?). > > Peter The absences that caught my eye were `with` and `next()`. The PyVCF developers aren't planning to implement 2.5 compatibility (https://github.com/jamescasbon/PyVCF/issues/30) and I don't have expertise in that transition. On Wed, May 16, 2012 at 8:19 PM, Brad Chapman wrote: > > >> I don't think `SeqFeature` or an extension thereof would be >> appropriate for storing Variant data; therefore, I intend to make a >> new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if >> this structure should be associated with `Seq`, i.e. by naming it >> `SeqVariant`, and would like feedback on this question. > > I'm agreed about SeqFeature. Would you consider using _Record/_Call > directly? Then you could provide functionality to convert this to/from > basic SeqFeatures if needed. An advantage of using these structures > explicitly is that you could plug in compatible APIs, like Aaron > Quinlan's CyVCF: > > https://github.com/arq5x/cyvcf > > I don't think we should add a new representation class unless we > explicitly need to store additional information. > The reason I suggested a new representation class is so data from all parsers can be stored in the same way. As far as I can tell, GVF doesn't store all of the information stored in VCF (for example, the headers). My concern was unexpected behavior if I tried to store GVF data in the exact same object used by VCF. On the other hand, your GFF parser outputs to SeqRecords/SeqFeatures, so if the PyVCF wrapper can output to SeqRecords as well, I probably wouldn't have to worry about an intermediate structure. I'll start by having the PyVCF wrapper use _Record and _Call to keep things simple. In any case, if I do end up writing an interface/new structure, I would definitely write it to allow substitution of CyVCF or other parsers. Lenna From p.j.a.cock at googlemail.com Sat May 19 08:02:35 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 19 May 2012 13:02:35 +0100 Subject: [Biopython-dev] Python 2.5 support? Message-ID: Hello all, I'm curious how many of you (on the dev list) are still using Python 2.5, and if the time has come to start deprecating our support for it? One good reason would be if it helps us with the Python 3 migration (where currently we are restricted to a subset of features available on or back ported to the older Python releases). Here at work, we are mostly using Python 2.6, although the system default Python on many of our servers is Python 2.4 (CentOS boxes). So dropping Python 2.5 won't cause me personally a problem. A quick review of the main Linux distributions and their current long term support platforms might be prudent at this point. One issue is Jython support, which has to date targetted the C python 2.5 feature set - their new alpha release of Jython 2.7 now brings with it some of the new functionality in C Python 2.6 & 2.7, which is helpful. Regards, Peter From reece at harts.net Sun May 20 12:01:54 2012 From: reece at harts.net (Reece Hart) Date: Sun, 20 May 2012 09:01:54 -0700 Subject: [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: <871umju0ay.fsf@fastmail.fm> Message-ID: On Wed, May 16, 2012 at 1:01 PM, Lenna Peterson wrote: > I don't think `SeqFeature` or an extension thereof would be appropriate > for storing Variant data; therefore, I intend to make a new structure based > on `_Record` and `_Call` in PyVCF. > > Brad> I don't think we should add a new representation class unless we > Brad> explicitly need to store additional information. > > The reason I suggested a new representation class is so data from all > parsers can be stored in the same way. Lenna makes a very sound point. A Variant class should be able to represent all variant types, and therefore represent *only* the salient features of a generalized variant. It should not be specific to a particular format. For instance, _Record expects a CHROM, but this immediately eliminates its use for transcript-based variants (NM or ENST). QUAL, FILTER, INFO, and FORMAT are not intrinsic properties of a variant. Don't get me wrong -- it's exactly right for a *VCF* variant. However, _Record was never intended to be the variant abstraction that I think we should be aiming for at this time. Being VCF-specific isn't bad, but let's make sure the name accurately reflects the level of abstraction. Here's a counter example: variant = < ref_ac, var_type, loc, pre, post, rpt_count > ref_ac -- accession var_type -- type of variant/coordinate system (genomic, cds, protein) pre -- "before" seq (aka reference); empty if insertion post -- "after" seq (alt); empty if deletion or repeat rpt_count -- min, max count for repeats I implemented variants roughly this way once ( http://bitbucket.org/reece/bio-hgvs-perl). This structure is agnostic regarding peculiarities of a particular format. I show it as an example, not a proposal. Therefore, I am planning to write my project to be compatible with Python > 2.6 and delaying its inclusion in the main Biopython branch until a future > 2.6+ Biopython release. > Has anyone ever polled to see what versions of python people are using? I wonder whether we should care about 2.6 even (never mind 2.5). My guess is that 2.5 and 2.6 are tails of the distribution (as is 3.0, but at least it's ascending). I would be content to focus exclusively on 2.7 and 3.0. -Reece From mictadlo at gmail.com Sun May 20 19:04:04 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 21 May 2012 09:04:04 +1000 Subject: [Biopython-dev] Python 2.5 support? In-Reply-To: References: Message-ID: I am using 2.7.x, but e.g. on Cent OS or Debian I installed the latest python version in a different location and create a PYTHONPATH. The reason for 2.7.x is that multiproccesing bug has been fixed in that version. Cheers, Mic On Sat, May 19, 2012 at 10:02 PM, Peter Cock wrote: > Hello all, > > I'm curious how many of you (on the dev list) are still > using Python 2.5, and if the time has come to start > deprecating our support for it? One good reason > would be if it helps us with the Python 3 migration > (where currently we are restricted to a subset of > features available on or back ported to the older > Python releases). > > Here at work, we are mostly using Python 2.6, > although the system default Python on many of > our servers is Python 2.4 (CentOS boxes). So > dropping Python 2.5 won't cause me personally > a problem. > > A quick review of the main Linux distributions > and their current long term support platforms > might be prudent at this point. > > One issue is Jython support, which has to date > targetted the C python 2.5 feature set - their > new alpha release of Jython 2.7 now brings > with it some of the new functionality in C > Python 2.6 & 2.7, which is helpful. > > Regards, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Sun May 20 21:35:08 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 20 May 2012 21:35:08 -0400 Subject: [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: <871umju0ay.fsf@fastmail.fm> Message-ID: <87ehqe1flf.fsf@fastmail.fm> Lenna and Reece; >> The reason I suggested a new representation class is so data from all >> parsers can be stored in the same way. > > Lenna makes a very sound point. A Variant class should be able to represent > all variant types, and therefore represent *only* the salient features of a > generalized variant. It should not be specific to a particular format. I'm in agreement with you. My thought process is along the lines of: you'll help get to a general representation by exploring the deficiencies of the more specific ones. I think it's hard to invent a fully general scheme from outside. > For instance, _Record expects a CHROM, but this immediately eliminates its > use for transcript-based variants (NM or ENST). QUAL, FILTER, INFO, and > FORMAT are not intrinsic properties of a variant. Don't get me wrong -- > it's exactly right for a *VCF* variant. However, _Record was never intended > to be the variant abstraction that I think we should be aiming for at this > time. Being VCF-specific isn't bad, but let's make sure the name accurately > reflects the level of abstraction. Also agreed, although you can fit a wide variety of things into this general scheme. Ignoring all of the specific naming it's: - the reference name (chromosome or space or contig or whatever you want to call it) - position - identifier - ref/alt seqs (or pre/post) - key-value pairs associated with the variant - genotypes associated with the variant (also with key-value pairs) The real different between this and your bio-hgvs-perl example is what you expose as top level from the key-value pairs. VCF exposes QUAL and FILTER (and I guess identifier too) while you had different choices that were more right for your particular problem. This is all brainstorming, rather than a specific suggestion. If I have to think up something specific, I guess the right thing to do is make it easy to built a custom object representation that makes coding easy for specific problem sets from the more generic key/value information. > Has anyone ever polled to see what versions of python people are using? I > wonder whether we should care about 2.6 even (never mind 2.5). My guess is > that 2.5 and 2.6 are tails of the distribution (as is 3.0, but at least > it's ascending). I would be content to focus exclusively on 2.7 and > 3.0. I'm agreed, although practically dropping 2.6 support in Biopython won't happen for a while. Unless there are 2.7 features that we really need it shouldn't be to hard to support both. I only miss the multiple context manager support for with statements, and haven't let myself get hooked on ordered dicts or dictionary comprehensions yet. Brad From redmine at redmine.open-bio.org Sun May 20 23:30:49 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 21 May 2012 03:30:49 +0000 Subject: [Biopython-dev] [Biopython - Bug #3358] (New) Bio.PDB.Atom does not copy xtra dictionary Message-ID: Issue #3358 has been reported by Alexander Campbell. ---------------------------------------- Bug #3358: Bio.PDB.Atom does not copy xtra dictionary https://redmine.open-bio.org/issues/3358 Author: Alexander Campbell Status: New Priority: Normal Assignee: Category: Target version: URL: The Bio.PDB.Atom.copy function does not copy the object's xtra dictionary, leading to the source and the copy Atom objects sharing the same xtra dictionary. This can be observed by calling the id() function on the xtra attribute in the source and copy Atom objects, or more practically by copying an Atom object, adding a value to the copy's xtra dict, and observing that the same value is now present in the source's xtra dict. The fix is to have the Bio.PDB.Atom.copy function call copy.copy() function on the self.xtra dict, as does the Bio.PDB.Entity.copy function. The copied Atom object's xtra dict is now a copy of the source Atom object's xtra dict, and behaves as expected. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Mon May 21 12:37:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 21 May 2012 17:37:21 +0100 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? Message-ID: Hello all, This is something I talked to Bow a little about during our last weekly meeting for his GSoC meeting, but it is broader than just SearchIO... When describing BLAST results, or FASTA alignments, or indeed many other local alignments you typically have a (gapped) query sequence and match sequence fragment, and the co-ordinates describing which part of the full query and matched sequence this is. i.e. You are told the start and end of the subsequence (and perhaps strand). The same essentially applies to some multiple alignment formats in AlignIO as well, including Stockholm/PFAM (where this is encoded into the record name as identifier slash start-end), FASTA output (which will be handled via SearchIO in future) and MAF. http://biopython.org/wiki/Multiple_Alignment_Format Indeed thinking about how best to handle this was the main reason I haven't merged Andrew's MAF branch yet. (There are subtleties, for instance how is the strand given in the file, do you get start+end explicitly or must the end be inferred from the start and the sequence, etc). Currently recording these in the SeqRecord's annotation dictionary 'works', but does not exploit the structure. In particular, if the SeqRecord is sliced to get a fragment of the alignment, this co-ordinate information is lost. It would be nice if this preserved the start/end/strand and updated it accordingly. One idea for doing this is to introduce a new location property to the SeqRecord (defaulting to None), which would be a FeatureLocation object normally used for SeqFeature objects. If an operation couldn't preserve or update the location, it would become None. Note that slicing we will generally need to know the gap characters of the sequence (in order to recalculate the sub-sequence's start/end), which for the parsers may mean some minor updates to ensure the default alphabet specifies the '-' gap character. On the order hand, perhaps this 'location' property idea is overly complicated? Maybe all we need is a common convention about which keys to use in the annotation dictionary, and how to store the information (e.g. Python counting, start < end, and strand as +1 or -1 if present)? Thoughts or feedback please? Would a worked example help with my explanation? Thanks, Peter From chapmanb at 50mail.com Mon May 21 21:25:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 21 May 2012 21:25:39 -0400 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: References: Message-ID: <87ehqduhv0.fsf@fastmail.fm> Peter; > When describing BLAST results, or FASTA alignments, or > indeed many other local alignments you typically have a > (gapped) query sequence and match sequence fragment, > and the co-ordinates describing which part of the full query > and matched sequence this is. i.e. You are told the start > and end of the subsequence (and perhaps strand). [...] > One idea for doing this is to introduce a new location > property to the SeqRecord (defaulting to None), which > would be a FeatureLocation object normally used for > SeqFeature objects. I'm not sure if I understand the representation, but could we handle this as a standard named SeqFeature within the SeqRecord? This would let you store the metadata like gap information within the SeqFeature qualifiers and avoid introducing a new property. > Maybe all we need is a common convention about which > keys to use in the annotation dictionary, and how to store > the information (e.g. Python counting, start < end, and > strand as +1 or -1 if present)? I'm becoming more of a fan of this type of convention key/value approach as opposed to specific attributes but it does seem nice to re-use your existing classes if it holds the same information. > Thoughts or feedback please? Would a worked example > help with my explanation? A worked example might help: not totally sure I grasp all the subtleties, Brad From p.j.a.cock at googlemail.com Tue May 22 05:44:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 May 2012 10:44:27 +0100 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: <87ehqduhv0.fsf@fastmail.fm> References: <87ehqduhv0.fsf@fastmail.fm> Message-ID: On Tue, May 22, 2012 at 2:25 AM, Brad Chapman wrote: >> Thoughts or feedback please? Would a worked example >> help with my explanation? > > A worked example might help: not totally sure I grasp all the > subtleties, > Brad OK. This will work best in a mono-spaced font. This was picked out of one of our unit tests, bt009.txt - I just looked for a BLAST pairwise alignment with some gaps: Score = 151 bits (378), Expect = 9e-37 Identities = 88/201 (43%), Positives = 128/201 (62%), Gaps = 9/201 (4%) Query: 1 MTRISHITRNTKETQIELSINLDGTGQADISTGIGFLDHML-TLLTFHSDFDLKIIGHGD 59 M+R ++ITR TKET+IE+ +++D G+ +ST I F +HML TLLT+ + I+ D Sbjct: 1 MSRSANITRETKETKIEVLLDIDRKGEVKVSTPIPFFNHMLITLLTYMNS--TAIVSATD 58 Query: 60 HETVGMDPHHLIEDVAIALGKCISEDLGNKLGIRRYGSFTIPMDEALVTCDLDISGRPYL 119 + D HH++EDVAI LG I LG+K GI+R+ IPMD+ALV LDIS R Sbjct: 59 K--LPYDDHHIVEDVAITLGLAIKTALGDKRGIKRFSHQIIPMDDALVLVSLDISNRGMA 116 Query: 120 VFHADLSGNQKLGGYDTEMTEEFFRALAFNAGITLHLNEHYGQNTHHIIEGMFKSTARAL 179 + +L ++ +GG TE FF++ A+N+GITLH+++ G NTHHIIE FK+ AL Sbjct: 117 FVNLNLKRSE-IGGLATENVPHFFQSFAYNSGITLHISQLSGYNTHHIIEASFKALGLAL 175 Query: 180 KQAVSIDESKVGEIPSSKGVL 200 +A I ++ EI S+KG++ Sbjct: 176 YEATRIVDN---EIRSTKGII 193 When looking at this as a pairwise alignment, for the query SeqRecord the sequence would be MTRISHITRNT...KGVL (with gaps), running from residue 1 to 200 inclusive (one based counting, or 0 to 200 in Python). There are 200 letters plus one gap, meaning the gapped sequence is 201 letters long. Similarly the matched sequence SeqRecord (or subject in BLAST's terminology) is also 201 letters, this time 193 amino acids (residue 1 to 193 inclusive, one based counting) plus 8 gaps. To turn this into code, something like this: >>> from Bio.Seq import Seq >>> from Bio.SeqRecord import SeqRecord >>> query = SeqRecord(id="query", seq=Seq("MTRISHITRNTKETQIELSINLDGTGQADISTGIGFLDHML-TLLTFHSDFDLKIIGHGDHETVGMDPHHLIEDVAIALGKCISEDLGNKLGIRRYGSFTIPMDEALVTCDLDISGRPYLVFHADLSGNQKLGGYDTEMTEEFFRALAFNAGITLHLNEHYGQNTHHIIEGMFKSTARALKQAVSIDESKVGEIPSSKGVL")) >>> match = SeqRecord(id="match", seq=Seq("MSRSANITRETKETKIEVLLDIDRKGEVKVSTPIPFFNHMLITLLTYMNS--TAIVSATDK--LPYDDHHIVEDVAITLGLAIKTALGDKRGIKRFSHQIIPMDDALVLVSLDISNRGMAFVNLNLKRSE-IGGLATENVPHFFQSFAYNSGITLHISQLSGYNTHHIIEASFKALGLALYEATRIVDN---EIRSTKGII")) Turn it into an alignment, >>> from Bio.Align import MultipleSeqAlignment >>> align = MultipleSeqAlignment([query, match]) >>> print align Alphabet() alignment with 2 rows and 201 columns MTRISHITRNTKETQIELSINLDGTGQADISTGIGFLDHML-TL...GVL query MSRSANITRETKETKIEVLLDIDRKGEVKVSTPIPFFNHMLITL...GII match Assume the start/end co-ordinates are also stored somewhere (and for nucleotide sequences, the strand too). [As an aside, a pairwise multiple sequence alignment subclass or similar for the SearchIO project could have a nicer pretty print __str__ method showing where the two sequences agree - as done in the BLAST text output etc.] Now, suppose we slice this - for simplicity let's take the third chunk as shown in the original text BLAST output above, i.e. columns 121 to 180 inclusive (one based counting): >>> print align[:,120:180] Alphabet() alignment with 2 rows and 60 columns VFHADLSGNQKLGGYDTEMTEEFFRALAFNAGITLHLNEHYGQN...RAL query FVNLNLKRSE-IGGLATENVPHFFQSFAYNSGITLHISQLSGYN...LAL match We know that for this sub-alignment (by looking at the BLAST text output) that the query fragment is base 120 to 179 inclusive (one based counting) and the match fragment is base 117 to 175. I would like the SeqRecord slicing (done by the alignment object) to be able to deduce these new start/end co-ordinates from the original co-ordinates. This means when we do match[120:180] and query[120:180], we need to look at the position of the gaps, and thus convert from the ungapped coordinates (used for the start and end values) into the gapped coordinates (used for the alignment columns). Essentially here the new start is the old start plus the number of non-gap letters being removed from the start (before the slice point). The new end can then be calculated by adding the number of non-gap letters in the selected sequence, or from the old end value reducing it by the number of non-gap letters removed from the end. In fact there is no need to store the end coordinate in memory - it can be found on the fly from the start, sequence, and for nucleotides, strand - which is all you get in MAF format. Doing this would avoid inconsistent sets of values, but imposes a number of complications on the object representation. This is doable - but a sensible question is how common a use case is it to slice alignments (or SeqRecord objects) and care about their co-ordinates? This may actually be more important for classical multiple sequence alignments like Stockholm and MAF than for SearchIO. Peter From p.j.a.cock at googlemail.com Tue May 22 05:48:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 May 2012 10:48:18 +0100 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: References: <87ehqduhv0.fsf@fastmail.fm> Message-ID: On Tue, May 22, 2012 at 10:44 AM, Peter Cock wrote: > ... > I would like the SeqRecord slicing (done by the alignment object) > to be able to deduce these new start/end co-ordinates from the > original co-ordinates. > > ... > > This is doable - but a sensible question is how common a > use case is it to slice alignments (or SeqRecord objects) and > care about their co-ordinates? This may actually be more > important for classical multiple sequence alignments like > Stockholm and MAF than for SearchIO. I was struggling to come up with a simple self contained motivating example. Here is a possible example with BLAST, (although you can do similar things with multiple sequence alignments), but it is actually a larger or different problem. Suppose you have a domain of interest in a larger protein, and you want to pull out similar domains from similar proteins in a BLAST database. So, you do the BLAST search, and filter the results (e.g. use a minimum match length to ensure you are looking at full proteins). You then want to pull out just the region of the matched protein corresponding to your domain of interest. To solve this task, a SeqRecord location property is just a step in the right direction - but what this really boils down to is mapping between the three different co-ordinate systems: Ungapped query seuqnece <-> aligned columns (i.e. the common gapped sequence coordinates) <-> ungapped match sequence. Maybe that would be some nice functionality to add... the API needs a lot of thought though. Perhaps a specialized GappedSeq object (which could let us deprecate the current gapped alphabet class)? Peter From w.arindrarto at gmail.com Tue May 22 06:21:25 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 22 May 2012 12:21:25 +0200 Subject: [Biopython-dev] GSoC Project Update -- 3 Message-ID: Hi everyone, I just posted my latest GSoC update here: http://bow.web.id/blog/2012/05/from-bio-import-searchio/ To summarize the post and what I've done the last week: * I finished writing all base SearchIO objects and tested them as well. These objects are the QueryResult object (previously called Result), representing search results from a single query; the Hit object, representing pairwise alignments from a single database hit; and the HSP object, representing a single alignment. I've also written the docstrings for these objects, so you can run help() on them in an interpreter session. The post also includes a very brief outline of the base objects' features, if you are curious. * Using this, I was able to write a working prototype for SearchIO BLAST XML parsing. This prototype has also been tested, using the test cases I've generated previously. For now, it's implemented using our NCBIXML parser, just so that people can have a taste of what SearchIO will feel like. If you want to play around with the prototype, it's available here: https://github.com/bow/biopython/tree/searchio-blastxml. As always, feel free to notify me of suggestions, critiques, and/or feature requests :). regards, Bow From redmine at redmine.open-bio.org Tue May 22 06:40:05 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 22 May 2012 10:40:05 +0000 Subject: [Biopython-dev] [Biopython - Feature #3359] (New) Bio.Phylo.PAML.codeml doesn't parse output with multiple genes Message-ID: Issue #3359 has been reported by Brandon Invergo. ---------------------------------------- Feature #3359: Bio.Phylo.PAML.codeml doesn't parse output with multiple genes https://redmine.open-bio.org/issues/3359 Author: Brandon Invergo Status: New Priority: Normal Assignee: Brandon Invergo Category: Target version: URL: One particular combination of settings for PAML's codeml program creates output for separate genes. The current implementation of Bio.Phylo.PAML.codeml does not properly parse this. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Tue May 22 06:44:37 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 May 2012 11:44:37 +0100 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: References: <87ehqduhv0.fsf@fastmail.fm> Message-ID: On Tue, May 22, 2012 at 10:44 AM, Peter Cock wrote: > On Tue, May 22, 2012 at 2:25 AM, Brad Chapman wrote: >>> Thoughts or feedback please? Would a worked example >>> help with my explanation? >> >> A worked example might help: not totally sure I grasp all the >> subtleties, >> Brad > > OK. This will work best in a mono-spaced font. This was > picked out of one of our unit tests, bt009.txt - I just looked > for a BLAST pairwise alignment with some gaps: > > [BLASTP example] On reflection, a translated BLAST search would have been more interesting - then you've got at least another layer of co-ordinate transformations to worry about. e.g. for TBLASTX, query nucleotide <-> query protein <-> gapped protein <-> matched protein <-> matched nucleotide. Looking at a short snippet from example bt096.txt, an easy case in that there are no gaps, we have: Score = 100 bits (214), Expect(2) = 4e-49 Identities = 37/44 (84%), Positives = 38/44 (86%), Gaps = 0/44 (0%) Frame = -2/-2 Query 148 FCIFSRDGVLPCWSGWSRTPDLR*SACLGLPKCWDYRCEPPRPA 17 FCIFSRDGV CW GWSRTPDL+*S LGLPKCWDYR EPPRPA Sbjct 630 FCIFSRDGVSSCWPGWSRTPDLK*STHLGLPKCWDYRREPPRPA 499 The translated query sequence is 44 amino acids (including a stop codon), thus 44*3 = 132 base pairs, explaining how it runs from position 148 to 17 (one based) in the nucleotide query sequence. Currently Bio.SeqFeature.FeatureLocation doesn't have anything really intended for mixing nucleotide and protein coordinates, so that may not be the best fit for how to hold and manipulate these co-ordinates. Hmm. Peter From p.j.a.cock at googlemail.com Tue May 22 07:07:15 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 May 2012 12:07:15 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: <4F9AFA1F.6030103@med.nyu.edu> References: <4F91E4CF.8040602@med.nyu.edu> <4F9AFA1F.6030103@med.nyu.edu> Message-ID: Hi all, I've CC'd the BioRuby mailing list just to ensure you're aware of the potentially useful combination of MAF indexing and BGZF compression. We can continue this on the BioRuby list if more appropriate. The start of this Biopython-dev thread is here: http://lists.open-bio.org/pipermail/biopython-dev/2012-April/009561.html This might be a nice opportunity to combine the work of this year's OBF Google Summer of Code students - Clayton is doing MAF for BioRuby, and part of Artem's project could provide BGZF support for BioRuby. On Fri, Apr 27, 2012 at 8:57 PM, Andrew Sczesnak wrote: > Peter, > >> It should be easy enough to follow the BGZF changes to Bio/SeqIO/_index.py >> and I'm willing to do this myself for MAF (while going over your index >> work - something I want to do anyway). The only potential catch is >> avoiding offset arithmetic. > > I have no problem with you doing this if you're willing. It would be great > to have some code review of MafIndex as well. I'm not sure if Clayton will be able to comment on the Python code, but he should have some thoughts on the MAF indexing itself. Regards, Peter From andrew.sczesnak at med.nyu.edu Tue May 22 17:10:23 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Tue, 22 May 2012 17:10:23 -0400 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: References: Message-ID: <4FBC00BF.5000500@med.nyu.edu> Peter, It sort of seems like every letter in a sequence needs to have its own annotation, mapping it to its chromosome/sequence and position of origin. In this way, when multiple sequences are sliced and concatenated the annotation is preserved. For example, a = GappedSeq("ATGATG") ^ ^ | chr1:6 chr1:1 b = GappedSeq("GGG") ^ ^ | chr1:502 chr1:500 b = b.reverse_complement() c = a + b = GappedSeq("ATGATGGGG") Such that c[1].someproperty = "chr1[+]2" while chr[7].someproperty = "chr1[-]501". Strand information could be preserved on a per-letter basis and flipped from -1 to +1 upon reverse_complement(). The API could find and report contiguous stretches by analyzing these per-letter annotations, for example: >>> print c GappedSeq('ATGATGGGG', someproperty=["chr1[+]1-6", "chr1[-]500-502"]) The issue of gaps and of translating multiple alignments of gapped sequences could be resolved by having a convention where gaps always belong to the right-nearest gap except in the case of right-terminal gaps. For example: a = GappedSeq("----AGCG-ATG---") 000001234456666 a[0] = GappedSeq("----A") a[1] = GappedSeq("G") a[4] = GappedSeq("-A") a[6] = GappedSeq("G---") A nucleotide triplet of this sequence would thus look like this: a[:3] = GappedSeq("----AGC") a[-3:] = GappedSeq("ATG---") In the case of slicing a MultipleSeqAlignment of GappedSeq objects, there would have to be an "anchor" sequence (like there is in UCSC MAF files) with which other sequences in the alignment are sliced in reference to. For example: a = GappedSeq("----AGCG-ATG---") a = GappedSeq("----AGCG-ATG---") a = GappedSeqAlignment( On 05/21/2012 12:37 PM, Peter Cock wrote: > Hello all, > > This is something I talked to Bow a little about during our last > weekly meeting for his GSoC meeting, but it is broader than > just SearchIO... > > When describing BLAST results, or FASTA alignments, or > indeed many other local alignments you typically have a > (gapped) query sequence and match sequence fragment, > and the co-ordinates describing which part of the full query > and matched sequence this is. i.e. You are told the start > and end of the subsequence (and perhaps strand). > > The same essentially applies to some multiple alignment > formats in AlignIO as well, including Stockholm/PFAM > (where this is encoded into the record name as identifier > slash start-end), FASTA output (which will be handled via > SearchIO in future) and MAF. > > http://biopython.org/wiki/Multiple_Alignment_Format > > Indeed thinking about how best to handle this was the > main reason I haven't merged Andrew's MAF branch yet. > > (There are subtleties, for instance how is the strand given > in the file, do you get start+end explicitly or must the end > be inferred from the start and the sequence, etc). > > Currently recording these in the SeqRecord's annotation > dictionary 'works', but does not exploit the structure. In > particular, if the SeqRecord is sliced to get a fragment > of the alignment, this co-ordinate information is lost. It > would be nice if this preserved the start/end/strand and > updated it accordingly. > > One idea for doing this is to introduce a new location > property to the SeqRecord (defaulting to None), which > would be a FeatureLocation object normally used for > SeqFeature objects. > > If an operation couldn't preserve or update the location, > it would become None. Note that slicing we will generally > need to know the gap characters of the sequence (in > order to recalculate the sub-sequence's start/end), which > for the parsers may mean some minor updates to ensure > the default alphabet specifies the '-' gap character. > > On the order hand, perhaps this 'location' property idea > is overly complicated? > > Maybe all we need is a common convention about which > keys to use in the annotation dictionary, and how to store > the information (e.g. Python counting, start< end, and > strand as +1 or -1 if present)? > > Thoughts or feedback please? Would a worked example > help with my explanation? > > Thanks, > > Peter From andrew.sczesnak at med.nyu.edu Tue May 22 17:31:48 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Tue, 22 May 2012 17:31:48 -0400 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: References: Message-ID: <4FBC05C4.1060302@med.nyu.edu> Apologies, I accidentally hit send before finishing. -- Peter, It sort of seems like every letter in a sequence needs to have its own annotation, mapping it to its chromosome/sequence and position of origin. In this way, when multiple sequences are sliced and concatenated the annotation is preserved. For example, a = GappedSeq("ATGATG") ^ ^ | chr1:6 chr1:1 b = GappedSeq("GGG") ^ ^ | chr1:502 chr1:500 b = b.reverse_complement() c = a + b = GappedSeq("ATGATGGGG") Such that c[1].someproperty = "chr1[+]2" while chr[7].someproperty = "chr1[-]501". Strand information could be preserved on a per-letter basis and flipped from -1 to +1 upon reverse_complement(). The API could find and report contiguous stretches by analyzing these per-letter annotations, for example: >>> print c GappedSeq('ATGATGGGG', someproperty=["chr1[+]1-6", "chr1[-]500-502"]) The issue of gaps and of translating multiple alignments of gapped sequences could be resolved by having a convention where gaps always belong to the right-nearest gap except in the case of right-terminal gaps. For example: a = GappedSeq("----AGCG-ATG---") 000001234456666 a[0] = GappedSeq("----A") a[1] = GappedSeq("G") a[4] = GappedSeq("-A") a[6] = GappedSeq("G---") A nucleotide triplet of this sequence would thus look like this: a[:3] = GappedSeq("----AGC") a[-3:] = GappedSeq("ATG---") In the case of slicing a MultipleSeqAlignment of GappedSeq objects, there would have to be an "anchor" sequence (like there is in UCSC MAF files) with which other sequences in the alignment are sliced in reference to. For example: a = GappedSeq("----AGCG-ATG---", id="a", anchor=True) b = GappedSeq("AG--GG---ATAG--", id="b") c = GappedSeq("A--CGG---ATAGGG", id="c") d = GappedSeqAlignment([a, b, c]) >>> print d[:,:3] SingleLetterAlphabet() alignment with 3 rows and 7 columns ----AGC a, anchor=True AG--GG- b A--CGG- c One problem with this might be how to translate the multiple alignment... in this case, should b and c have no translation? Thanks, Andrew On 05/21/2012 12:37 PM, Peter Cock wrote: > Hello all, > > This is something I talked to Bow a little about during our last > weekly meeting for his GSoC meeting, but it is broader than > just SearchIO... > > When describing BLAST results, or FASTA alignments, or > indeed many other local alignments you typically have a > (gapped) query sequence and match sequence fragment, > and the co-ordinates describing which part of the full query > and matched sequence this is. i.e. You are told the start > and end of the subsequence (and perhaps strand). > > The same essentially applies to some multiple alignment > formats in AlignIO as well, including Stockholm/PFAM > (where this is encoded into the record name as identifier > slash start-end), FASTA output (which will be handled via > SearchIO in future) and MAF. > > http://biopython.org/wiki/Multiple_Alignment_Format > > Indeed thinking about how best to handle this was the > main reason I haven't merged Andrew's MAF branch yet. > > (There are subtleties, for instance how is the strand given > in the file, do you get start+end explicitly or must the end > be inferred from the start and the sequence, etc). > > Currently recording these in the SeqRecord's annotation > dictionary 'works', but does not exploit the structure. In > particular, if the SeqRecord is sliced to get a fragment > of the alignment, this co-ordinate information is lost. It > would be nice if this preserved the start/end/strand and > updated it accordingly. > > One idea for doing this is to introduce a new location > property to the SeqRecord (defaulting to None), which > would be a FeatureLocation object normally used for > SeqFeature objects. > > If an operation couldn't preserve or update the location, > it would become None. Note that slicing we will generally > need to know the gap characters of the sequence (in > order to recalculate the sub-sequence's start/end), which > for the parsers may mean some minor updates to ensure > the default alphabet specifies the '-' gap character. > > On the order hand, perhaps this 'location' property idea > is overly complicated? > > Maybe all we need is a common convention about which > keys to use in the annotation dictionary, and how to store > the information (e.g. Python counting, start< end, and > strand as +1 or -1 if present)? > > Thoughts or feedback please? Would a worked example > help with my explanation? > > Thanks, > > Peter -- Andrew Sczesnak Bioinformatician, Littman Lab Howard Hughes Medical Institute New York University School of Medicine 540 First Avenue New York, NY 10016 p: (212) 263-6921 f: (212) 263-1498 e: andrew.sczesnak at med.nyu.edu From p.j.a.cock at googlemail.com Wed May 23 05:29:52 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 23 May 2012 10:29:52 +0100 Subject: [Biopython-dev] Fwd: buildbot failure in Biopython on Linux 64 - Python 2.6 In-Reply-To: <201205230307.q4N37ujM032082@testing.open-bio.org> References: <201205230307.q4N37ujM032082@testing.open-bio.org> Message-ID: Hi Brandon, I only tested your fix on my Mac which doesn't have PAML installed. The buildslaves caught some problems last night: e.g. from a 64bit Linux machine, ====================================================================== FAIL: testParseAllNSsites (__main__.ModTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_PAML_codeml.py", line 245, in testParseAllNSsites self.assertEqual(len(results["NSsites"]), 6, version_msg) AssertionError: Improper parsing for version 4.3 And from a 32bit Windows machine, ====================================================================== FAIL: testParseAllNSsites (test_PAML_codeml.ModTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\repositories\BuildBotBiopython\win32\build\build\py3.2\Tests\test_PAML_codeml.py", line 245, in testParseAllNSsites self.assertEqual(len(results["NSsites"]), 6, version_msg) AssertionError: 1 != 6 : Improper parsing for version 4.1 Can you reproduce this locally? Thanks, Peter From p.j.a.cock at googlemail.com Wed May 23 09:29:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 23 May 2012 14:29:44 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: Message-ID: On Tue, Apr 17, 2012 at 5:11 PM, Kevin Jacobs wrote: > On Tue, Apr 17, 2012 at 11:23 AM, Peter Cock > wrote: >> >> I've just rebased my bgzf branch, which I think is ready to apply to the >> trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3. >> https://github.com/peterjc/biopython/tree/bgzf2 >> >> Would anyone like to review this please? There are unittests and >> plenty of docstrings - but so far nothing in the Tutorial though. >> > > Hi Peter, > > I've implemented code to create BAM/tabix style index files and perform > lookups, so it has been high on my list to test and validate your BGZF code > (rather having to write my own). ?I'm notoriously short on time, but this is > in the critical path for several projects and I'm going to work on it over > the next week or so. > > -Kevin Hi Kevin, Did you get a chance to look at my BGZF code? Thanks, Peter From bioinformed at gmail.com Wed May 23 10:09:03 2012 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 23 May 2012 10:09:03 -0400 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: Message-ID: Hi Peter, I've been going through it carefully, though I've been diverted a few times (by fixing bgzip problems in Boost and adapting BAM/tabix interval indexing to HDF5). I'll clean up my notes and finish going through the code over the next few days. -Kevin On Wed, May 23, 2012 at 9:29 AM, Peter Cock wrote: > > Hi Kevin, > > Did you get a chance to look at my BGZF code? > > Thanks, > > Peter > > From arklenna at gmail.com Wed May 23 17:56:03 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 23 May 2012 17:56:03 -0400 Subject: [Biopython-dev] GSoC python variant update 3 Message-ID: Hi all, Latest blog post here: http://arklenna.tumblr.com/post/23630012065/week-1 Brief summary: I have reversed my prior conclusion that `SeqRecord` is inadequate for holding variant data. It is still not ideal, but the advantages of using an existing native object are substantial, and the disadvantages can be reduced by creating an accessor for the variant-specific data within a `SeqRecord`. I've made an outline of how I would store the information returned by PyVCF within `SeqRecord` and `SeqFeature` objects. It includes a few questions about the most logical way to store certain variant information. As the coding period has now started, I'll be pushing some prototypes to GitHub in the near future. Lenna From p.j.a.cock at googlemail.com Thu May 24 05:18:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 24 May 2012 10:18:33 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: <4F91E4CF.8040602@med.nyu.edu> <4F9AFA1F.6030103@med.nyu.edu> Message-ID: On Thu, May 24, 2012 at 6:52 AM, Artem Tarasov wrote: > Hi all, > > it's a good point that many line-based formats need some sort of compression > with indexing, and BGZF is good enough in that sense. BGZF doesn't have to be used with line-based formats, anything with sequential records would work (like BAM files of course). I've not tried it to see how well it compressed, but SFF files in BGZF should work too as another example. >> So far, I think Artem's BGZF implementation is entirely in D; I may just >> add Ruby support for BGZF separately. > > The only problem I see with that approach is that it's hardly possible to > get parallel compression with MRI. But overall I tend to agree with Clayton. > Firstly, it's hard to abstract away some common interface right now, not > writing any code and looking at it. Secondly, there're still problems with D > shared library support. We were assured by GDC developer that they'll get > solved soon, but at the moment the situation is far from perfect. My BGZF code is pure Python (using C zlib via Python's zlib library), and does not currently tackle parallel compression or decompression. There as been recent work in samtools for this. We don't need parallel compression/decompression of BGZF for it to be useful. Peter From albl500 at york.ac.uk Thu May 24 07:06:45 2012 From: albl500 at york.ac.uk (Alex Leach) Date: Thu, 24 May 2012 12:06:45 +0100 Subject: [Biopython-dev] json formatting of SeqRecord objects Message-ID: <6817662.WQ7uPsvdBl@metabuntu> Dear all, I've written a fairly simple SeqRecord formatter to convert sequences to/from JSON objects, and wondered if it might be useful enough to be included in BioPython. It currently injects 'json' into SeqIO and AlignIO's _FormatToIterator and _FormatToWriter dictionaries, so can be used like any other SeqRecord format. I'm not sure where exactly I should submit it, but I thought here might do as an initial proposal.. I attach the source code. If you'd be interested in using it, let me know and I'll tidy it up to standards. Kind regards, Alex -- Alex Leach BSc. MRes. Department of Biology University of York York YO10 5DD United Kingdom EMAIL DISCLAIMER: http://www.york.ac.uk/docs/disclaimer/email.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: jsonIO.py Type: text/x-python Size: 5756 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Thu May 24 07:24:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 24 May 2012 12:24:27 +0100 Subject: [Biopython-dev] json formatting of SeqRecord objects In-Reply-To: <6817662.WQ7uPsvdBl@metabuntu> References: <6817662.WQ7uPsvdBl@metabuntu> Message-ID: On Thu, May 24, 2012 at 12:06 PM, Alex Leach wrote: > Dear all, > > I've written a fairly simple SeqRecord formatter to convert sequences to/from > JSON objects, and wondered if it might be useful enough to be included in > BioPython. That does look interesting, but from scanning the code the JSON representation is extremely Biopython specific. I take it there is no existing JSON representation in wide usage that you know of? It strikes me that something common between the Bio* projects would be much more valuable. I know that TogoWS (which uses both BioRuby and BioPerl internally) has some JSON support, but it looks more like a dump of the raw file from a quick look. How are you using this now? Do you pass JSON encoded SeqRecord objects over the network for something? > It currently injects 'json' into SeqIO and AlignIO's _FormatToIterator and > _FormatToWriter dictionaries, so can be used like any other SeqRecord format. > I'm not sure where exactly I should submit it, but I thought here might do as > an initial proposal.. > > I attach the source code. If you'd be interested in using it, let me know and > I'll tidy it up to standards. > > Kind regards, > Alex If you are happy with git, we'd suggest you make a fork on github (i.e. make a copy of the repository), then develop the new code on a new branch. Peter From p.j.a.cock at googlemail.com Thu May 24 07:32:53 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 24 May 2012 12:32:53 +0100 Subject: [Biopython-dev] json formatting of SeqRecord objects In-Reply-To: References: <6817662.WQ7uPsvdBl@metabuntu> Message-ID: On Thu, May 24, 2012 at 12:24 PM, Peter Cock wrote: > On Thu, May 24, 2012 at 12:06 PM, Alex Leach wrote: >> Dear all, >> >> I've written a fairly simple SeqRecord formatter to convert sequences to/from >> JSON objects, and wondered if it might be useful enough to be included in >> BioPython. > > That does look interesting, but from scanning the code the JSON > representation is extremely Biopython specific. I take it there is no > existing JSON representation in wide usage that you know of? It > strikes me that something common between the Bio* projects would > be much more valuable. > > I know that TogoWS (which uses both BioRuby and BioPerl internally) > has some JSON support, but it looks more like a dump of the raw > file from a quick look. e.g. Consider this two protein example for UniProt: http://togows.dbcls.jp/entry/uniprot/A1AG1_HUMAN,A1AG1_MOUSE http://togows.dbcls.jp/entry/uniprot/A1AG1_HUMAN,A1AG1_MOUSE.fasta http://togows.dbcls.jp/entry/uniprot/A1AG1_HUMAN,A1AG1_MOUSE.json Or with GenBank, http://togows.dbcls.jp/entry/protein/117606345,117606345 http://togows.dbcls.jp/entry/protein/117606345,117606345.fasta http://togows.dbcls.jp/entry/protein/117606345,117606345.json Currently TogoWS' JSON output is essentially the raw record (here plain text "swiss" format or plain text GenBank), using a JSON list to make the division into the records explicit. Peter From mictadlo at gmail.com Fri May 25 02:49:13 2012 From: mictadlo at gmail.com (Mic) Date: Fri, 25 May 2012 16:49:13 +1000 Subject: [Biopython-dev] [BioRuby] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: <4F91E4CF.8040602@med.nyu.edu> <4F9AFA1F.6030103@med.nyu.edu> Message-ID: I think Pircard-tools does parallel compression/decompression of BGZF. Cheers, Mic On Thu, May 24, 2012 at 7:18 PM, Peter Cock wrote: > On Thu, May 24, 2012 at 6:52 AM, Artem Tarasov > wrote: > > Hi all, > > > > it's a good point that many line-based formats need some sort of > compression > > with indexing, and BGZF is good enough in that sense. > > BGZF doesn't have to be used with line-based formats, anything > with sequential records would work (like BAM files of course). I've not > tried it to see how well it compressed, but SFF files in BGZF should > work too as another example. > > >> So far, I think Artem's BGZF implementation is entirely in D; I may just > >> add Ruby support for BGZF separately. > > > > The only problem I see with that approach is that it's hardly possible to > > get parallel compression with MRI. But overall I tend to agree with > Clayton. > > Firstly, it's hard to abstract away some common interface right now, not > > writing any code and looking at it. Secondly, there're still problems > with D > > shared library support. We were assured by GDC developer that they'll get > > solved soon, but at the moment the situation is far from perfect. > > My BGZF code is pure Python (using C zlib via Python's zlib library), > and does not currently tackle parallel compression or decompression. > There as been recent work in samtools for this. > > We don't need parallel compression/decompression of BGZF for it to > be useful. > > Peter > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From bioinformed at gmail.com Fri May 25 07:15:00 2012 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Fri, 25 May 2012 07:15:00 -0400 Subject: [Biopython-dev] [BioRuby] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: <4F91E4CF.8040602@med.nyu.edu> <4F9AFA1F.6030103@med.nyu.edu> Message-ID: On Fri, May 25, 2012 at 2:49 AM, Mic wrote: > I think Pircard-tools does parallel compression/decompression of BGZF. > > Here is what Picard's does for one command: MergeSamFiles Merges multiple SAM/BAM files into one file. USE_THREADING=BooleanOption to create a background thread to encode, compress and write to disk the output file. The threaded version uses about 20% more CPU and decreases runtime by ~20% when writing out a compressed BAM file. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} BAM output (dominated by zlib compression and/or IO write latency) is run in a different thread, but is still performed sequentially over blocks. The recent samtools fork attempts to buffer uncompressed BAM blocks and allocates multiple threads to compress several in parallel since they are independent. -Kevin From p.j.a.cock at googlemail.com Mon May 28 07:06:40 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 12:06:40 +0100 Subject: [Biopython-dev] HMMER in SearchIO Message-ID: Hi Bow, I've been looking over your GSoC branch, and noticed that for HMMER3 we've only talked about the regular text output. I think that the table output is also worth supporting (offers one line query query, or one line per domain). This isn't tab separated but variable spaces to give a fixed column layout, but should be easier to parse. Something to think about later on... Peter From arklenna at gmail.com Tue May 29 17:32:47 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 29 May 2012 17:32:47 -0400 Subject: [Biopython-dev] SeqRecord id behavior Message-ID: Hi all, I have some questions/comments regarding how SeqRecord handles various arguments. >>> print SeqRecord(seq="G") ID: Name: Description: Number of features: 0 'G' >>> print SeqRecord(seq="G", id=2) TypeError: id argument should be a string >>> print SeqRecord(seq="G", id=None) Name: Description: Number of features: 0 'G' 1. Couldn't a sequence id hypothetically be an integer? In which case, it could be converted to a string. 2. Regarding this comment on line 180: https://github.com/biopython/biopython/blob/master/Bio/SeqRecord.py#L180 if id is not None and not isinstance(id, basestring): #Lots of existing code uses id=None... this may be a bad idea. raise TypeError("id argument should be a string") Why might that be a bad idea? id=None will currently set self.id to None, so it doesn't affect the type checking. 3. Is it desirable to be able to remove the id from the __str__ representation, or would it be more consistent to do this: if id == "" or id is None: self.id = "" else: (typecheck here) Lenna From p.j.a.cock at googlemail.com Tue May 29 18:02:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 May 2012 23:02:20 +0100 Subject: [Biopython-dev] SeqRecord id behavior In-Reply-To: References: Message-ID: On Tue, May 29, 2012 at 10:32 PM, Lenna Peterson wrote: > Hi all, > > I have some questions/comments regarding how SeqRecord handles various > arguments. > >>>> print SeqRecord(seq="G") > ID: > Name: > Description: > Number of features: 0 > 'G' >>>> print SeqRecord(seq="G", id=2) > TypeError: id argument should be a string >>>> print SeqRecord(seq="G", id=None) > Name: > Description: > Number of features: 0 > 'G' > > 1. Couldn't a sequence id hypothetically be an integer? In which > case, it could be converted to a string. We want to be able to assume a string for things like the string formatting operators used in SeqRecord output (dealing with None as a special case is annoying enough). > 2. Regarding this comment on line 180: > https://github.com/biopython/biopython/blob/master/Bio/SeqRecord.py#L180 > > ? ?if id is not None and not isinstance(id, basestring): > ? ? ? ?#Lots of existing code uses id=None... this may be a bad idea. > ? ? ? ?raise TypeError("id argument should be a string") > > Why might that be a bad idea? id=None will currently set self.id to > None, so it doesn't affect the type checking. Using None for the ID prevents code assuming it is a string (but see below). > 3. Is it desirable to be able to remove the id from the __str__ > representation, No - the sequence and the ID are the two most important bits of a SeqRecord. > or would it be more consistent to do this: > > ? ?if id == "" or id is None: > ? ? ? ?self.id = "" > ? ?else: > ? ? ? ?(typecheck here) > > Lenna I never liked the face that "" has a space in it. This breaks the assumption of loads of file formats. Many file formats don't like an empty ID, so maybe "" is better. On the other hand, it is fairly common in Python to use None as a missing data representation... which currently the SeqRecord allows you to do. Note these SeqRecord defaults predate Bio.SeqIO - if we didn't have to worry about breaking existing code I would much rather make the ID a mandatory SeqRecord argument. Peter From w.arindrarto at gmail.com Wed May 30 17:44:04 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 30 May 2012 23:44:04 +0200 Subject: [Biopython-dev] GSoC Project Update -- 4 Message-ID: Hi everyone, I just posted my latest GSoC update here: http://bow.web.id/blog/2012/05/assembling-the-parsers/ To summarize: I've been working on more SearchIO parsers last week, adding more formats to support. We know have SearchIO-specific BLAST+ XML parser (it was first implemented on top of NCBIXML). It uses ElementTree as the base XML parser, with promising performance gains. I've also completed SearchIO's blast tabular parser, which takes in the BLAST+ tabular output files with or without headers. If the tabular file has headers, it can parse any number of columns in any order as long the columns with hit and query IDs are present. Finally, I've finished writing the HMMER plain text parser. For now, the parser can handle outputs from hmmscan and hmmsearch, single and multiple queries. All these parsers have been tested using the test cases I've generated previously. Additionally, I also had a public discussion with Peter on Github regarding SearchIO objects here: https://github.com/bow/biopython/commit/69a0ab64dfa7718f7455ca4c3961e95277fb4dbc#-P0, if anyone is interested. It started as a discussion on some behaviors of the HSP object, but also relates to other issues raised earlier (the dynamic SeqRecord coordinates Peter brought up earlier and Biopython's platform support). That's it for this week :). cheers, Bow From redmine at redmine.open-bio.org Thu May 31 23:29:10 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 1 Jun 2012 03:29:10 +0000 Subject: [Biopython-dev] [Biopython - Bug #3360] (New) Bio.Phylo.Applications support for FastTree Message-ID: Issue #3360 has been reported by Eric Talevich. ---------------------------------------- Bug #3360: Bio.Phylo.Applications support for FastTree https://redmine.open-bio.org/issues/3360 Author: Eric Talevich Status: New Priority: Low Assignee: Category: Target version: URL: FastTree is the new hotness in maximum-likelihood tree inference, and the new default/recommendation in SATe (http://phylo.bio.ku.edu/software/sate/sate.html). It also does very fast neighbor-joining trees. Let's create a wrapper for it in Biopython. The interface is very simple and Unix-y, taking an alignment in Phylip or FASTA format as input. Interestingly, it prints the final tree to standard output, unlike RAxML and PhyML. Homepage: http://www.microbesonline.org/fasttree/ ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From w.arindrarto at gmail.com Tue May 1 14:52:38 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 1 May 2012 16:52:38 +0200 Subject: [Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython In-Reply-To: References: Message-ID: On Mon, Apr 30, 2012 at 12:57, Peter Cock wrote: > On Mon, Apr 30, 2012 at 11:08 AM, Wibowo Arindrarto > wrote: >> >> I'm thinking of using the Search object as the object returned by >> SearchIO.parse or SearchIO.read. That way, we can store attributes >> common to the different search queries in it. For example: >> >>>>> search ?= SearchIO.parse('blast_result.xml', 'blast-xml') >>>>> search.format >> 'blast-xml' >>>>> search.algorithm >> 'blastx' >>>>> search.version >> '2.2.26+' >>>>> search.database >> 'refseq_protein' >>>>> search.results >> >> >> And iteration over the results would be done like this (for example): >>>>> for result in search.results: >> ... print result.query, print len(result) >> >> Additionaly, we can also define __iter__ and next for Search so we can >> just do the following: >>>>> for result in search: >> ... print result.query, print len(result) >> >> What do you think? > > I think you'll get in a mess with multiple iterators all sharing the > same handle and competing over using it - but maybe I'm not > grasping what you have in mind. > > Initially keep it simple: The primary public API would be > > for result in Bio.SearchIO.parse(...): > ? ? print result.query, print len(result) > > where each iteration gives a complete result set for one query. > > Peter > > P.S. With SearchIO subject to name space discussions ;) Hmm..Ok. I'll stick to the simpler API for the initial implementation ~ see if later it's feasible to add more details :) (and perhaps change the namespace too, as touched earlier). Bow From w.arindrarto at gmail.com Wed May 2 08:17:19 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 2 May 2012 10:17:19 +0200 Subject: [Biopython-dev] HMMER (+ BLAT) wrappers Message-ID: Hi everyone, The past week I've been trying to generate some test cases for BLAST, HMMER, et al. I was writing some short scripts to automate the test case generation, when I realized that Biopython doesn't have wrappers for HMMER and BLAT, so I decided to write them. The code is here: https://github.com/bow/gsoc/blob/master/hmmer/_HMMER.py and here: https://github.com/bow/gsoc/blob/master/blat/_BLAT.py. If it is of general interest to Biopython, I'd love to submit a pull request for these wrappers. They were primarily written for test case generation, but I imagine they won't require that many tweaks to make it suitable for inclusion in Biopython. However, before I can do that, there are some issues that I think needs to be discussed: 1. Where should the wrappers be put? I noticed that different wrappers are located in different directories according to their 'theme' (e.g. BLAST wrappers in Bio.Blast.Applications and ClustalW wrapper in Bio.Align.Applications). For the HMMER wrapper, should it be put inside Bio.Motif.Applications? For the BLAT wrapper, should I create a new Bio.Blat folder just for it? Yesterday I thought maybe it would be easier if all application wrappers are put inside the same directory (e.g. all in Bio.Applications), so maybe that's a viable option for future releases? 2. How should shared options among slightly different programs be handled? We can rely on creating abstract subclasses for them, but I find it easier to simply create lists and then combine them in the different programs. The current HMMER wrapper employs both of these approaches, but I think it needs to stick to just one approach to make the code easier to understand. 3. Is there a convention for naming the command line arguments? For example, if the command line option trigger is '--domE', should I name the Python variable, for example, 'domE', 'dome', 'dom_e', or 'dom_E'? 4. For the HMMER wrapper, there are some flags that are exclusive to each other (i.e. the user can only choose one of the flags). If the user chooses both, HMMER doesn't show any error messages ~ but nothing is run. Should the wrapper check for such mutually exclusive flags when it's created as well? 5. For BLAT, the installed suite includes a program that runs a BLAT server to handle search requests from different clients. It doesn't seem to be a typical program that should be wrapped by Biopython, but I might be wrong. Should a wrapper for the server be included as well? cheers, Bow From chris.mit7 at gmail.com Wed May 2 14:50:05 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Wed, 2 May 2012 10:50:05 -0400 Subject: [Biopython-dev] HMMER (+ BLAT) wrappers In-Reply-To: References: Message-ID: Hey Bow, I think it would be better to have an option to send the query to the local server should one be running as opposed to wrapping a gfServer that would be local for the duration of a given python process. This would allow for cases where someone has their BLAT queries split up in a script and not incur the loading time for the database multiple times. The gfServer/gfQuery setup is also rather a pain to use from my experience (it's all relative paths). I also think using the -pslx output would be a better default since -psl doesn't provide you with the sequence alignments. Chris On Wed, May 2, 2012 at 4:17 AM, Wibowo Arindrarto wrote: > Hi everyone, > > The past week I've been trying to generate some test cases for BLAST, > HMMER, et al. I was writing some short scripts to automate the test > case generation, when I realized that Biopython doesn't have wrappers > for HMMER and BLAT, so I decided to write them. The code is here: > https://github.com/bow/gsoc/blob/master/hmmer/_HMMER.py and here: > https://github.com/bow/gsoc/blob/master/blat/_BLAT.py. > > If it is of general interest to Biopython, I'd love to submit a pull > request for these wrappers. They were primarily written for test case > generation, but I imagine they won't require that many tweaks to make > it suitable for inclusion in Biopython. However, before I can do that, > there are some issues that I think needs to be discussed: > > 1. Where should the wrappers be put? I noticed that different wrappers > are located in different directories according to their 'theme' (e.g. > BLAST wrappers in Bio.Blast.Applications and ClustalW wrapper in > Bio.Align.Applications). For the HMMER wrapper, should it be put > inside Bio.Motif.Applications? For the BLAT wrapper, should I create a > new Bio.Blat folder just for it? Yesterday I thought maybe it would be > easier if all application wrappers are put inside the same directory > (e.g. all in Bio.Applications), so maybe that's a viable option for > future releases? > > 2. How should shared options among slightly different programs be > handled? We can rely on creating abstract subclasses for them, but I > find it easier to simply create lists and then combine them in the > different programs. The current HMMER wrapper employs both of these > approaches, but I think it needs to stick to just one approach to make > the code easier to understand. > > 3. Is there a convention for naming the command line arguments? For > example, if the command line option trigger is '--domE', should I name > the Python variable, for example, 'domE', 'dome', 'dom_e', or 'dom_E'? > > 4. For the HMMER wrapper, there are some flags that are exclusive to > each other (i.e. the user can only choose one of the flags). If the > user chooses both, HMMER doesn't show any error messages ~ but nothing > is run. Should the wrapper check for such mutually exclusive flags > when it's created as well? > > 5. For BLAT, the installed suite includes a program that runs a BLAT > server to handle search requests from different clients. It doesn't > seem to be a typical program that should be wrapped by Biopython, but > I might be wrong. Should a wrapper for the server be included as well? > > cheers, > Bow > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From w.arindrarto at gmail.com Wed May 2 15:21:33 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 2 May 2012 17:21:33 +0200 Subject: [Biopython-dev] HMMER (+ BLAT) wrappers In-Reply-To: References: Message-ID: On Wed, May 2, 2012 at 4:50 PM, Chris Mitchell wrote: > Hey Bow, > > I think it would be better to have an option to send the query to the local > server should one be running as opposed to wrapping a gfServer that would be > local for the duration of a given python process.? This would allow for > cases where someone has their BLAT queries split up in a script and not > incur the loading time for the database multiple times.? The > gfServer/gfQuery setup is also rather a pain to use from my experience (it's > all relative paths).? I also think using the -pslx output would be a better > default since -psl doesn't provide you with the sequence alignments. > > Chris > Hi Chris, You are talking about 'gfServer query ...' right? That does make gfServer usable enough to be wrapped. Thanks for pointing that out ~ I overlooked that gfServer can also do queries. I admit that the current BLAT wrapper is very minimum, but like I said, it shouldn't take that much time to write wrappers for the rest of the executables in the suite :). As for the file format, I actually prefer leaving the options as it is (i.e. the program's default) to keep surprises minimum to users (even though I agree that the pslx output is more informative). Bow From redmine at redmine.open-bio.org Wed May 2 19:04:38 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 2 May 2012 19:04:38 +0000 Subject: [Biopython-dev] [Biopython - Bug #3348] (New) Documentation error Message-ID: Issue #3348 has been reported by Patrick P. ---------------------------------------- Bug #3348: Documentation error https://redmine.open-bio.org/issues/3348 Author: Patrick P Status: New Priority: Normal Assignee: Category: Target version: URL: 4.3.3 Sequence [...] For example consider a (short) gene sequence with location 5:18 on the reverse strand, which in GenBank/EMBL notation using 1-based counting would be complement(4..18), like this: [...] -------------------------
                                                                         vvv
                                                                          v
 ---> in GenBank/EMBL notation using 1-based counting would be complement(6..18)
                                                                          ^
                                                                         ^^^
---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Thu May 3 17:50:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 May 2012 18:50:55 +0100 Subject: [Biopython-dev] SeqIO circular In-Reply-To: References: Message-ID: On Saturday, April 28, 2012, Matthias Bernt wrote: > Dear developers, > > I would like to suggest a quick "fix" for the problem. Currently the > parser just returns true per default for the circular property. This > is a wrong piece of information for all circular sequences. > Furthermore its not possible to detect if the parser did return true > because it is its default value or if its really from the data. So I > suggest to return None if the parser does not parse the information. > > What do you think? This should be possible with minimal effort. > > The parsing side of this is trivial - the only piece missing is how best to present the information in the SeqRecord for BioSQL compatibility (and perhaps some extra work on our BioSQL bindings). That requires someone to test where BioPerl stores this in BioSQL (as that is the reference implementation). Without that, a "quick fix" will mostly likely create a bug in our BioSQL support - in that we wouldn't store the circular field in the same way as the other Bio* implementations. Peter From redmine at redmine.open-bio.org Fri May 4 19:33:54 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 4 May 2012 19:33:54 +0000 Subject: [Biopython-dev] [Biopython - Bug #3349] (New) Bio.PDB.Entity.copy child handling errors. Message-ID: Issue #3349 has been reported by Alexander Ford. ---------------------------------------- Bug #3349: Bio.PDB.Entity.copy child handling errors. https://redmine.open-bio.org/issues/3349 Author: Alexander Ford Status: New Priority: Normal Assignee: Category: Target version: URL: https://github.com/asford/biopython/tree/entity_copy_fix The Bio.PDB.Entity.copy function, introduced in revision 0d8299a9, does not properly handle the entity child list. Iteration over the child_dict results in a loss of child ordering and the explicit call to detach_child results in a destructive modification of the copied entity's child elements' parent reference. The copy function should instead instantiate empty child_list and child_dict elements in the copy object and then add copies of each element in the source object's child_list via the Entity.add function. This will appropriately update the copy's child_dict and the child's parent reference while preserving child ordering. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Sat May 5 09:04:58 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 5 May 2012 10:04:58 +0100 Subject: [Biopython-dev] Fwd: [biopython] Fixing Entity copy method to preserve child ordering. (#37) In-Reply-To: References: Message-ID: Who wants to review this one? Peter ---------- Forwarded message ---------- From: *asford* Date: Friday, May 4, 2012 Subject: [biopython] Fixing Entity copy method to preserve child ordering. (#37) To: Peter Cock Pull request to fix issue #3349. You can merge this Pull Request by running: git pull https://github.com/asford/biopython entity_copy_fix Or you can view, comment on it, or merge it online at: https://github.com/biopython/biopython/pull/37 -- Commit Summary -- * Fixing Entity copy method to preserve child ordering. -- File Changes -- M Bio/PDB/Entity.py (11) -- Patch Links -- https://github.com/biopython/biopython/pull/37.patch https://github.com/biopython/biopython/pull/37.diff --- Reply to this email directly or view it on GitHub: https://github.com/biopython/biopython/pull/37 From eric.talevich at gmail.com Sat May 5 16:09:55 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 5 May 2012 12:09:55 -0400 Subject: [Biopython-dev] Fwd: [biopython] Fixing Entity copy method to preserve child ordering. (#37) In-Reply-To: References: Message-ID: I'll check it out today. On Sat, May 5, 2012 at 5:04 AM, Peter Cock wrote: > Who wants to review this one? > > Peter > > ---------- Forwarded message ---------- > From: *asford* > Date: Friday, May 4, 2012 > Subject: [biopython] Fixing Entity copy method to preserve child ordering. > (#37) > To: Peter Cock > > > Pull request to fix issue #3349. > > You can merge this Pull Request by running: > > git pull https://github.com/asford/biopython entity_copy_fix > > Or you can view, comment on it, or merge it online at: > > https://github.com/biopython/biopython/pull/37 > > -- Commit Summary -- > > * Fixing Entity copy method to preserve child ordering. > > -- File Changes -- > > M Bio/PDB/Entity.py (11) > > -- Patch Links -- > > https://github.com/biopython/biopython/pull/37.patch > https://github.com/biopython/biopython/pull/37.diff > > --- > Reply to this email directly or view it on GitHub: > https://github.com/biopython/biopython/pull/37 > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Sun May 6 11:09:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 6 May 2012 12:09:30 +0100 Subject: [Biopython-dev] Fwd: 2012 SciPy Bioinformatics Workshop In-Reply-To: <1336063455.23270.YahooMailNeo@web111204.mail.gq1.yahoo.com> References: <1336063455.23270.YahooMailNeo@web111204.mail.gq1.yahoo.com> Message-ID: Dear Biopythoneers, Are any of us planning to attend the SciPy meeting? The 2012 SciPy Bioinformatics Workshop is crying out for a Biopython related talk... and from the email below it sounds like they're not just looking for a developers perspectives, but also how Python is being used in bioinformatics. Is it quite close after BOSC and ISMB but July 19 doesn't actually clash: http://www.open-bio.org/wiki/BOSC_2012 SciPy 2012 as a whole does clash with ISMB, and for those in Europe, it clashes with the planned CodeFest too: http://www.open-bio.org/wiki/EU_Codefest_2012 July is definitely conference season... Peter ---------- Forwarded message ---------- From: *Chris Mueller* Date: Thursday, May 3, 2012 Subject: [Numpy-discussion] 2012 SciPy Bioinformatics Workshop To: "chris.mueller at lab7.io" We are pleased to announce the 2012 SciPy Bioinformatics Workshop held in conjunction with SciPy 2012 this July in Austin, TX. Python in biology is not dead yet... in fact, it's alive and well! Remember just a few short years ago when BioPerl ruled the world? Just one minor paradigm shift* later and Python now has a commanding presence in bioinformatics. From Python bindings to common tools all the way to entire Python-based informatics platforms, Python is used everywhere** in modern bioinformatics. If you use Python for bioinformatics or just want to learn more about how its being used, join us at the 2012 SciPy Bioinformatics Workshop. We will have speakers from both academia and industry showcasing how Python is enabling biologists to effectively work with large, complex data sets. The workshop will be held the evening of July 19 from 5-6:30. More information about SciPy is available on the conference site: http://conference.scipy.org/scipy2012/ !! Participate !! Are you using Python in bioinformatics? We'd love to have you share your story. We are looking for 3-4 speakers to share their experiences using Python for bioinformatics. Please contact Chris Mueller at chris.mueller [at] lab7.io and Ray Roberts at rroberts [at] enthought.com to volunteer. Please include a brief description or link to a paper/topic which you would like to discuss. Presentations will last for 15 minutes each and will be followed by a panel Q&A. -- * That would be next generation sequencing ** Yes, we aRe awaRe of that otheR language used eveRywhere, but let's celebRate Python Right now. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion From tiagoantao at gmail.com Sun May 6 11:16:36 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 6 May 2012 12:16:36 +0100 Subject: [Biopython-dev] [Biopython] Fwd: 2012 SciPy Bioinformatics Workshop In-Reply-To: References: <1336063455.23270.YahooMailNeo@web111204.mail.gq1.yahoo.com> Message-ID: Hi, On Sun, May 6, 2012 at 12:09 PM, Peter Cock wrote: > SciPy 2012 as a whole does clash with ISMB, and for those in Europe, it > clashes with the planned CodeFest too: > http://www.open-bio.org/wiki/EU_Codefest_2012 Are any people from here going to the codefest? Tiago From p.j.a.cock at googlemail.com Mon May 7 08:37:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 May 2012 09:37:38 +0100 Subject: [Biopython-dev] [Biopython] Fwd: 2012 SciPy Bioinformatics Workshop In-Reply-To: References: <1336063455.23270.YahooMailNeo@web111204.mail.gq1.yahoo.com> Message-ID: On Sun, May 6, 2012 at 12:16 PM, Tiago Ant?o wrote: > Hi, > > On Sun, May 6, 2012 at 12:09 PM, Peter Cock wrote: >> SciPy 2012 as a whole does clash with ISMB, and for those in Europe, it >> clashes with the planned CodeFest too: >> http://www.open-bio.org/wiki/EU_Codefest_2012 > > Are any people from here going to the codefest? > > Tiago Brad is going to the pre-BOSC CodeFest in California, http://www.open-bio.org/wiki/Codefest_2012 I'm not sure if we have any Biopython folk signed up for the post-BOSC EU CodeFest in Italy yet. http://www.open-bio.org/wiki/EU_Codefest_2012 I aim to attend one of the CodeFests - trying to firm up summer travel plans now... Peter From arklenna at gmail.com Sun May 6 21:26:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 6 May 2012 17:26:30 -0400 Subject: [Biopython-dev] GSoC python variant update Message-ID: Hi all, I've written a few new posts on my blog; here's the latest: http://arklenna.tumblr.com/post/22542372076/spot-isa-dog I will attach a UML diagram and include the part of the post addressing the diagram. Click through to the full post for a bonus Einstein quote! ------- My main goals are not limited to: * Make the structure parser and file-format agnostic: an abstracted OO design should allow anything to be slotted in (for example, Marjan's C GFF parser?) * Maintain encapsulation: limit how much each object can see of objects above and below it * Allow extension at multiple levels: some existing parsers may process data in different ways; this structure should allow handling both raw data and data in various formats. The `Variant` object's constructor allows an end user to change the default parsers. Practical implementation details of `parse()` and `write()` will need to be finessed - for example, ways to help the user sift through immense quantities of data. I'm still in the process of comparing the data contained in VCF/GVF files as well as the APIs of PyVCF and BCBio.GFF. `Parser` and `Writer` are both abstract classes that will define all methods found in known parsers/writers with `NotImplementedError`s. I'm speculating on whether a Variant-specific exception would be useful, but a custom message should suffice. Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper` would each inherit from both `Parser` and `Writer`. As the name implies, they would serve as the adapter between the generic `Variant` and the specific parser. I anticipate that this structure could easily be extended to allow intermediate storage in DBs as well as innumerable sorting/comparing/filtering methods inside `Variant`. ------- I would appreciate any and all feedback about the overall structure. Namespace is definitely flexible. I'd also appreciate any specific genomic variant workflows, and if somebody can point me to smallish sample files of the same data in both VCF and GVF, I'd be eternally grateful. Regards, Lenna -------------- next part -------------- A non-text attachment was scrubbed... Name: Variant_UML.png Type: image/png Size: 23313 bytes Desc: not available URL: From chapmanb at 50mail.com Tue May 8 00:24:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 07 May 2012 20:24:39 -0400 Subject: [Biopython-dev] [GSoC] GSoC python variant update In-Reply-To: References: Message-ID: <87mx5jfrjs.fsf@fastmail.fm> Lenna; This all looks great for a top level overview of the classes. This should give you sufficient flexibility to work on the different file types. Another approach is to avoid some of the inheritence and have parse/write dispatch to VCF or GFF specific classes based on the filetype: if filetype == "vcf": variant_handler = PyVCFVariants() elif filetype == "gvf": variant_handler = GVFVariants() variant_handler.parse(*args) Avoiding layers can be nice to simplify the architecture, as long as it gives you the flexibility you need. My suggestion for digging more in the API design would be to start playing with some VCF files and getting comfortable with the data they have and where it would go in Biopython objects. VCF is much more widely used than GVF so it's a good practical place to start. Thanks for all this work and best of luck on finals, Brad > Hi all, > > I've written a few new posts on my blog; here's the latest: > > http://arklenna.tumblr.com/post/22542372076/spot-isa-dog > > I will attach a UML diagram and include the part of the post > addressing the diagram. Click through to the full post for a bonus > Einstein quote! > > ------- > > My main goals are not limited to: > > * Make the structure parser and file-format agnostic: an abstracted > OO design should allow anything to be slotted in (for example, > Marjan's C GFF parser?) > * Maintain encapsulation: limit how much each object can see of > objects above and below it > * Allow extension at multiple levels: some existing parsers may > process data in different ways; this structure should allow handling > both raw data and data in various formats. > > The `Variant` object's constructor allows an end user to change the > default parsers. Practical implementation details of `parse()` and > `write()` will need to be finessed - for example, ways to help the > user sift through immense quantities of data. I'm still in the process > of comparing the data contained in VCF/GVF files as well as the APIs > of PyVCF and BCBio.GFF. > > `Parser` and `Writer` are both abstract classes that will define all > methods found in known parsers/writers with `NotImplementedError`s. > I'm speculating on whether a Variant-specific exception would be > useful, but a custom message should suffice. > > Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper` > would each inherit from both `Parser` and `Writer`. As the name > implies, they would serve as the adapter between the generic `Variant` > and the specific parser. > > I anticipate that this structure could easily be extended to allow > intermediate storage in DBs as well as innumerable > sorting/comparing/filtering methods inside `Variant`. > > ------- > > I would appreciate any and all feedback about the overall structure. > Namespace is definitely flexible. I'd also appreciate any specific > genomic variant workflows, and if somebody can point me to smallish > sample files of the same data in both VCF and GVF, I'd be eternally > grateful. > > Regards, > > Lenna Attachment: Variant_UML.png (image/png) > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From casbon at gmail.com Tue May 8 08:57:57 2012 From: casbon at gmail.com (James Casbon) Date: Tue, 8 May 2012 09:57:57 +0100 Subject: [Biopython-dev] [GSoC] GSoC python variant update In-Reply-To: <87mx5jfrjs.fsf@fastmail.fm> References: <87mx5jfrjs.fsf@fastmail.fm> Message-ID: On 8 May 2012 01:24, Brad Chapman wrote: > > Lenna; > This all looks great for a top level overview of the classes. This > should give you sufficient flexibility to work on the different file > types. Another approach is to avoid some of the inheritence and have > parse/write dispatch to VCF or GFF specific classes based on the > filetype: > > if filetype == "vcf": > ? ?variant_handler = PyVCFVariants() > elif filetype == "gvf": > ? ?variant_handler = GVFVariants() > variant_handler.parse(*args) > > Avoiding layers can be nice to simplify the architecture, as long as it > gives you the flexibility you need. Hi Lenna, This looks a good start, but I would agree with Brad that layers of inheritance aren't always the best way to proceed with python. Specific feedback: why does the Variant have parse/write methods when you state that you will use adaptation from the general variation class to the actual parser? I'm also slightly worried this could be pretty slow when dealing with the volume of data you get from a VCF file. As for the points in your blog post... I have plenty of data, do we know any SNP callers capable of creating GVF files? If so, I can give you both formats. The simplest variant workflows would be to filter and then score on some metric. Filter would be to remove noise, so quality threshold is the simplest one. The metric used depends on the experimental setup. For case/control, a fishers test is quite easy, or for a single population an HWE test is fairly simple. Hope this helps, -- James http://casbon.me/ From w.arindrarto at gmail.com Wed May 9 16:24:43 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 9 May 2012 18:24:43 +0200 Subject: [Biopython-dev] GSoC Project Update -- 1 Message-ID: Hi everyone, I just posted my latest blog updated here: http://bow.web.id/blog/2012/05/warming-up-for-the-coding-period/ To summarize, I've spent most of my time getting to know the programs I will support better. This has been done by: 1. Playing around with the programs to see how many different outputs I can generate. 2. Writing scripts to automate test case generation for each of the programs. 3. Writing wrappers (for programs not yet wrapped by Biopython: FASTA, HMMER, and BLAT) to ease writing the test case generators. 4. Continuing to complete my proposed SearchIO object naming scheme (http://bit.ly/searchio-terms) The test cases, their generators, and the wrappers I've written are available in my non-Biopython gsoc repo here: http://github.com/bow/gsoc/. Additionally, I've used the generated test case to improve a recent bug report and submitted a fix for the next release. For the coming weeks prior to coding start, I'm planning to play around more with XML and SQLite as I will use them in the code. I might start to add more skeleton code to my current development branch as well (https://github.com/bow/biopython). cheers, Bow From arklenna at gmail.com Thu May 10 00:16:18 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 9 May 2012 20:16:18 -0400 Subject: [Biopython-dev] [GSoC] GSoC python variant update In-Reply-To: <20120508114043.GC14359@thebird.nl> References: <87mx5jfrjs.fsf@fastmail.fm> <20120508114043.GC14359@thebird.nl> Message-ID: I think my UML diagram may need a legend, or perhaps it should just be abandoned. I've written some skeleton code to try to avoid confusion about the pesky OO terms that have slightly different meanings for every language. https://gist.github.com/2649676 Regarding concerns about inheritance: I think the UML diagram implies 3 levels of inheritance. The only inheritance I intended was from abstract interfaces like Parser or Writer, that only contain non-implemented methods. Because I can't guarantee that all future parsers will have common attribute and method names, the only solution I can see is to write an interface and inherit from that to make wrappers for each parser. Thank you to Eric for this link: (https://en.wikipedia.org/wiki/Fragile_base_class). The page states that the best way to avoid problems is to use an interface. Also thank you to Pjotr for the article about mixins (http://www.cs.utexas.edu/~lin/papers/aop03.pdf). I believe I'm using inheritance in a safe and helpful manner. James, I hope my clarification and skeleton code answer any questions you have about the implementation. Brad, I am using if statements to determine which parser to use, but I am still calling wrappers that inherit from an interface. Eric, I looked at the structure of PDBParser. Is the idea that a user might pass in an instance of StructureBuilder that already contained some structure and add to it? Or is there another purpose that isn't jumping out at me? In my skeleton code, I used the example of StructureBuilder, but I'm not sure if there's an advantage to passing the object rather than the object's name. And finally, Brad and James, I will do my best to get more conversant with VCF etc. If I'm not a user, I can't be a capable developer. Looking forward to any more structural feedback! Cheers, Lenna From eric.talevich at gmail.com Thu May 10 13:36:49 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 May 2012 09:36:49 -0400 Subject: [Biopython-dev] [GSoC] GSoC python variant update In-Reply-To: References: <87mx5jfrjs.fsf@fastmail.fm> <20120508114043.GC14359@thebird.nl> Message-ID: On Wed, May 9, 2012 at 8:16 PM, Lenna Peterson wrote: > I looked at the structure of PDBParser. Is the idea that a user might pass > in an instance of StructureBuilder that already contained some structure > and add to it? Or is there another purpose that isn't jumping out at me? In > my skeleton code, I used the example of StructureBuilder, but I'm not sure > if there's an advantage to passing the object rather than the object's name. > > My understanding of the producer/consumer design in Bio.PDB (I didn't write it) is that the logic for parsing the given file format is contained in the *Parser class, and the logic for building the target object is in the *Builder class. This is useful if the target object is somewhat complex to build, as is the case with PDB's Structure/Model/Chain/Residue/Atom hierarchy -- the parser just passes raw values along to the appropriate method on the StructureBuilder class. (The Internet also points out that this design is super useful if "producing" and "consuming" are asynchronous, which is not the case here... yet?) Regarding the shared interface, I think we've generally achieved this throughout most of Biopython by just remembering to implement the required methods on each parser and writer class -- just "parse" and "write", usually. Essentially, it's your design minus the common base class that enforces the interface; an error in the implementation would result in an AttributeError rather than a NotImplementedError. This works because (1) Python uses duck typing, unlike C++ and Java; (2) in Biopython, each file format is usually implemented by one dedicated person who can keep it all in their head, and we don't add new file formats very rapidly; (3) we maintain pretty good coverage with our unit tests, and certainly add unit tests for new parsers. Given all that, I think your design is superior, and it's quite clear how it all works from the way you've written it. As for the difference between passing an instance of the *Builder object versus a reference to the *Builder class (did I get that right?), it requires slightly less code from the user to pass a reference to the class. Also, if you set the object-or-class as a default argument, remember that objects are mutable, so you risk hitting one of Python's most infamous gotchas (default arguments are only evaluated once, so the second time you use the parser, you'll be adding to the original object instead of starting with a fresh copy). Cheers, Eric From w.arindrarto at gmail.com Fri May 11 16:08:25 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 11 May 2012 18:08:25 +0200 Subject: [Biopython-dev] Biopython wrappers' behavior Message-ID: Hi everyone, There has been a recent discussion on Github (here: https://github.com/bow/biopython/commit/b0b1a460149d4a68f76ebde916471628cecfe4e7#-P0) regarding the way our command line wrappers are supposed to work. It started as a question on how to handle incompatible parameters and boils down to how much complexity we want to have in our wrappers. To give you an illustration: We have wrappers for BLAST, which raise exceptions if two incompatible parameters are used at the same time. This mimics BLAST's behavior, since it will also show errors if given that same combination of parameters sans using our wrapper. However, the way other programs handle incompatible parameters are not always the same as BLAST's. For example, HMMER doesn't show any errors but nothing still gets run, and EMBOSS seems to use the last parameter it sees, ignoring previous ones. I have not tested this for all available programs and parameters in each suite, but it seems reasonable to extrapolate the behavior to the rest of the programs in their respective suite. The question is, how should our wrappers handle this? Should we: * Raise errors whenever incompatible parameters are used (as seen in BLAST's wrappers)? Or perhaps just give warnings? This is an extra layer of complexity, but it would help users figure out if something goes unexpected when using our wrappers. * Leave it as it is and not worry about incompatible parameters at all? Perhaps we could also report a bug / feature request to the respective programs' authors and expect their default behavior to change? * (other ideas...)? I personally favor mimicking the programs' behavior as close as possible. If it gives errors, we should handle it with our code, if not then we leave it as it is, even if it results in some unexpected behavior, but this is just me. What do you think? How should our wrappers handle incompatible parameters? Bow From eric.talevich at gmail.com Sat May 12 18:38:27 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 12 May 2012 14:38:27 -0400 Subject: [Biopython-dev] Biopython wrappers' behavior In-Reply-To: References: Message-ID: On Fri, May 11, 2012 at 12:08 PM, Wibowo Arindrarto wrote: > Hi everyone, > > There has been a recent discussion on Github (here: > > https://github.com/bow/biopython/commit/b0b1a460149d4a68f76ebde916471628cecfe4e7#-P0 > ) > regarding the way our command line wrappers are supposed to work. It > started as a question on how to handle incompatible parameters and > boils down to how much complexity we want to have in our wrappers. > > To give you an illustration: > > We have wrappers for BLAST, which raise exceptions if two incompatible > parameters are used at the same time. This mimics BLAST's behavior, > since it will also show errors if given that same combination of > parameters sans using our wrapper. However, the way other programs > handle incompatible parameters are not always the same as BLAST's. For > example, HMMER doesn't show any errors but nothing still gets run, and > EMBOSS seems to use the last parameter it sees, ignoring previous > ones. I have not tested this for all available programs and parameters > in each suite, but it seems reasonable to extrapolate the behavior to > the rest of the programs in their respective suite. > > The question is, how should our wrappers handle this? Should we: > > * Raise errors whenever incompatible parameters are used (as seen in > BLAST's wrappers)? Or perhaps just give warnings? This is an extra > layer of complexity, but it would help users figure out if something > goes unexpected when using our wrappers. > * Leave it as it is and not worry about incompatible parameters at > all? Perhaps we could also report a bug / feature request to the > respective programs' authors and expect their default behavior to > change? > * (other ideas...)? > > I personally favor mimicking the programs' behavior as close as > possible. If it gives errors, we should handle it with our code, if > not then we leave it as it is, even if it results in some unexpected > behavior, but this is just me. What do you think? How should our > wrappers handle incompatible parameters? There are certain motivations that apply to command-line tools but not Python object-based wrappers. The first thing that comes to mind is the use of scripts and aliases on the command line, where an existing setting "--foo" can be reversed/nullified by adding the "--no-foo" later in the command line. Examples -- say these are set globally in /etc/profile: % alias ourwater="water -brief" % ourwater -nobrief % export COMMON_BLAST_OPTIONS="-d /opt/db/nr -e 1e-4 --foo" % blastall -i myseq.fa $COMMON_BLAST_OPTIONS --no-foo I think EMBOSS handles this situation in the most Unix-friendly way, while BLAST is being fussy and HMMer is... still in development. In any case, this situation doesn't apply in Python/Biopython. If we want to reverse or reset an attribute on an object, we assign a new value to it, problem solved. >>> some_cmd = SomeCommandlineWrapper(foo=True) >>> some_cmd.foo = False So, I would support these behaviors in general: 1. If conflicting options are specified together in the constructor (__init__), raise an exception: >>> SomeCommandlineWrapper(foo=True, nofoo=True) # kaboom! 2. Where it's possible and intuitive, only use one attribute to specify boolean behaviors. Instead of having 'foo' and 'nofoo' attributes, just have 'foo', and let the 'nofoo' switch set that attribute to False. When building the command line for execution, sort it out again. I'm not sure about the easiest way to do this with Bio.Applications, but maybe we should come up with a standard mechanism for it. From w.arindrarto at gmail.com Mon May 14 19:29:44 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 14 May 2012 21:29:44 +0200 Subject: [Biopython-dev] Biopython wrappers' behavior In-Reply-To: References: Message-ID: > There are certain motivations that apply to command-line tools but not > Python object-based wrappers. The first thing that comes to mind is the use > of scripts and aliases on the command line, where an existing setting > "--foo" can be reversed/nullified by adding the "--no-foo" later in the > command line. > > Examples -- say these are set globally in /etc/profile: > > % alias ourwater="water -brief" > % ourwater -nobrief > > % export COMMON_BLAST_OPTIONS="-d /opt/db/nr -e 1e-4 --foo" > % blastall -i myseq.fa $COMMON_BLAST_OPTIONS --no-foo > > > I think EMBOSS handles this situation in the most Unix-friendly way, while > BLAST is being fussy and HMMer is... still in development. > > In any case, this situation doesn't apply in Python/Biopython. If we want to > reverse or reset an attribute on an object, we assign a new value to it, > problem solved. > >>>> some_cmd = SomeCommandlineWrapper(foo=True) >>>> some_cmd.foo = False > > So, I would support these behaviors in general: > > 1. If conflicting options are specified together in the constructor > (__init__), raise an exception: > >>>> SomeCommandlineWrapper(foo=True, nofoo=True)? # kaboom! > > 2. Where it's possible and intuitive, only use one attribute to specify > boolean behaviors. Instead of having 'foo' and 'nofoo' attributes, just have > 'foo', and let the 'nofoo' switch set that attribute to False. When building > the command line for execution, sort it out again. I'm not sure about the > easiest way to do this with Bio.Applications, but maybe we should come up > with a standard mechanism for it. > Hi Eric, Thanks for the explanation! It never occured to me we should consider custom command-line aliases as well, but that makes sense now. For your first point, there's already an incompatibility-checking mechanism implemented in the Bio.Blast.Applications module. It's currently tied-up to Bio.Blast.Applications's _validate method, but it seems doable to generalize this into a method of Bio.Application.AbstractCommandline, so it's available to the rest of the command line wrappers (EMBOSS wrappers being some of them). As per your second point, I can imagine three general ways to do this (on top of my head): 1. Implement a method to override one parameter setting with its opposing parameter in AbstractCommandline. This is perhaps similar to Bio.Blast.Application's _validate_incompatibilities method, only instead of raising an exception it deletes one of the parameters. 2. Implement a new _AbstractParameter subclass that can handle two different incompatible parameters (this is perhaps too complicated) 3. Implement an incompatibility checking mechanism in AbstractCommandline.__str__, to define parameters that can override its pair (e.g. foo and nofoo). This will keep the opposing parameters stored as the object attribute (so a __repr__ will reveal them both), but it won't get passed on to the console as the __call__ method relies on __str__. Bow From eric.talevich at gmail.com Mon May 14 19:53:50 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 14 May 2012 15:53:50 -0400 Subject: [Biopython-dev] Biopython wrappers' behavior In-Reply-To: References: Message-ID: On Mon, May 14, 2012 at 3:29 PM, Wibowo Arindrarto wrote: > > There are certain motivations that apply to command-line tools but not > > Python object-based wrappers. The first thing that comes to mind is the > use > > of scripts and aliases on the command line, where an existing setting > > "--foo" can be reversed/nullified by adding the "--no-foo" later in the > > command line. > > > > Examples -- say these are set globally in /etc/profile: > > > > % alias ourwater="water -brief" > > % ourwater -nobrief > > > > % export COMMON_BLAST_OPTIONS="-d /opt/db/nr -e 1e-4 --foo" > > % blastall -i myseq.fa $COMMON_BLAST_OPTIONS --no-foo > > > > > > I think EMBOSS handles this situation in the most Unix-friendly way, > while > > BLAST is being fussy and HMMer is... still in development. > > > > In any case, this situation doesn't apply in Python/Biopython. If we > want to > > reverse or reset an attribute on an object, we assign a new value to it, > > problem solved. > > > >>>> some_cmd = SomeCommandlineWrapper(foo=True) > >>>> some_cmd.foo = False > > > > So, I would support these behaviors in general: > > > > 1. If conflicting options are specified together in the constructor > > (__init__), raise an exception: > > > >>>> SomeCommandlineWrapper(foo=True, nofoo=True) # kaboom! > > > > 2. Where it's possible and intuitive, only use one attribute to specify > > boolean behaviors. Instead of having 'foo' and 'nofoo' attributes, just > have > > 'foo', and let the 'nofoo' switch set that attribute to False. When > building > > the command line for execution, sort it out again. I'm not sure about the > > easiest way to do this with Bio.Applications, but maybe we should come up > > with a standard mechanism for it. > > > > Hi Eric, > > Thanks for the explanation! It never occured to me we should consider > custom command-line aliases as well, but that makes sense now. > > For your first point, there's already an incompatibility-checking > mechanism implemented in the Bio.Blast.Applications module. It's > currently tied-up to Bio.Blast.Applications's _validate method, but it > seems doable to generalize this into a method of > Bio.Application.AbstractCommandline, so it's available to the rest of > the command line wrappers (EMBOSS wrappers being some of them). > > As per your second point, I can imagine three general ways to do this > (on top of my head): > > 1. Implement a method to override one parameter setting with its > opposing parameter in AbstractCommandline. This is perhaps similar to > Bio.Blast.Application's _validate_incompatibilities method, only > instead of raising an exception it deletes one of the parameters. > > 2. Implement a new _AbstractParameter subclass that can handle two > different incompatible parameters (this is perhaps too complicated) > > 3. Implement an incompatibility checking mechanism in > AbstractCommandline.__str__, to define parameters that can override > its pair (e.g. foo and nofoo). This will keep the opposing parameters > stored as the object attribute (so a __repr__ will reveal them both), > but it won't get passed on to the console as the __call__ method > relies on __str__. > > Here's a fourth to consider, similar to your #1 (not to disagree with any of your suggestions): add an "_AntiSwitch" class to Bio.Applications, which includes a reference or string name of the attribute/parameter it nullifies. Would that be easier to specify when writing the application wrapper? From clements at galaxyproject.org Mon May 14 20:57:19 2012 From: clements at galaxyproject.org (Dave Clements) Date: Mon, 14 May 2012 13:57:19 -0700 Subject: [Biopython-dev] 2012 Galaxy Community Conference Message-ID: Hello all, We are pleased to announce that early registration for the 2012 Galaxy Community Conference (GCC2012, http://galaxyproject.org/GCC2012) is now open. GCC2012 will be held July 25-27, at the UIC Forum, in Chicago, Illinois. The conference will feature two full days of presentations, discussions, lightning talks, and breakouts. We have also added a new full day of training this year, featuring 3 parallel tracks with four workshops each, covering seven to twelve different topics (please vote on topics by Friday May 18: http://bit.ly/GCC2012TDSurvey). The Galaxy Community Conference is for: * Sequencing core facility staff * Bioinformatics core staff * Bioinformatics tool and workflow developers * Bioinformatics focused principal investigators and researchers * Data producers * Power bioinformatics users This event is about integrating, analyzing, and sharing the diverse and very large datasets that are now typical in biomedical research. GCC2012 is an opportunity to share best practices with, and learn from, a large community of researchers and support staff who are facing the challenges of data-intensive biology. Galaxy is an open web-based platform for data intensive biomedical research (http://galaxyproject.org) that is widely used and deployed at research organizations of all sizes and around the world. Registration is very affordable, especially for post-docs and students. *You can can save 36% to 42% by registering on or before June 11*. Conference lodging can also be booked. Low-cost rooms have been reserved on the UIC campus. You can also stay at the official conference hotel, at a substantial discount. There are a limited rooms available in both, and you are encouraged to register early. Thanks, and hope to see you in Chicago! Dave Clements, on behalf of the GCC2012 Organizing Committee PS: Please help get the word out. A flyer and graphics are at http://wiki.g2.bx.psu.edu/Events/GCC2012/Promotion. -- http://galaxyproject.org/GCC2012 http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://galaxyproject.org/wiki/ From erikclarke at gmail.com Tue May 15 16:44:32 2012 From: erikclarke at gmail.com (Erik Clarke) Date: Tue, 15 May 2012 09:44:32 -0700 Subject: [Biopython-dev] GEO library revamp Message-ID: Hi all, I saw on the wiki that the BioPython GEO library was in need of some TLC. I agree; a recent effort to use the parser for a project in our lab was stymied by its lack of flexibility (it seems to be particularly poor at reading GEO datasets, for instance). In response, we've developed a basic GEO module in Python loosely based on GEOQuery and the existing Geo module. Currently, our module is capable of downloading and parsing all four major GEO record types and providing rudimentary pretty-print output of the data. It also provides a representation of a GDS file in a form amenable to statistical analysis using SciPy. I've included a method that finds the enriched genes in a given subset as a demonstration. Since it was an internal project before this, I would appreciate any feedback in terms of usability, bugs, etc that we may not have caught. It's still under active development as I flesh out some of the missing features (better pretty-printing, bug fixes, complete unit-test coverage, etc). In any case, my development branch of BioPython is here: https://github.com/eclarke/biopython/tree/GEOQuery, and obviously all of the new code is in the Bio/Geo folder (Records.py will replace Record.py). I've tried to make it as well-commented as possible. I have not yet tested it on Python < 2.7, but I plan on doing so. If this is of interest to anybody, I would be more than happy to tweak it as people saw fit and hopefully one day replace the current GEO parser. Cheers, Erik Clarke The Scripps Research Institute La Jolla, CA From w.arindrarto at gmail.com Tue May 15 20:32:08 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 15 May 2012 22:32:08 +0200 Subject: [Biopython-dev] Biopython wrappers' behavior In-Reply-To: References: Message-ID: >> As per your second point, I can imagine three general ways to do this >> (on top of my head): >> >> 1. Implement a method to override one parameter setting with its >> opposing parameter in AbstractCommandline. This is perhaps similar to >> Bio.Blast.Application's _validate_incompatibilities method, only >> instead of raising an exception it deletes one of the parameters. >> >> 2. Implement a new _AbstractParameter subclass that can handle two >> different incompatible parameters (this is perhaps too complicated) >> >> 3. Implement an incompatibility checking mechanism in >> AbstractCommandline.__str__, to define parameters that can override >> its pair (e.g. foo and nofoo). This will keep the opposing parameters >> stored as the object attribute (so a __repr__ will reveal them both), >> but it won't get passed on to the console as the __call__ method >> relies on __str__. >> > > Here's a fourth to consider, similar to your #1 (not to disagree with any of > your suggestions): add an "_AntiSwitch" class to Bio.Applications, which > includes a reference or string name of the attribute/parameter it nullifies. > Would that be easier to specify when writing the application wrapper? > That seems doable :). I can't imagine it being too hard to implement technically, as this will only be used for options that can be grouped under one name (i.e. opposing boolean parameters). From redmine at redmine.open-bio.org Wed May 16 09:32:51 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 16 May 2012 09:32:51 +0000 Subject: [Biopython-dev] [Biopython - Bug #3353] (New) Bio.KEGG.Enzymes change Message-ID: Issue #3353 has been reported by Thomas van Gurp. ---------------------------------------- Bug #3353: Bio.KEGG.Enzymes change https://redmine.open-bio.org/issues/3353 Author: Thomas van Gurp Status: New Priority: Normal Assignee: Category: Target version: URL: When retrieving an enzyme from [[http://soap.genome.jp/KEGG.wsdl]] using ec_handle = client.service.bget(ec) and Bio.KEGG.Enzyme.parse(ec_handle.split('\n')) there is an error in the way pathways are parsed. The layout changed from: PATHWAY PATH: MAP00130 Ubiquinone biosynthesis to PATHWAY ec00030 Pentose phosphate pathway ec00480 Glutathione metabolism A minor modification in the code fixes this issue: elif keyword=="PATHWAY ": if data[:5]=='PATH': path, map, name = data.split(None,2) path = 'PATH:' pathway = (path[:-1], map, name) record.pathway.append(pathway) else: map, name = data.split(None,1) path = 'PATH:' pathway = (path[:-1], map, name) record.pathway.append(pathway) ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed May 16 10:19:14 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 16 May 2012 10:19:14 +0000 Subject: [Biopython-dev] [Biopython - Bug #3354] (New) Legacy blast XML parser returns prematurely StopIteration Message-ID: Issue #3354 has been reported by Martin Mokrej?. ---------------------------------------- Bug #3354: Legacy blast XML parser returns prematurely StopIteration https://redmine.open-bio.org/issues/3354 Author: Martin Mokrej? Status: New Priority: Normal Assignee: Category: Target version: URL: Hi, I am parsing some blast 2.2.24 XML output and the last record I get is the one from iteration 124. I see that entry is followed by a new section which is probably the culprit. I will try newer legacy blast but still, biopython could maybe overcome this bug in XML input?
blastall -p blastn -A 4 -i SRR068315.fasta -d my_targets.fasta -F 0 -S 1 -r 2 -e 10e-30 -m 7




  blastn
  blastn 2.2.24 [Aug-08-2010]
  ~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~"Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs",  Nucleic Acids Res. 25:3389-3402.
  my_targets.fasta
  lcl|1_0
  FYUQ5C204IQCOE length=283 xy=3463_2076 region=4 run=R_2009_07_08_19_30_38_
  318
  
    
      1e-29
      2
      -3
      5
      2
      F
    
  
  
[cut]
    
      124
      lcl|124_0
      FYUQ5C204JXGMI length=44 xy=3954_2264 region=4 run=R_2009_07_08_19_30_38_
      350
      
        
          22
          9262
          0
          0
          0.41
          0.625
          0.78
        
      
      No hits found
    
    
      1
      
        
          22
          9262
          0
          0
          0.41
          0.625
          0.78
        
      
    
    
      125
      lcl|125_0
      FYUQ5C204JFG82 length=173 xy=3749_2948 region=4 run=R_2009_07_08_19_30_38_
      208
      
        
          22
          9262
          0
          0
          0.41
          0.625
          0.78
        
      
      No hits found
    
    
      126
      lcl|126_0
      FYUQ5C204I2D3A length=146 xy=3600_2628 region=4 run=R_2009_07_08_19_30_38_
      205
      
        
          22
          9262
          0
          0
          0.41
          0.625
          0.78
        
      
      No hits found
    


Grep-ping for the iteration numbers I foresee few more cases like that ahead in the XML file:

      234
      1
      235
      236

      345
      1
      346
      347

      450
      1
      451
      452

      555
      1
      556
      557

      655
      1
      656
      657

      759
      1
      760
      761

      859
      1
      860
      861

      956
      1
      957
      958

      1050
      1
      1051
      1052

      1145
      1
      1146
      1147

      1239
      1
      1240
      1241

      1333
      1
      1334
      1335

      1430
      1
      1431
      1432

      1523
      1
      1524
      1525

      1610
      1
      1611
      1612

      1703
      1
      1704
      1705

      1792
      1
      1793
      1794

      1881
      1
      1882
      1883


Then, no this problem anymore until end of the XML file at:
     25698
I am attaching the XML file with entries removed since about the last problematic place, with the two "closing" XML lines added so the file should be valid XML again. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Wed May 16 17:07:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 16 May 2012 18:07:18 +0100 Subject: [Biopython-dev] GEO library revamp In-Reply-To: References: Message-ID: On Tue, May 15, 2012 at 5:44 PM, Erik Clarke wrote: > Hi all, > I saw on the wiki that the BioPython GEO library was in need of some TLC. > I agree; a recent effort to use the parser for a project in our lab was > stymied by its lack of flexibility (it seems to be particularly poor at > reading GEO datasets, for instance). > > In response, we've developed a basic GEO module in Python loosely based on > GEOQuery and the existing Geo module. Currently, our module is capable of > downloading and parsing all four major GEO record types and providing > rudimentary pretty-print output of the data. It also provides a > representation of a GDS file in a form amenable to statistical analysis > using SciPy. I've included a method that finds the enriched genes in a given > subset as a demonstration. > > Since it was an internal project before this, I would appreciate any > feedback in terms of usability, bugs, etc that we may not have caught. It's > still under active development as I flesh out some of the missing features > (better pretty-printing, bug fixes, complete unit-test coverage, etc). > > In any case, my development branch of BioPython is here: > https://github.com/eclarke/biopython/tree/GEOQuery, and obviously all of the > new code is in the Bio/Geo folder (Records.py will replace Record.py). I've > tried to make it as well-commented as possible. I have not yet tested it on > Python < 2.7, but I plan on doing so. > > If this is of interest to anybody, I would be more than happy to tweak it > as people saw fit and hopefully one day replace the current GEO parser. > > Cheers, > Erik Clarke > The Scripps Research Institute > La Jolla, CA Hi Erik, That does sound promising. Switching to using numpy seems very sensible :) As you'll have read on the "Project Ideas" list on the wiki, I was thinking we should draw inspiration from Sean Davis' GEOquery http://www.bioconductor.org/packages/bioc/html/GEOquery.html in R/Bioconductor - which I had previously used from Python via rpy http://www.warwick.ac.uk/go/peter_cock/r/geo/ Sean sometimes posts here on the Biopython lists, so it would be great if he could comment on your work. Peter From sdavis2 at mail.nih.gov Wed May 16 17:20:11 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 16 May 2012 13:20:11 -0400 Subject: [Biopython-dev] GEO library revamp In-Reply-To: References: Message-ID: On Wed, May 16, 2012 at 1:07 PM, Peter Cock wrote: > On Tue, May 15, 2012 at 5:44 PM, Erik Clarke wrote: > > Hi all, > > I saw on the wiki that the BioPython GEO library was in need of some TLC. > > I agree; a recent effort to use the parser for a project in our lab was > > stymied by its lack of flexibility (it seems to be particularly poor at > > reading GEO datasets, for instance). > > > > In response, we've developed a basic GEO module in Python loosely based > on > > GEOQuery and the existing Geo module. Currently, our module is capable of > > downloading and parsing all four major GEO record types and providing > > rudimentary pretty-print output of the data. It also provides a > > representation of a GDS file in a form amenable to statistical analysis > > using SciPy. I've included a method that finds the enriched genes in a > given > > subset as a demonstration. > > > > Since it was an internal project before this, I would appreciate any > > feedback in terms of usability, bugs, etc that we may not have caught. > It's > > still under active development as I flesh out some of the missing > features > > (better pretty-printing, bug fixes, complete unit-test coverage, etc). > > > > In any case, my development branch of BioPython is here: > > https://github.com/eclarke/biopython/tree/GEOQuery, and obviously all > of the > > new code is in the Bio/Geo folder (Records.py will replace Record.py). > I've > > tried to make it as well-commented as possible. I have not yet tested it > on > > Python < 2.7, but I plan on doing so. > > > > If this is of interest to anybody, I would be more than happy to tweak it > > as people saw fit and hopefully one day replace the current GEO parser. > > > > Cheers, > > Erik Clarke > > The Scripps Research Institute > > La Jolla, CA > > Hi Erik, > > That does sound promising. Switching to using numpy seems > very sensible :) > > As you'll have read on the "Project Ideas" list on the wiki, I was > thinking we should draw inspiration from Sean Davis' GEOquery > http://www.bioconductor.org/packages/bioc/html/GEOquery.html > in R/Bioconductor - which I had previously used from Python via > rpy http://www.warwick.ac.uk/go/peter_cock/r/geo/ > > Sean sometimes posts here on the Biopython lists, so it would > be great if he could comment on your work. > > I'm looking forward to taking a look. It will be great to have a native python implementation. In the short term, Erik, you might take a look at some of the tests in the GEOquery package for some (high level) edge cases that I have stumbled onto over the years. Sean From erikclarke at gmail.com Wed May 16 18:51:45 2012 From: erikclarke at gmail.com (Erik Clarke) Date: Wed, 16 May 2012 11:51:45 -0700 Subject: [Biopython-dev] GEO library revamp In-Reply-To: References: Message-ID: <38363A1D-A317-4D8E-9AD4-A1EDE4ECC064@gmail.com> Thanks Sean, I'll definitely have a look at those. I'm looking forward to hearing your thoughts or critiques of the implementation. -Erik On May 16, 2012, at 10:20 AM, Sean Davis wrote: > > > On Wed, May 16, 2012 at 1:07 PM, Peter Cock wrote: > On Tue, May 15, 2012 at 5:44 PM, Erik Clarke wrote: > > Hi all, > > I saw on the wiki that the BioPython GEO library was in need of some TLC. > > I agree; a recent effort to use the parser for a project in our lab was > > stymied by its lack of flexibility (it seems to be particularly poor at > > reading GEO datasets, for instance). > > > > In response, we've developed a basic GEO module in Python loosely based on > > GEOQuery and the existing Geo module. Currently, our module is capable of > > downloading and parsing all four major GEO record types and providing > > rudimentary pretty-print output of the data. It also provides a > > representation of a GDS file in a form amenable to statistical analysis > > using SciPy. I've included a method that finds the enriched genes in a given > > subset as a demonstration. > > > > Since it was an internal project before this, I would appreciate any > > feedback in terms of usability, bugs, etc that we may not have caught. It's > > still under active development as I flesh out some of the missing features > > (better pretty-printing, bug fixes, complete unit-test coverage, etc). > > > > In any case, my development branch of BioPython is here: > > https://github.com/eclarke/biopython/tree/GEOQuery, and obviously all of the > > new code is in the Bio/Geo folder (Records.py will replace Record.py). I've > > tried to make it as well-commented as possible. I have not yet tested it on > > Python < 2.7, but I plan on doing so. > > > > If this is of interest to anybody, I would be more than happy to tweak it > > as people saw fit and hopefully one day replace the current GEO parser. > > > > Cheers, > > Erik Clarke > > The Scripps Research Institute > > La Jolla, CA > > Hi Erik, > > That does sound promising. Switching to using numpy seems > very sensible :) > > As you'll have read on the "Project Ideas" list on the wiki, I was > thinking we should draw inspiration from Sean Davis' GEOquery > http://www.bioconductor.org/packages/bioc/html/GEOquery.html > in R/Bioconductor - which I had previously used from Python via > rpy http://www.warwick.ac.uk/go/peter_cock/r/geo/ > > Sean sometimes posts here on the Biopython lists, so it would > be great if he could comment on your work. > > > I'm looking forward to taking a look. It will be great to have a native python implementation. In the short term, Erik, you might take a look at some of the tests in the GEOquery package for some (high level) edge cases that I have stumbled onto over the years. > > Sean From w.arindrarto at gmail.com Wed May 16 19:36:28 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 16 May 2012 21:36:28 +0200 Subject: [Biopython-dev] GSoC Project Update -- 2 Message-ID: Hi everyone, I just posted my latest GSoC blog update here: http://bow.web.id/blog/2012/05/the-final-preparations/ To summarize, I spent the last week playing with XML and SQLite, and in extension SeqIO's index and index_db. I didn't write as much as real code the week before (mostly on online tutorials). Additionally, I started writing some of the SearchIO main methods, improved the test case generation time, and added more entries to the SearchIO terms table (http://bit.ly/searchio-terms). Finally, from this day onwards, I'm starting coding for the actual SearchIO implementation. The weekly plan will follow my proposed timeline (http://bit.ly/searchio-proposal) and I'll be writing mostly on my main SearchIO branch (https://github.com/bow/biopython/tree/searchio/Bio/SearchIO). cheers, Bow P.S. I also updated my blog last week so that the GSoC entries can be tracked through its own feed. The feed is available here: http://bow.web.id/feed/atom-gsoc.xml From arklenna at gmail.com Wed May 16 20:01:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 16 May 2012 16:01:30 -0400 Subject: [Biopython-dev] GSoC python variant update 2 Message-ID: Hi all, Latest blog post here: http://arklenna.tumblr.com/post/23178684555/week-2 Brief summary of this post: I don't think `SeqFeature` or an extension thereof would be appropriate for storing Variant data; therefore, I intend to make a new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if this structure should be associated with `Seq`, i.e. by naming it `SeqVariant`, and would like feedback on this question. It could be very difficult to make PyVCF compatible with Python 2.5. Therefore, I am planning to write my project to be compatible with Python 2.6 and delaying its inclusion in the main Biopython branch until a future 2.6+ Biopython release. Alternate suggestions are welcome. This week I will solidify the structure so I am ready for the end of the community bonding period and the start of coding on May 21. Regards, Lenna From p.j.a.cock at googlemail.com Wed May 16 20:47:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 16 May 2012 21:47:21 +0100 Subject: [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: Message-ID: On Wed, May 16, 2012 at 9:01 PM, Lenna Peterson wrote: > Hi all, > > Latest blog post here: http://arklenna.tumblr.com/post/23178684555/week-2 > > Brief summary of this post: > > It could be very difficult to make PyVCF compatible with Python 2.5. What makes you worry? You mention argparse in the blog post, but that is for parsing command line arguments - and so is not really relevant for a library like Biopython (unless you are planning a bunch of command line tools too?). Peter From chapmanb at 50mail.com Thu May 17 00:19:01 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 16 May 2012 20:19:01 -0400 Subject: [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: Message-ID: <871umju0ay.fsf@fastmail.fm> Lenna; Thanks for the update on your thinking. Sounds like you are right on track. > I don't think `SeqFeature` or an extension thereof would be > appropriate for storing Variant data; therefore, I intend to make a > new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if > this structure should be associated with `Seq`, i.e. by naming it > `SeqVariant`, and would like feedback on this question. I'm agreed about SeqFeature. Would you consider using _Record/_Call directly? Then you could provide functionality to convert this to/from basic SeqFeatures if needed. An advantage of using these structures explicitly is that you could plug in compatible APIs, like Aaron Quinlan's CyVCF: https://github.com/arq5x/cyvcf I don't think we should add a new representation class unless we explicitly need to store additional information. > It could be very difficult to make PyVCF compatible with Python > 2.5. Therefore, I am planning to write my project to be compatible > with Python 2.6 and delaying its inclusion in the main Biopython > branch until a future 2.6+ Biopython release. Alternate suggestions > are welcome. I'm agreed with this. I don't think 2.5 is an entrenched as 2.4 was so think we could move on a deprecation path for it. It's more important to be forward compatible with 3.x and 2.6+ should make that easier. Thanks again for sharing all your thoughts and digging into this, Brad From arklenna at gmail.com Fri May 18 04:35:10 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 18 May 2012 00:35:10 -0400 Subject: [Biopython-dev] GSoC python variant update 2 In-Reply-To: <871umju0ay.fsf@fastmail.fm> References: <871umju0ay.fsf@fastmail.fm> Message-ID: On Wed, May 16, 2012 at 4:47 PM, Peter Cock wrote: >> >> It could be very difficult to make PyVCF compatible with Python 2.5. > > What makes you worry? You mention argparse in the blog post, > but that is for parsing command line arguments - and so is > not really relevant for a library like Biopython (unless you are > planning a bunch of command line tools too?). > > Peter The absences that caught my eye were `with` and `next()`. The PyVCF developers aren't planning to implement 2.5 compatibility (https://github.com/jamescasbon/PyVCF/issues/30) and I don't have expertise in that transition. On Wed, May 16, 2012 at 8:19 PM, Brad Chapman wrote: > > >> I don't think `SeqFeature` or an extension thereof would be >> appropriate for storing Variant data; therefore, I intend to make a >> new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if >> this structure should be associated with `Seq`, i.e. by naming it >> `SeqVariant`, and would like feedback on this question. > > I'm agreed about SeqFeature. Would you consider using _Record/_Call > directly? Then you could provide functionality to convert this to/from > basic SeqFeatures if needed. An advantage of using these structures > explicitly is that you could plug in compatible APIs, like Aaron > Quinlan's CyVCF: > > https://github.com/arq5x/cyvcf > > I don't think we should add a new representation class unless we > explicitly need to store additional information. > The reason I suggested a new representation class is so data from all parsers can be stored in the same way. As far as I can tell, GVF doesn't store all of the information stored in VCF (for example, the headers). My concern was unexpected behavior if I tried to store GVF data in the exact same object used by VCF. On the other hand, your GFF parser outputs to SeqRecords/SeqFeatures, so if the PyVCF wrapper can output to SeqRecords as well, I probably wouldn't have to worry about an intermediate structure. I'll start by having the PyVCF wrapper use _Record and _Call to keep things simple. In any case, if I do end up writing an interface/new structure, I would definitely write it to allow substitution of CyVCF or other parsers. Lenna From p.j.a.cock at googlemail.com Sat May 19 12:02:35 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 19 May 2012 13:02:35 +0100 Subject: [Biopython-dev] Python 2.5 support? Message-ID: Hello all, I'm curious how many of you (on the dev list) are still using Python 2.5, and if the time has come to start deprecating our support for it? One good reason would be if it helps us with the Python 3 migration (where currently we are restricted to a subset of features available on or back ported to the older Python releases). Here at work, we are mostly using Python 2.6, although the system default Python on many of our servers is Python 2.4 (CentOS boxes). So dropping Python 2.5 won't cause me personally a problem. A quick review of the main Linux distributions and their current long term support platforms might be prudent at this point. One issue is Jython support, which has to date targetted the C python 2.5 feature set - their new alpha release of Jython 2.7 now brings with it some of the new functionality in C Python 2.6 & 2.7, which is helpful. Regards, Peter From reece at harts.net Sun May 20 16:01:54 2012 From: reece at harts.net (Reece Hart) Date: Sun, 20 May 2012 09:01:54 -0700 Subject: [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: <871umju0ay.fsf@fastmail.fm> Message-ID: On Wed, May 16, 2012 at 1:01 PM, Lenna Peterson wrote: > I don't think `SeqFeature` or an extension thereof would be appropriate > for storing Variant data; therefore, I intend to make a new structure based > on `_Record` and `_Call` in PyVCF. > > Brad> I don't think we should add a new representation class unless we > Brad> explicitly need to store additional information. > > The reason I suggested a new representation class is so data from all > parsers can be stored in the same way. Lenna makes a very sound point. A Variant class should be able to represent all variant types, and therefore represent *only* the salient features of a generalized variant. It should not be specific to a particular format. For instance, _Record expects a CHROM, but this immediately eliminates its use for transcript-based variants (NM or ENST). QUAL, FILTER, INFO, and FORMAT are not intrinsic properties of a variant. Don't get me wrong -- it's exactly right for a *VCF* variant. However, _Record was never intended to be the variant abstraction that I think we should be aiming for at this time. Being VCF-specific isn't bad, but let's make sure the name accurately reflects the level of abstraction. Here's a counter example: variant = < ref_ac, var_type, loc, pre, post, rpt_count > ref_ac -- accession var_type -- type of variant/coordinate system (genomic, cds, protein) pre -- "before" seq (aka reference); empty if insertion post -- "after" seq (alt); empty if deletion or repeat rpt_count -- min, max count for repeats I implemented variants roughly this way once ( http://bitbucket.org/reece/bio-hgvs-perl). This structure is agnostic regarding peculiarities of a particular format. I show it as an example, not a proposal. Therefore, I am planning to write my project to be compatible with Python > 2.6 and delaying its inclusion in the main Biopython branch until a future > 2.6+ Biopython release. > Has anyone ever polled to see what versions of python people are using? I wonder whether we should care about 2.6 even (never mind 2.5). My guess is that 2.5 and 2.6 are tails of the distribution (as is 3.0, but at least it's ascending). I would be content to focus exclusively on 2.7 and 3.0. -Reece From mictadlo at gmail.com Sun May 20 23:04:04 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 21 May 2012 09:04:04 +1000 Subject: [Biopython-dev] Python 2.5 support? In-Reply-To: References: Message-ID: I am using 2.7.x, but e.g. on Cent OS or Debian I installed the latest python version in a different location and create a PYTHONPATH. The reason for 2.7.x is that multiproccesing bug has been fixed in that version. Cheers, Mic On Sat, May 19, 2012 at 10:02 PM, Peter Cock wrote: > Hello all, > > I'm curious how many of you (on the dev list) are still > using Python 2.5, and if the time has come to start > deprecating our support for it? One good reason > would be if it helps us with the Python 3 migration > (where currently we are restricted to a subset of > features available on or back ported to the older > Python releases). > > Here at work, we are mostly using Python 2.6, > although the system default Python on many of > our servers is Python 2.4 (CentOS boxes). So > dropping Python 2.5 won't cause me personally > a problem. > > A quick review of the main Linux distributions > and their current long term support platforms > might be prudent at this point. > > One issue is Jython support, which has to date > targetted the C python 2.5 feature set - their > new alpha release of Jython 2.7 now brings > with it some of the new functionality in C > Python 2.6 & 2.7, which is helpful. > > Regards, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Mon May 21 01:35:08 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 20 May 2012 21:35:08 -0400 Subject: [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: <871umju0ay.fsf@fastmail.fm> Message-ID: <87ehqe1flf.fsf@fastmail.fm> Lenna and Reece; >> The reason I suggested a new representation class is so data from all >> parsers can be stored in the same way. > > Lenna makes a very sound point. A Variant class should be able to represent > all variant types, and therefore represent *only* the salient features of a > generalized variant. It should not be specific to a particular format. I'm in agreement with you. My thought process is along the lines of: you'll help get to a general representation by exploring the deficiencies of the more specific ones. I think it's hard to invent a fully general scheme from outside. > For instance, _Record expects a CHROM, but this immediately eliminates its > use for transcript-based variants (NM or ENST). QUAL, FILTER, INFO, and > FORMAT are not intrinsic properties of a variant. Don't get me wrong -- > it's exactly right for a *VCF* variant. However, _Record was never intended > to be the variant abstraction that I think we should be aiming for at this > time. Being VCF-specific isn't bad, but let's make sure the name accurately > reflects the level of abstraction. Also agreed, although you can fit a wide variety of things into this general scheme. Ignoring all of the specific naming it's: - the reference name (chromosome or space or contig or whatever you want to call it) - position - identifier - ref/alt seqs (or pre/post) - key-value pairs associated with the variant - genotypes associated with the variant (also with key-value pairs) The real different between this and your bio-hgvs-perl example is what you expose as top level from the key-value pairs. VCF exposes QUAL and FILTER (and I guess identifier too) while you had different choices that were more right for your particular problem. This is all brainstorming, rather than a specific suggestion. If I have to think up something specific, I guess the right thing to do is make it easy to built a custom object representation that makes coding easy for specific problem sets from the more generic key/value information. > Has anyone ever polled to see what versions of python people are using? I > wonder whether we should care about 2.6 even (never mind 2.5). My guess is > that 2.5 and 2.6 are tails of the distribution (as is 3.0, but at least > it's ascending). I would be content to focus exclusively on 2.7 and > 3.0. I'm agreed, although practically dropping 2.6 support in Biopython won't happen for a while. Unless there are 2.7 features that we really need it shouldn't be to hard to support both. I only miss the multiple context manager support for with statements, and haven't let myself get hooked on ordered dicts or dictionary comprehensions yet. Brad From redmine at redmine.open-bio.org Mon May 21 03:30:49 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 21 May 2012 03:30:49 +0000 Subject: [Biopython-dev] [Biopython - Bug #3358] (New) Bio.PDB.Atom does not copy xtra dictionary Message-ID: Issue #3358 has been reported by Alexander Campbell. ---------------------------------------- Bug #3358: Bio.PDB.Atom does not copy xtra dictionary https://redmine.open-bio.org/issues/3358 Author: Alexander Campbell Status: New Priority: Normal Assignee: Category: Target version: URL: The Bio.PDB.Atom.copy function does not copy the object's xtra dictionary, leading to the source and the copy Atom objects sharing the same xtra dictionary. This can be observed by calling the id() function on the xtra attribute in the source and copy Atom objects, or more practically by copying an Atom object, adding a value to the copy's xtra dict, and observing that the same value is now present in the source's xtra dict. The fix is to have the Bio.PDB.Atom.copy function call copy.copy() function on the self.xtra dict, as does the Bio.PDB.Entity.copy function. The copied Atom object's xtra dict is now a copy of the source Atom object's xtra dict, and behaves as expected. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Mon May 21 16:37:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 21 May 2012 17:37:21 +0100 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? Message-ID: Hello all, This is something I talked to Bow a little about during our last weekly meeting for his GSoC meeting, but it is broader than just SearchIO... When describing BLAST results, or FASTA alignments, or indeed many other local alignments you typically have a (gapped) query sequence and match sequence fragment, and the co-ordinates describing which part of the full query and matched sequence this is. i.e. You are told the start and end of the subsequence (and perhaps strand). The same essentially applies to some multiple alignment formats in AlignIO as well, including Stockholm/PFAM (where this is encoded into the record name as identifier slash start-end), FASTA output (which will be handled via SearchIO in future) and MAF. http://biopython.org/wiki/Multiple_Alignment_Format Indeed thinking about how best to handle this was the main reason I haven't merged Andrew's MAF branch yet. (There are subtleties, for instance how is the strand given in the file, do you get start+end explicitly or must the end be inferred from the start and the sequence, etc). Currently recording these in the SeqRecord's annotation dictionary 'works', but does not exploit the structure. In particular, if the SeqRecord is sliced to get a fragment of the alignment, this co-ordinate information is lost. It would be nice if this preserved the start/end/strand and updated it accordingly. One idea for doing this is to introduce a new location property to the SeqRecord (defaulting to None), which would be a FeatureLocation object normally used for SeqFeature objects. If an operation couldn't preserve or update the location, it would become None. Note that slicing we will generally need to know the gap characters of the sequence (in order to recalculate the sub-sequence's start/end), which for the parsers may mean some minor updates to ensure the default alphabet specifies the '-' gap character. On the order hand, perhaps this 'location' property idea is overly complicated? Maybe all we need is a common convention about which keys to use in the annotation dictionary, and how to store the information (e.g. Python counting, start < end, and strand as +1 or -1 if present)? Thoughts or feedback please? Would a worked example help with my explanation? Thanks, Peter From chapmanb at 50mail.com Tue May 22 01:25:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 21 May 2012 21:25:39 -0400 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: References: Message-ID: <87ehqduhv0.fsf@fastmail.fm> Peter; > When describing BLAST results, or FASTA alignments, or > indeed many other local alignments you typically have a > (gapped) query sequence and match sequence fragment, > and the co-ordinates describing which part of the full query > and matched sequence this is. i.e. You are told the start > and end of the subsequence (and perhaps strand). [...] > One idea for doing this is to introduce a new location > property to the SeqRecord (defaulting to None), which > would be a FeatureLocation object normally used for > SeqFeature objects. I'm not sure if I understand the representation, but could we handle this as a standard named SeqFeature within the SeqRecord? This would let you store the metadata like gap information within the SeqFeature qualifiers and avoid introducing a new property. > Maybe all we need is a common convention about which > keys to use in the annotation dictionary, and how to store > the information (e.g. Python counting, start < end, and > strand as +1 or -1 if present)? I'm becoming more of a fan of this type of convention key/value approach as opposed to specific attributes but it does seem nice to re-use your existing classes if it holds the same information. > Thoughts or feedback please? Would a worked example > help with my explanation? A worked example might help: not totally sure I grasp all the subtleties, Brad From p.j.a.cock at googlemail.com Tue May 22 09:44:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 May 2012 10:44:27 +0100 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: <87ehqduhv0.fsf@fastmail.fm> References: <87ehqduhv0.fsf@fastmail.fm> Message-ID: On Tue, May 22, 2012 at 2:25 AM, Brad Chapman wrote: >> Thoughts or feedback please? Would a worked example >> help with my explanation? > > A worked example might help: not totally sure I grasp all the > subtleties, > Brad OK. This will work best in a mono-spaced font. This was picked out of one of our unit tests, bt009.txt - I just looked for a BLAST pairwise alignment with some gaps: Score = 151 bits (378), Expect = 9e-37 Identities = 88/201 (43%), Positives = 128/201 (62%), Gaps = 9/201 (4%) Query: 1 MTRISHITRNTKETQIELSINLDGTGQADISTGIGFLDHML-TLLTFHSDFDLKIIGHGD 59 M+R ++ITR TKET+IE+ +++D G+ +ST I F +HML TLLT+ + I+ D Sbjct: 1 MSRSANITRETKETKIEVLLDIDRKGEVKVSTPIPFFNHMLITLLTYMNS--TAIVSATD 58 Query: 60 HETVGMDPHHLIEDVAIALGKCISEDLGNKLGIRRYGSFTIPMDEALVTCDLDISGRPYL 119 + D HH++EDVAI LG I LG+K GI+R+ IPMD+ALV LDIS R Sbjct: 59 K--LPYDDHHIVEDVAITLGLAIKTALGDKRGIKRFSHQIIPMDDALVLVSLDISNRGMA 116 Query: 120 VFHADLSGNQKLGGYDTEMTEEFFRALAFNAGITLHLNEHYGQNTHHIIEGMFKSTARAL 179 + +L ++ +GG TE FF++ A+N+GITLH+++ G NTHHIIE FK+ AL Sbjct: 117 FVNLNLKRSE-IGGLATENVPHFFQSFAYNSGITLHISQLSGYNTHHIIEASFKALGLAL 175 Query: 180 KQAVSIDESKVGEIPSSKGVL 200 +A I ++ EI S+KG++ Sbjct: 176 YEATRIVDN---EIRSTKGII 193 When looking at this as a pairwise alignment, for the query SeqRecord the sequence would be MTRISHITRNT...KGVL (with gaps), running from residue 1 to 200 inclusive (one based counting, or 0 to 200 in Python). There are 200 letters plus one gap, meaning the gapped sequence is 201 letters long. Similarly the matched sequence SeqRecord (or subject in BLAST's terminology) is also 201 letters, this time 193 amino acids (residue 1 to 193 inclusive, one based counting) plus 8 gaps. To turn this into code, something like this: >>> from Bio.Seq import Seq >>> from Bio.SeqRecord import SeqRecord >>> query = SeqRecord(id="query", seq=Seq("MTRISHITRNTKETQIELSINLDGTGQADISTGIGFLDHML-TLLTFHSDFDLKIIGHGDHETVGMDPHHLIEDVAIALGKCISEDLGNKLGIRRYGSFTIPMDEALVTCDLDISGRPYLVFHADLSGNQKLGGYDTEMTEEFFRALAFNAGITLHLNEHYGQNTHHIIEGMFKSTARALKQAVSIDESKVGEIPSSKGVL")) >>> match = SeqRecord(id="match", seq=Seq("MSRSANITRETKETKIEVLLDIDRKGEVKVSTPIPFFNHMLITLLTYMNS--TAIVSATDK--LPYDDHHIVEDVAITLGLAIKTALGDKRGIKRFSHQIIPMDDALVLVSLDISNRGMAFVNLNLKRSE-IGGLATENVPHFFQSFAYNSGITLHISQLSGYNTHHIIEASFKALGLALYEATRIVDN---EIRSTKGII")) Turn it into an alignment, >>> from Bio.Align import MultipleSeqAlignment >>> align = MultipleSeqAlignment([query, match]) >>> print align Alphabet() alignment with 2 rows and 201 columns MTRISHITRNTKETQIELSINLDGTGQADISTGIGFLDHML-TL...GVL query MSRSANITRETKETKIEVLLDIDRKGEVKVSTPIPFFNHMLITL...GII match Assume the start/end co-ordinates are also stored somewhere (and for nucleotide sequences, the strand too). [As an aside, a pairwise multiple sequence alignment subclass or similar for the SearchIO project could have a nicer pretty print __str__ method showing where the two sequences agree - as done in the BLAST text output etc.] Now, suppose we slice this - for simplicity let's take the third chunk as shown in the original text BLAST output above, i.e. columns 121 to 180 inclusive (one based counting): >>> print align[:,120:180] Alphabet() alignment with 2 rows and 60 columns VFHADLSGNQKLGGYDTEMTEEFFRALAFNAGITLHLNEHYGQN...RAL query FVNLNLKRSE-IGGLATENVPHFFQSFAYNSGITLHISQLSGYN...LAL match We know that for this sub-alignment (by looking at the BLAST text output) that the query fragment is base 120 to 179 inclusive (one based counting) and the match fragment is base 117 to 175. I would like the SeqRecord slicing (done by the alignment object) to be able to deduce these new start/end co-ordinates from the original co-ordinates. This means when we do match[120:180] and query[120:180], we need to look at the position of the gaps, and thus convert from the ungapped coordinates (used for the start and end values) into the gapped coordinates (used for the alignment columns). Essentially here the new start is the old start plus the number of non-gap letters being removed from the start (before the slice point). The new end can then be calculated by adding the number of non-gap letters in the selected sequence, or from the old end value reducing it by the number of non-gap letters removed from the end. In fact there is no need to store the end coordinate in memory - it can be found on the fly from the start, sequence, and for nucleotides, strand - which is all you get in MAF format. Doing this would avoid inconsistent sets of values, but imposes a number of complications on the object representation. This is doable - but a sensible question is how common a use case is it to slice alignments (or SeqRecord objects) and care about their co-ordinates? This may actually be more important for classical multiple sequence alignments like Stockholm and MAF than for SearchIO. Peter From p.j.a.cock at googlemail.com Tue May 22 09:48:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 May 2012 10:48:18 +0100 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: References: <87ehqduhv0.fsf@fastmail.fm> Message-ID: On Tue, May 22, 2012 at 10:44 AM, Peter Cock wrote: > ... > I would like the SeqRecord slicing (done by the alignment object) > to be able to deduce these new start/end co-ordinates from the > original co-ordinates. > > ... > > This is doable - but a sensible question is how common a > use case is it to slice alignments (or SeqRecord objects) and > care about their co-ordinates? This may actually be more > important for classical multiple sequence alignments like > Stockholm and MAF than for SearchIO. I was struggling to come up with a simple self contained motivating example. Here is a possible example with BLAST, (although you can do similar things with multiple sequence alignments), but it is actually a larger or different problem. Suppose you have a domain of interest in a larger protein, and you want to pull out similar domains from similar proteins in a BLAST database. So, you do the BLAST search, and filter the results (e.g. use a minimum match length to ensure you are looking at full proteins). You then want to pull out just the region of the matched protein corresponding to your domain of interest. To solve this task, a SeqRecord location property is just a step in the right direction - but what this really boils down to is mapping between the three different co-ordinate systems: Ungapped query seuqnece <-> aligned columns (i.e. the common gapped sequence coordinates) <-> ungapped match sequence. Maybe that would be some nice functionality to add... the API needs a lot of thought though. Perhaps a specialized GappedSeq object (which could let us deprecate the current gapped alphabet class)? Peter From w.arindrarto at gmail.com Tue May 22 10:21:25 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 22 May 2012 12:21:25 +0200 Subject: [Biopython-dev] GSoC Project Update -- 3 Message-ID: Hi everyone, I just posted my latest GSoC update here: http://bow.web.id/blog/2012/05/from-bio-import-searchio/ To summarize the post and what I've done the last week: * I finished writing all base SearchIO objects and tested them as well. These objects are the QueryResult object (previously called Result), representing search results from a single query; the Hit object, representing pairwise alignments from a single database hit; and the HSP object, representing a single alignment. I've also written the docstrings for these objects, so you can run help() on them in an interpreter session. The post also includes a very brief outline of the base objects' features, if you are curious. * Using this, I was able to write a working prototype for SearchIO BLAST XML parsing. This prototype has also been tested, using the test cases I've generated previously. For now, it's implemented using our NCBIXML parser, just so that people can have a taste of what SearchIO will feel like. If you want to play around with the prototype, it's available here: https://github.com/bow/biopython/tree/searchio-blastxml. As always, feel free to notify me of suggestions, critiques, and/or feature requests :). regards, Bow From redmine at redmine.open-bio.org Tue May 22 10:40:05 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 22 May 2012 10:40:05 +0000 Subject: [Biopython-dev] [Biopython - Feature #3359] (New) Bio.Phylo.PAML.codeml doesn't parse output with multiple genes Message-ID: Issue #3359 has been reported by Brandon Invergo. ---------------------------------------- Feature #3359: Bio.Phylo.PAML.codeml doesn't parse output with multiple genes https://redmine.open-bio.org/issues/3359 Author: Brandon Invergo Status: New Priority: Normal Assignee: Brandon Invergo Category: Target version: URL: One particular combination of settings for PAML's codeml program creates output for separate genes. The current implementation of Bio.Phylo.PAML.codeml does not properly parse this. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Tue May 22 10:44:37 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 May 2012 11:44:37 +0100 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: References: <87ehqduhv0.fsf@fastmail.fm> Message-ID: On Tue, May 22, 2012 at 10:44 AM, Peter Cock wrote: > On Tue, May 22, 2012 at 2:25 AM, Brad Chapman wrote: >>> Thoughts or feedback please? Would a worked example >>> help with my explanation? >> >> A worked example might help: not totally sure I grasp all the >> subtleties, >> Brad > > OK. This will work best in a mono-spaced font. This was > picked out of one of our unit tests, bt009.txt - I just looked > for a BLAST pairwise alignment with some gaps: > > [BLASTP example] On reflection, a translated BLAST search would have been more interesting - then you've got at least another layer of co-ordinate transformations to worry about. e.g. for TBLASTX, query nucleotide <-> query protein <-> gapped protein <-> matched protein <-> matched nucleotide. Looking at a short snippet from example bt096.txt, an easy case in that there are no gaps, we have: Score = 100 bits (214), Expect(2) = 4e-49 Identities = 37/44 (84%), Positives = 38/44 (86%), Gaps = 0/44 (0%) Frame = -2/-2 Query 148 FCIFSRDGVLPCWSGWSRTPDLR*SACLGLPKCWDYRCEPPRPA 17 FCIFSRDGV CW GWSRTPDL+*S LGLPKCWDYR EPPRPA Sbjct 630 FCIFSRDGVSSCWPGWSRTPDLK*STHLGLPKCWDYRREPPRPA 499 The translated query sequence is 44 amino acids (including a stop codon), thus 44*3 = 132 base pairs, explaining how it runs from position 148 to 17 (one based) in the nucleotide query sequence. Currently Bio.SeqFeature.FeatureLocation doesn't have anything really intended for mixing nucleotide and protein coordinates, so that may not be the best fit for how to hold and manipulate these co-ordinates. Hmm. Peter From p.j.a.cock at googlemail.com Tue May 22 11:07:15 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 May 2012 12:07:15 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: <4F9AFA1F.6030103@med.nyu.edu> References: <4F91E4CF.8040602@med.nyu.edu> <4F9AFA1F.6030103@med.nyu.edu> Message-ID: Hi all, I've CC'd the BioRuby mailing list just to ensure you're aware of the potentially useful combination of MAF indexing and BGZF compression. We can continue this on the BioRuby list if more appropriate. The start of this Biopython-dev thread is here: http://lists.open-bio.org/pipermail/biopython-dev/2012-April/009561.html This might be a nice opportunity to combine the work of this year's OBF Google Summer of Code students - Clayton is doing MAF for BioRuby, and part of Artem's project could provide BGZF support for BioRuby. On Fri, Apr 27, 2012 at 8:57 PM, Andrew Sczesnak wrote: > Peter, > >> It should be easy enough to follow the BGZF changes to Bio/SeqIO/_index.py >> and I'm willing to do this myself for MAF (while going over your index >> work - something I want to do anyway). The only potential catch is >> avoiding offset arithmetic. > > I have no problem with you doing this if you're willing. It would be great > to have some code review of MafIndex as well. I'm not sure if Clayton will be able to comment on the Python code, but he should have some thoughts on the MAF indexing itself. Regards, Peter From andrew.sczesnak at med.nyu.edu Tue May 22 21:10:23 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Tue, 22 May 2012 17:10:23 -0400 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: References: Message-ID: <4FBC00BF.5000500@med.nyu.edu> Peter, It sort of seems like every letter in a sequence needs to have its own annotation, mapping it to its chromosome/sequence and position of origin. In this way, when multiple sequences are sliced and concatenated the annotation is preserved. For example, a = GappedSeq("ATGATG") ^ ^ | chr1:6 chr1:1 b = GappedSeq("GGG") ^ ^ | chr1:502 chr1:500 b = b.reverse_complement() c = a + b = GappedSeq("ATGATGGGG") Such that c[1].someproperty = "chr1[+]2" while chr[7].someproperty = "chr1[-]501". Strand information could be preserved on a per-letter basis and flipped from -1 to +1 upon reverse_complement(). The API could find and report contiguous stretches by analyzing these per-letter annotations, for example: >>> print c GappedSeq('ATGATGGGG', someproperty=["chr1[+]1-6", "chr1[-]500-502"]) The issue of gaps and of translating multiple alignments of gapped sequences could be resolved by having a convention where gaps always belong to the right-nearest gap except in the case of right-terminal gaps. For example: a = GappedSeq("----AGCG-ATG---") 000001234456666 a[0] = GappedSeq("----A") a[1] = GappedSeq("G") a[4] = GappedSeq("-A") a[6] = GappedSeq("G---") A nucleotide triplet of this sequence would thus look like this: a[:3] = GappedSeq("----AGC") a[-3:] = GappedSeq("ATG---") In the case of slicing a MultipleSeqAlignment of GappedSeq objects, there would have to be an "anchor" sequence (like there is in UCSC MAF files) with which other sequences in the alignment are sliced in reference to. For example: a = GappedSeq("----AGCG-ATG---") a = GappedSeq("----AGCG-ATG---") a = GappedSeqAlignment( On 05/21/2012 12:37 PM, Peter Cock wrote: > Hello all, > > This is something I talked to Bow a little about during our last > weekly meeting for his GSoC meeting, but it is broader than > just SearchIO... > > When describing BLAST results, or FASTA alignments, or > indeed many other local alignments you typically have a > (gapped) query sequence and match sequence fragment, > and the co-ordinates describing which part of the full query > and matched sequence this is. i.e. You are told the start > and end of the subsequence (and perhaps strand). > > The same essentially applies to some multiple alignment > formats in AlignIO as well, including Stockholm/PFAM > (where this is encoded into the record name as identifier > slash start-end), FASTA output (which will be handled via > SearchIO in future) and MAF. > > http://biopython.org/wiki/Multiple_Alignment_Format > > Indeed thinking about how best to handle this was the > main reason I haven't merged Andrew's MAF branch yet. > > (There are subtleties, for instance how is the strand given > in the file, do you get start+end explicitly or must the end > be inferred from the start and the sequence, etc). > > Currently recording these in the SeqRecord's annotation > dictionary 'works', but does not exploit the structure. In > particular, if the SeqRecord is sliced to get a fragment > of the alignment, this co-ordinate information is lost. It > would be nice if this preserved the start/end/strand and > updated it accordingly. > > One idea for doing this is to introduce a new location > property to the SeqRecord (defaulting to None), which > would be a FeatureLocation object normally used for > SeqFeature objects. > > If an operation couldn't preserve or update the location, > it would become None. Note that slicing we will generally > need to know the gap characters of the sequence (in > order to recalculate the sub-sequence's start/end), which > for the parsers may mean some minor updates to ensure > the default alphabet specifies the '-' gap character. > > On the order hand, perhaps this 'location' property idea > is overly complicated? > > Maybe all we need is a common convention about which > keys to use in the annotation dictionary, and how to store > the information (e.g. Python counting, start< end, and > strand as +1 or -1 if present)? > > Thoughts or feedback please? Would a worked example > help with my explanation? > > Thanks, > > Peter From andrew.sczesnak at med.nyu.edu Tue May 22 21:31:48 2012 From: andrew.sczesnak at med.nyu.edu (Andrew Sczesnak) Date: Tue, 22 May 2012 17:31:48 -0400 Subject: [Biopython-dev] Start/end co-ordinates in SeqRecord objects? In-Reply-To: References: Message-ID: <4FBC05C4.1060302@med.nyu.edu> Apologies, I accidentally hit send before finishing. -- Peter, It sort of seems like every letter in a sequence needs to have its own annotation, mapping it to its chromosome/sequence and position of origin. In this way, when multiple sequences are sliced and concatenated the annotation is preserved. For example, a = GappedSeq("ATGATG") ^ ^ | chr1:6 chr1:1 b = GappedSeq("GGG") ^ ^ | chr1:502 chr1:500 b = b.reverse_complement() c = a + b = GappedSeq("ATGATGGGG") Such that c[1].someproperty = "chr1[+]2" while chr[7].someproperty = "chr1[-]501". Strand information could be preserved on a per-letter basis and flipped from -1 to +1 upon reverse_complement(). The API could find and report contiguous stretches by analyzing these per-letter annotations, for example: >>> print c GappedSeq('ATGATGGGG', someproperty=["chr1[+]1-6", "chr1[-]500-502"]) The issue of gaps and of translating multiple alignments of gapped sequences could be resolved by having a convention where gaps always belong to the right-nearest gap except in the case of right-terminal gaps. For example: a = GappedSeq("----AGCG-ATG---") 000001234456666 a[0] = GappedSeq("----A") a[1] = GappedSeq("G") a[4] = GappedSeq("-A") a[6] = GappedSeq("G---") A nucleotide triplet of this sequence would thus look like this: a[:3] = GappedSeq("----AGC") a[-3:] = GappedSeq("ATG---") In the case of slicing a MultipleSeqAlignment of GappedSeq objects, there would have to be an "anchor" sequence (like there is in UCSC MAF files) with which other sequences in the alignment are sliced in reference to. For example: a = GappedSeq("----AGCG-ATG---", id="a", anchor=True) b = GappedSeq("AG--GG---ATAG--", id="b") c = GappedSeq("A--CGG---ATAGGG", id="c") d = GappedSeqAlignment([a, b, c]) >>> print d[:,:3] SingleLetterAlphabet() alignment with 3 rows and 7 columns ----AGC a, anchor=True AG--GG- b A--CGG- c One problem with this might be how to translate the multiple alignment... in this case, should b and c have no translation? Thanks, Andrew On 05/21/2012 12:37 PM, Peter Cock wrote: > Hello all, > > This is something I talked to Bow a little about during our last > weekly meeting for his GSoC meeting, but it is broader than > just SearchIO... > > When describing BLAST results, or FASTA alignments, or > indeed many other local alignments you typically have a > (gapped) query sequence and match sequence fragment, > and the co-ordinates describing which part of the full query > and matched sequence this is. i.e. You are told the start > and end of the subsequence (and perhaps strand). > > The same essentially applies to some multiple alignment > formats in AlignIO as well, including Stockholm/PFAM > (where this is encoded into the record name as identifier > slash start-end), FASTA output (which will be handled via > SearchIO in future) and MAF. > > http://biopython.org/wiki/Multiple_Alignment_Format > > Indeed thinking about how best to handle this was the > main reason I haven't merged Andrew's MAF branch yet. > > (There are subtleties, for instance how is the strand given > in the file, do you get start+end explicitly or must the end > be inferred from the start and the sequence, etc). > > Currently recording these in the SeqRecord's annotation > dictionary 'works', but does not exploit the structure. In > particular, if the SeqRecord is sliced to get a fragment > of the alignment, this co-ordinate information is lost. It > would be nice if this preserved the start/end/strand and > updated it accordingly. > > One idea for doing this is to introduce a new location > property to the SeqRecord (defaulting to None), which > would be a FeatureLocation object normally used for > SeqFeature objects. > > If an operation couldn't preserve or update the location, > it would become None. Note that slicing we will generally > need to know the gap characters of the sequence (in > order to recalculate the sub-sequence's start/end), which > for the parsers may mean some minor updates to ensure > the default alphabet specifies the '-' gap character. > > On the order hand, perhaps this 'location' property idea > is overly complicated? > > Maybe all we need is a common convention about which > keys to use in the annotation dictionary, and how to store > the information (e.g. Python counting, start< end, and > strand as +1 or -1 if present)? > > Thoughts or feedback please? Would a worked example > help with my explanation? > > Thanks, > > Peter -- Andrew Sczesnak Bioinformatician, Littman Lab Howard Hughes Medical Institute New York University School of Medicine 540 First Avenue New York, NY 10016 p: (212) 263-6921 f: (212) 263-1498 e: andrew.sczesnak at med.nyu.edu From p.j.a.cock at googlemail.com Wed May 23 09:29:52 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 23 May 2012 10:29:52 +0100 Subject: [Biopython-dev] Fwd: buildbot failure in Biopython on Linux 64 - Python 2.6 In-Reply-To: <201205230307.q4N37ujM032082@testing.open-bio.org> References: <201205230307.q4N37ujM032082@testing.open-bio.org> Message-ID: Hi Brandon, I only tested your fix on my Mac which doesn't have PAML installed. The buildslaves caught some problems last night: e.g. from a 64bit Linux machine, ====================================================================== FAIL: testParseAllNSsites (__main__.ModTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_PAML_codeml.py", line 245, in testParseAllNSsites self.assertEqual(len(results["NSsites"]), 6, version_msg) AssertionError: Improper parsing for version 4.3 And from a 32bit Windows machine, ====================================================================== FAIL: testParseAllNSsites (test_PAML_codeml.ModTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\repositories\BuildBotBiopython\win32\build\build\py3.2\Tests\test_PAML_codeml.py", line 245, in testParseAllNSsites self.assertEqual(len(results["NSsites"]), 6, version_msg) AssertionError: 1 != 6 : Improper parsing for version 4.1 Can you reproduce this locally? Thanks, Peter From p.j.a.cock at googlemail.com Wed May 23 13:29:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 23 May 2012 14:29:44 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: Message-ID: On Tue, Apr 17, 2012 at 5:11 PM, Kevin Jacobs wrote: > On Tue, Apr 17, 2012 at 11:23 AM, Peter Cock > wrote: >> >> I've just rebased my bgzf branch, which I think is ready to apply to the >> trunk. It has been tested under Python 2, PyPy [*], Jython and Python 3. >> https://github.com/peterjc/biopython/tree/bgzf2 >> >> Would anyone like to review this please? There are unittests and >> plenty of docstrings - but so far nothing in the Tutorial though. >> > > Hi Peter, > > I've implemented code to create BAM/tabix style index files and perform > lookups, so it has been high on my list to test and validate your BGZF code > (rather having to write my own). ?I'm notoriously short on time, but this is > in the critical path for several projects and I'm going to work on it over > the next week or so. > > -Kevin Hi Kevin, Did you get a chance to look at my BGZF code? Thanks, Peter From bioinformed at gmail.com Wed May 23 14:09:03 2012 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 23 May 2012 10:09:03 -0400 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: Message-ID: Hi Peter, I've been going through it carefully, though I've been diverted a few times (by fixing bgzip problems in Boost and adapting BAM/tabix interval indexing to HDF5). I'll clean up my notes and finish going through the code over the next few days. -Kevin On Wed, May 23, 2012 at 9:29 AM, Peter Cock wrote: > > Hi Kevin, > > Did you get a chance to look at my BGZF code? > > Thanks, > > Peter > > From arklenna at gmail.com Wed May 23 21:56:03 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 23 May 2012 17:56:03 -0400 Subject: [Biopython-dev] GSoC python variant update 3 Message-ID: Hi all, Latest blog post here: http://arklenna.tumblr.com/post/23630012065/week-1 Brief summary: I have reversed my prior conclusion that `SeqRecord` is inadequate for holding variant data. It is still not ideal, but the advantages of using an existing native object are substantial, and the disadvantages can be reduced by creating an accessor for the variant-specific data within a `SeqRecord`. I've made an outline of how I would store the information returned by PyVCF within `SeqRecord` and `SeqFeature` objects. It includes a few questions about the most logical way to store certain variant information. As the coding period has now started, I'll be pushing some prototypes to GitHub in the near future. Lenna From p.j.a.cock at googlemail.com Thu May 24 09:18:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 24 May 2012 10:18:33 +0100 Subject: [Biopython-dev] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: <4F91E4CF.8040602@med.nyu.edu> <4F9AFA1F.6030103@med.nyu.edu> Message-ID: On Thu, May 24, 2012 at 6:52 AM, Artem Tarasov wrote: > Hi all, > > it's a good point that many line-based formats need some sort of compression > with indexing, and BGZF is good enough in that sense. BGZF doesn't have to be used with line-based formats, anything with sequential records would work (like BAM files of course). I've not tried it to see how well it compressed, but SFF files in BGZF should work too as another example. >> So far, I think Artem's BGZF implementation is entirely in D; I may just >> add Ruby support for BGZF separately. > > The only problem I see with that approach is that it's hardly possible to > get parallel compression with MRI. But overall I tend to agree with Clayton. > Firstly, it's hard to abstract away some common interface right now, not > writing any code and looking at it. Secondly, there're still problems with D > shared library support. We were assured by GDC developer that they'll get > solved soon, but at the moment the situation is far from perfect. My BGZF code is pure Python (using C zlib via Python's zlib library), and does not currently tackle parallel compression or decompression. There as been recent work in samtools for this. We don't need parallel compression/decompression of BGZF for it to be useful. Peter From albl500 at york.ac.uk Thu May 24 11:06:45 2012 From: albl500 at york.ac.uk (Alex Leach) Date: Thu, 24 May 2012 12:06:45 +0100 Subject: [Biopython-dev] json formatting of SeqRecord objects Message-ID: <6817662.WQ7uPsvdBl@metabuntu> Dear all, I've written a fairly simple SeqRecord formatter to convert sequences to/from JSON objects, and wondered if it might be useful enough to be included in BioPython. It currently injects 'json' into SeqIO and AlignIO's _FormatToIterator and _FormatToWriter dictionaries, so can be used like any other SeqRecord format. I'm not sure where exactly I should submit it, but I thought here might do as an initial proposal.. I attach the source code. If you'd be interested in using it, let me know and I'll tidy it up to standards. Kind regards, Alex -- Alex Leach BSc. MRes. Department of Biology University of York York YO10 5DD United Kingdom EMAIL DISCLAIMER: http://www.york.ac.uk/docs/disclaimer/email.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: jsonIO.py Type: text/x-python Size: 5756 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Thu May 24 11:24:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 24 May 2012 12:24:27 +0100 Subject: [Biopython-dev] json formatting of SeqRecord objects In-Reply-To: <6817662.WQ7uPsvdBl@metabuntu> References: <6817662.WQ7uPsvdBl@metabuntu> Message-ID: On Thu, May 24, 2012 at 12:06 PM, Alex Leach wrote: > Dear all, > > I've written a fairly simple SeqRecord formatter to convert sequences to/from > JSON objects, and wondered if it might be useful enough to be included in > BioPython. That does look interesting, but from scanning the code the JSON representation is extremely Biopython specific. I take it there is no existing JSON representation in wide usage that you know of? It strikes me that something common between the Bio* projects would be much more valuable. I know that TogoWS (which uses both BioRuby and BioPerl internally) has some JSON support, but it looks more like a dump of the raw file from a quick look. How are you using this now? Do you pass JSON encoded SeqRecord objects over the network for something? > It currently injects 'json' into SeqIO and AlignIO's _FormatToIterator and > _FormatToWriter dictionaries, so can be used like any other SeqRecord format. > I'm not sure where exactly I should submit it, but I thought here might do as > an initial proposal.. > > I attach the source code. If you'd be interested in using it, let me know and > I'll tidy it up to standards. > > Kind regards, > Alex If you are happy with git, we'd suggest you make a fork on github (i.e. make a copy of the repository), then develop the new code on a new branch. Peter From p.j.a.cock at googlemail.com Thu May 24 11:32:53 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 24 May 2012 12:32:53 +0100 Subject: [Biopython-dev] json formatting of SeqRecord objects In-Reply-To: References: <6817662.WQ7uPsvdBl@metabuntu> Message-ID: On Thu, May 24, 2012 at 12:24 PM, Peter Cock wrote: > On Thu, May 24, 2012 at 12:06 PM, Alex Leach wrote: >> Dear all, >> >> I've written a fairly simple SeqRecord formatter to convert sequences to/from >> JSON objects, and wondered if it might be useful enough to be included in >> BioPython. > > That does look interesting, but from scanning the code the JSON > representation is extremely Biopython specific. I take it there is no > existing JSON representation in wide usage that you know of? It > strikes me that something common between the Bio* projects would > be much more valuable. > > I know that TogoWS (which uses both BioRuby and BioPerl internally) > has some JSON support, but it looks more like a dump of the raw > file from a quick look. e.g. Consider this two protein example for UniProt: http://togows.dbcls.jp/entry/uniprot/A1AG1_HUMAN,A1AG1_MOUSE http://togows.dbcls.jp/entry/uniprot/A1AG1_HUMAN,A1AG1_MOUSE.fasta http://togows.dbcls.jp/entry/uniprot/A1AG1_HUMAN,A1AG1_MOUSE.json Or with GenBank, http://togows.dbcls.jp/entry/protein/117606345,117606345 http://togows.dbcls.jp/entry/protein/117606345,117606345.fasta http://togows.dbcls.jp/entry/protein/117606345,117606345.json Currently TogoWS' JSON output is essentially the raw record (here plain text "swiss" format or plain text GenBank), using a JSON list to make the division into the records explicit. Peter From mictadlo at gmail.com Fri May 25 06:49:13 2012 From: mictadlo at gmail.com (Mic) Date: Fri, 25 May 2012 16:49:13 +1000 Subject: [Biopython-dev] [BioRuby] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: <4F91E4CF.8040602@med.nyu.edu> <4F9AFA1F.6030103@med.nyu.edu> Message-ID: I think Pircard-tools does parallel compression/decompression of BGZF. Cheers, Mic On Thu, May 24, 2012 at 7:18 PM, Peter Cock wrote: > On Thu, May 24, 2012 at 6:52 AM, Artem Tarasov > wrote: > > Hi all, > > > > it's a good point that many line-based formats need some sort of > compression > > with indexing, and BGZF is good enough in that sense. > > BGZF doesn't have to be used with line-based formats, anything > with sequential records would work (like BAM files of course). I've not > tried it to see how well it compressed, but SFF files in BGZF should > work too as another example. > > >> So far, I think Artem's BGZF implementation is entirely in D; I may just > >> add Ruby support for BGZF separately. > > > > The only problem I see with that approach is that it's hardly possible to > > get parallel compression with MRI. But overall I tend to agree with > Clayton. > > Firstly, it's hard to abstract away some common interface right now, not > > writing any code and looking at it. Secondly, there're still problems > with D > > shared library support. We were assured by GDC developer that they'll get > > solved soon, but at the moment the situation is far from perfect. > > My BGZF code is pure Python (using C zlib via Python's zlib library), > and does not currently tackle parallel compression or decompression. > There as been recent work in samtools for this. > > We don't need parallel compression/decompression of BGZF for it to > be useful. > > Peter > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From bioinformed at gmail.com Fri May 25 11:15:00 2012 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Fri, 25 May 2012 07:15:00 -0400 Subject: [Biopython-dev] [BioRuby] BGZF support, was Re: Biopython 1.60 plans and beyond In-Reply-To: References: <4F91E4CF.8040602@med.nyu.edu> <4F9AFA1F.6030103@med.nyu.edu> Message-ID: On Fri, May 25, 2012 at 2:49 AM, Mic wrote: > I think Pircard-tools does parallel compression/decompression of BGZF. > > Here is what Picard's does for one command: MergeSamFiles Merges multiple SAM/BAM files into one file. USE_THREADING=BooleanOption to create a background thread to encode, compress and write to disk the output file. The threaded version uses about 20% more CPU and decreases runtime by ~20% when writing out a compressed BAM file. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} BAM output (dominated by zlib compression and/or IO write latency) is run in a different thread, but is still performed sequentially over blocks. The recent samtools fork attempts to buffer uncompressed BAM blocks and allocates multiple threads to compress several in parallel since they are independent. -Kevin From p.j.a.cock at googlemail.com Mon May 28 11:06:40 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 12:06:40 +0100 Subject: [Biopython-dev] HMMER in SearchIO Message-ID: Hi Bow, I've been looking over your GSoC branch, and noticed that for HMMER3 we've only talked about the regular text output. I think that the table output is also worth supporting (offers one line query query, or one line per domain). This isn't tab separated but variable spaces to give a fixed column layout, but should be easier to parse. Something to think about later on... Peter From arklenna at gmail.com Tue May 29 21:32:47 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 29 May 2012 17:32:47 -0400 Subject: [Biopython-dev] SeqRecord id behavior Message-ID: Hi all, I have some questions/comments regarding how SeqRecord handles various arguments. >>> print SeqRecord(seq="G") ID: Name: Description: Number of features: 0 'G' >>> print SeqRecord(seq="G", id=2) TypeError: id argument should be a string >>> print SeqRecord(seq="G", id=None) Name: Description: Number of features: 0 'G' 1. Couldn't a sequence id hypothetically be an integer? In which case, it could be converted to a string. 2. Regarding this comment on line 180: https://github.com/biopython/biopython/blob/master/Bio/SeqRecord.py#L180 if id is not None and not isinstance(id, basestring): #Lots of existing code uses id=None... this may be a bad idea. raise TypeError("id argument should be a string") Why might that be a bad idea? id=None will currently set self.id to None, so it doesn't affect the type checking. 3. Is it desirable to be able to remove the id from the __str__ representation, or would it be more consistent to do this: if id == "" or id is None: self.id = "" else: (typecheck here) Lenna From p.j.a.cock at googlemail.com Tue May 29 22:02:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 May 2012 23:02:20 +0100 Subject: [Biopython-dev] SeqRecord id behavior In-Reply-To: References: Message-ID: On Tue, May 29, 2012 at 10:32 PM, Lenna Peterson wrote: > Hi all, > > I have some questions/comments regarding how SeqRecord handles various > arguments. > >>>> print SeqRecord(seq="G") > ID: > Name: > Description: > Number of features: 0 > 'G' >>>> print SeqRecord(seq="G", id=2) > TypeError: id argument should be a string >>>> print SeqRecord(seq="G", id=None) > Name: > Description: > Number of features: 0 > 'G' > > 1. Couldn't a sequence id hypothetically be an integer? In which > case, it could be converted to a string. We want to be able to assume a string for things like the string formatting operators used in SeqRecord output (dealing with None as a special case is annoying enough). > 2. Regarding this comment on line 180: > https://github.com/biopython/biopython/blob/master/Bio/SeqRecord.py#L180 > > ? ?if id is not None and not isinstance(id, basestring): > ? ? ? ?#Lots of existing code uses id=None... this may be a bad idea. > ? ? ? ?raise TypeError("id argument should be a string") > > Why might that be a bad idea? id=None will currently set self.id to > None, so it doesn't affect the type checking. Using None for the ID prevents code assuming it is a string (but see below). > 3. Is it desirable to be able to remove the id from the __str__ > representation, No - the sequence and the ID are the two most important bits of a SeqRecord. > or would it be more consistent to do this: > > ? ?if id == "" or id is None: > ? ? ? ?self.id = "" > ? ?else: > ? ? ? ?(typecheck here) > > Lenna I never liked the face that "" has a space in it. This breaks the assumption of loads of file formats. Many file formats don't like an empty ID, so maybe "" is better. On the other hand, it is fairly common in Python to use None as a missing data representation... which currently the SeqRecord allows you to do. Note these SeqRecord defaults predate Bio.SeqIO - if we didn't have to worry about breaking existing code I would much rather make the ID a mandatory SeqRecord argument. Peter From w.arindrarto at gmail.com Wed May 30 21:44:04 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 30 May 2012 23:44:04 +0200 Subject: [Biopython-dev] GSoC Project Update -- 4 Message-ID: Hi everyone, I just posted my latest GSoC update here: http://bow.web.id/blog/2012/05/assembling-the-parsers/ To summarize: I've been working on more SearchIO parsers last week, adding more formats to support. We know have SearchIO-specific BLAST+ XML parser (it was first implemented on top of NCBIXML). It uses ElementTree as the base XML parser, with promising performance gains. I've also completed SearchIO's blast tabular parser, which takes in the BLAST+ tabular output files with or without headers. If the tabular file has headers, it can parse any number of columns in any order as long the columns with hit and query IDs are present. Finally, I've finished writing the HMMER plain text parser. For now, the parser can handle outputs from hmmscan and hmmsearch, single and multiple queries. All these parsers have been tested using the test cases I've generated previously. Additionally, I also had a public discussion with Peter on Github regarding SearchIO objects here: https://github.com/bow/biopython/commit/69a0ab64dfa7718f7455ca4c3961e95277fb4dbc#-P0, if anyone is interested. It started as a discussion on some behaviors of the HSP object, but also relates to other issues raised earlier (the dynamic SeqRecord coordinates Peter brought up earlier and Biopython's platform support). That's it for this week :). cheers, Bow