From mmokrejs at fold.natur.cuni.cz Wed Mar 2 18:00:04 2011 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 03 Mar 2011 00:00:04 +0100 Subject: [Biopython] traditional NCBI blast vs. blast+ Message-ID: <4D6ECBF4.9050006@fold.natur.cuni.cz> Hi, I needed to run and parse some blastn analysis. I had a look into the Tutorial and followed the currently recommended blast+ approach. Somewhat I was not getting any results. It seems to me a formatdb-formatted database is not readable by the blast+ tools. I had a look what tools are installed on my Gentoo Linux along with blastn, blastx and the other tools coming from blast+ bundle and from filenames I just could not guess what am I supposed to run over my fasta target database to make it searchable by blastn. I would prefer if biopython would throw out some error if there are no appropriate files (which names could be guessed depending on the (t)blastn/x/p, etc.). The tutorial mentions that I should lookup an older version of the Tutorial for examples on the old, NCBI blast usage via biopython. It took me a while but I found through Google some docs like that. ;-) On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation, not a single README, HOWTO, Changes, just the binaries and libs. What is installed on other Linux platform, would you mind sharing this with me? I just failed to find by Google what tools should I use instead of the formatdb. I found some FAQ on the NCBI tools++ site but that talked just about C++ API etc., nothing from the user perspective. On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being installed because they have same name as the same utility from "old" ncbi-tools (hence overwting their files). The ncbi-tools++ package is not allowed to be installed on stable "systems" (lack of testing or open bug reports) so most people using Gentoo do NOT have ncbi-tools++ and probably won't for a while. I propose to keep support for the "old" blast for a long while. Luckily, the blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML. What do you think? Is the blast+ approach faster, more stable, or just newer so we all like to "upgrade"? Where are some docs and what is the formatdb-like tool in blast+. ;) Thanks, Martin From nuin at genedrift.org Wed Mar 2 18:06:17 2011 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 2 Mar 2011 18:06:17 -0500 Subject: [Biopython] traditional NCBI blast vs. blast+ In-Reply-To: <4D6ECBF4.9050006@fold.natur.cuni.cz> References: <4D6ECBF4.9050006@fold.natur.cuni.cz> Message-ID: <4FC7BB7C-9E17-4699-850E-0A4F4E63521B@genedrift.org> Hi Just answering your blast portion of the question: - you have to run makeblastdb in order to create the database. - you should be able to download the source of blast+ to compile, it should compile just fine on your system - and yes, it seems to be faster and more stable than the previous version, at least on the tests I run Paulo On 2011-03-02, at 6:00 PM, Martin Mokrejs wrote: > Hi, > I needed to run and parse some blastn analysis. I had a look into the Tutorial > and followed the currently recommended blast+ approach. Somewhat I was not > getting any results. It seems to me a formatdb-formatted database is not readable > by the blast+ tools. I had a look what tools are installed on my Gentoo Linux > along with blastn, blastx and the other tools coming from blast+ bundle and from > filenames I just could not guess what am I supposed to run over my fasta > target database to make it searchable by blastn. I would prefer if biopython > would throw out some error if there are no appropriate files (which names could > be guessed depending on the (t)blastn/x/p, etc.). > The tutorial mentions that I should lookup an older version of the Tutorial > for examples on the old, NCBI blast usage via biopython. It took me a while but > I found through Google some docs like that. ;-) > On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation, > not a single README, HOWTO, Changes, just the binaries and libs. What is installed > on other Linux platform, would you mind sharing this with me? I just failed > to find by Google what tools should I use instead of the formatdb. I found > some FAQ on the NCBI tools++ site but that talked just about C++ API etc., > nothing from the user perspective. > On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being > installed because they have same name as the same utility from "old" ncbi-tools > (hence overwting their files). The ncbi-tools++ package is not allowed to be > installed on stable "systems" (lack of testing or open bug reports) so most people > using Gentoo do NOT have ncbi-tools++ and probably won't for a while. > I propose to keep support for the "old" blast for a long while. Luckily, the > blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML. > > What do you think? Is the blast+ approach faster, more stable, or just newer > so we all like to "upgrade"? Where are some docs and what is the formatdb-like > tool in blast+. ;) > Thanks, > Martin > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Thu Mar 3 05:27:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Mar 2011 10:27:54 +0000 Subject: [Biopython] traditional NCBI blast vs. blast+ In-Reply-To: <4D6ECBF4.9050006@fold.natur.cuni.cz> References: <4D6ECBF4.9050006@fold.natur.cuni.cz> Message-ID: On Wed, Mar 2, 2011 at 11:00 PM, Martin Mokrejs wrote: > Hi, > ?I needed to run and parse some blastn analysis. I had a look into the Tutorial > and followed the currently recommended blast+ approach. Somewhat I was not > getting any results. It seems to me a formatdb-formatted database is not readable > by the blast+ tools. I think it is possible to get databases which will work with both legacy BLAST and BLAST+ (since the NCBI only offer one set for NR etc) but I have not tried to mix the two. As pointed out by Paulo, the successor to formatdb in BLAST+ is makeblastdb, so just use that instead. > I had a look what tools are installed on my Gentoo Linux > along with blastn, blastx and the other tools coming from blast+ bundle and from > filenames I just could not guess what am I supposed to run over my fasta > target database to make it searchable by blastn. This is very clear in the BLAST+ documentation from the NCBI website (link given below), and is arguably a Gentoo packaging issue. > I would prefer if biopython > would throw out some error if there are no appropriate files (which names could > be guessed depending on the (t)blastn/x/p, etc.). BLAST+ itself generally gives useful errors. > ?The tutorial mentions that I should lookup an older version of the Tutorial > for examples on the old, NCBI blast usage via biopython. It took me a while but > I found through Google some docs like that. ;-) You could have just downloaded one of the old Biopython releases (the zip or tar balls) and looked in the Doc subdirectory. I'll clarify the current text in the tutorial to point people there. >?On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation, > not a single README, HOWTO, Changes, just the binaries and libs. File a bug with Gentoo? > What is installed > on other Linux platform, would you mind sharing this with me? I just failed > to find by Google what tools should I use instead of the formatdb. I found > some FAQ on the NCBI tools++ site but that talked just about C++ API etc., > nothing from the user perspective. You are probably looking for this, linked to from the BLAST+ download page: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/user_manual.pdf > On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being > installed because they have same name as the same utility from "old" ncbi-tools > (hence overwting their files). The ncbi-tools++ package is not allowed to be > installed on stable "systems" (lack of testing or open bug reports) so most people > using Gentoo do NOT have ncbi-tools++ and probably won't for a while. I was aware of the name clash for rpsblast, and yes, this is a problem the NCBI could have avoided. You could just ignore the Gentoo package and get BLAST+ directly from the NCBI. >?I propose to keep support for the "old" blast for a long while. We've already delayed deprecating the ``legacy'' BLAST wrappers, but probably we should do that after releasing Biopython 1.57. > Luckily, the > blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML. The NCBI kept the same XML output format, and in fact the plain text output is close enough that our old text parser could be updated to cope. >?What do you think? Is the blast+ approach faster, more stable, or just newer > so we all like to "upgrade"? I like BLAST+ for some new functionality (FASTA vs FASTA for example), but since the NCBI is dropping the ``legacy'' BLAST you will have to upgrade at some point > Where are some docs and what is the formatdb-like tool in blast+. ;) I've given links to the docs above, they're linked to on the NCBI website. Regards, Peter From p.j.a.cock at googlemail.com Thu Mar 3 15:32:11 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Mar 2011 20:32:11 +0000 Subject: [Biopython] Fwd: [Bosc] Bioinformatics Open Source Conference (BOSC 2011)--Call for Abstracts In-Reply-To: <3922D2BE-5A99-4CDE-91AB-B311C42E10CE@lbl.gov> References: <3922D2BE-5A99-4CDE-91AB-B311C42E10CE@lbl.gov> Message-ID: Dear Biopythoneers, BOSC will be in Vienna, Austria this year. Peter ---------- Forwarded message ---------- From: Nomi Harris Date: Thu, Mar 3, 2011 at 7:37 PM Subject: [Bosc] Bioinformatics Open Source Conference (BOSC 2011)--Call for Abstracts To: bosc-announce at lists.open-bio.org, members at open-bio.org, GMOD Announcements List , GMOD Developers List Cc: Nomi Harris We invite you to submit an abstract to BOSC 2011! ?Please forward this message as appropriate, and forgive multiple postings. Call for Abstracts for the 12th Annual Bioinformatics Open Source Conference (BOSC 2011) An ISMB 2011 Special Interest Group (SIG) Dates: July 15-16, 2011 Location: Vienna, Austria Web site: http://www.open-bio.org/wiki/BOSC_2011 Email: bosc at open-bio.org BOSC announcements mailing list: http://lists.open-bio.org/mailman/listinfo/bosc-announce Important Dates: April 18, 2011: Deadline for submitting abstracts to BOSC 2011 May 9, 2011: Notifications of accepted abstracts emailed to corresponding authors July 13-14, 2011: Codefest 2011 programming session (see http://www.open-bio.org/wiki/Codefest_2011 for details) July 15-16, 2011: BOSC 2011 July 17-19, 2011: ISMB 2011 The Bioinformatics Open Source Conference (BOSC) is sponsored by the Open Bioinformatics Foundation (O|B|F), a non-profit group dedicated to promoting the practice and philosophy of Open Source software development within the biological research community. To be considered for acceptance, software systems representing the central topic in a presentation submitted to BOSC must be licensed with a recognized Open Source License, and be freely available for download in source code form. We invite you to submit abstracts for talks and posters. ?Sessions include: - Approaches to parallel processing - Cloud-based approaches to improving software and data accessibility - The Semantic Web in open source bioinformatics - Data visualization - Tools for next-generation sequencing - Other Open Source software In addition to the above sessions, there will be a panel discussion about "Meeting the challenges of inter-institutional collaboration". We are also working to arrange a joint session with one of the other ISMB SIGs. Thanks to generous sponsorship from Eagle Genomics and an anonymous donor, we are pleased to announce a competition for three Student Travel Awards for BOSC 2011. Each winner will be awarded $250 to defray the costs of travel to BOSC 2011. For instructions on submitting your abstract, please visit http://www.open-bio.org/wiki/BOSC_2011#Abstract_Submission_Information BOSC 2011 Organizing Committee: Nomi Harris and Peter Rice (co-chairs); Brad Chapman, Peter Cock, Erwin Frise, Darin London, Ron Taylor _______________________________________________ BOSC mailing list BOSC at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bosc From hlapp at drycafe.net Fri Mar 4 18:26:25 2011 From: hlapp at drycafe.net (Hilmar Lapp) Date: Fri, 4 Mar 2011 18:26:25 -0500 Subject: [Biopython] Informatics job opportunity at NESCent Message-ID: <1878F27F-000D-4C80-B9EA-A83F7887828F@drycafe.net> (Apologies if you receive multiple copies, and also if you are not interested in job opportunities. In my defense, quite a few people on Bio* lists might qualify for (let alone enjoy) the position. And if you know someone who might be interested please forward.) =================================================== User Interface Design and Web Application Developer =================================================== The National Evolutionary Synthesis Center (NESCent) seeks a creative and enthusiastic individual to design user interfaces and web applications for scientific applications. The incumbent will work as part of a small informatics team in close collaboration with domain scientists. NESCent is an NSF-funded center dedicated to cross-disciplinary research in evolutionary science. Our informatics team works closely with visiting and resident scientists to support their custom software and database development needs. All NESCent software products are open- source, and the Center has a number of initiatives to actively promote collaborative development of community software resources (informatics.nescent.org). Above all, we are enthusiastic about our work, about the mission of the Center, and about the contribution of informatics to that mission. Job description: The incumbent will design and develop user interfaces and web applications for databases and other software tools for sponsored scientists and staff. The job responsibilities include all stages of the software development process, including requirements gathering, design, implementation, release packaging and documentation, as part of a small team (typically 2-3 individuals) following project management best practices. We expect the incumbent to present their work at conferences and contribute to publications with scientific collaborators; interact regularly with visiting and resident scientists, other members of the informatics team and Center staff; and generally serve as an expert resource for Center personnel. The position provides opportunities for professional development. Most informatics staff work at our Durham NC offices, located adjacent to Duke University, but we do support a wide range of technologies for virtual communication with off-site staff and collaborators. Required Qualifications: * Demonstrated success collaborating with clients on custom software solutions * Experience with various stages of the software development cycle * Expertise in development and testing of user interface designs * Excellent communication skills, both virtual and face-to-face * A four-year college degree in Computer Science, Bioinformatics or a related field Preferred Qualifications: * M.S. or Ph.D. in Computer Science, Bioinformatics or related field along with demonstrated interest in science, particularly biology * Expertise in rapid application development and respective programming technologies and languages (e.g., modern scripting languages and web-application frameworks such as Python/Django, Ruby/ Ruby-on-Rails, and Perl/Catalyst), fluency in Java programming, and prior experience in relational database programming (PostgreSQL or MySQL) * Expertise in dynamic and interactive web technologies (JavaScript, CGI), web service (SOAP, REST, XML, JSON) and semantic web technologies * Experience with open-source, and collaborative, software development, software usability design and assessment * Expertise in graphic design, data visualization and/or scientific data integration How to apply: Please send cover letter, resume and contact information for three references to Dr. Karen Cranston, Training Coordinator and Bioinformatics Project Manager (karen.cranston at nescent.org). Review of applications will begin March 21, 2011. Informal inquires or requests for additional information may be directed to Dr. Cranston by email or phone (+1-919-613-2275). -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From p.j.a.cock at googlemail.com Mon Mar 7 09:19:11 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 Mar 2011 14:19:11 +0000 Subject: [Biopython] Tutorial proofreading? Message-ID: Hi all, We're planning to do the Biopython 1.57 release soon, and something some volunteer help would be useful for is with our documentation - in particular the tutorial. These links are for the current tutorial, at the time or writing that means Biopython 1.56: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf There links are for the latest in-progress tutorial (automatically updated nightly from the git repository): http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf I would like some volunteers to proof read this please and report any problems, suggestions or additions? Ideally I'd like people to check the examples work (although some will need the latest Biopython installed from the source code). Even reporting minor typos is useful, as fixing them will make a better impression for newcomers reading this. Thanks, Peter P.S. The tutorial source file is here, if you are interested, https://github.com/biopython/biopython/blob/master/Doc/Tutorial.tex From anaryin at gmail.com Mon Mar 7 09:21:19 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 7 Mar 2011 15:21:19 +0100 Subject: [Biopython] Tutorial proofreading? In-Reply-To: References: Message-ID: Will have a look at it this week, I noticed some problems in the Bio.PDB section (outdated code). Cheers! From rmb32 at cornell.edu Mon Mar 7 11:37:32 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 07 Mar 2011 11:37:32 -0500 Subject: [Biopython] Google Summer of Code project ideas Message-ID: <4D7509CC.3040604@cornell.edu> Hi all, I'm going to be OBF project admin again this year for Google Summer of code. OBF's application is due later this week, and we need to update our project ideas on the OBF wiki page and on each project's individual wiki pages. So, for each of the OBF projects that wants to do GSoC again this year, please: a.) Update the list of project ideas on your project's GSoC page (BioPython, BioPerl, BioRuby, etc). Add new ones, remove ones that have already been done or no longer relevant, etc. b.) Update the list of project ideas on the main OBF GSoC page (http://www.open-bio.org/wiki/Google_Summer_of_Code) to match. c.) Let me know via email that you have done so and it's ready for Google to peruse. Please have the updates done, if possible, by this Friday (March 11). The number and quality of the project ideas are part of the evaluation process for whether OBF is accepted as a Summer of Code organization again this year, so let's come up with some good ones. :-) Rob ---- Robert Buels (prospective) 2011 OBF GSoC Organization Admin From p.cherepanov at imperial.ac.uk Mon Mar 7 21:42:26 2011 From: p.cherepanov at imperial.ac.uk (Peter Cherepanov) Date: Tue, 8 Mar 2011 02:42:26 +0000 Subject: [Biopython] define circular DNA (?) Message-ID: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> is there an easy way to define a circular DNA sequence in BioPython? It would be useful to have something like: my_seq = Seq('ATGCATGC...ATGC', circular_dna) am I missing something obvious?? Peter From komalsnehal1991 at gmail.com Tue Mar 8 02:58:11 2011 From: komalsnehal1991 at gmail.com (Komal S) Date: Tue, 8 Mar 2011 13:28:11 +0530 Subject: [Biopython] Biopython Projects Message-ID: Hi everyone, I'm Komal, a Junior Undergraduate Student from India studying Bioengineering. I'm a fan of Python and I love Computational Biology and I plan to do my further studies in the same. I went through the projects on the Biopython page. I was very much interested in the RNA Structure project mentioned. Any contribution which I make will help me a lot and the organisation too. In fact, I am currently doing a project on RNA Editing. I'll be very happy to integrate my knowledge. Please help me on how I should proceed. Komal From p.j.a.cock at googlemail.com Tue Mar 8 03:45:31 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Mar 2011 08:45:31 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: On Tue, Mar 8, 2011 at 2:42 AM, Peter Cherepanov wrote: > is there an easy way to define a circular DNA sequence in BioPython? > > It would be useful to have something like: > > my_seq = Seq('ATGCATGC...ATGC', circular_dna) > > am I missing something obvious?? > > Peter No, but how would you expect it to act? We've talked about such an object before... I'd have to go though my old emails but I recall there being some annoying corner cases to consider with the slice method (__getitem__). Peter From p.j.a.cock at googlemail.com Tue Mar 8 05:48:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Mar 2011 10:48:13 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov wrote: > I suppose if a DNA sequence is kept as a simple Python string, there is > no easy way to have it "circular". I am a beginner in Python (I use it only > occasionally, to solve very specific and simple-minded tasks, when manual > match/cut-and-paste operations become too much of a burden). Having > spent an extra hour to hack out and debug a piece of code to match/extract > to/from circular plasmid sequences kept as Python strings, I thought: hey, > wait a minute, there is such thing as BioPython, which should have made > this task so much easier... > > Is there a way to "enhance" the Seq object? (or may be I do not know what > I am talking about...). > > thanks a lot for responding! > > with best wishes, > > Peter What I had in mind was a new class, CircularSeq, which would subclass the current Biopython Seq object, and still use a string internally for the sequence. We could then modify the slice behaviour so that, perhaps this would by work wrapping the origin: c = CircularSeq('ACGTACGTACGT') assert len(c)==12 print c[10:14] It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat 14 as wrapped to 2, returning the four bases GTAC. Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the same as 'ACGTACGTACGT'[10:] which is the last two letters only. This means anyone (or more importantly, any code) expecting the string like behaviour will get a nasty surprise (or a bug). Another example, what about c[-2:]? For a plain string you'd get the last two letters. For a circular sequence you might think that should represent starting two before the origin, thus giving the last two letter plus the whole sequence? Also, c[-2:2] could mean the last two letters plus the first two letters, but for a plain python string that returns an empty string. Note that due to the way Python indexing works, single letter access is fine for negative indices, c[-2] would give the second last letter, 'G', which is consistent with wrapped counting back from the origin. We could also make c[14] wrap round to c[2] in this length 12 example (although there is a small risk of breaking code expecting an IndexError in this case). There would be lots of other things to implement, like "in" and the find methods would need to check the substring across the origin. Then (for nucleotides), we'd need to ensure reverse_complement and complement also give a CircularSeq, likewise perhaps for the transcribe and back_transcribe. The translate method is particularly tricky as you can have an infinite reading frame, which might be represented as a circular protein sequence? All in all, it is quite a lot of work, and there are several tricky bits where the desired behaviour is not clear cut. Could we come up with something useful or not? Peter P.S. Please CC the mailing list in your replies :) From p.cherepanov at imperial.ac.uk Tue Mar 8 05:30:08 2011 From: p.cherepanov at imperial.ac.uk (Peter Cherepanov) Date: Tue, 8 Mar 2011 10:30:08 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: <503B48D3-61BA-4C77-A441-00942366FFB4@imperial.ac.uk> I suppose if a DNA sequence is kept as a simple Python string, there is no easy way to have it "circular". I am a beginner in Python (I use it only occasionally, to solve very specific and simple-minded tasks, when manual match/cut-and-paste operations become too much of a burden). Having spent an extra hour to hack out and debug a piece of code to match/extract to/from circular plasmid sequences kept as Python strings, I thought: hey, wait a minute, there is such thing as BioPython, which should have made this task so much easier... Is there a way to "enhance" the Seq object? (or may be I do not know what I am talking about...). thanks a lot for responding! with best wishes, Peter On 8 Mar 2011, at 08:45, Peter Cock wrote: > On Tue, Mar 8, 2011 at 2:42 AM, Peter Cherepanov wrote: >> is there an easy way to define a circular DNA sequence in BioPython? >> >> It would be useful to have something like: >> >> my_seq = Seq('ATGCATGC...ATGC', circular_dna) >> >> am I missing something obvious?? >> >> Peter > > No, but how would you expect it to act? We've talked > about such an object before... I'd have to go though my > old emails but I recall there being some annoying corner > cases to consider with the slice method (__getitem__). > > Peter From moritz.beber at googlemail.com Tue Mar 8 06:32:44 2011 From: moritz.beber at googlemail.com (Moritz Beber) Date: Tue, 08 Mar 2011 12:32:44 +0100 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: <4D7613DC.2050506@googlemail.com> On 03/08/2011 11:48 AM, Peter Cock wrote: > On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov > wrote: >> I suppose if a DNA sequence is kept as a simple Python string, there is >> no easy way to have it "circular". I am a beginner in Python (I use it only >> occasionally, to solve very specific and simple-minded tasks, when manual >> match/cut-and-paste operations become too much of a burden). Having >> spent an extra hour to hack out and debug a piece of code to match/extract >> to/from circular plasmid sequences kept as Python strings, I thought: hey, >> wait a minute, there is such thing as BioPython, which should have made >> this task so much easier... >> >> Is there a way to "enhance" the Seq object? (or may be I do not know what >> I am talking about...). >> >> thanks a lot for responding! >> >> with best wishes, >> >> Peter > What I had in mind was a new class, CircularSeq, which would subclass > the current Biopython Seq object, and still use a string internally for the > sequence. > > We could then modify the slice behaviour so that, perhaps this would > by work wrapping the origin: > > c = CircularSeq('ACGTACGTACGT') > assert len(c)==12 > print c[10:14] > > It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat > 14 as wrapped to 2, returning the four bases GTAC. > > Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the > same as 'ACGTACGTACGT'[10:] which is the last two letters only. > This means anyone (or more importantly, any code) expecting the > string like behaviour will get a nasty surprise (or a bug). > > Another example, what about c[-2:]? For a plain string you'd > get the last two letters. For a circular sequence you might think > that should represent starting two before the origin, thus giving > the last two letter plus the whole sequence? Also, c[-2:2] could > mean the last two letters plus the first two letters, but for a > plain python string that returns an empty string. > > Note that due to the way Python indexing works, single letter > access is fine for negative indices, c[-2] would give the second > last letter, 'G', which is consistent with wrapped counting back > from the origin. We could also make c[14] wrap round to c[2] in > this length 12 example (although there is a small risk of breaking > code expecting an IndexError in this case). > > There would be lots of other things to implement, like "in" and the > find methods would need to check the substring across the origin. > Then (for nucleotides), we'd need to ensure reverse_complement > and complement also give a CircularSeq, likewise perhaps for the > transcribe and back_transcribe. The translate method is particularly > tricky as you can have an infinite reading frame, which might be > represented as a circular protein sequence? > > All in all, it is quite a lot of work, and there are several tricky bits > where the desired behaviour is not clear cut. Could we come up > with something useful or not? > > Peter > > P.S. Please CC the mailing list in your replies :) > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > If you just need circular behaviour in a small number of use cases, you could consider wrapping the sequence in a cycle iterator http://docs.python.org/release/2.6/library/itertools.html?highlight=cycle#itertools.cycle From p.j.a.cock at googlemail.com Tue Mar 8 06:40:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Mar 2011 11:40:08 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: <4D7613DC.2050506@googlemail.com> References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> <4D7613DC.2050506@googlemail.com> Message-ID: On Tue, Mar 8, 2011 at 11:32 AM, Moritz Beber wrote: > > If you just need circular behaviour in a small number of use cases, you > could consider wrapping the sequence in a cycle iterator > http://docs.python.org/release/2.6/library/itertools.html?highlight=cycle#itertools.cycle > That might need a lot of memory if used on a long sequence like a bacterial genome, but an interesting idea. Peter From p.cherepanov at imperial.ac.uk Tue Mar 8 07:12:26 2011 From: p.cherepanov at imperial.ac.uk (Peter Cherepanov) Date: Tue, 8 Mar 2011 12:12:26 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: ideally, it would be an object were the last letter is hard-linked to the first. For example, we should be able to define: c = CircularSeq('ATGCGGGGA') where: c[1:9] equals ATGCGGGGA (or, more awkwardly, c[0:9], if the original Python string numbering must be retained for some reasons) c[8:7] equals GAATGCATG c[1:1] equals A (on a python string it is c[0:1] = A, of course) Ideally, we would want to number such sequences from 1, after all these are the kind of objects we deal in biology. And, most importantly of all, if must be able to: c.find('GGAATG') to return "7" Peter On 8 Mar 2011, at 10:48, Peter Cock wrote: > On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov > wrote: >> I suppose if a DNA sequence is kept as a simple Python string, there is >> no easy way to have it "circular". I am a beginner in Python (I use it only >> occasionally, to solve very specific and simple-minded tasks, when manual >> match/cut-and-paste operations become too much of a burden). Having >> spent an extra hour to hack out and debug a piece of code to match/extract >> to/from circular plasmid sequences kept as Python strings, I thought: hey, >> wait a minute, there is such thing as BioPython, which should have made >> this task so much easier... >> >> Is there a way to "enhance" the Seq object? (or may be I do not know what >> I am talking about...). >> >> thanks a lot for responding! >> >> with best wishes, >> >> Peter > > What I had in mind was a new class, CircularSeq, which would subclass > the current Biopython Seq object, and still use a string internally for the > sequence. > > We could then modify the slice behaviour so that, perhaps this would > by work wrapping the origin: > > c = CircularSeq('ACGTACGTACGT') > assert len(c)==12 > print c[10:14] > > It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat > 14 as wrapped to 2, returning the four bases GTAC. > > Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the > same as 'ACGTACGTACGT'[10:] which is the last two letters only. > This means anyone (or more importantly, any code) expecting the > string like behaviour will get a nasty surprise (or a bug). > > Another example, what about c[-2:]? For a plain string you'd > get the last two letters. For a circular sequence you might think > that should represent starting two before the origin, thus giving > the last two letter plus the whole sequence? Also, c[-2:2] could > mean the last two letters plus the first two letters, but for a > plain python string that returns an empty string. > > Note that due to the way Python indexing works, single letter > access is fine for negative indices, c[-2] would give the second > last letter, 'G', which is consistent with wrapped counting back > from the origin. We could also make c[14] wrap round to c[2] in > this length 12 example (although there is a small risk of breaking > code expecting an IndexError in this case). > > There would be lots of other things to implement, like "in" and the > find methods would need to check the substring across the origin. > Then (for nucleotides), we'd need to ensure reverse_complement > and complement also give a CircularSeq, likewise perhaps for the > transcribe and back_transcribe. The translate method is particularly > tricky as you can have an infinite reading frame, which might be > represented as a circular protein sequence? > > All in all, it is quite a lot of work, and there are several tricky bits > where the desired behaviour is not clear cut. Could we come up > with something useful or not? > > Peter > > P.S. Please CC the mailing list in your replies :) From p.j.a.cock at googlemail.com Tue Mar 8 08:24:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Mar 2011 13:24:07 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: On Tue, Mar 8, 2011 at 12:12 PM, Peter Cherepanov wrote: > ideally, it would be an object were the last letter is hard-linked to the first. For example, we should be able to define: > > c = CircularSeq('ATGCGGGGA') > > where: > > c[1:9] ?equals ?ATGCGGGGA ? (or, more awkwardly, c[0:9], if the original > Python string numbering must be retained for some reasons) > c[8:7] ?equals ?GAATGCATG > c[1:1] equals A ?(on a python string it is c[0:1] ?= ?A, of course) > > Ideally, we would want to number such sequences from 1, after all these > are the kind of objects we deal in biology. Absolutely not - it would put the circular sequence completely out of sync with the existing sequence objects in Biopython and the Python string. Don't worry - you'll get used to zero based counting, and the Python slicing is very beautiful once you understand it. > And, most importantly of all, if must be able to: > c.find('GGAATG') to return "7" > Well, 6 in zero based counting, but yes, that would be the expected result for find (and similarly for rfind). We'd also need to do something with the split and rsplit methods to include looking for matches over the origin. Peter From Leighton.Pritchard at scri.ac.uk Tue Mar 8 08:28:11 2011 From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard) Date: Tue, 8 Mar 2011 13:28:11 +0000 Subject: [Biopython] define circular DNA (?) Message-ID: I've got 2p hanging around, so... On 08/03/2011 Tuesday, March 8, 10:48, "Peter Cock" wrote: > On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov > wrote: >> I suppose if a DNA sequence is kept as a simple Python string, there is >> no easy way to have it "circular". I am a beginner in Python (I use it only >> occasionally, to solve very specific and simple-minded tasks, when manual >> match/cut-and-paste operations become too much of a burden). Having >> spent an extra hour to hack out and debug a piece of code to match/extract >> to/from circular plasmid sequences kept as Python strings, I thought: hey, >> wait a minute, there is such thing as BioPython, which should have made >> this task so much easier... >> >> Is there a way to "enhance" the Seq object? (or may be I do not know what >> I am talking about...). >> >> thanks a lot for responding! >> >> with best wishes, >> >> Peter > > What I had in mind was a new class, CircularSeq, which would subclass > the current Biopython Seq object, and still use a string internally for the > sequence. That seems sensible. The main issue, as I see it, is that the physical object is naturally represented by a circularly-linked list, and we have for circular sequences an indexing/co-ordinate system with a defined zero start/end point (which is essentially arbitrary - though is usually the origin of replication for bacterial chromosomes). This leads to a conflict between our natural expectations of Python indexing, and the meaning of the indexing on the physical object that's being represented. Whatever the ultimate implementation, there will either have to be a compromise between these two representations, or one or other view will be ignored. There will inevitably be value judgements that someone is unhappy with ;) > We could then modify the slice behaviour so that, perhaps this would > by work wrapping the origin: > > c = CircularSeq('ACGTACGTACGT') > assert len(c)==12 > print c[10:14] > > It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat > 14 as wrapped to 2, returning the four bases GTAC. That makes sense in Python indexing terms, but not in terms of the co-ordinate system for navigating the circular DNA. To be consistent with location information from GenBank and other sources where features wrap the origin of circular DNA, we would need c[10:2] to return the same result as c[10:14]. That gives us potentially the same problem as c[-2:2], as it currently returns an empty string. We'd have to modify Python slicing/indexing behaviour quite a bit to implement this 'naturally'. However, I don't think we should ignore the Python indexing format here, because we might want the ten bases after the base with co-ordinate 6 with c[6:6+10], which would give us a physically and conceptually sensible linear sequence that crosses the origin. We'd probably want to do the obvious things with modular arithmetic, so that we don't return, say, three concatenated linearised circular sequences to a request like c[0:36] or c[6:42]. > Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the > same as 'ACGTACGTACGT'[10:] which is the last two letters only. > This means anyone (or more importantly, any code) expecting the > string like behaviour will get a nasty surprise (or a bug). I'm not sure it's wise to constrain functionality and adequate representation of a (very important! - showing my bacterial bias) physical structure to maintain that level of consistency with String. For instance, what would CircularSeq + Seq mean? Physically, and conceptually, not a lot. So we might want to deprecate the __add__ method for this object - not typical String behaviour but, in my opinion, appropriate. (You might remember that I was also generally not in favour of treating Seq objects as idealised Strings, so there's another bias for you ;) ) > Note that due to the way Python indexing works, single letter > access is fine for negative indices, c[-2] would give the second > last letter, 'G', which is consistent with wrapped counting back > from the origin. We could also make c[14] wrap round to c[2] in > this length 12 example (although there is a small risk of breaking > code expecting an IndexError in this case). I wouldn't be in favour that behaviour in a general sense, though I don't see how to avoid it cleanly. I think it would be best to be strict with indexing to the co-ordinate system to avoid possible degeneracy of feature locations. If we had a SNP at position 2, we could equally well associate it with any one of an infinite number of positions kl+2 where k is an integer and l is the sequence length, without modifying the computational result. I'm not keen on that kind of woolliness, but I think that it could possibly be avoided by modifying indexing to require at least one index that lies in the range [-l,l], and using modular arithmetic for slicing so that, for the example above, c[18:26] would not be treated as the valid slice c[6:14], but would instead throw an IndexError. > There would be lots of other things to implement, like "in" and the > find methods would need to check the substring across the origin. > Then (for nucleotides), we'd need to ensure reverse_complement > and complement also give a CircularSeq, likewise perhaps for the > transcribe and back_transcribe. Not to mention the other Biopython functions/methods that expect String-like indexing. Maybe a cast (of sorts) between CircularSeq and Seq would be useful for that, though I can imagine great problems, there. > The translate method is particularly > tricky as you can have an infinite reading frame, which might be > represented as a circular protein sequence? I would think that the test for that particular condition should be fairly straightforward (is there at least one stop codon in each of the six frames, taking into account the origin?). > All in all, it is quite a lot of work, and there are several tricky bits > where the desired behaviour is not clear cut. Could we come up > with something useful or not? I think that there's every possibility of coming up with something useful - the question is to what degree it fits the Biopython/Python idiom, or 'looks like' the physical object, and whether it gets included in Biopython. L. -- Dr Leighton Pritchard MRSC Plant Pathology Programme, SCRI (C block) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel: No telephone during office refurbishment [The James Hutton Institute logo] Please note that from 1 April 2011, SCRI and the Macaulay Land Use Research Institute will join to become The James Hutton Institute. ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From p.j.a.cock at googlemail.com Tue Mar 8 08:58:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Mar 2011 13:58:03 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: Message-ID: On Tue, Mar 8, 2011 at 1:28 PM, Leighton Pritchard wrote: > I've got 2p hanging around, so... > > On 08/03/2011 Tuesday, March 8, 10:48, "Peter Cock" > wrote: >> >> What I had in mind was a new class, CircularSeq, which would subclass >> the current Biopython Seq object, and still use a string internally for the >> sequence. > > That seems sensible. ?The main issue, as I see it, is that the physical > object is naturally represented by a circularly-linked list, and we have for > circular sequences an indexing/co-ordinate system with a defined zero > start/end point (which is essentially arbitrary - though is usually the > origin of replication for bacterial chromosomes). ?This leads to a conflict > between our natural expectations of Python indexing, and the meaning of the > indexing on the physical object that's being represented. > > Whatever the ultimate implementation, there will either have to be a > compromise between these two representations, or one or other view will be > ignored. ?There will inevitably be value judgements that someone is unhappy > with ;) Indeed. >> We could then modify the slice behaviour so that, perhaps this would >> by work wrapping the origin: >> >> c = CircularSeq('ACGTACGTACGT') >> assert len(c)==12 >> print c[10:14] >> >> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat >> 14 as wrapped to 2, returning the four bases GTAC. > > That makes sense in Python indexing terms, but not in terms of the > co-ordinate system for navigating the circular DNA. ?To be consistent with > location information from GenBank and other sources where features wrap the > origin of circular DNA, we would need c[10:2] to return the same result as > c[10:14]. ?That gives us potentially the same problem as c[-2:2], as it > currently returns an empty string. ?We'd have to modify Python > slicing/indexing behaviour quite a bit to implement this 'naturally'. > > However, I don't think we should ignore the Python indexing format here, > because we might want the ten bases after the base with co-ordinate 6 with > c[6:6+10], which would give us a physically and conceptually sensible linear > sequence that crosses the origin. I think we agree that c[10:14] and c[10:10+4] should give the four bases GTAC wrapping the origin when c is circular sequence ACGTACGTACGT, equivalently c[10:12] + c[0:2] using Python slicing. Likewise for your example c[6:6+10] or c[6:16] this should give six bases wrapping the origin, equivalently c[6:12] + c[0:4] using Python slicing. > We'd probably want to do the obvious things with modular arithmetic, so that > we don't return, say, three concatenated linearised circular sequences to a > request like c[0:36] or c[6:42]. I disagree, returning the three concatenated linearised circular sequences is what I would expect. This is one of the debatable issues that will divide people. Consider the (special and artificial) case of a circular plasmid with an ORF wrapping round the origin (one, twice or infinite), the ORF sequence is longer than the linearised plasmid, so slicing with concatenation would be useful. e.g. http://www.ncbi.nlm.nih.gov/pubmed/9740124 Perriman and Ares (1998), Circular mRNA can direct translation of extremely long repeating-sequence proteins in vivo. and: http://dx.doi.org/10.1385/1-59259-280-5:069 Perriman (2002), Circular mRNA Encoding for Monomeric and Polymeric Green Fluorescent Protein (Very cool work) >> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the >> same as 'ACGTACGTACGT'[10:] which is the last two letters only. >> This means anyone (or more importantly, any code) expecting the >> string like behaviour will get a nasty surprise (or a bug). > > I'm not sure it's wise to constrain functionality and adequate > representation of a (very important! - showing my bacterial bias) physical > structure to maintain that level of consistency with String. ?For instance, > what would CircularSeq + Seq mean? ?Physically, and conceptually, not a lot. > So we might want to deprecate the __add__ method for this object - not > typical String behaviour but, in my opinion, appropriate. We're probably want to made addition of CircularSeq + Seq raise a TypeError. Or, do a linearisation and simple addition with a warning? > (You might remember that I was also generally not in favour of treating > Seq objects as idealised Strings, so there's another bias for you ;) ) I recall :) >> Note that due to the way Python indexing works, single letter >> access is fine for negative indices, c[-2] would give the second >> last letter, 'G', which is consistent with wrapped counting back >> from the origin. We could also make c[14] wrap round to c[2] in >> this length 12 example (although there is a small risk of breaking >> code expecting an IndexError in this case). > > I wouldn't be in favour that behaviour in a general sense, though I don't > see how to avoid it cleanly. I think it would be best to be strict with > indexing to the co-ordinate system to avoid possible degeneracy of feature > locations. ?If we had a SNP at position 2, we could equally well associate > it with any one of an infinite number of positions kl+2 where k is an > integer and l is the sequence length, without modifying the computational > result. Yes, I was suggesting we could make c[x+n*length] act as c[x], i.e. for *single* indexes which return one letter, apply the modulo arithmetic. Or, we leave this to follow the current Python string behaviour where if the index is equal to the length or more, you get an IndexError. That avoids the ambiguity ;) > I'm not keen on that kind of woolliness, but I think that it could > possibly be avoided by modifying indexing to require at least one index that > lies in the range [-l,l], and using modular arithmetic for slicing so that, > for the example above, c[18:26] would not be treated as the valid slice > c[6:14], but would instead throw an IndexError. This depends on the treatment of things like c[0:36] or c[6:42] discussed above (return 36 bases, or just 12?). >> There would be lots of other things to implement, like "in" and the >> find methods would need to check the substring across the origin. >> Then (for nucleotides), we'd need to ensure reverse_complement >> and complement also give a CircularSeq, likewise perhaps for the >> transcribe and back_transcribe. > > Not to mention the other Biopython functions/methods that expect String-like > indexing. ?Maybe a cast (of sorts) between CircularSeq and Seq would be > useful for that, though I can imagine great problems, there. Having a toseq method like the MutableSeq does could handle that, returning a traditional linear Seq object. If the CircularSeq 'breaks' too much expected string-like behaviour that would be important. >> The translate method is particularly >> tricky as you can have an infinite reading frame, which might be >> represented as a circular protein sequence? > > I would think that the test for that particular condition should be fairly > straightforward (is there at least one stop codon in each of the six frames, > taking into account the origin?). Having thought about this example at length before, it can be done but I don't think it is all that straightforward ;) >> All in all, it is quite a lot of work, and there are several tricky bits >> where the desired behaviour is not clear cut. Could we come up >> with something useful or not? > > I think that there's every possibility of coming up with something useful - > the question is to what degree it fits the Biopython/Python idiom, or 'looks > like' the physical object, and whether it gets included in Biopython. > > L. Agreed. Peter From anaryin at gmail.com Tue Mar 8 16:39:07 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 8 Mar 2011 22:39:07 +0100 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu> <95E27938-F262-4F25-AF29-FBE387DB8782@gmail.com> Message-ID: Back to this question. Haven't had much time to look at it and it turned out to be a bit more complicated than what I thought. Permissive is an attribute of the PDBParser module and since the assignment takes place in the Atom module I don't see a straightforward way of pulling this off. However, and although there is the very simple solution of playing with the warnings module, the solution I offer is to allow a second level of "permissiveness" (PERMISSIVE=2) where all warnings are supressed. Cheers, J From laserson at mit.edu Tue Mar 8 22:07:54 2011 From: laserson at mit.edu (Uri Laserson) Date: Tue, 8 Mar 2011 22:07:54 -0500 Subject: [Biopython] SeqRecord subclassing or composition Message-ID: I am trying to implement a data type for my work. Each object will have a sequence (derived from a single read) and lots of annotations and features. However, I want to implement some extra interface that is problem-specific to make my analysis more convenient. I am debating whether to subclass SeqRecord and simply implement the extra interface or define a new object that wraps a SeqRecord object and pass on the subset of native SeqRecord calls and/or simply access the underlying SeqRecord directly. One additional factor is that I want to be able to read/write INSDC-style files for the data (e.g., GenBank). Therefore, if I use the SeqIO parser, it will return native SeqRecords. If I go the inheritance route, how do I cast a SeqRecord object to my new subclass? So, I am debating between inheritance class ImmuneChain(SeqRecord): def __init__(self, *args, **kw): SeqRecord.__init__(self,*args,**kw) # But how do I cast a SeqRecord to an ImmuneChain? or composition class ImmuneChain(object): def __init__(self, *args, **kw): if isinstance(args[0],SeqRecord): self._record = args[0] else: # Initialize the underlying SeqRecord manually self._record.seq = ... Any thoughts? Thanks! Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From p.j.a.cock at googlemail.com Wed Mar 9 04:04:26 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 9 Mar 2011 09:04:26 +0000 Subject: [Biopython] SeqRecord subclassing or composition In-Reply-To: References: Message-ID: On Wed, Mar 9, 2011 at 3:07 AM, Uri Laserson wrote: > I am trying to implement a data type for my work. ?Each object will have a > sequence (derived from a single read) and lots of annotations and features. > ?However, I want to implement some extra interface that is problem-specific > to make my analysis more convenient. > > I am debating whether to subclass SeqRecord and simply implement the extra > interface or define a new object that wraps a SeqRecord object and pass on > the subset of native SeqRecord calls and/or simply access the underlying > SeqRecord directly. > > One additional factor is that I want to be able to read/write INSDC-style > files for the data (e.g., GenBank). ?Therefore, if I use the SeqIO parser, > it will return native SeqRecords. ?If I go the inheritance route, how do I > cast a SeqRecord object to my new subclass? There is (currently at least) no option in SeqIO parse/read to override the use of the SeqRecord object. So you'd need code to 'upgrade' a SeqRecord into your class. Probably the simplest route would be for it's __init__ method to take a single argument (a SeqRecord). Then you could have: def my_parse(...): for seq_record in SeqIO.parse(...): yield MyClass(seq_record) def my_read(...): return MyClass(SeqIO.read(...)) etc > So, I am debating between inheritance > > class ImmuneChain(SeqRecord): > ? ?def __init__(self, *args, **kw): > ? ? ? ?SeqRecord.__init__(self,*args,**kw) > ? ? ? ?# But how do I cast a SeqRecord to an ImmuneChain? Unless you modify the methods/atttributes too much, a ImmuneChain subclass of SeqRecord should be usable as is with SeqIO.write etc. You don't need to 'cast'. Also note the above __init__ method can be more specific, you might have say 10 init args for ImmuneChain, only some of which you pass to the SeqRecord init. You could even have a single __init__ argument of a SeqRecord, and copy all its attributes. > or composition > > class ImmuneChain(object): > ? ?def __init__(self, *args, **kw): > ? ? ? ?if isinstance(args[0],SeqRecord): > ? ? ? ? ? ?self._record = args[0] > ? ? ? ?else: > ? ? ? ? ? ?# Initialize the underlying SeqRecord manually > ? ? ? ? ? ?self._record.seq = ... With the above approach you'd have to pass the private record to SeqIO.write etc (anything which needs a SeqRecord). That could be done inside methods of the ImmuneChain object (e.g. you could expose the format method of the SeqRecord). > > Any thoughts? > You could alternatively go for a procedural style where you write your code as functions taking SeqRecord objects (perhaps expecting particular information in the annotation). Peter From komalsnehal1991 at gmail.com Wed Mar 9 05:49:23 2011 From: komalsnehal1991 at gmail.com (Komal S) Date: Wed, 9 Mar 2011 02:49:23 -0800 Subject: [Biopython] ::Biopython Project Message-ID: Hi everyone, I'm Komal, a Junior Undergraduate Student from India studying Bioengineering. I'm a fan of Python and I love Computational Biology and I plan to do my further studies in the same. I went through the projects on the Biopython page. I was very much interested in the RNA Structure project mentioned. Any contribution which I make will help me a lot and the organisation too. In fact, I am currently doing a project on RNA Editing. I'll be very happy to integrate my knowledge. In fact, I have been trying to contact people on #obf-soc IRC. I think there is no separate IRC for Biopython. Please help me on how I should proceed. Komal From laserson at mit.edu Wed Mar 9 10:28:22 2011 From: laserson at mit.edu (Uri Laserson) Date: Wed, 9 Mar 2011 10:28:22 -0500 Subject: [Biopython] SeqRecord subclassing or composition In-Reply-To: References: Message-ID: > > Unless you modify the methods/atttributes too much, a > ImmuneChain subclass of SeqRecord should be usable > as is with SeqIO.write etc. You don't need to 'cast'. > I'm more worried about parsing than writing. As you mentioned, I will have to upgrade my SeqRecord object to an ImmuneChain object. So maybe the best approach is a combination of the two code snippets I included. It would subclass SeqRecord, and then manually check whether I am initializing with a pre-existing SeqRecord or just data: class ImmuneChain(SeqRecord): def __init__(self, *args, **kw): if isinstance(args[0],SeqRecord): # if initializing with SeqRecord, then manually transfer the data # based on the initializer for SeqRecord (http://goo.gl/X95Zf) record = args[0] SeqRecord.__init__(self, seq, id=record.id, name=record.name, description=record.description, dbxrefs=record.dbxrefs, features=record.features, annotations=record.annotations, letter_annotations=record.letter_annotations) else: # assume I'm initializing just like a regular SeqRecord: SeqRecord.__init__(*args,**kw) # Finally, I perform any problem-specific additional initializations # here. pass Does this seem like a good solution? Also, do you think that it would make sense to make a deep copy of the SeqRecord object before I use it to initialize the ImmuneChain? Uri From p.j.a.cock at googlemail.com Wed Mar 9 10:32:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 9 Mar 2011 15:32:50 +0000 Subject: [Biopython] SeqRecord subclassing or composition In-Reply-To: References: Message-ID: On Wed, Mar 9, 2011 at 3:28 PM, Uri Laserson wrote: >> Unless you modify the methods/atttributes too much, a >> ImmuneChain subclass of SeqRecord should be usable >> as is with SeqIO.write etc. You don't need to 'cast'. > > I'm more worried about parsing than writing. ?As you mentioned, I will have > to upgrade my SeqRecord object to an ImmuneChain object. > So maybe the best approach is a combination of the two code snippets I > included. ?It would subclass SeqRecord, and then manually check whether I am > initializing with a pre-existing SeqRecord or just data: > class ImmuneChain(SeqRecord): > ?? ?def __init__(self, *args, **kw): > ?? ? ? ?if isinstance(args[0],SeqRecord): > ?? ? ? ? ? ?# if initializing with SeqRecord, then manually transfer the > data > ?? ? ? ? ? ?# based on the initializer for SeqRecord (http://goo.gl/X95Zf) > ?? ? ? ? ? ?record = args[0] > ?? ? ? ? ? ?SeqRecord.__init__(self, seq, id=record.id, name=record.name, > ?? ? ? ? ? ? ? ? ? ? description=record.description, dbxrefs=record.dbxrefs, > ?? ? ? ? ? ? ? ? ? ? features=record.features, > annotations=record.annotations, > ?? ? ? ? ? ? ? ? ? ? letter_annotations=record.letter_annotations) > ?? ? ? ?else: > ?? ? ? ? ? ?# assume I'm initializing just like a regular SeqRecord: > ?? ? ? ? ? ?SeqRecord.__init__(*args,**kw) > > ?? ? ? ?# Finally, I perform any problem-specific additional initializations > ?? ? ? ?# here. > ?? ? ? ?pass > Does this seem like a good solution? I think it will work, > Also, do you think that it would make sense to make a deep copy of the > SeqRecord object before I use it to initialize the ImmuneChain? Assuming you will be discarding the original SeqRecord, then I see no reason to make a deep copy. It will just slow things down. Peter From jvb at Cs.Nott.AC.UK Wed Mar 9 10:33:28 2011 From: jvb at Cs.Nott.AC.UK (Jonathan Blakes) Date: Wed, 09 Mar 2011 15:33:28 +0000 Subject: [Biopython] back-translation method for Seq object? Message-ID: <4D779DC8.8090704@cs.nott.ac.uk> This is a reply to an old thread (October 2008), but I thought someone might find it useful. In that thread, discussing the representation of back-translations using ambiguous bases to avoid the factorial explosion of an all possibilities back-translation, Bruce Southey gave a table similar to the one below but some of the ambiguous codons were incorrect or the ambiguous codons were to ambiguous and covered more than one amino acid. The codons for stop (*) were also missing. Some were corrected later in the thread but not all. Here are the correct ambiguous codons for the standard genetic code: * = TAG, TAA, TGA = TAR, TGA A = GCT, GCC, GCA, GCG = GCN C = TGT, TGC = TGY D = GAT, GAC = GAY E = GAA, GAG = GAR F = TTT, TTC = TTY G = GGT, GGC, GGA, GGG = GGN H = CAT, CAC = CAY I = ATT, ATC, ATA = ATH K = AAA, AAG = AAR L = TTA, TTG, CTT, CTC, CTA, CTG = TTR, CTN M = ATG = ATG N = AAT, AAC = AAY P = CCT, CCC, CCA, CCG = CCN Q = CAA, CAG = CAR R = CGT, CGC, CGA, CGG, AGA, AGG = CGN, AGR S = TCT, TCC, TCA, TCG, AGT, AGC = TCN, AGY T = ACT, ACC, ACA, ACG = ACN V = GTT, GTC, GTA, GTG = GTN W = TGG = TGG Y = TAT, TAC = TAY Even though this is still not a one-to-one mapping in 4/21 cases the factorial explosion is significantly decreased. For example, the protein ACDEFGHIKLMNPQRSTVWY* has 1,019,215,872 unambiguous back-translations. Using the code above it has 16, or generally 2^(L+R+S+*). If anyone has an algorithm for determining the set of non-overlapping ambiguous codons from any codon table I would like to know. Thanks, Jon -- Jonathan Blakes School of Computer Science University of Nottingham From rasi at seas.harvard.edu Wed Mar 9 17:57:30 2011 From: rasi at seas.harvard.edu (Arvind Subramaniam) Date: Wed, 9 Mar 2011 17:57:30 -0500 Subject: [Biopython] .ab1 file parser in biopython? Message-ID: Hi I am new to biopython so please excuse me if this issue is obviously simple. I am trying to parse .ab1 sequencing trace files in Biopython and I cannot find the right module or method to do this job. Can someone suggest how I can parse .ab1 files? Thanks, Arvind. From cmckay at u.washington.edu Wed Mar 9 20:09:55 2011 From: cmckay at u.washington.edu (Cedar McKay) Date: Wed, 9 Mar 2011 17:09:55 -0800 Subject: [Biopython] "raw" genbank locations? Message-ID: Hello all. Biopython continues to be a lifesaver. I'm trying to get the "raw" genbank locations for a downstream application after parsing a genbank file. Is there any way to get at this (or reproduce it)? As it is, the SeqRecord feature has start and stop information for the whole feature, and a list of sub-features each with it's own start and stops. I'm looking for one concise text string the describes the entire feature location, much like the original raw genbank locations do. I searched the archives, but nothing popped into view. Thanks for your help! best, Cedar From chapmanb at 50mail.com Wed Mar 9 21:05:45 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 9 Mar 2011 21:05:45 -0500 Subject: [Biopython] "raw" genbank locations? In-Reply-To: References: Message-ID: <20110310020545.GA2185@kunkel> Cedar; Glad to hear Biopython has been helping out with your work. > I'm trying to get the "raw" genbank locations for a downstream > application after parsing a genbank file. Is there any way to get at > this (or reproduce it)? As it is, the SeqRecord feature has start and > stop information for the whole feature, and a list of sub-features > each with it's own start and stops. I'm looking for one concise text > string the describes the entire feature location, much like the > original raw genbank locations do. You can do this with the GenBank RecordParser, which doesn't parse the location strings: >>> from Bio.GenBank import RecordParser >>> parser = RecordParser() >>> handle = open("NT_019265.gb") >>> rec = parser.parse(handle) >>> for f in rec.features: ... print f.location ... 1..1250660 1..3290 215902..365470 217508 join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092) If you have SeqRecord objects from SeqIO you can do this in a ugly way by reaching into the internals of the GenBank writer: >>> from Bio import SeqIO >>> from Bio.SeqIO import InsdcIO >>> handle = open("NT_019265.gb") >>> for rec in SeqIO.parse(handle, "genbank"): ... for f in rec.features: ... print InsdcIO._insdc_feature_location_string(f, len(rec.seq)) ... 1..1250660 1..3290 215902..365470 217508 join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092) That might work for a quick hack but is not necessarily future proof is the internal change. Peter, do you think this would be useful to expose as a function of a SeqFeature directly, so you could do feature.insdc_string() or something similar? Brad From chapmanb at 50mail.com Wed Mar 9 21:05:45 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 9 Mar 2011 21:05:45 -0500 Subject: [Biopython] "raw" genbank locations? In-Reply-To: References: Message-ID: <20110310020545.GA2185@kunkel> Cedar; Glad to hear Biopython has been helping out with your work. > I'm trying to get the "raw" genbank locations for a downstream > application after parsing a genbank file. Is there any way to get at > this (or reproduce it)? As it is, the SeqRecord feature has start and > stop information for the whole feature, and a list of sub-features > each with it's own start and stops. I'm looking for one concise text > string the describes the entire feature location, much like the > original raw genbank locations do. You can do this with the GenBank RecordParser, which doesn't parse the location strings: >>> from Bio.GenBank import RecordParser >>> parser = RecordParser() >>> handle = open("NT_019265.gb") >>> rec = parser.parse(handle) >>> for f in rec.features: ... print f.location ... 1..1250660 1..3290 215902..365470 217508 join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092) If you have SeqRecord objects from SeqIO you can do this in a ugly way by reaching into the internals of the GenBank writer: >>> from Bio import SeqIO >>> from Bio.SeqIO import InsdcIO >>> handle = open("NT_019265.gb") >>> for rec in SeqIO.parse(handle, "genbank"): ... for f in rec.features: ... print InsdcIO._insdc_feature_location_string(f, len(rec.seq)) ... 1..1250660 1..3290 215902..365470 217508 join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092) That might work for a quick hack but is not necessarily future proof is the internal change. Peter, do you think this would be useful to expose as a function of a SeqFeature directly, so you could do feature.insdc_string() or something similar? Brad From p.j.a.cock at googlemail.com Thu Mar 10 03:57:20 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 08:57:20 +0000 Subject: [Biopython] "raw" genbank locations? In-Reply-To: <20110310020545.GA2185@kunkel> References: <20110310020545.GA2185@kunkel> Message-ID: On Thu, Mar 10, 2011 at 2:05 AM, Brad Chapman wrote: > Cedar; > Glad to hear Biopython has been helping out with your work. > >> I'm trying to get the "raw" genbank locations for a downstream >> application after parsing a genbank file. Is there any way to get at >> this (or reproduce it)? As it is, the SeqRecord feature has start and >> stop information for the whole feature, and a list of sub-features >> each with it's own start and stops. I'm looking for one concise text >> string the describes the entire feature location, much like the >> original raw genbank locations do. > > You can do this with the GenBank RecordParser, which doesn't parse > the location strings: > >>>> from Bio.GenBank import RecordParser >>>> parser = RecordParser() >>>> handle = open("NT_019265.gb") >>>> rec = parser.parse(handle) >>>> for f in rec.features: > ... ? ? print f.location > ... > > > If you have SeqRecord objects from SeqIO you can do this in a ugly > way by reaching into the internals of the GenBank writer: > >>>> from Bio import SeqIO >>>> from Bio.SeqIO import InsdcIO >>>> handle = open("NT_019265.gb") >>>> for rec in SeqIO.parse(handle, "genbank"): > ... ? ? for f in rec.features: > ... ? ? ? ? print InsdcIO._insdc_feature_location_string(f, len(rec.seq)) > ... > > > That might work for a quick hack but is not necessarily future proof > is the internal change. Peter, do you think this would be useful to > expose as a function of a SeqFeature directly, so you could do > feature.insdc_string() or something similar? A couple of people have asked for this, and since adding SeqIO output in GenBank/EMBL format (the code you refer to in InsdcIO) this would be very possible... the issue holding me back is the annoying special case(s) requiring to know the parent sequence's length. The problem is that currently the SeqFeature doesn't have this information - it doesn't have any link back to a parent SeqRecord (and indeed it doesn't even have to be created in the context of a SeqRecord). Perhaps we can handle the case of between features N^1 on circular sequences of length N differently, maybe with a dedicated SeqFeature location class which would tell us it was at the origin? Then we'd be able to avoid the need to know the parent length. Once that is resolved, an orphan SeqFeature could generate its own INSDC (GenBank/EMBL) location string without needing any extra information, and exposing this as an object method would be fine. Peter P.S. If we ever add a CircularSeq object - see other thread- then SeqFeature locations spanning the origin might need reworking too. From p.j.a.cock at googlemail.com Thu Mar 10 04:00:51 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 09:00:51 +0000 Subject: [Biopython] .ab1 file parser in biopython? In-Reply-To: References: Message-ID: On Wed, Mar 9, 2011 at 10:57 PM, Arvind Subramaniam wrote: > Hi > ?I am new to biopython so please excuse me if this issue is obviously > simple. I am trying to parse .ab1 sequencing trace files in Biopython > and I cannot find the right module or method to do this job. Can > someone suggest how I can parse .ab1 files? > Thanks, > Arvind. You mean the ABI trace file format for capillary sequencing? Personally I use EMBOSS seqret (e.g. to make FASTQ), or tracetuner if I want to recall the bases (the ABI software doesn't always to the best possible calling job). http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/seqret.html http://sourceforge.net/projects/tracetuner/ Peter From chapmanb at 50mail.com Thu Mar 10 06:06:48 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 10 Mar 2011 06:06:48 -0500 Subject: [Biopython] "raw" genbank locations? In-Reply-To: References: <20110310020545.GA2185@kunkel> Message-ID: <20110310110648.GA2302@kunkel> Peter; > > do you think this would be useful to > > expose as a function of a SeqFeature directly, so you could do > > feature.insdc_string() or something similar? > > A couple of people have asked for this, and since adding SeqIO > output in GenBank/EMBL format (the code you refer to in InsdcIO) > this would be very possible... the issue holding me back is the > annoying special case(s) requiring to know the parent sequence's > length. The problem is that currently the SeqFeature doesn't > have this information - it doesn't have any link back to a parent > SeqRecord (and indeed it doesn't even have to be created in > the context of a SeqRecord). > > Perhaps we can handle the case of between features N^1 on > circular sequences of length N differently, maybe with a dedicated > SeqFeature location class which would tell us it was at the origin? > Then we'd be able to avoid the need to know the parent length. This is a great idea; makes sense to treat this as a special case since that's what it is. Another simple way would be to put the function on the SeqRecord class and call it with: rec.insdc_feature_string(feature); this places the responsibility of knowing the parent back on the library user. > P.S. If we ever add a CircularSeq object - see other thread- then > SeqFeature locations spanning the origin might need reworking > too. Makes sense. We can get the 99% of standard cases working now and then re-circle back on this once someone gets up the guts to tackle CircularSeq. Brad From p.j.a.cock at googlemail.com Thu Mar 10 06:52:48 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 11:52:48 +0000 Subject: [Biopython] "raw" genbank locations? In-Reply-To: <20110310110648.GA2302@kunkel> References: <20110310020545.GA2185@kunkel> <20110310110648.GA2302@kunkel> Message-ID: On Thu, Mar 10, 2011 at 11:06 AM, Brad Chapman wrote: > Peter; > >> > do you think this would be useful to >> > expose as a function of a SeqFeature directly, so you could do >> > feature.insdc_string() or something similar? >> >> A couple of people have asked for this, and since adding SeqIO >> output in GenBank/EMBL format (the code you refer to in InsdcIO) >> this would be very possible... the issue holding me back is the >> annoying special case(s) requiring to know the parent sequence's >> length. The problem is that currently the SeqFeature doesn't >> have this information - it doesn't have any link back to a parent >> SeqRecord (and indeed it doesn't even have to be created in >> the context of a SeqRecord). >> >> Perhaps we can handle the case of between features N^1 on >> circular sequences of length N differently, maybe with a dedicated >> SeqFeature location class which would tell us it was at the origin? >> Then we'd be able to avoid the need to know the parent length. > > This is a great idea; makes sense to treat this as a special case > since that's what it is. It is probably the most elegant solution without a big refactor. > Another simple way would be to put the > function on the SeqRecord class and call it with: > rec.insdc_feature_string(feature); this places the responsibility of > knowing the parent back on the library user. Yes, that would be simple. But don't we sometimes want to use 'orphan' SeqFeature objects (without a SeqRecord parent)? I'm thinking here about GFF3 files and the like. >> P.S. If we ever add a CircularSeq object - see other thread- then >> SeqFeature locations spanning the origin might need reworking >> too. > > Makes sense. We can get the 99% of standard cases working now and > then re-circle back on this once someone gets up the guts to tackle > CircularSeq. :) Peter From rmb32 at cornell.edu Thu Mar 10 12:15:41 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 10 Mar 2011 12:15:41 -0500 Subject: [Biopython] update Google Summer of Code project ideas Message-ID: <4D79073D.3090603@cornell.edu> Hi all, Please make sure the BioJava information is up to date for 2011 on both the OBF and BioJava wikis. Eric has done some work on it, but the current page has not been completely updated to reflect that it's 2011 and we're applying again. OBF wiki page: http://www.open-bio.org/wiki/Google_Summer_of_Code BioPython wiki: http://biopython.org/wiki/Google_Summer_of_Code Rob ---- Robert Buels (prospective) 2011 OBF GSoC Organization Admin From anaryin at gmail.com Thu Mar 10 12:25:04 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 10 Mar 2011 18:25:04 +0100 Subject: [Biopython] update Google Summer of Code project ideas In-Reply-To: <4D79073D.3090603@cornell.edu> References: <4D79073D.3090603@cornell.edu> Message-ID: I updated the date and added the project from last year to the page, to show we got another funded project. Cheers, J From p.j.a.cock at googlemail.com Thu Mar 10 12:42:58 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 17:42:58 +0000 Subject: [Biopython] Bugzilla -> Redmine migration Message-ID: Hi all, Anyone who has tried to file a bug recently will have noticed a big red message "Sorry, entering bugs into the product Biopython has been disabled." The reason for this is the OBF team are about to move us (and all the other Bio* projects using Bugzilla) to a Redmine server instead. See http://www.redmine.org/ I expect this to be completed in the next few days (with all the old bugs and accounts carried across). Hopefully this will include integration with our git repository as well. We'll make an announcement once it is ready, in the mean time, any new bugs could be emailed to the mailing list as a short term measure. Peter From laserson at mit.edu Thu Mar 10 13:22:42 2011 From: laserson at mit.edu (Uri Laserson) Date: Thu, 10 Mar 2011 13:22:42 -0500 Subject: [Biopython] .ab1 file parser in biopython? In-Reply-To: References: Message-ID: I also found the following code lying around somewhere. I copied it into one of my repositories: https://github.com/laserson/pytools/blob/master/ab1.py "Python implementation of an ABIF file reader according to Applied Biosystems' specificatons" as specified in March 2007, it appears. ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu On Thu, Mar 10, 2011 at 04:00, Peter Cock wrote: > On Wed, Mar 9, 2011 at 10:57 PM, Arvind Subramaniam > wrote: > > Hi > > I am new to biopython so please excuse me if this issue is obviously > > simple. I am trying to parse .ab1 sequencing trace files in Biopython > > and I cannot find the right module or method to do this job. Can > > someone suggest how I can parse .ab1 files? > > Thanks, > > Arvind. > > You mean the ABI trace file format for capillary sequencing? > > Personally I use EMBOSS seqret (e.g. to make FASTQ), or tracetuner > if I want to recall the bases (the ABI software doesn't always to the > best possible calling job). > > http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/seqret.html > http://sourceforge.net/projects/tracetuner/ > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Thu Mar 10 13:37:04 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 18:37:04 +0000 Subject: [Biopython] .ab1 file parser in biopython? In-Reply-To: References: Message-ID: On Thu, Mar 10, 2011 at 6:22 PM, Uri Laserson wrote: > I also found the following code lying around somewhere. ?I copied it into > one of my repositories: > > https://github.com/laserson/pytools/blob/master/ab1.py > > "Python implementation of an ABIF file reader according to Applied > Biosystems' specificatons" as specified in March 2007, it appears. > Its under the GPL license. If you contacted the named author, Francis Wolinski, and he was willing to re-licence for Biopython to use, then we could consider incorporating it. Alternatively it shouldn't be too hard to reimplement it from scratch based on the published specification (and go one step further and consider output too). http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf Note some case would be needed to work on Python 3, but we can follow the example of our SFF parser here. Is there actually a need for this though? As I said before, for my own needs getting the ABI file into FASTQ format (or FASTA+QUAL) has sufficed. Peter From cmckay at u.washington.edu Thu Mar 10 16:51:42 2011 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 10 Mar 2011 13:51:42 -0800 Subject: [Biopython] "raw" genbank locations? Message-ID: Great! InsdcIO._insdc_feature_location_string was just what I needed. I was actually on the right track, trying to figure out how SeqIO wrote locations in genbank format, but your email arrived soon enough that I didn't have to finish the job. I realize this is a private method, so I would like an official way to do this. Thanks so much guys, as usual, awesome service! Cedar From laserson at mit.edu Thu Mar 10 17:07:46 2011 From: laserson at mit.edu (Uri Laserson) Date: Thu, 10 Mar 2011 17:07:46 -0500 Subject: [Biopython] Transferring SeqFeatures between aligned sequences Message-ID: Say I have a SeqRecord called A and a SeqRecord called B. A has a bunch of SeqFeatures associated with it, while B has none. I perform a gapped alignment between the two sequences. Now I want to copy the SeqFeatures from A onto B in a way that respects the coordinates of all the features. For example (and please use a fixed-width font for this): 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 FEATURE_1 FEATURE_2 X X X X X X X X X X X X X X X X X X A - - - a c g g t - - a c a g a c g t g a t a c g | | | | | | | | | | | | | | | | | B a a a a c g g t g g a c a t a c g - g a t a c g 0 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 In sequence A, the coords of Feature 1 and Feature 2 should be (0,7) and (10,19), respectively. Now I want to copy it to sequence B, where the feature coords should instead be (3,12) and (15,23). Is there an easy way to do this in biopython already? Or are there any ideas for an elegant solution? Thanks! Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From p.j.a.cock at googlemail.com Thu Mar 10 17:46:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 22:46:32 +0000 Subject: [Biopython] Transferring SeqFeatures between aligned sequences In-Reply-To: References: Message-ID: On Thu, Mar 10, 2011 at 10:07 PM, Uri Laserson wrote: > Say I have a SeqRecord called A and a SeqRecord called B. ?A has a bunch of > SeqFeatures associated with it, while B has none. ?I perform a gapped > alignment between the two sequences. ?Now I want to copy the SeqFeatures > from A onto B in a way that respects the coordinates of all the features. > > For example (and please use a fixed-width font for this): > I'm not quite sure I followed that figure. > In sequence A, the coords of Feature 1 and Feature 2 should be (0,7) and > (10,19), respectively. ?Now I want to copy it to sequence B, where the > feature coords should instead be (3,12) and (15,23). > > Is there an easy way to do this in biopython already? No, but I'm not sure how advisable it is anyway (if I have understood you right - see below). > Or are there any ideas for an elegant solution? I actually wanted to do something similar to this myself. I had a draft genome I had annotated in GenBank format. We did some more sequencing and/or I tweaked the assembly, and I had a new very similar sequence in a FASTA file, and I wanted to copy the old annotation over. What I did was look for perfect matches between the regions spanned by the features (no introns in this case), and that meant all I needed to do was apply a shift to the SeqFeature location. There is a (private) method _shift which helped here (written for use in slicing a SeqRecord). In my case, that handled most of the annotation, and I did the nasty cases by hand (since I wanted to examine what had happened in the new assembly - it was a small genome). In your case the start and end co-ordinates may be shifted by different amounts (since you are doing gapped alignments). This worries me as the length of your features can change. For any gene or CDS features that is a problem (frame shifts). Have you thought about that? Perhaps you're dealing with non-coding features only? Peter From p.j.a.cock at googlemail.com Fri Mar 11 04:53:16 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 11 Mar 2011 09:53:16 +0000 Subject: [Biopython] Transferring SeqFeatures between aligned sequences In-Reply-To: References: Message-ID: On Thu, Mar 10, 2011 at 11:25 PM, Uri Laserson wrote: >> I'm not quite sure I followed that figure. > > I think you understood perfectly. Good - your text was clearer for me. >> In your case the start and end co-ordinates may be shifted >> by different amounts (since you are doing gapped alignments). >> This worries me as the length of your features can change. >> For any gene or CDS features that is a problem (frame shifts). >> Have you thought about that? Perhaps you're dealing with >> non-coding features only? > > That's exactly the complication here. ?I have one reference sequence that is > highly annotated, and I have a read that I want to align to it and transfer > over the annotations to the corresponding positions. OK - and do you want to worry about spotting frameshifts, and updating the translation for CDS features? > One way I can handle this situation is that when I actually build the > pairwise gapped alignment (which I do manually), in addition to the actual > gapped-sequence strings, I can generate two lists that contain the ungapped > coordinates of each sequence (in my diagram, this is the numbering above and > below). ?Figuring out the new coords from the old coordinates is then a > matter of matching the positions in the lists. ?(Though perhaps it's easier > to implement using dictionaries, so I don't have to search the lists I > generated.) Yes, that kind of technique is also useful for mapping between gapped and ungapped coordinates in assembly files. > Eitherway, in order to move the SeqFeature to the new sequence, should I > make a deep copy of it and then manually modify the start and end coords? > Uri You could do, or create a new SeqFeature, or "steal" the old one and modify it. The later technique would probably be fastest since there are no new objects to create, just a few integer attributes changes (location positions), but is perhaps a bit risky if you don't comment it clearly. If you do that, perhaps do this by popping the features from the old SeqRecord's feature list, modify them, and add them to the new SeqRecord's feature list. If all your current annotation uses simple exact locations, life is easier. If there are fuzzy locations, then using the location object's private _shift method might be simplest. Another query, are you going to look for inversions? In such cases the strand needs flipping and the start/end interchanged. The SeqRecord reverse complement method has to do this, and therefore the SeqFeature and its location and position classes all have a private _flip method. [If you find these private methods useful, perhaps we can make them public? Let us know] Thanks, Peter From thamelry at binf.ku.dk Fri Mar 11 08:08:55 2011 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Fri, 11 Mar 2011 14:08:55 +0100 Subject: [Biopython] update Google Summer of Code project ideas In-Reply-To: References: <4D79073D.3090603@cornell.edu> Message-ID: Hi, I've just added a proposal: Mocapy++Biopython: from data to probabilistic models of biomolecules Cheers, -- Thomas Hamelryck, Eng., Assoc. Prof. Group leader Structural Bioinformatics Bioinformatics center Department of Biology University of Copenhagen Ole Maaloes Vej 5 DK-2200 Copenhagen N Denmark http://www.binf.ku.dk/research/structural_bioinformatics/ From laserson at mit.edu Fri Mar 11 12:03:58 2011 From: laserson at mit.edu (Uri Laserson) Date: Fri, 11 Mar 2011 12:03:58 -0500 Subject: [Biopython] Transferring SeqFeatures between aligned sequences In-Reply-To: References: Message-ID: > > OK - and do you want to worry about spotting frameshifts, > and updating the translation for CDS features? > I can retranslate the features myself, weary of any frameshifts > You could do, or create a new SeqFeature, or "steal" the old one and > modify it. The later technique would probably be fastest since there > are no new objects to create, just a few integer attributes changes > (location positions), but is perhaps a bit risky if you don't comment > it clearly. If you do that, perhaps do this by popping the features > from the old SeqRecord's feature list, modify them, and add them > to the new SeqRecord's feature list. > I can't steal the features because the source of the features is a reference sequence that I will reuse for millions of reads. I will have to make a copy. You believe that building a new SeqFeature would be faster/safer than using python's copy.deepcopy() method? > Another query, are you going to look for inversions? In such > cases the strand needs flipping and the start/end interchanged. > The SeqRecord reverse complement method has to do this, > and therefore the SeqFeature and its location and position > classes all have a private _flip method. > All the reads will be reverse complemented to the coding orientation before the transfer of the features, so I don't think this will be a problem. > [If you find these private methods useful, perhaps we can make > them public? Let us know] > It's hard to tell what the general API should be or what are the most common use-cases. For myself, I can get by with writing my own methods to modify the coordinates accordingly. Uri From p.j.a.cock at googlemail.com Fri Mar 11 12:15:09 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 11 Mar 2011 17:15:09 +0000 Subject: [Biopython] Transferring SeqFeatures between aligned sequences In-Reply-To: References: Message-ID: On Fri, Mar 11, 2011 at 5:03 PM, Uri Laserson wrote: >> You could do, or create a new SeqFeature, or "steal" the old one and >> modify it. The later technique would probably be fastest since there >> are no new objects to create, just a few integer attributes changes >> (location positions), but is perhaps a bit risky if you don't comment >> it clearly. If you do that, perhaps do this by popping the features >> from the old SeqRecord's feature list, modify them, and add them >> to the new SeqRecord's feature list. > > I can't steal the features because the source of the features is a reference > sequence that I will reuse for millions of reads. ?I will have to make a > copy. ?You believe that building a new SeqFeature would be faster/safer than > using python's copy.deepcopy() method? Yes, in this case you will have to make a copy. As too speed, I'm not sure which would be fastest - try it and see ;) Note as long as you are not going to *change* the information in the qualifiers dictionary (and you may want to if you update the translation for example), then you can have the new SeqFeature share the old qualifiers dictionary. That is a bit sneaky but may help with speed (if speed is an issue). >> [If you find these private methods useful, perhaps we can make >> them public? Let us know] > > It's hard to tell what the general API should be or what are the most common > use-cases. ?For myself, I can get by with writing my own methods to modify > the coordinates accordingly. Thanks, Peter From reece at harts.net Mon Mar 14 14:22:52 2011 From: reece at harts.net (Reece Hart) Date: Mon, 14 Mar 2011 11:22:52 -0700 Subject: [Biopython] update Google Summer of Code project ideas In-Reply-To: References: <4D79073D.3090603@cornell.edu> Message-ID: All- I just added a GSoC Biopython proposal: Variant representation, parser, generator, and coordinate converter Comments and co-mentors welcome. -Reece From 2huggie at gmail.com Wed Mar 16 04:26:44 2011 From: 2huggie at gmail.com (Timothy Wu) Date: Wed, 16 Mar 2011 16:26:44 +0800 Subject: [Biopython] [BioPython] Genbank parser Message-ID: Hi, I'm using Biopython to parse human genome files with code like this: for seq_record in SeqIO.parse(fd, "genbank"): * do something with seq_record* However something tripped on me: Traceback (most recent call last): File "./buildSyn.py", line 26, in main() File "./buildSyn.py", line 19, in main gene2SynMapping, syn2GeneMapping = mapper.getMappingDicts(files) File "/home/thw/MyPythonPackage/frameworks/BioProg/idmapping/idmapper/human_genome_id_mapper.py", line 29, in getMappingDicts self.parseAndGetMapping(fd, gene2syn) File "/home/thw/MyPythonPackage/frameworks/BioProg/idmapping/idmapper/human_genome_id_mapper.py", line 74, in parseAndGetMapping for seq_record in SeqIO.parse(fd, "genbank"): File "/usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py", line 525, in parse for r in i: File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 437, in parse_records record = self.parse(handle, do_features) File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 420, in parse if self.feed(handle, consumer, do_features): File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 392, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 344, in _feed_feature_table consumer.location(location_string) File "/usr/lib/pymodules/python2.6/Bio/GenBank/__init__.py", line 975, in location raise LocationParserError(location_line) Bio.GenBank.LocationParserError: 958574^958575..958886 The Genbank file involved has the following structure: CDS 958574^958575..958772 /gene="CSH2" /gene_synonym="CS-2; CSB; hCS-B" /exception="unclassified translation discrepancy" /note="placental lactogen; chorionic somatomammotropin B; Derived by automated computational analysis using gene prediction method: Curated Genomic." /codon_start=1 /product="chorionic somatomammotropin hormone 2 isoform 3" /protein_id="NP_072171.1" /db_xref="GI:12408694" /db_xref="CCDS:CCDS42368.1" /db_xref="GeneID:1443" /db_xref="HGNC:2441" /db_xref="MIM:118820" This isn't the first occurrence in this file, however I manually deleted what's equivalent of "^958575" in the location and it works out OK. Is there something I can do? Right now I edit the genbank file instead (since I won't be needing the location information) And I'm not sure what the caret is suppose to represent. Thanks for your attention. Timothy From p.j.a.cock at googlemail.com Wed Mar 16 07:43:28 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 16 Mar 2011 11:43:28 +0000 Subject: [Biopython] [BioPython] Genbank parser In-Reply-To: References: Message-ID: On Wed, Mar 16, 2011 at 8:26 AM, Timothy Wu <2huggie at gmail.com> wrote: > Hi, > > I'm using Biopython to parse human genome files with code like this: > > ? ? ? ?for seq_record in SeqIO.parse(fd, "genbank"): > ? ? ? ? ? ?* do something with seq_record* > > However something tripped on me: > > Traceback (most recent call last): > ... > ? ?raise LocationParserError(location_line) > Bio.GenBank.LocationParserError: 958574^958575..958886 > > The Genbank file involved has the following structure: > > ? ?CDS ? ? ? ? ? ? 958574^958575..958772 > ? ? ? ? ? ? ? ? ? ? /gene="CSH2" > ... > > This isn't the first occurrence in this file, however I manually deleted > what's equivalent of "^958575" in the location and it works out OK. > > Is there something I can do? Right now I edit the genbank file instead > (since I won't be needing the location information) > And I'm not sure what the caret is suppose to represent. Hi Timothy, I believe this to be an invalid GenBank file, and I would like you to contact the NCBI to check this. The caret is used for 'between'. Here it seems to be saying meaning this feature starts between 958574 and 958575, and runs to 958772. That would normally be represented just as 958575..958772 See also: http://bugzilla.open-bio.org/show_bug.cgi?id=3175 http://redmine.open-bio.org/issues/3175 (we're migrating the bug database, official announcement due soon) How many of this kind of 'broken' GenBank records have you found? I would hope it is just one or two that can be fixed by hand. If on the other hand the NCBI say this is valid, we need to handle this in the Biopython feature model... Peter From cjfields at illinois.edu Wed Mar 16 13:58:23 2011 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 16 Mar 2011 12:58:23 -0500 Subject: [Biopython] [ANNOUNCEMENT] Bugzilla to Redmine migration Message-ID: <34C8C0CB-9273-468E-86D7-74B22464F181@illinois.edu> (apologies if you receive multiple copies of this) All, We are currently about 95% done with a transition over to our new Redmine tracking system, to the point where we feel comfortable in going ahead with opening it to developers: http://redmine.open-bio.org/ All edits to bugzilla reports on our old system (http://bugzilla.open-bio.org/) are now disabled and the system is now read-only. Any new bugs and comments to old ones should be reported on the new Redmine server. For current Bugzilla users, we have migrated login IDs to Redmine (this is normally an email address), but we have reset user passwords for security reasons. There are two ways to access your account: 1) When logging in (http://redmine.open-bio.org/login), click on the 'Lost password' link. You will be prompted for your email address (this should be the same as your bugzilla login). An new email will be sent out containing directions for resetting your password and logging in. 2) It is possible the above may be automatically detected as spam. If the above doesn't work or the reset email isn't received within a day, contact support at helpdesk.open-bio.org to receive your new password. Also, note that Redmine has a different syntax for those who want to add links to their reports; see http://www.redmine.org/projects/redmine/wiki/RedmineTextFormatting. Let us know if you have any questions. chris Christopher Fields IGB Postdoctoral Fellow Genomics of Neural & Behavioral Plasticity University of Illinois Urbana-Champaign Institute for Genomic Biology 1206 W. Gregory Dr. , MC-195 Urbana, IL 61801 From rmb32 at cornell.edu Fri Mar 18 15:23:37 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Fri, 18 Mar 2011 15:23:37 -0400 Subject: [Biopython] Google Summer of Code is *ON* for OBF projects! Message-ID: <4D83B139.4010803@cornell.edu> Hi all, Great news: Google announced today that the Open Bioinformatics Foundation has been accepted as a mentoring organization for this summer's Google Summer of Code! GSoC is a Google-sponsored student internship program for open-source projects, open to students from around the world (not just US residents). Students are paid a $5000 USD stipend to work as a developer on an open-source project for the summer. For more on GSoC, see GSoC 2011 FAQ at http://bit.ly/hpoz8W Student applications are due April 8, 2011 at 19:00 UTC. Students who are interested in participating should look at the OBF's GSoC page at http://open-bio.org/wiki/Google_Summer_of_Code, which lists project ideas, and whom to contact about applying. For current developers on OBF projects, please consider volunteering to be a mentor if you have not already, and contribute project ideas. Just list your name and project ideas on OBF wiki and on the relevant project's GSoC wiki page. Thanks to all who helped make OBF's application to GSoC a success, and let's have a great, productive summer of code! Rob Buels OBF GSoC 2011 Administrator From laserson at mit.edu Mon Mar 21 19:38:10 2011 From: laserson at mit.edu (Uri Laserson) Date: Mon, 21 Mar 2011 19:38:10 -0400 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? Message-ID: If I load a GenBank-formatted record: a = SeqIO.parse('myfile.gb','gb').next() then set an annotation: a.annotations['myannotation'] = 'saveme' and then format the SeqRecord object as GenBank: a.format('gb') then 'myannotation' is lost. Is this expected behavior? If so, that's a huge bummer...what is the suggested method to store my own annotations in INSDC formats? Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From p.j.a.cock at googlemail.com Tue Mar 22 05:22:17 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Mar 2011 09:22:17 +0000 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: On Mon, Mar 21, 2011 at 11:38 PM, Uri Laserson wrote: > If I load a GenBank-formatted record: > > ? ?a = SeqIO.parse('myfile.gb','gb').next() > > then set an annotation: > > ? ?a.annotations['myannotation'] = 'saveme' > > and then format the SeqRecord object as GenBank: > > ? ?a.format('gb') > > then 'myannotation' is lost. It isn't 'lost' in that it is still in your SeqRecord object in memory, but it isn't in the GenBank format output. > Is this expected behavior? Yes, there is no general field for record level annotation in the GenBank or EMBL file formats. Where did you expect it to be written? The same thing would happen with most file formats, e.g. FASTA has no annotation support at all beyond the free text description line. > If so, that's a huge bummer...what is the suggested method to > store my own annotations in INSDC formats? You could stuff record level information into a source feature's qualifier dictionary. It isn't elegant, but it would work. The NCBI seems to have introduced the source feature primarily to use this to store the taxon identifier and other little bits of information not handles explicitly in the header lines. (Plus this can handle chimeras which may have been a use case). Peter From laserson at mit.edu Tue Mar 22 11:08:08 2011 From: laserson at mit.edu (Uri Laserson) Date: Tue, 22 Mar 2011 11:08:08 -0400 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: > > You could stuff record level information into a source feature's > qualifier dictionary. What are the allowed types for the values of the qualifiers dictionary (that will be output correctly in INSDC)? Is it possible to have lists of strings? What is the standard practice: a feature of type "source" that runs the entire length of the sequence? Or is it possible to have a SeqFeature with no position annotation? Ideally, if I slice the SeqFeature, I would like these annotations to stay with the slice no matter what. Uri From p.j.a.cock at googlemail.com Tue Mar 22 11:30:46 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Mar 2011 15:30:46 +0000 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: On Tue, Mar 22, 2011 at 3:08 PM, Uri Laserson wrote: >> You could stuff record level information into a source feature's >> qualifier dictionary. > > What are the allowed types for the values of the qualifiers dictionary > (that will be output correctly in INSDC)? ?Is it possible to have lists of > strings? As far as the current Biopython output goes, you can basically use any (short) string as a qualifier key. Avoid keys with spaces in them (INSDC use underscores) and other funny characters. For strict INSDC compliance there is probably a white list of allowed feature types... > What is the standard practice: a feature of type "source" that runs the > entire length of the sequence? ?Or is it possible to have a SeqFeature with > no position annotation? ?Ideally, if I slice the SeqFeature, I would like > these annotations to stay with the slice no matter what. If you did have a SeqFeature without a location, we couldn't write it out in GenBank/EMBL format (the error handling here might be improved). If you have a SeqRecord with a (source) feature spanning the full sequence, and you slice the SeqRecord to take a subsequence, then that full length feature (and any other features not fully within the subsequence) would be lost. Using a source feature is really just a work around for the fact that GenBank/EMBL do not support arbitrary record level annotation. Do you have to use this as your output format? Would you not be better off with using a database or something else instead? Peter From laserson at mit.edu Tue Mar 22 11:44:02 2011 From: laserson at mit.edu (Uri Laserson) Date: Tue, 22 Mar 2011 11:44:02 -0400 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: > > As far as the current Biopython output goes, you can basically use any > (short) string as a qualifier key. > Sorry, I meant for the values, not the keys. Can you have a list of strings as a value? > Using a source feature is really just a work around for the fact that > GenBank/EMBL do not support arbitrary record level annotation. > Do you have to use this as your output format? Agreed. Essentially, I have a huge pile of sequencing reads that are highly annotated. For any given read, there are some annotations that are independent of the sequence itself (which is what I am trying to implement now) and there are some annotations that are associated with subsequences (which is why SeqFeatures are very appropriate). Ideally, I want a file format that will store the data, be easily parsable (and fast), and can be readable using something like `less` (though this last feature is less important). > Would you not be > better off with using a database or something else instead? > Well, initially I used XML to store the data, but I quickly realized I was reinventing the wheel, especially when it came to annotating features on top of the sequences. Are you suggesting something like SQLite? How would I deal with SeqFeature-type annotations? Uri > Peter > From p.j.a.cock at googlemail.com Tue Mar 22 12:14:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Mar 2011 16:14:05 +0000 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: On Tue, Mar 22, 2011 at 3:44 PM, Uri Laserson wrote: >> As far as the current Biopython output goes, you can basically use any >> (short) string as a qualifier key. > > Sorry, I meant for the values, not the keys. ?Can you have a list of strings > as a value? Right. Again yes, plus I think a single string as the value should work. This is because the INSDC feature table allows multiple values for a tag - for example you often get multiple database cross references. >> Using a source feature is really just a work around for the fact that >> GenBank/EMBL do not support arbitrary record level annotation. >> Do you have to use this as your output format? > > Agreed. ?Essentially, I have a huge pile of sequencing reads that are highly > annotated. ?For any given read, there are some annotations that are > independent of the sequence itself (which is what I am trying to implement > now) and there are some annotations that are associated with subsequences > (which is why SeqFeatures are very appropriate). ?Ideally, I want a file > format that will store the data, be easily parsable (and fast), and can be > readable using something like `less` (though this last feature is less > important). For this the GenBank/EMBL format with the source feature trick does sound workable. You just need to be careful how how and when you create the dummy source feature - I'd do it at the last moment before writing out the file, and in that way you can avoid things like slicing throwing it away. >> Would you not be >> better off with using a database or something else instead? > > Well, initially I used XML to store the data, but I quickly realized I was > reinventing the wheel, especially when it came to annotating features > on top of the sequences. I wonder if one of the INSDC XML formats would work nicely here? i.e. If they can be extended more easily. We should look at adding a parser for them to Biopython (and write support too ideally of course). > Are you suggesting something like SQLite? ?How would I deal with > SeqFeature-type annotations? I was thinking you could use the BioSQL schema (run on SQLite if you wanted to, or MySQL or PostgresSQL etc). You'd still face the same issues if/when you wanted to dump the annotated records to a plain text file though. Peter From laserson at mit.edu Tue Mar 22 12:58:03 2011 From: laserson at mit.edu (Uri Laserson) Date: Tue, 22 Mar 2011 12:58:03 -0400 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: > > For this the GenBank/EMBL format with the source feature trick > does sound workable. You just need to be careful how how and > when you create the dummy source feature - I'd do it at the last > moment before writing out the file, and in that way you can avoid > things like slicing throwing it away. > > That's a good idea. This should be even easier since I am subclassing SeqRecord. I can override `format` to first take the whole annotations dictionary and dump it into the qualifiers dictionary of a `source` feature. I also have my own parser which wraps SeqIO; using SeqIO to parse the 'imgt' format, I can then copy the `source` qualifiers to the annotations dictionary and delete `source` feature entirely. Does this sound reasonable? > I wonder if one of the INSDC XML formats would work nicely here? > i.e. If they can be extended more easily. We should look at adding a > parser for them to Biopython (and write support too ideally of course). > My only issue with this is that I'd rather not extend anyone's file format, but use a standard file format that fits my purpose. Otherwise, I might as well just go straight for a database, as below. (But there are some super-fast XML parsers out there.) > I was thinking you could use the BioSQL schema (run on SQLite if > you wanted to, or MySQL or PostgresSQL etc). You'd still face the > same issues if/when you wanted to dump the annotated records > to a plain text file though. > I suppose plain text readability is less important to me than ease of sharing the data. But when I dump a SeqRecord object to a BioSQL database, does it do it in a way that I can rebuild that object exactly with no loss of information? (I.e., does it solve the annotation dictionary problem that started this whole thread?) Uri From p.j.a.cock at googlemail.com Tue Mar 22 13:24:46 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Mar 2011 17:24:46 +0000 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: On Tue, Mar 22, 2011 at 4:58 PM, Uri Laserson wrote: >> For this the GenBank/EMBL format with the source feature trick >> does sound workable. You just need to be careful how how and >> when you create the dummy source feature - I'd do it at the last >> moment before writing out the file, and in that way you can avoid >> things like slicing throwing it away. > > That's a good idea. ?This should be even easier since I am subclassing > SeqRecord. ?I can override `format` to first take the whole annotations > dictionary and dump it into the qualifiers dictionary of a `source` feature. > ?I also have my own parser which wraps SeqIO; using SeqIO to parse the > 'imgt' format, I can then copy the `source` qualifiers to the annotations > dictionary and delete `source` feature entirely. ?Does this sound > reasonable? Yes, using your own parser/writer to take care to mapping between the SeqRecord annotations dictionary and a dummy feature sounds sensible. Also using 'imgt' rather than GenBank or EMBL will let you have longer feature qualifier keys - but these files are not as widely used/supported as the GenBank and EMBL formats. >> I wonder if one of the INSDC XML formats would work nicely here? >> i.e. If they can be extended more easily. We should look at adding a >> parser for them to Biopython (and write support too ideally of course). > > My only issue with this is that I'd rather not extend anyone's file format, > but use a standard file format that fits my purpose. ?Otherwise, I might as > well just go straight for a database, as below. ?(But there are some > super-fast XML parsers out there.) I haven't looked at the details to see if those XML file formats have a nice open ended misc annotation tag you could just use. >> I was thinking you could use the BioSQL schema (run on SQLite if >> you wanted to, or MySQL or PostgresSQL etc). You'd still face the >> same issues if/when you wanted to dump the annotated records >> to a plain text file though. > > I suppose plain text readability is less important to me than ease of > sharing the data. ?But when I dump a SeqRecord object to a BioSQL > database, does it do it in a way that I can rebuild that object exactly > with no loss of information? (I.e., does it solve the annotation dictionary > problem that started this whole thread?) Basically yes, subject to a few provisos, it should. Firstly note we don't support any per-letter-annotation in BioSQL. Secondly, all the SeqRecord annotations SeqFeature qualifiers will end up being stored as strings (in table bioentry_qualifier_value and table seqfeature_qualifier_value respectively). There may also be some fun with string values vs single entry lists containing one string. Peter From gori at cs.ru.nl Wed Mar 23 13:43:16 2011 From: gori at cs.ru.nl (Fabio Gori) Date: Wed, 23 Mar 2011 18:43:16 +0100 Subject: [Biopython] From genome to lineage with Entrez Message-ID: <201103231843.16762.gori@cs.ru.nl> Hi all, I have downloaded all the bacterial genomes (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz) and I want to compare their taxonomic lineages. I'm looking for a way to get their lineages with Entrez. From the files I can get the accession numbers and GIs, but I don't know how to get their taxonomic ids. I know that I can step from GIs to Taxids processing the file gi_taxid_nucl.dmp, but I'd prefer to use Entrez. Thanks in advance, Fabio -- F. Gori, PhD student Intelligent Systems ICIS (Institute for Computing and Information Sciences) Radboud University Nijmegen Home Page: http://www.cs.ru.nl/~gori/ From p.j.a.cock at googlemail.com Wed Mar 23 14:01:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 23 Mar 2011 18:01:32 +0000 Subject: [Biopython] From genome to lineage with Entrez In-Reply-To: <201103231843.16762.gori@cs.ru.nl> References: <201103231843.16762.gori@cs.ru.nl> Message-ID: On Wed, Mar 23, 2011 at 5:43 PM, Fabio Gori wrote: > Hi all, > > I have downloaded all the bacterial genomes > (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz) and I want to compare > their taxonomic lineages. > > I'm looking for a way to get their lineages with Entrez. From the files I can > get the accession numbers and GIs, but I don't know how to get their taxonomic > ids. > I know that I can step from GIs to Taxids processing the file > gi_taxid_nucl.dmp, but I'd prefer to use Entrez. > I think you can do it with ELink, but personally I'd use the taxid dump file, since it sounds like you'll want to process hundreds of lineages. Peter From amenity at enthought.com Wed Mar 23 23:29:35 2011 From: amenity at enthought.com (Amenity Applewhite) Date: Wed, 23 Mar 2011 22:29:35 -0500 Subject: [Biopython] SciPy 2011 Call for Papers Message-ID: Hello, SciPy 2011 , the 10th Python in Science conference, will be held July 11 - 16, 2011, in Austin, TX. At this conference, novel applications and breakthroughs made in the pursuit of science using Python are presented. Attended by leading figures from both academia and industry, it is an excellent opportunity to experience the cutting edge of scientific software development. The conference is preceded by two days of tutorials, during which community experts provide training on several scientific Python packages. *We'd like to invite you to consider presenting at SciPy 2011.* The list of topics that are appropriate for the conference includes (but is not limited to): * new Python libraries for science and engineering; * applications of Python to the solution of scientific or computational problems; * high performance, parallel and GPU computing with Python; * use of Python in science education. *Specialized Tracks* This year we also have two specialized tracks. They will be run concurrent to the main conference. *Python in Data Science Chair: Peter Wang, Streamitive, Inc.* This track focuses on the advantages and challenges of applying Python in the emerging field of "data science". This includes a breadth of technologies, from wrangling realtime data streams from the social web, to machine learning and semantic analysis, to workflow and repository management for large datasets. *Python and Core Technologies Chair: Anthony Scopatz, Enthought, Inc.* In an effort to broaden the scope of SciPy and to engage the larger community of software developers, we are pleased to introduce the _Python & Core Technologies_ track. Talks will cover subjects that are not directly related to science and engineering, yet nonetheless affect scientific computing. Proposals on the Python language, visualization toolkits, web frameworks, education, and other topics are appropriate for this session. *Talk/Paper Submission* We invite you to take part by submitting a talk abstract on the conference website at: http://conference.scipy.org/scipy2011/papers.php Papers are included in the peer-reviewed conference proceedings, to be published online. *Important dates for authors:* Friday, April 15: Tutorial proposals due (remember: stipends will be provided for Tutorial instructors) http://conference.scipy.org/scipy2011/tutorials.php Sunday, April 24: Paper abstracts due Sunday, May 8: Student sponsorship request due http://conference.scipy.org/scipy2011/student.php Tuesday, May 10: Accepted talks announced Monday, May 16: Student sponsorships announced Monday, May 23: Early Registration ends Sunday, June 20: Papers due Monday-Tuesday, July 11 - 12: Tutorials Wednesday-Thursday, July 13 - July 14: Conference Friday-Saturday, July 15 - July 16: Sprints The SciPy 2011 Team @SciPy2011 http://twitter.com/SciPy2011 _________________________ Amenity Applewhite Enthought, Inc. Scientific Computing Solutions From michele.silva at gmail.com Fri Mar 25 02:11:41 2011 From: michele.silva at gmail.com (Michele) Date: Fri, 25 Mar 2011 03:11:41 -0300 Subject: [Biopython] [GSoC] Proposal: Mocapy++Biopython In-Reply-To: References: Message-ID: Hello everyone, I'm Michele, a computer scientist and passionate developer who is currently enrolled in a biomedicine course. That's why I got in touch with the biopython project and have tried its tools for biological computation. When I read the Mocapy++Biopython proposal I immediately fell in love with it. Let me tell you why. I have worked since 2005 with bayesian networks, modelling BN for medical learning environments and also programming algorithms for handling those nets. In the context of my masters in computer science with the Artificial Intelligence Group, we have published several papers on the idea of using bayesian networks to model the uncertainty associated with the students' behavior in learning environments (see, for example, Designing a Bayesian Network based Student Model for Distance Learning Environmentspublished at the Seventh IEEE International Conference on Advanced Learning Technologies, 2007). As for the C++ and Python glue, I also have enjoyed the project's proposal. I have been programming in C++ for more than 5 years, in small and big projects, mainly in microelectronics CAD and firmware development. Coincidentally, last year I started working with Python in bigger projects. I worked for ESSS, a company which develops software for scientific computing and engineering simulation. I worked with oil reservoir simulation, where the applications were developed in Python and the simulation core and the computer graphics algorithms were programmed in C++. If you want to have a feeling on what reservoir simulation and the applications I worked in look like, have a look at the Kraken's project website . I worked in both Python and C++ development, as well as in the glue through the use of boost python. Regarding the experience in biomolecular structure, I'm a beginner. I have started studying biomedicine this year and therefore have a lot to learn. I know a bit about the PDB format and molecular biology. I'm sure I can count on your help to continue learning. So that was my not-so-short presentation. I would love to get to know the community better and work together on the GSoC. Please let me know if you think I could write a proposal and If you can help me on that. Cheers, Michele Silva http://www.linkedin.com/pub/michele-silva/6/520/5b0 From p.j.a.cock at googlemail.com Fri Mar 25 03:37:00 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 25 Mar 2011 07:37:00 +0000 Subject: [Biopython] Public example FASTQ files (for Tutorial examples)? Message-ID: Hi all, One of the volunteers proof reading the Biopython tutorial noticed our links to specific example FASTQ files at the NCBI SRA don't work any more. They have withdrawn them from the FTP site, although you can still download the files in the compressed *.sra format and in in theory convert then to FASTQ locally with the NCBI's toolkit (which is cross platform). Another option is to download the FASTQ files via the NCBI's webinterface. Unless there is an obvious way to do this with a URL that I missed initially, we have a complicated situation to describe where the user can choose all the reads for an experiment or just the filtered set, and also choose to have them pre-trimmed or not. Plus for me at least, the HTPP download wasn't as robust as the FTP one was. I'm hoping someone could suggest a couple of other moderately sized FASTQ files which are public, on FTP or a static HTML server, which we can use in the tutorial. So, suggestions? Thanks! Peter From brettpthomas at gmail.com Tue Mar 29 10:50:38 2011 From: brettpthomas at gmail.com (Brett Thomas) Date: Tue, 29 Mar 2011 10:50:38 -0400 Subject: [Biopython] VCF files In-Reply-To: References: Message-ID: Hi all, I write software for genetic research, and the predominant file format we use is VCF, a new file format used to represent genetic variation in the 1000 genomes project. Has there been any discussion of a biopython api for vcf files? I'd be happy to help if anybody is working on it. Thanks, Brett From jamesrwagner at gmail.com Tue Mar 29 13:55:56 2011 From: jamesrwagner at gmail.com (James Wagner) Date: Tue, 29 Mar 2011 13:55:56 -0400 Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work Message-ID: Hello: I was trying just as a proof of concept to do an NCBI WWW BLAST query with a FASTA file containing more than one sequence (but still a small number of sequences). I tried with the opuntia.fasta file from the website, and set it up as follows: result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r")) blast_records = NCBIXML.parse(result_handle) then I try: for record in blast_records: print record.alignments and I obtain: [] Surely at the very least since there were 7 sequences in this file, I should get 7 empty lists, assuming of course none of the sequences gives a hit in nr, which I am sure is not the case either? What is still missing? I realize I could use SeqIO.parse to obtain each sequence from the FASTA file and do a separate qblast, but surely doing this separately for each protein would create unnecessary overhead with the network traffic compared to somehow sending off all the protein queries at once? From p.j.a.cock at googlemail.com Tue Mar 29 14:07:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Mar 2011 19:07:47 +0100 Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work In-Reply-To: References: Message-ID: On Tue, Mar 29, 2011 at 6:55 PM, James Wagner wrote: > Hello: > > I was trying just as a proof of concept to do an NCBI WWW BLAST query > with a FASTA file containing more than one sequence (but still a small > number of sequences). > > I tried with the opuntia.fasta file from the website, and set it up as follows: > > result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r")) > blast_records = NCBIXML.parse(result_handle) > > then I try: > > for record in blast_records: > ? ? ?print record.alignments > > and I obtain: > [] > > > Surely at the very least since there were 7 sequences in this file, I > should get 7 empty lists, assuming of course none of the sequences > gives a hit in nr, which I am sure is not the case either? Not necessarily, the NCBI may have fixed this but for a long time if you had say 7 queries but only 2 gave hits, stand alone BLAST's XML output would only contain those 2 hits. There would be nothing at all from the 5 hit less queries. This was/is very annoying, but right now I'm not sure if they have fixed this or not. Try getting back the results as plain text and manually inspect them. In the plain text output all the queries appear, and there is a clear "no hits found" message. > What is still missing? I realize I could use SeqIO.parse to obtain > each sequence from the FASTA file and do a separate qblast, but surely > doing this separately for each protein would create unnecessary > overhead with the network traffic compared to somehow sending off all > the protein queries at once? Yes, in theory a single large query should have less overhead than individual queries. Personally I'd just use standalone BLAST and run it locally if I had more than a few queries. Peter From jamesrwagner at gmail.com Tue Mar 29 16:43:35 2011 From: jamesrwagner at gmail.com (James Wagner) Date: Tue, 29 Mar 2011 16:43:35 -0400 Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work In-Reply-To: References: Message-ID: OK, when I try to create a .fasta file with just the first sequence in opuntia, I get no hits. However, when I just copy paste the nucleotide sequence and copy paste, I get 50 hits! This is consistent with what happens with copy pasting the first opuntia sequence into the NCBI BLAST web interafce, though there I obtain 110 hits for intronic sequences in Opuntia chloroplast and chloroplasts. As a secondary point I also find it curious the result with using NCBIWWW is limited to 50 hits (I thought it was 500 by default). But what is more problematic than the fact that I get no hits when using a FASTA file with only a single sequence, when clearly there are some very high homology hits present in nr. This is my code from beginning to end, where the file opuntia1.fasta is a file containing only the 1st sequence from opuntia.fasta, and when using the line for opuntia1.fasta it resulted in no hits. I am using BioPython 1.5.3 and Python 2.6 on Ubuntu if this has any effect on the results. I also tried it by obtaining a single sequence from SeqIO.parse and then obtaining the Seq of this sequence, and it also gave 50 hits. So it's basically just with using a FASTA file handle that I can't get it to work. #!/usr/bin/python from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML result_handle = NCBIWWW.qblast("blastn", "nr", "TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATAGGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGA\CCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTATGNTCATTGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTAAATTGCTCATATTTTCTTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAACATAGACAAACTATATATATATATATATATAATATATTTCAAATTCCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATATTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTAGAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAAAAGAACCAGA") #result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia1.fasta", "r")) blast_record = NCBIXML.read(result_handle) for description in blast_record.descriptions: print description; #end of code. On Tue, Mar 29, 2011 at 2:07 PM, Peter Cock wrote: > On Tue, Mar 29, 2011 at 6:55 PM, James Wagner wrote: >> Hello: >> >> I was trying just as a proof of concept to do an NCBI WWW BLAST query >> with a FASTA file containing more than one sequence (but still a small >> number of sequences). >> >> I tried with the opuntia.fasta file from the website, and set it up as follows: >> >> result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r")) >> blast_records = NCBIXML.parse(result_handle) >> >> then I try: >> >> for record in blast_records: >> ? ? ?print record.alignments >> >> and I obtain: >> [] >> >> >> Surely at the very least since there were 7 sequences in this file, I >> should get 7 empty lists, assuming of course none of the sequences >> gives a hit in nr, which I am sure is not the case either? > > Not necessarily, the NCBI may have fixed this but for a long time if > you had say 7 queries but only 2 gave hits, stand alone BLAST's > XML output would only contain those 2 hits. There would be nothing > at all from the 5 hit less queries. This was/is very annoying, but > right now I'm not sure if they have fixed this or not. > > Try getting back the results as plain text and manually inspect them. > In the plain text output all the queries appear, and there is a clear > "no hits found" message. > >> What is still missing? I realize I could use SeqIO.parse to obtain >> each sequence from the FASTA file and do a separate qblast, but surely >> doing this separately for each protein would create unnecessary >> overhead with the network traffic compared to somehow sending off all >> the protein queries at once? > > Yes, in theory a single large query should have less overhead > than individual queries. Personally I'd just use standalone BLAST > and run it locally if I had more than a few queries. > > Peter > From rmb32 at cornell.edu Tue Mar 29 17:20:41 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Tue, 29 Mar 2011 14:20:41 -0700 Subject: [Biopython] Announcing OBF Summer of Code - please forward! Message-ID: <4D924D29.3020707@cornell.edu> Hi all, Here's an advertising-ready announcement for OBF's Summer of Code, thanks to Christian Zmasek and Hilmar Lapp for their excellent writing. Student applications are due April 8! Please spread it widely, we need to reach lots of students with it! Rob Buels OBF GSoC 2011 Admin ============================================================ *** Please disseminate widely at your local institutions *** *** including posting to message and job boards, so that *** *** we reach as many students as possible. *** ============================================================ OPEN BIOINFORMATICS FOUNDATION SUMMER OF CODE 2011 Applications due 19:00 UTC, April 8, 2010. http://www.open-bio.org/wiki/Google_Summer_of_Code The Open Bioinformatics Foundation Summer of Code program provides a unique opportunity for undergraduate, masters, and PhD students to obtain hands-on experience writing and extending open-source software for bioinformatics under the mentorship of experienced developers from around the world. The program is the participation of the Open Bioinformatics Foundation (OBF) as a mentoring organization in the Google Summer of Code(tm) (http://code.google.com/soc/). Students successfully completing the 3 month program receive a $5,000 USD stipend, and may work entirely from their home or home institution. Participation is open to students from any country in the world except countries subject to US trade restrictions. Each student will have at least one dedicated mentor to show them the ropes and help them complete their project. The Open Bioinformatics Foundation is particularly seeking students interested in both bioinformatics (computational biology) and software development. Some initial project ideas are listed on the website. These range from Galaxy phylogenetics pipeline development in Biopython to lightweight sequence objects and lazy parsing in BioPerl, a DAS Server for large files on local filesystems, and mapping Java libraries to Perl/Ruby/Python using Biolib+SWIG+JNI. All project ideas are flexible and many can be adjusted in scope to match the skills of the student. We also welcome and encourage students proposing their own project ideas; historically some of the most successful Summer of Code projects are ones proposed by the students themselves. TO APPLY: Apply online at the Google Summer of Code website (http://socghop.appspot.com/), where you will also find GSoC program rules and eligibility requirements. The 12-day application period for students runs from Monday, March 28 through Friday, April 8th, 2011. INQUIRIES: We strongly encourage all interested students to get in touch with us with their ideas as early on as possible. See the OBF GSoC page for contact details. 2011 OBF Summer of Code: http://www.open-bio.org/wiki/Google_Summer_of_Code Google Summer of Code FAQ: http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs From albert.bogdanowicz at gmail.com Thu Mar 31 13:01:45 2011 From: albert.bogdanowicz at gmail.com (Albert Bogdanowicz) Date: Thu, 31 Mar 2011 19:01:45 +0200 Subject: [Biopython] Google Summer of Code idea Message-ID: <201103311901.45372.albert.bogdanowicz@gmail.com> Hello World, I am a bioinformatics student and I would like to take part in Google Summer of Code this year. I have an idea for a project that I could write. It would be a module for synthetic biology, especially BioBrick standard used in iGEM competition (http://ung.igem.org/Main_Page). I'm a bit late, but I hope this fact won't disqualify me. I would appreciate any help in determining a more detailed specification for such project. Albert Bogdanowicz From laserson at mit.edu Thu Mar 31 16:48:16 2011 From: laserson at mit.edu (Uri Laserson) Date: Thu, 31 Mar 2011 16:48:16 -0400 Subject: [Biopython] Google Summer of Code idea In-Reply-To: <201103311901.45372.albert.bogdanowicz@gmail.com> References: <201103311901.45372.albert.bogdanowicz@gmail.com> Message-ID: Hi Albert, Are you thinking of something like the Clotho project? http://www.clothocad.org/ Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu On Thu, Mar 31, 2011 at 13:01, Albert Bogdanowicz < albert.bogdanowicz at gmail.com> wrote: > Hello World, > I am a bioinformatics student and I would like to take part in Google > Summer > of Code this year. > I have an idea for a project that I could write. It would be a module for > synthetic biology, especially BioBrick standard used in iGEM competition > (http://ung.igem.org/Main_Page). > I'm a bit late, but I hope this fact won't disqualify me. I would > appreciate > any help in determining a more detailed specification for such project. > Albert Bogdanowicz > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From rmb32 at cornell.edu Thu Mar 31 17:58:52 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 31 Mar 2011 14:58:52 -0700 Subject: [Biopython] Reminder: GSoC proposals due in 1 week Message-ID: <4D94F91C.1080005@cornell.edu> Hi all, Just a reminder, Google Summer of Code student applications are due April 8! If you're a student planning to apply to GSoC with OBF, it's very much in your best interest to write your proposal *early*, like now, and get it into the hands of the developers and mentors on your subproject (BioPerl/Ruby/Python/etc) so that they can give you some feedback on it. The final proposals must, of course, still be submitted to Google through the GSoC web application, as described on the main GSoC site (http://www.google-melange.com/gsoc/homepage/google/gsoc2011). Rob Buels OBF GSoC 2011 Administrator From rmb32 at cornell.edu Thu Mar 31 18:04:49 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 31 Mar 2011 15:04:49 -0700 Subject: [Biopython] GSoC call for mentors Message-ID: <4D94FA81.5090701@cornell.edu> Hi all, For current developers on OBF projects: If you would not mind being a mentor to a Summer of Code student this summer, please make sure you sign up as an OBF mentor in the GSoC web app. There's a link under "mentors: apply now!" midway down the page at http://www.google-melange.com/. If you didn't do last year's summer of code, it would be a good idea to drop me an email introducing yourself, as well, or I won't know whether to approve your request. :-) Being signed up as an OBF GSoC mentor will give you access to the student proposals, as they come in, and the ability to comment on them and assign scores to the ones you think show the most promise. If you sign up as a mentor, please also add yourself to the two OBF GSoC mailing lists: OBF-GSoC and OBF-GSoC-mentors OBF-GSoC list: http://lists.open-bio.org/mailman/listinfo/gsoc OBF mentors: http://lists.open-bio.org/mailman/listinfo/gsoc-mentors Thanks in advance! Rob --- Robert Buels OBF GSoC 2011 Administrator From philip.machanick at gmail.com Thu Mar 31 19:49:33 2011 From: philip.machanick at gmail.com (Philip Machanick) Date: Fri, 1 Apr 2011 09:49:33 +1000 Subject: [Biopython] extending Motif class Message-ID: I want to add a new scoring function to the Motif class and in true object-oriented spirit would like to do it by deriving a new class rather than hacking the existing code. The general structure of my test program (all in 1 file) is: from Bio.Motif import Motif class ScannableMotif(Motif): def pwm_score_hit(self,sequence,position): ## stuff to compute my new score from Bio import Motif def main (): for motif in ScannableMotif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"): for i in range(3): print motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i) The two different imports appear to be necessary. I need the first to be able to use the base class to derive a new one, and without the second when I use metaclass methods, I get TypeError: Error when calling the metaclass bases module.__init__() takes at most 2 arguments (3 given) The other problem: I can't directly invoke a metaclass method on a derived instance as above. The snippet below works as expected, but looks like a kludge to me. Is there a better way of accessing metaclass methods from a derived class object? for motif in Motif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"): motif.__class__ = ScannableMotif # promote to the new class for i in range(3): print motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i) I think I have the class vs. metaclass concept straight but understanding why I need the two different flavours of import would be useful. -- Philip Machanick Rhodes University, Grahamstown 6140, South Africa http://opinion-nation.blogspot.com/ +61-7-3871-0963 mobile +61 42 234 6909 skype philipmach From chapmanb at 50mail.com Thu Mar 31 20:59:52 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 31 Mar 2011 20:59:52 -0400 Subject: [Biopython] extending Motif class In-Reply-To: References: Message-ID: <20110401005952.GA2644@kunkel> Philip; > I want to add a new scoring function to the Motif class and in true > object-oriented spirit would like to do it by deriving a new class rather > than hacking the existing code. The approach you want to take here is to define a function that takes a motif as an input: def pwm_score_hit(motif, sequence, position): instead of trying to inherit from Motif. What happens in your example is that you inherit from the Motif class: > from Bio.Motif import Motif > > class ScannableMotif(Motif): > def pwm_score_hit(self,sequence,position): > ## stuff to compute my new score Then you call ScannableMotif as if it were the Motif namespace, when it is actually a class. The parse function is defined in Bio.Motif not the Motif class: > from Bio import Motif > def main (): > for motif in > ScannableMotif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"): > for i in range(3): > print > motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i) Which is why you get an error. Your promotion trick does work but really is too tricky and you are better off with just a separate function that works on motif objects. Hope this helps, Brad From bartek at rezolwenta.eu.org Thu Mar 31 21:07:53 2011 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Fri, 1 Apr 2011 03:07:53 +0200 Subject: [Biopython] extending Motif class In-Reply-To: References: Message-ID: Hi, On Fri, Apr 1, 2011 at 1:49 AM, Philip Machanick wrote: > I want to add a new scoring function to the Motif class and in true > object-oriented spirit would like to do it by deriving a new class rather > than hacking the existing code. > > Well, if you want to keep your code separate from biopython and ba able to use it with newer versions than maybe yes, but if you think tha your code code be contributed to biopython and useful for other people, than I'd consider just contributing via github. > The general structure of my test program (all in 1 file) is: > > from Bio.Motif import Motif > > class ScannableMotif(Motif): > def pwm_score_hit(self,sequence,position): > ## stuff to compute my new score > > That's OK, although I suspect something fishy is hidden in your code here. more later. > from Bio import Motif > you shouldn't need to do that > def main (): > for motif in > ScannableMotif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"): > for i in range(3): > print > motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i) > > is ScannableMotif now a module? or is "parse" a class method? BTW, you can parse MEME files with Bio.Motif.parse... > The two different imports appear to be necessary. I need the first to be > able to use the base class to derive a new one, and without the second when > Yes, you need to import the module to subclass Motif > I use metaclass methods, I get > > TypeError: Error when calling the metaclass bases > module.__init__() takes at most 2 arguments (3 given) > > I cannot reproduce this error and it is highly unlikely that it is related at all to the Biopython code as Bio.Motif does not uses any metaclasses. I think you cause it by something in your later code: > The other problem: I can't directly invoke a metaclass method on a derived > instance as above. The snippet below works as expected, but looks like a > kludge to me. Is there a better way of accessing metaclass methods from a > derived class object? > > for motif in Motif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"): > motif.__class__ = ScannableMotif # promote to the new class > There you go! don't do this. It is not the way objects "get promoted" to other classes, You seem to be playing with some python internals here for i in range(3): > print > motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i) > > I think I have the class vs. metaclass concept straight but understanding > why I need the two different flavours of import would be useful. > Don't get me wrong, but I don't think you _need_ any metaclasses here. I think your problem is that you are trying to change the class of an existing instance, which (while probably possible in python) is absolutely not the way to go. If your code is able to produce the correct output using the complicated imports it's interesting, but probably not the easiest way to achieve it. However it's hard to say, what exactly is your goal from the code you provided. But, on the more constructive side of things, if you want to subclass Bio.Motif and add a new method to it, you can just do what you did in the beginning of your code (provided that you do not mess with m.__class__ or something) Then, your problem seems to be that the MEME parser fails to return your subclass and gives you a Bio.Motif.Motif vanilla class (or MEMEMotif). What you can do (if you insist on not adding the method to Bio.Motif.Motif), is to write a constructor able to create a ScannableMotif from a "normal" motif: class ScannableMotif(Motif): def new_score_hit(self,sequence,position): return 1 # or something smarter... def __init__(self,m): #just copy it all... self.instances = m.instances self.has_instances=m.has_instances self.counts = m.counts self.has_counts=m.has_counts self.mask = m.mask self._pwm_is_current = False self._log_odds_is_current = False self.alphabet=m.alphabet self.length=m.length self.background=m.background self.beta=m.beta And then you can do things like: m=Bio.Motif.parse(f,"AlignAce") s=ScannableMotif(m) s.new_score_hit(seq,pos) I hope this helps... > -- > Philip Machanick > Rhodes University, Grahamstown 6140, South Africa > http://opinion-nation.blogspot.com/ > +61-7-3871-0963 mobile +61 42 234 6909 skype philipmach > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -- Bartek Wilczynski From philip.machanick at gmail.com Thu Mar 31 23:21:19 2011 From: philip.machanick at gmail.com (Philip Machanick) Date: Fri, 1 Apr 2011 13:21:19 +1000 Subject: [Biopython] extending Motif class In-Reply-To: References: Message-ID: Thanks. The issue is that parse is not defined in the class but in the module and if I understand this right, this makes it a metaclass method. More below. On Fri, Apr 1, 2011 at 11:07 AM, Bartek Wilczynski wrote: > Hi, > > On Fri, Apr 1, 2011 at 1:49 AM, Philip Machanick < > philip.machanick at gmail.com> wrote: > >> I want to add a new scoring function to the Motif class and in true >> object-oriented spirit would like to do it by deriving a new class rather >> than hacking the existing code. >> >> Well, if you want to keep your code separate from biopython and ba able to > use it with newer versions than maybe yes, but if you think tha your code > code be contributed to biopython and useful for other people, than I'd > consider just contributing via github. > > >> The general structure of my test program (all in 1 file) is: >> >> from Bio.Motif import Motif >> >> class ScannableMotif(Motif): >> def pwm_score_hit(self,sequence,position): >> ## stuff to compute my new score >> >> That's OK, although I suspect something fishy is hidden in your code here. > more later. > > >> from Bio import Motif >> > > you shouldn't need to do that > > >> def main (): >> for motif in >> ScannableMotif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"): >> for i in range(3): >> print >> motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i) >> >> is ScannableMotif now a module? or is "parse" a class method? BTW, you can > parse MEME files with Bio.Motif.parse... > At this stage for proof of concept I'm putting this all in the same file as the main program. > > >> The two different imports appear to be necessary. I need the first to be >> able to use the base class to derive a new one, and without the second >> when >> > Yes, you need to import the module to subclass Motif > > >> I use metaclass methods, I get >> >> TypeError: Error when calling the metaclass bases >> module.__init__() takes at most 2 arguments (3 given) >> >> I cannot reproduce this error and it is highly unlikely that it is related > at all to the Biopython code as Bio.Motif does not uses any metaclasses. I > think you cause it by something in your later code: > > This happens specifically if I use Motif.parse (defined in __inti__.py in the Motif module directory). > The other problem: I can't directly invoke a metaclass method on a derived >> instance as above. The snippet below works as expected, but looks like a >> kludge to me. Is there a better way of accessing metaclass methods from a >> derived class object? >> >> for motif in Motif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"): >> motif.__class__ = ScannableMotif # promote to the new class >> > > There you go! don't do this. It is not the way objects "get promoted" to > other classes, You seem to be playing with some python internals here > > for i in range(3): >> print >> motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i) >> >> I think I have the class vs. metaclass concept straight but understanding >> why I need the two different flavours of import would be useful. >> > Don't get me wrong, but I don't think you _need_ any metaclasses here. I > think your problem is that you are trying to change the class of an existing > instance, which (while probably possible in python) is absolutely not the > way to go. If your code is able to produce the correct output using the > complicated imports it's interesting, but probably not the easiest way to > achieve it. However it's hard to say, what exactly is your goal from the > code you provided. > > But, on the more constructive side of things, if you want to subclass > Bio.Motif and add a new method to it, you can just do what you did in the > beginning of your code (provided that you do not mess with m.__class__ or > something) > Then, your problem seems to be that the MEME parser fails to return your > subclass and gives you a Bio.Motif.Motif vanilla class (or MEMEMotif). What > you can do (if you insist on not adding the method to Bio.Motif.Motif), is > to write a constructor able to create a ScannableMotif from a "normal" > motif: > > class ScannableMotif(Motif): > def new_score_hit(self,sequence,position): > return 1 # or something smarter... > def __init__(self,m): #just copy it all... > self.instances = m.instances > self.has_instances=m.has_instances > self.counts = m.counts > self.has_counts=m.has_counts > self.mask = m.mask > self._pwm_is_current = False > self._log_odds_is_current = False > self.alphabet=m.alphabet > self.length=m.length > self.background=m.background > self.beta=m.beta > > And then you can do things like: > m=Bio.Motif.parse(f,"AlignAce") > s=ScannableMotif(m) > s.new_score_hit(seq,pos) > Thanks, this is more like what I was looking for. > I hope this helps... > > > >> -- >> >> Philip Machanick >> Rhodes University, Grahamstown 6140, South Africa >> http://opinion-nation.blogspot.com/ >> +61-7-3871-0963 mobile +61 42 234 6909 skype philipmach >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> > > > -- > Bartek Wilczynski > > > -- Philip Machanick (still in Australia for a while; note new mail address) Rhodes University, Grahamstown 6140, South Africa http://opinion-nation.blogspot.com/ +61-7-3871-0963 mobile +61 42 234 6909 skype philipmach From eric.talevich at gmail.com Thu Mar 31 23:53:51 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 31 Mar 2011 23:53:51 -0400 Subject: [Biopython] [GSoC] Proposal: Mocapy++Biopython In-Reply-To: References: Message-ID: Hi Michele, On Fri, Mar 25, 2011 at 2:11 AM, Michele wrote: > Hello everyone, > > I'm Michele, a computer scientist and passionate developer who is > currently enrolled in a biomedicine course. That's why I got in touch with > the biopython project and have tried its tools for biological computation. > In case you haven't heard back from anyone about your proposal yet -- you certainly sound qualified for this project and I encourage you to start an application on the main GSoC site (if you haven't already): http://www.google-melange.com/gsoc/homepage/google/gsoc2011 It's best to get the administrative bits out of the way and at least a stub of a proposal online early, to ensure you don't get caught by the deadline next week. Another good initial step is to sign up on GitHub and fork the Biopython source tree for yourself: https://github.com/biopython/biopython http://biopython.org/wiki/SourceCode Regarding the experience in biomolecular structure, I'm a beginner. I have > started studying biomedicine this year and therefore have a lot to learn. I > know a bit about the PDB format and molecular biology. I'm sure I can count > on your help to continue learning. > Certainly. :) So that was my not-so-short presentation. I would love to get to know the > community better and work together on the GSoC. Please let me know if you > think I could write a proposal and If you can help me on that. > Thanks for the introduction. We're always happy to help here. Cheers, Eric From mmokrejs at fold.natur.cuni.cz Wed Mar 2 23:00:04 2011 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 03 Mar 2011 00:00:04 +0100 Subject: [Biopython] traditional NCBI blast vs. blast+ Message-ID: <4D6ECBF4.9050006@fold.natur.cuni.cz> Hi, I needed to run and parse some blastn analysis. I had a look into the Tutorial and followed the currently recommended blast+ approach. Somewhat I was not getting any results. It seems to me a formatdb-formatted database is not readable by the blast+ tools. I had a look what tools are installed on my Gentoo Linux along with blastn, blastx and the other tools coming from blast+ bundle and from filenames I just could not guess what am I supposed to run over my fasta target database to make it searchable by blastn. I would prefer if biopython would throw out some error if there are no appropriate files (which names could be guessed depending on the (t)blastn/x/p, etc.). The tutorial mentions that I should lookup an older version of the Tutorial for examples on the old, NCBI blast usage via biopython. It took me a while but I found through Google some docs like that. ;-) On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation, not a single README, HOWTO, Changes, just the binaries and libs. What is installed on other Linux platform, would you mind sharing this with me? I just failed to find by Google what tools should I use instead of the formatdb. I found some FAQ on the NCBI tools++ site but that talked just about C++ API etc., nothing from the user perspective. On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being installed because they have same name as the same utility from "old" ncbi-tools (hence overwting their files). The ncbi-tools++ package is not allowed to be installed on stable "systems" (lack of testing or open bug reports) so most people using Gentoo do NOT have ncbi-tools++ and probably won't for a while. I propose to keep support for the "old" blast for a long while. Luckily, the blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML. What do you think? Is the blast+ approach faster, more stable, or just newer so we all like to "upgrade"? Where are some docs and what is the formatdb-like tool in blast+. ;) Thanks, Martin From nuin at genedrift.org Wed Mar 2 23:06:17 2011 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 2 Mar 2011 18:06:17 -0500 Subject: [Biopython] traditional NCBI blast vs. blast+ In-Reply-To: <4D6ECBF4.9050006@fold.natur.cuni.cz> References: <4D6ECBF4.9050006@fold.natur.cuni.cz> Message-ID: <4FC7BB7C-9E17-4699-850E-0A4F4E63521B@genedrift.org> Hi Just answering your blast portion of the question: - you have to run makeblastdb in order to create the database. - you should be able to download the source of blast+ to compile, it should compile just fine on your system - and yes, it seems to be faster and more stable than the previous version, at least on the tests I run Paulo On 2011-03-02, at 6:00 PM, Martin Mokrejs wrote: > Hi, > I needed to run and parse some blastn analysis. I had a look into the Tutorial > and followed the currently recommended blast+ approach. Somewhat I was not > getting any results. It seems to me a formatdb-formatted database is not readable > by the blast+ tools. I had a look what tools are installed on my Gentoo Linux > along with blastn, blastx and the other tools coming from blast+ bundle and from > filenames I just could not guess what am I supposed to run over my fasta > target database to make it searchable by blastn. I would prefer if biopython > would throw out some error if there are no appropriate files (which names could > be guessed depending on the (t)blastn/x/p, etc.). > The tutorial mentions that I should lookup an older version of the Tutorial > for examples on the old, NCBI blast usage via biopython. It took me a while but > I found through Google some docs like that. ;-) > On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation, > not a single README, HOWTO, Changes, just the binaries and libs. What is installed > on other Linux platform, would you mind sharing this with me? I just failed > to find by Google what tools should I use instead of the formatdb. I found > some FAQ on the NCBI tools++ site but that talked just about C++ API etc., > nothing from the user perspective. > On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being > installed because they have same name as the same utility from "old" ncbi-tools > (hence overwting their files). The ncbi-tools++ package is not allowed to be > installed on stable "systems" (lack of testing or open bug reports) so most people > using Gentoo do NOT have ncbi-tools++ and probably won't for a while. > I propose to keep support for the "old" blast for a long while. Luckily, the > blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML. > > What do you think? Is the blast+ approach faster, more stable, or just newer > so we all like to "upgrade"? Where are some docs and what is the formatdb-like > tool in blast+. ;) > Thanks, > Martin > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Thu Mar 3 10:27:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Mar 2011 10:27:54 +0000 Subject: [Biopython] traditional NCBI blast vs. blast+ In-Reply-To: <4D6ECBF4.9050006@fold.natur.cuni.cz> References: <4D6ECBF4.9050006@fold.natur.cuni.cz> Message-ID: On Wed, Mar 2, 2011 at 11:00 PM, Martin Mokrejs wrote: > Hi, > ?I needed to run and parse some blastn analysis. I had a look into the Tutorial > and followed the currently recommended blast+ approach. Somewhat I was not > getting any results. It seems to me a formatdb-formatted database is not readable > by the blast+ tools. I think it is possible to get databases which will work with both legacy BLAST and BLAST+ (since the NCBI only offer one set for NR etc) but I have not tried to mix the two. As pointed out by Paulo, the successor to formatdb in BLAST+ is makeblastdb, so just use that instead. > I had a look what tools are installed on my Gentoo Linux > along with blastn, blastx and the other tools coming from blast+ bundle and from > filenames I just could not guess what am I supposed to run over my fasta > target database to make it searchable by blastn. This is very clear in the BLAST+ documentation from the NCBI website (link given below), and is arguably a Gentoo packaging issue. > I would prefer if biopython > would throw out some error if there are no appropriate files (which names could > be guessed depending on the (t)blastn/x/p, etc.). BLAST+ itself generally gives useful errors. > ?The tutorial mentions that I should lookup an older version of the Tutorial > for examples on the old, NCBI blast usage via biopython. It took me a while but > I found through Google some docs like that. ;-) You could have just downloaded one of the old Biopython releases (the zip or tar balls) and looked in the Doc subdirectory. I'll clarify the current text in the tutorial to point people there. >?On Gentoo the ncbi-tools++ (aka blast+) package installs no documentation, > not a single README, HOWTO, Changes, just the binaries and libs. File a bug with Gentoo? > What is installed > on other Linux platform, would you mind sharing this with me? I just failed > to find by Google what tools should I use instead of the formatdb. I found > some FAQ on the NCBI tools++ site but that talked just about C++ API etc., > nothing from the user perspective. You are probably looking for this, linked to from the BLAST+ download page: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/user_manual.pdf > On Gentoo, the {asn2asn,rpsblast,test_regexp} from ncbi-tools++ is not being > installed because they have same name as the same utility from "old" ncbi-tools > (hence overwting their files). The ncbi-tools++ package is not allowed to be > installed on stable "systems" (lack of testing or open bug reports) so most people > using Gentoo do NOT have ncbi-tools++ and probably won't for a while. I was aware of the name clash for rpsblast, and yes, this is a problem the NCBI could have avoided. You could just ignore the Gentoo package and get BLAST+ directly from the NCBI. >?I propose to keep support for the "old" blast for a long while. We've already delayed deprecating the ``legacy'' BLAST wrappers, but probably we should do that after releasing Biopython 1.57. > Luckily, the > blastall -m 7 xml output seems to be parseable with Bio.Blast.NCBIXML. The NCBI kept the same XML output format, and in fact the plain text output is close enough that our old text parser could be updated to cope. >?What do you think? Is the blast+ approach faster, more stable, or just newer > so we all like to "upgrade"? I like BLAST+ for some new functionality (FASTA vs FASTA for example), but since the NCBI is dropping the ``legacy'' BLAST you will have to upgrade at some point > Where are some docs and what is the formatdb-like tool in blast+. ;) I've given links to the docs above, they're linked to on the NCBI website. Regards, Peter From p.j.a.cock at googlemail.com Thu Mar 3 20:32:11 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Mar 2011 20:32:11 +0000 Subject: [Biopython] Fwd: [Bosc] Bioinformatics Open Source Conference (BOSC 2011)--Call for Abstracts In-Reply-To: <3922D2BE-5A99-4CDE-91AB-B311C42E10CE@lbl.gov> References: <3922D2BE-5A99-4CDE-91AB-B311C42E10CE@lbl.gov> Message-ID: Dear Biopythoneers, BOSC will be in Vienna, Austria this year. Peter ---------- Forwarded message ---------- From: Nomi Harris Date: Thu, Mar 3, 2011 at 7:37 PM Subject: [Bosc] Bioinformatics Open Source Conference (BOSC 2011)--Call for Abstracts To: bosc-announce at lists.open-bio.org, members at open-bio.org, GMOD Announcements List , GMOD Developers List Cc: Nomi Harris We invite you to submit an abstract to BOSC 2011! ?Please forward this message as appropriate, and forgive multiple postings. Call for Abstracts for the 12th Annual Bioinformatics Open Source Conference (BOSC 2011) An ISMB 2011 Special Interest Group (SIG) Dates: July 15-16, 2011 Location: Vienna, Austria Web site: http://www.open-bio.org/wiki/BOSC_2011 Email: bosc at open-bio.org BOSC announcements mailing list: http://lists.open-bio.org/mailman/listinfo/bosc-announce Important Dates: April 18, 2011: Deadline for submitting abstracts to BOSC 2011 May 9, 2011: Notifications of accepted abstracts emailed to corresponding authors July 13-14, 2011: Codefest 2011 programming session (see http://www.open-bio.org/wiki/Codefest_2011 for details) July 15-16, 2011: BOSC 2011 July 17-19, 2011: ISMB 2011 The Bioinformatics Open Source Conference (BOSC) is sponsored by the Open Bioinformatics Foundation (O|B|F), a non-profit group dedicated to promoting the practice and philosophy of Open Source software development within the biological research community. To be considered for acceptance, software systems representing the central topic in a presentation submitted to BOSC must be licensed with a recognized Open Source License, and be freely available for download in source code form. We invite you to submit abstracts for talks and posters. ?Sessions include: - Approaches to parallel processing - Cloud-based approaches to improving software and data accessibility - The Semantic Web in open source bioinformatics - Data visualization - Tools for next-generation sequencing - Other Open Source software In addition to the above sessions, there will be a panel discussion about "Meeting the challenges of inter-institutional collaboration". We are also working to arrange a joint session with one of the other ISMB SIGs. Thanks to generous sponsorship from Eagle Genomics and an anonymous donor, we are pleased to announce a competition for three Student Travel Awards for BOSC 2011. Each winner will be awarded $250 to defray the costs of travel to BOSC 2011. For instructions on submitting your abstract, please visit http://www.open-bio.org/wiki/BOSC_2011#Abstract_Submission_Information BOSC 2011 Organizing Committee: Nomi Harris and Peter Rice (co-chairs); Brad Chapman, Peter Cock, Erwin Frise, Darin London, Ron Taylor _______________________________________________ BOSC mailing list BOSC at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bosc From hlapp at drycafe.net Fri Mar 4 23:26:25 2011 From: hlapp at drycafe.net (Hilmar Lapp) Date: Fri, 4 Mar 2011 18:26:25 -0500 Subject: [Biopython] Informatics job opportunity at NESCent Message-ID: <1878F27F-000D-4C80-B9EA-A83F7887828F@drycafe.net> (Apologies if you receive multiple copies, and also if you are not interested in job opportunities. In my defense, quite a few people on Bio* lists might qualify for (let alone enjoy) the position. And if you know someone who might be interested please forward.) =================================================== User Interface Design and Web Application Developer =================================================== The National Evolutionary Synthesis Center (NESCent) seeks a creative and enthusiastic individual to design user interfaces and web applications for scientific applications. The incumbent will work as part of a small informatics team in close collaboration with domain scientists. NESCent is an NSF-funded center dedicated to cross-disciplinary research in evolutionary science. Our informatics team works closely with visiting and resident scientists to support their custom software and database development needs. All NESCent software products are open- source, and the Center has a number of initiatives to actively promote collaborative development of community software resources (informatics.nescent.org). Above all, we are enthusiastic about our work, about the mission of the Center, and about the contribution of informatics to that mission. Job description: The incumbent will design and develop user interfaces and web applications for databases and other software tools for sponsored scientists and staff. The job responsibilities include all stages of the software development process, including requirements gathering, design, implementation, release packaging and documentation, as part of a small team (typically 2-3 individuals) following project management best practices. We expect the incumbent to present their work at conferences and contribute to publications with scientific collaborators; interact regularly with visiting and resident scientists, other members of the informatics team and Center staff; and generally serve as an expert resource for Center personnel. The position provides opportunities for professional development. Most informatics staff work at our Durham NC offices, located adjacent to Duke University, but we do support a wide range of technologies for virtual communication with off-site staff and collaborators. Required Qualifications: * Demonstrated success collaborating with clients on custom software solutions * Experience with various stages of the software development cycle * Expertise in development and testing of user interface designs * Excellent communication skills, both virtual and face-to-face * A four-year college degree in Computer Science, Bioinformatics or a related field Preferred Qualifications: * M.S. or Ph.D. in Computer Science, Bioinformatics or related field along with demonstrated interest in science, particularly biology * Expertise in rapid application development and respective programming technologies and languages (e.g., modern scripting languages and web-application frameworks such as Python/Django, Ruby/ Ruby-on-Rails, and Perl/Catalyst), fluency in Java programming, and prior experience in relational database programming (PostgreSQL or MySQL) * Expertise in dynamic and interactive web technologies (JavaScript, CGI), web service (SOAP, REST, XML, JSON) and semantic web technologies * Experience with open-source, and collaborative, software development, software usability design and assessment * Expertise in graphic design, data visualization and/or scientific data integration How to apply: Please send cover letter, resume and contact information for three references to Dr. Karen Cranston, Training Coordinator and Bioinformatics Project Manager (karen.cranston at nescent.org). Review of applications will begin March 21, 2011. Informal inquires or requests for additional information may be directed to Dr. Cranston by email or phone (+1-919-613-2275). -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From p.j.a.cock at googlemail.com Mon Mar 7 14:19:11 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 Mar 2011 14:19:11 +0000 Subject: [Biopython] Tutorial proofreading? Message-ID: Hi all, We're planning to do the Biopython 1.57 release soon, and something some volunteer help would be useful for is with our documentation - in particular the tutorial. These links are for the current tutorial, at the time or writing that means Biopython 1.56: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf There links are for the latest in-progress tutorial (automatically updated nightly from the git repository): http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf I would like some volunteers to proof read this please and report any problems, suggestions or additions? Ideally I'd like people to check the examples work (although some will need the latest Biopython installed from the source code). Even reporting minor typos is useful, as fixing them will make a better impression for newcomers reading this. Thanks, Peter P.S. The tutorial source file is here, if you are interested, https://github.com/biopython/biopython/blob/master/Doc/Tutorial.tex From anaryin at gmail.com Mon Mar 7 14:21:19 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 7 Mar 2011 15:21:19 +0100 Subject: [Biopython] Tutorial proofreading? In-Reply-To: References: Message-ID: Will have a look at it this week, I noticed some problems in the Bio.PDB section (outdated code). Cheers! From rmb32 at cornell.edu Mon Mar 7 16:37:32 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 07 Mar 2011 11:37:32 -0500 Subject: [Biopython] Google Summer of Code project ideas Message-ID: <4D7509CC.3040604@cornell.edu> Hi all, I'm going to be OBF project admin again this year for Google Summer of code. OBF's application is due later this week, and we need to update our project ideas on the OBF wiki page and on each project's individual wiki pages. So, for each of the OBF projects that wants to do GSoC again this year, please: a.) Update the list of project ideas on your project's GSoC page (BioPython, BioPerl, BioRuby, etc). Add new ones, remove ones that have already been done or no longer relevant, etc. b.) Update the list of project ideas on the main OBF GSoC page (http://www.open-bio.org/wiki/Google_Summer_of_Code) to match. c.) Let me know via email that you have done so and it's ready for Google to peruse. Please have the updates done, if possible, by this Friday (March 11). The number and quality of the project ideas are part of the evaluation process for whether OBF is accepted as a Summer of Code organization again this year, so let's come up with some good ones. :-) Rob ---- Robert Buels (prospective) 2011 OBF GSoC Organization Admin From p.cherepanov at imperial.ac.uk Tue Mar 8 02:42:26 2011 From: p.cherepanov at imperial.ac.uk (Peter Cherepanov) Date: Tue, 8 Mar 2011 02:42:26 +0000 Subject: [Biopython] define circular DNA (?) Message-ID: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> is there an easy way to define a circular DNA sequence in BioPython? It would be useful to have something like: my_seq = Seq('ATGCATGC...ATGC', circular_dna) am I missing something obvious?? Peter From komalsnehal1991 at gmail.com Tue Mar 8 07:58:11 2011 From: komalsnehal1991 at gmail.com (Komal S) Date: Tue, 8 Mar 2011 13:28:11 +0530 Subject: [Biopython] Biopython Projects Message-ID: Hi everyone, I'm Komal, a Junior Undergraduate Student from India studying Bioengineering. I'm a fan of Python and I love Computational Biology and I plan to do my further studies in the same. I went through the projects on the Biopython page. I was very much interested in the RNA Structure project mentioned. Any contribution which I make will help me a lot and the organisation too. In fact, I am currently doing a project on RNA Editing. I'll be very happy to integrate my knowledge. Please help me on how I should proceed. Komal From p.j.a.cock at googlemail.com Tue Mar 8 08:45:31 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Mar 2011 08:45:31 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: On Tue, Mar 8, 2011 at 2:42 AM, Peter Cherepanov wrote: > is there an easy way to define a circular DNA sequence in BioPython? > > It would be useful to have something like: > > my_seq = Seq('ATGCATGC...ATGC', circular_dna) > > am I missing something obvious?? > > Peter No, but how would you expect it to act? We've talked about such an object before... I'd have to go though my old emails but I recall there being some annoying corner cases to consider with the slice method (__getitem__). Peter From p.j.a.cock at googlemail.com Tue Mar 8 10:48:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Mar 2011 10:48:13 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov wrote: > I suppose if a DNA sequence is kept as a simple Python string, there is > no easy way to have it "circular". I am a beginner in Python (I use it only > occasionally, to solve very specific and simple-minded tasks, when manual > match/cut-and-paste operations become too much of a burden). Having > spent an extra hour to hack out and debug a piece of code to match/extract > to/from circular plasmid sequences kept as Python strings, I thought: hey, > wait a minute, there is such thing as BioPython, which should have made > this task so much easier... > > Is there a way to "enhance" the Seq object? (or may be I do not know what > I am talking about...). > > thanks a lot for responding! > > with best wishes, > > Peter What I had in mind was a new class, CircularSeq, which would subclass the current Biopython Seq object, and still use a string internally for the sequence. We could then modify the slice behaviour so that, perhaps this would by work wrapping the origin: c = CircularSeq('ACGTACGTACGT') assert len(c)==12 print c[10:14] It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat 14 as wrapped to 2, returning the four bases GTAC. Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the same as 'ACGTACGTACGT'[10:] which is the last two letters only. This means anyone (or more importantly, any code) expecting the string like behaviour will get a nasty surprise (or a bug). Another example, what about c[-2:]? For a plain string you'd get the last two letters. For a circular sequence you might think that should represent starting two before the origin, thus giving the last two letter plus the whole sequence? Also, c[-2:2] could mean the last two letters plus the first two letters, but for a plain python string that returns an empty string. Note that due to the way Python indexing works, single letter access is fine for negative indices, c[-2] would give the second last letter, 'G', which is consistent with wrapped counting back from the origin. We could also make c[14] wrap round to c[2] in this length 12 example (although there is a small risk of breaking code expecting an IndexError in this case). There would be lots of other things to implement, like "in" and the find methods would need to check the substring across the origin. Then (for nucleotides), we'd need to ensure reverse_complement and complement also give a CircularSeq, likewise perhaps for the transcribe and back_transcribe. The translate method is particularly tricky as you can have an infinite reading frame, which might be represented as a circular protein sequence? All in all, it is quite a lot of work, and there are several tricky bits where the desired behaviour is not clear cut. Could we come up with something useful or not? Peter P.S. Please CC the mailing list in your replies :) From p.cherepanov at imperial.ac.uk Tue Mar 8 10:30:08 2011 From: p.cherepanov at imperial.ac.uk (Peter Cherepanov) Date: Tue, 8 Mar 2011 10:30:08 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: <503B48D3-61BA-4C77-A441-00942366FFB4@imperial.ac.uk> I suppose if a DNA sequence is kept as a simple Python string, there is no easy way to have it "circular". I am a beginner in Python (I use it only occasionally, to solve very specific and simple-minded tasks, when manual match/cut-and-paste operations become too much of a burden). Having spent an extra hour to hack out and debug a piece of code to match/extract to/from circular plasmid sequences kept as Python strings, I thought: hey, wait a minute, there is such thing as BioPython, which should have made this task so much easier... Is there a way to "enhance" the Seq object? (or may be I do not know what I am talking about...). thanks a lot for responding! with best wishes, Peter On 8 Mar 2011, at 08:45, Peter Cock wrote: > On Tue, Mar 8, 2011 at 2:42 AM, Peter Cherepanov wrote: >> is there an easy way to define a circular DNA sequence in BioPython? >> >> It would be useful to have something like: >> >> my_seq = Seq('ATGCATGC...ATGC', circular_dna) >> >> am I missing something obvious?? >> >> Peter > > No, but how would you expect it to act? We've talked > about such an object before... I'd have to go though my > old emails but I recall there being some annoying corner > cases to consider with the slice method (__getitem__). > > Peter From moritz.beber at googlemail.com Tue Mar 8 11:32:44 2011 From: moritz.beber at googlemail.com (Moritz Beber) Date: Tue, 08 Mar 2011 12:32:44 +0100 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: <4D7613DC.2050506@googlemail.com> On 03/08/2011 11:48 AM, Peter Cock wrote: > On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov > wrote: >> I suppose if a DNA sequence is kept as a simple Python string, there is >> no easy way to have it "circular". I am a beginner in Python (I use it only >> occasionally, to solve very specific and simple-minded tasks, when manual >> match/cut-and-paste operations become too much of a burden). Having >> spent an extra hour to hack out and debug a piece of code to match/extract >> to/from circular plasmid sequences kept as Python strings, I thought: hey, >> wait a minute, there is such thing as BioPython, which should have made >> this task so much easier... >> >> Is there a way to "enhance" the Seq object? (or may be I do not know what >> I am talking about...). >> >> thanks a lot for responding! >> >> with best wishes, >> >> Peter > What I had in mind was a new class, CircularSeq, which would subclass > the current Biopython Seq object, and still use a string internally for the > sequence. > > We could then modify the slice behaviour so that, perhaps this would > by work wrapping the origin: > > c = CircularSeq('ACGTACGTACGT') > assert len(c)==12 > print c[10:14] > > It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat > 14 as wrapped to 2, returning the four bases GTAC. > > Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the > same as 'ACGTACGTACGT'[10:] which is the last two letters only. > This means anyone (or more importantly, any code) expecting the > string like behaviour will get a nasty surprise (or a bug). > > Another example, what about c[-2:]? For a plain string you'd > get the last two letters. For a circular sequence you might think > that should represent starting two before the origin, thus giving > the last two letter plus the whole sequence? Also, c[-2:2] could > mean the last two letters plus the first two letters, but for a > plain python string that returns an empty string. > > Note that due to the way Python indexing works, single letter > access is fine for negative indices, c[-2] would give the second > last letter, 'G', which is consistent with wrapped counting back > from the origin. We could also make c[14] wrap round to c[2] in > this length 12 example (although there is a small risk of breaking > code expecting an IndexError in this case). > > There would be lots of other things to implement, like "in" and the > find methods would need to check the substring across the origin. > Then (for nucleotides), we'd need to ensure reverse_complement > and complement also give a CircularSeq, likewise perhaps for the > transcribe and back_transcribe. The translate method is particularly > tricky as you can have an infinite reading frame, which might be > represented as a circular protein sequence? > > All in all, it is quite a lot of work, and there are several tricky bits > where the desired behaviour is not clear cut. Could we come up > with something useful or not? > > Peter > > P.S. Please CC the mailing list in your replies :) > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > If you just need circular behaviour in a small number of use cases, you could consider wrapping the sequence in a cycle iterator http://docs.python.org/release/2.6/library/itertools.html?highlight=cycle#itertools.cycle From p.j.a.cock at googlemail.com Tue Mar 8 11:40:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Mar 2011 11:40:08 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: <4D7613DC.2050506@googlemail.com> References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> <4D7613DC.2050506@googlemail.com> Message-ID: On Tue, Mar 8, 2011 at 11:32 AM, Moritz Beber wrote: > > If you just need circular behaviour in a small number of use cases, you > could consider wrapping the sequence in a cycle iterator > http://docs.python.org/release/2.6/library/itertools.html?highlight=cycle#itertools.cycle > That might need a lot of memory if used on a long sequence like a bacterial genome, but an interesting idea. Peter From p.cherepanov at imperial.ac.uk Tue Mar 8 12:12:26 2011 From: p.cherepanov at imperial.ac.uk (Peter Cherepanov) Date: Tue, 8 Mar 2011 12:12:26 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: ideally, it would be an object were the last letter is hard-linked to the first. For example, we should be able to define: c = CircularSeq('ATGCGGGGA') where: c[1:9] equals ATGCGGGGA (or, more awkwardly, c[0:9], if the original Python string numbering must be retained for some reasons) c[8:7] equals GAATGCATG c[1:1] equals A (on a python string it is c[0:1] = A, of course) Ideally, we would want to number such sequences from 1, after all these are the kind of objects we deal in biology. And, most importantly of all, if must be able to: c.find('GGAATG') to return "7" Peter On 8 Mar 2011, at 10:48, Peter Cock wrote: > On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov > wrote: >> I suppose if a DNA sequence is kept as a simple Python string, there is >> no easy way to have it "circular". I am a beginner in Python (I use it only >> occasionally, to solve very specific and simple-minded tasks, when manual >> match/cut-and-paste operations become too much of a burden). Having >> spent an extra hour to hack out and debug a piece of code to match/extract >> to/from circular plasmid sequences kept as Python strings, I thought: hey, >> wait a minute, there is such thing as BioPython, which should have made >> this task so much easier... >> >> Is there a way to "enhance" the Seq object? (or may be I do not know what >> I am talking about...). >> >> thanks a lot for responding! >> >> with best wishes, >> >> Peter > > What I had in mind was a new class, CircularSeq, which would subclass > the current Biopython Seq object, and still use a string internally for the > sequence. > > We could then modify the slice behaviour so that, perhaps this would > by work wrapping the origin: > > c = CircularSeq('ACGTACGTACGT') > assert len(c)==12 > print c[10:14] > > It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat > 14 as wrapped to 2, returning the four bases GTAC. > > Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the > same as 'ACGTACGTACGT'[10:] which is the last two letters only. > This means anyone (or more importantly, any code) expecting the > string like behaviour will get a nasty surprise (or a bug). > > Another example, what about c[-2:]? For a plain string you'd > get the last two letters. For a circular sequence you might think > that should represent starting two before the origin, thus giving > the last two letter plus the whole sequence? Also, c[-2:2] could > mean the last two letters plus the first two letters, but for a > plain python string that returns an empty string. > > Note that due to the way Python indexing works, single letter > access is fine for negative indices, c[-2] would give the second > last letter, 'G', which is consistent with wrapped counting back > from the origin. We could also make c[14] wrap round to c[2] in > this length 12 example (although there is a small risk of breaking > code expecting an IndexError in this case). > > There would be lots of other things to implement, like "in" and the > find methods would need to check the substring across the origin. > Then (for nucleotides), we'd need to ensure reverse_complement > and complement also give a CircularSeq, likewise perhaps for the > transcribe and back_transcribe. The translate method is particularly > tricky as you can have an infinite reading frame, which might be > represented as a circular protein sequence? > > All in all, it is quite a lot of work, and there are several tricky bits > where the desired behaviour is not clear cut. Could we come up > with something useful or not? > > Peter > > P.S. Please CC the mailing list in your replies :) From p.j.a.cock at googlemail.com Tue Mar 8 13:24:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Mar 2011 13:24:07 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: <7A53E882-A840-4CBF-B98A-4CEF482C1FB9@imperial.ac.uk> Message-ID: On Tue, Mar 8, 2011 at 12:12 PM, Peter Cherepanov wrote: > ideally, it would be an object were the last letter is hard-linked to the first. For example, we should be able to define: > > c = CircularSeq('ATGCGGGGA') > > where: > > c[1:9] ?equals ?ATGCGGGGA ? (or, more awkwardly, c[0:9], if the original > Python string numbering must be retained for some reasons) > c[8:7] ?equals ?GAATGCATG > c[1:1] equals A ?(on a python string it is c[0:1] ?= ?A, of course) > > Ideally, we would want to number such sequences from 1, after all these > are the kind of objects we deal in biology. Absolutely not - it would put the circular sequence completely out of sync with the existing sequence objects in Biopython and the Python string. Don't worry - you'll get used to zero based counting, and the Python slicing is very beautiful once you understand it. > And, most importantly of all, if must be able to: > c.find('GGAATG') to return "7" > Well, 6 in zero based counting, but yes, that would be the expected result for find (and similarly for rfind). We'd also need to do something with the split and rsplit methods to include looking for matches over the origin. Peter From Leighton.Pritchard at scri.ac.uk Tue Mar 8 13:28:11 2011 From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard) Date: Tue, 8 Mar 2011 13:28:11 +0000 Subject: [Biopython] define circular DNA (?) Message-ID: I've got 2p hanging around, so... On 08/03/2011 Tuesday, March 8, 10:48, "Peter Cock" wrote: > On Tue, Mar 8, 2011 at 10:23 AM, Peter Cherepanov > wrote: >> I suppose if a DNA sequence is kept as a simple Python string, there is >> no easy way to have it "circular". I am a beginner in Python (I use it only >> occasionally, to solve very specific and simple-minded tasks, when manual >> match/cut-and-paste operations become too much of a burden). Having >> spent an extra hour to hack out and debug a piece of code to match/extract >> to/from circular plasmid sequences kept as Python strings, I thought: hey, >> wait a minute, there is such thing as BioPython, which should have made >> this task so much easier... >> >> Is there a way to "enhance" the Seq object? (or may be I do not know what >> I am talking about...). >> >> thanks a lot for responding! >> >> with best wishes, >> >> Peter > > What I had in mind was a new class, CircularSeq, which would subclass > the current Biopython Seq object, and still use a string internally for the > sequence. That seems sensible. The main issue, as I see it, is that the physical object is naturally represented by a circularly-linked list, and we have for circular sequences an indexing/co-ordinate system with a defined zero start/end point (which is essentially arbitrary - though is usually the origin of replication for bacterial chromosomes). This leads to a conflict between our natural expectations of Python indexing, and the meaning of the indexing on the physical object that's being represented. Whatever the ultimate implementation, there will either have to be a compromise between these two representations, or one or other view will be ignored. There will inevitably be value judgements that someone is unhappy with ;) > We could then modify the slice behaviour so that, perhaps this would > by work wrapping the origin: > > c = CircularSeq('ACGTACGTACGT') > assert len(c)==12 > print c[10:14] > > It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat > 14 as wrapped to 2, returning the four bases GTAC. That makes sense in Python indexing terms, but not in terms of the co-ordinate system for navigating the circular DNA. To be consistent with location information from GenBank and other sources where features wrap the origin of circular DNA, we would need c[10:2] to return the same result as c[10:14]. That gives us potentially the same problem as c[-2:2], as it currently returns an empty string. We'd have to modify Python slicing/indexing behaviour quite a bit to implement this 'naturally'. However, I don't think we should ignore the Python indexing format here, because we might want the ten bases after the base with co-ordinate 6 with c[6:6+10], which would give us a physically and conceptually sensible linear sequence that crosses the origin. We'd probably want to do the obvious things with modular arithmetic, so that we don't return, say, three concatenated linearised circular sequences to a request like c[0:36] or c[6:42]. > Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the > same as 'ACGTACGTACGT'[10:] which is the last two letters only. > This means anyone (or more importantly, any code) expecting the > string like behaviour will get a nasty surprise (or a bug). I'm not sure it's wise to constrain functionality and adequate representation of a (very important! - showing my bacterial bias) physical structure to maintain that level of consistency with String. For instance, what would CircularSeq + Seq mean? Physically, and conceptually, not a lot. So we might want to deprecate the __add__ method for this object - not typical String behaviour but, in my opinion, appropriate. (You might remember that I was also generally not in favour of treating Seq objects as idealised Strings, so there's another bias for you ;) ) > Note that due to the way Python indexing works, single letter > access is fine for negative indices, c[-2] would give the second > last letter, 'G', which is consistent with wrapped counting back > from the origin. We could also make c[14] wrap round to c[2] in > this length 12 example (although there is a small risk of breaking > code expecting an IndexError in this case). I wouldn't be in favour that behaviour in a general sense, though I don't see how to avoid it cleanly. I think it would be best to be strict with indexing to the co-ordinate system to avoid possible degeneracy of feature locations. If we had a SNP at position 2, we could equally well associate it with any one of an infinite number of positions kl+2 where k is an integer and l is the sequence length, without modifying the computational result. I'm not keen on that kind of woolliness, but I think that it could possibly be avoided by modifying indexing to require at least one index that lies in the range [-l,l], and using modular arithmetic for slicing so that, for the example above, c[18:26] would not be treated as the valid slice c[6:14], but would instead throw an IndexError. > There would be lots of other things to implement, like "in" and the > find methods would need to check the substring across the origin. > Then (for nucleotides), we'd need to ensure reverse_complement > and complement also give a CircularSeq, likewise perhaps for the > transcribe and back_transcribe. Not to mention the other Biopython functions/methods that expect String-like indexing. Maybe a cast (of sorts) between CircularSeq and Seq would be useful for that, though I can imagine great problems, there. > The translate method is particularly > tricky as you can have an infinite reading frame, which might be > represented as a circular protein sequence? I would think that the test for that particular condition should be fairly straightforward (is there at least one stop codon in each of the six frames, taking into account the origin?). > All in all, it is quite a lot of work, and there are several tricky bits > where the desired behaviour is not clear cut. Could we come up > with something useful or not? I think that there's every possibility of coming up with something useful - the question is to what degree it fits the Biopython/Python idiom, or 'looks like' the physical object, and whether it gets included in Biopython. L. -- Dr Leighton Pritchard MRSC Plant Pathology Programme, SCRI (C block) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel: No telephone during office refurbishment [The James Hutton Institute logo] Please note that from 1 April 2011, SCRI and the Macaulay Land Use Research Institute will join to become The James Hutton Institute. ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From p.j.a.cock at googlemail.com Tue Mar 8 13:58:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Mar 2011 13:58:03 +0000 Subject: [Biopython] define circular DNA (?) In-Reply-To: References: Message-ID: On Tue, Mar 8, 2011 at 1:28 PM, Leighton Pritchard wrote: > I've got 2p hanging around, so... > > On 08/03/2011 Tuesday, March 8, 10:48, "Peter Cock" > wrote: >> >> What I had in mind was a new class, CircularSeq, which would subclass >> the current Biopython Seq object, and still use a string internally for the >> sequence. > > That seems sensible. ?The main issue, as I see it, is that the physical > object is naturally represented by a circularly-linked list, and we have for > circular sequences an indexing/co-ordinate system with a defined zero > start/end point (which is essentially arbitrary - though is usually the > origin of replication for bacterial chromosomes). ?This leads to a conflict > between our natural expectations of Python indexing, and the meaning of the > indexing on the physical object that's being represented. > > Whatever the ultimate implementation, there will either have to be a > compromise between these two representations, or one or other view will be > ignored. ?There will inevitably be value judgements that someone is unhappy > with ;) Indeed. >> We could then modify the slice behaviour so that, perhaps this would >> by work wrapping the origin: >> >> c = CircularSeq('ACGTACGTACGT') >> assert len(c)==12 >> print c[10:14] >> >> It *might* be nice to allow that to act like c[10:12] + c[0:2], i.e. treat >> 14 as wrapped to 2, returning the four bases GTAC. > > That makes sense in Python indexing terms, but not in terms of the > co-ordinate system for navigating the circular DNA. ?To be consistent with > location information from GenBank and other sources where features wrap the > origin of circular DNA, we would need c[10:2] to return the same result as > c[10:14]. ?That gives us potentially the same problem as c[-2:2], as it > currently returns an empty string. ?We'd have to modify Python > slicing/indexing behaviour quite a bit to implement this 'naturally'. > > However, I don't think we should ignore the Python indexing format here, > because we might want the ten bases after the base with co-ordinate 6 with > c[6:6+10], which would give us a physically and conceptually sensible linear > sequence that crosses the origin. I think we agree that c[10:14] and c[10:10+4] should give the four bases GTAC wrapping the origin when c is circular sequence ACGTACGTACGT, equivalently c[10:12] + c[0:2] using Python slicing. Likewise for your example c[6:6+10] or c[6:16] this should give six bases wrapping the origin, equivalently c[6:12] + c[0:4] using Python slicing. > We'd probably want to do the obvious things with modular arithmetic, so that > we don't return, say, three concatenated linearised circular sequences to a > request like c[0:36] or c[6:42]. I disagree, returning the three concatenated linearised circular sequences is what I would expect. This is one of the debatable issues that will divide people. Consider the (special and artificial) case of a circular plasmid with an ORF wrapping round the origin (one, twice or infinite), the ORF sequence is longer than the linearised plasmid, so slicing with concatenation would be useful. e.g. http://www.ncbi.nlm.nih.gov/pubmed/9740124 Perriman and Ares (1998), Circular mRNA can direct translation of extremely long repeating-sequence proteins in vivo. and: http://dx.doi.org/10.1385/1-59259-280-5:069 Perriman (2002), Circular mRNA Encoding for Monomeric and Polymeric Green Fluorescent Protein (Very cool work) >> Note that with a plain string, 'ACGTACGTACGT'[10:14] gives the >> same as 'ACGTACGTACGT'[10:] which is the last two letters only. >> This means anyone (or more importantly, any code) expecting the >> string like behaviour will get a nasty surprise (or a bug). > > I'm not sure it's wise to constrain functionality and adequate > representation of a (very important! - showing my bacterial bias) physical > structure to maintain that level of consistency with String. ?For instance, > what would CircularSeq + Seq mean? ?Physically, and conceptually, not a lot. > So we might want to deprecate the __add__ method for this object - not > typical String behaviour but, in my opinion, appropriate. We're probably want to made addition of CircularSeq + Seq raise a TypeError. Or, do a linearisation and simple addition with a warning? > (You might remember that I was also generally not in favour of treating > Seq objects as idealised Strings, so there's another bias for you ;) ) I recall :) >> Note that due to the way Python indexing works, single letter >> access is fine for negative indices, c[-2] would give the second >> last letter, 'G', which is consistent with wrapped counting back >> from the origin. We could also make c[14] wrap round to c[2] in >> this length 12 example (although there is a small risk of breaking >> code expecting an IndexError in this case). > > I wouldn't be in favour that behaviour in a general sense, though I don't > see how to avoid it cleanly. I think it would be best to be strict with > indexing to the co-ordinate system to avoid possible degeneracy of feature > locations. ?If we had a SNP at position 2, we could equally well associate > it with any one of an infinite number of positions kl+2 where k is an > integer and l is the sequence length, without modifying the computational > result. Yes, I was suggesting we could make c[x+n*length] act as c[x], i.e. for *single* indexes which return one letter, apply the modulo arithmetic. Or, we leave this to follow the current Python string behaviour where if the index is equal to the length or more, you get an IndexError. That avoids the ambiguity ;) > I'm not keen on that kind of woolliness, but I think that it could > possibly be avoided by modifying indexing to require at least one index that > lies in the range [-l,l], and using modular arithmetic for slicing so that, > for the example above, c[18:26] would not be treated as the valid slice > c[6:14], but would instead throw an IndexError. This depends on the treatment of things like c[0:36] or c[6:42] discussed above (return 36 bases, or just 12?). >> There would be lots of other things to implement, like "in" and the >> find methods would need to check the substring across the origin. >> Then (for nucleotides), we'd need to ensure reverse_complement >> and complement also give a CircularSeq, likewise perhaps for the >> transcribe and back_transcribe. > > Not to mention the other Biopython functions/methods that expect String-like > indexing. ?Maybe a cast (of sorts) between CircularSeq and Seq would be > useful for that, though I can imagine great problems, there. Having a toseq method like the MutableSeq does could handle that, returning a traditional linear Seq object. If the CircularSeq 'breaks' too much expected string-like behaviour that would be important. >> The translate method is particularly >> tricky as you can have an infinite reading frame, which might be >> represented as a circular protein sequence? > > I would think that the test for that particular condition should be fairly > straightforward (is there at least one stop codon in each of the six frames, > taking into account the origin?). Having thought about this example at length before, it can be done but I don't think it is all that straightforward ;) >> All in all, it is quite a lot of work, and there are several tricky bits >> where the desired behaviour is not clear cut. Could we come up >> with something useful or not? > > I think that there's every possibility of coming up with something useful - > the question is to what degree it fits the Biopython/Python idiom, or 'looks > like' the physical object, and whether it gets included in Biopython. > > L. Agreed. Peter From anaryin at gmail.com Tue Mar 8 21:39:07 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 8 Mar 2011 22:39:07 +0100 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu> <95E27938-F262-4F25-AF29-FBE387DB8782@gmail.com> Message-ID: Back to this question. Haven't had much time to look at it and it turned out to be a bit more complicated than what I thought. Permissive is an attribute of the PDBParser module and since the assignment takes place in the Atom module I don't see a straightforward way of pulling this off. However, and although there is the very simple solution of playing with the warnings module, the solution I offer is to allow a second level of "permissiveness" (PERMISSIVE=2) where all warnings are supressed. Cheers, J From laserson at mit.edu Wed Mar 9 03:07:54 2011 From: laserson at mit.edu (Uri Laserson) Date: Tue, 8 Mar 2011 22:07:54 -0500 Subject: [Biopython] SeqRecord subclassing or composition Message-ID: I am trying to implement a data type for my work. Each object will have a sequence (derived from a single read) and lots of annotations and features. However, I want to implement some extra interface that is problem-specific to make my analysis more convenient. I am debating whether to subclass SeqRecord and simply implement the extra interface or define a new object that wraps a SeqRecord object and pass on the subset of native SeqRecord calls and/or simply access the underlying SeqRecord directly. One additional factor is that I want to be able to read/write INSDC-style files for the data (e.g., GenBank). Therefore, if I use the SeqIO parser, it will return native SeqRecords. If I go the inheritance route, how do I cast a SeqRecord object to my new subclass? So, I am debating between inheritance class ImmuneChain(SeqRecord): def __init__(self, *args, **kw): SeqRecord.__init__(self,*args,**kw) # But how do I cast a SeqRecord to an ImmuneChain? or composition class ImmuneChain(object): def __init__(self, *args, **kw): if isinstance(args[0],SeqRecord): self._record = args[0] else: # Initialize the underlying SeqRecord manually self._record.seq = ... Any thoughts? Thanks! Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From p.j.a.cock at googlemail.com Wed Mar 9 09:04:26 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 9 Mar 2011 09:04:26 +0000 Subject: [Biopython] SeqRecord subclassing or composition In-Reply-To: References: Message-ID: On Wed, Mar 9, 2011 at 3:07 AM, Uri Laserson wrote: > I am trying to implement a data type for my work. ?Each object will have a > sequence (derived from a single read) and lots of annotations and features. > ?However, I want to implement some extra interface that is problem-specific > to make my analysis more convenient. > > I am debating whether to subclass SeqRecord and simply implement the extra > interface or define a new object that wraps a SeqRecord object and pass on > the subset of native SeqRecord calls and/or simply access the underlying > SeqRecord directly. > > One additional factor is that I want to be able to read/write INSDC-style > files for the data (e.g., GenBank). ?Therefore, if I use the SeqIO parser, > it will return native SeqRecords. ?If I go the inheritance route, how do I > cast a SeqRecord object to my new subclass? There is (currently at least) no option in SeqIO parse/read to override the use of the SeqRecord object. So you'd need code to 'upgrade' a SeqRecord into your class. Probably the simplest route would be for it's __init__ method to take a single argument (a SeqRecord). Then you could have: def my_parse(...): for seq_record in SeqIO.parse(...): yield MyClass(seq_record) def my_read(...): return MyClass(SeqIO.read(...)) etc > So, I am debating between inheritance > > class ImmuneChain(SeqRecord): > ? ?def __init__(self, *args, **kw): > ? ? ? ?SeqRecord.__init__(self,*args,**kw) > ? ? ? ?# But how do I cast a SeqRecord to an ImmuneChain? Unless you modify the methods/atttributes too much, a ImmuneChain subclass of SeqRecord should be usable as is with SeqIO.write etc. You don't need to 'cast'. Also note the above __init__ method can be more specific, you might have say 10 init args for ImmuneChain, only some of which you pass to the SeqRecord init. You could even have a single __init__ argument of a SeqRecord, and copy all its attributes. > or composition > > class ImmuneChain(object): > ? ?def __init__(self, *args, **kw): > ? ? ? ?if isinstance(args[0],SeqRecord): > ? ? ? ? ? ?self._record = args[0] > ? ? ? ?else: > ? ? ? ? ? ?# Initialize the underlying SeqRecord manually > ? ? ? ? ? ?self._record.seq = ... With the above approach you'd have to pass the private record to SeqIO.write etc (anything which needs a SeqRecord). That could be done inside methods of the ImmuneChain object (e.g. you could expose the format method of the SeqRecord). > > Any thoughts? > You could alternatively go for a procedural style where you write your code as functions taking SeqRecord objects (perhaps expecting particular information in the annotation). Peter From komalsnehal1991 at gmail.com Wed Mar 9 10:49:23 2011 From: komalsnehal1991 at gmail.com (Komal S) Date: Wed, 9 Mar 2011 02:49:23 -0800 Subject: [Biopython] ::Biopython Project Message-ID: Hi everyone, I'm Komal, a Junior Undergraduate Student from India studying Bioengineering. I'm a fan of Python and I love Computational Biology and I plan to do my further studies in the same. I went through the projects on the Biopython page. I was very much interested in the RNA Structure project mentioned. Any contribution which I make will help me a lot and the organisation too. In fact, I am currently doing a project on RNA Editing. I'll be very happy to integrate my knowledge. In fact, I have been trying to contact people on #obf-soc IRC. I think there is no separate IRC for Biopython. Please help me on how I should proceed. Komal From laserson at mit.edu Wed Mar 9 15:28:22 2011 From: laserson at mit.edu (Uri Laserson) Date: Wed, 9 Mar 2011 10:28:22 -0500 Subject: [Biopython] SeqRecord subclassing or composition In-Reply-To: References: Message-ID: > > Unless you modify the methods/atttributes too much, a > ImmuneChain subclass of SeqRecord should be usable > as is with SeqIO.write etc. You don't need to 'cast'. > I'm more worried about parsing than writing. As you mentioned, I will have to upgrade my SeqRecord object to an ImmuneChain object. So maybe the best approach is a combination of the two code snippets I included. It would subclass SeqRecord, and then manually check whether I am initializing with a pre-existing SeqRecord or just data: class ImmuneChain(SeqRecord): def __init__(self, *args, **kw): if isinstance(args[0],SeqRecord): # if initializing with SeqRecord, then manually transfer the data # based on the initializer for SeqRecord (http://goo.gl/X95Zf) record = args[0] SeqRecord.__init__(self, seq, id=record.id, name=record.name, description=record.description, dbxrefs=record.dbxrefs, features=record.features, annotations=record.annotations, letter_annotations=record.letter_annotations) else: # assume I'm initializing just like a regular SeqRecord: SeqRecord.__init__(*args,**kw) # Finally, I perform any problem-specific additional initializations # here. pass Does this seem like a good solution? Also, do you think that it would make sense to make a deep copy of the SeqRecord object before I use it to initialize the ImmuneChain? Uri From p.j.a.cock at googlemail.com Wed Mar 9 15:32:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 9 Mar 2011 15:32:50 +0000 Subject: [Biopython] SeqRecord subclassing or composition In-Reply-To: References: Message-ID: On Wed, Mar 9, 2011 at 3:28 PM, Uri Laserson wrote: >> Unless you modify the methods/atttributes too much, a >> ImmuneChain subclass of SeqRecord should be usable >> as is with SeqIO.write etc. You don't need to 'cast'. > > I'm more worried about parsing than writing. ?As you mentioned, I will have > to upgrade my SeqRecord object to an ImmuneChain object. > So maybe the best approach is a combination of the two code snippets I > included. ?It would subclass SeqRecord, and then manually check whether I am > initializing with a pre-existing SeqRecord or just data: > class ImmuneChain(SeqRecord): > ?? ?def __init__(self, *args, **kw): > ?? ? ? ?if isinstance(args[0],SeqRecord): > ?? ? ? ? ? ?# if initializing with SeqRecord, then manually transfer the > data > ?? ? ? ? ? ?# based on the initializer for SeqRecord (http://goo.gl/X95Zf) > ?? ? ? ? ? ?record = args[0] > ?? ? ? ? ? ?SeqRecord.__init__(self, seq, id=record.id, name=record.name, > ?? ? ? ? ? ? ? ? ? ? description=record.description, dbxrefs=record.dbxrefs, > ?? ? ? ? ? ? ? ? ? ? features=record.features, > annotations=record.annotations, > ?? ? ? ? ? ? ? ? ? ? letter_annotations=record.letter_annotations) > ?? ? ? ?else: > ?? ? ? ? ? ?# assume I'm initializing just like a regular SeqRecord: > ?? ? ? ? ? ?SeqRecord.__init__(*args,**kw) > > ?? ? ? ?# Finally, I perform any problem-specific additional initializations > ?? ? ? ?# here. > ?? ? ? ?pass > Does this seem like a good solution? I think it will work, > Also, do you think that it would make sense to make a deep copy of the > SeqRecord object before I use it to initialize the ImmuneChain? Assuming you will be discarding the original SeqRecord, then I see no reason to make a deep copy. It will just slow things down. Peter From jvb at Cs.Nott.AC.UK Wed Mar 9 15:33:28 2011 From: jvb at Cs.Nott.AC.UK (Jonathan Blakes) Date: Wed, 09 Mar 2011 15:33:28 +0000 Subject: [Biopython] back-translation method for Seq object? Message-ID: <4D779DC8.8090704@cs.nott.ac.uk> This is a reply to an old thread (October 2008), but I thought someone might find it useful. In that thread, discussing the representation of back-translations using ambiguous bases to avoid the factorial explosion of an all possibilities back-translation, Bruce Southey gave a table similar to the one below but some of the ambiguous codons were incorrect or the ambiguous codons were to ambiguous and covered more than one amino acid. The codons for stop (*) were also missing. Some were corrected later in the thread but not all. Here are the correct ambiguous codons for the standard genetic code: * = TAG, TAA, TGA = TAR, TGA A = GCT, GCC, GCA, GCG = GCN C = TGT, TGC = TGY D = GAT, GAC = GAY E = GAA, GAG = GAR F = TTT, TTC = TTY G = GGT, GGC, GGA, GGG = GGN H = CAT, CAC = CAY I = ATT, ATC, ATA = ATH K = AAA, AAG = AAR L = TTA, TTG, CTT, CTC, CTA, CTG = TTR, CTN M = ATG = ATG N = AAT, AAC = AAY P = CCT, CCC, CCA, CCG = CCN Q = CAA, CAG = CAR R = CGT, CGC, CGA, CGG, AGA, AGG = CGN, AGR S = TCT, TCC, TCA, TCG, AGT, AGC = TCN, AGY T = ACT, ACC, ACA, ACG = ACN V = GTT, GTC, GTA, GTG = GTN W = TGG = TGG Y = TAT, TAC = TAY Even though this is still not a one-to-one mapping in 4/21 cases the factorial explosion is significantly decreased. For example, the protein ACDEFGHIKLMNPQRSTVWY* has 1,019,215,872 unambiguous back-translations. Using the code above it has 16, or generally 2^(L+R+S+*). If anyone has an algorithm for determining the set of non-overlapping ambiguous codons from any codon table I would like to know. Thanks, Jon -- Jonathan Blakes School of Computer Science University of Nottingham From rasi at seas.harvard.edu Wed Mar 9 22:57:30 2011 From: rasi at seas.harvard.edu (Arvind Subramaniam) Date: Wed, 9 Mar 2011 17:57:30 -0500 Subject: [Biopython] .ab1 file parser in biopython? Message-ID: Hi I am new to biopython so please excuse me if this issue is obviously simple. I am trying to parse .ab1 sequencing trace files in Biopython and I cannot find the right module or method to do this job. Can someone suggest how I can parse .ab1 files? Thanks, Arvind. From cmckay at u.washington.edu Thu Mar 10 01:09:55 2011 From: cmckay at u.washington.edu (Cedar McKay) Date: Wed, 9 Mar 2011 17:09:55 -0800 Subject: [Biopython] "raw" genbank locations? Message-ID: Hello all. Biopython continues to be a lifesaver. I'm trying to get the "raw" genbank locations for a downstream application after parsing a genbank file. Is there any way to get at this (or reproduce it)? As it is, the SeqRecord feature has start and stop information for the whole feature, and a list of sub-features each with it's own start and stops. I'm looking for one concise text string the describes the entire feature location, much like the original raw genbank locations do. I searched the archives, but nothing popped into view. Thanks for your help! best, Cedar From chapmanb at 50mail.com Thu Mar 10 02:05:45 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 9 Mar 2011 21:05:45 -0500 Subject: [Biopython] "raw" genbank locations? In-Reply-To: References: Message-ID: <20110310020545.GA2185@kunkel> Cedar; Glad to hear Biopython has been helping out with your work. > I'm trying to get the "raw" genbank locations for a downstream > application after parsing a genbank file. Is there any way to get at > this (or reproduce it)? As it is, the SeqRecord feature has start and > stop information for the whole feature, and a list of sub-features > each with it's own start and stops. I'm looking for one concise text > string the describes the entire feature location, much like the > original raw genbank locations do. You can do this with the GenBank RecordParser, which doesn't parse the location strings: >>> from Bio.GenBank import RecordParser >>> parser = RecordParser() >>> handle = open("NT_019265.gb") >>> rec = parser.parse(handle) >>> for f in rec.features: ... print f.location ... 1..1250660 1..3290 215902..365470 217508 join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092) If you have SeqRecord objects from SeqIO you can do this in a ugly way by reaching into the internals of the GenBank writer: >>> from Bio import SeqIO >>> from Bio.SeqIO import InsdcIO >>> handle = open("NT_019265.gb") >>> for rec in SeqIO.parse(handle, "genbank"): ... for f in rec.features: ... print InsdcIO._insdc_feature_location_string(f, len(rec.seq)) ... 1..1250660 1..3290 215902..365470 217508 join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092) That might work for a quick hack but is not necessarily future proof is the internal change. Peter, do you think this would be useful to expose as a function of a SeqFeature directly, so you could do feature.insdc_string() or something similar? Brad From chapmanb at 50mail.com Thu Mar 10 02:05:45 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 9 Mar 2011 21:05:45 -0500 Subject: [Biopython] "raw" genbank locations? In-Reply-To: References: Message-ID: <20110310020545.GA2185@kunkel> Cedar; Glad to hear Biopython has been helping out with your work. > I'm trying to get the "raw" genbank locations for a downstream > application after parsing a genbank file. Is there any way to get at > this (or reproduce it)? As it is, the SeqRecord feature has start and > stop information for the whole feature, and a list of sub-features > each with it's own start and stops. I'm looking for one concise text > string the describes the entire feature location, much like the > original raw genbank locations do. You can do this with the GenBank RecordParser, which doesn't parse the location strings: >>> from Bio.GenBank import RecordParser >>> parser = RecordParser() >>> handle = open("NT_019265.gb") >>> rec = parser.parse(handle) >>> for f in rec.features: ... print f.location ... 1..1250660 1..3290 215902..365470 217508 join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092) If you have SeqRecord objects from SeqIO you can do this in a ugly way by reaching into the internals of the GenBank writer: >>> from Bio import SeqIO >>> from Bio.SeqIO import InsdcIO >>> handle = open("NT_019265.gb") >>> for rec in SeqIO.parse(handle, "genbank"): ... for f in rec.features: ... print InsdcIO._insdc_feature_location_string(f, len(rec.seq)) ... 1..1250660 1..3290 215902..365470 217508 join(342430..342515,363171..363300,365741..365814,376398..376499,390169..390297,391257..391379,392606..392679,398230..398419,399082..399167,399534..399650,405844..405913,406704..406761,406868..407010,407962..408091,408508..409092) That might work for a quick hack but is not necessarily future proof is the internal change. Peter, do you think this would be useful to expose as a function of a SeqFeature directly, so you could do feature.insdc_string() or something similar? Brad From p.j.a.cock at googlemail.com Thu Mar 10 08:57:20 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 08:57:20 +0000 Subject: [Biopython] "raw" genbank locations? In-Reply-To: <20110310020545.GA2185@kunkel> References: <20110310020545.GA2185@kunkel> Message-ID: On Thu, Mar 10, 2011 at 2:05 AM, Brad Chapman wrote: > Cedar; > Glad to hear Biopython has been helping out with your work. > >> I'm trying to get the "raw" genbank locations for a downstream >> application after parsing a genbank file. Is there any way to get at >> this (or reproduce it)? As it is, the SeqRecord feature has start and >> stop information for the whole feature, and a list of sub-features >> each with it's own start and stops. I'm looking for one concise text >> string the describes the entire feature location, much like the >> original raw genbank locations do. > > You can do this with the GenBank RecordParser, which doesn't parse > the location strings: > >>>> from Bio.GenBank import RecordParser >>>> parser = RecordParser() >>>> handle = open("NT_019265.gb") >>>> rec = parser.parse(handle) >>>> for f in rec.features: > ... ? ? print f.location > ... > > > If you have SeqRecord objects from SeqIO you can do this in a ugly > way by reaching into the internals of the GenBank writer: > >>>> from Bio import SeqIO >>>> from Bio.SeqIO import InsdcIO >>>> handle = open("NT_019265.gb") >>>> for rec in SeqIO.parse(handle, "genbank"): > ... ? ? for f in rec.features: > ... ? ? ? ? print InsdcIO._insdc_feature_location_string(f, len(rec.seq)) > ... > > > That might work for a quick hack but is not necessarily future proof > is the internal change. Peter, do you think this would be useful to > expose as a function of a SeqFeature directly, so you could do > feature.insdc_string() or something similar? A couple of people have asked for this, and since adding SeqIO output in GenBank/EMBL format (the code you refer to in InsdcIO) this would be very possible... the issue holding me back is the annoying special case(s) requiring to know the parent sequence's length. The problem is that currently the SeqFeature doesn't have this information - it doesn't have any link back to a parent SeqRecord (and indeed it doesn't even have to be created in the context of a SeqRecord). Perhaps we can handle the case of between features N^1 on circular sequences of length N differently, maybe with a dedicated SeqFeature location class which would tell us it was at the origin? Then we'd be able to avoid the need to know the parent length. Once that is resolved, an orphan SeqFeature could generate its own INSDC (GenBank/EMBL) location string without needing any extra information, and exposing this as an object method would be fine. Peter P.S. If we ever add a CircularSeq object - see other thread- then SeqFeature locations spanning the origin might need reworking too. From p.j.a.cock at googlemail.com Thu Mar 10 09:00:51 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 09:00:51 +0000 Subject: [Biopython] .ab1 file parser in biopython? In-Reply-To: References: Message-ID: On Wed, Mar 9, 2011 at 10:57 PM, Arvind Subramaniam wrote: > Hi > ?I am new to biopython so please excuse me if this issue is obviously > simple. I am trying to parse .ab1 sequencing trace files in Biopython > and I cannot find the right module or method to do this job. Can > someone suggest how I can parse .ab1 files? > Thanks, > Arvind. You mean the ABI trace file format for capillary sequencing? Personally I use EMBOSS seqret (e.g. to make FASTQ), or tracetuner if I want to recall the bases (the ABI software doesn't always to the best possible calling job). http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/seqret.html http://sourceforge.net/projects/tracetuner/ Peter From chapmanb at 50mail.com Thu Mar 10 11:06:48 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 10 Mar 2011 06:06:48 -0500 Subject: [Biopython] "raw" genbank locations? In-Reply-To: References: <20110310020545.GA2185@kunkel> Message-ID: <20110310110648.GA2302@kunkel> Peter; > > do you think this would be useful to > > expose as a function of a SeqFeature directly, so you could do > > feature.insdc_string() or something similar? > > A couple of people have asked for this, and since adding SeqIO > output in GenBank/EMBL format (the code you refer to in InsdcIO) > this would be very possible... the issue holding me back is the > annoying special case(s) requiring to know the parent sequence's > length. The problem is that currently the SeqFeature doesn't > have this information - it doesn't have any link back to a parent > SeqRecord (and indeed it doesn't even have to be created in > the context of a SeqRecord). > > Perhaps we can handle the case of between features N^1 on > circular sequences of length N differently, maybe with a dedicated > SeqFeature location class which would tell us it was at the origin? > Then we'd be able to avoid the need to know the parent length. This is a great idea; makes sense to treat this as a special case since that's what it is. Another simple way would be to put the function on the SeqRecord class and call it with: rec.insdc_feature_string(feature); this places the responsibility of knowing the parent back on the library user. > P.S. If we ever add a CircularSeq object - see other thread- then > SeqFeature locations spanning the origin might need reworking > too. Makes sense. We can get the 99% of standard cases working now and then re-circle back on this once someone gets up the guts to tackle CircularSeq. Brad From p.j.a.cock at googlemail.com Thu Mar 10 11:52:48 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 11:52:48 +0000 Subject: [Biopython] "raw" genbank locations? In-Reply-To: <20110310110648.GA2302@kunkel> References: <20110310020545.GA2185@kunkel> <20110310110648.GA2302@kunkel> Message-ID: On Thu, Mar 10, 2011 at 11:06 AM, Brad Chapman wrote: > Peter; > >> > do you think this would be useful to >> > expose as a function of a SeqFeature directly, so you could do >> > feature.insdc_string() or something similar? >> >> A couple of people have asked for this, and since adding SeqIO >> output in GenBank/EMBL format (the code you refer to in InsdcIO) >> this would be very possible... the issue holding me back is the >> annoying special case(s) requiring to know the parent sequence's >> length. The problem is that currently the SeqFeature doesn't >> have this information - it doesn't have any link back to a parent >> SeqRecord (and indeed it doesn't even have to be created in >> the context of a SeqRecord). >> >> Perhaps we can handle the case of between features N^1 on >> circular sequences of length N differently, maybe with a dedicated >> SeqFeature location class which would tell us it was at the origin? >> Then we'd be able to avoid the need to know the parent length. > > This is a great idea; makes sense to treat this as a special case > since that's what it is. It is probably the most elegant solution without a big refactor. > Another simple way would be to put the > function on the SeqRecord class and call it with: > rec.insdc_feature_string(feature); this places the responsibility of > knowing the parent back on the library user. Yes, that would be simple. But don't we sometimes want to use 'orphan' SeqFeature objects (without a SeqRecord parent)? I'm thinking here about GFF3 files and the like. >> P.S. If we ever add a CircularSeq object - see other thread- then >> SeqFeature locations spanning the origin might need reworking >> too. > > Makes sense. We can get the 99% of standard cases working now and > then re-circle back on this once someone gets up the guts to tackle > CircularSeq. :) Peter From rmb32 at cornell.edu Thu Mar 10 17:15:41 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 10 Mar 2011 12:15:41 -0500 Subject: [Biopython] update Google Summer of Code project ideas Message-ID: <4D79073D.3090603@cornell.edu> Hi all, Please make sure the BioJava information is up to date for 2011 on both the OBF and BioJava wikis. Eric has done some work on it, but the current page has not been completely updated to reflect that it's 2011 and we're applying again. OBF wiki page: http://www.open-bio.org/wiki/Google_Summer_of_Code BioPython wiki: http://biopython.org/wiki/Google_Summer_of_Code Rob ---- Robert Buels (prospective) 2011 OBF GSoC Organization Admin From anaryin at gmail.com Thu Mar 10 17:25:04 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 10 Mar 2011 18:25:04 +0100 Subject: [Biopython] update Google Summer of Code project ideas In-Reply-To: <4D79073D.3090603@cornell.edu> References: <4D79073D.3090603@cornell.edu> Message-ID: I updated the date and added the project from last year to the page, to show we got another funded project. Cheers, J From p.j.a.cock at googlemail.com Thu Mar 10 17:42:58 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 17:42:58 +0000 Subject: [Biopython] Bugzilla -> Redmine migration Message-ID: Hi all, Anyone who has tried to file a bug recently will have noticed a big red message "Sorry, entering bugs into the product Biopython has been disabled." The reason for this is the OBF team are about to move us (and all the other Bio* projects using Bugzilla) to a Redmine server instead. See http://www.redmine.org/ I expect this to be completed in the next few days (with all the old bugs and accounts carried across). Hopefully this will include integration with our git repository as well. We'll make an announcement once it is ready, in the mean time, any new bugs could be emailed to the mailing list as a short term measure. Peter From laserson at mit.edu Thu Mar 10 18:22:42 2011 From: laserson at mit.edu (Uri Laserson) Date: Thu, 10 Mar 2011 13:22:42 -0500 Subject: [Biopython] .ab1 file parser in biopython? In-Reply-To: References: Message-ID: I also found the following code lying around somewhere. I copied it into one of my repositories: https://github.com/laserson/pytools/blob/master/ab1.py "Python implementation of an ABIF file reader according to Applied Biosystems' specificatons" as specified in March 2007, it appears. ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu On Thu, Mar 10, 2011 at 04:00, Peter Cock wrote: > On Wed, Mar 9, 2011 at 10:57 PM, Arvind Subramaniam > wrote: > > Hi > > I am new to biopython so please excuse me if this issue is obviously > > simple. I am trying to parse .ab1 sequencing trace files in Biopython > > and I cannot find the right module or method to do this job. Can > > someone suggest how I can parse .ab1 files? > > Thanks, > > Arvind. > > You mean the ABI trace file format for capillary sequencing? > > Personally I use EMBOSS seqret (e.g. to make FASTQ), or tracetuner > if I want to recall the bases (the ABI software doesn't always to the > best possible calling job). > > http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/seqret.html > http://sourceforge.net/projects/tracetuner/ > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Thu Mar 10 18:37:04 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 18:37:04 +0000 Subject: [Biopython] .ab1 file parser in biopython? In-Reply-To: References: Message-ID: On Thu, Mar 10, 2011 at 6:22 PM, Uri Laserson wrote: > I also found the following code lying around somewhere. ?I copied it into > one of my repositories: > > https://github.com/laserson/pytools/blob/master/ab1.py > > "Python implementation of an ABIF file reader according to Applied > Biosystems' specificatons" as specified in March 2007, it appears. > Its under the GPL license. If you contacted the named author, Francis Wolinski, and he was willing to re-licence for Biopython to use, then we could consider incorporating it. Alternatively it shouldn't be too hard to reimplement it from scratch based on the published specification (and go one step further and consider output too). http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf Note some case would be needed to work on Python 3, but we can follow the example of our SFF parser here. Is there actually a need for this though? As I said before, for my own needs getting the ABI file into FASTQ format (or FASTA+QUAL) has sufficed. Peter From cmckay at u.washington.edu Thu Mar 10 21:51:42 2011 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 10 Mar 2011 13:51:42 -0800 Subject: [Biopython] "raw" genbank locations? Message-ID: Great! InsdcIO._insdc_feature_location_string was just what I needed. I was actually on the right track, trying to figure out how SeqIO wrote locations in genbank format, but your email arrived soon enough that I didn't have to finish the job. I realize this is a private method, so I would like an official way to do this. Thanks so much guys, as usual, awesome service! Cedar From laserson at mit.edu Thu Mar 10 22:07:46 2011 From: laserson at mit.edu (Uri Laserson) Date: Thu, 10 Mar 2011 17:07:46 -0500 Subject: [Biopython] Transferring SeqFeatures between aligned sequences Message-ID: Say I have a SeqRecord called A and a SeqRecord called B. A has a bunch of SeqFeatures associated with it, while B has none. I perform a gapped alignment between the two sequences. Now I want to copy the SeqFeatures from A onto B in a way that respects the coordinates of all the features. For example (and please use a fixed-width font for this): 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 FEATURE_1 FEATURE_2 X X X X X X X X X X X X X X X X X X A - - - a c g g t - - a c a g a c g t g a t a c g | | | | | | | | | | | | | | | | | B a a a a c g g t g g a c a t a c g - g a t a c g 0 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 In sequence A, the coords of Feature 1 and Feature 2 should be (0,7) and (10,19), respectively. Now I want to copy it to sequence B, where the feature coords should instead be (3,12) and (15,23). Is there an easy way to do this in biopython already? Or are there any ideas for an elegant solution? Thanks! Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From p.j.a.cock at googlemail.com Thu Mar 10 22:46:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Mar 2011 22:46:32 +0000 Subject: [Biopython] Transferring SeqFeatures between aligned sequences In-Reply-To: References: Message-ID: On Thu, Mar 10, 2011 at 10:07 PM, Uri Laserson wrote: > Say I have a SeqRecord called A and a SeqRecord called B. ?A has a bunch of > SeqFeatures associated with it, while B has none. ?I perform a gapped > alignment between the two sequences. ?Now I want to copy the SeqFeatures > from A onto B in a way that respects the coordinates of all the features. > > For example (and please use a fixed-width font for this): > I'm not quite sure I followed that figure. > In sequence A, the coords of Feature 1 and Feature 2 should be (0,7) and > (10,19), respectively. ?Now I want to copy it to sequence B, where the > feature coords should instead be (3,12) and (15,23). > > Is there an easy way to do this in biopython already? No, but I'm not sure how advisable it is anyway (if I have understood you right - see below). > Or are there any ideas for an elegant solution? I actually wanted to do something similar to this myself. I had a draft genome I had annotated in GenBank format. We did some more sequencing and/or I tweaked the assembly, and I had a new very similar sequence in a FASTA file, and I wanted to copy the old annotation over. What I did was look for perfect matches between the regions spanned by the features (no introns in this case), and that meant all I needed to do was apply a shift to the SeqFeature location. There is a (private) method _shift which helped here (written for use in slicing a SeqRecord). In my case, that handled most of the annotation, and I did the nasty cases by hand (since I wanted to examine what had happened in the new assembly - it was a small genome). In your case the start and end co-ordinates may be shifted by different amounts (since you are doing gapped alignments). This worries me as the length of your features can change. For any gene or CDS features that is a problem (frame shifts). Have you thought about that? Perhaps you're dealing with non-coding features only? Peter From p.j.a.cock at googlemail.com Fri Mar 11 09:53:16 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 11 Mar 2011 09:53:16 +0000 Subject: [Biopython] Transferring SeqFeatures between aligned sequences In-Reply-To: References: Message-ID: On Thu, Mar 10, 2011 at 11:25 PM, Uri Laserson wrote: >> I'm not quite sure I followed that figure. > > I think you understood perfectly. Good - your text was clearer for me. >> In your case the start and end co-ordinates may be shifted >> by different amounts (since you are doing gapped alignments). >> This worries me as the length of your features can change. >> For any gene or CDS features that is a problem (frame shifts). >> Have you thought about that? Perhaps you're dealing with >> non-coding features only? > > That's exactly the complication here. ?I have one reference sequence that is > highly annotated, and I have a read that I want to align to it and transfer > over the annotations to the corresponding positions. OK - and do you want to worry about spotting frameshifts, and updating the translation for CDS features? > One way I can handle this situation is that when I actually build the > pairwise gapped alignment (which I do manually), in addition to the actual > gapped-sequence strings, I can generate two lists that contain the ungapped > coordinates of each sequence (in my diagram, this is the numbering above and > below). ?Figuring out the new coords from the old coordinates is then a > matter of matching the positions in the lists. ?(Though perhaps it's easier > to implement using dictionaries, so I don't have to search the lists I > generated.) Yes, that kind of technique is also useful for mapping between gapped and ungapped coordinates in assembly files. > Eitherway, in order to move the SeqFeature to the new sequence, should I > make a deep copy of it and then manually modify the start and end coords? > Uri You could do, or create a new SeqFeature, or "steal" the old one and modify it. The later technique would probably be fastest since there are no new objects to create, just a few integer attributes changes (location positions), but is perhaps a bit risky if you don't comment it clearly. If you do that, perhaps do this by popping the features from the old SeqRecord's feature list, modify them, and add them to the new SeqRecord's feature list. If all your current annotation uses simple exact locations, life is easier. If there are fuzzy locations, then using the location object's private _shift method might be simplest. Another query, are you going to look for inversions? In such cases the strand needs flipping and the start/end interchanged. The SeqRecord reverse complement method has to do this, and therefore the SeqFeature and its location and position classes all have a private _flip method. [If you find these private methods useful, perhaps we can make them public? Let us know] Thanks, Peter From thamelry at binf.ku.dk Fri Mar 11 13:08:55 2011 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Fri, 11 Mar 2011 14:08:55 +0100 Subject: [Biopython] update Google Summer of Code project ideas In-Reply-To: References: <4D79073D.3090603@cornell.edu> Message-ID: Hi, I've just added a proposal: Mocapy++Biopython: from data to probabilistic models of biomolecules Cheers, -- Thomas Hamelryck, Eng., Assoc. Prof. Group leader Structural Bioinformatics Bioinformatics center Department of Biology University of Copenhagen Ole Maaloes Vej 5 DK-2200 Copenhagen N Denmark http://www.binf.ku.dk/research/structural_bioinformatics/ From laserson at mit.edu Fri Mar 11 17:03:58 2011 From: laserson at mit.edu (Uri Laserson) Date: Fri, 11 Mar 2011 12:03:58 -0500 Subject: [Biopython] Transferring SeqFeatures between aligned sequences In-Reply-To: References: Message-ID: > > OK - and do you want to worry about spotting frameshifts, > and updating the translation for CDS features? > I can retranslate the features myself, weary of any frameshifts > You could do, or create a new SeqFeature, or "steal" the old one and > modify it. The later technique would probably be fastest since there > are no new objects to create, just a few integer attributes changes > (location positions), but is perhaps a bit risky if you don't comment > it clearly. If you do that, perhaps do this by popping the features > from the old SeqRecord's feature list, modify them, and add them > to the new SeqRecord's feature list. > I can't steal the features because the source of the features is a reference sequence that I will reuse for millions of reads. I will have to make a copy. You believe that building a new SeqFeature would be faster/safer than using python's copy.deepcopy() method? > Another query, are you going to look for inversions? In such > cases the strand needs flipping and the start/end interchanged. > The SeqRecord reverse complement method has to do this, > and therefore the SeqFeature and its location and position > classes all have a private _flip method. > All the reads will be reverse complemented to the coding orientation before the transfer of the features, so I don't think this will be a problem. > [If you find these private methods useful, perhaps we can make > them public? Let us know] > It's hard to tell what the general API should be or what are the most common use-cases. For myself, I can get by with writing my own methods to modify the coordinates accordingly. Uri From p.j.a.cock at googlemail.com Fri Mar 11 17:15:09 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 11 Mar 2011 17:15:09 +0000 Subject: [Biopython] Transferring SeqFeatures between aligned sequences In-Reply-To: References: Message-ID: On Fri, Mar 11, 2011 at 5:03 PM, Uri Laserson wrote: >> You could do, or create a new SeqFeature, or "steal" the old one and >> modify it. The later technique would probably be fastest since there >> are no new objects to create, just a few integer attributes changes >> (location positions), but is perhaps a bit risky if you don't comment >> it clearly. If you do that, perhaps do this by popping the features >> from the old SeqRecord's feature list, modify them, and add them >> to the new SeqRecord's feature list. > > I can't steal the features because the source of the features is a reference > sequence that I will reuse for millions of reads. ?I will have to make a > copy. ?You believe that building a new SeqFeature would be faster/safer than > using python's copy.deepcopy() method? Yes, in this case you will have to make a copy. As too speed, I'm not sure which would be fastest - try it and see ;) Note as long as you are not going to *change* the information in the qualifiers dictionary (and you may want to if you update the translation for example), then you can have the new SeqFeature share the old qualifiers dictionary. That is a bit sneaky but may help with speed (if speed is an issue). >> [If you find these private methods useful, perhaps we can make >> them public? Let us know] > > It's hard to tell what the general API should be or what are the most common > use-cases. ?For myself, I can get by with writing my own methods to modify > the coordinates accordingly. Thanks, Peter From reece at harts.net Mon Mar 14 18:22:52 2011 From: reece at harts.net (Reece Hart) Date: Mon, 14 Mar 2011 11:22:52 -0700 Subject: [Biopython] update Google Summer of Code project ideas In-Reply-To: References: <4D79073D.3090603@cornell.edu> Message-ID: All- I just added a GSoC Biopython proposal: Variant representation, parser, generator, and coordinate converter Comments and co-mentors welcome. -Reece From 2huggie at gmail.com Wed Mar 16 08:26:44 2011 From: 2huggie at gmail.com (Timothy Wu) Date: Wed, 16 Mar 2011 16:26:44 +0800 Subject: [Biopython] [BioPython] Genbank parser Message-ID: Hi, I'm using Biopython to parse human genome files with code like this: for seq_record in SeqIO.parse(fd, "genbank"): * do something with seq_record* However something tripped on me: Traceback (most recent call last): File "./buildSyn.py", line 26, in main() File "./buildSyn.py", line 19, in main gene2SynMapping, syn2GeneMapping = mapper.getMappingDicts(files) File "/home/thw/MyPythonPackage/frameworks/BioProg/idmapping/idmapper/human_genome_id_mapper.py", line 29, in getMappingDicts self.parseAndGetMapping(fd, gene2syn) File "/home/thw/MyPythonPackage/frameworks/BioProg/idmapping/idmapper/human_genome_id_mapper.py", line 74, in parseAndGetMapping for seq_record in SeqIO.parse(fd, "genbank"): File "/usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py", line 525, in parse for r in i: File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 437, in parse_records record = self.parse(handle, do_features) File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 420, in parse if self.feed(handle, consumer, do_features): File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 392, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 344, in _feed_feature_table consumer.location(location_string) File "/usr/lib/pymodules/python2.6/Bio/GenBank/__init__.py", line 975, in location raise LocationParserError(location_line) Bio.GenBank.LocationParserError: 958574^958575..958886 The Genbank file involved has the following structure: CDS 958574^958575..958772 /gene="CSH2" /gene_synonym="CS-2; CSB; hCS-B" /exception="unclassified translation discrepancy" /note="placental lactogen; chorionic somatomammotropin B; Derived by automated computational analysis using gene prediction method: Curated Genomic." /codon_start=1 /product="chorionic somatomammotropin hormone 2 isoform 3" /protein_id="NP_072171.1" /db_xref="GI:12408694" /db_xref="CCDS:CCDS42368.1" /db_xref="GeneID:1443" /db_xref="HGNC:2441" /db_xref="MIM:118820" This isn't the first occurrence in this file, however I manually deleted what's equivalent of "^958575" in the location and it works out OK. Is there something I can do? Right now I edit the genbank file instead (since I won't be needing the location information) And I'm not sure what the caret is suppose to represent. Thanks for your attention. Timothy From p.j.a.cock at googlemail.com Wed Mar 16 11:43:28 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 16 Mar 2011 11:43:28 +0000 Subject: [Biopython] [BioPython] Genbank parser In-Reply-To: References: Message-ID: On Wed, Mar 16, 2011 at 8:26 AM, Timothy Wu <2huggie at gmail.com> wrote: > Hi, > > I'm using Biopython to parse human genome files with code like this: > > ? ? ? ?for seq_record in SeqIO.parse(fd, "genbank"): > ? ? ? ? ? ?* do something with seq_record* > > However something tripped on me: > > Traceback (most recent call last): > ... > ? ?raise LocationParserError(location_line) > Bio.GenBank.LocationParserError: 958574^958575..958886 > > The Genbank file involved has the following structure: > > ? ?CDS ? ? ? ? ? ? 958574^958575..958772 > ? ? ? ? ? ? ? ? ? ? /gene="CSH2" > ... > > This isn't the first occurrence in this file, however I manually deleted > what's equivalent of "^958575" in the location and it works out OK. > > Is there something I can do? Right now I edit the genbank file instead > (since I won't be needing the location information) > And I'm not sure what the caret is suppose to represent. Hi Timothy, I believe this to be an invalid GenBank file, and I would like you to contact the NCBI to check this. The caret is used for 'between'. Here it seems to be saying meaning this feature starts between 958574 and 958575, and runs to 958772. That would normally be represented just as 958575..958772 See also: http://bugzilla.open-bio.org/show_bug.cgi?id=3175 http://redmine.open-bio.org/issues/3175 (we're migrating the bug database, official announcement due soon) How many of this kind of 'broken' GenBank records have you found? I would hope it is just one or two that can be fixed by hand. If on the other hand the NCBI say this is valid, we need to handle this in the Biopython feature model... Peter From cjfields at illinois.edu Wed Mar 16 17:58:23 2011 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 16 Mar 2011 12:58:23 -0500 Subject: [Biopython] [ANNOUNCEMENT] Bugzilla to Redmine migration Message-ID: <34C8C0CB-9273-468E-86D7-74B22464F181@illinois.edu> (apologies if you receive multiple copies of this) All, We are currently about 95% done with a transition over to our new Redmine tracking system, to the point where we feel comfortable in going ahead with opening it to developers: http://redmine.open-bio.org/ All edits to bugzilla reports on our old system (http://bugzilla.open-bio.org/) are now disabled and the system is now read-only. Any new bugs and comments to old ones should be reported on the new Redmine server. For current Bugzilla users, we have migrated login IDs to Redmine (this is normally an email address), but we have reset user passwords for security reasons. There are two ways to access your account: 1) When logging in (http://redmine.open-bio.org/login), click on the 'Lost password' link. You will be prompted for your email address (this should be the same as your bugzilla login). An new email will be sent out containing directions for resetting your password and logging in. 2) It is possible the above may be automatically detected as spam. If the above doesn't work or the reset email isn't received within a day, contact support at helpdesk.open-bio.org to receive your new password. Also, note that Redmine has a different syntax for those who want to add links to their reports; see http://www.redmine.org/projects/redmine/wiki/RedmineTextFormatting. Let us know if you have any questions. chris Christopher Fields IGB Postdoctoral Fellow Genomics of Neural & Behavioral Plasticity University of Illinois Urbana-Champaign Institute for Genomic Biology 1206 W. Gregory Dr. , MC-195 Urbana, IL 61801 From rmb32 at cornell.edu Fri Mar 18 19:23:37 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Fri, 18 Mar 2011 15:23:37 -0400 Subject: [Biopython] Google Summer of Code is *ON* for OBF projects! Message-ID: <4D83B139.4010803@cornell.edu> Hi all, Great news: Google announced today that the Open Bioinformatics Foundation has been accepted as a mentoring organization for this summer's Google Summer of Code! GSoC is a Google-sponsored student internship program for open-source projects, open to students from around the world (not just US residents). Students are paid a $5000 USD stipend to work as a developer on an open-source project for the summer. For more on GSoC, see GSoC 2011 FAQ at http://bit.ly/hpoz8W Student applications are due April 8, 2011 at 19:00 UTC. Students who are interested in participating should look at the OBF's GSoC page at http://open-bio.org/wiki/Google_Summer_of_Code, which lists project ideas, and whom to contact about applying. For current developers on OBF projects, please consider volunteering to be a mentor if you have not already, and contribute project ideas. Just list your name and project ideas on OBF wiki and on the relevant project's GSoC wiki page. Thanks to all who helped make OBF's application to GSoC a success, and let's have a great, productive summer of code! Rob Buels OBF GSoC 2011 Administrator From laserson at mit.edu Mon Mar 21 23:38:10 2011 From: laserson at mit.edu (Uri Laserson) Date: Mon, 21 Mar 2011 19:38:10 -0400 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? Message-ID: If I load a GenBank-formatted record: a = SeqIO.parse('myfile.gb','gb').next() then set an annotation: a.annotations['myannotation'] = 'saveme' and then format the SeqRecord object as GenBank: a.format('gb') then 'myannotation' is lost. Is this expected behavior? If so, that's a huge bummer...what is the suggested method to store my own annotations in INSDC formats? Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From p.j.a.cock at googlemail.com Tue Mar 22 09:22:17 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Mar 2011 09:22:17 +0000 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: On Mon, Mar 21, 2011 at 11:38 PM, Uri Laserson wrote: > If I load a GenBank-formatted record: > > ? ?a = SeqIO.parse('myfile.gb','gb').next() > > then set an annotation: > > ? ?a.annotations['myannotation'] = 'saveme' > > and then format the SeqRecord object as GenBank: > > ? ?a.format('gb') > > then 'myannotation' is lost. It isn't 'lost' in that it is still in your SeqRecord object in memory, but it isn't in the GenBank format output. > Is this expected behavior? Yes, there is no general field for record level annotation in the GenBank or EMBL file formats. Where did you expect it to be written? The same thing would happen with most file formats, e.g. FASTA has no annotation support at all beyond the free text description line. > If so, that's a huge bummer...what is the suggested method to > store my own annotations in INSDC formats? You could stuff record level information into a source feature's qualifier dictionary. It isn't elegant, but it would work. The NCBI seems to have introduced the source feature primarily to use this to store the taxon identifier and other little bits of information not handles explicitly in the header lines. (Plus this can handle chimeras which may have been a use case). Peter From laserson at mit.edu Tue Mar 22 15:08:08 2011 From: laserson at mit.edu (Uri Laserson) Date: Tue, 22 Mar 2011 11:08:08 -0400 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: > > You could stuff record level information into a source feature's > qualifier dictionary. What are the allowed types for the values of the qualifiers dictionary (that will be output correctly in INSDC)? Is it possible to have lists of strings? What is the standard practice: a feature of type "source" that runs the entire length of the sequence? Or is it possible to have a SeqFeature with no position annotation? Ideally, if I slice the SeqFeature, I would like these annotations to stay with the slice no matter what. Uri From p.j.a.cock at googlemail.com Tue Mar 22 15:30:46 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Mar 2011 15:30:46 +0000 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: On Tue, Mar 22, 2011 at 3:08 PM, Uri Laserson wrote: >> You could stuff record level information into a source feature's >> qualifier dictionary. > > What are the allowed types for the values of the qualifiers dictionary > (that will be output correctly in INSDC)? ?Is it possible to have lists of > strings? As far as the current Biopython output goes, you can basically use any (short) string as a qualifier key. Avoid keys with spaces in them (INSDC use underscores) and other funny characters. For strict INSDC compliance there is probably a white list of allowed feature types... > What is the standard practice: a feature of type "source" that runs the > entire length of the sequence? ?Or is it possible to have a SeqFeature with > no position annotation? ?Ideally, if I slice the SeqFeature, I would like > these annotations to stay with the slice no matter what. If you did have a SeqFeature without a location, we couldn't write it out in GenBank/EMBL format (the error handling here might be improved). If you have a SeqRecord with a (source) feature spanning the full sequence, and you slice the SeqRecord to take a subsequence, then that full length feature (and any other features not fully within the subsequence) would be lost. Using a source feature is really just a work around for the fact that GenBank/EMBL do not support arbitrary record level annotation. Do you have to use this as your output format? Would you not be better off with using a database or something else instead? Peter From laserson at mit.edu Tue Mar 22 15:44:02 2011 From: laserson at mit.edu (Uri Laserson) Date: Tue, 22 Mar 2011 11:44:02 -0400 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: > > As far as the current Biopython output goes, you can basically use any > (short) string as a qualifier key. > Sorry, I meant for the values, not the keys. Can you have a list of strings as a value? > Using a source feature is really just a work around for the fact that > GenBank/EMBL do not support arbitrary record level annotation. > Do you have to use this as your output format? Agreed. Essentially, I have a huge pile of sequencing reads that are highly annotated. For any given read, there are some annotations that are independent of the sequence itself (which is what I am trying to implement now) and there are some annotations that are associated with subsequences (which is why SeqFeatures are very appropriate). Ideally, I want a file format that will store the data, be easily parsable (and fast), and can be readable using something like `less` (though this last feature is less important). > Would you not be > better off with using a database or something else instead? > Well, initially I used XML to store the data, but I quickly realized I was reinventing the wheel, especially when it came to annotating features on top of the sequences. Are you suggesting something like SQLite? How would I deal with SeqFeature-type annotations? Uri > Peter > From p.j.a.cock at googlemail.com Tue Mar 22 16:14:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Mar 2011 16:14:05 +0000 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: On Tue, Mar 22, 2011 at 3:44 PM, Uri Laserson wrote: >> As far as the current Biopython output goes, you can basically use any >> (short) string as a qualifier key. > > Sorry, I meant for the values, not the keys. ?Can you have a list of strings > as a value? Right. Again yes, plus I think a single string as the value should work. This is because the INSDC feature table allows multiple values for a tag - for example you often get multiple database cross references. >> Using a source feature is really just a work around for the fact that >> GenBank/EMBL do not support arbitrary record level annotation. >> Do you have to use this as your output format? > > Agreed. ?Essentially, I have a huge pile of sequencing reads that are highly > annotated. ?For any given read, there are some annotations that are > independent of the sequence itself (which is what I am trying to implement > now) and there are some annotations that are associated with subsequences > (which is why SeqFeatures are very appropriate). ?Ideally, I want a file > format that will store the data, be easily parsable (and fast), and can be > readable using something like `less` (though this last feature is less > important). For this the GenBank/EMBL format with the source feature trick does sound workable. You just need to be careful how how and when you create the dummy source feature - I'd do it at the last moment before writing out the file, and in that way you can avoid things like slicing throwing it away. >> Would you not be >> better off with using a database or something else instead? > > Well, initially I used XML to store the data, but I quickly realized I was > reinventing the wheel, especially when it came to annotating features > on top of the sequences. I wonder if one of the INSDC XML formats would work nicely here? i.e. If they can be extended more easily. We should look at adding a parser for them to Biopython (and write support too ideally of course). > Are you suggesting something like SQLite? ?How would I deal with > SeqFeature-type annotations? I was thinking you could use the BioSQL schema (run on SQLite if you wanted to, or MySQL or PostgresSQL etc). You'd still face the same issues if/when you wanted to dump the annotated records to a plain text file though. Peter From laserson at mit.edu Tue Mar 22 16:58:03 2011 From: laserson at mit.edu (Uri Laserson) Date: Tue, 22 Mar 2011 12:58:03 -0400 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: > > For this the GenBank/EMBL format with the source feature trick > does sound workable. You just need to be careful how how and > when you create the dummy source feature - I'd do it at the last > moment before writing out the file, and in that way you can avoid > things like slicing throwing it away. > > That's a good idea. This should be even easier since I am subclassing SeqRecord. I can override `format` to first take the whole annotations dictionary and dump it into the qualifiers dictionary of a `source` feature. I also have my own parser which wraps SeqIO; using SeqIO to parse the 'imgt' format, I can then copy the `source` qualifiers to the annotations dictionary and delete `source` feature entirely. Does this sound reasonable? > I wonder if one of the INSDC XML formats would work nicely here? > i.e. If they can be extended more easily. We should look at adding a > parser for them to Biopython (and write support too ideally of course). > My only issue with this is that I'd rather not extend anyone's file format, but use a standard file format that fits my purpose. Otherwise, I might as well just go straight for a database, as below. (But there are some super-fast XML parsers out there.) > I was thinking you could use the BioSQL schema (run on SQLite if > you wanted to, or MySQL or PostgresSQL etc). You'd still face the > same issues if/when you wanted to dump the annotated records > to a plain text file though. > I suppose plain text readability is less important to me than ease of sharing the data. But when I dump a SeqRecord object to a BioSQL database, does it do it in a way that I can rebuild that object exactly with no loss of information? (I.e., does it solve the annotation dictionary problem that started this whole thread?) Uri From p.j.a.cock at googlemail.com Tue Mar 22 17:24:46 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Mar 2011 17:24:46 +0000 Subject: [Biopython] User-defined SeqRecord annotations are trashed in INSDC formats? In-Reply-To: References: Message-ID: On Tue, Mar 22, 2011 at 4:58 PM, Uri Laserson wrote: >> For this the GenBank/EMBL format with the source feature trick >> does sound workable. You just need to be careful how how and >> when you create the dummy source feature - I'd do it at the last >> moment before writing out the file, and in that way you can avoid >> things like slicing throwing it away. > > That's a good idea. ?This should be even easier since I am subclassing > SeqRecord. ?I can override `format` to first take the whole annotations > dictionary and dump it into the qualifiers dictionary of a `source` feature. > ?I also have my own parser which wraps SeqIO; using SeqIO to parse the > 'imgt' format, I can then copy the `source` qualifiers to the annotations > dictionary and delete `source` feature entirely. ?Does this sound > reasonable? Yes, using your own parser/writer to take care to mapping between the SeqRecord annotations dictionary and a dummy feature sounds sensible. Also using 'imgt' rather than GenBank or EMBL will let you have longer feature qualifier keys - but these files are not as widely used/supported as the GenBank and EMBL formats. >> I wonder if one of the INSDC XML formats would work nicely here? >> i.e. If they can be extended more easily. We should look at adding a >> parser for them to Biopython (and write support too ideally of course). > > My only issue with this is that I'd rather not extend anyone's file format, > but use a standard file format that fits my purpose. ?Otherwise, I might as > well just go straight for a database, as below. ?(But there are some > super-fast XML parsers out there.) I haven't looked at the details to see if those XML file formats have a nice open ended misc annotation tag you could just use. >> I was thinking you could use the BioSQL schema (run on SQLite if >> you wanted to, or MySQL or PostgresSQL etc). You'd still face the >> same issues if/when you wanted to dump the annotated records >> to a plain text file though. > > I suppose plain text readability is less important to me than ease of > sharing the data. ?But when I dump a SeqRecord object to a BioSQL > database, does it do it in a way that I can rebuild that object exactly > with no loss of information? (I.e., does it solve the annotation dictionary > problem that started this whole thread?) Basically yes, subject to a few provisos, it should. Firstly note we don't support any per-letter-annotation in BioSQL. Secondly, all the SeqRecord annotations SeqFeature qualifiers will end up being stored as strings (in table bioentry_qualifier_value and table seqfeature_qualifier_value respectively). There may also be some fun with string values vs single entry lists containing one string. Peter From gori at cs.ru.nl Wed Mar 23 17:43:16 2011 From: gori at cs.ru.nl (Fabio Gori) Date: Wed, 23 Mar 2011 18:43:16 +0100 Subject: [Biopython] From genome to lineage with Entrez Message-ID: <201103231843.16762.gori@cs.ru.nl> Hi all, I have downloaded all the bacterial genomes (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz) and I want to compare their taxonomic lineages. I'm looking for a way to get their lineages with Entrez. From the files I can get the accession numbers and GIs, but I don't know how to get their taxonomic ids. I know that I can step from GIs to Taxids processing the file gi_taxid_nucl.dmp, but I'd prefer to use Entrez. Thanks in advance, Fabio -- F. Gori, PhD student Intelligent Systems ICIS (Institute for Computing and Information Sciences) Radboud University Nijmegen Home Page: http://www.cs.ru.nl/~gori/ From p.j.a.cock at googlemail.com Wed Mar 23 18:01:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 23 Mar 2011 18:01:32 +0000 Subject: [Biopython] From genome to lineage with Entrez In-Reply-To: <201103231843.16762.gori@cs.ru.nl> References: <201103231843.16762.gori@cs.ru.nl> Message-ID: On Wed, Mar 23, 2011 at 5:43 PM, Fabio Gori wrote: > Hi all, > > I have downloaded all the bacterial genomes > (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz) and I want to compare > their taxonomic lineages. > > I'm looking for a way to get their lineages with Entrez. From the files I can > get the accession numbers and GIs, but I don't know how to get their taxonomic > ids. > I know that I can step from GIs to Taxids processing the file > gi_taxid_nucl.dmp, but I'd prefer to use Entrez. > I think you can do it with ELink, but personally I'd use the taxid dump file, since it sounds like you'll want to process hundreds of lineages. Peter From amenity at enthought.com Thu Mar 24 03:29:35 2011 From: amenity at enthought.com (Amenity Applewhite) Date: Wed, 23 Mar 2011 22:29:35 -0500 Subject: [Biopython] SciPy 2011 Call for Papers Message-ID: Hello, SciPy 2011 , the 10th Python in Science conference, will be held July 11 - 16, 2011, in Austin, TX. At this conference, novel applications and breakthroughs made in the pursuit of science using Python are presented. Attended by leading figures from both academia and industry, it is an excellent opportunity to experience the cutting edge of scientific software development. The conference is preceded by two days of tutorials, during which community experts provide training on several scientific Python packages. *We'd like to invite you to consider presenting at SciPy 2011.* The list of topics that are appropriate for the conference includes (but is not limited to): * new Python libraries for science and engineering; * applications of Python to the solution of scientific or computational problems; * high performance, parallel and GPU computing with Python; * use of Python in science education. *Specialized Tracks* This year we also have two specialized tracks. They will be run concurrent to the main conference. *Python in Data Science Chair: Peter Wang, Streamitive, Inc.* This track focuses on the advantages and challenges of applying Python in the emerging field of "data science". This includes a breadth of technologies, from wrangling realtime data streams from the social web, to machine learning and semantic analysis, to workflow and repository management for large datasets. *Python and Core Technologies Chair: Anthony Scopatz, Enthought, Inc.* In an effort to broaden the scope of SciPy and to engage the larger community of software developers, we are pleased to introduce the _Python & Core Technologies_ track. Talks will cover subjects that are not directly related to science and engineering, yet nonetheless affect scientific computing. Proposals on the Python language, visualization toolkits, web frameworks, education, and other topics are appropriate for this session. *Talk/Paper Submission* We invite you to take part by submitting a talk abstract on the conference website at: http://conference.scipy.org/scipy2011/papers.php Papers are included in the peer-reviewed conference proceedings, to be published online. *Important dates for authors:* Friday, April 15: Tutorial proposals due (remember: stipends will be provided for Tutorial instructors) http://conference.scipy.org/scipy2011/tutorials.php Sunday, April 24: Paper abstracts due Sunday, May 8: Student sponsorship request due http://conference.scipy.org/scipy2011/student.php Tuesday, May 10: Accepted talks announced Monday, May 16: Student sponsorships announced Monday, May 23: Early Registration ends Sunday, June 20: Papers due Monday-Tuesday, July 11 - 12: Tutorials Wednesday-Thursday, July 13 - July 14: Conference Friday-Saturday, July 15 - July 16: Sprints The SciPy 2011 Team @SciPy2011 http://twitter.com/SciPy2011 _________________________ Amenity Applewhite Enthought, Inc. Scientific Computing Solutions From michele.silva at gmail.com Fri Mar 25 06:11:41 2011 From: michele.silva at gmail.com (Michele) Date: Fri, 25 Mar 2011 03:11:41 -0300 Subject: [Biopython] [GSoC] Proposal: Mocapy++Biopython In-Reply-To: References: Message-ID: Hello everyone, I'm Michele, a computer scientist and passionate developer who is currently enrolled in a biomedicine course. That's why I got in touch with the biopython project and have tried its tools for biological computation. When I read the Mocapy++Biopython proposal I immediately fell in love with it. Let me tell you why. I have worked since 2005 with bayesian networks, modelling BN for medical learning environments and also programming algorithms for handling those nets. In the context of my masters in computer science with the Artificial Intelligence Group, we have published several papers on the idea of using bayesian networks to model the uncertainty associated with the students' behavior in learning environments (see, for example, Designing a Bayesian Network based Student Model for Distance Learning Environmentspublished at the Seventh IEEE International Conference on Advanced Learning Technologies, 2007). As for the C++ and Python glue, I also have enjoyed the project's proposal. I have been programming in C++ for more than 5 years, in small and big projects, mainly in microelectronics CAD and firmware development. Coincidentally, last year I started working with Python in bigger projects. I worked for ESSS, a company which develops software for scientific computing and engineering simulation. I worked with oil reservoir simulation, where the applications were developed in Python and the simulation core and the computer graphics algorithms were programmed in C++. If you want to have a feeling on what reservoir simulation and the applications I worked in look like, have a look at the Kraken's project website . I worked in both Python and C++ development, as well as in the glue through the use of boost python. Regarding the experience in biomolecular structure, I'm a beginner. I have started studying biomedicine this year and therefore have a lot to learn. I know a bit about the PDB format and molecular biology. I'm sure I can count on your help to continue learning. So that was my not-so-short presentation. I would love to get to know the community better and work together on the GSoC. Please let me know if you think I could write a proposal and If you can help me on that. Cheers, Michele Silva http://www.linkedin.com/pub/michele-silva/6/520/5b0 From p.j.a.cock at googlemail.com Fri Mar 25 07:37:00 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 25 Mar 2011 07:37:00 +0000 Subject: [Biopython] Public example FASTQ files (for Tutorial examples)? Message-ID: Hi all, One of the volunteers proof reading the Biopython tutorial noticed our links to specific example FASTQ files at the NCBI SRA don't work any more. They have withdrawn them from the FTP site, although you can still download the files in the compressed *.sra format and in in theory convert then to FASTQ locally with the NCBI's toolkit (which is cross platform). Another option is to download the FASTQ files via the NCBI's webinterface. Unless there is an obvious way to do this with a URL that I missed initially, we have a complicated situation to describe where the user can choose all the reads for an experiment or just the filtered set, and also choose to have them pre-trimmed or not. Plus for me at least, the HTPP download wasn't as robust as the FTP one was. I'm hoping someone could suggest a couple of other moderately sized FASTQ files which are public, on FTP or a static HTML server, which we can use in the tutorial. So, suggestions? Thanks! Peter From brettpthomas at gmail.com Tue Mar 29 14:50:38 2011 From: brettpthomas at gmail.com (Brett Thomas) Date: Tue, 29 Mar 2011 10:50:38 -0400 Subject: [Biopython] VCF files In-Reply-To: References: Message-ID: Hi all, I write software for genetic research, and the predominant file format we use is VCF, a new file format used to represent genetic variation in the 1000 genomes project. Has there been any discussion of a biopython api for vcf files? I'd be happy to help if anybody is working on it. Thanks, Brett From jamesrwagner at gmail.com Tue Mar 29 17:55:56 2011 From: jamesrwagner at gmail.com (James Wagner) Date: Tue, 29 Mar 2011 13:55:56 -0400 Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work Message-ID: Hello: I was trying just as a proof of concept to do an NCBI WWW BLAST query with a FASTA file containing more than one sequence (but still a small number of sequences). I tried with the opuntia.fasta file from the website, and set it up as follows: result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r")) blast_records = NCBIXML.parse(result_handle) then I try: for record in blast_records: print record.alignments and I obtain: [] Surely at the very least since there were 7 sequences in this file, I should get 7 empty lists, assuming of course none of the sequences gives a hit in nr, which I am sure is not the case either? What is still missing? I realize I could use SeqIO.parse to obtain each sequence from the FASTA file and do a separate qblast, but surely doing this separately for each protein would create unnecessary overhead with the network traffic compared to somehow sending off all the protein queries at once? From p.j.a.cock at googlemail.com Tue Mar 29 18:07:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Mar 2011 19:07:47 +0100 Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work In-Reply-To: References: Message-ID: On Tue, Mar 29, 2011 at 6:55 PM, James Wagner wrote: > Hello: > > I was trying just as a proof of concept to do an NCBI WWW BLAST query > with a FASTA file containing more than one sequence (but still a small > number of sequences). > > I tried with the opuntia.fasta file from the website, and set it up as follows: > > result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r")) > blast_records = NCBIXML.parse(result_handle) > > then I try: > > for record in blast_records: > ? ? ?print record.alignments > > and I obtain: > [] > > > Surely at the very least since there were 7 sequences in this file, I > should get 7 empty lists, assuming of course none of the sequences > gives a hit in nr, which I am sure is not the case either? Not necessarily, the NCBI may have fixed this but for a long time if you had say 7 queries but only 2 gave hits, stand alone BLAST's XML output would only contain those 2 hits. There would be nothing at all from the 5 hit less queries. This was/is very annoying, but right now I'm not sure if they have fixed this or not. Try getting back the results as plain text and manually inspect them. In the plain text output all the queries appear, and there is a clear "no hits found" message. > What is still missing? I realize I could use SeqIO.parse to obtain > each sequence from the FASTA file and do a separate qblast, but surely > doing this separately for each protein would create unnecessary > overhead with the network traffic compared to somehow sending off all > the protein queries at once? Yes, in theory a single large query should have less overhead than individual queries. Personally I'd just use standalone BLAST and run it locally if I had more than a few queries. Peter From jamesrwagner at gmail.com Tue Mar 29 20:43:35 2011 From: jamesrwagner at gmail.com (James Wagner) Date: Tue, 29 Mar 2011 16:43:35 -0400 Subject: [Biopython] getting multiple BLAST (NCBIWWW) queries to work In-Reply-To: References: Message-ID: OK, when I try to create a .fasta file with just the first sequence in opuntia, I get no hits. However, when I just copy paste the nucleotide sequence and copy paste, I get 50 hits! This is consistent with what happens with copy pasting the first opuntia sequence into the NCBI BLAST web interafce, though there I obtain 110 hits for intronic sequences in Opuntia chloroplast and chloroplasts. As a secondary point I also find it curious the result with using NCBIWWW is limited to 50 hits (I thought it was 500 by default). But what is more problematic than the fact that I get no hits when using a FASTA file with only a single sequence, when clearly there are some very high homology hits present in nr. This is my code from beginning to end, where the file opuntia1.fasta is a file containing only the 1st sequence from opuntia.fasta, and when using the line for opuntia1.fasta it resulted in no hits. I am using BioPython 1.5.3 and Python 2.6 on Ubuntu if this has any effect on the results. I also tried it by obtaining a single sequence from SeqIO.parse and then obtaining the Seq of this sequence, and it also gave 50 hits. So it's basically just with using a FASTA file handle that I can't get it to work. #!/usr/bin/python from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML result_handle = NCBIWWW.qblast("blastn", "nr", "TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATAGGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGA\CCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTATGNTCATTGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTAAATTGCTCATATTTTCTTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAACATAGACAAACTATATATATATATATATATAATATATTTCAAATTCCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATATTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTAGAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAAAAGAACCAGA") #result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia1.fasta", "r")) blast_record = NCBIXML.read(result_handle) for description in blast_record.descriptions: print description; #end of code. On Tue, Mar 29, 2011 at 2:07 PM, Peter Cock wrote: > On Tue, Mar 29, 2011 at 6:55 PM, James Wagner wrote: >> Hello: >> >> I was trying just as a proof of concept to do an NCBI WWW BLAST query >> with a FASTA file containing more than one sequence (but still a small >> number of sequences). >> >> I tried with the opuntia.fasta file from the website, and set it up as follows: >> >> result_handle = NCBIWWW.qblast("blastn", "nr", open("opuntia.fasta","r")) >> blast_records = NCBIXML.parse(result_handle) >> >> then I try: >> >> for record in blast_records: >> ? ? ?print record.alignments >> >> and I obtain: >> [] >> >> >> Surely at the very least since there were 7 sequences in this file, I >> should get 7 empty lists, assuming of course none of the sequences >> gives a hit in nr, which I am sure is not the case either? > > Not necessarily, the NCBI may have fixed this but for a long time if > you had say 7 queries but only 2 gave hits, stand alone BLAST's > XML output would only contain those 2 hits. There would be nothing > at all from the 5 hit less queries. This was/is very annoying, but > right now I'm not sure if they have fixed this or not. > > Try getting back the results as plain text and manually inspect them. > In the plain text output all the queries appear, and there is a clear > "no hits found" message. > >> What is still missing? I realize I could use SeqIO.parse to obtain >> each sequence from the FASTA file and do a separate qblast, but surely >> doing this separately for each protein would create unnecessary >> overhead with the network traffic compared to somehow sending off all >> the protein queries at once? > > Yes, in theory a single large query should have less overhead > than individual queries. Personally I'd just use standalone BLAST > and run it locally if I had more than a few queries. > > Peter > From rmb32 at cornell.edu Tue Mar 29 21:20:41 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Tue, 29 Mar 2011 14:20:41 -0700 Subject: [Biopython] Announcing OBF Summer of Code - please forward! Message-ID: <4D924D29.3020707@cornell.edu> Hi all, Here's an advertising-ready announcement for OBF's Summer of Code, thanks to Christian Zmasek and Hilmar Lapp for their excellent writing. Student applications are due April 8! Please spread it widely, we need to reach lots of students with it! Rob Buels OBF GSoC 2011 Admin ============================================================ *** Please disseminate widely at your local institutions *** *** including posting to message and job boards, so that *** *** we reach as many students as possible. *** ============================================================ OPEN BIOINFORMATICS FOUNDATION SUMMER OF CODE 2011 Applications due 19:00 UTC, April 8, 2010. http://www.open-bio.org/wiki/Google_Summer_of_Code The Open Bioinformatics Foundation Summer of Code program provides a unique opportunity for undergraduate, masters, and PhD students to obtain hands-on experience writing and extending open-source software for bioinformatics under the mentorship of experienced developers from around the world. The program is the participation of the Open Bioinformatics Foundation (OBF) as a mentoring organization in the Google Summer of Code(tm) (http://code.google.com/soc/). Students successfully completing the 3 month program receive a $5,000 USD stipend, and may work entirely from their home or home institution. Participation is open to students from any country in the world except countries subject to US trade restrictions. Each student will have at least one dedicated mentor to show them the ropes and help them complete their project. The Open Bioinformatics Foundation is particularly seeking students interested in both bioinformatics (computational biology) and software development. Some initial project ideas are listed on the website. These range from Galaxy phylogenetics pipeline development in Biopython to lightweight sequence objects and lazy parsing in BioPerl, a DAS Server for large files on local filesystems, and mapping Java libraries to Perl/Ruby/Python using Biolib+SWIG+JNI. All project ideas are flexible and many can be adjusted in scope to match the skills of the student. We also welcome and encourage students proposing their own project ideas; historically some of the most successful Summer of Code projects are ones proposed by the students themselves. TO APPLY: Apply online at the Google Summer of Code website (http://socghop.appspot.com/), where you will also find GSoC program rules and eligibility requirements. The 12-day application period for students runs from Monday, March 28 through Friday, April 8th, 2011. INQUIRIES: We strongly encourage all interested students to get in touch with us with their ideas as early on as possible. See the OBF GSoC page for contact details. 2011 OBF Summer of Code: http://www.open-bio.org/wiki/Google_Summer_of_Code Google Summer of Code FAQ: http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs From albert.bogdanowicz at gmail.com Thu Mar 31 17:01:45 2011 From: albert.bogdanowicz at gmail.com (Albert Bogdanowicz) Date: Thu, 31 Mar 2011 19:01:45 +0200 Subject: [Biopython] Google Summer of Code idea Message-ID: <201103311901.45372.albert.bogdanowicz@gmail.com> Hello World, I am a bioinformatics student and I would like to take part in Google Summer of Code this year. I have an idea for a project that I could write. It would be a module for synthetic biology, especially BioBrick standard used in iGEM competition (http://ung.igem.org/Main_Page). I'm a bit late, but I hope this fact won't disqualify me. I would appreciate any help in determining a more detailed specification for such project. Albert Bogdanowicz From laserson at mit.edu Thu Mar 31 20:48:16 2011 From: laserson at mit.edu (Uri Laserson) Date: Thu, 31 Mar 2011 16:48:16 -0400 Subject: [Biopython] Google Summer of Code idea In-Reply-To: <201103311901.45372.albert.bogdanowicz@gmail.com> References: <201103311901.45372.albert.bogdanowicz@gmail.com> Message-ID: Hi Albert, Are you thinking of something like the Clotho project? http://www.clothocad.org/ Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu On Thu, Mar 31, 2011 at 13:01, Albert Bogdanowicz < albert.bogdanowicz at gmail.com> wrote: > Hello World, > I am a bioinformatics student and I would like to take part in Google > Summer > of Code this year. > I have an idea for a project that I could write. It would be a module for > synthetic biology, especially BioBrick standard used in iGEM competition > (http://ung.igem.org/Main_Page). > I'm a bit late, but I hope this fact won't disqualify me. I would > appreciate > any help in determining a more detailed specification for such project. > Albert Bogdanowicz > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From rmb32 at cornell.edu Thu Mar 31 21:58:52 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 31 Mar 2011 14:58:52 -0700 Subject: [Biopython] Reminder: GSoC proposals due in 1 week Message-ID: <4D94F91C.1080005@cornell.edu> Hi all, Just a reminder, Google Summer of Code student applications are due April 8! If you're a student planning to apply to GSoC with OBF, it's very much in your best interest to write your proposal *early*, like now, and get it into the hands of the developers and mentors on your subproject (BioPerl/Ruby/Python/etc) so that they can give you some feedback on it. The final proposals must, of course, still be submitted to Google through the GSoC web application, as described on the main GSoC site (http://www.google-melange.com/gsoc/homepage/google/gsoc2011). Rob Buels OBF GSoC 2011 Administrator From rmb32 at cornell.edu Thu Mar 31 22:04:49 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 31 Mar 2011 15:04:49 -0700 Subject: [Biopython] GSoC call for mentors Message-ID: <4D94FA81.5090701@cornell.edu> Hi all, For current developers on OBF projects: If you would not mind being a mentor to a Summer of Code student this summer, please make sure you sign up as an OBF mentor in the GSoC web app. There's a link under "mentors: apply now!" midway down the page at http://www.google-melange.com/. If you didn't do last year's summer of code, it would be a good idea to drop me an email introducing yourself, as well, or I won't know whether to approve your request. :-) Being signed up as an OBF GSoC mentor will give you access to the student proposals, as they come in, and the ability to comment on them and assign scores to the ones you think show the most promise. If you sign up as a mentor, please also add yourself to the two OBF GSoC mailing lists: OBF-GSoC and OBF-GSoC-mentors OBF-GSoC list: http://lists.open-bio.org/mailman/listinfo/gsoc OBF mentors: http://lists.open-bio.org/mailman/listinfo/gsoc-mentors Thanks in advance! Rob --- Robert Buels OBF GSoC 2011 Administrator From philip.machanick at gmail.com Thu Mar 31 23:49:33 2011 From: philip.machanick at gmail.com (Philip Machanick) Date: Fri, 1 Apr 2011 09:49:33 +1000 Subject: [Biopython] extending Motif class Message-ID: I want to add a new scoring function to the Motif class and in true object-oriented spirit would like to do it by deriving a new class rather than hacking the existing code. The general structure of my test program (all in 1 file) is: from Bio.Motif import Motif class ScannableMotif(Motif): def pwm_score_hit(self,sequence,position): ## stuff to compute my new score from Bio import Motif def main (): for motif in ScannableMotif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"): for i in range(3): print motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i) The two different imports appear to be necessary. I need the first to be able to use the base class to derive a new one, and without the second when I use metaclass methods, I get TypeError: Error when calling the metaclass bases module.__init__() takes at most 2 arguments (3 given) The other problem: I can't directly invoke a metaclass method on a derived instance as above. The snippet below works as expected, but looks like a kludge to me. Is there a better way of accessing metaclass methods from a derived class object? for motif in Motif.parse(open("/Users/philip/tmp/meme.txt"),"MEME"): motif.__class__ = ScannableMotif # promote to the new class for i in range(3): print motif.pwm_score_hit("CCTGGGGTCCCATTTCTCTTTTCTCTCCTGGGGTCCC",i) I think I have the class vs. metaclass concept straight but understanding why I need the two different flavours of import would be useful. -- Philip Machanick Rhodes University, Grahamstown 6140, South Africa http://opinion-nation.blogspot.com/ +61-7-3871-0963 mobile +61 42 234 6909 skype philipmach