From p.j.a.cock at googlemail.com Fri Apr 2 13:34:00 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 2 Apr 2010 18:34:00 +0100 Subject: [Biopython] Biopython 1.54 beta released Message-ID: Dear all, A beta release for Biopython 1.54 is now available for download and testing, as announced here: http://news.open-bio.org/news/2009/06/biopython-154-beta-released/ Noted that I haven't done a fully detailed release announcement, we'll leave that for the official release. Source distributions and Windows installers are available from the downloads page on the Biopython website. http://biopython.org/wiki/Download We are interested in getting feedback on the beta release as a whole, but especially on the new features - including the updated multiple sequence alignment object (which is what you?ll now get when parsing alignments with Bio.AlignIO), the new Bio.Phylo module, and the Bio.SeqIO support for Standard Flowgram Format (SFF) files. (At least) 10 people contributed to this release (so far), which includes 4 new people: Anne Pajon (first contribution) Brad Chapman Christian Zmasek Eric Talevich Jose Blanca (first contribution) Kevin Jacobs (first contribution) Leighton Pritchard Michiel de Hoon Peter Cock Thomas Holder (first contribution) On behalf of the Biopython team, thank you for any feedback, bug reports, and contributions. Peter P.S. You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News Biopython news is also on twitter: http://twitter.com/biopython From p.j.a.cock at googlemail.com Fri Apr 2 13:39:08 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 2 Apr 2010 18:39:08 +0100 Subject: [Biopython] Biopython 1.54 beta released In-Reply-To: References: Message-ID: > Dear all, > > A beta release for Biopython 1.54 is now available for download > and testing, as announced here: > > http://news.open-bio.org/news/2009/06/biopython-154-beta-released/ > > Noted that I haven't done a fully detailed release announcement, > we'll leave that for the official release. That URL should have been: http://news.open-bio.org/news/2010/04/biopython-1-54-beta-released/ Sorry for the extra email, Peter From cgohlke at uci.edu Fri Apr 2 19:05:25 2010 From: cgohlke at uci.edu (Christoph Gohlke) Date: Fri, 02 Apr 2010 16:05:25 -0700 Subject: [Biopython] Biopython 1.54b test failures Message-ID: <4BB67835.7030303@uci.edu> Hello, I get two test failures (see below) when running 'setup.py test' for biopython 1.54b on win-amd64-py2.6 (built with msvc9). These are related to line ending style. Maybe it would be a good idea to use Python's universal newline support (available since 2.3) when opening text files for iteration over lines. All tests pass after the following changes: BIO/SCOP/Raf.py line 104: f = open(self.filename, 'rU') line 121: f = open(self.filename, 'rU') BIO/SCOP/Cla.py line 103: f = open(self.filename, 'rU') line 123: f = open(self.filename, 'rU') line 72 (inconsistent indentation): h.append("=".join(map(str,ht))) -- Christoph ====================================================================== ERROR: Test CLA file indexing ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SCOP_Cla.py", line 74, in testIndex rec = index['d1hbia_'] File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", line 127, in __getitem__ record = Record(line) File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", line 45, in __init__ self._process(line) File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", line 51, in _process raise ValueError("I don't understand the format of %s" % line) ValueError: I don't understand the format of 5 ====================================================================== ERROR: testSeqMapIndex (test_SCOP_Raf.RafTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SCOP_Raf.py", line 68, in testSeqMapIndex r = index.getSeqMap("103m") File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", line 152, in getSeqMap sm = self[id] File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", line 125, in __getitem__ record = SeqMap(line) File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", line 196, in __init__ self._process(line) File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", line 216, in _process raise ValueError("Incompatible RAF version: "+self.version) ValueError: Incompatible RAF version: .01 ---------------------------------------------------------------------- Ran 143 tests in 98.871 seconds FAILED (failures = 2) From biopython at maubp.freeserve.co.uk Fri Apr 2 19:22:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Apr 2010 00:22:32 +0100 Subject: [Biopython] Biopython 1.54b test failures In-Reply-To: <4BB67835.7030303@uci.edu> References: <4BB67835.7030303@uci.edu> Message-ID: On Sat, Apr 3, 2010 at 12:05 AM, Christoph Gohlke wrote: > Hello, > > I get two test failures (see below) when running 'setup.py test' for > biopython 1.54b on win-amd64-py2.6 (built with msvc9). These are > related to line ending style. It is a known issue - a simple work around is just run something like unix2dos on the SCOP test files, and then the tests pass. > Maybe it would be a good idea to use Python's universal > newline support (available since 2.3) when opening text > files for iteration over lines. I had tried that in the past without success... > All tests pass after the following changes: > > BIO/SCOP/Raf.py > > line 104: > ? ? ? ?f = open(self.filename, 'rU') > > line 121: > ? ? ? ?f = open(self.filename, 'rU') > > BIO/SCOP/Cla.py > > line 103: > ? ? ? ?f = open(self.filename, 'rU') > > line 123: > ? ? ? ?f = open(self.filename, 'rU') > > line 72 (inconsistent indentation): > ? ? ? ? ? ?h.append("=".join(map(str,ht))) > I recall trying the universal read lines thing before without success in the SCOP tests - maybe it was this line 72 thing that I missed. I'll take another look at this next week (when I have access to a Windows machine). Thanks, Peter From skhadar at gmail.com Fri Apr 2 21:33:01 2010 From: skhadar at gmail.com (Khader Shameer) Date: Fri, 2 Apr 2010 19:33:01 -0600 Subject: [Biopython] Biopython installation failed on Mac OSX 10.6 Message-ID: Hi, I was trying to install BioPython using fink. Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Used the command "fink install biopython-py24" Got the following error: Failed: no package found for specification 'biopython-py24'! Tried 23, 24 and 25 - it is not working. Any idea why it is not working ? Thanks, Shameer From vincent at vincentdavis.net Fri Apr 2 23:04:17 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Fri, 2 Apr 2010 21:04:17 -0600 Subject: [Biopython] Biopython installation failed on Mac OSX 10.6 In-Reply-To: References: Message-ID: Installing from source, instructions here is straight forward, just did it with the newest version, no problems http://biopython.org/wiki/Download *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Fri, Apr 2, 2010 at 7:33 PM, Khader Shameer wrote: > Hi, > > I was trying to install BioPython using fink. > > Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > > Used the command "fink install biopython-py24" > Got the following error: > Failed: no package found for specification 'biopython-py24'! > Tried 23, 24 and 25 - it is not working. > > Any idea why it is not working ? > > Thanks, > Shameer > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Sat Apr 3 06:33:48 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 3 Apr 2010 11:33:48 +0100 Subject: [Biopython] Biopython installation failed on Mac OSX 10.6 In-Reply-To: References: Message-ID: On Sat, Apr 3, 2010 at 2:33 AM, Khader Shameer wrote: > Hi, > > I was trying to install BioPython using fink. > > Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > > Used the command "fink install biopython-py24" > Got the following error: > Failed: no package found for specification 'biopython-py24'! > Tried 23, 24 and 25 - it is not working. > > Any idea why it is not working ? Something to do with Fink? Also note we don't support Python 2.3 anymore (and Python 2.4 is on its last few releases as a supported version for Biopython). Apple provides python 2.5 (32bit) and python 2.6 (64bit) on Snow Leopard. I actually use python 2.6 on the Mac specifically because it is 64bit and can cope with more memory. As Vincent and our documentation suggests, try just installing from source. You'll need to install Apple's XCode tools first, and it seems to help if you tick the optional older SDKs as well. Peter From p.j.a.cock at googlemail.com Sat Apr 3 09:52:11 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 3 Apr 2010 14:52:11 +0100 Subject: [Biopython] Biopython installation failed on Mac OSX 10.6 In-Reply-To: References: Message-ID: >> Hi, >> >> I was trying to install BioPython using fink. >> >> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) >> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin >> >> Used the command "fink install biopython-py24" >> Got the following error: >> Failed: no package found for specification 'biopython-py24'! >> Tried 23, 24 and 25 - it is not working. >> >> Any idea why it is not working ? > > Something to do with Fink? Also note we don't > support Python 2.3 anymore (and Python 2.4 is > on its last few releases as a supported version > for Biopython). If you really want to use fink, I think you'll have to contact the fink team. Specifically it looks like Koen van der Drift is kindly taking care of packaging Biopython on Fink: http://pdb.finkproject.org/pdb/package.php/biopython-py24 http://pdb.finkproject.org/pdb/package.php/biopython-py25 http://pdb.finkproject.org/pdb/package.php/biopython-py26 Peter From skhadar at gmail.com Sat Apr 3 13:19:49 2010 From: skhadar at gmail.com (Khader Shameer) Date: Sat, 3 Apr 2010 11:19:49 -0600 Subject: [Biopython] Biopython installation failed on Mac OSX 10.6 In-Reply-To: References: Message-ID: Thanks Vincent, Peter : I have installed BioPython from source. On Sat, Apr 3, 2010 at 7:52 AM, Peter Cock wrote: > >> Hi, > >> > >> I was trying to install BioPython using fink. > >> > >> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) > >> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > >> > >> Used the command "fink install biopython-py24" > >> Got the following error: > >> Failed: no package found for specification 'biopython-py24'! > >> Tried 23, 24 and 25 - it is not working. > >> > >> Any idea why it is not working ? > > > > Something to do with Fink? Also note we don't > > support Python 2.3 anymore (and Python 2.4 is > > on its last few releases as a supported version > > for Biopython). > > If you really want to use fink, I think you'll have to > contact the fink team. Specifically it looks like > Koen van der Drift is kindly taking care of packaging > Biopython on Fink: > > http://pdb.finkproject.org/pdb/package.php/biopython-py24 > http://pdb.finkproject.org/pdb/package.php/biopython-py25 > http://pdb.finkproject.org/pdb/package.php/biopython-py26 > > Peter > From rmb32 at cornell.edu Sat Apr 3 16:09:27 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Sat, 03 Apr 2010 13:09:27 -0700 Subject: [Biopython] Google Summer of Code is *ON* for OBF projects! Message-ID: <4BB7A077.4070802@cornell.edu> Hi all, Reminder: GSoC student proposals must be submitted to Google by April 9th, 19:00 UTC. That's less than a week away. Students: you should ALREADY be working with mentors on the project mailing lists, they can help you get your proposal into shape. So far, we have 5 proposals submitted to our org in Google's web app. Keep them coming, and let's see some really good ones! Rob Buels OBF GSoC 2010 Administrator From rmb32 at cornell.edu Sun Apr 4 00:37:38 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Sat, 03 Apr 2010 21:37:38 -0700 Subject: [Biopython] Reminder: GSoC student applications due April 9, 19:00 UTC Message-ID: <4BB81792.8060001@cornell.edu> Hi all, Sending this again with a different subject line, just in case. GSoC student proposals must be submitted to Google through their web application by *April 9th, 19:00 UTC*. That's less than a week away. Students: you should ALREADY be working with mentors on the project mailing lists, they can help you get your proposal into shape. So far, we have 6 proposals submitted to our org in Google's web app. Keep them coming, and keep them good! Rob Buels OBF GSoC 2010 Administrator From ulfada at gmail.com Sun Apr 4 21:46:14 2010 From: ulfada at gmail.com (Sofia Lemons) Date: Sun, 4 Apr 2010 21:46:14 -0400 Subject: [Biopython] SoC project (BioPython and PyCogent) Message-ID: I'm working on an application for the Summer of Code project of integrating BioPython and PyCogent. I've looked through the list archives and saw Brad's general advice to other potential SoC applicants, but I thought I'd introduce myself and see if there was any advice specific to this project. I've used BioPython in the past and even explored the code a bit. I'm considering working on one or more of the bugs in Bugzilla if I can find time, and will work to familiarize myself with PyCogent. Are there any other concepts, projects, or people I should familiarize myself with (aside from what's listed on the ideas page, of course)? As you can see from my GitHub and Google Code accounts, I've got some experience with open source projects, but please do suggest any specific tools or methods you think I should try to get up to speed on, as well. Feel free to contact me off-list. Thanks, Sofia From stran104 at chapman.edu Mon Apr 5 06:59:28 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Mon, 5 Apr 2010 03:59:28 -0700 Subject: [Biopython] GSoC Ortholog Module Proposal Message-ID: Dear Biopython GSoC list, I am a student at Chapman University and over the last 18 months I have been using biopython to produce phylogenetic trees with ClustalW, T-Coffee, and PHYLIP. I have found the most difficult part to be identifying ortholgos for the particular species that our lab is interested in studying. The orthology databases provide a great deal of matches but each database requires its own wrapper and some databases are stronger than others with particular species. So far I have written wrappers to get ortholog IDs from InParanoid and then fetch the sequences from either NCBI or BioMart. This provides good results for most common species but not all. To handle rare species I have implemented the Reverse Smallest Distance orthology algorithm to run protein-protein searches. It is available at http://ortholog.us. I also have automated scripts to align protein families, concatenate aligned families, and create trees. For GSoC I would like to write a module to abstract finding orthologs as much as possible. This would greatly simplify creating custom evolutionary trees for biologists. The module could fetch orthologs from TreeFam, InParanoid, Harvard's Roundup, and Princeton's BLASTO. The module could also provide support for producing alignments, concatenating alignments, removing sections of gaps, and constructing trees. Ortholog identification could be done with no dependency other than an internet connection. Alignments and trees would require the user to have the appropriate tools installed. The overhead of writing this type of code makes it difficult for evolutionary biologists and bio wet labs to get a picture of evolutionary relationships in specific groups of species. This module would aim to simplify creating custom phylogenetic trees. A timeline of milestones might look something like this: Week 1-2: Stable wrappers for InParanoid Week 3-4: Stable wrappers for Roundup Week 5-6: Stable wrappers for Treefam Week 6-7: Stable wrappers for BlastO Week 8-9: Ortholog module to abstract the database wrappers Week 10-11: Alignment and tree tools Is there any interest in having such a project? I'd be grateful to get some feedback either on or off list. Best, -Matthew Strand From chapmanb at 50mail.com Mon Apr 5 07:50:00 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 5 Apr 2010 07:50:00 -0400 Subject: [Biopython] SoC project (BioPython and PyCogent) In-Reply-To: References: Message-ID: <20100405115000.GB62718@sobchak.mgh.harvard.edu> Sofia; > I'm working on an application for the Summer of Code project of > integrating BioPython and PyCogent. Great -- glad to you hear you are interested in the project. > I've looked through the list > archives and saw Brad's general advice to other potential SoC > applicants, but I thought I'd introduce myself and see if there was > any advice specific to this project. The overall goal is to provide integration between Biopython and PyCogent so programmers can benefit from the unique features and algorithms in each library. This has two general themes: - Ensuring interoperability between core objects like sequences, alignments and phylogenetic trees. - Using this interoperability to develop analysis workflows that utilize functionality from both libraries. Within this broad scope you are free to orient your proposal to whatever set of biological questions that interest you. We've tried to sketch out some ideas we had on the GSoC page as a starting point. > I've used BioPython in the past > and even explored the code a bit. I'm considering working on one or > more of the bugs in Bugzilla if I can find time, and will work to > familiarize myself with PyCogent. Are there any other concepts, > projects, or people I should familiarize myself with (aside from > what's listed on the ideas page, of course)? Proposals are due this Friday, April 9th and normally require a few rounds of back and forth revisions to get to a competitive level. My suggestion would be to focus on learning enough of Biopython and PyCogent to write out a detailed project plan, with a week by week description of activities and specific goals. > As you can see from my > GitHub and Google Code accounts, I've got some experience with open > source projects, but please do suggest any specific tools or methods > you think I should try to get up to speed on, as well. The open source work is great; definitely include this in your proposal. A good outline to start with is: - Project summary -- A short abstract describing what you hope to accomplish during the summer, how you plan to go about it, and what motivates you to work on the project. - Personal summary -- Describe your background and how it will help you be successful during GSoC. Here is where you can sell yourself to all of the mentors ranking the project: why are you a good coder? Why is this project useful to use? How will working on the summer project encourage you to stay active in the community? - Project plan -- The detailed week by week description of plans mentioned above. Hope this helps, Brad From chapmanb at 50mail.com Mon Apr 5 08:05:54 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 5 Apr 2010 08:05:54 -0400 Subject: [Biopython] GSoC Ortholog Module Proposal In-Reply-To: References: Message-ID: <20100405120554.GC62718@sobchak.mgh.harvard.edu> Matthew; Thanks for the introduction and pointers to your work. Your http://ortholog.us interface looks like a useful resource; it's really nice to see web interfaces being developed with programmable JSON APIs. Out of curiousity, is the code available for what you've done so far? > For GSoC I would like to write a module to abstract finding orthologs as > much as possible. This would greatly simplify creating custom evolutionary > trees for biologists. The module could fetch orthologs from TreeFam, > InParanoid, Harvard's Roundup, and Princeton's BLASTO. The module could also > provide support for producing alignments, concatenating alignments, removing > sections of gaps, and constructing trees. Ortholog identification could be > done with no dependency other than an internet connection. Alignments and > trees would require the user to have the appropriate tools installed. [...] > Is there any interest in having such a project? I'd be grateful to get some > feedback either on or off list. This is a good project idea and nicely spec'ed out. One additional direction that might also be worth exploring is using BioMart to retrieve orthologs from the Ensembl Compara work. Here's a recent thread on BioStar with the queries to use: http://biostar.stackexchange.com/questions/569/how-do-i-match-orthologues-in-one-species-to-another-genome-scale I don't know of Python programming interfaces to BioMart, but there is a nice R bioconductor library that can be leveraged with Rpy2: http://www.bioconductor.org/packages/bioc/html/biomaRt.html http://rpy.sourceforge.net/rpy2.html For the practical GSoC things, project proposals are due this Friday, April 9th so time is running short. I'm unfortunately a bit over-committed as this point to mentor but hopefully someone will be available to step in that role. I'm happy to make suggestions on the proposal as it comes together. Thanks, Brad From bjorn_johansson at bio.uminho.pt Mon Apr 5 09:50:25 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Mon, 5 Apr 2010 14:50:25 +0100 Subject: [Biopython] pro Message-ID: Hi, I have a problem that may be related to biopython (or not). I have written a plugin for a cross platform program (Wikidpad) that relies on some biopython modules. I do the development on ubuntu 9.10 and have Wikidpad installed using wine to be able to test the functionality on windows. Under wine I have added the following code to make biopython installed under linux available to the python interpreter (py2exe) under wine: if sys.platform == 'win32': sys.path.append("z:\usr\local\lib\python2.6\dist-packages") sys.path.append("z:\usr\lib/python2.6") line 40 in "SeqTools.py" below reads: from Bio import SeqIO I get the error below when importing the module under wikidpad running under wine File "C:\Program Files\WikidPad\user_extensions\SeqTools.py", line 40, in File "z:/usr/local/lib/python2.6/dist-packages\Bio\SeqIO\__init__.py", line 303, in File "z:/usr/local/lib/python2.6/dist-packages\Bio\SeqIO\InsdcIO.py", line 29, in File "z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\__init__.py", line 53, in File "z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\LocationParser.py", line 319, in File "z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\LocationParser.py", line 177, in __init__ File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line 88, in __init__ File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line 129, in collectRules File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line 101, in addRule AttributeError: 'NoneType' object has no attribute 'split' I wonder if anyone has an immediate idea of what I am doing wrong? The python interpreter under wine seem to find the biopython modules. I cannot understand the error that I get afterwards..... grateful for help! /bjorn From eric.talevich at gmail.com Mon Apr 5 11:48:04 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 5 Apr 2010 11:48:04 -0400 Subject: [Biopython] pro In-Reply-To: References: Message-ID: 2010/4/5 Bj?rn Johansson > Hi, > I have a problem that may be related to biopython (or not). > I have written a plugin for a cross platform program (Wikidpad) that relies > on some biopython modules. > I do the development on ubuntu 9.10 and have Wikidpad installed using wine > to be able to test the functionality on windows. > > Under wine I have added the following code to make biopython installed > under > linux available to the python interpreter (py2exe) under wine: > [...] > It looks like spark relies on the docstrings in Bio.GenBank.LocationParser. Is there anything in py2exe that would strip the docstrings from compiled modules? Some optimizations do this -- I think "python -O3" strips docstrings, for instance. -Eric From p.j.a.cock at googlemail.com Mon Apr 5 12:16:43 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Apr 2010 17:16:43 +0100 Subject: [Biopython] pro In-Reply-To: References: Message-ID: 2010/4/5 Eric Talevich > > It looks like spark relies on the docstrings in Bio.GenBank.LocationParser. > Is there anything in py2exe that would strip the docstrings from compiled > modules? Some optimizations do this -- I think "python -O3" strips > docstrings, for instance. You may be on to something there Eric. Bj?rn, could compare your file: z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py with the version we provide: http://github.com/biopython/biopython/blob/master/Bio/Parsers/spark.py or: http://biopython.org/SRC/biopython/Bio/Parsers/spark.py In the medium term, I'd like to move the GenBank/EMBL location parsing to something simpler and faster (using regular expressions) and then deprecate Bio.GenBank.LocationParser and indeed the whole of Bio.parsers (which just has a copy of spark). There is a bug open on this with some code. But that isn't going to help Bj?rn right now. Peter From stran104 at chapman.edu Mon Apr 5 15:02:21 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Mon, 5 Apr 2010 12:02:21 -0700 Subject: [Biopython] GSoC Ortholog Module Proposal Message-ID: > Thanks for the introduction and pointers to your work. Your > http://ortholog.us interface looks like a useful resource; it's > really nice to see web interfaces being developed with programmable > JSON APIs. Out of curiousity, is the code available for what you've > done so far? > Thanks, we have found it useful for finding unindexed orthologs. Fetching results from the pre-compiled databases is faster but of course requires writing wrappers that are time consuming to develop. The plan is to release all code as an open source Django app with a paper that is in the works. However, I'd be happy to share any code with mentors/organizers for evaluation purposes off-list in the meantime. > > This is a good project idea and nicely spec'ed out. One additional > direction that might also be worth exploring is using BioMart to > retrieve orthologs from the Ensembl Compara work. Here's a recent > thread on BioStar with the queries to use: > > > http://biostar.stackexchange.com/questions/569/how-do-i-match-orthologues-in-one-species-to-another-genome-scale > > I don't know of Python programming interfaces to BioMart, but there > is a nice R bioconductor library that can be leveraged with Rpy2: > I agree, this would be a good addition. I have some messy Python wrappers to BioMart but the Rpy route would probably provide a more reliable solution with less effort. > http://www.bioconductor.org/packages/bioc/html/biomaRt.html > http://rpy.sourceforge.net/rpy2.html > > For the practical GSoC things, project proposals are due this > Friday, April 9th so time is running short. I'm unfortunately a bit > over-committed as this point to mentor but hopefully someone will > be available to step in that role. I'm happy to make suggestions on > the proposal as it comes together. > Thanks, I hope so too. I will post a full proposal in the near future. Feedback would of course be greatly appreciated. I'm a little unclear, do I need a mentor to submit a proposal? Is writing a proposal a mute point without a mentor? Best, -Matt Strand From vincent at vincentdavis.net Mon Apr 5 15:51:46 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Mon, 5 Apr 2010 13:51:46 -0600 Subject: [Biopython] Build CDF file Message-ID: The custom array for which I have data does not have a CDF file. I have been told that others have changed the header on the CEL files to reference different CDF file. That only kinda makes sense to me. I obviously have CEL files. I also have the sequences that each probe matches and finally I have genome match data. By that I mean I know which probes are a perfect match and which are a mismatch and the location of the mismatch. Can I build a CDF file from this? How? Does it make sense to build a CDF for each hybrid(not sure thats the right word) of the organism if the genome is known for each. Not sure if this is better ask here or the BioConductor, If there is a python solution I would try that first, I think. I think the bioconductor package altcdfenvs LINK does this. I guess I should email Laurent Gautier, maybe he reads this :) *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From biopython at maubp.freeserve.co.uk Mon Apr 5 16:35:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Apr 2010 21:35:20 +0100 Subject: [Biopython] Build CDF file In-Reply-To: References: Message-ID: On Mon, Apr 5, 2010 at 8:51 PM, Vincent Davis wrote: > The custom array for which I have data does not have a CDF > file... Hi Vincent, Did you mean to post this to the BioConductor mailing list? Peter From biopython at maubp.freeserve.co.uk Mon Apr 5 16:53:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Apr 2010 21:53:42 +0100 Subject: [Biopython] Build CDF file In-Reply-To: <-3455855938884949614@unknownmsgid> References: <-3455855938884949614@unknownmsgid> Message-ID: On Mon, Apr 5, 2010 at 9:46 PM, Vincent Davis wrote: > > No, but maybe I should. I was hopping for a python solution > Are these CDF files of yours NetCDF files? http://en.wikipedia.org/wiki/NetCDF If so, try Scientific.IO.NetCDF from Konrad Hinsen's ScientificPython http://sourcesup.cru.fr/projects/scientific-py/ Peter From chapmanb at 50mail.com Tue Apr 6 08:26:27 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 6 Apr 2010 08:26:27 -0400 Subject: [Biopython] GSoC Ortholog Module Proposal In-Reply-To: References: Message-ID: <20100406122627.GE66230@sobchak.mgh.harvard.edu> Matthew; > > Thanks for the introduction and pointers to your work. Your > > http://ortholog.us interface looks like a useful resource; it's > > really nice to see web interfaces being developed with programmable > > JSON APIs. Out of curiousity, is the code available for what you've > > done so far? > > Thanks, we have found it useful for finding unindexed orthologs. Fetching > results from the pre-compiled databases is faster but of course requires > writing wrappers that are time consuming to develop. The plan is to release > all code as an open source Django app with a paper that is in the works. > However, I'd be happy to share any code with mentors/organizers for > evaluation purposes off-list in the meantime. Cool; definitely let us know on the mailing lists when the paper and code are out. It would be fun to see. > > For the practical GSoC things, project proposals are due this > > Friday, April 9th so time is running short. I'm unfortunately a bit > > over-committed as this point to mentor but hopefully someone will > > be available to step in that role. I'm happy to make suggestions on > > the proposal as it comes together. > > Thanks, I hope so too. I will post a full proposal in the near future. > Feedback would of course be greatly appreciated. I'm a little unclear, do I > need a mentor to submit a proposal? Is writing a proposal a mute point > without a mentor? You will need a mentor and this is always the tough part of GSoC: there are more good students and ideas than mentors and funded spots. I would never discourage anyone from getting together a proposal; it is a good exercise and helps you think through the work you are planning to do. In terms of acceptance rates, it is lower when coming in later in the process with your own ideas since mentors will have already settled on a few ideas and begun feeling committed to students working on those. However, nothing is locked down or decided until the deadline hits, proposals are ranked by all of the mentors, and we see how many spots we'll get from Google. GSoC is kind of like interviewing job candidates without being sure how many positions you'll have at the end. In summary, if you feel like the proposal writing process would be interesting and useful to you, I'd definitely encourage you to go for it and see where it takes you. Brad From bjorn_johansson at bio.uminho.pt Wed Apr 7 05:33:39 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Wed, 7 Apr 2010 10:33:39 +0100 Subject: [Biopython] pro In-Reply-To: References: Message-ID: Hi, thank you very much for the information, I think it has to do with the docstrings, if I run with python -OO under linux, I get the same error msg. as for the two spark files, they seem identical, spark.py is the one i downloaded from http://biopython.org/SRC/biopython/Bio/Parsers/spark.py: diff -w spark.py /usr/local/lib/python2.6/dist-packages/Bio/Parsers/spark.py produces no output at all. I will try and find out if the optimization can be overridden for one file only. Thanks! /bjorn 2010/4/5 Peter Cock > 2010/4/5 Eric Talevich > > > > It looks like spark relies on the docstrings in > Bio.GenBank.LocationParser. > > Is there anything in py2exe that would strip the docstrings from compiled > > modules? Some optimizations do this -- I think "python -O3" strips > > docstrings, for instance. > > You may be on to something there Eric. > > Bj?rn, could compare your file: > z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py > with the version we provide: > http://github.com/biopython/biopython/blob/master/Bio/Parsers/spark.py > or: > http://biopython.org/SRC/biopython/Bio/Parsers/spark.py > > In the medium term, I'd like to move the GenBank/EMBL location > parsing to something simpler and faster (using regular expressions) > and then deprecate Bio.GenBank.LocationParser and indeed the > whole of Bio.parsers (which just has a copy of spark). There is > a bug open on this with some code. But that isn't going to help > Bj?rn right now. > > Peter > -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From p.j.a.cock at googlemail.com Wed Apr 7 05:37:59 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Apr 2010 10:37:59 +0100 Subject: [Biopython] pro In-Reply-To: References: Message-ID: 010/4/7 Bj?rn Johansson : > Hi, > thank you very much for the information, I think it has to do with the > docstrings, if I run with python -OO under linux, I get the same error msg. > > as for the two spark files, they seem identical, spark.py is the one i > downloaded from > http://biopython.org/SRC/biopython/Bio/Parsers/spark.py: > > diff -w spark.py /usr/local/lib/python2.6/dist-packages/Bio/Parsers/spark.py > > produces no output at all. OK, thanks. I wanted to find out if py2exe was optimising the python files by editing them to remove the docstrings. It seems not. > I will try and find out if the optimization can be overridden for one file > only. > > Thanks! > /bjorn Peter From lunt at ctbp.ucsd.edu Wed Apr 7 20:57:07 2010 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Wed, 7 Apr 2010 17:57:07 -0700 Subject: [Biopython] StockholmIO replaces "." with "-", why? Message-ID: Greetings All! It looks like line 364 of Bio.AlignIO.StockholmIO reads: seqs[id] += seq.replace(".","-") So when you load into memory alignments that mark gaps created to allow alignment to inserts with ".", (such as PFam alignments or the output of hmmer) that information is lost. I know there must be a good reason for this, but I am finding it a problem on my end.. -Bryan Lunt From fuxin at umail.iu.edu Wed Apr 7 21:40:02 2010 From: fuxin at umail.iu.edu (Fuxiao Xin) Date: Wed, 7 Apr 2010 21:40:02 -0400 Subject: [Biopython] About Google Summer Code Project PDB-tidy Message-ID: Dear all, I am a third year Phd student in Bioinformatics from Indiana University Bloomington. I am very in interested in the google summer code project of biopython "PDB-Tidy: command-line tools for manipulating PDB files". My own research needs extensive manipulation of PDB files, and I think this idea of adding more features to Bio.PDB and more command line options to analyze/present PDB data is excellent. This project is of strong interest to me since it will benefit my own research project as well. Programming Skills: I use perl and python during my daily research. I am now working on developing a new functional site predictor using protein structure information. The code will be open source, but the work is under review so the code is not released yet. My project plan: week1 1. Renumber residues starting from 1 (or N) function name: renumberPDB, given a pdb file, rename the atom field numbering of the file to remove missing amino acids communicate with mentors to set standards of the code to follow for the rest of the functions create work log to keep track of process; week2-3 2. Select a portion of the structure -- models, chains, etc. -- and write it to a new file (PDB, FASTA, and other formats) function name: rewritePDB, inputs will be a particular portion of a PDB file you want to write out(support 'chain', 'model', 'atom'), a file format(PDB, fasta), and the output name. 3. Perform some basic, well-established measures of model quality/validity function name: PDBquality the function will report RESOLUTION and ? of the structure 4. extract disorder region in PDB structure function name: PDBdisorder report missing residues in the structure atom field week3-4 5. make a function to draw a Ramachandran plot function name: ramaPLOT combine the two steps(calcualting torsion angles and draw the plot) into one function, give the option to draw the plot or not week5 6. open PDB files in the window for visulization, visulize PDBsuperpose results, output RMSD function name: superposePDB the function will look like the PDBsuperpose function in matlab; use Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other visulization tool to see the results week6 7. write a function to extract all experimental conditions of a PDB file, includes PH, temperature, and salt function name: PDBconditon it will be easy to get PH and temperature information, but for salt, it will be hard to parse because there is no general rule of such information in the PDB file; parse REMARK 200 field; week7-8 8. extract PTM, function name: PDBptm difficult: the Post-translational modification annotation in PDB is not consistant, need to make a list of PTMs to work on parse MODRES field week9-10 9. extract ligand binding information function name: PDBligand parse HETNAM field Other obligations: I am aware that google summer code starts from May 24th, but I will have a review paper with my advisor due on June 1st, I hope it will be OK for me to start after June 1st, and I will makeup the first week in Auguest. Best, Fuxiao From eric.talevich at gmail.com Wed Apr 7 23:48:08 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 7 Apr 2010 23:48:08 -0400 Subject: [Biopython] About Google Summer Code Project PDB-tidy In-Reply-To: References: Message-ID: Hi Fuxiao, Thanks for your interest in this project. I see you've been working on this proposal for awhile already, so although the submission deadline is very close, I think you'll still be OK. I've interleaved my comments with your proposal below: On Wed, Apr 7, 2010 at 9:40 PM, Fuxiao Xin wrote: > Dear all, > > I am a third year Phd student in Bioinformatics from Indiana University > Bloomington. I am very in interested in the google summer code project of > biopython "PDB-Tidy: command-line tools for manipulating PDB files". > > My own research needs extensive manipulation of PDB files, and I think > this > idea of adding more features to Bio.PDB and more command line options to > analyze/present PDB data is excellent. This project is of strong interest > to > me since it will benefit my own research project as well. > Good to hear. Does your lab have a website? This project requires some knowledge of structural biology, so it helps if we can see what specific research you've already done in that area. Programming Skills: I use perl and python during my daily research. I am now > working on developing a new functional site predictor using protein > structure information. The code will be open source, but the work is under > review so the code is not released yet. > Is there any other programming work you've done in the past that you could let us see? It doesn't have to be part of an existing open-source project; even some functioning snippets posted somewhere would help us get a sense of your coding style and abilities. Examples where you've used Biopython or another established toolkit for working with PDB files or other scientific data would be especially useful. We also like to see that you're familiar with a project's build tools, which in Biopython's case is GitHub and the standard Python mechanisms. So, if you could upload some of your prior work to GitHub and send us the link, that would be ideal. My project plan: > > week1 > 1. Renumber residues starting from 1 (or N) > function name: renumberPDB, given a pdb file, rename the atom field > numbering of the file to remove missing amino acids > communicate with mentors to set standards of the code to follow for the > rest > of the functions > create work log to keep track of process; > Biopython's coding standards generally follow an earlier version of PEP 8; hopefully you can pick it up quickly just by reading the source code for Bio.PDB -- so you don't really need that item listed here. In the past, students have maintained their weekly schedules on a wiki or other public document, and updated them continually throughout the summer. This functions as a work log, in a way. You would also have an e-mail record of your work from your weekly reports to this list. week2-3 > 2. Select a portion of the structure -- models, chains, etc. -- and write > it > to a new file (PDB, FASTA, and other formats) > function name: rewritePDB, inputs will be a particular portion of a PDB > file > you want to write out(support 'chain', 'model', 'atom'), a file format(PDB, > fasta), and the output name. > 3. Perform some basic, well-established measures of model quality/validity > function name: PDBquality > the function will report RESOLUTION and ? of the structure > 4. extract disorder region in PDB structure > function name: PDBdisorder > report missing residues in the structure atom field > These tasks seem reasonable. You don't need to commit to specific function names yet; it would be more helpful to describe the overall module layout you're planning, and list the dependencies for each (especially the components of Bio.PDB that come into play). > week3-4 > 5. make a function to draw a Ramachandran plot > function name: ramaPLOT > combine the two steps(calcualting torsion angles and draw the plot) into > one > function, give the option to draw the plot or not > This task has a number of dependencies which I think you should list and describe here. Because of those dependencies there's a significant chance of it taking longer than you planned -- so I'd recommend moving it to after the midterm evaluations, wherever those fit into your schedule. week5 > 6. open PDB files in the window for visulization, visulize PDBsuperpose > results, output RMSD > function name: superposePDB > the function will look like the PDBsuperpose function in matlab; use > Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other > visulization tool to see the results > Would you build Python wrappers for interacting with the chosen visualization tool, or just write a set of files and launch the viewer in a script? > week6 > 7. write a function to extract all experimental conditions of a PDB file, > includes PH, temperature, and salt > function name: PDBconditon > it will be easy to get PH and temperature information, but for salt, it > will > be hard to parse because there is no general rule of such information in > the > PDB file; parse REMARK 200 field; > Sounds handy. Would your script write out a report combining all of this info, or just extract requested elements? > week7-8 > 8. extract PTM, > function name: PDBptm > difficult: the Post-translational modification annotation in PDB is not > consistant, need to make a list of PTMs to work on > parse MODRES field > > week9-10 > 9. extract ligand binding information > function name: PDBligand > parse HETNAM field > Good. Some of these later items sound straightforward enough that it would be better to tackle them earlier in the summer. > Other obligations: I am aware that google summer code starts from May > 24th, > but I will have a review paper with my advisor due on June 1st, I hope it > will be OK for me to start after June 1st, and I will makeup the first week > in Auguest. > How much of the "community bonding period" will this occupy? The guideline is that you get set up with the build system, read documentation and do background research part-time between GSoC acceptance and May 24, and start writing code full-time on May 24. You can make up for a gap in your project plan by doing extra preparation before coding starts; would this be possible for you? Finally, the GSoC administration app (socghop.appspot.com) gets crowded as the deadline approaches, so it's best if you register yourself there and take care of the administrivia as soon as you can to avoid any trouble on Friday. Best regards, Eric From rozziite at gmail.com Wed Apr 7 23:48:16 2010 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Wed, 7 Apr 2010 23:48:16 -0400 Subject: [Biopython] About Google Summer Code Project PDB-tidy In-Reply-To: References: Message-ID: Hi Fuxiao, Good start on the application! Some comments below. On Wed, Apr 7, 2010 at 9:40 PM, Fuxiao Xin wrote: > Dear all, > > I am a third year Phd student in Bioinformatics from Indiana University > Bloomington. ?I am very in interested in the google summer code project of > biopython "PDB-Tidy: command-line tools for manipulating PDB files". > > My own research needs extensive manipulation of PDB files, and I think ?this > idea of adding more features to Bio.PDB and more command line options to > analyze/present PDB data is excellent. This project is of strong interest to > me since it will benefit my own research project as well. > > Programming Skills: I use perl and python during my daily research. I am now > working on developing a new functional site predictor using protein > structure information. The code will be open source, but the work is under > review so the code is not released yet. > > My project plan: > > week1 > 1. Renumber residues starting from 1 (or N) > function name: renumberPDB, given a pdb file, rename the atom field > numbering of the file to remove missing amino acids > communicate with mentors to set standards of the code to follow for the rest > of the functions > create work log to keep track of process; > > week2-3 > 2. Select a portion of the structure -- models, chains, etc. -- and write it > to a new file (PDB, FASTA, and other formats) > function name: rewritePDB, inputs will be a particular portion of a PDB file > you want to write out(support 'chain', 'model', 'atom'), a file format(PDB, > fasta), and the output name. > 3. Perform some basic, well-established measures of model quality/validity > function name: PDBquality > the function will report RESOLUTION and ? of the structure Maybe you can get some inspiration of measures of model quality/validity from PDBREPORT database [0] and WHAT_IF [1] software. [0] http://swift.cmbi.ru.nl/gv/pdbreport/ [1] http://swift.cmbi.ru.nl/whatif/ > 4. extract disorder region in PDB structure > function name: PDBdisorder > report missing residues in the structure atom field > > week3-4 > 5. make a function to draw a Ramachandran plot > function name: ramaPLOT > combine the two steps(calcualting torsion angles and draw the plot) into one > function, give the option to draw the plot or not > > week5 > 6. open PDB files in the window for visulization, visulize PDBsuperpose > results, output RMSD > function name: superposePDB > the function will look like the PDBsuperpose function in matlab; use > Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other > visulization tool to see the results > week6 > 7. write a function to extract all experimental conditions of a PDB file, > includes PH, temperature, and salt > function name: PDBconditon > it will be easy to get PH and temperature information, but for salt, it will > be hard to parse because there is no general rule of such information in the > PDB file; parse REMARK 200 field; > > week7-8 > 8. extract PTM, > function name: PDBptm > difficult: the Post-translational modification annotation in PDB is not > consistant, need to make a list of PTMs to work on > parse MODRES field > > week9-10 > 9. extract ligand binding information > function name: PDBligand > parse HETNAM field > > > Other obligations: ?I am aware that google summer code starts from May 24th, > but I will have a review paper with my advisor due on June 1st, I hope it > will be OK for me to start after June 1st, and I will makeup the first week > in Auguest. > > Best, > Fuxiao > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From fuxin at indiana.edu Thu Apr 8 03:40:36 2010 From: fuxin at indiana.edu (Fuxiao Xin) Date: Thu, 8 Apr 2010 03:40:36 -0400 Subject: [Biopython] About Google Summer Code Project PDB-tidy In-Reply-To: References: Message-ID: hi Eric and Diana, Thanks for your quick reply. For the quality/validation problem, thanks Diana for pointing me to the two resources, I am surprised that there are so many "problems" defined for PDB files, and obviously I underestimate this task, and I think it's a very interesting problem to study and I'd like to devote more time on this task, I am thinking to make this task the main focus of my first period coding(before midterm check). What do you think? For Eric's responses, please find my reply in line. My own research needs extensive manipulation of PDB files, and I think this >> idea of adding more features to Bio.PDB and more command line options to >> analyze/present PDB data is excellent. This project is of strong interest >> to >> me since it will benefit my own research project as well. >> > > Good to hear. Does your lab have a website? This project requires some > knowledge of structural biology, so it helps if we can see what specific > research you've already done in that area. > Our lab's website is : http://www.informatics.indiana.edu/predrag/ , and one main focus of our lab is PTM and disorder, both need to deal with PDB files. A poster title shows my protein structure-based kernel work:* http://www.iscb.org/rocky09-program/rocky09-poster-presenters-abstracts, they didn't put the abstract online. I could send you the abstract if you are interested. * > Programming Skills: I use perl and python during my daily research. I am >> now >> working on developing a new functional site predictor using protein >> structure information. The code will be open source, but the work is under >> review so the code is not released yet. >> > > Is there any other programming work you've done in the past that you could > let us see? It doesn't have to be part of an existing open-source project; > even some functioning snippets posted somewhere would help us get a sense of > your coding style and abilities. Examples where you've used Biopython or > another established toolkit for working with PDB files or other scientific > data would be especially useful. > We also like to see that you're familiar with a project's build tools, which > in Biopython's case is GitHub and the standard Python mechanisms. So, if you > could upload some of your prior work to GitHub and send us the link, that > would be ideal. > I put some of my python code here: http://github.com/fuxiaoxin/my_python_code. I don't have code in python using Bio.PDB. For parsing PDB, my code are in perl for the sake of its regular expression, I seldomly use bioperl or biopython in the past, I write all my own code, that's also why I think I am very clear of all kinds of problems in PDB files. I am quite surprised to find Bio.PDB already have so many modules for various functions. I could upload some of my perl functions if you would like to have a look: I have functions similar to PDBparser, NeighborSearch, DSSP, NACCESS. I have to say I am not very familiar with the build tools of python. But I hope to learn it during the bonding period. I just guided myself through to upload my codes to Github, :) My project plan: >> >> week1 >> 1. Renumber residues starting from 1 (or N) >> function name: renumberPDB, given a pdb file, rename the atom field >> numbering of the file to remove missing amino acids >> communicate with mentors to set standards of the code to follow for the >> rest >> of the functions >> create work log to keep track of process; >> > > Biopython's coding standards generally follow an earlier version of PEP 8; > hopefully you can pick it up quickly just by reading the source code for > Bio.PDB -- so you don't really need that item listed here. > > I will learn from Bio.PDB source code and remove this one. > In the past, students have maintained their weekly schedules on a wiki or > other public document, and updated them continually throughout the summer. > This functions as a work log, in a way. You would also have an e-mail record > of your work from your weekly reports to this list. > That's great to know. > week2-3 >> 2. Select a portion of the structure -- models, chains, etc. -- and write >> it >> to a new file (PDB, FASTA, and other formats) >> function name: rewritePDB, inputs will be a particular portion of a PDB >> file >> you want to write out(support 'chain', 'model', 'atom'), a file >> format(PDB, >> fasta), and the output name. >> 3. Perform some basic, well-established measures of model quality/validity >> function name: PDBquality >> the function will report RESOLUTION and ? of the structure >> 4. extract disorder region in PDB structure >> function name: PDBdisorder >> report missing residues in the structure atom field >> > > These tasks seem reasonable. You don't need to commit to specific function > names yet; it would be more helpful to describe the overall module layout > you're planning, and list the dependencies for each (especially the > components of Bio.PDB that come into play). > I will make a new proposal with these details by tomorrow. > >> week3-4 >> 5. make a function to draw a Ramachandran plot >> function name: ramaPLOT >> combine the two steps(calcualting torsion angles and draw the plot) into >> one >> function, give the option to draw the plot or not >> > > This task has a number of dependencies which I think you should list and > describe here. Because of those dependencies there's a significant chance of > it taking longer than you planned -- so I'd recommend moving it to after the > midterm evaluations, wherever those fit into your schedule. > I will add more details here. > week5 >> 6. open PDB files in the window for visulization, visulize PDBsuperpose >> results, output RMSD >> function name: superposePDB >> the function will look like the PDBsuperpose function in matlab; use >> Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other >> visualization tool to see the results >> > > Would you build Python wrappers for interacting with the chosen > visualization tool, or just write a set of files and launch the viewer in a > script? > I am thinking of launching the script, since those PDB visualization tools already have very nice command line options and interfaces. But I think it is really important to be able to visualize the structure on the fly, especially when you are doing PDB superimpose. > week6 >> 7. write a function to extract all experimental conditions of a PDB file, >> includes PH, temperature, and salt >> function name: PDBconditon >> it will be easy to get PH and temperature information, but for salt, it >> will >> be hard to parse because there is no general rule of such information in >> the >> PDB file; parse REMARK 200 field; >> > > Sounds handy. Would your script write out a report combining all of this > info, or just extract requested elements? > I am thinking to put the results into a variable instead of a report, since it will be great for batch processing, and display the results immediately in interactive mode. > > Other obligations: I am aware that google summer code starts from May >> 24th, >> but I will have a review paper with my advisor due on June 1st, I hope it >> will be OK for me to start after June 1st, and I will makeup the first >> week >> in Auguest. >> > > How much of the "community bonding period" will this occupy? The guideline > is that you get set up with the build system, read documentation and do > background research part-time between GSoC acceptance and May 24, and start > writing code full-time on May 24. You can make up for a gap in your project > plan by doing extra preparation before coding starts; would this be possible > for you? > I think the bonding period will be really important for me to get known about the python build tools, and of course other stuff you mentors suggest me to learn, so I will devote my time for "bonding". But since I will get busy near the end of May, I plan to start early and do things more efficiently. > > Finally, the GSoC administration app (socghop.appspot.com) gets crowded as > the deadline approaches, so it's best if you register yourself there and > take care of the administrivia as soon as you can to avoid any trouble on > Friday. > Thanks for the reminding. I will incorporate you and Diana's suggestions to make a new version of proposal, by tomorrow night. But the idea is, the main project for the first period would be the quality/validation task , and the second period will be the Ramachandran plot. And I will fill in the time with other small functions. Thanks, Fuxiao From biopython at maubp.freeserve.co.uk Thu Apr 8 04:04:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Apr 2010 09:04:27 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt wrote: > Greetings All! > > It looks like line 364 of Bio.AlignIO.StockholmIO reads: > > seqs[id] += seq.replace(".","-") > > So when you load into memory alignments that mark gaps created to > allow alignment to inserts with ".", (such as PFam alignments or the > output of hmmer) that information is lost. > > I know there must be a good reason for this, but I am finding it a > problem on my end.. > > -Bryan Lunt Hi Bryan, Yes, is it done deliberately. The dot is a problem - it has a quite specific meaning of "same as above" on other alignment file formats, while "-" is an almost universal shorthand for gap/insertion. Consider the use case of Stockholm to PHYLIP/FASTA/Clustal conversion. Have you got a sample output file we can use as a unit test or at least discuss? As I recall, on the PFAM alignments I looked at there was no data loss by doing the dot to dash mapping. Peter From sma.hmc at gmail.com Thu Apr 8 05:41:26 2010 From: sma.hmc at gmail.com (Singer Ma) Date: Thu, 8 Apr 2010 02:41:26 -0700 Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability Message-ID: I am a junior Computer Science major with heavy bioinformatic leanings at Harvey Mudd College. I know that it is very late for new summer of code applications, but I was wondering if you could have a look at my proposed schedule to give me some pointers and answer a few questions. I am also considering applying for the project involving adding more ways to use R through python, but I was unsure of which project had more users who wanted it completed. Questions: What does it mean by BioPython's acquired sequences? I can't seem to find out what or where information about "acquired sequences" is. Thus, I do not discuss anything about it in my current proposal. For the creation of workflows, do there already exist use and test cases for this or would I be best off looking for ones in papers and trying to mimic them? Right now, I have an example paper where the interoperability would have been helpful. Any other use cases I should immediately consider in my proposal? My current proposed schedule: For Bio Python and PyCogent interoperability. Week 1: Familiarization with the code and soliciting requests. While what seems intuitive to me might not seem so to others. It would be best to spend this time to determine a group of people who would highly benefit from the interoperability and ask them for what they would look for. For example, would they rather use one, save the data, and use the other. Would they want to use them directly. Basically, I want to get a good idea of how this code will be used before making my own decisions on how I think people will use it. Also important here is to create sets of data which can be used later on the process. Week 2 and 3: Code converting PyCogent and BioPython. The core objects in each package seem like they should not be too difficult to convert. This step will involve looking into the documentation and coding for PyCogent and BioPython, to determine what the core objects contain for each. One possible problem here is if either PyCogent or BioPython core objects use heavy subclassing, as determining subclassing in Python has been a nightmare in the past. Testing at this point will likely involve going through the entire round trip conversion, and seeing if everything looks the same. Week 4: Ensure that conversions allow the use of data from one program to the other. The workflows of codon usage to clustering code can be tested. One possible test set is from Sharp et. al. 1986. Here they found different codon usage for different genes. Additionally, it should be considered how codon usage can be used to help with making biologically accurate clusters. Week 5: Familiarize with phyloXML and make interoperable with PyCogent. phyloXML has already been added with BioPython. Making phyloXML work with PyCogent could be based on how it was adapted for BioPython. Clear risks here include problems with making sure that the API for phyloXML in PyCogent gives an intuitive interface to use phyloXML. Week 6 and 7: Adapt PyCogent to query genomics databases. Currently there is at least some support for PyCogent to query ENSEMBL. It seems like it would be useful to query other genomics databases such as Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL queries into their MySQL database. Ideally, if everything previously has been alright, the conversion of PyCogent to BioPython forms shoudl already be accounted for. Week 8-12: Slip days and additional features. The initial set of use cases will surely expand and this is extra time to allow for those use cases to be accounted for. Thanks, Singer Ma From biopython at maubp.freeserve.co.uk Thu Apr 8 06:04:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Apr 2010 11:04:10 +0100 Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 10:41 AM, Singer Ma wrote: > I am a junior Computer Science major with heavy bioinformatic leanings > at Harvey Mudd College. I know that it is very late for new summer of > code applications, but I was wondering if you could have a look at my > proposed schedule to give me some pointers and answer a few questions. > I am also considering applying for the project involving adding more > ways to use R through python, but I was unsure of which project had > more users who wanted it completed. > > Questions: > What does it mean by BioPython's acquired sequences? I can't seem to > find out what or where information about "acquired sequences" is. > Thus, I do not discuss anything about it in my current proposal. http://www.biopython.org/wiki/Google_Summer_of_Code#Biopython_and_PyCogent_interoperability You mean "Connecting Biopython acquired sequences to PyCogent's alignment, phylogenetic tree preparation and tree visualization code."? I think Brad means using Biopython to load (parse) sequence data (e.g. with Bio.AlignIO), and then give this to PyCogent. i.e. Acquire data in the sense of get/load data. > Week 6 and 7: Adapt PyCogent to query genomics databases. Currently > there is at least some support for PyCogent to query ENSEMBL. It seems > like it would be useful to query other genomics databases such as > Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL > queries into their MySQL database. ... Are you are talking about the NCBI Entrez Utitlites (E-Utils)? Those are language neutral and we have Bio.Entrez to support them in Biopython. Peter From biopython at maubp.freeserve.co.uk Thu Apr 8 06:26:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Apr 2010 11:26:10 +0100 Subject: [Biopython] Biopython 1.54b test failures In-Reply-To: References: <4BB67835.7030303@uci.edu> Message-ID: On Sat, Apr 3, 2010 at 12:22 AM, Peter wrote: > > I recall trying the universal read lines thing before without > success in the SCOP tests - maybe it was this line 72 thing > that I missed. I'll take another look at this next week (when > I have access to a Windows machine). > You are right - that does make the two SCOP tests pass on Windows without having to first convert the SCOP example files from Unix to DOS/Windows newlines. Checked in. Would you like to be credited for this in the NEWS and CONTRIB files? Thanks, Peter From sma.hmc at gmail.com Thu Apr 8 06:31:10 2010 From: sma.hmc at gmail.com (Singer Ma) Date: Thu, 8 Apr 2010 03:31:10 -0700 Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability In-Reply-To: References: Message-ID: > You mean "Connecting Biopython acquired sequences to PyCogent's > alignment, phylogenetic tree preparation and tree visualization code."? > > I think Brad means using Biopython to load (parse) sequence data (e.g. > with Bio.AlignIO), and then give this to PyCogent. i.e. Acquire data in > the sense of get/load data. Ah, so, its just the most straightforward use of the conversion tools that would be made. Sorry, I thought I was missing something here. Shouldn't be this be taken care of in the first use case of "Allow round-trip conversion between biopython and pycogent core objects (sequence, alignment, tree, etc.)."? Or does this require me to determine how the interactions will be made? > > Are you are talking about the NCBI Entrez Utitlites (E-Utils)? Those are > language neutral and we have Bio.Entrez to support them in Biopython. Ah, I misread my information, so NCBI Entrez can already be queried. What exactly do we need to get from ENSEMBL that isn't already supported then? Singer From chapmanb at 50mail.com Thu Apr 8 08:39:53 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 8 Apr 2010 08:39:53 -0400 Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability In-Reply-To: References: Message-ID: <20100408123953.GG911@sobchak.mgh.harvard.edu> Singer; Thanks for the introduction and initial project plan. Glad that you are interested. I'll try to tackle a few of the specific points Peter has not already talked about, and suggest some specifics for the application. > Questions: > What does it mean by BioPython's acquired sequences? I can't seem to > find out what or where information about "acquired sequences" is. > Thus, I do not discuss anything about it in my current proposal. Following up on what Peter mentioned, what we're trying to say there is to use the results from step 1 (interoperability) to create unique workflows that use both Biopython and PyCogent. This is a suggested workflow to utilize some of the strengths of both packages. > For the creation of workflows, do there already exist use and test > cases for this or would I be best off looking for ones in papers and > trying to mimic them? Right now, I have an example paper where the > interoperability would have been helpful. Yes, that is exactly the right approach. The ideas we've suggested are just brainstorming; please select workflows that are interesting to you. > My current proposed schedule: > > For Bio Python and PyCogent interoperability. > Week 1: Familiarization with the code and soliciting requests. While > what seems intuitive to me might not seem so to others. It would be > best to spend this time to determine a group of people who would > highly benefit from the interoperability and ask them for what they > would look for. For example, would they rather use one, save the data, > and use the other. Would they want to use them directly. Basically, I > want to get a good idea of how this code will be used before making my > own decisions on how I think people will use it. Also important here > is to create sets of data which can be used later on the process. All of this type of non-coding work should be done in the community bonding period, from April 26th to the start of coding. When week 1 hits, you want to be ready to code. See the timeline for more specific information on dates: http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/timeline > Week 5: Familiarize with phyloXML and make interoperable with > PyCogent. phyloXML has already been added with BioPython. Making > phyloXML work with PyCogent could be based on how it was adapted for > BioPython. Clear risks here include problems with making sure that the > API for phyloXML in PyCogent gives an intuitive interface to use > phyloXML. Again, all of the non-coding activities should be moved to before the actual coding period. In your timeline you want to focus on code deliverables for each week. Of course there will be learning and reading during the program, but you want to be sure to have a code centric focus. > Week 6 and 7: Adapt PyCogent to query genomics databases. Currently > there is at least some support for PyCogent to query ENSEMBL. It seems > like it would be useful to query other genomics databases such as > Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL > queries into their MySQL database. Ideally, if everything previously > has been alright, the conversion of PyCogent to BioPython forms shoudl > already be accounted for. Following up on your discussion with Peter, you should think about some workflows that use Biopython Entrez queries and PyCogent Ensembl queries to answer interesting questions that could not be done with either. This should help to focus your ideas on integration and workflows, as opposed to implementing new functionality. > Week 8-12: Slip days and additional features. The initial set of use > cases will surely expand and this is extra time to allow for those use > cases to be accounted for. You need to continue your detailed project plan for the entire period. See the examples in the NESCent application documentation to get an idea of the level of detail in accepted projects from previous years: https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html Practically, applications are due tomorrow, so you should have a submission sent in to OpenBio through the GSoC interface (http://socghop.appspot.com). Hope this helps, Brad From vincent at vincentdavis.net Thu Apr 8 14:33:41 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 8 Apr 2010 12:33:41 -0600 Subject: [Biopython] affy CEL and CDF reader Message-ID: I ended up writing my own modules for reading both affy Cel and CDF files. Long story as to why I did not just use what was available in biopython. I plan on making what I have done available to the biopython and will upload it as a fork. I will outline what ways what I have is different below. My question is: Are there any improvements(features) others would like to see beyond what is avalible in the current CelFile.py? I saw some posts a month or so ago about checking for consistency in cell file, I think it was something about making sure the stated number of probes was consistent with the intensity measurements. What is different, when an file is read Affycel.read('file') many atributes are set. for example a = affcel() a.read('testfile') a.filename, a.version, a.header.items() # a dictionary of all header items a.num_intensity a.intensity a.num_masks a.masks a.num_outliers a.outliers a.numb_modified a.modified I plan to add the ability return/call intensity values with our with outliers or mask values. All data is currently store in numpy structured arrays, currently a.intensity returns the structured array, but I plan on making it an option to easily choose how this is returned. also what to make an optional normalized intensity array so that if the data is normalized it can be stored with the affycel instance. My use case was that I was opening about 80 cel files and reading them in was slow. this allowed me to read each file as an instance of affycel stored in a list that I then pickled. It was then much faster to open them. Are improvements to the CelFile.py are of value to biopython? I hope to have the code pushed up to my fork on github late tonight. Just thought I would ask if there was any suggestion before I did. Also have an CDF file reader, but only have done some basic testing. I don't have a lot of use for this, do other biopython users? I am kinda working in a vacuum and am trying to get more involved in projects to improve my skills and knowledge. Any suggestions would be appreciated. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From sdavis2 at mail.nih.gov Thu Apr 8 14:56:12 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 8 Apr 2010 14:56:12 -0400 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis wrote: > I ended up writing my own modules for reading both affy Cel and CDF files. > Long story as to why I did not just use what was available in biopython. > I plan on making what I have done available to the biopython and will upload > it as a fork. I will outline what ways what I have is different below. > My question is: Are there any improvements(features) others would like to > see beyond what is avalible in the current CelFile.py? > I saw some posts a month or so ago about checking for consistency in cell > file, I think it was something about making sure the stated number of probes > was consistent with the intensity measurements. > > What is different, > when an file is read Affycel.read('file') many atributes are set. for > example > a = affcel() > a.read('testfile') > a.filename, > a.version, > a.header.items() ?# a dictionary of all header items > a.num_intensity > a.intensity > a.num_masks > a.masks > a.num_outliers > a.outliers > a.numb_modified > a.modified > > I plan to add the ability return/call intensity values with our with > outliers or mask values. > All data is currently store in numpy structured arrays, > currently a.intensity returns the structured array, but I plan on making it > an option to easily choose how this is returned. > also what to make an optional normalized intensity array so that if the data > is normalized it can be stored with the affycel instance. My use case was > that I was opening about 80 cel files and reading them in was slow. this > allowed me to read each file as an instance of affycel stored in a list that > I then pickled. It was then much faster to open them. > > Are improvements to the CelFile.py are of value to biopython? > > I hope to have the code pushed up to my fork on github late tonight. Just > thought I would ask if there was any suggestion before I did. > > Also have an CDF file reader, but only have done some basic testing. I don't > have a lot of use for this, do other biopython users? > > I am kinda working in a vacuum and am trying to get more involved in > projects to improve my skills and knowledge. Any suggestions would be > appreciated. Just out of curiosity, is your work based on the affy sdk, or are you parsing stuff yourself? Sean From vincent at vincentdavis.net Thu Apr 8 15:03:38 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 8 Apr 2010 13:03:38 -0600 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: Parsing it myself, But based directly an the affy documentation found here. http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/ *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis wrote: > On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis > wrote: > > I ended up writing my own modules for reading both affy Cel and CDF > files. > > Long story as to why I did not just use what was available in biopython. > > I plan on making what I have done available to the biopython and will > upload > > it as a fork. I will outline what ways what I have is different below. > > My question is: Are there any improvements(features) others would like to > > see beyond what is avalible in the current CelFile.py? > > I saw some posts a month or so ago about checking for consistency in cell > > file, I think it was something about making sure the stated number of > probes > > was consistent with the intensity measurements. > > > > What is different, > > when an file is read Affycel.read('file') many atributes are set. for > > example > > a = affcel() > > a.read('testfile') > > a.filename, > > a.version, > > a.header.items() # a dictionary of all header items > > a.num_intensity > > a.intensity > > a.num_masks > > a.masks > > a.num_outliers > > a.outliers > > a.numb_modified > > a.modified > > > > I plan to add the ability return/call intensity values with our with > > outliers or mask values. > > All data is currently store in numpy structured arrays, > > currently a.intensity returns the structured array, but I plan on making > it > > an option to easily choose how this is returned. > > also what to make an optional normalized intensity array so that if the > data > > is normalized it can be stored with the affycel instance. My use case was > > that I was opening about 80 cel files and reading them in was slow. this > > allowed me to read each file as an instance of affycel stored in a list > that > > I then pickled. It was then much faster to open them. > > > > Are improvements to the CelFile.py are of value to biopython? > > > > I hope to have the code pushed up to my fork on github late tonight. Just > > thought I would ask if there was any suggestion before I did. > > > > Also have an CDF file reader, but only have done some basic testing. I > don't > > have a lot of use for this, do other biopython users? > > > > I am kinda working in a vacuum and am trying to get more involved in > > projects to improve my skills and knowledge. Any suggestions would be > > appreciated. > > Just out of curiosity, is your work based on the affy sdk, or are you > parsing stuff yourself? > > Sean > From sdavis2 at mail.nih.gov Thu Apr 8 15:40:01 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 8 Apr 2010 15:40:01 -0400 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis wrote: > Parsing it myself, But based directly an the affy documentation found here. > http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/ So, are you covering both binary and text formats for .CEL files? I think that modern .CEL files (those produced by GCOS) are binary and represent the majority of .CEL files produced today. Some of the I/O issues that you discuss are almost definitely dealt with by using the binary .CEL files. I'm certainly not an expert on Affy, so take all these questions/comments with a grain of salt. Sean > On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis wrote: > >> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis >> wrote: >> > I ended up writing my own modules for reading both affy Cel and CDF >> files. >> > Long story as to why I did not just use what was available in biopython. >> > I plan on making what I have done available to the biopython and will >> upload >> > it as a fork. I will outline what ways what I have is different below. >> > My question is: Are there any improvements(features) others would like to >> > see beyond what is avalible in the current CelFile.py? >> > I saw some posts a month or so ago about checking for consistency in cell >> > file, I think it was something about making sure the stated number of >> probes >> > was consistent with the intensity measurements. >> > >> > What is different, >> > when an file is read Affycel.read('file') many atributes are set. for >> > example >> > a = affcel() >> > a.read('testfile') >> > a.filename, >> > a.version, >> > a.header.items() ?# a dictionary of all header items >> > a.num_intensity >> > a.intensity >> > a.num_masks >> > a.masks >> > a.num_outliers >> > a.outliers >> > a.numb_modified >> > a.modified >> > >> > I plan to add the ability return/call intensity values with our with >> > outliers or mask values. >> > All data is currently store in numpy structured arrays, >> > currently a.intensity returns the structured array, but I plan on making >> it >> > an option to easily choose how this is returned. >> > also what to make an optional normalized intensity array so that if the >> data >> > is normalized it can be stored with the affycel instance. My use case was >> > that I was opening about 80 cel files and reading them in was slow. this >> > allowed me to read each file as an instance of affycel stored in a list >> that >> > I then pickled. It was then much faster to open them. >> > >> > Are improvements to the CelFile.py are of value to biopython? >> > >> > I hope to have the code pushed up to my fork on github late tonight. Just >> > thought I would ask if there was any suggestion before I did. >> > >> > Also have an CDF file reader, but only have done some basic testing. I >> don't >> > have a lot of use for this, do other biopython users? >> > >> > I am kinda working in a vacuum and am trying to get more involved in >> > projects to improve my skills and knowledge. Any suggestions would be >> > appreciated. >> >> Just out of curiosity, is your work based on the affy sdk, or are you >> parsing stuff yourself? >> >> Sean >> > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From vincent at vincentdavis.net Thu Apr 8 15:43:57 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 8 Apr 2010 13:43:57 -0600 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: No I was not reading the binary files. That said I am interested in perusing that if there is interest. Do you have a link to the SDK? *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Thu, Apr 8, 2010 at 1:40 PM, Sean Davis wrote: > On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis > wrote: > > Parsing it myself, But based directly an the affy documentation found > here. > > > http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/ > > So, are you covering both binary and text formats for .CEL files? I > think that modern .CEL files (those produced by GCOS) are binary and > represent the majority of .CEL files produced today. Some of the I/O > issues that you discuss are almost definitely dealt with by using the > binary .CEL files. > > I'm certainly not an expert on Affy, so take all these > questions/comments with a grain of salt. > > Sean > > > > On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis > wrote: > > > >> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis > > >> wrote: > >> > I ended up writing my own modules for reading both affy Cel and CDF > >> files. > >> > Long story as to why I did not just use what was available in > biopython. > >> > I plan on making what I have done available to the biopython and will > >> upload > >> > it as a fork. I will outline what ways what I have is different below. > >> > My question is: Are there any improvements(features) others would like > to > >> > see beyond what is avalible in the current CelFile.py? > >> > I saw some posts a month or so ago about checking for consistency in > cell > >> > file, I think it was something about making sure the stated number of > >> probes > >> > was consistent with the intensity measurements. > >> > > >> > What is different, > >> > when an file is read Affycel.read('file') many atributes are set. for > >> > example > >> > a = affcel() > >> > a.read('testfile') > >> > a.filename, > >> > a.version, > >> > a.header.items() # a dictionary of all header items > >> > a.num_intensity > >> > a.intensity > >> > a.num_masks > >> > a.masks > >> > a.num_outliers > >> > a.outliers > >> > a.numb_modified > >> > a.modified > >> > > >> > I plan to add the ability return/call intensity values with our with > >> > outliers or mask values. > >> > All data is currently store in numpy structured arrays, > >> > currently a.intensity returns the structured array, but I plan on > making > >> it > >> > an option to easily choose how this is returned. > >> > also what to make an optional normalized intensity array so that if > the > >> data > >> > is normalized it can be stored with the affycel instance. My use case > was > >> > that I was opening about 80 cel files and reading them in was slow. > this > >> > allowed me to read each file as an instance of affycel stored in a > list > >> that > >> > I then pickled. It was then much faster to open them. > >> > > >> > Are improvements to the CelFile.py are of value to biopython? > >> > > >> > I hope to have the code pushed up to my fork on github late tonight. > Just > >> > thought I would ask if there was any suggestion before I did. > >> > > >> > Also have an CDF file reader, but only have done some basic testing. I > >> don't > >> > have a lot of use for this, do other biopython users? > >> > > >> > I am kinda working in a vacuum and am trying to get more involved in > >> > projects to improve my skills and knowledge. Any suggestions would be > >> > appreciated. > >> > >> Just out of curiosity, is your work based on the affy sdk, or are you > >> parsing stuff yourself? > >> > >> Sean > >> > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > From vincent at vincentdavis.net Thu Apr 8 16:21:32 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 8 Apr 2010 14:21:32 -0600 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: Maybe I should have started this discussion differently. Is there any need for improvements to the ability to read CEL files or CDF files and if so what are they? I am interested in contributing. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Thu, Apr 8, 2010 at 12:33 PM, Vincent Davis wrote: > I ended up writing my own modules for reading both affy Cel and CDF files. > Long story as to why I did not just use what was available in biopython. > I plan on making what I have done available to the biopython and will > upload it as a fork. I will outline what ways what I have is different > below. > My question is: Are there any improvements(features) others would like to > see beyond what is avalible in the current CelFile.py? > I saw some posts a month or so ago about checking for consistency in cell > file, I think it was something about making sure the stated number of probes > was consistent with the intensity measurements. > > What is different, > when an file is read Affycel.read('file') many atributes are set. for > example > a = affcel() > a.read('testfile') > a.filename, > a.version, > a.header.items() # a dictionary of all header items > a.num_intensity > a.intensity > a.num_masks > a.masks > a.num_outliers > a.outliers > a.numb_modified > a.modified > > I plan to add the ability return/call intensity values with our with > outliers or mask values. > All data is currently store in numpy structured arrays, > currently a.intensity returns the structured array, but I plan on making it > an option to easily choose how this is returned. > also what to make an optional normalized intensity array so that if the > data is normalized it can be stored with the affycel instance. My use case > was that I was opening about 80 cel files and reading them in was slow. this > allowed me to read each file as an instance of affycel stored in a list that > I then pickled. It was then much faster to open them. > > Are improvements to the CelFile.py are of value to biopython? > > I hope to have the code pushed up to my fork on github late tonight. Just > thought I would ask if there was any suggestion before I did. > > Also have an CDF file reader, but only have done some basic testing. I > don't have a lot of use for this, do other biopython users? > > I am kinda working in a vacuum and am trying to get more involved in > projects to improve my skills and knowledge. Any suggestions would be > appreciated. > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | LinkedIn > > From sdavis2 at mail.nih.gov Thu Apr 8 18:31:43 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 8 Apr 2010 18:31:43 -0400 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 3:43 PM, Vincent Davis wrote: > No I was not reading the binary files. That said I am interested in perusing > that if there is interest. > Do you have a link to the SDK? I believe this will get you close: http://www.affymetrix.com/partners_programs/programs/developer/fusion/index.affx?terms=no I hope my questions are not taken the wrong way, but I have learned from the bioconductor project that dealing with vendor file formats is often a non-trivial pursuit. It isn't always easy to think of all the edge cases. Sean > ?*Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > ?my blog | > LinkedIn > > > On Thu, Apr 8, 2010 at 1:40 PM, Sean Davis wrote: > >> On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis >> wrote: >> > Parsing it myself, But based directly an the affy documentation found >> here. >> > >> http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/ >> >> So, are you covering both binary and text formats for .CEL files? ?I >> think that modern .CEL files (those produced by GCOS) are binary and >> represent the majority of .CEL files produced today. ?Some of the I/O >> issues that you discuss are almost definitely dealt with by using the >> binary .CEL files. >> >> I'm certainly not an expert on Affy, so take all these >> questions/comments with a grain of salt. >> >> Sean >> >> >> > On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis >> wrote: >> > >> >> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis > > >> >> wrote: >> >> > I ended up writing my own modules for reading both affy Cel and CDF >> >> files. >> >> > Long story as to why I did not just use what was available in >> biopython. >> >> > I plan on making what I have done available to the biopython and will >> >> upload >> >> > it as a fork. I will outline what ways what I have is different below. >> >> > My question is: Are there any improvements(features) others would like >> to >> >> > see beyond what is avalible in the current CelFile.py? >> >> > I saw some posts a month or so ago about checking for consistency in >> cell >> >> > file, I think it was something about making sure the stated number of >> >> probes >> >> > was consistent with the intensity measurements. >> >> > >> >> > What is different, >> >> > when an file is read Affycel.read('file') many atributes are set. for >> >> > example >> >> > a = affcel() >> >> > a.read('testfile') >> >> > a.filename, >> >> > a.version, >> >> > a.header.items() ?# a dictionary of all header items >> >> > a.num_intensity >> >> > a.intensity >> >> > a.num_masks >> >> > a.masks >> >> > a.num_outliers >> >> > a.outliers >> >> > a.numb_modified >> >> > a.modified >> >> > >> >> > I plan to add the ability return/call intensity values with our with >> >> > outliers or mask values. >> >> > All data is currently store in numpy structured arrays, >> >> > currently a.intensity returns the structured array, but I plan on >> making >> >> it >> >> > an option to easily choose how this is returned. >> >> > also what to make an optional normalized intensity array so that if >> the >> >> data >> >> > is normalized it can be stored with the affycel instance. My use case >> was >> >> > that I was opening about 80 cel files and reading them in was slow. >> this >> >> > allowed me to read each file as an instance of affycel stored in a >> list >> >> that >> >> > I then pickled. It was then much faster to open them. >> >> > >> >> > Are improvements to the CelFile.py are of value to biopython? >> >> > >> >> > I hope to have the code pushed up to my fork on github late tonight. >> Just >> >> > thought I would ask if there was any suggestion before I did. >> >> > >> >> > Also have an CDF file reader, but only have done some basic testing. I >> >> don't >> >> > have a lot of use for this, do other biopython users? >> >> > >> >> > I am kinda working in a vacuum and am trying to get more involved in >> >> > projects to improve my skills and knowledge. Any suggestions would be >> >> > appreciated. >> >> >> >> Just out of curiosity, is your work based on the affy sdk, or are you >> >> parsing stuff yourself? >> >> >> >> Sean >> >> >> > _______________________________________________ >> > Biopython mailing list ?- ?Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From reece at berkeley.edu Thu Apr 8 19:38:10 2010 From: reece at berkeley.edu (Reece Hart) Date: Thu, 08 Apr 2010 16:38:10 -0700 Subject: [Biopython] SeqIO.parse exception on Google App Engine Message-ID: <4BBE68E2.2030803@berkeley.edu> Hi- I'm trying to fetch a Genbank record and parse it in the Google App Engine environment. A command line version works fine, but when using exactly the same code under Google App Engine, SeqIO throws throws the following exception: ... File "/local/home/reece/tmp/demo1/Bio/GenBank/Scanner.py", line 746, in parse_footer self.line = self.line.rstrip(os.linesep) AttributeError: 'module' object has no attribute 'linesep' The environment: - Ubuntu Lucid beta1 - Python 2.6.5 - Biopython 1.53 - GAE 1.3.2 Test case: I put together a simple test case that retrieves a raw (text) Genbank record using Bio.Entrez (efetch); this works in both environments. Parsing that record works on the command line, but not under GAE. - curl http://harts.net/reece/tmp/demo1.tgz | tar -xvzf- - cd demo1 - update symlink ./Bio to a Biopython tree eg$ ln -s /usr/share/pyshared/Bio Bio My intent is to prepend Bio to sys.paths much the way I would expect this to be deployed (i.e., without updating sys.path). Command line test: $ ./lookup fetch_text:LOCUS NM_004006 13993 bp mRNA linear PRI 25-MAR-2010 fetch_parse:NM_004006.2 / NM_004006 / Homo sapiens dystrophin (DMD), transcript variant Dp427m, GAE test: In the demo1 directory: $ dev_appserver.py . and, in another terminal: $ curl http://localhost:8080/ You'll see the exception in the http reply and in the appserver log Thanks for any help/advice/pointers, Reece P.S. I'm learning Python and GAE at the same time, so silly errors are possible (nay, likely). From chapmanb at 50mail.com Thu Apr 8 21:19:45 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 8 Apr 2010 21:19:45 -0400 Subject: [Biopython] SeqIO.parse exception on Google App Engine In-Reply-To: <4BBE68E2.2030803@berkeley.edu> References: <4BBE68E2.2030803@berkeley.edu> Message-ID: <20100409011945.GE2011@kunkel> Hi Reece; > I'm trying to fetch a Genbank record and parse it in the Google App Engine > environment. A command line version works fine, but when using exactly the > same code under Google App Engine, SeqIO throws throws the following > exception: > ... > File "/local/home/reece/tmp/demo1/Bio/GenBank/Scanner.py", line > 746, in parse_footer > self.line = self.line.rstrip(os.linesep) > AttributeError: 'module' object has no attribute 'linesep' The python on Google App Engine is a bit crippled and lacks some of the functionality of a full python install. It looks like one issue must be that os.linesep is not defined on GAE. A quick fix is to modify this to "\n", or just do: os.linesep = "\n" at the top of the Scanner.py file. It would be really useful if you were able to submit a patch or list of areas where Biopython fails on app engine and we can think about how to suitably modify the code base to work on GAE and still be compatible with Windows. I did a bit of work on this using Biopython in Google App Engine last year; code is on GitHub here: http://github.com/chapmanb/biosqlweb that might be helpful as a starting place for other ideas. Good luck and let us know how your GAE experience goes, Brad From reece at berkeley.edu Thu Apr 8 22:34:48 2010 From: reece at berkeley.edu (Reece Hart) Date: Thu, 08 Apr 2010 19:34:48 -0700 Subject: [Biopython] SeqIO.parse exception on Google App Engine In-Reply-To: <20100409011945.GE2011@kunkel> References: <4BBE68E2.2030803@berkeley.edu> <20100409011945.GE2011@kunkel> Message-ID: <4BBE9248.2080502@berkeley.edu> Hi Brad. Thanks for the quick reply. On 04/08/2010 06:19 PM, Brad Chapman wrote: > A quick fix is to > modify this to "\n", or just do: > > os.linesep = "\n" > > at the top of the Scanner.py file. > It turns out that this fix also works within the module that does the parse. To wit: from Bio import SeqIO os.linesep = '\n' rec = SeqIO.parse(...) > I did a bit of work on this using Biopython in Google App Engine > last year; code is on GitHub here: > http://github.com/chapmanb/biosqlweb > that might be helpful as a starting place for other ideas. > Yes, thank you for this. This is precisely where I started only a few days ago... Cheers, Reece From reece at berkeley.edu Fri Apr 9 00:46:36 2010 From: reece at berkeley.edu (Reece Hart) Date: Thu, 08 Apr 2010 21:46:36 -0700 Subject: [Biopython] GenBank.Scanner use of os.linesep Message-ID: <4BBEB12C.8030907@berkeley.edu> Hi All- I recently discovered that the GenBank parser doesn't work on Google App Engine because os.linesep is undefined (GenBank/Scanner.py:746): 745 # if self.line[-1] == "\n" : self.line = self.line[:-1] 746 self.line = self.line.rstrip(os.linesep) 747 misc_lines.append(self.line) Defining os.linesep is sufficient to fix the problem (thanks to Brad Chapman). It seems to me that this use of os.linesep is probably mistaken here. If the file comes from efetch, the line separator will be \n regardless of platform [1] and that is what should be used in rstrip. It's possible that the file might come from a dog-foresaken CRLF platform and therefore contain that line separator. So, I humbly propose that 746 be changed to either rstrip('\n') or, perhaps, rstrip('\n\r'). Although the need for the latter is probably rare, I don't see that it costs anything to cover that case by adding \r. I'm new to this community, so I don't know whether we now have ferocious debate about the merits of line terminators or, rather, I submit a lame one-liner patch against the git HEAD. Thanks for Biopython. Cheers, Reece [1] For reference, here's a web request that should be equivalent to the efetch. On line 5, 0a is LF is \n. apt12j$ curl -s 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=238018044&rettype=gb' | hexdump -C | head 00000000 4c 4f 43 55 53 20 20 20 20 20 20 20 4e 4d 5f 30 |LOCUS NM_0| 00000010 30 34 30 30 36 20 20 20 20 20 20 20 20 20 20 20 |04006 | 00000020 20 20 20 31 33 39 39 33 20 62 70 20 20 20 20 6d | 13993 bp m| 00000030 52 4e 41 20 20 20 20 6c 69 6e 65 61 72 20 20 20 |RNA linear | 00000040 50 52 49 20 32 35 2d 4d 41 52 2d 32 30 31 30 0a |PRI 25-MAR-2010.| 00000050 44 45 46 49 4e 49 54 49 4f 4e 20 20 48 6f 6d 6f |DEFINITION Homo| -- Reece Hart, Ph.D. Chief Scientist, Genome Commons http://genomecommons.org/ Center for Computational Biology 324G Stanley Hall UC Berkeley / QB3 Berkeley, CA 94720 From biopython at maubp.freeserve.co.uk Fri Apr 9 04:54:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Apr 2010 09:54:53 +0100 Subject: [Biopython] GenBank.Scanner use of os.linesep In-Reply-To: <4BBEB12C.8030907@berkeley.edu> References: <4BBEB12C.8030907@berkeley.edu> Message-ID: On Fri, Apr 9, 2010 at 5:46 AM, Reece Hart wrote: > Hi All- > > I recently discovered that the GenBank parser doesn't work on Google App > Engine because os.linesep is undefined (GenBank/Scanner.py:746): > > ? 745 ? ?# ? ? ? ? ? ?if self.line[-1] == "\n" : self.line = self.line[:-1] > ? 746 ? ? ? ? ? ? ? ?self.line = self.line.rstrip(os.linesep) > ? 747 ? ? ? ? ? ? ? ?misc_lines.append(self.line) > > Defining os.linesep is sufficient to fix the problem (thanks to Brad > Chapman). > > It seems to me that this use of os.linesep is probably mistaken here. I agree. > If the > file comes from efetch, the line separator will be \n regardless of platform > [1] and that is what should be used in rstrip. It's possible that the file > might come from a dog-foresaken CRLF platform and therefore contain that > line separator. I think it would break in a more common setting - passing a file on Windows with CRLF, since Python will turn that into just \n. > So, I humbly propose that 746 be changed to either rstrip('\n') or, perhaps, > rstrip('\n\r'). Although the need for the latter is probably rare, I don't > see that it costs anything to cover that case by adding \r. A plain rstrip() would also work and get rid of any trailing whitespace. I've checked that in. > I'm new to this community, so I don't know whether we now have ferocious > debate about the merits of line terminators or, rather, I submit a lame > one-liner patch against the git HEAD. For something this trivial, your verbal patch is fine. Would you like to be added to the NEWS and CONTRIB file? Peter From biopython at maubp.freeserve.co.uk Fri Apr 9 08:08:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Apr 2010 13:08:03 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 9:04 AM, Peter wrote: > On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt wrote: >> Greetings All! >> >> It looks like line 364 of Bio.AlignIO.StockholmIO reads: >> >> seqs[id] += seq.replace(".","-") >> >> So when you load into memory alignments that mark gaps created to >> allow alignment to inserts with ".", (such as PFam alignments or the >> output of hmmer) that information is lost. >> >> I know there must be a good reason for this, but I am finding it a >> problem on my end.. >> >> -Bryan Lunt > > Hi Bryan, > > Yes, is it done deliberately. The dot is a problem - it has a quite > specific meaning of "same as above" on other alignment file > formats, while "-" is an almost universal shorthand for gap/insertion. > Consider the use case of Stockholm to PHYLIP/FASTA/Clustal > conversion. > > Have you got a sample output file we can use as a unit test or > at least discuss? As I recall, on the PFAM alignments I looked > at there was no data loss by doing the dot to dash mapping. According to http://sonnhammer.sbc.su.se/Stockholm.html >> Sequence letters may include any characters except >> whitespace. Gaps may be indicated by "." or "-". So a Stockholm file using a mixture of "." and "-" would be valid but a bit odd. Why would anyone do that? Peter From cjfields at illinois.edu Fri Apr 9 08:51:35 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 9 Apr 2010 07:51:35 -0500 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu> On Apr 9, 2010, at 7:08 AM, Peter wrote: > On Thu, Apr 8, 2010 at 9:04 AM, Peter wrote: >> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt wrote: >>> Greetings All! >>> >>> It looks like line 364 of Bio.AlignIO.StockholmIO reads: >>> >>> seqs[id] += seq.replace(".","-") >>> >>> So when you load into memory alignments that mark gaps created to >>> allow alignment to inserts with ".", (such as PFam alignments or the >>> output of hmmer) that information is lost. >>> >>> I know there must be a good reason for this, but I am finding it a >>> problem on my end.. >>> >>> -Bryan Lunt >> >> Hi Bryan, >> >> Yes, is it done deliberately. The dot is a problem - it has a quite >> specific meaning of "same as above" on other alignment file >> formats, while "-" is an almost universal shorthand for gap/insertion. >> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal >> conversion. >> >> Have you got a sample output file we can use as a unit test or >> at least discuss? As I recall, on the PFAM alignments I looked >> at there was no data loss by doing the dot to dash mapping. > > According to http://sonnhammer.sbc.su.se/Stockholm.html >>> Sequence letters may include any characters except >>> whitespace. Gaps may be indicated by "." or "-". > > So a Stockholm file using a mixture of "." and "-" would be > valid but a bit odd. Why would anyone do that? > > Peter Just curious, b/c this is a point of contention in BioPerl. How does BioPython internally set what symbols correspond to residues/gaps/frameshifts/other? BioPerl retains the original sequence but uses regexes for validation and methods that return symbol-related information (e.g. gap counts). (BTW, the contention here isn't that we use regexes, but that we set them globally). chris From biopython at maubp.freeserve.co.uk Fri Apr 9 09:21:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Apr 2010 14:21:03 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu> References: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu> Message-ID: On Fri, Apr 9, 2010 at 1:51 PM, Chris Fields wrote: > > > Just curious, b/c this is a point of contention in BioPerl. ?How does BioPython > internally set what symbols correspond to residues/gaps/frameshifts/other? > BioPerl retains the original sequence but uses regexes for validation and > methods that return symbol-related information (e.g. gap counts). > > (BTW, the contention here isn't that we use regexes, but that we set them globally). > > chris Hi Chris, The short answer is gaps are by default "-", and stop codons are "*", but beyond that it would be down to user code to interpret odd symbols. Our sequences have an alphabet object which can specify the letters (as a set of expected characters), with explicit support for a single gap character (usually "-"), and for proteins a single stop codon symbol (usually "*"). This could in theory be extended to define other symbols too. The gap char does get treated specially in some of the alignment code (e.g. for calling a consensus), but I don't think we have anything built in regarding frameshifts. Peter From biopython at maubp.freeserve.co.uk Fri Apr 9 09:30:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Apr 2010 14:30:55 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: On Fri, Apr 9, 2010 at 2:09 PM, Ivan Rossi wrote: > > On Fri, 9 Apr 2010, Peter wrote: > >> So a Stockholm file using a mixture of "." and "-" would be >> valid but a bit odd. Why would anyone do that? > > IIRC the "." are used for "gaps" at the extremes of sequences in a MSA. When > you do local sequence alignments, like blast and most HMMs do, gaps at the > extremes of sequences do not pay the usual penalty for gap opening. So in > Stockholm format distinguishes between gaps for what you paid a price during > the alignment ("-") and gaps-for-free (".") which are there just to pad each > row to the MSA width. So internal gaps (true gaps), versus leading or trailing padding. That makes sense - and is certainly how PFAM does things according to their FAQ: Quoting from http://pfam.sanger.ac.uk/help#tabview=tab3 >>> What is the difference between the - and . characters in your full alignments ? >>> >>> The '-' and '.' characters both represent gap characters. However they >>> do tell you some extra information about how the HMM has generated >>> the alignment. The '-' symbols are where the alignment of the sequence >>> has used a delete state in the HMM to jump past a match state. This >>> means that the sequence is missing a column that the HMM was >>> expecting to be there. The '.' character is used to pad gaps where one >>> sequence in the alignment has sequence from the HMMs insert state. >>> See the alignment below where both characters are used. The HMM >>> states emitting each column are shown. Note that residues emitted >>> from the Insert (I) state are in lower case. I wonder why doesn't this get mentioned anywhere on the format definitions: http://sonnhammer.sbc.su.se/Stockholm.html http://en.wikipedia.org/wiki/Stockholm_format Peter From cjfields at illinois.edu Fri Apr 9 09:28:42 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 9 Apr 2010 08:28:42 -0500 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu> Message-ID: <9D6E3C31-B273-4B37-BFE8-8C951C025CBB@illinois.edu> On Apr 9, 2010, at 8:21 AM, Peter wrote: > On Fri, Apr 9, 2010 at 1:51 PM, Chris Fields wrote: >> >> >> Just curious, b/c this is a point of contention in BioPerl. How does BioPython >> internally set what symbols correspond to residues/gaps/frameshifts/other? >> BioPerl retains the original sequence but uses regexes for validation and >> methods that return symbol-related information (e.g. gap counts). >> >> (BTW, the contention here isn't that we use regexes, but that we set them globally). >> >> chris > > Hi Chris, > > The short answer is gaps are by default "-", and stop codons are "*", but > beyond that it would be down to user code to interpret odd symbols. > > Our sequences have an alphabet object which can specify the letters (as > a set of expected characters), with explicit support for a single gap > character (usually "-"), and for proteins a single stop codon symbol (usually > "*"). This could in theory be extended to define other symbols too. The gap > char does get treated specially in some of the alignment code (e.g. for > calling a consensus), but I don't think we have anything built in regarding > frameshifts. > > Peter Within LocatableSeq we define the following: $GAP_SYMBOLS = '\-\.=~'; $FRAMESHIFT_SYMBOLS = '\\\/'; $OTHER_SYMBOLS = '\?'; $RESIDUE_SYMBOLS = '0-9A-Za-z\*'; Combined these can be used in a regex to validate sequence, or separately used for other purposes (counting gaps, frameshifts, etc.). The OTHER_SYMBOLS is rally a catch-all for anything residue-like (counted in the sequence). All of these can be redefined, but currently that's global, so it can have consequences in rare cases when mixing sequences from different formats. We may localize them to work around that (part of GSoC project for alignment reimplementation). We had a Symbol class at one point but I believe it was considered too 'heavy,' though this may be more a consequence of Perl's hammered-on OO. chris From reece at berkeley.edu Fri Apr 9 11:18:36 2010 From: reece at berkeley.edu (Reece Hart) Date: Fri, 09 Apr 2010 08:18:36 -0700 Subject: [Biopython] GenBank.Scanner use of os.linesep In-Reply-To: References: <4BBEB12C.8030907@berkeley.edu> Message-ID: <4BBF454C.4020502@berkeley.edu> Peter- > A plain rstrip() would also work and get rid of any trailing whitespace. > I've checked that in. > For something this trivial, your verbal patch is fine. Would you like > to be added to the NEWS and CONTRIB file? > Thanks for making this change so quickly. Please don't bother with the NEWS and CONTRIB file changes. Cheers, Reece From davidpkilgore at gmail.com Fri Apr 9 11:44:12 2010 From: davidpkilgore at gmail.com (Kizzo Kilgore) Date: Fri, 9 Apr 2010 08:44:12 -0700 Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore Message-ID: Hello I just wanted to introduce myself to the Biopython project/community, and my intentions for participating as a student in this year's Google's Summer of Code. I have posted a rough draft of my proposal to the GSOC applications site for mentors to see. It is not complete but I am currently working on it, so as to make final improvements before the deadline. I haven't had time (due to school/work) to fix any of the bugs in the bug tracking system that has been pointed to before, but please no that I am no stranger to source code, and that I will make a great addition to the Biopython community after the summer. Please leave me feedback either by shooting me an email or leaving a message in the GSOC applications site. Also, be sure to check out my website shown in the proposal for additional qualifications. Thank you. -- Kizzo From lunt at ctbp.ucsd.edu Fri Apr 9 11:55:31 2010 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Fri, 9 Apr 2010 08:55:31 -0700 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: Hello Peter, The HMMER suit of tools, and the Pfam website use "-" to indicate that an HMM visited a deletion state, and "." to indicate that the HMM on a different sequence visited an insertion state, and this gap is just added to maintain alignment. >foo AA...BBB---CCC >bar AAbazBBBDDDCCC In this example, the sequence "foo" doesn't have the DDD section of the profile HMM, the second sequence has not only the full model, but also contains an insert, "baz" that is not part of the HMM, for example, an extra-long loop. I hope this helps... -Bryan On Fri, Apr 9, 2010 at 5:08 AM, Peter wrote: > On Thu, Apr 8, 2010 at 9:04 AM, Peter wrote: >> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt wrote: >>> Greetings All! >>> >>> It looks like line 364 of Bio.AlignIO.StockholmIO reads: >>> >>> seqs[id] += seq.replace(".","-") >>> >>> So when you load into memory alignments that mark gaps created to >>> allow alignment to inserts with ".", (such as PFam alignments or the >>> output of hmmer) that information is lost. >>> >>> I know there must be a good reason for this, but I am finding it a >>> problem on my end.. >>> >>> -Bryan Lunt >> >> Hi Bryan, >> >> Yes, is it done deliberately. The dot is a problem - it has a quite >> specific meaning of "same as above" on other alignment file >> formats, while "-" is an almost universal shorthand for gap/insertion. >> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal >> conversion. >> >> Have you got a sample output file we can use as a unit test or >> at least discuss? As I recall, on the PFAM alignments I looked >> at there was no data loss by doing the dot to dash mapping. > > According to http://sonnhammer.sbc.su.se/Stockholm.html >>> Sequence letters may include any characters except >>> whitespace. Gaps may be indicated by "." or "-". > > So a Stockholm file using a mixture of "." and "-" would be > valid but a bit odd. Why would anyone do that? > > Peter > From biopython at maubp.freeserve.co.uk Fri Apr 9 12:09:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Apr 2010 17:09:16 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: Hi Bryan, On Fri, Apr 9, 2010 at 4:55 PM, Bryan Lunt wrote: > > Hello Peter, > The HMMER suit of tools, and the Pfam website use "-" to indicate that > an HMM visited a deletion state, and "." to indicate that the HMM on a > different sequence visited an insertion state, and this gap is just > added to maintain alignment. > >>foo > AA...BBB---CCC >>bar > AAbazBBBDDDCCC > > In this example, the sequence "foo" doesn't have the DDD section of > the profile HMM, > the second sequence has not only the full model, but also contains an > insert, "baz" that is not part of the HMM, for example, an extra-long > loop. > > I hope this helps... > -Bryan Yes, it does. I think this HMMER/PFAM convention should be noted on the definition of the Stockholm format - that might have prevented this problem in Biopython since none of the examples I'd looked at when writing the parser had this behaviour. Note your example is more subtle than the different between internal gaps and leading or trailing padding described by Ivan earlier: http://lists.open-bio.org/pipermail/biopython/2010-April/006396.html Could you point out a suitable (small) example from PFAM we can use for a unit test, or email me an example (off list)? Now, as to how to deal with this: We could extend the Biopython Alphabet objects to explicitly support multiple types of gaps (the current setup only really copes with a single gap character). Using this information we could handle some special cases like Stockholm to PHYLIP would require merging either gap onto a dash. This doesn't sound that straight forward though. Or, we can avoid explicit declarations about the sequence (just ignore the Biopython Alphabet object capabilities and use one of the generic alphabets), and leave the problem in the hands of the end user. This is bound to cause some unpleasant surprises one day, but might be the best solution. Peter From chapmanb at 50mail.com Fri Apr 9 16:21:32 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 9 Apr 2010 16:21:32 -0400 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: <20100409202132.GA20004@sobchak.mgh.harvard.edu> Vincent; Thanks for the work on the Affy Cel/CDF parsers. I don't know anything at all about the formats so can't help much with the technical questions, but wanted to help with a few more general points you raise. > > I ended up writing my own modules for reading both affy Cel and CDF files. This and the following discussion are a bit hard to follow. When I read through this thread I wasn't sure exactly what improvements you've made, how they affect back compatibility of the code, and how they help make the parser better going forward. A lot of this work is very specialized, so you are trying to catch the attention of the few people who know enough to help. If you can organize your code and e-mail in a way that makes it easy for them to comment and contribute, you'll increase the number of valuable responses you receive. It's an under appreciated skill, but very valuable for grabbing busy people's attention and getting feedback. > > Are improvements to the CelFile.py are of value to biopython? Absolutely. > Is there any need for improvements to the ability to read CEL files or CDF > files and if so what are they? I am interested in contributing. Yes. Make it faster, more complete, easier to use. There are general answers you can apply across the board. We definitely are looking for contributions and happy to have you interested. Brad From chapmanb at 50mail.com Fri Apr 9 16:39:12 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 9 Apr 2010 16:39:12 -0400 Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore In-Reply-To: References: Message-ID: <20100409203912.GB20004@sobchak.mgh.harvard.edu> Kizzo; > I just wanted to introduce myself to the Biopython project/community, > and my intentions for participating as a student in this year's > Google's Summer of Code. I have posted a rough draft of my proposal > to the GSOC applications site for mentors to see. Glad you are interested in this and thanks for getting together a proposal. I wish you would have dropped us a line a bit earlier as we would have been happy to help with getting the application together. > It is not complete > but I am currently working on it, so as to make final improvements > before the deadline. I haven't had time (due to school/work) to fix > any of the bugs in the bug tracking system that has been pointed to > before, but please no that I am no stranger to source code, and that I > will make a great addition to the Biopython community after the > summer. Great. I noticed that you worked on GSoC with OpenCog last year. Is this the most recent code base from that work? https://code.launchpad.net/~kizzobot/opencog/python-bindings Have you still been involved with that community after the work? Did they decide not to do GSoC this year? Thanks again, Brad From davidpkilgore at gmail.com Fri Apr 9 16:52:57 2010 From: davidpkilgore at gmail.com (Kizzo Kilgore) Date: Fri, 9 Apr 2010 13:52:57 -0700 Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore In-Reply-To: <20100409203912.GB20004@sobchak.mgh.harvard.edu> References: <20100409203912.GB20004@sobchak.mgh.harvard.edu> Message-ID: On Fri, Apr 9, 2010 at 1:39 PM, Brad Chapman wrote: > Kizzo; > >> I just wanted to introduce myself to the Biopython project/community, >> and my intentions for participating as a student in this year's >> Google's Summer of Code. ?I have posted a rough draft of my proposal >> to the GSOC applications site for mentors to see. > > Glad you are interested in this and thanks for getting together a > proposal. I wish you would have dropped us a line a bit earlier as > we would have been happy to help with getting the application > together. > >> It is not complete >> but I am currently working on it, so as to make final improvements >> before the deadline. ?I haven't had time (due to school/work) to fix >> any of the bugs in the bug tracking system that has been pointed to >> before, but please no that I am no stranger to source code, and that I >> will make a great addition to the Biopython community after the >> summer. > > Great. I noticed that you worked on GSoC with OpenCog last year. Is > this the most recent code base from that work? > > https://code.launchpad.net/~kizzobot/opencog/python-bindings > The core developers merged my bindings in with the main branch a long time ago, and yes that's the most recent codebase from that work. > Have you still been involved with that community after the work? Did > they decide not to do GSoC this year? > Oh yes, I'm still a regular on their IRC channel and mailing lists. OpenCog is closer to my passion, and I already had 2 proposals for OpenCog this summer ready, but unfortunately the project didn't get accepted for GSoC this year. I plan to work more with OpenCog as a potential PhD project, so am still am involved with OpenCog. > Thanks again, > Brad > -- Kizzo From vincent at vincentdavis.net Sat Apr 10 01:43:06 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Fri, 9 Apr 2010 23:43:06 -0600 Subject: [Biopython] Bio.Application now subprocess? Message-ID: I was considering writing a module for using the command line Affymetrix Power Tools Software LINK Mostly to convert between CEL file types but there are lots of other features If I read correctly will be replaced using subprocess. Are there any modules currently using subprcess rather than Bio.Application? Anything I should know but don't (as if you know what I know) or consider *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From biopython at maubp.freeserve.co.uk Sat Apr 10 06:28:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 10 Apr 2010 11:28:19 +0100 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: References: Message-ID: On Sat, Apr 10, 2010 at 6:43 AM, Vincent Davis wrote: > I was considering writing a module for using the command line Affymetrix > Power Tools Software > LINK > Mostly > to convert between CEL file types but there are lots of other features > If > I read correctly will be replaced using subprocess. Are there any modules > currently using subprcess rather than Bio.Application? > Anything I should know but don't (as if you know what I know) or consider Hi Vincent, The idea is to use a Bio.Application based wrapper to build a command line string, and invoke that with the subprocess module (i.e. use BOTH). The tutorial has several examples of this (e.g. alignment tools and BLAST). What have you been reading that makes you think Bio.Application is being replaced with subprocess? We should probably clarify it. Peter From vincent at vincentdavis.net Sat Apr 10 09:12:34 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sat, 10 Apr 2010 07:12:34 -0600 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: References: Message-ID: Let me say it was late at night when I started reading thorough this and I am very new to it so.... The first function defines in Bio/Applications.py def generic_run(commandline): """Run an application with the given commandline (DEPRECATED)......We now recommend you invoke subprocess directly, using str(commandline).............""" The second class ApplicationResult: """"""Make results of a program available through a standard interface (DEPRECATED).................""" I think these should be moved tp the bottom if possible maybe below a comment section that indicates the item below are or are going to be deprecated. The last line in class AbstractCommandline(object): """....................... You would typically run the command line via a standard Python operating system call (e.g. using the subprocess module).""" I started to read though this example but thought I would read more about subprocess module, At this point it is not clear to me what bio/Applications is doing for me. subprocess seems simple. But I have a lot to learn and I assume that if I start by getting basic functionality with subprocess then it will make more sence One of the parts that is not clear to me is for example in Emboss class WaterCommandline(_EmbossCommandLine): .......... self.parameters = \ [_Option(["-asequence","asequence"], ["input", "file"], None, 1, "First sequence to align") Not really sure where the parts to the _option line are documented, I assume in the ...for p in parameters:...... Just not clear, I guess I need to study it more. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Apr 10, 2010 at 4:28 AM, Peter wrote: > On Sat, Apr 10, 2010 at 6:43 AM, Vincent Davis > wrote: > > I was considering writing a module for using the command line Affymetrix > > Power Tools Software > > LINK< > http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx > > > > Mostly > > to convert between CEL file types but there are lots of other features > > < > http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx > >If > > I read correctly will be replaced using subprocess. Are there any modules > > currently using subprcess rather than Bio.Application? > > Anything I should know but don't (as if you know what I know) or consider > > Hi Vincent, > > The idea is to use a Bio.Application based wrapper to build a command > line string, and invoke that with the subprocess module (i.e. use BOTH). > The tutorial has several examples of this (e.g. alignment tools and BLAST). > > What have you been reading that makes you think Bio.Application is > being replaced with subprocess? We should probably clarify it. > > Peter > From biopython at maubp.freeserve.co.uk Sat Apr 10 09:58:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 10 Apr 2010 14:58:28 +0100 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: References: Message-ID: On Sat, Apr 10, 2010 at 2:12 PM, Vincent Davis wrote: > Let me say it was late at night when I started reading thorough this and I > am very new to it so.... > The first function defines in Bio/Applications.py > def generic_run(commandline): OK, so you are looking at the API docs and/or the code. Bits of Bio/Applications.py are deprecated, and I think you are right - we can try and make the status clearer. Peter From rodrigo_faccioli at uol.com.br Sat Apr 10 13:23:19 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Sat, 10 Apr 2010 14:23:19 -0300 Subject: [Biopython] Bio.Application now subprocess? Message-ID: I've developed a class for this proposed. It might help you. Please, see the link below. http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From vincent at vincentdavis.net Sat Apr 10 13:30:05 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sat, 10 Apr 2010 11:30:05 -0600 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: References: Message-ID: > > On Sat, Apr 10, 2010 at 11:23 AM, Rodrigo Faccioli < > rodrigo_faccioli at uol.com.br> wrote: > >> I've developed a class for this proposed. It might help you. Please, see >> the >> link below. > > > http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py >> >> > > Thanks, This might be a good place for me to start. Nit sure how this is different than Bio/Applications.py other than it is much simpler from a quick look. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Apr 10, 2010 at 11:23 AM, Rodrigo Faccioli < rodrigo_faccioli at uol.com.br> wrote: > I've developed a class for this proposed. It might help you. Please, see > the > link below. > > > http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py > > Thanks, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structure Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Sat Apr 10 15:02:08 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 10 Apr 2010 20:02:08 +0100 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: References: Message-ID: On Sat, Apr 10, 2010 at 2:58 PM, Peter wrote: > > OK, so you are looking at the API docs and/or the code. > Bits of Bio/Applications.py are deprecated, and I think > you are right - we can try and make the status clearer. > Hi Vincent, I updated that a bit, hopefully it is clearer that a typical user doesn't need to look at Bio.Applications at all. Rather you might use the alignment tool wrappers in Bio.Align.Applications, or the EMBOSS wrappers in Bio.Emboss.Applications (etc) which internally use the classes defined in Bio.Applications. The *only* reason you'd use Bio.Applications directly now is to write a new command line tool wrapper. [Historically you might have used the old generic_run function in Bio.Applications, but that is deprecated now] Peter From biopython at maubp.freeserve.co.uk Sat Apr 10 16:33:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 10 Apr 2010 21:33:57 +0100 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: <1101855478758905131@unknownmsgid> References: <1101855478758905131@unknownmsgid> Message-ID: On Sat, Apr 10, 2010 at 8:27 PM, Vincent Davis wrote: > > So that was/is my plan to use it to writes command lone tools for the > affymetrix apt dev commandline app. unless this is redundant in a way > I am not aware of. > Thanks Ah - right, now this makes sense. Are you on the dev mailing list (CC'd)? That would be a better place to ask. I'd start by looking at Bio.Align.Applications (less subclasses there) as a model. Peter From chapmanb at 50mail.com Mon Apr 12 08:37:31 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 12 Apr 2010 08:37:31 -0400 Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore In-Reply-To: References: <20100409203912.GB20004@sobchak.mgh.harvard.edu> Message-ID: <20100412123731.GJ20004@sobchak.mgh.harvard.edu> Kizzo; > > Have you still been involved with that community after the work? Did > > they decide not to do GSoC this year? > > Oh yes, I'm still a regular on their IRC channel and mailing lists. > OpenCog is closer to my passion, and I already had 2 proposals for > OpenCog this summer ready, but unfortunately the project didn't get > accepted for GSoC this year. I plan to work more with OpenCog as a > potential PhD project, so am still am involved with OpenCog. That's great to hear. One of the most important parts of GSoC for myself and many mentors is the chance to get additional folks involved in open source. Reviews of the applications have started, and the main aspect which would improve your proposal is to develop a specific project plan with detailed descriptions of week to week goals. For each week you should have: - Description of the specific weekly goal. - Details on the PyCogent and Biopython code you expect to be working with - Possible issues or areas of expansion you expect might impact the timeline - Expected work on documentation and testing. You want to have this integrated throughout the proposal. See the examples in the NESCent application documentation to get an idea of the level of detail in accepted projects from previous years: https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html The content we'd like to see in the proposal is interconversion of core object (Sequence, Alignment, Phylogeny) in the first half of the summer, and applications of this interconversion to developing biological workflows in the second half of the summer. Feel free to be creative and pick work that is of interest to your studies. Since you can't edit the proposal currently, please prepare this in a publicly accessible Google Doc and provide a link from the public comments so other mentors can view it. Thanks, Brad From biopython at maubp.freeserve.co.uk Mon Apr 12 09:35:44 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Apr 2010 14:35:44 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: On Fri, Apr 9, 2010 at 7:50 PM, Bryan Lunt wrote: > Hello Peter, > > Thanks for your help recently on this! > I have here two files that I like to use as examples, because they are > fairly small, (203 sequences) > > The Pfam page summarizing this family is : > http://pfam.sanger.ac.uk/family/PF07750 > > Cheers! > -Bryan Lunt I see what you mean - using that webpage to get the full alignment (in any of the supported file formats) using the mixed gap option (dot or dash) does show both symbols in a meaningful way. Peter From tiagoantao at gmail.com Mon Apr 12 19:39:29 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 13 Apr 2010 00:39:29 +0100 Subject: [Biopython] ASN.1 and Entrez SNP Message-ID: Hi, Just a simple question: Entrez SNP seems to return ASN.1 format only. Is there any way to parse this in biopython? I've looked at SeqIO and found nothing... I can think of tools to process this outside, but I am just curious if this is processed natively with Biopython (being an exposed NCBI format...) Many thanks, Tiago PS - You can easily try this with: hdl = Entrez.efetch(db="snp", id="3739022") print hdl.read() -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From biopython at maubp.freeserve.co.uk Tue Apr 13 04:22:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Apr 2010 09:22:42 +0100 Subject: [Biopython] ASN.1 and Entrez SNP In-Reply-To: References: Message-ID: 2010/4/13 Tiago Ant?o : > Hi, > > Just a simple question: > Entrez SNP seems to return ASN.1 format only. > Is there any way to parse this in biopython? I've looked at SeqIO and > found nothing... > I can think of tools to process this outside, but I am just curious if > this is processed natively with Biopython (being an exposed NCBI > format...) > > Many thanks, > Tiago > PS - You can easily try this with: > hdl = Entrez.efetch(db="snp", id="3739022") > print hdl.read() Hi Tiago, No, we don't support ASN.1, and I don't see any good reason to - I think it would only be NCBI ASN.1 we'd we interested in, and I think that all their resources are available in other easier to use formats like XML these days. See also http://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One Instead ask Entrez to give you the SNP data as XML: Entrez.efetch(db="snp", id="3739022", retmode="xml") Hopefully the SNP XML file has everything in it. You have a choice of Python XML parsers to use. However, the Bio.Entrez parser doesn't like this XML. This appears to be related (or caused by) a known NCBI bug. See http://bugzilla.open-bio.org/show_bug.cgi?id=2771 Peter From bala.biophysics at gmail.com Tue Apr 13 10:49:03 2010 From: bala.biophysics at gmail.com (Bala subramanian) Date: Tue, 13 Apr 2010 16:49:03 +0200 Subject: [Biopython] removing redundant sequence Message-ID: Friends, Sorry if this question was asked before. Is there any function in Biopython that can remove redundant sequence records from a fasta file. Thanks, Bala From biopython at maubp.freeserve.co.uk Tue Apr 13 11:02:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Apr 2010 16:02:52 +0100 Subject: [Biopython] removing redundant sequence In-Reply-To: References: Message-ID: On Tue, Apr 13, 2010 at 3:49 PM, Bala subramanian wrote: > Friends, > Sorry if this question was asked before. Is there any function in Biopython > that can remove redundant sequence records from a fasta file. > > Thanks, > Bala No, but you should be able to do this with Biopython - depending on what exactly you are asking for. When you say "redundant" do you mean 100% perfect identify? How big is your FASTA file - are you working with next-gen sequencing data and millions of reads?. If it is small enough you can keep all the data in memory to compare sequences to each other. Otherwise you might try using a checksum (e.g. SEGUID) to spot duplicates. Peter From schafer at rostlab.org Tue Apr 13 11:08:31 2010 From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=) Date: Tue, 13 Apr 2010 17:08:31 +0200 Subject: [Biopython] removing redundant sequence In-Reply-To: References: Message-ID: <4BC488EF.3000505@rostlab.org> Hey, I think not. But you can use an external tool like cd-hit or uniqueprot and implement a wrapper function for that in your code. Chris On 04/13/2010 04:49 PM, Bala subramanian wrote: > Friends, > Sorry if this question was asked before. Is there any function in Biopython > that can remove redundant sequence records from a fasta file. > > Thanks, > Bala > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Thu Apr 15 11:03:02 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 15 Apr 2010 16:03:02 +0100 Subject: [Biopython] Draft abstract for BOSC 2010 Biopython Project Update Message-ID: Hi all, I should have circulated this earlier, but here is a draft abstract for a "Biopython Project Update" talk at BOSC 2010, to be submitted *today*. http://www.open-bio.org/wiki/BOSC_2010 I'm hoping to attend BOSC again this year and give the talk, but haven't sorted out the finances - Brad has offered to present if I can't go, hence the talk author list. If anyone else wants to help with slides etc (or as a standby speaker) please let me know. This is based on the abstract from last year, included in this PDF: http://www.open-bio.org/w/images/c/c7/BOSC2009_program_20090601.pdf In the PDF version of the abstract I've made the logo smaller this time ;) Comments welcome, Thanks, Peter -- Biopython Project Update Peter Cock, Brad Chapman In this talk we present the current status of the Biopython project (www.biopython.org), described in a application note published last year (Cock et al., 2009). Biopython celebrated its 10th Birthday last year, and has now been cited or referred to in over 150 scientific publications (a list is included on our website). At the end of 2009, following an extended evaluation period, Biopython successfully migrated from using CVS for source code control to using git, hosted on github.com. This has helped our existing developers to work and test new features on publicly viewable branches before being merged, and has also encouraged new contributors to work on additions or improvements. Currently about fifty people have their own Biopython repository on GitHub. In summer 2009 we had two Google Summer of Code (GSoC) project students working on phylogenetic code for Biopython in conjunction with the National Evolutionary Synthesis Center (NESCent). Eric Talevich?s work on phylogenetic trees including phyloXML support (Han and Zamesk, 2009) was merged and included with Biopython 1.54, and he continues to be actively involved with Biopython. We hope to include Nick Matzke?s module for biogeographical data from the Global Biodiversity Information Facility (GBIF) later this year. For summer 2010 we have Biopython related GSoC projects submitted via both NESCent and the Open Bioinformatics Foundation (OBF), and hope to have students working on Biopython once again. Since BOSC 2009, Biopython has seen four releases. Biopython 1.51 (August 2009) was an important milestone in dropping support for Python 2.3 and our legacy parsing infra-structure (Martel/Mindy), but was most noteworthy for FASTQ support (Cock et al., 2010). Biopython 1.52 (September 2009) introduced indexing of most sequence file formats for random access, and made interconverting sequence and alignment files easier. Biopython 1.53 (December 2009) included wrappers for the new NCBI BLAST+ command line tools, and much improved support for running under Jython. Our latest release is Biopython 1.54 (April/May 2010), new features include Bio.Phylo for phylogenetic trees (GSoC project), and support for Standard Flowgram Format (SFF) files used for 454 Life Sciences (Roche) sequencing. Biopython is free open source software available from www.biopython.org under the Biopython License Agreement (an MIT style license, http://www.biopython.org/DIST/LICENSE). References Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., de Hoon, M.J. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. doi:10.1093/bioinformatics/btp163 Han, M.V. and Zmasek, C.M. (2009) phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinformatics 10:356. doi:10.1186/1471-2105-10-356 Cock, P.J.A., Fields, C.J., Goto N., Heuer, M.L., and Rice, P.M. (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38(6) 1767-71. doi:10.1093/nar/gkp1137 From mok at bioxray.dk Thu Apr 15 11:15:01 2010 From: mok at bioxray.dk (Morten Kjeldgaard) Date: Thu, 15 Apr 2010 17:15:01 +0200 Subject: [Biopython] Entrez.efetch bug? Message-ID: <4BC72D75.1040505@bioxray.dk> Hi, I am getting an error with Entrez.efetch() with Biopython version 1.51. This is my handle: handle = Entrez.efetch(db='protein', id='114391',rettype='gp') When I subsequently do this: record = Entrez.read(handle) I get a syntax error from Expat: ExpatError: syntax error: line 1, column 0 However, if I do the following, it works: record = handle.read() but then I need to parse the resulting record using the Genbank parser, which is a nuisance since I normally should get this for free from the Entrez module. Comments, anyone? -- Morten From biopython at maubp.freeserve.co.uk Thu Apr 15 11:31:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Apr 2010 16:31:28 +0100 Subject: [Biopython] Entrez.efetch bug? In-Reply-To: <4BC72D75.1040505@bioxray.dk> References: <4BC72D75.1040505@bioxray.dk> Message-ID: On Thu, Apr 15, 2010 at 4:15 PM, Morten Kjeldgaard wrote: > Hi, > > I am getting an error with Entrez.efetch() with Biopython version 1.51. This > is my handle: > > handle = Entrez.efetch(db='protein', id='114391',rettype='gp') > In the above, you've asked Entrez to give you a plain text GenPept file (a protein GenBank file). > When I subsequently do this: > > ?record = Entrez.read(handle) > > I get a syntax error from Expat: > > ExpatError: syntax error: line 1, column 0 > The Bio.Entrez.read() and Bio.Entrez.parse() functions expect XML. > However, if I do the following, it works: > > record = handle.read() Well, yes, you get a big string stored as the variable record. > but then I need to parse the resulting record using the Genbank parser, > which is a nuisance since I normally should get this for free from the > Entrez module. > > Comments, anyone? Try this: from Bio import Entrez from Bio import SeqIO handle = Entrez.efetch(db='protein', id='114391',rettype='gp') record = SeqIO.read(handle, 'genbank') Peter From mok at bioxray.dk Thu Apr 15 17:28:24 2010 From: mok at bioxray.dk (Morten Kjeldgaard) Date: Thu, 15 Apr 2010 23:28:24 +0200 Subject: [Biopython] Entrez.efetch bug? In-Reply-To: References: <4BC72D75.1040505@bioxray.dk> Message-ID: <26E933F7-D7D2-48EC-82B4-4B654403F177@bioxray.dk> On 15/04/2010, at 17.31, Peter wrote: > record = SeqIO.read(handle, 'genbank') d'Oh!! :-) Thanks, just the hint I needed. Cheers, Morten From davidpkilgore at gmail.com Mon Apr 19 02:54:55 2010 From: davidpkilgore at gmail.com (Kizzo Kilgore) Date: Sun, 18 Apr 2010 23:54:55 -0700 Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore In-Reply-To: <20100412123731.GJ20004@sobchak.mgh.harvard.edu> References: <20100409203912.GB20004@sobchak.mgh.harvard.edu> <20100412123731.GJ20004@sobchak.mgh.harvard.edu> Message-ID: I have taken the time to carefully look over the links and examples you suggested, and came up with my own draft week by week plan for the summer. It is not perfect, or even complete, as I am in the closing weeks of school and things are getting really busy, but I managed to pull this together. You can visit the following public Google Docs link to get the Gnumeric spreadsheet of my timeline. If you would like me to, I will also convert it to some other format if you like (and if I can), or I can attach a copy of the file itself (or post it on my website) if for some reason the link does not work. Thank you. https://docs.google.com/leaf?id=0B4KRpw_6YxAjMzU3NDgxMWYtZGIxZi00YmY3LTk5MGQtNDlmMjYyYTRhN2M0&hl=en On Mon, Apr 12, 2010 at 5:37 AM, Brad Chapman wrote: > Kizzo; > >> > Have you still been involved with that community after the work? Did >> > they decide not to do GSoC this year? >> >> Oh yes, I'm still a regular on their IRC channel and mailing lists. >> OpenCog is closer to my passion, and I already had 2 proposals for >> OpenCog this summer ready, but unfortunately the project didn't get >> accepted for GSoC this year. ?I plan to work more with OpenCog as a >> potential PhD project, so am still am involved with OpenCog. > > That's great to hear. One of the most important parts of GSoC for > myself and many mentors is the chance to get additional folks > involved in open source. > > Reviews of the applications have started, and the main aspect which > would improve your proposal is to develop a specific project plan > with detailed descriptions of week to week goals. For each week you > should have: > > - Description of the specific weekly goal. > - Details on the PyCogent and Biopython code you expect to be working with > - Possible issues or areas of expansion you expect might impact the > ?timeline > - Expected work on documentation and testing. You want to have this > ?integrated throughout the proposal. > > See the examples in the NESCent application documentation to get an > idea of the level of detail in accepted projects from previous years: > > https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply > http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html > > The content we'd like to see in the proposal is interconversion of > core object (Sequence, Alignment, Phylogeny) in the first half of > the summer, and applications of this interconversion to developing > biological workflows in the second half of the summer. Feel free to > be creative and pick work that is of interest to your studies. > > Since you can't edit the proposal currently, please prepare this in > a publicly accessible Google Doc and provide a link from the public > comments so other mentors can view it. > > Thanks, > Brad > -- Kizzo From mjldehoon at yahoo.com Mon Apr 19 03:08:04 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 19 Apr 2010 00:08:04 -0700 (PDT) Subject: [Biopython] Fw: Entrez.efetch In-Reply-To: <910794.43889.qm@web56207.mail.re3.yahoo.com> Message-ID: <870000.56671.qm@web62402.mail.re1.yahoo.com> > I sent the mail to the biopython at biopython.org > but it was not delivered. It will be delivered if you subscribe to the mailing list. --- On Mon, 4/19/10, olumide olufuwa wrote: > From: olumide olufuwa > Subject: Fw: [Biopython]Entrez.efetch > To: biopython-owner at lists.open-bio.org > Cc: "Biopython mailing list" > Date: Monday, April 19, 2010, 2:50 AM > > > Hello Michel, > I sent the mail to the biopython at biopython.org > but it was not delivered. I have edited the message. > > > The code that > accepts UNIPROT ID, retrieves the record using > Entrez.efetch and then it > parsed to obtain the Pubmed ID which i use to search > Medline for the > Title, Abstract and other information about the entry. > The code: > > query_id=str(raw_input("please > > > enter your UNIPROT_ID: ")) #Request UNIPROT ID from user > Entrez.email="ludax5 at yahoo.com" > prothandle=Entrez.efetch(db="protein", > > > id=query_id, rettype="gb" #queries Protein DB with the > given ID > #The > program returns an error here if a wrong ID is given. > Details of the > error is given below > seq_record=SeqIO.read(prothandle, "gb") > for > > record in seq_record.annotations['references']: # To > obtain Pubmed id > from the seqrecord > ?? key_word=record.pubmed_id > ?? if key_word: > ???? > handle=Entrez.efetch(db="pubmed", > > id=key_word, rettype="medline") > ???? > medRecords=Medline.parse(handle) > ???? for rec in medRecords: #prints > title and Abstract > ???????? if rec.has_key('AB') and > rec.has_key('TI'): > ?????????? print "TITLE: ",rec['TI'] > ?????????? > print "ABSTRACT: ",rec['AB'] > ?????????? print ' ' > > > THE > PROBLEM: The program gives an error if a wrong ID is > entered or an ID > other than UNIPROT ID e.g PDB ID, GSS ID etc. > > > > An Example Run: > > > please enter your UNIPROT_ID: > 1wio #A PDB ID is given instead > > > Traceback (most recent call last): > ? File "file.py", line 11, in > > ??? seq_record=SeqIO.read(prothandle, "gb") > ? File > "/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py", > line 522, in > read > ??? raise ValueError("No records found in handle") > ValueError: > > No records found in handle > > I want to avoid this error, thus i > want the program to print "INCORRECT ID GIVEN"? when a > wrong or an > incorrect ID is given. > > > Thanks a lot. > lummy > > > > From olumideolufuwa at yahoo.com Mon Apr 19 03:30:24 2010 From: olumideolufuwa at yahoo.com (Olumide Olufuwa) Date: Mon, 19 Apr 2010 00:30:24 -0700 (PDT) Subject: [Biopython] Entrez.efetch In-Reply-To: Message-ID: <221701.32474.qm@web45106.mail.sp1.yahoo.com> Hello there, ? I wrote a program, I am not awesome in biopython but this is what it does: The program code that accepts user defined UNIPROT ID, retrieves the record using Entrez.efetch and then it is parsed to obtain the Pubmed ID which i use to search Medline for Title, Abstract and other information about the entry. The code is simply: query_id=str(raw_input("please enter your UNIPROT_ID: ")) #Request UNIPROT ID from user Entrez.email="ludax5 at yahoo.com" prothandle=Entrez.efetch(db="protein", id=query_id, rettype="gb" #queries Protein DB with the given ID #The program returns an error here if a wrong ID is given. Details of the error is given below seq_record=SeqIO.read(prothandle, "gb") for record in seq_record.annotations['references']: # To obtain Pubmed id from the seqrecord ?? key_word=record.pubmed_id ?? if key_word: ???? handle=Entrez.efetch(db="pubmed", id=key_word, rettype="medline") ???? medRecords=Medline.parse(handle) ???? for rec in medRecords: #prints title and Abstract ???????? if rec.has_key('AB') and rec.has_key('TI'): ?????????? print "TITLE: ",rec['TI'] ?????????? print "ABSTRACT: ",rec['AB'] ?????????? print ' ' THE PROBLEM: The program gives an error if a wrong ID is entered or an ID other than UNIPROT ID e.g PDB ID, GSS ID etc. An Example Run with a wrong ID is shown below: please enter your UNIPROT_ID: 1wio #A PDB ID is given instead Traceback (most recent call last): ? File "file.py", line 11, in ??? seq_record=SeqIO.read(prothandle, "gb") ? File "/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py", line 522, in read ??? raise ValueError("No records found in handle") ValueError: No records found in handle I want to avoid this error, thus i want the program to print "INCORRECT ID GIVEN"? when a wrong or an incorrect ID is given. Thanks a lot. lummy From mjldehoon at yahoo.com Mon Apr 19 03:45:59 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 19 Apr 2010 00:45:59 -0700 (PDT) Subject: [Biopython] Entrez.efetch In-Reply-To: <221701.32474.qm@web45106.mail.sp1.yahoo.com> Message-ID: <902706.80063.qm@web62402.mail.re1.yahoo.com> Put a try:/except: block around the call to SeqIO.read, as in: try: seq_record=SeqIO.read(prothandle, "gb") except ValueError: print "INCORRECT ID GIVEN" --Michiel --- On Mon, 4/19/10, Olumide Olufuwa wrote: > From: Olumide Olufuwa > Subject: [Biopython] Entrez.efetch > To: biopython at lists.open-bio.org > Date: Monday, April 19, 2010, 3:30 AM > > Hello there, > ? > I wrote a program, I am not awesome in biopython but this > is what it does: The program code that > accepts user defined UNIPROT ID, retrieves the record using > Entrez.efetch and then it > is parsed to obtain the Pubmed ID which i use to search > Medline for Title, Abstract and other information about the > entry. > The code is simply: > > query_id=str(raw_input("please > > > > enter your UNIPROT_ID: ")) #Request UNIPROT ID from user > Entrez.email="ludax5 at yahoo.com" > prothandle=Entrez.efetch(db="protein", > > > > id=query_id, rettype="gb" #queries Protein DB with the > given ID > #The > program returns an error here if a wrong ID is given. > Details of the > error is given below > seq_record=SeqIO.read(prothandle, "gb") > for > > record in seq_record.annotations['references']: # To > obtain Pubmed id > from the seqrecord > ?? key_word=record.pubmed_id > ?? if key_word: > ???? > > handle=Entrez.efetch(db="pubmed", > > id=key_word, rettype="medline") > ???? > medRecords=Medline.parse(handle) > ???? for rec in medRecords: #prints > title and Abstract > ???????? if rec.has_key('AB') and > rec.has_key('TI'): > ?????????? print "TITLE: ",rec['TI'] > ?????????? > print "ABSTRACT: ",rec['AB'] > ?????????? print ' ' > > > THE > PROBLEM: The program gives an error if a wrong ID is > entered or an ID > other than UNIPROT ID e.g PDB ID, GSS ID etc. > > > > An Example Run with a wrong ID is shown below: > > > please enter your UNIPROT_ID: > 1wio #A PDB ID is given instead > > > Traceback (most recent call last): > ? File "file.py", line 11, in > > ??? seq_record=SeqIO.read(prothandle, "gb") > ? File > "/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py", > line 522, in > read > ??? raise ValueError("No records found in handle") > ValueError: > > > No records found in handle > > I want to avoid this error, thus i > want the program to print "INCORRECT ID GIVEN"? when a > wrong or an > incorrect ID is given. > > > Thanks a lot. > lummy > > > > > ? ? ? > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From fkauff at biologie.uni-kl.de Tue Apr 20 10:27:30 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Tue, 20 Apr 2010 16:27:30 +0200 Subject: [Biopython] Code for protein alpha helix prediction Message-ID: <4BCDB9D2.4050207@biologie.uni-kl.de> Hi all, I've recently been asked to help with screening protein sequences for certain features, something I don't really know much about... Yet! My questions: Is there some code in Biopython that allows for a quick check whether an amino acid sequece is likely to be a alpha helix? Couldn't find any. Or is there an algorithm that could be straightforwardly implemented in python, or a commandline tool that could be called from within a python script? Thanks in advance, Frank From rodrigo_faccioli at uol.com.br Tue Apr 20 11:34:47 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Tue, 20 Apr 2010 12:34:47 -0300 Subject: [Biopython] Code for protein alpha helix prediction In-Reply-To: <4BCDB9D2.4050207@biologie.uni-kl.de> References: <4BCDB9D2.4050207@biologie.uni-kl.de> Message-ID: Hi Frank, I'm not sure if I understood your question. I'm computer scientist and I'm researching globular protein structure prediction. In fact, I've studied the application of Evolutionary Algorithms for it. Therefore, our goals are different. if I understood your question, you have a Fasta file of your protein. So, you need to communicate with databases such as NCBI, scop and CATH. In this way, I recommend you use Entrez BioPython module. Other suggestion is the use of BioPython Blast module. Sorry if my answer is not what you is looking for. Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Tue, Apr 20, 2010 at 11:27 AM, Frank Kauff wrote: > Hi all, > > I've recently been asked to help with screening protein sequences for > certain features, something I don't really know much about... Yet! > > My questions: Is there some code in Biopython that allows for a quick check > whether an amino acid sequece is likely to be a alpha helix? Couldn't find > any. Or is there an algorithm that could be straightforwardly implemented in > python, or a commandline tool that could be called from within a python > script? > > Thanks in advance, > Frank > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Tue Apr 20 11:43:02 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Apr 2010 16:43:02 +0100 Subject: [Biopython] Code for protein alpha helix prediction In-Reply-To: <4BCDB9D2.4050207@biologie.uni-kl.de> References: <4BCDB9D2.4050207@biologie.uni-kl.de> Message-ID: On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff wrote: > Hi all, > > I've recently been asked to help with screening protein sequences for > certain features, something I don't really know much about... Yet! > > My questions: Is there some code in Biopython that allows for a quick check > whether an amino acid sequece is likely to be a alpha helix? Couldn't find > any. Or is there an algorithm that could be straightforwardly implemented in > python, or a commandline tool that could be called from within a python > script? Hi Frank, There are lots of tools for predicting secondary structure (alpha helices, beta sheets etc) both de novo, and guided by reference sequences with known structures. Some of these are online web services. I'm pretty sure there is nothing for this built into Biopython, so for scripting this for a large number of sequences then (as you have also suggested), my first approach would be to look for command line tools which you could call from Python. I've never needed to do this myself, and have no specific recommendations regarding which tools to try first. If you do find some useful algorithms which could easily be implemented in Python, they could be worth including - maybe under Bio.SeqUtils? Peter From darnells at dnastar.com Tue Apr 20 14:16:22 2010 From: darnells at dnastar.com (Steve Darnell) Date: Tue, 20 Apr 2010 13:16:22 -0500 Subject: [Biopython] Code for protein alpha helix prediction In-Reply-To: References: <4BCDB9D2.4050207@biologie.uni-kl.de> Message-ID: Frank, One of the most accurate (and popular) algorithms is PSIPRED. A stand-alone command line version is available: http://bioinfadmin.cs.ucl.ac.uk/downloads/psipred/ If memory serves, it requires a local installation of blast and the nr database. A position weight matrix generated from PSI-BLAST acts as input to a neural network, which makes the secondary structure predictions. The Rosetta Design group had a poll last year of people's favorite tools. There are plenty of others to try if PSIPRED doesn't meet your needs. http://rosettadesigngroup.com/blog/456/fairest-secondary-structure-predi ction-algorithm/ I am not a PSIPRED developer, just a satisfied user. Regards, Steve -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter Sent: Tuesday, April 20, 2010 10:43 AM To: Frank Kauff Cc: BioPython Mailing List Subject: Re: [Biopython] Code for protein alpha helix prediction On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff wrote: > Hi all, > > I've recently been asked to help with screening protein sequences for > certain features, something I don't really know much about... Yet! > > My questions: Is there some code in Biopython that allows for a quick check > whether an amino acid sequece is likely to be a alpha helix? Couldn't find > any. Or is there an algorithm that could be straightforwardly implemented in > python, or a commandline tool that could be called from within a python > script? Hi Frank, There are lots of tools for predicting secondary structure (alpha helices, beta sheets etc) both de novo, and guided by reference sequences with known structures. Some of these are online web services. I'm pretty sure there is nothing for this built into Biopython, so for scripting this for a large number of sequences then (as you have also suggested), my first approach would be to look for command line tools which you could call from Python. I've never needed to do this myself, and have no specific recommendations regarding which tools to try first. If you do find some useful algorithms which could easily be implemented in Python, they could be worth including - maybe under Bio.SeqUtils? Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From fkauff at biologie.uni-kl.de Wed Apr 21 07:50:30 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Wed, 21 Apr 2010 13:50:30 +0200 Subject: [Biopython] Code for protein alpha helix prediction In-Reply-To: References: <4BCDB9D2.4050207@biologie.uni-kl.de> Message-ID: <4BCEE686.3080803@biologie.uni-kl.de> Thanks everybody! Now I have plenty of tools to look at - the standalone version of psipred certainly fulfills the easy-to-use and quick-to-try-out requirements. Frank On 04/20/2010 08:16 PM, Steve Darnell wrote: > Frank, > > One of the most accurate (and popular) algorithms is PSIPRED. A > stand-alone command line version is available: > http://bioinfadmin.cs.ucl.ac.uk/downloads/psipred/ > > If memory serves, it requires a local installation of blast and the nr > database. A position weight matrix generated from PSI-BLAST acts as > input to a neural network, which makes the secondary structure > predictions. > > The Rosetta Design group had a poll last year of people's favorite > tools. There are plenty of others to try if PSIPRED doesn't meet your > needs. > > http://rosettadesigngroup.com/blog/456/fairest-secondary-structure-predi > ction-algorithm/ > > I am not a PSIPRED developer, just a satisfied user. > > Regards, > Steve > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org > [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter > Sent: Tuesday, April 20, 2010 10:43 AM > To: Frank Kauff > Cc: BioPython Mailing List > Subject: Re: [Biopython] Code for protein alpha helix prediction > > On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff > wrote: > >> Hi all, >> >> I've recently been asked to help with screening protein sequences for >> certain features, something I don't really know much about... Yet! >> >> My questions: Is there some code in Biopython that allows for a quick >> > check > >> whether an amino acid sequece is likely to be a alpha helix? Couldn't >> > find > >> any. Or is there an algorithm that could be straightforwardly >> > implemented in > >> python, or a commandline tool that could be called from within a >> > python > >> script? >> > Hi Frank, > > There are lots of tools for predicting secondary structure (alpha > helices, > beta sheets etc) both de novo, and guided by reference sequences with > known structures. Some of these are online web services. > > I'm pretty sure there is nothing for this built into Biopython, so for > scripting > this for a large number of sequences then (as you have also suggested), > my first approach would be to look for command line tools which you > could > call from Python. I've never needed to do this myself, and have no > specific > recommendations regarding which tools to try first. > > If you do find some useful algorithms which could easily be implemented > in Python, they could be worth including - maybe under Bio.SeqUtils? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From fkauff at biologie.uni-kl.de Wed Apr 21 07:59:31 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Wed, 21 Apr 2010 13:59:31 +0200 Subject: [Biopython] Code for protein alpha helix prediction In-Reply-To: References: <4BCDB9D2.4050207@biologie.uni-kl.de> Message-ID: <4BCEE8A3.3010008@biologie.uni-kl.de> Hi Peter, for the start, it seems psipred is the easiest one to use and to implement. I'll start with that, and once the parser for the output goes beyond the quick-and-dirty level, we can think about including it. Frank On 04/20/2010 05:43 PM, Peter wrote: > On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff wrote: > >> Hi all, >> >> I've recently been asked to help with screening protein sequences for >> certain features, something I don't really know much about... Yet! >> >> My questions: Is there some code in Biopython that allows for a quick check >> whether an amino acid sequece is likely to be a alpha helix? Couldn't find >> any. Or is there an algorithm that could be straightforwardly implemented in >> python, or a commandline tool that could be called from within a python >> script? >> > Hi Frank, > > There are lots of tools for predicting secondary structure (alpha helices, > beta sheets etc) both de novo, and guided by reference sequences with > known structures. Some of these are online web services. > > I'm pretty sure there is nothing for this built into Biopython, so for scripting > this for a large number of sequences then (as you have also suggested), > my first approach would be to look for command line tools which you could > call from Python. I've never needed to do this myself, and have no specific > recommendations regarding which tools to try first. > > If you do find some useful algorithms which could easily be implemented > in Python, they could be worth including - maybe under Bio.SeqUtils? > > Peter > From bala.biophysics at gmail.com Wed Apr 21 10:25:35 2010 From: bala.biophysics at gmail.com (Bala subramanian) Date: Wed, 21 Apr 2010 16:25:35 +0200 Subject: [Biopython] removing redundant sequence In-Reply-To: References: Message-ID: Peter, Sorry for the delayed reply. Yes i want to remove those sequences that are 100% identical but they have different identifier. I created a sample fasta file with two redundant sequences. But when i use checksums seguid to spot the redundancies, it spots only the first one. In [36]: for record in SeqIO.parse(open('t'),'fasta'): ....: print record.id, seguid(record.seq) ....: ....: A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8 *AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw* AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ *AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY * In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda rec:seguid(rec.seq)) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /home/cbala/test/ in () /usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in to_dict(sequences, key_function) 585 key = key_function(record) 586 if key in d : --> 587 raise ValueError("Duplicate key '%s'" % key) 588 d[key] = record 589 return d ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw' On Tue, Apr 13, 2010 at 5:02 PM, Peter wrote: > On Tue, Apr 13, 2010 at 3:49 PM, Bala subramanian > wrote: > > Friends, > > Sorry if this question was asked before. Is there any function in > Biopython > > that can remove redundant sequence records from a fasta file. > > > > Thanks, > > Bala > > No, but you should be able to do this with Biopython - depending on > what exactly you are asking for. > > When you say "redundant" do you mean 100% perfect identify? > > How big is your FASTA file - are you working with next-gen sequencing > data and millions of reads?. If it is small enough you can keep all > the data in memory to compare sequences to each other. Otherwise > you might try using a checksum (e.g. SEGUID) to spot duplicates. > > Peter > From biopython at maubp.freeserve.co.uk Wed Apr 21 11:10:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Apr 2010 16:10:45 +0100 Subject: [Biopython] removing redundant sequence In-Reply-To: References: Message-ID: On Wed, Apr 21, 2010 at 3:25 PM, Bala subramanian wrote: > Peter, > Sorry for the delayed reply. Yes i want to remove those sequences that are > 100% identical but they have different identifier. I created a sample fasta > file with two redundant sequences. But when i use checksums seguid to spot > the redundancies, it spots only the first one. > > In [36]: for record in SeqIO.parse(open('t'),'fasta'): > ? ....: ? ? print record.id, seguid(record.seq) > ? ....: > ? ....: > A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8 > *AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw > AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw* > AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA > AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA > AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ > AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM > AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ > *AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY > AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY > * > In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda > rec:seguid(rec.seq)) > --------------------------------------------------------------------------- > ValueError ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Traceback (most recent call last) > > /home/cbala/test/ in () > > /usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in > to_dict(sequences, key_function) > ? ?585 ? ? ? ? key = key_function(record) > ? ?586 ? ? ? ? if key in d : > --> 587 ? ? ? ? ? ? raise ValueError("Duplicate key '%s'" % key) > ? ?588 ? ? ? ? d[key] = record > ? ?589 ? ? return d > > ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw' Hi Bala, You know there are duplicate sequences in your file, so if you try to use the SEGUID as a key, there will be duplicate keys. Thus you get this error message. If you want to use Bio.SeqIO.to_dict you have to have unique keys. What you should do is loop over the records and keep a record of the checksums you have saved, and use that to ignore duplicates. I would use a python set rather than a python list for speed. You could do this with a for loop. However, I would probably use an iterator based approach with a generator function - I think it is more elegant but perhaps not so easy for a beginner: from Bio import SeqIO from Bio.SeqUtils.CheckSum import seguid def remove_dup_seqs(records): """"SeqRecord iterator to removing duplicate sequences.""" checksums = set() for record in records: checksum = seguid(record.seq) if checksum in checksums: print "Ignoring %s" % record.id continue checksums.add(checksum) yield record records = remove_dup_seqs(SeqIO.parse("with_dups.fasta", "fasta")) count = SeqIO.write(records, "no_dups.fasta", "fasta") print "Saved %i records" % count Note I've used filename with Bio.SeqIO which requires Biopython 1.54b or later - for older versions use handles. See also: http://news.open-bio.org/news/2010/04/biopython-seqio-and-alignio-easier/ Peter From silvio.tschapke at googlemail.com Wed Apr 21 14:34:54 2010 From: silvio.tschapke at googlemail.com (Silvio Tschapke) Date: Wed, 21 Apr 2010 20:34:54 +0200 Subject: [Biopython] Entrez.efetch rettype retmode Message-ID: Hello. I am new to Biopython and I tried to download a whole record with efetch. The problem is that I get an error message in the output: ""Report 'full' not found in 'pmc' presentation"" Maybe I haven't understood the whole principle. But isn't it the goal of pmc to provide full text? I have read the help-page of efetch but it doesn't help me a lot. ---- handle = Entrez.efetch(db="pmc", id="2531137", rettype="full", retmode="text") string = str(handle.read()) f = open('./output.txt', 'w') f.write(string) ---- Thanks for your help! From robert.campbell at queensu.ca Wed Apr 21 16:14:10 2010 From: robert.campbell at queensu.ca (Robert Campbell) Date: Wed, 21 Apr 2010 16:14:10 -0400 Subject: [Biopython] Entrez.efetch rettype retmode In-Reply-To: References: Message-ID: <20100421161410.4fd950ec@adelie.biochem.queensu.ca> Hello Silvio, On Wed, 21 Apr 2010 20:34:54 +0200 Silvio Tschapke wrote: > Hello. > > I am new to Biopython and I tried to download a whole record with efetch. > The problem is that I get an error message in the output: > ""Report 'full' not found in 'pmc' presentation"" > Maybe I haven't understood the whole principle. > > But isn't it the goal of pmc to provide full text? I have read the help-page > of efetch but it doesn't help me a lot. > > > ---- > handle = Entrez.efetch(db="pmc", id="2531137", rettype="full", > retmode="text") > string = str(handle.read()) The documentation on efetch (http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html) specifies that: pmc - PubMed Central contains a number of articles classified as "open access" for which you may download the full text as XML. For the remaining articles in PMC you may download only the abstracts as XML. So you just need to change your retmode='text' to retmode='xml' and omit the rettype option altogether. You will find that not all articles are free to download this way though. I tried a random one and got an error message that the particular journal didn't allow download of full text as XML. Cheers, Rob -- Robert L. Campbell, Ph.D. Senior Research Associate/Adjunct Assistant Professor Botterell Hall Rm 644 Department of Biochemistry, Queen's University, Kingston, ON K7L 3N6 Canada Tel: 613-533-6821 Fax: 613-533-2497 http://pldserver1.biochem.queensu.ca/~rlc From laserson at mit.edu Wed Apr 21 21:07:19 2010 From: laserson at mit.edu (Uri Laserson) Date: Wed, 21 Apr 2010 21:07:19 -0400 Subject: [Biopython] Bug in GenBank/EMBL parser? Message-ID: Hi, I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which supposedly conforms to the EMBL standard). The short story is that whenever there is a feature, the parser checks whether there are qualifiers in the feature with an assert statement, and does not allow features with no qualifiers. However, the IMGT flatfile is full of entries that have features with no qualifiers (only coordinates). Who is wrong here? Does the EMBL specification require that a feature have qualifiers? Or is this a bug to be fixed in the parser. To be more concrete, the parser broke on the following record: ID A03907 IMGT/LIGM annotation : keyword level; unassigned DNA; HUM; 412 BP. XX AC A03907; XX DT 11-MAR-1998 (Rel. 8, arrived in LIGM-DB ) DT 10-JUN-2008 (Rel. 200824-2, Last updated, Version 3) XX DE H.sapiens antibody D1.3 variable region protein ; DE unassigned DNA; rearranged configuration; Ig-Heavy; regular; group IGHV. XX KW antigen receptor; Immunoglobulin superfamily (IgSF); KW Immunoglobulin (IG); IG-Heavy; variable; diversity; joining; KW rearranged. XX OS Homo sapiens (human) OC cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; OC Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata; OC Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda; OC Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates; OC Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; OC Homo/Pan/Gorilla group; Homo. XX RN [1] RP 1-412 RA ; RT "Recombinant antibodies and methods for their production."; RL Patent number EP0239400-A/10, 30-SEP-1987. RL MEDICAL RESEARCH COUNCIL. XX DR EMBL; A03907. XX FH Key Location/Qualifiers (from EMBL) FH FT source 1..412 FT /organism="Homo sapiens" FT /mol_type="unassigned DNA" FT /db_xref="taxon:9606" FT V_region 8..>412 FT /note="antibody D1.3 V region" FT sig_peptide 8..64 FT CDS 8..>412 FT /product="antibody D1.3 V region (VDJ)" FT /protein_id="CAA00308.1" FT /translation="MAVLALLFCLVTFPSCILSQVQLKESGPGLVAPSQSLSITCTVSG FT FSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSL FT HTDDTARYYCARERDYRLDYWGQGTTLTVSS" FT D_segment 356..371 FT J_segment 372..>412 FT /note="J(H)2 region" XX SQ Sequence 412 BP; 105 A; 109 C; 104 G; 94 T; 0 other; tcagagcatg gctgtcctgg cattactctt ctgcctggta acattcccaa gctgtatcct 60 ttcccaggtg cagctgaagg agtcaggacc tggcctggtg gcgccctcac agagcctgtc 120 catcacatgc accgtctcag ggttctcatt aaccggctat ggtgtaaact gggttcgcca 180 gcctccagga aagggtctgg agtggctggg aatgatttgg ggtgatggaa acacagacta 240 taattcagct ctcaaatcca gactgagcat cagcaaggac aactccaaga gccaagtttt 300 cttaaaaatg aacagtctgc acactgatga cacagccagg tactactgtg ccagagagag 360 agattatagg cttgactact ggggccaagg caccactctc acagtctcct ca 412 // And the traceback was: ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (311, 0)) --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) /Volumes/External/home/laserson/research/church/vdj-ome/ref-data/IMGT/ in () /Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc in parse_records(self, handle, do_features) 418 #This is a generator function 419 while True : --> 420 record = self.parse(handle, do_features) 421 if record is None : break 422 assert record.id is not None /Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc in parse(self, handle, do_features) 401 feature_cleaner = FeatureValueCleaner()) 402 --> 403 if self.feed(handle, consumer, do_features) : 404 return consumer.data 405 else : /Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc in feed(self, handle, consumer, do_features) 373 #Features (common to both EMBL and GenBank): 374 if do_features : --> 375 self._feed_feature_table(consumer, self.parse_features(skip=False)) 376 else : 377 self.parse_features(skip=True) # ignore the data /Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc in parse_features(self, skip) 170 feature_lines.append(line[self.FEATURE_QUALIFIER_INDENT:].rstrip()) 171 line = self.handle.readline() --> 172 features.append(self.parse_feature(feature_key, feature_lines)) 173 self.line = line 174 return features /Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc in parse_feature(self, feature_key, lines) 267 else : 268 #Unquoted continuation --> 269 assert len(qualifiers) > 0 270 assert key==qualifiers[-1][0] 271 #if debug : print "Unquoted Cont %s:%s" % (key, line) AssertionError: Which is tracked to an assert statement in Scanner.py at line 269. It appears that the assumption in the code is that there is an unquoted continuation of a feature qualifier. Finally, I am using biopython 1.51 that I built from source using python 2.5 (from an EPD install 4.3.0). I am on a Mac running OS X 10.5.8 (Leopard) Thanks! Uri From biopython at maubp.freeserve.co.uk Thu Apr 22 04:56:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Apr 2010 09:56:52 +0100 Subject: [Biopython] Bug in GenBank/EMBL parser? In-Reply-To: References: Message-ID: On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson wrote: > Hi, > > I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which > supposedly conforms to the EMBL standard). > > The short story is that whenever there is a feature, the parser checks > whether there are qualifiers in the feature with an assert statement, and > does not allow features with no qualifiers. ?However, the IMGT flatfile is > full of entries that have features with no qualifiers (only coordinates). > > Who is wrong here? ?Does the EMBL specification require that a feature have > qualifiers? ?Or is this a bug to be fixed in the parser. Hi Uri, Thank you for your detailed report, Since you have raised this, I went back over the EMBL documentation. All their example features qualifiers (and from personal experience all EMBL files from the EMBL and GenBank files from the NCBI) do have qualifiers. However, in Section 7.2 they are called "Optional qualifiers". http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2 So it does look like an unwarranted assumption in the Biopython parser (even though it has been a safe assumption on "official" EMBL and GenBank files thus far), which we should fix. Could you file a bug please? http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython This also affect Biopython 1.54b (the latest release) and the current code in the repository. I would hope we can solve this before Biopython 1.54 proper is released. Regards, Peter From chapmanb at 50mail.com Thu Apr 22 08:18:10 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 22 Apr 2010 08:18:10 -0400 Subject: [Biopython] removing redundant sequence In-Reply-To: References: Message-ID: <20100422121810.GV29724@sobchak.mgh.harvard.edu> Bala; > > I created a sample fasta > > file with two redundant sequences. But when i use checksums seguid to spot > > the redundancies, it spots only the first one. > What you should do is loop over the records and keep a record > of the checksums you have saved, and use that to ignore duplicates. > I would use a python set rather than a python list for speed. > > You could do this with a for loop. However, I would probably use an > iterator based approach with a generator function - I think it is more > elegant but perhaps not so easy for a beginner: [... Nice code example from Peter ..] This is a nice problem example and discussion. Bala, it sounds like Peter provided some useful example code to solve this. Once you use this to get together a program that solves your problem, it would be very helpful if you could write it up as a Cookbook entry: http://biopython.org/wiki/Category:Cookbook That would help others in the future who will be tackling similar issues. Thanks much, Brad From cloudycrimson at gmail.com Fri Apr 23 03:56:45 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Fri, 23 Apr 2010 13:26:45 +0530 Subject: [Biopython] Qblast : no hits Message-ID: Hello freinds, I have a problem with qblast. I have sequences from the mass spectromerty equipment that needs to be BLASTed to find the protein it belongs to. When I blast these sequences in the NCBI website it takes some time (longer than usual ) but does gives me hits. When i blast them using the following code in biopython they dont give me any hits. CODE: **************************************************************************** >>> from Bio.Blast import NCBIWWW >>> result_handle = NCBIWWW.qblast("blastp", "nr", "AFAQVRCSGLARGGGYVLR") >>> blast_results = result_handle.read() >>> save_file = open( "testseq.xml", "w") >>> save_file.write(blast_results) >>> save_file.close() **************************************************************************** OUTPUT: **************************************************************************** blastp BLASTP 2.2.23+ Alejandro A. Schäffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005. nr 12361 unnamed protein product 19 BLOSUM62 10 11 1 F 1 12361 unnamed protein product 19 10888645 -585703444 0 0 0.041 0.267 0.14 ***************************************************************************** Is this because a normal blast code doesn wait long till the results are given? I mean the RTOE error. if yes, how to control the "time of execution"? Or else what is the problem with my code? If you guys know anything on this issue, please give me your ideas. Thanking you in advance. Sincerely, Karthik From biopython at maubp.freeserve.co.uk Fri Apr 23 05:49:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Apr 2010 10:49:55 +0100 Subject: [Biopython] Qblast : no hits In-Reply-To: References: Message-ID: Hello Karthik On Fri, Apr 23, 2010 at 8:56 AM, Karthik Raja wrote: > Hello freinds, > > I have a ?problem with qblast. I have sequences from the mass > spectromerty equipment that needs to be BLASTed to find the protein it > belongs to. When I blast these sequences in the NCBI website it takes > some time (longer than usual ) but does gives me hits. When i blast > them using the following code in biopython they dont give me any hits. > > CODE: > > **************************************************************************** > >>>> from Bio.Blast import NCBIWWW >>>> result_handle = NCBIWWW.qblast("blastp", "nr", "AFAQVRCSGLARGGGYVLR") >>>> blast_results = result_handle.read() >>>> save_file = open( "testseq.xml", "w") >>>> save_file.write(blast_results) >>>> save_file.close() > > **************************************************************************** > > Is this because a normal blast code doesn wait long till the results are > given? I mean the RTOE error. if yes, how to control the "time of > execution"? What error? It looks like your example ran fine. > Or else what is the problem with my code? > > If you guys know anything on this issue, please give me your ideas. Differences between a manual BLAST search on the NCBI website and a script search via QBLAST are almost always down to different parameter settings. The NCBI have often adjusted the defaults on the website, and they no longer match the defaults on QBLAST. You should check things like the expectation cut off, the matrix, gap penalties etc. The simplest option would be just to copy the current defaults from the website into your python code. We probably need to put this into the Biopython FAQ ... Regards, Peter From cjfields at illinois.edu Fri Apr 23 08:00:07 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 23 Apr 2010 07:00:07 -0500 Subject: [Biopython] Qblast : no hits In-Reply-To: References: Message-ID: On Apr 23, 2010, at 4:49 AM, Peter wrote: >> ... > > Differences between a manual BLAST search on the NCBI website > and a script search via QBLAST are almost always down to different > parameter settings. The NCBI have often adjusted the defaults on > the website, and they no longer match the defaults on QBLAST. > You should check things like the expectation cut off, the matrix, > gap penalties etc. The simplest option would be just to copy the > current defaults from the website into your python code. > > We probably need to put this into the Biopython FAQ ... > > Regards, > > Peter Same for BioPerl. chris From cloudycrimson at gmail.com Fri Apr 23 23:27:10 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Sat, 24 Apr 2010 08:57:10 +0530 Subject: [Biopython] Qblast : no hits In-Reply-To: References: Message-ID: Hello Peter, I did try changing the paramters according to the WWW BLAST and its gives an error saying "no RID or no RTOE found". Its the same error i was trying to tell you in the 1st post. Its the "request time of execution". Is there any way to change this RTOE i.e. to increase it? Any idea? On Fri, Apr 23, 2010 at 5:30 PM, Chris Fields wrote: > On Apr 23, 2010, at 4:49 AM, Peter wrote: > > >> ... > > > > Differences between a manual BLAST search on the NCBI website > > and a script search via QBLAST are almost always down to different > > parameter settings. The NCBI have often adjusted the defaults on > > the website, and they no longer match the defaults on QBLAST. > > You should check things like the expectation cut off, the matrix, > > gap penalties etc. The simplest option would be just to copy the > > current defaults from the website into your python code. > > > > We probably need to put this into the Biopython FAQ ... > > > > Regards, > > > > Peter > > Same for BioPerl. > > chris > From p.j.a.cock at googlemail.com Sat Apr 24 07:40:27 2010 From: p.j.a.cock at googlemail.com (Peter) Date: Sat, 24 Apr 2010 12:40:27 +0100 Subject: [Biopython] Qblast : no hits In-Reply-To: References: Message-ID: <6540A260-554B-488A-AED7-B0559883F7F7@googlemail.com> On 24 Apr 2010, at 04:27, Karthik Raja wrote: > Hello Peter, > > I did try changing the paramters according to the WWW BLAST and its > gives an > error saying "no RID or no RTOE found". Its the same error i was > trying to > tell you in the 1st post. Its the "request time of execution". Is > there any > way to change this RTOE i.e. to increase it? Any idea? > > On Fri, Apr 23, 2010 at 5:30 PM, Chris Fields > wrote: > >> On Apr 23, 2010, at 4:49 AM, Peter wrote: >> >>>> ... >>> >>> Differences between a manual BLAST search on the NCBI website >>> and a script search via QBLAST are almost always down to different >>> parameter settings. The NCBI have often adjusted the defaults on >>> the website, and they no longer match the defaults on QBLAST. >>> You should check things like the expectation cut off, the matrix, >>> gap penalties etc. The simplest option would be just to copy the >>> current defaults from the website into your python code. >>> >>> We probably need to put this into the Biopython FAQ ... >>> >>> Regards, >>> >>> Peter >> >> Same for BioPerl. >> >> chris >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Sat Apr 24 07:49:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 24 Apr 2010 12:49:55 +0100 Subject: [Biopython] Qblast : no hits In-Reply-To: References: Message-ID: Hi all, Sorry for the blank email just now. On Sat, Apr 24, 2010 at 4:27 AM, Karthik Raja wrote: > Hello Peter, > > I did try changing the paramters according to the WWW BLAST > and its gives an error saying "no RID or no RTOE found". Its the > same error i was trying to tell you in the 1st post. Its the "request > time of execution". Is there any way to change this RTOE i.e. to > increase it? Any idea? Please show us an example with this problem (i.e. the python code and the traceback). What is meant to happen is we send the query to the NCBI, and they reply with reference details (RID and RTOE) which are used to fetch the results after BLAST has finished running. My guess for what is happening is your parameters are for some reason invalid, and the NCBI is giving an error page (so no RID and no RTOE). Biopython tries to spot any error message in this situation, but in your case could not. Peter From cloudycrimson at gmail.com Sat Apr 24 23:24:59 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Sun, 25 Apr 2010 08:54:59 +0530 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: Hello Peter, As said i did try changing the parameters of qblast according to the set in the web blast. The parameters that I changed are 1. Martrix 2. Word size 3. Expect There is a check box option in the web page that allows us to check it if we want the web blast to adjust according short sequences. I am not sure how to bring that option into the qblast. *Below given are the code and the traceback. * >>> from Bio.Blast import NCBIWWW >>> result_handle = NCBIWWW.qblast ("blastp", "nr", "SSRVQDGMGLYTARRVR", auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=200000, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name= 'PAM30', nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=2, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None) *Traceback (most recent call last): * File "", line 1, in result_handle = NCBIWWW.qblast *("blastp", "nr", "SSRVQDGMGLYTARRVR",*auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', *expect=200000*, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, *matrix_name= 'PAM30'*, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, * word_size=2*, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None) File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 117, in qblast rid, rtoe = _parse_qblast_ref_page(handle) File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 203, in _parse_qblast_ref_page raise ValueError("No RID and no RTOE found in the 'please wait' page." ValueError: No RID and no RTOE found in the 'please wait' page. (there was probably a problem with your request) Here are a few examples of my MS sequences. 1. *IMYTALPVIGKRHFRPSFTR * 2. *RSSRGRGR * 3. *AGPGPRRAKAAPYR * 4. *ASRSYSSERRAR * 5. *AASAAPPRAGRPDRGPLALAGR * 6. *GSDGKSRGR * 7. *TYGWRAEPR * 8. *PPEPAREPRLSPRR * 9. *GVLTALRR * 10. *AGMRLPSRRQSFPAPVSR * *Sincerely, * *Karthikraja* On Sat, Apr 24, 2010 at 5:19 PM, Peter wrote: > Hi all, > > Sorry for the blank email just now. > > On Sat, Apr 24, 2010 at 4:27 AM, Karthik Raja wrote: > > Hello Peter, > > > > I did try changing the paramters according to the WWW BLAST > > and its gives an error saying "no RID or no RTOE found". Its the > > same error i was trying to tell you in the 1st post. Its the "request > > time of execution". Is there any way to change this RTOE i.e. to > > increase it? Any idea? > > Please show us an example with this problem (i.e. the python > code and the traceback). > > What is meant to happen is we send the query to the NCBI, and > they reply with reference details (RID and RTOE) which are > used to fetch the results after BLAST has finished running. > > My guess for what is happening is your parameters are for > some reason invalid, and the NCBI is giving an error page > (so no RID and no RTOE). Biopython tries to spot any error > message in this situation, but in your case could not. > > Peter > From biopython at maubp.freeserve.co.uk Sun Apr 25 08:45:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 25 Apr 2010 13:45:05 +0100 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: On Sun, Apr 25, 2010 at 4:24 AM, Karthik Raja wrote: > *Below given are the code and the traceback. * Great - I can run that and get the same traceback. Here is a shorter version which does the same thing - removing all the parameters you don't actually set: from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR", entrez_query='(none)', expect=200000, hitlist_size=50, matrix_name='PAM30', word_size=2, alignments=500, descriptions=500, format_type='XML') Getting shorter still: result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR", matrix_name='PAM30') The problem is the matrix name - remove that and the error goes away. So progress :) Doing a little digging, this is the error message from the NCBI is: Message ID#35 Error: Cannot validate the Blast options: Gap existence and extension values of 11 and 1 not supported for PAM30 supported values are: 32767, 32767 7, 2 6, 2 5, 2 10, 1 9, 1 8, 1 As I guessed earlier, Biopython needed a little update to recognise this error message and pass it to the user. I've done that. In your case, you need to pick gap parameters appropriate for PAM30. Peter From cloudycrimson at gmail.com Mon Apr 26 04:38:59 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Mon, 26 Apr 2010 14:08:59 +0530 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: Hello Peter, I tried out what you suggested and it works perfectly. I checked the result XML file and there was no problem at all. But I still have one more small issue that I am sure you can help me with. The main reason i wanted to use python was that I could put all the query sequences in a file and blast it. So when I tried the above code to blast a sequence that I have put in a fasta file, it gives an error. Same kinda error. Below are the code and traceback. >>> fasta_string = open("test.fasta").read() >>> result_handle = NCBIWWW.qblast("blastp", "nr", fasta_string,entrez_query='(none)', expect=200000, hitlist_size=50, word_size=2, alignments=500, descriptions=500,format_type='XML') *Traceback (most recent call last): * File "", line 2, in word_size=2, alignments=500, descriptions=500,format_type='XML') File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 117, in qblast rid, rtoe = _parse_qblast_ref_page(handle) File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 203, in _parse_qblast_ref_page raise ValueError("No RID and no RTOE found in the 'please wait' page." ValueError: No RID and no RTOE found in the 'please wait' page. (there was probably a problem with your request) Please let me know if you could sense in the problem with the code. Sincerely, Karthik On Sun, Apr 25, 2010 at 6:15 PM, Peter wrote: > On Sun, Apr 25, 2010 at 4:24 AM, Karthik Raja wrote: > > > *Below given are the code and the traceback. * > > Great - I can run that and get the same traceback. > > Here is a shorter version which does the same thing - removing all the > parameters you don't actually set: > > from Bio.Blast import NCBIWWW > result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR", > entrez_query='(none)', expect=200000, hitlist_size=50, > matrix_name='PAM30', word_size=2, alignments=500, descriptions=500, > format_type='XML') > > Getting shorter still: > > result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR", > matrix_name='PAM30') > > The problem is the matrix name - remove that and the error goes away. > So progress :) > > Doing a little digging, this is the error message from the NCBI is: > > Message ID#35 Error: Cannot validate the Blast options: Gap existence > and extension values of 11 and 1 not supported for PAM30 > supported values are: > 32767, 32767 > 7, 2 > 6, 2 > 5, 2 > 10, 1 > 9, 1 > 8, 1 > > As I guessed earlier, Biopython needed a little update to recognise > this error message and pass it to the user. I've done that. > > In your case, you need to pick gap parameters appropriate for PAM30. > > Peter > From biopython at maubp.freeserve.co.uk Mon Apr 26 06:02:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 11:02:24 +0100 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: Hi Karthik, On Mon, Apr 26, 2010 at 9:38 AM, Karthik Raja wrote: > Hello Peter, > > I tried out what you suggested and it works perfectly. I checked the result > XML file and there was no problem at all. That's good :) > But I still have one more small issue that I am sure you can help me with. > The main reason i wanted to use python was that I could put all the query > sequences in a file and blast it. I wouldn't recommend that approach. For a modest number of queries, I would suggest doing one online BLAST query at a time. This will spread out the load on the NCBI, and means each time your XML results won't be too big. Trying to do too many queries at risks hitting an NCBI CPU limit, or having problems downloading a very large XML result file. For a large number of queries, I would suggest using standalone BLAST (installed and run locally) - especially if you want to use very lenient parameters giving lots of results (meaning large output files). > So when I tried the above code to blast a > sequence that I have put in a fasta file, it gives an error. Same kinda > error. Below are the code and traceback. > >>>> fasta_string = open("test.fasta").read() >>>> result_handle = NCBIWWW.qblast("blastp", "nr", > fasta_string,entrez_query='(none)', expect=200000, hitlist_size=50, > word_size=2, alignments=500, descriptions=500,format_type='XML') > > *Traceback (most recent call last): > ... > ValueError: No RID and no RTOE found in the 'please wait' page. (there was > probably a problem with your request) > > Please let me know if you could sense in the problem with the code. > > Sincerely, > Karthik The code works fine - I just tried it using a FASTA file with four proteins. I would guess there is a problem with your FASTA file - perhaps there is a bad sequence in it, or too many sequences. Since you don't have the latest code we can't see the NCBI error message in the traceback, which would help a lot. I see you are running on Windows, so the easiest way to try this is to backup C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py and replace it with the new version from our repository: http://biopython.open-bio.org/SRC/biopython/Bio/Blast/NCBIWWW.py or: http://github.com/biopython/biopython/raw/master/Bio/Blast/NCBIWWW.py Or, could you send me the FASTA file to try it here (please send it to me directly, not the mailing list). Regards, Peter From nick_leake77 at hotmail.com Mon Apr 26 11:36:28 2010 From: nick_leake77 at hotmail.com (Nick Leake) Date: Mon, 26 Apr 2010 11:36:28 -0400 Subject: [Biopython] parsing a fasta with multiple entries Message-ID: Hello, I'm having trouble parsing a fasta file with multiple sequences - it is a fasta that has most of the transposable elements in fruit flies found at http://www.fruitfly.org/p_disrupt/TE.html#NAT right side, third box down. I want to be able to access the DNA sequences for manipulation and later removal from a chromosomal region. I originally thought that I could follow the same fasta format example shown in the biopython tutorial. However, that failed to work. I think it might be because there are multiple entries. Basically, I just want parse the information and have dictionaries hold the transposon elements name and sequence for later use. Can I do that with biopython or should I make my own parser? Any help would be greatly appreciated. I'm still very much a python novice and get frustrated by not knowing how to ask my questions appropriately. _________________________________________________________________ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5 From biopython at maubp.freeserve.co.uk Mon Apr 26 11:52:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 16:52:28 +0100 Subject: [Biopython] help with parsing EMBL In-Reply-To: References: Message-ID: Hi Nick, On Mon, Apr 26, 2010 Nick Leake wrote: > Hello, > > I'm having trouble parsing an embl file (attached) with multiple > sequences. ?I want to be able to access the DNA sequences for > manipulation and removal from a chromosomal region. ?I originally > thought that I could follow the same fasta format example shown in the > biopython tutorial. ?However, that failed to work. ?Next, I tried to > convert the file to a fastq or a fasta to just follow the examples - > again, failed. ?So, I looked around and found some embl parsing code: > > from Bio import SeqIO > > p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl") > p.next() > record=p.next() > > print record > > This kinda works, but fails to read all entries. Well, yes: from Bio import SeqIO #that imports the library p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl") #that sets up the EMBL parser (although EMBL files are text so it is a bit #odd to open it in binary read mode) p.next() #reads the first record and discards it record=p.next() #reads the second record and stores as variable record You only ever try and look at the second record. See below... > ... ?In addition, I don't know what code I need to 'grab' the DNA > information for manipulations and remove these sequences from > a given DNA segment. ? ?Can I get a little guidance to > what I need to do or where I can look to help solve my problem? What you probably want to start with is a simple for loop, from Bio import SeqIO for record in SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41"),"embl"): print record.id, record.seq However, this runs into a problem: Traceback (most recent call last): ... ValueError: Expected sequence length 2, found 2483. Looking at your file (which was too big to send to the list), your EMBL file is invalid. Specifically this is failing on the record which starts: ID FROGGER standard; DNA; INV; 2 BP. That ID line says the sequence is just 2 base pairs, but in fact the seems to be 2483bp. The ID line should probably be edited like this: ID FROGGER standard; DNA; INV; 2483 BP. Fixing that shows up another similar problem, ID TV1 standard; DNA; INV; 1728 BP. should probably be: ID TV1 standard; DNA; INV; 1730 BP. Then there is this record: ID DDBARI1 standard; DNA; INV; 1676 BP. Several parts of the record suggest it should be 1676bp (not just the ID line, but also for example the SQ line), but there is actually 1677bp of sequence present. After making those three edits by hand, Biopython should parse it. I suspect your EMBL file has been manually edited. Where did it come from? Peter From biopython at maubp.freeserve.co.uk Mon Apr 26 11:54:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 16:54:54 +0100 Subject: [Biopython] Fwd: help with parsing EMBL In-Reply-To: References: Message-ID: Hi all, I'm forwarding this email from Nick Leake about parsing EMBL files, but without his 1.3MB attachment. I'll reply to his questions in a follow up email... Peter ---------- Forwarded message ---------- From:?Nick Leake To:? Date:?Mon, 26 Apr 2010 09:35:45 -0400 Subject:?help with parsing Hello, I'm having trouble parsing an embl file (attached) with multiple sequences. ?I want to be able to access the DNA sequences for manipulation and removal from a chromosomal region. ?I originally thought that I could follow the same fasta format example shown in the biopython tutorial. ?However, that failed to work. ?Next, I tried to convert the file to a fastq or a fasta to just follow the examples - again, failed. ?So, I looked around and found some embl parsing code: from Bio import SeqIO p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl") p.next() record=p.next() print record This kinda works, but fails to read all entries. ?Also, there is no 'record' argument for output. ?In addition, I don't know what code I need to 'grab' the DNA information for manipulations and remove these sequences from a given DNA segment. ? ?Can I get a little guidance to what I need to do or where I can look to help solve my problem? Any help would be greatly appreciated. ?I'm still very much a python novice and get frustrated by not knowing how to ask my questions appropriately. _________________________________________________________________ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4 ---------- Forwarded message ---------- From:?biopython-request at lists.open-bio.org To: Date:?Mon, 26 Apr 2010 09:44:02 -0400 Subject:?confirm 29081d7dc4252dd9c96c13f5018658d3414acbdc If you reply to this message, keeping the Subject: header intact, Mailman will discard the held message. ?Do this if the message is spam. ?If you reply to this message and include an Approved: header with the list password in it, the message will be approved for posting to the list. ?The Approved: header can also appear in the first line of the body of the reply. From biopython at maubp.freeserve.co.uk Mon Apr 26 11:59:02 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 16:59:02 +0100 Subject: [Biopython] parsing a fasta with multiple entries In-Reply-To: References: Message-ID: On Mon, Apr 26, 2010 at 4:36 PM, Nick Leake wrote: > > Hello, > > I'm having trouble parsing a fasta file with multiple sequences - it is a fasta > that has most of the transposable elements in fruit flies found at > http://www.fruitfly.org/p_disrupt/TE.html#NAT right side, third box down. Hi Nick, You mean this file? http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.fasta > I want to be able to access the DNA sequences for manipulation and later > removal from a chromosomal region. ?I originally thought that I could follow > the same fasta format example shown in the biopython tutorial. ?However, > that failed to work. ?I think it might be because there are multiple entries. The Bio.SeqIO.read() function is for when there is a single record. The Bio.SeqIO.parse() function is for when you have multiple records. Could you clarify which bit of the tutorial was confusing? We'd like to make it better. > Basically, I just want parse the information and have dictionaries hold the > transposon elements name and sequence for later use. ?Can I do that with > biopython or should I make my own parser? Any help would be greatly > appreciated. ?I'm still very much a python novice and get frustrated by not > knowing how to ask my questions appropriately. You should be able to use the Bio.SeqIO.index() function for this. >>> from Bio import SeqIO >>> data = SeqIO.index("D_mel_transposon_sequence_set.fasta", "fasta") >>> data.keys()[:10] ['gb|U14101|TART-B', 'gb|AF162798|Dbuz\\BuT1', 'gb|U26847|Dvir\\Helena', 'gb|X67681|Bari1', 'gb|M69216|hobo', 'gb|U29466|Dkoe\\Gandalf', 'gb|Z27119|flea', 'gb|AB022762|aurora-element', 'gb|nnnnnnnn|Stalker3T', 'gb|AF518730|Dwil\\Vege'] >>> data["gb|nnnnnnnn|Stalker3T"] SeqRecord(seq=Seq('TGTAGTGTATCTACCCTCAATATGTArAGTAGAGTTAATATGTAAGTAAGTAAT...ACA', SingleLetterAlphabet()), id='gb|nnnnnnnn|Stalker3T', name='gb|nnnnnnnn|Stalker3T', description='gb|nnnnnnnn|Stalker3T STALKER3 372bp', dbxrefs=[]) >>> print data["gb|nnnnnnnn|Stalker3T"].seq TGTAGTGTATCTACCCTCAATATGTArAGTAGAGTTAATATGTAAGTAAGTAATATGTAAAGTAGAGTTAATATGTAAGTAAGCAAAAGACCACCAACACTTACATGAACACTCCAGCTCTTGAAATACGATCGAGCGCTTAAACATAAGCCGATCGCGGAGCGTGAGAGTGCCGAGCATACACCTAGCAGCTCAAGTGATTAAGATAAGATAAGATAAGATAACAAACACGTAGTCTTAAGCGCGTCATGTGCGGGTGGCTGTACCCAAGAACAGCAAAGTGAATTCATTCGAATAAACCGCTTCAAGCAGAGCAGAGCCAAGTCTATTATATCAACTTCAAAAATACCGTATAACCTTGAACCTATTACA Peter From biopython at maubp.freeserve.co.uk Mon Apr 26 12:02:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 17:02:18 +0100 Subject: [Biopython] help with parsing EMBL In-Reply-To: References: Message-ID: On Mon, Apr 26, 2010 at 4:52 PM, Peter wrote: > Hi Nick, > > On Mon, Apr 26, 2010 Nick Leake wrote: >> Hello, >> >> I'm having trouble parsing an embl file (attached) with multiple >> sequences. ... > > After making those three edits by hand, Biopython should parse it. > I suspect your EMBL file has been manually edited. Where did it > come from? >From Nick's other email about the FASTA file, http://lists.open-bio.org/pipermail/biopython/2010-April/006451.html I can can see that the funny EMBL file came from the Berkeley Drosophil Genome Project (BDGP)'s Natural Transposable Element Project: http://www.fruitfly.org/p_disrupt/TE.html Specifically this file: http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.embl I'll email them to alert them about the three obvious errors I discussed. Peter From biopython at maubp.freeserve.co.uk Mon Apr 26 12:28:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 17:28:31 +0100 Subject: [Biopython] help with parsing EMBL In-Reply-To: References: Message-ID: On Mon, Apr 26, 2010 at 5:02 PM, Peter wrote: > > From Nick's other email about the FASTA file, > http://lists.open-bio.org/pipermail/biopython/2010-April/006451.html > I can can see that the funny EMBL file came from the Berkeley Drosophil > ?Genome Project (BDGP)'s Natural Transposable Element Project: > http://www.fruitfly.org/p_disrupt/TE.html > > Specifically this file: > http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.embl > > I'll email them to alert them about the three obvious errors I discussed. There is also something odd going on with the features, which the Biopython parser seems to be ignoring... Peter From biopython at maubp.freeserve.co.uk Mon Apr 26 18:04:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 23:04:15 +0100 Subject: [Biopython] parsing a fasta with multiple entries In-Reply-To: References: Message-ID: On Mon, Apr 26, 2010 at 8:05 PM, Nick Leake wrote: > Thanks Peter, > > All of the information is?very helpful.? I apologize for sending?second > email.? I was thinking that?the first email was going to be discarded for > having the attachment - which in hindsight is an obvious fact.? At that > time, I had only seen the initial email for rejecting the first. I managed to reply before sending the original email (without attachment) to the list - so partly my fault. >>> I want to be able to access the DNA sequences for manipulation and >>> later removal from a chromosomal region. ?I originally thought that I >>> could follow the same fasta format example shown in the biopython >>> tutorial. ?However, that failed to work. ?I think it might be because >>> there are multiple entries. >> >> The Bio.SeqIO.read() function is for when there is a single record. The >> Bio.SeqIO.parse() function is for when you have multiple records. Could >> you clarify which bit of the tutorial was confusing? We'd like to make it >> better. > > The tutorial I used was from > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html OK, good - that is the current version. > I will admit I didn't really know the difference from the Bio.SeqIO.read() > verse the Bio.SeqIO.parse() functions even though they should be > intuitive.? Still, the mentioned tutorial doen't seem to have a multiple > entry parsed example.?This is where my naivet??and confusion on > the matter probably started. It does (the file ls_orchid.fasta used in several examples has 94 entries), but I guess there is a lot of information in there and it can be overwhelming. Your problems with the funny EMBL file probably didn't help :( Peter From p.j.a.cock at googlemail.com Mon Apr 26 18:30:54 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 26 Apr 2010 23:30:54 +0100 Subject: [Biopython] Google Summer of Code - accepted students In-Reply-To: <4BD60D63.1040400@cornell.edu> References: <4BD60D63.1040400@cornell.edu> Message-ID: ---------- Forwarded message ---------- From: Robert Buels Date: Mon, Apr 26, 2010 at 11:02 PM Subject: Google Summer of Code - accepted students To: rmb32 at cornell.edu Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. ?We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator From rmb32 at cornell.edu Mon Apr 26 18:02:11 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 26 Apr 2010 15:02:11 -0700 Subject: [Biopython] Google Summer of Code - accepted students Message-ID: <4BD60D63.1040400@cornell.edu> Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator From anaryin at gmail.com Tue Apr 27 00:29:36 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 27 Apr 2010 12:29:36 +0800 Subject: [Biopython] Google Summer of Code - accepted students In-Reply-To: References: <4BD60D63.1040400@cornell.edu> Message-ID: Hello all! Thanks for the confidence! I'm sure it's going to work alright! If anyone has any comments to add to my application feel free either to email me! Regards! Jo?o [...] Rodrigues On Monday, April 26, 2010, Peter Cock wrote: > ---------- Forwarded message ---------- > From: Robert Buels > Date: Mon, Apr 26, 2010 at 11:02 PM > Subject: Google Summer of Code - accepted students > To: rmb32 at cornell.edu > > > Hi all, > > I'm pleased to announce the acceptance of OBF's 2010 Google Summer of > Code students, listed in alphabetical order with their project titles > and primary mentors: > > Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including > Implementation of Multiple Sequence Alignment Algorithms > > Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, > Classification, and Visualization of Posttranslational Modification of > Proteins > > Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby > > Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & > Duplication Inference Algorithm for Binary and Non-binary Species Tree > > Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending > Bio.PDB: broadening the usefulness of BioPython's Structural Biology > module > > Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring > > Congratulations to our accepted students! > > All told, we had 52 applications submitted for the 6 slots (5 > originally assigned, plus 1 extra) allotted to us by Google. > Proposals were extremely competitive: 6 out of 52 translates to an > 11.5% acceptance rate. ?We received a lot of really excellent > proposals, the decisions were not easy. > > Thanks very much to all the students who applied, we very much > appreciate your hard work. > > Here's to a great 2010 Summer of Code, I'm sure these students will do > some wonderful work. > > Rob Buels > OBF GSoC 2010 Administrator > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From rmb32 at cornell.edu Tue Apr 27 01:52:57 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 26 Apr 2010 22:52:57 -0700 Subject: [Biopython] Google Summer of Code - accepted students Message-ID: <4BD67BB9.3000804@cornell.edu> Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator From biopython at maubp.freeserve.co.uk Tue Apr 27 05:45:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Apr 2010 10:45:20 +0100 Subject: [Biopython] Bug in GenBank/EMBL parser? In-Reply-To: References: Message-ID: On Thu, Apr 22, 2010 at 9:56 AM, Peter wrote: > On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson wrote: >> Hi, >> >> I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which >> supposedly conforms to the EMBL standard). >> >> The short story is that whenever there is a feature, the parser checks >> whether there are qualifiers in the feature with an assert statement, and >> does not allow features with no qualifiers. ?However, the IMGT flatfile is >> full of entries that have features with no qualifiers (only coordinates). >> >> Who is wrong here? ?Does the EMBL specification require that a feature have >> qualifiers? ?Or is this a bug to be fixed in the parser. > > Hi Uri, > > Thank you for your detailed report, > > Since you have raised this, I went back over the EMBL documentation. > All their example features qualifiers (and from personal experience all > EMBL files from the EMBL and GenBank files from the NCBI) do have > qualifiers. However, in Section 7.2 they are called "Optional qualifiers". > http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2 > > So it does look like an unwarranted assumption in the Biopython > parser (even though it has been a safe assumption on "official" EMBL > and GenBank files thus far), which we should fix. Bug filed and now fixed, http://bugzilla.open-bio.org/show_bug.cgi?id=3062 It turned out to be an invalid EMBL file where the features were over- indented. Biopython was quite happy to parse valid EMBL or GenBank files with features without qualifiers (although I don't recall seeing any examples from EMBL or the NCBI like this). Peter From silvio.tschapke at googlemail.com Wed Apr 28 05:24:25 2010 From: silvio.tschapke at googlemail.com (Silvio Tschapke) Date: Wed, 28 Apr 2010 11:24:25 +0200 Subject: [Biopython] save efetch results in different files Message-ID: Hi all, I'd like to download hundreds of pubmed entries in one turn, but save every entry in a single file for further processing with e.g. NLTK. Is this possible? Or what is the common way to do this? Or do I have to call efetch for every single pmid? I dont know how. Could you also explain me what handle.read() does? Entrez.read(handle) I understand, because it is documented, but handle.read() not. What kind of type is a handle? search_results = Entrez.read(Entrez.esearch(db="pubmed", term="Biopython", usehistory="y")) batch_size = 10 for start in range(0,count,batch_size): end = min(count, start+batch_size) print "Going to download record %i to %i" % (start+1, end) fetch_handle = Entrez.efetch(db="pubmed", rettype="xml", retstart=start, retmax=batch_size, webenv=search_results["WebEnv"], query_key=search_results["QueryKey"]) for pmid in search_results["IdList"]: out_handle = open(pmid+".txt", "w") HERE I HAVE TO ACCESS THE ENTRY FROM THE fetch_handle FOR THE CORRESPONDING pmid #data = Entrez.read(fetch_handle) #data = fetch_handle.read() fetch_handle.close() out_handle.write(data) out_handle.close() Cheers, Silvio From biopython at maubp.freeserve.co.uk Wed Apr 28 05:57:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Apr 2010 10:57:48 +0100 Subject: [Biopython] save efetch results in different files In-Reply-To: References: Message-ID: On Wed, Apr 28, 2010 at 10:24 AM, Silvio Tschapke wrote: > Hi all, > > I'd like to download hundreds of pubmed entries in one turn, but save every > entry in a single file for further processing with e.g. NLTK. > Is this possible? Or what is the common way to do this? Or do I have to call > efetch for every single pmid? I dont know how. Personally I would probably save each pubmed result to a separate file named using the pmid - a Unix filesystem should cope fine with a few thousand files in a single directory. This is simple and lets you add more entries at a later date, and you have simple access to any record. The other approach of combining separate entries into multiple files sounds overly complicated (although possible), while another approach would be a single large file containing all the records in one. These would require a index if you needed random access to the entries by pmid. > Could you also explain me what handle.read() does? Entrez.read(handle) I > understand, because it is documented, but handle.read() not. What kind of > type is a handle? It is *like* a standard handle that you'd get in python from open(filename). This is an object supporting read() giving all the remaining data as a string, readline() giving the next line etc. Peter From laserson at mit.edu Wed Apr 28 14:49:40 2010 From: laserson at mit.edu (Uri Laserson) Date: Wed, 28 Apr 2010 14:49:40 -0400 Subject: [Biopython] SPARK error messages to be sent to stderr? Message-ID: The spark error messages when there is a parsing problem are currently getting sent to stdout: (line 181 in Bio/Parsers/spark.py) print "Syntax error at or near `%s' token" % token Can this be changed to: print >>sys.stderr, "Syntax error at or near `%s' token" % token This way the error messages can be handled separately. Thanks! Uri From laserson at mit.edu Wed Apr 28 15:12:28 2010 From: laserson at mit.edu (Uri Laserson) Date: Wed, 28 Apr 2010 15:12:28 -0400 Subject: [Biopython] Can the GenBank/EMBL parser recover from errors? Message-ID: Hi, I am trying to parse a large file of EMBL records that I know has some errors in it. However, rather than having the parser break when it gets to the error, I'd rather it just skip that record, and move on to the next one. I was wondering if this functionality is already built in somewhere. One way I can do this is like this: iterator = SeqIO.parse(ip,'embl').__iter__() while True: try: record = iterator.next() # Now I specify all the parsing errors I want to catch: except LocationParserError: # Reinitialize iterator at current file position. The iterator # then skips to the beginning of the next record and continues. iterator = SeqIO.parse(ip,'embl').__iter__() except StopIteration: break This way, whenever there is a parsing error, I just reinitialize the iterator at the current file position, and it seeks to the beginning of the next record. However, this requires me to write out the for loop manually (using StopIteration). Does anyone know of a cleaner/more elegant way of doing this? Thanks! Uri -- Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From laserson at mit.edu Wed Apr 28 17:38:52 2010 From: laserson at mit.edu (Uri Laserson) Date: Wed, 28 Apr 2010 17:38:52 -0400 Subject: [Biopython] Bug in GenBank/EMBL parser? In-Reply-To: References: Message-ID: This fixed the main problem with parsing IMGT files that have increased indentation. I also filed an additional bug/enhancement with a proposed patch, which should make biopython compatible with IMGT and still conform to the INSDC format: http://bugzilla.open-bio.org/show_bug.cgi?id=3069 Uri On Tue, Apr 27, 2010 at 05:45, Peter wrote: > On Thu, Apr 22, 2010 at 9:56 AM, Peter > wrote: > > On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson wrote: > >> Hi, > >> > >> I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which > >> supposedly conforms to the EMBL standard). > >> > >> The short story is that whenever there is a feature, the parser checks > >> whether there are qualifiers in the feature with an assert statement, > and > >> does not allow features with no qualifiers. However, the IMGT flatfile > is > >> full of entries that have features with no qualifiers (only > coordinates). > >> > >> Who is wrong here? Does the EMBL specification require that a feature > have > >> qualifiers? Or is this a bug to be fixed in the parser. > > > > Hi Uri, > > > > Thank you for your detailed report, > > > > Since you have raised this, I went back over the EMBL documentation. > > All their example features qualifiers (and from personal experience all > > EMBL files from the EMBL and GenBank files from the NCBI) do have > > qualifiers. However, in Section 7.2 they are called "Optional > qualifiers". > > > http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2 > > > > So it does look like an unwarranted assumption in the Biopython > > parser (even though it has been a safe assumption on "official" EMBL > > and GenBank files thus far), which we should fix. > > Bug filed and now fixed, > http://bugzilla.open-bio.org/show_bug.cgi?id=3062 > > It turned out to be an invalid EMBL file where the features were over- > indented. Biopython was quite happy to parse valid EMBL or GenBank > files with features without qualifiers (although I don't recall seeing any > examples from EMBL or the NCBI like this). > > Peter > -- Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From p.j.a.cock at googlemail.com Wed Apr 28 18:11:43 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Apr 2010 23:11:43 +0100 Subject: [Biopython] Can the GenBank/EMBL parser recover from errors? In-Reply-To: References: Message-ID: On Wednesday, April 28, 2010, Uri Laserson wrote: > Hi, > > I am trying to parse a large file of EMBL records that I know has some > errors in it. ?However, rather than having the parser break when it gets to > the error, I'd rather it just skip that record, and move on to the next one. > ?I was wondering if this functionality is already built in somewhere. ?One > way I can do this is like this: > > iterator = SeqIO.parse(ip,'embl').__iter__() > while True: > ? ?try: > ? ? ? ?record = iterator.next() > ? ?# Now I specify all the parsing errors I want to catch: > ? ?except LocationParserError: > ? ? ? ?# Reinitialize iterator at current file position. The iterator > ? ? ? ?# then skips to the beginning of the next record and continues. > ? ? ? ?iterator = SeqIO.parse(ip,'embl').__iter__() > ? ?except StopIteration: > ? ? ? ?break > > This way, whenever there is a parsing error, I just reinitialize the > iterator at the current file position, and it seeks to the beginning of the > next record. ?However, this requires me to write out the for loop manually > (using StopIteration). ?Does anyone know of a cleaner/more elegant way of > doing this? > > Thanks! Hi Uri, There is no obvious way to handle this within the Bio.SeqIO.parse framework. I'd suggest you use Bio.SeqIO.index instead (assuming the file isn't so corrupt that it can't be scanned to identify each record). Just wrap each record access in an error handler. Peter From cloudycrimson at gmail.com Thu Apr 29 02:58:26 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Thu, 29 Apr 2010 12:28:26 +0530 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: hello Peter, Sorry for the late reply. I am writing to thank you. The suggestions you gave were of massive work in our research by reducing the BLASTing time. Thank you for taking interest, Sincerely, Karthikaja On Mon, Apr 26, 2010 at 5:27 PM, Peter wrote: > On Mon, Apr 26, 2010 at 12:52 PM, Peter > wrote: > > On Mon, Apr 26, 2010 at 12:25 PM, Karthik Raja > wrote: > >> Hi Peter, > >> > >> I will seriously consider using the stand alone blast option. And thank > you > >> so much for the links. :) I have replaced the repository. > >> > >> You suspected a problem with the sequences but they work very well when > >> given directly in the code. I have attached my fasta file. Please tell > me > >> how it works with you. > >> > >> Karthikraja. > > > > You seem to have made a mistake with the FASTA file, there should be > > a read name on the ">" lines with the sequence on the subsequence lines. > > E.g. More like this: > > > >>Seq1 > > IMYTALPVIGKRHFRPSFTR > >>Seq2 > > RSSRGRGR > > (etc) > > > > As is, your file is valid but describes seven records each with no > sequence > > (instead their names are IMYTALPVIGKRHFRPSFTR, RSSRGRGR, etc). > > P.S. The updated Biopython should have given you this error message: > > ValueError: Error message from NCBI: Message ID#32 Error: Query > contains no data: Query contains no sequence data > > Peter > From biopython at maubp.freeserve.co.uk Thu Apr 29 05:08:00 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 Apr 2010 10:08:00 +0100 Subject: [Biopython] save efetch results in different files In-Reply-To: References: Message-ID: On Wed, Apr 28, 2010 at 5:56 PM, Silvio Tschapke wrote: > > On Wed, Apr 28, 2010 at 11:57 AM, Peter wrote: >> >> On Wed, Apr 28, 2010 at 10:24 AM, Silvio Tschapke wrote: >> > Hi all, >> > >> > I'd like to download hundreds of pubmed entries in one turn, but save >> > every entry in a single file for further processing with e.g. NLTK. >> > Is this possible? Or what is the common way to do this? Or do I have to >> > call efetch for every single pmid? I dont know how. >> >> Personally I would probably save each pubmed result to a separate file >> named using the pmid - a Unix filesystem should cope fine with a few >> thousand files in a single directory. This is simple and lets you add more >> entries at a later date, and you have simple access to any record. > > This is what I thought..to save each pubmed result to a separate file named > using the pmid, as you can see in the code snippet. > But it isn't working so far. Could you help me with the efetch_handle? I > have called efetch one time with all pmids. So the efetch_handle contains > all results. But now I need to pull out every single result from this handle > to save it in a separate file with its pmid. And I don't know how to do it. > Or isn't there another way..do I have to call efetch for every pmid and than > save it into a file inside the loop? > Because Biopython recommends to not do many queries per second I > thought it would be better to only call efetch one time for all pmids. The simplest answer is to make one efetch call per PMID, giving a single record at a time which you can save to individual files. You can still do this with the esearch+efetch history support. This does mean making many small queries to the NCBI, rather than batching them together - but the NCBI do not have any explicit guidelines on batch sizes. Note - you would be making over 100 queries, so make sure you don't run this during USA office hours! The more complex approach (which the NCBI might prefer) is to download batches of records together (e.g. 50 PMID results at once). If you wanted to save these to separate files, you would have to divide the text up yourself. I think you just need to look for lines starting "PMID-" so this shouldn't be too hard. Peter From cloudycrimson at gmail.com Fri Apr 30 06:50:08 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Fri, 30 Apr 2010 16:20:08 +0530 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: hello Peter, I have done blast for 25 sequences and have got 10 hits for each sequence. I have stored the results in an XML file. Now i need to *parse* it and the information in the cookbook isn helping me. >>> from Bio.Blast import NCBIWWW >>> result_handle = open("finaltest3.xml") >>> from Bio.Blast import NCBIXML >>> blast_records = NCBIXML.parse(result_handle) >>> for blast_record in blast_records: I am using the above code. Please tell me how to proceed to get information namely "sequence, seq id, e value and alignment". And I also have another doubt. While using q blast, is it possible to restrict the results to only human and mouse hits? If yes, it will be great if you could give me an example code or link. Sincerely, Karthik. On Thu, Apr 29, 2010 at 12:28 PM, Karthik Raja wrote: > > hello Peter, > > Sorry for the late reply. I am writing to thank you. The suggestions you > gave were of massive work in our research by reducing the BLASTing time. > Thank you for taking interest, > > Sincerely, > Karthikaja > On Mon, Apr 26, 2010 at 5:27 PM, Peter wrote: > >> On Mon, Apr 26, 2010 at 12:52 PM, Peter >> wrote: >> > On Mon, Apr 26, 2010 at 12:25 PM, Karthik Raja >> wrote: >> >> Hi Peter, >> >> >> >> I will seriously consider using the stand alone blast option. And thank >> you >> >> so much for the links. :) I have replaced the repository. >> >> >> >> You suspected a problem with the sequences but they work very well when >> >> given directly in the code. I have attached my fasta file. Please tell >> me >> >> how it works with you. >> >> >> >> Karthikraja. >> > >> > You seem to have made a mistake with the FASTA file, there should be >> > a read name on the ">" lines with the sequence on the subsequence lines. >> > E.g. More like this: >> > >> >>Seq1 >> > IMYTALPVIGKRHFRPSFTR >> >>Seq2 >> > RSSRGRGR >> > (etc) >> > >> > As is, your file is valid but describes seven records each with no >> sequence >> > (instead their names are IMYTALPVIGKRHFRPSFTR, RSSRGRGR, etc). >> >> P.S. The updated Biopython should have given you this error message: >> >> ValueError: Error message from NCBI: Message ID#32 Error: Query >> contains no data: Query contains no sequence data >> >> Peter >> > > From biopython at maubp.freeserve.co.uk Fri Apr 30 07:15:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 30 Apr 2010 12:15:05 +0100 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: On Fri, Apr 30, 2010 at 11:50 AM, Karthik Raja wrote: > hello Peter, > > I have done blast for 25 sequences and have got 10 hits for each sequence. I > have stored the results in an XML file. Now i need to *parse* it and the > information in the cookbook isn helping me. > >>>> from Bio.Blast import NCBIWWW >>>> result_handle = open("finaltest3.xml") >>>> from Bio.Blast import NCBIXML >>>> blast_records = NCBIXML.parse(result_handle) >>>> for blast_record in blast_records: > > I am using the above code. Please tell me how to proceed to get information > namely "sequence, seq id, e value and alignment". That should be fairly clear from the tutorial, look at the section titled "The BLAST record class". > And I also have another doubt. While using q blast, is it possible to > restrict the results to only human and mouse hits? If yes, it will be great > if you could give me an example code or link. You can ask the NCBI to filter the BLAST results for you with an Entrez query, one of the optional arguments to the Biopython qblast function. Something like "mouse[ORGN] OR human[ORGN]" should work. You can try out the Entrez query on the website to make sure you have the right syntax and terms. Peter From p.j.a.cock at googlemail.com Fri Apr 2 17:34:00 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 2 Apr 2010 18:34:00 +0100 Subject: [Biopython] Biopython 1.54 beta released Message-ID: Dear all, A beta release for Biopython 1.54 is now available for download and testing, as announced here: http://news.open-bio.org/news/2009/06/biopython-154-beta-released/ Noted that I haven't done a fully detailed release announcement, we'll leave that for the official release. Source distributions and Windows installers are available from the downloads page on the Biopython website. http://biopython.org/wiki/Download We are interested in getting feedback on the beta release as a whole, but especially on the new features - including the updated multiple sequence alignment object (which is what you?ll now get when parsing alignments with Bio.AlignIO), the new Bio.Phylo module, and the Bio.SeqIO support for Standard Flowgram Format (SFF) files. (At least) 10 people contributed to this release (so far), which includes 4 new people: Anne Pajon (first contribution) Brad Chapman Christian Zmasek Eric Talevich Jose Blanca (first contribution) Kevin Jacobs (first contribution) Leighton Pritchard Michiel de Hoon Peter Cock Thomas Holder (first contribution) On behalf of the Biopython team, thank you for any feedback, bug reports, and contributions. Peter P.S. You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News Biopython news is also on twitter: http://twitter.com/biopython From p.j.a.cock at googlemail.com Fri Apr 2 17:39:08 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 2 Apr 2010 18:39:08 +0100 Subject: [Biopython] Biopython 1.54 beta released In-Reply-To: References: Message-ID: > Dear all, > > A beta release for Biopython 1.54 is now available for download > and testing, as announced here: > > http://news.open-bio.org/news/2009/06/biopython-154-beta-released/ > > Noted that I haven't done a fully detailed release announcement, > we'll leave that for the official release. That URL should have been: http://news.open-bio.org/news/2010/04/biopython-1-54-beta-released/ Sorry for the extra email, Peter From cgohlke at uci.edu Fri Apr 2 23:05:25 2010 From: cgohlke at uci.edu (Christoph Gohlke) Date: Fri, 02 Apr 2010 16:05:25 -0700 Subject: [Biopython] Biopython 1.54b test failures Message-ID: <4BB67835.7030303@uci.edu> Hello, I get two test failures (see below) when running 'setup.py test' for biopython 1.54b on win-amd64-py2.6 (built with msvc9). These are related to line ending style. Maybe it would be a good idea to use Python's universal newline support (available since 2.3) when opening text files for iteration over lines. All tests pass after the following changes: BIO/SCOP/Raf.py line 104: f = open(self.filename, 'rU') line 121: f = open(self.filename, 'rU') BIO/SCOP/Cla.py line 103: f = open(self.filename, 'rU') line 123: f = open(self.filename, 'rU') line 72 (inconsistent indentation): h.append("=".join(map(str,ht))) -- Christoph ====================================================================== ERROR: Test CLA file indexing ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SCOP_Cla.py", line 74, in testIndex rec = index['d1hbia_'] File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", line 127, in __getitem__ record = Record(line) File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", line 45, in __init__ self._process(line) File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Cla.py", line 51, in _process raise ValueError("I don't understand the format of %s" % line) ValueError: I don't understand the format of 5 ====================================================================== ERROR: testSeqMapIndex (test_SCOP_Raf.RafTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SCOP_Raf.py", line 68, in testSeqMapIndex r = index.getSeqMap("103m") File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", line 152, in getSeqMap sm = self[id] File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", line 125, in __getitem__ record = SeqMap(line) File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", line 196, in __init__ self._process(line) File "D:\Dev\Compile\Biopython\biopython-1.54b\build\lib.win-amd64-2.6\Bio\SCOP\Raf.py", line 216, in _process raise ValueError("Incompatible RAF version: "+self.version) ValueError: Incompatible RAF version: .01 ---------------------------------------------------------------------- Ran 143 tests in 98.871 seconds FAILED (failures = 2) From biopython at maubp.freeserve.co.uk Fri Apr 2 23:22:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Apr 2010 00:22:32 +0100 Subject: [Biopython] Biopython 1.54b test failures In-Reply-To: <4BB67835.7030303@uci.edu> References: <4BB67835.7030303@uci.edu> Message-ID: On Sat, Apr 3, 2010 at 12:05 AM, Christoph Gohlke wrote: > Hello, > > I get two test failures (see below) when running 'setup.py test' for > biopython 1.54b on win-amd64-py2.6 (built with msvc9). These are > related to line ending style. It is a known issue - a simple work around is just run something like unix2dos on the SCOP test files, and then the tests pass. > Maybe it would be a good idea to use Python's universal > newline support (available since 2.3) when opening text > files for iteration over lines. I had tried that in the past without success... > All tests pass after the following changes: > > BIO/SCOP/Raf.py > > line 104: > ? ? ? ?f = open(self.filename, 'rU') > > line 121: > ? ? ? ?f = open(self.filename, 'rU') > > BIO/SCOP/Cla.py > > line 103: > ? ? ? ?f = open(self.filename, 'rU') > > line 123: > ? ? ? ?f = open(self.filename, 'rU') > > line 72 (inconsistent indentation): > ? ? ? ? ? ?h.append("=".join(map(str,ht))) > I recall trying the universal read lines thing before without success in the SCOP tests - maybe it was this line 72 thing that I missed. I'll take another look at this next week (when I have access to a Windows machine). Thanks, Peter From skhadar at gmail.com Sat Apr 3 01:33:01 2010 From: skhadar at gmail.com (Khader Shameer) Date: Fri, 2 Apr 2010 19:33:01 -0600 Subject: [Biopython] Biopython installation failed on Mac OSX 10.6 Message-ID: Hi, I was trying to install BioPython using fink. Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Used the command "fink install biopython-py24" Got the following error: Failed: no package found for specification 'biopython-py24'! Tried 23, 24 and 25 - it is not working. Any idea why it is not working ? Thanks, Shameer From vincent at vincentdavis.net Sat Apr 3 03:04:17 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Fri, 2 Apr 2010 21:04:17 -0600 Subject: [Biopython] Biopython installation failed on Mac OSX 10.6 In-Reply-To: References: Message-ID: Installing from source, instructions here is straight forward, just did it with the newest version, no problems http://biopython.org/wiki/Download *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Fri, Apr 2, 2010 at 7:33 PM, Khader Shameer wrote: > Hi, > > I was trying to install BioPython using fink. > > Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > > Used the command "fink install biopython-py24" > Got the following error: > Failed: no package found for specification 'biopython-py24'! > Tried 23, 24 and 25 - it is not working. > > Any idea why it is not working ? > > Thanks, > Shameer > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Sat Apr 3 10:33:48 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 3 Apr 2010 11:33:48 +0100 Subject: [Biopython] Biopython installation failed on Mac OSX 10.6 In-Reply-To: References: Message-ID: On Sat, Apr 3, 2010 at 2:33 AM, Khader Shameer wrote: > Hi, > > I was trying to install BioPython using fink. > > Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > > Used the command "fink install biopython-py24" > Got the following error: > Failed: no package found for specification 'biopython-py24'! > Tried 23, 24 and 25 - it is not working. > > Any idea why it is not working ? Something to do with Fink? Also note we don't support Python 2.3 anymore (and Python 2.4 is on its last few releases as a supported version for Biopython). Apple provides python 2.5 (32bit) and python 2.6 (64bit) on Snow Leopard. I actually use python 2.6 on the Mac specifically because it is 64bit and can cope with more memory. As Vincent and our documentation suggests, try just installing from source. You'll need to install Apple's XCode tools first, and it seems to help if you tick the optional older SDKs as well. Peter From p.j.a.cock at googlemail.com Sat Apr 3 13:52:11 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 3 Apr 2010 14:52:11 +0100 Subject: [Biopython] Biopython installation failed on Mac OSX 10.6 In-Reply-To: References: Message-ID: >> Hi, >> >> I was trying to install BioPython using fink. >> >> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) >> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin >> >> Used the command "fink install biopython-py24" >> Got the following error: >> Failed: no package found for specification 'biopython-py24'! >> Tried 23, 24 and 25 - it is not working. >> >> Any idea why it is not working ? > > Something to do with Fink? Also note we don't > support Python 2.3 anymore (and Python 2.4 is > on its last few releases as a supported version > for Biopython). If you really want to use fink, I think you'll have to contact the fink team. Specifically it looks like Koen van der Drift is kindly taking care of packaging Biopython on Fink: http://pdb.finkproject.org/pdb/package.php/biopython-py24 http://pdb.finkproject.org/pdb/package.php/biopython-py25 http://pdb.finkproject.org/pdb/package.php/biopython-py26 Peter From skhadar at gmail.com Sat Apr 3 17:19:49 2010 From: skhadar at gmail.com (Khader Shameer) Date: Sat, 3 Apr 2010 11:19:49 -0600 Subject: [Biopython] Biopython installation failed on Mac OSX 10.6 In-Reply-To: References: Message-ID: Thanks Vincent, Peter : I have installed BioPython from source. On Sat, Apr 3, 2010 at 7:52 AM, Peter Cock wrote: > >> Hi, > >> > >> I was trying to install BioPython using fink. > >> > >> Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) > >> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > >> > >> Used the command "fink install biopython-py24" > >> Got the following error: > >> Failed: no package found for specification 'biopython-py24'! > >> Tried 23, 24 and 25 - it is not working. > >> > >> Any idea why it is not working ? > > > > Something to do with Fink? Also note we don't > > support Python 2.3 anymore (and Python 2.4 is > > on its last few releases as a supported version > > for Biopython). > > If you really want to use fink, I think you'll have to > contact the fink team. Specifically it looks like > Koen van der Drift is kindly taking care of packaging > Biopython on Fink: > > http://pdb.finkproject.org/pdb/package.php/biopython-py24 > http://pdb.finkproject.org/pdb/package.php/biopython-py25 > http://pdb.finkproject.org/pdb/package.php/biopython-py26 > > Peter > From rmb32 at cornell.edu Sat Apr 3 20:09:27 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Sat, 03 Apr 2010 13:09:27 -0700 Subject: [Biopython] Google Summer of Code is *ON* for OBF projects! Message-ID: <4BB7A077.4070802@cornell.edu> Hi all, Reminder: GSoC student proposals must be submitted to Google by April 9th, 19:00 UTC. That's less than a week away. Students: you should ALREADY be working with mentors on the project mailing lists, they can help you get your proposal into shape. So far, we have 5 proposals submitted to our org in Google's web app. Keep them coming, and let's see some really good ones! Rob Buels OBF GSoC 2010 Administrator From rmb32 at cornell.edu Sun Apr 4 04:37:38 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Sat, 03 Apr 2010 21:37:38 -0700 Subject: [Biopython] Reminder: GSoC student applications due April 9, 19:00 UTC Message-ID: <4BB81792.8060001@cornell.edu> Hi all, Sending this again with a different subject line, just in case. GSoC student proposals must be submitted to Google through their web application by *April 9th, 19:00 UTC*. That's less than a week away. Students: you should ALREADY be working with mentors on the project mailing lists, they can help you get your proposal into shape. So far, we have 6 proposals submitted to our org in Google's web app. Keep them coming, and keep them good! Rob Buels OBF GSoC 2010 Administrator From ulfada at gmail.com Mon Apr 5 01:46:14 2010 From: ulfada at gmail.com (Sofia Lemons) Date: Sun, 4 Apr 2010 21:46:14 -0400 Subject: [Biopython] SoC project (BioPython and PyCogent) Message-ID: I'm working on an application for the Summer of Code project of integrating BioPython and PyCogent. I've looked through the list archives and saw Brad's general advice to other potential SoC applicants, but I thought I'd introduce myself and see if there was any advice specific to this project. I've used BioPython in the past and even explored the code a bit. I'm considering working on one or more of the bugs in Bugzilla if I can find time, and will work to familiarize myself with PyCogent. Are there any other concepts, projects, or people I should familiarize myself with (aside from what's listed on the ideas page, of course)? As you can see from my GitHub and Google Code accounts, I've got some experience with open source projects, but please do suggest any specific tools or methods you think I should try to get up to speed on, as well. Feel free to contact me off-list. Thanks, Sofia From stran104 at chapman.edu Mon Apr 5 10:59:28 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Mon, 5 Apr 2010 03:59:28 -0700 Subject: [Biopython] GSoC Ortholog Module Proposal Message-ID: Dear Biopython GSoC list, I am a student at Chapman University and over the last 18 months I have been using biopython to produce phylogenetic trees with ClustalW, T-Coffee, and PHYLIP. I have found the most difficult part to be identifying ortholgos for the particular species that our lab is interested in studying. The orthology databases provide a great deal of matches but each database requires its own wrapper and some databases are stronger than others with particular species. So far I have written wrappers to get ortholog IDs from InParanoid and then fetch the sequences from either NCBI or BioMart. This provides good results for most common species but not all. To handle rare species I have implemented the Reverse Smallest Distance orthology algorithm to run protein-protein searches. It is available at http://ortholog.us. I also have automated scripts to align protein families, concatenate aligned families, and create trees. For GSoC I would like to write a module to abstract finding orthologs as much as possible. This would greatly simplify creating custom evolutionary trees for biologists. The module could fetch orthologs from TreeFam, InParanoid, Harvard's Roundup, and Princeton's BLASTO. The module could also provide support for producing alignments, concatenating alignments, removing sections of gaps, and constructing trees. Ortholog identification could be done with no dependency other than an internet connection. Alignments and trees would require the user to have the appropriate tools installed. The overhead of writing this type of code makes it difficult for evolutionary biologists and bio wet labs to get a picture of evolutionary relationships in specific groups of species. This module would aim to simplify creating custom phylogenetic trees. A timeline of milestones might look something like this: Week 1-2: Stable wrappers for InParanoid Week 3-4: Stable wrappers for Roundup Week 5-6: Stable wrappers for Treefam Week 6-7: Stable wrappers for BlastO Week 8-9: Ortholog module to abstract the database wrappers Week 10-11: Alignment and tree tools Is there any interest in having such a project? I'd be grateful to get some feedback either on or off list. Best, -Matthew Strand From chapmanb at 50mail.com Mon Apr 5 11:50:00 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 5 Apr 2010 07:50:00 -0400 Subject: [Biopython] SoC project (BioPython and PyCogent) In-Reply-To: References: Message-ID: <20100405115000.GB62718@sobchak.mgh.harvard.edu> Sofia; > I'm working on an application for the Summer of Code project of > integrating BioPython and PyCogent. Great -- glad to you hear you are interested in the project. > I've looked through the list > archives and saw Brad's general advice to other potential SoC > applicants, but I thought I'd introduce myself and see if there was > any advice specific to this project. The overall goal is to provide integration between Biopython and PyCogent so programmers can benefit from the unique features and algorithms in each library. This has two general themes: - Ensuring interoperability between core objects like sequences, alignments and phylogenetic trees. - Using this interoperability to develop analysis workflows that utilize functionality from both libraries. Within this broad scope you are free to orient your proposal to whatever set of biological questions that interest you. We've tried to sketch out some ideas we had on the GSoC page as a starting point. > I've used BioPython in the past > and even explored the code a bit. I'm considering working on one or > more of the bugs in Bugzilla if I can find time, and will work to > familiarize myself with PyCogent. Are there any other concepts, > projects, or people I should familiarize myself with (aside from > what's listed on the ideas page, of course)? Proposals are due this Friday, April 9th and normally require a few rounds of back and forth revisions to get to a competitive level. My suggestion would be to focus on learning enough of Biopython and PyCogent to write out a detailed project plan, with a week by week description of activities and specific goals. > As you can see from my > GitHub and Google Code accounts, I've got some experience with open > source projects, but please do suggest any specific tools or methods > you think I should try to get up to speed on, as well. The open source work is great; definitely include this in your proposal. A good outline to start with is: - Project summary -- A short abstract describing what you hope to accomplish during the summer, how you plan to go about it, and what motivates you to work on the project. - Personal summary -- Describe your background and how it will help you be successful during GSoC. Here is where you can sell yourself to all of the mentors ranking the project: why are you a good coder? Why is this project useful to use? How will working on the summer project encourage you to stay active in the community? - Project plan -- The detailed week by week description of plans mentioned above. Hope this helps, Brad From chapmanb at 50mail.com Mon Apr 5 12:05:54 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 5 Apr 2010 08:05:54 -0400 Subject: [Biopython] GSoC Ortholog Module Proposal In-Reply-To: References: Message-ID: <20100405120554.GC62718@sobchak.mgh.harvard.edu> Matthew; Thanks for the introduction and pointers to your work. Your http://ortholog.us interface looks like a useful resource; it's really nice to see web interfaces being developed with programmable JSON APIs. Out of curiousity, is the code available for what you've done so far? > For GSoC I would like to write a module to abstract finding orthologs as > much as possible. This would greatly simplify creating custom evolutionary > trees for biologists. The module could fetch orthologs from TreeFam, > InParanoid, Harvard's Roundup, and Princeton's BLASTO. The module could also > provide support for producing alignments, concatenating alignments, removing > sections of gaps, and constructing trees. Ortholog identification could be > done with no dependency other than an internet connection. Alignments and > trees would require the user to have the appropriate tools installed. [...] > Is there any interest in having such a project? I'd be grateful to get some > feedback either on or off list. This is a good project idea and nicely spec'ed out. One additional direction that might also be worth exploring is using BioMart to retrieve orthologs from the Ensembl Compara work. Here's a recent thread on BioStar with the queries to use: http://biostar.stackexchange.com/questions/569/how-do-i-match-orthologues-in-one-species-to-another-genome-scale I don't know of Python programming interfaces to BioMart, but there is a nice R bioconductor library that can be leveraged with Rpy2: http://www.bioconductor.org/packages/bioc/html/biomaRt.html http://rpy.sourceforge.net/rpy2.html For the practical GSoC things, project proposals are due this Friday, April 9th so time is running short. I'm unfortunately a bit over-committed as this point to mentor but hopefully someone will be available to step in that role. I'm happy to make suggestions on the proposal as it comes together. Thanks, Brad From bjorn_johansson at bio.uminho.pt Mon Apr 5 13:50:25 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Mon, 5 Apr 2010 14:50:25 +0100 Subject: [Biopython] pro Message-ID: Hi, I have a problem that may be related to biopython (or not). I have written a plugin for a cross platform program (Wikidpad) that relies on some biopython modules. I do the development on ubuntu 9.10 and have Wikidpad installed using wine to be able to test the functionality on windows. Under wine I have added the following code to make biopython installed under linux available to the python interpreter (py2exe) under wine: if sys.platform == 'win32': sys.path.append("z:\usr\local\lib\python2.6\dist-packages") sys.path.append("z:\usr\lib/python2.6") line 40 in "SeqTools.py" below reads: from Bio import SeqIO I get the error below when importing the module under wikidpad running under wine File "C:\Program Files\WikidPad\user_extensions\SeqTools.py", line 40, in File "z:/usr/local/lib/python2.6/dist-packages\Bio\SeqIO\__init__.py", line 303, in File "z:/usr/local/lib/python2.6/dist-packages\Bio\SeqIO\InsdcIO.py", line 29, in File "z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\__init__.py", line 53, in File "z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\LocationParser.py", line 319, in File "z:/usr/local/lib/python2.6/dist-packages\Bio\GenBank\LocationParser.py", line 177, in __init__ File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line 88, in __init__ File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line 129, in collectRules File "z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py", line 101, in addRule AttributeError: 'NoneType' object has no attribute 'split' I wonder if anyone has an immediate idea of what I am doing wrong? The python interpreter under wine seem to find the biopython modules. I cannot understand the error that I get afterwards..... grateful for help! /bjorn From eric.talevich at gmail.com Mon Apr 5 15:48:04 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 5 Apr 2010 11:48:04 -0400 Subject: [Biopython] pro In-Reply-To: References: Message-ID: 2010/4/5 Bj?rn Johansson > Hi, > I have a problem that may be related to biopython (or not). > I have written a plugin for a cross platform program (Wikidpad) that relies > on some biopython modules. > I do the development on ubuntu 9.10 and have Wikidpad installed using wine > to be able to test the functionality on windows. > > Under wine I have added the following code to make biopython installed > under > linux available to the python interpreter (py2exe) under wine: > [...] > It looks like spark relies on the docstrings in Bio.GenBank.LocationParser. Is there anything in py2exe that would strip the docstrings from compiled modules? Some optimizations do this -- I think "python -O3" strips docstrings, for instance. -Eric From p.j.a.cock at googlemail.com Mon Apr 5 16:16:43 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Apr 2010 17:16:43 +0100 Subject: [Biopython] pro In-Reply-To: References: Message-ID: 2010/4/5 Eric Talevich > > It looks like spark relies on the docstrings in Bio.GenBank.LocationParser. > Is there anything in py2exe that would strip the docstrings from compiled > modules? Some optimizations do this -- I think "python -O3" strips > docstrings, for instance. You may be on to something there Eric. Bj?rn, could compare your file: z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py with the version we provide: http://github.com/biopython/biopython/blob/master/Bio/Parsers/spark.py or: http://biopython.org/SRC/biopython/Bio/Parsers/spark.py In the medium term, I'd like to move the GenBank/EMBL location parsing to something simpler and faster (using regular expressions) and then deprecate Bio.GenBank.LocationParser and indeed the whole of Bio.parsers (which just has a copy of spark). There is a bug open on this with some code. But that isn't going to help Bj?rn right now. Peter From stran104 at chapman.edu Mon Apr 5 19:02:21 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Mon, 5 Apr 2010 12:02:21 -0700 Subject: [Biopython] GSoC Ortholog Module Proposal Message-ID: > Thanks for the introduction and pointers to your work. Your > http://ortholog.us interface looks like a useful resource; it's > really nice to see web interfaces being developed with programmable > JSON APIs. Out of curiousity, is the code available for what you've > done so far? > Thanks, we have found it useful for finding unindexed orthologs. Fetching results from the pre-compiled databases is faster but of course requires writing wrappers that are time consuming to develop. The plan is to release all code as an open source Django app with a paper that is in the works. However, I'd be happy to share any code with mentors/organizers for evaluation purposes off-list in the meantime. > > This is a good project idea and nicely spec'ed out. One additional > direction that might also be worth exploring is using BioMart to > retrieve orthologs from the Ensembl Compara work. Here's a recent > thread on BioStar with the queries to use: > > > http://biostar.stackexchange.com/questions/569/how-do-i-match-orthologues-in-one-species-to-another-genome-scale > > I don't know of Python programming interfaces to BioMart, but there > is a nice R bioconductor library that can be leveraged with Rpy2: > I agree, this would be a good addition. I have some messy Python wrappers to BioMart but the Rpy route would probably provide a more reliable solution with less effort. > http://www.bioconductor.org/packages/bioc/html/biomaRt.html > http://rpy.sourceforge.net/rpy2.html > > For the practical GSoC things, project proposals are due this > Friday, April 9th so time is running short. I'm unfortunately a bit > over-committed as this point to mentor but hopefully someone will > be available to step in that role. I'm happy to make suggestions on > the proposal as it comes together. > Thanks, I hope so too. I will post a full proposal in the near future. Feedback would of course be greatly appreciated. I'm a little unclear, do I need a mentor to submit a proposal? Is writing a proposal a mute point without a mentor? Best, -Matt Strand From vincent at vincentdavis.net Mon Apr 5 19:51:46 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Mon, 5 Apr 2010 13:51:46 -0600 Subject: [Biopython] Build CDF file Message-ID: The custom array for which I have data does not have a CDF file. I have been told that others have changed the header on the CEL files to reference different CDF file. That only kinda makes sense to me. I obviously have CEL files. I also have the sequences that each probe matches and finally I have genome match data. By that I mean I know which probes are a perfect match and which are a mismatch and the location of the mismatch. Can I build a CDF file from this? How? Does it make sense to build a CDF for each hybrid(not sure thats the right word) of the organism if the genome is known for each. Not sure if this is better ask here or the BioConductor, If there is a python solution I would try that first, I think. I think the bioconductor package altcdfenvs LINK does this. I guess I should email Laurent Gautier, maybe he reads this :) *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From biopython at maubp.freeserve.co.uk Mon Apr 5 20:35:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Apr 2010 21:35:20 +0100 Subject: [Biopython] Build CDF file In-Reply-To: References: Message-ID: On Mon, Apr 5, 2010 at 8:51 PM, Vincent Davis wrote: > The custom array for which I have data does not have a CDF > file... Hi Vincent, Did you mean to post this to the BioConductor mailing list? Peter From biopython at maubp.freeserve.co.uk Mon Apr 5 20:53:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Apr 2010 21:53:42 +0100 Subject: [Biopython] Build CDF file In-Reply-To: <-3455855938884949614@unknownmsgid> References: <-3455855938884949614@unknownmsgid> Message-ID: On Mon, Apr 5, 2010 at 9:46 PM, Vincent Davis wrote: > > No, but maybe I should. I was hopping for a python solution > Are these CDF files of yours NetCDF files? http://en.wikipedia.org/wiki/NetCDF If so, try Scientific.IO.NetCDF from Konrad Hinsen's ScientificPython http://sourcesup.cru.fr/projects/scientific-py/ Peter From chapmanb at 50mail.com Tue Apr 6 12:26:27 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 6 Apr 2010 08:26:27 -0400 Subject: [Biopython] GSoC Ortholog Module Proposal In-Reply-To: References: Message-ID: <20100406122627.GE66230@sobchak.mgh.harvard.edu> Matthew; > > Thanks for the introduction and pointers to your work. Your > > http://ortholog.us interface looks like a useful resource; it's > > really nice to see web interfaces being developed with programmable > > JSON APIs. Out of curiousity, is the code available for what you've > > done so far? > > Thanks, we have found it useful for finding unindexed orthologs. Fetching > results from the pre-compiled databases is faster but of course requires > writing wrappers that are time consuming to develop. The plan is to release > all code as an open source Django app with a paper that is in the works. > However, I'd be happy to share any code with mentors/organizers for > evaluation purposes off-list in the meantime. Cool; definitely let us know on the mailing lists when the paper and code are out. It would be fun to see. > > For the practical GSoC things, project proposals are due this > > Friday, April 9th so time is running short. I'm unfortunately a bit > > over-committed as this point to mentor but hopefully someone will > > be available to step in that role. I'm happy to make suggestions on > > the proposal as it comes together. > > Thanks, I hope so too. I will post a full proposal in the near future. > Feedback would of course be greatly appreciated. I'm a little unclear, do I > need a mentor to submit a proposal? Is writing a proposal a mute point > without a mentor? You will need a mentor and this is always the tough part of GSoC: there are more good students and ideas than mentors and funded spots. I would never discourage anyone from getting together a proposal; it is a good exercise and helps you think through the work you are planning to do. In terms of acceptance rates, it is lower when coming in later in the process with your own ideas since mentors will have already settled on a few ideas and begun feeling committed to students working on those. However, nothing is locked down or decided until the deadline hits, proposals are ranked by all of the mentors, and we see how many spots we'll get from Google. GSoC is kind of like interviewing job candidates without being sure how many positions you'll have at the end. In summary, if you feel like the proposal writing process would be interesting and useful to you, I'd definitely encourage you to go for it and see where it takes you. Brad From bjorn_johansson at bio.uminho.pt Wed Apr 7 09:33:39 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Wed, 7 Apr 2010 10:33:39 +0100 Subject: [Biopython] pro In-Reply-To: References: Message-ID: Hi, thank you very much for the information, I think it has to do with the docstrings, if I run with python -OO under linux, I get the same error msg. as for the two spark files, they seem identical, spark.py is the one i downloaded from http://biopython.org/SRC/biopython/Bio/Parsers/spark.py: diff -w spark.py /usr/local/lib/python2.6/dist-packages/Bio/Parsers/spark.py produces no output at all. I will try and find out if the optimization can be overridden for one file only. Thanks! /bjorn 2010/4/5 Peter Cock > 2010/4/5 Eric Talevich > > > > It looks like spark relies on the docstrings in > Bio.GenBank.LocationParser. > > Is there anything in py2exe that would strip the docstrings from compiled > > modules? Some optimizations do this -- I think "python -O3" strips > > docstrings, for instance. > > You may be on to something there Eric. > > Bj?rn, could compare your file: > z:/usr/local/lib/python2.6/dist-packages\Bio\Parsers\spark.py > with the version we provide: > http://github.com/biopython/biopython/blob/master/Bio/Parsers/spark.py > or: > http://biopython.org/SRC/biopython/Bio/Parsers/spark.py > > In the medium term, I'd like to move the GenBank/EMBL location > parsing to something simpler and faster (using regular expressions) > and then deprecate Bio.GenBank.LocationParser and indeed the > whole of Bio.parsers (which just has a copy of spark). There is > a bug open on this with some code. But that isn't going to help > Bj?rn right now. > > Peter > -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From p.j.a.cock at googlemail.com Wed Apr 7 09:37:59 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Apr 2010 10:37:59 +0100 Subject: [Biopython] pro In-Reply-To: References: Message-ID: 010/4/7 Bj?rn Johansson : > Hi, > thank you very much for the information, I think it has to do with the > docstrings, if I run with python -OO under linux, I get the same error msg. > > as for the two spark files, they seem identical, spark.py is the one i > downloaded from > http://biopython.org/SRC/biopython/Bio/Parsers/spark.py: > > diff -w spark.py /usr/local/lib/python2.6/dist-packages/Bio/Parsers/spark.py > > produces no output at all. OK, thanks. I wanted to find out if py2exe was optimising the python files by editing them to remove the docstrings. It seems not. > I will try and find out if the optimization can be overridden for one file > only. > > Thanks! > /bjorn Peter From lunt at ctbp.ucsd.edu Thu Apr 8 00:57:07 2010 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Wed, 7 Apr 2010 17:57:07 -0700 Subject: [Biopython] StockholmIO replaces "." with "-", why? Message-ID: Greetings All! It looks like line 364 of Bio.AlignIO.StockholmIO reads: seqs[id] += seq.replace(".","-") So when you load into memory alignments that mark gaps created to allow alignment to inserts with ".", (such as PFam alignments or the output of hmmer) that information is lost. I know there must be a good reason for this, but I am finding it a problem on my end.. -Bryan Lunt From fuxin at umail.iu.edu Thu Apr 8 01:40:02 2010 From: fuxin at umail.iu.edu (Fuxiao Xin) Date: Wed, 7 Apr 2010 21:40:02 -0400 Subject: [Biopython] About Google Summer Code Project PDB-tidy Message-ID: Dear all, I am a third year Phd student in Bioinformatics from Indiana University Bloomington. I am very in interested in the google summer code project of biopython "PDB-Tidy: command-line tools for manipulating PDB files". My own research needs extensive manipulation of PDB files, and I think this idea of adding more features to Bio.PDB and more command line options to analyze/present PDB data is excellent. This project is of strong interest to me since it will benefit my own research project as well. Programming Skills: I use perl and python during my daily research. I am now working on developing a new functional site predictor using protein structure information. The code will be open source, but the work is under review so the code is not released yet. My project plan: week1 1. Renumber residues starting from 1 (or N) function name: renumberPDB, given a pdb file, rename the atom field numbering of the file to remove missing amino acids communicate with mentors to set standards of the code to follow for the rest of the functions create work log to keep track of process; week2-3 2. Select a portion of the structure -- models, chains, etc. -- and write it to a new file (PDB, FASTA, and other formats) function name: rewritePDB, inputs will be a particular portion of a PDB file you want to write out(support 'chain', 'model', 'atom'), a file format(PDB, fasta), and the output name. 3. Perform some basic, well-established measures of model quality/validity function name: PDBquality the function will report RESOLUTION and ? of the structure 4. extract disorder region in PDB structure function name: PDBdisorder report missing residues in the structure atom field week3-4 5. make a function to draw a Ramachandran plot function name: ramaPLOT combine the two steps(calcualting torsion angles and draw the plot) into one function, give the option to draw the plot or not week5 6. open PDB files in the window for visulization, visulize PDBsuperpose results, output RMSD function name: superposePDB the function will look like the PDBsuperpose function in matlab; use Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other visulization tool to see the results week6 7. write a function to extract all experimental conditions of a PDB file, includes PH, temperature, and salt function name: PDBconditon it will be easy to get PH and temperature information, but for salt, it will be hard to parse because there is no general rule of such information in the PDB file; parse REMARK 200 field; week7-8 8. extract PTM, function name: PDBptm difficult: the Post-translational modification annotation in PDB is not consistant, need to make a list of PTMs to work on parse MODRES field week9-10 9. extract ligand binding information function name: PDBligand parse HETNAM field Other obligations: I am aware that google summer code starts from May 24th, but I will have a review paper with my advisor due on June 1st, I hope it will be OK for me to start after June 1st, and I will makeup the first week in Auguest. Best, Fuxiao From eric.talevich at gmail.com Thu Apr 8 03:48:08 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 7 Apr 2010 23:48:08 -0400 Subject: [Biopython] About Google Summer Code Project PDB-tidy In-Reply-To: References: Message-ID: Hi Fuxiao, Thanks for your interest in this project. I see you've been working on this proposal for awhile already, so although the submission deadline is very close, I think you'll still be OK. I've interleaved my comments with your proposal below: On Wed, Apr 7, 2010 at 9:40 PM, Fuxiao Xin wrote: > Dear all, > > I am a third year Phd student in Bioinformatics from Indiana University > Bloomington. I am very in interested in the google summer code project of > biopython "PDB-Tidy: command-line tools for manipulating PDB files". > > My own research needs extensive manipulation of PDB files, and I think > this > idea of adding more features to Bio.PDB and more command line options to > analyze/present PDB data is excellent. This project is of strong interest > to > me since it will benefit my own research project as well. > Good to hear. Does your lab have a website? This project requires some knowledge of structural biology, so it helps if we can see what specific research you've already done in that area. Programming Skills: I use perl and python during my daily research. I am now > working on developing a new functional site predictor using protein > structure information. The code will be open source, but the work is under > review so the code is not released yet. > Is there any other programming work you've done in the past that you could let us see? It doesn't have to be part of an existing open-source project; even some functioning snippets posted somewhere would help us get a sense of your coding style and abilities. Examples where you've used Biopython or another established toolkit for working with PDB files or other scientific data would be especially useful. We also like to see that you're familiar with a project's build tools, which in Biopython's case is GitHub and the standard Python mechanisms. So, if you could upload some of your prior work to GitHub and send us the link, that would be ideal. My project plan: > > week1 > 1. Renumber residues starting from 1 (or N) > function name: renumberPDB, given a pdb file, rename the atom field > numbering of the file to remove missing amino acids > communicate with mentors to set standards of the code to follow for the > rest > of the functions > create work log to keep track of process; > Biopython's coding standards generally follow an earlier version of PEP 8; hopefully you can pick it up quickly just by reading the source code for Bio.PDB -- so you don't really need that item listed here. In the past, students have maintained their weekly schedules on a wiki or other public document, and updated them continually throughout the summer. This functions as a work log, in a way. You would also have an e-mail record of your work from your weekly reports to this list. week2-3 > 2. Select a portion of the structure -- models, chains, etc. -- and write > it > to a new file (PDB, FASTA, and other formats) > function name: rewritePDB, inputs will be a particular portion of a PDB > file > you want to write out(support 'chain', 'model', 'atom'), a file format(PDB, > fasta), and the output name. > 3. Perform some basic, well-established measures of model quality/validity > function name: PDBquality > the function will report RESOLUTION and ? of the structure > 4. extract disorder region in PDB structure > function name: PDBdisorder > report missing residues in the structure atom field > These tasks seem reasonable. You don't need to commit to specific function names yet; it would be more helpful to describe the overall module layout you're planning, and list the dependencies for each (especially the components of Bio.PDB that come into play). > week3-4 > 5. make a function to draw a Ramachandran plot > function name: ramaPLOT > combine the two steps(calcualting torsion angles and draw the plot) into > one > function, give the option to draw the plot or not > This task has a number of dependencies which I think you should list and describe here. Because of those dependencies there's a significant chance of it taking longer than you planned -- so I'd recommend moving it to after the midterm evaluations, wherever those fit into your schedule. week5 > 6. open PDB files in the window for visulization, visulize PDBsuperpose > results, output RMSD > function name: superposePDB > the function will look like the PDBsuperpose function in matlab; use > Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other > visulization tool to see the results > Would you build Python wrappers for interacting with the chosen visualization tool, or just write a set of files and launch the viewer in a script? > week6 > 7. write a function to extract all experimental conditions of a PDB file, > includes PH, temperature, and salt > function name: PDBconditon > it will be easy to get PH and temperature information, but for salt, it > will > be hard to parse because there is no general rule of such information in > the > PDB file; parse REMARK 200 field; > Sounds handy. Would your script write out a report combining all of this info, or just extract requested elements? > week7-8 > 8. extract PTM, > function name: PDBptm > difficult: the Post-translational modification annotation in PDB is not > consistant, need to make a list of PTMs to work on > parse MODRES field > > week9-10 > 9. extract ligand binding information > function name: PDBligand > parse HETNAM field > Good. Some of these later items sound straightforward enough that it would be better to tackle them earlier in the summer. > Other obligations: I am aware that google summer code starts from May > 24th, > but I will have a review paper with my advisor due on June 1st, I hope it > will be OK for me to start after June 1st, and I will makeup the first week > in Auguest. > How much of the "community bonding period" will this occupy? The guideline is that you get set up with the build system, read documentation and do background research part-time between GSoC acceptance and May 24, and start writing code full-time on May 24. You can make up for a gap in your project plan by doing extra preparation before coding starts; would this be possible for you? Finally, the GSoC administration app (socghop.appspot.com) gets crowded as the deadline approaches, so it's best if you register yourself there and take care of the administrivia as soon as you can to avoid any trouble on Friday. Best regards, Eric From rozziite at gmail.com Thu Apr 8 03:48:16 2010 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Wed, 7 Apr 2010 23:48:16 -0400 Subject: [Biopython] About Google Summer Code Project PDB-tidy In-Reply-To: References: Message-ID: Hi Fuxiao, Good start on the application! Some comments below. On Wed, Apr 7, 2010 at 9:40 PM, Fuxiao Xin wrote: > Dear all, > > I am a third year Phd student in Bioinformatics from Indiana University > Bloomington. ?I am very in interested in the google summer code project of > biopython "PDB-Tidy: command-line tools for manipulating PDB files". > > My own research needs extensive manipulation of PDB files, and I think ?this > idea of adding more features to Bio.PDB and more command line options to > analyze/present PDB data is excellent. This project is of strong interest to > me since it will benefit my own research project as well. > > Programming Skills: I use perl and python during my daily research. I am now > working on developing a new functional site predictor using protein > structure information. The code will be open source, but the work is under > review so the code is not released yet. > > My project plan: > > week1 > 1. Renumber residues starting from 1 (or N) > function name: renumberPDB, given a pdb file, rename the atom field > numbering of the file to remove missing amino acids > communicate with mentors to set standards of the code to follow for the rest > of the functions > create work log to keep track of process; > > week2-3 > 2. Select a portion of the structure -- models, chains, etc. -- and write it > to a new file (PDB, FASTA, and other formats) > function name: rewritePDB, inputs will be a particular portion of a PDB file > you want to write out(support 'chain', 'model', 'atom'), a file format(PDB, > fasta), and the output name. > 3. Perform some basic, well-established measures of model quality/validity > function name: PDBquality > the function will report RESOLUTION and ? of the structure Maybe you can get some inspiration of measures of model quality/validity from PDBREPORT database [0] and WHAT_IF [1] software. [0] http://swift.cmbi.ru.nl/gv/pdbreport/ [1] http://swift.cmbi.ru.nl/whatif/ > 4. extract disorder region in PDB structure > function name: PDBdisorder > report missing residues in the structure atom field > > week3-4 > 5. make a function to draw a Ramachandran plot > function name: ramaPLOT > combine the two steps(calcualting torsion angles and draw the plot) into one > function, give the option to draw the plot or not > > week5 > 6. open PDB files in the window for visulization, visulize PDBsuperpose > results, output RMSD > function name: superposePDB > the function will look like the PDBsuperpose function in matlab; use > Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other > visulization tool to see the results > week6 > 7. write a function to extract all experimental conditions of a PDB file, > includes PH, temperature, and salt > function name: PDBconditon > it will be easy to get PH and temperature information, but for salt, it will > be hard to parse because there is no general rule of such information in the > PDB file; parse REMARK 200 field; > > week7-8 > 8. extract PTM, > function name: PDBptm > difficult: the Post-translational modification annotation in PDB is not > consistant, need to make a list of PTMs to work on > parse MODRES field > > week9-10 > 9. extract ligand binding information > function name: PDBligand > parse HETNAM field > > > Other obligations: ?I am aware that google summer code starts from May 24th, > but I will have a review paper with my advisor due on June 1st, I hope it > will be OK for me to start after June 1st, and I will makeup the first week > in Auguest. > > Best, > Fuxiao > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From fuxin at indiana.edu Thu Apr 8 07:40:36 2010 From: fuxin at indiana.edu (Fuxiao Xin) Date: Thu, 8 Apr 2010 03:40:36 -0400 Subject: [Biopython] About Google Summer Code Project PDB-tidy In-Reply-To: References: Message-ID: hi Eric and Diana, Thanks for your quick reply. For the quality/validation problem, thanks Diana for pointing me to the two resources, I am surprised that there are so many "problems" defined for PDB files, and obviously I underestimate this task, and I think it's a very interesting problem to study and I'd like to devote more time on this task, I am thinking to make this task the main focus of my first period coding(before midterm check). What do you think? For Eric's responses, please find my reply in line. My own research needs extensive manipulation of PDB files, and I think this >> idea of adding more features to Bio.PDB and more command line options to >> analyze/present PDB data is excellent. This project is of strong interest >> to >> me since it will benefit my own research project as well. >> > > Good to hear. Does your lab have a website? This project requires some > knowledge of structural biology, so it helps if we can see what specific > research you've already done in that area. > Our lab's website is : http://www.informatics.indiana.edu/predrag/ , and one main focus of our lab is PTM and disorder, both need to deal with PDB files. A poster title shows my protein structure-based kernel work:* http://www.iscb.org/rocky09-program/rocky09-poster-presenters-abstracts, they didn't put the abstract online. I could send you the abstract if you are interested. * > Programming Skills: I use perl and python during my daily research. I am >> now >> working on developing a new functional site predictor using protein >> structure information. The code will be open source, but the work is under >> review so the code is not released yet. >> > > Is there any other programming work you've done in the past that you could > let us see? It doesn't have to be part of an existing open-source project; > even some functioning snippets posted somewhere would help us get a sense of > your coding style and abilities. Examples where you've used Biopython or > another established toolkit for working with PDB files or other scientific > data would be especially useful. > We also like to see that you're familiar with a project's build tools, which > in Biopython's case is GitHub and the standard Python mechanisms. So, if you > could upload some of your prior work to GitHub and send us the link, that > would be ideal. > I put some of my python code here: http://github.com/fuxiaoxin/my_python_code. I don't have code in python using Bio.PDB. For parsing PDB, my code are in perl for the sake of its regular expression, I seldomly use bioperl or biopython in the past, I write all my own code, that's also why I think I am very clear of all kinds of problems in PDB files. I am quite surprised to find Bio.PDB already have so many modules for various functions. I could upload some of my perl functions if you would like to have a look: I have functions similar to PDBparser, NeighborSearch, DSSP, NACCESS. I have to say I am not very familiar with the build tools of python. But I hope to learn it during the bonding period. I just guided myself through to upload my codes to Github, :) My project plan: >> >> week1 >> 1. Renumber residues starting from 1 (or N) >> function name: renumberPDB, given a pdb file, rename the atom field >> numbering of the file to remove missing amino acids >> communicate with mentors to set standards of the code to follow for the >> rest >> of the functions >> create work log to keep track of process; >> > > Biopython's coding standards generally follow an earlier version of PEP 8; > hopefully you can pick it up quickly just by reading the source code for > Bio.PDB -- so you don't really need that item listed here. > > I will learn from Bio.PDB source code and remove this one. > In the past, students have maintained their weekly schedules on a wiki or > other public document, and updated them continually throughout the summer. > This functions as a work log, in a way. You would also have an e-mail record > of your work from your weekly reports to this list. > That's great to know. > week2-3 >> 2. Select a portion of the structure -- models, chains, etc. -- and write >> it >> to a new file (PDB, FASTA, and other formats) >> function name: rewritePDB, inputs will be a particular portion of a PDB >> file >> you want to write out(support 'chain', 'model', 'atom'), a file >> format(PDB, >> fasta), and the output name. >> 3. Perform some basic, well-established measures of model quality/validity >> function name: PDBquality >> the function will report RESOLUTION and ? of the structure >> 4. extract disorder region in PDB structure >> function name: PDBdisorder >> report missing residues in the structure atom field >> > > These tasks seem reasonable. You don't need to commit to specific function > names yet; it would be more helpful to describe the overall module layout > you're planning, and list the dependencies for each (especially the > components of Bio.PDB that come into play). > I will make a new proposal with these details by tomorrow. > >> week3-4 >> 5. make a function to draw a Ramachandran plot >> function name: ramaPLOT >> combine the two steps(calcualting torsion angles and draw the plot) into >> one >> function, give the option to draw the plot or not >> > > This task has a number of dependencies which I think you should list and > describe here. Because of those dependencies there's a significant chance of > it taking longer than you planned -- so I'd recommend moving it to after the > midterm evaluations, wherever those fit into your schedule. > I will add more details here. > week5 >> 6. open PDB files in the window for visulization, visulize PDBsuperpose >> results, output RMSD >> function name: superposePDB >> the function will look like the PDBsuperpose function in matlab; use >> Bio.PDB.Superimposer() to perform the superimpose, use Jmol or other >> visualization tool to see the results >> > > Would you build Python wrappers for interacting with the chosen > visualization tool, or just write a set of files and launch the viewer in a > script? > I am thinking of launching the script, since those PDB visualization tools already have very nice command line options and interfaces. But I think it is really important to be able to visualize the structure on the fly, especially when you are doing PDB superimpose. > week6 >> 7. write a function to extract all experimental conditions of a PDB file, >> includes PH, temperature, and salt >> function name: PDBconditon >> it will be easy to get PH and temperature information, but for salt, it >> will >> be hard to parse because there is no general rule of such information in >> the >> PDB file; parse REMARK 200 field; >> > > Sounds handy. Would your script write out a report combining all of this > info, or just extract requested elements? > I am thinking to put the results into a variable instead of a report, since it will be great for batch processing, and display the results immediately in interactive mode. > > Other obligations: I am aware that google summer code starts from May >> 24th, >> but I will have a review paper with my advisor due on June 1st, I hope it >> will be OK for me to start after June 1st, and I will makeup the first >> week >> in Auguest. >> > > How much of the "community bonding period" will this occupy? The guideline > is that you get set up with the build system, read documentation and do > background research part-time between GSoC acceptance and May 24, and start > writing code full-time on May 24. You can make up for a gap in your project > plan by doing extra preparation before coding starts; would this be possible > for you? > I think the bonding period will be really important for me to get known about the python build tools, and of course other stuff you mentors suggest me to learn, so I will devote my time for "bonding". But since I will get busy near the end of May, I plan to start early and do things more efficiently. > > Finally, the GSoC administration app (socghop.appspot.com) gets crowded as > the deadline approaches, so it's best if you register yourself there and > take care of the administrivia as soon as you can to avoid any trouble on > Friday. > Thanks for the reminding. I will incorporate you and Diana's suggestions to make a new version of proposal, by tomorrow night. But the idea is, the main project for the first period would be the quality/validation task , and the second period will be the Ramachandran plot. And I will fill in the time with other small functions. Thanks, Fuxiao From biopython at maubp.freeserve.co.uk Thu Apr 8 08:04:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Apr 2010 09:04:27 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt wrote: > Greetings All! > > It looks like line 364 of Bio.AlignIO.StockholmIO reads: > > seqs[id] += seq.replace(".","-") > > So when you load into memory alignments that mark gaps created to > allow alignment to inserts with ".", (such as PFam alignments or the > output of hmmer) that information is lost. > > I know there must be a good reason for this, but I am finding it a > problem on my end.. > > -Bryan Lunt Hi Bryan, Yes, is it done deliberately. The dot is a problem - it has a quite specific meaning of "same as above" on other alignment file formats, while "-" is an almost universal shorthand for gap/insertion. Consider the use case of Stockholm to PHYLIP/FASTA/Clustal conversion. Have you got a sample output file we can use as a unit test or at least discuss? As I recall, on the PFAM alignments I looked at there was no data loss by doing the dot to dash mapping. Peter From sma.hmc at gmail.com Thu Apr 8 09:41:26 2010 From: sma.hmc at gmail.com (Singer Ma) Date: Thu, 8 Apr 2010 02:41:26 -0700 Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability Message-ID: I am a junior Computer Science major with heavy bioinformatic leanings at Harvey Mudd College. I know that it is very late for new summer of code applications, but I was wondering if you could have a look at my proposed schedule to give me some pointers and answer a few questions. I am also considering applying for the project involving adding more ways to use R through python, but I was unsure of which project had more users who wanted it completed. Questions: What does it mean by BioPython's acquired sequences? I can't seem to find out what or where information about "acquired sequences" is. Thus, I do not discuss anything about it in my current proposal. For the creation of workflows, do there already exist use and test cases for this or would I be best off looking for ones in papers and trying to mimic them? Right now, I have an example paper where the interoperability would have been helpful. Any other use cases I should immediately consider in my proposal? My current proposed schedule: For Bio Python and PyCogent interoperability. Week 1: Familiarization with the code and soliciting requests. While what seems intuitive to me might not seem so to others. It would be best to spend this time to determine a group of people who would highly benefit from the interoperability and ask them for what they would look for. For example, would they rather use one, save the data, and use the other. Would they want to use them directly. Basically, I want to get a good idea of how this code will be used before making my own decisions on how I think people will use it. Also important here is to create sets of data which can be used later on the process. Week 2 and 3: Code converting PyCogent and BioPython. The core objects in each package seem like they should not be too difficult to convert. This step will involve looking into the documentation and coding for PyCogent and BioPython, to determine what the core objects contain for each. One possible problem here is if either PyCogent or BioPython core objects use heavy subclassing, as determining subclassing in Python has been a nightmare in the past. Testing at this point will likely involve going through the entire round trip conversion, and seeing if everything looks the same. Week 4: Ensure that conversions allow the use of data from one program to the other. The workflows of codon usage to clustering code can be tested. One possible test set is from Sharp et. al. 1986. Here they found different codon usage for different genes. Additionally, it should be considered how codon usage can be used to help with making biologically accurate clusters. Week 5: Familiarize with phyloXML and make interoperable with PyCogent. phyloXML has already been added with BioPython. Making phyloXML work with PyCogent could be based on how it was adapted for BioPython. Clear risks here include problems with making sure that the API for phyloXML in PyCogent gives an intuitive interface to use phyloXML. Week 6 and 7: Adapt PyCogent to query genomics databases. Currently there is at least some support for PyCogent to query ENSEMBL. It seems like it would be useful to query other genomics databases such as Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL queries into their MySQL database. Ideally, if everything previously has been alright, the conversion of PyCogent to BioPython forms shoudl already be accounted for. Week 8-12: Slip days and additional features. The initial set of use cases will surely expand and this is extra time to allow for those use cases to be accounted for. Thanks, Singer Ma From biopython at maubp.freeserve.co.uk Thu Apr 8 10:04:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Apr 2010 11:04:10 +0100 Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 10:41 AM, Singer Ma wrote: > I am a junior Computer Science major with heavy bioinformatic leanings > at Harvey Mudd College. I know that it is very late for new summer of > code applications, but I was wondering if you could have a look at my > proposed schedule to give me some pointers and answer a few questions. > I am also considering applying for the project involving adding more > ways to use R through python, but I was unsure of which project had > more users who wanted it completed. > > Questions: > What does it mean by BioPython's acquired sequences? I can't seem to > find out what or where information about "acquired sequences" is. > Thus, I do not discuss anything about it in my current proposal. http://www.biopython.org/wiki/Google_Summer_of_Code#Biopython_and_PyCogent_interoperability You mean "Connecting Biopython acquired sequences to PyCogent's alignment, phylogenetic tree preparation and tree visualization code."? I think Brad means using Biopython to load (parse) sequence data (e.g. with Bio.AlignIO), and then give this to PyCogent. i.e. Acquire data in the sense of get/load data. > Week 6 and 7: Adapt PyCogent to query genomics databases. Currently > there is at least some support for PyCogent to query ENSEMBL. It seems > like it would be useful to query other genomics databases such as > Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL > queries into their MySQL database. ... Are you are talking about the NCBI Entrez Utitlites (E-Utils)? Those are language neutral and we have Bio.Entrez to support them in Biopython. Peter From biopython at maubp.freeserve.co.uk Thu Apr 8 10:26:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Apr 2010 11:26:10 +0100 Subject: [Biopython] Biopython 1.54b test failures In-Reply-To: References: <4BB67835.7030303@uci.edu> Message-ID: On Sat, Apr 3, 2010 at 12:22 AM, Peter wrote: > > I recall trying the universal read lines thing before without > success in the SCOP tests - maybe it was this line 72 thing > that I missed. I'll take another look at this next week (when > I have access to a Windows machine). > You are right - that does make the two SCOP tests pass on Windows without having to first convert the SCOP example files from Unix to DOS/Windows newlines. Checked in. Would you like to be credited for this in the NEWS and CONTRIB files? Thanks, Peter From sma.hmc at gmail.com Thu Apr 8 10:31:10 2010 From: sma.hmc at gmail.com (Singer Ma) Date: Thu, 8 Apr 2010 03:31:10 -0700 Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability In-Reply-To: References: Message-ID: > You mean "Connecting Biopython acquired sequences to PyCogent's > alignment, phylogenetic tree preparation and tree visualization code."? > > I think Brad means using Biopython to load (parse) sequence data (e.g. > with Bio.AlignIO), and then give this to PyCogent. i.e. Acquire data in > the sense of get/load data. Ah, so, its just the most straightforward use of the conversion tools that would be made. Sorry, I thought I was missing something here. Shouldn't be this be taken care of in the first use case of "Allow round-trip conversion between biopython and pycogent core objects (sequence, alignment, tree, etc.)."? Or does this require me to determine how the interactions will be made? > > Are you are talking about the NCBI Entrez Utitlites (E-Utils)? Those are > language neutral and we have Bio.Entrez to support them in Biopython. Ah, I misread my information, so NCBI Entrez can already be queried. What exactly do we need to get from ENSEMBL that isn't already supported then? Singer From chapmanb at 50mail.com Thu Apr 8 12:39:53 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 8 Apr 2010 08:39:53 -0400 Subject: [Biopython] GSoC - BioPython and PyCogent Interoperability In-Reply-To: References: Message-ID: <20100408123953.GG911@sobchak.mgh.harvard.edu> Singer; Thanks for the introduction and initial project plan. Glad that you are interested. I'll try to tackle a few of the specific points Peter has not already talked about, and suggest some specifics for the application. > Questions: > What does it mean by BioPython's acquired sequences? I can't seem to > find out what or where information about "acquired sequences" is. > Thus, I do not discuss anything about it in my current proposal. Following up on what Peter mentioned, what we're trying to say there is to use the results from step 1 (interoperability) to create unique workflows that use both Biopython and PyCogent. This is a suggested workflow to utilize some of the strengths of both packages. > For the creation of workflows, do there already exist use and test > cases for this or would I be best off looking for ones in papers and > trying to mimic them? Right now, I have an example paper where the > interoperability would have been helpful. Yes, that is exactly the right approach. The ideas we've suggested are just brainstorming; please select workflows that are interesting to you. > My current proposed schedule: > > For Bio Python and PyCogent interoperability. > Week 1: Familiarization with the code and soliciting requests. While > what seems intuitive to me might not seem so to others. It would be > best to spend this time to determine a group of people who would > highly benefit from the interoperability and ask them for what they > would look for. For example, would they rather use one, save the data, > and use the other. Would they want to use them directly. Basically, I > want to get a good idea of how this code will be used before making my > own decisions on how I think people will use it. Also important here > is to create sets of data which can be used later on the process. All of this type of non-coding work should be done in the community bonding period, from April 26th to the start of coding. When week 1 hits, you want to be ready to code. See the timeline for more specific information on dates: http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/timeline > Week 5: Familiarize with phyloXML and make interoperable with > PyCogent. phyloXML has already been added with BioPython. Making > phyloXML work with PyCogent could be based on how it was adapted for > BioPython. Clear risks here include problems with making sure that the > API for phyloXML in PyCogent gives an intuitive interface to use > phyloXML. Again, all of the non-coding activities should be moved to before the actual coding period. In your timeline you want to focus on code deliverables for each week. Of course there will be learning and reading during the program, but you want to be sure to have a code centric focus. > Week 6 and 7: Adapt PyCogent to query genomics databases. Currently > there is at least some support for PyCogent to query ENSEMBL. It seems > like it would be useful to query other genomics databases such as > Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL > queries into their MySQL database. Ideally, if everything previously > has been alright, the conversion of PyCogent to BioPython forms shoudl > already be accounted for. Following up on your discussion with Peter, you should think about some workflows that use Biopython Entrez queries and PyCogent Ensembl queries to answer interesting questions that could not be done with either. This should help to focus your ideas on integration and workflows, as opposed to implementing new functionality. > Week 8-12: Slip days and additional features. The initial set of use > cases will surely expand and this is extra time to allow for those use > cases to be accounted for. You need to continue your detailed project plan for the entire period. See the examples in the NESCent application documentation to get an idea of the level of detail in accepted projects from previous years: https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html Practically, applications are due tomorrow, so you should have a submission sent in to OpenBio through the GSoC interface (http://socghop.appspot.com). Hope this helps, Brad From vincent at vincentdavis.net Thu Apr 8 18:33:41 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 8 Apr 2010 12:33:41 -0600 Subject: [Biopython] affy CEL and CDF reader Message-ID: I ended up writing my own modules for reading both affy Cel and CDF files. Long story as to why I did not just use what was available in biopython. I plan on making what I have done available to the biopython and will upload it as a fork. I will outline what ways what I have is different below. My question is: Are there any improvements(features) others would like to see beyond what is avalible in the current CelFile.py? I saw some posts a month or so ago about checking for consistency in cell file, I think it was something about making sure the stated number of probes was consistent with the intensity measurements. What is different, when an file is read Affycel.read('file') many atributes are set. for example a = affcel() a.read('testfile') a.filename, a.version, a.header.items() # a dictionary of all header items a.num_intensity a.intensity a.num_masks a.masks a.num_outliers a.outliers a.numb_modified a.modified I plan to add the ability return/call intensity values with our with outliers or mask values. All data is currently store in numpy structured arrays, currently a.intensity returns the structured array, but I plan on making it an option to easily choose how this is returned. also what to make an optional normalized intensity array so that if the data is normalized it can be stored with the affycel instance. My use case was that I was opening about 80 cel files and reading them in was slow. this allowed me to read each file as an instance of affycel stored in a list that I then pickled. It was then much faster to open them. Are improvements to the CelFile.py are of value to biopython? I hope to have the code pushed up to my fork on github late tonight. Just thought I would ask if there was any suggestion before I did. Also have an CDF file reader, but only have done some basic testing. I don't have a lot of use for this, do other biopython users? I am kinda working in a vacuum and am trying to get more involved in projects to improve my skills and knowledge. Any suggestions would be appreciated. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From sdavis2 at mail.nih.gov Thu Apr 8 18:56:12 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 8 Apr 2010 14:56:12 -0400 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis wrote: > I ended up writing my own modules for reading both affy Cel and CDF files. > Long story as to why I did not just use what was available in biopython. > I plan on making what I have done available to the biopython and will upload > it as a fork. I will outline what ways what I have is different below. > My question is: Are there any improvements(features) others would like to > see beyond what is avalible in the current CelFile.py? > I saw some posts a month or so ago about checking for consistency in cell > file, I think it was something about making sure the stated number of probes > was consistent with the intensity measurements. > > What is different, > when an file is read Affycel.read('file') many atributes are set. for > example > a = affcel() > a.read('testfile') > a.filename, > a.version, > a.header.items() ?# a dictionary of all header items > a.num_intensity > a.intensity > a.num_masks > a.masks > a.num_outliers > a.outliers > a.numb_modified > a.modified > > I plan to add the ability return/call intensity values with our with > outliers or mask values. > All data is currently store in numpy structured arrays, > currently a.intensity returns the structured array, but I plan on making it > an option to easily choose how this is returned. > also what to make an optional normalized intensity array so that if the data > is normalized it can be stored with the affycel instance. My use case was > that I was opening about 80 cel files and reading them in was slow. this > allowed me to read each file as an instance of affycel stored in a list that > I then pickled. It was then much faster to open them. > > Are improvements to the CelFile.py are of value to biopython? > > I hope to have the code pushed up to my fork on github late tonight. Just > thought I would ask if there was any suggestion before I did. > > Also have an CDF file reader, but only have done some basic testing. I don't > have a lot of use for this, do other biopython users? > > I am kinda working in a vacuum and am trying to get more involved in > projects to improve my skills and knowledge. Any suggestions would be > appreciated. Just out of curiosity, is your work based on the affy sdk, or are you parsing stuff yourself? Sean From vincent at vincentdavis.net Thu Apr 8 19:03:38 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 8 Apr 2010 13:03:38 -0600 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: Parsing it myself, But based directly an the affy documentation found here. http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/ *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis wrote: > On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis > wrote: > > I ended up writing my own modules for reading both affy Cel and CDF > files. > > Long story as to why I did not just use what was available in biopython. > > I plan on making what I have done available to the biopython and will > upload > > it as a fork. I will outline what ways what I have is different below. > > My question is: Are there any improvements(features) others would like to > > see beyond what is avalible in the current CelFile.py? > > I saw some posts a month or so ago about checking for consistency in cell > > file, I think it was something about making sure the stated number of > probes > > was consistent with the intensity measurements. > > > > What is different, > > when an file is read Affycel.read('file') many atributes are set. for > > example > > a = affcel() > > a.read('testfile') > > a.filename, > > a.version, > > a.header.items() # a dictionary of all header items > > a.num_intensity > > a.intensity > > a.num_masks > > a.masks > > a.num_outliers > > a.outliers > > a.numb_modified > > a.modified > > > > I plan to add the ability return/call intensity values with our with > > outliers or mask values. > > All data is currently store in numpy structured arrays, > > currently a.intensity returns the structured array, but I plan on making > it > > an option to easily choose how this is returned. > > also what to make an optional normalized intensity array so that if the > data > > is normalized it can be stored with the affycel instance. My use case was > > that I was opening about 80 cel files and reading them in was slow. this > > allowed me to read each file as an instance of affycel stored in a list > that > > I then pickled. It was then much faster to open them. > > > > Are improvements to the CelFile.py are of value to biopython? > > > > I hope to have the code pushed up to my fork on github late tonight. Just > > thought I would ask if there was any suggestion before I did. > > > > Also have an CDF file reader, but only have done some basic testing. I > don't > > have a lot of use for this, do other biopython users? > > > > I am kinda working in a vacuum and am trying to get more involved in > > projects to improve my skills and knowledge. Any suggestions would be > > appreciated. > > Just out of curiosity, is your work based on the affy sdk, or are you > parsing stuff yourself? > > Sean > From sdavis2 at mail.nih.gov Thu Apr 8 19:40:01 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 8 Apr 2010 15:40:01 -0400 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis wrote: > Parsing it myself, But based directly an the affy documentation found here. > http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/ So, are you covering both binary and text formats for .CEL files? I think that modern .CEL files (those produced by GCOS) are binary and represent the majority of .CEL files produced today. Some of the I/O issues that you discuss are almost definitely dealt with by using the binary .CEL files. I'm certainly not an expert on Affy, so take all these questions/comments with a grain of salt. Sean > On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis wrote: > >> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis >> wrote: >> > I ended up writing my own modules for reading both affy Cel and CDF >> files. >> > Long story as to why I did not just use what was available in biopython. >> > I plan on making what I have done available to the biopython and will >> upload >> > it as a fork. I will outline what ways what I have is different below. >> > My question is: Are there any improvements(features) others would like to >> > see beyond what is avalible in the current CelFile.py? >> > I saw some posts a month or so ago about checking for consistency in cell >> > file, I think it was something about making sure the stated number of >> probes >> > was consistent with the intensity measurements. >> > >> > What is different, >> > when an file is read Affycel.read('file') many atributes are set. for >> > example >> > a = affcel() >> > a.read('testfile') >> > a.filename, >> > a.version, >> > a.header.items() ?# a dictionary of all header items >> > a.num_intensity >> > a.intensity >> > a.num_masks >> > a.masks >> > a.num_outliers >> > a.outliers >> > a.numb_modified >> > a.modified >> > >> > I plan to add the ability return/call intensity values with our with >> > outliers or mask values. >> > All data is currently store in numpy structured arrays, >> > currently a.intensity returns the structured array, but I plan on making >> it >> > an option to easily choose how this is returned. >> > also what to make an optional normalized intensity array so that if the >> data >> > is normalized it can be stored with the affycel instance. My use case was >> > that I was opening about 80 cel files and reading them in was slow. this >> > allowed me to read each file as an instance of affycel stored in a list >> that >> > I then pickled. It was then much faster to open them. >> > >> > Are improvements to the CelFile.py are of value to biopython? >> > >> > I hope to have the code pushed up to my fork on github late tonight. Just >> > thought I would ask if there was any suggestion before I did. >> > >> > Also have an CDF file reader, but only have done some basic testing. I >> don't >> > have a lot of use for this, do other biopython users? >> > >> > I am kinda working in a vacuum and am trying to get more involved in >> > projects to improve my skills and knowledge. Any suggestions would be >> > appreciated. >> >> Just out of curiosity, is your work based on the affy sdk, or are you >> parsing stuff yourself? >> >> Sean >> > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From vincent at vincentdavis.net Thu Apr 8 19:43:57 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 8 Apr 2010 13:43:57 -0600 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: No I was not reading the binary files. That said I am interested in perusing that if there is interest. Do you have a link to the SDK? *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Thu, Apr 8, 2010 at 1:40 PM, Sean Davis wrote: > On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis > wrote: > > Parsing it myself, But based directly an the affy documentation found > here. > > > http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/ > > So, are you covering both binary and text formats for .CEL files? I > think that modern .CEL files (those produced by GCOS) are binary and > represent the majority of .CEL files produced today. Some of the I/O > issues that you discuss are almost definitely dealt with by using the > binary .CEL files. > > I'm certainly not an expert on Affy, so take all these > questions/comments with a grain of salt. > > Sean > > > > On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis > wrote: > > > >> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis > > >> wrote: > >> > I ended up writing my own modules for reading both affy Cel and CDF > >> files. > >> > Long story as to why I did not just use what was available in > biopython. > >> > I plan on making what I have done available to the biopython and will > >> upload > >> > it as a fork. I will outline what ways what I have is different below. > >> > My question is: Are there any improvements(features) others would like > to > >> > see beyond what is avalible in the current CelFile.py? > >> > I saw some posts a month or so ago about checking for consistency in > cell > >> > file, I think it was something about making sure the stated number of > >> probes > >> > was consistent with the intensity measurements. > >> > > >> > What is different, > >> > when an file is read Affycel.read('file') many atributes are set. for > >> > example > >> > a = affcel() > >> > a.read('testfile') > >> > a.filename, > >> > a.version, > >> > a.header.items() # a dictionary of all header items > >> > a.num_intensity > >> > a.intensity > >> > a.num_masks > >> > a.masks > >> > a.num_outliers > >> > a.outliers > >> > a.numb_modified > >> > a.modified > >> > > >> > I plan to add the ability return/call intensity values with our with > >> > outliers or mask values. > >> > All data is currently store in numpy structured arrays, > >> > currently a.intensity returns the structured array, but I plan on > making > >> it > >> > an option to easily choose how this is returned. > >> > also what to make an optional normalized intensity array so that if > the > >> data > >> > is normalized it can be stored with the affycel instance. My use case > was > >> > that I was opening about 80 cel files and reading them in was slow. > this > >> > allowed me to read each file as an instance of affycel stored in a > list > >> that > >> > I then pickled. It was then much faster to open them. > >> > > >> > Are improvements to the CelFile.py are of value to biopython? > >> > > >> > I hope to have the code pushed up to my fork on github late tonight. > Just > >> > thought I would ask if there was any suggestion before I did. > >> > > >> > Also have an CDF file reader, but only have done some basic testing. I > >> don't > >> > have a lot of use for this, do other biopython users? > >> > > >> > I am kinda working in a vacuum and am trying to get more involved in > >> > projects to improve my skills and knowledge. Any suggestions would be > >> > appreciated. > >> > >> Just out of curiosity, is your work based on the affy sdk, or are you > >> parsing stuff yourself? > >> > >> Sean > >> > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > From vincent at vincentdavis.net Thu Apr 8 20:21:32 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 8 Apr 2010 14:21:32 -0600 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: Maybe I should have started this discussion differently. Is there any need for improvements to the ability to read CEL files or CDF files and if so what are they? I am interested in contributing. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Thu, Apr 8, 2010 at 12:33 PM, Vincent Davis wrote: > I ended up writing my own modules for reading both affy Cel and CDF files. > Long story as to why I did not just use what was available in biopython. > I plan on making what I have done available to the biopython and will > upload it as a fork. I will outline what ways what I have is different > below. > My question is: Are there any improvements(features) others would like to > see beyond what is avalible in the current CelFile.py? > I saw some posts a month or so ago about checking for consistency in cell > file, I think it was something about making sure the stated number of probes > was consistent with the intensity measurements. > > What is different, > when an file is read Affycel.read('file') many atributes are set. for > example > a = affcel() > a.read('testfile') > a.filename, > a.version, > a.header.items() # a dictionary of all header items > a.num_intensity > a.intensity > a.num_masks > a.masks > a.num_outliers > a.outliers > a.numb_modified > a.modified > > I plan to add the ability return/call intensity values with our with > outliers or mask values. > All data is currently store in numpy structured arrays, > currently a.intensity returns the structured array, but I plan on making it > an option to easily choose how this is returned. > also what to make an optional normalized intensity array so that if the > data is normalized it can be stored with the affycel instance. My use case > was that I was opening about 80 cel files and reading them in was slow. this > allowed me to read each file as an instance of affycel stored in a list that > I then pickled. It was then much faster to open them. > > Are improvements to the CelFile.py are of value to biopython? > > I hope to have the code pushed up to my fork on github late tonight. Just > thought I would ask if there was any suggestion before I did. > > Also have an CDF file reader, but only have done some basic testing. I > don't have a lot of use for this, do other biopython users? > > I am kinda working in a vacuum and am trying to get more involved in > projects to improve my skills and knowledge. Any suggestions would be > appreciated. > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | LinkedIn > > From sdavis2 at mail.nih.gov Thu Apr 8 22:31:43 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 8 Apr 2010 18:31:43 -0400 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 3:43 PM, Vincent Davis wrote: > No I was not reading the binary files. That said I am interested in perusing > that if there is interest. > Do you have a link to the SDK? I believe this will get you close: http://www.affymetrix.com/partners_programs/programs/developer/fusion/index.affx?terms=no I hope my questions are not taken the wrong way, but I have learned from the bioconductor project that dealing with vendor file formats is often a non-trivial pursuit. It isn't always easy to think of all the edge cases. Sean > ?*Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > ?my blog | > LinkedIn > > > On Thu, Apr 8, 2010 at 1:40 PM, Sean Davis wrote: > >> On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis >> wrote: >> > Parsing it myself, But based directly an the affy documentation found >> here. >> > >> http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/ >> >> So, are you covering both binary and text formats for .CEL files? ?I >> think that modern .CEL files (those produced by GCOS) are binary and >> represent the majority of .CEL files produced today. ?Some of the I/O >> issues that you discuss are almost definitely dealt with by using the >> binary .CEL files. >> >> I'm certainly not an expert on Affy, so take all these >> questions/comments with a grain of salt. >> >> Sean >> >> >> > On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis >> wrote: >> > >> >> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis > > >> >> wrote: >> >> > I ended up writing my own modules for reading both affy Cel and CDF >> >> files. >> >> > Long story as to why I did not just use what was available in >> biopython. >> >> > I plan on making what I have done available to the biopython and will >> >> upload >> >> > it as a fork. I will outline what ways what I have is different below. >> >> > My question is: Are there any improvements(features) others would like >> to >> >> > see beyond what is avalible in the current CelFile.py? >> >> > I saw some posts a month or so ago about checking for consistency in >> cell >> >> > file, I think it was something about making sure the stated number of >> >> probes >> >> > was consistent with the intensity measurements. >> >> > >> >> > What is different, >> >> > when an file is read Affycel.read('file') many atributes are set. for >> >> > example >> >> > a = affcel() >> >> > a.read('testfile') >> >> > a.filename, >> >> > a.version, >> >> > a.header.items() ?# a dictionary of all header items >> >> > a.num_intensity >> >> > a.intensity >> >> > a.num_masks >> >> > a.masks >> >> > a.num_outliers >> >> > a.outliers >> >> > a.numb_modified >> >> > a.modified >> >> > >> >> > I plan to add the ability return/call intensity values with our with >> >> > outliers or mask values. >> >> > All data is currently store in numpy structured arrays, >> >> > currently a.intensity returns the structured array, but I plan on >> making >> >> it >> >> > an option to easily choose how this is returned. >> >> > also what to make an optional normalized intensity array so that if >> the >> >> data >> >> > is normalized it can be stored with the affycel instance. My use case >> was >> >> > that I was opening about 80 cel files and reading them in was slow. >> this >> >> > allowed me to read each file as an instance of affycel stored in a >> list >> >> that >> >> > I then pickled. It was then much faster to open them. >> >> > >> >> > Are improvements to the CelFile.py are of value to biopython? >> >> > >> >> > I hope to have the code pushed up to my fork on github late tonight. >> Just >> >> > thought I would ask if there was any suggestion before I did. >> >> > >> >> > Also have an CDF file reader, but only have done some basic testing. I >> >> don't >> >> > have a lot of use for this, do other biopython users? >> >> > >> >> > I am kinda working in a vacuum and am trying to get more involved in >> >> > projects to improve my skills and knowledge. Any suggestions would be >> >> > appreciated. >> >> >> >> Just out of curiosity, is your work based on the affy sdk, or are you >> >> parsing stuff yourself? >> >> >> >> Sean >> >> >> > _______________________________________________ >> > Biopython mailing list ?- ?Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From reece at berkeley.edu Thu Apr 8 23:38:10 2010 From: reece at berkeley.edu (Reece Hart) Date: Thu, 08 Apr 2010 16:38:10 -0700 Subject: [Biopython] SeqIO.parse exception on Google App Engine Message-ID: <4BBE68E2.2030803@berkeley.edu> Hi- I'm trying to fetch a Genbank record and parse it in the Google App Engine environment. A command line version works fine, but when using exactly the same code under Google App Engine, SeqIO throws throws the following exception: ... File "/local/home/reece/tmp/demo1/Bio/GenBank/Scanner.py", line 746, in parse_footer self.line = self.line.rstrip(os.linesep) AttributeError: 'module' object has no attribute 'linesep' The environment: - Ubuntu Lucid beta1 - Python 2.6.5 - Biopython 1.53 - GAE 1.3.2 Test case: I put together a simple test case that retrieves a raw (text) Genbank record using Bio.Entrez (efetch); this works in both environments. Parsing that record works on the command line, but not under GAE. - curl http://harts.net/reece/tmp/demo1.tgz | tar -xvzf- - cd demo1 - update symlink ./Bio to a Biopython tree eg$ ln -s /usr/share/pyshared/Bio Bio My intent is to prepend Bio to sys.paths much the way I would expect this to be deployed (i.e., without updating sys.path). Command line test: $ ./lookup fetch_text:LOCUS NM_004006 13993 bp mRNA linear PRI 25-MAR-2010 fetch_parse:NM_004006.2 / NM_004006 / Homo sapiens dystrophin (DMD), transcript variant Dp427m, GAE test: In the demo1 directory: $ dev_appserver.py . and, in another terminal: $ curl http://localhost:8080/ You'll see the exception in the http reply and in the appserver log Thanks for any help/advice/pointers, Reece P.S. I'm learning Python and GAE at the same time, so silly errors are possible (nay, likely). From chapmanb at 50mail.com Fri Apr 9 01:19:45 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 8 Apr 2010 21:19:45 -0400 Subject: [Biopython] SeqIO.parse exception on Google App Engine In-Reply-To: <4BBE68E2.2030803@berkeley.edu> References: <4BBE68E2.2030803@berkeley.edu> Message-ID: <20100409011945.GE2011@kunkel> Hi Reece; > I'm trying to fetch a Genbank record and parse it in the Google App Engine > environment. A command line version works fine, but when using exactly the > same code under Google App Engine, SeqIO throws throws the following > exception: > ... > File "/local/home/reece/tmp/demo1/Bio/GenBank/Scanner.py", line > 746, in parse_footer > self.line = self.line.rstrip(os.linesep) > AttributeError: 'module' object has no attribute 'linesep' The python on Google App Engine is a bit crippled and lacks some of the functionality of a full python install. It looks like one issue must be that os.linesep is not defined on GAE. A quick fix is to modify this to "\n", or just do: os.linesep = "\n" at the top of the Scanner.py file. It would be really useful if you were able to submit a patch or list of areas where Biopython fails on app engine and we can think about how to suitably modify the code base to work on GAE and still be compatible with Windows. I did a bit of work on this using Biopython in Google App Engine last year; code is on GitHub here: http://github.com/chapmanb/biosqlweb that might be helpful as a starting place for other ideas. Good luck and let us know how your GAE experience goes, Brad From reece at berkeley.edu Fri Apr 9 02:34:48 2010 From: reece at berkeley.edu (Reece Hart) Date: Thu, 08 Apr 2010 19:34:48 -0700 Subject: [Biopython] SeqIO.parse exception on Google App Engine In-Reply-To: <20100409011945.GE2011@kunkel> References: <4BBE68E2.2030803@berkeley.edu> <20100409011945.GE2011@kunkel> Message-ID: <4BBE9248.2080502@berkeley.edu> Hi Brad. Thanks for the quick reply. On 04/08/2010 06:19 PM, Brad Chapman wrote: > A quick fix is to > modify this to "\n", or just do: > > os.linesep = "\n" > > at the top of the Scanner.py file. > It turns out that this fix also works within the module that does the parse. To wit: from Bio import SeqIO os.linesep = '\n' rec = SeqIO.parse(...) > I did a bit of work on this using Biopython in Google App Engine > last year; code is on GitHub here: > http://github.com/chapmanb/biosqlweb > that might be helpful as a starting place for other ideas. > Yes, thank you for this. This is precisely where I started only a few days ago... Cheers, Reece From reece at berkeley.edu Fri Apr 9 04:46:36 2010 From: reece at berkeley.edu (Reece Hart) Date: Thu, 08 Apr 2010 21:46:36 -0700 Subject: [Biopython] GenBank.Scanner use of os.linesep Message-ID: <4BBEB12C.8030907@berkeley.edu> Hi All- I recently discovered that the GenBank parser doesn't work on Google App Engine because os.linesep is undefined (GenBank/Scanner.py:746): 745 # if self.line[-1] == "\n" : self.line = self.line[:-1] 746 self.line = self.line.rstrip(os.linesep) 747 misc_lines.append(self.line) Defining os.linesep is sufficient to fix the problem (thanks to Brad Chapman). It seems to me that this use of os.linesep is probably mistaken here. If the file comes from efetch, the line separator will be \n regardless of platform [1] and that is what should be used in rstrip. It's possible that the file might come from a dog-foresaken CRLF platform and therefore contain that line separator. So, I humbly propose that 746 be changed to either rstrip('\n') or, perhaps, rstrip('\n\r'). Although the need for the latter is probably rare, I don't see that it costs anything to cover that case by adding \r. I'm new to this community, so I don't know whether we now have ferocious debate about the merits of line terminators or, rather, I submit a lame one-liner patch against the git HEAD. Thanks for Biopython. Cheers, Reece [1] For reference, here's a web request that should be equivalent to the efetch. On line 5, 0a is LF is \n. apt12j$ curl -s 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=238018044&rettype=gb' | hexdump -C | head 00000000 4c 4f 43 55 53 20 20 20 20 20 20 20 4e 4d 5f 30 |LOCUS NM_0| 00000010 30 34 30 30 36 20 20 20 20 20 20 20 20 20 20 20 |04006 | 00000020 20 20 20 31 33 39 39 33 20 62 70 20 20 20 20 6d | 13993 bp m| 00000030 52 4e 41 20 20 20 20 6c 69 6e 65 61 72 20 20 20 |RNA linear | 00000040 50 52 49 20 32 35 2d 4d 41 52 2d 32 30 31 30 0a |PRI 25-MAR-2010.| 00000050 44 45 46 49 4e 49 54 49 4f 4e 20 20 48 6f 6d 6f |DEFINITION Homo| -- Reece Hart, Ph.D. Chief Scientist, Genome Commons http://genomecommons.org/ Center for Computational Biology 324G Stanley Hall UC Berkeley / QB3 Berkeley, CA 94720 From biopython at maubp.freeserve.co.uk Fri Apr 9 08:54:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Apr 2010 09:54:53 +0100 Subject: [Biopython] GenBank.Scanner use of os.linesep In-Reply-To: <4BBEB12C.8030907@berkeley.edu> References: <4BBEB12C.8030907@berkeley.edu> Message-ID: On Fri, Apr 9, 2010 at 5:46 AM, Reece Hart wrote: > Hi All- > > I recently discovered that the GenBank parser doesn't work on Google App > Engine because os.linesep is undefined (GenBank/Scanner.py:746): > > ? 745 ? ?# ? ? ? ? ? ?if self.line[-1] == "\n" : self.line = self.line[:-1] > ? 746 ? ? ? ? ? ? ? ?self.line = self.line.rstrip(os.linesep) > ? 747 ? ? ? ? ? ? ? ?misc_lines.append(self.line) > > Defining os.linesep is sufficient to fix the problem (thanks to Brad > Chapman). > > It seems to me that this use of os.linesep is probably mistaken here. I agree. > If the > file comes from efetch, the line separator will be \n regardless of platform > [1] and that is what should be used in rstrip. It's possible that the file > might come from a dog-foresaken CRLF platform and therefore contain that > line separator. I think it would break in a more common setting - passing a file on Windows with CRLF, since Python will turn that into just \n. > So, I humbly propose that 746 be changed to either rstrip('\n') or, perhaps, > rstrip('\n\r'). Although the need for the latter is probably rare, I don't > see that it costs anything to cover that case by adding \r. A plain rstrip() would also work and get rid of any trailing whitespace. I've checked that in. > I'm new to this community, so I don't know whether we now have ferocious > debate about the merits of line terminators or, rather, I submit a lame > one-liner patch against the git HEAD. For something this trivial, your verbal patch is fine. Would you like to be added to the NEWS and CONTRIB file? Peter From biopython at maubp.freeserve.co.uk Fri Apr 9 12:08:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Apr 2010 13:08:03 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: On Thu, Apr 8, 2010 at 9:04 AM, Peter wrote: > On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt wrote: >> Greetings All! >> >> It looks like line 364 of Bio.AlignIO.StockholmIO reads: >> >> seqs[id] += seq.replace(".","-") >> >> So when you load into memory alignments that mark gaps created to >> allow alignment to inserts with ".", (such as PFam alignments or the >> output of hmmer) that information is lost. >> >> I know there must be a good reason for this, but I am finding it a >> problem on my end.. >> >> -Bryan Lunt > > Hi Bryan, > > Yes, is it done deliberately. The dot is a problem - it has a quite > specific meaning of "same as above" on other alignment file > formats, while "-" is an almost universal shorthand for gap/insertion. > Consider the use case of Stockholm to PHYLIP/FASTA/Clustal > conversion. > > Have you got a sample output file we can use as a unit test or > at least discuss? As I recall, on the PFAM alignments I looked > at there was no data loss by doing the dot to dash mapping. According to http://sonnhammer.sbc.su.se/Stockholm.html >> Sequence letters may include any characters except >> whitespace. Gaps may be indicated by "." or "-". So a Stockholm file using a mixture of "." and "-" would be valid but a bit odd. Why would anyone do that? Peter From cjfields at illinois.edu Fri Apr 9 12:51:35 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 9 Apr 2010 07:51:35 -0500 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu> On Apr 9, 2010, at 7:08 AM, Peter wrote: > On Thu, Apr 8, 2010 at 9:04 AM, Peter wrote: >> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt wrote: >>> Greetings All! >>> >>> It looks like line 364 of Bio.AlignIO.StockholmIO reads: >>> >>> seqs[id] += seq.replace(".","-") >>> >>> So when you load into memory alignments that mark gaps created to >>> allow alignment to inserts with ".", (such as PFam alignments or the >>> output of hmmer) that information is lost. >>> >>> I know there must be a good reason for this, but I am finding it a >>> problem on my end.. >>> >>> -Bryan Lunt >> >> Hi Bryan, >> >> Yes, is it done deliberately. The dot is a problem - it has a quite >> specific meaning of "same as above" on other alignment file >> formats, while "-" is an almost universal shorthand for gap/insertion. >> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal >> conversion. >> >> Have you got a sample output file we can use as a unit test or >> at least discuss? As I recall, on the PFAM alignments I looked >> at there was no data loss by doing the dot to dash mapping. > > According to http://sonnhammer.sbc.su.se/Stockholm.html >>> Sequence letters may include any characters except >>> whitespace. Gaps may be indicated by "." or "-". > > So a Stockholm file using a mixture of "." and "-" would be > valid but a bit odd. Why would anyone do that? > > Peter Just curious, b/c this is a point of contention in BioPerl. How does BioPython internally set what symbols correspond to residues/gaps/frameshifts/other? BioPerl retains the original sequence but uses regexes for validation and methods that return symbol-related information (e.g. gap counts). (BTW, the contention here isn't that we use regexes, but that we set them globally). chris From biopython at maubp.freeserve.co.uk Fri Apr 9 13:21:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Apr 2010 14:21:03 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu> References: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu> Message-ID: On Fri, Apr 9, 2010 at 1:51 PM, Chris Fields wrote: > > > Just curious, b/c this is a point of contention in BioPerl. ?How does BioPython > internally set what symbols correspond to residues/gaps/frameshifts/other? > BioPerl retains the original sequence but uses regexes for validation and > methods that return symbol-related information (e.g. gap counts). > > (BTW, the contention here isn't that we use regexes, but that we set them globally). > > chris Hi Chris, The short answer is gaps are by default "-", and stop codons are "*", but beyond that it would be down to user code to interpret odd symbols. Our sequences have an alphabet object which can specify the letters (as a set of expected characters), with explicit support for a single gap character (usually "-"), and for proteins a single stop codon symbol (usually "*"). This could in theory be extended to define other symbols too. The gap char does get treated specially in some of the alignment code (e.g. for calling a consensus), but I don't think we have anything built in regarding frameshifts. Peter From biopython at maubp.freeserve.co.uk Fri Apr 9 13:30:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Apr 2010 14:30:55 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: On Fri, Apr 9, 2010 at 2:09 PM, Ivan Rossi wrote: > > On Fri, 9 Apr 2010, Peter wrote: > >> So a Stockholm file using a mixture of "." and "-" would be >> valid but a bit odd. Why would anyone do that? > > IIRC the "." are used for "gaps" at the extremes of sequences in a MSA. When > you do local sequence alignments, like blast and most HMMs do, gaps at the > extremes of sequences do not pay the usual penalty for gap opening. So in > Stockholm format distinguishes between gaps for what you paid a price during > the alignment ("-") and gaps-for-free (".") which are there just to pad each > row to the MSA width. So internal gaps (true gaps), versus leading or trailing padding. That makes sense - and is certainly how PFAM does things according to their FAQ: Quoting from http://pfam.sanger.ac.uk/help#tabview=tab3 >>> What is the difference between the - and . characters in your full alignments ? >>> >>> The '-' and '.' characters both represent gap characters. However they >>> do tell you some extra information about how the HMM has generated >>> the alignment. The '-' symbols are where the alignment of the sequence >>> has used a delete state in the HMM to jump past a match state. This >>> means that the sequence is missing a column that the HMM was >>> expecting to be there. The '.' character is used to pad gaps where one >>> sequence in the alignment has sequence from the HMMs insert state. >>> See the alignment below where both characters are used. The HMM >>> states emitting each column are shown. Note that residues emitted >>> from the Insert (I) state are in lower case. I wonder why doesn't this get mentioned anywhere on the format definitions: http://sonnhammer.sbc.su.se/Stockholm.html http://en.wikipedia.org/wiki/Stockholm_format Peter From cjfields at illinois.edu Fri Apr 9 13:28:42 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 9 Apr 2010 08:28:42 -0500 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: <64ED30D9-CA83-42D5-846F-A8D7EA8261F9@illinois.edu> Message-ID: <9D6E3C31-B273-4B37-BFE8-8C951C025CBB@illinois.edu> On Apr 9, 2010, at 8:21 AM, Peter wrote: > On Fri, Apr 9, 2010 at 1:51 PM, Chris Fields wrote: >> >> >> Just curious, b/c this is a point of contention in BioPerl. How does BioPython >> internally set what symbols correspond to residues/gaps/frameshifts/other? >> BioPerl retains the original sequence but uses regexes for validation and >> methods that return symbol-related information (e.g. gap counts). >> >> (BTW, the contention here isn't that we use regexes, but that we set them globally). >> >> chris > > Hi Chris, > > The short answer is gaps are by default "-", and stop codons are "*", but > beyond that it would be down to user code to interpret odd symbols. > > Our sequences have an alphabet object which can specify the letters (as > a set of expected characters), with explicit support for a single gap > character (usually "-"), and for proteins a single stop codon symbol (usually > "*"). This could in theory be extended to define other symbols too. The gap > char does get treated specially in some of the alignment code (e.g. for > calling a consensus), but I don't think we have anything built in regarding > frameshifts. > > Peter Within LocatableSeq we define the following: $GAP_SYMBOLS = '\-\.=~'; $FRAMESHIFT_SYMBOLS = '\\\/'; $OTHER_SYMBOLS = '\?'; $RESIDUE_SYMBOLS = '0-9A-Za-z\*'; Combined these can be used in a regex to validate sequence, or separately used for other purposes (counting gaps, frameshifts, etc.). The OTHER_SYMBOLS is rally a catch-all for anything residue-like (counted in the sequence). All of these can be redefined, but currently that's global, so it can have consequences in rare cases when mixing sequences from different formats. We may localize them to work around that (part of GSoC project for alignment reimplementation). We had a Symbol class at one point but I believe it was considered too 'heavy,' though this may be more a consequence of Perl's hammered-on OO. chris From reece at berkeley.edu Fri Apr 9 15:18:36 2010 From: reece at berkeley.edu (Reece Hart) Date: Fri, 09 Apr 2010 08:18:36 -0700 Subject: [Biopython] GenBank.Scanner use of os.linesep In-Reply-To: References: <4BBEB12C.8030907@berkeley.edu> Message-ID: <4BBF454C.4020502@berkeley.edu> Peter- > A plain rstrip() would also work and get rid of any trailing whitespace. > I've checked that in. > For something this trivial, your verbal patch is fine. Would you like > to be added to the NEWS and CONTRIB file? > Thanks for making this change so quickly. Please don't bother with the NEWS and CONTRIB file changes. Cheers, Reece From davidpkilgore at gmail.com Fri Apr 9 15:44:12 2010 From: davidpkilgore at gmail.com (Kizzo Kilgore) Date: Fri, 9 Apr 2010 08:44:12 -0700 Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore Message-ID: Hello I just wanted to introduce myself to the Biopython project/community, and my intentions for participating as a student in this year's Google's Summer of Code. I have posted a rough draft of my proposal to the GSOC applications site for mentors to see. It is not complete but I am currently working on it, so as to make final improvements before the deadline. I haven't had time (due to school/work) to fix any of the bugs in the bug tracking system that has been pointed to before, but please no that I am no stranger to source code, and that I will make a great addition to the Biopython community after the summer. Please leave me feedback either by shooting me an email or leaving a message in the GSOC applications site. Also, be sure to check out my website shown in the proposal for additional qualifications. Thank you. -- Kizzo From lunt at ctbp.ucsd.edu Fri Apr 9 15:55:31 2010 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Fri, 9 Apr 2010 08:55:31 -0700 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: Hello Peter, The HMMER suit of tools, and the Pfam website use "-" to indicate that an HMM visited a deletion state, and "." to indicate that the HMM on a different sequence visited an insertion state, and this gap is just added to maintain alignment. >foo AA...BBB---CCC >bar AAbazBBBDDDCCC In this example, the sequence "foo" doesn't have the DDD section of the profile HMM, the second sequence has not only the full model, but also contains an insert, "baz" that is not part of the HMM, for example, an extra-long loop. I hope this helps... -Bryan On Fri, Apr 9, 2010 at 5:08 AM, Peter wrote: > On Thu, Apr 8, 2010 at 9:04 AM, Peter wrote: >> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt wrote: >>> Greetings All! >>> >>> It looks like line 364 of Bio.AlignIO.StockholmIO reads: >>> >>> seqs[id] += seq.replace(".","-") >>> >>> So when you load into memory alignments that mark gaps created to >>> allow alignment to inserts with ".", (such as PFam alignments or the >>> output of hmmer) that information is lost. >>> >>> I know there must be a good reason for this, but I am finding it a >>> problem on my end.. >>> >>> -Bryan Lunt >> >> Hi Bryan, >> >> Yes, is it done deliberately. The dot is a problem - it has a quite >> specific meaning of "same as above" on other alignment file >> formats, while "-" is an almost universal shorthand for gap/insertion. >> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal >> conversion. >> >> Have you got a sample output file we can use as a unit test or >> at least discuss? As I recall, on the PFAM alignments I looked >> at there was no data loss by doing the dot to dash mapping. > > According to http://sonnhammer.sbc.su.se/Stockholm.html >>> Sequence letters may include any characters except >>> whitespace. Gaps may be indicated by "." or "-". > > So a Stockholm file using a mixture of "." and "-" would be > valid but a bit odd. Why would anyone do that? > > Peter > From biopython at maubp.freeserve.co.uk Fri Apr 9 16:09:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Apr 2010 17:09:16 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: Hi Bryan, On Fri, Apr 9, 2010 at 4:55 PM, Bryan Lunt wrote: > > Hello Peter, > The HMMER suit of tools, and the Pfam website use "-" to indicate that > an HMM visited a deletion state, and "." to indicate that the HMM on a > different sequence visited an insertion state, and this gap is just > added to maintain alignment. > >>foo > AA...BBB---CCC >>bar > AAbazBBBDDDCCC > > In this example, the sequence "foo" doesn't have the DDD section of > the profile HMM, > the second sequence has not only the full model, but also contains an > insert, "baz" that is not part of the HMM, for example, an extra-long > loop. > > I hope this helps... > -Bryan Yes, it does. I think this HMMER/PFAM convention should be noted on the definition of the Stockholm format - that might have prevented this problem in Biopython since none of the examples I'd looked at when writing the parser had this behaviour. Note your example is more subtle than the different between internal gaps and leading or trailing padding described by Ivan earlier: http://lists.open-bio.org/pipermail/biopython/2010-April/006396.html Could you point out a suitable (small) example from PFAM we can use for a unit test, or email me an example (off list)? Now, as to how to deal with this: We could extend the Biopython Alphabet objects to explicitly support multiple types of gaps (the current setup only really copes with a single gap character). Using this information we could handle some special cases like Stockholm to PHYLIP would require merging either gap onto a dash. This doesn't sound that straight forward though. Or, we can avoid explicit declarations about the sequence (just ignore the Biopython Alphabet object capabilities and use one of the generic alphabets), and leave the problem in the hands of the end user. This is bound to cause some unpleasant surprises one day, but might be the best solution. Peter From chapmanb at 50mail.com Fri Apr 9 20:21:32 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 9 Apr 2010 16:21:32 -0400 Subject: [Biopython] affy CEL and CDF reader In-Reply-To: References: Message-ID: <20100409202132.GA20004@sobchak.mgh.harvard.edu> Vincent; Thanks for the work on the Affy Cel/CDF parsers. I don't know anything at all about the formats so can't help much with the technical questions, but wanted to help with a few more general points you raise. > > I ended up writing my own modules for reading both affy Cel and CDF files. This and the following discussion are a bit hard to follow. When I read through this thread I wasn't sure exactly what improvements you've made, how they affect back compatibility of the code, and how they help make the parser better going forward. A lot of this work is very specialized, so you are trying to catch the attention of the few people who know enough to help. If you can organize your code and e-mail in a way that makes it easy for them to comment and contribute, you'll increase the number of valuable responses you receive. It's an under appreciated skill, but very valuable for grabbing busy people's attention and getting feedback. > > Are improvements to the CelFile.py are of value to biopython? Absolutely. > Is there any need for improvements to the ability to read CEL files or CDF > files and if so what are they? I am interested in contributing. Yes. Make it faster, more complete, easier to use. There are general answers you can apply across the board. We definitely are looking for contributions and happy to have you interested. Brad From chapmanb at 50mail.com Fri Apr 9 20:39:12 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 9 Apr 2010 16:39:12 -0400 Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore In-Reply-To: References: Message-ID: <20100409203912.GB20004@sobchak.mgh.harvard.edu> Kizzo; > I just wanted to introduce myself to the Biopython project/community, > and my intentions for participating as a student in this year's > Google's Summer of Code. I have posted a rough draft of my proposal > to the GSOC applications site for mentors to see. Glad you are interested in this and thanks for getting together a proposal. I wish you would have dropped us a line a bit earlier as we would have been happy to help with getting the application together. > It is not complete > but I am currently working on it, so as to make final improvements > before the deadline. I haven't had time (due to school/work) to fix > any of the bugs in the bug tracking system that has been pointed to > before, but please no that I am no stranger to source code, and that I > will make a great addition to the Biopython community after the > summer. Great. I noticed that you worked on GSoC with OpenCog last year. Is this the most recent code base from that work? https://code.launchpad.net/~kizzobot/opencog/python-bindings Have you still been involved with that community after the work? Did they decide not to do GSoC this year? Thanks again, Brad From davidpkilgore at gmail.com Fri Apr 9 20:52:57 2010 From: davidpkilgore at gmail.com (Kizzo Kilgore) Date: Fri, 9 Apr 2010 13:52:57 -0700 Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore In-Reply-To: <20100409203912.GB20004@sobchak.mgh.harvard.edu> References: <20100409203912.GB20004@sobchak.mgh.harvard.edu> Message-ID: On Fri, Apr 9, 2010 at 1:39 PM, Brad Chapman wrote: > Kizzo; > >> I just wanted to introduce myself to the Biopython project/community, >> and my intentions for participating as a student in this year's >> Google's Summer of Code. ?I have posted a rough draft of my proposal >> to the GSOC applications site for mentors to see. > > Glad you are interested in this and thanks for getting together a > proposal. I wish you would have dropped us a line a bit earlier as > we would have been happy to help with getting the application > together. > >> It is not complete >> but I am currently working on it, so as to make final improvements >> before the deadline. ?I haven't had time (due to school/work) to fix >> any of the bugs in the bug tracking system that has been pointed to >> before, but please no that I am no stranger to source code, and that I >> will make a great addition to the Biopython community after the >> summer. > > Great. I noticed that you worked on GSoC with OpenCog last year. Is > this the most recent code base from that work? > > https://code.launchpad.net/~kizzobot/opencog/python-bindings > The core developers merged my bindings in with the main branch a long time ago, and yes that's the most recent codebase from that work. > Have you still been involved with that community after the work? Did > they decide not to do GSoC this year? > Oh yes, I'm still a regular on their IRC channel and mailing lists. OpenCog is closer to my passion, and I already had 2 proposals for OpenCog this summer ready, but unfortunately the project didn't get accepted for GSoC this year. I plan to work more with OpenCog as a potential PhD project, so am still am involved with OpenCog. > Thanks again, > Brad > -- Kizzo From vincent at vincentdavis.net Sat Apr 10 05:43:06 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Fri, 9 Apr 2010 23:43:06 -0600 Subject: [Biopython] Bio.Application now subprocess? Message-ID: I was considering writing a module for using the command line Affymetrix Power Tools Software LINK Mostly to convert between CEL file types but there are lots of other features If I read correctly will be replaced using subprocess. Are there any modules currently using subprcess rather than Bio.Application? Anything I should know but don't (as if you know what I know) or consider *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From biopython at maubp.freeserve.co.uk Sat Apr 10 10:28:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 10 Apr 2010 11:28:19 +0100 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: References: Message-ID: On Sat, Apr 10, 2010 at 6:43 AM, Vincent Davis wrote: > I was considering writing a module for using the command line Affymetrix > Power Tools Software > LINK > Mostly > to convert between CEL file types but there are lots of other features > If > I read correctly will be replaced using subprocess. Are there any modules > currently using subprcess rather than Bio.Application? > Anything I should know but don't (as if you know what I know) or consider Hi Vincent, The idea is to use a Bio.Application based wrapper to build a command line string, and invoke that with the subprocess module (i.e. use BOTH). The tutorial has several examples of this (e.g. alignment tools and BLAST). What have you been reading that makes you think Bio.Application is being replaced with subprocess? We should probably clarify it. Peter From vincent at vincentdavis.net Sat Apr 10 13:12:34 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sat, 10 Apr 2010 07:12:34 -0600 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: References: Message-ID: Let me say it was late at night when I started reading thorough this and I am very new to it so.... The first function defines in Bio/Applications.py def generic_run(commandline): """Run an application with the given commandline (DEPRECATED)......We now recommend you invoke subprocess directly, using str(commandline).............""" The second class ApplicationResult: """"""Make results of a program available through a standard interface (DEPRECATED).................""" I think these should be moved tp the bottom if possible maybe below a comment section that indicates the item below are or are going to be deprecated. The last line in class AbstractCommandline(object): """....................... You would typically run the command line via a standard Python operating system call (e.g. using the subprocess module).""" I started to read though this example but thought I would read more about subprocess module, At this point it is not clear to me what bio/Applications is doing for me. subprocess seems simple. But I have a lot to learn and I assume that if I start by getting basic functionality with subprocess then it will make more sence One of the parts that is not clear to me is for example in Emboss class WaterCommandline(_EmbossCommandLine): .......... self.parameters = \ [_Option(["-asequence","asequence"], ["input", "file"], None, 1, "First sequence to align") Not really sure where the parts to the _option line are documented, I assume in the ...for p in parameters:...... Just not clear, I guess I need to study it more. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Apr 10, 2010 at 4:28 AM, Peter wrote: > On Sat, Apr 10, 2010 at 6:43 AM, Vincent Davis > wrote: > > I was considering writing a module for using the command line Affymetrix > > Power Tools Software > > LINK< > http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx > > > > Mostly > > to convert between CEL file types but there are lots of other features > > < > http://www.affymetrix.com/partners_programs/programs/developer/tools/powertools.affx > >If > > I read correctly will be replaced using subprocess. Are there any modules > > currently using subprcess rather than Bio.Application? > > Anything I should know but don't (as if you know what I know) or consider > > Hi Vincent, > > The idea is to use a Bio.Application based wrapper to build a command > line string, and invoke that with the subprocess module (i.e. use BOTH). > The tutorial has several examples of this (e.g. alignment tools and BLAST). > > What have you been reading that makes you think Bio.Application is > being replaced with subprocess? We should probably clarify it. > > Peter > From biopython at maubp.freeserve.co.uk Sat Apr 10 13:58:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 10 Apr 2010 14:58:28 +0100 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: References: Message-ID: On Sat, Apr 10, 2010 at 2:12 PM, Vincent Davis wrote: > Let me say it was late at night when I started reading thorough this and I > am very new to it so.... > The first function defines in Bio/Applications.py > def generic_run(commandline): OK, so you are looking at the API docs and/or the code. Bits of Bio/Applications.py are deprecated, and I think you are right - we can try and make the status clearer. Peter From rodrigo_faccioli at uol.com.br Sat Apr 10 17:23:19 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Sat, 10 Apr 2010 14:23:19 -0300 Subject: [Biopython] Bio.Application now subprocess? Message-ID: I've developed a class for this proposed. It might help you. Please, see the link below. http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From vincent at vincentdavis.net Sat Apr 10 17:30:05 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sat, 10 Apr 2010 11:30:05 -0600 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: References: Message-ID: > > On Sat, Apr 10, 2010 at 11:23 AM, Rodrigo Faccioli < > rodrigo_faccioli at uol.com.br> wrote: > >> I've developed a class for this proposed. It might help you. Please, see >> the >> link below. > > > http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py >> >> > > Thanks, This might be a good place for me to start. Nit sure how this is different than Bio/Applications.py other than it is much simpler from a quick look. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Apr 10, 2010 at 11:23 AM, Rodrigo Faccioli < rodrigo_faccioli at uol.com.br> wrote: > I've developed a class for this proposed. It might help you. Please, see > the > link below. > > > http://github.com/rodrigofaccioli/PythonStudies/blob/master/src/FcfrpExecuteProgram.py > > Thanks, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structure Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Sat Apr 10 19:02:08 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 10 Apr 2010 20:02:08 +0100 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: References: Message-ID: On Sat, Apr 10, 2010 at 2:58 PM, Peter wrote: > > OK, so you are looking at the API docs and/or the code. > Bits of Bio/Applications.py are deprecated, and I think > you are right - we can try and make the status clearer. > Hi Vincent, I updated that a bit, hopefully it is clearer that a typical user doesn't need to look at Bio.Applications at all. Rather you might use the alignment tool wrappers in Bio.Align.Applications, or the EMBOSS wrappers in Bio.Emboss.Applications (etc) which internally use the classes defined in Bio.Applications. The *only* reason you'd use Bio.Applications directly now is to write a new command line tool wrapper. [Historically you might have used the old generic_run function in Bio.Applications, but that is deprecated now] Peter From biopython at maubp.freeserve.co.uk Sat Apr 10 20:33:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 10 Apr 2010 21:33:57 +0100 Subject: [Biopython] Bio.Application now subprocess? In-Reply-To: <1101855478758905131@unknownmsgid> References: <1101855478758905131@unknownmsgid> Message-ID: On Sat, Apr 10, 2010 at 8:27 PM, Vincent Davis wrote: > > So that was/is my plan to use it to writes command lone tools for the > affymetrix apt dev commandline app. unless this is redundant in a way > I am not aware of. > Thanks Ah - right, now this makes sense. Are you on the dev mailing list (CC'd)? That would be a better place to ask. I'd start by looking at Bio.Align.Applications (less subclasses there) as a model. Peter From chapmanb at 50mail.com Mon Apr 12 12:37:31 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 12 Apr 2010 08:37:31 -0400 Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore In-Reply-To: References: <20100409203912.GB20004@sobchak.mgh.harvard.edu> Message-ID: <20100412123731.GJ20004@sobchak.mgh.harvard.edu> Kizzo; > > Have you still been involved with that community after the work? Did > > they decide not to do GSoC this year? > > Oh yes, I'm still a regular on their IRC channel and mailing lists. > OpenCog is closer to my passion, and I already had 2 proposals for > OpenCog this summer ready, but unfortunately the project didn't get > accepted for GSoC this year. I plan to work more with OpenCog as a > potential PhD project, so am still am involved with OpenCog. That's great to hear. One of the most important parts of GSoC for myself and many mentors is the chance to get additional folks involved in open source. Reviews of the applications have started, and the main aspect which would improve your proposal is to develop a specific project plan with detailed descriptions of week to week goals. For each week you should have: - Description of the specific weekly goal. - Details on the PyCogent and Biopython code you expect to be working with - Possible issues or areas of expansion you expect might impact the timeline - Expected work on documentation and testing. You want to have this integrated throughout the proposal. See the examples in the NESCent application documentation to get an idea of the level of detail in accepted projects from previous years: https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html The content we'd like to see in the proposal is interconversion of core object (Sequence, Alignment, Phylogeny) in the first half of the summer, and applications of this interconversion to developing biological workflows in the second half of the summer. Feel free to be creative and pick work that is of interest to your studies. Since you can't edit the proposal currently, please prepare this in a publicly accessible Google Doc and provide a link from the public comments so other mentors can view it. Thanks, Brad From biopython at maubp.freeserve.co.uk Mon Apr 12 13:35:44 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Apr 2010 14:35:44 +0100 Subject: [Biopython] StockholmIO replaces "." with "-", why? In-Reply-To: References: Message-ID: On Fri, Apr 9, 2010 at 7:50 PM, Bryan Lunt wrote: > Hello Peter, > > Thanks for your help recently on this! > I have here two files that I like to use as examples, because they are > fairly small, (203 sequences) > > The Pfam page summarizing this family is : > http://pfam.sanger.ac.uk/family/PF07750 > > Cheers! > -Bryan Lunt I see what you mean - using that webpage to get the full alignment (in any of the supported file formats) using the mixed gap option (dot or dash) does show both symbols in a meaningful way. Peter From tiagoantao at gmail.com Mon Apr 12 23:39:29 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 13 Apr 2010 00:39:29 +0100 Subject: [Biopython] ASN.1 and Entrez SNP Message-ID: Hi, Just a simple question: Entrez SNP seems to return ASN.1 format only. Is there any way to parse this in biopython? I've looked at SeqIO and found nothing... I can think of tools to process this outside, but I am just curious if this is processed natively with Biopython (being an exposed NCBI format...) Many thanks, Tiago PS - You can easily try this with: hdl = Entrez.efetch(db="snp", id="3739022") print hdl.read() -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From biopython at maubp.freeserve.co.uk Tue Apr 13 08:22:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Apr 2010 09:22:42 +0100 Subject: [Biopython] ASN.1 and Entrez SNP In-Reply-To: References: Message-ID: 2010/4/13 Tiago Ant?o : > Hi, > > Just a simple question: > Entrez SNP seems to return ASN.1 format only. > Is there any way to parse this in biopython? I've looked at SeqIO and > found nothing... > I can think of tools to process this outside, but I am just curious if > this is processed natively with Biopython (being an exposed NCBI > format...) > > Many thanks, > Tiago > PS - You can easily try this with: > hdl = Entrez.efetch(db="snp", id="3739022") > print hdl.read() Hi Tiago, No, we don't support ASN.1, and I don't see any good reason to - I think it would only be NCBI ASN.1 we'd we interested in, and I think that all their resources are available in other easier to use formats like XML these days. See also http://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One Instead ask Entrez to give you the SNP data as XML: Entrez.efetch(db="snp", id="3739022", retmode="xml") Hopefully the SNP XML file has everything in it. You have a choice of Python XML parsers to use. However, the Bio.Entrez parser doesn't like this XML. This appears to be related (or caused by) a known NCBI bug. See http://bugzilla.open-bio.org/show_bug.cgi?id=2771 Peter From bala.biophysics at gmail.com Tue Apr 13 14:49:03 2010 From: bala.biophysics at gmail.com (Bala subramanian) Date: Tue, 13 Apr 2010 16:49:03 +0200 Subject: [Biopython] removing redundant sequence Message-ID: Friends, Sorry if this question was asked before. Is there any function in Biopython that can remove redundant sequence records from a fasta file. Thanks, Bala From biopython at maubp.freeserve.co.uk Tue Apr 13 15:02:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Apr 2010 16:02:52 +0100 Subject: [Biopython] removing redundant sequence In-Reply-To: References: Message-ID: On Tue, Apr 13, 2010 at 3:49 PM, Bala subramanian wrote: > Friends, > Sorry if this question was asked before. Is there any function in Biopython > that can remove redundant sequence records from a fasta file. > > Thanks, > Bala No, but you should be able to do this with Biopython - depending on what exactly you are asking for. When you say "redundant" do you mean 100% perfect identify? How big is your FASTA file - are you working with next-gen sequencing data and millions of reads?. If it is small enough you can keep all the data in memory to compare sequences to each other. Otherwise you might try using a checksum (e.g. SEGUID) to spot duplicates. Peter From schafer at rostlab.org Tue Apr 13 15:08:31 2010 From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=) Date: Tue, 13 Apr 2010 17:08:31 +0200 Subject: [Biopython] removing redundant sequence In-Reply-To: References: Message-ID: <4BC488EF.3000505@rostlab.org> Hey, I think not. But you can use an external tool like cd-hit or uniqueprot and implement a wrapper function for that in your code. Chris On 04/13/2010 04:49 PM, Bala subramanian wrote: > Friends, > Sorry if this question was asked before. Is there any function in Biopython > that can remove redundant sequence records from a fasta file. > > Thanks, > Bala > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Thu Apr 15 15:03:02 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 15 Apr 2010 16:03:02 +0100 Subject: [Biopython] Draft abstract for BOSC 2010 Biopython Project Update Message-ID: Hi all, I should have circulated this earlier, but here is a draft abstract for a "Biopython Project Update" talk at BOSC 2010, to be submitted *today*. http://www.open-bio.org/wiki/BOSC_2010 I'm hoping to attend BOSC again this year and give the talk, but haven't sorted out the finances - Brad has offered to present if I can't go, hence the talk author list. If anyone else wants to help with slides etc (or as a standby speaker) please let me know. This is based on the abstract from last year, included in this PDF: http://www.open-bio.org/w/images/c/c7/BOSC2009_program_20090601.pdf In the PDF version of the abstract I've made the logo smaller this time ;) Comments welcome, Thanks, Peter -- Biopython Project Update Peter Cock, Brad Chapman In this talk we present the current status of the Biopython project (www.biopython.org), described in a application note published last year (Cock et al., 2009). Biopython celebrated its 10th Birthday last year, and has now been cited or referred to in over 150 scientific publications (a list is included on our website). At the end of 2009, following an extended evaluation period, Biopython successfully migrated from using CVS for source code control to using git, hosted on github.com. This has helped our existing developers to work and test new features on publicly viewable branches before being merged, and has also encouraged new contributors to work on additions or improvements. Currently about fifty people have their own Biopython repository on GitHub. In summer 2009 we had two Google Summer of Code (GSoC) project students working on phylogenetic code for Biopython in conjunction with the National Evolutionary Synthesis Center (NESCent). Eric Talevich?s work on phylogenetic trees including phyloXML support (Han and Zamesk, 2009) was merged and included with Biopython 1.54, and he continues to be actively involved with Biopython. We hope to include Nick Matzke?s module for biogeographical data from the Global Biodiversity Information Facility (GBIF) later this year. For summer 2010 we have Biopython related GSoC projects submitted via both NESCent and the Open Bioinformatics Foundation (OBF), and hope to have students working on Biopython once again. Since BOSC 2009, Biopython has seen four releases. Biopython 1.51 (August 2009) was an important milestone in dropping support for Python 2.3 and our legacy parsing infra-structure (Martel/Mindy), but was most noteworthy for FASTQ support (Cock et al., 2010). Biopython 1.52 (September 2009) introduced indexing of most sequence file formats for random access, and made interconverting sequence and alignment files easier. Biopython 1.53 (December 2009) included wrappers for the new NCBI BLAST+ command line tools, and much improved support for running under Jython. Our latest release is Biopython 1.54 (April/May 2010), new features include Bio.Phylo for phylogenetic trees (GSoC project), and support for Standard Flowgram Format (SFF) files used for 454 Life Sciences (Roche) sequencing. Biopython is free open source software available from www.biopython.org under the Biopython License Agreement (an MIT style license, http://www.biopython.org/DIST/LICENSE). References Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., de Hoon, M.J. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. doi:10.1093/bioinformatics/btp163 Han, M.V. and Zmasek, C.M. (2009) phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinformatics 10:356. doi:10.1186/1471-2105-10-356 Cock, P.J.A., Fields, C.J., Goto N., Heuer, M.L., and Rice, P.M. (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38(6) 1767-71. doi:10.1093/nar/gkp1137 From mok at bioxray.dk Thu Apr 15 15:15:01 2010 From: mok at bioxray.dk (Morten Kjeldgaard) Date: Thu, 15 Apr 2010 17:15:01 +0200 Subject: [Biopython] Entrez.efetch bug? Message-ID: <4BC72D75.1040505@bioxray.dk> Hi, I am getting an error with Entrez.efetch() with Biopython version 1.51. This is my handle: handle = Entrez.efetch(db='protein', id='114391',rettype='gp') When I subsequently do this: record = Entrez.read(handle) I get a syntax error from Expat: ExpatError: syntax error: line 1, column 0 However, if I do the following, it works: record = handle.read() but then I need to parse the resulting record using the Genbank parser, which is a nuisance since I normally should get this for free from the Entrez module. Comments, anyone? -- Morten From biopython at maubp.freeserve.co.uk Thu Apr 15 15:31:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Apr 2010 16:31:28 +0100 Subject: [Biopython] Entrez.efetch bug? In-Reply-To: <4BC72D75.1040505@bioxray.dk> References: <4BC72D75.1040505@bioxray.dk> Message-ID: On Thu, Apr 15, 2010 at 4:15 PM, Morten Kjeldgaard wrote: > Hi, > > I am getting an error with Entrez.efetch() with Biopython version 1.51. This > is my handle: > > handle = Entrez.efetch(db='protein', id='114391',rettype='gp') > In the above, you've asked Entrez to give you a plain text GenPept file (a protein GenBank file). > When I subsequently do this: > > ?record = Entrez.read(handle) > > I get a syntax error from Expat: > > ExpatError: syntax error: line 1, column 0 > The Bio.Entrez.read() and Bio.Entrez.parse() functions expect XML. > However, if I do the following, it works: > > record = handle.read() Well, yes, you get a big string stored as the variable record. > but then I need to parse the resulting record using the Genbank parser, > which is a nuisance since I normally should get this for free from the > Entrez module. > > Comments, anyone? Try this: from Bio import Entrez from Bio import SeqIO handle = Entrez.efetch(db='protein', id='114391',rettype='gp') record = SeqIO.read(handle, 'genbank') Peter From mok at bioxray.dk Thu Apr 15 21:28:24 2010 From: mok at bioxray.dk (Morten Kjeldgaard) Date: Thu, 15 Apr 2010 23:28:24 +0200 Subject: [Biopython] Entrez.efetch bug? In-Reply-To: References: <4BC72D75.1040505@bioxray.dk> Message-ID: <26E933F7-D7D2-48EC-82B4-4B654403F177@bioxray.dk> On 15/04/2010, at 17.31, Peter wrote: > record = SeqIO.read(handle, 'genbank') d'Oh!! :-) Thanks, just the hint I needed. Cheers, Morten From davidpkilgore at gmail.com Mon Apr 19 06:54:55 2010 From: davidpkilgore at gmail.com (Kizzo Kilgore) Date: Sun, 18 Apr 2010 23:54:55 -0700 Subject: [Biopython] Google's Summer of Code 2010 - David Kilgore In-Reply-To: <20100412123731.GJ20004@sobchak.mgh.harvard.edu> References: <20100409203912.GB20004@sobchak.mgh.harvard.edu> <20100412123731.GJ20004@sobchak.mgh.harvard.edu> Message-ID: I have taken the time to carefully look over the links and examples you suggested, and came up with my own draft week by week plan for the summer. It is not perfect, or even complete, as I am in the closing weeks of school and things are getting really busy, but I managed to pull this together. You can visit the following public Google Docs link to get the Gnumeric spreadsheet of my timeline. If you would like me to, I will also convert it to some other format if you like (and if I can), or I can attach a copy of the file itself (or post it on my website) if for some reason the link does not work. Thank you. https://docs.google.com/leaf?id=0B4KRpw_6YxAjMzU3NDgxMWYtZGIxZi00YmY3LTk5MGQtNDlmMjYyYTRhN2M0&hl=en On Mon, Apr 12, 2010 at 5:37 AM, Brad Chapman wrote: > Kizzo; > >> > Have you still been involved with that community after the work? Did >> > they decide not to do GSoC this year? >> >> Oh yes, I'm still a regular on their IRC channel and mailing lists. >> OpenCog is closer to my passion, and I already had 2 proposals for >> OpenCog this summer ready, but unfortunately the project didn't get >> accepted for GSoC this year. ?I plan to work more with OpenCog as a >> potential PhD project, so am still am involved with OpenCog. > > That's great to hear. One of the most important parts of GSoC for > myself and many mentors is the chance to get additional folks > involved in open source. > > Reviews of the applications have started, and the main aspect which > would improve your proposal is to develop a specific project plan > with detailed descriptions of week to week goals. For each week you > should have: > > - Description of the specific weekly goal. > - Details on the PyCogent and Biopython code you expect to be working with > - Possible issues or areas of expansion you expect might impact the > ?timeline > - Expected work on documentation and testing. You want to have this > ?integrated throughout the proposal. > > See the examples in the NESCent application documentation to get an > idea of the level of detail in accepted projects from previous years: > > https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#When_you_apply > http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&single=true&gid=0&output=html > > The content we'd like to see in the proposal is interconversion of > core object (Sequence, Alignment, Phylogeny) in the first half of > the summer, and applications of this interconversion to developing > biological workflows in the second half of the summer. Feel free to > be creative and pick work that is of interest to your studies. > > Since you can't edit the proposal currently, please prepare this in > a publicly accessible Google Doc and provide a link from the public > comments so other mentors can view it. > > Thanks, > Brad > -- Kizzo From mjldehoon at yahoo.com Mon Apr 19 07:08:04 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 19 Apr 2010 00:08:04 -0700 (PDT) Subject: [Biopython] Fw: Entrez.efetch In-Reply-To: <910794.43889.qm@web56207.mail.re3.yahoo.com> Message-ID: <870000.56671.qm@web62402.mail.re1.yahoo.com> > I sent the mail to the biopython at biopython.org > but it was not delivered. It will be delivered if you subscribe to the mailing list. --- On Mon, 4/19/10, olumide olufuwa wrote: > From: olumide olufuwa > Subject: Fw: [Biopython]Entrez.efetch > To: biopython-owner at lists.open-bio.org > Cc: "Biopython mailing list" > Date: Monday, April 19, 2010, 2:50 AM > > > Hello Michel, > I sent the mail to the biopython at biopython.org > but it was not delivered. I have edited the message. > > > The code that > accepts UNIPROT ID, retrieves the record using > Entrez.efetch and then it > parsed to obtain the Pubmed ID which i use to search > Medline for the > Title, Abstract and other information about the entry. > The code: > > query_id=str(raw_input("please > > > enter your UNIPROT_ID: ")) #Request UNIPROT ID from user > Entrez.email="ludax5 at yahoo.com" > prothandle=Entrez.efetch(db="protein", > > > id=query_id, rettype="gb" #queries Protein DB with the > given ID > #The > program returns an error here if a wrong ID is given. > Details of the > error is given below > seq_record=SeqIO.read(prothandle, "gb") > for > > record in seq_record.annotations['references']: # To > obtain Pubmed id > from the seqrecord > ?? key_word=record.pubmed_id > ?? if key_word: > ???? > handle=Entrez.efetch(db="pubmed", > > id=key_word, rettype="medline") > ???? > medRecords=Medline.parse(handle) > ???? for rec in medRecords: #prints > title and Abstract > ???????? if rec.has_key('AB') and > rec.has_key('TI'): > ?????????? print "TITLE: ",rec['TI'] > ?????????? > print "ABSTRACT: ",rec['AB'] > ?????????? print ' ' > > > THE > PROBLEM: The program gives an error if a wrong ID is > entered or an ID > other than UNIPROT ID e.g PDB ID, GSS ID etc. > > > > An Example Run: > > > please enter your UNIPROT_ID: > 1wio #A PDB ID is given instead > > > Traceback (most recent call last): > ? File "file.py", line 11, in > > ??? seq_record=SeqIO.read(prothandle, "gb") > ? File > "/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py", > line 522, in > read > ??? raise ValueError("No records found in handle") > ValueError: > > No records found in handle > > I want to avoid this error, thus i > want the program to print "INCORRECT ID GIVEN"? when a > wrong or an > incorrect ID is given. > > > Thanks a lot. > lummy > > > > From olumideolufuwa at yahoo.com Mon Apr 19 07:30:24 2010 From: olumideolufuwa at yahoo.com (Olumide Olufuwa) Date: Mon, 19 Apr 2010 00:30:24 -0700 (PDT) Subject: [Biopython] Entrez.efetch In-Reply-To: Message-ID: <221701.32474.qm@web45106.mail.sp1.yahoo.com> Hello there, ? I wrote a program, I am not awesome in biopython but this is what it does: The program code that accepts user defined UNIPROT ID, retrieves the record using Entrez.efetch and then it is parsed to obtain the Pubmed ID which i use to search Medline for Title, Abstract and other information about the entry. The code is simply: query_id=str(raw_input("please enter your UNIPROT_ID: ")) #Request UNIPROT ID from user Entrez.email="ludax5 at yahoo.com" prothandle=Entrez.efetch(db="protein", id=query_id, rettype="gb" #queries Protein DB with the given ID #The program returns an error here if a wrong ID is given. Details of the error is given below seq_record=SeqIO.read(prothandle, "gb") for record in seq_record.annotations['references']: # To obtain Pubmed id from the seqrecord ?? key_word=record.pubmed_id ?? if key_word: ???? handle=Entrez.efetch(db="pubmed", id=key_word, rettype="medline") ???? medRecords=Medline.parse(handle) ???? for rec in medRecords: #prints title and Abstract ???????? if rec.has_key('AB') and rec.has_key('TI'): ?????????? print "TITLE: ",rec['TI'] ?????????? print "ABSTRACT: ",rec['AB'] ?????????? print ' ' THE PROBLEM: The program gives an error if a wrong ID is entered or an ID other than UNIPROT ID e.g PDB ID, GSS ID etc. An Example Run with a wrong ID is shown below: please enter your UNIPROT_ID: 1wio #A PDB ID is given instead Traceback (most recent call last): ? File "file.py", line 11, in ??? seq_record=SeqIO.read(prothandle, "gb") ? File "/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py", line 522, in read ??? raise ValueError("No records found in handle") ValueError: No records found in handle I want to avoid this error, thus i want the program to print "INCORRECT ID GIVEN"? when a wrong or an incorrect ID is given. Thanks a lot. lummy From mjldehoon at yahoo.com Mon Apr 19 07:45:59 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 19 Apr 2010 00:45:59 -0700 (PDT) Subject: [Biopython] Entrez.efetch In-Reply-To: <221701.32474.qm@web45106.mail.sp1.yahoo.com> Message-ID: <902706.80063.qm@web62402.mail.re1.yahoo.com> Put a try:/except: block around the call to SeqIO.read, as in: try: seq_record=SeqIO.read(prothandle, "gb") except ValueError: print "INCORRECT ID GIVEN" --Michiel --- On Mon, 4/19/10, Olumide Olufuwa wrote: > From: Olumide Olufuwa > Subject: [Biopython] Entrez.efetch > To: biopython at lists.open-bio.org > Date: Monday, April 19, 2010, 3:30 AM > > Hello there, > ? > I wrote a program, I am not awesome in biopython but this > is what it does: The program code that > accepts user defined UNIPROT ID, retrieves the record using > Entrez.efetch and then it > is parsed to obtain the Pubmed ID which i use to search > Medline for Title, Abstract and other information about the > entry. > The code is simply: > > query_id=str(raw_input("please > > > > enter your UNIPROT_ID: ")) #Request UNIPROT ID from user > Entrez.email="ludax5 at yahoo.com" > prothandle=Entrez.efetch(db="protein", > > > > id=query_id, rettype="gb" #queries Protein DB with the > given ID > #The > program returns an error here if a wrong ID is given. > Details of the > error is given below > seq_record=SeqIO.read(prothandle, "gb") > for > > record in seq_record.annotations['references']: # To > obtain Pubmed id > from the seqrecord > ?? key_word=record.pubmed_id > ?? if key_word: > ???? > > handle=Entrez.efetch(db="pubmed", > > id=key_word, rettype="medline") > ???? > medRecords=Medline.parse(handle) > ???? for rec in medRecords: #prints > title and Abstract > ???????? if rec.has_key('AB') and > rec.has_key('TI'): > ?????????? print "TITLE: ",rec['TI'] > ?????????? > print "ABSTRACT: ",rec['AB'] > ?????????? print ' ' > > > THE > PROBLEM: The program gives an error if a wrong ID is > entered or an ID > other than UNIPROT ID e.g PDB ID, GSS ID etc. > > > > An Example Run with a wrong ID is shown below: > > > please enter your UNIPROT_ID: > 1wio #A PDB ID is given instead > > > Traceback (most recent call last): > ? File "file.py", line 11, in > > ??? seq_record=SeqIO.read(prothandle, "gb") > ? File > "/usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.py", > line 522, in > read > ??? raise ValueError("No records found in handle") > ValueError: > > > No records found in handle > > I want to avoid this error, thus i > want the program to print "INCORRECT ID GIVEN"? when a > wrong or an > incorrect ID is given. > > > Thanks a lot. > lummy > > > > > ? ? ? > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From fkauff at biologie.uni-kl.de Tue Apr 20 14:27:30 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Tue, 20 Apr 2010 16:27:30 +0200 Subject: [Biopython] Code for protein alpha helix prediction Message-ID: <4BCDB9D2.4050207@biologie.uni-kl.de> Hi all, I've recently been asked to help with screening protein sequences for certain features, something I don't really know much about... Yet! My questions: Is there some code in Biopython that allows for a quick check whether an amino acid sequece is likely to be a alpha helix? Couldn't find any. Or is there an algorithm that could be straightforwardly implemented in python, or a commandline tool that could be called from within a python script? Thanks in advance, Frank From rodrigo_faccioli at uol.com.br Tue Apr 20 15:34:47 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Tue, 20 Apr 2010 12:34:47 -0300 Subject: [Biopython] Code for protein alpha helix prediction In-Reply-To: <4BCDB9D2.4050207@biologie.uni-kl.de> References: <4BCDB9D2.4050207@biologie.uni-kl.de> Message-ID: Hi Frank, I'm not sure if I understood your question. I'm computer scientist and I'm researching globular protein structure prediction. In fact, I've studied the application of Evolutionary Algorithms for it. Therefore, our goals are different. if I understood your question, you have a Fasta file of your protein. So, you need to communicate with databases such as NCBI, scop and CATH. In this way, I recommend you use Entrez BioPython module. Other suggestion is the use of BioPython Blast module. Sorry if my answer is not what you is looking for. Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Tue, Apr 20, 2010 at 11:27 AM, Frank Kauff wrote: > Hi all, > > I've recently been asked to help with screening protein sequences for > certain features, something I don't really know much about... Yet! > > My questions: Is there some code in Biopython that allows for a quick check > whether an amino acid sequece is likely to be a alpha helix? Couldn't find > any. Or is there an algorithm that could be straightforwardly implemented in > python, or a commandline tool that could be called from within a python > script? > > Thanks in advance, > Frank > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Tue Apr 20 15:43:02 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Apr 2010 16:43:02 +0100 Subject: [Biopython] Code for protein alpha helix prediction In-Reply-To: <4BCDB9D2.4050207@biologie.uni-kl.de> References: <4BCDB9D2.4050207@biologie.uni-kl.de> Message-ID: On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff wrote: > Hi all, > > I've recently been asked to help with screening protein sequences for > certain features, something I don't really know much about... Yet! > > My questions: Is there some code in Biopython that allows for a quick check > whether an amino acid sequece is likely to be a alpha helix? Couldn't find > any. Or is there an algorithm that could be straightforwardly implemented in > python, or a commandline tool that could be called from within a python > script? Hi Frank, There are lots of tools for predicting secondary structure (alpha helices, beta sheets etc) both de novo, and guided by reference sequences with known structures. Some of these are online web services. I'm pretty sure there is nothing for this built into Biopython, so for scripting this for a large number of sequences then (as you have also suggested), my first approach would be to look for command line tools which you could call from Python. I've never needed to do this myself, and have no specific recommendations regarding which tools to try first. If you do find some useful algorithms which could easily be implemented in Python, they could be worth including - maybe under Bio.SeqUtils? Peter From darnells at dnastar.com Tue Apr 20 18:16:22 2010 From: darnells at dnastar.com (Steve Darnell) Date: Tue, 20 Apr 2010 13:16:22 -0500 Subject: [Biopython] Code for protein alpha helix prediction In-Reply-To: References: <4BCDB9D2.4050207@biologie.uni-kl.de> Message-ID: Frank, One of the most accurate (and popular) algorithms is PSIPRED. A stand-alone command line version is available: http://bioinfadmin.cs.ucl.ac.uk/downloads/psipred/ If memory serves, it requires a local installation of blast and the nr database. A position weight matrix generated from PSI-BLAST acts as input to a neural network, which makes the secondary structure predictions. The Rosetta Design group had a poll last year of people's favorite tools. There are plenty of others to try if PSIPRED doesn't meet your needs. http://rosettadesigngroup.com/blog/456/fairest-secondary-structure-predi ction-algorithm/ I am not a PSIPRED developer, just a satisfied user. Regards, Steve -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter Sent: Tuesday, April 20, 2010 10:43 AM To: Frank Kauff Cc: BioPython Mailing List Subject: Re: [Biopython] Code for protein alpha helix prediction On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff wrote: > Hi all, > > I've recently been asked to help with screening protein sequences for > certain features, something I don't really know much about... Yet! > > My questions: Is there some code in Biopython that allows for a quick check > whether an amino acid sequece is likely to be a alpha helix? Couldn't find > any. Or is there an algorithm that could be straightforwardly implemented in > python, or a commandline tool that could be called from within a python > script? Hi Frank, There are lots of tools for predicting secondary structure (alpha helices, beta sheets etc) both de novo, and guided by reference sequences with known structures. Some of these are online web services. I'm pretty sure there is nothing for this built into Biopython, so for scripting this for a large number of sequences then (as you have also suggested), my first approach would be to look for command line tools which you could call from Python. I've never needed to do this myself, and have no specific recommendations regarding which tools to try first. If you do find some useful algorithms which could easily be implemented in Python, they could be worth including - maybe under Bio.SeqUtils? Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From fkauff at biologie.uni-kl.de Wed Apr 21 11:50:30 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Wed, 21 Apr 2010 13:50:30 +0200 Subject: [Biopython] Code for protein alpha helix prediction In-Reply-To: References: <4BCDB9D2.4050207@biologie.uni-kl.de> Message-ID: <4BCEE686.3080803@biologie.uni-kl.de> Thanks everybody! Now I have plenty of tools to look at - the standalone version of psipred certainly fulfills the easy-to-use and quick-to-try-out requirements. Frank On 04/20/2010 08:16 PM, Steve Darnell wrote: > Frank, > > One of the most accurate (and popular) algorithms is PSIPRED. A > stand-alone command line version is available: > http://bioinfadmin.cs.ucl.ac.uk/downloads/psipred/ > > If memory serves, it requires a local installation of blast and the nr > database. A position weight matrix generated from PSI-BLAST acts as > input to a neural network, which makes the secondary structure > predictions. > > The Rosetta Design group had a poll last year of people's favorite > tools. There are plenty of others to try if PSIPRED doesn't meet your > needs. > > http://rosettadesigngroup.com/blog/456/fairest-secondary-structure-predi > ction-algorithm/ > > I am not a PSIPRED developer, just a satisfied user. > > Regards, > Steve > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org > [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter > Sent: Tuesday, April 20, 2010 10:43 AM > To: Frank Kauff > Cc: BioPython Mailing List > Subject: Re: [Biopython] Code for protein alpha helix prediction > > On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff > wrote: > >> Hi all, >> >> I've recently been asked to help with screening protein sequences for >> certain features, something I don't really know much about... Yet! >> >> My questions: Is there some code in Biopython that allows for a quick >> > check > >> whether an amino acid sequece is likely to be a alpha helix? Couldn't >> > find > >> any. Or is there an algorithm that could be straightforwardly >> > implemented in > >> python, or a commandline tool that could be called from within a >> > python > >> script? >> > Hi Frank, > > There are lots of tools for predicting secondary structure (alpha > helices, > beta sheets etc) both de novo, and guided by reference sequences with > known structures. Some of these are online web services. > > I'm pretty sure there is nothing for this built into Biopython, so for > scripting > this for a large number of sequences then (as you have also suggested), > my first approach would be to look for command line tools which you > could > call from Python. I've never needed to do this myself, and have no > specific > recommendations regarding which tools to try first. > > If you do find some useful algorithms which could easily be implemented > in Python, they could be worth including - maybe under Bio.SeqUtils? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From fkauff at biologie.uni-kl.de Wed Apr 21 11:59:31 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Wed, 21 Apr 2010 13:59:31 +0200 Subject: [Biopython] Code for protein alpha helix prediction In-Reply-To: References: <4BCDB9D2.4050207@biologie.uni-kl.de> Message-ID: <4BCEE8A3.3010008@biologie.uni-kl.de> Hi Peter, for the start, it seems psipred is the easiest one to use and to implement. I'll start with that, and once the parser for the output goes beyond the quick-and-dirty level, we can think about including it. Frank On 04/20/2010 05:43 PM, Peter wrote: > On Tue, Apr 20, 2010 at 3:27 PM, Frank Kauff wrote: > >> Hi all, >> >> I've recently been asked to help with screening protein sequences for >> certain features, something I don't really know much about... Yet! >> >> My questions: Is there some code in Biopython that allows for a quick check >> whether an amino acid sequece is likely to be a alpha helix? Couldn't find >> any. Or is there an algorithm that could be straightforwardly implemented in >> python, or a commandline tool that could be called from within a python >> script? >> > Hi Frank, > > There are lots of tools for predicting secondary structure (alpha helices, > beta sheets etc) both de novo, and guided by reference sequences with > known structures. Some of these are online web services. > > I'm pretty sure there is nothing for this built into Biopython, so for scripting > this for a large number of sequences then (as you have also suggested), > my first approach would be to look for command line tools which you could > call from Python. I've never needed to do this myself, and have no specific > recommendations regarding which tools to try first. > > If you do find some useful algorithms which could easily be implemented > in Python, they could be worth including - maybe under Bio.SeqUtils? > > Peter > From bala.biophysics at gmail.com Wed Apr 21 14:25:35 2010 From: bala.biophysics at gmail.com (Bala subramanian) Date: Wed, 21 Apr 2010 16:25:35 +0200 Subject: [Biopython] removing redundant sequence In-Reply-To: References: Message-ID: Peter, Sorry for the delayed reply. Yes i want to remove those sequences that are 100% identical but they have different identifier. I created a sample fasta file with two redundant sequences. But when i use checksums seguid to spot the redundancies, it spots only the first one. In [36]: for record in SeqIO.parse(open('t'),'fasta'): ....: print record.id, seguid(record.seq) ....: ....: A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8 *AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw* AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ *AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY * In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda rec:seguid(rec.seq)) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /home/cbala/test/ in () /usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in to_dict(sequences, key_function) 585 key = key_function(record) 586 if key in d : --> 587 raise ValueError("Duplicate key '%s'" % key) 588 d[key] = record 589 return d ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw' On Tue, Apr 13, 2010 at 5:02 PM, Peter wrote: > On Tue, Apr 13, 2010 at 3:49 PM, Bala subramanian > wrote: > > Friends, > > Sorry if this question was asked before. Is there any function in > Biopython > > that can remove redundant sequence records from a fasta file. > > > > Thanks, > > Bala > > No, but you should be able to do this with Biopython - depending on > what exactly you are asking for. > > When you say "redundant" do you mean 100% perfect identify? > > How big is your FASTA file - are you working with next-gen sequencing > data and millions of reads?. If it is small enough you can keep all > the data in memory to compare sequences to each other. Otherwise > you might try using a checksum (e.g. SEGUID) to spot duplicates. > > Peter > From biopython at maubp.freeserve.co.uk Wed Apr 21 15:10:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Apr 2010 16:10:45 +0100 Subject: [Biopython] removing redundant sequence In-Reply-To: References: Message-ID: On Wed, Apr 21, 2010 at 3:25 PM, Bala subramanian wrote: > Peter, > Sorry for the delayed reply. Yes i want to remove those sequences that are > 100% identical but they have different identifier. I created a sample fasta > file with two redundant sequences. But when i use checksums seguid to spot > the redundancies, it spots only the first one. > > In [36]: for record in SeqIO.parse(open('t'),'fasta'): > ? ....: ? ? print record.id, seguid(record.seq) > ? ....: > ? ....: > A04321 44lpJ2F4Eb74aKigVa5Sut/J0M8 > *AF02161a asaPdDgrYXwwJItOY/wlQFGTmGw > AF02161b asaPdDgrYXwwJItOY/wlQFGTmGw* > AF021618 JvRNzgmeXDBbA9SL5+OQaH2V/zA > AF021622 JvRNzgmeXDBbA9SL5+OQaH2V/zA > AF021627 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ > AF021628 2GT4z2fXZdv9f51ng74C8o0rQXM > AF021629 zq4Fuy1DnR+nh4TbYk+jJ9ygfrQ > *AF02163a fOKCIiGvk6NaPDYY6oKx74tvcxY > AF02163b fOKCIiGvk6NaPDYY6oKx74tvcxY > * > In [37]: hivdict=SeqIO.to_dict(SeqIO.parse(open('t'),'fasta'),lambda > rec:seguid(rec.seq)) > --------------------------------------------------------------------------- > ValueError ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Traceback (most recent call last) > > /home/cbala/test/ in () > > /usr/lib/python2.5/site-packages/Bio/SeqIO/__init__.pyc in > to_dict(sequences, key_function) > ? ?585 ? ? ? ? key = key_function(record) > ? ?586 ? ? ? ? if key in d : > --> 587 ? ? ? ? ? ? raise ValueError("Duplicate key '%s'" % key) > ? ?588 ? ? ? ? d[key] = record > ? ?589 ? ? return d > > ValueError: Duplicate key 'asaPdDgrYXwwJItOY/wlQFGTmGw' Hi Bala, You know there are duplicate sequences in your file, so if you try to use the SEGUID as a key, there will be duplicate keys. Thus you get this error message. If you want to use Bio.SeqIO.to_dict you have to have unique keys. What you should do is loop over the records and keep a record of the checksums you have saved, and use that to ignore duplicates. I would use a python set rather than a python list for speed. You could do this with a for loop. However, I would probably use an iterator based approach with a generator function - I think it is more elegant but perhaps not so easy for a beginner: from Bio import SeqIO from Bio.SeqUtils.CheckSum import seguid def remove_dup_seqs(records): """"SeqRecord iterator to removing duplicate sequences.""" checksums = set() for record in records: checksum = seguid(record.seq) if checksum in checksums: print "Ignoring %s" % record.id continue checksums.add(checksum) yield record records = remove_dup_seqs(SeqIO.parse("with_dups.fasta", "fasta")) count = SeqIO.write(records, "no_dups.fasta", "fasta") print "Saved %i records" % count Note I've used filename with Bio.SeqIO which requires Biopython 1.54b or later - for older versions use handles. See also: http://news.open-bio.org/news/2010/04/biopython-seqio-and-alignio-easier/ Peter From silvio.tschapke at googlemail.com Wed Apr 21 18:34:54 2010 From: silvio.tschapke at googlemail.com (Silvio Tschapke) Date: Wed, 21 Apr 2010 20:34:54 +0200 Subject: [Biopython] Entrez.efetch rettype retmode Message-ID: Hello. I am new to Biopython and I tried to download a whole record with efetch. The problem is that I get an error message in the output: ""Report 'full' not found in 'pmc' presentation"" Maybe I haven't understood the whole principle. But isn't it the goal of pmc to provide full text? I have read the help-page of efetch but it doesn't help me a lot. ---- handle = Entrez.efetch(db="pmc", id="2531137", rettype="full", retmode="text") string = str(handle.read()) f = open('./output.txt', 'w') f.write(string) ---- Thanks for your help! From robert.campbell at queensu.ca Wed Apr 21 20:14:10 2010 From: robert.campbell at queensu.ca (Robert Campbell) Date: Wed, 21 Apr 2010 16:14:10 -0400 Subject: [Biopython] Entrez.efetch rettype retmode In-Reply-To: References: Message-ID: <20100421161410.4fd950ec@adelie.biochem.queensu.ca> Hello Silvio, On Wed, 21 Apr 2010 20:34:54 +0200 Silvio Tschapke wrote: > Hello. > > I am new to Biopython and I tried to download a whole record with efetch. > The problem is that I get an error message in the output: > ""Report 'full' not found in 'pmc' presentation"" > Maybe I haven't understood the whole principle. > > But isn't it the goal of pmc to provide full text? I have read the help-page > of efetch but it doesn't help me a lot. > > > ---- > handle = Entrez.efetch(db="pmc", id="2531137", rettype="full", > retmode="text") > string = str(handle.read()) The documentation on efetch (http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html) specifies that: pmc - PubMed Central contains a number of articles classified as "open access" for which you may download the full text as XML. For the remaining articles in PMC you may download only the abstracts as XML. So you just need to change your retmode='text' to retmode='xml' and omit the rettype option altogether. You will find that not all articles are free to download this way though. I tried a random one and got an error message that the particular journal didn't allow download of full text as XML. Cheers, Rob -- Robert L. Campbell, Ph.D. Senior Research Associate/Adjunct Assistant Professor Botterell Hall Rm 644 Department of Biochemistry, Queen's University, Kingston, ON K7L 3N6 Canada Tel: 613-533-6821 Fax: 613-533-2497 http://pldserver1.biochem.queensu.ca/~rlc From laserson at mit.edu Thu Apr 22 01:07:19 2010 From: laserson at mit.edu (Uri Laserson) Date: Wed, 21 Apr 2010 21:07:19 -0400 Subject: [Biopython] Bug in GenBank/EMBL parser? Message-ID: Hi, I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which supposedly conforms to the EMBL standard). The short story is that whenever there is a feature, the parser checks whether there are qualifiers in the feature with an assert statement, and does not allow features with no qualifiers. However, the IMGT flatfile is full of entries that have features with no qualifiers (only coordinates). Who is wrong here? Does the EMBL specification require that a feature have qualifiers? Or is this a bug to be fixed in the parser. To be more concrete, the parser broke on the following record: ID A03907 IMGT/LIGM annotation : keyword level; unassigned DNA; HUM; 412 BP. XX AC A03907; XX DT 11-MAR-1998 (Rel. 8, arrived in LIGM-DB ) DT 10-JUN-2008 (Rel. 200824-2, Last updated, Version 3) XX DE H.sapiens antibody D1.3 variable region protein ; DE unassigned DNA; rearranged configuration; Ig-Heavy; regular; group IGHV. XX KW antigen receptor; Immunoglobulin superfamily (IgSF); KW Immunoglobulin (IG); IG-Heavy; variable; diversity; joining; KW rearranged. XX OS Homo sapiens (human) OC cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; OC Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata; OC Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda; OC Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates; OC Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; OC Homo/Pan/Gorilla group; Homo. XX RN [1] RP 1-412 RA ; RT "Recombinant antibodies and methods for their production."; RL Patent number EP0239400-A/10, 30-SEP-1987. RL MEDICAL RESEARCH COUNCIL. XX DR EMBL; A03907. XX FH Key Location/Qualifiers (from EMBL) FH FT source 1..412 FT /organism="Homo sapiens" FT /mol_type="unassigned DNA" FT /db_xref="taxon:9606" FT V_region 8..>412 FT /note="antibody D1.3 V region" FT sig_peptide 8..64 FT CDS 8..>412 FT /product="antibody D1.3 V region (VDJ)" FT /protein_id="CAA00308.1" FT /translation="MAVLALLFCLVTFPSCILSQVQLKESGPGLVAPSQSLSITCTVSG FT FSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSL FT HTDDTARYYCARERDYRLDYWGQGTTLTVSS" FT D_segment 356..371 FT J_segment 372..>412 FT /note="J(H)2 region" XX SQ Sequence 412 BP; 105 A; 109 C; 104 G; 94 T; 0 other; tcagagcatg gctgtcctgg cattactctt ctgcctggta acattcccaa gctgtatcct 60 ttcccaggtg cagctgaagg agtcaggacc tggcctggtg gcgccctcac agagcctgtc 120 catcacatgc accgtctcag ggttctcatt aaccggctat ggtgtaaact gggttcgcca 180 gcctccagga aagggtctgg agtggctggg aatgatttgg ggtgatggaa acacagacta 240 taattcagct ctcaaatcca gactgagcat cagcaaggac aactccaaga gccaagtttt 300 cttaaaaatg aacagtctgc acactgatga cacagccagg tactactgtg ccagagagag 360 agattatagg cttgactact ggggccaagg caccactctc acagtctcct ca 412 // And the traceback was: ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (311, 0)) --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) /Volumes/External/home/laserson/research/church/vdj-ome/ref-data/IMGT/ in () /Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc in parse_records(self, handle, do_features) 418 #This is a generator function 419 while True : --> 420 record = self.parse(handle, do_features) 421 if record is None : break 422 assert record.id is not None /Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc in parse(self, handle, do_features) 401 feature_cleaner = FeatureValueCleaner()) 402 --> 403 if self.feed(handle, consumer, do_features) : 404 return consumer.data 405 else : /Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc in feed(self, handle, consumer, do_features) 373 #Features (common to both EMBL and GenBank): 374 if do_features : --> 375 self._feed_feature_table(consumer, self.parse_features(skip=False)) 376 else : 377 self.parse_features(skip=True) # ignore the data /Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc in parse_features(self, skip) 170 feature_lines.append(line[self.FEATURE_QUALIFIER_INDENT:].rstrip()) 171 line = self.handle.readline() --> 172 features.append(self.parse_feature(feature_key, feature_lines)) 173 self.line = line 174 return features /Library/Frameworks/Python.framework/Versions/4.3.0/lib/python2.5/site-packages/Bio/GenBank/Scanner.pyc in parse_feature(self, feature_key, lines) 267 else : 268 #Unquoted continuation --> 269 assert len(qualifiers) > 0 270 assert key==qualifiers[-1][0] 271 #if debug : print "Unquoted Cont %s:%s" % (key, line) AssertionError: Which is tracked to an assert statement in Scanner.py at line 269. It appears that the assumption in the code is that there is an unquoted continuation of a feature qualifier. Finally, I am using biopython 1.51 that I built from source using python 2.5 (from an EPD install 4.3.0). I am on a Mac running OS X 10.5.8 (Leopard) Thanks! Uri From biopython at maubp.freeserve.co.uk Thu Apr 22 08:56:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Apr 2010 09:56:52 +0100 Subject: [Biopython] Bug in GenBank/EMBL parser? In-Reply-To: References: Message-ID: On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson wrote: > Hi, > > I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which > supposedly conforms to the EMBL standard). > > The short story is that whenever there is a feature, the parser checks > whether there are qualifiers in the feature with an assert statement, and > does not allow features with no qualifiers. ?However, the IMGT flatfile is > full of entries that have features with no qualifiers (only coordinates). > > Who is wrong here? ?Does the EMBL specification require that a feature have > qualifiers? ?Or is this a bug to be fixed in the parser. Hi Uri, Thank you for your detailed report, Since you have raised this, I went back over the EMBL documentation. All their example features qualifiers (and from personal experience all EMBL files from the EMBL and GenBank files from the NCBI) do have qualifiers. However, in Section 7.2 they are called "Optional qualifiers". http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2 So it does look like an unwarranted assumption in the Biopython parser (even though it has been a safe assumption on "official" EMBL and GenBank files thus far), which we should fix. Could you file a bug please? http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython This also affect Biopython 1.54b (the latest release) and the current code in the repository. I would hope we can solve this before Biopython 1.54 proper is released. Regards, Peter From chapmanb at 50mail.com Thu Apr 22 12:18:10 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 22 Apr 2010 08:18:10 -0400 Subject: [Biopython] removing redundant sequence In-Reply-To: References: Message-ID: <20100422121810.GV29724@sobchak.mgh.harvard.edu> Bala; > > I created a sample fasta > > file with two redundant sequences. But when i use checksums seguid to spot > > the redundancies, it spots only the first one. > What you should do is loop over the records and keep a record > of the checksums you have saved, and use that to ignore duplicates. > I would use a python set rather than a python list for speed. > > You could do this with a for loop. However, I would probably use an > iterator based approach with a generator function - I think it is more > elegant but perhaps not so easy for a beginner: [... Nice code example from Peter ..] This is a nice problem example and discussion. Bala, it sounds like Peter provided some useful example code to solve this. Once you use this to get together a program that solves your problem, it would be very helpful if you could write it up as a Cookbook entry: http://biopython.org/wiki/Category:Cookbook That would help others in the future who will be tackling similar issues. Thanks much, Brad From cloudycrimson at gmail.com Fri Apr 23 07:56:45 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Fri, 23 Apr 2010 13:26:45 +0530 Subject: [Biopython] Qblast : no hits Message-ID: Hello freinds, I have a problem with qblast. I have sequences from the mass spectromerty equipment that needs to be BLASTed to find the protein it belongs to. When I blast these sequences in the NCBI website it takes some time (longer than usual ) but does gives me hits. When i blast them using the following code in biopython they dont give me any hits. CODE: **************************************************************************** >>> from Bio.Blast import NCBIWWW >>> result_handle = NCBIWWW.qblast("blastp", "nr", "AFAQVRCSGLARGGGYVLR") >>> blast_results = result_handle.read() >>> save_file = open( "testseq.xml", "w") >>> save_file.write(blast_results) >>> save_file.close() **************************************************************************** OUTPUT: **************************************************************************** blastp BLASTP 2.2.23+ Alejandro A. Schäffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005. nr 12361 unnamed protein product 19 BLOSUM62 10 11 1 F 1 12361 unnamed protein product 19 10888645 -585703444 0 0 0.041 0.267 0.14 ***************************************************************************** Is this because a normal blast code doesn wait long till the results are given? I mean the RTOE error. if yes, how to control the "time of execution"? Or else what is the problem with my code? If you guys know anything on this issue, please give me your ideas. Thanking you in advance. Sincerely, Karthik From biopython at maubp.freeserve.co.uk Fri Apr 23 09:49:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Apr 2010 10:49:55 +0100 Subject: [Biopython] Qblast : no hits In-Reply-To: References: Message-ID: Hello Karthik On Fri, Apr 23, 2010 at 8:56 AM, Karthik Raja wrote: > Hello freinds, > > I have a ?problem with qblast. I have sequences from the mass > spectromerty equipment that needs to be BLASTed to find the protein it > belongs to. When I blast these sequences in the NCBI website it takes > some time (longer than usual ) but does gives me hits. When i blast > them using the following code in biopython they dont give me any hits. > > CODE: > > **************************************************************************** > >>>> from Bio.Blast import NCBIWWW >>>> result_handle = NCBIWWW.qblast("blastp", "nr", "AFAQVRCSGLARGGGYVLR") >>>> blast_results = result_handle.read() >>>> save_file = open( "testseq.xml", "w") >>>> save_file.write(blast_results) >>>> save_file.close() > > **************************************************************************** > > Is this because a normal blast code doesn wait long till the results are > given? I mean the RTOE error. if yes, how to control the "time of > execution"? What error? It looks like your example ran fine. > Or else what is the problem with my code? > > If you guys know anything on this issue, please give me your ideas. Differences between a manual BLAST search on the NCBI website and a script search via QBLAST are almost always down to different parameter settings. The NCBI have often adjusted the defaults on the website, and they no longer match the defaults on QBLAST. You should check things like the expectation cut off, the matrix, gap penalties etc. The simplest option would be just to copy the current defaults from the website into your python code. We probably need to put this into the Biopython FAQ ... Regards, Peter From cjfields at illinois.edu Fri Apr 23 12:00:07 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 23 Apr 2010 07:00:07 -0500 Subject: [Biopython] Qblast : no hits In-Reply-To: References: Message-ID: On Apr 23, 2010, at 4:49 AM, Peter wrote: >> ... > > Differences between a manual BLAST search on the NCBI website > and a script search via QBLAST are almost always down to different > parameter settings. The NCBI have often adjusted the defaults on > the website, and they no longer match the defaults on QBLAST. > You should check things like the expectation cut off, the matrix, > gap penalties etc. The simplest option would be just to copy the > current defaults from the website into your python code. > > We probably need to put this into the Biopython FAQ ... > > Regards, > > Peter Same for BioPerl. chris From cloudycrimson at gmail.com Sat Apr 24 03:27:10 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Sat, 24 Apr 2010 08:57:10 +0530 Subject: [Biopython] Qblast : no hits In-Reply-To: References: Message-ID: Hello Peter, I did try changing the paramters according to the WWW BLAST and its gives an error saying "no RID or no RTOE found". Its the same error i was trying to tell you in the 1st post. Its the "request time of execution". Is there any way to change this RTOE i.e. to increase it? Any idea? On Fri, Apr 23, 2010 at 5:30 PM, Chris Fields wrote: > On Apr 23, 2010, at 4:49 AM, Peter wrote: > > >> ... > > > > Differences between a manual BLAST search on the NCBI website > > and a script search via QBLAST are almost always down to different > > parameter settings. The NCBI have often adjusted the defaults on > > the website, and they no longer match the defaults on QBLAST. > > You should check things like the expectation cut off, the matrix, > > gap penalties etc. The simplest option would be just to copy the > > current defaults from the website into your python code. > > > > We probably need to put this into the Biopython FAQ ... > > > > Regards, > > > > Peter > > Same for BioPerl. > > chris > From p.j.a.cock at googlemail.com Sat Apr 24 11:40:27 2010 From: p.j.a.cock at googlemail.com (Peter) Date: Sat, 24 Apr 2010 12:40:27 +0100 Subject: [Biopython] Qblast : no hits In-Reply-To: References: Message-ID: <6540A260-554B-488A-AED7-B0559883F7F7@googlemail.com> On 24 Apr 2010, at 04:27, Karthik Raja wrote: > Hello Peter, > > I did try changing the paramters according to the WWW BLAST and its > gives an > error saying "no RID or no RTOE found". Its the same error i was > trying to > tell you in the 1st post. Its the "request time of execution". Is > there any > way to change this RTOE i.e. to increase it? Any idea? > > On Fri, Apr 23, 2010 at 5:30 PM, Chris Fields > wrote: > >> On Apr 23, 2010, at 4:49 AM, Peter wrote: >> >>>> ... >>> >>> Differences between a manual BLAST search on the NCBI website >>> and a script search via QBLAST are almost always down to different >>> parameter settings. The NCBI have often adjusted the defaults on >>> the website, and they no longer match the defaults on QBLAST. >>> You should check things like the expectation cut off, the matrix, >>> gap penalties etc. The simplest option would be just to copy the >>> current defaults from the website into your python code. >>> >>> We probably need to put this into the Biopython FAQ ... >>> >>> Regards, >>> >>> Peter >> >> Same for BioPerl. >> >> chris >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Sat Apr 24 11:49:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 24 Apr 2010 12:49:55 +0100 Subject: [Biopython] Qblast : no hits In-Reply-To: References: Message-ID: Hi all, Sorry for the blank email just now. On Sat, Apr 24, 2010 at 4:27 AM, Karthik Raja wrote: > Hello Peter, > > I did try changing the paramters according to the WWW BLAST > and its gives an error saying "no RID or no RTOE found". Its the > same error i was trying to tell you in the 1st post. Its the "request > time of execution". Is there any way to change this RTOE i.e. to > increase it? Any idea? Please show us an example with this problem (i.e. the python code and the traceback). What is meant to happen is we send the query to the NCBI, and they reply with reference details (RID and RTOE) which are used to fetch the results after BLAST has finished running. My guess for what is happening is your parameters are for some reason invalid, and the NCBI is giving an error page (so no RID and no RTOE). Biopython tries to spot any error message in this situation, but in your case could not. Peter From cloudycrimson at gmail.com Sun Apr 25 03:24:59 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Sun, 25 Apr 2010 08:54:59 +0530 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: Hello Peter, As said i did try changing the parameters of qblast according to the set in the web blast. The parameters that I changed are 1. Martrix 2. Word size 3. Expect There is a check box option in the web page that allows us to check it if we want the web blast to adjust according short sequences. I am not sure how to bring that option into the qblast. *Below given are the code and the traceback. * >>> from Bio.Blast import NCBIWWW >>> result_handle = NCBIWWW.qblast ("blastp", "nr", "SSRVQDGMGLYTARRVR", auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=200000, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name= 'PAM30', nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=2, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None) *Traceback (most recent call last): * File "", line 1, in result_handle = NCBIWWW.qblast *("blastp", "nr", "SSRVQDGMGLYTARRVR",*auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', *expect=200000*, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, *matrix_name= 'PAM30'*, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, * word_size=2*, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None) File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 117, in qblast rid, rtoe = _parse_qblast_ref_page(handle) File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 203, in _parse_qblast_ref_page raise ValueError("No RID and no RTOE found in the 'please wait' page." ValueError: No RID and no RTOE found in the 'please wait' page. (there was probably a problem with your request) Here are a few examples of my MS sequences. 1. *IMYTALPVIGKRHFRPSFTR * 2. *RSSRGRGR * 3. *AGPGPRRAKAAPYR * 4. *ASRSYSSERRAR * 5. *AASAAPPRAGRPDRGPLALAGR * 6. *GSDGKSRGR * 7. *TYGWRAEPR * 8. *PPEPAREPRLSPRR * 9. *GVLTALRR * 10. *AGMRLPSRRQSFPAPVSR * *Sincerely, * *Karthikraja* On Sat, Apr 24, 2010 at 5:19 PM, Peter wrote: > Hi all, > > Sorry for the blank email just now. > > On Sat, Apr 24, 2010 at 4:27 AM, Karthik Raja wrote: > > Hello Peter, > > > > I did try changing the paramters according to the WWW BLAST > > and its gives an error saying "no RID or no RTOE found". Its the > > same error i was trying to tell you in the 1st post. Its the "request > > time of execution". Is there any way to change this RTOE i.e. to > > increase it? Any idea? > > Please show us an example with this problem (i.e. the python > code and the traceback). > > What is meant to happen is we send the query to the NCBI, and > they reply with reference details (RID and RTOE) which are > used to fetch the results after BLAST has finished running. > > My guess for what is happening is your parameters are for > some reason invalid, and the NCBI is giving an error page > (so no RID and no RTOE). Biopython tries to spot any error > message in this situation, but in your case could not. > > Peter > From biopython at maubp.freeserve.co.uk Sun Apr 25 12:45:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 25 Apr 2010 13:45:05 +0100 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: On Sun, Apr 25, 2010 at 4:24 AM, Karthik Raja wrote: > *Below given are the code and the traceback. * Great - I can run that and get the same traceback. Here is a shorter version which does the same thing - removing all the parameters you don't actually set: from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR", entrez_query='(none)', expect=200000, hitlist_size=50, matrix_name='PAM30', word_size=2, alignments=500, descriptions=500, format_type='XML') Getting shorter still: result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR", matrix_name='PAM30') The problem is the matrix name - remove that and the error goes away. So progress :) Doing a little digging, this is the error message from the NCBI is: Message ID#35 Error: Cannot validate the Blast options: Gap existence and extension values of 11 and 1 not supported for PAM30 supported values are: 32767, 32767 7, 2 6, 2 5, 2 10, 1 9, 1 8, 1 As I guessed earlier, Biopython needed a little update to recognise this error message and pass it to the user. I've done that. In your case, you need to pick gap parameters appropriate for PAM30. Peter From cloudycrimson at gmail.com Mon Apr 26 08:38:59 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Mon, 26 Apr 2010 14:08:59 +0530 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: Hello Peter, I tried out what you suggested and it works perfectly. I checked the result XML file and there was no problem at all. But I still have one more small issue that I am sure you can help me with. The main reason i wanted to use python was that I could put all the query sequences in a file and blast it. So when I tried the above code to blast a sequence that I have put in a fasta file, it gives an error. Same kinda error. Below are the code and traceback. >>> fasta_string = open("test.fasta").read() >>> result_handle = NCBIWWW.qblast("blastp", "nr", fasta_string,entrez_query='(none)', expect=200000, hitlist_size=50, word_size=2, alignments=500, descriptions=500,format_type='XML') *Traceback (most recent call last): * File "", line 2, in word_size=2, alignments=500, descriptions=500,format_type='XML') File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 117, in qblast rid, rtoe = _parse_qblast_ref_page(handle) File "C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py", line 203, in _parse_qblast_ref_page raise ValueError("No RID and no RTOE found in the 'please wait' page." ValueError: No RID and no RTOE found in the 'please wait' page. (there was probably a problem with your request) Please let me know if you could sense in the problem with the code. Sincerely, Karthik On Sun, Apr 25, 2010 at 6:15 PM, Peter wrote: > On Sun, Apr 25, 2010 at 4:24 AM, Karthik Raja wrote: > > > *Below given are the code and the traceback. * > > Great - I can run that and get the same traceback. > > Here is a shorter version which does the same thing - removing all the > parameters you don't actually set: > > from Bio.Blast import NCBIWWW > result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR", > entrez_query='(none)', expect=200000, hitlist_size=50, > matrix_name='PAM30', word_size=2, alignments=500, descriptions=500, > format_type='XML') > > Getting shorter still: > > result_handle = NCBIWWW.qblast("blastp", "nr", "SSRVQDGMGLYTARRVR", > matrix_name='PAM30') > > The problem is the matrix name - remove that and the error goes away. > So progress :) > > Doing a little digging, this is the error message from the NCBI is: > > Message ID#35 Error: Cannot validate the Blast options: Gap existence > and extension values of 11 and 1 not supported for PAM30 > supported values are: > 32767, 32767 > 7, 2 > 6, 2 > 5, 2 > 10, 1 > 9, 1 > 8, 1 > > As I guessed earlier, Biopython needed a little update to recognise > this error message and pass it to the user. I've done that. > > In your case, you need to pick gap parameters appropriate for PAM30. > > Peter > From biopython at maubp.freeserve.co.uk Mon Apr 26 10:02:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 11:02:24 +0100 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: Hi Karthik, On Mon, Apr 26, 2010 at 9:38 AM, Karthik Raja wrote: > Hello Peter, > > I tried out what you suggested and it works perfectly. I checked the result > XML file and there was no problem at all. That's good :) > But I still have one more small issue that I am sure you can help me with. > The main reason i wanted to use python was that I could put all the query > sequences in a file and blast it. I wouldn't recommend that approach. For a modest number of queries, I would suggest doing one online BLAST query at a time. This will spread out the load on the NCBI, and means each time your XML results won't be too big. Trying to do too many queries at risks hitting an NCBI CPU limit, or having problems downloading a very large XML result file. For a large number of queries, I would suggest using standalone BLAST (installed and run locally) - especially if you want to use very lenient parameters giving lots of results (meaning large output files). > So when I tried the above code to blast a > sequence that I have put in a fasta file, it gives an error. Same kinda > error. Below are the code and traceback. > >>>> fasta_string = open("test.fasta").read() >>>> result_handle = NCBIWWW.qblast("blastp", "nr", > fasta_string,entrez_query='(none)', expect=200000, hitlist_size=50, > word_size=2, alignments=500, descriptions=500,format_type='XML') > > *Traceback (most recent call last): > ... > ValueError: No RID and no RTOE found in the 'please wait' page. (there was > probably a problem with your request) > > Please let me know if you could sense in the problem with the code. > > Sincerely, > Karthik The code works fine - I just tried it using a FASTA file with four proteins. I would guess there is a problem with your FASTA file - perhaps there is a bad sequence in it, or too many sequences. Since you don't have the latest code we can't see the NCBI error message in the traceback, which would help a lot. I see you are running on Windows, so the easiest way to try this is to backup C:\Python26\lib\site-packages\Bio\Blast\NCBIWWW.py and replace it with the new version from our repository: http://biopython.open-bio.org/SRC/biopython/Bio/Blast/NCBIWWW.py or: http://github.com/biopython/biopython/raw/master/Bio/Blast/NCBIWWW.py Or, could you send me the FASTA file to try it here (please send it to me directly, not the mailing list). Regards, Peter From nick_leake77 at hotmail.com Mon Apr 26 15:36:28 2010 From: nick_leake77 at hotmail.com (Nick Leake) Date: Mon, 26 Apr 2010 11:36:28 -0400 Subject: [Biopython] parsing a fasta with multiple entries Message-ID: Hello, I'm having trouble parsing a fasta file with multiple sequences - it is a fasta that has most of the transposable elements in fruit flies found at http://www.fruitfly.org/p_disrupt/TE.html#NAT right side, third box down. I want to be able to access the DNA sequences for manipulation and later removal from a chromosomal region. I originally thought that I could follow the same fasta format example shown in the biopython tutorial. However, that failed to work. I think it might be because there are multiple entries. Basically, I just want parse the information and have dictionaries hold the transposon elements name and sequence for later use. Can I do that with biopython or should I make my own parser? Any help would be greatly appreciated. I'm still very much a python novice and get frustrated by not knowing how to ask my questions appropriately. _________________________________________________________________ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5 From biopython at maubp.freeserve.co.uk Mon Apr 26 15:52:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 16:52:28 +0100 Subject: [Biopython] help with parsing EMBL In-Reply-To: References: Message-ID: Hi Nick, On Mon, Apr 26, 2010 Nick Leake wrote: > Hello, > > I'm having trouble parsing an embl file (attached) with multiple > sequences. ?I want to be able to access the DNA sequences for > manipulation and removal from a chromosomal region. ?I originally > thought that I could follow the same fasta format example shown in the > biopython tutorial. ?However, that failed to work. ?Next, I tried to > convert the file to a fastq or a fasta to just follow the examples - > again, failed. ?So, I looked around and found some embl parsing code: > > from Bio import SeqIO > > p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl") > p.next() > record=p.next() > > print record > > This kinda works, but fails to read all entries. Well, yes: from Bio import SeqIO #that imports the library p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl") #that sets up the EMBL parser (although EMBL files are text so it is a bit #odd to open it in binary read mode) p.next() #reads the first record and discards it record=p.next() #reads the second record and stores as variable record You only ever try and look at the second record. See below... > ... ?In addition, I don't know what code I need to 'grab' the DNA > information for manipulations and remove these sequences from > a given DNA segment. ? ?Can I get a little guidance to > what I need to do or where I can look to help solve my problem? What you probably want to start with is a simple for loop, from Bio import SeqIO for record in SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41"),"embl"): print record.id, record.seq However, this runs into a problem: Traceback (most recent call last): ... ValueError: Expected sequence length 2, found 2483. Looking at your file (which was too big to send to the list), your EMBL file is invalid. Specifically this is failing on the record which starts: ID FROGGER standard; DNA; INV; 2 BP. That ID line says the sequence is just 2 base pairs, but in fact the seems to be 2483bp. The ID line should probably be edited like this: ID FROGGER standard; DNA; INV; 2483 BP. Fixing that shows up another similar problem, ID TV1 standard; DNA; INV; 1728 BP. should probably be: ID TV1 standard; DNA; INV; 1730 BP. Then there is this record: ID DDBARI1 standard; DNA; INV; 1676 BP. Several parts of the record suggest it should be 1676bp (not just the ID line, but also for example the SQ line), but there is actually 1677bp of sequence present. After making those three edits by hand, Biopython should parse it. I suspect your EMBL file has been manually edited. Where did it come from? Peter From biopython at maubp.freeserve.co.uk Mon Apr 26 15:54:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 16:54:54 +0100 Subject: [Biopython] Fwd: help with parsing EMBL In-Reply-To: References: Message-ID: Hi all, I'm forwarding this email from Nick Leake about parsing EMBL files, but without his 1.3MB attachment. I'll reply to his questions in a follow up email... Peter ---------- Forwarded message ---------- From:?Nick Leake To:? Date:?Mon, 26 Apr 2010 09:35:45 -0400 Subject:?help with parsing Hello, I'm having trouble parsing an embl file (attached) with multiple sequences. ?I want to be able to access the DNA sequences for manipulation and removal from a chromosomal region. ?I originally thought that I could follow the same fasta format example shown in the biopython tutorial. ?However, that failed to work. ?Next, I tried to convert the file to a fastq or a fasta to just follow the examples - again, failed. ?So, I looked around and found some embl parsing code: from Bio import SeqIO p=SeqIO.parse(open(r"transposon_sequence_set.embl.v.9.41","rb"),"embl") p.next() record=p.next() print record This kinda works, but fails to read all entries. ?Also, there is no 'record' argument for output. ?In addition, I don't know what code I need to 'grab' the DNA information for manipulations and remove these sequences from a given DNA segment. ? ?Can I get a little guidance to what I need to do or where I can look to help solve my problem? Any help would be greatly appreciated. ?I'm still very much a python novice and get frustrated by not knowing how to ask my questions appropriately. _________________________________________________________________ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4 ---------- Forwarded message ---------- From:?biopython-request at lists.open-bio.org To: Date:?Mon, 26 Apr 2010 09:44:02 -0400 Subject:?confirm 29081d7dc4252dd9c96c13f5018658d3414acbdc If you reply to this message, keeping the Subject: header intact, Mailman will discard the held message. ?Do this if the message is spam. ?If you reply to this message and include an Approved: header with the list password in it, the message will be approved for posting to the list. ?The Approved: header can also appear in the first line of the body of the reply. From biopython at maubp.freeserve.co.uk Mon Apr 26 15:59:02 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 16:59:02 +0100 Subject: [Biopython] parsing a fasta with multiple entries In-Reply-To: References: Message-ID: On Mon, Apr 26, 2010 at 4:36 PM, Nick Leake wrote: > > Hello, > > I'm having trouble parsing a fasta file with multiple sequences - it is a fasta > that has most of the transposable elements in fruit flies found at > http://www.fruitfly.org/p_disrupt/TE.html#NAT right side, third box down. Hi Nick, You mean this file? http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.fasta > I want to be able to access the DNA sequences for manipulation and later > removal from a chromosomal region. ?I originally thought that I could follow > the same fasta format example shown in the biopython tutorial. ?However, > that failed to work. ?I think it might be because there are multiple entries. The Bio.SeqIO.read() function is for when there is a single record. The Bio.SeqIO.parse() function is for when you have multiple records. Could you clarify which bit of the tutorial was confusing? We'd like to make it better. > Basically, I just want parse the information and have dictionaries hold the > transposon elements name and sequence for later use. ?Can I do that with > biopython or should I make my own parser? Any help would be greatly > appreciated. ?I'm still very much a python novice and get frustrated by not > knowing how to ask my questions appropriately. You should be able to use the Bio.SeqIO.index() function for this. >>> from Bio import SeqIO >>> data = SeqIO.index("D_mel_transposon_sequence_set.fasta", "fasta") >>> data.keys()[:10] ['gb|U14101|TART-B', 'gb|AF162798|Dbuz\\BuT1', 'gb|U26847|Dvir\\Helena', 'gb|X67681|Bari1', 'gb|M69216|hobo', 'gb|U29466|Dkoe\\Gandalf', 'gb|Z27119|flea', 'gb|AB022762|aurora-element', 'gb|nnnnnnnn|Stalker3T', 'gb|AF518730|Dwil\\Vege'] >>> data["gb|nnnnnnnn|Stalker3T"] SeqRecord(seq=Seq('TGTAGTGTATCTACCCTCAATATGTArAGTAGAGTTAATATGTAAGTAAGTAAT...ACA', SingleLetterAlphabet()), id='gb|nnnnnnnn|Stalker3T', name='gb|nnnnnnnn|Stalker3T', description='gb|nnnnnnnn|Stalker3T STALKER3 372bp', dbxrefs=[]) >>> print data["gb|nnnnnnnn|Stalker3T"].seq TGTAGTGTATCTACCCTCAATATGTArAGTAGAGTTAATATGTAAGTAAGTAATATGTAAAGTAGAGTTAATATGTAAGTAAGCAAAAGACCACCAACACTTACATGAACACTCCAGCTCTTGAAATACGATCGAGCGCTTAAACATAAGCCGATCGCGGAGCGTGAGAGTGCCGAGCATACACCTAGCAGCTCAAGTGATTAAGATAAGATAAGATAAGATAACAAACACGTAGTCTTAAGCGCGTCATGTGCGGGTGGCTGTACCCAAGAACAGCAAAGTGAATTCATTCGAATAAACCGCTTCAAGCAGAGCAGAGCCAAGTCTATTATATCAACTTCAAAAATACCGTATAACCTTGAACCTATTACA Peter From biopython at maubp.freeserve.co.uk Mon Apr 26 16:02:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 17:02:18 +0100 Subject: [Biopython] help with parsing EMBL In-Reply-To: References: Message-ID: On Mon, Apr 26, 2010 at 4:52 PM, Peter wrote: > Hi Nick, > > On Mon, Apr 26, 2010 Nick Leake wrote: >> Hello, >> >> I'm having trouble parsing an embl file (attached) with multiple >> sequences. ... > > After making those three edits by hand, Biopython should parse it. > I suspect your EMBL file has been manually edited. Where did it > come from? >From Nick's other email about the FASTA file, http://lists.open-bio.org/pipermail/biopython/2010-April/006451.html I can can see that the funny EMBL file came from the Berkeley Drosophil Genome Project (BDGP)'s Natural Transposable Element Project: http://www.fruitfly.org/p_disrupt/TE.html Specifically this file: http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.embl I'll email them to alert them about the three obvious errors I discussed. Peter From biopython at maubp.freeserve.co.uk Mon Apr 26 16:28:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 17:28:31 +0100 Subject: [Biopython] help with parsing EMBL In-Reply-To: References: Message-ID: On Mon, Apr 26, 2010 at 5:02 PM, Peter wrote: > > From Nick's other email about the FASTA file, > http://lists.open-bio.org/pipermail/biopython/2010-April/006451.html > I can can see that the funny EMBL file came from the Berkeley Drosophil > ?Genome Project (BDGP)'s Natural Transposable Element Project: > http://www.fruitfly.org/p_disrupt/TE.html > > Specifically this file: > http://www.fruitfly.org/data/p_disrupt/datasets/ASHBURNER/D_mel_transposon_sequence_set.embl > > I'll email them to alert them about the three obvious errors I discussed. There is also something odd going on with the features, which the Biopython parser seems to be ignoring... Peter From biopython at maubp.freeserve.co.uk Mon Apr 26 22:04:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Apr 2010 23:04:15 +0100 Subject: [Biopython] parsing a fasta with multiple entries In-Reply-To: References: Message-ID: On Mon, Apr 26, 2010 at 8:05 PM, Nick Leake wrote: > Thanks Peter, > > All of the information is?very helpful.? I apologize for sending?second > email.? I was thinking that?the first email was going to be discarded for > having the attachment - which in hindsight is an obvious fact.? At that > time, I had only seen the initial email for rejecting the first. I managed to reply before sending the original email (without attachment) to the list - so partly my fault. >>> I want to be able to access the DNA sequences for manipulation and >>> later removal from a chromosomal region. ?I originally thought that I >>> could follow the same fasta format example shown in the biopython >>> tutorial. ?However, that failed to work. ?I think it might be because >>> there are multiple entries. >> >> The Bio.SeqIO.read() function is for when there is a single record. The >> Bio.SeqIO.parse() function is for when you have multiple records. Could >> you clarify which bit of the tutorial was confusing? We'd like to make it >> better. > > The tutorial I used was from > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html OK, good - that is the current version. > I will admit I didn't really know the difference from the Bio.SeqIO.read() > verse the Bio.SeqIO.parse() functions even though they should be > intuitive.? Still, the mentioned tutorial doen't seem to have a multiple > entry parsed example.?This is where my naivet??and confusion on > the matter probably started. It does (the file ls_orchid.fasta used in several examples has 94 entries), but I guess there is a lot of information in there and it can be overwhelming. Your problems with the funny EMBL file probably didn't help :( Peter From p.j.a.cock at googlemail.com Mon Apr 26 22:30:54 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 26 Apr 2010 23:30:54 +0100 Subject: [Biopython] Google Summer of Code - accepted students In-Reply-To: <4BD60D63.1040400@cornell.edu> References: <4BD60D63.1040400@cornell.edu> Message-ID: ---------- Forwarded message ---------- From: Robert Buels Date: Mon, Apr 26, 2010 at 11:02 PM Subject: Google Summer of Code - accepted students To: rmb32 at cornell.edu Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. ?We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator From rmb32 at cornell.edu Mon Apr 26 22:02:11 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 26 Apr 2010 15:02:11 -0700 Subject: [Biopython] Google Summer of Code - accepted students Message-ID: <4BD60D63.1040400@cornell.edu> Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator From anaryin at gmail.com Tue Apr 27 04:29:36 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 27 Apr 2010 12:29:36 +0800 Subject: [Biopython] Google Summer of Code - accepted students In-Reply-To: References: <4BD60D63.1040400@cornell.edu> Message-ID: Hello all! Thanks for the confidence! I'm sure it's going to work alright! If anyone has any comments to add to my application feel free either to email me! Regards! Jo?o [...] Rodrigues On Monday, April 26, 2010, Peter Cock wrote: > ---------- Forwarded message ---------- > From: Robert Buels > Date: Mon, Apr 26, 2010 at 11:02 PM > Subject: Google Summer of Code - accepted students > To: rmb32 at cornell.edu > > > Hi all, > > I'm pleased to announce the acceptance of OBF's 2010 Google Summer of > Code students, listed in alphabetical order with their project titles > and primary mentors: > > Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including > Implementation of Multiple Sequence Alignment Algorithms > > Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, > Classification, and Visualization of Posttranslational Modification of > Proteins > > Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby > > Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & > Duplication Inference Algorithm for Binary and Non-binary Species Tree > > Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending > Bio.PDB: broadening the usefulness of BioPython's Structural Biology > module > > Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring > > Congratulations to our accepted students! > > All told, we had 52 applications submitted for the 6 slots (5 > originally assigned, plus 1 extra) allotted to us by Google. > Proposals were extremely competitive: 6 out of 52 translates to an > 11.5% acceptance rate. ?We received a lot of really excellent > proposals, the decisions were not easy. > > Thanks very much to all the students who applied, we very much > appreciate your hard work. > > Here's to a great 2010 Summer of Code, I'm sure these students will do > some wonderful work. > > Rob Buels > OBF GSoC 2010 Administrator > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From rmb32 at cornell.edu Tue Apr 27 05:52:57 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 26 Apr 2010 22:52:57 -0700 Subject: [Biopython] Google Summer of Code - accepted students Message-ID: <4BD67BB9.3000804@cornell.edu> Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator From biopython at maubp.freeserve.co.uk Tue Apr 27 09:45:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Apr 2010 10:45:20 +0100 Subject: [Biopython] Bug in GenBank/EMBL parser? In-Reply-To: References: Message-ID: On Thu, Apr 22, 2010 at 9:56 AM, Peter wrote: > On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson wrote: >> Hi, >> >> I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which >> supposedly conforms to the EMBL standard). >> >> The short story is that whenever there is a feature, the parser checks >> whether there are qualifiers in the feature with an assert statement, and >> does not allow features with no qualifiers. ?However, the IMGT flatfile is >> full of entries that have features with no qualifiers (only coordinates). >> >> Who is wrong here? ?Does the EMBL specification require that a feature have >> qualifiers? ?Or is this a bug to be fixed in the parser. > > Hi Uri, > > Thank you for your detailed report, > > Since you have raised this, I went back over the EMBL documentation. > All their example features qualifiers (and from personal experience all > EMBL files from the EMBL and GenBank files from the NCBI) do have > qualifiers. However, in Section 7.2 they are called "Optional qualifiers". > http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2 > > So it does look like an unwarranted assumption in the Biopython > parser (even though it has been a safe assumption on "official" EMBL > and GenBank files thus far), which we should fix. Bug filed and now fixed, http://bugzilla.open-bio.org/show_bug.cgi?id=3062 It turned out to be an invalid EMBL file where the features were over- indented. Biopython was quite happy to parse valid EMBL or GenBank files with features without qualifiers (although I don't recall seeing any examples from EMBL or the NCBI like this). Peter From silvio.tschapke at googlemail.com Wed Apr 28 09:24:25 2010 From: silvio.tschapke at googlemail.com (Silvio Tschapke) Date: Wed, 28 Apr 2010 11:24:25 +0200 Subject: [Biopython] save efetch results in different files Message-ID: Hi all, I'd like to download hundreds of pubmed entries in one turn, but save every entry in a single file for further processing with e.g. NLTK. Is this possible? Or what is the common way to do this? Or do I have to call efetch for every single pmid? I dont know how. Could you also explain me what handle.read() does? Entrez.read(handle) I understand, because it is documented, but handle.read() not. What kind of type is a handle? search_results = Entrez.read(Entrez.esearch(db="pubmed", term="Biopython", usehistory="y")) batch_size = 10 for start in range(0,count,batch_size): end = min(count, start+batch_size) print "Going to download record %i to %i" % (start+1, end) fetch_handle = Entrez.efetch(db="pubmed", rettype="xml", retstart=start, retmax=batch_size, webenv=search_results["WebEnv"], query_key=search_results["QueryKey"]) for pmid in search_results["IdList"]: out_handle = open(pmid+".txt", "w") HERE I HAVE TO ACCESS THE ENTRY FROM THE fetch_handle FOR THE CORRESPONDING pmid #data = Entrez.read(fetch_handle) #data = fetch_handle.read() fetch_handle.close() out_handle.write(data) out_handle.close() Cheers, Silvio From biopython at maubp.freeserve.co.uk Wed Apr 28 09:57:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Apr 2010 10:57:48 +0100 Subject: [Biopython] save efetch results in different files In-Reply-To: References: Message-ID: On Wed, Apr 28, 2010 at 10:24 AM, Silvio Tschapke wrote: > Hi all, > > I'd like to download hundreds of pubmed entries in one turn, but save every > entry in a single file for further processing with e.g. NLTK. > Is this possible? Or what is the common way to do this? Or do I have to call > efetch for every single pmid? I dont know how. Personally I would probably save each pubmed result to a separate file named using the pmid - a Unix filesystem should cope fine with a few thousand files in a single directory. This is simple and lets you add more entries at a later date, and you have simple access to any record. The other approach of combining separate entries into multiple files sounds overly complicated (although possible), while another approach would be a single large file containing all the records in one. These would require a index if you needed random access to the entries by pmid. > Could you also explain me what handle.read() does? Entrez.read(handle) I > understand, because it is documented, but handle.read() not. What kind of > type is a handle? It is *like* a standard handle that you'd get in python from open(filename). This is an object supporting read() giving all the remaining data as a string, readline() giving the next line etc. Peter From laserson at mit.edu Wed Apr 28 18:49:40 2010 From: laserson at mit.edu (Uri Laserson) Date: Wed, 28 Apr 2010 14:49:40 -0400 Subject: [Biopython] SPARK error messages to be sent to stderr? Message-ID: The spark error messages when there is a parsing problem are currently getting sent to stdout: (line 181 in Bio/Parsers/spark.py) print "Syntax error at or near `%s' token" % token Can this be changed to: print >>sys.stderr, "Syntax error at or near `%s' token" % token This way the error messages can be handled separately. Thanks! Uri From laserson at mit.edu Wed Apr 28 19:12:28 2010 From: laserson at mit.edu (Uri Laserson) Date: Wed, 28 Apr 2010 15:12:28 -0400 Subject: [Biopython] Can the GenBank/EMBL parser recover from errors? Message-ID: Hi, I am trying to parse a large file of EMBL records that I know has some errors in it. However, rather than having the parser break when it gets to the error, I'd rather it just skip that record, and move on to the next one. I was wondering if this functionality is already built in somewhere. One way I can do this is like this: iterator = SeqIO.parse(ip,'embl').__iter__() while True: try: record = iterator.next() # Now I specify all the parsing errors I want to catch: except LocationParserError: # Reinitialize iterator at current file position. The iterator # then skips to the beginning of the next record and continues. iterator = SeqIO.parse(ip,'embl').__iter__() except StopIteration: break This way, whenever there is a parsing error, I just reinitialize the iterator at the current file position, and it seeks to the beginning of the next record. However, this requires me to write out the for loop manually (using StopIteration). Does anyone know of a cleaner/more elegant way of doing this? Thanks! Uri -- Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From laserson at mit.edu Wed Apr 28 21:38:52 2010 From: laserson at mit.edu (Uri Laserson) Date: Wed, 28 Apr 2010 17:38:52 -0400 Subject: [Biopython] Bug in GenBank/EMBL parser? In-Reply-To: References: Message-ID: This fixed the main problem with parsing IMGT files that have increased indentation. I also filed an additional bug/enhancement with a proposed patch, which should make biopython compatible with IMGT and still conform to the INSDC format: http://bugzilla.open-bio.org/show_bug.cgi?id=3069 Uri On Tue, Apr 27, 2010 at 05:45, Peter wrote: > On Thu, Apr 22, 2010 at 9:56 AM, Peter > wrote: > > On Thu, Apr 22, 2010 at 2:07 AM, Uri Laserson wrote: > >> Hi, > >> > >> I am trying to use the EMBL parse to parse the IMGT/LIGM flatfile (which > >> supposedly conforms to the EMBL standard). > >> > >> The short story is that whenever there is a feature, the parser checks > >> whether there are qualifiers in the feature with an assert statement, > and > >> does not allow features with no qualifiers. However, the IMGT flatfile > is > >> full of entries that have features with no qualifiers (only > coordinates). > >> > >> Who is wrong here? Does the EMBL specification require that a feature > have > >> qualifiers? Or is this a bug to be fixed in the parser. > > > > Hi Uri, > > > > Thank you for your detailed report, > > > > Since you have raised this, I went back over the EMBL documentation. > > All their example features qualifiers (and from personal experience all > > EMBL files from the EMBL and GenBank files from the NCBI) do have > > qualifiers. However, in Section 7.2 they are called "Optional > qualifiers". > > > http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#7.2 > > > > So it does look like an unwarranted assumption in the Biopython > > parser (even though it has been a safe assumption on "official" EMBL > > and GenBank files thus far), which we should fix. > > Bug filed and now fixed, > http://bugzilla.open-bio.org/show_bug.cgi?id=3062 > > It turned out to be an invalid EMBL file where the features were over- > indented. Biopython was quite happy to parse valid EMBL or GenBank > files with features without qualifiers (although I don't recall seeing any > examples from EMBL or the NCBI like this). > > Peter > -- Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From p.j.a.cock at googlemail.com Wed Apr 28 22:11:43 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Apr 2010 23:11:43 +0100 Subject: [Biopython] Can the GenBank/EMBL parser recover from errors? In-Reply-To: References: Message-ID: On Wednesday, April 28, 2010, Uri Laserson wrote: > Hi, > > I am trying to parse a large file of EMBL records that I know has some > errors in it. ?However, rather than having the parser break when it gets to > the error, I'd rather it just skip that record, and move on to the next one. > ?I was wondering if this functionality is already built in somewhere. ?One > way I can do this is like this: > > iterator = SeqIO.parse(ip,'embl').__iter__() > while True: > ? ?try: > ? ? ? ?record = iterator.next() > ? ?# Now I specify all the parsing errors I want to catch: > ? ?except LocationParserError: > ? ? ? ?# Reinitialize iterator at current file position. The iterator > ? ? ? ?# then skips to the beginning of the next record and continues. > ? ? ? ?iterator = SeqIO.parse(ip,'embl').__iter__() > ? ?except StopIteration: > ? ? ? ?break > > This way, whenever there is a parsing error, I just reinitialize the > iterator at the current file position, and it seeks to the beginning of the > next record. ?However, this requires me to write out the for loop manually > (using StopIteration). ?Does anyone know of a cleaner/more elegant way of > doing this? > > Thanks! Hi Uri, There is no obvious way to handle this within the Bio.SeqIO.parse framework. I'd suggest you use Bio.SeqIO.index instead (assuming the file isn't so corrupt that it can't be scanned to identify each record). Just wrap each record access in an error handler. Peter From cloudycrimson at gmail.com Thu Apr 29 06:58:26 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Thu, 29 Apr 2010 12:28:26 +0530 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: hello Peter, Sorry for the late reply. I am writing to thank you. The suggestions you gave were of massive work in our research by reducing the BLASTing time. Thank you for taking interest, Sincerely, Karthikaja On Mon, Apr 26, 2010 at 5:27 PM, Peter wrote: > On Mon, Apr 26, 2010 at 12:52 PM, Peter > wrote: > > On Mon, Apr 26, 2010 at 12:25 PM, Karthik Raja > wrote: > >> Hi Peter, > >> > >> I will seriously consider using the stand alone blast option. And thank > you > >> so much for the links. :) I have replaced the repository. > >> > >> You suspected a problem with the sequences but they work very well when > >> given directly in the code. I have attached my fasta file. Please tell > me > >> how it works with you. > >> > >> Karthikraja. > > > > You seem to have made a mistake with the FASTA file, there should be > > a read name on the ">" lines with the sequence on the subsequence lines. > > E.g. More like this: > > > >>Seq1 > > IMYTALPVIGKRHFRPSFTR > >>Seq2 > > RSSRGRGR > > (etc) > > > > As is, your file is valid but describes seven records each with no > sequence > > (instead their names are IMYTALPVIGKRHFRPSFTR, RSSRGRGR, etc). > > P.S. The updated Biopython should have given you this error message: > > ValueError: Error message from NCBI: Message ID#32 Error: Query > contains no data: Query contains no sequence data > > Peter > From biopython at maubp.freeserve.co.uk Thu Apr 29 09:08:00 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 Apr 2010 10:08:00 +0100 Subject: [Biopython] save efetch results in different files In-Reply-To: References: Message-ID: On Wed, Apr 28, 2010 at 5:56 PM, Silvio Tschapke wrote: > > On Wed, Apr 28, 2010 at 11:57 AM, Peter wrote: >> >> On Wed, Apr 28, 2010 at 10:24 AM, Silvio Tschapke wrote: >> > Hi all, >> > >> > I'd like to download hundreds of pubmed entries in one turn, but save >> > every entry in a single file for further processing with e.g. NLTK. >> > Is this possible? Or what is the common way to do this? Or do I have to >> > call efetch for every single pmid? I dont know how. >> >> Personally I would probably save each pubmed result to a separate file >> named using the pmid - a Unix filesystem should cope fine with a few >> thousand files in a single directory. This is simple and lets you add more >> entries at a later date, and you have simple access to any record. > > This is what I thought..to save each pubmed result to a separate file named > using the pmid, as you can see in the code snippet. > But it isn't working so far. Could you help me with the efetch_handle? I > have called efetch one time with all pmids. So the efetch_handle contains > all results. But now I need to pull out every single result from this handle > to save it in a separate file with its pmid. And I don't know how to do it. > Or isn't there another way..do I have to call efetch for every pmid and than > save it into a file inside the loop? > Because Biopython recommends to not do many queries per second I > thought it would be better to only call efetch one time for all pmids. The simplest answer is to make one efetch call per PMID, giving a single record at a time which you can save to individual files. You can still do this with the esearch+efetch history support. This does mean making many small queries to the NCBI, rather than batching them together - but the NCBI do not have any explicit guidelines on batch sizes. Note - you would be making over 100 queries, so make sure you don't run this during USA office hours! The more complex approach (which the NCBI might prefer) is to download batches of records together (e.g. 50 PMID results at once). If you wanted to save these to separate files, you would have to divide the text up yourself. I think you just need to look for lines starting "PMID-" so this shouldn't be too hard. Peter From cloudycrimson at gmail.com Fri Apr 30 10:50:08 2010 From: cloudycrimson at gmail.com (Karthik Raja) Date: Fri, 30 Apr 2010 16:20:08 +0530 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: hello Peter, I have done blast for 25 sequences and have got 10 hits for each sequence. I have stored the results in an XML file. Now i need to *parse* it and the information in the cookbook isn helping me. >>> from Bio.Blast import NCBIWWW >>> result_handle = open("finaltest3.xml") >>> from Bio.Blast import NCBIXML >>> blast_records = NCBIXML.parse(result_handle) >>> for blast_record in blast_records: I am using the above code. Please tell me how to proceed to get information namely "sequence, seq id, e value and alignment". And I also have another doubt. While using q blast, is it possible to restrict the results to only human and mouse hits? If yes, it will be great if you could give me an example code or link. Sincerely, Karthik. On Thu, Apr 29, 2010 at 12:28 PM, Karthik Raja wrote: > > hello Peter, > > Sorry for the late reply. I am writing to thank you. The suggestions you > gave were of massive work in our research by reducing the BLASTing time. > Thank you for taking interest, > > Sincerely, > Karthikaja > On Mon, Apr 26, 2010 at 5:27 PM, Peter wrote: > >> On Mon, Apr 26, 2010 at 12:52 PM, Peter >> wrote: >> > On Mon, Apr 26, 2010 at 12:25 PM, Karthik Raja >> wrote: >> >> Hi Peter, >> >> >> >> I will seriously consider using the stand alone blast option. And thank >> you >> >> so much for the links. :) I have replaced the repository. >> >> >> >> You suspected a problem with the sequences but they work very well when >> >> given directly in the code. I have attached my fasta file. Please tell >> me >> >> how it works with you. >> >> >> >> Karthikraja. >> > >> > You seem to have made a mistake with the FASTA file, there should be >> > a read name on the ">" lines with the sequence on the subsequence lines. >> > E.g. More like this: >> > >> >>Seq1 >> > IMYTALPVIGKRHFRPSFTR >> >>Seq2 >> > RSSRGRGR >> > (etc) >> > >> > As is, your file is valid but describes seven records each with no >> sequence >> > (instead their names are IMYTALPVIGKRHFRPSFTR, RSSRGRGR, etc). >> >> P.S. The updated Biopython should have given you this error message: >> >> ValueError: Error message from NCBI: Message ID#32 Error: Query >> contains no data: Query contains no sequence data >> >> Peter >> > > From biopython at maubp.freeserve.co.uk Fri Apr 30 11:15:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 30 Apr 2010 12:15:05 +0100 Subject: [Biopython] Fwd: Qblast : no hits In-Reply-To: References: Message-ID: On Fri, Apr 30, 2010 at 11:50 AM, Karthik Raja wrote: > hello Peter, > > I have done blast for 25 sequences and have got 10 hits for each sequence. I > have stored the results in an XML file. Now i need to *parse* it and the > information in the cookbook isn helping me. > >>>> from Bio.Blast import NCBIWWW >>>> result_handle = open("finaltest3.xml") >>>> from Bio.Blast import NCBIXML >>>> blast_records = NCBIXML.parse(result_handle) >>>> for blast_record in blast_records: > > I am using the above code. Please tell me how to proceed to get information > namely "sequence, seq id, e value and alignment". That should be fairly clear from the tutorial, look at the section titled "The BLAST record class". > And I also have another doubt. While using q blast, is it possible to > restrict the results to only human and mouse hits? If yes, it will be great > if you could give me an example code or link. You can ask the NCBI to filter the BLAST results for you with an Entrez query, one of the optional arguments to the Biopython qblast function. Something like "mouse[ORGN] OR human[ORGN]" should work. You can try out the Entrez query on the website to make sure you have the right syntax and terms. Peter