From bugzilla-daemon at portal.open-bio.org Sun Apr 3 07:30:24 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Sun Apr 3 08:12:55 2005 Subject: [Biopython-dev] [Bug 1767] New: Bio/trie.c can crash on Windows Message-ID: <200504031130.j33BUO4v019943@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1767 Summary: Bio/trie.c can crash on Windows Product: Biopython Version: Not Applicable Platform: PC OS/Version: Windows 2000 Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev@biopython.org ReportedBy: mdehoon@ims.u-tokyo.ac.jp In Bio/trie.c, the function strdup is being used, which is not part of the ANSI-C standard. As a result, when Bio/trie.c is compiled, the resulting trie module links to two C runtime libraries (mscvrt.dll and mscvr71.dll), which are incompatible with each other and can cause crashes. To fix this bug, we need to write our own strdup function using ANSI-C standard functions. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at ims.u-tokyo.ac.jp Wed Apr 6 22:30:35 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Wed Apr 6 22:20:03 2005 Subject: [Biopython-dev] Re: In-Reply-To: <200504061641.j36Gfnck010384@mail-gw0.york.ac.uk> References: <200504061641.j36Gfnck010384@mail-gw0.york.ac.uk> Message-ID: <42549B4B.8000408@ims.u-tokyo.ac.jp> To post to biopython-dev@biopython.org, you need to subscribe first via the biopython website. This was done to stop the huge amounts of spam we were receiving earlier. If you want to submit a patch, the best way is to create a bug report first (also via the biopython website) and then add an attachment containing the patch. Patches sent to the mailing list tend to get lost, and are sometimes rejected by the spam filter. If you want to submit a larger piece of code, for example a new module, you can post a message to biopython-dev@biopython.org first to describe the code, and then send it to one of the developers. Hope this helps! --Michiel. Glen van Ginkel wrote: > I tried to submit code to biopython-dev@biopython.org but got a message > telling me I was not allowed to mail them > > > > How do I go about submitting code? > > > > Thankyou > > > > Glen van Ginkel > > > -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From gvg500 at york.ac.uk Thu Apr 7 03:13:39 2005 From: gvg500 at york.ac.uk (Glen van Ginkel) Date: Thu Apr 7 02:08:10 2005 Subject: [Biopython-dev] Hmmer integration modules/package Message-ID: <200504070813.39242.gvg500@york.ac.uk> Hi Guys As a project for my MRes in bioinformatics I have had to write a Python package that would expand Biopythons capabilities in interacting with useful programs. Here I submit code to interact with the Hmmer suite of programs. Basically, you instantiate the Hmmer object (Like a Hmmer commandline builder object), build a commandline, execute it and grab the results. Really simple for the user. This only works in UNIX since Hmmer is not available on Windows platforms and it obviously assumes you have Hmmer already installed on your PC. I have also tried to integrate the Bio.Align.Generic alignment object sothat the Hmmer object is able to handle alignment objects by writing the records of the alignment object to a temporary fasta file. Since I have to write this up as a report I would greatly appreciate any criticism of any kind and perhaps some suggestions as to how I might improve the code. Also, If you can think of any other ways to implement the Hmmer stuff please let me know. At the moment I am working on a test suite for the project. If you would like the code I have written to exercise some of the methods please let me know and I'll send it over. I also have a bit of documentation I would like to add to the Application package because I feel if failed to help me concerning certain aspects. How would I go about this? I look forward to hearing your suggestions. Glen van Ginkel From gvg500 at york.ac.uk Thu Apr 7 08:10:29 2005 From: gvg500 at york.ac.uk (Glen van Ginkel) Date: Thu Apr 7 07:25:03 2005 Subject: [Biopython-dev] Hmmer integration modules/package Message-ID: <200504071310.29943.gvg500@york.ac.uk> Hi Guys As a project for my MRes in bioinformatics I have had to write a Python package that would expand Biopythons capabilities in interacting with useful programs. Here I submit code to interact with the Hmmer suite of programs. Basically, you instantiate the Hmmer object (Like a Hmmer commandline builder object), build a commandline, execute it and grab the results. Really simple for the user. This only works in UNIX since Hmmer is not available on Windows platforms and it obviously assumes you have Hmmer already installed on your PC. I have also tried to integrate the Bio.Align.Generic alignment object sothat the Hmmer object is able to handle alignment objects by writing the records of the alignment object to a temporary fasta file. Since I have to write this up as a report I would greatly appreciate any criticism of any kind and perhaps some suggestions as to how I might improve the code. Also, If you can think of any other ways to implement the Hmmer stuff please let me know. At the moment I am working on a test suite for the project. If you would like the code I have written to exercise some of the methods please let me know and I'll send it over. I also have a bit of documentation I would like to add to the Application package because I feel if failed to help me concerning certain aspects. How would I go about this? I look forward to hearing your suggestions. Glen van Ginkel Attatched is the code as well as an example of a test suite using the files from Eddy2003 and an experiment file experiment.fas -------------- next part -------------- A non-text attachment was scrubbed... Name: Applications.py Type: application/x-python Size: 16558 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/Applications-0001.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: __init__.py Type: application/x-python Size: 320 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/__init__-0001.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: HmmerStandalone.py Type: application/x-python Size: 9139 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/HmmerStandalone-0001.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: testsuite.py Type: application/x-python Size: 2427 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/testsuite-0001.bin -------------- next part -------------- >S134211|S134211|GLOBIN - BRINE SHRIMP1 DKATIKRTWATVTDLPSFGRNVFLSVFAAK >S134212|S134212|GLOBIN - BRINE SHRIMP2 PEYKNLFVEFRNIPASELASSERLLYHGGR >S134213|S134213|GLOBIN - BRINE SHRIMP3 VLSSIDEAIAGIDTPDRAVKTLLALGERHI >S134214|S134214|GLOBIN - BRINE SHRIMP4 SRGTVRRHFEAFSYAFIDELKQRGVESADL >S134215|S134215|GLOBIN - BRINE SHRIMP5 AAWRRGWDNIVNVLEAGLLRRQIDLEVTGL >S134216|S134216|GLOBIN - BRINE SHRIMP6 SCVDVANIQESWSKVSGDLKTTGSVVFQRM >S134217|S134217|GLOBIN - BRINE SHRIMP7 INGHPEYQQLFRQFRDVDLDKLGESNSFVA From gvg500 at york.ac.uk Tue Apr 12 05:10:57 2005 From: gvg500 at york.ac.uk (Glen van Ginkel) Date: Tue Apr 12 05:07:14 2005 Subject: [Biopython-dev] Hmmer API Message-ID: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk> This is a desperate-ish plea to any biopython developer. Recently submitted an explanation of code that interacts with the Hmmer (Eddy 2003) suite of programs to the list. I am required to write up a project concerning this Hmmer API and would hugely appreciate it if ANY of the developers could give me some feed-back. Bearing in mind that this is in your spare time, I'm not asking for a major commentary. The focus of the project was to produce something that might be accepted as part of the BioPython project. What I'm looking for is just a sort of "yeah probably will be accepted but.." or "No! you've got a long way to go before we accept that gubbins, but.". I would really appreciate ANY response In anticipation Glen van Ginkel From fkauff at duke.edu Tue Apr 12 08:59:12 2005 From: fkauff at duke.edu (Frank Kauff) Date: Tue Apr 12 08:52:48 2005 Subject: [Biopython-dev] Hmmer API In-Reply-To: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk> References: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk> Message-ID: <1113310753.5132.5.camel@osiris.biology.duke.edu> Glen, I'd be happy to have a look at it. I was recently thinking of using Hmmer to do some alignment for one of my python scripts, so maybe your API comes in handy. I'm not too familiar with Hmmer yet, though. So I hope it comes with some example code...? And thanks for contributing to biopython! Frank On Tue, 2005-04-12 at 10:10 +0100, Glen van Ginkel wrote: > This is a desperate-ish plea to any biopython developer. > > > > Recently submitted an explanation of code that interacts with the Hmmer > (Eddy 2003) suite of programs to the list. I am required to write up a > project concerning this Hmmer API and would hugely appreciate it if ANY of > the developers could give me some feed-back. Bearing in mind that this is in > your spare time, I'm not asking for a major commentary. The focus of the > project was to produce something that might be accepted as part of the > BioPython project. What I'm looking for is just a sort of "yeah probably > will be accepted but.." or "No! you've got a long way to go before we accept > that gubbins, but.". > > > > I would really appreciate ANY response > > > > In anticipation > > > > Glen van Ginkel > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev -- Frank Kauff Dept. of Biology Duke University Box 90338 Durham, NC 27708 USA Phone 919-660-7382 Fax 919-660-7293 Web http://www.lutzonilab.net/member/frankkauff.shtml From jhackney at stanford.edu Tue Apr 12 16:20:24 2005 From: jhackney at stanford.edu (Jason A. Hackney) Date: Tue Apr 12 16:23:46 2005 Subject: [Biopython-dev] Hmmer API In-Reply-To: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk> References: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk> Message-ID: <6956556815154e4131d7683adf017b0b@stanford.edu> Hi Glen, I would also be willing to have a look at any code you've got so far. I've done a bit of Hmmer stuff in the past, so I'd be interested in seeing what you've got going. Cheers, Jason On Apr 12, 2005, at 2:10 AM, Glen van Ginkel wrote: > This is a desperate-ish plea to any biopython developer. > > > > Recently submitted an explanation of code that interacts with the Hmmer > (Eddy 2003) suite of programs to the list. I am required to write up a > project concerning this Hmmer API and would hugely appreciate it if > ANY of > the developers could give me some feed-back. Bearing in mind that this > is in > your spare time, I'm not asking for a major commentary. The focus of > the > project was to produce something that might be accepted as part of the > BioPython project. What I'm looking for is just a sort of "yeah > probably > will be accepted but.." or "No! you've got a long way to go before we > accept > that gubbins, but.". > > > > I would really appreciate ANY response > > > > In anticipation > > > > Glen van Ginkel > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > Jason A. Hackney Postdoctoral Scholar Department of Microbiology and Immunology Stanford University From idoerg at burnham.org Tue Apr 12 18:34:36 2005 From: idoerg at burnham.org (Iddo Friedberg) Date: Tue Apr 12 18:29:42 2005 Subject: [Biopython-dev] Hmmer API In-Reply-To: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk> References: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk> Message-ID: <425C4CFC.5020203@burnham.org> hi Folks, Glen actually gave me the code, but I cannot really give him the response it merits, not within th time frame he would like. From a cursory look it is "yes, we can accept it, give us contingency tests as well". But if soeone can actually run it a couple of times, and see what it's like, that would be great. Sorry Glen, I didn't realize you were that pressed for time. Cheers, Iddo Glen van Ginkel wrote: >This is a desperate-ish plea to any biopython developer. > > > >Recently submitted an explanation of code that interacts with the Hmmer >(Eddy 2003) suite of programs to the list. I am required to write up a >project concerning this Hmmer API and would hugely appreciate it if ANY of >the developers could give me some feed-back. Bearing in mind that this is in >your spare time, I'm not asking for a major commentary. The focus of the >project was to produce something that might be accepted as part of the >BioPython project. What I'm looking for is just a sort of "yeah probably >will be accepted but.." or "No! you've got a long way to go before we accept >that gubbins, but.". > > > >I would really appreciate ANY response > > > >In anticipation > > > >Glen van Ginkel > > > >_______________________________________________ >Biopython-dev mailing list >Biopython-dev@biopython.org >http://biopython.org/mailman/listinfo/biopython-dev > > > > -- Iddo Friedberg, Ph.D. The Burnham Institute 10901 N. Torrey Pines Rd. La Jolla, CA 92037 Tel: (858) 646 3100 x3516 Fax: (858) 713 9930 http://ffas.ljcrf.edu/~iddo From idoerg at burnham.org Thu Apr 14 20:53:47 2005 From: idoerg at burnham.org (Iddo Friedberg) Date: Thu Apr 14 20:47:25 2005 Subject: [Biopython-dev] Speakers for BOSC needed Message-ID: <425F109B.8040507@burnham.org> Hi all, It's that time of year again, and BOSC 2005 will be happening on June 23-24. The more Biopython representatives, the merrier. I will be around, but I will be dealing with my own SIG meeting, so I will not be able to give a talk. Is there someone who can give the BioPython "plenary"? should be a 30-40 minute talk. Also, there are slots for shorter talks, so if you contributed and interesting module, or had an interesting experience with biopython you would like to share, please submit a talk. For those of you who do not know what BOSC is, it's the Bioinformatics Open Source Conference, which is held as a satellite meeting of ISMB. I highly recommend this event, it is a real eye opener with respect to the world of open source, and computational biology. More about it here: http://open-bio.org/bosc/ Cheers, Iddo -- Iddo Friedberg, Ph.D. The Burnham Institute 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 http://ffas.ljcrf.edu/~iddo ========================== The First Automated Protein Function Prediction SIG Detroit, MI June 24, 2005 http://ffas.burnham.org/AFP From idoerg at burnham.org Tue Apr 19 13:29:16 2005 From: idoerg at burnham.org (Iddo Friedberg) Date: Tue Apr 19 13:22:41 2005 Subject: [Biopython-dev] BOSC 2005 Message-ID: <42653FEC.2010907@burnham.org> {Please pass the word!} SECOND CALL FOR SPEAKERS The 6th annual Bioinformatics Open Source Conference (BOSC'2005) is organized by the not-for-profit Open Bioinformatics Foundation. The meeting will take place June 23-24, 2005 in Detroit, Michigan, USA, and is one of several Special Interest Group (SIG) meetings occurring in conjunction with the 13th International Conference on Intelligent Systems for Molecular Biology. see http://www.iscb.org/ismb2005 for more information. Because of the power of many Open Source bioinformatics packages in use by the Research Community today, it is not too presumptuous to say that the work of the Open Source Bioinformatics Community represents the cutting edge of Bioinformatics in general. This has been repeatedly demonstrated by the quality of presentations at previous BOSC conferences. This year, at BOSC 2005, we want to continue this tradition of excellence, while presenting this message to a wider part of the Research Community. Please, pass this message on to anyone you know that is interested in Bioinformatics software. BOSC PROGRAM & CONTACT INFO * Web: http://www.open-bio.org/bosc2005/ * Online Registration: https://www.cteusa.com/iscb4/ * Email: bosc@open-bio.org FEES * Corporate : $195 ($245 after May 16th) * Academic : $170 ($220 after May 16th) * Student : $145 ($195 after May 16th) SPEAKERS & ABSTRACTS WANTED The program committee is currently seeking abstracts for talks at BOSC 2005. BOSC is a great opportunity for you to tell the community about your use, development, or philosophy of open source software development in bioinformatics. The committee will select several submitted abstracts for 25-minute talks and others for shorter "lightning" talks. Accepted abstracts will be published on the BOSC web site. If you are interested in speaking at BOSC 2005, please send us before April 26, 2005: * an abstract (no more than a few paragraphs) * a URL for the project page, if applicable * information about the open source license used for your software or your release plans. Abstracts will be accepted for submission until April 26, 2005. Abstracts chosen for presentation will be announced May 12, 2005 (before the ISMB Early Registration Deadline). LIGHTNING-TALK SPEAKERS WANTED! The program committee is currently seeking speakers for the lightning talks at BOSC 2005. Lightning talks are quick - only five minutes long - and a great opportunity for you to give people a quick summary of your open source project, code, idea, or vision of the future. If you are interested in giving a lightning talk at BOSC 2005, please send us: * a brief title and summary (one or two lines) * a URL for the project page, if applicable * information about the open source license used for your software or your release plans. We will accept entries on-line until BOSC starts, but space for demos and lightning talks is limited.
Iddo Friedberg wrote: > Glen actually gave me the code, but I cannot really give him the > response it merits, not within th time frame he would like. From a > cursory look it is "yes, we can accept it, give us contingency tests as > well". But if soeone can actually run it a couple of times, and see what > it's like, that would be great. I'm not a hmmer user either, so it's hard to assess the code in great detail. What I like about the code is that it has extensive documentation (in the source code) and makes use of and is well integrated with the existing Biopython software. Since Hmmer is a rather standard bioinformatics tool, I think Biopython should support it. So I vote to accept this into Biopython. Once users start using Bio.Hmmer, maybe some issues with the code will show up, but then these users will be more familiar with Hmmer than we are, and can give more useful advice. About the documentation, it is quite extensive in the source code (which is fine too), but it would be nice to have some documentation outside of the source code. For example, this is something I wrote a while ago for Bio.LogisticRegression: http://www.biopython.org/docs/cookbook/LogisticRegression.html With such a documentation, the code will be more accessible. Without such documentation, people have to look through the CVS tree to search for the Hmmer package, or may not even notice it. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From bugzilla-daemon at portal.open-bio.org Fri Apr 22 08:33:28 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Fri Apr 22 09:19:33 2005 Subject: [Biopython-dev] [Bug 1771] New: need some file from some xml module ? Message-ID: <200504221233.j3MCXSc7007748@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1771 Summary: need some file from some xml module ? Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev@biopython.org ReportedBy: jeroen@xdh.nl Hi, Using the Sprot.py module of version 1.40b, I got this : File "stepsget035.py", line 98, in list_swissprot_gpcrs from Bio.SwissProt import SProt File "/usr/src/biopython-1.40b/build/lib.linux-i686- 2.3/Bio/SwissProt/SProt.py", line 39, in ? from Bio import SeqRecord File "/usr/src/biopython-1.40b/build/lib.linux-i686-2.3/Bio/SeqRecord.py", line 7, in ? from Bio import FormatIO File "/usr/src/biopython-1.40b/build/lib.linux-i686-2.3/Bio/FormatIO.py", line 2, in ? from xml.sax import saxutils Before, I used 1.24 and that gave no such error/dependency/bug/dunno-what-it- is, probably cuz I didn't need to import anything from some xml module. Thanks for any help, Jeroen ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 22 09:31:27 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Fri Apr 22 10:18:29 2005 Subject: [Biopython-dev] [Bug 1771] need some file from some xml module ? Message-ID: <200504221331.j3MDVRhb008729@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1771 ------- Additional Comments From mdehoon@ims.u-tokyo.ac.jp 2005-04-22 09:31 ------- Which Python version are you using? See: samma{mdehoon}8: python1.5 Python 1.5.2 (#1, Nov 28 2001, 02:33:46) [GCC 2.95.3 20010315 (release)] on sun os5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> from xml.sax import saxutils Traceback (innermost last): File "", line 1, in ? ImportError: No module named xml.sax >>> ^D samma{mdehoon}9: python2.2 Python 2.2.2 (#1, Jan 24 2003, 17:26:30) [GCC 2.95.3 20010315 (release)] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> from xml.sax import saxutils >>> ^D samma{mdehoon}10: ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 22 10:46:57 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Fri Apr 22 11:22:41 2005 Subject: [Biopython-dev] [Bug 1771] need some file from some xml module ? Message-ID: <200504221446.j3MEkv9D009884@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1771 jeroen@xdh.nl changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Additional Comments From jeroen@xdh.nl 2005-04-22 10:46 ------- Yep, there was something wrong with the sax module of my python (version 2.3) installation, it has been fixed now. thanks, Jeroen ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 22 16:45:31 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Fri Apr 22 17:18:28 2005 Subject: [Biopython-dev] [Bug 1772] New: Bio.PDB's parse_pdb_header never stops parsing if there is no ATOM record Message-ID: <200504222045.j3MKjVfL014797@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1772 Summary: Bio.PDB's parse_pdb_header never stops parsing if there is no ATOM record Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P2 Component: Other AssignedTo: biopython-dev@biopython.org ReportedBy: dhendrix@compbio.berkeley.edu In the future, please run your code on more pdb files before you release it! I have a pdb file with no ATOM records, just HETATMs. So when I use parse_pdb_header to read the header, it runs until I'm out of memory or I (or the os) kill it, because it is reading in the header as anything that occurs before an ATOM record, and there ain't one! The pdbID is 1PBL. While it is unusual that there are no ATOM records, it can definitely occur! Also, there is the annoying printing of nonstandard resolution NOT APPLICABLE. for every NMR structure. Why???? THANK YOU! Donna ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 22 20:35:33 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Fri Apr 22 21:18:33 2005 Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops parsing if there is no ATOM record Message-ID: <200504230035.j3N0ZW5H017089@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1772 idoerg@burnham.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 22 20:36:52 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Fri Apr 22 21:18:37 2005 Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops parsing if there is no ATOM record Message-ID: <200504230036.j3N0aqVG017107@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1772 idoerg@burnham.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |WORKSFORME ------- Additional Comments From idoerg@burnham.org 2005-04-22 20:36 ------- User reported using old version of Biopython. I checked it against 1.40b and CVS versions, could not duplicate. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 25 17:22:50 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Mon Apr 25 18:21:05 2005 Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops parsing if there is no ATOM record Message-ID: <200504252122.j3PLMoIh022842@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1772 dhendrix@compbio.berkeley.edu changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|WORKSFORME | ------- Additional Comments From dhendrix@compbio.berkeley.edu 2005-04-25 17:22 ------- Thank you so much for the quick turnaround. Iddo Friedberg suggested that I use the top level of the CVS. I updated python and my biopython (and NumPy, etc...) and encountered the same behavior, as well as another little bug that I had fixed a while ago. Here are my diffs to parse_pdb_header.py, which are small but vital for me to get parse_pdb_header working for me. 122c122 < f=open(file,'r') --- > f=open(filename,'r') 127c127 < if not re.search("\AATOM",l) and not re.search("\AEND",l): --- > if not re.search("\AATOM",l): I can send you a little test program that fails on the current version of parse_pdb_header, if that will help. Regardless, I think it's a good idea to stop reading the file when you reach the end of it! ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 26 03:49:37 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Apr 26 04:20:11 2005 Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops parsing if there is no ATOM record Message-ID: <200504260749.j3Q7nbeJ029909@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1772 ------- Additional Comments From mdehoon@ims.u-tokyo.ac.jp 2005-04-26 03:49 ------- I have accepted the first patch in CVS: 122c122 < f=open(file,'r') --- > f=open(filename,'r') But the second part doesn't seem right: 127c127 < if not re.search("\AATOM",l) and not re.search("\AEND",l): --- > if not re.search("\AATOM",l): It will append HETATM lines to the header. Instead we can use if not re.search("\AATOM",l) and not re.search("\AHETATM",l): ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From thamelry at binf.ku.dk Tue Apr 26 04:40:06 2005 From: thamelry at binf.ku.dk (thamelry@binf.ku.dk) Date: Tue Apr 26 05:00:32 2005 Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops parsing if there is no ATOM record In-Reply-To: <200504260749.j3Q7nbeJ029909@portal.open-bio.org> References: <200504260749.j3Q7nbeJ029909@portal.open-bio.org> Message-ID: <32883.83.92.3.59.1114504806.squirrel@www.binf.ku.dk> > It will append HETATM lines to the header. Instead we can use > if not re.search("\AATOM",l) and not re.search("\AHETATM",l): Or even simpler: record_type=l[0:6] if record_type=='ATOM ' or record_type=='HETATM' or record_type=='MODEL ': break else: header.append(l) Note that MODEL can also signal the end of the header. I'll add it to the CVS. Cheers, -Thomas From bugzilla-daemon at portal.open-bio.org Tue Apr 26 06:35:59 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Apr 26 08:21:59 2005 Subject: [Biopython-dev] [Bug 1773] New: Martel.Parser.ParserPositionException Message-ID: <200504261035.j3QAZx3P031804@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1773 Summary: Martel.Parser.ParserPositionException Product: Biopython Version: Not Applicable Platform: PC URL: http://portal.open-bio.org/pipermail/biopython/2005- April/002604.html OS/Version: Linux Status: NEW Severity: minor Priority: P2 Component: Martel/Mindy AssignedTo: biopython-dev@biopython.org ReportedBy: marc.saric@gmx.de I tried to index the the following Genbank file with this simple script (as described in the cookbook) but it failed with the following traceback. The file can be found in: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=BA000018 including all features (SNP, CDD, MGC, HPRD, STS) I tried with Biopython 1.3, Biopython 1.4b and the Biopython-CVS as of 2005-04-15. The program ===snip=== #!/usr/bin/env python from Bio import GenBank dict_file = "ba000018_s_aureus_n315_genome.gb" index_file = "ba000018_s_aureus_n315_genome.idx" GenBank.index_file(dict_file, index_file) ===snap=== The Traceback: ===snip=== Traceback (most recent call last): File "/home/saric/data/devel/workspace/scripts/hitman/index_gb.py", line 37, in ? GenBank.index_file(dict_file, index_file) # FIXME: This breaks with the N315 S.aureus-genome File "/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Bio/GenBank/__init__.py", line 1283, in index_file SimpleSeqRecord.create_flatdb([filename], indexname, indexer) File "/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Bio/Mindy/SimpleSeqRecord.py", line 152, in create_flatdb creator.load(filename, builder = builder, fileid_info = {}) File "/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Bio/Mindy/BaseDB.py", line 52, in load for record in iterator.iterate(source, cont_handler = builder): File "/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Martel/IterParser.py", line 71, in iterateFile raise Parser.ParserPositionException(self.start_position) Martel.Parser.ParserPositionException: error parsing at or beyond character 5887615 ===snap=== I use a x86-machine running SuSE-Linux 9.1 (kernel 2.6.5-7.147-default, gcc version 3.3.3). The error is most likely due to a trainling blank line in the GenBank-file, which is there for all "official" downloads I checked (see my post on the Biopython-mailinglist (link above)). Either this is a bug in GenBank (delivering invalid files) or something minor in the Martel-Parser, which does not like blank lines at the end of the file. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 26 11:57:16 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Apr 26 12:19:59 2005 Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops parsing if there is no ATOM record Message-ID: <200504261557.j3QFvGrQ004138@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1772 ------- Additional Comments From dhendrix@compbio.berkeley.edu 2005-04-26 11:57 ------- Of course it appends HETATM (and CONECT) records, but at least it stops reading the file :-). I'm fine with your change, as long as it recognizes the end of the file (or header).... THANK YOU! ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 26 12:35:40 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Apr 26 13:22:22 2005 Subject: [Biopython-dev] [Bug 1774] New: Bio.Clustalw: bug in computing the alignment. Message-ID: <200504261635.j3QGZeP9004533@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1774 Summary: Bio.Clustalw: bug in computing the alignment. Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev@biopython.org ReportedBy: crober@scri.ac.uk Using Bio.Clustalw in order to process a sequence alignment, there is a bug in computing the alignment when in the input file more than one sequence has the same name. This bug is not reported by Bio.Clustalw if the output file you specify already exists. In that latter case, Bio.Clustalw will just read the results from the output file rather than reporting the error. #------- program.py #! /usr/bin/python2.4 import sys from Bio import Clustalw from Bio.Clustalw import MultipleAlignCL import sys cline = MultipleAlignCL(sys.argv[1]) cline.set_output(sys.argv[2]) align = Bio.Clustalw.do_alignment(cline) #---------- input.fas >Putative binding site ggaacggatgctcgcccagttccaccaacg >Putative binding site ggaacccatccttttctgcgtccacacagc >Putative promoter inside ggaacaggtgtttcgtcaacacgga >Putative binding site ggaacaaacacaactactgcactat #------- command line to start creating the alignment $ python2.4 program.py input.fas output.fas #-------- ERROR MESSAGE when running clustalw as follows: $ clustalw input.fas -outfile=output.fas ERROR: Multiple sequences found with same name, Putative (first 30 chars are significant) No. of seqs. read = 0. No alignment! ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 26 13:06:33 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Apr 26 13:22:29 2005 Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops parsing if there is no ATOM record Message-ID: <200504261706.j3QH6XSe004914@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1772 thamelry@binf.ku.dk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |FIXED ------- Additional Comments From thamelry@binf.ku.dk 2005-04-26 13:06 ------- Parsing header now stops on ATOM, HETATM, MODEL or EOF ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 27 01:13:26 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Wed Apr 27 01:20:36 2005 Subject: [Biopython-dev] [Bug 1774] Bio.Clustalw: bug in computing the alignment. Message-ID: <200504270513.j3R5DPBN012863@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1774 mdehoon@ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Additional Comments From mdehoon@ims.u-tokyo.ac.jp 2005-04-27 01:13 ------- Fixed in CVS, thanks. The bug was caused by interpreting the return value of os.popen as the exit status instead of the termination status. The exit status is the second byte of the termination status, so we need to divide the result of os.popen by 256 to get the exit status. For example, if the C code of a program a.out contains int main(void) { int status; ... return status; } then os.popen("a.out").close() returns status*256 instead of status. Same goes for os.system. This is also true at the C-level, so this is not a Python bug. Since we have calls to os.popen and os.system in various places in Biopython, the same bug may appear elsewhere also. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at ims.u-tokyo.ac.jp Wed Apr 27 02:18:19 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Wed Apr 27 02:07:11 2005 Subject: [Biopython-dev] Rethinking Seq objects Message-ID: <426F2EAB.3000709@ims.u-tokyo.ac.jp> Hi everybody, For my research, I tend to work a lot with sequences, but I find myself not using Bio.Seq much. I'd like to propose some changes to make sequence objects more useful. I'd be happy to hear comments from the other developers, in particular the original developers who probably thought this through much more than I have. There are five changes I'd like to propose: 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and the MutableSeq class basically describe the same thing, except that one is read-only and the other one is not. If desired, we can add a readonly flag to the class to describe if it is mutable or not. (Given that e.g. Numerical Python arrays don't have such a flag, my feeling is that it is not really needed for Seq objects either). 2) Make Seq objects a bit smarter about which type of sequence they contain. One reason I don't use Bio.Seq much is that I have to write >>> from Bio.Alphabet import IUPAC >>> my_alpha = IUPAC.unambiguous_dna >>> from Bio.Seq import Seq >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) which is too much typing. I am thinking about the following scheme when initializing a Seq object: - If the user specifies my_alpha, accept that alphabet. Raise an error if the sequence is not consistent with the alphabet - Assume the sequence is an unambiguous DNA sequence - If the sequence contains any characters other than ATCG, assume it is unambiguous RNA, otherwise accept the sequence - If the sequence contains any characters other than AUCG, assume it is a protein, otherwise accept the sequence - If the sequence contains any characters other than ACDEFGHIKLMNPQRSTVWY, assume it is ambiguous DNA, otherwise accept the sequence - If the sequence contains any characters other than GATCRYWSMKHBVDN, assume it is ambiguous RNA, otherwise accept the sequence - If the sequence contains any characters other than GAUCRYWSMKHBVDN, assume it is an extended protein sequence, otherwise accept the sequence - If the sequence contains any characters other than ACDEFGHIKLMNPQRSTVWYBXZ, yell at the user. 3) When changing a sequence, check if it is still consistent with the alphabet. Right now, we can do >>> from Bio.Seq import * >>> from Bio.Alphabet import IUPAC >>> my_alpha = IUPAC.unambiguous_dna >>> my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) >>> my_seq[:10] = "weirdstuff" >>> my_seq MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), IUPACUnambiguousDNA()) 4) Make Seq objects understand circular genomes. Many bacterial genomes are circular. It would be nice if we could take the indices [-1000:1000] from a Seq object, if it is circular, or [3999000:40001000] if the sequence is circular with length 4000000. 5) Perhaps it would be a good idea to add transcribe and translate methods to the Seq class. Currently, to translate a DNA sequence, we have to do >>> from Bio.Seq import Seq >>> from Bio import Translate >>> from Bio.Alphabet import IUPAC >>> my_alpha = IUPAC.unambiguous_dna >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) >>> standard_translator = Translate.unambiguous_dna_by_id[1] >>> standard_translator.translate(my_seq) Seq('AIVMGR*KGAR', IUPACProtein()) which is too much typing for my taste. Any thoughts/comments/suggestions? --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From thamelry at binf.ku.dk Wed Apr 27 08:09:28 2005 From: thamelry at binf.ku.dk (thamelry@binf.ku.dk) Date: Wed Apr 27 08:05:56 2005 Subject: [Biopython-dev] Rethinking Seq objects In-Reply-To: <426F2EAB.3000709@ims.u-tokyo.ac.jp> References: <426F2EAB.3000709@ims.u-tokyo.ac.jp> Message-ID: <33165.83.92.3.59.1114603768.squirrel@www.binf.ku.dk> Hi Michiel, > Any thoughts/comments/suggestions? I happen to be doing some sequence stuff myself at the moment and I couldn't agree more with the points you raised. It could all be a lot more straightforward! Cheers, -Thomas From hoffman at ebi.ac.uk Wed Apr 27 08:37:03 2005 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Wed Apr 27 08:37:45 2005 Subject: [Biopython-dev] Rethinking Seq objects In-Reply-To: <426F2EAB.3000709@ims.u-tokyo.ac.jp> References: <426F2EAB.3000709@ims.u-tokyo.ac.jp> Message-ID: On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote: > 1) Make Seq objects mutable, and get rid of MutableSeq. I imagine it will be a lot slower to replace built-in strings with character arrays. Right now, I only use Seq when I absolutely have to. Personally, I'd love it if Seq were just a light-weight subclass of str without the performance penalties of the existing Seq. Using a Surrogate pattern slows down all those inner loops a lot. Also lots of unnecessary input-checking does as well. I think performance should be a concern when you are talking about what should be the most-used part of the library. Similarly, I think lots of magic trying to figure out the alphabet is a bad idea. There are only a few operations that actually require the alphabet to be known, and most of the time I store a sequence in memory I'm not going to need any of these, so having to deal with alphabet issues when it's unnecessary is just going to be a pain in the butt that will keep me from using Seq. Similarly, I use augmented alphabets with things like B in them and I don't want Seq yelling at me when there's no point. Sure, if it can't figure out how to revcom the sequence, but just to instantiate it? I think these principles from the Zen of Python would be well-considered here: Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Sparse is better than dense. Readability counts. In the face of ambiguity, refuse the temptation to guess. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. > > > > Right now, we can do > > > > from Bio.Seq import * > > > > from Bio.Alphabet import IUPAC > > > > my_alpha = IUPAC.unambiguous_dna > > > > my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) > > > > my_seq[:10] = "weirdstuff" > > > > my_seq > MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), > IUPACUnambiguousDNA()) "Doctor, it hurts when I do this." "Don't do that." > 4) Make Seq objects understand circular genomes. Many bacterial genomes are > circular. It would be nice if we could take the indices [-1000:1000] from a > Seq object, if it is circular, or [3999000:40001000] if the sequence is > circular with length 4000000. I'm sure that will be useful to some people. But having a CircularSeq subclass would make it easier to avoid this extra functionality from impacting on the primary use case. > 5) Perhaps it would be a good idea to add transcribe and translate methods to > the Seq class. +1 You would obviously have to specify an alphabet for this, but I'm fine with that so long as I'm not forced to when I don't need to. -- Michael Hoffman European Bioinformatics Institute From mdehoon at ims.u-tokyo.ac.jp Wed Apr 27 23:29:33 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Wed Apr 27 23:17:53 2005 Subject: [Biopython-dev] Rethinking Seq objects In-Reply-To: References: <426F2EAB.3000709@ims.u-tokyo.ac.jp> Message-ID: <4270589D.2090500@ims.u-tokyo.ac.jp> Michael Hoffman wrote: > On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote: > >> 1) Make Seq objects mutable, and get rid of MutableSeq. > > I imagine it will be a lot slower to replace built-in strings with > character arrays. Right now, I only use Seq when I absolutely have to. Well I wouldn't replace them with character arrays, the idea would be to reimplement the Seq class in C. So it would not be slower than built-in strings, maybe even a bit faster. The Seq object would look like a string object, but be mutable. > Similarly, I think lots of magic trying to figure out the alphabet is > a bad idea. There are only a few operations that actually require the > alphabet to be known, and most of the time I store a sequence in > memory I'm not going to need any of these, so having to deal with > alphabet issues when it's unnecessary is just going to be a pain in > the butt that will keep me from using Seq. Similarly, I use augmented > alphabets with things like B in them and I don't want Seq yelling at > me when there's no point. Sure, if it can't figure out how to revcom > the sequence, but just to instantiate it? OK, then how about this: - By default, don't assume a particular alphabet. Same as how it works now: >>> from Bio.Seq import * >>> Seq('ATCG') Seq('ATCG', Alphabet()) - If the user decides to specify the alphabet, make sure the sequence is consistent with it. Of course, if the alphabet is Alphabet(), don't do any input checking. So essentially, the user gets to decide whether she wants input checking for the sequence or not. >> >>> Right now, we can do >> >>> from Bio.Seq import * >> >>> from Bio.Alphabet import IUPAC >> >>> my_alpha = IUPAC.unambiguous_dna >> >>> my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) >> >>> my_seq[:10] = "weirdstuff" >> >>> my_seq >> MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), >> IUPACUnambiguousDNA()) > > "Doctor, it hurts when I do this." > "Don't do that." Well you would be right if this were Biofortran. For a higher-level language, I would expect better checking to make sure an object is self-consistent. Python itself is full of checks and assertions. Another option would be to get rid of alphabets altogether. What good are they otherwise? >> 4) Make Seq objects understand circular genomes. Many bacterial >> genomes are circular. It would be nice if we could take the indices >> [-1000:1000] from a Seq object, if it is circular, or >> [3999000:40001000] if the sequence is circular with length 4000000. > > I'm sure that will be useful to some people. But having a CircularSeq > subclass would make it easier to avoid this extra functionality from > impacting on the primary use case. My feeling is that having a subclass is a bit of an overkill. The idea is to have an optional topology argument, which defaults to "linear". So the primary use case would not be affected. > >> 5) Perhaps it would be a good idea to add transcribe and translate >> methods to the Seq class. > > +1 > > You would obviously have to specify an alphabet for this, but I'm fine > with that so long as I'm not forced to when I don't need to. If the alphabet defaults to Alphabet() when creating a Seq object, then I'd think the transcribe and translate methods should work even if a user doesn't specify the sequence to be DNA or RNA. My current gripe with the Seq object is that there are too many steps to translate a DNA sequence. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From Frederic.Sohm at iaf.cnrs-gif.fr Thu Apr 28 03:35:01 2005 From: Frederic.Sohm at iaf.cnrs-gif.fr (=?iso-8859-1?b?RnLpZOlyaWM=?= Sohm) Date: Thu Apr 28 03:28:19 2005 Subject: [Biopython-dev] Rethinking Seq objects Message-ID: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr> Hi, I was following your discussion on Seq object. I more or less agree with Michiel. But some thoughts : 1) get rid of MutableSeq and make all Seq mutable. Will it not be a problem for some people there? I mean I only use MutableSeq so noproblem there for me but I assume that someone uses non-mutable Seq or is it a feature which is not needed? 2) Checking the alphabet. Yes. good. with the remark of Michael not force it for people who don't want it. Can be painful for real long sequence. 4) circular sequences and indices. Nice. from experience not so easy to implement correctly though. 5) Translate and transcribe. Yes obviously a good thing. If you are interested you can have a look to DNA object in rana. If you want it, take it under a Biopython licence. It's only DNA and it's certainly not worth using it but it can give you some idea. A lot of complexity is added by supporting a biological indexation [1:len] rather than a python one [0:len-1]. This is would not be a sensible thing to do in biopython. This is the C-implementation of Python String, modified with an alphabet checking (a modif of the string translate() method with the alphabet hard coded in) and support for circular sequences. If you want have a look at the DNA object for rana here : http://cvs.sourceforge.net/viewcvs.py/rana/rana/Rana/c_extension/DNAdata.c?rev=1.9&view=markup The code is pretty bad, well my code not the python one. Fred -- Fr?d?ric Sohm Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s" UPR 2197 DEPSN, CNRS Institut de Neurosciences A. Fessard 1 Avenue de la Terrasse 91 198 GIF-SUR-YVETTE FRANCE Phone: +33 (0) 1 69 82 34 12 Fax:+33 (0) 1 69 82 34 47 From hoffman at ebi.ac.uk Thu Apr 28 04:56:51 2005 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Thu Apr 28 04:50:05 2005 Subject: [Biopython-dev] Rethinking Seq objects In-Reply-To: <4270589D.2090500@ims.u-tokyo.ac.jp> References: <426F2EAB.3000709@ims.u-tokyo.ac.jp> <4270589D.2090500@ims.u-tokyo.ac.jp> Message-ID: On Thu, 28 Apr 2005, Michiel Jan Laurens de Hoon wrote: > Michael Hoffman wrote: > >> On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote: >> >> > 1) Make Seq objects mutable, and get rid of MutableSeq. >> >> I imagine it will be a lot slower to replace built-in strings with >> character arrays. Right now, I only use Seq when I absolutely have to. > Well I wouldn't replace them with character arrays, the idea would be to > reimplement the Seq class in C. So it would not be slower than built-in > strings, maybe even a bit faster. The Seq object would look like a string > object, but be mutable. If you can make a sequence class that is faster than the current built-in string, I would suggest you submit a patch to the Python tracker to make it a replacement for the current built-in string. :P > OK, then how about this: > - By default, don't assume a particular alphabet. Same as how it works now: > > > > from Bio.Seq import * > > > > Seq('ATCG') > Seq('ATCG', Alphabet()) +1 >> > > > > my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) >> > > > > my_seq[:10] = "weirdstuff" >> > > > > my_seq >> > MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), >> > IUPACUnambiguousDNA()) >> >> "Doctor, it hurts when I do this." >> "Don't do that." > > Well you would be right if this were Biofortran. For a higher-level language, > I would expect better checking to make sure an object is self-consistent. > Python itself is full of checks and assertions. How often have you actually put "weirdstuff" into the middle of a MutableSeq? Have you ever done this, or are you just imagining that it might happen? You Aren't Gonna Need It. The number of checks and assertions you can make is limitless and you have to know where to draw the line. To me, the line should be drawn at user input, but not at every internal change to a sequence made within a program. Maybe optional alphabet checking would help with this. > Another option would be to get rid of alphabets altogether. What good are > they otherwise? They're useful for transcription/translation/reverse complement operations. And as far as I'm concerned, that's a good place to do error checking, should it be necessary. >> But having a CircularSeq subclass would make it easier to avoid >> this extra functionality from impacting on the primary use case. > > My feeling is that having a subclass is a bit of an overkill. The idea is to > have an optional topology argument, which defaults to "linear". So the > primary use case would not be affected. If you're doing this in C, then my performance assumptions are perhaps incorrect. I wouldn't want every slice of my linear sequence to have to go through "is this circular?" logic in Python. > If the alphabet defaults to Alphabet() when creating a Seq object, > then I'd think the transcribe and translate methods should work even > if a user doesn't specify the sequence to be DNA or RNA. My current > gripe with the Seq object is that there are too many steps to > translate a DNA sequence. Good point. Perhaps a warning when it has to guess? -- Michael Hoffman European Bioinformatics Institute From hoffman at ebi.ac.uk Thu Apr 28 04:59:51 2005 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Thu Apr 28 05:14:23 2005 Subject: [Biopython-dev] Rethinking Seq objects In-Reply-To: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr> References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr> Message-ID: On Thu, 28 Apr 2005, Fr?d?ric Sohm wrote: > 1) get rid of MutableSeq and make all Seq mutable. > Will it not be a problem for some people there? I mean I only use MutableSeq so > noproblem there for me but I assume that someone uses non-mutable Seq or is it a > feature which is not needed? In the rest of CPython, immutable have two benefits: they are more memory-efficient (and sometimes space-efficient), and they are hashable. I don't think Seqs are usefully hashable right now, and Michiel says he will code the new Seq such that there won't be a significant performance impact. -- Michael Hoffman European Bioinformatics Institute From mdehoon at ims.u-tokyo.ac.jp Fri Apr 29 01:05:14 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Fri Apr 29 00:53:15 2005 Subject: [Biopython-dev] Rethinking Seq objects In-Reply-To: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr> References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr> Message-ID: <4271C08A.3050306@ims.u-tokyo.ac.jp> Fr?d?ric Sohm wrote: > 1) get rid of MutableSeq and make all Seq mutable. > Will it not be a problem for some people there? I mean I only use MutableSeq so > noproblem there for me but I assume that someone uses non-mutable Seq or is it a > feature which is not needed? As far as I can tell, the only way in which a mutable Seq may affect a user is in terms of performance, as Michael pointed out. But anyway, as soon as we reach some conclusion on Seq/MutableSeq on biopython-dev, I'll send a message to the biopython mailing list to see if any of the users will get into problems because of this. Also, I expect that MutableSeq will be around for some time as a deprecated class. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From mdehoon at ims.u-tokyo.ac.jp Fri Apr 29 01:15:28 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Fri Apr 29 01:03:21 2005 Subject: [Biopython-dev] Rethinking Seq objects In-Reply-To: References: <426F2EAB.3000709@ims.u-tokyo.ac.jp> <4270589D.2090500@ims.u-tokyo.ac.jp> Message-ID: <4271C2F0.7070608@ims.u-tokyo.ac.jp> Michael Hoffman wrote: > On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote: >> Another option would be to get rid of alphabets altogether. What good >> are they otherwise? > > They're useful for transcription/translation/reverse complement > operations. And as far as I'm concerned, that's a good place to do > error checking, should it be necessary. > For transcription and translation, we don't need to know the alphabet. Effectively, by calling translate or transcribe, the user is telling us that the input sequence object is DNA or RNA, and that the output sequence is RNA (for transcription) or protein (for translation). Of course, when a character other than ACGTU is encountered, we need to raise an error. But the point is that knowing the Alphabet doesn't tell us anything we don't already know. For reverse complement, we also don't need to know the alphabet; it is either DNA or RNA. The only exception is when a user wants to reverse complement a sequence that does not contain a T or a U. But the current situation, where we have IUPACProtein, ExtendedIUPACProtein, IUPACAmbiguousDNA, IUPACUnambiguousDNA, ExtendedIUPACDNA, IUPACAmbiguousRNA, IUPACUnambiguousRNA alphabets, is an overkill. It would be much easier to have a reverse_complement and a rna_reverse_complement function (or something like that). So I still don't see any use for alphabets other than input checking. Or am I missing something here? --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From mdehoon at ims.u-tokyo.ac.jp Fri Apr 29 01:20:08 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Fri Apr 29 01:08:07 2005 Subject: [Biopython-dev] Rethinking Seq objects In-Reply-To: References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr> Message-ID: <4271C408.9010401@ims.u-tokyo.ac.jp> Michael Hoffman wrote: > On Thu, 28 Apr 2005, Fr?d?ric Sohm wrote: > >> 1) get rid of MutableSeq and make all Seq mutable. >> Will it not be a problem for some people there? I mean I only use >> MutableSeq so >> noproblem there for me but I assume that someone uses non-mutable Seq >> or is it a >> feature which is not needed? > > In the rest of CPython, immutable have two benefits: they are more > memory-efficient (and sometimes space-efficient), and they are > hashable. I don't think Seqs are usefully hashable right now, and > Michiel says he will code the new Seq such that there won't be a > significant performance impact. Would you be willing to test the performance of a new Seq class? I haven't actually written any code yet, but I could send it to you when it's done before including it in Biopython. Note also that a mutable Seq class avoids the need for calls to tomutable and toseq, so there may be an overall performance gain. But it would be better to test this on a real-life case. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From hoffman at ebi.ac.uk Fri Apr 29 04:02:05 2005 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Fri Apr 29 03:55:11 2005 Subject: [Biopython-dev] Rethinking Seq objects In-Reply-To: <4271C408.9010401@ims.u-tokyo.ac.jp> References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr> <4271C408.9010401@ims.u-tokyo.ac.jp> Message-ID: On Fri, 29 Apr 2005, Michiel Jan Laurens de Hoon wrote: > Would you be willing to test the performance of a new Seq class? I > haven't actually written any code yet, but I could send it to you > when it's done before including it in Biopython. Note also that a > mutable Seq class avoids the need for calls to tomutable and toseq, > so there may be an overall performance gain. But it would be better > to test this on a real-life case. Sure, I think I could whip something up. -- Michael Hoffman European Bioinformatics Institute From Frederic.Sohm at iaf.cnrs-gif.fr Fri Apr 29 04:03:56 2005 From: Frederic.Sohm at iaf.cnrs-gif.fr (=?iso-8859-1?b?RnLpZOlyaWM=?= Sohm) Date: Fri Apr 29 03:57:03 2005 Subject: [Biopython-dev] Rethinking Seq objects Message-ID: <1114761836.4271ea6ca540a@mail.iaf.cnrs-gif.fr> I am ready to do some performance testing, if you want. I am looking for a replacement for the DNA object I have written and could well switch to the biopython Seq object if it is faster. Fred Michael Hoffman wrote: > On Thu, 28 Apr 2005, Fr?d?ric Sohm wrote: > >> 1) get rid of MutableSeq and make all Seq mutable. >> Will it not be a problem for some people there? I mean I only use >> MutableSeq so >> noproblem there for me but I assume that someone uses non-mutable Seq >> or is it a >> feature which is not needed? > > In the rest of CPython, immutable have two benefits: they are more > memory-efficient (and sometimes space-efficient), and they are > hashable. I don't think Seqs are usefully hashable right now, and > Michiel says he will code the new Seq such that there won't be a > significant performance impact. Would you be willing to test the performance of a new Seq class? I haven't actually written any code yet, but I could send it to you when it's done before including it in Biopython. Note also that a mutable Seq class avoids the need for calls to tomutable and toseq, so there may be an overall performance gain. But it would be better to test this on a real-life case. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon _______________________________________________ Biopython-dev mailing list Biopython-dev@biopython.org http://biopython.org/mailman/listinfo/biopython-dev -- Fr?d?ric Sohm Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s" UPR 2197 DEPSN, CNRS Institut de Neurosciences A. Fessard 1 Avenue de la Terrasse 91 198 GIF-SUR-YVETTE FRANCE Phone: +33 (0) 1 69 82 34 12 Fax:+33 (0) 1 69 82 34 47