From bugzilla-daemon at portal.open-bio.org Sun Apr 3 07:30:24 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sun Apr 3 08:12:55 2005
Subject: [Biopython-dev] [Bug 1767] New: Bio/trie.c can crash on Windows
Message-ID: <200504031130.j33BUO4v019943@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1767
Summary: Bio/trie.c can crash on Windows
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Windows 2000
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev@biopython.org
ReportedBy: mdehoon@ims.u-tokyo.ac.jp
In Bio/trie.c, the function strdup is being used, which is not part of the
ANSI-C standard. As a result, when Bio/trie.c is compiled, the resulting trie
module links to two C runtime libraries (mscvrt.dll and mscvr71.dll), which are
incompatible with each other and can cause crashes. To fix this bug, we need to
write our own strdup function using ANSI-C standard functions.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mdehoon at ims.u-tokyo.ac.jp Wed Apr 6 22:30:35 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Wed Apr 6 22:20:03 2005
Subject: [Biopython-dev] Re:
In-Reply-To: <200504061641.j36Gfnck010384@mail-gw0.york.ac.uk>
References: <200504061641.j36Gfnck010384@mail-gw0.york.ac.uk>
Message-ID: <42549B4B.8000408@ims.u-tokyo.ac.jp>
To post to biopython-dev@biopython.org, you need to subscribe first via the
biopython website. This was done to stop the huge amounts of spam we were
receiving earlier. If you want to submit a patch, the best way is to create a
bug report first (also via the biopython website) and then add an attachment
containing the patch. Patches sent to the mailing list tend to get lost, and are
sometimes rejected by the spam filter. If you want to submit a larger piece of
code, for example a new module, you can post a message to
biopython-dev@biopython.org first to describe the code, and then send it to one
of the developers.
Hope this helps!
--Michiel.
Glen van Ginkel wrote:
> I tried to submit code to biopython-dev@biopython.org but got a message
> telling me I was not allowed to mail them
>
>
>
> How do I go about submitting code?
>
>
>
> Thankyou
>
>
>
> Glen van Ginkel
>
>
>
--
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From gvg500 at york.ac.uk Thu Apr 7 03:13:39 2005
From: gvg500 at york.ac.uk (Glen van Ginkel)
Date: Thu Apr 7 02:08:10 2005
Subject: [Biopython-dev] Hmmer integration modules/package
Message-ID: <200504070813.39242.gvg500@york.ac.uk>
Hi Guys
As a project for my MRes in bioinformatics I have had to write a Python
package that would expand Biopythons capabilities in interacting with useful
programs. Here I submit code to interact with the Hmmer suite of programs.
Basically, you instantiate the Hmmer object (Like a Hmmer commandline builder
object), build a commandline, execute it and grab the results. Really simple
for the user. This only works in UNIX since Hmmer is not available on Windows
platforms and it obviously assumes you have Hmmer already installed on your
PC.
I have also tried to integrate the Bio.Align.Generic alignment object sothat
the Hmmer object is able to handle alignment objects by writing the records
of the alignment object to a temporary fasta file.
Since I have to write this up as a report I would greatly appreciate any
criticism of any kind and perhaps some suggestions as to how I might improve
the code. Also, If you can think of any other ways to implement the Hmmer
stuff please let me know.
At the moment I am working on a test suite for the project. If you would like
the code I have written to exercise some of the methods please let me know
and I'll send it over.
I also have a bit of documentation I would like to add to the Application
package because I feel if failed to help me concerning certain aspects. How
would I go about this?
I look forward to hearing your suggestions.
Glen van Ginkel
From gvg500 at york.ac.uk Thu Apr 7 08:10:29 2005
From: gvg500 at york.ac.uk (Glen van Ginkel)
Date: Thu Apr 7 07:25:03 2005
Subject: [Biopython-dev] Hmmer integration modules/package
Message-ID: <200504071310.29943.gvg500@york.ac.uk>
Hi Guys
As a project for my MRes in bioinformatics I have had to write a Python
package that would expand Biopythons capabilities in interacting with useful
programs. Here I submit code to interact with the Hmmer suite of programs.
Basically, you instantiate the Hmmer object (Like a Hmmer commandline builder
object), build a commandline, execute it and grab the results. Really simple
for the user. This only works in UNIX since Hmmer is not available on Windows
platforms and it obviously assumes you have Hmmer already installed on your
PC.
I have also tried to integrate the Bio.Align.Generic alignment object sothat
the Hmmer object is able to handle alignment objects by writing the records
of the alignment object to a temporary fasta file.
Since I have to write this up as a report I would greatly appreciate any
criticism of any kind and perhaps some suggestions as to how I might improve
the code. Also, If you can think of any other ways to implement the Hmmer
stuff please let me know.
At the moment I am working on a test suite for the project. If you would like
the code I have written to exercise some of the methods please let me know
and I'll send it over.
I also have a bit of documentation I would like to add to the Application
package because I feel if failed to help me concerning certain aspects. How
would I go about this?
I look forward to hearing your suggestions.
Glen van Ginkel
Attatched is the code as well as an example of a test suite using the files
from Eddy2003 and an experiment file experiment.fas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Applications.py
Type: application/x-python
Size: 16558 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/Applications-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: __init__.py
Type: application/x-python
Size: 320 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/__init__-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: HmmerStandalone.py
Type: application/x-python
Size: 9139 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/HmmerStandalone-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testsuite.py
Type: application/x-python
Size: 2427 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/testsuite-0001.bin
-------------- next part --------------
>S134211|S134211|GLOBIN - BRINE SHRIMP1
DKATIKRTWATVTDLPSFGRNVFLSVFAAK
>S134212|S134212|GLOBIN - BRINE SHRIMP2
PEYKNLFVEFRNIPASELASSERLLYHGGR
>S134213|S134213|GLOBIN - BRINE SHRIMP3
VLSSIDEAIAGIDTPDRAVKTLLALGERHI
>S134214|S134214|GLOBIN - BRINE SHRIMP4
SRGTVRRHFEAFSYAFIDELKQRGVESADL
>S134215|S134215|GLOBIN - BRINE SHRIMP5
AAWRRGWDNIVNVLEAGLLRRQIDLEVTGL
>S134216|S134216|GLOBIN - BRINE SHRIMP6
SCVDVANIQESWSKVSGDLKTTGSVVFQRM
>S134217|S134217|GLOBIN - BRINE SHRIMP7
INGHPEYQQLFRQFRDVDLDKLGESNSFVA
From gvg500 at york.ac.uk Tue Apr 12 05:10:57 2005
From: gvg500 at york.ac.uk (Glen van Ginkel)
Date: Tue Apr 12 05:07:14 2005
Subject: [Biopython-dev] Hmmer API
Message-ID: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
This is a desperate-ish plea to any biopython developer.
Recently submitted an explanation of code that interacts with the Hmmer
(Eddy 2003) suite of programs to the list. I am required to write up a
project concerning this Hmmer API and would hugely appreciate it if ANY of
the developers could give me some feed-back. Bearing in mind that this is in
your spare time, I'm not asking for a major commentary. The focus of the
project was to produce something that might be accepted as part of the
BioPython project. What I'm looking for is just a sort of "yeah probably
will be accepted but.." or "No! you've got a long way to go before we accept
that gubbins, but.".
I would really appreciate ANY response
In anticipation
Glen van Ginkel
From fkauff at duke.edu Tue Apr 12 08:59:12 2005
From: fkauff at duke.edu (Frank Kauff)
Date: Tue Apr 12 08:52:48 2005
Subject: [Biopython-dev] Hmmer API
In-Reply-To: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
References: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
Message-ID: <1113310753.5132.5.camel@osiris.biology.duke.edu>
Glen,
I'd be happy to have a look at it. I was recently thinking of using
Hmmer to do some alignment for one of my python scripts, so maybe your
API comes in handy. I'm not too familiar with Hmmer yet, though. So I
hope it comes with some example code...?
And thanks for contributing to biopython!
Frank
On Tue, 2005-04-12 at 10:10 +0100, Glen van Ginkel wrote:
> This is a desperate-ish plea to any biopython developer.
>
>
>
> Recently submitted an explanation of code that interacts with the Hmmer
> (Eddy 2003) suite of programs to the list. I am required to write up a
> project concerning this Hmmer API and would hugely appreciate it if ANY of
> the developers could give me some feed-back. Bearing in mind that this is in
> your spare time, I'm not asking for a major commentary. The focus of the
> project was to produce something that might be accepted as part of the
> BioPython project. What I'm looking for is just a sort of "yeah probably
> will be accepted but.." or "No! you've got a long way to go before we accept
> that gubbins, but.".
>
>
>
> I would really appreciate ANY response
>
>
>
> In anticipation
>
>
>
> Glen van Ginkel
>
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
--
Frank Kauff
Dept. of Biology
Duke University
Box 90338
Durham, NC 27708
USA
Phone 919-660-7382
Fax 919-660-7293
Web http://www.lutzonilab.net/member/frankkauff.shtml
From jhackney at stanford.edu Tue Apr 12 16:20:24 2005
From: jhackney at stanford.edu (Jason A. Hackney)
Date: Tue Apr 12 16:23:46 2005
Subject: [Biopython-dev] Hmmer API
In-Reply-To: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
References: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
Message-ID: <6956556815154e4131d7683adf017b0b@stanford.edu>
Hi Glen,
I would also be willing to have a look at any code you've got so far.
I've done a bit of Hmmer stuff in the past, so I'd be interested in
seeing what you've got going.
Cheers,
Jason
On Apr 12, 2005, at 2:10 AM, Glen van Ginkel wrote:
> This is a desperate-ish plea to any biopython developer.
>
>
>
> Recently submitted an explanation of code that interacts with the Hmmer
> (Eddy 2003) suite of programs to the list. I am required to write up a
> project concerning this Hmmer API and would hugely appreciate it if
> ANY of
> the developers could give me some feed-back. Bearing in mind that this
> is in
> your spare time, I'm not asking for a major commentary. The focus of
> the
> project was to produce something that might be accepted as part of the
> BioPython project. What I'm looking for is just a sort of "yeah
> probably
> will be accepted but.." or "No! you've got a long way to go before we
> accept
> that gubbins, but.".
>
>
>
> I would really appreciate ANY response
>
>
>
> In anticipation
>
>
>
> Glen van Ginkel
>
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
>
Jason A. Hackney
Postdoctoral Scholar
Department of Microbiology and Immunology
Stanford University
From idoerg at burnham.org Tue Apr 12 18:34:36 2005
From: idoerg at burnham.org (Iddo Friedberg)
Date: Tue Apr 12 18:29:42 2005
Subject: [Biopython-dev] Hmmer API
In-Reply-To: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
References: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
Message-ID: <425C4CFC.5020203@burnham.org>
hi Folks,
Glen actually gave me the code, but I cannot really give him the
response it merits, not within th time frame he would like. From a
cursory look it is "yes, we can accept it, give us contingency tests as
well". But if soeone can actually run it a couple of times, and see what
it's like, that would be great.
Sorry Glen, I didn't realize you were that pressed for time.
Cheers,
Iddo
Glen van Ginkel wrote:
>This is a desperate-ish plea to any biopython developer.
>
>
>
>Recently submitted an explanation of code that interacts with the Hmmer
>(Eddy 2003) suite of programs to the list. I am required to write up a
>project concerning this Hmmer API and would hugely appreciate it if ANY of
>the developers could give me some feed-back. Bearing in mind that this is in
>your spare time, I'm not asking for a major commentary. The focus of the
>project was to produce something that might be accepted as part of the
>BioPython project. What I'm looking for is just a sort of "yeah probably
>will be accepted but.." or "No! you've got a long way to go before we accept
>that gubbins, but.".
>
>
>
>I would really appreciate ANY response
>
>
>
>In anticipation
>
>
>
>Glen van Ginkel
>
>
>
>_______________________________________________
>Biopython-dev mailing list
>Biopython-dev@biopython.org
>http://biopython.org/mailman/listinfo/biopython-dev
>
>
>
>
--
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
Tel: (858) 646 3100 x3516
Fax: (858) 713 9930
http://ffas.ljcrf.edu/~iddo
From idoerg at burnham.org Thu Apr 14 20:53:47 2005
From: idoerg at burnham.org (Iddo Friedberg)
Date: Thu Apr 14 20:47:25 2005
Subject: [Biopython-dev] Speakers for BOSC needed
Message-ID: <425F109B.8040507@burnham.org>
Hi all,
It's that time of year again, and BOSC 2005 will be happening on June
23-24. The more Biopython representatives, the merrier. I will be
around, but I will be dealing with my own SIG meeting, so I will not be
able to give a talk. Is there someone who can give the BioPython
"plenary"? should be a 30-40 minute talk. Also, there are slots for
shorter talks, so if you contributed and interesting module, or had an
interesting experience with biopython you would like to share, please
submit a talk.
For those of you who do not know what BOSC is, it's the Bioinformatics
Open Source Conference, which is held as a satellite meeting of ISMB. I
highly recommend this event, it is a real eye opener with respect to the
world of open source, and computational biology. More about it here:
http://open-bio.org/bosc/
Cheers,
Iddo
--
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
http://ffas.ljcrf.edu/~iddo
==========================
The First Automated Protein Function Prediction SIG
Detroit, MI June 24, 2005
http://ffas.burnham.org/AFP
From idoerg at burnham.org Tue Apr 19 13:29:16 2005
From: idoerg at burnham.org (Iddo Friedberg)
Date: Tue Apr 19 13:22:41 2005
Subject: [Biopython-dev] BOSC 2005
Message-ID: <42653FEC.2010907@burnham.org>
{Please pass the word!}
SECOND CALL FOR SPEAKERS
The 6th annual Bioinformatics Open Source Conference (BOSC'2005) is organized by the
not-for-profit Open Bioinformatics Foundation. The meeting will take place
June 23-24, 2005 in Detroit, Michigan, USA, and is one of several Special Interest
Group (SIG) meetings occurring in conjunction with the 13th International Conference
on Intelligent Systems for Molecular Biology.
see http://www.iscb.org/ismb2005 for more information.
Because of the power of many Open Source bioinformatics packages in
use by the Research Community today, it is not too presumptuous to say
that the work of the Open Source Bioinformatics Community represents
the cutting edge of Bioinformatics in general. This has been repeatedly
demonstrated by the quality of presentations at previous BOSC conferences.
This year, at BOSC 2005, we want to continue this tradition of excellence,
while presenting this message to a wider part of the Research Community.
Please, pass this message on to anyone you know that is interested in
Bioinformatics software.
BOSC PROGRAM & CONTACT INFO
* Web: http://www.open-bio.org/bosc2005/
* Online Registration: https://www.cteusa.com/iscb4/
* Email: bosc@open-bio.org
FEES
* Corporate : $195 ($245 after May 16th)
* Academic : $170 ($220 after May 16th)
* Student : $145 ($195 after May 16th)
SPEAKERS & ABSTRACTS WANTED
The program committee is currently seeking abstracts for talks at BOSC
2005. BOSC is a great opportunity for you to tell the community about
your use, development, or philosophy of open source software development
in bioinformatics. The committee will select several submitted abstracts
for 25-minute talks and others for shorter "lightning" talks. Accepted
abstracts will be published on the BOSC web site.
If you are interested in speaking at BOSC 2005,
please send us before April 26, 2005:
* an abstract (no more than a few paragraphs)
* a URL for the project page, if applicable
* information about the open source license used for your software or
your release plans.
Abstracts will be accepted for submission until April 26, 2005.
Abstracts chosen for presentation will be announced May 12, 2005
(before the ISMB Early Registration Deadline).
LIGHTNING-TALK SPEAKERS WANTED!
The program committee is currently seeking speakers for the lightning
talks at BOSC 2005. Lightning talks are quick - only five minutes
long - and a great opportunity for you to give people a quick
summary of your open source project, code, idea, or vision of the future.
If you are interested in giving a lightning talk at BOSC 2005,
please send us:
* a brief title and summary (one or two lines)
* a URL for the project page, if applicable
* information about the open source license used for your software or
your release plans.
We will accept entries on-line until BOSC starts, but
space for demos and lightning talks is limited.
Iddo Friedberg wrote:
> Glen actually gave me the code, but I cannot really give him the
> response it merits, not within th time frame he would like. From a
> cursory look it is "yes, we can accept it, give us contingency tests as
> well". But if soeone can actually run it a couple of times, and see what
> it's like, that would be great.
I'm not a hmmer user either, so it's hard to assess the code in great detail.
What I like about the code is that it has extensive documentation (in the source
code) and makes use of and is well integrated with the existing Biopython
software. Since Hmmer is a rather standard bioinformatics tool, I think
Biopython should support it. So I vote to accept this into Biopython. Once users
start using Bio.Hmmer, maybe some issues with the code will show up, but then
these users will be more familiar with Hmmer than we are, and can give more
useful advice.
About the documentation, it is quite extensive in the source code (which is fine
too), but it would be nice to have some documentation outside of the source
code. For example, this is something I wrote a while ago for Bio.LogisticRegression:
http://www.biopython.org/docs/cookbook/LogisticRegression.html
With such a documentation, the code will be more accessible. Without such
documentation, people have to look through the CVS tree to search for the Hmmer
package, or may not even notice it.
--Michiel.
--
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From bugzilla-daemon at portal.open-bio.org Fri Apr 22 08:33:28 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 09:19:33 2005
Subject: [Biopython-dev] [Bug 1771] New: need some file from some xml module
?
Message-ID: <200504221233.j3MCXSc7007748@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1771
Summary: need some file from some xml module ?
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev@biopython.org
ReportedBy: jeroen@xdh.nl
Hi,
Using the Sprot.py module of version 1.40b, I got this :
File "stepsget035.py", line 98, in list_swissprot_gpcrs
from Bio.SwissProt import SProt
File "/usr/src/biopython-1.40b/build/lib.linux-i686-
2.3/Bio/SwissProt/SProt.py", line 39, in ?
from Bio import SeqRecord
File "/usr/src/biopython-1.40b/build/lib.linux-i686-2.3/Bio/SeqRecord.py",
line 7, in ?
from Bio import FormatIO
File "/usr/src/biopython-1.40b/build/lib.linux-i686-2.3/Bio/FormatIO.py",
line 2, in ?
from xml.sax import saxutils
Before, I used 1.24 and that gave no such error/dependency/bug/dunno-what-it-
is, probably cuz I didn't need to import anything from some xml module.
Thanks for any help,
Jeroen
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 22 09:31:27 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 10:18:29 2005
Subject: [Biopython-dev] [Bug 1771] need some file from some xml module ?
Message-ID: <200504221331.j3MDVRhb008729@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1771
------- Additional Comments From mdehoon@ims.u-tokyo.ac.jp 2005-04-22 09:31 -------
Which Python version are you using? See:
samma{mdehoon}8: python1.5
Python 1.5.2 (#1, Nov 28 2001, 02:33:46) [GCC 2.95.3 20010315 (release)] on sun
os5
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> from xml.sax import saxutils
Traceback (innermost last):
File "", line 1, in ?
ImportError: No module named xml.sax
>>> ^D
samma{mdehoon}9: python2.2
Python 2.2.2 (#1, Jan 24 2003, 17:26:30)
[GCC 2.95.3 20010315 (release)] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.sax import saxutils
>>> ^D
samma{mdehoon}10:
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 22 10:46:57 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 11:22:41 2005
Subject: [Biopython-dev] [Bug 1771] need some file from some xml module ?
Message-ID: <200504221446.j3MEkv9D009884@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1771
jeroen@xdh.nl changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Additional Comments From jeroen@xdh.nl 2005-04-22 10:46 -------
Yep, there was something wrong with the sax module of my python (version 2.3)
installation, it has been fixed now.
thanks,
Jeroen
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 22 16:45:31 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 17:18:28 2005
Subject: [Biopython-dev] [Bug 1772] New: Bio.PDB's parse_pdb_header never
stops parsing if there is no ATOM record
Message-ID: <200504222045.j3MKjVfL014797@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1772
Summary: Bio.PDB's parse_pdb_header never stops parsing if there
is no ATOM record
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: major
Priority: P2
Component: Other
AssignedTo: biopython-dev@biopython.org
ReportedBy: dhendrix@compbio.berkeley.edu
In the future, please run your code on more pdb files before you release it!
I have a pdb file with no ATOM records, just HETATMs. So when I use
parse_pdb_header to read the header, it runs until I'm out of memory or I (or
the os) kill it, because it is reading in the header as anything that occurs
before an ATOM record, and there ain't one! The pdbID is 1PBL. While it is
unusual that there are no ATOM records, it can definitely occur!
Also, there is the annoying printing of
nonstandard resolution NOT APPLICABLE.
for every NMR structure. Why????
THANK YOU!
Donna
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 22 20:35:33 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 21:18:33 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
parsing if there is no ATOM record
Message-ID: <200504230035.j3N0ZW5H017089@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1772
idoerg@burnham.org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 22 20:36:52 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 21:18:37 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
parsing if there is no ATOM record
Message-ID: <200504230036.j3N0aqVG017107@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1772
idoerg@burnham.org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution| |WORKSFORME
------- Additional Comments From idoerg@burnham.org 2005-04-22 20:36 -------
User reported using old version of Biopython. I checked it against 1.40b and CVS
versions, could not duplicate.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Apr 25 17:22:50 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Apr 25 18:21:05 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
parsing if there is no ATOM record
Message-ID: <200504252122.j3PLMoIh022842@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1772
dhendrix@compbio.berkeley.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|RESOLVED |REOPENED
Resolution|WORKSFORME |
------- Additional Comments From dhendrix@compbio.berkeley.edu 2005-04-25 17:22 -------
Thank you so much for the quick turnaround. Iddo Friedberg suggested that I use
the top level of the CVS. I updated python and my biopython (and NumPy, etc...)
and encountered the same behavior, as well as another little bug that I had
fixed a while ago. Here are my diffs to parse_pdb_header.py, which are small
but vital for me to get parse_pdb_header working for me.
122c122
< f=open(file,'r')
---
> f=open(filename,'r')
127c127
< if not re.search("\AATOM",l) and not re.search("\AEND",l):
---
> if not re.search("\AATOM",l):
I can send you a little test program that fails on the current version of
parse_pdb_header, if that will help. Regardless, I think it's a good idea to
stop reading the file when you reach the end of it!
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 26 03:49:37 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Apr 26 04:20:11 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
parsing if there is no ATOM record
Message-ID: <200504260749.j3Q7nbeJ029909@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1772
------- Additional Comments From mdehoon@ims.u-tokyo.ac.jp 2005-04-26 03:49 -------
I have accepted the first patch in CVS:
122c122
< f=open(file,'r')
---
> f=open(filename,'r')
But the second part doesn't seem right:
127c127
< if not re.search("\AATOM",l) and not re.search("\AEND",l):
---
> if not re.search("\AATOM",l):
It will append HETATM lines to the header. Instead we can use
if not re.search("\AATOM",l) and not re.search("\AHETATM",l):
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From thamelry at binf.ku.dk Tue Apr 26 04:40:06 2005
From: thamelry at binf.ku.dk (thamelry@binf.ku.dk)
Date: Tue Apr 26 05:00:32 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never
stops parsing if there is no ATOM record
In-Reply-To: <200504260749.j3Q7nbeJ029909@portal.open-bio.org>
References: <200504260749.j3Q7nbeJ029909@portal.open-bio.org>
Message-ID: <32883.83.92.3.59.1114504806.squirrel@www.binf.ku.dk>
> It will append HETATM lines to the header. Instead we can use
> if not re.search("\AATOM",l) and not re.search("\AHETATM",l):
Or even simpler:
record_type=l[0:6]
if record_type=='ATOM ' or record_type=='HETATM' or record_type=='MODEL ':
break
else:
header.append(l)
Note that MODEL can also signal the end of the header.
I'll add it to the CVS.
Cheers,
-Thomas
From bugzilla-daemon at portal.open-bio.org Tue Apr 26 06:35:59 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Apr 26 08:21:59 2005
Subject: [Biopython-dev] [Bug 1773] New:
Martel.Parser.ParserPositionException
Message-ID: <200504261035.j3QAZx3P031804@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1773
Summary: Martel.Parser.ParserPositionException
Product: Biopython
Version: Not Applicable
Platform: PC
URL: http://portal.open-bio.org/pipermail/biopython/2005-
April/002604.html
OS/Version: Linux
Status: NEW
Severity: minor
Priority: P2
Component: Martel/Mindy
AssignedTo: biopython-dev@biopython.org
ReportedBy: marc.saric@gmx.de
I tried to index the the following Genbank file with this simple script
(as described in the cookbook) but it failed with the following traceback.
The file can be found in:
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=BA000018
including all features (SNP, CDD, MGC, HPRD, STS)
I tried with Biopython 1.3, Biopython 1.4b and the Biopython-CVS as of
2005-04-15.
The program
===snip===
#!/usr/bin/env python
from Bio import GenBank
dict_file = "ba000018_s_aureus_n315_genome.gb"
index_file = "ba000018_s_aureus_n315_genome.idx"
GenBank.index_file(dict_file, index_file)
===snap===
The Traceback:
===snip===
Traceback (most recent call last):
File "/home/saric/data/devel/workspace/scripts/hitman/index_gb.py",
line 37, in ?
GenBank.index_file(dict_file, index_file) # FIXME: This breaks with
the N315 S.aureus-genome
File
"/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Bio/GenBank/__init__.py",
line 1283, in index_file
SimpleSeqRecord.create_flatdb([filename], indexname, indexer)
File
"/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Bio/Mindy/SimpleSeqRecord.py",
line 152, in create_flatdb
creator.load(filename, builder = builder, fileid_info = {})
File
"/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Bio/Mindy/BaseDB.py",
line 52, in load
for record in iterator.iterate(source, cont_handler = builder):
File
"/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Martel/IterParser.py",
line 71, in iterateFile
raise Parser.ParserPositionException(self.start_position)
Martel.Parser.ParserPositionException: error parsing at or beyond
character 5887615
===snap===
I use a x86-machine running SuSE-Linux 9.1 (kernel 2.6.5-7.147-default,
gcc version 3.3.3).
The error is most likely due to a trainling blank line in the GenBank-file,
which is there for all "official" downloads I checked (see my post on the
Biopython-mailinglist (link above)).
Either this is a bug in GenBank (delivering invalid files) or something minor in
the Martel-Parser, which does not like blank lines at the end of the file.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 26 11:57:16 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Apr 26 12:19:59 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
parsing if there is no ATOM record
Message-ID: <200504261557.j3QFvGrQ004138@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1772
------- Additional Comments From dhendrix@compbio.berkeley.edu 2005-04-26 11:57 -------
Of course it appends HETATM (and CONECT) records, but at least it stops reading
the file :-). I'm fine with your change, as long as it recognizes the end of
the file (or header).... THANK YOU!
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 26 12:35:40 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Apr 26 13:22:22 2005
Subject: [Biopython-dev] [Bug 1774] New: Bio.Clustalw: bug in computing the
alignment.
Message-ID: <200504261635.j3QGZeP9004533@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1774
Summary: Bio.Clustalw: bug in computing the alignment.
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev@biopython.org
ReportedBy: crober@scri.ac.uk
Using Bio.Clustalw in order to process a sequence alignment, there is a bug in
computing the alignment when in the input file more than one sequence
has the same name. This bug is not reported by Bio.Clustalw if the output file
you specify already exists. In that latter case, Bio.Clustalw will
just read the results from the output file rather than reporting the error.
#------- program.py
#! /usr/bin/python2.4
import sys
from Bio import Clustalw
from Bio.Clustalw import MultipleAlignCL
import sys
cline = MultipleAlignCL(sys.argv[1])
cline.set_output(sys.argv[2])
align = Bio.Clustalw.do_alignment(cline)
#---------- input.fas
>Putative binding site
ggaacggatgctcgcccagttccaccaacg
>Putative binding site
ggaacccatccttttctgcgtccacacagc
>Putative promoter inside
ggaacaggtgtttcgtcaacacgga
>Putative binding site
ggaacaaacacaactactgcactat
#------- command line to start creating the alignment
$ python2.4 program.py input.fas output.fas
#-------- ERROR MESSAGE when running clustalw as follows:
$ clustalw input.fas -outfile=output.fas
ERROR: Multiple sequences found with same name, Putative (first 30 chars are
significant)
No. of seqs. read = 0. No alignment!
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 26 13:06:33 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Apr 26 13:22:29 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
parsing if there is no ATOM record
Message-ID: <200504261706.j3QH6XSe004914@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1772
thamelry@binf.ku.dk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|REOPENED |RESOLVED
Resolution| |FIXED
------- Additional Comments From thamelry@binf.ku.dk 2005-04-26 13:06 -------
Parsing header now stops on ATOM, HETATM, MODEL or EOF
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 27 01:13:26 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Apr 27 01:20:36 2005
Subject: [Biopython-dev] [Bug 1774] Bio.Clustalw: bug in computing the
alignment.
Message-ID: <200504270513.j3R5DPBN012863@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1774
mdehoon@ims.u-tokyo.ac.jp changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Additional Comments From mdehoon@ims.u-tokyo.ac.jp 2005-04-27 01:13 -------
Fixed in CVS, thanks.
The bug was caused by interpreting the return value of os.popen as the exit
status instead of the termination status. The exit status is the second byte of
the termination status, so we need to divide the result of os.popen by 256 to
get the exit status.
For example, if the C code of a program a.out contains
int main(void)
{ int status;
...
return status;
}
then os.popen("a.out").close() returns status*256 instead of status. Same goes
for os.system. This is also true at the C-level, so this is not a Python bug.
Since we have calls to os.popen and os.system in various places in Biopython,
the same bug may appear elsewhere also.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mdehoon at ims.u-tokyo.ac.jp Wed Apr 27 02:18:19 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Wed Apr 27 02:07:11 2005
Subject: [Biopython-dev] Rethinking Seq objects
Message-ID: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
Hi everybody,
For my research, I tend to work a lot with sequences, but I find myself not
using Bio.Seq much. I'd like to propose some changes to make sequence objects
more useful. I'd be happy to hear comments from the other developers, in
particular the original developers who probably thought this through much more
than I have.
There are five changes I'd like to propose:
1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and the
MutableSeq class basically describe the same thing, except that one is read-only
and the other one is not. If desired, we can add a readonly flag to the class to
describe if it is mutable or not. (Given that e.g. Numerical Python arrays don't
have such a flag, my feeling is that it is not really needed for Seq objects
either).
2) Make Seq objects a bit smarter about which type of sequence they contain. One
reason I don't use Bio.Seq much is that I have to write
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> from Bio.Seq import Seq
>>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
which is too much typing. I am thinking about the following scheme when
initializing a Seq object:
- If the user specifies my_alpha, accept that alphabet. Raise an error if the
sequence is not consistent with the alphabet
- Assume the sequence is an unambiguous DNA sequence
- If the sequence contains any characters other than ATCG, assume it is
unambiguous RNA, otherwise accept the sequence
- If the sequence contains any characters other than AUCG, assume it is
a protein, otherwise accept the sequence
- If the sequence contains any characters other than ACDEFGHIKLMNPQRSTVWY,
assume it is ambiguous DNA, otherwise accept the sequence
- If the sequence contains any characters other than GATCRYWSMKHBVDN, assume it
is ambiguous RNA, otherwise accept the sequence
- If the sequence contains any characters other than GAUCRYWSMKHBVDN, assume it
is an extended protein sequence, otherwise accept the sequence
- If the sequence contains any characters other than ACDEFGHIKLMNPQRSTVWYBXZ,
yell at the user.
3) When changing a sequence, check if it is still consistent with the alphabet.
Right now, we can do
>>> from Bio.Seq import *
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>> my_seq[:10] = "weirdstuff"
>>> my_seq
MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), IUPACUnambiguousDNA())
4) Make Seq objects understand circular genomes. Many bacterial genomes are
circular. It would be nice if we could take the indices [-1000:1000] from a Seq
object, if it is circular, or [3999000:40001000] if the sequence is circular
with length 4000000.
5) Perhaps it would be a good idea to add transcribe and translate methods to
the Seq class. Currently, to translate a DNA sequence, we have to do
>>> from Bio.Seq import Seq
>>> from Bio import Translate
>>> from Bio.Alphabet import IUPAC
>>> my_alpha = IUPAC.unambiguous_dna
>>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>>> standard_translator = Translate.unambiguous_dna_by_id[1]
>>> standard_translator.translate(my_seq)
Seq('AIVMGR*KGAR', IUPACProtein())
which is too much typing for my taste.
Any thoughts/comments/suggestions?
--Michiel.
--
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From thamelry at binf.ku.dk Wed Apr 27 08:09:28 2005
From: thamelry at binf.ku.dk (thamelry@binf.ku.dk)
Date: Wed Apr 27 08:05:56 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
References: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
Message-ID: <33165.83.92.3.59.1114603768.squirrel@www.binf.ku.dk>
Hi Michiel,
> Any thoughts/comments/suggestions?
I happen to be doing some sequence stuff myself at the moment
and I couldn't agree more with the points you raised. It could
all be a lot more straightforward!
Cheers,
-Thomas
From hoffman at ebi.ac.uk Wed Apr 27 08:37:03 2005
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Wed Apr 27 08:37:45 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
References: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
Message-ID:
On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:
> 1) Make Seq objects mutable, and get rid of MutableSeq.
I imagine it will be a lot slower to replace built-in strings with
character arrays. Right now, I only use Seq when I absolutely have to.
Personally, I'd love it if Seq were just a light-weight subclass of
str without the performance penalties of the existing Seq. Using a
Surrogate pattern slows down all those inner loops a lot. Also lots of
unnecessary input-checking does as well. I think performance should be
a concern when you are talking about what should be the most-used part
of the library.
Similarly, I think lots of magic trying to figure out the alphabet is
a bad idea. There are only a few operations that actually require the
alphabet to be known, and most of the time I store a sequence in
memory I'm not going to need any of these, so having to deal with
alphabet issues when it's unnecessary is just going to be a pain in
the butt that will keep me from using Seq. Similarly, I use augmented
alphabets with things like B in them and I don't want Seq yelling at
me when there's no point. Sure, if it can't figure out how to revcom
the sequence, but just to instantiate it?
I think these principles from the Zen of Python would be
well-considered here:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Sparse is better than dense.
Readability counts.
In the face of ambiguity, refuse the temptation to guess.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
> > > > Right now, we can do
> > > > from Bio.Seq import *
> > > > from Bio.Alphabet import IUPAC
> > > > my_alpha = IUPAC.unambiguous_dna
> > > > my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
> > > > my_seq[:10] = "weirdstuff"
> > > > my_seq
> MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'),
> IUPACUnambiguousDNA())
"Doctor, it hurts when I do this."
"Don't do that."
> 4) Make Seq objects understand circular genomes. Many bacterial genomes are
> circular. It would be nice if we could take the indices [-1000:1000] from a
> Seq object, if it is circular, or [3999000:40001000] if the sequence is
> circular with length 4000000.
I'm sure that will be useful to some people. But having a CircularSeq
subclass would make it easier to avoid this extra functionality from
impacting on the primary use case.
> 5) Perhaps it would be a good idea to add transcribe and translate methods to
> the Seq class.
+1
You would obviously have to specify an alphabet for this, but I'm fine
with that so long as I'm not forced to when I don't need to.
--
Michael Hoffman
European Bioinformatics Institute
From mdehoon at ims.u-tokyo.ac.jp Wed Apr 27 23:29:33 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Wed Apr 27 23:17:53 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To:
References: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
Message-ID: <4270589D.2090500@ims.u-tokyo.ac.jp>
Michael Hoffman wrote:
> On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:
>
>> 1) Make Seq objects mutable, and get rid of MutableSeq.
>
> I imagine it will be a lot slower to replace built-in strings with
> character arrays. Right now, I only use Seq when I absolutely have to.
Well I wouldn't replace them with character arrays, the idea would be to
reimplement the Seq class in C. So it would not be slower than built-in strings,
maybe even a bit faster. The Seq object would look like a string object, but be
mutable.
> Similarly, I think lots of magic trying to figure out the alphabet is
> a bad idea. There are only a few operations that actually require the
> alphabet to be known, and most of the time I store a sequence in
> memory I'm not going to need any of these, so having to deal with
> alphabet issues when it's unnecessary is just going to be a pain in
> the butt that will keep me from using Seq. Similarly, I use augmented
> alphabets with things like B in them and I don't want Seq yelling at
> me when there's no point. Sure, if it can't figure out how to revcom
> the sequence, but just to instantiate it?
OK, then how about this:
- By default, don't assume a particular alphabet. Same as how it works now:
>>> from Bio.Seq import *
>>> Seq('ATCG')
Seq('ATCG', Alphabet())
- If the user decides to specify the alphabet, make sure the sequence is
consistent with it. Of course, if the alphabet is Alphabet(), don't do any input
checking. So essentially, the user gets to decide whether she wants input
checking for the sequence or not.
>> >>> Right now, we can do
>> >>> from Bio.Seq import *
>> >>> from Bio.Alphabet import IUPAC
>> >>> my_alpha = IUPAC.unambiguous_dna
>> >>> my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>> >>> my_seq[:10] = "weirdstuff"
>> >>> my_seq
>> MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'),
>> IUPACUnambiguousDNA())
>
> "Doctor, it hurts when I do this."
> "Don't do that."
Well you would be right if this were Biofortran. For a higher-level language, I
would expect better checking to make sure an object is self-consistent. Python
itself is full of checks and assertions.
Another option would be to get rid of alphabets altogether. What good are they
otherwise?
>> 4) Make Seq objects understand circular genomes. Many bacterial
>> genomes are circular. It would be nice if we could take the indices
>> [-1000:1000] from a Seq object, if it is circular, or
>> [3999000:40001000] if the sequence is circular with length 4000000.
>
> I'm sure that will be useful to some people. But having a CircularSeq
> subclass would make it easier to avoid this extra functionality from
> impacting on the primary use case.
My feeling is that having a subclass is a bit of an overkill. The idea is to
have an optional topology argument, which defaults to "linear". So the primary
use case would not be affected.
>
>> 5) Perhaps it would be a good idea to add transcribe and translate
>> methods to the Seq class.
>
> +1
>
> You would obviously have to specify an alphabet for this, but I'm fine
> with that so long as I'm not forced to when I don't need to.
If the alphabet defaults to Alphabet() when creating a Seq object, then I'd
think the transcribe and translate methods should work even if a user doesn't
specify the sequence to be DNA or RNA. My current gripe with the Seq object is
that there are too many steps to translate a DNA sequence.
--Michiel.
--
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From Frederic.Sohm at iaf.cnrs-gif.fr Thu Apr 28 03:35:01 2005
From: Frederic.Sohm at iaf.cnrs-gif.fr (=?iso-8859-1?b?RnLpZOlyaWM=?= Sohm)
Date: Thu Apr 28 03:28:19 2005
Subject: [Biopython-dev] Rethinking Seq objects
Message-ID: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
Hi,
I was following your discussion on Seq object. I more or less agree with Michiel.
But some thoughts :
1) get rid of MutableSeq and make all Seq mutable.
Will it not be a problem for some people there? I mean I only use MutableSeq so
noproblem there for me but I assume that someone uses non-mutable Seq or is it a
feature which is not needed?
2) Checking the alphabet. Yes. good. with the remark of Michael not force it for
people who don't want it. Can be painful for real long sequence.
4) circular sequences and indices. Nice. from experience not so easy to
implement correctly though.
5) Translate and transcribe. Yes obviously a good thing.
If you are interested you can have a look to DNA object in rana.
If you want it, take it under a Biopython licence.
It's only DNA and it's certainly not worth using it but it can give you some
idea. A lot of complexity is added by supporting a biological indexation [1:len]
rather than a python one [0:len-1]. This is would not be a sensible thing to do
in biopython.
This is the C-implementation of Python String, modified with an alphabet
checking (a modif of the string translate() method with the alphabet hard coded
in) and support for circular sequences.
If you want have a look at the DNA object for rana here :
http://cvs.sourceforge.net/viewcvs.py/rana/rana/Rana/c_extension/DNAdata.c?rev=1.9&view=markup
The code is pretty bad, well my code not the python one.
Fred
--
Fr?d?ric Sohm
Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s"
UPR 2197 DEPSN, CNRS
Institut de Neurosciences A. Fessard
1 Avenue de la Terrasse
91 198 GIF-SUR-YVETTE
FRANCE
Phone: +33 (0) 1 69 82 34 12
Fax:+33 (0) 1 69 82 34 47
From hoffman at ebi.ac.uk Thu Apr 28 04:56:51 2005
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Thu Apr 28 04:50:05 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <4270589D.2090500@ims.u-tokyo.ac.jp>
References: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
<4270589D.2090500@ims.u-tokyo.ac.jp>
Message-ID:
On Thu, 28 Apr 2005, Michiel Jan Laurens de Hoon wrote:
> Michael Hoffman wrote:
>
>> On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:
>>
>> > 1) Make Seq objects mutable, and get rid of MutableSeq.
>>
>> I imagine it will be a lot slower to replace built-in strings with
>> character arrays. Right now, I only use Seq when I absolutely have to.
> Well I wouldn't replace them with character arrays, the idea would be to
> reimplement the Seq class in C. So it would not be slower than built-in
> strings, maybe even a bit faster. The Seq object would look like a string
> object, but be mutable.
If you can make a sequence class that is faster than the current
built-in string, I would suggest you submit a patch to the Python
tracker to make it a replacement for the current built-in string. :P
> OK, then how about this:
> - By default, don't assume a particular alphabet. Same as how it works now:
> > > > from Bio.Seq import *
> > > > Seq('ATCG')
> Seq('ATCG', Alphabet())
+1
>> > > > > my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>> > > > > my_seq[:10] = "weirdstuff"
>> > > > > my_seq
>> > MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'),
>> > IUPACUnambiguousDNA())
>>
>> "Doctor, it hurts when I do this."
>> "Don't do that."
>
> Well you would be right if this were Biofortran. For a higher-level language,
> I would expect better checking to make sure an object is self-consistent.
> Python itself is full of checks and assertions.
How often have you actually put "weirdstuff" into the middle of a
MutableSeq? Have you ever done this, or are you just imagining that it
might happen?
You Aren't Gonna Need It. The number of checks and assertions you can
make is limitless and you have to know where to draw the line. To me,
the line should be drawn at user input, but not at every internal
change to a sequence made within a program. Maybe optional alphabet
checking would help with this.
> Another option would be to get rid of alphabets altogether. What good are
> they otherwise?
They're useful for transcription/translation/reverse complement
operations. And as far as I'm concerned, that's a good place to do
error checking, should it be necessary.
>> But having a CircularSeq subclass would make it easier to avoid
>> this extra functionality from impacting on the primary use case.
>
> My feeling is that having a subclass is a bit of an overkill. The idea is to
> have an optional topology argument, which defaults to "linear". So the
> primary use case would not be affected.
If you're doing this in C, then my performance assumptions are perhaps
incorrect. I wouldn't want every slice of my linear sequence to have
to go through "is this circular?" logic in Python.
> If the alphabet defaults to Alphabet() when creating a Seq object,
> then I'd think the transcribe and translate methods should work even
> if a user doesn't specify the sequence to be DNA or RNA. My current
> gripe with the Seq object is that there are too many steps to
> translate a DNA sequence.
Good point. Perhaps a warning when it has to guess?
--
Michael Hoffman
European Bioinformatics Institute
From hoffman at ebi.ac.uk Thu Apr 28 04:59:51 2005
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Thu Apr 28 05:14:23 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
Message-ID:
On Thu, 28 Apr 2005, Fr?d?ric Sohm wrote:
> 1) get rid of MutableSeq and make all Seq mutable.
> Will it not be a problem for some people there? I mean I only use MutableSeq so
> noproblem there for me but I assume that someone uses non-mutable Seq or is it a
> feature which is not needed?
In the rest of CPython, immutable have two benefits: they are more
memory-efficient (and sometimes space-efficient), and they are
hashable. I don't think Seqs are usefully hashable right now, and
Michiel says he will code the new Seq such that there won't be a
significant performance impact.
--
Michael Hoffman
European Bioinformatics Institute
From mdehoon at ims.u-tokyo.ac.jp Fri Apr 29 01:05:14 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Fri Apr 29 00:53:15 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
Message-ID: <4271C08A.3050306@ims.u-tokyo.ac.jp>
Fr?d?ric Sohm wrote:
> 1) get rid of MutableSeq and make all Seq mutable.
> Will it not be a problem for some people there? I mean I only use MutableSeq so
> noproblem there for me but I assume that someone uses non-mutable Seq or is it a
> feature which is not needed?
As far as I can tell, the only way in which a mutable Seq may affect a user is
in terms of performance, as Michael pointed out. But anyway, as soon as we reach
some conclusion on Seq/MutableSeq on biopython-dev, I'll send a message to the
biopython mailing list to see if any of the users will get into problems because
of this. Also, I expect that MutableSeq will be around for some time as a
deprecated class.
--Michiel.
--
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From mdehoon at ims.u-tokyo.ac.jp Fri Apr 29 01:15:28 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Fri Apr 29 01:03:21 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To:
References: <426F2EAB.3000709@ims.u-tokyo.ac.jp> <4270589D.2090500@ims.u-tokyo.ac.jp>
Message-ID: <4271C2F0.7070608@ims.u-tokyo.ac.jp>
Michael Hoffman wrote:
> On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:
>> Another option would be to get rid of alphabets altogether. What good
>> are they otherwise?
>
> They're useful for transcription/translation/reverse complement
> operations. And as far as I'm concerned, that's a good place to do
> error checking, should it be necessary.
>
For transcription and translation, we don't need to know the alphabet.
Effectively, by calling translate or transcribe, the user is telling us that the
input sequence object is DNA or RNA, and that the output sequence is RNA (for
transcription) or protein (for translation). Of course, when a character other
than ACGTU is encountered, we need to raise an error. But the point is that
knowing the Alphabet doesn't tell us anything we don't already know.
For reverse complement, we also don't need to know the alphabet; it is either
DNA or RNA. The only exception is when a user wants to reverse complement a
sequence that does not contain a T or a U. But the current situation, where we
have IUPACProtein, ExtendedIUPACProtein, IUPACAmbiguousDNA, IUPACUnambiguousDNA,
ExtendedIUPACDNA, IUPACAmbiguousRNA, IUPACUnambiguousRNA alphabets, is an
overkill. It would be much easier to have a reverse_complement and a
rna_reverse_complement function (or something like that).
So I still don't see any use for alphabets other than input checking. Or am I
missing something here?
--Michiel.
--
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From mdehoon at ims.u-tokyo.ac.jp Fri Apr 29 01:20:08 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Fri Apr 29 01:08:07 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To:
References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
Message-ID: <4271C408.9010401@ims.u-tokyo.ac.jp>
Michael Hoffman wrote:
> On Thu, 28 Apr 2005, Fr?d?ric Sohm wrote:
>
>> 1) get rid of MutableSeq and make all Seq mutable.
>> Will it not be a problem for some people there? I mean I only use
>> MutableSeq so
>> noproblem there for me but I assume that someone uses non-mutable Seq
>> or is it a
>> feature which is not needed?
>
> In the rest of CPython, immutable have two benefits: they are more
> memory-efficient (and sometimes space-efficient), and they are
> hashable. I don't think Seqs are usefully hashable right now, and
> Michiel says he will code the new Seq such that there won't be a
> significant performance impact.
Would you be willing to test the performance of a new Seq class? I haven't
actually written any code yet, but I could send it to you when it's done before
including it in Biopython. Note also that a mutable Seq class avoids the need
for calls to tomutable and toseq, so there may be an overall performance gain.
But it would be better to test this on a real-life case.
--Michiel.
--
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From hoffman at ebi.ac.uk Fri Apr 29 04:02:05 2005
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Fri Apr 29 03:55:11 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <4271C408.9010401@ims.u-tokyo.ac.jp>
References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
<4271C408.9010401@ims.u-tokyo.ac.jp>
Message-ID:
On Fri, 29 Apr 2005, Michiel Jan Laurens de Hoon wrote:
> Would you be willing to test the performance of a new Seq class? I
> haven't actually written any code yet, but I could send it to you
> when it's done before including it in Biopython. Note also that a
> mutable Seq class avoids the need for calls to tomutable and toseq,
> so there may be an overall performance gain. But it would be better
> to test this on a real-life case.
Sure, I think I could whip something up.
--
Michael Hoffman
European Bioinformatics Institute
From Frederic.Sohm at iaf.cnrs-gif.fr Fri Apr 29 04:03:56 2005
From: Frederic.Sohm at iaf.cnrs-gif.fr (=?iso-8859-1?b?RnLpZOlyaWM=?= Sohm)
Date: Fri Apr 29 03:57:03 2005
Subject: [Biopython-dev] Rethinking Seq objects
Message-ID: <1114761836.4271ea6ca540a@mail.iaf.cnrs-gif.fr>
I am ready to do some performance testing, if you want. I am looking for a
replacement for the DNA object I have written and could well switch to the
biopython Seq object if it is faster.
Fred
Michael Hoffman wrote:
> On Thu, 28 Apr 2005, Fr?d?ric Sohm wrote:
>
>> 1) get rid of MutableSeq and make all Seq mutable.
>> Will it not be a problem for some people there? I mean I only use
>> MutableSeq so
>> noproblem there for me but I assume that someone uses non-mutable Seq
>> or is it a
>> feature which is not needed?
>
> In the rest of CPython, immutable have two benefits: they are more
> memory-efficient (and sometimes space-efficient), and they are
> hashable. I don't think Seqs are usefully hashable right now, and
> Michiel says he will code the new Seq such that there won't be a
> significant performance impact.
Would you be willing to test the performance of a new Seq class? I haven't
actually written any code yet, but I could send it to you when it's done before
including it in Biopython. Note also that a mutable Seq class avoids the need
for calls to tomutable and toseq, so there may be an overall performance gain.
But it would be better to test this on a real-life case.
--Michiel.
--
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
_______________________________________________
Biopython-dev mailing list
Biopython-dev@biopython.org
http://biopython.org/mailman/listinfo/biopython-dev
--
Fr?d?ric Sohm
Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s"
UPR 2197 DEPSN, CNRS
Institut de Neurosciences A. Fessard
1 Avenue de la Terrasse
91 198 GIF-SUR-YVETTE
FRANCE
Phone: +33 (0) 1 69 82 34 12
Fax:+33 (0) 1 69 82 34 47