From bugzilla-daemon at portal.open-bio.org  Sun Apr  3 07:30:24 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sun Apr  3 08:12:55 2005
Subject: [Biopython-dev] [Bug 1767] New: Bio/trie.c can crash on Windows
Message-ID: <200504031130.j33BUO4v019943@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1767

           Summary: Bio/trie.c can crash on Windows
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Windows 2000
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: mdehoon@ims.u-tokyo.ac.jp


In Bio/trie.c, the function strdup is being used, which is not part of the
ANSI-C standard. As a result, when Bio/trie.c is compiled, the resulting trie
module links to two C runtime libraries (mscvrt.dll and mscvr71.dll), which are
incompatible with each other and can cause crashes. To fix this bug, we need to
write our own strdup function using ANSI-C standard functions.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mdehoon at ims.u-tokyo.ac.jp  Wed Apr  6 22:30:35 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Wed Apr  6 22:20:03 2005
Subject: [Biopython-dev] Re: 
In-Reply-To: <200504061641.j36Gfnck010384@mail-gw0.york.ac.uk>
References: <200504061641.j36Gfnck010384@mail-gw0.york.ac.uk>
Message-ID: <42549B4B.8000408@ims.u-tokyo.ac.jp>

To post to biopython-dev@biopython.org, you need to subscribe first via the 
biopython website. This was done to stop the huge amounts of spam we were 
receiving earlier. If you want to submit a patch, the best way is to create a 
bug report first (also via the biopython website) and then add an attachment 
containing the patch. Patches sent to the mailing list tend to get lost, and are 
sometimes rejected by the spam filter. If you want to submit a larger piece of 
code, for example a new module, you can post a message to 
biopython-dev@biopython.org first to describe the code, and then send it to one 
of the developers.

Hope this helps!

--Michiel.


Glen van Ginkel wrote:

> I tried to submit code to biopython-dev@biopython.org but got a message
> telling me I was not allowed to mail them
> 
>  
> 
> How do I go about submitting code?
> 
>  
> 
> Thankyou
> 
>  
> 
> Glen van Ginkel
> 
> 
> 

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From gvg500 at york.ac.uk  Thu Apr  7 03:13:39 2005
From: gvg500 at york.ac.uk (Glen van Ginkel)
Date: Thu Apr  7 02:08:10 2005
Subject: [Biopython-dev] Hmmer integration modules/package
Message-ID: <200504070813.39242.gvg500@york.ac.uk>

Hi Guys

As a project for my MRes in bioinformatics I have had to write a Python 
package that would expand Biopythons capabilities in interacting with useful 
programs. Here I submit code to interact with the Hmmer suite of programs. 

Basically, you instantiate the Hmmer object (Like a Hmmer commandline builder 
object), build a commandline, execute it and grab the results. Really simple 
for the user. This only works in UNIX since Hmmer is not available on Windows 
platforms and it obviously assumes you have Hmmer already installed on your 
PC. 
I have also tried to integrate the Bio.Align.Generic alignment object sothat 
the Hmmer object is able to handle alignment objects by writing the records 
of the alignment object to a temporary fasta file. 

Since I have to write this up as a report I would greatly appreciate any 
criticism of any kind and perhaps some suggestions as to how I might improve 
the code. Also, If you can think of any other ways to implement the Hmmer 
stuff please let me know. 

At the moment I am working on a test suite for the project. If you would like 
the code I have written to exercise some of the methods please let me know 
and I'll send it over. 
I also have a bit of documentation I would like to add to the Application 
package because I feel if failed to help me concerning certain aspects. How 
would I go about this?

I look forward to hearing your suggestions. 

Glen van Ginkel
From gvg500 at york.ac.uk  Thu Apr  7 08:10:29 2005
From: gvg500 at york.ac.uk (Glen van Ginkel)
Date: Thu Apr  7 07:25:03 2005
Subject: [Biopython-dev] Hmmer integration modules/package
Message-ID: <200504071310.29943.gvg500@york.ac.uk>

Hi Guys

As a project for my MRes in bioinformatics I have had to write a Python 
package that would expand Biopythons capabilities in interacting with useful 
programs. Here I submit code to interact with the Hmmer suite of programs. 

Basically, you instantiate the Hmmer object (Like a Hmmer commandline builder 
object), build a commandline, execute it and grab the results. Really simple 
for the user. This only works in UNIX since Hmmer is not available on Windows 
platforms and it obviously assumes you have Hmmer already installed on your 
PC. 
I have also tried to integrate the Bio.Align.Generic alignment object sothat 
the Hmmer object is able to handle alignment objects by writing the records 
of the alignment object to a temporary fasta file. 

Since I have to write this up as a report I would greatly appreciate any 
criticism of any kind and perhaps some suggestions as to how I might improve 
the code. Also, If you can think of any other ways to implement the Hmmer 
stuff please let me know. 

At the moment I am working on a test suite for the project. If you would like 
the code I have written to exercise some of the methods please let me know 
and I'll send it over. 
I also have a bit of documentation I would like to add to the Application 
package because I feel if failed to help me concerning certain aspects. How 
would I go about this?

I look forward to hearing your suggestions. 

Glen van Ginkel

Attatched is the code as well as an example of a test suite using the files 
from Eddy2003 and an experiment file experiment.fas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Applications.py
Type: application/x-python
Size: 16558 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/Applications-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: __init__.py
Type: application/x-python
Size: 320 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/__init__-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: HmmerStandalone.py
Type: application/x-python
Size: 9139 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/HmmerStandalone-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testsuite.py
Type: application/x-python
Size: 2427 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20050407/aab5b8bd/testsuite-0001.bin
-------------- next part --------------
>S134211|S134211|GLOBIN - BRINE SHRIMP1
DKATIKRTWATVTDLPSFGRNVFLSVFAAK
>S134212|S134212|GLOBIN - BRINE SHRIMP2
PEYKNLFVEFRNIPASELASSERLLYHGGR
>S134213|S134213|GLOBIN - BRINE SHRIMP3
VLSSIDEAIAGIDTPDRAVKTLLALGERHI
>S134214|S134214|GLOBIN - BRINE SHRIMP4
SRGTVRRHFEAFSYAFIDELKQRGVESADL
>S134215|S134215|GLOBIN - BRINE SHRIMP5
AAWRRGWDNIVNVLEAGLLRRQIDLEVTGL
>S134216|S134216|GLOBIN - BRINE SHRIMP6
SCVDVANIQESWSKVSGDLKTTGSVVFQRM
>S134217|S134217|GLOBIN - BRINE SHRIMP7
INGHPEYQQLFRQFRDVDLDKLGESNSFVA
From gvg500 at york.ac.uk  Tue Apr 12 05:10:57 2005
From: gvg500 at york.ac.uk (Glen van Ginkel)
Date: Tue Apr 12 05:07:14 2005
Subject: [Biopython-dev] Hmmer API
Message-ID: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>

This is a desperate-ish plea to any biopython developer.

 
Recently submitted an explanation of code that interacts with the Hmmer
(Eddy 2003) suite of programs to the list. I am required to write up a
project concerning this Hmmer API and would hugely appreciate it if ANY of
the developers could give me some feed-back. Bearing in mind that this is in
your spare time, I'm not asking for a major commentary. The focus of the
project was to produce something that might be accepted as part of the
BioPython project. What I'm looking for is just a sort of "yeah probably
will be accepted but.." or "No! you've got a long way to go before we accept
that gubbins, but.". 

 
I would really appreciate ANY response

 
In anticipation

 
Glen van Ginkel

 
From fkauff at duke.edu  Tue Apr 12 08:59:12 2005
From: fkauff at duke.edu (Frank Kauff)
Date: Tue Apr 12 08:52:48 2005
Subject: [Biopython-dev] Hmmer API
In-Reply-To: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
References: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
Message-ID: <1113310753.5132.5.camel@osiris.biology.duke.edu>

Glen,

I'd be happy to have a look at it. I was recently thinking of using
Hmmer to do some alignment for one of my python scripts, so maybe your
API comes in handy. I'm not too familiar with Hmmer yet, though. So I
hope it comes with some example code...?

And thanks for contributing to biopython! 

Frank

On Tue, 2005-04-12 at 10:10 +0100, Glen van Ginkel wrote:
> This is a desperate-ish plea to any biopython developer.
> 
>  
> 
> Recently submitted an explanation of code that interacts with the Hmmer
> (Eddy 2003) suite of programs to the list. I am required to write up a
> project concerning this Hmmer API and would hugely appreciate it if ANY of
> the developers could give me some feed-back. Bearing in mind that this is in
> your spare time, I'm not asking for a major commentary. The focus of the
> project was to produce something that might be accepted as part of the
> BioPython project. What I'm looking for is just a sort of "yeah probably
> will be accepted but.." or "No! you've got a long way to go before we accept
> that gubbins, but.". 
> 
>  
> 
> I would really appreciate ANY response
> 
>  
> 
> In anticipation
> 
>  
> 
> Glen van Ginkel
> 
>  
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
-- 
Frank Kauff
Dept. of Biology
Duke University
Box 90338
Durham, NC 27708
USA

Phone 919-660-7382
Fax 919-660-7293
Web http://www.lutzonilab.net/member/frankkauff.shtml


From jhackney at stanford.edu  Tue Apr 12 16:20:24 2005
From: jhackney at stanford.edu (Jason A. Hackney)
Date: Tue Apr 12 16:23:46 2005
Subject: [Biopython-dev] Hmmer API
In-Reply-To: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
References: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
Message-ID: <6956556815154e4131d7683adf017b0b@stanford.edu>

Hi Glen,

I would also be willing to have a look at any code you've got so far. 
I've done a bit of Hmmer stuff in the past, so I'd be interested in 
seeing what you've got going.

Cheers,

Jason

On Apr 12, 2005, at 2:10 AM, Glen van Ginkel wrote:

> This is a desperate-ish plea to any biopython developer.
>
>
>
> Recently submitted an explanation of code that interacts with the Hmmer
> (Eddy 2003) suite of programs to the list. I am required to write up a
> project concerning this Hmmer API and would hugely appreciate it if 
> ANY of
> the developers could give me some feed-back. Bearing in mind that this 
> is in
> your spare time, I'm not asking for a major commentary. The focus of 
> the
> project was to produce something that might be accepted as part of the
> BioPython project. What I'm looking for is just a sort of "yeah 
> probably
> will be accepted but.." or "No! you've got a long way to go before we 
> accept
> that gubbins, but.".
>
>
>
> I would really appreciate ANY response
>
>
>
> In anticipation
>
>
>
> Glen van Ginkel
>
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
>
Jason A. Hackney
Postdoctoral Scholar
Department of Microbiology and Immunology
Stanford University

From idoerg at burnham.org  Tue Apr 12 18:34:36 2005
From: idoerg at burnham.org (Iddo Friedberg)
Date: Tue Apr 12 18:29:42 2005
Subject: [Biopython-dev] Hmmer API
In-Reply-To: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
References: <200504120910.j3C9AYAF016856@mail-gw1.york.ac.uk>
Message-ID: <425C4CFC.5020203@burnham.org>

hi Folks,

Glen actually gave me the code, but I cannot really give him the 
response it merits, not within th time frame he would like. From a 
cursory look it is "yes, we can accept it, give us contingency tests as 
well". But if soeone can actually run it a couple of times, and see what 
it's like, that would be great.

Sorry Glen, I didn't realize you were that pressed for time.

Cheers,

Iddo


Glen van Ginkel wrote:

>This is a desperate-ish plea to any biopython developer.
>
> 
>
>Recently submitted an explanation of code that interacts with the Hmmer
>(Eddy 2003) suite of programs to the list. I am required to write up a
>project concerning this Hmmer API and would hugely appreciate it if ANY of
>the developers could give me some feed-back. Bearing in mind that this is in
>your spare time, I'm not asking for a major commentary. The focus of the
>project was to produce something that might be accepted as part of the
>BioPython project. What I'm looking for is just a sort of "yeah probably
>will be accepted but.." or "No! you've got a long way to go before we accept
>that gubbins, but.". 
>
> 
>
>I would really appreciate ANY response
>
> 
>
>In anticipation
>
> 
>
>Glen van Ginkel
>
> 
>
>_______________________________________________
>Biopython-dev mailing list
>Biopython-dev@biopython.org
>http://biopython.org/mailman/listinfo/biopython-dev
>
>
>  
>


-- 

Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
Tel: (858) 646 3100 x3516
Fax: (858) 713 9930
http://ffas.ljcrf.edu/~iddo

From idoerg at burnham.org  Thu Apr 14 20:53:47 2005
From: idoerg at burnham.org (Iddo Friedberg)
Date: Thu Apr 14 20:47:25 2005
Subject: [Biopython-dev] Speakers for BOSC needed
Message-ID: <425F109B.8040507@burnham.org>

Hi all,

It's that time of year again, and BOSC 2005 will be happening on June 
23-24. The more Biopython representatives, the merrier. I will be 
around, but I will be dealing with my own SIG meeting, so I will not be 
able to give a talk. Is there someone who can give the BioPython 
"plenary"? should be a 30-40 minute talk. Also, there are slots for 
shorter talks, so if you contributed and interesting module, or had an 
interesting experience with biopython you would like to share, please 
submit a talk.

For those of you who do not know what BOSC is, it's the Bioinformatics 
Open Source Conference, which is held as a satellite meeting of ISMB. I 
highly recommend this event, it is a real eye opener with respect to the 
world of open source, and computational biology. More about it here:

http://open-bio.org/bosc/

Cheers,

Iddo

-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
http://ffas.ljcrf.edu/~iddo
==========================
The First Automated Protein Function Prediction SIG
Detroit, MI June 24, 2005
http://ffas.burnham.org/AFP

From idoerg at burnham.org  Tue Apr 19 13:29:16 2005
From: idoerg at burnham.org (Iddo Friedberg)
Date: Tue Apr 19 13:22:41 2005
Subject: [Biopython-dev] BOSC 2005
Message-ID: <42653FEC.2010907@burnham.org>


{Please pass the word!}

SECOND CALL FOR SPEAKERS
 
The 6th annual Bioinformatics Open Source Conference (BOSC'2005) is organized by the
not-for-profit Open Bioinformatics Foundation. The meeting will take place
June 23-24, 2005 in Detroit, Michigan, USA, and is one of several Special Interest
Group (SIG) meetings occurring in conjunction with the 13th International Conference
on Intelligent Systems for Molecular Biology.
 
see http://www.iscb.org/ismb2005 for more information.
 
Because of the power of many Open Source bioinformatics packages in
use by the Research Community today, it is not too presumptuous to say 
that the work of the Open Source Bioinformatics Community represents 
the cutting edge of Bioinformatics in general. This has been repeatedly 
demonstrated by the quality of presentations at previous BOSC conferences.
This year, at BOSC 2005, we want to continue this tradition of excellence, 
while presenting this message to a wider part of the Research Community.  
Please, pass this message on to anyone you know that is interested in
Bioinformatics software. 
 
 
BOSC PROGRAM & CONTACT INFO

* Web: http://www.open-bio.org/bosc2005/
* Online Registration: https://www.cteusa.com/iscb4/
* Email: bosc@open-bio.org

FEES


* Corporate : $195 ($245 after May 16th)
* Academic : $170 ($220 after May 16th)
* Student : $145 ($195 after May 16th) 

SPEAKERS & ABSTRACTS WANTED

The program committee is currently seeking abstracts for talks at BOSC 
2005. BOSC is a great opportunity for you to tell the community about 
your use, development, or philosophy of open source software development 
in bioinformatics. The committee will select several submitted abstracts 
for 25-minute talks and others for shorter "lightning" talks. Accepted 
abstracts will be published on the BOSC web site.

If you are interested in speaking at BOSC 2005, 
please send us before April 26, 2005:

* an abstract (no more than a few paragraphs)
* a URL for the project page, if applicable
* information about the open source license used for your software or 
  your release plans.
 
Abstracts will be accepted for submission until April 26, 2005.
Abstracts chosen for presentation will be announced May 12, 2005 
(before the ISMB Early Registration Deadline).
 
LIGHTNING-TALK SPEAKERS WANTED!

The program committee is currently seeking speakers for the lightning 
talks at BOSC 2005. Lightning talks are quick - only five minutes 
long - and a great opportunity for you to give people a quick 
summary of your open source project, code, idea, or vision of the future.
 
If you are interested in giving a lightning talk at BOSC 2005, 
please send us:
 
* a brief title and summary (one or two lines)
* a URL for the project page, if applicable
* information about the open source license used for your software or 
  your release plans.
 
We will accept entries on-line until BOSC starts, but
space for demos and lightning talks is limited.<br/
   
SOFTWARE DEMONSTRATIONS WANTED!
If you are involved in the development of Open Source Bioinformatics Software, 
you are invited to provide a short demonstration to attendees of BOSC 2005.
 
If you are interested in giving a software demonstration at BOSC 2005,
please send us:
 
* a brief title and summary (one or two lines)
* a URL for the project page, if applicable
* Internet connectivity requirements (e.g. website Application served on the 
  world wide web, or web based client application).
 
  We will accept entries on-line until the BOSC starts, but
  space for demos and lightning talks is limited. 
 
 ** Because the mission of the OBF is to promote Open Source software, we will favor submissions for
  projects that apply a recognized Open Source License, or adhere to the general Open Source Philosophy.
  See the following websites for further details:
  href="http://www.opensource.org/licenses/
  href="http://www.opensource.org/docs/definition.php
 
 
 SESSION CHAIRS WANTED
 If you would like to be involved BOSC 2005, we invite you to chair a session.  This will 
 not require much of your time.  You will be given a schedule of presenters during your session. 
 You simply introduce each speaker, and manage the time of their presentation (25 minutes for full 
 presentations, 5-10 minutes for lightning talks/demos, depending on the number of entries).
 
 If you are interested in chairing a session, please send us your name and affiliation (if applicable).
 
-- 
cheers,

Bosc Organizing Committee


-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
http://ffas.ljcrf.edu/~iddo


From mdehoon at ims.u-tokyo.ac.jp  Wed Apr 20 09:28:31 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Wed Apr 20 09:19:20 2005
Subject: [Biopython-dev] Hmmer API
Message-ID: <426658FF.5020801@ims.u-tokyo.ac.jp>

Iddo Friedberg wrote:
> Glen actually gave me the code, but I cannot really give him the 
> response it merits, not within th time frame he would like. From a 
> cursory look it is "yes, we can accept it, give us contingency tests as 
> well". But if soeone can actually run it a couple of times, and see what 
> it's like, that would be great.

I'm not a hmmer user either, so it's hard to assess the code in great detail. 
What I like about the code is that it has extensive documentation (in the source 
code) and makes use of and is well integrated with the existing Biopython 
software. Since Hmmer is a rather standard bioinformatics tool, I think 
Biopython should support it. So I vote to accept this into Biopython. Once users 
start using Bio.Hmmer, maybe some issues with the code will show up, but then 
these users will be more familiar with Hmmer than we are, and can give more 
useful advice.

About the documentation, it is quite extensive in the source code (which is fine 
too), but it would be nice to have some documentation outside of the source 
code. For example, this is something I wrote a while ago for Bio.LogisticRegression:

http://www.biopython.org/docs/cookbook/LogisticRegression.html

With such a documentation, the code will be more accessible. Without such 
documentation, people have to look through the CVS tree to search for the Hmmer 
package, or may not even notice it.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon

From bugzilla-daemon at portal.open-bio.org  Fri Apr 22 08:33:28 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 09:19:33 2005
Subject: [Biopython-dev] [Bug 1771] New: need some file from some xml module
	?
Message-ID: <200504221233.j3MCXSc7007748@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1771

           Summary: need some file from some xml module ?
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: jeroen@xdh.nl


Hi,

Using the Sprot.py module of version 1.40b, I got this :

  File "stepsget035.py", line 98, in list_swissprot_gpcrs
    from Bio.SwissProt import SProt
  File "/usr/src/biopython-1.40b/build/lib.linux-i686-
2.3/Bio/SwissProt/SProt.py", line 39, in ?
    from Bio import SeqRecord
  File "/usr/src/biopython-1.40b/build/lib.linux-i686-2.3/Bio/SeqRecord.py", 
line 7, in ?
    from Bio import FormatIO
  File "/usr/src/biopython-1.40b/build/lib.linux-i686-2.3/Bio/FormatIO.py", 
line 2, in ?
    from xml.sax import saxutils

Before, I used 1.24 and that gave no such error/dependency/bug/dunno-what-it-
is, probably cuz I didn't need to import anything from some xml module.


Thanks for any help,

Jeroen


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Apr 22 09:31:27 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 10:18:29 2005
Subject: [Biopython-dev] [Bug 1771] need some file from some xml module ?
Message-ID: <200504221331.j3MDVRhb008729@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1771


------- Additional Comments From mdehoon@ims.u-tokyo.ac.jp  2005-04-22 09:31 -------
Which Python version are you using? See:

samma{mdehoon}8: python1.5
Python 1.5.2 (#1, Nov 28 2001, 02:33:46)  [GCC 2.95.3 20010315 (release)] on sun
os5
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> from xml.sax import saxutils
Traceback (innermost last):
  File "<stdin>", line 1, in ?
ImportError: No module named xml.sax
>>> ^D
samma{mdehoon}9: python2.2
Python 2.2.2 (#1, Jan 24 2003, 17:26:30)
[GCC 2.95.3 20010315 (release)] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.sax import saxutils
>>> ^D
samma{mdehoon}10:


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Apr 22 10:46:57 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 11:22:41 2005
Subject: [Biopython-dev] [Bug 1771] need some file from some xml module ?
Message-ID: <200504221446.j3MEkv9D009884@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1771

jeroen@xdh.nl changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Additional Comments From jeroen@xdh.nl  2005-04-22 10:46 -------
Yep, there was something wrong with the sax module of my python (version 2.3) 
installation, it has been fixed now.

thanks,


Jeroen


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Apr 22 16:45:31 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 17:18:28 2005
Subject: [Biopython-dev] [Bug 1772] New: Bio.PDB's parse_pdb_header never
	stops parsing if there is no ATOM record
Message-ID: <200504222045.j3MKjVfL014797@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1772

           Summary: Bio.PDB's parse_pdb_header never stops parsing if there
                    is no ATOM record
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: Other
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: dhendrix@compbio.berkeley.edu


In the future, please run your code on more pdb files before you release it!

I have a pdb file with no ATOM records, just HETATMs.  So when I use
parse_pdb_header to read the header, it runs until I'm out of memory or I (or
the os) kill it, because it is reading in the header as anything that occurs
before an ATOM record, and there ain't one!  The pdbID is 1PBL.  While it is
unusual that there are no ATOM records, it can definitely occur!  

Also, there is the annoying printing of 
nonstandard resolution  NOT APPLICABLE.
for every NMR structure.  Why????

THANK YOU!
Donna


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Apr 22 20:35:33 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 21:18:33 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
	parsing if there is no ATOM record
Message-ID: <200504230035.j3N0ZW5H017089@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1772

idoerg@burnham.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Fri Apr 22 20:36:52 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Fri Apr 22 21:18:37 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
	parsing if there is no ATOM record
Message-ID: <200504230036.j3N0aqVG017107@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1772

idoerg@burnham.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |WORKSFORME


------- Additional Comments From idoerg@burnham.org  2005-04-22 20:36 -------
User reported using old version of Biopython. I checked it against 1.40b and CVS
versions, could not duplicate.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Mon Apr 25 17:22:50 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Apr 25 18:21:05 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
	parsing if there is no ATOM record
Message-ID: <200504252122.j3PLMoIh022842@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1772

dhendrix@compbio.berkeley.edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|WORKSFORME                  |


------- Additional Comments From dhendrix@compbio.berkeley.edu  2005-04-25 17:22 -------
Thank you so much for the quick turnaround.  Iddo Friedberg suggested that I use
the top level of the CVS.  I updated python and my biopython (and NumPy, etc...)
and encountered the same behavior, as well as another little bug that I had
fixed a while ago.  Here are my diffs to parse_pdb_header.py, which are small
but vital for me to get parse_pdb_header working for me.
122c122
<         f=open(file,'r')
---
>         f=open(filename,'r')
127c127
<         if not re.search("\AATOM",l) and not re.search("\AEND",l):
---
>         if not re.search("\AATOM",l):

I can send you a little test program that fails on the current version of
parse_pdb_header, if that will help.  Regardless, I think it's a good idea to
stop reading the file when you reach the end of it!


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Apr 26 03:49:37 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Apr 26 04:20:11 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
	parsing if there is no ATOM record
Message-ID: <200504260749.j3Q7nbeJ029909@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1772


------- Additional Comments From mdehoon@ims.u-tokyo.ac.jp  2005-04-26 03:49 -------
I have accepted the first patch in CVS:
122c122
<         f=open(file,'r')
---
>         f=open(filename,'r')
But the second part doesn't seem right:
127c127
<         if not re.search("\AATOM",l) and not re.search("\AEND",l):
---
>         if not re.search("\AATOM",l):
It will append HETATM lines to the header. Instead we can use
         if not re.search("\AATOM",l) and not re.search("\AHETATM",l):


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From thamelry at binf.ku.dk  Tue Apr 26 04:40:06 2005
From: thamelry at binf.ku.dk (thamelry@binf.ku.dk)
Date: Tue Apr 26 05:00:32 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never 
	stops parsing if there is no ATOM record
In-Reply-To: <200504260749.j3Q7nbeJ029909@portal.open-bio.org>
References: <200504260749.j3Q7nbeJ029909@portal.open-bio.org>
Message-ID: <32883.83.92.3.59.1114504806.squirrel@www.binf.ku.dk>


> It will append HETATM lines to the header. Instead we can use
>          if not re.search("\AATOM",l) and not re.search("\AHETATM",l):

Or even simpler:

record_type=l[0:6]
if record_type=='ATOM  ' or record_type=='HETATM' or record_type=='MODEL ':
    break
else:
    header.append(l)

Note that MODEL can also signal the end of the header.

I'll add it to the CVS.

Cheers,

-Thomas


From bugzilla-daemon at portal.open-bio.org  Tue Apr 26 06:35:59 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Apr 26 08:21:59 2005
Subject: [Biopython-dev] [Bug 1773] New:
	Martel.Parser.ParserPositionException
Message-ID: <200504261035.j3QAZx3P031804@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1773

           Summary: Martel.Parser.ParserPositionException
           Product: Biopython
           Version: Not Applicable
          Platform: PC
               URL: http://portal.open-bio.org/pipermail/biopython/2005-
                    April/002604.html
        OS/Version: Linux
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Martel/Mindy
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: marc.saric@gmx.de


I tried to index the the following Genbank file with this simple script
(as described in the cookbook) but it failed with the following traceback.

The file can be found in:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=BA000018

including all features (SNP, CDD, MGC, HPRD, STS)

I tried with Biopython 1.3, Biopython 1.4b and the Biopython-CVS as of
2005-04-15.

The program

===snip===
#!/usr/bin/env python
from Bio import GenBank

dict_file = "ba000018_s_aureus_n315_genome.gb"
index_file = "ba000018_s_aureus_n315_genome.idx"

GenBank.index_file(dict_file, index_file)
===snap===

The Traceback:

===snip===
Traceback (most recent call last):
  File "/home/saric/data/devel/workspace/scripts/hitman/index_gb.py",
line 37, in ?
    GenBank.index_file(dict_file, index_file) # FIXME: This breaks with
the N315 S.aureus-genome
  File
"/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Bio/GenBank/__init__.py",
line 1283, in index_file
    SimpleSeqRecord.create_flatdb([filename], indexname, indexer)
  File
"/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Bio/Mindy/SimpleSeqRecord.py",
line 152, in create_flatdb
    creator.load(filename, builder = builder, fileid_info = {})
  File
"/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Bio/Mindy/BaseDB.py",
line 52, in load
    for record in iterator.iterate(source, cont_handler = builder):
  File
"/home/saric/transfer/source/biopython/biopython_cvs_20050415/biopython/build/lib.linux-i686-2.3/Martel/IterParser.py",
line 71, in iterateFile
    raise Parser.ParserPositionException(self.start_position)
Martel.Parser.ParserPositionException: error parsing at or beyond
character 5887615

===snap===

I use a x86-machine running SuSE-Linux 9.1 (kernel 2.6.5-7.147-default,
gcc version 3.3.3).

The error is most likely due to a trainling blank line in the GenBank-file,
which is there for all "official" downloads I checked (see my post on the
Biopython-mailinglist (link above)).

Either this is a bug in GenBank (delivering invalid files) or something minor in
the Martel-Parser, which does not like blank lines at the end of the file.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Apr 26 11:57:16 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Apr 26 12:19:59 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
	parsing if there is no ATOM record
Message-ID: <200504261557.j3QFvGrQ004138@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1772


------- Additional Comments From dhendrix@compbio.berkeley.edu  2005-04-26 11:57 -------
Of course it appends HETATM (and CONECT) records, but at least it stops reading
the file :-).  I'm fine with your change, as long as it recognizes the end of
the file (or header)....  THANK YOU!


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Apr 26 12:35:40 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Apr 26 13:22:22 2005
Subject: [Biopython-dev] [Bug 1774] New: Bio.Clustalw: bug in computing the
	alignment.
Message-ID: <200504261635.j3QGZeP9004533@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1774

           Summary: Bio.Clustalw: bug in computing the alignment.
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: crober@scri.ac.uk


Using Bio.Clustalw in order to process a sequence alignment, there is a bug in
computing the alignment when in the input file more than one sequence 
has the same name. This bug is not reported by Bio.Clustalw if the output file
you specify already exists. In that latter case, Bio.Clustalw will 
just read the results from the output file rather than reporting the error.


#------- program.py
#! /usr/bin/python2.4
import sys
from Bio import Clustalw
from Bio.Clustalw import MultipleAlignCL
import sys

cline = MultipleAlignCL(sys.argv[1])
cline.set_output(sys.argv[2])
align = Bio.Clustalw.do_alignment(cline)

#---------- input.fas
>Putative binding site
ggaacggatgctcgcccagttccaccaacg
>Putative binding site
ggaacccatccttttctgcgtccacacagc
>Putative promoter inside
ggaacaggtgtttcgtcaacacgga
>Putative binding site
ggaacaaacacaactactgcactat

#------- command line to start creating the alignment
$ python2.4 program.py input.fas output.fas

#-------- ERROR MESSAGE when running clustalw as follows:
$ clustalw input.fas -outfile=output.fas
ERROR: Multiple sequences found with same name, Putative (first 30 chars are
significant)
No. of seqs. read = 0. No alignment!


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Apr 26 13:06:33 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Apr 26 13:22:29 2005
Subject: [Biopython-dev] [Bug 1772] Bio.PDB's parse_pdb_header never stops
	parsing if there is no ATOM record
Message-ID: <200504261706.j3QH6XSe004914@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1772

thamelry@binf.ku.dk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |FIXED


------- Additional Comments From thamelry@binf.ku.dk  2005-04-26 13:06 -------
Parsing header now stops on ATOM, HETATM, MODEL or EOF


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Wed Apr 27 01:13:26 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Wed Apr 27 01:20:36 2005
Subject: [Biopython-dev] [Bug 1774] Bio.Clustalw: bug in computing the
	alignment.
Message-ID: <200504270513.j3R5DPBN012863@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1774

mdehoon@ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Additional Comments From mdehoon@ims.u-tokyo.ac.jp  2005-04-27 01:13 -------
Fixed in CVS, thanks.

The bug was caused by interpreting the return value of os.popen as the exit
status instead of the termination status. The exit status is the second byte of
the termination status, so we need to divide the result of os.popen by 256 to
get the exit status.
For example, if the C code of a program a.out contains
int main(void)
{  int status;
   ...
   return status;
}
then os.popen("a.out").close() returns status*256 instead of status. Same goes
for os.system. This is also true at the C-level, so this is not a Python bug.
Since we have calls to os.popen and os.system in various places in Biopython,
the same bug may appear elsewhere also.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mdehoon at ims.u-tokyo.ac.jp  Wed Apr 27 02:18:19 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Wed Apr 27 02:07:11 2005
Subject: [Biopython-dev] Rethinking Seq objects
Message-ID: <426F2EAB.3000709@ims.u-tokyo.ac.jp>

Hi everybody,

For my research, I tend to work a lot with sequences, but I find myself not 
using Bio.Seq much. I'd like to propose some changes to make sequence objects 
more useful. I'd be happy to hear comments from the other developers, in 
particular the original developers who probably thought this through much more 
than I have.

There are five changes I'd like to propose:

1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and the 
MutableSeq class basically describe the same thing, except that one is read-only 
and the other one is not. If desired, we can add a readonly flag to the class to 
describe if it is mutable or not. (Given that e.g. Numerical Python arrays don't 
have such a flag, my feeling is that it is not really needed for Seq objects 
either).

2) Make Seq objects a bit smarter about which type of sequence they contain. One 
reason I don't use Bio.Seq much is that I have to write
 >>> from Bio.Alphabet import IUPAC
 >>> my_alpha = IUPAC.unambiguous_dna
 >>> from Bio.Seq import Seq
 >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
which is too much typing. I am thinking about the following scheme when 
initializing a Seq object:
- If the user specifies my_alpha, accept that alphabet. Raise an error if the 
sequence is not consistent with the alphabet
- Assume the sequence is an unambiguous DNA sequence
- If the sequence contains any characters other than ATCG, assume it is 
unambiguous RNA, otherwise accept the sequence
- If the sequence contains any characters other than AUCG, assume it is
a protein, otherwise accept the sequence
- If the sequence contains any characters other than ACDEFGHIKLMNPQRSTVWY, 
assume it is ambiguous DNA, otherwise accept the sequence
- If the sequence contains any characters other than GATCRYWSMKHBVDN, assume it 
is ambiguous RNA, otherwise accept the sequence
- If the sequence contains any characters other than GAUCRYWSMKHBVDN, assume it 
is an extended protein sequence, otherwise accept the sequence
- If the sequence contains any characters other than ACDEFGHIKLMNPQRSTVWYBXZ, 
yell at the user.

3) When changing a sequence, check if it is still consistent with the alphabet. 
Right now, we can do
 >>> from Bio.Seq import *
 >>> from Bio.Alphabet import IUPAC
 >>> my_alpha = IUPAC.unambiguous_dna
 >>> my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
 >>> my_seq[:10] = "weirdstuff"
 >>> my_seq
MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), IUPACUnambiguousDNA())

4) Make Seq objects understand circular genomes. Many bacterial genomes are 
circular. It would be nice if we could take the indices [-1000:1000] from a Seq 
object, if it is circular, or [3999000:40001000] if the sequence is circular 
with length 4000000.

5) Perhaps it would be a good idea to add transcribe and translate methods to 
the Seq class. Currently, to translate a DNA sequence, we have to do
 >>> from Bio.Seq import Seq
 >>> from Bio import Translate
 >>> from Bio.Alphabet import IUPAC
 >>> my_alpha = IUPAC.unambiguous_dna
 >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
 >>> standard_translator = Translate.unambiguous_dna_by_id[1]
 >>> standard_translator.translate(my_seq)
Seq('AIVMGR*KGAR', IUPACProtein())
which is too much typing for my taste.

Any thoughts/comments/suggestions?

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon

From thamelry at binf.ku.dk  Wed Apr 27 08:09:28 2005
From: thamelry at binf.ku.dk (thamelry@binf.ku.dk)
Date: Wed Apr 27 08:05:56 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
References: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
Message-ID: <33165.83.92.3.59.1114603768.squirrel@www.binf.ku.dk>


Hi Michiel,

> Any thoughts/comments/suggestions?

I happen to be doing some sequence stuff myself at the moment
and I couldn't agree more with the points you raised. It could
all be a lot more straightforward!

Cheers,

-Thomas

From hoffman at ebi.ac.uk  Wed Apr 27 08:37:03 2005
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Wed Apr 27 08:37:45 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
References: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
Message-ID: <Pine.LNX.4.62.0504271321290.28964@qnzvnan.rov.np.hx>

On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:

> 1) Make Seq objects mutable, and get rid of MutableSeq.

I imagine it will be a lot slower to replace built-in strings with
character arrays. Right now, I only use Seq when I absolutely have to.

Personally, I'd love it if Seq were just a light-weight subclass of
str without the performance penalties of the existing Seq. Using a
Surrogate pattern slows down all those inner loops a lot. Also lots of
unnecessary input-checking does as well. I think performance should be
a concern when you are talking about what should be the most-used part
of the library.

Similarly, I think lots of magic trying to figure out the alphabet is
a bad idea. There are only a few operations that actually require the
alphabet to be known, and most of the time I store a sequence in
memory I'm not going to need any of these, so having to deal with
alphabet issues when it's unnecessary is just going to be a pain in
the butt that will keep me from using Seq. Similarly, I use augmented
alphabets with things like B in them and I don't want Seq yelling at
me when there's no point. Sure, if it can't figure out how to revcom
the sequence, but just to instantiate it?

I think these principles from the Zen of Python would be
well-considered here:

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Sparse is better than dense.
Readability counts.
In the face of ambiguity, refuse the temptation to guess.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.

> > > > Right now, we can do
> > > >  from Bio.Seq import *
> > > >  from Bio.Alphabet import IUPAC
> > > >  my_alpha = IUPAC.unambiguous_dna
> > > >  my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
> > > >  my_seq[:10] = "weirdstuff"
> > > >  my_seq
> MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), 
> IUPACUnambiguousDNA())

"Doctor, it hurts when I do this."
"Don't do that."

> 4) Make Seq objects understand circular genomes. Many bacterial genomes are 
> circular. It would be nice if we could take the indices [-1000:1000] from a 
> Seq object, if it is circular, or [3999000:40001000] if the sequence is 
> circular with length 4000000.

I'm sure that will be useful to some people. But having a CircularSeq
subclass would make it easier to avoid this extra functionality from
impacting on the primary use case.

> 5) Perhaps it would be a good idea to add transcribe and translate methods to 
> the Seq class.

+1

You would obviously have to specify an alphabet for this, but I'm fine
with that so long as I'm not forced to when I don't need to.
-- 
Michael Hoffman <hoffman@ebi.ac.uk>
European Bioinformatics Institute
From mdehoon at ims.u-tokyo.ac.jp  Wed Apr 27 23:29:33 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Wed Apr 27 23:17:53 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <Pine.LNX.4.62.0504271321290.28964@qnzvnan.rov.np.hx>
References: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
	<Pine.LNX.4.62.0504271321290.28964@qnzvnan.rov.np.hx>
Message-ID: <4270589D.2090500@ims.u-tokyo.ac.jp>

Michael Hoffman wrote:

> On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:
> 
>> 1) Make Seq objects mutable, and get rid of MutableSeq.
> 
> I imagine it will be a lot slower to replace built-in strings with
> character arrays. Right now, I only use Seq when I absolutely have to.
Well I wouldn't replace them with character arrays, the idea would be to 
reimplement the Seq class in C. So it would not be slower than built-in strings, 
maybe even a bit faster. The Seq object would look like a string object, but be 
mutable.

> Similarly, I think lots of magic trying to figure out the alphabet is
> a bad idea. There are only a few operations that actually require the
> alphabet to be known, and most of the time I store a sequence in
> memory I'm not going to need any of these, so having to deal with
> alphabet issues when it's unnecessary is just going to be a pain in
> the butt that will keep me from using Seq. Similarly, I use augmented
> alphabets with things like B in them and I don't want Seq yelling at
> me when there's no point. Sure, if it can't figure out how to revcom
> the sequence, but just to instantiate it?

OK, then how about this:
- By default, don't assume a particular alphabet. Same as how it works now:
 >>> from Bio.Seq import *
 >>> Seq('ATCG')
Seq('ATCG', Alphabet())
- If the user decides to specify the alphabet, make sure the sequence is 
consistent with it. Of course, if the alphabet is Alphabet(), don't do any input 
checking. So essentially, the user gets to decide whether she wants input 
checking for the sequence or not.

>> >>> Right now, we can do
>> >>>  from Bio.Seq import *
>> >>>  from Bio.Alphabet import IUPAC
>> >>>  my_alpha = IUPAC.unambiguous_dna
>> >>>  my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>> >>>  my_seq[:10] = "weirdstuff"
>> >>>  my_seq
>> MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), 
>> IUPACUnambiguousDNA())
> 
> "Doctor, it hurts when I do this."
> "Don't do that."

Well you would be right if this were Biofortran. For a higher-level language, I 
would expect better checking to make sure an object is self-consistent. Python 
itself is full of checks and assertions.
Another option would be to get rid of alphabets altogether. What good are they 
otherwise?

>> 4) Make Seq objects understand circular genomes. Many bacterial 
>> genomes are circular. It would be nice if we could take the indices 
>> [-1000:1000] from a Seq object, if it is circular, or 
>> [3999000:40001000] if the sequence is circular with length 4000000.
> 
> I'm sure that will be useful to some people. But having a CircularSeq
> subclass would make it easier to avoid this extra functionality from
> impacting on the primary use case.

My feeling is that having a subclass is a bit of an overkill. The idea is to 
have an optional topology argument, which defaults to "linear". So the primary 
use case would not be affected.

> 
>> 5) Perhaps it would be a good idea to add transcribe and translate 
>> methods to the Seq class.
> 
> +1
> 
> You would obviously have to specify an alphabet for this, but I'm fine
> with that so long as I'm not forced to when I don't need to.

If the alphabet defaults to Alphabet() when creating a Seq object, then I'd 
think the transcribe and translate methods should work even if a user doesn't 
specify the sequence to be DNA or RNA. My current gripe with the Seq object is 
that there are too many steps to translate a DNA sequence.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From Frederic.Sohm at iaf.cnrs-gif.fr  Thu Apr 28 03:35:01 2005
From: Frederic.Sohm at iaf.cnrs-gif.fr (=?iso-8859-1?b?RnLpZOlyaWM=?= Sohm)
Date: Thu Apr 28 03:28:19 2005
Subject: [Biopython-dev] Rethinking Seq objects
Message-ID: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>

Hi,

I was following your discussion on Seq object. I more or less agree with Michiel.

But some thoughts :

1) get rid of MutableSeq and make all Seq mutable.
Will it not be a problem for some people there? I mean I only use MutableSeq so
noproblem there for me but I assume that someone uses non-mutable Seq or is it a
feature which is not needed?

2) Checking the alphabet. Yes. good. with the remark of Michael not force it for
people who don't want it. Can be painful for real long sequence.

4) circular sequences and indices. Nice. from experience not so easy to
implement correctly though.

5) Translate and transcribe. Yes obviously a good thing.


If you are interested you can have a look to DNA object in rana.
If you want it, take it under a Biopython licence. 
It's only DNA and it's certainly not worth using it but it can give you some
idea. A lot of complexity is added by supporting a biological indexation [1:len]
rather than a python one [0:len-1]. This is would not be a sensible thing to do
in biopython.

This is the C-implementation of Python String, modified with an alphabet
checking (a modif of the string translate() method with the alphabet hard coded
in) and support for circular sequences. 
If you want have a look at the DNA object for rana here :

http://cvs.sourceforge.net/viewcvs.py/rana/rana/Rana/c_extension/DNAdata.c?rev=1.9&view=markup

The code is pretty bad, well my code not the python one.

Fred

-- 
Fr?d?ric Sohm
Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s"
UPR 2197 DEPSN, CNRS
Institut de Neurosciences A. Fessard
1 Avenue de la Terrasse
91 198 GIF-SUR-YVETTE
FRANCE
Phone: +33 (0) 1 69 82 34 12
Fax:+33 (0) 1 69 82 34 47

From hoffman at ebi.ac.uk  Thu Apr 28 04:56:51 2005
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Thu Apr 28 04:50:05 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <4270589D.2090500@ims.u-tokyo.ac.jp>
References: <426F2EAB.3000709@ims.u-tokyo.ac.jp>
	<Pine.LNX.4.62.0504271321290.28964@qnzvnan.rov.np.hx>
	<4270589D.2090500@ims.u-tokyo.ac.jp>
Message-ID: <Pine.LNX.4.62.0504280940500.30253@qnzvnan.rov.np.hx>

On Thu, 28 Apr 2005, Michiel Jan Laurens de Hoon wrote:

> Michael Hoffman wrote:
>
>>  On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:
>> 
>> >  1) Make Seq objects mutable, and get rid of MutableSeq.
>> 
>>  I imagine it will be a lot slower to replace built-in strings with
>>  character arrays. Right now, I only use Seq when I absolutely have to.

> Well I wouldn't replace them with character arrays, the idea would be to 
> reimplement the Seq class in C. So it would not be slower than built-in 
> strings, maybe even a bit faster. The Seq object would look like a string 
> object, but be mutable.

If you can make a sequence class that is faster than the current
built-in string, I would suggest you submit a patch to the Python
tracker to make it a replacement for the current built-in string. :P

> OK, then how about this:
> - By default, don't assume a particular alphabet. Same as how it works now:
> > > >  from Bio.Seq import *
> > > >  Seq('ATCG')
> Seq('ATCG', Alphabet())

+1

>> > > > >   my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>> > > > >   my_seq[:10] = "weirdstuff"
>> > > > >   my_seq
>> >  MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), 
>> >  IUPACUnambiguousDNA())
>> 
>>  "Doctor, it hurts when I do this."
>>  "Don't do that."
>
> Well you would be right if this were Biofortran. For a higher-level language, 
> I would expect better checking to make sure an object is self-consistent. 
> Python itself is full of checks and assertions.

How often have you actually put "weirdstuff" into the middle of a
MutableSeq? Have you ever done this, or are you just imagining that it
might happen?

You Aren't Gonna Need It. The number of checks and assertions you can
make is limitless and you have to know where to draw the line. To me,
the line should be drawn at user input, but not at every internal
change to a sequence made within a program. Maybe optional alphabet
checking would help with this.

> Another option would be to get rid of alphabets altogether. What good are 
> they otherwise?

They're useful for transcription/translation/reverse complement
operations. And as far as I'm concerned, that's a good place to do
error checking, should it be necessary.

>>  But having a CircularSeq subclass would make it easier to avoid
>>  this extra functionality from impacting on the primary use case.
>
> My feeling is that having a subclass is a bit of an overkill. The idea is to 
> have an optional topology argument, which defaults to "linear". So the 
> primary use case would not be affected.

If you're doing this in C, then my performance assumptions are perhaps
incorrect. I wouldn't want every slice of my linear sequence to have
to go through "is this circular?" logic in Python.

> If the alphabet defaults to Alphabet() when creating a Seq object,
> then I'd think the transcribe and translate methods should work even
> if a user doesn't specify the sequence to be DNA or RNA. My current
> gripe with the Seq object is that there are too many steps to
> translate a DNA sequence.

Good point. Perhaps a warning when it has to guess?
-- 
Michael Hoffman <hoffman@ebi.ac.uk>
European Bioinformatics Institute
From hoffman at ebi.ac.uk  Thu Apr 28 04:59:51 2005
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Thu Apr 28 05:14:23 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
Message-ID: <Pine.LNX.4.62.0504280957350.30253@qnzvnan.rov.np.hx>

On Thu, 28 Apr 2005, Fr?d?ric Sohm wrote:

> 1) get rid of MutableSeq and make all Seq mutable.
> Will it not be a problem for some people there? I mean I only use MutableSeq so
> noproblem there for me but I assume that someone uses non-mutable Seq or is it a
> feature which is not needed?

In the rest of CPython, immutable have two benefits: they are more
memory-efficient (and sometimes space-efficient), and they are
hashable. I don't think Seqs are usefully hashable right now, and
Michiel says he will code the new Seq such that there won't be a
significant performance impact.
-- 
Michael Hoffman <hoffman@ebi.ac.uk>
European Bioinformatics Institute
From mdehoon at ims.u-tokyo.ac.jp  Fri Apr 29 01:05:14 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Fri Apr 29 00:53:15 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
Message-ID: <4271C08A.3050306@ims.u-tokyo.ac.jp>

Fr?d?ric Sohm wrote:
> 1) get rid of MutableSeq and make all Seq mutable.
> Will it not be a problem for some people there? I mean I only use MutableSeq so
> noproblem there for me but I assume that someone uses non-mutable Seq or is it a
> feature which is not needed?

As far as I can tell, the only way in which a mutable Seq may affect a user is 
in terms of performance, as Michael pointed out. But anyway, as soon as we reach 
some conclusion on Seq/MutableSeq on biopython-dev, I'll send a message to the 
biopython mailing list to see if any of the users will get into problems because 
of this. Also, I expect that MutableSeq will be around for some time as a 
deprecated class.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From mdehoon at ims.u-tokyo.ac.jp  Fri Apr 29 01:15:28 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Fri Apr 29 01:03:21 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <Pine.LNX.4.62.0504280940500.30253@qnzvnan.rov.np.hx>
References: <426F2EAB.3000709@ims.u-tokyo.ac.jp>	<Pine.LNX.4.62.0504271321290.28964@qnzvnan.rov.np.hx>	<4270589D.2090500@ims.u-tokyo.ac.jp>
	<Pine.LNX.4.62.0504280940500.30253@qnzvnan.rov.np.hx>
Message-ID: <4271C2F0.7070608@ims.u-tokyo.ac.jp>

Michael Hoffman wrote:
 > On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:
>> Another option would be to get rid of alphabets altogether. What good 
>> are they otherwise?
> 
> They're useful for transcription/translation/reverse complement
> operations. And as far as I'm concerned, that's a good place to do
> error checking, should it be necessary.
> 

For transcription and translation, we don't need to know the alphabet. 
Effectively, by calling translate or transcribe, the user is telling us that the 
input sequence object is DNA or RNA, and that the output sequence is RNA (for 
transcription) or protein (for translation). Of course, when a character other 
than ACGTU is encountered, we need to raise an error. But the point is that 
knowing the Alphabet doesn't tell us anything we don't already know.

For reverse complement, we also don't need to know the alphabet; it is either 
DNA or RNA. The only exception is when a user wants to reverse complement a 
sequence that does not contain a T or a U. But the current situation, where we 
have IUPACProtein, ExtendedIUPACProtein, IUPACAmbiguousDNA, IUPACUnambiguousDNA, 
ExtendedIUPACDNA, IUPACAmbiguousRNA, IUPACUnambiguousRNA alphabets, is an 
overkill. It would be much easier to have a reverse_complement and a 
rna_reverse_complement function (or something like that).

So I still don't see any use for alphabets other than input checking. Or am I 
missing something here?

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From mdehoon at ims.u-tokyo.ac.jp  Fri Apr 29 01:20:08 2005
From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon)
Date: Fri Apr 29 01:08:07 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <Pine.LNX.4.62.0504280957350.30253@qnzvnan.rov.np.hx>
References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
	<Pine.LNX.4.62.0504280957350.30253@qnzvnan.rov.np.hx>
Message-ID: <4271C408.9010401@ims.u-tokyo.ac.jp>

Michael Hoffman wrote:
> On Thu, 28 Apr 2005, Fr?d?ric Sohm wrote:
> 
>> 1) get rid of MutableSeq and make all Seq mutable.
>> Will it not be a problem for some people there? I mean I only use 
>> MutableSeq so
>> noproblem there for me but I assume that someone uses non-mutable Seq 
>> or is it a
>> feature which is not needed?
> 
> In the rest of CPython, immutable have two benefits: they are more
> memory-efficient (and sometimes space-efficient), and they are
> hashable. I don't think Seqs are usefully hashable right now, and
> Michiel says he will code the new Seq such that there won't be a
> significant performance impact.

Would you be willing to test the performance of a new Seq class? I haven't 
actually written any code yet, but I could send it to you when it's done before 
including it in Biopython. Note also that a mutable Seq class avoids the need 
for calls to tomutable and toseq, so there may be an overall performance gain. 
But it would be better to test this on a real-life case.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
From hoffman at ebi.ac.uk  Fri Apr 29 04:02:05 2005
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Fri Apr 29 03:55:11 2005
Subject: [Biopython-dev] Rethinking Seq objects
In-Reply-To: <4271C408.9010401@ims.u-tokyo.ac.jp>
References: <1114673701.4270922513bad@mail.iaf.cnrs-gif.fr>
	<Pine.LNX.4.62.0504280957350.30253@qnzvnan.rov.np.hx>
	<4271C408.9010401@ims.u-tokyo.ac.jp>
Message-ID: <Pine.LNX.4.62.0504290859080.6976@qnzvnan.rov.np.hx>

On Fri, 29 Apr 2005, Michiel Jan Laurens de Hoon wrote:

> Would you be willing to test the performance of a new Seq class? I
> haven't actually written any code yet, but I could send it to you
> when it's done before including it in Biopython. Note also that a
> mutable Seq class avoids the need for calls to tomutable and toseq,
> so there may be an overall performance gain. But it would be better
> to test this on a real-life case.

Sure, I think I could whip something up.
-- 
Michael Hoffman <hoffman@ebi.ac.uk>
European Bioinformatics Institute
From Frederic.Sohm at iaf.cnrs-gif.fr  Fri Apr 29 04:03:56 2005
From: Frederic.Sohm at iaf.cnrs-gif.fr (=?iso-8859-1?b?RnLpZOlyaWM=?= Sohm)
Date: Fri Apr 29 03:57:03 2005
Subject: [Biopython-dev] Rethinking Seq objects
Message-ID: <1114761836.4271ea6ca540a@mail.iaf.cnrs-gif.fr>

I am ready to do some performance testing, if you want. I am looking for a
replacement for the DNA object I have written and could well switch to the
biopython Seq object if it is faster. 


Fred

Michael Hoffman wrote:
> On Thu, 28 Apr 2005, Fr?d?ric Sohm wrote:
> 
>> 1) get rid of MutableSeq and make all Seq mutable.
>> Will it not be a problem for some people there? I mean I only use 
>> MutableSeq so
>> noproblem there for me but I assume that someone uses non-mutable Seq 
>> or is it a
>> feature which is not needed?
> 
> In the rest of CPython, immutable have two benefits: they are more
> memory-efficient (and sometimes space-efficient), and they are
> hashable. I don't think Seqs are usefully hashable right now, and
> Michiel says he will code the new Seq such that there won't be a
> significant performance impact.

Would you be willing to test the performance of a new Seq class? I haven't 
actually written any code yet, but I could send it to you when it's done before 
including it in Biopython. Note also that a mutable Seq class avoids the need 
for calls to tomutable and toseq, so there may be an overall performance gain. 
But it would be better to test this on a real-life case.

--Michiel.

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon
_______________________________________________
Biopython-dev mailing list
Biopython-dev@biopython.org
http://biopython.org/mailman/listinfo/biopython-dev

-- 
Fr?d?ric Sohm
Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s"
UPR 2197 DEPSN, CNRS
Institut de Neurosciences A. Fessard
1 Avenue de la Terrasse
91 198 GIF-SUR-YVETTE
FRANCE
Phone: +33 (0) 1 69 82 34 12
Fax:+33 (0) 1 69 82 34 47