From mdehoon at c2b2.columbia.edu  Wed Nov  1 00:58:41 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Wed, 01 Nov 2006 00:58:41 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4545D9F1.2040902@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>
	<454574B4.3050407@c2b2.columbia.edu>
	<4545D9F1.2040902@maubp.freeserve.co.uk>
Message-ID: <45483791.7070803@c2b2.columbia.edu>

Peter (BioPython Dev) wrote:
> With such a speed up, I'd guess you were using Bio.Fasta before.

Yes I was. I just went to the Biopython tutorial and used the stuff in
section 2.4. I didn't expect it to be *that* slow.

> I've noticed the same thing.  Are you dealing with NCBI style fasta 
> identifiers made up of several fields separated by "|" characters?

Yep.

>> For duplicate keys, there are at least four possibilities (raise an
>> exception, store only one of the keys, store neither of the keys
>> and don't raise an exception, store both after modifying one of the
>> keys). So this should also be an option.
> 
> Supporting all these options with an easy to understand interface
> looks too hard.
> 
> In my opinion if someone is trying to build a dictionary using
> repeated keys they have made a mistake (either in their datafile, or
> their record2key function) - so raising an exception is reasonable
> default behaviour (and is easy to code).

You're probably right. I'm fine with raising an exception.

>> In the File2SequenceDict above, answer[key] contains the complete 
>> record. Some people will want that. However, in my application I
>> only want to store the record.seq part in answer[key]. Somebody
>> else may want str(record.seq). So we'd also need a record2value
>> argument.
> 
> It does slightly undermine the "you only get SeqRecord objects" 
> principle.  On the other hand, its a simple addition that is easy to
> explain and implement.  I'm happy to add this.

The point I was trying to make is that for a File2SequenceDict function 
to be useful, it would end up being too complex. In the answer above, a 
user could also do answer[key].seq to get the part she wants, so maybe a 
record2value argument is not essential in practice.

Part of my opposition against the File2SequenceDict function is that it 
requires the parser to be called File2SequenceIterator (which I don't 
like as a name, but more about that some other time), which then leads 
to a File2SequenceList function, which is software bloat.

So, how about making the functionality of File2SequenceDict available as 
a todict() method to the iterator object returned by 
File2SequenceIterator, or, as a iterator2dict function?

--Michiel.

From biopython-dev at maubp.freeserve.co.uk  Wed Nov  1 05:09:59 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed, 01 Nov 2006 10:09:59 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45483791.7070803@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>
	<45483791.7070803@c2b2.columbia.edu>
Message-ID: <45487277.6080308@maubp.freeserve.co.uk>

> The point I was trying to make is that for a File2SequenceDict
> function to be useful, it would end up being too complex.

Of course I'm going to be biased here, but I do find the simple current
dictionary construction useful as it is.  Clearly we have slightly
different uses in mind (which is good - the design should try and cater
to most people).

> In the answer above, a user could also do answer[key].seq to get the
> part she wants, so maybe a record2value argument is not essential in
> practice.
> 
> Part of my opposition against the File2SequenceDict function is that
> it requires the parser to be called File2SequenceIterator (which I
> don't like as a name, but more about that some other time), which
> then leads to a File2SequenceList function, which is software bloat.
> 
> So, how about making the functionality of File2SequenceDict available
> as a todict() method to the iterator object returned by 
> File2SequenceIterator, or, as a iterator2dict function?

I do like your first suggestion - the idea of adding a todict() method 
to the iterator objects.  However, that would require that all the 
parsers be written as (sub)classes, and right now several of them are 
written as generator functions.

I've found using generator functions to be very simple, and easy to
understand.  They seem like a good choice for simple file formats.  But
with a good reason enough reason, I could turn them into classes.

                         ----

Right now I am making both "file to dict" and "iterator to dict"
functions available:

File2SequenceDict(..., record2key) is implemented as
SequenceIter2Dict(File2SequenceIterator(...), record2key)

Also:
File2Alignment(...) is implemented as
Iter2Alignment(File2SequenceIterator(...))

And:
File2SequenceList(...) is implemented as list(File2SequenceIterator(...))

Leaving aside the names (which I notice are not currently consistent) I
would be fine with removing File2SequenceList, File2SequenceDict, and
File2Alignment but retaining the two functions which convert from a
SeqRecord returning iterator into dict or an alignment.

How does that sound Michiel (subject to agreeing on names)?

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Nov  1 17:50:46 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Nov 2006 17:50:46 -0500
Subject: [Biopython-dev] [Bug 2131] New: SProt.py fails to parse the current
	Swiss-Prot version 51.0
Message-ID: <bug-2131-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131

           Summary: SProt.py fails to parse the current Swiss-Prot version
                    51.0
           Product: Biopython
           Version: 1.24
          Platform: Macintosh
        OS/Version: MacOS X
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: Biosql at hotmail.com


Hi, 

I'm running on a mac OS 10.4, python 2.5 and tried to parse the Swiss-Prot .dat
file whit the latest SProt.py version and get this : 

Traceback (most recent call last):
  File "Parser_SProt_to_DB.py", line 37, in <module>
    cur_record = s_iterator.next()
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 166, in
next
    return self._parser.parse(File.StringHandle(data))
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 290, in
parse
    self._scanner.feed(handle, self._consumer)
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 332, in
feed
    self._scan_record(uhandle, consumer)
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 337, in
_scan_record
    fn(self, uhandle, consumer)
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 369, in
_scan_id
    self._scan_line('ID', uhandle, consumer.identification, exactly_one=1)
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 359, in
_scan_line
    read_and_call(uhandle, event_fn, start=line_type)
  File "/sw/lib/python2.5/site-packages/Bio/ParserSupport.py", line 301, in
read_and_call
    method(line)
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 526, in
identification
    self.data.sequence_length = int(cols[4])
ValueError: invalid literal for int() with base 10: 'AA.'

Any clue ?

Thanks !


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From chris.lasher at gmail.com  Wed Nov  1 22:49:04 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Wed, 1 Nov 2006 22:49:04 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45487277.6080308@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>
	<4542F123.9050106@c2b2.columbia.edu>
	<45434611.1040708@maubp.freeserve.co.uk>
	<4544458A.5000102@c2b2.columbia.edu>
	<45448FAF.1090104@maubp.freeserve.co.uk>
	<454574B4.3050407@c2b2.columbia.edu>
	<4545D9F1.2040902@maubp.freeserve.co.uk>
	<45483791.7070803@c2b2.columbia.edu>
	<45487277.6080308@maubp.freeserve.co.uk>
Message-ID: <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>

I'd like to pitch in a few comments here.

Peter wrote:
> One point against names like File2SequenceIterator is the pun on two
> versus to (i.e. convert) will not be so obvious to non-native English
> speakers.

I'd like to second that. It's cute, sure, but FileToSequenceIterator
isn't that much more difficult, and leaves no room for confusion.
(e.g., Where's the File1SequenceIterator?)

Michiel wrote:
> I like the idea of one argument that takes a file name or handle. I
> believe that that is how other Biopython functions work.

Yikes! Are you serious? Why not make it easier and require a file-like
object? I would definitely not be for it taking a plain string. This
seems implicit rather than explicit. "Takes a file... or a file-like
object... or a string containing a filename... or just a string
containing the file contents... or a brief description of the data
that's in your file... or a bunch of smiley emoticons, if you're in a
good mood..." File-like objects are testable and leave little room for
surprise. Anything else seems like it's asking for a headache.

Which brings me to the issue of "guessing" a file's format. Yikes,
again! I'd expect that kind of "magickery" from Perl, but once again,
explicit is better than implicit. I honestly think it's not too much
to expect the user to know what filetype they're expecting BioPython
to deal with. Could you guys please explain the motivation behind this
to me? As I see it right now, the last thing I want is BioPython
incorrectly guessing my file format, and particularly, assuming that I
have put the proper extension to represent the file format. The
unified sequence object is what's beautiful about SeqIO, but the
guesswork that you are discussing having SeqIO's classes do is scary,
to me.

And I think by now it's predictable that I'm a fan of Peter's
suggestion to have an exception raised upon the attempt to create a
dictionary with identical IDs; all other options are, again, too
implicit for my tastes.

Thanks very much for developing SeqIO and discussing it so much, guys.
I think this will be a fantastic asset to BioPython! Keep on rockin'
it!

Chris

From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 06:29:38 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 06:29:38 -0500
Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current
	Swiss-Prot version 51.0
In-Reply-To: <bug-2131-42@http.bugzilla.open-bio.org/>
Message-ID: <200611021129.kA2BTcOX010117@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2006-11-02 06:29 -------
Hi Jonathan,

What version of BioPython are you using?

I know that bugzilla needs updating to include more up to date version numbers,
but you aren't really using BioPython 1.24 are you?

Currently the latest release is 1.42, and this does include some updates for
SProt, e.g. bug 1948. There is also a more recent fix in CVS for bug 2043
dealing with new style RX lines.

Could you tell use which SProt file you are using (a URL would be fine).  If
there are many that fail the same way, and you have a small example input file,
you could even attach it to this bug.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython-dev at maubp.freeserve.co.uk  Thu Nov  2 07:49:30 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Thu, 02 Nov 2006 12:49:30 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>
	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>
Message-ID: <4549E95A.6080605@maubp.freeserve.co.uk>

Chris Lasher wrote:
> I'd like to pitch in a few comments here.
> 
> Peter wrote:
>> One point against names like File2SequenceIterator is the pun on 
>> two versus to (i.e. convert) will not be so obvious to non-native 
>> English speakers.
> 
> I'd like to second that. It's cute, sure, but FileToSequenceIterator
>  isn't that much more difficult, and leaves no room for confusion. 
> (e.g., Where's the File1SequenceIterator?)

I would be happy with FileToSequenceIterator, or even
FileToSequenceIter.  FileToSeqIter is shorter but we don't actually
return Seq objects so I would avoid that.

Does anyone else have any suggestions?

> Michiel wrote:
>> I like the idea of one argument that takes a file name or handle. I
>>  believe that that is how other Biopython functions work.

I've had a little look, and the only case I found is the recent
Bio.Nexus parser - and this choked on a StringIO handle on my machine
(fix checked in).

Chris Lasher wrote:
> Yikes! Are you serious? Why not make it easier and require a 
> file-like object? I would definitely not be for it taking a plain 
> string. This seems implicit rather than explicit. "Takes a file... or
> a file-like object... or a string containing a filename... or just a
> string containing the file contents... or a brief description of the
> data that's in your file... or a bunch of smiley emoticons, if 
> you're in a good mood..." File-like objects are testable and leave 
> little room for surprise. Anything else seems like it's asking for a 
> headache.

Trying to distinguish between an (invalid) filename and the contents of
a sequence file is just too much to ask - more a migraine than a headache.

As an experiment, I've implemented (but not checked in) automatic
handle/filename detection.  Its seems to work (but I have not yet tried
exotic arguments like file names in Unicode, or random classes with a
__str__ method).  Still its messy.

While it does sound like a nice idea for the end user, the idea of
filenames and handles is pretty important in python, and maybe we
shouldn't worry about forcing newcomers deal with handles.  After all,
the SeqIO system will make them deal with iterators and SeqRecords which
I think are far more complicated!

What do you think Michiel?

Chris Lasher wrote:
> Which brings me to the issue of "guessing" a file's format. Yikes, 
> again! I'd expect that kind of "magickery" from Perl, but once again,
> explicit is better than implicit. I honestly think it's not too much
> to expect the user to know what filetype they're expecting BioPython
> to deal with. Could you guys please explain the motivation behind 
> this to me? As I see it right now, the last thing I want is BioPython
> incorrectly guessing my file format, and particularly, assuming that
> I have put the proper extension to represent the file format. The 
> unified sequence object is what's beautiful about SeqIO, but the
> guesswork that you are discussing having SeqIO's classes do is scary,
> to me.

For comparison this quote is from the BioPerl SeqIO How-To:
>> [BioPerl's] SeqIO can try to guess based on known file extensions 
>> or content, ... it is a good idea to get into the practice of 
>> always specifying the format.

I want to stress that as written, the user can specify the file format
to the File2SequenceIterator function (and its variants).  Maybe we
should encourage people to explicitly supply the format in any Bio.SeqIO
documentation....

You asked about motivation for guessing the file format.  I break that
down into guessing the file format based on the file extension, or based
on the file's contents (see later).

I personally am perfectly happy with using a file extension to file
format mapping.  Maybe this reflects my computing background (more
DOS/Windows background than Unix/Linux).

Note that if the format is not specified, and the file extension is not
on the known list (e.g. "txt" or "data" which could be anything) then
the call to File2SequenceIterator function (or its variants) will fail
with an invalid format message/exception.

Assuming we don't make the format a required argument, and we keep the
extension to format mappings, then I should make a point of including
deliberate miss-matches in the test suits - and check that they abort
with a SyntaxError.

Regarding guessing the format based on file contents:

For some applications, having a format guesser built into BioPython
might actually be very useful - the example given on the BioPerl website
is the back end of a web tool that took sequence input, where maybe you
can't trust the actual end user to know exactly what file format their
data is in.

Doing this for some file formats isn't too hard, often all you need to
see is the first line.  For other file formats its very tricky and best
not attempted.  But, is partial guess support even worth implementing -
especially as it may be less than perfect and get it wrong sometimes?

I think Michiel and I where happy to leave this question for later...

Chris Lasher wrote:
> And I think by now it's predictable that I'm a fan of Peter's 
> suggestion to have an exception raised upon the attempt to create a 
> dictionary with identical IDs; all other options are, again, too 
> implicit for my tastes.

Good.  Michiel agreed in another email:
>> 
>> You're probably right. I'm fine with raising an exception.
>> 

Have you been following the rest of that SeqRecord dictionary discussion
Chris?

> Thanks very much for developing SeqIO and discussing it so much, 
> guys. I think this will be a fantastic asset to BioPython! Keep on 
> rockin' it!
> 
> Chris

Thank you for your passionate feedback :)

Peter


From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 12:38:26 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 12:38:26 -0500
Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current
	Swiss-Prot version 51.0
In-Reply-To: <bug-2131-42@http.bugzilla.open-bio.org/>
Message-ID: <200611021738.kA2HcQKH017740@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131


Biosql at hotmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|1.24                        |Not Applicable


------- Comment #2 from Biosql at hotmail.com  2006-11-02 12:38 -------
I'm using the latest version of Biopython 1.42 with the latest version of
Sprot.py from the CVS. 

I used the Swiss-Prot file version 51 coming from here : 

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

I also tried to parse this file on a PC with python 2.4.3 and the latest
biopython version and got the same result. 

Jonathan


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 13:07:03 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 13:07:03 -0500
Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current
	Swiss-Prot version 51.0
In-Reply-To: <bug-2131-42@http.bugzilla.open-bio.org/>
Message-ID: <200611021807.kA2I73W3021327@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2006-11-02 13:07 -------
Created an attachment (id=491)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=491&action=view)
First four records from uniprot_sprot.dat.gz release 51

I was hoping for a smaller test case, uniprot_sprot.dat.gz is 185MB compressed,
and 836MB as plain text!  Anyway, I have extracted and attached a file with the
just the first four records in it for anyone interested in testing.

I would guess from your stack trace that this recent change to the ID line that
has caused the trouble:

http://ca.expasy.org/sprot/relnotes/sp_news.html#rel9.0

Old (with MoleculeType):
ID   EntryName DataClass; MoleculeType; SequenceLength.

New (without MoleculeType):
ID   EntryName DataClass; SequenceLength.

e.g.
ID   CYC_PIG                 Reviewed;         104 AA.
ID   Q3ASY8_CHLCH            Unreviewed;     36805 AA.

This shouldn't be too hard to fix...


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 13:41:46 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 13:41:46 -0500
Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current
	Swiss-Prot version 51.0
In-Reply-To: <bug-2131-42@http.bugzilla.open-bio.org/>
Message-ID: <200611021841.kA2Ifkpg025233@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2006-11-02 13:41 -------
Fix checked into CVS, please reopen the bug if you run into problems.

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SwissProt/SProt.py?cvsroot=biopython

file Bio/SwissProt/SProt.py
Revision 1.34 made 2nd Nov 2006

This is the test script I used with the example file from comment 3 attachment
491

from Bio.SwissProt import SProt

#Works
rec_iter = SProt.Iterator(open("uniprot_sprot_f4.dat"), SProt.SequenceParser())
for record in rec_iter :
    print record.id
    print record.seq

#Failed
rec_iter = SProt.Iterator(open("uniprot_sprot_f4.dat"), SProt.RecordParser())
for record in rec_iter :
    print record


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 14:16:53 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 14:16:53 -0500
Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current
	Swiss-Prot version 51.0
In-Reply-To: <bug-2131-42@http.bugzilla.open-bio.org/>
Message-ID: <200611021916.kA2JGrGY028566@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131


------- Comment #5 from Biosql at hotmail.com  2006-11-02 14:16 -------
Thank you Peter !

So fast and so good. 

Jonathan


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 16:27:25 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 16:27:25 -0500
Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current
	Swiss-Prot version (RX and OH lines are broken)
In-Reply-To: <bug-2043-42@http.bugzilla.open-bio.org/>
Message-ID: <200611022127.kA2LRPkN009879@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2043


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2006-11-02 16:27 -------
It seems to be working from the small amount I testing I did on another
Swiss-Prot bug.

Marking as fixed - please reopen if needed.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 16:38:01 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 16:38:01 -0500
Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more
In-Reply-To: <bug-1944-42@http.bugzilla.open-bio.org/>
Message-ID: <200611022138.kA2Lc1Zi010834@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1944


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2006-11-02 16:38 -------
While working on bug 2059, I've been tempted to make some similar changes to
Marc's suggestions about handling the SeqRecord's id/name/description, and the
addition of an "add SeqRecord" method.

I like the idea of adding a method to iterate over the sequences.  How about
something a little simpler (which I haven't tested yet):

     def __iter__(self):
         """Iterate over the SeqRecord objects making up the alignment"""
         return iter(self._records)

i.e. Use the fact that self._records is a list, and will support iteration
itself.  This avoids having to keep track of the current iteration position in
our own next method.

Also, would anyone else like to be able to iterate over the columns?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From mdehoon at c2b2.columbia.edu  Thu Nov  2 21:20:23 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Thu, 02 Nov 2006 21:20:23 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45487277.6080308@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>
	<45483791.7070803@c2b2.columbia.edu>
	<45487277.6080308@maubp.freeserve.co.uk>
Message-ID: <454AA767.9030506@c2b2.columbia.edu>

Peter wrote:
> Right now I am making both "file to dict" and "iterator to dict"
> functions available:
> 
> File2SequenceDict(..., record2key) is implemented as
> SequenceIter2Dict(File2SequenceIterator(...), record2key)
> 
> Also:
> File2Alignment(...) is implemented as
> Iter2Alignment(File2SequenceIterator(...))
> 
> And:
> File2SequenceList(...) is implemented as list(File2SequenceIterator(...))
> 
> Leaving aside the names (which I notice are not currently consistent) I
> would be fine with removing File2SequenceList, File2SequenceDict, and
> File2Alignment but retaining the two functions which convert from a
> SeqRecord returning iterator into dict or an alignment.
> 
> How does that sound Michiel (subject to agreeing on names)?
That sounds good to me.

--Michiel.

From mdehoon at c2b2.columbia.edu  Thu Nov  2 21:44:47 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Thu, 02 Nov 2006 21:44:47 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4549E95A.6080605@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>
	<4549E95A.6080605@maubp.freeserve.co.uk>
Message-ID: <454AAD1F.5050006@c2b2.columbia.edu>

Peter wrote:
> Chris Lasher wrote:
>> Peter wrote:
>>> One point against names like File2SequenceIterator is the pun on 
>>> two versus to (i.e. convert) will not be so obvious to non-native 
>>> English speakers.
>> I'd like to second that. It's cute, sure, but FileToSequenceIterator
>>  isn't that much more difficult, and leaves no room for confusion. 
>> (e.g., Where's the File1SequenceIterator?)
> 
> I would be happy with FileToSequenceIterator, or even
> FileToSequenceIter.  FileToSeqIter is shorter but we don't actually
> return Seq objects so I would avoid that.
> 
> Does anyone else have any suggestions?

Yes, but let's discuss function names after we decide which functions we 
want.

> While it does sound like a nice idea for the end user, the idea of
> filenames and handles is pretty important in python, and maybe we
> shouldn't worry about forcing newcomers deal with handles.  After all,
> the SeqIO system will make them deal with iterators and SeqRecords which
> I think are far more complicated!
> 
> What do you think Michiel?

My preferred solution would be for File2SequenceIterator to take handles 
only.
Same as Bio.Blast:

blast_out = open('my_blast.out')
b_parser = NCBIXML.BlastParser()
b_record = b_parser.parse(blast_out)

> Chris Lasher wrote:
>> Which brings me to the issue of "guessing" a file's format. Yikes, 
>> again! I'd expect that kind of "magickery" from Perl, but once again,
>> explicit is better than implicit. I honestly think it's not too much
>> to expect the user to know what filetype they're expecting BioPython
>> to deal with. Could you guys please explain the motivation behind 
>> this to me?
 >......
>
> I think Michiel and I where happy to leave this question for later...
> 
I am leaning towards Chris' opinion. File type guessing (from extension 
or file contents) doesn't seem really necessary. At least, I don't 
remember a user asking for it. The benefits of file type guessing from 
the extension are minimal (since a user can probably do that more 
reliably himself, knowing the file names he's likely to encounter). And 
since file type guessing will not be foolproof, it may even be 
confusing. Once file type guessing is available in Biopython though, 
we're committed to it and we'll have to support it. So I'd be happier 
without the file type guessing functionality.

That said, if somebody really wants it, I can live with it.

--Michiel.


From biopython-dev at maubp.freeserve.co.uk  Fri Nov  3 06:48:17 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Fri, 03 Nov 2006 11:48:17 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <454AAD1F.5050006@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>
	<454AAD1F.5050006@c2b2.columbia.edu>
Message-ID: <454B2C81.9090309@maubp.freeserve.co.uk>

My apologies for this somewhat long email.

Handles and Filenames
=====================

Currently the individual format specific iterators just require a handle
(and not a filename).  Are we all happy with this?

Michiel de Hoon wrote:
>> While it does sound like a nice idea for the end user, the idea of 
>> filenames and handles is pretty important in python, and maybe we 
>> shouldn't worry about forcing newcomers deal with handles.  After
>> all, the SeqIO system will make them deal with iterators and
>> SeqRecords which I think are far more complicated!
>> 
>> What do you think Michiel?
> 
> My preferred solution would be for File2SequenceIterator to take
> handles only.

Assuming we keep the non-ambiguous file extension to file format
mappings, allowing a filename as a possible argument to
File2SequenceIterator (and any variants) makes good sense.

Note that most handle objects have a "name" attribute to get the
filename, which could be used to determine the file extension.  i.e. We
can still do the file extension to file format mapping using just a file
handle (instead of a filename).

Currently File2SequenceIterator has separate named arguments for a
handle, filename and format.  If no handle is provided, it will open one
using the filename provided.

We could make the handle and format the first arguments as a compromise?

If we drop the extension to file format mapping (see below), then I
agree File2SequenceIterator could just expect a handle and not a filename.

Guessing File Formats
=====================

>> Chris Lasher wrote:
>>> Which brings me to the issue of "guessing" a file's format.
>>> Yikes, again! I'd expect that kind of "magickery" from Perl, but
>>> once again, explicit is better than implicit. I honestly think
>>> it's not too much to expect the user to know what filetype
>>> they're expecting BioPython to deal with. Could you guys please
>>> explain the motivation behind this to me?

Michiel de Hoon wrote:
> I am leaning towards Chris' opinion. File type guessing (from
> extension or file contents) doesn't seem really necessary. At least,
> I don't remember a user asking for it. The benefits of file type
> guessing from the extension are minimal (since a user can probably do
> that more reliably himself, knowing the file names he's likely to
> encounter). And since file type guessing will not be foolproof, it
> may even be confusing. Once file type guessing is available in
> Biopython though, we're committed to it and we'll have to support it.
> So I'd be happier without the file type guessing functionality.
> 
> That said, if somebody really wants it, I can live with it.

I agree that we shouldn't implement file format guessing based on the
contents of a file (unless, as you say, we get strong feedback wanting it).

I personally want the file extension to format mapping, but then I am
fairly disciplined about using file extensions.  As I seem to be the
only voice advocating this, it looks like I may have to give in...

Is it worth asking on the main discussion list to canvas opinion?

Maybe we should settle on the function names before doing that - it
would be better replace the current function names now, before too many
people are used to them.

Functions and Naming
====================
This is where I think things stand for Bio/SeqIO/__init__.py

We have functions to do the following, where "file" may mean just a
handle, or perhaps the choice of a handle or filename (see above):

(*) File to SeqRecord iterator, currently File2SequenceIterator
(*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
(*) SeqRecord iterator/list to alignment, currently Iter2Alignment
(*) Write SeqRecordwher iterator/list to a file, currently Sequences2File

Possible names without the digit two: FileToSequenceIterator,
SequencesToDict, SequencesToAlignment, and SequencesToFile

I think Michiel wanted to drop the following "wrapper functions" as code
bloat:

(*) File to list of SeqRecord objects, currently File2SequenceList
     Just use list(File2SequenceIterator(...)) instead

(*) File to dictionary of SeqRecord objects, currently File2SequenceDict
     Just use SequenceIter2Dict(File2SequenceIterator(...)) instead

(*) File to alignment, currently File2Alignment
     Just use Iter2Alignment(File2SequenceIterator(...))

The reason I invented the above three examples was so I could do things
like this in one line (assuming my files have valid known extensions):

rec_iter = File2SequenceIterator(filename="demo.faa")
rec_list = File2SequenceList(filename="demo.gbk")
rec_dict = File2SequenceDict(filename="demo.fasta")
align    = File2Alignment(filename="demo.sth")

or perhaps:

align    = File2Alignment(filename="demo.aln", format="clustal")

The alternatives suggestions seem to lead to using file handles and an
explicit format, with a second function to convert from an iterator if
required.  While this can be done in one line - I find the following
much less straight forward:

rec_iter = File2SequenceIterator(open("demo.faa"), "fasta")

rec_list = list(File2SequenceIterator(open("demo.gbk"), "genbank"))

rec_dict = SequenceIter2Dict(File2SequenceIterator(open("demo.fasta"),
                                                    "fasta"))

align = Iter2Alignment(File2SequenceIterator(open("demo.sth"),

                                              "stockholm"))


Peter


From sbassi at gmail.com  Sat Nov  4 15:48:22 2006
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sat, 4 Nov 2006 17:48:22 -0300
Subject: [Biopython-dev] Microbiology module
Message-ID: <b43bf2080611041248l12b9e7f3x2882f3c084a27d23@mail.gmail.com>

I am working in functions for industrial microbiology. Like:
Growth rate equations, Continuous culture equations, batch culture,
yields for different source of energy (and for fermentation or
respiration), oxygen consume rate, constants, thermodynamic equations
used in bioreactors, cell cultures and so on.
Biopython is lacking such a module, but I am not sure if this is out
of scope. Is there a chance to include it in Biopython, or this is not
useful?
I think this could extend Biopython into a whole new area (bioprocess
and microbiology).
Please tell me what maintainers think about this. If this idea is
rejected, I will make ugly and uncommented code for my own consuming,
but if passed, I will write very nice and documented for people to see
:)
Best regards,
SB.

-- 
Bioinformatics news: http://www.bioinformatica.info
Lriser: http://www.linspire.com/lraiser_success.php?serial=318

From sbassi at gmail.com  Sun Nov  5 09:49:20 2006
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sun, 5 Nov 2006 11:49:20 -0300
Subject: [Biopython-dev] Microbiology module
In-Reply-To: <2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com>
References: <b43bf2080611041248l12b9e7f3x2882f3c084a27d23@mail.gmail.com>
	<2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com>
Message-ID: <b43bf2080611050649l196da833q920a571871e727a3@mail.gmail.com>

On 11/5/06, Thomas Hamelryck <thamelry at binf.ku.dk> wrote:
>
> Sounds like a fun project, and a potentially valuable addition to Biopython. I guess some of the topics you mention might be of relevance to systems biology, right?
>

Yes, some methods could be used as a base for systems biology.

From thamelry at binf.ku.dk  Sun Nov  5 09:33:53 2006
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Sun, 5 Nov 2006 15:33:53 +0100
Subject: [Biopython-dev] Microbiology module
In-Reply-To: <b43bf2080611041248l12b9e7f3x2882f3c084a27d23@mail.gmail.com>
References: <b43bf2080611041248l12b9e7f3x2882f3c084a27d23@mail.gmail.com>
Message-ID: <2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com>

On 11/4/06, Sebastian Bassi <sbassi at gmail.com> wrote:
>
> I am working in functions for industrial microbiology. Like:
> Growth rate equations, Continuous culture equations, batch culture,
> yields for different source of energy (and for fermentation or
> respiration), oxygen consume rate, constants, thermodynamic equations
> used in bioreactors, cell cultures and so on.


Sounds like a fun project, and a potentially valuable addition to Biopython.
I guess some of the topics you mention might be of relevance to systems
biology, right?

Best regards,

----
Thomas Hamelryck, Marie Curie EU-Research fellow
Bioinformatics center
Institute of Molecular Biology
University of Copenhagen
Universitetsparken 15 - Building 10
DK-2100 Copenhagen ?
Denmark
Homepage: http://www.binf.ku.dk/Protein_structure


From idoerg at burnham.org  Tue Nov  7 12:48:05 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Tue, 07 Nov 2006 09:48:05 -0800
Subject: [Biopython-dev] InterProScan parser?
Message-ID: <4550C6D5.10606@burnham.org>

Hi,

Does anybody have an interproscan parser, by any chance? Preferably for 
the XML or EBIXML output.

Thanks,

Iddo

-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org

From bugzilla-daemon at portal.open-bio.org  Wed Nov  8 12:13:05 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Nov 2006 12:13:05 -0500
Subject: [Biopython-dev] [Bug 2137] New: Install from CVS fails on
	clistfnsmodule.c compilation
Message-ID: <bug-2137-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2137

           Summary: Install from CVS fails on clistfnsmodule.c compilation
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: chris.lasher at gmail.com


On November 7, 2006, I did a fresh checkout of BioPython from the CVS
repository. Attempts to build/install the CVS checkout are failing on attempts
to compile Bio/clistfnsmodule.c. The main culprit seems to be a missing file,
Python.h.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed Nov  8 12:15:42 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Nov 2006 12:15:42 -0500
Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on
	clistfnsmodule.c compilation
In-Reply-To: <bug-2137-42@http.bugzilla.open-bio.org/>
Message-ID: <200611081715.kA8HFg6e017131@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2137


------- Comment #1 from chris.lasher at gmail.com  2006-11-08 12:15 -------
Created an attachment (id=497)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=497&action=view)
Output from failed installation.

This is the output from my failed installation.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed Nov  8 12:33:41 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Nov 2006 12:33:41 -0500
Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on
	clistfnsmodule.c compilation
In-Reply-To: <bug-2137-42@http.bugzilla.open-bio.org/>
Message-ID: <200611081733.kA8HXfpb018644@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2137


------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp  2006-11-08 12:33 -------
This very much looks like a problem with your Python installation. Do you have
the Python.h header file on your system?
This problem may arise if you installed python using an rpm. If so, make sure
to install the python-devel rpm also. That one contains Python.h.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed Nov  8 13:41:43 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Nov 2006 13:41:43 -0500
Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on
	clistfnsmodule.c compilation
In-Reply-To: <bug-2137-42@http.bugzilla.open-bio.org/>
Message-ID: <200611081841.kA8IfhFJ023784@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2137


------- Comment #3 from chris.lasher at gmail.com  2006-11-08 13:41 -------
(In reply to comment #2)
> This very much looks like a problem with your Python installation. Do you have
> the Python.h header file on your system?
> This problem may arise if you installed python using an rpm. If so, make sure
> to install the python-devel rpm also. That one contains Python.h.
> 

Good call! My apologies, I feel foolish now. For Debian/*buntu users, the
package to get is python-dev.

Should I add something about the Python development packages being necessary
for installation from CVS source on http://biopython.org/wiki/CVS ?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed Nov  8 14:05:13 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Nov 2006 14:05:13 -0500
Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on
	clistfnsmodule.c compilation
In-Reply-To: <bug-2137-42@http.bugzilla.open-bio.org/>
Message-ID: <200611081905.kA8J5DIg025030@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2137


mdehoon at ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID


------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp  2006-11-08 14:05 -------
> Should I add something about the Python development packages being necessary
> for installation from CVS source on http://biopython.org/wiki/CVS ?

The Python development packages are always needed, so also when installing an
official Biopython release.
If you could add some text to that effect to the Biopython wiki somewhere, that
would be great.

Closing this bug.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From idoerg at burnham.org  Wed Nov  8 21:39:23 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Wed, 08 Nov 2006 18:39:23 -0800
Subject: [Biopython-dev] [BioPython] EUtils module
In-Reply-To: <20061109014938.11961.qmail@web38113.mail.mud.yahoo.com>
References: <20061109014938.11961.qmail@web38113.mail.mud.yahoo.com>
Message-ID: <455294DB.6000105@burnham.org>

Srinivas Iyyer wrote:
> Dear Group,
> 
> I downloaded EUtils module. 
> 
> I am trying to reproduce the code given in :
> 
> http://www.dalkescientific.com/writings/diary/archive/2005/09/30/using_eutils.html


> 
> I am getting Errors. 

This is code from an alpha version of EUtils used at a presentation. I 
don't think it was meant to be reproducible, or even made it into the 
final module.

You might want to look under the hood. There is a README file in the 
EUtils installation, which has some examples.

But NCBI change the EUtils specifications quite frequently, so chances 
are, if no one used EUtils ofr a while, that it might be broken.

> 
> I want to know which databases in Entrez are supported
> by EUtils.
> 
> Could any one please help me whats the problem.
> 
> Are not many people using EUtils. 
> 
> Thanks
> 
>>>> import EUtils
>>>> dbs = EUtils.dblist()
> 
> Traceback (most recent call last):
>   File "<pyshell#1>", line 1, in -toplevel-
>     dbs = EUtils.dblist()
> AttributeError: 'module' object has no attribute
> 'dblist'
>>>> dbinfo = EUtils.dbinfo("pubmed")
> 
> Traceback (most recent call last):
>   File "<pyshell#2>", line 1, in -toplevel-
>     dbinfo = EUtils.dbinfo("pubmed")
> AttributeError: 'module' object has no attribute
> 'dbinfo'
> 
> 
> 
> 
> 
> 
>  
> ____________________________________________________________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs.
> http://music.yahoo.com/unlimited
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 


-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org

From mdehoon at c2b2.columbia.edu  Fri Nov 10 01:28:49 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Fri, 10 Nov 2006 01:28:49 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <454B2C81.9090309@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>	<454AAD1F.5050006@c2b2.columbia.edu>
	<454B2C81.9090309@maubp.freeserve.co.uk>
Message-ID: <45541C21.6080402@c2b2.columbia.edu>

Peter (BioPython Dev) wrote:
> Currently the individual format specific iterators just require a handle
> (and not a filename).  Are we all happy with this?

Happy.

> We could make the handle and format the first arguments as a compromise?

If in doubt, don't add it to Biopython!
It's much easier to add a functionality later, should the need arise, 
than to remove one.

> I personally want the file extension to format mapping, but then I am
> fairly disciplined about using file extensions.  As I seem to be the
> only voice advocating this, it looks like I may have to give in...
> 
> Is it worth asking on the main discussion list to canvas opinion?

Sure, go ahead. But ask for *why* a user wants file extension to format 
mapping (so just "Yeah, I'd like that..." doesn't count). I'd like to 
know which usage case that we haven't thought about yet warrants file 
extension to format mapping.

> We have functions to do the following, where "file" may mean just a
> handle, or perhaps the choice of a handle or filename (see above):
> 
> (*) File to SeqRecord iterator, currently File2SequenceIterator
> (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
> (*) SeqRecord iterator/list to alignment, currently Iter2Alignment
> (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File

If:
   File2SequenceIterator doesn't infer the file format from the extension
and
   File2SequenceIterator takes handles only, so no file names,
then:
   Why do we need the File2SequenceIterator function?

Btw, we should make a new Biopython release once the dust settles.

--Michiel.

From idoerg at burnham.org  Fri Nov 10 02:30:17 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Thu, 09 Nov 2006 23:30:17 -0800
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45541C21.6080402@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>	<454AAD1F.5050006@c2b2.columbia.edu>	<454B2C81.9090309@maubp.freeserve.co.uk>
	<45541C21.6080402@c2b2.columbia.edu>
Message-ID: <45542A89.6050202@burnham.org>

Michiel de Hoon wrote:
> Peter (BioPython Dev) wrote:
>> Currently the individual format specific iterators just require a handle
>> (and not a filename).  Are we all happy with this?
> 
> Happy.

I second that.

I have two arguments against that:

1) It is standard practice in biopython to pass file handle as arguments 
to a parser rather than a filename. If we break this, we would start 
thinking which parser takes a handle and which a filename. things will 
be a mess.

2) Also, what if you are not passing a real file? E.g. I have 
applications that pass StringIO streams  into the parser. You are 
lumping two levels of IO into one, and IMHO that is bad practice. In 
other words, a filehandle can always be generated from a file, easily

 >>> filefunc(open('myfile'))

but you cannot generate a file form a filehandle type of data. OK, you 
can programatically generate a tmp file for reading, but that places a 
burden on the user.

3) The last argument against rigid filename extensions is 
interoperability with other applications that generate those files. 
Suppose you have one application that generates fasta files with a .tfa 
extension, and another with a .fa extension and yet a third with .pfa 
extensions... and those extensions are important to you for other 
reasons, like knowing which is a nucleic acid file and which is protein. 
Actually, all the NCBI genomic files are built like this... :)

OK, three arguments. I think that relying on filename extensions for 
content is rather DOS-ish and places an extra burden on the user. I'm 
suffering enough on my Windows machine with Rasmol trying to open all my 
.pdb files. Including those where pdb stands for "Palm Pilot database" 
rather than Protein Data Bank.


> 
>> We could make the handle and format the first arguments as a compromise?
> 
> If in doubt, don't add it to Biopython!
> It's much easier to add a functionality later, should the need arise, 
> than to remove one.

We could add the format as a OPTIONAL keyword argument, with a "None" 
default value. And have the parser recognize the format from a lookahead 
using a magic regexp fro each format. The user passed format overrides 
the parser guesswork. Shouldn't be too  hard to implement, as file 
formats are very distinct.


> 
>> I personally want the file extension to format mapping, but then I am
>> fairly disciplined about using file extensions.  As I seem to be the
>> only voice advocating this, it looks like I may have to give in...
>>
>> Is it worth asking on the main discussion list to canvas opinion?
> 
> Sure, go ahead. But ask for *why* a user wants file extension to format 
> mapping (so just "Yeah, I'd like that..." doesn't count). I'd like to 
> know which usage case that we haven't thought about yet warrants file 
> extension to format mapping.
> 
>> We have functions to do the following, where "file" may mean just a
>> handle, or perhaps the choice of a handle or filename (see above):
>>
>> (*) File to SeqRecord iterator, currently File2SequenceIterator
>> (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
>> (*) SeqRecord iterator/list to alignment, currently Iter2Alignment
>> (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File
> 
> If:
>    File2SequenceIterator doesn't infer the file format from the extension
> and
>    File2SequenceIterator takes handles only, so no file names,
> then:
>    Why do we need the File2SequenceIterator function?
> 
> Btw, we should make a new Biopython release once the dust settles.
> 
> --Michiel.
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037, USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org

From biopython-dev at maubp.freeserve.co.uk  Mon Nov 13 19:49:02 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Tue, 14 Nov 2006 00:49:02 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45542A89.6050202@burnham.org>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>	<454AAD1F.5050006@c2b2.columbia.edu>	<454B2C81.9090309@maubp.freeserve.co.uk>	<45541C21.6080402@c2b2.columbia.edu>
	<45542A89.6050202@burnham.org>
Message-ID: <4559127E.3050109@maubp.freeserve.co.uk>

Iddo Friedberg wrote:
> 3) The last argument against rigid filename extensions is 
> interoperability with other applications that generate those files. 
> Suppose you have one application that generates fasta files with a
> .tfa extension, and another with a .fa extension and yet a third with
> .pfa extensions... and those extensions are important to you for
> other reasons, like knowing which is a nucleic acid file and which is
> protein. Actually, all the NCBI genomic files are built like this...
> :)

Interesting tidbit.

If you are using "exotic" file extensions, then you would have to
explicitly tell my Bio.SeqIO code the file's format.

Although "fa" is currently a known extension mapped to fasta format in
Bio.SeqIO, your other examples are not.  Are these other extensions used
outside the internal systems of the NCBI?

> OK, three arguments. I think that relying on filename extensions for
> content is rather DOS-ish and places an extra burden on the user.

I'm not trying to force anyone into using specific filename extensions -
  I'm trying to make life easier for people who already do this (or who
download their data from online sources like the NCBI or PFAM - which do
seem to be consistent in their naming conventions).

> I'm suffering enough on my Windows machine with Rasmol trying to open
> all my .pdb files. Including those where pdb stands for "Palm Pilot
> database" rather than Protein Data Bank.

Yes - multiple interpretations of a given file format are a problem.
I've noticed that same PDB extension clash too (but I don't use a Palm
pilot any more).

Can anyone think of any common extensions used for more than one file
format?  I know Clustal uses *.aln for its alignments which is perhaps
asking for trouble...

> We could add the format as a OPTIONAL keyword argument, with a "None"
> default value. And have the parser recognize the format from a
> lookahead using a magic regexp fro each format. The user passed
> format overrides the parser guesswork. Shouldn't be too  hard to
> implement, as file formats are very distinct.

Currently the format is an optional keyword argument defaulting to None.
When it is omitted, I currently use a limited filename extension to
format mapping (assuming the filename is available) to deduce/guess the
format.

Peter


From idoerg at burnham.org  Tue Nov 14 12:19:14 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Tue, 14 Nov 2006 09:19:14 -0800
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4559127E.3050109@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>	<454AAD1F.5050006@c2b2.columbia.edu>	<454B2C81.9090309@maubp.freeserve.co.uk>	<45541C21.6080402@c2b2.columbia.edu>	<45542A89.6050202@burnham.org>
	<4559127E.3050109@maubp.freeserve.co.uk>
Message-ID: <4559FA92.8070408@burnham.org>

Peter (BioPython Dev) wrote:
> Iddo Friedberg wrote:
>> 3) The last argument against rigid filename extensions is 
>> interoperability with other applications that generate those files. 
>> Suppose you have one application that generates fasta files with a
>> .tfa extension, and another with a .fa extension and yet a third with
>> .pfa extensions... and those extensions are important to you for
>> other reasons, like knowing which is a nucleic acid file and which is
>> protein. Actually, all the NCBI genomic files are built like this...
>> :)
> 
> Interesting tidbit.
> 
> If you are using "exotic" file extensions, then you would have to
> explicitly tell my Bio.SeqIO code the file's format.
> 
> Although "fa" is currently a known extension mapped to fasta format in
> Bio.SeqIO, your other examples are not.  Are these other extensions used
> outside the internal systems of the NCBI?

I would tidbit or exotic. It is very prevalent, NCBI's GenBank genomic 
repositories are very much deferred to. The point is, since NCBI uses 
one standard of file extensions for its genomic databases, TIGR another 
(actually, TIGR points to GenBank for completed genomes) UCSC a third... 
then maybe relying on file suffixes is not such a great idea.

See for example the E. coli genome:

ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Escherichia_coli_K12

Some are fasta format. But have different contents: whole genome, 
noncoding RNA, protein. Same with those that are GenBank format. So the 
NCBI suffixes denote not only the file format, but the biological 
content as well.

Also, for the reasons I gave in my previous email, I think we should 
stick with passing file handles, not file names.

There is no real need for to pass a filename rather than a file handle. 
If you need information from the filename, you can read the filename 
from the file handle:

 >>> foo = open('foo')

 >>> print foo.name
'foo'

And the functions could still accept StringIO streams if needed.

> 
>> 
> 
> I'm not trying to force anyone into using specific filename extensions -
>   I'm trying to make life easier for people who already do this (or who
> download their data from online sources like the NCBI or PFAM - which do
> seem to be consistent in their naming conventions).
> 

You cannot rely on such consistency prevailing. Especially not with NCBI.;)


> 
>> We could add the format as a OPTIONAL keyword argument, with a "None"
>> default value. And have the parser recognize the format from a
>> lookahead using a magic regexp fro each format. The user passed
>> format overrides the parser guesswork. Shouldn't be too  hard to
>> implement, as file formats are very distinct.
> 
> Currently the format is an optional keyword argument defaulting to None.
> When it is omitted, I currently use a limited filename extension to
> format mapping (assuming the filename is available) to deduce/guess the
> format.
> 


Ideally, the data format should be supplied by the user. Second best is 
inferring from parsing the first line or so in the file. Third is 
filename extension. Bit both options B and C are not very good 
practices, IMHO.


> Peter
> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org

From bugzilla-daemon at portal.open-bio.org  Tue Nov 14 15:48:49 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Nov 2006 15:48:49 -0500
Subject: [Biopython-dev] [Bug 2143] New: Error parsing BLAT output (using
	out=blast format)
Message-ID: <bug-2143-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2143

           Summary: Error parsing BLAT output (using out=blast format)
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: fgibbons at hms.harvard.edu


Attempting to parse this BLAT output (see below) raises an "I couldn't find the
sbjct in" exception.

After looking at the code, it seems to me that the problem is an overly strict
regexp, that relies on a single space between the "Sbjct:" and the integer that
follows it. Replace the literal space with '\s*', and it goes away. This in
fact matches the regexp used to match the "Query:". I can't imagine that it
might hurt things, even in the main NCBIBlastParser, but you never know.... 

(All of the above refers to the method sbjct in class _HSPConsumer, file
NCBIStandalone.py)

-Frank Gibbons (fgibbons at hms.harvard.edu)
-------------------------------------

Reference:  Kent, WJ. (2002) BLAT - The BLAST-like alignment tool

Query=  NCU00001
        (54 letters)

Database:  all_proteins.fasta
           293697 sequences; 128,064,135 total letters

                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

MGG_10872.5                                                           101  
1e-21


>MGG_10872.5
          Length = 245

 Score = 101 bits (260), Expect = 1e-21
 Identities = 54/54 (100%), Positives = 54/54 (100%), Gaps = 0/54 (0%)

Query:   1 MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL 54
           MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL
Sbjct: 192 MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL 245

  Database: all_proteins.fasta
BLASTP 2.2.4 [blat]


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue Nov 14 17:03:40 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Nov 2006 17:03:40 -0500
Subject: [Biopython-dev] [Bug 2143] Error parsing BLAT output (using
	out=blast format)
In-Reply-To: <bug-2143-42@http.bugzilla.open-bio.org/>
Message-ID: <200611142203.kAEM3eu3014395@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2143


mdehoon at ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp  2006-11-14 17:03 -------
Fixed in CVS, thanks.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From chris.lasher at gmail.com  Tue Nov 14 19:51:26 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 14 Nov 2006 19:51:26 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4559FA92.8070408@burnham.org>
References: <45425925.8090607@maubp.freeserve.co.uk>
	<45487277.6080308@maubp.freeserve.co.uk>
	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>
	<4549E95A.6080605@maubp.freeserve.co.uk>
	<454AAD1F.5050006@c2b2.columbia.edu>
	<454B2C81.9090309@maubp.freeserve.co.uk>
	<45541C21.6080402@c2b2.columbia.edu> <45542A89.6050202@burnham.org>
	<4559127E.3050109@maubp.freeserve.co.uk>
	<4559FA92.8070408@burnham.org>
Message-ID: <128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com>

Just pitching in again, I agree with Michiel with regards to the list
of functions necessary. To restate, these would be:

(*) File to SeqRecord iterator, currently File2SequenceIterator
(*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
(*) SeqRecord iterator/list to alignment, currently Iter2Alignment
(*) Write SeqRecordwher iterator/list to a file, currently Sequences2File

I also think there's wisdom to Michiel's statement it's easier to add
functionality than it is to remove it.

I agree with Iddo on his arguments against dealing with filename
extensions. Upon reflection, however, I feel comfortable with a
lookahead-based file-format guesser for the sake of convenience and as
a matter of compromise to those who are not keen on being explicit in
regards to every detail. It's been stated that bio file formats are
quite distinct. I tried to think of a counterexample but failed.

Finally, to reply to Michiel's question on release, it does seem once
SeqIO is solidified this would certainly be worthy of a new release.
SeqIO is a big step in a good direction for BioPython.

Chris

From biopython-dev at maubp.freeserve.co.uk  Wed Nov 15 07:52:58 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Wed, 15 Nov 2006 12:52:58 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com>
References: <45425925.8090607@maubp.freeserve.co.uk>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>	<454AAD1F.5050006@c2b2.columbia.edu>	<454B2C81.9090309@maubp.freeserve.co.uk>	<45541C21.6080402@c2b2.columbia.edu>
	<45542A89.6050202@burnham.org>	<4559127E.3050109@maubp.freeserve.co.uk>	<4559FA92.8070408@burnham.org>
	<128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com>
Message-ID: <455B0DAA.9040000@maubp.freeserve.co.uk>

Chris Lasher wrote:
> Just pitching in again, I agree with Michiel with regards to the list
> of functions necessary. To restate, these would be:

On Monday I switched from the "2" pun names to "To" giving the following:

(*) FileToSequenceIterator, previously File2SequenceIterator
     File to SeqRecord iterator

(*) SequencesToDict, previously SequenceIter2Dict
     SeqRecord iterator/list to dictionary

(*) SequencesToAlignment, previously Iter2Alignment
     SeqRecord iterator/list to alignment

(*) SequencesToFile, previously Sequences2File
     Write SeqRecord iterator/list to a file

I agree that these are all important "core functions".

> I also think there's wisdom to Michiel's statement it's easier to add
> functionality than it is to remove it.

Very true.  On that note...

We also currently have three "convenience functions", which seem
scheduled for removal based on these discussions.  Unless anyone speaks
up for these three, I'll remove them (and update the Wiki to match):

(*) FileToSequenceList previously called File2SequenceList
(*) FileToSequenceDict previously called File2SequenceDict
(*) FileToAlignment    previously called File2Alignment

These simply wrap FileToSequenceIterator with the list, SequencesToDict
or SequencesToAlignment function.

> I agree with Iddo on his arguments against dealing with filename
> extensions. Upon reflection, however, I feel comfortable with a
> lookahead-based file-format guesser for the sake of convenience and as
> a matter of compromise to those who are not keen on being explicit in
> regards to every detail. It's been stated that bio file formats are
> quite distinct. I tried to think of a counterexample but failed.

I would say telling EMBL and Swiss (aka SwissProt aka Unigene) apart is
tricky.  They both start with an "ID ..." line and finish with "//", the
feature table format is the big difference.

If we did try guessing file formats by looking at the file contents, I
would not try and guess every file format which Bio.SeqIO could read -
just those which are easily identifiable.  In this case, I would be
inclined not to try and tell EMBL and SwissProt apart, and simply abort
with "Unrecognised format".

Peter


From biopython-dev at maubp.freeserve.co.uk  Tue Nov 28 08:24:35 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Tue, 28 Nov 2006 13:24:35 +0000
Subject: [Biopython-dev] [BioPython] Problems with Win Release for
 Python 2.5: Numeric, KDTree
In-Reply-To: <005301c7129b$f3222300$b400a8c0@Sirius>
References: <005301c7129b$f3222300$b400a8c0@Sirius>
Message-ID: <456C3893.6060402@maubp.freeserve.co.uk>

Hendrik Weisser wrote:
> The main question for me is whether these issues (the 2nd, mostly) can be 
> adressed quickly, or whether it is recommended to use the "old" Python 2.4 
> and corresponding packages for the time being. Can anyone help me with that?

Yes - assuming you don't have all the compilers and stuff to compile
your own libraries (and therefore need to use the Windows installers),
using Windows with Python 2.4 and Numeric 24.2 with BioPython 1.42
should be fine.

Personally I use Python 2.4 on Linux (as shipped with the distribution)
and Python 2.3 on my Windows machine.

Both work fine with BioPython and Numeric - although I have not used
Bio.PDB very much.

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Nov 29 14:03:08 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 29 Nov 2006 14:03:08 -0500
Subject: [Biopython-dev] [Bug 2090] Blast.NCBIStandalone BlastParser fails
	with blastall 2.2.14
In-Reply-To: <bug-2090-42@http.bugzilla.open-bio.org/>
Message-ID: <200611291903.kATJ38DJ007489@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2090


------- Comment #1 from grunberg at embl.de  2006-11-29 14:03 -------
Things get worse with the current blastall 2.2.15. _scan_parameters in
NCBIStandalone.py expects "Number of HSP's better" which in the later blastall
versions has changed to: "Number of sequences better".  
This prevents the parser from fetching the next two lines even though they
would be there and then we get exceptions etc. 

Another independent problem occurs further down -- The lines::
  T: 11
  A: 40
have now changed to::
  Neighboring words threshold: 11
  Window for multiple hits: 40
and again we run into an exeption. Both problems also concern in the latest CVS
snapshot.

Both can be fixed with some additional attempt_read_and_call but I am not sure
whether my quick and dirty fixes is following the right spirit...

change A:
---------
INSERT BEFORE...::
        # not in blastx 2.2.1
        attempt_read_and_call(uhandle, consumer.query_length,
                              has_re=re.compile(r"[Ll]ength of query"))
...These two statements::

        # in blastall 2.2.15
        attempt_read_and_call(uhandle, consumer.noevent,
                                 start="Number of HSP's gapped:")

        attempt_read_and_call(uhandle, consumer.noevent,
                          start="Number of HSP's successfully")

Change B:
---------
REPLACE::
        # not in BLASTN 2.2.9
        attempt_read_and_call(uhandle, consumer.threshold, start='T')
        read_and_call(uhandle, consumer.window_size, start='A')

BY::
        # not in BLASTN 2.2.9
        attempt_read_and_call(uhandle, consumer.threshold, start='T')
        attempt_read_and_call(uhandle, consumer.window_size, start='A')
        ## renamed in BLASTALL 2.2.15
        attempt_read_and_call(uhandle, consumer.threshold, start='Neighboring')
        attempt_read_and_call(uhandle, consumer.window_size, start='Window')

Could someone with more Biopython experience please validate and apply the fix?
THX!


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From mdehoon at c2b2.columbia.edu  Wed Nov  1 05:58:41 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Wed, 01 Nov 2006 00:58:41 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4545D9F1.2040902@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>
	<454574B4.3050407@c2b2.columbia.edu>
	<4545D9F1.2040902@maubp.freeserve.co.uk>
Message-ID: <45483791.7070803@c2b2.columbia.edu>

Peter (BioPython Dev) wrote:
> With such a speed up, I'd guess you were using Bio.Fasta before.

Yes I was. I just went to the Biopython tutorial and used the stuff in
section 2.4. I didn't expect it to be *that* slow.

> I've noticed the same thing.  Are you dealing with NCBI style fasta 
> identifiers made up of several fields separated by "|" characters?

Yep.

>> For duplicate keys, there are at least four possibilities (raise an
>> exception, store only one of the keys, store neither of the keys
>> and don't raise an exception, store both after modifying one of the
>> keys). So this should also be an option.
> 
> Supporting all these options with an easy to understand interface
> looks too hard.
> 
> In my opinion if someone is trying to build a dictionary using
> repeated keys they have made a mistake (either in their datafile, or
> their record2key function) - so raising an exception is reasonable
> default behaviour (and is easy to code).

You're probably right. I'm fine with raising an exception.

>> In the File2SequenceDict above, answer[key] contains the complete 
>> record. Some people will want that. However, in my application I
>> only want to store the record.seq part in answer[key]. Somebody
>> else may want str(record.seq). So we'd also need a record2value
>> argument.
> 
> It does slightly undermine the "you only get SeqRecord objects" 
> principle.  On the other hand, its a simple addition that is easy to
> explain and implement.  I'm happy to add this.

The point I was trying to make is that for a File2SequenceDict function 
to be useful, it would end up being too complex. In the answer above, a 
user could also do answer[key].seq to get the part she wants, so maybe a 
record2value argument is not essential in practice.

Part of my opposition against the File2SequenceDict function is that it 
requires the parser to be called File2SequenceIterator (which I don't 
like as a name, but more about that some other time), which then leads 
to a File2SequenceList function, which is software bloat.

So, how about making the functionality of File2SequenceDict available as 
a todict() method to the iterator object returned by 
File2SequenceIterator, or, as a iterator2dict function?

--Michiel.


From biopython-dev at maubp.freeserve.co.uk  Wed Nov  1 10:09:59 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed, 01 Nov 2006 10:09:59 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45483791.7070803@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>
	<45483791.7070803@c2b2.columbia.edu>
Message-ID: <45487277.6080308@maubp.freeserve.co.uk>

> The point I was trying to make is that for a File2SequenceDict
> function to be useful, it would end up being too complex.

Of course I'm going to be biased here, but I do find the simple current
dictionary construction useful as it is.  Clearly we have slightly
different uses in mind (which is good - the design should try and cater
to most people).

> In the answer above, a user could also do answer[key].seq to get the
> part she wants, so maybe a record2value argument is not essential in
> practice.
> 
> Part of my opposition against the File2SequenceDict function is that
> it requires the parser to be called File2SequenceIterator (which I
> don't like as a name, but more about that some other time), which
> then leads to a File2SequenceList function, which is software bloat.
> 
> So, how about making the functionality of File2SequenceDict available
> as a todict() method to the iterator object returned by 
> File2SequenceIterator, or, as a iterator2dict function?

I do like your first suggestion - the idea of adding a todict() method 
to the iterator objects.  However, that would require that all the 
parsers be written as (sub)classes, and right now several of them are 
written as generator functions.

I've found using generator functions to be very simple, and easy to
understand.  They seem like a good choice for simple file formats.  But
with a good reason enough reason, I could turn them into classes.

                         ----

Right now I am making both "file to dict" and "iterator to dict"
functions available:

File2SequenceDict(..., record2key) is implemented as
SequenceIter2Dict(File2SequenceIterator(...), record2key)

Also:
File2Alignment(...) is implemented as
Iter2Alignment(File2SequenceIterator(...))

And:
File2SequenceList(...) is implemented as list(File2SequenceIterator(...))

Leaving aside the names (which I notice are not currently consistent) I
would be fine with removing File2SequenceList, File2SequenceDict, and
File2Alignment but retaining the two functions which convert from a
SeqRecord returning iterator into dict or an alignment.

How does that sound Michiel (subject to agreeing on names)?

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Nov  1 22:50:46 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Nov 2006 17:50:46 -0500
Subject: [Biopython-dev] [Bug 2131] New: SProt.py fails to parse the current
	Swiss-Prot version 51.0
Message-ID: <bug-2131-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131

           Summary: SProt.py fails to parse the current Swiss-Prot version
                    51.0
           Product: Biopython
           Version: 1.24
          Platform: Macintosh
        OS/Version: MacOS X
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: Biosql at hotmail.com


Hi, 

I'm running on a mac OS 10.4, python 2.5 and tried to parse the Swiss-Prot .dat
file whit the latest SProt.py version and get this : 

Traceback (most recent call last):
  File "Parser_SProt_to_DB.py", line 37, in <module>
    cur_record = s_iterator.next()
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 166, in
next
    return self._parser.parse(File.StringHandle(data))
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 290, in
parse
    self._scanner.feed(handle, self._consumer)
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 332, in
feed
    self._scan_record(uhandle, consumer)
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 337, in
_scan_record
    fn(self, uhandle, consumer)
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 369, in
_scan_id
    self._scan_line('ID', uhandle, consumer.identification, exactly_one=1)
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 359, in
_scan_line
    read_and_call(uhandle, event_fn, start=line_type)
  File "/sw/lib/python2.5/site-packages/Bio/ParserSupport.py", line 301, in
read_and_call
    method(line)
  File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 526, in
identification
    self.data.sequence_length = int(cols[4])
ValueError: invalid literal for int() with base 10: 'AA.'

Any clue ?

Thanks !


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From chris.lasher at gmail.com  Thu Nov  2 03:49:04 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Wed, 1 Nov 2006 22:49:04 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45487277.6080308@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>
	<4542F123.9050106@c2b2.columbia.edu>
	<45434611.1040708@maubp.freeserve.co.uk>
	<4544458A.5000102@c2b2.columbia.edu>
	<45448FAF.1090104@maubp.freeserve.co.uk>
	<454574B4.3050407@c2b2.columbia.edu>
	<4545D9F1.2040902@maubp.freeserve.co.uk>
	<45483791.7070803@c2b2.columbia.edu>
	<45487277.6080308@maubp.freeserve.co.uk>
Message-ID: <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>

I'd like to pitch in a few comments here.

Peter wrote:
> One point against names like File2SequenceIterator is the pun on two
> versus to (i.e. convert) will not be so obvious to non-native English
> speakers.

I'd like to second that. It's cute, sure, but FileToSequenceIterator
isn't that much more difficult, and leaves no room for confusion.
(e.g., Where's the File1SequenceIterator?)

Michiel wrote:
> I like the idea of one argument that takes a file name or handle. I
> believe that that is how other Biopython functions work.

Yikes! Are you serious? Why not make it easier and require a file-like
object? I would definitely not be for it taking a plain string. This
seems implicit rather than explicit. "Takes a file... or a file-like
object... or a string containing a filename... or just a string
containing the file contents... or a brief description of the data
that's in your file... or a bunch of smiley emoticons, if you're in a
good mood..." File-like objects are testable and leave little room for
surprise. Anything else seems like it's asking for a headache.

Which brings me to the issue of "guessing" a file's format. Yikes,
again! I'd expect that kind of "magickery" from Perl, but once again,
explicit is better than implicit. I honestly think it's not too much
to expect the user to know what filetype they're expecting BioPython
to deal with. Could you guys please explain the motivation behind this
to me? As I see it right now, the last thing I want is BioPython
incorrectly guessing my file format, and particularly, assuming that I
have put the proper extension to represent the file format. The
unified sequence object is what's beautiful about SeqIO, but the
guesswork that you are discussing having SeqIO's classes do is scary,
to me.

And I think by now it's predictable that I'm a fan of Peter's
suggestion to have an exception raised upon the attempt to create a
dictionary with identical IDs; all other options are, again, too
implicit for my tastes.

Thanks very much for developing SeqIO and discussing it so much, guys.
I think this will be a fantastic asset to BioPython! Keep on rockin'
it!

Chris


From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 11:29:38 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 06:29:38 -0500
Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current
	Swiss-Prot version 51.0
In-Reply-To: <bug-2131-42@http.bugzilla.open-bio.org/>
Message-ID: <200611021129.kA2BTcOX010117@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2006-11-02 06:29 -------
Hi Jonathan,

What version of BioPython are you using?

I know that bugzilla needs updating to include more up to date version numbers,
but you aren't really using BioPython 1.24 are you?

Currently the latest release is 1.42, and this does include some updates for
SProt, e.g. bug 1948. There is also a more recent fix in CVS for bug 2043
dealing with new style RX lines.

Could you tell use which SProt file you are using (a URL would be fine).  If
there are many that fail the same way, and you have a small example input file,
you could even attach it to this bug.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython-dev at maubp.freeserve.co.uk  Thu Nov  2 12:49:30 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Thu, 02 Nov 2006 12:49:30 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>
	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>
Message-ID: <4549E95A.6080605@maubp.freeserve.co.uk>

Chris Lasher wrote:
> I'd like to pitch in a few comments here.
> 
> Peter wrote:
>> One point against names like File2SequenceIterator is the pun on 
>> two versus to (i.e. convert) will not be so obvious to non-native 
>> English speakers.
> 
> I'd like to second that. It's cute, sure, but FileToSequenceIterator
>  isn't that much more difficult, and leaves no room for confusion. 
> (e.g., Where's the File1SequenceIterator?)

I would be happy with FileToSequenceIterator, or even
FileToSequenceIter.  FileToSeqIter is shorter but we don't actually
return Seq objects so I would avoid that.

Does anyone else have any suggestions?

> Michiel wrote:
>> I like the idea of one argument that takes a file name or handle. I
>>  believe that that is how other Biopython functions work.

I've had a little look, and the only case I found is the recent
Bio.Nexus parser - and this choked on a StringIO handle on my machine
(fix checked in).

Chris Lasher wrote:
> Yikes! Are you serious? Why not make it easier and require a 
> file-like object? I would definitely not be for it taking a plain 
> string. This seems implicit rather than explicit. "Takes a file... or
> a file-like object... or a string containing a filename... or just a
> string containing the file contents... or a brief description of the
> data that's in your file... or a bunch of smiley emoticons, if 
> you're in a good mood..." File-like objects are testable and leave 
> little room for surprise. Anything else seems like it's asking for a 
> headache.

Trying to distinguish between an (invalid) filename and the contents of
a sequence file is just too much to ask - more a migraine than a headache.

As an experiment, I've implemented (but not checked in) automatic
handle/filename detection.  Its seems to work (but I have not yet tried
exotic arguments like file names in Unicode, or random classes with a
__str__ method).  Still its messy.

While it does sound like a nice idea for the end user, the idea of
filenames and handles is pretty important in python, and maybe we
shouldn't worry about forcing newcomers deal with handles.  After all,
the SeqIO system will make them deal with iterators and SeqRecords which
I think are far more complicated!

What do you think Michiel?

Chris Lasher wrote:
> Which brings me to the issue of "guessing" a file's format. Yikes, 
> again! I'd expect that kind of "magickery" from Perl, but once again,
> explicit is better than implicit. I honestly think it's not too much
> to expect the user to know what filetype they're expecting BioPython
> to deal with. Could you guys please explain the motivation behind 
> this to me? As I see it right now, the last thing I want is BioPython
> incorrectly guessing my file format, and particularly, assuming that
> I have put the proper extension to represent the file format. The 
> unified sequence object is what's beautiful about SeqIO, but the
> guesswork that you are discussing having SeqIO's classes do is scary,
> to me.

For comparison this quote is from the BioPerl SeqIO How-To:
>> [BioPerl's] SeqIO can try to guess based on known file extensions 
>> or content, ... it is a good idea to get into the practice of 
>> always specifying the format.

I want to stress that as written, the user can specify the file format
to the File2SequenceIterator function (and its variants).  Maybe we
should encourage people to explicitly supply the format in any Bio.SeqIO
documentation....

You asked about motivation for guessing the file format.  I break that
down into guessing the file format based on the file extension, or based
on the file's contents (see later).

I personally am perfectly happy with using a file extension to file
format mapping.  Maybe this reflects my computing background (more
DOS/Windows background than Unix/Linux).

Note that if the format is not specified, and the file extension is not
on the known list (e.g. "txt" or "data" which could be anything) then
the call to File2SequenceIterator function (or its variants) will fail
with an invalid format message/exception.

Assuming we don't make the format a required argument, and we keep the
extension to format mappings, then I should make a point of including
deliberate miss-matches in the test suits - and check that they abort
with a SyntaxError.

Regarding guessing the format based on file contents:

For some applications, having a format guesser built into BioPython
might actually be very useful - the example given on the BioPerl website
is the back end of a web tool that took sequence input, where maybe you
can't trust the actual end user to know exactly what file format their
data is in.

Doing this for some file formats isn't too hard, often all you need to
see is the first line.  For other file formats its very tricky and best
not attempted.  But, is partial guess support even worth implementing -
especially as it may be less than perfect and get it wrong sometimes?

I think Michiel and I where happy to leave this question for later...

Chris Lasher wrote:
> And I think by now it's predictable that I'm a fan of Peter's 
> suggestion to have an exception raised upon the attempt to create a 
> dictionary with identical IDs; all other options are, again, too 
> implicit for my tastes.

Good.  Michiel agreed in another email:
>> 
>> You're probably right. I'm fine with raising an exception.
>> 

Have you been following the rest of that SeqRecord dictionary discussion
Chris?

> Thanks very much for developing SeqIO and discussing it so much, 
> guys. I think this will be a fantastic asset to BioPython! Keep on 
> rockin' it!
> 
> Chris

Thank you for your passionate feedback :)

Peter


From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 17:38:26 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 12:38:26 -0500
Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current
	Swiss-Prot version 51.0
In-Reply-To: <bug-2131-42@http.bugzilla.open-bio.org/>
Message-ID: <200611021738.kA2HcQKH017740@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131


Biosql at hotmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|1.24                        |Not Applicable


------- Comment #2 from Biosql at hotmail.com  2006-11-02 12:38 -------
I'm using the latest version of Biopython 1.42 with the latest version of
Sprot.py from the CVS. 

I used the Swiss-Prot file version 51 coming from here : 

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

I also tried to parse this file on a PC with python 2.4.3 and the latest
biopython version and got the same result. 

Jonathan


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 18:07:03 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 13:07:03 -0500
Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current
	Swiss-Prot version 51.0
In-Reply-To: <bug-2131-42@http.bugzilla.open-bio.org/>
Message-ID: <200611021807.kA2I73W3021327@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2006-11-02 13:07 -------
Created an attachment (id=491)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=491&action=view)
First four records from uniprot_sprot.dat.gz release 51

I was hoping for a smaller test case, uniprot_sprot.dat.gz is 185MB compressed,
and 836MB as plain text!  Anyway, I have extracted and attached a file with the
just the first four records in it for anyone interested in testing.

I would guess from your stack trace that this recent change to the ID line that
has caused the trouble:

http://ca.expasy.org/sprot/relnotes/sp_news.html#rel9.0

Old (with MoleculeType):
ID   EntryName DataClass; MoleculeType; SequenceLength.

New (without MoleculeType):
ID   EntryName DataClass; SequenceLength.

e.g.
ID   CYC_PIG                 Reviewed;         104 AA.
ID   Q3ASY8_CHLCH            Unreviewed;     36805 AA.

This shouldn't be too hard to fix...


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 18:41:46 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 13:41:46 -0500
Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current
	Swiss-Prot version 51.0
In-Reply-To: <bug-2131-42@http.bugzilla.open-bio.org/>
Message-ID: <200611021841.kA2Ifkpg025233@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2006-11-02 13:41 -------
Fix checked into CVS, please reopen the bug if you run into problems.

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SwissProt/SProt.py?cvsroot=biopython

file Bio/SwissProt/SProt.py
Revision 1.34 made 2nd Nov 2006

This is the test script I used with the example file from comment 3 attachment
491

from Bio.SwissProt import SProt

#Works
rec_iter = SProt.Iterator(open("uniprot_sprot_f4.dat"), SProt.SequenceParser())
for record in rec_iter :
    print record.id
    print record.seq

#Failed
rec_iter = SProt.Iterator(open("uniprot_sprot_f4.dat"), SProt.RecordParser())
for record in rec_iter :
    print record


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 19:16:53 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 14:16:53 -0500
Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current
	Swiss-Prot version 51.0
In-Reply-To: <bug-2131-42@http.bugzilla.open-bio.org/>
Message-ID: <200611021916.kA2JGrGY028566@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2131


------- Comment #5 from Biosql at hotmail.com  2006-11-02 14:16 -------
Thank you Peter !

So fast and so good. 

Jonathan


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 21:27:25 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 16:27:25 -0500
Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current
	Swiss-Prot version (RX and OH lines are broken)
In-Reply-To: <bug-2043-42@http.bugzilla.open-bio.org/>
Message-ID: <200611022127.kA2LRPkN009879@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2043


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2006-11-02 16:27 -------
It seems to be working from the small amount I testing I did on another
Swiss-Prot bug.

Marking as fixed - please reopen if needed.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Nov  2 21:38:01 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Nov 2006 16:38:01 -0500
Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more
In-Reply-To: <bug-1944-42@http.bugzilla.open-bio.org/>
Message-ID: <200611022138.kA2Lc1Zi010834@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1944


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2006-11-02 16:38 -------
While working on bug 2059, I've been tempted to make some similar changes to
Marc's suggestions about handling the SeqRecord's id/name/description, and the
addition of an "add SeqRecord" method.

I like the idea of adding a method to iterate over the sequences.  How about
something a little simpler (which I haven't tested yet):

     def __iter__(self):
         """Iterate over the SeqRecord objects making up the alignment"""
         return iter(self._records)

i.e. Use the fact that self._records is a list, and will support iteration
itself.  This avoids having to keep track of the current iteration position in
our own next method.

Also, would anyone else like to be able to iterate over the columns?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From mdehoon at c2b2.columbia.edu  Fri Nov  3 02:20:23 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Thu, 02 Nov 2006 21:20:23 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45487277.6080308@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>
	<45483791.7070803@c2b2.columbia.edu>
	<45487277.6080308@maubp.freeserve.co.uk>
Message-ID: <454AA767.9030506@c2b2.columbia.edu>

Peter wrote:
> Right now I am making both "file to dict" and "iterator to dict"
> functions available:
> 
> File2SequenceDict(..., record2key) is implemented as
> SequenceIter2Dict(File2SequenceIterator(...), record2key)
> 
> Also:
> File2Alignment(...) is implemented as
> Iter2Alignment(File2SequenceIterator(...))
> 
> And:
> File2SequenceList(...) is implemented as list(File2SequenceIterator(...))
> 
> Leaving aside the names (which I notice are not currently consistent) I
> would be fine with removing File2SequenceList, File2SequenceDict, and
> File2Alignment but retaining the two functions which convert from a
> SeqRecord returning iterator into dict or an alignment.
> 
> How does that sound Michiel (subject to agreeing on names)?
That sounds good to me.

--Michiel.


From mdehoon at c2b2.columbia.edu  Fri Nov  3 02:44:47 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Thu, 02 Nov 2006 21:44:47 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4549E95A.6080605@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>
	<4549E95A.6080605@maubp.freeserve.co.uk>
Message-ID: <454AAD1F.5050006@c2b2.columbia.edu>

Peter wrote:
> Chris Lasher wrote:
>> Peter wrote:
>>> One point against names like File2SequenceIterator is the pun on 
>>> two versus to (i.e. convert) will not be so obvious to non-native 
>>> English speakers.
>> I'd like to second that. It's cute, sure, but FileToSequenceIterator
>>  isn't that much more difficult, and leaves no room for confusion. 
>> (e.g., Where's the File1SequenceIterator?)
> 
> I would be happy with FileToSequenceIterator, or even
> FileToSequenceIter.  FileToSeqIter is shorter but we don't actually
> return Seq objects so I would avoid that.
> 
> Does anyone else have any suggestions?

Yes, but let's discuss function names after we decide which functions we 
want.

> While it does sound like a nice idea for the end user, the idea of
> filenames and handles is pretty important in python, and maybe we
> shouldn't worry about forcing newcomers deal with handles.  After all,
> the SeqIO system will make them deal with iterators and SeqRecords which
> I think are far more complicated!
> 
> What do you think Michiel?

My preferred solution would be for File2SequenceIterator to take handles 
only.
Same as Bio.Blast:

blast_out = open('my_blast.out')
b_parser = NCBIXML.BlastParser()
b_record = b_parser.parse(blast_out)

> Chris Lasher wrote:
>> Which brings me to the issue of "guessing" a file's format. Yikes, 
>> again! I'd expect that kind of "magickery" from Perl, but once again,
>> explicit is better than implicit. I honestly think it's not too much
>> to expect the user to know what filetype they're expecting BioPython
>> to deal with. Could you guys please explain the motivation behind 
>> this to me?
 >......
>
> I think Michiel and I where happy to leave this question for later...
> 
I am leaning towards Chris' opinion. File type guessing (from extension 
or file contents) doesn't seem really necessary. At least, I don't 
remember a user asking for it. The benefits of file type guessing from 
the extension are minimal (since a user can probably do that more 
reliably himself, knowing the file names he's likely to encounter). And 
since file type guessing will not be foolproof, it may even be 
confusing. Once file type guessing is available in Biopython though, 
we're committed to it and we'll have to support it. So I'd be happier 
without the file type guessing functionality.

That said, if somebody really wants it, I can live with it.

--Michiel.


From biopython-dev at maubp.freeserve.co.uk  Fri Nov  3 11:48:17 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Fri, 03 Nov 2006 11:48:17 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <454AAD1F.5050006@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>
	<454AAD1F.5050006@c2b2.columbia.edu>
Message-ID: <454B2C81.9090309@maubp.freeserve.co.uk>

My apologies for this somewhat long email.

Handles and Filenames
=====================

Currently the individual format specific iterators just require a handle
(and not a filename).  Are we all happy with this?

Michiel de Hoon wrote:
>> While it does sound like a nice idea for the end user, the idea of 
>> filenames and handles is pretty important in python, and maybe we 
>> shouldn't worry about forcing newcomers deal with handles.  After
>> all, the SeqIO system will make them deal with iterators and
>> SeqRecords which I think are far more complicated!
>> 
>> What do you think Michiel?
> 
> My preferred solution would be for File2SequenceIterator to take
> handles only.

Assuming we keep the non-ambiguous file extension to file format
mappings, allowing a filename as a possible argument to
File2SequenceIterator (and any variants) makes good sense.

Note that most handle objects have a "name" attribute to get the
filename, which could be used to determine the file extension.  i.e. We
can still do the file extension to file format mapping using just a file
handle (instead of a filename).

Currently File2SequenceIterator has separate named arguments for a
handle, filename and format.  If no handle is provided, it will open one
using the filename provided.

We could make the handle and format the first arguments as a compromise?

If we drop the extension to file format mapping (see below), then I
agree File2SequenceIterator could just expect a handle and not a filename.

Guessing File Formats
=====================

>> Chris Lasher wrote:
>>> Which brings me to the issue of "guessing" a file's format.
>>> Yikes, again! I'd expect that kind of "magickery" from Perl, but
>>> once again, explicit is better than implicit. I honestly think
>>> it's not too much to expect the user to know what filetype
>>> they're expecting BioPython to deal with. Could you guys please
>>> explain the motivation behind this to me?

Michiel de Hoon wrote:
> I am leaning towards Chris' opinion. File type guessing (from
> extension or file contents) doesn't seem really necessary. At least,
> I don't remember a user asking for it. The benefits of file type
> guessing from the extension are minimal (since a user can probably do
> that more reliably himself, knowing the file names he's likely to
> encounter). And since file type guessing will not be foolproof, it
> may even be confusing. Once file type guessing is available in
> Biopython though, we're committed to it and we'll have to support it.
> So I'd be happier without the file type guessing functionality.
> 
> That said, if somebody really wants it, I can live with it.

I agree that we shouldn't implement file format guessing based on the
contents of a file (unless, as you say, we get strong feedback wanting it).

I personally want the file extension to format mapping, but then I am
fairly disciplined about using file extensions.  As I seem to be the
only voice advocating this, it looks like I may have to give in...

Is it worth asking on the main discussion list to canvas opinion?

Maybe we should settle on the function names before doing that - it
would be better replace the current function names now, before too many
people are used to them.

Functions and Naming
====================
This is where I think things stand for Bio/SeqIO/__init__.py

We have functions to do the following, where "file" may mean just a
handle, or perhaps the choice of a handle or filename (see above):

(*) File to SeqRecord iterator, currently File2SequenceIterator
(*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
(*) SeqRecord iterator/list to alignment, currently Iter2Alignment
(*) Write SeqRecordwher iterator/list to a file, currently Sequences2File

Possible names without the digit two: FileToSequenceIterator,
SequencesToDict, SequencesToAlignment, and SequencesToFile

I think Michiel wanted to drop the following "wrapper functions" as code
bloat:

(*) File to list of SeqRecord objects, currently File2SequenceList
     Just use list(File2SequenceIterator(...)) instead

(*) File to dictionary of SeqRecord objects, currently File2SequenceDict
     Just use SequenceIter2Dict(File2SequenceIterator(...)) instead

(*) File to alignment, currently File2Alignment
     Just use Iter2Alignment(File2SequenceIterator(...))

The reason I invented the above three examples was so I could do things
like this in one line (assuming my files have valid known extensions):

rec_iter = File2SequenceIterator(filename="demo.faa")
rec_list = File2SequenceList(filename="demo.gbk")
rec_dict = File2SequenceDict(filename="demo.fasta")
align    = File2Alignment(filename="demo.sth")

or perhaps:

align    = File2Alignment(filename="demo.aln", format="clustal")

The alternatives suggestions seem to lead to using file handles and an
explicit format, with a second function to convert from an iterator if
required.  While this can be done in one line - I find the following
much less straight forward:

rec_iter = File2SequenceIterator(open("demo.faa"), "fasta")

rec_list = list(File2SequenceIterator(open("demo.gbk"), "genbank"))

rec_dict = SequenceIter2Dict(File2SequenceIterator(open("demo.fasta"),
                                                    "fasta"))

align = Iter2Alignment(File2SequenceIterator(open("demo.sth"),

                                              "stockholm"))


Peter


From sbassi at gmail.com  Sat Nov  4 20:48:22 2006
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sat, 4 Nov 2006 17:48:22 -0300
Subject: [Biopython-dev] Microbiology module
Message-ID: <b43bf2080611041248l12b9e7f3x2882f3c084a27d23@mail.gmail.com>

I am working in functions for industrial microbiology. Like:
Growth rate equations, Continuous culture equations, batch culture,
yields for different source of energy (and for fermentation or
respiration), oxygen consume rate, constants, thermodynamic equations
used in bioreactors, cell cultures and so on.
Biopython is lacking such a module, but I am not sure if this is out
of scope. Is there a chance to include it in Biopython, or this is not
useful?
I think this could extend Biopython into a whole new area (bioprocess
and microbiology).
Please tell me what maintainers think about this. If this idea is
rejected, I will make ugly and uncommented code for my own consuming,
but if passed, I will write very nice and documented for people to see
:)
Best regards,
SB.

-- 
Bioinformatics news: http://www.bioinformatica.info
Lriser: http://www.linspire.com/lraiser_success.php?serial=318


From sbassi at gmail.com  Sun Nov  5 14:49:20 2006
From: sbassi at gmail.com (Sebastian Bassi)
Date: Sun, 5 Nov 2006 11:49:20 -0300
Subject: [Biopython-dev] Microbiology module
In-Reply-To: <2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com>
References: <b43bf2080611041248l12b9e7f3x2882f3c084a27d23@mail.gmail.com>
	<2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com>
Message-ID: <b43bf2080611050649l196da833q920a571871e727a3@mail.gmail.com>

On 11/5/06, Thomas Hamelryck <thamelry at binf.ku.dk> wrote:
>
> Sounds like a fun project, and a potentially valuable addition to Biopython. I guess some of the topics you mention might be of relevance to systems biology, right?
>

Yes, some methods could be used as a base for systems biology.


From thamelry at binf.ku.dk  Sun Nov  5 14:33:53 2006
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Sun, 5 Nov 2006 15:33:53 +0100
Subject: [Biopython-dev] Microbiology module
In-Reply-To: <b43bf2080611041248l12b9e7f3x2882f3c084a27d23@mail.gmail.com>
References: <b43bf2080611041248l12b9e7f3x2882f3c084a27d23@mail.gmail.com>
Message-ID: <2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com>

On 11/4/06, Sebastian Bassi <sbassi at gmail.com> wrote:
>
> I am working in functions for industrial microbiology. Like:
> Growth rate equations, Continuous culture equations, batch culture,
> yields for different source of energy (and for fermentation or
> respiration), oxygen consume rate, constants, thermodynamic equations
> used in bioreactors, cell cultures and so on.


Sounds like a fun project, and a potentially valuable addition to Biopython.
I guess some of the topics you mention might be of relevance to systems
biology, right?

Best regards,

----
Thomas Hamelryck, Marie Curie EU-Research fellow
Bioinformatics center
Institute of Molecular Biology
University of Copenhagen
Universitetsparken 15 - Building 10
DK-2100 Copenhagen ?
Denmark
Homepage: http://www.binf.ku.dk/Protein_structure


From idoerg at burnham.org  Tue Nov  7 17:48:05 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Tue, 07 Nov 2006 09:48:05 -0800
Subject: [Biopython-dev] InterProScan parser?
Message-ID: <4550C6D5.10606@burnham.org>

Hi,

Does anybody have an interproscan parser, by any chance? Preferably for 
the XML or EBIXML output.

Thanks,

Iddo

-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org


From bugzilla-daemon at portal.open-bio.org  Wed Nov  8 17:13:05 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Nov 2006 12:13:05 -0500
Subject: [Biopython-dev] [Bug 2137] New: Install from CVS fails on
	clistfnsmodule.c compilation
Message-ID: <bug-2137-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2137

           Summary: Install from CVS fails on clistfnsmodule.c compilation
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: chris.lasher at gmail.com


On November 7, 2006, I did a fresh checkout of BioPython from the CVS
repository. Attempts to build/install the CVS checkout are failing on attempts
to compile Bio/clistfnsmodule.c. The main culprit seems to be a missing file,
Python.h.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed Nov  8 17:15:42 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Nov 2006 12:15:42 -0500
Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on
	clistfnsmodule.c compilation
In-Reply-To: <bug-2137-42@http.bugzilla.open-bio.org/>
Message-ID: <200611081715.kA8HFg6e017131@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2137


------- Comment #1 from chris.lasher at gmail.com  2006-11-08 12:15 -------
Created an attachment (id=497)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=497&action=view)
Output from failed installation.

This is the output from my failed installation.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed Nov  8 17:33:41 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Nov 2006 12:33:41 -0500
Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on
	clistfnsmodule.c compilation
In-Reply-To: <bug-2137-42@http.bugzilla.open-bio.org/>
Message-ID: <200611081733.kA8HXfpb018644@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2137


------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp  2006-11-08 12:33 -------
This very much looks like a problem with your Python installation. Do you have
the Python.h header file on your system?
This problem may arise if you installed python using an rpm. If so, make sure
to install the python-devel rpm also. That one contains Python.h.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed Nov  8 18:41:43 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Nov 2006 13:41:43 -0500
Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on
	clistfnsmodule.c compilation
In-Reply-To: <bug-2137-42@http.bugzilla.open-bio.org/>
Message-ID: <200611081841.kA8IfhFJ023784@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2137


------- Comment #3 from chris.lasher at gmail.com  2006-11-08 13:41 -------
(In reply to comment #2)
> This very much looks like a problem with your Python installation. Do you have
> the Python.h header file on your system?
> This problem may arise if you installed python using an rpm. If so, make sure
> to install the python-devel rpm also. That one contains Python.h.
> 

Good call! My apologies, I feel foolish now. For Debian/*buntu users, the
package to get is python-dev.

Should I add something about the Python development packages being necessary
for installation from CVS source on http://biopython.org/wiki/CVS ?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed Nov  8 19:05:13 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Nov 2006 14:05:13 -0500
Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on
	clistfnsmodule.c compilation
In-Reply-To: <bug-2137-42@http.bugzilla.open-bio.org/>
Message-ID: <200611081905.kA8J5DIg025030@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2137


mdehoon at ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID


------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp  2006-11-08 14:05 -------
> Should I add something about the Python development packages being necessary
> for installation from CVS source on http://biopython.org/wiki/CVS ?

The Python development packages are always needed, so also when installing an
official Biopython release.
If you could add some text to that effect to the Biopython wiki somewhere, that
would be great.

Closing this bug.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From idoerg at burnham.org  Thu Nov  9 02:39:23 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Wed, 08 Nov 2006 18:39:23 -0800
Subject: [Biopython-dev] [BioPython] EUtils module
In-Reply-To: <20061109014938.11961.qmail@web38113.mail.mud.yahoo.com>
References: <20061109014938.11961.qmail@web38113.mail.mud.yahoo.com>
Message-ID: <455294DB.6000105@burnham.org>

Srinivas Iyyer wrote:
> Dear Group,
> 
> I downloaded EUtils module. 
> 
> I am trying to reproduce the code given in :
> 
> http://www.dalkescientific.com/writings/diary/archive/2005/09/30/using_eutils.html


> 
> I am getting Errors. 

This is code from an alpha version of EUtils used at a presentation. I 
don't think it was meant to be reproducible, or even made it into the 
final module.

You might want to look under the hood. There is a README file in the 
EUtils installation, which has some examples.

But NCBI change the EUtils specifications quite frequently, so chances 
are, if no one used EUtils ofr a while, that it might be broken.

> 
> I want to know which databases in Entrez are supported
> by EUtils.
> 
> Could any one please help me whats the problem.
> 
> Are not many people using EUtils. 
> 
> Thanks
> 
>>>> import EUtils
>>>> dbs = EUtils.dblist()
> 
> Traceback (most recent call last):
>   File "<pyshell#1>", line 1, in -toplevel-
>     dbs = EUtils.dblist()
> AttributeError: 'module' object has no attribute
> 'dblist'
>>>> dbinfo = EUtils.dbinfo("pubmed")
> 
> Traceback (most recent call last):
>   File "<pyshell#2>", line 1, in -toplevel-
>     dbinfo = EUtils.dbinfo("pubmed")
> AttributeError: 'module' object has no attribute
> 'dbinfo'
> 
> 
> 
> 
> 
> 
>  
> ____________________________________________________________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs.
> http://music.yahoo.com/unlimited
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 


-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org


From mdehoon at c2b2.columbia.edu  Fri Nov 10 06:28:49 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Fri, 10 Nov 2006 01:28:49 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <454B2C81.9090309@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>	<454AAD1F.5050006@c2b2.columbia.edu>
	<454B2C81.9090309@maubp.freeserve.co.uk>
Message-ID: <45541C21.6080402@c2b2.columbia.edu>

Peter (BioPython Dev) wrote:
> Currently the individual format specific iterators just require a handle
> (and not a filename).  Are we all happy with this?

Happy.

> We could make the handle and format the first arguments as a compromise?

If in doubt, don't add it to Biopython!
It's much easier to add a functionality later, should the need arise, 
than to remove one.

> I personally want the file extension to format mapping, but then I am
> fairly disciplined about using file extensions.  As I seem to be the
> only voice advocating this, it looks like I may have to give in...
> 
> Is it worth asking on the main discussion list to canvas opinion?

Sure, go ahead. But ask for *why* a user wants file extension to format 
mapping (so just "Yeah, I'd like that..." doesn't count). I'd like to 
know which usage case that we haven't thought about yet warrants file 
extension to format mapping.

> We have functions to do the following, where "file" may mean just a
> handle, or perhaps the choice of a handle or filename (see above):
> 
> (*) File to SeqRecord iterator, currently File2SequenceIterator
> (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
> (*) SeqRecord iterator/list to alignment, currently Iter2Alignment
> (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File

If:
   File2SequenceIterator doesn't infer the file format from the extension
and
   File2SequenceIterator takes handles only, so no file names,
then:
   Why do we need the File2SequenceIterator function?

Btw, we should make a new Biopython release once the dust settles.

--Michiel.


From idoerg at burnham.org  Fri Nov 10 07:30:17 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Thu, 09 Nov 2006 23:30:17 -0800
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45541C21.6080402@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>	<454AAD1F.5050006@c2b2.columbia.edu>	<454B2C81.9090309@maubp.freeserve.co.uk>
	<45541C21.6080402@c2b2.columbia.edu>
Message-ID: <45542A89.6050202@burnham.org>

Michiel de Hoon wrote:
> Peter (BioPython Dev) wrote:
>> Currently the individual format specific iterators just require a handle
>> (and not a filename).  Are we all happy with this?
> 
> Happy.

I second that.

I have two arguments against that:

1) It is standard practice in biopython to pass file handle as arguments 
to a parser rather than a filename. If we break this, we would start 
thinking which parser takes a handle and which a filename. things will 
be a mess.

2) Also, what if you are not passing a real file? E.g. I have 
applications that pass StringIO streams  into the parser. You are 
lumping two levels of IO into one, and IMHO that is bad practice. In 
other words, a filehandle can always be generated from a file, easily

 >>> filefunc(open('myfile'))

but you cannot generate a file form a filehandle type of data. OK, you 
can programatically generate a tmp file for reading, but that places a 
burden on the user.

3) The last argument against rigid filename extensions is 
interoperability with other applications that generate those files. 
Suppose you have one application that generates fasta files with a .tfa 
extension, and another with a .fa extension and yet a third with .pfa 
extensions... and those extensions are important to you for other 
reasons, like knowing which is a nucleic acid file and which is protein. 
Actually, all the NCBI genomic files are built like this... :)

OK, three arguments. I think that relying on filename extensions for 
content is rather DOS-ish and places an extra burden on the user. I'm 
suffering enough on my Windows machine with Rasmol trying to open all my 
.pdb files. Including those where pdb stands for "Palm Pilot database" 
rather than Protein Data Bank.


> 
>> We could make the handle and format the first arguments as a compromise?
> 
> If in doubt, don't add it to Biopython!
> It's much easier to add a functionality later, should the need arise, 
> than to remove one.

We could add the format as a OPTIONAL keyword argument, with a "None" 
default value. And have the parser recognize the format from a lookahead 
using a magic regexp fro each format. The user passed format overrides 
the parser guesswork. Shouldn't be too  hard to implement, as file 
formats are very distinct.


> 
>> I personally want the file extension to format mapping, but then I am
>> fairly disciplined about using file extensions.  As I seem to be the
>> only voice advocating this, it looks like I may have to give in...
>>
>> Is it worth asking on the main discussion list to canvas opinion?
> 
> Sure, go ahead. But ask for *why* a user wants file extension to format 
> mapping (so just "Yeah, I'd like that..." doesn't count). I'd like to 
> know which usage case that we haven't thought about yet warrants file 
> extension to format mapping.
> 
>> We have functions to do the following, where "file" may mean just a
>> handle, or perhaps the choice of a handle or filename (see above):
>>
>> (*) File to SeqRecord iterator, currently File2SequenceIterator
>> (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
>> (*) SeqRecord iterator/list to alignment, currently Iter2Alignment
>> (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File
> 
> If:
>    File2SequenceIterator doesn't infer the file format from the extension
> and
>    File2SequenceIterator takes handles only, so no file names,
> then:
>    Why do we need the File2SequenceIterator function?
> 
> Btw, we should make a new Biopython release once the dust settles.
> 
> --Michiel.
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037, USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


From biopython-dev at maubp.freeserve.co.uk  Tue Nov 14 00:49:02 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Tue, 14 Nov 2006 00:49:02 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45542A89.6050202@burnham.org>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>	<454AAD1F.5050006@c2b2.columbia.edu>	<454B2C81.9090309@maubp.freeserve.co.uk>	<45541C21.6080402@c2b2.columbia.edu>
	<45542A89.6050202@burnham.org>
Message-ID: <4559127E.3050109@maubp.freeserve.co.uk>

Iddo Friedberg wrote:
> 3) The last argument against rigid filename extensions is 
> interoperability with other applications that generate those files. 
> Suppose you have one application that generates fasta files with a
> .tfa extension, and another with a .fa extension and yet a third with
> .pfa extensions... and those extensions are important to you for
> other reasons, like knowing which is a nucleic acid file and which is
> protein. Actually, all the NCBI genomic files are built like this...
> :)

Interesting tidbit.

If you are using "exotic" file extensions, then you would have to
explicitly tell my Bio.SeqIO code the file's format.

Although "fa" is currently a known extension mapped to fasta format in
Bio.SeqIO, your other examples are not.  Are these other extensions used
outside the internal systems of the NCBI?

> OK, three arguments. I think that relying on filename extensions for
> content is rather DOS-ish and places an extra burden on the user.

I'm not trying to force anyone into using specific filename extensions -
  I'm trying to make life easier for people who already do this (or who
download their data from online sources like the NCBI or PFAM - which do
seem to be consistent in their naming conventions).

> I'm suffering enough on my Windows machine with Rasmol trying to open
> all my .pdb files. Including those where pdb stands for "Palm Pilot
> database" rather than Protein Data Bank.

Yes - multiple interpretations of a given file format are a problem.
I've noticed that same PDB extension clash too (but I don't use a Palm
pilot any more).

Can anyone think of any common extensions used for more than one file
format?  I know Clustal uses *.aln for its alignments which is perhaps
asking for trouble...

> We could add the format as a OPTIONAL keyword argument, with a "None"
> default value. And have the parser recognize the format from a
> lookahead using a magic regexp fro each format. The user passed
> format overrides the parser guesswork. Shouldn't be too  hard to
> implement, as file formats are very distinct.

Currently the format is an optional keyword argument defaulting to None.
When it is omitted, I currently use a limited filename extension to
format mapping (assuming the filename is available) to deduce/guess the
format.

Peter


From idoerg at burnham.org  Tue Nov 14 17:19:14 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Tue, 14 Nov 2006 09:19:14 -0800
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4559127E.3050109@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>	<4542F123.9050106@c2b2.columbia.edu>	<45434611.1040708@maubp.freeserve.co.uk>	<4544458A.5000102@c2b2.columbia.edu>	<45448FAF.1090104@maubp.freeserve.co.uk>	<454574B4.3050407@c2b2.columbia.edu>	<4545D9F1.2040902@maubp.freeserve.co.uk>	<45483791.7070803@c2b2.columbia.edu>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>	<454AAD1F.5050006@c2b2.columbia.edu>	<454B2C81.9090309@maubp.freeserve.co.uk>	<45541C21.6080402@c2b2.columbia.edu>	<45542A89.6050202@burnham.org>
	<4559127E.3050109@maubp.freeserve.co.uk>
Message-ID: <4559FA92.8070408@burnham.org>

Peter (BioPython Dev) wrote:
> Iddo Friedberg wrote:
>> 3) The last argument against rigid filename extensions is 
>> interoperability with other applications that generate those files. 
>> Suppose you have one application that generates fasta files with a
>> .tfa extension, and another with a .fa extension and yet a third with
>> .pfa extensions... and those extensions are important to you for
>> other reasons, like knowing which is a nucleic acid file and which is
>> protein. Actually, all the NCBI genomic files are built like this...
>> :)
> 
> Interesting tidbit.
> 
> If you are using "exotic" file extensions, then you would have to
> explicitly tell my Bio.SeqIO code the file's format.
> 
> Although "fa" is currently a known extension mapped to fasta format in
> Bio.SeqIO, your other examples are not.  Are these other extensions used
> outside the internal systems of the NCBI?

I would tidbit or exotic. It is very prevalent, NCBI's GenBank genomic 
repositories are very much deferred to. The point is, since NCBI uses 
one standard of file extensions for its genomic databases, TIGR another 
(actually, TIGR points to GenBank for completed genomes) UCSC a third... 
then maybe relying on file suffixes is not such a great idea.

See for example the E. coli genome:

ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Escherichia_coli_K12

Some are fasta format. But have different contents: whole genome, 
noncoding RNA, protein. Same with those that are GenBank format. So the 
NCBI suffixes denote not only the file format, but the biological 
content as well.

Also, for the reasons I gave in my previous email, I think we should 
stick with passing file handles, not file names.

There is no real need for to pass a filename rather than a file handle. 
If you need information from the filename, you can read the filename 
from the file handle:

 >>> foo = open('foo')

 >>> print foo.name
'foo'

And the functions could still accept StringIO streams if needed.

> 
>> 
> 
> I'm not trying to force anyone into using specific filename extensions -
>   I'm trying to make life easier for people who already do this (or who
> download their data from online sources like the NCBI or PFAM - which do
> seem to be consistent in their naming conventions).
> 

You cannot rely on such consistency prevailing. Especially not with NCBI.;)


> 
>> We could add the format as a OPTIONAL keyword argument, with a "None"
>> default value. And have the parser recognize the format from a
>> lookahead using a magic regexp fro each format. The user passed
>> format overrides the parser guesswork. Shouldn't be too  hard to
>> implement, as file formats are very distinct.
> 
> Currently the format is an optional keyword argument defaulting to None.
> When it is omitted, I currently use a limited filename extension to
> format mapping (assuming the filename is available) to deduce/guess the
> format.
> 


Ideally, the data format should be supplied by the user. Second best is 
inferring from parsing the first line or so in the file. Third is 
filename extension. Bit both options B and C are not very good 
practices, IMHO.


> Peter
> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org


From bugzilla-daemon at portal.open-bio.org  Tue Nov 14 20:48:49 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Nov 2006 15:48:49 -0500
Subject: [Biopython-dev] [Bug 2143] New: Error parsing BLAT output (using
	out=blast format)
Message-ID: <bug-2143-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2143

           Summary: Error parsing BLAT output (using out=blast format)
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: fgibbons at hms.harvard.edu


Attempting to parse this BLAT output (see below) raises an "I couldn't find the
sbjct in" exception.

After looking at the code, it seems to me that the problem is an overly strict
regexp, that relies on a single space between the "Sbjct:" and the integer that
follows it. Replace the literal space with '\s*', and it goes away. This in
fact matches the regexp used to match the "Query:". I can't imagine that it
might hurt things, even in the main NCBIBlastParser, but you never know.... 

(All of the above refers to the method sbjct in class _HSPConsumer, file
NCBIStandalone.py)

-Frank Gibbons (fgibbons at hms.harvard.edu)
-------------------------------------

Reference:  Kent, WJ. (2002) BLAT - The BLAST-like alignment tool

Query=  NCU00001
        (54 letters)

Database:  all_proteins.fasta
           293697 sequences; 128,064,135 total letters

                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

MGG_10872.5                                                           101  
1e-21


>MGG_10872.5
          Length = 245

 Score = 101 bits (260), Expect = 1e-21
 Identities = 54/54 (100%), Positives = 54/54 (100%), Gaps = 0/54 (0%)

Query:   1 MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL 54
           MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL
Sbjct: 192 MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL 245

  Database: all_proteins.fasta
BLASTP 2.2.4 [blat]


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Nov 14 22:03:40 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Nov 2006 17:03:40 -0500
Subject: [Biopython-dev] [Bug 2143] Error parsing BLAT output (using
	out=blast format)
In-Reply-To: <bug-2143-42@http.bugzilla.open-bio.org/>
Message-ID: <200611142203.kAEM3eu3014395@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2143


mdehoon at ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp  2006-11-14 17:03 -------
Fixed in CVS, thanks.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From chris.lasher at gmail.com  Wed Nov 15 00:51:26 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 14 Nov 2006 19:51:26 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4559FA92.8070408@burnham.org>
References: <45425925.8090607@maubp.freeserve.co.uk>
	<45487277.6080308@maubp.freeserve.co.uk>
	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>
	<4549E95A.6080605@maubp.freeserve.co.uk>
	<454AAD1F.5050006@c2b2.columbia.edu>
	<454B2C81.9090309@maubp.freeserve.co.uk>
	<45541C21.6080402@c2b2.columbia.edu> <45542A89.6050202@burnham.org>
	<4559127E.3050109@maubp.freeserve.co.uk>
	<4559FA92.8070408@burnham.org>
Message-ID: <128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com>

Just pitching in again, I agree with Michiel with regards to the list
of functions necessary. To restate, these would be:

(*) File to SeqRecord iterator, currently File2SequenceIterator
(*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
(*) SeqRecord iterator/list to alignment, currently Iter2Alignment
(*) Write SeqRecordwher iterator/list to a file, currently Sequences2File

I also think there's wisdom to Michiel's statement it's easier to add
functionality than it is to remove it.

I agree with Iddo on his arguments against dealing with filename
extensions. Upon reflection, however, I feel comfortable with a
lookahead-based file-format guesser for the sake of convenience and as
a matter of compromise to those who are not keen on being explicit in
regards to every detail. It's been stated that bio file formats are
quite distinct. I tried to think of a counterexample but failed.

Finally, to reply to Michiel's question on release, it does seem once
SeqIO is solidified this would certainly be worthy of a new release.
SeqIO is a big step in a good direction for BioPython.

Chris


From biopython-dev at maubp.freeserve.co.uk  Wed Nov 15 12:52:58 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Wed, 15 Nov 2006 12:52:58 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com>
References: <45425925.8090607@maubp.freeserve.co.uk>	<45487277.6080308@maubp.freeserve.co.uk>	<128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com>	<4549E95A.6080605@maubp.freeserve.co.uk>	<454AAD1F.5050006@c2b2.columbia.edu>	<454B2C81.9090309@maubp.freeserve.co.uk>	<45541C21.6080402@c2b2.columbia.edu>
	<45542A89.6050202@burnham.org>	<4559127E.3050109@maubp.freeserve.co.uk>	<4559FA92.8070408@burnham.org>
	<128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com>
Message-ID: <455B0DAA.9040000@maubp.freeserve.co.uk>

Chris Lasher wrote:
> Just pitching in again, I agree with Michiel with regards to the list
> of functions necessary. To restate, these would be:

On Monday I switched from the "2" pun names to "To" giving the following:

(*) FileToSequenceIterator, previously File2SequenceIterator
     File to SeqRecord iterator

(*) SequencesToDict, previously SequenceIter2Dict
     SeqRecord iterator/list to dictionary

(*) SequencesToAlignment, previously Iter2Alignment
     SeqRecord iterator/list to alignment

(*) SequencesToFile, previously Sequences2File
     Write SeqRecord iterator/list to a file

I agree that these are all important "core functions".

> I also think there's wisdom to Michiel's statement it's easier to add
> functionality than it is to remove it.

Very true.  On that note...

We also currently have three "convenience functions", which seem
scheduled for removal based on these discussions.  Unless anyone speaks
up for these three, I'll remove them (and update the Wiki to match):

(*) FileToSequenceList previously called File2SequenceList
(*) FileToSequenceDict previously called File2SequenceDict
(*) FileToAlignment    previously called File2Alignment

These simply wrap FileToSequenceIterator with the list, SequencesToDict
or SequencesToAlignment function.

> I agree with Iddo on his arguments against dealing with filename
> extensions. Upon reflection, however, I feel comfortable with a
> lookahead-based file-format guesser for the sake of convenience and as
> a matter of compromise to those who are not keen on being explicit in
> regards to every detail. It's been stated that bio file formats are
> quite distinct. I tried to think of a counterexample but failed.

I would say telling EMBL and Swiss (aka SwissProt aka Unigene) apart is
tricky.  They both start with an "ID ..." line and finish with "//", the
feature table format is the big difference.

If we did try guessing file formats by looking at the file contents, I
would not try and guess every file format which Bio.SeqIO could read -
just those which are easily identifiable.  In this case, I would be
inclined not to try and tell EMBL and SwissProt apart, and simply abort
with "Unrecognised format".

Peter


From biopython-dev at maubp.freeserve.co.uk  Tue Nov 28 13:24:35 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Tue, 28 Nov 2006 13:24:35 +0000
Subject: [Biopython-dev] [BioPython] Problems with Win Release for
 Python 2.5: Numeric, KDTree
In-Reply-To: <005301c7129b$f3222300$b400a8c0@Sirius>
References: <005301c7129b$f3222300$b400a8c0@Sirius>
Message-ID: <456C3893.6060402@maubp.freeserve.co.uk>

Hendrik Weisser wrote:
> The main question for me is whether these issues (the 2nd, mostly) can be 
> adressed quickly, or whether it is recommended to use the "old" Python 2.4 
> and corresponding packages for the time being. Can anyone help me with that?

Yes - assuming you don't have all the compilers and stuff to compile
your own libraries (and therefore need to use the Windows installers),
using Windows with Python 2.4 and Numeric 24.2 with BioPython 1.42
should be fine.

Personally I use Python 2.4 on Linux (as shipped with the distribution)
and Python 2.3 on my Windows machine.

Both work fine with BioPython and Numeric - although I have not used
Bio.PDB very much.

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Nov 29 19:03:08 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 29 Nov 2006 14:03:08 -0500
Subject: [Biopython-dev] [Bug 2090] Blast.NCBIStandalone BlastParser fails
	with blastall 2.2.14
In-Reply-To: <bug-2090-42@http.bugzilla.open-bio.org/>
Message-ID: <200611291903.kATJ38DJ007489@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2090


------- Comment #1 from grunberg at embl.de  2006-11-29 14:03 -------
Things get worse with the current blastall 2.2.15. _scan_parameters in
NCBIStandalone.py expects "Number of HSP's better" which in the later blastall
versions has changed to: "Number of sequences better".  
This prevents the parser from fetching the next two lines even though they
would be there and then we get exceptions etc. 

Another independent problem occurs further down -- The lines::
  T: 11
  A: 40
have now changed to::
  Neighboring words threshold: 11
  Window for multiple hits: 40
and again we run into an exeption. Both problems also concern in the latest CVS
snapshot.

Both can be fixed with some additional attempt_read_and_call but I am not sure
whether my quick and dirty fixes is following the right spirit...

change A:
---------
INSERT BEFORE...::
        # not in blastx 2.2.1
        attempt_read_and_call(uhandle, consumer.query_length,
                              has_re=re.compile(r"[Ll]ength of query"))
...These two statements::

        # in blastall 2.2.15
        attempt_read_and_call(uhandle, consumer.noevent,
                                 start="Number of HSP's gapped:")

        attempt_read_and_call(uhandle, consumer.noevent,
                          start="Number of HSP's successfully")

Change B:
---------
REPLACE::
        # not in BLASTN 2.2.9
        attempt_read_and_call(uhandle, consumer.threshold, start='T')
        read_and_call(uhandle, consumer.window_size, start='A')

BY::
        # not in BLASTN 2.2.9
        attempt_read_and_call(uhandle, consumer.threshold, start='T')
        attempt_read_and_call(uhandle, consumer.window_size, start='A')
        ## renamed in BLASTALL 2.2.15
        attempt_read_and_call(uhandle, consumer.threshold, start='Neighboring')
        attempt_read_and_call(uhandle, consumer.window_size, start='Window')

Could someone with more Biopython experience please validate and apply the fix?
THX!


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.