From mdehoon at c2b2.columbia.edu Wed Nov 1 00:58:41 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Wed, 01 Nov 2006 00:58:41 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4545D9F1.2040902@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> Message-ID: <45483791.7070803@c2b2.columbia.edu> Peter (BioPython Dev) wrote: > With such a speed up, I'd guess you were using Bio.Fasta before. Yes I was. I just went to the Biopython tutorial and used the stuff in section 2.4. I didn't expect it to be *that* slow. > I've noticed the same thing. Are you dealing with NCBI style fasta > identifiers made up of several fields separated by "|" characters? Yep. >> For duplicate keys, there are at least four possibilities (raise an >> exception, store only one of the keys, store neither of the keys >> and don't raise an exception, store both after modifying one of the >> keys). So this should also be an option. > > Supporting all these options with an easy to understand interface > looks too hard. > > In my opinion if someone is trying to build a dictionary using > repeated keys they have made a mistake (either in their datafile, or > their record2key function) - so raising an exception is reasonable > default behaviour (and is easy to code). You're probably right. I'm fine with raising an exception. >> In the File2SequenceDict above, answer[key] contains the complete >> record. Some people will want that. However, in my application I >> only want to store the record.seq part in answer[key]. Somebody >> else may want str(record.seq). So we'd also need a record2value >> argument. > > It does slightly undermine the "you only get SeqRecord objects" > principle. On the other hand, its a simple addition that is easy to > explain and implement. I'm happy to add this. The point I was trying to make is that for a File2SequenceDict function to be useful, it would end up being too complex. In the answer above, a user could also do answer[key].seq to get the part she wants, so maybe a record2value argument is not essential in practice. Part of my opposition against the File2SequenceDict function is that it requires the parser to be called File2SequenceIterator (which I don't like as a name, but more about that some other time), which then leads to a File2SequenceList function, which is software bloat. So, how about making the functionality of File2SequenceDict available as a todict() method to the iterator object returned by File2SequenceIterator, or, as a iterator2dict function? --Michiel. From biopython-dev at maubp.freeserve.co.uk Wed Nov 1 05:09:59 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 01 Nov 2006 10:09:59 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45483791.7070803@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> Message-ID: <45487277.6080308@maubp.freeserve.co.uk> > The point I was trying to make is that for a File2SequenceDict > function to be useful, it would end up being too complex. Of course I'm going to be biased here, but I do find the simple current dictionary construction useful as it is. Clearly we have slightly different uses in mind (which is good - the design should try and cater to most people). > In the answer above, a user could also do answer[key].seq to get the > part she wants, so maybe a record2value argument is not essential in > practice. > > Part of my opposition against the File2SequenceDict function is that > it requires the parser to be called File2SequenceIterator (which I > don't like as a name, but more about that some other time), which > then leads to a File2SequenceList function, which is software bloat. > > So, how about making the functionality of File2SequenceDict available > as a todict() method to the iterator object returned by > File2SequenceIterator, or, as a iterator2dict function? I do like your first suggestion - the idea of adding a todict() method to the iterator objects. However, that would require that all the parsers be written as (sub)classes, and right now several of them are written as generator functions. I've found using generator functions to be very simple, and easy to understand. They seem like a good choice for simple file formats. But with a good reason enough reason, I could turn them into classes. ---- Right now I am making both "file to dict" and "iterator to dict" functions available: File2SequenceDict(..., record2key) is implemented as SequenceIter2Dict(File2SequenceIterator(...), record2key) Also: File2Alignment(...) is implemented as Iter2Alignment(File2SequenceIterator(...)) And: File2SequenceList(...) is implemented as list(File2SequenceIterator(...)) Leaving aside the names (which I notice are not currently consistent) I would be fine with removing File2SequenceList, File2SequenceDict, and File2Alignment but retaining the two functions which convert from a SeqRecord returning iterator into dict or an alignment. How does that sound Michiel (subject to agreeing on names)? Peter From bugzilla-daemon at portal.open-bio.org Wed Nov 1 17:50:46 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Nov 2006 17:50:46 -0500 Subject: [Biopython-dev] [Bug 2131] New: SProt.py fails to parse the current Swiss-Prot version 51.0 Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2131 Summary: SProt.py fails to parse the current Swiss-Prot version 51.0 Product: Biopython Version: 1.24 Platform: Macintosh OS/Version: MacOS X Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: Biosql at hotmail.com Hi, I'm running on a mac OS 10.4, python 2.5 and tried to parse the Swiss-Prot .dat file whit the latest SProt.py version and get this : Traceback (most recent call last): File "Parser_SProt_to_DB.py", line 37, in cur_record = s_iterator.next() File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 166, in next return self._parser.parse(File.StringHandle(data)) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 290, in parse self._scanner.feed(handle, self._consumer) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 332, in feed self._scan_record(uhandle, consumer) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 337, in _scan_record fn(self, uhandle, consumer) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 369, in _scan_id self._scan_line('ID', uhandle, consumer.identification, exactly_one=1) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 359, in _scan_line read_and_call(uhandle, event_fn, start=line_type) File "/sw/lib/python2.5/site-packages/Bio/ParserSupport.py", line 301, in read_and_call method(line) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 526, in identification self.data.sequence_length = int(cols[4]) ValueError: invalid literal for int() with base 10: 'AA.' Any clue ? Thanks ! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chris.lasher at gmail.com Wed Nov 1 22:49:04 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Wed, 1 Nov 2006 22:49:04 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45487277.6080308@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> Message-ID: <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> I'd like to pitch in a few comments here. Peter wrote: > One point against names like File2SequenceIterator is the pun on two > versus to (i.e. convert) will not be so obvious to non-native English > speakers. I'd like to second that. It's cute, sure, but FileToSequenceIterator isn't that much more difficult, and leaves no room for confusion. (e.g., Where's the File1SequenceIterator?) Michiel wrote: > I like the idea of one argument that takes a file name or handle. I > believe that that is how other Biopython functions work. Yikes! Are you serious? Why not make it easier and require a file-like object? I would definitely not be for it taking a plain string. This seems implicit rather than explicit. "Takes a file... or a file-like object... or a string containing a filename... or just a string containing the file contents... or a brief description of the data that's in your file... or a bunch of smiley emoticons, if you're in a good mood..." File-like objects are testable and leave little room for surprise. Anything else seems like it's asking for a headache. Which brings me to the issue of "guessing" a file's format. Yikes, again! I'd expect that kind of "magickery" from Perl, but once again, explicit is better than implicit. I honestly think it's not too much to expect the user to know what filetype they're expecting BioPython to deal with. Could you guys please explain the motivation behind this to me? As I see it right now, the last thing I want is BioPython incorrectly guessing my file format, and particularly, assuming that I have put the proper extension to represent the file format. The unified sequence object is what's beautiful about SeqIO, but the guesswork that you are discussing having SeqIO's classes do is scary, to me. And I think by now it's predictable that I'm a fan of Peter's suggestion to have an exception raised upon the attempt to create a dictionary with identical IDs; all other options are, again, too implicit for my tastes. Thanks very much for developing SeqIO and discussing it so much, guys. I think this will be a fantastic asset to BioPython! Keep on rockin' it! Chris From bugzilla-daemon at portal.open-bio.org Thu Nov 2 06:29:38 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 06:29:38 -0500 Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current Swiss-Prot version 51.0 In-Reply-To: Message-ID: <200611021129.kA2BTcOX010117@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2131 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2006-11-02 06:29 ------- Hi Jonathan, What version of BioPython are you using? I know that bugzilla needs updating to include more up to date version numbers, but you aren't really using BioPython 1.24 are you? Currently the latest release is 1.42, and this does include some updates for SProt, e.g. bug 1948. There is also a more recent fix in CVS for bug 2043 dealing with new style RX lines. Could you tell use which SProt file you are using (a URL would be fine). If there are many that fail the same way, and you have a small example input file, you could even attach it to this bug. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Thu Nov 2 07:49:30 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu, 02 Nov 2006 12:49:30 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> Message-ID: <4549E95A.6080605@maubp.freeserve.co.uk> Chris Lasher wrote: > I'd like to pitch in a few comments here. > > Peter wrote: >> One point against names like File2SequenceIterator is the pun on >> two versus to (i.e. convert) will not be so obvious to non-native >> English speakers. > > I'd like to second that. It's cute, sure, but FileToSequenceIterator > isn't that much more difficult, and leaves no room for confusion. > (e.g., Where's the File1SequenceIterator?) I would be happy with FileToSequenceIterator, or even FileToSequenceIter. FileToSeqIter is shorter but we don't actually return Seq objects so I would avoid that. Does anyone else have any suggestions? > Michiel wrote: >> I like the idea of one argument that takes a file name or handle. I >> believe that that is how other Biopython functions work. I've had a little look, and the only case I found is the recent Bio.Nexus parser - and this choked on a StringIO handle on my machine (fix checked in). Chris Lasher wrote: > Yikes! Are you serious? Why not make it easier and require a > file-like object? I would definitely not be for it taking a plain > string. This seems implicit rather than explicit. "Takes a file... or > a file-like object... or a string containing a filename... or just a > string containing the file contents... or a brief description of the > data that's in your file... or a bunch of smiley emoticons, if > you're in a good mood..." File-like objects are testable and leave > little room for surprise. Anything else seems like it's asking for a > headache. Trying to distinguish between an (invalid) filename and the contents of a sequence file is just too much to ask - more a migraine than a headache. As an experiment, I've implemented (but not checked in) automatic handle/filename detection. Its seems to work (but I have not yet tried exotic arguments like file names in Unicode, or random classes with a __str__ method). Still its messy. While it does sound like a nice idea for the end user, the idea of filenames and handles is pretty important in python, and maybe we shouldn't worry about forcing newcomers deal with handles. After all, the SeqIO system will make them deal with iterators and SeqRecords which I think are far more complicated! What do you think Michiel? Chris Lasher wrote: > Which brings me to the issue of "guessing" a file's format. Yikes, > again! I'd expect that kind of "magickery" from Perl, but once again, > explicit is better than implicit. I honestly think it's not too much > to expect the user to know what filetype they're expecting BioPython > to deal with. Could you guys please explain the motivation behind > this to me? As I see it right now, the last thing I want is BioPython > incorrectly guessing my file format, and particularly, assuming that > I have put the proper extension to represent the file format. The > unified sequence object is what's beautiful about SeqIO, but the > guesswork that you are discussing having SeqIO's classes do is scary, > to me. For comparison this quote is from the BioPerl SeqIO How-To: >> [BioPerl's] SeqIO can try to guess based on known file extensions >> or content, ... it is a good idea to get into the practice of >> always specifying the format. I want to stress that as written, the user can specify the file format to the File2SequenceIterator function (and its variants). Maybe we should encourage people to explicitly supply the format in any Bio.SeqIO documentation.... You asked about motivation for guessing the file format. I break that down into guessing the file format based on the file extension, or based on the file's contents (see later). I personally am perfectly happy with using a file extension to file format mapping. Maybe this reflects my computing background (more DOS/Windows background than Unix/Linux). Note that if the format is not specified, and the file extension is not on the known list (e.g. "txt" or "data" which could be anything) then the call to File2SequenceIterator function (or its variants) will fail with an invalid format message/exception. Assuming we don't make the format a required argument, and we keep the extension to format mappings, then I should make a point of including deliberate miss-matches in the test suits - and check that they abort with a SyntaxError. Regarding guessing the format based on file contents: For some applications, having a format guesser built into BioPython might actually be very useful - the example given on the BioPerl website is the back end of a web tool that took sequence input, where maybe you can't trust the actual end user to know exactly what file format their data is in. Doing this for some file formats isn't too hard, often all you need to see is the first line. For other file formats its very tricky and best not attempted. But, is partial guess support even worth implementing - especially as it may be less than perfect and get it wrong sometimes? I think Michiel and I where happy to leave this question for later... Chris Lasher wrote: > And I think by now it's predictable that I'm a fan of Peter's > suggestion to have an exception raised upon the attempt to create a > dictionary with identical IDs; all other options are, again, too > implicit for my tastes. Good. Michiel agreed in another email: >> >> You're probably right. I'm fine with raising an exception. >> Have you been following the rest of that SeqRecord dictionary discussion Chris? > Thanks very much for developing SeqIO and discussing it so much, > guys. I think this will be a fantastic asset to BioPython! Keep on > rockin' it! > > Chris Thank you for your passionate feedback :) Peter From bugzilla-daemon at portal.open-bio.org Thu Nov 2 12:38:26 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 12:38:26 -0500 Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current Swiss-Prot version 51.0 In-Reply-To: Message-ID: <200611021738.kA2HcQKH017740@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2131 Biosql at hotmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Version|1.24 |Not Applicable ------- Comment #2 from Biosql at hotmail.com 2006-11-02 12:38 ------- I'm using the latest version of Biopython 1.42 with the latest version of Sprot.py from the CVS. I used the Swiss-Prot file version 51 coming from here : ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz I also tried to parse this file on a PC with python 2.4.3 and the latest biopython version and got the same result. Jonathan -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 2 13:07:03 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 13:07:03 -0500 Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current Swiss-Prot version 51.0 In-Reply-To: Message-ID: <200611021807.kA2I73W3021327@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2131 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2006-11-02 13:07 ------- Created an attachment (id=491) --> (http://bugzilla.open-bio.org/attachment.cgi?id=491&action=view) First four records from uniprot_sprot.dat.gz release 51 I was hoping for a smaller test case, uniprot_sprot.dat.gz is 185MB compressed, and 836MB as plain text! Anyway, I have extracted and attached a file with the just the first four records in it for anyone interested in testing. I would guess from your stack trace that this recent change to the ID line that has caused the trouble: http://ca.expasy.org/sprot/relnotes/sp_news.html#rel9.0 Old (with MoleculeType): ID EntryName DataClass; MoleculeType; SequenceLength. New (without MoleculeType): ID EntryName DataClass; SequenceLength. e.g. ID CYC_PIG Reviewed; 104 AA. ID Q3ASY8_CHLCH Unreviewed; 36805 AA. This shouldn't be too hard to fix... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 2 13:41:46 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 13:41:46 -0500 Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current Swiss-Prot version 51.0 In-Reply-To: Message-ID: <200611021841.kA2Ifkpg025233@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2131 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2006-11-02 13:41 ------- Fix checked into CVS, please reopen the bug if you run into problems. http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SwissProt/SProt.py?cvsroot=biopython file Bio/SwissProt/SProt.py Revision 1.34 made 2nd Nov 2006 This is the test script I used with the example file from comment 3 attachment 491 from Bio.SwissProt import SProt #Works rec_iter = SProt.Iterator(open("uniprot_sprot_f4.dat"), SProt.SequenceParser()) for record in rec_iter : print record.id print record.seq #Failed rec_iter = SProt.Iterator(open("uniprot_sprot_f4.dat"), SProt.RecordParser()) for record in rec_iter : print record -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 2 14:16:53 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 14:16:53 -0500 Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current Swiss-Prot version 51.0 In-Reply-To: Message-ID: <200611021916.kA2JGrGY028566@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2131 ------- Comment #5 from Biosql at hotmail.com 2006-11-02 14:16 ------- Thank you Peter ! So fast and so good. Jonathan -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 2 16:27:25 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 16:27:25 -0500 Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current Swiss-Prot version (RX and OH lines are broken) In-Reply-To: Message-ID: <200611022127.kA2LRPkN009879@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2043 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2006-11-02 16:27 ------- It seems to be working from the small amount I testing I did on another Swiss-Prot bug. Marking as fixed - please reopen if needed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 2 16:38:01 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 16:38:01 -0500 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200611022138.kA2Lc1Zi010834@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2006-11-02 16:38 ------- While working on bug 2059, I've been tempted to make some similar changes to Marc's suggestions about handling the SeqRecord's id/name/description, and the addition of an "add SeqRecord" method. I like the idea of adding a method to iterate over the sequences. How about something a little simpler (which I haven't tested yet): def __iter__(self): """Iterate over the SeqRecord objects making up the alignment""" return iter(self._records) i.e. Use the fact that self._records is a list, and will support iteration itself. This avoids having to keep track of the current iteration position in our own next method. Also, would anyone else like to be able to iterate over the columns? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at c2b2.columbia.edu Thu Nov 2 21:20:23 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Thu, 02 Nov 2006 21:20:23 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45487277.6080308@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> Message-ID: <454AA767.9030506@c2b2.columbia.edu> Peter wrote: > Right now I am making both "file to dict" and "iterator to dict" > functions available: > > File2SequenceDict(..., record2key) is implemented as > SequenceIter2Dict(File2SequenceIterator(...), record2key) > > Also: > File2Alignment(...) is implemented as > Iter2Alignment(File2SequenceIterator(...)) > > And: > File2SequenceList(...) is implemented as list(File2SequenceIterator(...)) > > Leaving aside the names (which I notice are not currently consistent) I > would be fine with removing File2SequenceList, File2SequenceDict, and > File2Alignment but retaining the two functions which convert from a > SeqRecord returning iterator into dict or an alignment. > > How does that sound Michiel (subject to agreeing on names)? That sounds good to me. --Michiel. From mdehoon at c2b2.columbia.edu Thu Nov 2 21:44:47 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Thu, 02 Nov 2006 21:44:47 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4549E95A.6080605@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> Message-ID: <454AAD1F.5050006@c2b2.columbia.edu> Peter wrote: > Chris Lasher wrote: >> Peter wrote: >>> One point against names like File2SequenceIterator is the pun on >>> two versus to (i.e. convert) will not be so obvious to non-native >>> English speakers. >> I'd like to second that. It's cute, sure, but FileToSequenceIterator >> isn't that much more difficult, and leaves no room for confusion. >> (e.g., Where's the File1SequenceIterator?) > > I would be happy with FileToSequenceIterator, or even > FileToSequenceIter. FileToSeqIter is shorter but we don't actually > return Seq objects so I would avoid that. > > Does anyone else have any suggestions? Yes, but let's discuss function names after we decide which functions we want. > While it does sound like a nice idea for the end user, the idea of > filenames and handles is pretty important in python, and maybe we > shouldn't worry about forcing newcomers deal with handles. After all, > the SeqIO system will make them deal with iterators and SeqRecords which > I think are far more complicated! > > What do you think Michiel? My preferred solution would be for File2SequenceIterator to take handles only. Same as Bio.Blast: blast_out = open('my_blast.out') b_parser = NCBIXML.BlastParser() b_record = b_parser.parse(blast_out) > Chris Lasher wrote: >> Which brings me to the issue of "guessing" a file's format. Yikes, >> again! I'd expect that kind of "magickery" from Perl, but once again, >> explicit is better than implicit. I honestly think it's not too much >> to expect the user to know what filetype they're expecting BioPython >> to deal with. Could you guys please explain the motivation behind >> this to me? >...... > > I think Michiel and I where happy to leave this question for later... > I am leaning towards Chris' opinion. File type guessing (from extension or file contents) doesn't seem really necessary. At least, I don't remember a user asking for it. The benefits of file type guessing from the extension are minimal (since a user can probably do that more reliably himself, knowing the file names he's likely to encounter). And since file type guessing will not be foolproof, it may even be confusing. Once file type guessing is available in Biopython though, we're committed to it and we'll have to support it. So I'd be happier without the file type guessing functionality. That said, if somebody really wants it, I can live with it. --Michiel. From biopython-dev at maubp.freeserve.co.uk Fri Nov 3 06:48:17 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Fri, 03 Nov 2006 11:48:17 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <454AAD1F.5050006@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> Message-ID: <454B2C81.9090309@maubp.freeserve.co.uk> My apologies for this somewhat long email. Handles and Filenames ===================== Currently the individual format specific iterators just require a handle (and not a filename). Are we all happy with this? Michiel de Hoon wrote: >> While it does sound like a nice idea for the end user, the idea of >> filenames and handles is pretty important in python, and maybe we >> shouldn't worry about forcing newcomers deal with handles. After >> all, the SeqIO system will make them deal with iterators and >> SeqRecords which I think are far more complicated! >> >> What do you think Michiel? > > My preferred solution would be for File2SequenceIterator to take > handles only. Assuming we keep the non-ambiguous file extension to file format mappings, allowing a filename as a possible argument to File2SequenceIterator (and any variants) makes good sense. Note that most handle objects have a "name" attribute to get the filename, which could be used to determine the file extension. i.e. We can still do the file extension to file format mapping using just a file handle (instead of a filename). Currently File2SequenceIterator has separate named arguments for a handle, filename and format. If no handle is provided, it will open one using the filename provided. We could make the handle and format the first arguments as a compromise? If we drop the extension to file format mapping (see below), then I agree File2SequenceIterator could just expect a handle and not a filename. Guessing File Formats ===================== >> Chris Lasher wrote: >>> Which brings me to the issue of "guessing" a file's format. >>> Yikes, again! I'd expect that kind of "magickery" from Perl, but >>> once again, explicit is better than implicit. I honestly think >>> it's not too much to expect the user to know what filetype >>> they're expecting BioPython to deal with. Could you guys please >>> explain the motivation behind this to me? Michiel de Hoon wrote: > I am leaning towards Chris' opinion. File type guessing (from > extension or file contents) doesn't seem really necessary. At least, > I don't remember a user asking for it. The benefits of file type > guessing from the extension are minimal (since a user can probably do > that more reliably himself, knowing the file names he's likely to > encounter). And since file type guessing will not be foolproof, it > may even be confusing. Once file type guessing is available in > Biopython though, we're committed to it and we'll have to support it. > So I'd be happier without the file type guessing functionality. > > That said, if somebody really wants it, I can live with it. I agree that we shouldn't implement file format guessing based on the contents of a file (unless, as you say, we get strong feedback wanting it). I personally want the file extension to format mapping, but then I am fairly disciplined about using file extensions. As I seem to be the only voice advocating this, it looks like I may have to give in... Is it worth asking on the main discussion list to canvas opinion? Maybe we should settle on the function names before doing that - it would be better replace the current function names now, before too many people are used to them. Functions and Naming ==================== This is where I think things stand for Bio/SeqIO/__init__.py We have functions to do the following, where "file" may mean just a handle, or perhaps the choice of a handle or filename (see above): (*) File to SeqRecord iterator, currently File2SequenceIterator (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict (*) SeqRecord iterator/list to alignment, currently Iter2Alignment (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File Possible names without the digit two: FileToSequenceIterator, SequencesToDict, SequencesToAlignment, and SequencesToFile I think Michiel wanted to drop the following "wrapper functions" as code bloat: (*) File to list of SeqRecord objects, currently File2SequenceList Just use list(File2SequenceIterator(...)) instead (*) File to dictionary of SeqRecord objects, currently File2SequenceDict Just use SequenceIter2Dict(File2SequenceIterator(...)) instead (*) File to alignment, currently File2Alignment Just use Iter2Alignment(File2SequenceIterator(...)) The reason I invented the above three examples was so I could do things like this in one line (assuming my files have valid known extensions): rec_iter = File2SequenceIterator(filename="demo.faa") rec_list = File2SequenceList(filename="demo.gbk") rec_dict = File2SequenceDict(filename="demo.fasta") align = File2Alignment(filename="demo.sth") or perhaps: align = File2Alignment(filename="demo.aln", format="clustal") The alternatives suggestions seem to lead to using file handles and an explicit format, with a second function to convert from an iterator if required. While this can be done in one line - I find the following much less straight forward: rec_iter = File2SequenceIterator(open("demo.faa"), "fasta") rec_list = list(File2SequenceIterator(open("demo.gbk"), "genbank")) rec_dict = SequenceIter2Dict(File2SequenceIterator(open("demo.fasta"), "fasta")) align = Iter2Alignment(File2SequenceIterator(open("demo.sth"), "stockholm")) Peter From sbassi at gmail.com Sat Nov 4 15:48:22 2006 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 4 Nov 2006 17:48:22 -0300 Subject: [Biopython-dev] Microbiology module Message-ID: I am working in functions for industrial microbiology. Like: Growth rate equations, Continuous culture equations, batch culture, yields for different source of energy (and for fermentation or respiration), oxygen consume rate, constants, thermodynamic equations used in bioreactors, cell cultures and so on. Biopython is lacking such a module, but I am not sure if this is out of scope. Is there a chance to include it in Biopython, or this is not useful? I think this could extend Biopython into a whole new area (bioprocess and microbiology). Please tell me what maintainers think about this. If this idea is rejected, I will make ugly and uncommented code for my own consuming, but if passed, I will write very nice and documented for people to see :) Best regards, SB. -- Bioinformatics news: http://www.bioinformatica.info Lriser: http://www.linspire.com/lraiser_success.php?serial=318 From sbassi at gmail.com Sun Nov 5 09:49:20 2006 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 5 Nov 2006 11:49:20 -0300 Subject: [Biopython-dev] Microbiology module In-Reply-To: <2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com> References: <2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com> Message-ID: On 11/5/06, Thomas Hamelryck wrote: > > Sounds like a fun project, and a potentially valuable addition to Biopython. I guess some of the topics you mention might be of relevance to systems biology, right? > Yes, some methods could be used as a base for systems biology. From thamelry at binf.ku.dk Sun Nov 5 09:33:53 2006 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Sun, 5 Nov 2006 15:33:53 +0100 Subject: [Biopython-dev] Microbiology module In-Reply-To: References: Message-ID: <2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com> On 11/4/06, Sebastian Bassi wrote: > > I am working in functions for industrial microbiology. Like: > Growth rate equations, Continuous culture equations, batch culture, > yields for different source of energy (and for fermentation or > respiration), oxygen consume rate, constants, thermodynamic equations > used in bioreactors, cell cultures and so on. Sounds like a fun project, and a potentially valuable addition to Biopython. I guess some of the topics you mention might be of relevance to systems biology, right? Best regards, ---- Thomas Hamelryck, Marie Curie EU-Research fellow Bioinformatics center Institute of Molecular Biology University of Copenhagen Universitetsparken 15 - Building 10 DK-2100 Copenhagen ? Denmark Homepage: http://www.binf.ku.dk/Protein_structure From idoerg at burnham.org Tue Nov 7 12:48:05 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Tue, 07 Nov 2006 09:48:05 -0800 Subject: [Biopython-dev] InterProScan parser? Message-ID: <4550C6D5.10606@burnham.org> Hi, Does anybody have an interproscan parser, by any chance? Preferably for the XML or EBIXML output. Thanks, Iddo -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org http://BioFunctionPrediction.org From bugzilla-daemon at portal.open-bio.org Wed Nov 8 12:13:05 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Nov 2006 12:13:05 -0500 Subject: [Biopython-dev] [Bug 2137] New: Install from CVS fails on clistfnsmodule.c compilation Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2137 Summary: Install from CVS fails on clistfnsmodule.c compilation Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: chris.lasher at gmail.com On November 7, 2006, I did a fresh checkout of BioPython from the CVS repository. Attempts to build/install the CVS checkout are failing on attempts to compile Bio/clistfnsmodule.c. The main culprit seems to be a missing file, Python.h. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 8 12:15:42 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Nov 2006 12:15:42 -0500 Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on clistfnsmodule.c compilation In-Reply-To: Message-ID: <200611081715.kA8HFg6e017131@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2137 ------- Comment #1 from chris.lasher at gmail.com 2006-11-08 12:15 ------- Created an attachment (id=497) --> (http://bugzilla.open-bio.org/attachment.cgi?id=497&action=view) Output from failed installation. This is the output from my failed installation. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 8 12:33:41 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Nov 2006 12:33:41 -0500 Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on clistfnsmodule.c compilation In-Reply-To: Message-ID: <200611081733.kA8HXfpb018644@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2137 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2006-11-08 12:33 ------- This very much looks like a problem with your Python installation. Do you have the Python.h header file on your system? This problem may arise if you installed python using an rpm. If so, make sure to install the python-devel rpm also. That one contains Python.h. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 8 13:41:43 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Nov 2006 13:41:43 -0500 Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on clistfnsmodule.c compilation In-Reply-To: Message-ID: <200611081841.kA8IfhFJ023784@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2137 ------- Comment #3 from chris.lasher at gmail.com 2006-11-08 13:41 ------- (In reply to comment #2) > This very much looks like a problem with your Python installation. Do you have > the Python.h header file on your system? > This problem may arise if you installed python using an rpm. If so, make sure > to install the python-devel rpm also. That one contains Python.h. > Good call! My apologies, I feel foolish now. For Debian/*buntu users, the package to get is python-dev. Should I add something about the Python development packages being necessary for installation from CVS source on http://biopython.org/wiki/CVS ? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 8 14:05:13 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Nov 2006 14:05:13 -0500 Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on clistfnsmodule.c compilation In-Reply-To: Message-ID: <200611081905.kA8J5DIg025030@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2137 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2006-11-08 14:05 ------- > Should I add something about the Python development packages being necessary > for installation from CVS source on http://biopython.org/wiki/CVS ? The Python development packages are always needed, so also when installing an official Biopython release. If you could add some text to that effect to the Biopython wiki somewhere, that would be great. Closing this bug. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From idoerg at burnham.org Wed Nov 8 21:39:23 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Wed, 08 Nov 2006 18:39:23 -0800 Subject: [Biopython-dev] [BioPython] EUtils module In-Reply-To: <20061109014938.11961.qmail@web38113.mail.mud.yahoo.com> References: <20061109014938.11961.qmail@web38113.mail.mud.yahoo.com> Message-ID: <455294DB.6000105@burnham.org> Srinivas Iyyer wrote: > Dear Group, > > I downloaded EUtils module. > > I am trying to reproduce the code given in : > > http://www.dalkescientific.com/writings/diary/archive/2005/09/30/using_eutils.html > > I am getting Errors. This is code from an alpha version of EUtils used at a presentation. I don't think it was meant to be reproducible, or even made it into the final module. You might want to look under the hood. There is a README file in the EUtils installation, which has some examples. But NCBI change the EUtils specifications quite frequently, so chances are, if no one used EUtils ofr a while, that it might be broken. > > I want to know which databases in Entrez are supported > by EUtils. > > Could any one please help me whats the problem. > > Are not many people using EUtils. > > Thanks > >>>> import EUtils >>>> dbs = EUtils.dblist() > > Traceback (most recent call last): > File "", line 1, in -toplevel- > dbs = EUtils.dblist() > AttributeError: 'module' object has no attribute > 'dblist' >>>> dbinfo = EUtils.dbinfo("pubmed") > > Traceback (most recent call last): > File "", line 1, in -toplevel- > dbinfo = EUtils.dbinfo("pubmed") > AttributeError: 'module' object has no attribute > 'dbinfo' > > > > > > > > ____________________________________________________________________________________ > Yahoo! Music Unlimited > Access over 1 million songs. > http://music.yahoo.com/unlimited > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org http://BioFunctionPrediction.org From mdehoon at c2b2.columbia.edu Fri Nov 10 01:28:49 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Fri, 10 Nov 2006 01:28:49 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <454B2C81.9090309@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> Message-ID: <45541C21.6080402@c2b2.columbia.edu> Peter (BioPython Dev) wrote: > Currently the individual format specific iterators just require a handle > (and not a filename). Are we all happy with this? Happy. > We could make the handle and format the first arguments as a compromise? If in doubt, don't add it to Biopython! It's much easier to add a functionality later, should the need arise, than to remove one. > I personally want the file extension to format mapping, but then I am > fairly disciplined about using file extensions. As I seem to be the > only voice advocating this, it looks like I may have to give in... > > Is it worth asking on the main discussion list to canvas opinion? Sure, go ahead. But ask for *why* a user wants file extension to format mapping (so just "Yeah, I'd like that..." doesn't count). I'd like to know which usage case that we haven't thought about yet warrants file extension to format mapping. > We have functions to do the following, where "file" may mean just a > handle, or perhaps the choice of a handle or filename (see above): > > (*) File to SeqRecord iterator, currently File2SequenceIterator > (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict > (*) SeqRecord iterator/list to alignment, currently Iter2Alignment > (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File If: File2SequenceIterator doesn't infer the file format from the extension and File2SequenceIterator takes handles only, so no file names, then: Why do we need the File2SequenceIterator function? Btw, we should make a new Biopython release once the dust settles. --Michiel. From idoerg at burnham.org Fri Nov 10 02:30:17 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Thu, 09 Nov 2006 23:30:17 -0800 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45541C21.6080402@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> <45541C21.6080402@c2b2.columbia.edu> Message-ID: <45542A89.6050202@burnham.org> Michiel de Hoon wrote: > Peter (BioPython Dev) wrote: >> Currently the individual format specific iterators just require a handle >> (and not a filename). Are we all happy with this? > > Happy. I second that. I have two arguments against that: 1) It is standard practice in biopython to pass file handle as arguments to a parser rather than a filename. If we break this, we would start thinking which parser takes a handle and which a filename. things will be a mess. 2) Also, what if you are not passing a real file? E.g. I have applications that pass StringIO streams into the parser. You are lumping two levels of IO into one, and IMHO that is bad practice. In other words, a filehandle can always be generated from a file, easily >>> filefunc(open('myfile')) but you cannot generate a file form a filehandle type of data. OK, you can programatically generate a tmp file for reading, but that places a burden on the user. 3) The last argument against rigid filename extensions is interoperability with other applications that generate those files. Suppose you have one application that generates fasta files with a .tfa extension, and another with a .fa extension and yet a third with .pfa extensions... and those extensions are important to you for other reasons, like knowing which is a nucleic acid file and which is protein. Actually, all the NCBI genomic files are built like this... :) OK, three arguments. I think that relying on filename extensions for content is rather DOS-ish and places an extra burden on the user. I'm suffering enough on my Windows machine with Rasmol trying to open all my .pdb files. Including those where pdb stands for "Palm Pilot database" rather than Protein Data Bank. > >> We could make the handle and format the first arguments as a compromise? > > If in doubt, don't add it to Biopython! > It's much easier to add a functionality later, should the need arise, > than to remove one. We could add the format as a OPTIONAL keyword argument, with a "None" default value. And have the parser recognize the format from a lookahead using a magic regexp fro each format. The user passed format overrides the parser guesswork. Shouldn't be too hard to implement, as file formats are very distinct. > >> I personally want the file extension to format mapping, but then I am >> fairly disciplined about using file extensions. As I seem to be the >> only voice advocating this, it looks like I may have to give in... >> >> Is it worth asking on the main discussion list to canvas opinion? > > Sure, go ahead. But ask for *why* a user wants file extension to format > mapping (so just "Yeah, I'd like that..." doesn't count). I'd like to > know which usage case that we haven't thought about yet warrants file > extension to format mapping. > >> We have functions to do the following, where "file" may mean just a >> handle, or perhaps the choice of a handle or filename (see above): >> >> (*) File to SeqRecord iterator, currently File2SequenceIterator >> (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict >> (*) SeqRecord iterator/list to alignment, currently Iter2Alignment >> (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File > > If: > File2SequenceIterator doesn't infer the file format from the extension > and > File2SequenceIterator takes handles only, so no file names, > then: > Why do we need the File2SequenceIterator function? > > Btw, we should make a new Biopython release once the dust settles. > > --Michiel. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037, USA T: +1 858 646 3100 x3516 http://iddo-friedberg.org http://BioFunctionPrediction.org From biopython-dev at maubp.freeserve.co.uk Mon Nov 13 19:49:02 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 14 Nov 2006 00:49:02 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45542A89.6050202@burnham.org> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> <45541C21.6080402@c2b2.columbia.edu> <45542A89.6050202@burnham.org> Message-ID: <4559127E.3050109@maubp.freeserve.co.uk> Iddo Friedberg wrote: > 3) The last argument against rigid filename extensions is > interoperability with other applications that generate those files. > Suppose you have one application that generates fasta files with a > .tfa extension, and another with a .fa extension and yet a third with > .pfa extensions... and those extensions are important to you for > other reasons, like knowing which is a nucleic acid file and which is > protein. Actually, all the NCBI genomic files are built like this... > :) Interesting tidbit. If you are using "exotic" file extensions, then you would have to explicitly tell my Bio.SeqIO code the file's format. Although "fa" is currently a known extension mapped to fasta format in Bio.SeqIO, your other examples are not. Are these other extensions used outside the internal systems of the NCBI? > OK, three arguments. I think that relying on filename extensions for > content is rather DOS-ish and places an extra burden on the user. I'm not trying to force anyone into using specific filename extensions - I'm trying to make life easier for people who already do this (or who download their data from online sources like the NCBI or PFAM - which do seem to be consistent in their naming conventions). > I'm suffering enough on my Windows machine with Rasmol trying to open > all my .pdb files. Including those where pdb stands for "Palm Pilot > database" rather than Protein Data Bank. Yes - multiple interpretations of a given file format are a problem. I've noticed that same PDB extension clash too (but I don't use a Palm pilot any more). Can anyone think of any common extensions used for more than one file format? I know Clustal uses *.aln for its alignments which is perhaps asking for trouble... > We could add the format as a OPTIONAL keyword argument, with a "None" > default value. And have the parser recognize the format from a > lookahead using a magic regexp fro each format. The user passed > format overrides the parser guesswork. Shouldn't be too hard to > implement, as file formats are very distinct. Currently the format is an optional keyword argument defaulting to None. When it is omitted, I currently use a limited filename extension to format mapping (assuming the filename is available) to deduce/guess the format. Peter From idoerg at burnham.org Tue Nov 14 12:19:14 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Tue, 14 Nov 2006 09:19:14 -0800 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4559127E.3050109@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> <45541C21.6080402@c2b2.columbia.edu> <45542A89.6050202@burnham.org> <4559127E.3050109@maubp.freeserve.co.uk> Message-ID: <4559FA92.8070408@burnham.org> Peter (BioPython Dev) wrote: > Iddo Friedberg wrote: >> 3) The last argument against rigid filename extensions is >> interoperability with other applications that generate those files. >> Suppose you have one application that generates fasta files with a >> .tfa extension, and another with a .fa extension and yet a third with >> .pfa extensions... and those extensions are important to you for >> other reasons, like knowing which is a nucleic acid file and which is >> protein. Actually, all the NCBI genomic files are built like this... >> :) > > Interesting tidbit. > > If you are using "exotic" file extensions, then you would have to > explicitly tell my Bio.SeqIO code the file's format. > > Although "fa" is currently a known extension mapped to fasta format in > Bio.SeqIO, your other examples are not. Are these other extensions used > outside the internal systems of the NCBI? I would tidbit or exotic. It is very prevalent, NCBI's GenBank genomic repositories are very much deferred to. The point is, since NCBI uses one standard of file extensions for its genomic databases, TIGR another (actually, TIGR points to GenBank for completed genomes) UCSC a third... then maybe relying on file suffixes is not such a great idea. See for example the E. coli genome: ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Escherichia_coli_K12 Some are fasta format. But have different contents: whole genome, noncoding RNA, protein. Same with those that are GenBank format. So the NCBI suffixes denote not only the file format, but the biological content as well. Also, for the reasons I gave in my previous email, I think we should stick with passing file handles, not file names. There is no real need for to pass a filename rather than a file handle. If you need information from the filename, you can read the filename from the file handle: >>> foo = open('foo') >>> print foo.name 'foo' And the functions could still accept StringIO streams if needed. > >> > > I'm not trying to force anyone into using specific filename extensions - > I'm trying to make life easier for people who already do this (or who > download their data from online sources like the NCBI or PFAM - which do > seem to be consistent in their naming conventions). > You cannot rely on such consistency prevailing. Especially not with NCBI.;) > >> We could add the format as a OPTIONAL keyword argument, with a "None" >> default value. And have the parser recognize the format from a >> lookahead using a magic regexp fro each format. The user passed >> format overrides the parser guesswork. Shouldn't be too hard to >> implement, as file formats are very distinct. > > Currently the format is an optional keyword argument defaulting to None. > When it is omitted, I currently use a limited filename extension to > format mapping (assuming the filename is available) to deduce/guess the > format. > Ideally, the data format should be supplied by the user. Second best is inferring from parsing the first line or so in the file. Third is filename extension. Bit both options B and C are not very good practices, IMHO. > Peter > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org http://BioFunctionPrediction.org From bugzilla-daemon at portal.open-bio.org Tue Nov 14 15:48:49 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Nov 2006 15:48:49 -0500 Subject: [Biopython-dev] [Bug 2143] New: Error parsing BLAT output (using out=blast format) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2143 Summary: Error parsing BLAT output (using out=blast format) Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: fgibbons at hms.harvard.edu Attempting to parse this BLAT output (see below) raises an "I couldn't find the sbjct in" exception. After looking at the code, it seems to me that the problem is an overly strict regexp, that relies on a single space between the "Sbjct:" and the integer that follows it. Replace the literal space with '\s*', and it goes away. This in fact matches the regexp used to match the "Query:". I can't imagine that it might hurt things, even in the main NCBIBlastParser, but you never know.... (All of the above refers to the method sbjct in class _HSPConsumer, file NCBIStandalone.py) -Frank Gibbons (fgibbons at hms.harvard.edu) ------------------------------------- Reference: Kent, WJ. (2002) BLAT - The BLAST-like alignment tool Query= NCU00001 (54 letters) Database: all_proteins.fasta 293697 sequences; 128,064,135 total letters Score E Sequences producing significant alignments: (bits) Value MGG_10872.5 101 1e-21 >MGG_10872.5 Length = 245 Score = 101 bits (260), Expect = 1e-21 Identities = 54/54 (100%), Positives = 54/54 (100%), Gaps = 0/54 (0%) Query: 1 MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL 54 MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL Sbjct: 192 MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL 245 Database: all_proteins.fasta BLASTP 2.2.4 [blat] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 14 17:03:40 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Nov 2006 17:03:40 -0500 Subject: [Biopython-dev] [Bug 2143] Error parsing BLAT output (using out=blast format) In-Reply-To: Message-ID: <200611142203.kAEM3eu3014395@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2143 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2006-11-14 17:03 ------- Fixed in CVS, thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chris.lasher at gmail.com Tue Nov 14 19:51:26 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 14 Nov 2006 19:51:26 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4559FA92.8070408@burnham.org> References: <45425925.8090607@maubp.freeserve.co.uk> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> <45541C21.6080402@c2b2.columbia.edu> <45542A89.6050202@burnham.org> <4559127E.3050109@maubp.freeserve.co.uk> <4559FA92.8070408@burnham.org> Message-ID: <128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com> Just pitching in again, I agree with Michiel with regards to the list of functions necessary. To restate, these would be: (*) File to SeqRecord iterator, currently File2SequenceIterator (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict (*) SeqRecord iterator/list to alignment, currently Iter2Alignment (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File I also think there's wisdom to Michiel's statement it's easier to add functionality than it is to remove it. I agree with Iddo on his arguments against dealing with filename extensions. Upon reflection, however, I feel comfortable with a lookahead-based file-format guesser for the sake of convenience and as a matter of compromise to those who are not keen on being explicit in regards to every detail. It's been stated that bio file formats are quite distinct. I tried to think of a counterexample but failed. Finally, to reply to Michiel's question on release, it does seem once SeqIO is solidified this would certainly be worthy of a new release. SeqIO is a big step in a good direction for BioPython. Chris From biopython-dev at maubp.freeserve.co.uk Wed Nov 15 07:52:58 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 15 Nov 2006 12:52:58 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com> References: <45425925.8090607@maubp.freeserve.co.uk> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> <45541C21.6080402@c2b2.columbia.edu> <45542A89.6050202@burnham.org> <4559127E.3050109@maubp.freeserve.co.uk> <4559FA92.8070408@burnham.org> <128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com> Message-ID: <455B0DAA.9040000@maubp.freeserve.co.uk> Chris Lasher wrote: > Just pitching in again, I agree with Michiel with regards to the list > of functions necessary. To restate, these would be: On Monday I switched from the "2" pun names to "To" giving the following: (*) FileToSequenceIterator, previously File2SequenceIterator File to SeqRecord iterator (*) SequencesToDict, previously SequenceIter2Dict SeqRecord iterator/list to dictionary (*) SequencesToAlignment, previously Iter2Alignment SeqRecord iterator/list to alignment (*) SequencesToFile, previously Sequences2File Write SeqRecord iterator/list to a file I agree that these are all important "core functions". > I also think there's wisdom to Michiel's statement it's easier to add > functionality than it is to remove it. Very true. On that note... We also currently have three "convenience functions", which seem scheduled for removal based on these discussions. Unless anyone speaks up for these three, I'll remove them (and update the Wiki to match): (*) FileToSequenceList previously called File2SequenceList (*) FileToSequenceDict previously called File2SequenceDict (*) FileToAlignment previously called File2Alignment These simply wrap FileToSequenceIterator with the list, SequencesToDict or SequencesToAlignment function. > I agree with Iddo on his arguments against dealing with filename > extensions. Upon reflection, however, I feel comfortable with a > lookahead-based file-format guesser for the sake of convenience and as > a matter of compromise to those who are not keen on being explicit in > regards to every detail. It's been stated that bio file formats are > quite distinct. I tried to think of a counterexample but failed. I would say telling EMBL and Swiss (aka SwissProt aka Unigene) apart is tricky. They both start with an "ID ..." line and finish with "//", the feature table format is the big difference. If we did try guessing file formats by looking at the file contents, I would not try and guess every file format which Bio.SeqIO could read - just those which are easily identifiable. In this case, I would be inclined not to try and tell EMBL and SwissProt apart, and simply abort with "Unrecognised format". Peter From biopython-dev at maubp.freeserve.co.uk Tue Nov 28 08:24:35 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 28 Nov 2006 13:24:35 +0000 Subject: [Biopython-dev] [BioPython] Problems with Win Release for Python 2.5: Numeric, KDTree In-Reply-To: <005301c7129b$f3222300$b400a8c0@Sirius> References: <005301c7129b$f3222300$b400a8c0@Sirius> Message-ID: <456C3893.6060402@maubp.freeserve.co.uk> Hendrik Weisser wrote: > The main question for me is whether these issues (the 2nd, mostly) can be > adressed quickly, or whether it is recommended to use the "old" Python 2.4 > and corresponding packages for the time being. Can anyone help me with that? Yes - assuming you don't have all the compilers and stuff to compile your own libraries (and therefore need to use the Windows installers), using Windows with Python 2.4 and Numeric 24.2 with BioPython 1.42 should be fine. Personally I use Python 2.4 on Linux (as shipped with the distribution) and Python 2.3 on my Windows machine. Both work fine with BioPython and Numeric - although I have not used Bio.PDB very much. Peter From bugzilla-daemon at portal.open-bio.org Wed Nov 29 14:03:08 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Nov 2006 14:03:08 -0500 Subject: [Biopython-dev] [Bug 2090] Blast.NCBIStandalone BlastParser fails with blastall 2.2.14 In-Reply-To: Message-ID: <200611291903.kATJ38DJ007489@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2090 ------- Comment #1 from grunberg at embl.de 2006-11-29 14:03 ------- Things get worse with the current blastall 2.2.15. _scan_parameters in NCBIStandalone.py expects "Number of HSP's better" which in the later blastall versions has changed to: "Number of sequences better". This prevents the parser from fetching the next two lines even though they would be there and then we get exceptions etc. Another independent problem occurs further down -- The lines:: T: 11 A: 40 have now changed to:: Neighboring words threshold: 11 Window for multiple hits: 40 and again we run into an exeption. Both problems also concern in the latest CVS snapshot. Both can be fixed with some additional attempt_read_and_call but I am not sure whether my quick and dirty fixes is following the right spirit... change A: --------- INSERT BEFORE...:: # not in blastx 2.2.1 attempt_read_and_call(uhandle, consumer.query_length, has_re=re.compile(r"[Ll]ength of query")) ...These two statements:: # in blastall 2.2.15 attempt_read_and_call(uhandle, consumer.noevent, start="Number of HSP's gapped:") attempt_read_and_call(uhandle, consumer.noevent, start="Number of HSP's successfully") Change B: --------- REPLACE:: # not in BLASTN 2.2.9 attempt_read_and_call(uhandle, consumer.threshold, start='T') read_and_call(uhandle, consumer.window_size, start='A') BY:: # not in BLASTN 2.2.9 attempt_read_and_call(uhandle, consumer.threshold, start='T') attempt_read_and_call(uhandle, consumer.window_size, start='A') ## renamed in BLASTALL 2.2.15 attempt_read_and_call(uhandle, consumer.threshold, start='Neighboring') attempt_read_and_call(uhandle, consumer.window_size, start='Window') Could someone with more Biopython experience please validate and apply the fix? THX! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at c2b2.columbia.edu Wed Nov 1 05:58:41 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Wed, 01 Nov 2006 00:58:41 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4545D9F1.2040902@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> Message-ID: <45483791.7070803@c2b2.columbia.edu> Peter (BioPython Dev) wrote: > With such a speed up, I'd guess you were using Bio.Fasta before. Yes I was. I just went to the Biopython tutorial and used the stuff in section 2.4. I didn't expect it to be *that* slow. > I've noticed the same thing. Are you dealing with NCBI style fasta > identifiers made up of several fields separated by "|" characters? Yep. >> For duplicate keys, there are at least four possibilities (raise an >> exception, store only one of the keys, store neither of the keys >> and don't raise an exception, store both after modifying one of the >> keys). So this should also be an option. > > Supporting all these options with an easy to understand interface > looks too hard. > > In my opinion if someone is trying to build a dictionary using > repeated keys they have made a mistake (either in their datafile, or > their record2key function) - so raising an exception is reasonable > default behaviour (and is easy to code). You're probably right. I'm fine with raising an exception. >> In the File2SequenceDict above, answer[key] contains the complete >> record. Some people will want that. However, in my application I >> only want to store the record.seq part in answer[key]. Somebody >> else may want str(record.seq). So we'd also need a record2value >> argument. > > It does slightly undermine the "you only get SeqRecord objects" > principle. On the other hand, its a simple addition that is easy to > explain and implement. I'm happy to add this. The point I was trying to make is that for a File2SequenceDict function to be useful, it would end up being too complex. In the answer above, a user could also do answer[key].seq to get the part she wants, so maybe a record2value argument is not essential in practice. Part of my opposition against the File2SequenceDict function is that it requires the parser to be called File2SequenceIterator (which I don't like as a name, but more about that some other time), which then leads to a File2SequenceList function, which is software bloat. So, how about making the functionality of File2SequenceDict available as a todict() method to the iterator object returned by File2SequenceIterator, or, as a iterator2dict function? --Michiel. From biopython-dev at maubp.freeserve.co.uk Wed Nov 1 10:09:59 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 01 Nov 2006 10:09:59 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45483791.7070803@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> Message-ID: <45487277.6080308@maubp.freeserve.co.uk> > The point I was trying to make is that for a File2SequenceDict > function to be useful, it would end up being too complex. Of course I'm going to be biased here, but I do find the simple current dictionary construction useful as it is. Clearly we have slightly different uses in mind (which is good - the design should try and cater to most people). > In the answer above, a user could also do answer[key].seq to get the > part she wants, so maybe a record2value argument is not essential in > practice. > > Part of my opposition against the File2SequenceDict function is that > it requires the parser to be called File2SequenceIterator (which I > don't like as a name, but more about that some other time), which > then leads to a File2SequenceList function, which is software bloat. > > So, how about making the functionality of File2SequenceDict available > as a todict() method to the iterator object returned by > File2SequenceIterator, or, as a iterator2dict function? I do like your first suggestion - the idea of adding a todict() method to the iterator objects. However, that would require that all the parsers be written as (sub)classes, and right now several of them are written as generator functions. I've found using generator functions to be very simple, and easy to understand. They seem like a good choice for simple file formats. But with a good reason enough reason, I could turn them into classes. ---- Right now I am making both "file to dict" and "iterator to dict" functions available: File2SequenceDict(..., record2key) is implemented as SequenceIter2Dict(File2SequenceIterator(...), record2key) Also: File2Alignment(...) is implemented as Iter2Alignment(File2SequenceIterator(...)) And: File2SequenceList(...) is implemented as list(File2SequenceIterator(...)) Leaving aside the names (which I notice are not currently consistent) I would be fine with removing File2SequenceList, File2SequenceDict, and File2Alignment but retaining the two functions which convert from a SeqRecord returning iterator into dict or an alignment. How does that sound Michiel (subject to agreeing on names)? Peter From bugzilla-daemon at portal.open-bio.org Wed Nov 1 22:50:46 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Nov 2006 17:50:46 -0500 Subject: [Biopython-dev] [Bug 2131] New: SProt.py fails to parse the current Swiss-Prot version 51.0 Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2131 Summary: SProt.py fails to parse the current Swiss-Prot version 51.0 Product: Biopython Version: 1.24 Platform: Macintosh OS/Version: MacOS X Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: Biosql at hotmail.com Hi, I'm running on a mac OS 10.4, python 2.5 and tried to parse the Swiss-Prot .dat file whit the latest SProt.py version and get this : Traceback (most recent call last): File "Parser_SProt_to_DB.py", line 37, in cur_record = s_iterator.next() File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 166, in next return self._parser.parse(File.StringHandle(data)) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 290, in parse self._scanner.feed(handle, self._consumer) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 332, in feed self._scan_record(uhandle, consumer) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 337, in _scan_record fn(self, uhandle, consumer) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 369, in _scan_id self._scan_line('ID', uhandle, consumer.identification, exactly_one=1) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 359, in _scan_line read_and_call(uhandle, event_fn, start=line_type) File "/sw/lib/python2.5/site-packages/Bio/ParserSupport.py", line 301, in read_and_call method(line) File "/sw/lib/python2.5/site-packages/Bio/SwissProt/SProt.py", line 526, in identification self.data.sequence_length = int(cols[4]) ValueError: invalid literal for int() with base 10: 'AA.' Any clue ? Thanks ! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chris.lasher at gmail.com Thu Nov 2 03:49:04 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Wed, 1 Nov 2006 22:49:04 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45487277.6080308@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> Message-ID: <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> I'd like to pitch in a few comments here. Peter wrote: > One point against names like File2SequenceIterator is the pun on two > versus to (i.e. convert) will not be so obvious to non-native English > speakers. I'd like to second that. It's cute, sure, but FileToSequenceIterator isn't that much more difficult, and leaves no room for confusion. (e.g., Where's the File1SequenceIterator?) Michiel wrote: > I like the idea of one argument that takes a file name or handle. I > believe that that is how other Biopython functions work. Yikes! Are you serious? Why not make it easier and require a file-like object? I would definitely not be for it taking a plain string. This seems implicit rather than explicit. "Takes a file... or a file-like object... or a string containing a filename... or just a string containing the file contents... or a brief description of the data that's in your file... or a bunch of smiley emoticons, if you're in a good mood..." File-like objects are testable and leave little room for surprise. Anything else seems like it's asking for a headache. Which brings me to the issue of "guessing" a file's format. Yikes, again! I'd expect that kind of "magickery" from Perl, but once again, explicit is better than implicit. I honestly think it's not too much to expect the user to know what filetype they're expecting BioPython to deal with. Could you guys please explain the motivation behind this to me? As I see it right now, the last thing I want is BioPython incorrectly guessing my file format, and particularly, assuming that I have put the proper extension to represent the file format. The unified sequence object is what's beautiful about SeqIO, but the guesswork that you are discussing having SeqIO's classes do is scary, to me. And I think by now it's predictable that I'm a fan of Peter's suggestion to have an exception raised upon the attempt to create a dictionary with identical IDs; all other options are, again, too implicit for my tastes. Thanks very much for developing SeqIO and discussing it so much, guys. I think this will be a fantastic asset to BioPython! Keep on rockin' it! Chris From bugzilla-daemon at portal.open-bio.org Thu Nov 2 11:29:38 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 06:29:38 -0500 Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current Swiss-Prot version 51.0 In-Reply-To: Message-ID: <200611021129.kA2BTcOX010117@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2131 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2006-11-02 06:29 ------- Hi Jonathan, What version of BioPython are you using? I know that bugzilla needs updating to include more up to date version numbers, but you aren't really using BioPython 1.24 are you? Currently the latest release is 1.42, and this does include some updates for SProt, e.g. bug 1948. There is also a more recent fix in CVS for bug 2043 dealing with new style RX lines. Could you tell use which SProt file you are using (a URL would be fine). If there are many that fail the same way, and you have a small example input file, you could even attach it to this bug. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Thu Nov 2 12:49:30 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu, 02 Nov 2006 12:49:30 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> Message-ID: <4549E95A.6080605@maubp.freeserve.co.uk> Chris Lasher wrote: > I'd like to pitch in a few comments here. > > Peter wrote: >> One point against names like File2SequenceIterator is the pun on >> two versus to (i.e. convert) will not be so obvious to non-native >> English speakers. > > I'd like to second that. It's cute, sure, but FileToSequenceIterator > isn't that much more difficult, and leaves no room for confusion. > (e.g., Where's the File1SequenceIterator?) I would be happy with FileToSequenceIterator, or even FileToSequenceIter. FileToSeqIter is shorter but we don't actually return Seq objects so I would avoid that. Does anyone else have any suggestions? > Michiel wrote: >> I like the idea of one argument that takes a file name or handle. I >> believe that that is how other Biopython functions work. I've had a little look, and the only case I found is the recent Bio.Nexus parser - and this choked on a StringIO handle on my machine (fix checked in). Chris Lasher wrote: > Yikes! Are you serious? Why not make it easier and require a > file-like object? I would definitely not be for it taking a plain > string. This seems implicit rather than explicit. "Takes a file... or > a file-like object... or a string containing a filename... or just a > string containing the file contents... or a brief description of the > data that's in your file... or a bunch of smiley emoticons, if > you're in a good mood..." File-like objects are testable and leave > little room for surprise. Anything else seems like it's asking for a > headache. Trying to distinguish between an (invalid) filename and the contents of a sequence file is just too much to ask - more a migraine than a headache. As an experiment, I've implemented (but not checked in) automatic handle/filename detection. Its seems to work (but I have not yet tried exotic arguments like file names in Unicode, or random classes with a __str__ method). Still its messy. While it does sound like a nice idea for the end user, the idea of filenames and handles is pretty important in python, and maybe we shouldn't worry about forcing newcomers deal with handles. After all, the SeqIO system will make them deal with iterators and SeqRecords which I think are far more complicated! What do you think Michiel? Chris Lasher wrote: > Which brings me to the issue of "guessing" a file's format. Yikes, > again! I'd expect that kind of "magickery" from Perl, but once again, > explicit is better than implicit. I honestly think it's not too much > to expect the user to know what filetype they're expecting BioPython > to deal with. Could you guys please explain the motivation behind > this to me? As I see it right now, the last thing I want is BioPython > incorrectly guessing my file format, and particularly, assuming that > I have put the proper extension to represent the file format. The > unified sequence object is what's beautiful about SeqIO, but the > guesswork that you are discussing having SeqIO's classes do is scary, > to me. For comparison this quote is from the BioPerl SeqIO How-To: >> [BioPerl's] SeqIO can try to guess based on known file extensions >> or content, ... it is a good idea to get into the practice of >> always specifying the format. I want to stress that as written, the user can specify the file format to the File2SequenceIterator function (and its variants). Maybe we should encourage people to explicitly supply the format in any Bio.SeqIO documentation.... You asked about motivation for guessing the file format. I break that down into guessing the file format based on the file extension, or based on the file's contents (see later). I personally am perfectly happy with using a file extension to file format mapping. Maybe this reflects my computing background (more DOS/Windows background than Unix/Linux). Note that if the format is not specified, and the file extension is not on the known list (e.g. "txt" or "data" which could be anything) then the call to File2SequenceIterator function (or its variants) will fail with an invalid format message/exception. Assuming we don't make the format a required argument, and we keep the extension to format mappings, then I should make a point of including deliberate miss-matches in the test suits - and check that they abort with a SyntaxError. Regarding guessing the format based on file contents: For some applications, having a format guesser built into BioPython might actually be very useful - the example given on the BioPerl website is the back end of a web tool that took sequence input, where maybe you can't trust the actual end user to know exactly what file format their data is in. Doing this for some file formats isn't too hard, often all you need to see is the first line. For other file formats its very tricky and best not attempted. But, is partial guess support even worth implementing - especially as it may be less than perfect and get it wrong sometimes? I think Michiel and I where happy to leave this question for later... Chris Lasher wrote: > And I think by now it's predictable that I'm a fan of Peter's > suggestion to have an exception raised upon the attempt to create a > dictionary with identical IDs; all other options are, again, too > implicit for my tastes. Good. Michiel agreed in another email: >> >> You're probably right. I'm fine with raising an exception. >> Have you been following the rest of that SeqRecord dictionary discussion Chris? > Thanks very much for developing SeqIO and discussing it so much, > guys. I think this will be a fantastic asset to BioPython! Keep on > rockin' it! > > Chris Thank you for your passionate feedback :) Peter From bugzilla-daemon at portal.open-bio.org Thu Nov 2 17:38:26 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 12:38:26 -0500 Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current Swiss-Prot version 51.0 In-Reply-To: Message-ID: <200611021738.kA2HcQKH017740@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2131 Biosql at hotmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Version|1.24 |Not Applicable ------- Comment #2 from Biosql at hotmail.com 2006-11-02 12:38 ------- I'm using the latest version of Biopython 1.42 with the latest version of Sprot.py from the CVS. I used the Swiss-Prot file version 51 coming from here : ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz I also tried to parse this file on a PC with python 2.4.3 and the latest biopython version and got the same result. Jonathan -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 2 18:07:03 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 13:07:03 -0500 Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current Swiss-Prot version 51.0 In-Reply-To: Message-ID: <200611021807.kA2I73W3021327@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2131 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2006-11-02 13:07 ------- Created an attachment (id=491) --> (http://bugzilla.open-bio.org/attachment.cgi?id=491&action=view) First four records from uniprot_sprot.dat.gz release 51 I was hoping for a smaller test case, uniprot_sprot.dat.gz is 185MB compressed, and 836MB as plain text! Anyway, I have extracted and attached a file with the just the first four records in it for anyone interested in testing. I would guess from your stack trace that this recent change to the ID line that has caused the trouble: http://ca.expasy.org/sprot/relnotes/sp_news.html#rel9.0 Old (with MoleculeType): ID EntryName DataClass; MoleculeType; SequenceLength. New (without MoleculeType): ID EntryName DataClass; SequenceLength. e.g. ID CYC_PIG Reviewed; 104 AA. ID Q3ASY8_CHLCH Unreviewed; 36805 AA. This shouldn't be too hard to fix... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 2 18:41:46 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 13:41:46 -0500 Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current Swiss-Prot version 51.0 In-Reply-To: Message-ID: <200611021841.kA2Ifkpg025233@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2131 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2006-11-02 13:41 ------- Fix checked into CVS, please reopen the bug if you run into problems. http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SwissProt/SProt.py?cvsroot=biopython file Bio/SwissProt/SProt.py Revision 1.34 made 2nd Nov 2006 This is the test script I used with the example file from comment 3 attachment 491 from Bio.SwissProt import SProt #Works rec_iter = SProt.Iterator(open("uniprot_sprot_f4.dat"), SProt.SequenceParser()) for record in rec_iter : print record.id print record.seq #Failed rec_iter = SProt.Iterator(open("uniprot_sprot_f4.dat"), SProt.RecordParser()) for record in rec_iter : print record -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 2 19:16:53 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 14:16:53 -0500 Subject: [Biopython-dev] [Bug 2131] SProt.py fails to parse the current Swiss-Prot version 51.0 In-Reply-To: Message-ID: <200611021916.kA2JGrGY028566@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2131 ------- Comment #5 from Biosql at hotmail.com 2006-11-02 14:16 ------- Thank you Peter ! So fast and so good. Jonathan -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 2 21:27:25 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 16:27:25 -0500 Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current Swiss-Prot version (RX and OH lines are broken) In-Reply-To: Message-ID: <200611022127.kA2LRPkN009879@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2043 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2006-11-02 16:27 ------- It seems to be working from the small amount I testing I did on another Swiss-Prot bug. Marking as fixed - please reopen if needed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 2 21:38:01 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Nov 2006 16:38:01 -0500 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200611022138.kA2Lc1Zi010834@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2006-11-02 16:38 ------- While working on bug 2059, I've been tempted to make some similar changes to Marc's suggestions about handling the SeqRecord's id/name/description, and the addition of an "add SeqRecord" method. I like the idea of adding a method to iterate over the sequences. How about something a little simpler (which I haven't tested yet): def __iter__(self): """Iterate over the SeqRecord objects making up the alignment""" return iter(self._records) i.e. Use the fact that self._records is a list, and will support iteration itself. This avoids having to keep track of the current iteration position in our own next method. Also, would anyone else like to be able to iterate over the columns? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at c2b2.columbia.edu Fri Nov 3 02:20:23 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Thu, 02 Nov 2006 21:20:23 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45487277.6080308@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> Message-ID: <454AA767.9030506@c2b2.columbia.edu> Peter wrote: > Right now I am making both "file to dict" and "iterator to dict" > functions available: > > File2SequenceDict(..., record2key) is implemented as > SequenceIter2Dict(File2SequenceIterator(...), record2key) > > Also: > File2Alignment(...) is implemented as > Iter2Alignment(File2SequenceIterator(...)) > > And: > File2SequenceList(...) is implemented as list(File2SequenceIterator(...)) > > Leaving aside the names (which I notice are not currently consistent) I > would be fine with removing File2SequenceList, File2SequenceDict, and > File2Alignment but retaining the two functions which convert from a > SeqRecord returning iterator into dict or an alignment. > > How does that sound Michiel (subject to agreeing on names)? That sounds good to me. --Michiel. From mdehoon at c2b2.columbia.edu Fri Nov 3 02:44:47 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Thu, 02 Nov 2006 21:44:47 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4549E95A.6080605@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> Message-ID: <454AAD1F.5050006@c2b2.columbia.edu> Peter wrote: > Chris Lasher wrote: >> Peter wrote: >>> One point against names like File2SequenceIterator is the pun on >>> two versus to (i.e. convert) will not be so obvious to non-native >>> English speakers. >> I'd like to second that. It's cute, sure, but FileToSequenceIterator >> isn't that much more difficult, and leaves no room for confusion. >> (e.g., Where's the File1SequenceIterator?) > > I would be happy with FileToSequenceIterator, or even > FileToSequenceIter. FileToSeqIter is shorter but we don't actually > return Seq objects so I would avoid that. > > Does anyone else have any suggestions? Yes, but let's discuss function names after we decide which functions we want. > While it does sound like a nice idea for the end user, the idea of > filenames and handles is pretty important in python, and maybe we > shouldn't worry about forcing newcomers deal with handles. After all, > the SeqIO system will make them deal with iterators and SeqRecords which > I think are far more complicated! > > What do you think Michiel? My preferred solution would be for File2SequenceIterator to take handles only. Same as Bio.Blast: blast_out = open('my_blast.out') b_parser = NCBIXML.BlastParser() b_record = b_parser.parse(blast_out) > Chris Lasher wrote: >> Which brings me to the issue of "guessing" a file's format. Yikes, >> again! I'd expect that kind of "magickery" from Perl, but once again, >> explicit is better than implicit. I honestly think it's not too much >> to expect the user to know what filetype they're expecting BioPython >> to deal with. Could you guys please explain the motivation behind >> this to me? >...... > > I think Michiel and I where happy to leave this question for later... > I am leaning towards Chris' opinion. File type guessing (from extension or file contents) doesn't seem really necessary. At least, I don't remember a user asking for it. The benefits of file type guessing from the extension are minimal (since a user can probably do that more reliably himself, knowing the file names he's likely to encounter). And since file type guessing will not be foolproof, it may even be confusing. Once file type guessing is available in Biopython though, we're committed to it and we'll have to support it. So I'd be happier without the file type guessing functionality. That said, if somebody really wants it, I can live with it. --Michiel. From biopython-dev at maubp.freeserve.co.uk Fri Nov 3 11:48:17 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Fri, 03 Nov 2006 11:48:17 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <454AAD1F.5050006@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> Message-ID: <454B2C81.9090309@maubp.freeserve.co.uk> My apologies for this somewhat long email. Handles and Filenames ===================== Currently the individual format specific iterators just require a handle (and not a filename). Are we all happy with this? Michiel de Hoon wrote: >> While it does sound like a nice idea for the end user, the idea of >> filenames and handles is pretty important in python, and maybe we >> shouldn't worry about forcing newcomers deal with handles. After >> all, the SeqIO system will make them deal with iterators and >> SeqRecords which I think are far more complicated! >> >> What do you think Michiel? > > My preferred solution would be for File2SequenceIterator to take > handles only. Assuming we keep the non-ambiguous file extension to file format mappings, allowing a filename as a possible argument to File2SequenceIterator (and any variants) makes good sense. Note that most handle objects have a "name" attribute to get the filename, which could be used to determine the file extension. i.e. We can still do the file extension to file format mapping using just a file handle (instead of a filename). Currently File2SequenceIterator has separate named arguments for a handle, filename and format. If no handle is provided, it will open one using the filename provided. We could make the handle and format the first arguments as a compromise? If we drop the extension to file format mapping (see below), then I agree File2SequenceIterator could just expect a handle and not a filename. Guessing File Formats ===================== >> Chris Lasher wrote: >>> Which brings me to the issue of "guessing" a file's format. >>> Yikes, again! I'd expect that kind of "magickery" from Perl, but >>> once again, explicit is better than implicit. I honestly think >>> it's not too much to expect the user to know what filetype >>> they're expecting BioPython to deal with. Could you guys please >>> explain the motivation behind this to me? Michiel de Hoon wrote: > I am leaning towards Chris' opinion. File type guessing (from > extension or file contents) doesn't seem really necessary. At least, > I don't remember a user asking for it. The benefits of file type > guessing from the extension are minimal (since a user can probably do > that more reliably himself, knowing the file names he's likely to > encounter). And since file type guessing will not be foolproof, it > may even be confusing. Once file type guessing is available in > Biopython though, we're committed to it and we'll have to support it. > So I'd be happier without the file type guessing functionality. > > That said, if somebody really wants it, I can live with it. I agree that we shouldn't implement file format guessing based on the contents of a file (unless, as you say, we get strong feedback wanting it). I personally want the file extension to format mapping, but then I am fairly disciplined about using file extensions. As I seem to be the only voice advocating this, it looks like I may have to give in... Is it worth asking on the main discussion list to canvas opinion? Maybe we should settle on the function names before doing that - it would be better replace the current function names now, before too many people are used to them. Functions and Naming ==================== This is where I think things stand for Bio/SeqIO/__init__.py We have functions to do the following, where "file" may mean just a handle, or perhaps the choice of a handle or filename (see above): (*) File to SeqRecord iterator, currently File2SequenceIterator (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict (*) SeqRecord iterator/list to alignment, currently Iter2Alignment (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File Possible names without the digit two: FileToSequenceIterator, SequencesToDict, SequencesToAlignment, and SequencesToFile I think Michiel wanted to drop the following "wrapper functions" as code bloat: (*) File to list of SeqRecord objects, currently File2SequenceList Just use list(File2SequenceIterator(...)) instead (*) File to dictionary of SeqRecord objects, currently File2SequenceDict Just use SequenceIter2Dict(File2SequenceIterator(...)) instead (*) File to alignment, currently File2Alignment Just use Iter2Alignment(File2SequenceIterator(...)) The reason I invented the above three examples was so I could do things like this in one line (assuming my files have valid known extensions): rec_iter = File2SequenceIterator(filename="demo.faa") rec_list = File2SequenceList(filename="demo.gbk") rec_dict = File2SequenceDict(filename="demo.fasta") align = File2Alignment(filename="demo.sth") or perhaps: align = File2Alignment(filename="demo.aln", format="clustal") The alternatives suggestions seem to lead to using file handles and an explicit format, with a second function to convert from an iterator if required. While this can be done in one line - I find the following much less straight forward: rec_iter = File2SequenceIterator(open("demo.faa"), "fasta") rec_list = list(File2SequenceIterator(open("demo.gbk"), "genbank")) rec_dict = SequenceIter2Dict(File2SequenceIterator(open("demo.fasta"), "fasta")) align = Iter2Alignment(File2SequenceIterator(open("demo.sth"), "stockholm")) Peter From sbassi at gmail.com Sat Nov 4 20:48:22 2006 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 4 Nov 2006 17:48:22 -0300 Subject: [Biopython-dev] Microbiology module Message-ID: I am working in functions for industrial microbiology. Like: Growth rate equations, Continuous culture equations, batch culture, yields for different source of energy (and for fermentation or respiration), oxygen consume rate, constants, thermodynamic equations used in bioreactors, cell cultures and so on. Biopython is lacking such a module, but I am not sure if this is out of scope. Is there a chance to include it in Biopython, or this is not useful? I think this could extend Biopython into a whole new area (bioprocess and microbiology). Please tell me what maintainers think about this. If this idea is rejected, I will make ugly and uncommented code for my own consuming, but if passed, I will write very nice and documented for people to see :) Best regards, SB. -- Bioinformatics news: http://www.bioinformatica.info Lriser: http://www.linspire.com/lraiser_success.php?serial=318 From sbassi at gmail.com Sun Nov 5 14:49:20 2006 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 5 Nov 2006 11:49:20 -0300 Subject: [Biopython-dev] Microbiology module In-Reply-To: <2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com> References: <2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com> Message-ID: On 11/5/06, Thomas Hamelryck wrote: > > Sounds like a fun project, and a potentially valuable addition to Biopython. I guess some of the topics you mention might be of relevance to systems biology, right? > Yes, some methods could be used as a base for systems biology. From thamelry at binf.ku.dk Sun Nov 5 14:33:53 2006 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Sun, 5 Nov 2006 15:33:53 +0100 Subject: [Biopython-dev] Microbiology module In-Reply-To: References: Message-ID: <2d7c25310611050633n19deb680r5cbf936195110b2@mail.gmail.com> On 11/4/06, Sebastian Bassi wrote: > > I am working in functions for industrial microbiology. Like: > Growth rate equations, Continuous culture equations, batch culture, > yields for different source of energy (and for fermentation or > respiration), oxygen consume rate, constants, thermodynamic equations > used in bioreactors, cell cultures and so on. Sounds like a fun project, and a potentially valuable addition to Biopython. I guess some of the topics you mention might be of relevance to systems biology, right? Best regards, ---- Thomas Hamelryck, Marie Curie EU-Research fellow Bioinformatics center Institute of Molecular Biology University of Copenhagen Universitetsparken 15 - Building 10 DK-2100 Copenhagen ? Denmark Homepage: http://www.binf.ku.dk/Protein_structure From idoerg at burnham.org Tue Nov 7 17:48:05 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Tue, 07 Nov 2006 09:48:05 -0800 Subject: [Biopython-dev] InterProScan parser? Message-ID: <4550C6D5.10606@burnham.org> Hi, Does anybody have an interproscan parser, by any chance? Preferably for the XML or EBIXML output. Thanks, Iddo -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org http://BioFunctionPrediction.org From bugzilla-daemon at portal.open-bio.org Wed Nov 8 17:13:05 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Nov 2006 12:13:05 -0500 Subject: [Biopython-dev] [Bug 2137] New: Install from CVS fails on clistfnsmodule.c compilation Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2137 Summary: Install from CVS fails on clistfnsmodule.c compilation Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: chris.lasher at gmail.com On November 7, 2006, I did a fresh checkout of BioPython from the CVS repository. Attempts to build/install the CVS checkout are failing on attempts to compile Bio/clistfnsmodule.c. The main culprit seems to be a missing file, Python.h. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 8 17:15:42 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Nov 2006 12:15:42 -0500 Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on clistfnsmodule.c compilation In-Reply-To: Message-ID: <200611081715.kA8HFg6e017131@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2137 ------- Comment #1 from chris.lasher at gmail.com 2006-11-08 12:15 ------- Created an attachment (id=497) --> (http://bugzilla.open-bio.org/attachment.cgi?id=497&action=view) Output from failed installation. This is the output from my failed installation. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 8 17:33:41 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Nov 2006 12:33:41 -0500 Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on clistfnsmodule.c compilation In-Reply-To: Message-ID: <200611081733.kA8HXfpb018644@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2137 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2006-11-08 12:33 ------- This very much looks like a problem with your Python installation. Do you have the Python.h header file on your system? This problem may arise if you installed python using an rpm. If so, make sure to install the python-devel rpm also. That one contains Python.h. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 8 18:41:43 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Nov 2006 13:41:43 -0500 Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on clistfnsmodule.c compilation In-Reply-To: Message-ID: <200611081841.kA8IfhFJ023784@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2137 ------- Comment #3 from chris.lasher at gmail.com 2006-11-08 13:41 ------- (In reply to comment #2) > This very much looks like a problem with your Python installation. Do you have > the Python.h header file on your system? > This problem may arise if you installed python using an rpm. If so, make sure > to install the python-devel rpm also. That one contains Python.h. > Good call! My apologies, I feel foolish now. For Debian/*buntu users, the package to get is python-dev. Should I add something about the Python development packages being necessary for installation from CVS source on http://biopython.org/wiki/CVS ? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 8 19:05:13 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Nov 2006 14:05:13 -0500 Subject: [Biopython-dev] [Bug 2137] Install from CVS fails on clistfnsmodule.c compilation In-Reply-To: Message-ID: <200611081905.kA8J5DIg025030@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2137 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2006-11-08 14:05 ------- > Should I add something about the Python development packages being necessary > for installation from CVS source on http://biopython.org/wiki/CVS ? The Python development packages are always needed, so also when installing an official Biopython release. If you could add some text to that effect to the Biopython wiki somewhere, that would be great. Closing this bug. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From idoerg at burnham.org Thu Nov 9 02:39:23 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Wed, 08 Nov 2006 18:39:23 -0800 Subject: [Biopython-dev] [BioPython] EUtils module In-Reply-To: <20061109014938.11961.qmail@web38113.mail.mud.yahoo.com> References: <20061109014938.11961.qmail@web38113.mail.mud.yahoo.com> Message-ID: <455294DB.6000105@burnham.org> Srinivas Iyyer wrote: > Dear Group, > > I downloaded EUtils module. > > I am trying to reproduce the code given in : > > http://www.dalkescientific.com/writings/diary/archive/2005/09/30/using_eutils.html > > I am getting Errors. This is code from an alpha version of EUtils used at a presentation. I don't think it was meant to be reproducible, or even made it into the final module. You might want to look under the hood. There is a README file in the EUtils installation, which has some examples. But NCBI change the EUtils specifications quite frequently, so chances are, if no one used EUtils ofr a while, that it might be broken. > > I want to know which databases in Entrez are supported > by EUtils. > > Could any one please help me whats the problem. > > Are not many people using EUtils. > > Thanks > >>>> import EUtils >>>> dbs = EUtils.dblist() > > Traceback (most recent call last): > File "", line 1, in -toplevel- > dbs = EUtils.dblist() > AttributeError: 'module' object has no attribute > 'dblist' >>>> dbinfo = EUtils.dbinfo("pubmed") > > Traceback (most recent call last): > File "", line 1, in -toplevel- > dbinfo = EUtils.dbinfo("pubmed") > AttributeError: 'module' object has no attribute > 'dbinfo' > > > > > > > > ____________________________________________________________________________________ > Yahoo! Music Unlimited > Access over 1 million songs. > http://music.yahoo.com/unlimited > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org http://BioFunctionPrediction.org From mdehoon at c2b2.columbia.edu Fri Nov 10 06:28:49 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Fri, 10 Nov 2006 01:28:49 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <454B2C81.9090309@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> Message-ID: <45541C21.6080402@c2b2.columbia.edu> Peter (BioPython Dev) wrote: > Currently the individual format specific iterators just require a handle > (and not a filename). Are we all happy with this? Happy. > We could make the handle and format the first arguments as a compromise? If in doubt, don't add it to Biopython! It's much easier to add a functionality later, should the need arise, than to remove one. > I personally want the file extension to format mapping, but then I am > fairly disciplined about using file extensions. As I seem to be the > only voice advocating this, it looks like I may have to give in... > > Is it worth asking on the main discussion list to canvas opinion? Sure, go ahead. But ask for *why* a user wants file extension to format mapping (so just "Yeah, I'd like that..." doesn't count). I'd like to know which usage case that we haven't thought about yet warrants file extension to format mapping. > We have functions to do the following, where "file" may mean just a > handle, or perhaps the choice of a handle or filename (see above): > > (*) File to SeqRecord iterator, currently File2SequenceIterator > (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict > (*) SeqRecord iterator/list to alignment, currently Iter2Alignment > (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File If: File2SequenceIterator doesn't infer the file format from the extension and File2SequenceIterator takes handles only, so no file names, then: Why do we need the File2SequenceIterator function? Btw, we should make a new Biopython release once the dust settles. --Michiel. From idoerg at burnham.org Fri Nov 10 07:30:17 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Thu, 09 Nov 2006 23:30:17 -0800 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45541C21.6080402@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> <45541C21.6080402@c2b2.columbia.edu> Message-ID: <45542A89.6050202@burnham.org> Michiel de Hoon wrote: > Peter (BioPython Dev) wrote: >> Currently the individual format specific iterators just require a handle >> (and not a filename). Are we all happy with this? > > Happy. I second that. I have two arguments against that: 1) It is standard practice in biopython to pass file handle as arguments to a parser rather than a filename. If we break this, we would start thinking which parser takes a handle and which a filename. things will be a mess. 2) Also, what if you are not passing a real file? E.g. I have applications that pass StringIO streams into the parser. You are lumping two levels of IO into one, and IMHO that is bad practice. In other words, a filehandle can always be generated from a file, easily >>> filefunc(open('myfile')) but you cannot generate a file form a filehandle type of data. OK, you can programatically generate a tmp file for reading, but that places a burden on the user. 3) The last argument against rigid filename extensions is interoperability with other applications that generate those files. Suppose you have one application that generates fasta files with a .tfa extension, and another with a .fa extension and yet a third with .pfa extensions... and those extensions are important to you for other reasons, like knowing which is a nucleic acid file and which is protein. Actually, all the NCBI genomic files are built like this... :) OK, three arguments. I think that relying on filename extensions for content is rather DOS-ish and places an extra burden on the user. I'm suffering enough on my Windows machine with Rasmol trying to open all my .pdb files. Including those where pdb stands for "Palm Pilot database" rather than Protein Data Bank. > >> We could make the handle and format the first arguments as a compromise? > > If in doubt, don't add it to Biopython! > It's much easier to add a functionality later, should the need arise, > than to remove one. We could add the format as a OPTIONAL keyword argument, with a "None" default value. And have the parser recognize the format from a lookahead using a magic regexp fro each format. The user passed format overrides the parser guesswork. Shouldn't be too hard to implement, as file formats are very distinct. > >> I personally want the file extension to format mapping, but then I am >> fairly disciplined about using file extensions. As I seem to be the >> only voice advocating this, it looks like I may have to give in... >> >> Is it worth asking on the main discussion list to canvas opinion? > > Sure, go ahead. But ask for *why* a user wants file extension to format > mapping (so just "Yeah, I'd like that..." doesn't count). I'd like to > know which usage case that we haven't thought about yet warrants file > extension to format mapping. > >> We have functions to do the following, where "file" may mean just a >> handle, or perhaps the choice of a handle or filename (see above): >> >> (*) File to SeqRecord iterator, currently File2SequenceIterator >> (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict >> (*) SeqRecord iterator/list to alignment, currently Iter2Alignment >> (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File > > If: > File2SequenceIterator doesn't infer the file format from the extension > and > File2SequenceIterator takes handles only, so no file names, > then: > Why do we need the File2SequenceIterator function? > > Btw, we should make a new Biopython release once the dust settles. > > --Michiel. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037, USA T: +1 858 646 3100 x3516 http://iddo-friedberg.org http://BioFunctionPrediction.org From biopython-dev at maubp.freeserve.co.uk Tue Nov 14 00:49:02 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 14 Nov 2006 00:49:02 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45542A89.6050202@burnham.org> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> <45541C21.6080402@c2b2.columbia.edu> <45542A89.6050202@burnham.org> Message-ID: <4559127E.3050109@maubp.freeserve.co.uk> Iddo Friedberg wrote: > 3) The last argument against rigid filename extensions is > interoperability with other applications that generate those files. > Suppose you have one application that generates fasta files with a > .tfa extension, and another with a .fa extension and yet a third with > .pfa extensions... and those extensions are important to you for > other reasons, like knowing which is a nucleic acid file and which is > protein. Actually, all the NCBI genomic files are built like this... > :) Interesting tidbit. If you are using "exotic" file extensions, then you would have to explicitly tell my Bio.SeqIO code the file's format. Although "fa" is currently a known extension mapped to fasta format in Bio.SeqIO, your other examples are not. Are these other extensions used outside the internal systems of the NCBI? > OK, three arguments. I think that relying on filename extensions for > content is rather DOS-ish and places an extra burden on the user. I'm not trying to force anyone into using specific filename extensions - I'm trying to make life easier for people who already do this (or who download their data from online sources like the NCBI or PFAM - which do seem to be consistent in their naming conventions). > I'm suffering enough on my Windows machine with Rasmol trying to open > all my .pdb files. Including those where pdb stands for "Palm Pilot > database" rather than Protein Data Bank. Yes - multiple interpretations of a given file format are a problem. I've noticed that same PDB extension clash too (but I don't use a Palm pilot any more). Can anyone think of any common extensions used for more than one file format? I know Clustal uses *.aln for its alignments which is perhaps asking for trouble... > We could add the format as a OPTIONAL keyword argument, with a "None" > default value. And have the parser recognize the format from a > lookahead using a magic regexp fro each format. The user passed > format overrides the parser guesswork. Shouldn't be too hard to > implement, as file formats are very distinct. Currently the format is an optional keyword argument defaulting to None. When it is omitted, I currently use a limited filename extension to format mapping (assuming the filename is available) to deduce/guess the format. Peter From idoerg at burnham.org Tue Nov 14 17:19:14 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Tue, 14 Nov 2006 09:19:14 -0800 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4559127E.3050109@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> <4545D9F1.2040902@maubp.freeserve.co.uk> <45483791.7070803@c2b2.columbia.edu> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> <45541C21.6080402@c2b2.columbia.edu> <45542A89.6050202@burnham.org> <4559127E.3050109@maubp.freeserve.co.uk> Message-ID: <4559FA92.8070408@burnham.org> Peter (BioPython Dev) wrote: > Iddo Friedberg wrote: >> 3) The last argument against rigid filename extensions is >> interoperability with other applications that generate those files. >> Suppose you have one application that generates fasta files with a >> .tfa extension, and another with a .fa extension and yet a third with >> .pfa extensions... and those extensions are important to you for >> other reasons, like knowing which is a nucleic acid file and which is >> protein. Actually, all the NCBI genomic files are built like this... >> :) > > Interesting tidbit. > > If you are using "exotic" file extensions, then you would have to > explicitly tell my Bio.SeqIO code the file's format. > > Although "fa" is currently a known extension mapped to fasta format in > Bio.SeqIO, your other examples are not. Are these other extensions used > outside the internal systems of the NCBI? I would tidbit or exotic. It is very prevalent, NCBI's GenBank genomic repositories are very much deferred to. The point is, since NCBI uses one standard of file extensions for its genomic databases, TIGR another (actually, TIGR points to GenBank for completed genomes) UCSC a third... then maybe relying on file suffixes is not such a great idea. See for example the E. coli genome: ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Escherichia_coli_K12 Some are fasta format. But have different contents: whole genome, noncoding RNA, protein. Same with those that are GenBank format. So the NCBI suffixes denote not only the file format, but the biological content as well. Also, for the reasons I gave in my previous email, I think we should stick with passing file handles, not file names. There is no real need for to pass a filename rather than a file handle. If you need information from the filename, you can read the filename from the file handle: >>> foo = open('foo') >>> print foo.name 'foo' And the functions could still accept StringIO streams if needed. > >> > > I'm not trying to force anyone into using specific filename extensions - > I'm trying to make life easier for people who already do this (or who > download their data from online sources like the NCBI or PFAM - which do > seem to be consistent in their naming conventions). > You cannot rely on such consistency prevailing. Especially not with NCBI.;) > >> We could add the format as a OPTIONAL keyword argument, with a "None" >> default value. And have the parser recognize the format from a >> lookahead using a magic regexp fro each format. The user passed >> format overrides the parser guesswork. Shouldn't be too hard to >> implement, as file formats are very distinct. > > Currently the format is an optional keyword argument defaulting to None. > When it is omitted, I currently use a limited filename extension to > format mapping (assuming the filename is available) to deduce/guess the > format. > Ideally, the data format should be supplied by the user. Second best is inferring from parsing the first line or so in the file. Third is filename extension. Bit both options B and C are not very good practices, IMHO. > Peter > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org http://BioFunctionPrediction.org From bugzilla-daemon at portal.open-bio.org Tue Nov 14 20:48:49 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Nov 2006 15:48:49 -0500 Subject: [Biopython-dev] [Bug 2143] New: Error parsing BLAT output (using out=blast format) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2143 Summary: Error parsing BLAT output (using out=blast format) Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: fgibbons at hms.harvard.edu Attempting to parse this BLAT output (see below) raises an "I couldn't find the sbjct in" exception. After looking at the code, it seems to me that the problem is an overly strict regexp, that relies on a single space between the "Sbjct:" and the integer that follows it. Replace the literal space with '\s*', and it goes away. This in fact matches the regexp used to match the "Query:". I can't imagine that it might hurt things, even in the main NCBIBlastParser, but you never know.... (All of the above refers to the method sbjct in class _HSPConsumer, file NCBIStandalone.py) -Frank Gibbons (fgibbons at hms.harvard.edu) ------------------------------------- Reference: Kent, WJ. (2002) BLAT - The BLAST-like alignment tool Query= NCU00001 (54 letters) Database: all_proteins.fasta 293697 sequences; 128,064,135 total letters Score E Sequences producing significant alignments: (bits) Value MGG_10872.5 101 1e-21 >MGG_10872.5 Length = 245 Score = 101 bits (260), Expect = 1e-21 Identities = 54/54 (100%), Positives = 54/54 (100%), Gaps = 0/54 (0%) Query: 1 MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL 54 MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL Sbjct: 192 MAINSGTRRLKNSVYNPLAEISVYVGKIKISLIEVISNIVKEKNPEVFIIRIRL 245 Database: all_proteins.fasta BLASTP 2.2.4 [blat] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 14 22:03:40 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Nov 2006 17:03:40 -0500 Subject: [Biopython-dev] [Bug 2143] Error parsing BLAT output (using out=blast format) In-Reply-To: Message-ID: <200611142203.kAEM3eu3014395@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2143 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2006-11-14 17:03 ------- Fixed in CVS, thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chris.lasher at gmail.com Wed Nov 15 00:51:26 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 14 Nov 2006 19:51:26 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4559FA92.8070408@burnham.org> References: <45425925.8090607@maubp.freeserve.co.uk> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> <45541C21.6080402@c2b2.columbia.edu> <45542A89.6050202@burnham.org> <4559127E.3050109@maubp.freeserve.co.uk> <4559FA92.8070408@burnham.org> Message-ID: <128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com> Just pitching in again, I agree with Michiel with regards to the list of functions necessary. To restate, these would be: (*) File to SeqRecord iterator, currently File2SequenceIterator (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict (*) SeqRecord iterator/list to alignment, currently Iter2Alignment (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File I also think there's wisdom to Michiel's statement it's easier to add functionality than it is to remove it. I agree with Iddo on his arguments against dealing with filename extensions. Upon reflection, however, I feel comfortable with a lookahead-based file-format guesser for the sake of convenience and as a matter of compromise to those who are not keen on being explicit in regards to every detail. It's been stated that bio file formats are quite distinct. I tried to think of a counterexample but failed. Finally, to reply to Michiel's question on release, it does seem once SeqIO is solidified this would certainly be worthy of a new release. SeqIO is a big step in a good direction for BioPython. Chris From biopython-dev at maubp.freeserve.co.uk Wed Nov 15 12:52:58 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 15 Nov 2006 12:52:58 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com> References: <45425925.8090607@maubp.freeserve.co.uk> <45487277.6080308@maubp.freeserve.co.uk> <128a885f0611011949x52044856qccfef42deffdbebc@mail.gmail.com> <4549E95A.6080605@maubp.freeserve.co.uk> <454AAD1F.5050006@c2b2.columbia.edu> <454B2C81.9090309@maubp.freeserve.co.uk> <45541C21.6080402@c2b2.columbia.edu> <45542A89.6050202@burnham.org> <4559127E.3050109@maubp.freeserve.co.uk> <4559FA92.8070408@burnham.org> <128a885f0611141651g3e010050i84d8aea766ebdc31@mail.gmail.com> Message-ID: <455B0DAA.9040000@maubp.freeserve.co.uk> Chris Lasher wrote: > Just pitching in again, I agree with Michiel with regards to the list > of functions necessary. To restate, these would be: On Monday I switched from the "2" pun names to "To" giving the following: (*) FileToSequenceIterator, previously File2SequenceIterator File to SeqRecord iterator (*) SequencesToDict, previously SequenceIter2Dict SeqRecord iterator/list to dictionary (*) SequencesToAlignment, previously Iter2Alignment SeqRecord iterator/list to alignment (*) SequencesToFile, previously Sequences2File Write SeqRecord iterator/list to a file I agree that these are all important "core functions". > I also think there's wisdom to Michiel's statement it's easier to add > functionality than it is to remove it. Very true. On that note... We also currently have three "convenience functions", which seem scheduled for removal based on these discussions. Unless anyone speaks up for these three, I'll remove them (and update the Wiki to match): (*) FileToSequenceList previously called File2SequenceList (*) FileToSequenceDict previously called File2SequenceDict (*) FileToAlignment previously called File2Alignment These simply wrap FileToSequenceIterator with the list, SequencesToDict or SequencesToAlignment function. > I agree with Iddo on his arguments against dealing with filename > extensions. Upon reflection, however, I feel comfortable with a > lookahead-based file-format guesser for the sake of convenience and as > a matter of compromise to those who are not keen on being explicit in > regards to every detail. It's been stated that bio file formats are > quite distinct. I tried to think of a counterexample but failed. I would say telling EMBL and Swiss (aka SwissProt aka Unigene) apart is tricky. They both start with an "ID ..." line and finish with "//", the feature table format is the big difference. If we did try guessing file formats by looking at the file contents, I would not try and guess every file format which Bio.SeqIO could read - just those which are easily identifiable. In this case, I would be inclined not to try and tell EMBL and SwissProt apart, and simply abort with "Unrecognised format". Peter From biopython-dev at maubp.freeserve.co.uk Tue Nov 28 13:24:35 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 28 Nov 2006 13:24:35 +0000 Subject: [Biopython-dev] [BioPython] Problems with Win Release for Python 2.5: Numeric, KDTree In-Reply-To: <005301c7129b$f3222300$b400a8c0@Sirius> References: <005301c7129b$f3222300$b400a8c0@Sirius> Message-ID: <456C3893.6060402@maubp.freeserve.co.uk> Hendrik Weisser wrote: > The main question for me is whether these issues (the 2nd, mostly) can be > adressed quickly, or whether it is recommended to use the "old" Python 2.4 > and corresponding packages for the time being. Can anyone help me with that? Yes - assuming you don't have all the compilers and stuff to compile your own libraries (and therefore need to use the Windows installers), using Windows with Python 2.4 and Numeric 24.2 with BioPython 1.42 should be fine. Personally I use Python 2.4 on Linux (as shipped with the distribution) and Python 2.3 on my Windows machine. Both work fine with BioPython and Numeric - although I have not used Bio.PDB very much. Peter From bugzilla-daemon at portal.open-bio.org Wed Nov 29 19:03:08 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Nov 2006 14:03:08 -0500 Subject: [Biopython-dev] [Bug 2090] Blast.NCBIStandalone BlastParser fails with blastall 2.2.14 In-Reply-To: Message-ID: <200611291903.kATJ38DJ007489@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2090 ------- Comment #1 from grunberg at embl.de 2006-11-29 14:03 ------- Things get worse with the current blastall 2.2.15. _scan_parameters in NCBIStandalone.py expects "Number of HSP's better" which in the later blastall versions has changed to: "Number of sequences better". This prevents the parser from fetching the next two lines even though they would be there and then we get exceptions etc. Another independent problem occurs further down -- The lines:: T: 11 A: 40 have now changed to:: Neighboring words threshold: 11 Window for multiple hits: 40 and again we run into an exeption. Both problems also concern in the latest CVS snapshot. Both can be fixed with some additional attempt_read_and_call but I am not sure whether my quick and dirty fixes is following the right spirit... change A: --------- INSERT BEFORE...:: # not in blastx 2.2.1 attempt_read_and_call(uhandle, consumer.query_length, has_re=re.compile(r"[Ll]ength of query")) ...These two statements:: # in blastall 2.2.15 attempt_read_and_call(uhandle, consumer.noevent, start="Number of HSP's gapped:") attempt_read_and_call(uhandle, consumer.noevent, start="Number of HSP's successfully") Change B: --------- REPLACE:: # not in BLASTN 2.2.9 attempt_read_and_call(uhandle, consumer.threshold, start='T') read_and_call(uhandle, consumer.window_size, start='A') BY:: # not in BLASTN 2.2.9 attempt_read_and_call(uhandle, consumer.threshold, start='T') attempt_read_and_call(uhandle, consumer.window_size, start='A') ## renamed in BLASTALL 2.2.15 attempt_read_and_call(uhandle, consumer.threshold, start='Neighboring') attempt_read_and_call(uhandle, consumer.window_size, start='Window') Could someone with more Biopython experience please validate and apply the fix? THX! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.