From biopython-dev at maubp.freeserve.co.uk Sat Feb 3 14:41:43 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sat, 03 Feb 2007 19:41:43 +0000 Subject: [Biopython-dev] Moving from Numeric to NumPy In-Reply-To: <3ADAFA9D-6731-4823-9609-AA864E5C0869@fas.harvard.edu> References: <3ADAFA9D-6731-4823-9609-AA864E5C0869@fas.harvard.edu> Message-ID: <45C4E577.2060405@maubp.freeserve.co.uk> I've compiled a rough list of modules using Numeric with the aid of grep. As far as I can tell, we have two big lumps of code (Michiel's Bio.Cluster module and Thomas' Bio.PDB and SVDSuperimposer modules) plus a selection of miscellaneous bits. Thomas & Michiel - have you looked or made plans to moving your code? Also, do we have any active developers familiar with the following modules: Jeffrey Chang (no longer active?): NaiveBayes.py MaxEntropy.py Harry Zuzan (no longer active?): Affy/CelFile.py and Affy/celmodule.cc Unnamed authors: KDTree/KDTree.py (includes C++ code so tricky) LogisticRegression.py MarkovModel.py Statistics/lowess.py distance.py kNN.py Some of which could be converted "by hand" almost trivially (e.g. distance.py), but see also: http://www.scipy.org/Porting_to_NumPy Peter From bsouthey at gmail.com Tue Feb 6 09:10:25 2007 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 6 Feb 2007 08:10:25 -0600 Subject: [Biopython-dev] Moving from Numeric to NumPy In-Reply-To: <45C4E577.2060405@maubp.freeserve.co.uk> References: <3ADAFA9D-6731-4823-9609-AA864E5C0869@fas.harvard.edu> <45C4E577.2060405@maubp.freeserve.co.uk> Message-ID: Hi, I would be interested in trying to help out on this as my time permits. Is there any plan of action? Bruce On 2/3/07, Peter wrote: > I've compiled a rough list of modules using Numeric with the aid of grep. > > As far as I can tell, we have two big lumps of code (Michiel's > Bio.Cluster module and Thomas' Bio.PDB and SVDSuperimposer modules) plus > a selection of miscellaneous bits. > > Thomas & Michiel - have you looked or made plans to moving your code? > > Also, do we have any active developers familiar with the following modules: > > Jeffrey Chang (no longer active?): > NaiveBayes.py > MaxEntropy.py > > Harry Zuzan (no longer active?): > Affy/CelFile.py and Affy/celmodule.cc > > Unnamed authors: > KDTree/KDTree.py (includes C++ code so tricky) > LogisticRegression.py > MarkovModel.py > Statistics/lowess.py > distance.py > kNN.py > > Some of which could be converted "by hand" almost trivially (e.g. > distance.py), but see also: > > http://www.scipy.org/Porting_to_NumPy > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython-dev at maubp.freeserve.co.uk Tue Feb 6 10:16:42 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 06 Feb 2007 15:16:42 +0000 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <44461DD4.80306@maubp.freeserve.co.uk> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> Message-ID: <45C89BDA.6060803@maubp.freeserve.co.uk> Albert Krewinkel wrote: >> I am trying to parse a EMBL-formated file with biopython, but I >> couldn't find any working parser for this. When I try to use the >> Martel-based parser as described in one of the mailinglist-threads, I >> get the following error... Peter wrote: > OK, we have the following files in BioPython: > > Bio/formatdefs/embl.py (wrapper) > Bio/expressions/embl/__init__.py (dummy file) > Bio/expressions/embl/embl65.py (contains Martel definition) > > ... > > It does look like an out of date [Martel] file format definition in > BioPython (assuming that example code from Jeff Chang is fine). I haven't touched the Martel file format definition, but I have been looking at EMBL parsing for Bio.SeqIO Based on my experience with the poor performance of the old Martel GenBank on large files, I would expect the same issue to apply to the Martel EMBL parser (even if it was updated). So, I have been looking at re-writing my Python based GenBank parser (in Bio.GenBank) instead: Notes and attachment showing the idea here: http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c14 I am thinking of sticking with the current scanner/consumer model in Bio/GenBank/__init__.py but simply replacing the (GenBank only) _Scanner class with a "GenBank scanner" and an "EMBL scanner" (based on a common base class which will handle the feature table). These new scanners would both feed into the existing consumers. In particular, the "Feature Consumer" which builds a SeqRecord with SeqFeature objects. I have this more or less working. Does this sound like a sensible way to include EMBL support? While it would be possible to use the new EMBL parser in much the same way as the current GenBank parser, I would recommend most users simply invoke them via Bio.SeqIO for normal work. I could put most of the new code in Bio/GenBank and create a new module/directory called Bio/EMBL, or just stick everything in Bio/GenBank - I'm not that fussed either way given I want to push Bio.SeqIO as the main interface. (Once that is settled I can rearrange the new code to slot in as appropriate.) Michiel - how does this plan sound? And should I try and get these changes done and tested in time for the next release - or wait until afterwards? Peter From mdehoon at c2b2.columbia.edu Tue Feb 6 12:05:48 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 06 Feb 2007 12:05:48 -0500 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45C89BDA.6060803@maubp.freeserve.co.uk> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> Message-ID: <45C8B56C.2070809@c2b2.columbia.edu> Peter wrote: > Does this sound like a sensible way to include EMBL support? > > While it would be possible to use the new EMBL parser in much the same > way as the current GenBank parser, I would recommend most users simply > invoke them via Bio.SeqIO for normal work. That sounds good to me. > I could put most of the new code in Bio/GenBank and create a new > module/directory called Bio/EMBL, or just stick everything in > Bio/GenBank - I'm not that fussed either way given I want to push > Bio.SeqIO as the main interface. That sounds good too. > Michiel - how does this plan sound? And should I try and get these > changes done and tested in time for the next release - or wait until > afterwards? Either way is fine with me. We can do the Bronx release in the near future, and do another release when the EMBL stuff is done. But it's up to you. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Tue Feb 6 11:56:47 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 06 Feb 2007 11:56:47 -0500 Subject: [Biopython-dev] Moving from Numeric to NumPy In-Reply-To: References: <3ADAFA9D-6731-4823-9609-AA864E5C0869@fas.harvard.edu> <45C4E577.2060405@maubp.freeserve.co.uk> Message-ID: <45C8B34F.9090502@c2b2.columbia.edu> Bruce Southey wrote: > Hi, > I would be interested in trying to help out on this as my time > permits. Is there any plan of action? > Not as far as I know. Can I nominate you as the Biopython Numeric-to-numpy conversion project leader? According to the numpy website, this conversion should be easy, but not without some effort. Basically, you'd have to find out if the Python scripts that use Numerical Python need to be modified for numpy, and if the C-code using Numerical Python still compiles and runs correctly with numpy. If few changes are needed, we may be able to make Biopython compatible with both Numerical Python and numpy. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From bugzilla-daemon at portal.open-bio.org Wed Feb 7 12:07:27 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 7 Feb 2007 12:07:27 -0500 Subject: [Biopython-dev] [Bug 1981] GenBank parser generates unusual feature qualifiers. In-Reply-To: Message-ID: <200702071707.l17H7RCH031173@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1981 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2007-02-07 12:07 EST ------- As part of my work to include EMBL support, I ended up changing the GenBank parser behaviour for the newline/whitespace in feature qualifier descriptions to what Marc was suggesting. I haven't noticed any side effects yet... Marking this bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Wed Feb 7 12:04:04 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 07 Feb 2007 17:04:04 +0000 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45C8B56C.2070809@c2b2.columbia.edu> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> <45C8B56C.2070809@c2b2.columbia.edu> Message-ID: <45CA0684.40107@maubp.freeserve.co.uk> Michiel Jan Laurens de Hoon wrote: > Peter wrote: >> Does this sound like a sensible way to include EMBL support? >> >> ... > > Either way is fine with me. We can do the Bronx release in the near > future, and do another release when the EMBL stuff is done. But it's up > to you. This took longer than I expected, but its done now. There is a new file Bio/GenBank/Scanner.py which includes a base "INSDC scanner" which handles the common code (e.g. feature tables) with two subclasses, a GenBankScanner and an EmblScanner. I have updated Bio/GenBank/__init_.py to remove my old Genbank only scanner, and call the new GenBankScanner instead. I have also updated Bio.SeqIO to use this new code for both GenBank and EMBL formats. http://www.biopython.org/wiki/SeqIO Note: The handling of newlines and white spaces has changed slightly as a result of these changes. I updated the expected output for the test_GenBank unit test Incidentally I think this "fixes" Bug 1981: http://bugzilla.open-bio.org/show_bug.cgi?id=1981 Other than that, touch wood, nothing should have changed for GenBank users. The relevant unit tests look fine. The EMBL support has a few bits that need polishing (search for TODO in Bio/GenBank/Scanner.py for points that I noted at the time), and some rigorous testing of course. I should probably add some EMBL examples to the SeqIO unit test... Peter From bugzilla-daemon at portal.open-bio.org Mon Feb 12 12:23:05 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 12 Feb 2007 12:23:05 -0500 Subject: [Biopython-dev] [Bug 1921] BioSeqDatabase.load() method fails In-Reply-To: Message-ID: <200702121723.l1CHN5Qg007996@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1921 ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2007-02-12 12:23 EST ------- Leighton, Are you still happy with the patch as submitted? If so, I'll commit it to CVS. I know next to nothing about SQL, so I'll have to rely on your judgment here. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Feb 12 12:24:45 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 12 Feb 2007 12:24:45 -0500 Subject: [Biopython-dev] [Bug 1982] Patch to BioSQL/Loader.py In-Reply-To: Message-ID: <200702121724.l1CHOjv8008116@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1982 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2007-02-12 12:24 EST ------- > This patch contains print and sys.stdout.write statements that report back to > the user when errors/unusual events occur. Is this acceptable within the > BioPython style? Using Python's warning module would be better. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 13 10:28:07 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 13 Feb 2007 10:28:07 -0500 Subject: [Biopython-dev] [Bug 1921] BioSeqDatabase.load() method fails In-Reply-To: Message-ID: <200702131528.l1DFS7od009962@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1921 ------- Comment #6 from lpritc at scri.sari.ac.uk 2007-02-13 10:28 EST ------- Hi Michiel, I've had no further issues since I put in the patch locally, so it works for me, at least. No doubt someone will point out any further problems that I've missed in my usage ;) L. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 13 11:27:48 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 13 Feb 2007 11:27:48 -0500 Subject: [Biopython-dev] [Bug 1921] BioSeqDatabase.load() method fails In-Reply-To: Message-ID: <200702131627.l1DGRmuL012469@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1921 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #7 from mdehoon at ims.u-tokyo.ac.jp 2007-02-13 11:27 EST ------- Patch accepted in CVS, thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Feb 15 16:29:45 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 15 Feb 2007 16:29:45 -0500 Subject: [Biopython-dev] [Bug 2169] 'close' method is missing for ReseekFile wrapper In-Reply-To: Message-ID: <200702152129.l1FLTjhH010350@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2169 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2007-02-15 16:29 EST ------- I would think that no "close" is needed in PDBParser's get_structure method. If the argument "file" is an open handle, get_structure closes it, which appears to be an unwanted side-effect. If the argument "file" is a string, then "file = open(file)" in get_structure opens a file, and Python will close it for you once the file variable goes out of scope (when the function returns). Thomas, am I missing something? Do you have any objections against removing the "close"? --Michiel. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 16 03:21:47 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 16 Feb 2007 03:21:47 -0500 Subject: [Biopython-dev] [Bug 2169] 'close' method is missing for ReseekFile wrapper In-Reply-To: Message-ID: <200702160821.l1G8Ll30032274@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2169 ------- Comment #2 from thamelry at binf.ku.dk 2007-02-16 03:21 EST ------- (In reply to comment #0) > Do you have any objections against removing the > "close"? Nope - it makes sense to remove it. -Thomas -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 16 10:40:31 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 16 Feb 2007 10:40:31 -0500 Subject: [Biopython-dev] [Bug 2169] 'close' method is missing for ReseekFile wrapper In-Reply-To: Message-ID: <200702161540.l1GFeVuX018588@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2169 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2007-02-16 10:40 EST ------- I have removed the "close" in PDBParser.py in CVS. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Mon Feb 19 06:24:29 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Feb 2007 11:24:29 +0000 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45CA0684.40107@maubp.freeserve.co.uk> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> <45C8B56C.2070809@c2b2.columbia.edu> <45CA0684.40107@maubp.freeserve.co.uk> Message-ID: <45D988ED.3070309@maubp.freeserve.co.uk> Peter wrote: >>> Does this sound like a sensible way to include EMBL support? >>> >>> ... > > This took longer than I expected, but its done now. Has anyone had a chance to try out the revised EMBL/GenBank parser? I could ask on the main list, but as testing the EMBL parsing would require installing the CVS release (or updating just Bio/GenBank and Bio/SeqIO by hand) that seems a bit much to ask. There are three main things I would like feedback on: (a) Has any existing code using Bio.GenBank been affected at all. (b) Does Bio.SeqIO read your favourite EMBL/GenBank files. (c) How parsing the file as "genbank-cds" and "embl-cds" look? i.e. This returns each CDS feature with its stated amino acid translation as a SeqRecord. Does anyone else think getting that the genes themselves in this way is a useful option? I'm not sure about the simplistic code to choose the SeqRecord id/name/description - this is difficult as there is a lot of variation in annotation conventions. > I should probably add some EMBL examples to the SeqIO unit test... I have added a single record EMBL file to the test suite. Peter From mdehoon at c2b2.columbia.edu Mon Feb 19 10:42:11 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Mon, 19 Feb 2007 10:42:11 -0500 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45D988ED.3070309@maubp.freeserve.co.uk> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> <45C8B56C.2070809@c2b2.columbia.edu> <45CA0684.40107@maubp.freeserve.co.uk> <45D988ED.3070309@maubp.freeserve.co.uk> Message-ID: <45D9C553.1060006@c2b2.columbia.edu> Hi Peter, Currently, the SeqIO test fails on Mac OS X (see below). It looks like this is due to a different line separator being used on Macintosh. --Michiel. mdehoon:~/biopython/Tests $ python run_tests.py test_SeqIO test_SeqIO ... FAIL ====================================================================== FAIL: test_SeqIO ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 148, in runTest self.runSafeTest() File "run_tests.py", line 161, in runSafeTest cur_test = __import__(self.test_name) File "/Users/mdehoon/biopython/Tests/test_SeqIO.py", line 261, in check_simple_write_read(records, test_write_read_non_alignment_formats) File "/Users/mdehoon/biopython/Tests/test_SeqIO.py", line 140, in check_simple_write_read WriteSequences(sequences=records, handle=handle, format=format) File "/Users/mdehoon/biopython/build/lib.macosx-10.3-i386-2.5/Bio/SeqIO/__init__.py", line 229, in WriteSequences writer_class(handle).write_file(sequences) File "/Users/mdehoon/biopython/build/lib.macosx-10.3-i386-2.5/Bio/SeqIO/Interfaces.py", line 235, in write_file self.write_records(records) File "/Users/mdehoon/biopython/build/lib.macosx-10.3-i386-2.5/Bio/SeqIO/Interfaces.py", line 225, in write_records self.write_record(record) File "/Users/mdehoon/biopython/build/lib.macosx-10.3-i386-2.5/Bio/SeqIO/FastaIO.py", line 108, in write_record assert os.linesep not in description AssertionError ---------------------------------------------------------------------- Ran 1 test in 0.429s FAILED (failures=1) mdehoon:~/biopython/Tests $ From biopython-dev at maubp.freeserve.co.uk Tue Feb 20 05:48:45 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Feb 2007 10:48:45 +0000 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45D9C553.1060006@c2b2.columbia.edu> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> <45C8B56C.2070809@c2b2.columbia.edu> <45CA0684.40107@maubp.freeserve.co.uk> <45D988ED.3070309@maubp.freeserve.co.uk> <45D9C553.1060006@c2b2.columbia.edu> Message-ID: <45DAD20D.2020903@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Hi Peter, > > Currently, the SeqIO test fails on Mac OS X (see below). It looks like > this is due to a different line separator being used on Macintosh. > > --Michiel. Embarrassingly, it also does this on Linux. I should have caught that... This was caused by the fasta writer checking that the record ID and description do not contain new lines, which should normally be the case. When I tried just "python test_SeqIO" I found that the SwissProt parser had created a record from SwissProt/sp003 whose description contained a new line. Arguably we should update Bio.SwissProt to clean up new line characters in the description... I have updated FastaIO.py to cope with newlines in the ID or description (by replacing them with spaces) and the SeqIO unit test now passes on Linux. It should be fine on MacOS too. Peter From bugzilla-daemon at portal.open-bio.org Thu Feb 22 09:36:18 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Feb 2007 09:36:18 -0500 Subject: [Biopython-dev] [Bug 2216] New: Bio/Nexus/Tree.py class Tree adds extra space in to_string Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2216 Summary: Bio/Nexus/Tree.py class Tree adds extra space in to_string Product: Biopython Version: Not Applicable Platform: PC OS/Version: Windows XP Status: NEW Severity: minor Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk The unit test for Bio.Nexus (test_Nexus.py) is failing due to an extra space at the end of the string representation of (some) trees: Output : "Test Tree module. ... tree tree1 = (((((('one should be punished, for (that)!','isn''that [a] strange name?'),'t2 the name'),t8,t9),t6),t7),(t5,t1)) ;\n" Expected: "Test Tree module. ... tree tree1 = (((((('one should be punished, for (that)!','isn''that [a] strange name?'),'t2 the name'),t8,t9),t6),t7),(t5,t1)); \n" I think this bug was introduced six months ago in revision 1.7 (fkauff) as part of a switch from building the tree string by concatenation to doing "".join(list of string) which should be faster. (Note that I just checked in a minor change to the tree parser to cope with some odd white space) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Sat Feb 24 18:41:36 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sat, 24 Feb 2007 23:41:36 +0000 Subject: [Biopython-dev] Bio.SeqIO Message-ID: <45E0CD30.4060108@maubp.freeserve.co.uk> Another update on the Bio.SeqIO code... Based on the previous discussions (e.g. Iddo's well argued emails in November 2006) I have now removed the code for mapping known file extensions to file formats. The file format is now a required argument (it can be made into an optional argument in future if we decide to support file format guessing one day). The provisional documentation is here, which now includes a few examples converting from one file format to another - selecting only certain records: http://www.biopython.org/wiki/SeqIO Are you all happy with the interface defined in Bio/SeqIO/__init__.py consisting of the four functions: SequenceIterator(handle, format) SequencesToDict(sequences, key_function=None) SequencesToAlignment(sequences, ...) WriteSequences(sequences, handle, format) Does anyone want to suggest different names for these functions? Do you think the argument order is sensible for WriteSequences? Note that we may want to improve the generic alignment class at some point (e.g. see bug 1944). If alignments could be initialized from a SeqRecord list/iterator this would make the SequencesToAlignment() function "obsolete"... Peter From mdehoon at c2b2.columbia.edu Sun Feb 25 00:14:31 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 25 Feb 2007 00:14:31 -0500 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45DAD20D.2020903@maubp.freeserve.co.uk> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> <45C8B56C.2070809@c2b2.columbia.edu> <45CA0684.40107@maubp.freeserve.co.uk> <45D988ED.3070309@maubp.freeserve.co.uk> <45D9C553.1060006@c2b2.columbia.edu> <45DAD20D.2020903@maubp.freeserve.co.uk> Message-ID: <45E11B37.8040902@c2b2.columbia.edu> Peter wrote: > Michiel de Hoon wrote: >> Hi Peter, >> >> Currently, the SeqIO test fails on Mac OS X (see below). It looks like >> this is due to a different line separator being used on Macintosh. >> >> --Michiel. > I have updated FastaIO.py to cope with newlines in the ID or description > (by replacing them with spaces) and the SeqIO unit test now passes on > Linux. It should be fine on MacOS too. Latest version works fine on Mac OS. Thanks! --Michiel. From bugzilla-daemon at portal.open-bio.org Sun Feb 25 01:11:07 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 25 Feb 2007 01:11:07 -0500 Subject: [Biopython-dev] [Bug 2216] Bio/Nexus/Tree.py class Tree adds extra space in to_string In-Reply-To: Message-ID: <200702250611.l1P6B7VJ013499@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2216 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2007-02-25 01:11 EST ------- Created an attachment (id=574) --> (http://bugzilla.open-bio.org/attachment.cgi?id=574&action=view) Patch to Bio/Nexus/Trees.py This patch removes the extra space in front of the ';'. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at c2b2.columbia.edu Sun Feb 25 06:42:21 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 25 Feb 2007 06:42:21 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E0CD30.4060108@maubp.freeserve.co.uk> References: <45E0CD30.4060108@maubp.freeserve.co.uk> Message-ID: <45E1761D.4060906@c2b2.columbia.edu> Thanks Peter! Peter wrote: > The file format is now a required argument (it can be made into an > optional argument in future if we decide to support file format guessing > one day). > Looks good! I have just some minor comments: Currently the format has to be in lower-case. It might be better to make the format case-insensitive. So I won't have to remember whether it is "fasta", "Fasta", or "FASTA". Three of the ValueErrors raised by WriteSequences and SequenceIterator are actually TypeErrors: if isinstance(handle, basestring) : if not format : if not isinstance(format, basestring) : The "if not format" is actually not needed, since Python will complain already if these functions are called without the correct number of arguments. For an incorrect format argument, WriteSequences raises an AssertionError. A ValueError (as in SequenceIterator) seems more appropriate. Also, it might be a good idea to print possible values for the format if the user passes an incorrect format. Btw, the docstring for SequenceIterator mentions guessing the file format from the handle if the format is not specified. --Michiel. From mdehoon at c2b2.columbia.edu Sun Feb 25 06:52:18 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 25 Feb 2007 06:52:18 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E0CD30.4060108@maubp.freeserve.co.uk> References: <45E0CD30.4060108@maubp.freeserve.co.uk> Message-ID: <45E17872.8020500@c2b2.columbia.edu> Peter wrote: > SequenceIterator(handle, format) > SequencesToDict(sequences, key_function=None) > SequencesToAlignment(sequences, ...) > WriteSequences(sequences, handle, format) > ... > Do you think the argument order is sensible for WriteSequences? > Yes. For one thing, it is consistent with the pickle function dump: pickle.dump(object, file, protocol=None). --Michiel. From biopython-dev at maubp.freeserve.co.uk Sun Feb 25 07:50:19 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sun, 25 Feb 2007 12:50:19 +0000 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E1761D.4060906@c2b2.columbia.edu> References: <45E0CD30.4060108@maubp.freeserve.co.uk> <45E1761D.4060906@c2b2.columbia.edu> Message-ID: <45E1860B.4060903@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Currently the format has to be in lower-case. It might be better to make > the format case-insensitive. So I won't have to remember whether it is > "fasta", "Fasta", or "FASTA". If you do use a non lower-case form, then a ValueError is raised - so it should be easy to see what has happened. Does anyone else care either way? > Three of the ValueErrors raised by WriteSequences and SequenceIterator > are actually TypeErrors: Good point. I have changed the type tests for the handle and format to raise TypeErrors. > The "if not format" is actually not needed, since Python will complain > already if these functions are called without the correct number of > arguments. I was actually trying to catch cases where format was supplied as None, or the empty string "". I have moved this below the type check, so it is only checking for an empty string and will still raise a ValueError. > For an incorrect format argument, WriteSequences raises an > AssertionError. A ValueError (as in SequenceIterator) seems more > appropriate. Agreed. I hadn't noticed that remaining assertion. > Also, it might be a good idea to print possible values for > the format if the user passes an incorrect format. At the moment this is a fairly short list, but it should grow in future. Doing this would make the functionality more discoverable. It would also help where the user had tried another name for a supported format, e.g. "genpept" versus "genbank", or "clustalw" versus "clustal". > Btw, the docstring for SequenceIterator mentions guessing the file > format from the handle if the format is not specified. Whoops. Fixed. Thanks for your attention to detail Michiel. Peter From mdehoon at c2b2.columbia.edu Sun Feb 25 20:28:06 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 25 Feb 2007 20:28:06 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E0CD30.4060108@maubp.freeserve.co.uk> References: <45E0CD30.4060108@maubp.freeserve.co.uk> Message-ID: <45E237A6.4040801@c2b2.columbia.edu> Peter wrote: > SequenceIterator(handle, format) > SequencesToDict(sequences, key_function=None) > SequencesToAlignment(sequences, ...) > WriteSequences(sequences, handle, format) > > Does anyone want to suggest different names for these functions? > Instead of >>> from Bio.SeqIO import SequenceIterator, WriteSequences >>> SequenceIterator(handle, format) >>> WriteSequences(sequences, handle, format) I would prefer >>> from Bio import SeqIO >>> SeqIO.read(handle, format) >>> SeqIO.write(sequences, handle, format) for the following reasons: 1) Similar functions in the Python standard library use a short verb that describes what the function does, not what the function returns. For example: >>> myfile = open("myfile.txt") # Note: this returns an iterator >>> myfile.read() >>> pickle.load(handle) >>> pickle.dump(object, handle) >>> xml.sax.parse(source, handler) 2) The lack of symmetry between SequenceIterator and WriteSequences makes them harder to remember. Each time I use Bio.SeqIO, I wonder whether it is SequenceIterator or ReadSequences. 3) SequenceIterator is not factually correct; it would be a SeqRecordIterator. But that is even harder to remember, and involves even more typing. 4) The "Sequence" in SequenceIterator and WriteSequences is redundant. As these functions are in the SeqIO module, we already know they handle sequences. In addition, new users will probably not know what an iterator is. 5) Bio.SeqIO being a new module allows us to correct some design errors from the past. One thing that always bothered me in Biopython is that it is hard to guess its usage; I always need to look up in the manual how to use a particular parser. Now, "read" and "write" are generic names that can be used by similar functions in other Biopython modules. For example, the new Blast XML parser tentatively uses NCBIXML.parse. This function returns an iterator, with a Blast record for each Blast query, resembling how "read" works in Bio.SeqIO. Renaming the NCBIXML parser function to NCBIXML.read would give us some internal consistency in Biopython and enable us to guess the function name without having to look it up in the manual each time. --Michiel. From mdehoon at c2b2.columbia.edu Sun Feb 25 20:45:16 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 25 Feb 2007 20:45:16 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E0CD30.4060108@maubp.freeserve.co.uk> References: <45E0CD30.4060108@maubp.freeserve.co.uk> Message-ID: <45E23BAC.7050702@c2b2.columbia.edu> Peter wrote: > Note that we may want to improve the generic alignment class at some > point (e.g. see bug 1944). If alignments could be initialized from a > SeqRecord list/iterator this would make the SequencesToAlignment() > function "obsolete"... > I think that the functionality of SequencesToAlignment fits better in Bio.Align than in Bio.SeqIO. The easiest way to accomplish this might be to change the __init__ of the Alignment class from def __init__(self, alphabet) to def __init__(self, alphabet, records=[]) --Michiel. From bugzilla-daemon at portal.open-bio.org Sun Feb 25 20:49:32 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 25 Feb 2007 20:49:32 -0500 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200702260149.l1Q1nWLG024799@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2007-02-25 20:49 EST ------- Note that an Alignment class is essentially a list of SeqRecords. We can get this functionality (and also simplify this class) by having the Alignment class inherit from a list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at c2b2.columbia.edu Mon Feb 26 04:53:46 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Mon, 26 Feb 2007 04:53:46 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E26E7B.5030501@unity.ncsu.edu> References: <45E0CD30.4060108@maubp.freeserve.co.uk> <45E23BAC.7050702@c2b2.columbia.edu> <45E26E7B.5030501@unity.ncsu.edu> Message-ID: <45E2AE2A.4050207@c2b2.columbia.edu> Alex Griffing wrote: > >> The easiest way to accomplish this might be to change the __init__ of >> the Alignment class from >> >> def __init__(self, alphabet) >> >> to >> >> def __init__(self, alphabet, records=[]) >> > > Does this apply here? > http://www.python.org/doc/faq/general.html#why-are-default-values-shared-between-objects > In theory, yes, but since we won't be modifying records it doesn't matter. The full function would look like: def __init__(self, alphabet, records=[]): self._alphabet = alphabet self._records = list(records) The "list" is necessary since the user may pass an iterator for records instead of a list. --Michiel. From bugzilla-daemon at portal.open-bio.org Mon Feb 26 05:32:58 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 26 Feb 2007 05:32:58 -0500 Subject: [Biopython-dev] [Bug 2216] Bio/Nexus/Tree.py class Tree adds extra space in to_string In-Reply-To: Message-ID: <200702261032.l1QAWwWD011505@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2216 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2007-02-26 05:32 EST ------- I checked in that fix, and confirmed the unit test passes. See biopython/Bio/Nexus/Trees.py revision: 1.10 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Mon Feb 26 05:40:29 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Feb 2007 10:40:29 +0000 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E2AE2A.4050207@c2b2.columbia.edu> References: <45E0CD30.4060108@maubp.freeserve.co.uk> <45E23BAC.7050702@c2b2.columbia.edu> <45E26E7B.5030501@unity.ncsu.edu> <45E2AE2A.4050207@c2b2.columbia.edu> Message-ID: <45E2B91D.2080509@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Alex Griffing wrote: >>> The easiest way to accomplish this might be to change the __init__ of >>> the Alignment class from >>> >>> def __init__(self, alphabet) >>> >>> to >>> >>> def __init__(self, alphabet, records=[]) >>> >> Does this apply here? >> http://www.python.org/doc/faq/general.html#why-are-default-values-shared-between-objects >> > In theory, yes, but since we won't be modifying records it doesn't > matter. The full function would look like: > > def __init__(self, alphabet, records=[]): > self._alphabet = alphabet > self._records = list(records) > > The "list" is necessary since the user may pass an iterator for records > instead of a list. We (or the user) might well change the records - in particular, they might add MORE records. How about this: def __init__(self, alphabet, records=None): """Initialize a new Alignment object. Arguments: o alphabet - The alphabet to use for the sequence objects that are created. This alphabet must be a gapped type. o records - A list or iterator returning of SeqRecord objects whose (gapped) sequences must be the same length. """ self._alphabet = alphabet if records : self._records = list(records) #TODO - Check all seq lengths are the same? #TODO - Check the seq's alphabet is compatible? else : self._records = [] This passes relevant unit tests. Peter From mdehoon at c2b2.columbia.edu Mon Feb 26 06:53:00 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Mon, 26 Feb 2007 06:53:00 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E2B91D.2080509@maubp.freeserve.co.uk> References: <45E0CD30.4060108@maubp.freeserve.co.uk> <45E23BAC.7050702@c2b2.columbia.edu> <45E26E7B.5030501@unity.ncsu.edu> <45E2AE2A.4050207@c2b2.columbia.edu> <45E2B91D.2080509@maubp.freeserve.co.uk> Message-ID: <45E2CA1C.6090306@c2b2.columbia.edu> Peter wrote: > Michiel de Hoon wrote: >> Alex Griffing wrote: >>>> The easiest way to accomplish this might be to change the __init__ of >>>> the Alignment class from >>>> >>>> def __init__(self, alphabet) >>>> >>>> to >>>> >>>> def __init__(self, alphabet, records=[]) >>>> >>> Does this apply here? >>> http://www.python.org/doc/faq/general.html#why-are-default-values-shared-between-objects >>> >> In theory, yes, but since we won't be modifying records it doesn't >> matter. The full function would look like: >> >> def __init__(self, alphabet, records=[]): >> self._alphabet = alphabet >> self._records = list(records) >> >> The "list" is necessary since the user may pass an iterator for records >> instead of a list. > > We (or the user) might well change the records - in particular, they > might add MORE records. How about this: > Still, it doesn't matter, since we're making a copy of records via the list function. So self._records is not the same as the records default argument in the __init__ function. If a user adds more records, they will end up in the object-specific self._records. --Michiel. From biopython-dev at maubp.freeserve.co.uk Mon Feb 26 14:59:45 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Feb 2007 19:59:45 +0000 Subject: [Biopython-dev] test_Cluster.py fails, "nclusters should be positive" Message-ID: <45E33C31.907@maubp.freeserve.co.uk> I just updated my Linux box running Python 2.4.3 to the CVS version of BioPython (including building KDTree). The bad news is that test_Cluster fails (full details below) with cluster.error: kcluster: nclusters should be positive The rest of the unit tests look OK except for some untested modules: the BioSQL ones, test_GFF, test_Wise and test_psw (the last two fail because I do not have dnal installed). Interesting some of these tests which pass under Linux fail under Windows due to differences in floating point display. We might want to fix that at some point... Peter -- $ python test_Cluster.py test_Cluster test_mean_median: data = [ 34.300, 3.000, 2.000] mean is 13.100; median is 3.000 data = [ 5.000, 10.000, 15.000, 20.000] mean is 12.500; median is 12.500 data = [ 1.000, 2.000, 3.000, 5.000, 7.000, 11.000, 13.000, 17.000] mean is 7.375; median is 6.000 data = [ 100.000, 19.000, 3.000, 1.500, 1.400, 1.000, 1.000, 1.000] mean is 15.988; median is 1.450 test_matrix_parse: Read data1 (correct) Read data2 (correct) Refused incorrect matrix data3 Refused incorrect matrix data4 Refused incorrect matrix data5 Refused incorrect matrix data6 Refused incorrect matrix data7 Refused incorrect matrix data8 Refused incorrect matrix data9 Refused incorrect matrix data10 test_kcluster First data set Traceback (most recent call last): File "test_Cluster.py", line 479, in ? run_tests(module = "Bio.Cluster") File "test_Cluster.py", line 472, in run_tests test_kcluster(module) File "test_Cluster.py", line 181, in test_kcluster clusterid, error, nfound = kcluster (data1, nclusters=nclusters, mask=mask1, weight=weight1, transpose=0, npass=100, method='a', dist='e') cluster.error: kcluster: nclusters should be positive From mdehoon at c2b2.columbia.edu Mon Feb 26 16:17:52 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 26 Feb 2007 16:17:52 -0500 Subject: [Biopython-dev] test_Cluster.py fails, "nclusters should be positive" In-Reply-To: <45E33C31.907@maubp.freeserve.co.uk> References: <45E33C31.907@maubp.freeserve.co.uk> Message-ID: <45E34E80.3060005@c2b2.columbia.edu> Peter wrote: > I just updated my Linux box running Python 2.4.3 to the CVS version of > BioPython (including building KDTree). > > The bad news is that test_Cluster fails (full details below) with > cluster.error: kcluster: nclusters should be positive Does this happen to be a 64-bits Linux box? One user with a 64-bits machine had this problem before. I have a fix for it, but it's not in CVS yet. > Interesting some of these tests which pass under Linux fail under > Windows due to differences in floating point display. We might want to > fix that at some point... Are these errors also from test_Cluster? --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Mon Feb 26 16:49:44 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Feb 2007 21:49:44 +0000 Subject: [Biopython-dev] test_Cluster.py fails In-Reply-To: <45E34E80.3060005@c2b2.columbia.edu> References: <45E33C31.907@maubp.freeserve.co.uk> <45E34E80.3060005@c2b2.columbia.edu> Message-ID: <45E355F8.2080103@maubp.freeserve.co.uk> Michiel Jan Laurens de Hoon wrote: > Peter wrote: >> I just updated my Linux box running Python 2.4.3 to the CVS version of >> BioPython (including building KDTree). >> >> The bad news is that test_Cluster fails (full details below) with >> cluster.error: kcluster: nclusters should be positive > > Does this happen to be a 64-bits Linux box? One user with a 64-bits > machine had this problem before. I have a fix for it, but it's not in > CVS yet. Yes, this is a 64-bit Linux box. Is there a simple patch I could test? >> Interesting some of these tests which pass under Linux fail under >> Windows due to differences in floating point display. We might want to >> fix that at some point... > > Are these errors also from test_Cluster? No, other things like test_SVDSuperimposer (used by Bio.PDB) and some others. Peter From biopython-dev at maubp.freeserve.co.uk Mon Feb 26 20:24:48 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Feb 2007 01:24:48 +0000 Subject: [Biopython-dev] Parsing NCBI Protein Tables (PTT) files in Bio.SeqIO Message-ID: <45E38860.7030001@maubp.freeserve.co.uk> I have a rough NCBI Protein Table (*.ptt) file parser prepared for Bio.SeqIO (see below). This file format was discussed in August 2006, and is unusual in that it does not actually contain any sequences (!). This means the parser returns SeqRecord objects with empty sequences, but with populated annotation fields. I believe Leighton Pritchard was interested in parsing PTT files from the NCBI. Does something like this (below) look useful? Does anyone know of a link to any "offical" documentation on this file format? Leighton, you also mentioned parsing the NCBI's GFF files, which seem to be a tab separated variable dump of the information found in a GenBank file's features table (link to documentation welcome). An entire GFF file could be turned into a single SeqRecord with no sequence, but with many sub features as SeqFeatures (akin to the results of the existing "genbank" parser). The location information would be simplified for GFF. Also, it looks like parsing just the CDS entries from a GFF file into "sequence free" SeqRecords would also be sensible... (akin to the existing "genbank-cds" parser). Peter -- from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq def NcbiProteinTableIterator(handle) : """Returns a SeqRecord for each entry in an NCBI Protein Table (PTT file) Note that the SeqRecord object's sequence will be zero length (emtpy). """ line = handle.readline() line = handle.readline() parts = line.rstrip().split() if len(parts) <> 2 or parts[1].lower() <> "proteins" : raise SyntaxError("Second line not recognised as an NCBI Protein Table (PTT file)") line = handle.readline().strip() if line.rstrip() <> "Location\tStrand\tLength\tPID\tGene\tSynonym\tCode\tCOG\tProduct" : raise SyntaxError("Third line not recognised as an NCBI Protein Table (PTT file)") LOCATION = 0 STRAND = 1 LENGTH = 2 PID = 3 GENE = 4 SYNONYM = 5 CODE = 6 COG = 7 PRODUCT = 8 mapping = { LOCATION : "location", STRAND : "strand", PID : "PID", GENE : "gene", SYNONYM : "synonym", CODE : "locus_tag", # Is this always correct? COG : "COG", PRODUCT : "product", } count = int(parts[0]) for line in handle : parts = line.rstrip().split("\t") record = SeqRecord(seq=Seq(""), id=parts[PID], name=parts[GENE], description=parts[PRODUCT]) if parts[LENGTH] <> "-" : record.annotations["length"] = int(parts[LENGTH]) for field,key in mapping.iteritems() : if parts[field] <> "-" : record.annotations[key] = parts[field] #TODO - Make sure STRAND is treated same as for GenBank yield record count -= 1 if count <> 0 : raise SyntaxError("Record header number of records wrong?") From biopython-dev at maubp.freeserve.co.uk Mon Feb 26 20:34:04 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Feb 2007 01:34:04 +0000 Subject: [Biopython-dev] GFF3 files in Bio.SeqIO In-Reply-To: <45E38860.7030001@maubp.freeserve.co.uk> References: <45E38860.7030001@maubp.freeserve.co.uk> Message-ID: <45E38A8C.6090007@maubp.freeserve.co.uk> Peter wrote: > Leighton, you also mentioned parsing the NCBI's GFF files, which seem to > be a tab separated variable dump of the information found in a GenBank > file's features table (link to documentation welcome). > > An entire GFF file could be turned into a single SeqRecord with no > sequence, but with many sub features as SeqFeatures (akin to the results > of the existing "genbank" parser). The location information would be > simplified for GFF. > > Also, it looks like parsing just the CDS entries from a GFF file into > "sequence free" SeqRecords would also be sensible... (akin to the > existing "genbank-cds" parser). I went through my old emails, and actually you did point me in this direction: http://song.sourceforge.net/gff3.shtml http://www.sequenceontology.org/gff3.shtml The file format does looks much more complicated that I had first thought. Interestingly the file format does allow for FASTA records to be appended to it - however the NCBI at least does not do this. Perhaps a more general GFF3 parser would be more useful that a sequence orientated one for Bio.SeqIO? Peter From mdehoon at c2b2.columbia.edu Tue Feb 27 13:26:10 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 27 Feb 2007 13:26:10 -0500 Subject: [Biopython-dev] test_Cluster.py fails In-Reply-To: <45E355F8.2080103@maubp.freeserve.co.uk> References: <45E33C31.907@maubp.freeserve.co.uk> <45E34E80.3060005@c2b2.columbia.edu> <45E355F8.2080103@maubp.freeserve.co.uk> Message-ID: <45E477C2.1030101@c2b2.columbia.edu> Peter wrote: >>> The bad news is that test_Cluster fails (full details below) with >>> cluster.error: kcluster: nclusters should be positive >> >> Does this happen to be a 64-bits Linux box? One user with a 64-bits >> machine had this problem before. I have a fix for it, but it's not in >> CVS yet. > > Yes, this is a 64-bit Linux box. Is there a simple patch I could test? > I have updated Bio.Cluster and the corresponding tests in CVS now. Could you try this new code and see if the problem is solved on the 64-bits machines? Thanks! --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Wed Feb 28 13:05:20 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Feb 2007 18:05:20 +0000 Subject: [Biopython-dev] test_Cluster.py fails In-Reply-To: <45E477C2.1030101@c2b2.columbia.edu> References: <45E33C31.907@maubp.freeserve.co.uk> <45E34E80.3060005@c2b2.columbia.edu> <45E355F8.2080103@maubp.freeserve.co.uk> <45E477C2.1030101@c2b2.columbia.edu> Message-ID: <45E5C460.4070000@maubp.freeserve.co.uk> Michiel Jan Laurens de Hoon wrote: > Peter wrote: >>>> The bad news is that test_Cluster fails (full details below) with >>>> cluster.error: kcluster: nclusters should be positive >>> Does this happen to be a 64-bits Linux box? One user with a 64-bits >>> machine had this problem before. I have a fix for it, but it's not in >>> CVS yet. >> Yes, this is a 64-bit Linux box. Is there a simple patch I could test? >> > I have updated Bio.Cluster and the corresponding tests in CVS now. Could > you try this new code and see if the problem is solved on the 64-bits > machines? Updated to CVS, and test_Cluster.py now passes on my 64-bit Linux machine. I have not attempted any further testing of the module. Thank you :) Peter From biopython-dev at maubp.freeserve.co.uk Sat Feb 3 19:41:43 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sat, 03 Feb 2007 19:41:43 +0000 Subject: [Biopython-dev] Moving from Numeric to NumPy In-Reply-To: <3ADAFA9D-6731-4823-9609-AA864E5C0869@fas.harvard.edu> References: <3ADAFA9D-6731-4823-9609-AA864E5C0869@fas.harvard.edu> Message-ID: <45C4E577.2060405@maubp.freeserve.co.uk> I've compiled a rough list of modules using Numeric with the aid of grep. As far as I can tell, we have two big lumps of code (Michiel's Bio.Cluster module and Thomas' Bio.PDB and SVDSuperimposer modules) plus a selection of miscellaneous bits. Thomas & Michiel - have you looked or made plans to moving your code? Also, do we have any active developers familiar with the following modules: Jeffrey Chang (no longer active?): NaiveBayes.py MaxEntropy.py Harry Zuzan (no longer active?): Affy/CelFile.py and Affy/celmodule.cc Unnamed authors: KDTree/KDTree.py (includes C++ code so tricky) LogisticRegression.py MarkovModel.py Statistics/lowess.py distance.py kNN.py Some of which could be converted "by hand" almost trivially (e.g. distance.py), but see also: http://www.scipy.org/Porting_to_NumPy Peter From bsouthey at gmail.com Tue Feb 6 14:10:25 2007 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 6 Feb 2007 08:10:25 -0600 Subject: [Biopython-dev] Moving from Numeric to NumPy In-Reply-To: <45C4E577.2060405@maubp.freeserve.co.uk> References: <3ADAFA9D-6731-4823-9609-AA864E5C0869@fas.harvard.edu> <45C4E577.2060405@maubp.freeserve.co.uk> Message-ID: Hi, I would be interested in trying to help out on this as my time permits. Is there any plan of action? Bruce On 2/3/07, Peter wrote: > I've compiled a rough list of modules using Numeric with the aid of grep. > > As far as I can tell, we have two big lumps of code (Michiel's > Bio.Cluster module and Thomas' Bio.PDB and SVDSuperimposer modules) plus > a selection of miscellaneous bits. > > Thomas & Michiel - have you looked or made plans to moving your code? > > Also, do we have any active developers familiar with the following modules: > > Jeffrey Chang (no longer active?): > NaiveBayes.py > MaxEntropy.py > > Harry Zuzan (no longer active?): > Affy/CelFile.py and Affy/celmodule.cc > > Unnamed authors: > KDTree/KDTree.py (includes C++ code so tricky) > LogisticRegression.py > MarkovModel.py > Statistics/lowess.py > distance.py > kNN.py > > Some of which could be converted "by hand" almost trivially (e.g. > distance.py), but see also: > > http://www.scipy.org/Porting_to_NumPy > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython-dev at maubp.freeserve.co.uk Tue Feb 6 15:16:42 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 06 Feb 2007 15:16:42 +0000 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <44461DD4.80306@maubp.freeserve.co.uk> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> Message-ID: <45C89BDA.6060803@maubp.freeserve.co.uk> Albert Krewinkel wrote: >> I am trying to parse a EMBL-formated file with biopython, but I >> couldn't find any working parser for this. When I try to use the >> Martel-based parser as described in one of the mailinglist-threads, I >> get the following error... Peter wrote: > OK, we have the following files in BioPython: > > Bio/formatdefs/embl.py (wrapper) > Bio/expressions/embl/__init__.py (dummy file) > Bio/expressions/embl/embl65.py (contains Martel definition) > > ... > > It does look like an out of date [Martel] file format definition in > BioPython (assuming that example code from Jeff Chang is fine). I haven't touched the Martel file format definition, but I have been looking at EMBL parsing for Bio.SeqIO Based on my experience with the poor performance of the old Martel GenBank on large files, I would expect the same issue to apply to the Martel EMBL parser (even if it was updated). So, I have been looking at re-writing my Python based GenBank parser (in Bio.GenBank) instead: Notes and attachment showing the idea here: http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c14 I am thinking of sticking with the current scanner/consumer model in Bio/GenBank/__init__.py but simply replacing the (GenBank only) _Scanner class with a "GenBank scanner" and an "EMBL scanner" (based on a common base class which will handle the feature table). These new scanners would both feed into the existing consumers. In particular, the "Feature Consumer" which builds a SeqRecord with SeqFeature objects. I have this more or less working. Does this sound like a sensible way to include EMBL support? While it would be possible to use the new EMBL parser in much the same way as the current GenBank parser, I would recommend most users simply invoke them via Bio.SeqIO for normal work. I could put most of the new code in Bio/GenBank and create a new module/directory called Bio/EMBL, or just stick everything in Bio/GenBank - I'm not that fussed either way given I want to push Bio.SeqIO as the main interface. (Once that is settled I can rearrange the new code to slot in as appropriate.) Michiel - how does this plan sound? And should I try and get these changes done and tested in time for the next release - or wait until afterwards? Peter From mdehoon at c2b2.columbia.edu Tue Feb 6 17:05:48 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 06 Feb 2007 12:05:48 -0500 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45C89BDA.6060803@maubp.freeserve.co.uk> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> Message-ID: <45C8B56C.2070809@c2b2.columbia.edu> Peter wrote: > Does this sound like a sensible way to include EMBL support? > > While it would be possible to use the new EMBL parser in much the same > way as the current GenBank parser, I would recommend most users simply > invoke them via Bio.SeqIO for normal work. That sounds good to me. > I could put most of the new code in Bio/GenBank and create a new > module/directory called Bio/EMBL, or just stick everything in > Bio/GenBank - I'm not that fussed either way given I want to push > Bio.SeqIO as the main interface. That sounds good too. > Michiel - how does this plan sound? And should I try and get these > changes done and tested in time for the next release - or wait until > afterwards? Either way is fine with me. We can do the Bronx release in the near future, and do another release when the EMBL stuff is done. But it's up to you. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Tue Feb 6 16:56:47 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 06 Feb 2007 11:56:47 -0500 Subject: [Biopython-dev] Moving from Numeric to NumPy In-Reply-To: References: <3ADAFA9D-6731-4823-9609-AA864E5C0869@fas.harvard.edu> <45C4E577.2060405@maubp.freeserve.co.uk> Message-ID: <45C8B34F.9090502@c2b2.columbia.edu> Bruce Southey wrote: > Hi, > I would be interested in trying to help out on this as my time > permits. Is there any plan of action? > Not as far as I know. Can I nominate you as the Biopython Numeric-to-numpy conversion project leader? According to the numpy website, this conversion should be easy, but not without some effort. Basically, you'd have to find out if the Python scripts that use Numerical Python need to be modified for numpy, and if the C-code using Numerical Python still compiles and runs correctly with numpy. If few changes are needed, we may be able to make Biopython compatible with both Numerical Python and numpy. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From bugzilla-daemon at portal.open-bio.org Wed Feb 7 17:07:27 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 7 Feb 2007 12:07:27 -0500 Subject: [Biopython-dev] [Bug 1981] GenBank parser generates unusual feature qualifiers. In-Reply-To: Message-ID: <200702071707.l17H7RCH031173@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1981 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2007-02-07 12:07 EST ------- As part of my work to include EMBL support, I ended up changing the GenBank parser behaviour for the newline/whitespace in feature qualifier descriptions to what Marc was suggesting. I haven't noticed any side effects yet... Marking this bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Wed Feb 7 17:04:04 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 07 Feb 2007 17:04:04 +0000 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45C8B56C.2070809@c2b2.columbia.edu> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> <45C8B56C.2070809@c2b2.columbia.edu> Message-ID: <45CA0684.40107@maubp.freeserve.co.uk> Michiel Jan Laurens de Hoon wrote: > Peter wrote: >> Does this sound like a sensible way to include EMBL support? >> >> ... > > Either way is fine with me. We can do the Bronx release in the near > future, and do another release when the EMBL stuff is done. But it's up > to you. This took longer than I expected, but its done now. There is a new file Bio/GenBank/Scanner.py which includes a base "INSDC scanner" which handles the common code (e.g. feature tables) with two subclasses, a GenBankScanner and an EmblScanner. I have updated Bio/GenBank/__init_.py to remove my old Genbank only scanner, and call the new GenBankScanner instead. I have also updated Bio.SeqIO to use this new code for both GenBank and EMBL formats. http://www.biopython.org/wiki/SeqIO Note: The handling of newlines and white spaces has changed slightly as a result of these changes. I updated the expected output for the test_GenBank unit test Incidentally I think this "fixes" Bug 1981: http://bugzilla.open-bio.org/show_bug.cgi?id=1981 Other than that, touch wood, nothing should have changed for GenBank users. The relevant unit tests look fine. The EMBL support has a few bits that need polishing (search for TODO in Bio/GenBank/Scanner.py for points that I noted at the time), and some rigorous testing of course. I should probably add some EMBL examples to the SeqIO unit test... Peter From bugzilla-daemon at portal.open-bio.org Mon Feb 12 17:23:05 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 12 Feb 2007 12:23:05 -0500 Subject: [Biopython-dev] [Bug 1921] BioSeqDatabase.load() method fails In-Reply-To: Message-ID: <200702121723.l1CHN5Qg007996@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1921 ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2007-02-12 12:23 EST ------- Leighton, Are you still happy with the patch as submitted? If so, I'll commit it to CVS. I know next to nothing about SQL, so I'll have to rely on your judgment here. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Feb 12 17:24:45 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 12 Feb 2007 12:24:45 -0500 Subject: [Biopython-dev] [Bug 1982] Patch to BioSQL/Loader.py In-Reply-To: Message-ID: <200702121724.l1CHOjv8008116@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1982 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2007-02-12 12:24 EST ------- > This patch contains print and sys.stdout.write statements that report back to > the user when errors/unusual events occur. Is this acceptable within the > BioPython style? Using Python's warning module would be better. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 13 15:28:07 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 13 Feb 2007 10:28:07 -0500 Subject: [Biopython-dev] [Bug 1921] BioSeqDatabase.load() method fails In-Reply-To: Message-ID: <200702131528.l1DFS7od009962@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1921 ------- Comment #6 from lpritc at scri.sari.ac.uk 2007-02-13 10:28 EST ------- Hi Michiel, I've had no further issues since I put in the patch locally, so it works for me, at least. No doubt someone will point out any further problems that I've missed in my usage ;) L. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Feb 13 16:27:48 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 13 Feb 2007 11:27:48 -0500 Subject: [Biopython-dev] [Bug 1921] BioSeqDatabase.load() method fails In-Reply-To: Message-ID: <200702131627.l1DGRmuL012469@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1921 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #7 from mdehoon at ims.u-tokyo.ac.jp 2007-02-13 11:27 EST ------- Patch accepted in CVS, thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Feb 15 21:29:45 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 15 Feb 2007 16:29:45 -0500 Subject: [Biopython-dev] [Bug 2169] 'close' method is missing for ReseekFile wrapper In-Reply-To: Message-ID: <200702152129.l1FLTjhH010350@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2169 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2007-02-15 16:29 EST ------- I would think that no "close" is needed in PDBParser's get_structure method. If the argument "file" is an open handle, get_structure closes it, which appears to be an unwanted side-effect. If the argument "file" is a string, then "file = open(file)" in get_structure opens a file, and Python will close it for you once the file variable goes out of scope (when the function returns). Thomas, am I missing something? Do you have any objections against removing the "close"? --Michiel. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 16 08:21:47 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 16 Feb 2007 03:21:47 -0500 Subject: [Biopython-dev] [Bug 2169] 'close' method is missing for ReseekFile wrapper In-Reply-To: Message-ID: <200702160821.l1G8Ll30032274@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2169 ------- Comment #2 from thamelry at binf.ku.dk 2007-02-16 03:21 EST ------- (In reply to comment #0) > Do you have any objections against removing the > "close"? Nope - it makes sense to remove it. -Thomas -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Feb 16 15:40:31 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 16 Feb 2007 10:40:31 -0500 Subject: [Biopython-dev] [Bug 2169] 'close' method is missing for ReseekFile wrapper In-Reply-To: Message-ID: <200702161540.l1GFeVuX018588@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2169 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2007-02-16 10:40 EST ------- I have removed the "close" in PDBParser.py in CVS. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Mon Feb 19 11:24:29 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Feb 2007 11:24:29 +0000 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45CA0684.40107@maubp.freeserve.co.uk> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> <45C8B56C.2070809@c2b2.columbia.edu> <45CA0684.40107@maubp.freeserve.co.uk> Message-ID: <45D988ED.3070309@maubp.freeserve.co.uk> Peter wrote: >>> Does this sound like a sensible way to include EMBL support? >>> >>> ... > > This took longer than I expected, but its done now. Has anyone had a chance to try out the revised EMBL/GenBank parser? I could ask on the main list, but as testing the EMBL parsing would require installing the CVS release (or updating just Bio/GenBank and Bio/SeqIO by hand) that seems a bit much to ask. There are three main things I would like feedback on: (a) Has any existing code using Bio.GenBank been affected at all. (b) Does Bio.SeqIO read your favourite EMBL/GenBank files. (c) How parsing the file as "genbank-cds" and "embl-cds" look? i.e. This returns each CDS feature with its stated amino acid translation as a SeqRecord. Does anyone else think getting that the genes themselves in this way is a useful option? I'm not sure about the simplistic code to choose the SeqRecord id/name/description - this is difficult as there is a lot of variation in annotation conventions. > I should probably add some EMBL examples to the SeqIO unit test... I have added a single record EMBL file to the test suite. Peter From mdehoon at c2b2.columbia.edu Mon Feb 19 15:42:11 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Mon, 19 Feb 2007 10:42:11 -0500 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45D988ED.3070309@maubp.freeserve.co.uk> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> <45C8B56C.2070809@c2b2.columbia.edu> <45CA0684.40107@maubp.freeserve.co.uk> <45D988ED.3070309@maubp.freeserve.co.uk> Message-ID: <45D9C553.1060006@c2b2.columbia.edu> Hi Peter, Currently, the SeqIO test fails on Mac OS X (see below). It looks like this is due to a different line separator being used on Macintosh. --Michiel. mdehoon:~/biopython/Tests $ python run_tests.py test_SeqIO test_SeqIO ... FAIL ====================================================================== FAIL: test_SeqIO ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 148, in runTest self.runSafeTest() File "run_tests.py", line 161, in runSafeTest cur_test = __import__(self.test_name) File "/Users/mdehoon/biopython/Tests/test_SeqIO.py", line 261, in check_simple_write_read(records, test_write_read_non_alignment_formats) File "/Users/mdehoon/biopython/Tests/test_SeqIO.py", line 140, in check_simple_write_read WriteSequences(sequences=records, handle=handle, format=format) File "/Users/mdehoon/biopython/build/lib.macosx-10.3-i386-2.5/Bio/SeqIO/__init__.py", line 229, in WriteSequences writer_class(handle).write_file(sequences) File "/Users/mdehoon/biopython/build/lib.macosx-10.3-i386-2.5/Bio/SeqIO/Interfaces.py", line 235, in write_file self.write_records(records) File "/Users/mdehoon/biopython/build/lib.macosx-10.3-i386-2.5/Bio/SeqIO/Interfaces.py", line 225, in write_records self.write_record(record) File "/Users/mdehoon/biopython/build/lib.macosx-10.3-i386-2.5/Bio/SeqIO/FastaIO.py", line 108, in write_record assert os.linesep not in description AssertionError ---------------------------------------------------------------------- Ran 1 test in 0.429s FAILED (failures=1) mdehoon:~/biopython/Tests $ From biopython-dev at maubp.freeserve.co.uk Tue Feb 20 10:48:45 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Feb 2007 10:48:45 +0000 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45D9C553.1060006@c2b2.columbia.edu> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> <45C8B56C.2070809@c2b2.columbia.edu> <45CA0684.40107@maubp.freeserve.co.uk> <45D988ED.3070309@maubp.freeserve.co.uk> <45D9C553.1060006@c2b2.columbia.edu> Message-ID: <45DAD20D.2020903@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Hi Peter, > > Currently, the SeqIO test fails on Mac OS X (see below). It looks like > this is due to a different line separator being used on Macintosh. > > --Michiel. Embarrassingly, it also does this on Linux. I should have caught that... This was caused by the fasta writer checking that the record ID and description do not contain new lines, which should normally be the case. When I tried just "python test_SeqIO" I found that the SwissProt parser had created a record from SwissProt/sp003 whose description contained a new line. Arguably we should update Bio.SwissProt to clean up new line characters in the description... I have updated FastaIO.py to cope with newlines in the ID or description (by replacing them with spaces) and the SeqIO unit test now passes on Linux. It should be fine on MacOS too. Peter From bugzilla-daemon at portal.open-bio.org Thu Feb 22 14:36:18 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 22 Feb 2007 09:36:18 -0500 Subject: [Biopython-dev] [Bug 2216] New: Bio/Nexus/Tree.py class Tree adds extra space in to_string Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2216 Summary: Bio/Nexus/Tree.py class Tree adds extra space in to_string Product: Biopython Version: Not Applicable Platform: PC OS/Version: Windows XP Status: NEW Severity: minor Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk The unit test for Bio.Nexus (test_Nexus.py) is failing due to an extra space at the end of the string representation of (some) trees: Output : "Test Tree module. ... tree tree1 = (((((('one should be punished, for (that)!','isn''that [a] strange name?'),'t2 the name'),t8,t9),t6),t7),(t5,t1)) ;\n" Expected: "Test Tree module. ... tree tree1 = (((((('one should be punished, for (that)!','isn''that [a] strange name?'),'t2 the name'),t8,t9),t6),t7),(t5,t1)); \n" I think this bug was introduced six months ago in revision 1.7 (fkauff) as part of a switch from building the tree string by concatenation to doing "".join(list of string) which should be faster. (Note that I just checked in a minor change to the tree parser to cope with some odd white space) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Sat Feb 24 23:41:36 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sat, 24 Feb 2007 23:41:36 +0000 Subject: [Biopython-dev] Bio.SeqIO Message-ID: <45E0CD30.4060108@maubp.freeserve.co.uk> Another update on the Bio.SeqIO code... Based on the previous discussions (e.g. Iddo's well argued emails in November 2006) I have now removed the code for mapping known file extensions to file formats. The file format is now a required argument (it can be made into an optional argument in future if we decide to support file format guessing one day). The provisional documentation is here, which now includes a few examples converting from one file format to another - selecting only certain records: http://www.biopython.org/wiki/SeqIO Are you all happy with the interface defined in Bio/SeqIO/__init__.py consisting of the four functions: SequenceIterator(handle, format) SequencesToDict(sequences, key_function=None) SequencesToAlignment(sequences, ...) WriteSequences(sequences, handle, format) Does anyone want to suggest different names for these functions? Do you think the argument order is sensible for WriteSequences? Note that we may want to improve the generic alignment class at some point (e.g. see bug 1944). If alignments could be initialized from a SeqRecord list/iterator this would make the SequencesToAlignment() function "obsolete"... Peter From mdehoon at c2b2.columbia.edu Sun Feb 25 05:14:31 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 25 Feb 2007 00:14:31 -0500 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <45DAD20D.2020903@maubp.freeserve.co.uk> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> <44461DD4.80306@maubp.freeserve.co.uk> <45C89BDA.6060803@maubp.freeserve.co.uk> <45C8B56C.2070809@c2b2.columbia.edu> <45CA0684.40107@maubp.freeserve.co.uk> <45D988ED.3070309@maubp.freeserve.co.uk> <45D9C553.1060006@c2b2.columbia.edu> <45DAD20D.2020903@maubp.freeserve.co.uk> Message-ID: <45E11B37.8040902@c2b2.columbia.edu> Peter wrote: > Michiel de Hoon wrote: >> Hi Peter, >> >> Currently, the SeqIO test fails on Mac OS X (see below). It looks like >> this is due to a different line separator being used on Macintosh. >> >> --Michiel. > I have updated FastaIO.py to cope with newlines in the ID or description > (by replacing them with spaces) and the SeqIO unit test now passes on > Linux. It should be fine on MacOS too. Latest version works fine on Mac OS. Thanks! --Michiel. From bugzilla-daemon at portal.open-bio.org Sun Feb 25 06:11:07 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 25 Feb 2007 01:11:07 -0500 Subject: [Biopython-dev] [Bug 2216] Bio/Nexus/Tree.py class Tree adds extra space in to_string In-Reply-To: Message-ID: <200702250611.l1P6B7VJ013499@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2216 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2007-02-25 01:11 EST ------- Created an attachment (id=574) --> (http://bugzilla.open-bio.org/attachment.cgi?id=574&action=view) Patch to Bio/Nexus/Trees.py This patch removes the extra space in front of the ';'. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at c2b2.columbia.edu Sun Feb 25 11:42:21 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 25 Feb 2007 06:42:21 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E0CD30.4060108@maubp.freeserve.co.uk> References: <45E0CD30.4060108@maubp.freeserve.co.uk> Message-ID: <45E1761D.4060906@c2b2.columbia.edu> Thanks Peter! Peter wrote: > The file format is now a required argument (it can be made into an > optional argument in future if we decide to support file format guessing > one day). > Looks good! I have just some minor comments: Currently the format has to be in lower-case. It might be better to make the format case-insensitive. So I won't have to remember whether it is "fasta", "Fasta", or "FASTA". Three of the ValueErrors raised by WriteSequences and SequenceIterator are actually TypeErrors: if isinstance(handle, basestring) : if not format : if not isinstance(format, basestring) : The "if not format" is actually not needed, since Python will complain already if these functions are called without the correct number of arguments. For an incorrect format argument, WriteSequences raises an AssertionError. A ValueError (as in SequenceIterator) seems more appropriate. Also, it might be a good idea to print possible values for the format if the user passes an incorrect format. Btw, the docstring for SequenceIterator mentions guessing the file format from the handle if the format is not specified. --Michiel. From mdehoon at c2b2.columbia.edu Sun Feb 25 11:52:18 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 25 Feb 2007 06:52:18 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E0CD30.4060108@maubp.freeserve.co.uk> References: <45E0CD30.4060108@maubp.freeserve.co.uk> Message-ID: <45E17872.8020500@c2b2.columbia.edu> Peter wrote: > SequenceIterator(handle, format) > SequencesToDict(sequences, key_function=None) > SequencesToAlignment(sequences, ...) > WriteSequences(sequences, handle, format) > ... > Do you think the argument order is sensible for WriteSequences? > Yes. For one thing, it is consistent with the pickle function dump: pickle.dump(object, file, protocol=None). --Michiel. From biopython-dev at maubp.freeserve.co.uk Sun Feb 25 12:50:19 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sun, 25 Feb 2007 12:50:19 +0000 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E1761D.4060906@c2b2.columbia.edu> References: <45E0CD30.4060108@maubp.freeserve.co.uk> <45E1761D.4060906@c2b2.columbia.edu> Message-ID: <45E1860B.4060903@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Currently the format has to be in lower-case. It might be better to make > the format case-insensitive. So I won't have to remember whether it is > "fasta", "Fasta", or "FASTA". If you do use a non lower-case form, then a ValueError is raised - so it should be easy to see what has happened. Does anyone else care either way? > Three of the ValueErrors raised by WriteSequences and SequenceIterator > are actually TypeErrors: Good point. I have changed the type tests for the handle and format to raise TypeErrors. > The "if not format" is actually not needed, since Python will complain > already if these functions are called without the correct number of > arguments. I was actually trying to catch cases where format was supplied as None, or the empty string "". I have moved this below the type check, so it is only checking for an empty string and will still raise a ValueError. > For an incorrect format argument, WriteSequences raises an > AssertionError. A ValueError (as in SequenceIterator) seems more > appropriate. Agreed. I hadn't noticed that remaining assertion. > Also, it might be a good idea to print possible values for > the format if the user passes an incorrect format. At the moment this is a fairly short list, but it should grow in future. Doing this would make the functionality more discoverable. It would also help where the user had tried another name for a supported format, e.g. "genpept" versus "genbank", or "clustalw" versus "clustal". > Btw, the docstring for SequenceIterator mentions guessing the file > format from the handle if the format is not specified. Whoops. Fixed. Thanks for your attention to detail Michiel. Peter From mdehoon at c2b2.columbia.edu Mon Feb 26 01:28:06 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 25 Feb 2007 20:28:06 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E0CD30.4060108@maubp.freeserve.co.uk> References: <45E0CD30.4060108@maubp.freeserve.co.uk> Message-ID: <45E237A6.4040801@c2b2.columbia.edu> Peter wrote: > SequenceIterator(handle, format) > SequencesToDict(sequences, key_function=None) > SequencesToAlignment(sequences, ...) > WriteSequences(sequences, handle, format) > > Does anyone want to suggest different names for these functions? > Instead of >>> from Bio.SeqIO import SequenceIterator, WriteSequences >>> SequenceIterator(handle, format) >>> WriteSequences(sequences, handle, format) I would prefer >>> from Bio import SeqIO >>> SeqIO.read(handle, format) >>> SeqIO.write(sequences, handle, format) for the following reasons: 1) Similar functions in the Python standard library use a short verb that describes what the function does, not what the function returns. For example: >>> myfile = open("myfile.txt") # Note: this returns an iterator >>> myfile.read() >>> pickle.load(handle) >>> pickle.dump(object, handle) >>> xml.sax.parse(source, handler) 2) The lack of symmetry between SequenceIterator and WriteSequences makes them harder to remember. Each time I use Bio.SeqIO, I wonder whether it is SequenceIterator or ReadSequences. 3) SequenceIterator is not factually correct; it would be a SeqRecordIterator. But that is even harder to remember, and involves even more typing. 4) The "Sequence" in SequenceIterator and WriteSequences is redundant. As these functions are in the SeqIO module, we already know they handle sequences. In addition, new users will probably not know what an iterator is. 5) Bio.SeqIO being a new module allows us to correct some design errors from the past. One thing that always bothered me in Biopython is that it is hard to guess its usage; I always need to look up in the manual how to use a particular parser. Now, "read" and "write" are generic names that can be used by similar functions in other Biopython modules. For example, the new Blast XML parser tentatively uses NCBIXML.parse. This function returns an iterator, with a Blast record for each Blast query, resembling how "read" works in Bio.SeqIO. Renaming the NCBIXML parser function to NCBIXML.read would give us some internal consistency in Biopython and enable us to guess the function name without having to look it up in the manual each time. --Michiel. From mdehoon at c2b2.columbia.edu Mon Feb 26 01:45:16 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 25 Feb 2007 20:45:16 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E0CD30.4060108@maubp.freeserve.co.uk> References: <45E0CD30.4060108@maubp.freeserve.co.uk> Message-ID: <45E23BAC.7050702@c2b2.columbia.edu> Peter wrote: > Note that we may want to improve the generic alignment class at some > point (e.g. see bug 1944). If alignments could be initialized from a > SeqRecord list/iterator this would make the SequencesToAlignment() > function "obsolete"... > I think that the functionality of SequencesToAlignment fits better in Bio.Align than in Bio.SeqIO. The easiest way to accomplish this might be to change the __init__ of the Alignment class from def __init__(self, alphabet) to def __init__(self, alphabet, records=[]) --Michiel. From bugzilla-daemon at portal.open-bio.org Mon Feb 26 01:49:32 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 25 Feb 2007 20:49:32 -0500 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200702260149.l1Q1nWLG024799@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2007-02-25 20:49 EST ------- Note that an Alignment class is essentially a list of SeqRecords. We can get this functionality (and also simplify this class) by having the Alignment class inherit from a list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at c2b2.columbia.edu Mon Feb 26 09:53:46 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Mon, 26 Feb 2007 04:53:46 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E26E7B.5030501@unity.ncsu.edu> References: <45E0CD30.4060108@maubp.freeserve.co.uk> <45E23BAC.7050702@c2b2.columbia.edu> <45E26E7B.5030501@unity.ncsu.edu> Message-ID: <45E2AE2A.4050207@c2b2.columbia.edu> Alex Griffing wrote: > >> The easiest way to accomplish this might be to change the __init__ of >> the Alignment class from >> >> def __init__(self, alphabet) >> >> to >> >> def __init__(self, alphabet, records=[]) >> > > Does this apply here? > http://www.python.org/doc/faq/general.html#why-are-default-values-shared-between-objects > In theory, yes, but since we won't be modifying records it doesn't matter. The full function would look like: def __init__(self, alphabet, records=[]): self._alphabet = alphabet self._records = list(records) The "list" is necessary since the user may pass an iterator for records instead of a list. --Michiel. From bugzilla-daemon at portal.open-bio.org Mon Feb 26 10:32:58 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 26 Feb 2007 05:32:58 -0500 Subject: [Biopython-dev] [Bug 2216] Bio/Nexus/Tree.py class Tree adds extra space in to_string In-Reply-To: Message-ID: <200702261032.l1QAWwWD011505@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2216 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2007-02-26 05:32 EST ------- I checked in that fix, and confirmed the unit test passes. See biopython/Bio/Nexus/Trees.py revision: 1.10 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Mon Feb 26 10:40:29 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Feb 2007 10:40:29 +0000 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E2AE2A.4050207@c2b2.columbia.edu> References: <45E0CD30.4060108@maubp.freeserve.co.uk> <45E23BAC.7050702@c2b2.columbia.edu> <45E26E7B.5030501@unity.ncsu.edu> <45E2AE2A.4050207@c2b2.columbia.edu> Message-ID: <45E2B91D.2080509@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Alex Griffing wrote: >>> The easiest way to accomplish this might be to change the __init__ of >>> the Alignment class from >>> >>> def __init__(self, alphabet) >>> >>> to >>> >>> def __init__(self, alphabet, records=[]) >>> >> Does this apply here? >> http://www.python.org/doc/faq/general.html#why-are-default-values-shared-between-objects >> > In theory, yes, but since we won't be modifying records it doesn't > matter. The full function would look like: > > def __init__(self, alphabet, records=[]): > self._alphabet = alphabet > self._records = list(records) > > The "list" is necessary since the user may pass an iterator for records > instead of a list. We (or the user) might well change the records - in particular, they might add MORE records. How about this: def __init__(self, alphabet, records=None): """Initialize a new Alignment object. Arguments: o alphabet - The alphabet to use for the sequence objects that are created. This alphabet must be a gapped type. o records - A list or iterator returning of SeqRecord objects whose (gapped) sequences must be the same length. """ self._alphabet = alphabet if records : self._records = list(records) #TODO - Check all seq lengths are the same? #TODO - Check the seq's alphabet is compatible? else : self._records = [] This passes relevant unit tests. Peter From mdehoon at c2b2.columbia.edu Mon Feb 26 11:53:00 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Mon, 26 Feb 2007 06:53:00 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45E2B91D.2080509@maubp.freeserve.co.uk> References: <45E0CD30.4060108@maubp.freeserve.co.uk> <45E23BAC.7050702@c2b2.columbia.edu> <45E26E7B.5030501@unity.ncsu.edu> <45E2AE2A.4050207@c2b2.columbia.edu> <45E2B91D.2080509@maubp.freeserve.co.uk> Message-ID: <45E2CA1C.6090306@c2b2.columbia.edu> Peter wrote: > Michiel de Hoon wrote: >> Alex Griffing wrote: >>>> The easiest way to accomplish this might be to change the __init__ of >>>> the Alignment class from >>>> >>>> def __init__(self, alphabet) >>>> >>>> to >>>> >>>> def __init__(self, alphabet, records=[]) >>>> >>> Does this apply here? >>> http://www.python.org/doc/faq/general.html#why-are-default-values-shared-between-objects >>> >> In theory, yes, but since we won't be modifying records it doesn't >> matter. The full function would look like: >> >> def __init__(self, alphabet, records=[]): >> self._alphabet = alphabet >> self._records = list(records) >> >> The "list" is necessary since the user may pass an iterator for records >> instead of a list. > > We (or the user) might well change the records - in particular, they > might add MORE records. How about this: > Still, it doesn't matter, since we're making a copy of records via the list function. So self._records is not the same as the records default argument in the __init__ function. If a user adds more records, they will end up in the object-specific self._records. --Michiel. From biopython-dev at maubp.freeserve.co.uk Mon Feb 26 19:59:45 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Feb 2007 19:59:45 +0000 Subject: [Biopython-dev] test_Cluster.py fails, "nclusters should be positive" Message-ID: <45E33C31.907@maubp.freeserve.co.uk> I just updated my Linux box running Python 2.4.3 to the CVS version of BioPython (including building KDTree). The bad news is that test_Cluster fails (full details below) with cluster.error: kcluster: nclusters should be positive The rest of the unit tests look OK except for some untested modules: the BioSQL ones, test_GFF, test_Wise and test_psw (the last two fail because I do not have dnal installed). Interesting some of these tests which pass under Linux fail under Windows due to differences in floating point display. We might want to fix that at some point... Peter -- $ python test_Cluster.py test_Cluster test_mean_median: data = [ 34.300, 3.000, 2.000] mean is 13.100; median is 3.000 data = [ 5.000, 10.000, 15.000, 20.000] mean is 12.500; median is 12.500 data = [ 1.000, 2.000, 3.000, 5.000, 7.000, 11.000, 13.000, 17.000] mean is 7.375; median is 6.000 data = [ 100.000, 19.000, 3.000, 1.500, 1.400, 1.000, 1.000, 1.000] mean is 15.988; median is 1.450 test_matrix_parse: Read data1 (correct) Read data2 (correct) Refused incorrect matrix data3 Refused incorrect matrix data4 Refused incorrect matrix data5 Refused incorrect matrix data6 Refused incorrect matrix data7 Refused incorrect matrix data8 Refused incorrect matrix data9 Refused incorrect matrix data10 test_kcluster First data set Traceback (most recent call last): File "test_Cluster.py", line 479, in ? run_tests(module = "Bio.Cluster") File "test_Cluster.py", line 472, in run_tests test_kcluster(module) File "test_Cluster.py", line 181, in test_kcluster clusterid, error, nfound = kcluster (data1, nclusters=nclusters, mask=mask1, weight=weight1, transpose=0, npass=100, method='a', dist='e') cluster.error: kcluster: nclusters should be positive From mdehoon at c2b2.columbia.edu Mon Feb 26 21:17:52 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 26 Feb 2007 16:17:52 -0500 Subject: [Biopython-dev] test_Cluster.py fails, "nclusters should be positive" In-Reply-To: <45E33C31.907@maubp.freeserve.co.uk> References: <45E33C31.907@maubp.freeserve.co.uk> Message-ID: <45E34E80.3060005@c2b2.columbia.edu> Peter wrote: > I just updated my Linux box running Python 2.4.3 to the CVS version of > BioPython (including building KDTree). > > The bad news is that test_Cluster fails (full details below) with > cluster.error: kcluster: nclusters should be positive Does this happen to be a 64-bits Linux box? One user with a 64-bits machine had this problem before. I have a fix for it, but it's not in CVS yet. > Interesting some of these tests which pass under Linux fail under > Windows due to differences in floating point display. We might want to > fix that at some point... Are these errors also from test_Cluster? --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Mon Feb 26 21:49:44 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Feb 2007 21:49:44 +0000 Subject: [Biopython-dev] test_Cluster.py fails In-Reply-To: <45E34E80.3060005@c2b2.columbia.edu> References: <45E33C31.907@maubp.freeserve.co.uk> <45E34E80.3060005@c2b2.columbia.edu> Message-ID: <45E355F8.2080103@maubp.freeserve.co.uk> Michiel Jan Laurens de Hoon wrote: > Peter wrote: >> I just updated my Linux box running Python 2.4.3 to the CVS version of >> BioPython (including building KDTree). >> >> The bad news is that test_Cluster fails (full details below) with >> cluster.error: kcluster: nclusters should be positive > > Does this happen to be a 64-bits Linux box? One user with a 64-bits > machine had this problem before. I have a fix for it, but it's not in > CVS yet. Yes, this is a 64-bit Linux box. Is there a simple patch I could test? >> Interesting some of these tests which pass under Linux fail under >> Windows due to differences in floating point display. We might want to >> fix that at some point... > > Are these errors also from test_Cluster? No, other things like test_SVDSuperimposer (used by Bio.PDB) and some others. Peter From biopython-dev at maubp.freeserve.co.uk Tue Feb 27 01:24:48 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Feb 2007 01:24:48 +0000 Subject: [Biopython-dev] Parsing NCBI Protein Tables (PTT) files in Bio.SeqIO Message-ID: <45E38860.7030001@maubp.freeserve.co.uk> I have a rough NCBI Protein Table (*.ptt) file parser prepared for Bio.SeqIO (see below). This file format was discussed in August 2006, and is unusual in that it does not actually contain any sequences (!). This means the parser returns SeqRecord objects with empty sequences, but with populated annotation fields. I believe Leighton Pritchard was interested in parsing PTT files from the NCBI. Does something like this (below) look useful? Does anyone know of a link to any "offical" documentation on this file format? Leighton, you also mentioned parsing the NCBI's GFF files, which seem to be a tab separated variable dump of the information found in a GenBank file's features table (link to documentation welcome). An entire GFF file could be turned into a single SeqRecord with no sequence, but with many sub features as SeqFeatures (akin to the results of the existing "genbank" parser). The location information would be simplified for GFF. Also, it looks like parsing just the CDS entries from a GFF file into "sequence free" SeqRecords would also be sensible... (akin to the existing "genbank-cds" parser). Peter -- from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq def NcbiProteinTableIterator(handle) : """Returns a SeqRecord for each entry in an NCBI Protein Table (PTT file) Note that the SeqRecord object's sequence will be zero length (emtpy). """ line = handle.readline() line = handle.readline() parts = line.rstrip().split() if len(parts) <> 2 or parts[1].lower() <> "proteins" : raise SyntaxError("Second line not recognised as an NCBI Protein Table (PTT file)") line = handle.readline().strip() if line.rstrip() <> "Location\tStrand\tLength\tPID\tGene\tSynonym\tCode\tCOG\tProduct" : raise SyntaxError("Third line not recognised as an NCBI Protein Table (PTT file)") LOCATION = 0 STRAND = 1 LENGTH = 2 PID = 3 GENE = 4 SYNONYM = 5 CODE = 6 COG = 7 PRODUCT = 8 mapping = { LOCATION : "location", STRAND : "strand", PID : "PID", GENE : "gene", SYNONYM : "synonym", CODE : "locus_tag", # Is this always correct? COG : "COG", PRODUCT : "product", } count = int(parts[0]) for line in handle : parts = line.rstrip().split("\t") record = SeqRecord(seq=Seq(""), id=parts[PID], name=parts[GENE], description=parts[PRODUCT]) if parts[LENGTH] <> "-" : record.annotations["length"] = int(parts[LENGTH]) for field,key in mapping.iteritems() : if parts[field] <> "-" : record.annotations[key] = parts[field] #TODO - Make sure STRAND is treated same as for GenBank yield record count -= 1 if count <> 0 : raise SyntaxError("Record header number of records wrong?") From biopython-dev at maubp.freeserve.co.uk Tue Feb 27 01:34:04 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Feb 2007 01:34:04 +0000 Subject: [Biopython-dev] GFF3 files in Bio.SeqIO In-Reply-To: <45E38860.7030001@maubp.freeserve.co.uk> References: <45E38860.7030001@maubp.freeserve.co.uk> Message-ID: <45E38A8C.6090007@maubp.freeserve.co.uk> Peter wrote: > Leighton, you also mentioned parsing the NCBI's GFF files, which seem to > be a tab separated variable dump of the information found in a GenBank > file's features table (link to documentation welcome). > > An entire GFF file could be turned into a single SeqRecord with no > sequence, but with many sub features as SeqFeatures (akin to the results > of the existing "genbank" parser). The location information would be > simplified for GFF. > > Also, it looks like parsing just the CDS entries from a GFF file into > "sequence free" SeqRecords would also be sensible... (akin to the > existing "genbank-cds" parser). I went through my old emails, and actually you did point me in this direction: http://song.sourceforge.net/gff3.shtml http://www.sequenceontology.org/gff3.shtml The file format does looks much more complicated that I had first thought. Interestingly the file format does allow for FASTA records to be appended to it - however the NCBI at least does not do this. Perhaps a more general GFF3 parser would be more useful that a sequence orientated one for Bio.SeqIO? Peter From mdehoon at c2b2.columbia.edu Tue Feb 27 18:26:10 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 27 Feb 2007 13:26:10 -0500 Subject: [Biopython-dev] test_Cluster.py fails In-Reply-To: <45E355F8.2080103@maubp.freeserve.co.uk> References: <45E33C31.907@maubp.freeserve.co.uk> <45E34E80.3060005@c2b2.columbia.edu> <45E355F8.2080103@maubp.freeserve.co.uk> Message-ID: <45E477C2.1030101@c2b2.columbia.edu> Peter wrote: >>> The bad news is that test_Cluster fails (full details below) with >>> cluster.error: kcluster: nclusters should be positive >> >> Does this happen to be a 64-bits Linux box? One user with a 64-bits >> machine had this problem before. I have a fix for it, but it's not in >> CVS yet. > > Yes, this is a 64-bit Linux box. Is there a simple patch I could test? > I have updated Bio.Cluster and the corresponding tests in CVS now. Could you try this new code and see if the problem is solved on the 64-bits machines? Thanks! --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Wed Feb 28 18:05:20 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Feb 2007 18:05:20 +0000 Subject: [Biopython-dev] test_Cluster.py fails In-Reply-To: <45E477C2.1030101@c2b2.columbia.edu> References: <45E33C31.907@maubp.freeserve.co.uk> <45E34E80.3060005@c2b2.columbia.edu> <45E355F8.2080103@maubp.freeserve.co.uk> <45E477C2.1030101@c2b2.columbia.edu> Message-ID: <45E5C460.4070000@maubp.freeserve.co.uk> Michiel Jan Laurens de Hoon wrote: > Peter wrote: >>>> The bad news is that test_Cluster fails (full details below) with >>>> cluster.error: kcluster: nclusters should be positive >>> Does this happen to be a 64-bits Linux box? One user with a 64-bits >>> machine had this problem before. I have a fix for it, but it's not in >>> CVS yet. >> Yes, this is a 64-bit Linux box. Is there a simple patch I could test? >> > I have updated Bio.Cluster and the corresponding tests in CVS now. Could > you try this new code and see if the problem is solved on the 64-bits > machines? Updated to CVS, and test_Cluster.py now passes on my 64-bit Linux machine. I have not attempted any further testing of the module. Thank you :) Peter